InnoDB: ###### Diagnostic info printed to the standard error stream

2021-11-19T14:52:21.671328+08:00 0 [ERROR] [MY-012872] [InnoDB] [FATAL] Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.

2021-11-19T14:52:21.671407+08:00 0 [ERROR] [MY-013183] [InnoDB] Assertion failure: ut0ut.cc:532 thread 139928164968192

InnoDB: We intentionally generate a memory trap.

InnoDB: Submit a detailed bug report to

InnoDB: If you get repeated assertion failures or crashes, even

InnoDB: immediately after the mysqld startup, there may be

InnoDB: corruption in the InnoDB tablespace. Please refer to

InnoDB: http://dev.mysql.com/doc/refman/8.0/en/forcing-innodb-recovery.html

InnoDB: about forcing recovery.

06:52:21 UTC - mysqld got signal 6 ;

Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.

Thread pointer: 0x0

Attempting backtrace. You can use the following information to find out

where mysqld died. If you see no messages after this, something went

terribly wrong...

Thread 139932014196480 在 buf0flu.cc line 1357 等待一个信号量，需要在buf0buf.cc line 781 创建一个申请SX-lock的 RW-latch ，但是这时thread id 139928049043200

在这上面已经有独占锁了，

而Thread 139928049043200 在 btr0cur.cc line 233 等待一个信号量，需要在buf0buf.cc line 781 创建一个申请X-lock (wait_ex)的 RW-latch ，可是此时Thread 139928049043200

实际上是一直在等待独占锁（wait exclusive），即是说该RW-latch 目前仍未被释放。

[Warning] [MY-012985] [InnoDB] A long semaphore wait:

--Thread 139932014196480 has waited at buf0flu.cc line 1357 for 784 seconds the semaphore:

SX-lock on RW-latch at 0x7f439b4df988 created in file buf0buf.cc line 781

a writer (thread id 139928049043200) has reserved it in mode exclusive

2021-11-19T14:51:51.381694+08:00 0 [Warning] [MY-012985] [InnoDB] A long semaphore wait:

--Thread 139928049043200 has waited at btr0cur.cc line 233 for 923 seconds the semaphore:

X-lock (wait_ex) on RW-latch at 0x7f43a36c0108 created in file buf0buf.cc line 781

a writer (thread id 139928049043200) has reserved it in mode wait exclusive

先说buf0buf.cc ，其就是指的database buffer pool , line 781 对应的是 rw_lock_create(PFS_NOT_INSTRUMENTED, &block->lock, SYNC_LEVEL_VARYING); 是创建一个读写锁，其所属的函数是buf_block_init()，其是在初始buffer pool时初始化一个buffer 控制块，即是指对数据库buffer 进行锁控制的。

而buf0flu.cc 是用于mysql 的buffer pool flush的。line 1357 是 rw_lock_sx_lock_gen(rw_lock, BUF_IO_WRITE)，看起来是在flush buffer时需要申请的rw_lock。

btr0cur.cc 的描述是索引树游标，作用是所有行操作对一个b-tree或记录的改变都必须通过这个模块，同时，每一次对聚合索引记录的修改或插入，都会在这里写入undo 日志记录。

而第233 行是获取btree block的过程，可以看到他需要去申请RW_X_LATCH.

232 get_block = btr_block_get(page_id_t(page_id.space(), left_page_no),

233 page_size, RW_X_LATCH, cursor->index, mtr);

这里看起来就是对表的更改操作才会触发，同时在error log中也可以看到Thread 139928049043200 的TRANSACTION 787138017 是一直在进行rollback，这可以进一步印证以上情况。

---TRANSACTION 787138017, ACTIVE 4387 sec rollback <<<<<<<

mysql tables in use 2, locked 1

ROLLING BACK 1 lock struct(s), heap size 1136, 0 row lock(s), undo log entries 7891101

MySQL thread id 11963922, OS thread handle 139928049043200, query id 1348226447 30.88.75.45 msvcmdata query end

INSERT INTO `xxl_job_log_copy1` SELECT * FROM `msvcm`.`xxl_job_log`

Trx read view will not see trx with id >= 787138017, sees < 787138017

同时也存在另外一条类似sql正在进行inserting,此时的buffer pool确实会相对紧张。

---TRANSACTION 787233964, ACTIVE 3040 sec inserting

mysql tables in use 2, locked 1

1 lock struct(s), heap size 1136, 0 row lock(s), undo log entries 8544482

MySQL thread id 11964240, OS thread handle 139933323695872, query id 1348381276 30.88.75.45 msvcmdata executing

INSERT INTO `xxl_job_log_copy2` SELECT * FROM `msvcm`.`xxl_job_log`

Trx read view will not see trx with id >= 787233964, sees < 787138017

从innodb status 看起，当时Pending flushes buffer pool 的情况不断上涨。

Pending flushes (fsync) log: 1; buffer pool: 30

Pending flushes (fsync) log: 1; buffer pool: 33

Pending flushes (fsync) log: 1; buffer pool: 31

Pending flushes (fsync) log: 1; buffer pool: 35

Pending flushes (fsync) log: 0; buffer pool: 37

Pending flushes (fsync) log: 0; buffer pool: 48

Pending flushes (fsync) log: 1; buffer pool: 54

Pending flushes (fsync) log: 1; buffer pool: 62

Pending flushes (fsync) log: 0; buffer pool: 68

Pending flushes (fsync) log: 0; buffer pool: 63

Pending flushes (fsync) log: 1; buffer pool: 63

Pending flushes (fsync) log: 0; buffer pool: 67

Pending flushes (fsync) log: 0; buffer pool: 71

Pending flushes (fsync) log: 1; buffer pool: 70

Pending flushes (fsync) log: 1; buffer pool: 71

Pending flushes (fsync) log: 1; buffer pool: 72

Pending flushes (fsync) log: 1; buffer pool: 73

Pending flushes (fsync) log: 0; buffer pool: 76

Pending flushes (fsync) log: 1; buffer pool: 75

Pending flushes (fsync) log: 0; buffer pool: 75

Pending flushes (fsync) log: 1; buffer pool: 75

Pending flushes (fsync) log: 1; buffer pool: 76

Pending flushes (fsync) log: 0; buffer pool: 76

Pending flushes (fsync) log: 1; buffer pool: 78

Pending flushes (fsync) log: 0; buffer pool: 82

Pending flushes (fsync) log: 0; buffer pool: 85

Pending flushes (fsync) log: 1; buffer pool: 89

Pending flushes (fsync) log: 1; buffer pool: 95

Pending flushes (fsync) log: 1; buffer pool: 97

Pending flushes (fsync) log: 0; buffer pool: 102

主要是因为存在两条比较大的insert select 全表扫描的语句，一条在进行insert,一条在进行rollback，加上当时buffer pool相对比较小，buffer pool的资源紧张，数据库在flush buffer及modify buffer 时对buffer 的latch发生争用且长时间等待，从而导致信号量超时。

目前我暂时未找到相关的已知bug描述，这类问题解决方案就是避免SQL长时间占用Latch：

1，修改select查询避免全表扫描，避免Latch长期被占用。

2，加查询条件，适当的加索引，让select执行更快，也避免一个select锁的数据更少。

3，适当加大buffer pool 及 buffer pool instance，每个buffer pool都有自己独立的Latch，避免latch竞争。

4.避免长时间回滚