InnoDB: ###### Diagnostic info printed to the standard error stream
2021-11-19T14:52:21.671328+08:00 0 [ERROR] [MY-012872] [InnoDB] [FATAL] Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
2021-11-19T14:52:21.671407+08:00 0 [ERROR] [MY-013183] [InnoDB] Assertion failure: ut0ut.cc:532 thread 139928164968192
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/8.0/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
06:52:21 UTC - mysqld got signal 6 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
Thread 139932014196480 在 buf0flu.cc line 1357 等待一个信号量,需要在buf0buf.cc line 781 创建一个 申请SX-lock的 RW-latch ,但是这时thread id 139928049043200
在这上面已经有独占锁了,
而Thread 139928049043200 在 btr0cur.cc line 233 等待一个信号量,需要在buf0buf.cc line 781 创建一个 申请X-lock (wait_ex)的 RW-latch ,可是此时Thread 139928049043200
实际上是一直在等待独占锁(wait exclusive),即是说该RW-latch 目前仍未被释放。
[Warning] [MY-012985] [InnoDB] A long semaphore wait:
--Thread 139932014196480 has waited at buf0flu.cc line 1357 for 784 seconds the semaphore:
SX-lock on RW-latch at 0x7f439b4df988 created in file buf0buf.cc line 781
a writer (thread id 139928049043200) has reserved it in mode exclusive
2021-11-19T14:51:51.381694+08:00 0 [Warning] [MY-012985] [InnoDB] A long semaphore wait:
--Thread 139928049043200 has waited at btr0cur.cc line 233 for 923 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x7f43a36c0108 created in file buf0buf.cc line 781
a writer (thread id 139928049043200) has reserved it in mode wait exclusive
先说buf0buf.cc ,其就是指的database buffer pool , line 781 对应的是 rw_lock_create(PFS_NOT_INSTRUMENTED, &block->lock, SYNC_LEVEL_VARYING); 是创建一个读写锁,其所属的函数是buf_block_init(),其是在初始buffer pool时初始化一个buffer 控制块,即是指对数据库buffer 进行锁控制的。
而buf0flu.cc 是用于mysql 的buffer pool flush的。line 1357 是 rw_lock_sx_lock_gen(rw_lock, BUF_IO_WRITE),看起来是在flush buffer时需要申请的rw_lock。
btr0cur.cc 的描述是索引树游标,作用是所有行操作对一个b-tree或记录的改变都必须通过这个模块,同时,每一次对聚合索引记录的修改或插入,都会在这里写入undo 日志记录。
而第233 行是获取btree block的过程,可以看到他需要去申请RW_X_LATCH.
232 get_block = btr_block_get(page_id_t(page_id.space(), left_page_no),
233 page_size, RW_X_LATCH, cursor->index, mtr);
这里看起来就是对表的更改操作才会触发, 同时在error log中也可以看到Thread 139928049043200 的TRANSACTION 787138017 是一直在进行rollback,这可以进一步印证以上情况。
---TRANSACTION 787138017, ACTIVE 4387 sec rollback <<<<<<<
mysql tables in use 2, locked 1
ROLLING BACK 1 lock struct(s), heap size 1136, 0 row lock(s), undo log entries 7891101
MySQL thread id 11963922, OS thread handle 139928049043200, query id 1348226447 30.88.75.45 msvcmdata query end
INSERT INTO `xxl_job_log_copy1` SELECT * FROM `msvcm`.`xxl_job_log`
Trx read view will not see trx with id >= 787138017, sees < 787138017
同时也存在另外一条类似sql正在进行inserting,此时的buffer pool确实会相对紧张。
---TRANSACTION 787233964, ACTIVE 3040 sec inserting
mysql tables in use 2, locked 1
1 lock struct(s), heap size 1136, 0 row lock(s), undo log entries 8544482
MySQL thread id 11964240, OS thread handle 139933323695872, query id 1348381276 30.88.75.45 msvcmdata executing
INSERT INTO `xxl_job_log_copy2` SELECT * FROM `msvcm`.`xxl_job_log`
Trx read view will not see trx with id >= 787233964, sees < 787138017
从innodb status 看起,当时Pending flushes buffer pool 的情况不断上涨。
Pending flushes (fsync) log: 1; buffer pool: 30
Pending flushes (fsync) log: 1; buffer pool: 33
Pending flushes (fsync) log: 1; buffer pool: 31
Pending flushes (fsync) log: 1; buffer pool: 31
Pending flushes (fsync) log: 1; buffer pool: 31
Pending flushes (fsync) log: 1; buffer pool: 35
Pending flushes (fsync) log: 0; buffer pool: 37
Pending flushes (fsync) log: 0; buffer pool: 48
Pending flushes (fsync) log: 1; buffer pool: 54
Pending flushes (fsync) log: 1; buffer pool: 54
Pending flushes (fsync) log: 1; buffer pool: 62
Pending flushes (fsync) log: 0; buffer pool: 68
Pending flushes (fsync) log: 0; buffer pool: 63
Pending flushes (fsync) log: 1; buffer pool: 63
Pending flushes (fsync) log: 0; buffer pool: 67
Pending flushes (fsync) log: 0; buffer pool: 71
Pending flushes (fsync) log: 1; buffer pool: 70
Pending flushes (fsync) log: 1; buffer pool: 71
Pending flushes (fsync) log: 1; buffer pool: 72
Pending flushes (fsync) log: 1; buffer pool: 73
Pending flushes (fsync) log: 0; buffer pool: 76
Pending flushes (fsync) log: 1; buffer pool: 75
Pending flushes (fsync) log: 1; buffer pool: 75
Pending flushes (fsync) log: 0; buffer pool: 75
Pending flushes (fsync) log: 1; buffer pool: 75
Pending flushes (fsync) log: 1; buffer pool: 76
Pending flushes (fsync) log: 0; buffer pool: 76
Pending flushes (fsync) log: 1; buffer pool: 78
Pending flushes (fsync) log: 0; buffer pool: 82
Pending flushes (fsync) log: 0; buffer pool: 85
Pending flushes (fsync) log: 1; buffer pool: 89
Pending flushes (fsync) log: 1; buffer pool: 95
Pending flushes (fsync) log: 1; buffer pool: 97
Pending flushes (fsync) log: 0; buffer pool: 102
主要是因为存在两条比较大的insert select 全表扫描的语句,一条在进行insert,一条在进行rollback,加上当时buffer pool相对比较小,buffer pool的资源紧张,数据库在flush buffer及modify buffer 时对buffer 的latch发生争用且长时间等待,从而导致信号量超时。
目前我暂时未找到相关的已知bug描述,这类问题解决方案就是避免SQL长时间占用Latch:
1,修改select查询避免全表扫描,避免Latch长期被占用。
2,加查询条件,适当的加索引,让select执行更快,也避免一个select锁的数据更少。
3,适当加大buffer pool 及 buffer pool instance,每个buffer pool都有自己独立的Latch,避免latch竞争。
4.避免长时间回滚