前言
今早一位朋友找到我说,遇到这么一个情况:"有么有遇到备库一个select跑很久,导致备库无法应用,一直不发生recovery conflict",意思就是在备库跑了一个大查询,碰到了recovery conflict但是却一直没有报错,PostgreSQL没有去杀掉这个会话,阻塞了WAL的回放,致使延迟越来越大。
现象
recovery conflict,也就是流复制冲突。
通常会将一些执行时间较长的分析任务、统计SQL跑在备库上。在备库上执行长时间的查询,由于涉及的记录有可能在主库上被更新或删除,主库上对更新或删除数据的老版本进行vacuum后,从库上也会执行这个操作,从而与从库上依旧需要这些老版本的查询产生冲突。
这个机制有几个参数可以控制:
1. max_standby_streaming_delay:默认30s,当备库执行SQL时,有可能与正在应用的WAL发生冲突,此查询如果30s没有执行完就被中止,此参数可以设置为-1,表示当从库上的WAL应用进程与从库上执行的查询冲突时,WAL应用进程一直等待直到从库查询执行完成。朋友的环境中,该参数配置的是2min
2. hot_standby_feedback:设为on后,从库执行查询时会通知主库。在从库执行查询的过程中,主库不会清理从库仍然需要的数据行老版本,因此,从库上的查询不会被中止,这种方法也会带来弊端,主库上的表可能会出现膨胀,主库表的膨胀程度与表上的写事务和从库上大查询的执行时间有关,默认为off。朋友的环境中,未开此参数,为off
3. vacuum_defer_cleanup_age:指定vacuum延迟清理死亡元组的事务数,vacuum会延迟清除无效的记录,延迟的事务个数通过vacuum_defer_cleanup_age进行设置。默认为0。朋友的环境中,未开启此参数。
复现
现在让我们先模拟一下复制冲突,先构造一个大表
postgres=# create table test(id int);
CREATE TABLE
postgres=# insert into test values(generate_series(1,100000000));
INSERT 0 100000000
为了方便复现,主库上先对表做一个vacuum analyze,这样便会生成vm文件,下次vacuum的时候就可以跳过all_visible的数据块
postgres=# vacuum analyze test;
VACUUM
postgres=# select relallvisible,relpages from pg_class where relname = 'test';
relallvisible | relpages
---------------+----------
442478 | 442478
(1 row)
postgres=# vacuum analyze test;
VACUUM
postgres=# select * from pg_visibility_map_summary('test');
all_visible | all_frozen
-------------+------------
442478 | 0
(1 row)
然后在备库模拟进行一个大查询(需要耗费点时间的查询)
postgres=# explain select count(*) from test where id = 66666666;
QUERY PLAN
----------------------------------------------------------------
Aggregate (cost=1692478.40..1692478.41 rows=1 width=8)
-> Seq Scan on test (cost=0.00..1692478.40 rows=1 width=0)
Filter: (id = 66666666)
(3 rows)
为了方便模拟出冲突,需要使用一个小技巧:根据CTID来查找数据行,这样主库可以快速定位删除数据,不然可能备库查完了主库还没找到这一行,无法模拟快照冲突。同时调小备库的max_standby_streaming_delay为5s
postgres=# select ctid from test where id = 66666666;
ctid
-------------
(294985,56)
(1 row)
postgres=# explain delete from test where ctid = '(294985,56)'; ---根据ctid定位数据,总成本只有4
QUERY PLAN
----------------------------------------------------------
Delete on test (cost=0.00..4.01 rows=0 width=0)
-> Tid Scan on test (cost=0.00..4.01 rows=1 width=6)
TID Cond: (ctid = '(294985,56)'::tid)
(3 rows)
postgres=# explain delete from test where id = 66666666; ---直接删需要顺序扫描,成本特别高
QUERY PLAN
----------------------------------------------------------------
Delete on test (cost=0.00..1692478.40 rows=0 width=0)
-> Seq Scan on test (cost=0.00..1692478.40 rows=1 width=6)
Filter: (id = 66666666)
(3 rows)
备库开始查询
postgres=# select count(*) from test where id = 6666666;
---开始查询
同时主库上立马删除这一行并执行vacuum
postgres=# delete from test where ctid = '(294985,56)';vacuum test;
DELETE 1
VACUUM
超过了5s之后,备库的查询就报错了
postgres=# select count(*) from test where id = 66666666;
ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.
这个是典型的snapshot conflict,也是最常见的复制冲突。
/*
* errdetail_recovery_conflict
*
* Add an errdetail() line showing conflict source.
*/
static int
errdetail_recovery_conflict(void)
{
switch (RecoveryConflictReason)
{
case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
errdetail("User was holding shared buffer pin for too long.");
break;
case PROCSIG_RECOVERY_CONFLICT_LOCK:
errdetail("User was holding a relation lock for too long.");
break;
case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
errdetail("User was or might have been using tablespace that must be dropped.");
break;
case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
errdetail("User query might have needed to see row versions that must be removed.");
break;
case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
errdetail("User transaction caused buffer deadlock with recovery.");
break;
case PROCSIG_RECOVERY_CONFLICT_DATABASE:
errdetail("User was connected to a database that must be dropped.");
break;
default:
break;
/* no errdetail */
}
return 0;
}
另外,我们也可以观察startup进程。过了max_standby_streaming_delay之后,startup进程就不再waiting了。
[postgres@xiongcc ~]$ ps -ef | grep startup;date
postgres 2309 2307 4 14:47 ? 00:00:32 postgres: startup recovering 00000001000000020000002B waiting
postgres 2536 32729 0 15:00 pts/7 00:00:00 grep --color=auto startup
Fri Apr 29 15:00:30 CST 2022
[postgres@xiongcc ~]$ ps -ef | grep startup;date
postgres 2309 2307 4 14:47 ? 00:00:32 postgres: startup recovering 00000001000000020000002B
postgres 2548 32729 0 15:00 pts/7 00:00:00 grep --color=auto startup
Fri Apr 29 15:00:34 CST 2022
到这里就说不通了,正常情况下PostgreSQL会干掉与startup进程冲突的会话,那问题出在哪里了?
我们知道,在PostgreSQL中杀连接有两种方式,第一种是pg_cancel_backend(),相当于SIGINT,温柔地杀进程。另一种是pg_terminate_backend(),相当于SIGTERM,暴力地杀进程。区别在于向backend发送SIGINT的时候,backend会终止当前的事务,不过并不会结束会话,而向backend发送SIGTERM的时候除了中断当前的事务还会中断会话,一般pg_cancel_backend()干不掉的,用pg_terminate_backend()再去干。
但是,想必各位肯定也碰到过,至少我碰到过不止一次,通过pg_terminate_backend()也干不掉。会不会是碰到recovery conflict的时候,PostgreSQL没杀成功?
让我们验证一下,遇到recovery conflict时,PostgreSQL采用的何种方式杀会话,最简单的方式就是观察pid是否发生变化
postgres=# select pg_backend_pid();
pg_backend_pid
----------------
2417
(1 row)
postgres=# select count(*) from test where id = 66666666;
ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.
postgres=# select pg_backend_pid();
pg_backend_pid
----------------
2417
(1 row)
可以看到,pid并没有发生变化,因此PostgreSQL使用的是温柔的方式杀进程。
到这里,好像能够解释通了,碰到了recovery conflict想去杀会话,但是一直没杀掉这个冲突会话,导致复制一直延迟。于是,朋友手动调用pg_cancel_backend(),果然杀不掉这个会话,使用pg_terminate_backend()就可以杀掉,然后延迟就降了下去,如下3121进程,十分坚挺。
可惜当时朋友就只抓了perf,不过我们也能窥见一二,这里有一个procsignal_sigusr1_handler,这个是一个信号处理器,信号处理器会设置对应的latch。
StartupProcessMain() - > procsignal_sigusr1_handler这个里面就包括了相关了conflict 处理流程
/* ----------------------------------
* Startup Process main entry point
* ----------------------------------
*/
void
StartupProcessMain(void)
{
/* Arrange to clean up at startup process exit */
on_shmem_exit(StartupProcExit, 0);
/*
* Properly accept or ignore signals the postmaster might send us.
*/
pqsignal(SIGHUP, StartupProcSigHupHandler); /* reload config file */
pqsignal(SIGINT, SIG_IGN); /* ignore query cancel */
pqsignal(SIGTERM, StartupProcShutdownHandler); /* request shutdown */
/* SIGQUIT handler was already set up by InitPostmasterChild */
InitializeTimeouts(); /* establishes SIGALRM handler */
pqsignal(SIGPIPE, SIG_IGN);
pqsignal(SIGUSR1, procsignal_sigusr1_handler);
pqsignal(SIGUSR2, StartupProcTriggerHandler);
...
procsignal_sigusr1_handler(SIGNAL_ARGS)
{
int save_errno = errno;
if (CheckProcSignal(PROCSIG_CATCHUP_INTERRUPT))
HandleCatchupInterrupt();
if (CheckProcSignal(PROCSIG_NOTIFY_INTERRUPT))
HandleNotifyInterrupt();
if (CheckProcSignal(PROCSIG_PARALLEL_MESSAGE))
HandleParallelMessageInterrupt();
if (CheckProcSignal(PROCSIG_WALSND_INIT_STOPPING))
HandleWalSndInitStopping();
if (CheckProcSignal(PROCSIG_BARRIER))
HandleProcSignalBarrierInterrupt();
if (CheckProcSignal(PROCSIG_LOG_MEMORY_CONTEXT))
HandleLogMemoryContextInterrupt();
if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_DATABASE))
RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_DATABASE);
if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_TABLESPACE))
RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_TABLESPACE);if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOCK))
RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOCK);
if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
SetLatch(MyLatch);
errno = save_errno;
}
下方的WaitEventSetWait表示在等待相关事件,等待加入到等待事件集合中的事件发生,或者直至超时。
同时,在上面还可以看到一个 postgres __errno_location@plt
简单搜了一下,errno是一个由POSIX和ISO C标准定义的符号,由系统调用和某些库函数根据事件来设置它,用于表明哪里有问题。
__errno_location -- address of errno variable
The
__errno_location()function shall return the address of theerrnovariable for the current thread.
__errno_location()is not in the source standard; it is only in the binary standard.
去请教了一下Linux内核相关的大佬
__errno_location@plt,返回当前线程的errno的地址,其实是errno的宏展开,plt是一个跳转表
看样子,PostgreSQL遇到了一些错,Linux中,在头文件 /usr/include/asm-generic/errno-base.h 对基础常用errno进行了宏定义,比如很熟悉的Out of memory:
#ifndef _ASM_GENERIC_ERRNO_BASE_H
#define _ASM_GENERIC_ERRNO_BASE_H
#define EPERM 1 /* Operation not permitted */
#define ENOENT 2 /* No such file or directory */
#define ESRCH 3 /* No such process */
#define EINTR 4 /* Interrupted system call */
#define EIO 5 /* I/O error */
#define ENXIO 6 /* No such device or address */
#define E2BIG 7 /* Argument list too long */
#define ENOEXEC 8 /* Exec format error */
#define EBADF 9 /* Bad file number */
#define ECHILD 10 /* No child processes */
#define EAGAIN 11 /* Try again */
#define ENOMEM 12 /* Out of memory */
#define EACCES 13 /* Permission denied */
#define EFAULT 14 /* Bad address */
#define ENOTBLK 15 /* Block device required */
#define EBUSY 16 /* Device or resource busy *
回过头来检查了一下这个无法杀掉的会话,发现这个会话运行了超过16个小时(query_time),通过navicat(application_name)连过来,在等待ClientWrite,Waiting to write data to the client.,大概率是会话异常断开了,在等待往该会话写数据,结果杀又杀不掉,占着茅坑,最终导致延迟越来越大。
本想尝试复现一下,跑的过程中,通过关闭云主机的安全组、关掉Eclipse、暴力杀连接等,发现也能正常报错,暂时没有想到什么方式可以复现。
old_snapshot_threshold会有影响吗
就在我想测试一下备库的old_snapshot_threshold会影响startup进程的时候,我发现了一个有趣的现象
postgres=# show old_snapshot_threshold ;
old_snapshot_threshold
------------------------
0
(1 row)
postgres=# select count(*) from test where id = 66666666;
ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.
postgres=# show old_snapshot_threshold ;
old_snapshot_threshold
------------------------
1h
(1 row)
postgres=# select count(*) from test where id = 66666666;
ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.
就当我又一次操作的时候,居然发现PostgreSQL也会使用terminate的方式!
postgres=# show old_snapshot_threshold ;
old_snapshot_threshold
------------------------
1min
(1 row)
postgres=# select count(*),pg_sleep(120) from test where id = 66666666;
FATAL: terminating connection due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.
postgres=# select pg_backend_pid();
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
纳尼?遂去翻了一下源码,发现总共有三个地方会采取terminate connection的方式
/*
* ProcessInterrupts: out-of-line portion of CHECK_FOR_INTERRUPTS() macro
*
* If an interrupt condition is pending, and it's safe to service it,
* then clear the flag and accept the interrupt. Called only when
* InterruptPending is true.
*
* Note: if INTERRUPTS_CAN_BE_PROCESSED() is true, then ProcessInterrupts
* is guaranteed to clear the InterruptPending flag before returning.
* (This is not the same as guaranteeing that it's still clear when we
* return; another interrupt could have arrived. But we promise that
* any pre-existing one will have been serviced.)
*/
void
ProcessInterrupts(void)
{
/* OK to accept any interrupts now? */
if (InterruptHoldoffCount != 0 || CritSectionCount != 0)
return;
InterruptPending = false;
...
...
else if (RecoveryConflictPending && RecoveryConflictRetryable)
{
pgstat_report_recovery_conflict(RecoveryConflictReason);
ereport(FATAL,
(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
errmsg("terminating connection due to conflict with recovery"),
errdetail_recovery_conflict()));
}
else if (RecoveryConflictPending)
{
/* Currently there is only one non-retryable recovery conflict */
Assert(RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_DATABASE);
pgstat_report_recovery_conflict(RecoveryConflictReason);
ereport(FATAL,
(errcode(ERRCODE_DATABASE_DROPPED),
errmsg("terminating connection due to conflict with recovery"),
errdetail_recovery_conflict()));
}
...
/*
* If a recovery conflict happens while we are waiting for input from the
* client, the client is presumably just sitting idle in a transaction,
* preventing recovery from making progress. Terminate the connection to
* dislodge it.
*/
if (RecoveryConflictPending && DoingCommandRead)
{
QueryCancelPending = false; /* this trumps QueryCancel */
RecoveryConflictPending = false;
LockErrorCleanup();
pgstat_report_recovery_conflict(RecoveryConflictReason);
ereport(FATAL,
(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
errmsg("terminating connection due to conflict with recovery"),
errdetail_recovery_conflict(),
errhint("In a moment you should be able to reconnect to the"
" database and repeat your command.")));
}
ERRCODE_T_R_SERIALIZATION_FAILURE
第一种光看ERRCODE,是ERRCODE_T_R_SERIALIZATION_FAILURE,对应的就是could not serialize access due to concurrent update,序列化错误,隔离级别大于RC,即RR和SERIALIZABLE的时候会有这个错,在跑的时候是默认的RC隔离级别,并且报错信息里面有关键的In a moment you should be able to reconnect to the database and repeat your command.,所以不是这个错。
ERRCODE_DATABASE_DROPPED
第二个,ERRCODE_DATABASE_DROPPED,删库?试试看,跑的时候主库删库
postgres=# \c mydb
You are now connected to database "mydb" as user "postgres".
mydb=# drop database postgres WITH (FORCE);
DROP DATABASE
备库果然terminate了,这个很好理解,库都没了,连接连哪里去?不过也不符合我们这个例子。
postgres=# select count(*) from test where id = 66666666;
FATAL: terminating connection due to conflict with recovery
DETAIL: User was connected to a database that must be dropped.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Time: 18310.956 ms (00:18.311)
DoingCommandRead
剩下就是这个DoingCommandRead,比较费解,等待从客户端输入的时候发生了recovery conflict,然后就会terminate connection。
/*
* Flag to indicate that we are doing the outer loop's read-from-client,
* as opposed to any random read from client that might happen within
* commands like COPY FROM STDIN.
*/
static bool DoingCommandRead = false;
在日志里只有这一次发生了terminate
并且日志里还打印了
In a moment you should be able to reconnect to the database and repeat your command.
因此,参照源码,貌似刚刚踩中了第三种情况。不过这种情况根据实操,出现的机率十分罕见,毕竟我刚刚复现了N次都没有出现。
不过还有一种情况,就是RR级别的错误,这个倒是可以稳定复现,见如下例子
主库创建一张表
postgres=# create table test1(id int,info text, crt_time timestamp);
CREATE TABLE
postgres=# insert into test1 select 1,md5(random()::text),now() from generate_series(1,10000000);
INSERT 0 10000000
然后备库开启一个RR级别的事务,注意备库要开启hot_standby_feedback,这样主库就不会回收相关的数据行
postgres=# begin transaction isolation level repeatable read;
BEGIN
postgres=*# select count(*) from test1;
count
----------
10000000
(1 row)
然后主库更新一下,同时执行vacuum
postgres=# update test1 set info=info;
UPDATE 10000000
postgres=# vacuum test1;
VACUUM
然后备库再关闭hot_standby_feedback,同时进行查询,就可以复现了
[postgres@xiongcc ~]$ pg_ctl -D standby_data/ reload
server signaled
[postgres@xiongcc ~]$ psql -p 5433
psql (14.2)
Type "help" for help.
postgres=# begin transaction isolation level repeatable read;
BEGIN
postgres=*# select count(*) from test1;
count
----------
10000000
(1 row)
postgres=*# select * from test1 limit 10;
id | info | crt_time
----+----------------------------------+----------------------------
1 | 1ce80e68af1a170ee1416798d0653c5f | 2022-04-29 18:05:16.247001
1 | 86b4f0ac2e5364d3ad7407234fac6566 | 2022-04-29 18:05:16.247001
1 | 27e3f5e32014b7d9a92a539a166a2b7c | 2022-04-29 18:05:16.247001
1 | 8a3c45e3536722296704aa6d981a236c | 2022-04-29 18:05:16.247001
1 | 0b7f707d69678a6c1a945ed32453c9e3 | 2022-04-29 18:05:16.247001
1 | 0912d20bd74dc6a04d67bd8636dc0425 | 2022-04-29 18:05:16.247001
1 | 2ae8466239b58bd820fdaf520cb07672 | 2022-04-29 18:05:16.247001
1 | 6f07759afd87be0a26867abc87223a08 | 2022-04-29 18:05:16.247001
1 | d0be573c4d18657f794d6603cfd1b923 | 2022-04-29 18:05:16.247001
1 | dd7beac08ad56aa6691244aebd6f9c8b | 2022-04-29 18:05:16.247001
(10 rows)
postgres=*# select * from test1 limit 10;
FATAL: terminating connection due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
可以看到,关闭了hot_standby_feedback之后,主库就不会"刀下留人"了,清楚了备库需要的行版本,备库也随之terminating connection了。
小结
所以,到这里我们可以总结一下了
• 默认情况下PostgreSQL碰到了recovery conflict的时候,会优雅地杀掉冲突会话,不会干掉连接
• 假如碰到RR以上级别的时候,同时又有recovery conflict,就会terminate connection
• 删库也会terminate connection,不过等着被炒鱿鱼吧
• 优雅地杀掉冲突会话的时候,可能会因为各种匪夷所思的情况杀不掉,比如今天这个案例,导致复制延迟,因此需要一个延迟监控
有些时候,可能你发现pg_cancel_backend()和pg_terminate_backend()都杀不掉,就类似于我们在Linux中看到的大D进程,处于TASK_UNINTERRUPTIBLE,不可中断的睡眠状态。看到这个状态可能是由于在等待IO,比如磁盘IO,网络IO,其他外设IO,如果进程正在等待的IO在较长的时间内都没有响应,那么就被ps看到了,同时也就意味着很有可能有IO出了问题,可能是外设本身出了故障,也可能是比如挂载的远程文件系统已经不可访问等操作时出现的问题。
A process which ends up in “D” state for any measurable length of time is trapped in the midst of a system call (usually an I/O operation on a device — thus the initial in the ps output).
Such a process cannot be killed — it would risk leaving the kernel in an inconsistent state, leading to a panic. In general you can consider this to be a bug in the device driver that the process is accessing.
因此,我们还是老老实实分析一下为什么干不掉,抓抓堆栈,看看strace等,可以去看一下这块注释 src/include/miscadmin.h,类似于处于一个不可中断的状态,
CHECK_FOR_INTERRUPTS()和HOLD_INTERRUPTS()会确保不会响应terminate。
/*****************************************************************************
* System interrupt and critical section handling
*
* There are two types of interrupts that a running backend needs to accept
* without messing up its state: QueryCancel (SIGINT) and ProcDie (SIGTERM).
* In both cases, we need to be able to clean up the current transaction
* gracefully, so we can't respond to the interrupt instantaneously ---
* there's no guarantee that internal data structures would be self-consistent
* if the code is interrupted at an arbitrary instant. Instead, the signal
* handlers set flags that are checked periodically during execution.
*
* The CHECK_FOR_INTERRUPTS() macro is called at strategically located spots
* where it is normally safe to accept a cancel or die interrupt. In some
* cases, we invoke CHECK_FOR_INTERRUPTS() inside low-level subroutines that
* might sometimes be called in contexts that do *not* want to allow a cancel
* or die interrupt. The HOLD_INTERRUPTS() and RESUME_INTERRUPTS() macros
* allow code to ensure that no cancel or die interrupt will be accepted,
* even if CHECK_FOR_INTERRUPTS() gets called in a subroutine. The interrupt
* will be held off until CHECK_FOR_INTERRUPTS() is done outside any
* HOLD_INTERRUPTS() ... RESUME_INTERRUPTS() section.