PostgreSQL：“ FATAL: requested WAL segment 0000000800002A0000000000 has already been removed”

http://francs3.blog.163.com/blog/static/4057672720136225542598/

昨天，一重要生产环境的备库主机由于硬件故障需要停机做硬件检测，由于是流复制环境，

备库可以停，停机检测大概花了 2 小时左右，之后再次启动备库时，报了如下错误：

--csv 日志

2013-07-01 13:25:29.430 CST,,,27738,,51d112c8.6c5a,1,,2013-07-01 13:25:28 CST,,0,LOG,00000,"streaming replication successfully connected to primary",,,,,,,,"libpqrcv_connect, libpqwalreceiver.c:171",""

2013-07-01 13:25:29.430 CST,,,27738,,51d112c8.6c5a,2,,2013-07-01 13:25:28 CST,,0,FATAL,XX000,"could not receive data from WAL stream: FATAL: requested WAL segment 0000000800002A0000000000 has already been removed

",,,,,,,,"libpqrcv_receive, libpqwalreceiver.c:389",""

备注：根据报错信息，很容易知道是由于停机时间的过程中备库所需的 WAL 已经被主库循环使用覆盖了，而在备库

停机维护过程中，主库并未打开归档，这时这个备库需要重做了。也许有人会问，为何不一直打开主库的归档，

我想说的是，这个库在 TB 级而且比较繁忙，忙的时候一天的归档 600 GB左右，这么大的归档需要大量的存储。

但是在备库停机维护过程中，建议主库打开归档，只要不把归档目录撑满，那么在备库重新恢复后，有了主库

的归档，那么备库依然能够跟上主库，为了加深理解，下面模拟这个错误，并演示规避方法：

一环境信息

主机：笔记本虚拟机

系统： Red Hat Enterprise Linux Server release 6.2

版本： PostgreSQL 9.3beta1

主库IP： 192.168.1.36 主机名：redhatB

备库IP： 192.168.1.35 主机名: redhat6

备注：流复制搭建过程略，参考 PostgreSQL：使用 pg_basebackup 搭建流复制环境

二模拟过程

--2.1 设置主库 postgresql.conf

为了容易出演示效果，设置以下参数，其它参数根据需求设置：

checkpoint_segments = 3

archive_mode = on

archive_command = 'cp %p /archive/pg93/%f'

max_wal_senders = 3

wal_keep_segments = 3

max_wal_senders = 3

备注：归档目录 /archive/pg93/ 需要创建好并赋相应权限。

--2.2 重载配置文件

[pg93@redhatB pg_root]$ pg_ctl reload -D $PGDATA

server signaled

备注：参数修改好后需要 reload 。

--2.3 测试前: 主库数据

[pg93@redhatB ~]$ psql

psql (9.3beta1)

Type "help" for help.

postgres=# select * from test_1;

id | create_time

----+---------------------

1 | 2013-07-01 21:15:34

2 | 2013-07-01 21:55:37

3 | 2013-07-01 22:01:18

(3 rows)

--2.4 测试前: 备库数据

[pg93@redhat6 ~]$ psql

psql (9.3beta1)

Type "help" for help.

postgres=# select * from test_1;

id | create_time

----+---------------------

1 | 2013-07-01 21:15:34

2 | 2013-07-01 21:55:37

3 | 2013-07-01 22:01:18

(3 rows)

备注：这里先标记下数据，以便后面做对比。

--2.5 停备库

[pg93@redhat6 ~]$ pg_ctl stop -m fast -D $PGDATA

waiting for server to shut down.... done

server stopped

--2.6 在主库上执行以下操作

postgres=# insert into test_1 values (5,now());

INSERT 0 1

postgres=# select pg_switch_xlog();

pg_switch_xlog

----------------

1/310000AC

(1 row)

postgres=# insert into test_1 values (5,now());

INSERT 0 1

postgres=# select pg_switch_xlog();

pg_switch_xlog

----------------

1/320004D0

(1 row)

postgres=# insert into test_1 values (5,now());

INSERT 0 1

postgres=# select pg_switch_xlog();

pg_switch_xlog

----------------

1/330000AC

(1 row)

.....

便于显示，重复内容就不贴了。

--2.7 查看归档目录

[pg93@redhatB pg93]$ ll /archive/pg93/ | wc -l

36

备注：这时归档目录就产生了一些归档的 WAL 日志。

--2.8 启动备库

[pg93@redhat6 ~]$ pg_ctl start -D $PGDATA

server starting

备注：数据库能启来，但查看日志，报以下错误：

--2.9 csv 日志

[pg93@redhat6 pg_log]$ tail -f postgresql-2013-07-02_062642.csv

",,,,,,,,,""

2013-07-02 06:27:17.672 CST,,,3704,,51d20245.e78,1,,2013-07-02 06:27:17 CST,,0,LOG,00000,"started streaming WAL from primary at 1/22000000 on timeline 1",,,,,,,,,""

2013-07-02 06:27:17.674 CST,,,3704,,51d20245.e78,2,,2013-07-02 06:27:17 CST,,0,FATAL,XX000,"could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000100000022 has already been removed

备注：目标的错误已重现，在主库的 $PGDATA/pg_xlog 目录里已找不到 000000010000000100000022 文件了，

因为 XLOG 文件已被循环使用覆盖了，但在归档目录 /archive/pg93 里可以找到。

--2.10 复制主库归档目录的 WAL 到备节点

[pg93@redhatB pg93]$ scp /archive/pg93/* pg93@192.168.1.35:/archive/pg93

备注：这时把主节点上归档目录的 WAL 文件复制到备节点的归档目录。

--2.11 修改备库 recovery.conf 文件的以下参数

restore_command = 'cp /archive/pg93/%f %p'

备注：其它参数略, 参考 PostgreSQL：使用 pg_basebackup 搭建流复制环境

--2.12 重启备库

[pg93@redhat6 pg_root]$ pg_ctl stop -m fast -D $PGDATA

waiting for server to shut down.... done

server stopped

[pg93@redhat6 pg_root]$ pg_ctl start -D $PGDATA

server starting

--2.13 查看备库日志

[pg93@redhat6 pg_log]$ tail -f postgresql-2013-07-02_063606.csv

2013-07-02 06:36:08.123 CST,,,4008,,51d20456.fa8,4,,2013-07-02 06:36:06 CST,1/0,0,LOG,00000,"redo starts at 1/200002BC",,,,,,,,,""

2013-07-02 06:36:08.592 CST,,,4008,,51d20456.fa8,5,,2013-07-02 06:36:06 CST,1/0,0,LOG,00000,"restored log file ""000000010000000100000021"" from archive",,,,,,,,,""

2013-07-02 06:36:08.618 CST,,,4008,,51d20456.fa8,6,,2013-07-02 06:36:06 CST,1/0,0,LOG,00000,"consistent recovery state reached at 1/22000000",,,,,,,,,""

2013-07-02 06:36:08.630 CST,,,4006,,51d20456.fa6,1,,2013-07-02 06:36:06 CST,,0,LOG,00000,"database system is ready to accept read only connections",,,,,,,,,""

2013-07-02 06:36:10.014 CST,,,4008,,51d20456.fa8,7,,2013-07-02 06:36:06 CST,1/0,0,LOG,00000,"restored log file ""000000010000000100000022"" from archive",,,,,,,,,""

2013-07-02 06:36:10.800 CST,,,4008,,51d20456.fa8,8,,2013-07-02 06:36:06 CST,1/0,0,LOG,00000,"restored log file ""000000010000000100000023"" from archive",,,,,,,,,""

2013-07-02 06:36:12.139 CST,,,4008,,51d20456.fa8,9,,2013-07-02 06:36:06 CST,1/0,0,LOG,00000,"restored log file ""000000010000000100000024"" from archive",,,,,,,,,""

2013-07-02 06:36:13.937 CST,,,4008,,51d20456.fa8,10,,2013-07-02 06:36:06 CST,1/0,0,LOG,00000,"restored log file ""000000010000000100000025"" from archive",,,,,,,,,""

....

省略部分内容

2013-07-02 06:36:47.400 CST,,,4059,,51d2047f.fdb,1,,2013-07-02 06:36:47 CST,,0,LOG,00000,"started streaming WAL from primary at 1/43000000 on timeline 1",,,,,,,,,""

备注：此时备库从归档目录取到所需的 WAL 后开始拼命追主库，直到看到上面日志的最后一行信息时表示

已完全追上主库。

三验证

--3.1 主库

[pg93@redhatB ~]$ psql

psql (9.3beta1)

Type "help" for help.

postgres=# select * from test_1 order by create_time desc limit 3;

id | create_time

----+---------------------

5 | 2013-07-02 06:38:07

5 | 2013-07-02 06:26:11

5 | 2013-07-02 06:26:09

(3 rows)

postgres=# select count(*) from test_1;

count

-------

39

(1 row)

--3.2 备库

[pg93@redhat6 pg_root]$ psql

psql (9.3beta1)

Type "help" for help.

postgres=# select * from test_1 order by create_time desc limit 3;

id | create_time

----+---------------------

5 | 2013-07-02 06:38:07

5 | 2013-07-02 06:26:11

5 | 2013-07-02 06:26:09

(3 rows)

postgres=# select count(*) from test_1;

count

-------

39

(1 row)

备注：此时在备库停掉过程中新增的数据也已经同步到备库了，恢复成功。

四总结

1 对于比较繁忙的库，建议给 pg_xlog 分配较大的存储，从而能保留较多的 WAL 文件，在备节点

停机维护后，能够获得更多的停库时间。

2 对于更繁忙的大数据库，例如 TB 级，如果没有足够的存储长期开启归档，那么至少在备节点需要

停库维护时的这段时间把归档开启，否则当备库启来后追不上主库时，TB 级库重做 standby 的代

价是可想而知。

五参考

PostgreSQL：“ FATAL: requested WAL segment00800002A0 has already been removed”

PostgreSQL：“ FATAL: requested WAL segment 0000000800002A0000000000 has already been removed”