PostgreSQL:“ FATAL: requested WAL segment 0000000800002A0000000000 has already been removed”
http://francs3.blog.163.com/blog/static/4057672720136225542598/
昨天,一重要生产环境的备库主机由于硬件故障需要停机做硬件检测,由于是流复制环境,
备库可以停,停机检测大概花了 2 小时左右,之后再次启动备库时,报了如下错误:
--csv 日志
2013-07-01 13:25:29.430 CST,,,27738,,51d112c8.6c5a,1,,2013-07-01 13:25:28 CST,,0,LOG,00000,"streaming replication successfully connected to primary",,,,,,,,"libpqrcv_connect, libpqwalreceiver.c:171",""
2013-07-01 13:25:29.430 CST,,,27738,,51d112c8.6c5a,2,,2013-07-01 13:25:28 CST,,0,FATAL,XX000,"could not receive data from WAL stream: FATAL: requested WAL segment 0000000800002A0000000000 has already been removed
",,,,,,,,"libpqrcv_receive, libpqwalreceiver.c:389",""
备注:根据报错信息,很容易知道是由于停机时间的过程中备库所需的 WAL 已经被主库循环使用覆盖了,而在备库
停机维护过程中,主库并未打开归档,这时这个备库需要重做了。也许有人会问,为何不一直打开主库的归档,
我想说的是,这个库在 TB 级而且比较繁忙,忙的时候一天的归档 600 GB左右,这么大的归档需要大量的存储。
但是在备库停机维护过程中,建议主库打开归档,只要不把归档目录撑满,那么在备库重新恢复后,有了主库
的归档,那么备库依然能够跟上主库,为了加深理解,下面模拟这个错误,并演示规避方法:
一 环境信息
主机: 笔记本虚拟机
系统: Red Hat Enterprise Linux Server release 6.2
版本: PostgreSQL 9.3beta1
主库IP: 192.168.1.36 主机名:redhatB
备库IP: 192.168.1.35 主机名: redhat6
备注: 流复制搭建过程略,参考 PostgreSQL:使用 pg_basebackup 搭建流复制环境
二 模拟过程
--2.1 设置主库 postgresql.conf
为了容易出演示效果,设置以下参数,其它参数根据需求设置:
|
checkpoint_segments = 3
archive_mode = on
archive_command = 'cp %p /archive/pg93/%f'
max_wal_senders = 3
wal_keep_segments = 3
max_wal_senders = 3
|
--2.2 重载配置文件
|
[pg93@redhatB pg_root]$ pg_ctl reload -D $PGDATA
server signaled
|
--2.3 测试前: 主库数据
|
[pg93@redhatB ~]$ psql
psql (9.3beta1)
Type "help" for help.
postgres=# select * from test_1;
id | create_time
----+---------------------
1 | 2013-07-01 21:15:34
2 | 2013-07-01 21:55:37
3 | 2013-07-01 22:01:18
(3 rows)
|
--2.4 测试前: 备库数据
|
[pg93@redhat6 ~]$ psql
psql (9.3beta1)
Type "help" for help.
postgres=# select * from test_1;
id | create_time
----+---------------------
1 | 2013-07-01 21:15:34
2 | 2013-07-01 21:55:37
3 | 2013-07-01 22:01:18
(3 rows)
|
--2.5 停备库
|
[pg93@redhat6 ~]$ pg_ctl stop -m fast -D $PGDATA
waiting for server to shut down.... done
server stopped
|
--2.6 在主库上执行以下操作
|
postgres=# insert into test_1 values (5,now());
INSERT 0 1
postgres=# select pg_switch_xlog();
pg_switch_xlog
----------------
1/310000AC
(1 row)
postgres=# insert into test_1 values (5,now());
INSERT 0 1
postgres=# select pg_switch_xlog();
pg_switch_xlog
----------------
1/320004D0
(1 row)
postgres=# insert into test_1 values (5,now());
INSERT 0 1
postgres=# select pg_switch_xlog();
pg_switch_xlog
----------------
1/330000AC
(1 row)
.....
便于显示,重复内容就不贴了。
|
--2.7 查看归档目录
|
[pg93@redhatB pg93]$ ll /archive/pg93/ | wc -l
36
|
--2.8 启动备库
|
[pg93@redhat6 ~]$ pg_ctl start -D $PGDATA
server starting
|
--2.9 csv 日志
|
[pg93@redhat6 pg_log]$ tail -f postgresql-2013-07-02_062642.csv
",,,,,,,,,""
2013-07-02 06:27:17.672 CST,,,3704,,51d20245.e78,1,,2013-07-02 06:27:17 CST,,0,LOG,00000,"started streaming WAL from primary at 1/22000000 on timeline 1",,,,,,,,,""
2013-07-02 06:27:17.674 CST,,,3704,,51d20245.e78,2,,2013-07-02 06:27:17 CST,,0,FATAL,XX000,"could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000100000022 has already been removed
|
因为 XLOG 文件已被循环使用覆盖了,但在归档目录 /archive/pg93 里可以找到。
--2.10 复制主库归档目录的 WAL 到备节点
| [pg93@redhatB pg93]$ scp /archive/pg93/* pg93@192.168.1.35:/archive/pg93 |
--2.11 修改备库 recovery.conf 文件的以下参数
| restore_command = 'cp /archive/pg93/%f %p' |
--2.12 重启备库
|
[pg93@redhat6 pg_root]$ pg_ctl stop -m fast -D $PGDATA
waiting for server to shut down.... done
server stopped
[pg93@redhat6 pg_root]$ pg_ctl start -D $PGDATA
server starting
|
--2.13 查看备库日志
|
[pg93@redhat6 pg_log]$ tail -f postgresql-2013-07-02_063606.csv
2013-07-02 06:36:08.123 CST,,,4008,,51d20456.fa8,4,,2013-07-02 06:36:06 CST,1/0,0,LOG,00000,"redo starts at 1/200002BC",,,,,,,,,""
2013-07-02 06:36:08.592 CST,,,4008,,51d20456.fa8,5,,2013-07-02 06:36:06 CST,1/0,0,LOG,00000,"restored log file ""000000010000000100000021"" from archive",,,,,,,,,""
2013-07-02 06:36:08.618 CST,,,4008,,51d20456.fa8,6,,2013-07-02 06:36:06 CST,1/0,0,LOG,00000,"consistent recovery state reached at 1/22000000",,,,,,,,,""
2013-07-02 06:36:08.630 CST,,,4006,,51d20456.fa6,1,,2013-07-02 06:36:06 CST,,0,LOG,00000,"database system is ready to accept read only connections",,,,,,,,,""
2013-07-02 06:36:10.014 CST,,,4008,,51d20456.fa8,7,,2013-07-02 06:36:06 CST,1/0,0,LOG,00000,"restored log file ""000000010000000100000022"" from archive",,,,,,,,,""
2013-07-02 06:36:10.800 CST,,,4008,,51d20456.fa8,8,,2013-07-02 06:36:06 CST,1/0,0,LOG,00000,"restored log file ""000000010000000100000023"" from archive",,,,,,,,,""
2013-07-02 06:36:12.139 CST,,,4008,,51d20456.fa8,9,,2013-07-02 06:36:06 CST,1/0,0,LOG,00000,"restored log file ""000000010000000100000024"" from archive",,,,,,,,,""
2013-07-02 06:36:13.937 CST,,,4008,,51d20456.fa8,10,,2013-07-02 06:36:06 CST,1/0,0,LOG,00000,"restored log file ""000000010000000100000025"" from archive",,,,,,,,,""
....
省略部分内容
2013-07-02 06:36:47.400 CST,,,4059,,51d2047f.fdb,1,,2013-07-02 06:36:47 CST,,0,LOG,00000,"started streaming WAL from primary at 1/43000000 on timeline 1",,,,,,,,,""
|
已完全追上主库。
三 验证
--3.1 主库
|
[pg93@redhatB ~]$ psql
psql (9.3beta1)
Type "help" for help.
postgres=# select * from test_1 order by create_time desc limit 3;
id | create_time
----+---------------------
5 | 2013-07-02 06:38:07
5 | 2013-07-02 06:26:11
5 | 2013-07-02 06:26:09
(3 rows)
postgres=# select count(*) from test_1;
count
-------
39
(1 row)
|
--3.2 备库
|
[pg93@redhat6 pg_root]$ psql
psql (9.3beta1)
Type "help" for help.
postgres=# select * from test_1 order by create_time desc limit 3;
id | create_time
----+---------------------
5 | 2013-07-02 06:38:07
5 | 2013-07-02 06:26:11
5 | 2013-07-02 06:26:09
(3 rows)
postgres=# select count(*) from test_1;
count
-------
39
(1 row)
|
四 总结
1 对于比较繁忙的库,建议给 pg_xlog 分配较大的存储,从而能保留较多的 WAL 文件,在备节点
停机维护后,能够获得更多的停库时间。
2 对于更繁忙的大数据库,例如 TB 级,如果没有足够的存储长期开启归档,那么至少在备节点需要
停库维护时的这段时间把归档开启,否则当备库启来后追不上主库时,TB 级库重做 standby 的代
价是可想而知。
五 参考