1 概述
DBA关闭所有数据库实例,但未关闭OGG,重启OGG所有进程,抽取进程,MGR进程、大部分复制进程都恢复正常,但还有几个进程的Time Since Chkpt 时间一直处于增加,等待一个小时,重启MGR及相关进程,进行观察;再次等待15小时,发现Time Since Chkpt还是处于增加状态,OGG没有应用任何数据,故认定由于数据库重启,导致OGG的REPLICAT进程异常,处于假死状态。
2 OGG进程状态查看
Program Status Group Lag at Chkpt Time Since Chkpt
REPLICAT RUNNING repaa 00:00:05 00:12:02
REPLICAT RUNNING repaa 00:00:05 00:44:50
GGSCI (testdb3) 25> info repaa
REPLICAT repaa Last Started 2022-03-30 17:05 Status RUNNING
Checkpoint Lag 00:0 0:05 (updated 01:04:53 ago)
Log Read Checkpoint File ./dirdat/aa559442
2022-03-30 16:53:00.000921 RBA 1169746574
查看report 报告,已经读取到559446文件了
Opened trail file ./dirdat/aa559442 at 2022-03-30 17:05:01
Switching to next trail file ./dirdat/aa559443 at 2022-03-30 17:05:02 due to EOF, with current RBA 1999998495
Opened trail file ./dirdat/aa559443 at 2022-03-30 17:05:02
Switching to next trail file ./dirdat/aa559444 at 2022-03-30 17:05:43 due to EOF, with current RBA 1999998562
Opened trail file ./dirdat/aa559444 at 2022-03-30 17:05:43
Switching to next trail file ./dirdat/aa559445 at 2022-03-30 17:07:33 due to EOF, with current RBA 1999998898
Opened trail file ./dirdat/aa559445 at 2022-03-30 17:07:33
Switching to next trail file ./dirdat/aa559446 at 2022-03-30 17:12:51 due to EOF, with current RBA 1999998960
Opened trail file ./dirdat/aa559446 at 2022-03-30 17:12:51
查看进程状态,目前进程读取trail文件到559446,但事务处理的数据为0
GGSCI (testdb3) 9> send repaa status
Sending STATUS request to REPLICAT repaa ...
Current status: Processing data
Sequence #: 559446
RBA: 876964647
0 records in current transaction
根据如上信息,暂时认为OGG进程是正常的,等待1小时后,发现时间还是处于增长状态,
根据经验,重启MGR及此进程,一般进程就会恢复正常。
2 重启相关进程
GGSCI (testdb3) 24> start repaa
Sending START request to MANAGER ...
REPLICAT repaa starting
GGSCI (testdb3) 25> info repaa
REPLICAT repaa Last Started 2022-03-30 17:57 Status RUNNING
Checkpoint Lag 00:00:05 (updated 01:04:53 ago)
Log Read Checkpoint File ./dirdat/aa559442
2022-03-30 16:53:00.000921 RBA 1169746574
3 等待15小时,发现此进程延迟还是增加,如下
REPLICAT RUNNING repaa 00:00:05 15:42:26
4 查看OGG进程执行的语句,发现一直执行SELECT COUNT (DISTINCT thread#) FROM v$log
,不进行任何其它的SQL
SYS@testdb3 > SELECT /*+ ORDERED */ sql_id,sql_text FROM v$sqltext a WHERE (a.hash_value, a.address) IN
(SELECT DECODE(sql_hash_value, 0, prev_hash_value, sql_hash_value),
DECODE(sql_hash_value, 0, prev_sql_addr, sql_address)
FROM v$session b WHERE b.sid='&SID' and b.SERIAL#='&SERIAL') order by piece asc;
Enter value for sid: 2103
Enter value for serial: 5
old 1: SELECT /*+ ORDERED */ sql_id,sql_text FROM v$sqltext a WHERE (a.hash_value, a.address) IN
(SELECT DECODE(sql_hash_value, 0, prev_hash_value, sql_hash_value),
DECODE(sql_hash_value, 0, prev_sql_addr, sql_address)
FROM v$session b WHERE b.sid='&SID' and b.SERIAL#='&SERIAL') order by piece asc
new 1: SELECT /*+ ORDERED */ sql_id,sql_text FROM v$sqltext a WHERE (a.hash_value, a.address) IN
(SELECT DECODE(sql_hash_value, 0, prev_hash_value, sql_hash_value),
DECODE(sql_hash_value, 0, prev_sql_addr, sql_address)
FROM v$session b WHERE b.sid='2103' and b.SERIAL#='5') order by piece asc
SQL_ID SQL_TEXT
------------------------------ ----------------------------------------------------------------
gnxg91rmnxtdj SELECT COUNT (DISTINCT thread#) FROM v$log
5 对进程做相关trace,未发现任何异常,如下:
send repaa,trace2 ./dirtmp/repaa_20220331.log
09:20:27.875 (58526527) General statistics:
0.00% Checking messages (includes checkpointing)
0.00% Checking periodic tasks
0.00% Waiting for more data
0.00% Converting ASCII header to internal
0.00% Converting ASCII data to internal
0.00% Reading input records
0.00% Writing output records (replicate_io)
0.00% Mapping columns
0.00% Outputting data records
0.00% Performing SQL statements
0.00% Performing BATCHSQL statements
0.00% Performing actual DB op
0.00% Preparing SQL statements
0.00% Performing transaction commits
0.00% Checkpointing
6 查找官方文档及各大网站,未发现解决方法,经过分析,认为有可能是数据库关闭,
导致OGG某些 程序异常,如果重置RBA,一般OGG会从某个位置重新开始录入数据,
那我指定相同的EXTSEQNO和 RBA号,是不OGG也会恢复正常?
进行测试,重新执行相同的RBA号, 验证进程是否正常, 发现OGG进程恢复正常。
GGSCI (testdb3) 77> info repaa
REPLICAT repaa Last Started 2022-03-30 17:57 Status RUNNING
Checkpoint Lag 00:00:05 (updated 15:47:14 ago)
Log Read Checkpoint File ./dirdat/aa559442
2022-03-30 16:53:00.000921 RBA 1169746574
由于不能正常关闭OGG进程,需要在操作系统层面KILL,KILL完OGG进程后,执行如下名
GGSCI (testdb3) 18> alter repaa extseqno 559442 extrba 1169746574
REPLICAT altered.
GGSCI (testdb3) 19> info repaa
REPLICAT repaa Initialized 2022-03-31 09:32 Status STOPPED
Checkpoint Lag 00:00:00 (updated 00:00:10 ago)
Log Read Checkpoint File ./dirdat/aa559442
First Record RBA 1169746574
经过几分钟的等待,OGG恢复正常,如下:
REPLICAT RUNNING repaa 09:40:41 00:00:04
等待几小时,延迟变为0:
REPLICAT RUNNING repaa 00:00:04 00:00:04
GGSCI (testdb3) 73> info repaa
REPLICAT repaa Last Started 2022-03-31 08:59 Status RUNNING
Checkpoint Lag 00:00:03 (updated 00:00:06 ago)
Log Read Checkpoint File ./dirdat/aa559881
2022-03-31 15:53:58.000975 RBA 14563502