OGG的replicat进程的Time Since Chkpt一直增加,进程处于假死状态

1 概述

DBA关闭所有数据库实例,但未关闭OGG,重启OGG所有进程,抽取进程,MGR进程、大部分复制进程都恢复正常,但还有几个进程的Time Since Chkpt 时间一直处于增加,等待一个小时,重启MGR及相关进程,进行观察;再次等待15小时,发现Time Since Chkpt还是处于增加状态,OGG没有应用任何数据,故认定由于数据库重启,导致OGG的REPLICAT进程异常,处于假死状态。


2 OGG进程状态查看

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

REPLICAT    RUNNING     repaa    00:00:05      00:12:02  


REPLICAT    RUNNING     repaa    00:00:05      00:44:50   


GGSCI (testdb3) 25> info repaa


REPLICAT   repaa  Last Started 2022-03-30 17:05   Status RUNNING

Checkpoint Lag       00:0 0:05 (updated 01:04:53 ago)

Log Read Checkpoint  File ./dirdat/aa559442

                     2022-03-30 16:53:00.000921  RBA 1169746574

                     

查看report 报告,已经读取到559446文件了

Opened trail file ./dirdat/aa559442 at 2022-03-30 17:05:01


Switching to next trail file ./dirdat/aa559443 at 2022-03-30 17:05:02 due to EOF, with current RBA 1999998495

Opened trail file ./dirdat/aa559443 at 2022-03-30 17:05:02



Switching to next trail file ./dirdat/aa559444 at 2022-03-30 17:05:43 due to EOF, with current RBA 1999998562

Opened trail file ./dirdat/aa559444 at 2022-03-30 17:05:43



Switching to next trail file ./dirdat/aa559445 at 2022-03-30 17:07:33 due to EOF, with current RBA 1999998898

Opened trail file ./dirdat/aa559445 at 2022-03-30 17:07:33



Switching to next trail file ./dirdat/aa559446 at 2022-03-30 17:12:51 due to EOF, with current RBA 1999998960

Opened trail file ./dirdat/aa559446 at 2022-03-30 17:12:51



查看进程状态,目前进程读取trail文件到559446,但事务处理的数据为0

GGSCI (testdb3) 9> send repaa status


Sending STATUS request to REPLICAT repaa ...

  Current status: Processing data

  Sequence #: 559446

  RBA: 876964647

  0 records in current transaction


根据如上信息,暂时认为OGG进程是正常的,等待1小时后,发现时间还是处于增长状态,

根据经验,重启MGR及此进程,一般进程就会恢复正常。


2 重启相关进程


GGSCI (testdb3) 24> start repaa


Sending START request to MANAGER ...

REPLICAT repaa starting



GGSCI (testdb3) 25> info repaa


REPLICAT   repaa  Last Started 2022-03-30 17:57   Status RUNNING

Checkpoint Lag       00:00:05 (updated 01:04:53 ago)

Log Read Checkpoint  File ./dirdat/aa559442

                     2022-03-30 16:53:00.000921  RBA 1169746574


3 等待15小时,发现此进程延迟还是增加,如下


REPLICAT    RUNNING     repaa    00:00:05      15:42:26  


4 查看OGG进程执行的语句,发现一直执行SELECT COUNT (DISTINCT thread#)  FROM v$log

,不进行任何其它的SQL


SYS@testdb3 > SELECT /*+ ORDERED */ sql_id,sql_text  FROM v$sqltext a WHERE (a.hash_value, a.address) IN      

 (SELECT DECODE(sql_hash_value, 0, prev_hash_value, sql_hash_value),               

DECODE(sql_hash_value, 0, prev_sql_addr, sql_address)     

      FROM v$session b         WHERE b.sid='&SID'  and b.SERIAL#='&SERIAL') order by piece asc; 

Enter value for sid: 2103

Enter value for serial: 5

old   1:  SELECT /*+ ORDERED */ sql_id,sql_text  FROM v$sqltext a WHERE (a.hash_value, a.address) IN   

    (SELECT DECODE(sql_hash_value, 0, prev_hash_value, sql_hash_value),     

DECODE(sql_hash_value, 0, prev_sql_addr, sql_address)   

            FROM v$session b         WHERE b.sid='&SID'  and b.SERIAL#='&SERIAL') order by piece asc

new   1:  SELECT /*+ ORDERED */ sql_id,sql_text  FROM v$sqltext a WHERE (a.hash_value, a.address) IN  

     (SELECT DECODE(sql_hash_value, 0, prev_hash_value, sql_hash_value),     

DECODE(sql_hash_value, 0, prev_sql_addr, sql_address)  

              FROM v$session b         WHERE b.sid='2103'  and b.SERIAL#='5') order by piece asc


SQL_ID                         SQL_TEXT

------------------------------ ----------------------------------------------------------------

gnxg91rmnxtdj                  SELECT COUNT (DISTINCT thread#)  FROM v$log



5 对进程做相关trace,未发现任何异常,如下:


send repaa,trace2 ./dirtmp/repaa_20220331.log


09:20:27.875 (58526527) General statistics:

                         0.00%  Checking messages (includes checkpointing)

                                 0.00%  Checking periodic tasks

                         0.00%  Waiting for more data

                         0.00%  Converting ASCII header to internal

                         0.00%  Converting ASCII data to internal

                         0.00%  Reading input records

                         0.00%  Writing output records (replicate_io)

                                 0.00%  Mapping columns

                                 0.00%  Outputting data records

                                 0.00%  Performing SQL statements

                         0.00%  Performing BATCHSQL statements

                         0.00%  Performing actual DB op

                         0.00%  Preparing SQL statements

                         0.00%  Performing transaction commits

                         0.00%  Checkpointing

                         

                         

6 查找官方文档及各大网站,未发现解决方法,经过分析,认为有可能是数据库关闭,

导致OGG某些 程序异常,如果重置RBA,一般OGG会从某个位置重新开始录入数据,

那我指定相同的EXTSEQNO和 RBA号,是不OGG也会恢复正常?

进行测试,重新执行相同的RBA号, 验证进程是否正常, 发现OGG进程恢复正常。

                       

GGSCI (testdb3) 77> info repaa


REPLICAT   repaa  Last Started 2022-03-30 17:57   Status RUNNING

Checkpoint Lag       00:00:05 (updated 15:47:14 ago)

Log Read Checkpoint  File ./dirdat/aa559442

                     2022-03-30 16:53:00.000921  RBA 1169746574

                      

由于不能正常关闭OGG进程,需要在操作系统层面KILL,KILL完OGG进程后,执行如下名

 

GGSCI (testdb3) 18> alter   repaa extseqno    559442 extrba 1169746574

REPLICAT altered.



GGSCI (testdb3) 19> info repaa


REPLICAT   repaa  Initialized   2022-03-31 09:32   Status STOPPED

Checkpoint Lag       00:00:00 (updated 00:00:10 ago)

Log Read Checkpoint  File ./dirdat/aa559442

                     First Record  RBA 1169746574 

 

经过几分钟的等待,OGG恢复正常,如下:

REPLICAT    RUNNING     repaa    09:40:41      00:00:04 

等待几小时,延迟变为0:

REPLICAT    RUNNING     repaa    00:00:04      00:00:04 

 

GGSCI (testdb3) 73> info repaa


REPLICAT   repaa  Last Started 2022-03-31 08:59   Status RUNNING

Checkpoint Lag       00:00:03 (updated 00:00:06 ago)

Log Read Checkpoint  File ./dirdat/aa559881

                     2022-03-31 15:53:58.000975  RBA 14563502




请使用浏览器的分享功能分享到微信等