1 寻找一个时间点,四个OGG抽取进程读取的日志相差特别小,使用强制命令停止OGG进程
首先根据归档日志切换,找到每日数据库压力最小的时间点,认为7-9点和17点-21点为业务数据量比较清闲是时间段。在此时间段内寻找合并的抽取进程差异最小的时间点,数据量相差在10M内。

使用 info ET_* 查看合并的OGG抽取进程读取的归档日志,找到合并的进程差异最小时间点,并强行停止OGG抽取进程。
如果MGR进程配置OGG进程自动重启,则需要将此参数禁用,并重启MGR进程。
SEND EXTRACT ET_CJ, FORCESTOP
SEND EXTRACT ET_JA, FORCESTOP
SEND EXTRACT ET_TB, FORCESTOP
SEND EXTRACT ET_YD, FORCESTOP
停止后的状态如下
EXTRACT ABENDED ET_CJ 00:00:01 00:01:19
EXTRACT ABENDED ET_JA 00:00:01 00:01:16
EXTRACT STOPPED ET_TB 00:00:00 00:01:13
EXTRACT STOPPED ET_YD 00:00:01 00:01:13
查看每个抽取进程进程读取的RBA号的位置,并以最小的时间为准。
GGSCI (XXDB3) 24> info ET_*
EXTRACT ET_CJ Last Started 2021-11-19 08:49 Status ABENDED
Checkpoint Lag 00:00:01 (updated 00:00:13 ago)
Log Read Checkpoint Oracle Redo Logs
2021-11-19 08:51:38 Thread 1, Seqno 923151, RBA 226304
SCN 3689.4151044528 (15848285399472)
Log Read Checkpoint Oracle Redo Logs
2021-11-19 08:51:40 Thread 2, Seqno 513373, RBA 580112
SCN 3689.4151046683 (15848285401627)
Log Read Checkpoint Oracle Redo Logs
2021-11-19 08:43:41 Thread 3, Seqno 700612, RBA 759541456
SCN 3689.4150310212 (15848284665156)
EXTRACT ET_JA Last Started 2021-11-19 08:49 Status ABENDED
Checkpoint Lag 00:00:01 (updated 00:00:10 ago)
Log Read Checkpoint Oracle Redo Logs
2021-11-19 08:51:42 Thread 1, Seqno 923151, RBA 243200
SCN 3689.4151048854 (15848285403798)
Log Read Checkpoint Oracle Redo Logs
2021-11-19 08:51:43 Thread 2, Seqno 513373, RBA 599568
SCN 3689.4151049970 (15848285404914)
Log Read Checkpoint Oracle Redo Logs
2021-11-19 08:43:40 Thread 3, Seqno 700612, RBA 750326584
SCN 3689.4150309682 (15848284664626)
EXTRACT ET_TB Last Started 2021-11-19 08:49 Status STOPPED
Checkpoint Lag 00:00:00 (updated 00:00:07 ago)
Log Read Checkpoint Oracle Redo Logs
2021-11-19 08:51:45 Thread 1, Seqno 923151, RBA 261120
SCN 3689.4151052791 (15848285407735)
Log Read Checkpoint Oracle Redo Logs
2021-11-19 08:51:47 Thread 2, Seqno 513373, RBA 616292
SCN 3689.4151054181 (15848285409125)
Log Read Checkpoint Oracle Redo Logs
2021-11-19 08:43:50 Thread 3, Seqno 700612, RBA 1104938896
SCN 3689.4150328763 (15848284683707)
EXTRACT ET_YD Last Started 2021-11-19 08:49 Status STOPPED
Checkpoint Lag 00:00:01 (updated 00:00:07 ago)
Log Read Checkpoint Oracle Redo Logs
2021-11-19 08:51:45 Thread 1, Seqno 923151, RBA 259072
SCN 3689.4151052078 (15848285407022)
Log Read Checkpoint Oracle Redo Logs
2021-11-19 08:51:46 Thread 2, Seqno 513373, RBA 614928
SCN 3689.4151053645 (15848285408589)
Log Read Checkpoint Oracle Redo Logs
2021-11-19 08:43:50 Thread 3, Seqno 700612, RBA 1089680932
SCN 3689.4150328161 (15848284683105)
根据如上信息,决定使用使用每个线程最低的RBA号进行数据抽取,差异在10M内,
有可能会导致目标端的数据有重复数据但不会导致数据丢失,应该是在可以接受的范围内。
否则,则需要业务系统所有的业务,才能保证几个抽取进程读取的位置一致。
实例1 2021-11-19 08:51:38 Thread 1, Seqno 923151, RBA 226304
实例2 2021-11-19 08:51:40 Thread 2, Seqno 513373, RBA 580112
实例3 2021-11-19 08:43:40 Thread 3, Seqno 700612, RBA 750326584
2 停止投递进程,并将投递进程中的参数文件予以修改。即将抽取进程中的表,复制到投递进程中,
并记录投递进程读取的RBA号,以作回退使用。
GGSCI (XXDB3) 24> stop DP_*
Sending STOP request to EXTRACT DP_CJ ...
Request processed.
Sending STOP request to EXTRACT DP_JA ...
Request processed.
Sending STOP request to EXTRACT DP_TB ...
Request processed.
Sending STOP request to EXTRACT DP_YD ...
Request processed.
GGSCI (XXDB3) 25> info DP_*
EXTRACT DP_CJ Last Started 2021-11-15 23:33 Status STOPPED
Checkpoint Lag 00:00:00 (updated 00:00:08 ago)
Log Read Checkpoint File ./dirdat/cj2/cj041171
2021-11-19 08:49:45.202412 RBA 1384
EXTRACT DP_JA Last Started 2021-11-19 09:42 Status STOPPED
Checkpoint Lag 00:00:00 (updated 00:00:07 ago)
Log Read Checkpoint File ./dirdat/ja/jb128023
2021-11-19 08:49:45.275881 RBA 1385
EXTRACT DP_TB Last Started 2021-11-15 23:33 Status STOPPED
Checkpoint Lag 00:00:00 (updated 00:00:07 ago)
Log Read Checkpoint File ./dirdat/tb/tb004071
2021-11-19 08:49:45.523086 RBA 1061
EXTRACT DP_YD Last Started 2021-11-15 23:33 Status STOPPED
Checkpoint Lag 00:00:00 (updated 00:00:06 ago)
Log Read Checkpoint File ./dirdat/yd/yd008514
2021-11-19 08:49:45.218477 RBA 1061
3 将 ET_CJ、ET_JA、ET_TB、ET_YD 四个抽取进程中的表进行整理去重,合并到一起,并添加到ET_TB进程中
此步骤相关的表就不列在此处了。
4 调整ET_TB进程的trail文件大小及读取的位置,此此调整trail文件大小,主要是解决11G的OGG的trail
文件号达到999999时的故障,提前进行处理,避免未来某一天因为此问题导致OGG进程的异常。
修改ET_TB 的抽取trail文件大小为2G
extract Trail: ./dirdat/tb/tb
extract: ET_TB
Seqno: 4071
RBA: 1061
File Size: 200M --修改前
以下为执行的修改命令
GGSCI (XXDB3) 39> ALTER EXTTRAIL ./dirdat/tb/tb, EXTRACT ET_TB, MEGABYTES 2000
EXTTRAIL altered.
extract Trail: ./dirdat/tb/tb
extract: ET_TB
Seqno: 4071
RBA: 1061
File Size: 2000M --修改后
修改ET_TB 读取归档日志的位置:
GGSCI (XXDB3) 41> info ET_tb
EXTRACT ET_TB Initialized 2021-11-19 08:49 Status STOPPED
Checkpoint Lag 00:00:00 (updated 01:19:32 ago)
Log Read Checkpoint Oracle Redo Logs
2021-11-19 08:51:45 Thread 1, Seqno 923151, RBA 261120
SCN 3689.4151052791 (15848285407735)
Log Read Checkpoint Oracle Redo Logs
2021-11-19 08:51:47 Thread 2, Seqno 513373, RBA 616292
SCN 3689.4151054181 (15848285409125)
Log Read Checkpoint Oracle Redo Logs
2021-11-19 08:43:50 Thread 3, Seqno 700612, RBA 1104938896
SCN 3689.4150328763 (15848284683707)
GGSCI (XXDB3) 42> ALTER EXTRACT ET_TB, Thread 1, ET_SEQNO 923151, ET_RBA 226304
EXTRACT altered.
GGSCI (XXDB3) 43> ALTER EXTRACT ET_TB, Thread 2, ET_SEQNO 513373, ET_RBA 580112
EXTRACT altered.
GGSCI (XXDB3) 44> ALTER EXTRACT ET_TB, Thread 3, ET_SEQNO 700612, ET_RBA 750326584
EXTRACT altered.
GGSCI (XXDB3) 45> info ET_tb
EXTRACT ET_TB Initialized 2021-11-19 10:12 Status STOPPED
Checkpoint Lag 01:20:20 (updated 00:00:04 ago)
Log Read Checkpoint Oracle Redo Logs
2021-11-19 08:51:45 Thread 1, Seqno 923151, RBA 226304
SCN 0.0 (0)
Log Read Checkpoint Oracle Redo Logs
2021-11-19 08:51:47 Thread 2, Seqno 513373, RBA 580112
SCN 0.0 (0)
Log Read Checkpoint Oracle Redo Logs
2021-11-19 08:43:50 Thread 3, Seqno 700612, RBA 750326584
SCN 0.0 (0)
5 启动OGG抽取进程,发现报错,对错误的处理
2021-11-19 10:13:14 INFO OGG-01516 Positioned to (Thread 2) Sequence 513373, RBA 580112, SCN 0.0, Nov 19, 2021 8:51:47 AM.
2021-11-19 10:13:14 INFO OGG-01513 Positioning to (Thread 3) Sequence 700612, RBA 750326584, SCN 0.0.
2021-11-19 10:13:15 INFO OGG-01516 Positioned to (Thread 3) Sequence 700612, RBA 750326584, SCN 0.0, Nov 19, 2021 8:43:50 AM.
2021-11-19 10:13:15 INFO OGG-01517 Position of first record processed for Thread 1, Sequence 923151, RBA 226320, SCN 3689.4151044612, N
ov 19, 2021 8:51:38 AM.
2021-11-19 10:13:15 INFO OGG-01055 Recovery initialization completed for target file ./dirdat/tb/tb004071, at RBA 1061.
2021-11-19 10:13:15 INFO OGG-01478 Output file ./dirdat/tb/tb is using format RELEASE 11.2.
2021-11-19 10:13:15 INFO OGG-01026 Rolling over remote file ./dirdat/tb/tb004071.
2021-11-19 10:13:15 INFO OGG-01053 Recovery completed for target file ./dirdat/tb/tb004072, at RBA 1061.
2021-11-19 10:13:15 INFO OGG-01057 Recovery completed for all targets.
***********************************************************************
** Run Time Messages **
***********************************************************************
Source ContET_ :
SourceModule : [er.redo.ora.rtc]
SourceID : [/scratch/pradshar/view_storage/pradshar_pse_15852019/oggcore/OpenSys/src/app/er/redo/oracle/rtc.c]
SourceFunction : [RTC_producer]
SourceLine : [2111]
ThreadBacktrace : [6] elements
: [/goldengate/libgglog.so(CMessageContET_::AddThreadContET_()+0x1e) [0x7f46be57771e]]
: [/goldengate/libgglog.so(CMessageFactory::CreateMessage(CSourceContET_*, unsigned int, ...)+0x2cc) [0x7f46be5706
bc]]
: [/goldengate/libgglog.so(_MSG_ERR_ER_GENERIC_FAILURE(CSourceContET_*, char const*, CMessageFactory::MessageDispo
sition)+0x31) [0x7f46be55b859]]
: [/goldengate/ET_ract(RTC_producer+0x4f6) [0x72b826]]
: [/lib64/libpthread.so.0(+0x7dd5) [0x7f46be6f1dd5]]
: [/lib64/libc.so.6(clone+0x6d) [0x7f46ba78fb3d]]
2021-11-19 10:13:15 ERROR OGG-01028 Non-standard redo detected in 10g compatible format.
根据如下文档ET_ract Abends with OGG-01028 Non-Standard Redo Detected in 10g
Compatible Format (Doc ID 1313864.1) 进行调整,抽取进程恢复正常
以下为执行的操作
GGSCI (XXDB3) 53> ALTER EXTRACT ET_TB, Thread 2, BEGIN 2021-11-19 08:51:40
EXTRACT altered.
GGSCI (XXDB3) 54> ALTER EXTRACT ET_TB, Thread 3, BEGIN 2021-11-19 08:43:40
EXTRACT altered.
GGSCI (XXDB3) 59> info ET_tb --OGG抽取进程已经恢复正常
EXTRACT ET_TB Last Started 2021-11-19 10:26 Status RUNNING
Checkpoint Lag 01:26:07 (updated 00:00:00 ago)
Log Read Checkpoint Oracle Redo Logs
2021-11-19 08:55:00 Thread 1, Seqno 923152, RBA 325136
SCN 3689.4151325171 (15848285680115)
Log Read Checkpoint Oracle Redo Logs
2021-11-19 09:00:39 Thread 2, Seqno 513376, RBA 13507168
SCN 3689.4151908598 (15848286263542)
Log Read Checkpoint Oracle Redo Logs
2021-11-19 08:43:40 Thread 3, Seqno 700612, RBA 748003028
SCN 3689.4150309574 (15848284664518)
查看生产的trail文件,为调整其它OGG投递进程做依据。
[oracle@XXDB3 tb]$ ls -ltr
total 3736044
-rw-rw-rw- 1 oracle oinstall 1061 Nov 19 10:13 tb004071
-rw-rw-rw- 1 oracle oinstall 1061 Nov 19 10:26 tb004072
-rw-rw-rw- 1 oracle oinstall 1999998590 Nov 19 10:31 tb004073
-rw-rw-rw- 1 oracle oinstall 798726201 Nov 19 10:32 tb004074
6 修改DP_tb 投递进程的大小,并启动(由于此投递进程以前就是ET_TB抽取进程对应的投递进程,故只需要调整大小就行了)
GGSCI (XXDB3) 61> ALTER RMTTRAIL ./dirdat/tb, EXTRACT DP_TB, MEGABYTES 2000
RMTTRAIL altered.
GGSCI (XXDB3) 62> info ET_trail ./dirdat/tb
extract Trail: ./dirdat/tb
ET_ract: DP_TB
Seqno: 4066
RBA: 111462747
File Size: 2000M
GGSCI (XXDB3) 64> start DP_TB
Sending START request to MANAGER ...
EXTRACT DP_TB starting
GGSCI (XXDB3) 74> info DP_tb
EXTRACT DP_TB Last Started 2021-11-19 10:31 Status RUNNING
Checkpoint Lag 01:46:02 (updated 00:00:00 ago)
Log Read Checkpoint File ./dirdat/tb/tb004074
2021-11-19 08:46:29.000000 RBA 1088769399
7 调整DP_CJ投递进程生成trail文件的大小,及读取trail文件的名称及位置(由于此投递进程都去的trail文件名不是ET_TB生成的trail文件名,故需要进行调整)
调整读取trail文件的名称
GGSCI (XXDB3) 80> ALTER EXTRACT DP_CJ, EXTTRAILSOURCE ./dirdat/tb/tb
EXTRACT altered.
调整读取trail文件的RBA号
GGSCI (XXDB3) 81> ALTER DP_CJ,extseqno 4073,extrba 0
EXTRACT altered.
GGSCI (XXDB3) 82> info DP_cj
EXTRACT DP_CJ Initialized 2021-11-19 10:39 Status STOPPED
Checkpoint Lag 00:00:00 (updated 00:00:09 ago)
Log Read Checkpoint File ./dirdat/tb/tb004073
First Record RBA 0
调整TRAIL文件的大小
GGSCI (XXDB3) 83> ALTER RMTTRAIL ./dirdat/yx/yx, EXTRACT DP_CJ, MEGABYTES 2000
RMTTRAIL altered.
ET_ract Trail: ./dirdat/yx/yx
ET_ract: DP_CJ
Seqno: 41162
RBA: 15795712
File Size: 2000M
启动
GGSCI (XXDB3) 84> start DP_cj
Sending START request to MANAGER ...
EXTRACT DP_CJ starting
GGSCI (XXDB3) 91> !
info DP_cj
EXTRACT DP_CJ Last Started 2021-11-19 10:41 Status RUNNING
Checkpoint Lag 00:00:00 (updated 00:00:01 ago)
Log Read Checkpoint File ./dirdat/tb/tb004085
2021-11-19 09:22:27.000000 RBA 1032207204
其它的OGG投递进程就不一一列举。
8 查看目标端OGG进程是否有报错,经验证,只要一个复制进程报ORA-00001错误,忽略后,问题解决。
9 对比4个OGG抽取进程合并后的性能,初步计算,4个OGG抽取进程合并为一个,每天大概能为磁盘节省12T到24T的IO。
经过几天的观察,OGG目前没有发生延迟,也未发生以前延迟88小时的现象。
9.1 从日志切换,在19日OGG抽取进程合并后22日有5T的归档日志量,但抽取进程没有任何延迟

9.2 从OGG抽取进程处理数据的性能来比对,发现峰值,OGG性能提升比以前提升许多倍。
合并前,11月2日,处理数据的速度,根据delta(此值后续的数字为每秒中OGG抽取进程处理数据的数量),每秒大概在几条数据,11月4日,为230条左右


合并后,根据11月22日的数据,每秒4000条,有时能够达到5万条每秒,最高达到11万每秒


根据delta信息的比对,可以发现OGG抽取进程性能最少提升20倍以上,最大估计能达到1000倍以上。
通过此事件可以清晰的看出,不合理的架构,会极大的浪费系统的资源。稍微进行调整,性能就会有极大的提升。
如果将另一个节点17个抽取进程合并为一个,估计每天就能为系统节省140T的IO吞吐量,想想都兴奋,主机的IO、CPU、内存、网络等性能压力都将进行释放,这是一个多么伟大的事件。估计比花100多万买设备的效果都好。
IT系统优化两条路,一为提升硬件的性能,二为优化业务系统。优化业务系统,需要懂数据库、业务、架构等各方面知识,且有多年经验的,能静下心来认值思考问题的可能原因,通过各种技术手段进行验证,才能找到解决问题的根源,否则一切都是
都是空谈。