问题现象:
一套Oracle 19C 两节点RAC,每个节点上各有3个实例,其中节点2上三个实例突然陆续宕机,宕机后马上又自动启动。
问题分析:
一:查看数据库告警日志
宕机后,分别查看节点二上三个实例(chendb2,judb2,chaodb2)对应数据库告警日志
---1 chendb2实例IPC0进程被强制中断,实例中断。
[oracle@cjcdb02 trace]$ vim alert_chendb2.log 2020-12-29T20:07:44.176617+08:00 PMON (ospid: ): terminating the instance due to ORA error 2020-12-29T20:07:44.184140+08:00 Cause - 'Instance is being terminated due to fatal process death (pid: 5, ospid: 20103, IPC0)'
---2 judb2实例DBW0进程被强制中断,实例中断。
[oracle@cjcdb02 trace]$ vim alert_judb2.log 2020-12-29T20:07:43.495906+08:00 PMON (ospid: ): terminating the instance due to ORA error 2020-12-29T20:07:43.506822+08:00 Cause - 'Instance is being terminated due to fatal process death (pid: 44, ospid: 11307, DBW0)' 2020-12-29T20:07:44.649325+08:00
---3 chao2实例LGWR进程被强制中断,实例中断。
[oracle@cjcdb02 trace]$ vim alert_chaodb2.log 2020-12-29T20:07:43.056011+08:00 PMON (ospid: ): terminating the instance due to ORA error 2020-12-29T20:07:43.059794+08:00 Cause - 'Instance is being terminated due to fatal process death (pid: 78, ospid: 4197, LGWR)' 2020-12-29T20:07:44.132532+08:00
二:查看grid告警日志
显然如下,节点间通信异常,发生脑裂,节点2被强制驱逐并重启。
CRS-1612: Network communication with node cjcdb01 (1) has been missing for 50% of the timeout interval. If this persists, removal of this node from cluster will occur in 14.240 seconds
详细日志如下:
[grid@cjcdb02 trace]$ pwd /oracle/grid/diag/crs/cjcdb02/crs/trace [grid@cjcdb02 trace]$ vim alert.log ...... 2020-12-29 19:40:57.560 [OCTSSD(158770)]CRS-2412: The Cluster Time Synchronization Service detects that the local time is significantly different from the mean cluster time. Details in /oracle/grid/diag/crs/cjcdb02/crs/trace/octssd.trc. 2020-12-29 20:07:36.698 [OCSSD(156435)]CRS-1612: Network communication with node cjcdb01 (1) has been missing for 50% of the timeout interval. If this persists, removal of this node from cluster will occur in 14.240 seconds 2020-12-29 20:07:40.287 [OCSSD(156435)]CRS-1609: This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /oracle/grid/diag/crs/cjcdb02/crs/trace/ocssd.trc. 2020-12-29 20:07:40.288 [OCSSD(156435)]CRS-1656: The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /oracle/grid/diag/crs/cjcdb02/crs/trace/ocssd.trc 2020-12-29 20:07:40.288 [OCSSD(156435)]CRS-1608: This node was evicted by node 1, cjcdb01; details at (:CSSNM00005:) in /oracle/grid/diag/crs/cjcdb02/crs/trace/ocssd.trc. 2020-12-29 20:07:40.289 [OCSSD(156435)]CRS-1608: This node was evicted by node 1, cjcdb01; details at (:CSSNM00005:) in /oracle/grid/diag/crs/cjcdb02/crs/trace/ocssd.trc. 2020-12-29 20:07:40.326 [OCSSD(156435)]CRS-1652: Starting clean up of CRSD resources. 2020-12-29 20:07:40.469 [OCSSD(156435)]CRS-1608: This node was evicted by node 1, cjcdb01; details at (:CSSNM00005:) in /oracle/grid/diag/crs/cjcdb02/crs/trace/ocssd.trc. 2020-12-29 20:07:42.128 [OCSSD(156435)]CRS-1654: Clean up of CRSD resources finished successfully. 2020-12-29 20:07:42.133 [OCSSD(156435)]CRS-1655: CSSD on node cjcdb02 detected a problem and started to shutdown. 2020-12-29 20:07:42.160 [ORAROOTAGENT(159654)]CRS-5822: Agent '/oracle/crs/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:2:87} in /oracle/grid/diag/crs/cjcdb02/crs/trace/crsd_orarootagent_root.trc. 2020-12-29 20:07:42.507 [CRSD(77885)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 77885 2020-12-29 20:07:44.922 [CSSDMONITOR(88318)]CRS-8500: Oracle Clusterware CSSDMONITOR process is starting with operating system process ID 88318 2020-12-29 20:07:46.783 [OCSSD(156435)]CRS-8503: Oracle Clusterware process OCSSD with operating system process ID 156435 experienced fatal signal or exception code 6. 2020-12-29T20:07:46.788757+08:00 Errors in file /oracle/grid/diag/crs/cjcdb02/crs/trace/ocssd.trc (incident=1): CRS-8503 [] [] [] [] [] [] [] [] [] [] [] [] Incident details in: /oracle/grid/diag/crs/cjcdb02/crs/incident/incdir_1/ocssd_i1.trc 2020-12-29 20:07:47.827 [CRSD(77885)]CRS-0804: Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-23: Error in cluster services layer Cluster services error [ [3]]. Details at (:CRSD00111:) in /oracle/grid/diag/crs/cjcdb02/crs/trace/crsd.trc.
三:查看messages日志
通过message可以看到:
节点2重启实例时,节点2的VIP执行了Deleting interface,节点1接管了节点2的VIP。
节点2启动后,VIP又漂会节点1.
[root@cjcdb02 ~]# vim /var/log/messages Dec 29 20:07:41 cjcdb02 ntpd[76526]: Deleting interface #15 bond0:1, 192.100.100.54#123, interface stats: received=0, sent=0, dropped=0, active_time=379716 secs Dec 29 20:07:47 cjcdb02 abrt-hook-ccpp: Process 156435 (ocssd.bin) of user 501 killed by SIGABRT - dumping core Dec 29 20:07:47 cjcdb02 abrt-hook-ccpp: Failed to create core_backtrace: waitpid failed: No child processes Dec 29 20:07:47 cjcdb02 abrt-server: Executable '/oracle/crs/bin/ocssd.bin' doesn't belong to any package and ProcessUnpackaged is set to 'no' Dec 29 20:07:47 cjcdb02 abrt-server: 'post-create' on '/var/spool/abrt/ccpp-2020-12-29-20:07:47-156435' exited with 1 Dec 29 20:07:47 cjcdb02 abrt-server: Deleting problem directory '/var/spool/abrt/ccpp-2020-12-29-20:07:47-156435'
[root@cjcdb01 ~]# vim /var/log/messages Dec 29 20:07:47 sy-fkjx-db01 ntpd[76447]: Listen normally on 14 bond0:3 192.100.100.54 UDP 123 Dec 29 20:08:01 sy-fkjx-db01 systemd: Started Session 3644 of user root. Dec 29 20:08:02 sy-fkjx-db01 su: (to oracle) root on none Dec 29 20:08:13 sy-fkjx-db01 ntpd[76447]: Deleting interface #14 bond0:3, 192.100.100.54#123, interface stats: received=0, sent=0, dropped=0, active_time=26 secs Dec 29 20:09:19 sy-fkjx-db01 ntpd[76447]: Listen normally on 15 bond0:3 192.100.100.54 UDP 123 Dec 29 20:09:43 sy-fkjx-db01 ntpd[76447]: Deleting interface #15 bond0:3, 192.100.100.54#123, interface stats: received=0, sent=0, dropped=0, active_time=24 secs
四:问题分析
查看日志显示两个节点通信异常导致节点2被驱逐自动重启,但是导致节点通信异常异常的原因又是什么呢?
通常是某些资源达到瓶颈导致的,例如网络、内存、CPU、I/O。
查看对应的OSW日志:
查看问题时间段内存、CPU、I/O都很闲,没有什么负载。
查看问题时间段私网通信情况,显示有30秒左右通信异常,
[root@cjcdb02 oswprvtnet]# pwd /oracle/grid/tfa/repository/suptools/cjcdb02/oswbb/grid/archive/oswprvtnet [root@cjcdb02 oswprvtnet]# ll -rth total 2.1M -rw-r--r-- 1 grid oinstall 38K Dec 28 19:59 cjcdb02_prvtnet_20.12.28.1900.dat ...... -rw-r--r-- 1 grid oinstall 38K Dec 29 19:59 cjcdb02_prvtnet_20.12.29.1900.dat -rw-r--r-- 1 grid oinstall 38K Dec 29 20:59 cjcdb02_prvtnet_20.12.29.2000.dat -rw-r--r-- 1 grid oinstall 38K Dec 29 21:59 cjcdb02_prvtnet_20.12.29.2100.dat -rw-r--r-- 1 grid oinstall 38K Dec 29 22:59 cjcdb02_prvtnet_20.12.29.2200.dat -rw-r--r-- 1 grid oinstall 38K Dec 29 23:59 cjcdb02_prvtnet_20.12.29.2300.dat -rw-r--r-- 1 grid oinstall 38K Dec 30 00:59 cjcdb02_prvtnet_20.12.30.0000.dat -rw-r--r-- 1 grid oinstall 38K Dec 30 01:59 cjcdb02_prvtnet_20.12.30.0100.dat ...... -rw-r--r-- 1 grid oinstall 30K Dec 30 19:46 cjcdb02_prvtnet_20.12.30.1900.dat
[root@cjcdb02 oswprvtnet]# vim cjcdb02_prvtnet_20.12.29.2000.dat ...... traceroute to 10.116.59.113 (10.116.59.113), 30 hops max, 60 byte packets 1 cjcdb01-priv (10.116.59.113) 0.053 ms 0.039 ms 0.033 ms traceroute to 10.116.59.114 (10.116.59.114), 30 hops max, 60 byte packets 1 cjcdb02-priv (10.116.59.114) 0.016 ms 0.004 ms 0.003 ms zzz ***Tue Dec 29 20:06:45 CST 2020 traceroute to 10.116.59.113 (10.116.59.113), 30 hops max, 60 byte packets 1 cjcdb01-priv (10.116.59.113) 0.066 ms 0.054 ms 0.041 ms traceroute to 10.116.59.114 (10.116.59.114), 30 hops max, 60 byte packets 1 cjcdb02-priv (10.116.59.114) 0.020 ms 0.005 ms 0.004 ms zzz ***Tue Dec 29 20:07:15 CST 2020 traceroute to 10.116.59.113 (10.116.59.113), 30 hops max, 60 byte packets 1 * * * 2 * * * 3 * * * 4 * * * 5 * * * 6 * * * 7 * * * 8 * * * 9 * * * 10 * * * 11 * * * 12 * * * 13 * * * 14 * * * 15 * * * 16 * * * 17 * * * 18 * * * 19 * * * 20 * * * 21 * * * 22 * cjcdb01-priv (10.116.59.113) 0.039 ms * traceroute to 10.116.59.114 (10.116.59.114), 30 hops max, 60 byte packets 1 cjcdb02-priv (10.116.59.114) 0.015 ms 0.004 ms 0.003 ms zzz ***Tue Dec 29 20:07:45 CST 2020 traceroute to 10.116.59.113 (10.116.59.113), 30 hops max, 60 byte packets 1 cjcdb01-priv (10.116.59.113) 0.057 ms 0.041 ms 0.027 ms traceroute to 10.116.59.114 (10.116.59.114), 30 hops max, 60 byte packets 1 cjcdb02-priv (10.116.59.114) 0.019 ms 0.004 ms 0.003 ms zzz ***Tue Dec 29 20:08:15 CST 2020 traceroute to 10.116.59.113 (10.116.59.113), 30 hops max, 60 byte packets 1 cjcdb01-priv (10.116.59.113) 0.062 ms 0.047 ms 0.042 ms traceroute to 10.116.59.114 (10.116.59.114), 30 hops max, 60 byte packets 1 cjcdb02-priv (10.116.59.114) 0.016 ms 0.004 ms 0.004 ms
查看ocss日志
[grid@cjcdb02 trace]$ pwd /oracle/grid/diag/crs/cjcdb02/crs/trace [grid@cjcdb02 trace]$ ls -lhrt ocssd*.trc -rw-rw---- 1 grid oinstall 51M Dec 20 19:04 ocssd_16.trc -rw-rw---- 1 grid oinstall 51M Dec 21 23:52 ocssd_17.trc -rw-rw---- 1 grid oinstall 51M Dec 23 04:25 ocssd_18.trc -rw-rw---- 1 grid oinstall 51M Dec 24 09:14 ocssd_19.trc -rw-rw---- 1 grid oinstall 51M Dec 25 10:40 ocssd_20.trc -rw-rw---- 1 grid oinstall 51M Dec 26 14:54 ocssd_21.trc -rw-rw---- 1 grid oinstall 51M Dec 27 19:15 ocssd_22.trc -rw-rw---- 1 grid oinstall 51M Dec 28 23:47 ocssd_23.trc -rw-rw---- 1 grid oinstall 51M Dec 29 23:50 ocssd_24.trc -rw-rw---- 1 grid oinstall 36M Dec 30 19:44 ocssd.trc [grid@cjcdb02 trace]$ vim ocssd_24.trc ......
五:结论
初步怀疑集群节点私有网络出现问题导致的节点2被驱逐,但是网络工程师检查两个私网接在同一个交换机上,没有查出有什么问题,先记录下来,以后有进展在更新!
chenjuchao 2020-12-31 21:30