案例一:
Tue Aug 30 17:33:10 2022 No connectivity to other instances in the cluster during startup. Hence, LMON is terminating the instance. Please check the LMON trace file for details. Also, please check the network logs of this instance along with clusterwide network health for problems and then re-start this instance. LMON (ospid: 11223): terminating the instance Tue Aug 30 17:33:10 2022 System state dump requested by (instance=1, osid=11223 (LMON)), summary=[abnormal instance termination]. System State dumped to trace file /oracle/app/diag/rdbms/orcl/orcl1/trace/orcl1_diag_11213_20220830173310.trc Dumping diagnostic data in directory=[cdmp_20220830173311], requested by (instance=1, osid=11223 (LMON)), summary=[abnormal instance termination]. Instance terminated by LMON, pid = 11223
LMON ( Global Enqueue Service Monitor):这个进程负责维护数据库集群层面的节点关 系 ( Cluster Group Service, CGs),并与其他实例的 LMON 进程定期进行心跳通信。当节点间出现了通信问题时,这个节点负责完成实例层面的重新配置和 GES 层面的实例恢复。当某个(或多个)实例离开或加人数据库集群时,这个进程也负责完成实例层面的 reconfigurationo同时这个进程还会和LMD 进程一起完成一些 GES 层面的管理工作,而且这个进程也会完成DRM 的一部分工作。每个数据库实例只会存在一个LMON 进程
查看具体trace文件
Trace file /oracle/app/diag/rdbms/orcl/orcl1/trace/orcl1_diag_11213_20220830173310.trc Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP, Data Mining and Real Application Testing options ORACLE_HOME = /oracle/app/product/11.2.0/db_1 System name: Linux Node name: 11grac1 Release: 2.6.32-696.el6.x86_64 Version: #1 SMP Tue Mar 21 19:29:05 UTC 2017 Machine: x86_64 Instance name: orcl1 Redo thread mounted by this instance: 0Oracle process number: 6 Unix process pid: 11213, image: oracle@11grac1 (DIAG) *** 2022-08-30 17:33:10.946 *** SESSION ID:(3.1) 2022-08-30 17:33:10.946 *** CLIENT ID:() 2022-08-30 17:33:10.946 *** SERVICE NAME:() 2022-08-30 17:33:10.946 *** MODULE NAME:() 2022-08-30 17:33:10.946 *** ACTION NAME:() 2022-08-30 17:33:10.946 kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]). =================================================== SYSTEM STATE (level=10) ------------ System global information: processes: base 0xf37c5d20, size 300, cleanup 0xf37ef2c0 allocation: free sessions 0xf39459c0, free calls (nil) control alloc errors: 1092 (process), 1092 (session), 1092 (call) PMON latch cleanup depth: 0 seconds since PMON's last scan for dead processes: 45 system statistics: 0 OS CPU Qt wait time 0 Requests to/from client 29 logons cumulative 27 logons current 0 opened cursors cumulative 0 opened cursors current 0 user commits 0 user rollbacks 60 user calls 6 recursive calls 1 recursive cpu usage 0 pinned cursors current 0 user logons cumulative 0 user logouts cumulative 0 session logical reads 0 session logical reads in local numa group 0 session logical reads in remote numa group 0 session stored procedure space 0 CPU used when call started 0 CPU used by this session 0 DB time 0 cluster wait time
参考文档:
Exadata Rac Node Instance Crash with kjzdattdlm: Can not attach to DLM (Doc ID 1386843.1)
由官方文档联想到本次故障应该是私网通信层面出了问题
检查集群网络信息:
oifcfg getif
[grid@11grac1 ~]$ oifcfg getif eth0 192.168.238.0 global public eth1 10.10.10.28 global cluster_interconnect
发现了猫腻,正常的信息应该如下:
[grid@host1 ~]$ oifcfg getif eth0 192.168.242.0 global public eth1 172.16.205.0 global cluster_interconnect
所以将信息改回来即可。
案例二:
工程师反馈rac一个节点db启动卡在以下环节:
12.2的 rac 节点一正常,节点二启动卡在这里
反馈客户网卡升级过
根据案例一让同事检查集群网络信息:
oifcfg getif
同事反馈,确实心跳信息不对,改正后即可
恩,客户说估计之前就配置错了,一直没重启,这次升级网卡驱动,重启了,发现问题的,一开始他们说集群没有问题,只是数据库起不来,后来我看到,二节点在不断的自己重启,他们也说前几天升级了网卡,看了一下心跳,发现心跳不对
案例三:
某客户反馈rac某个节点db无法启动
首先让客户检查集群状态是否正常,确认集群状态正常
其次让客户检查alert日志:
明确报了私网问题,判定心跳网络存在问题
总结:碰到rac一个节点db无法启动,两边只能启动一个节点的时候,不妨往心跳网络层面排查。