环境描述

选项	信息
系统类型	Linux x86 64-bit
系统内核	3.10.0-957.el7.x86_64
数据库版本	19.7.0.0.0
内存	500G
CPUs	128
架构	RAC

问题描述

数据库二节点经常性的自动重启

第一层分析：自动重启的原因

数据库重启，一般都是先看alert日志，有如下显示

2020-07-16T16:59:10.896311+08:00
System state dump requested by (instance=2, osid=124268 (PMON)), summary=[abnormal instance termination]. error - 'Instance is terminating.'
System State dumped to trace file /oracle/diag/rdbms/loandb/loandb2/trace/loandb2_diag_124290.trc
2020-07-16T16:59:10.918179+08:00
PMON (ospid: 124268): terminating the instance due to ORA error 471
2020-07-16T16:59:10.922909+08:00
Cause - 'Instance is being terminated due to fatal process death (pid: 76, ospid: 124440, DBW4)'
2020-07-16T16:59:20.408801+08:00
Termination issued to instance processes. Waiting for the processes to exit, wait time 5 sec
2020-07-16T16:59:20.704260+08:00
ORA-1092 : opitsk aborting process
2020-07-16T16:59:24.413038+08:00
Instance terminated by PMON, pid = 124268
2020-07-16T16:59:28.427812+08:00
Starting ORACLE instance (normal) (OS id: 59964)

从 alert 日志，未获取到什么信息，所以我们延伸到 trace 进行进一步定位

74: DBW2 ospid 124436 sid 3553 ser 57003, waiting for 'rdbms ipc message' 
75: DBW3 ospid 124438 sid 3601 ser 57536, waiting for 'rdbms ipc message' 
76: DBW4 ospid 124440 sid 3649 ser 47151, waiting for 'rdbms ipc message' (DEAD)
77: DBW5 ospid 124442 sid 3697 ser 63467, waiting for 'rdbms ipc message' 
78: DBW6 ospid 124444 sid 3745 ser 3580, waiting for 'rdbms ipc message'

此时我们发现 DBW4数据库写进程被杀死，此时我已经基本断定，作为核心进程的DBWn进程被杀死引起的数据库重启

第二层分析：DBWn 被谁杀的？

通过系统日志，我看到了如下两部分内容：

[1142037.749433] 72137435 total pagecache pages
[1142037.749441] 127913 pages in swap cache
[1142037.749442] Swap cache stats: add 13821614, delete 13682513, find 6832838/8029664
[1142037.749443] Free swap  = 0kB
[1142037.749444] Total swap = 16777212kB
[1142037.749445] 134110877 pages RAM
[1142037.749446] 0 pages HighMem/MovableOnly
[1142037.749447] 2172212 pages reserved
.
.
.
[1142037.751657] Out of memory: Kill process 124440 (ora_dbw4_loandb) score 91 or sacrifice child
[1142037.751659] Killed process 124440 (ora_dbw4_loandb) total-vm:309293368kB, anon-rss:20240kB, file-rss:1604kB, shmem-rss:49057684kB

内存耗尽，swap 剩余为 0，且通过日志查看应该是调用了 Linux 的 OOM killer（Out-Of-Memory killer）机制。该机制是在系统内存不足的时候，out_of_memory() 被触发，然后调用 select_bad_process() 选择一个“bad”进程杀掉，判断和选择一个“bad”进程的过程由 oom_badness() 决定，最 bad 的那个进程就是那个最占用内存的进程。整个过程可以参考内核源代码 linux/mm/oom_kill.c。

第三层分析：DBWn 在做什么？

通过上面的系统日志，我们可以大致了解一下 DBW4 使用内存情况

total-vm：进程使用的虚拟内存的大小列，此处与 SGA 大小一致：30G（309293368kB）
anon-rss：代表“驻留集大小”，即当前在进程中为 RAM 分配的内存量，此处不到 20M （20240kB）
file-rss：有 RSS 内存块被映射到设备和文件中，此处大小为 1.5M 左右（1604k）

为了更细了解 DBW4 在做什么，此处特别去查看了一下 DBW4 的 trc 日志

GLOBAL CACHE ELEMENT DUMP (address: 0x277f6a9060):
  id1: 0x3761b49 id2: 0x2 pkey: OBJ#0,30,204525 block: (2/58071881)
  lock: X rls: 0x0 acq: 0x0 bucket: 51453725 latch: 392 lms: 12
  flags: 0x20 fair: 0 recovery: 0
  bscn: 0x826bb2de scan: 0x0
  lch: [0x3d7e102140,0x3d7e102140] nml: [0x782ab9200,0x2cfeece4a8]
  seq: 8 hist: 65 225 60 65 143:0 325 352 32 97 197 197 48 121 239 197:48
  LIST OF BUFFERS LINKED TO THIS GLOBAL CACHE ELEMENT:
    flg: 0x200021 lflg: 0x8 state: XCURRENT tsn: [0/30] tsh: 0
      fpin: 'ktspbwh2: ktspfmdb' fscn: 0x8267e7c8
      addr: 0x3d7e101fd8 obj: 204525 cls: DATA
      bscn: 0x826bb2de seq: 2 bflg: 0x4
 GCS SHADOW 0x277f6a90e8,7 resp[0x5f2144720,0x3761b49.2] pkey 0.30.204525  lock domid 0, res domid 0 
    domid 0, inreco? 0, rdom flgs x20, bfinc 0, nxtvalid bfinc 5 
   grant 2 cvt 0 mode 0x2 role 0x0 st 0x10e lst 0x40 GRANTQ rl LOCAL
   master 2 owner 2 sid 12 lms LMSC remote[(nil),0]
   KJBL history 0xf5.0x5b.0xf5.0xe.0x9.0x2a.0x5b.0xf5.0xe.0x9.0x2a.0xe.0x2.0x2a.0xe.0x2.
   cflag 0x0 sender 0 flags 0x0 replay# 5 abast (nil).x0.1
   disk: 0x0 write request: 0x826c6af7 wcseq: x0
   pi scn: 0x0 sq[0x5f2144768,0x5f2144768]
   msgseq 0x0 updseq 0x0 reqids[7,0,0] lockseq x877
   infop (nil).0 rcv scn: 0x0 pinc 7
     pkey 0.30.204525
 GCS SHADOW END
 GCS RESOURCE 0x5f2144720 hashq [0x36b6a61c0,0x75b4cf0b0] name[0x3761b49.2] pkey 0.30.204525  domid 0
   grant 0x277f6a90e8 cvt (nil) send (nil)@1,0 write (nil),0@65536
   flag 0x400000 mdrole 0x2 mode 2 scan 0.0 role LOCAL
   disk: 0x82688aec write: 0x0 lwbscn: 0x8267e7e2 cnt 0x0 hist 0x0
   xid 0x0000.000.00000000 sid 12 lms LMSC pkwait 0s rmacks 0

基本上都由 GCS SHADOW 和 GCS RESOURCE 所占用，GCS SHADOW 和 GCS RESOURCE 结构是用于处理RAC中的缓冲区缓存，因此它们的内存使用取决于缓冲区缓存的大小。

DB启动后， GCS SHADOW 和 GCS RESOURCE 可以根据缓存融合活动动态增长，而用于动态增长的内存区域使用共享池中的空闲内存，会导致 ORA-4031 或者发生类似此处内存被占满情况。

全局缓存服务(GCS)

GCS 要和 Cache Fusion 结合在一起来理解，全局缓存要涉及到数据块。全局缓存服务负责维护该全局缓冲存储区内的缓存一致性，确保一个实例在任何时刻想修改一个数据块时，都可获得一个全局锁资源，从而避免另一个实例同时修改该块的可能性。进行修改的实例将拥有块的当前版本（包括已提交的和未提交的事物）以及块的前象(post image)。如果另一个实例也请求该块，那么全局缓存服务要负责跟踪拥有该块的实例、拥有块的版本是什么，以及块处于何种模式。LMS 进程是全局缓存服务的关键组成部分。

第四层分析：如何解决

回头我们再来看看故障前最繁忙的等待事件有哪些

75% 左右的等待都由gc 所占用，基本符合上面所说的内容。后面则是根据等待，锁定 SQL 语句，对相关的业务流程及 SQL 语句进行优化了，由于涉及到用户应用的一些信息，此处就不再继续进行说明了。

参考文档

Linux: Out-of-Memory (OOM) Killer (Doc ID 452000.1)
How To Prevent OOM Killer from killing processes (Doc ID 2260273.1）
ORA-4031 Due To Large 'GCS RESOURCES' And 'GCS SHADOWS' (Doc ID 844879.1)