平台:aix 6.1
现象:不断有core dump文件在$ORACLE_HOME/dbs目录产生
第一步:分析core文件
$ cd $ORACLE_HOME/dbs
$ ls
core0 core14 core2 core8 hc_circbjdg1.dat
core1 core15 core3 core9 init.ora
core10 core16 core4 core_15270062 initcircbjdg1.ora
core11 core17 core5 core_16384198 initcircbjdg1.ora.bak
core12 core18 core6 core_16384200 orapwcircbjdg1
core13 core19 core7 core_17367274 snapcf_circbjdg1.f
$ cd core_17367274
$ dbx
Type 'help' for help.
enter object file name (default is `a.out', ^D to exit): $ORACLE_HOME/bin/oracle core
cannot read $ORACLE_HOME/bin/oracle core
enter object file name (default is `a.out', ^D to exit): ^C$
$ dbx $ORACLE_HOME/bin/oracle core
Type 'help' for help.
[using memory image in core]
reading symbolic information ...
IOT/Abort trap in pthread_kill at 0x9000000004efa30 ($t1)
0x9000000004efa30 (pthread_kill+0xb0) e8410028 ld r2,0x28(r1)
(dbx) where
pthread_kill(??, ??) at 0x9000000004efa30
_p_raise(??) at 0x9000000004ef2a8
raise.raise(??) at 0x90000000002c2ac
abort() at 0x90000000007d084
skgdbgcra(??) at 0x1008c45f0
sksdbgcra(??, ??) at 0x1028b1160
ksdbgcra() at 0x1028b0b30
ssexhd(??, ??, ??) at 0x10299d3a8
ksmpclrpga() at 0x101e2d9f0
opidcl(??, ??) at 0x10781dd84
opidrv(??, ??, ??) at 0x10781d660
sou2o(??, ??, ??, ??) at 0x1078131e8
opimai_real(??, ??) at 0x10000089c
ssthrdmain(??, ??) at 0x1000ee84c
main(??, ??) at 0x10000064c
(dbx) quit
第二步:找oracle bug号
Bug 13808372 : CORE GENERATED IN ORACLE_HOME/DBS |
|
||||
|
|||||
![]() |




Hdr: 13808372 11.2.0.3 RDBMS 11.2.0.3 UNKNOWN PRODID-5 PORTID-212 Abstract: CORE GENERATED IN ORACLE_HOME/DBS *** 03/05/12 12:30 am *** PROBLEM: -------- CORE FILE ARE GENERATED IN $ORACLE_HOME/dbs DATABASE Version is 11.2.0.3 $ ls -l drwxrwx--- 74 oracle dba-4096 Feb 10 13:24 .. drwxr-x--- 2 oracle dba-256 Mar 02 11:51 core_4260040 drwxr-x--- 2 oracle dba-256 Mar 02 11:52 core_7340074 drwxr-x--- 2 oracle dba-256 Mar 02 11:52 core_6225922 drwxr-x--- 2 oracle dba-256 Mar 02 11:52 core_17236164 drwxr-x--- 2 oracle dba-256 Mar 02 12:00 core_9175102 drwxr-x--- 2 oracle dba-256 Mar 02 12:00 core_7798936 drwxr-x--- 2 oracle dba-256 Mar 02 12:02 core_9502832 drwxr-x--- 2 oracle dba-256 Mar 02 12:02 core_13697132 drwxr-x--- 2 oracle dba-256 Mar 02 12:11 core_7667918 drwxr-x--- 2 oracle dba-256 Mar 02 12:12 core_5570702 drwxr-x--- 2 oracle dba-256 Mar 02 12:15 core_12714056 drwxr-x--- 2 oracle dba-256 Mar 02 12:21 core_11272362 drwxr-x--- 2 oracle dba-256 Mar 02 12:22 core_5570670 drwxr-x--- 2 oracle dba-256 Mar 02 12:31 core_12714106 drwxr-x--- 2 oracle dba-256 Mar 02 12:32 core_9175240 drwxr-x--- 2 oracle dba-256 Mar 02 12:41 core_7995586 drwxr-x--- 2 oracle dba-256 Mar 02 12:42 core_8978624 drwxr-x--- 2 oracle dba-256 Mar 02 12:51 core_14942342 drwxr-x--- 2 oracle dba-256 Mar 02 12:52 core_6029324 drwxr-x--- 2 oracle dba-256 Mar 02 12:52 core_7733320 ............ CT say there is no error such lke ora-7445 in alertlog. DIAGNOSTIC ANALYSIS: -------------------- drwxr-xr-x 462 oracle dba-24576 Feb 27 17:44 .. -rw-r----- 1 oracle dba-13298965 Feb 27 17:44 core [DGQIS01] oracle@gqmdbd01:/ora_engine/1120/dbs/core_15270044 $ file core core: AIX core file fulldump 64-bit, oracle [DGQIS01] oracle@gqmdbd01:/ora_engine/1120/dbs/core_4260040 $ dbx $ORACLE_HOME/bin/oracle core Type 'help' for help. [using memory image in core] reading symbolic information ... IOT/Abort trap in pthread_kill at 0x9000000004efa30 ($t1) 0x9000000004efa30 (pthread_kill+0xb0) e8410028 ld r2,0x28(r1) (dbx) where pthread_kill(??, ??) at 0x9000000004efa30 _p_raise(??) at 0x9000000004ef2a8 raise.raise(??) at 0x90000000002c2ac abort() at 0x90000000007d084 skgdbgcra(??) at 0x1008c45f0 sksdbgcra(??, ??) at 0x102db2440 ksdbgcra() at 0x102db1e10 ssexhd(??, ??, ??) at 0x102e9b5a8 .() at 0x0 dbgerEvaluateRules(??, ??, ??) at 0x1006d1610 dbgerEvaluateRules(??, ??, ??) at 0x1006d1610 dbgexPhaseII(??, ??, ??) at 0x1002c30b4 dbgexExplicitEndInc(??, ??) at 0x1002c429c dbgeEndDDEInvocationImpl(??, ??) at 0x10015ec20 dbgeEndDDEInvocation(??) at 0x10015e930 ssexhd(??, ??, ??) at 0x102e9b4bc .() at 0x0 dbgerEvaluateRules(??, ??, ??) at 0x1006d1610 dbgerEvaluateRules(??, ??, ??) at 0x1006d1610 dbgexPhaseII(??, ??, ??) at 0x1002c30b4 dbgexExplicitEndInc(??, ??) at 0x1002c429c dbgeEndDDEInvocationImpl(??, ??) at 0x10015ec20 dbgeEndDDEInvocation(??) at 0x10015e930 ssexhd(??, ??, ??) at 0x102e9b4bc ksmpclrpga() at 0x101e2d620 opidcl(??, ??) at 0x107587224 opidrv(??, ??, ??) at 0x107586b00 sou2o(??, ??, ??, ??) at 0x10757c688 opimai_real(??, ??) at 0x10000089c ssthrdmain(??, ??) at 0x1000ee84c main(??, ??) at 0x10000064c (dbx) quit WORKAROUND: ----------- n/a RELATED BUGS: ------------- i check the known issue like below but CT does not use EM/GRID CONTROL Agent & RMAN, TSM. ++ RMAN Core Dumps With TSM Client 6.x (Doc ID 1248324.1) ++ RMAN Creating Core Dump Files in $ORACLE_HOME/dbs (Doc ID 1275194.1) ++ Core Files Generated Under $ORACLE_HOME/dbs Directory (Doc ID 1327258.1) REPRODUCIBILITY: ---------------- YES, EVERY DAY TEST CASE: ---------- N/A STACK TRACE: ------------ pthread_kill <- p_raise <- raise <- abort <- skgdbgcra <- sksdbgcra <- ksdbgcra <- ssexhd <- dbgerEvaluateRules <- dbgerEvaluateRules <- dbgexPhaseII <- dbgexExplicitEndInc <- dbgeEndDDEInvocationImpl <- dbgeEndDDEInvocation <- ssexhd <- dbgerEvaluateRules <- dbgerEvaluateRules <- dbgexPhaseII <- dbgexExplicitEndInc <- dbgeEndDDEInvocationImpl <- dbgeEndDDEInvocation <- ssexhd <- ksmpclrpga <- opidcl <- opidrv <- sou2o <- opimai_real <- ssthrdmain <- main
第三步:找解决方案
Apply OS level patch IFIX IV09580 and relink the oracle software.
1. 下载补丁
iv09580紧急补丁的描述:https://www-304.ibm.com/support/docview.wss?uid=isg1IV09580 iv09580紧急补丁的下载:ftp://public.dhe.ibm.com/aix/efixes/iv09580 2.使用操作系统的emgr命令应用iv09580补丁 :
从上面的地址下载iv09580紧急补丁,执行下面的步骤应用紧急补丁。
1).紧急补丁安装预览命令:
#emgr -p -e IV09580.epkg.Z
出现INSTALL PREVIEW ,SUCCESS的情况下才能执行后面的安装命令。
2).应用紧急补丁:
#emgr -e IV09580.epkg.Z
3).查看补丁情况:
mzrac1@root[/]emgr -l
ID STATE LABEL INSTALL TIME UPDATED BY ABSTRACT
=== ===== ========== ================= ========== ======================================
1 S IV09580s01 06/27/12 21:55:58 Ifix for IV09580@6.1TL7SP1
STATE codes:
S = STABLE
M = MOUNTED
U = UNMOUNTED
Q = REBOOT REQUIRED
B = BROKEN
I = INSTALLING
R = REMOVING
T = TESTED
P = PATCHED
N = NOT PATCHED
SP = STABLE + PATCHED
SN = STABLE + NOT PATCHED
QP = BOOT IMAGE MODIFIED + PATCHED
QN = BOOT IMAGE MODIFIED + NOT PATCHED
RQ = REMOVING + REBOOT REQUIRED
第四步:relink oracle
对于Oracle Grid Infrastructure(GI) 11.2 及之后的版本,在GRID HOME中有一些binary需要在OS升级或者打补丁后被relink。 对于数据库软件(RDBMS binary),在OS升级或者OS打补丁后推荐做relink, RAC 的binary也是一样的,需要relink。 下面是在11.2 集群环境中执行relink的过程,包括了对GI和RAC做relink的步骤: 1. 首先停止这个节点上的所有数据库实例,这是因为之后停止CRS时虽然会停止数据库实例,但是是以shutdown abort的方式,我们需要以shutdown immediate或者normal来停止数据库实例: $su - oracle $srvctl stop instance -d -i -o immediate 2. 如果业务需要高可用性,确保这个实例上的service已经切换到了其它节点的实例上。 $ srvctl status service -d 3. 用root用户执行/crs/install/rootcrs.pl -unlock来修改相应目录权限并停止GI: [root@rac1 ~]# cd /u01/app/11.2.0/grid/crs/install [root@rac1 install]# perl rootcrs.pl -unlock Using configuration parameter file: ./crsconfig_params CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'rac1' CRS-2673: Attempting to stop 'ora.crsd' on 'rac1' CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on 'rac1' CRS-2673: Attempting to stop 'ora.rac2.vip' on 'rac1' CRS-2673: Attempting to stop 'ora.oc4j' on 'rac1' CRS-2673: Attempting to stop 'ora.LISTENER_SCAN1.lsnr' on 'rac1' CRS-2673: Attempting to stop 'ora.cvu' on 'rac1' CRS-2677: Stop of 'ora.rac2.vip' on 'rac1' succeeded CRS-2677: Stop of 'ora.LISTENER_SCAN1.lsnr' on 'rac1' succeeded CRS-2673: Attempting to stop 'ora.scan1.vip' on 'rac1' CRS-2677: Stop of 'ora.scan1.vip' on 'rac1' succeeded CRS-2677: Stop of 'ora.oc4j' on 'rac1' succeeded CRS-2677: Stop of 'ora.cvu' on 'rac1' succeeded CRS-2673: Attempting to stop 'ora.LISTENER.lsnr' on 'rac1' CRS-2673: Attempting to stop 'ora.CRS.dg' on 'rac1' CRS-2673: Attempting to stop 'ora.racdb.db' on 'rac1' CRS-2677: Stop of 'ora.LISTENER.lsnr' on 'rac1' succeeded CRS-2673: Attempting to stop 'ora.rac1.vip' on 'rac1' CRS-2677: Stop of 'ora.rac1.vip' on 'rac1' succeeded CRS-2677: Stop of 'ora.CRS.dg' on 'rac1' succeeded CRS-2677: Stop of 'ora.racdb.db' on 'rac1' succeeded CRS-2673: Attempting to stop 'ora.DATA.dg' on 'rac1' CRS-2673: Attempting to stop 'ora.RECO.dg' on 'rac1' CRS-2677: Stop of 'ora.DATA.dg' on 'rac1' succeeded CRS-2677: Stop of 'ora.RECO.dg' on 'rac1' succeeded CRS-2673: Attempting to stop 'ora.asm' on 'rac1' CRS-2677: Stop of 'ora.asm' on 'rac1' succeeded CRS-2673: Attempting to stop 'ora.ons' on 'rac1' CRS-2677: Stop of 'ora.ons' on 'rac1' succeeded CRS-2673: Attempting to stop 'ora.net1.network' on 'rac1' CRS-2677: Stop of 'ora.net1.network' on 'rac1' succeeded CRS-2792: Shutdown of Cluster Ready Services-managed resources on 'rac1' has completed CRS-2677: Stop of 'ora.crsd' on 'rac1' succeeded CRS-2673: Attempting to stop 'ora.mdnsd' on 'rac1' CRS-2673: Attempting to stop 'ora.crf' on 'rac1' CRS-2673: Attempting to stop 'ora.ctssd' on 'rac1' CRS-2673: Attempting to stop 'ora.evmd' on 'rac1' CRS-2673: Attempting to stop 'ora.asm' on 'rac1' CRS-2677: Stop of 'ora.mdnsd' on 'rac1' succeeded CRS-2677: Stop of 'ora.crf' on 'rac1' succeeded CRS-2677: Stop of 'ora.evmd' on 'rac1' succeeded CRS-2677: Stop of 'ora.ctssd' on 'rac1' succeeded CRS-2677: Stop of 'ora.asm' on 'rac1' succeeded CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'rac1' CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'rac1' succeeded CRS-2673: Attempting to stop 'ora.cssd' on 'rac1' CRS-2677: Stop of 'ora.cssd' on 'rac1' succeeded CRS-2673: Attempting to stop 'ora.gipcd' on 'rac1' CRS-2677: Stop of 'ora.gipcd' on 'rac1' succeeded CRS-2673: Attempting to stop 'ora.gpnpd' on 'rac1' CRS-2677: Stop of 'ora.gpnpd' on 'rac1' succeeded CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'rac1' has completed CRS-4133: Oracle High Availability Services has been stopped. Successfully unlock /u01/app/11.2.0/grid 注意,如果在$GRID_HOME/rdbms/audit下面的audit文件很多,会导致rootcrs.pl执行很长时间,这样的话可以将$GRID_HOME/rdbms/audit/*.aud 文件备份到GRID_HOME之外,然后删除。 4. 禁止GI在OS重启后自动启动,这是因为升级OS或者打OS补丁后,可能需要重启主机,这样的话,需要在relink之前禁止GI启动。 用root用户: [root@rac1 install]# crsctl disable crs CRS-4621: Oracle High Availability Services autostart is disabled. 5. 备份GI和RDBMS的ORACLE_HOME。 6. 升级OS或者给OS打补丁,包括重启主机等(如果需要)。 7. 用GI的属主用户来对GI binary进行relink: [root@rac1 audit]# su - grid [grid@rac1 ~]$ export ORACLE_HOME=/u01/app/11.2.0/grid 确保GI是停止的,然后再执行relink: [grid@rac1 ~]$ ps -ef|grep d.bin grid 3408 3360 0 17:09 pts/0 00:00:00 grep d.bin [grid@rac1 ~]$ crsctl stat res -t CRS-4535: Cannot communicate with Cluster Ready Services CRS-4000: Command Status failed, or completed with errors. [grid@rac1 ~]$ $ORACLE_HOME/bin/relink writing relink log to: /u01/app/11.2.0/grid/install/relink.log [grid@rac1 ~]$ <===relink结束后,并不会有任何信息提示,只是显示命令提示符。 需要检查/u01/app/11.2.0/grid/install/relink.log, 查看是否有错误。 下面截取了末尾的一些行,如下: ... - Linking Oracle rm -f /u01/app/11.2.0/grid/rdbms/lib/oracle gcc -o /u01/app/11.2.0/grid/rdbms/lib/oracle -m64 -L/u01/app/11.2.0/grid/rdbms/lib/ -L/u01/app/11.2.0/grid/lib/ - ... lsnls11 -lnls11 -lcore11 -lnls11 -lasmclnt11 -lcommon11 -lcore11 -laio `cat /u01/app/11.2.0/grid/lib/sysliblist` -Wl,- rpath,/u01/app/11.2.0/grid/lib -lm `cat /u01/app/11.2.0/grid/lib/sysliblist` -ldl -lm -L/u01/app/11.2.0/grid/lib test ! -f /u01/app/11.2.0/grid/bin/oracle ||\ mv -f /u01/app/11.2.0/grid/bin/oracle /u01/app/11.2.0/grid/bin/oracleO mv /u01/app/11.2.0/grid/rdbms/lib/oracle /u01/app/11.2.0/grid/bin/oracle chmod 6751 /u01/app/11.2.0/grid/bin/oracle 8. 用RDBMS的属主对数据库binary做relink: su - oracle 确保$ORACLE_HOME设置为了数据库的ORACLE_HOME,然后执行: [oracle@rac1 ~]$ $ORACLE_HOME/bin/relink all writing relink log to: /u01/app/oracle/product/11.2.0/dbhome_1/install/relink.log <===relink结束后,并不会有任何信息提示,只是显示命令提示符。 需要检查/u01/app/oracle/product/11.2.0/dbhome_1/install/relink.log, 查看是否有错误。 截取relink.log中部分内容: Starting Oracle Universal Installer... <<<<<<开头 ... le/product/11.2.0/dbhome_1/lib/sysliblist` -ldl -lm -L/u01/app/oracle/product/11.2.0/dbhome_1/lib test ! -f /u01/app/oracle/product/11.2.0/dbhome_1/bin/oracle ||\ mv -f /u01/app/oracle/product/11.2.0/dbhome_1/bin/oracle /u01/app/oracle/product/11.2.0/dbhome_1/bin/ oracleO mv /u01/app/oracle/product/11.2.0/dbhome_1/rdbms/lib/oracle /u01/app/oracle/product/11.2.0/dbhome_1/bin/oracle chmod 6751 /u01/app/oracle/product/11.2.0/dbhome_1/bin/oracle <<<<<<结尾 9. 用root用户执行/crs/install/rootcrs.pl -patch来修改相应目录权限并启动GI: [root@rac1 ~]# cd /u01/app/11.2.0/grid/crs/install [root@rac1 install]# perl rootcrs.pl -patch Using configuration parameter file: ./crsconfig_params CRS-4123: Oracle High Availability Services has been started. 10. Enable CRS来保证主机重启后可以自动启动GI: [root@rac1 install]# crsctl enable crs CRS-4622: Oracle High Availability Services autostart is enabled. 11. 确认所有的应启动的资源都已启动: [root@rac1 install]# crsctl stat res -t -------------------------------------------------------------------------------- NAME TARGET STATE SERVER STATE_DETAILS -------------------------------------------------------------------------------- Local Resources -------------------------------------------------------------------------------- ora.CRS.dg ONLINE ONLINE rac1 ONLINE ONLINE rac2 ora.DATA.dg ONLINE ONLINE rac1 ONLINE ONLINE rac2 ora.LISTENER.lsnr ONLINE ONLINE rac1 ONLINE ONLINE rac2 ora.RECO.dg ONLINE ONLINE rac1 ONLINE ONLINE rac2 ora.asm ONLINE ONLINE rac1 Started ONLINE ONLINE rac2 Started ora.gsd OFFLINE OFFLINE rac1 OFFLINE OFFLINE rac2 ora.net1.network ONLINE ONLINE rac1 ONLINE ONLINE rac2 ora.ons ONLINE ONLINE rac1 ONLINE ONLINE rac2 -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.LISTENER_SCAN1.lsnr 1 ONLINE ONLINE rac2 ora.cvu 1 ONLINE ONLINE rac2 ora.oc4j 1 ONLINE ONLINE rac2 ora.rac1.vip 1 ONLINE ONLINE rac1 ora.rac2.vip 1 ONLINE ONLINE rac2 ora.racdb.db 1 ONLINE ONLINE rac2 Open 2 OFFLINE OFFLINE Instance Shutdown ora.scan1.vip 1 ONLINE ONLINE rac2 如果发现实例没有启动,可以手工启动: $srvctl start instance -d -i 12. 可以用下面的MOS文档中的方法来确认oracle 的binary是RAC的: How to Check Whether Oracle Binary/Instance is RAC Enabled and Relink Oracle Binary in RAC [ID 284785.1] 方法1:如果下面的命令能查出kcsm.o ,说明binary是RAC的: su - oracle $ar -t $ORACLE_HOME/rdbms/lib/libknlopt.a|grep kcsm.o kcsm.o 在AIX上命令是不同的: ar -X32_64 -t $ORACLE_HOME/rdbms/lib/libknlopt.a|grep kcsm.o 方法2:查看RAC特有的后台进程是否存在,比如: [grid@rac1 ~]$ ps -ef|grep lmon grid 7732 1 0 17:59 ? 00:00:17 asm_lmon_+ASM1 oracle 18605 1 0 20:49 ? 00:00:00 ora_lmon_RACDB1 <=========== grid 20992 10160 0 21:10 pts/2 00:00:00 grep lmon 上面的所有步骤需要在集群的各个节点上依次执行。 上述relink GI的过程来源于下面MOS文档中章节 “Do I need to relink the Oracle Clusterware / Grid Infrastructure home after an OS upgrade?” RAC: Frequently Asked Questions [ID 220970.1]