ONS自动offline - Authentication OSD error, op: scls_auth_client_response_set

4节点10.2.0.4 RAC (64bit) ,  节点2出现问题下线, 现在剩下 1,3,4,5 。 Linux AS 5.3  64bit .  
因为节点3的内存坏掉一根, 需要停机更换,   17:15分左右关闭节点3后, 更换内存, 然后开启,所有
节点的所有CRS服务都非常正常,  其他的动作由于是海外的DBA操作, 没有仔细监控, 他好像rebuild了
一个table的index,    从log中还可以看出, 应该做了expdp备份操作, 从ons log可以看到明显的错误
信息,  但是还不能明显看出到底是什么导致了ons 在 22:30:39 出现问题 。  





mxrac01<*mxdell1*/u01/product/admin/mxdell/bdump>$crs_stat -t  
Name           Type           Target    State     Host        
------------------------------------------------------------
ora.mxdell.db  application    ONLINE    ONLINE    mxrac01     
ora....l1.inst application    ONLINE    ONLINE    mxrac01     
ora....l3.inst application    ONLINE    ONLINE    mxrac03     
ora....l4.inst application    ONLINE    ONLINE    mxrac04     
ora....l5.inst application    ONLINE    ONLINE    mxrac05     
ora....01.lsnr application    ONLINE    ONLINE    mxrac01     
ora....c01.gsd application    ONLINE    ONLINE    mxrac01     
ora....c01.ons application    ONLINE    ONLINE    mxrac01     
ora....c01.vip application    ONLINE    ONLINE    mxrac01     
ora....03.lsnr application    ONLINE    ONLINE    mxrac03     
ora....c03.gsd application    ONLINE    ONLINE    mxrac03     
ora....c03.ons application    ONLINE    ONLINE    mxrac03     
ora....c03.vip application    ONLINE    ONLINE    mxrac03     
ora....04.lsnr application    ONLINE    ONLINE    mxrac04     
ora....c04.gsd application    ONLINE    ONLINE    mxrac04     
ora....c04.ons application    ONLINE    ONLINE    mxrac04     
ora....c04.vip application    ONLINE    ONLINE    mxrac04     
ora....05.lsnr application    ONLINE    ONLINE    mxrac05     
ora....c05.gsd application    ONLINE    ONLINE    mxrac05     
ora....c05.ons application    ONLINE    OFFLINE              
ora....c05.vip application    ONLINE    ONLINE    mxrac05     
mxrac01<*mxdell1*/u01/product/admin/mxdell/bdump>$
mxrac01<*mxdell1*/u01/product/admin/mxdell/bdump>$
mxrac01<*mxdell1*/u01/product/admin/mxdell/bdump>$






节点5上ons的log , 错误好像是权限相关的一些提示 。


mxrac05<*mxdell5*/u01/product/crs/log/mxrac05/racg>$vi ora.mxrac05.ons.log   

2010-11-14 22:30:39.763: [ CSSCLNT][3030990112]clsssInitNative: connect failed, rc 2
2010-11-14 22:30:39.772: [    RACG][3030990112] [29004][3030990112][ora.mxrac05.ons]: clsrccssgetctx: clsssinit() failed. rc=3
2010-11-14 22:30:39.773: [ COMMCRS][3030990112]Authentication OSD error, op: scls_auth_client_response_set
loc: write
info: len -1 != expected 4
dep: 28
2010-11-14 22:30:39.773: [    RACG][3030990112] [29004][3030990112][ora.mxrac05.ons]: clsrcgetprsrctx: prsr_init_ext returned rc = 3
2010-11-14 22:30:39.978: [    RACG][3030990112] [29004][3030990112][ora.mxrac05.ons]: clsrons_init failed, stat = 504, crerr = 32
ons is not running ...

2010-11-14 22:30:39.978: [    RACG][3030990112] [29004][3030990112][ora.mxrac05.ons]: clsrcexecut: cmd = /u01/product/crs/bin/racgeut -e _USR_ORA_DEBUG=0 540 /u01/product/crs/bin/onsctl ping
2010-11-14 22:30:39.978: [    RACG][3030990112] [29004][3030990112][ora.mxrac05.ons]: clsrcexecut: rc = 1, time = 0.210s
2010-11-14 22:30:39.978: [    RACG][3030990112] [29004][3030990112][ora.mxrac05.ons]: end for resource = ora.mxrac05.ons, action = check, status = 1, time = 0.250s
2010-11-14 22:30:40.646: [ CSSCLNT][2823294240]clsssInitNative: connect failed, rc 2
2010-11-14 22:30:40.646: [    RACG][2823294240] [29024][2823294240][ora.mxrac05.ons]: clsrccssgetctx: clsssinit() failed. rc=3
2010-11-14 22:30:40.647: [ COMMCRS][2823294240]Authentication OSD error, op: scls_auth_client_response_set
loc: write
info: len -1 != expected 4
dep: 28
2010-11-14 22:30:40.647: [    RACG][2823294240] [29024][2823294240][ora.mxrac05.ons]: clsrcgetprsrctx: prsr_init_ex2010-11-15 01:43:21.975: [    RACG][807959840] [18202][807959840][ora.mxrac05.ons]: Number of onsconfiguration retrieved, numcfg = 4
onscfg[0]
   {node = mxrac01, port = 6200}
Adding remote host mxrac01:6200
onscfg[1]
   {node = mxrac03, port = 6200}
Adding remote host mxrac03:6200
onscfg[2]
   {node = mxrac04, port = 6200}




节点5上的crs log (节点3内存坏了,更换了一根,下面的log是正常报错).

2010-09-06 02:20:52.588
[crsd(11964)]CRS-1201:CRSD started on node mxrac05.
2010-11-07 02:32:47.411
[cssd(12637)]CRS-1605:CSSD voting file is online: /ocfs_data/crs/votingdisk. Details in /u01/product/crs/log/mxrac05/cssd/ocssd.log.
[cssd(12637)]CRS-1601:CSSD Reconfiguration complete. Active nodes are mxrac01 mxrac03 mxrac04 mxrac05 .
2010-11-07 02:32:48.583
[crsd(12013)]CRS-1012:The OCR service started on node mxrac05.
2010-11-07 02:32:48.656
[evmd(11875)]CRS-1401:EVMD started on node mxrac05.
2010-11-07 02:32:50.115
[crsd(12013)]CRS-1201:CRSD started on node mxrac05.
2010-11-14 17:13:50.019
[cssd(12637)]CRS-1612:node mxrac03 (3) at 50% heartbeat fatal, eviction in 29.020 seconds
2010-11-14 17:14:04.048
[cssd(12637)]CRS-1611:node mxrac03 (3) at 75% heartbeat fatal, eviction in 14.222 seconds
2010-11-14 17:14:05.050
[cssd(12637)]CRS-1611:node mxrac03 (3) at 75% heartbeat fatal, eviction in 13.222 seconds
2010-11-14 17:14:13.066
[cssd(12637)]CRS-1610:node mxrac03 (3) at 90% heartbeat fatal, eviction in 5.202 seconds
2010-11-14 17:14:14.068
[cssd(12637)]CRS-1610:node mxrac03 (3) at 90% heartbeat fatal, eviction in 4.202 seconds
2010-11-14 17:14:15.070
[cssd(12637)]CRS-1610:node mxrac03 (3) at 90% heartbeat fatal, eviction in 3.202 seconds
2010-11-14 17:14:16.072
[cssd(12637)]CRS-1610:node mxrac03 (3) at 90% heartbeat fatal, eviction in 2.202 seconds
2010-11-14 17:14:17.074
[cssd(12637)]CRS-1610:node mxrac03 (3) at 90% heartbeat fatal, eviction in 1.202 seconds
2010-11-14 17:14:18.075
[cssd(12637)]CRS-1610:node mxrac03 (3) at 90% heartbeat fatal, eviction in 0.192 seconds
[cssd(12637)]CRS-1601:CSSD Reconfiguration complete. Active nodes are mxrac01 mxrac04 mxrac05 .
[cssd(12637)]CRS-1601:CSSD Reconfiguration complete. Active nodes are mxrac01 mxrac03 mxrac04 mxrac05 .





节点5上的Oracle alert log . 原来测试过,只要在哪个节点执行expdp动作,都会有修改service_name
的命令在alert log中出现。  

Sun Nov 14 23:09:05 2010
Thread 5 advanced to log sequence 14077 (LGWR switch)
  Current log# 49 seq# 14077 mem# 0: /ocfs_ctrl_redo/mxdell/redo49_a.log
  Current log# 49 seq# 14077 mem# 1: /ocfs_data/mxdell/redo49_b.log
Sun Nov 14 23:09:16 2010
ALTER SYSTEM SET service_names='SYS$SYS.KUPC$S_5_20101114225709.MXDELL','mxdell' SCOPE=MEMORY SID='mxdell5';
Sun Nov 14 23:09:16 2010
ALTER SYSTEM SET service_names='mxdell' SCOPE=MEMORY SID='mxdell5';
Sun Nov 14 23:10:24 2010
Thread 5 cannot allocate new log, sequence 14078
Checkpoint not complete
  Current log# 49 seq# 14077 mem# 0: /ocfs_ctrl_redo/mxdell/redo49_a.log
  Current log# 49 seq# 14077 mem# 1: /ocfs_data/mxdell/redo49_b.log
Sun Nov 14 23:10:27 2010
Thread 5 advanced to log sequence 14078 (LGWR switch)
  Current log# 50 seq# 14078 mem# 0: /ocfs_ctrl_redo/mxdell/redo50_a.log
  Current log# 50 seq# 14078 mem# 1: /ocfs_data/mxdell/redo50_b.log




节点5 上的查看process :

mxrac05<*mxdell5*/u01/product/oracle>$ps -ef |grep ons
oracle   13281     1  0 Nov07 ?        00:00:00 /u01/product/crs/opmn/bin/ons -d
oracle   13282 13281  0 Nov07 ?        00:00:00 /u01/product/crs/opmn/bin/ons -d
oracle   19743 15970  0 00:33 pts/0    00:00:00 grep ons




节点5上的Oracle alert log .  22:30左右的log
Sun Nov 14 22:17:12 2010
Thread 5 advanced to log sequence 13988 (LGWR switch)
  Current log# 50 seq# 13988 mem# 0: /ocfs_ctrl_redo/mxdell/redo50_a.log
  Current log# 50 seq# 13988 mem# 1: /ocfs_data/mxdell/redo50_b.log
Sun Nov 14 22:27:28 2010
Thread 5 advanced to log sequence 13989 (LGWR switch)
  Current log# 51 seq# 13989 mem# 0: /ocfs_ctrl_redo/mxdell/redo51_a.log
  Current log# 51 seq# 13989 mem# 1: /ocfs_data/mxdell/redo51_b.log
Sun Nov 14 22:28:18 2010
ALTER SYSTEM SET service_names='SYS$SYS.KUPC$S_5_20101114221629.MXDELL','mxdell' SCOPE=MEMORY SID='mxdell5';
Sun Nov 14 22:28:18 2010
ALTER SYSTEM SET service_names='mxdell' SCOPE=MEMORY SID='mxdell5';
Sun Nov 14 22:34:50 2010
Thread 5 advanced to log sequence 13990 (LGWR switch)
  Current log# 52 seq# 13990 mem# 0: /ocfs_ctrl_redo/mxdell/redo52_a.log
  Current log# 52 seq# 13990 mem# 1: /ocfs_data/mxdell/redo52_b.log
Sun Nov 14 22:35:34 2010
The value (30) of MAXTRANS parameter ignored.
Sun Nov 14 22:35:35 2010
ALTER SYSTEM SET service_names='mxdell','SYS$SYS.KUPC$C_5_20101114223535.MXDELL' SCOPE=MEMORY SID='mxdell5';
Sun Nov 14 22:35:35 2010
ALTER SYSTEM SET service_names='SYS$SYS.KUPC$C_5_20101114223535.MXDELL','mxdell','SYS$SYS.KUPC$S_5_20101114223535.MXDELL' SCOPE=MEMORY SID='mxdell5';
kupprdp: master process DM00 started with pid=92, OS id=31287
         to execute - SYS.KUPM$MCP.MAIN('SYS_EXPORT_TABLE_02', 'SYSTEM', 'KUPC$C_5_20101114223535', 'KUPC$S_5_20101114223535', 0);
kupprdp: worker process DW01 started with worker id=1, pid=104, OS id=31300
         to execute - SYS.KUPW$WORKER.MAIN('SYS_EXPORT_TABLE_02', 'SYSTEM');






手工开启节点5上的ons .
mxrac05<*mxdell5*/home/oracle>$crs_start    ora.mxrac05.ons
Attempting to start `ora.mxrac05.ons` on member `mxrac05`
Start of `ora.mxrac05.ons` on member `mxrac05` succeeded.
mxrac05<*mxdell5*/home/oracle>$


手工启动ONS后查看ons log :
2010-11-15 01:43:21.975: [    RACG][807959840] [18202][807959840][ora.mxrac05.ons]: Adding remote host mxrac04:6200
onscfg[3]
   {node = mxrac05, port = 6200}
Adding remote host mxrac05:6200
onsctl: ons is already running
2010-11-15 01:43:21.975: [    RACG][807959840] [18202][807959840][ora.mxrac05.ons]: clsrcexecut: env ORACLE_CONFIG_HOME=/u01/product/crs
2010-11-15 01:43:21.975: [    RACG][807959840] [18202][807959840][ora.mxrac05.ons]: clsrcexecut: cmd = /u01/product/crs/bin/racgeut -e _USR_ORA_DEBUG=0 540 /u01/product/crs/bin/onsctl start
2010-11-15 01:43:21.975: [    RACG][807959840] [18202][807959840][ora.mxrac05.ons]: clsrcexecut: rc = 1, time = 0.210s
2010-11-15 01:43:22.180: [    RACG][807959840] [18202][807959840][ora.mxrac05.ons]: Number of onsconfiguration retrieved, numcfg = 4
onscfg[0]
   {node = mxrac01, port = 6200}
Adding remote host mxrac01:6200
onscfg[1]
   {node = mxrac03, port = 6200}
Adding remote host mxrac03:6200
onscfg[2]
   {node = mxrac04, port = 6200}
2010-11-15 01:43:22.180: [    RACG][807959840] [18202][807959840][ora.mxrac05.ons]: Adding remote host mxrac04:6200
onscfg[3]
   {node = mxrac05, port = 6200}
Adding remote host mxrac05:6200
ons is running ...






节点5上的ONS恢复正常 。


mxrac05<*mxdell5*/u01/product/crs/log/mxrac05/racg>$crs_stat -t  
Name           Type           Target    State     Host        
------------------------------------------------------------
ora.mxdell.db  application    ONLINE    ONLINE    mxrac01     
ora....l1.inst application    ONLINE    ONLINE    mxrac01     
ora....l3.inst application    ONLINE    ONLINE    mxrac03     
ora....l4.inst application    ONLINE    ONLINE    mxrac04     
ora....l5.inst application    ONLINE    ONLINE    mxrac05     
ora....01.lsnr application    ONLINE    ONLINE    mxrac01     
ora....c01.gsd application    ONLINE    ONLINE    mxrac01     
ora....c01.ons application    ONLINE    ONLINE    mxrac01     
ora....c01.vip application    ONLINE    ONLINE    mxrac01     
ora....03.lsnr application    ONLINE    ONLINE    mxrac03     
ora....c03.gsd application    ONLINE    ONLINE    mxrac03     
ora....c03.ons application    ONLINE    ONLINE    mxrac03     
ora....c03.vip application    ONLINE    ONLINE    mxrac03     
ora....04.lsnr application    ONLINE    ONLINE    mxrac04     
ora....c04.gsd application    ONLINE    ONLINE    mxrac04     
ora....c04.ons application    ONLINE    ONLINE    mxrac04     
ora....c04.vip application    ONLINE    ONLINE    mxrac04     
ora....05.lsnr application    ONLINE    ONLINE    mxrac05     
ora....c05.gsd application    ONLINE    ONLINE    mxrac05     
ora....c05.ons application    ONLINE    ONLINE    mxrac05     
ora....c05.vip application    ONLINE    ONLINE    mxrac05
请使用浏览器的分享功能分享到微信等