客户反映说一个库前几天因为磁盘损坏,导致操作系统宕机。几经折腾终于把盘恢复了,却又发现数据库无法启动了。并且已经使用trace里面的backup controlfile重建了控制文件,但还是无法正常启动。
检查之后, mount 实例,尝试 recover database 并 open 数据库 :
recover database until cancel using backup controlfile;
因为controlfile是重建出来的,所以当前的controlfile并不知道哪个在线日志是current的,需要手动指定。
Thu Apr 16 13:01:14 2015
ALTER DATABASE RECOVER database until cancel using backup controlfile
Thu Apr 16 13:01:14 2015
Media Recovery Start
WARNING! Recovering data file 1 from a fuzzy file. If not the current file
it might be an online backup taken without entering the begin backup command.
WARNING! Recovering data file 2 from a fuzzy file. If not the current file
it might be an online backup taken without entering the begin backup command.
...
WARNING! Recovering data file 75 from a fuzzy file. If not the current file
it might be an online backup taken without entering the begin backup command.
ORA-279 signalled during: ALTER DATABASE RECOVER database until cancel usin...
Thu Apr 16 13:02:31 2015
ALTER DATABASE RECOVER LOGFILE 'E:\datafile\redo01.log'
Thu Apr 16 13:02:31 2015
Media Recovery Log E:\datafile\redo01.log
Errors with log E:\datafile\redo01.log
ORA-339 signalled during: ALTER DATABASE RECOVER LOGFILE 'E:\datafile\red...
Thu Apr 16 13:02:31 2015
ALTER DATABASE RECOVER CANCEL
ORA-1547 signalled during: ALTER DATABASE RECOVER CANCEL ...
Thu Apr 16 13:03:02 2015
ALTER DATABASE RECOVER database until cancel using backup controlfile
Thu Apr 16 13:03:02 2015
Media Recovery Start
WARNING! Recovering data file 1 from a fuzzy file. If not the current file
it might be an online backup taken without entering the begin backup command.
WARNING! Recovering data file 2 from a fuzzy file. If not the current file
it might be an online backup taken without entering the begin backup command.
...
WARNING! Recovering data file 75 from a fuzzy file. If not the current file
it might be an online backup taken without entering the begin backup command.
ORA-279 signalled during: ALTER DATABASE RECOVER database until cancel usin...
Thu Apr 16 13:03:16 2015
ALTER DATABASE RECOVER LOGFILE 'E:\datafile\redo02.log'
Thu Apr 16 13:03:16 2015
Media Recovery Log E:\datafile\redo02.log
Errors with log E:\datafile\redo02.log
ORA-339 signalled during: ALTER DATABASE RECOVER LOGFILE 'E:\datafile\red...
Thu Apr 16 13:03:16 2015
ALTER DATABASE RECOVER CANCEL
ORA-1547 signalled during: ALTER DATABASE RECOVER CANCEL ...
Thu Apr 16 13:03:34 2015
ALTER DATABASE RECOVER database until cancel using backup controlfile
Thu Apr 16 13:03:34 2015
Media Recovery Start
WARNING! Recovering data file 1 from a fuzzy file. If not the current file
it might be an online backup taken without entering the begin backup command.
WARNING! Recovering data file 2 from a fuzzy file. If not the current file
it might be an online backup taken without entering the begin backup command.
...
WARNING! Recovering data file 75 from a fuzzy file. If not the current file
it might be an online backup taken without entering the begin backup command.
ORA-279 signalled during: ALTER DATABASE RECOVER database until cancel usin...
Thu Apr 16 13:03:49 2015
ALTER DATABASE RECOVER LOGFILE 'E:\datafile\redo03.log'
Thu Apr 16 13:03:49 2015
Media Recovery Log E:\datafile\redo03.log
Incomplete recovery applied all redo ever generated.
Recovery completed through change 14081497748113
Media Recovery Complete
Completed: ALTER DATABASE RECOVER LOGFILE 'E:\datafile\red
多次尝试后发现redo03.log是所需要的current group member,尝试alter database open resetlogs,但是出现ORA-00600 [4000]错误:
Thu Apr 16 08:00:40 2015
Errors in file d:\oracle\admin\kf2\udump\kf2_ora_144.trc:
ORA-00600:
内部错误代码,参数
: [4000], [3], [], [], [], [], [], []
Thu Apr 16 08:00:40 2015
Errors in file d:\oracle\admin\kf2\udump\kf2_ora_144.trc:
ORA-00704:
引导程序进程失败
ORA-00704:
引导程序进程失败
ORA-00600:
内部错误代码,参数
: [4000], [3], [], [], [], [], [], []
Thu Apr 16 08:00:40 2015
Error 704 happened during db open, shutting down database
USER: terminating instance due to error 704
Thu Apr 16 08:00:40 2015
Errors in file d:\oracle\admin\kf2\bdump\kf2_pmon_5172.trc:
ORA-00704: bootstrap process failure
以上ORA-00600: [4000], [3],说明在使用usn=3的回滚段rollback数据块时发现rollback segment存在错误,且伴随有ORA-00704: bootstrap process failure错误,说明需要回滚的数据块是bootstrap需要的自举对象。
一般来说bootstrap object需要做rollback或cleanup,而apply undo数据时,我们是无法使用_corrupted_rollback_segments,_offline_rollback_segments或10513事件来阻止ORA-00600: [4000]的发生。
不过还是抱着试一试的心理尝试一下:
*._allow_resetlogs_corruption=true
event=('10510 trace name context forever,level 1','10511 trace name context forever,level 2','10512 trace name context forever,level 1','10513 trace name context forever,level 2')
*._corrupted_rollback_segments=(_SYSSMU3$)
*._offline_rollback_segments=(_SYSSMU3$)
再次尝试启动数据库:
SMON: enabling cache recovery
Thu Apr 16 08:10:20 2015
Errors in file d:\oracle\admin\kf2\udump\kf2_ora_5148.trc:
ORA-00600:
内部错误代码,参数
: [4000], [3], [], [], [], [], [], []
Thu Apr 16 08:10:20 2015
Errors in file d:\oracle\admin\kf2\udump\kf2_ora_5148.trc:
ORA-00704:
引导程序进程失败
ORA-00704:
引导程序进程失败
ORA-00600:
内部错误代码,参数
: [4000], [3], [], [], [], [], [], []
确实还是无法避免bootstrap对象发生ORA-00600:[4000]错误。没办法,要想恢复这个数据库就必须要解决这个bootstrap对象。
先来看一下ORA-00600:[4000]内部错误的trace日志:
Dump file d:\oracle\admin\kf2\udump\kf2_ora_5148.trc
Thu Apr 16 08:10:08 2015
ORACLE V9.2.0.8.0 - Production vsnsta=0
vsnsql=12 vsnxtr=3
Windows 2000 Version 5.2 Service Pack 2, CPU type 586
Oracle9i Enterprise Edition Release 9.2.0.8.0 - Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.8.0 - Production
Windows 2000 Version 5.2 Service Pack 2, CPU type 586
Instance name: kf2
Redo thread mounted by this instance: 1
Oracle process number: 12
Windows thread id: 5148, image: ORACLE.EXE
*** SESSION ID
9.3) 2015-04-16 08:10:08.515
Start recovery at thread 1 ckpt scn 14081497668136 logseq 1 block 2
*** 2015-04-16 08:10:10.265
Media Recovery Log D:\ORACLE\ORA92\RDBMS\ARC00001.001
*** 2015-04-16 08:10:10.281
Media Recovery Log D:\ORACLE\ORA92\RDBMS\ARC00001.001
*** 2015-04-16 08:10:18.015
*** 2015-04-16 08:10:20.140
ksedmp: internal or fatal error
ORA-00600:
内部错误代码,参数
: [4000], [3], [], [], [], [], [], []
Current SQL statement for this session:
select ctime, mtime, stime from obj$ where obj# = :1
引发错误的数据块头信息:
Block header dump: 0x0040007a
Object id on Block? Y
seg/obj: 0x12 csc: 0xcce.9aabd24f itc: 1 flg: - typ: 1 - DATA
fsl: 0 fnx: 0x0 ver: 0x01
Itl Xid Uba Flag Lck Scn/Fsc
0x01 0x0003.000.0001a69c 0x0080002d.33df.01 --U- 1 fsc 0x0000.9aabd250
data_block_dump,data header at 0x34270044
通过上面的trace文件内容,我们知道:
1. 引发ORA-00600:[4000], [3]错误的语句是”select ctime, mtime, stime from obj$ where obj# = :1″。这是一条常用的递归SQL语句,查询的对象是非常重要的bootstrap自举对象OBJ$基表,这说明需要cleanup的块是OBJ$表上的。
2. 引发ORA-00600:[4000], [3]错误的数据块是1号数据文件的122块,seg/obj为0×12,块类型为Data,且存在有一条ITL entry:
SQL> select DBMS_UTILITY.data_block_address_file (TO_NUMBER ('0040007a','xxxxxxxx')) file_no,
2 DBMS_UTILITY.data_block_address_block (TO_NUMBER ('0040007a','xxxxxxxx')) block_no
3 from dual;
FILE_NO BLOCK_NO
---------- ----------
1 122
SQL>
Itl Xid Uba Flag Lck Scn/Fsc
0x01 0x0003.000.0001a69c 0x0080002d.33df.01 --U- 1 fsc 0x0000.9aabd250
因为该致命内部错误涉及到的对象是重要的Bootstrap表OBJ$,无法使用传统的例如_corrupted_rollback_segments,_offline_rollback_segments或10513事件来阻止ORA-00600: [4000]的发生,必须使用块修改工具BBED来修改存在问题的数据块将ITL事务槽的FLAG从U修改为C(Commit),手工提交该事务。
事务状态标识:
TRANSACTION_COMMITED = 0x08;TRANSACTION_UPBOUND = 0x02;TRANSACTION_ACTIVE = 0x01;
Flag= -U- 即TRANSACTION_UPBOUND时,flag值为0×02,需要将该字节修改为TRANSACTION_COMMITED = 0×08;
下面用bbed修改system01.dbf文件。注意修改前一定要先备份。
E:\datafile>bbed filename=system01.dbf password=blockedit blocksize=8192 mode=edit
BBED: Release 2.0.0.0.0 - Limited Production on
星期四
4
月
16 09:58:02 2015
Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.
************* !!! For Oracle Internal Use only !!! ***************
BBED> map
File: system01.dbf (0)
Block: 1 Dba:0x00000000
------------------------------------------------------------
Data File Header
struct kcvfh, 360 bytes @0
ub4 tailchk @8188
BBED> set block 122
BLOCK# 122
BBED> map
File: system01.dbf (0)
Block: 122 Dba:0x00000000
------------------------------------------------------------
KTB Data Block (Table/Cluster)
struct kcbh, 20 bytes @0
struct ktbbh, 48 bytes @20
struct kdbh, 14 bytes @68
struct kdbt[1], 4 bytes @82
sb2 kdbr[108] @86
ub1 freespace[859] @302
ub1 rowdata[7027] @1161
ub4 tailchk @8188
BBED> p ktbbh
struct ktbbh, 48 bytes @20
ub1 ktbbhtyp @20 0x01 (KDDBTDATA)
union ktbbhsid, 4 bytes @24
ub4 ktbbhsg1 @24 0x00000012
ub4 ktbbhod1 @24 0x00000012
struct ktbbhcsc, 8 bytes @28
ub4 kscnbas @28 0x9aabd24f
ub2 kscnwrp @32 0x0cce
b2 ktbbhict @36 1
ub1 ktbbhflg @38 0x02 (NONE)
ub1 ktbbhfsl @39 0x00
ub4 ktbbhfnx @40 0x00000000
struct ktbbhitl[0], 24 bytes @44
struct ktbitxid, 8 bytes @44
ub2 kxidusn @44 0x0003
ub2 kxidslt @46 0x0000
ub4 kxidsqn @48 0x0001a69c
struct ktbituba, 8 bytes @52
ub4 kubadba @52 0x0080002d
ub2 kubaseq @56 0x33df
ub1 kubarec @58 0x01
ub2 ktbitflg @60 0x2001 (KTBFUPB)
union _ktbitun, 2 bytes @62
b2 _ktbitfsc @62 0
ub2 _ktbitwrp @62 0x0000
ub4 ktbitbas @64 0x9aabd250
BBED> set offset 61
OFFSET 61
BBED> set count 16
COUNT 16
BBED> d
File: system01.dbf (0)
Block: 122 Offsets: 61 to 76 Dba:0x00000000
------------------------------------------------------------------------
20000050 d2ab9a00 016c00ff ffea0045
<32 bytes per line>
BBED> m /x 0x80
File: system01.dbf (0)
Block: 122 Offsets: 61 to 76 Dba:0x00000000
------------------------------------------------------------------------
80000050 d2ab9a00 016c00ff ffea0045
<32 bytes per line>
BBED> sum apply
Check value for File 0, Block 122:
current = 0xb0d6, required = 0xb0d6
BBED>
再次尝试打开数据库,出现 ORA-00600:[2256] 错误:
Thu Apr 16 14:43:57 2015
Errors in file d:\oracle\admin\kf2\udump\kf2_ora_5744.trc:
ORA-00600:
内部错误代码,参数
: [2256], [3278], [3221225472], [3278], [3221225539], [], [], []
Thu Apr 16 14:43:57 2015
Errors in file d:\oracle\admin\kf2\udump\kf2_ora_5744.trc:
ORA-00600:
内部错误代码,参数
: [2256], [3278], [3221225472], [3278], [3221225539], [], [], []
Thu Apr 16 14:43:57 2015
Error 600 happened during db open, shutting down database
USER: terminating instance due to error 600
一般ORA-00600:[2256]错误可以通过10015 ADJUST_SCN事件来推进SCN解决。
MOS里面对ORA-00600:[2256]的各个参数解释如下:
Arg [a] Current SCN WRAP
Arg
Current SCN BASE
Arg [c] dependent SCN WRAP
Arg [d] dependent SCN BASE
Arg [e] Where present this is the DBA where the dependent SCN came from.
我们的日志中的各个参数值如下:
ORA-00600:
内部错误代码,参数
: [2662], [3278], [2594951966], [3278], [2595857063], [4235250], [], []
根据level的计算规则算出需要推进的level大小:
Level = Arg[c] * 4 + Arg[d]/1024/1024/1024 = 13115
再次open数据库,alert日志如下:
可以看到,这时的报错已经不一样了,这是因为system01.dbf文件中还有其他的坏块从而导致down库的。
用dbv检查system01.dbf文件:
检查出有4个坏块(43268,44180,44052,47652)。
再次用bbed工具修复坏块:
正常情况下tail check=scn base+block type+scn seq,因此taikchk的值应该是7baa + 06 + 01,即[url=]
0×7baa0601[/url]
。而我们的当前值是
0xaf060601
。
根据little-endian的规则,0×7baa0601应该写成:01 06 aa 7b
验证后,43268的块已经修复。用同样的方法将另外3个块修复。
再次打开数据库:
终于启动起来了。
不过还需要重建undo表空间,并立马将数据exp导出备份好。事实上,数据字典还是有所损坏,部分数据查询时还有错误,在此不再赘述。