一、 数据库坏块背景说明
某次,用户反馈数据库后台日志出现数据库坏块,部分表查询时报错。在业务恢复之后,通过梳理处理记录,做一次总结分析,以下将对该案例的诊断过程进行说明。
二、问题详细诊断过程
2 .1 数据库 alter 日志
Thu Jun 10 08:36:33 2021
OS Pid: 12648460 executed alter system set events '10231 trace name context forever,level 10'
Thu Jun 10 08:37:38 2021
ALTER SYSTEM: Flushing buffer cache
Thu Jun 10 08:37:55 2021
Hex dump of (file 62, block 5521) in trace file /oracle/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_12648460.trc
Corrupt block relative dba: 0x0f801591 (file 62, block 5521)
Bad header found during buffer read
Data in bad block:
type: 6 format: 2 rdba: 0x2acf8b99
last change scn: 0x0006.edc07b41 seq: 0x1 flg: 0x06
spare1: 0x0 spare2: 0x0 spare3: 0x0
consistency value in tail: 0x7b410601
check value in block header: 0xdb02
computed block checksum: 0x0
Reading datafile '+DATA/orcl/datafile/tbs_sys.ora' for corruption at rdba: 0x0f801591 (file 62, block 5521)
Read datafile mirror 'DATA_0000' (file 62, block 5521) found same corrupt data (no logical check)
WARNING: requested mirror side 2 of virtual extent 43 logical extent 1 offset 139264 is not allocated; I/O request failed
Errors in file /oracle/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_12648460.trc (incident=912662):
ORA-01578: ORACLE data block corrupted (file # 62, block # 5521)
ORA-01110: data file 62: '+DATA/orcl/datafile/tbs_sys.ora'
Incident details in: /oracle/oracle/diag/rdbms/orcl/orcl1/incident/incdir_912662/orcl1_ora_12648460_i912662.trc
Thu Jun 10 08:37:55 2021
Corrupt Block Found
TSN = 62, TSNAME = TBS_SYS
RFN = 62, BLK = 5521, RDBA = 260052369
OBJN = 82103, OBJD = 82103, OBJECT = XXXX100D, SUBOBJECT =
SEGMENT OWNER = XXXXXX_ST, SEGMENT TYPE = Table Segment
Thu Jun 10 08:37:57 2021
Dumping diagnostic data in directory=[cdmp_20210610083757], requested by (instance=1, osid=12648460), summary=[incident=912662].
Thu Jun 10 08:37:58 2021
Sweep [inc][912662]: completed
Sweep [inc2][912774]: completed
Sweep [inc2][912662]: completed
Thu Jun 10 08:38:14 2021
Dumping diagnostic data in directory=[cdmp_20210610083814], requested by (instance=2, osid=9175400), summary=[incident=353303].
Thu Jun 10 08:38:29 2021
OS Pid: 12648460 executed alter system set events '10231 trace name context off'
从数据库后台日志可以看出,数据文件 '+DATA/orcl/datafile/tbs_sys.ora' 发生损坏,出现了部分损坏的 block 。损坏的块属于 XXXXXX_ST 用户下的 XXXX100 表。
2 .2 操作系统错误日志

检查 A IX 主机的操作系统日志,发现存在多块磁盘存在 DISK OPERATION ERROR 错误。
2 .3 R MN 校验数据库坏块
RMAN> run {
allocate channel d1 type disk;
allocate channel d2 type disk;
allocate channel d3 type disk;
allocate channel d4 type disk;
backup validate check logical database;
release channel d1;
release channel d2;
release channel d3;
release channel d4;
}


当使用 R MAN 校验到 6 2 号数据文件 t bs _ sys.ora 时报错,自动退出。
S QL> select * from g v$database_block_corrutption ;
no rows selected
查询数据库坏块视图,没有返回数据库中坏块的信息。
2 .4 使用 exp 尝试导出表数据

使用 exp 导出 XXXX100 表,在查询到 6 2 号文件的 5 248 号块时报错,导出失败。
2 .5 RMAN 备份修复坏块
RMAN> blockrecover datafile 6 2 block 5248 ; -- 单块修复
RMAN> blockrecover corruption list; -- 对视图中所有坏块修复
sys@ORCL> select count(*) from XXXXXX_ST.XXXX100;
select count(*) from XXXXXX_ST.XXXX100
*
ERROR at line 1:
ORA-01115: IO error reading block from file (block # )
ORA-01115: IO error reading block from file 62 (block # 5248)
ORA-15081: failed to submit an I/O operation to a disk
ORA-15081: failed to submit an I/O operation to a disk
再次查询 XXXX100 表在访问 5 248 号块时,报 I O error 。
2 .6 设置 1 0231 事件跳过坏块
SQL>alter system SET EVENTS '10231 trace name context forever,level 10';
设置 1 0231 事件后,查询 XXXX100 表,仍然报数据块损坏,无法查询。

2 .7 使用 dbms 包标记坏块
sys@ORCL> begin
dbms_repair.skip_corrupt_blocks(
schema_name=>'XXXXXX_ST',
object_name=>'XXXX100',
object_type=>dbms_repair.table_object,
flags=>dbms_repair.skip_flag);
end;
/ 2 3 4 5 6 7 8
PL/SQL procedure successfully completed.
sys@ORCL> select count(*) from XXXXXX_ST.XXXX100;
select count(*) from XXXXXX_ST.XXXX100
*
ERROR at line 1:
ORA-01115: IO error reading block from file (block # )
ORA-01115: IO error reading block from file 62 (block # 5248)
ORA-15081: failed to submit an I/O operation to a disk
ORA-15081: failed to submit an I/O operation to a disk
sys@ORCL> exec dbms_repair.skip_corrupt_blocks(schema_name => 'XXXXXX_ST',object_name => 'XXXX100',flags => 1);
PL/SQL procedure successfully completed.
sys@ORCL>
sys@ORCL> select count(*) from XXXXXX_ST.XXXX100;
select count(*) from XXXXXX_ST.XXXX100
*
ERROR at line 1:
ORA-01115: IO error reading block from file (block # )
ORA-01115: IO error reading block from file 62 (block # 5248)
ORA-15081: failed to submit an I/O operation to a disk
ORA-15081: failed to submit an I/O operation to a disk
从结果上来看,使用 dbms 包标记的方法也无法正常跳过坏块,查询出来数据。
2 .8 基于 rowid 抽取数据
2 .8.1 查看表属性

2 .8.2 基于块 rowid 生成插入语句
-- 基于 rowid 抽取坏块数据
select 'insert into test select * from XXXXXX_ST.XXXX100 t where rowid between ' || '''' ||
dbms_rowid.rowid_create(1,
o.data_object_id,
e.RELATIVE_FNO,
e.BLOCK_ID,
0) || '''' || ' and ' || '''' ||
DBMS_ROWID.ROWID_CREATE(1,
o.data_object_id,
e.RELATIVE_FNO,
e.BLOCK_ID + e.BLOCKS - 1,
10000) || '''' || ';' || CHR(10) ||
'commit;'
from dba_extents e, dba_objects o
where e.segment_name = 'XXXX100'
and e.owner = 'XXXXXX_ST'
AND o.object_name = 'XXXX100'
AND o.owner = 'XXXXXX_ST';

2 .8.3 创建空表
SQL> create table test as select * from XXXXXX_ST.XXXX100 where 1=2;
2 .8.4 插入数据

从图中可以看出就只有第一个 insert 语句因为数据块损坏无法正常执行,其他都执行成功。
2 .8.5 查询表数据


可以看出新建的表,能够正常访问和查询。
三、解决办法和建议
3.1 解决办法
从问题详细诊断过程的基于 rowid 抽取数据的方法可以看出,此方法能够跳过坏块,读取出表中可用的数据。因此,该问题的处理思路大概如下:
( 1 )将原表 rename 掉,例如: XXXX100 rename 为 XXXX100 _BAK
( 2 )复制 XXXX100_ BAK 的表结构创建一个名为 XXXX100 的空表。
( 3 )使用 rowid 的抽取方式,把没有损坏的块的数据抽取到 XXXX100 表中。
( 4 )查看 XXXX100 _BAK 的表属性,重新创建 XXXX100 的索引及约束等。
3.2 分析建议
( 1 )因使用 R MAN 校验失败,无法从数据库视图中直接查询有多少表受到了影响,可以使用 select 批量生成查询脚本或使用 exp 导出该用户的方式,当访问或导出问题表时,会出现报错,通过此方法来确定有多少表受到了影响,进而统一进行修复。
( 2 )对重要的数据库做异地容灾,可以避免如机房断电、硬件故障等原因造成的业务中断或数据丢失。