Oracle数据库坏块典型案例分析

一、 数据库坏块背景说明

某次,用户反馈数据库后台日志出现数据库坏块,部分表查询时报错。在业务恢复之后,通过梳理处理记录,做一次总结分析,以下将对该案例的诊断过程进行说明。

二、问题详细诊断过程

2 .1 数据库 alter 日志

Thu Jun 10 08:36:33 2021

OS Pid: 12648460 executed alter system set events '10231 trace name context forever,level 10'

Thu Jun 10 08:37:38 2021

ALTER SYSTEM: Flushing buffer cache

Thu Jun 10 08:37:55 2021

Hex dump of (file 62, block 5521) in trace file /oracle/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_12648460.trc

Corrupt block relative dba: 0x0f801591  (file 62, block 5521)

Bad header found during buffer read

Data in bad block:

 type: 6 format: 2 rdba: 0x2acf8b99

 last change scn: 0x0006.edc07b41 seq: 0x1 flg: 0x06

 spare1: 0x0 spare2: 0x0 spare3: 0x0

 consistency value in tail: 0x7b410601

 check value in block header: 0xdb02

 computed block checksum: 0x0

Reading datafile '+DATA/orcl/datafile/tbs_sys.ora' for corruption at rdba: 0x0f801591 (file 62, block 5521)

Read datafile mirror 'DATA_0000' (file 62, block 5521) found same corrupt data (no logical check)

WARNING: requested mirror side 2 of virtual extent 43 logical extent 1 offset 139264 is not allocated; I/O request failed

Errors in file /oracle/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_12648460.trc  (incident=912662):

ORA-01578: ORACLE data block corrupted (file # 62, block # 5521)

ORA-01110: data file 62: '+DATA/orcl/datafile/tbs_sys.ora'

Incident details in: /oracle/oracle/diag/rdbms/orcl/orcl1/incident/incdir_912662/orcl1_ora_12648460_i912662.trc

Thu Jun 10 08:37:55 2021

Corrupt Block Found

         TSN = 62, TSNAME = TBS_SYS

         RFN = 62, BLK = 5521, RDBA = 260052369

         OBJN = 82103, OBJD = 82103, OBJECT = XXXX100D, SUBOBJECT =

         SEGMENT OWNER = XXXXXX_ST, SEGMENT TYPE = Table Segment

Thu Jun 10 08:37:57 2021

Dumping diagnostic data in directory=[cdmp_20210610083757], requested by (instance=1, osid=12648460), summary=[incident=912662].

Thu Jun 10 08:37:58 2021

Sweep [inc][912662]: completed

Sweep [inc2][912774]: completed

Sweep [inc2][912662]: completed

Thu Jun 10 08:38:14 2021

Dumping diagnostic data in directory=[cdmp_20210610083814], requested by (instance=2, osid=9175400), summary=[incident=353303].

Thu Jun 10 08:38:29 2021

OS Pid: 12648460 executed alter system set events '10231 trace name context off'

 

从数据库后台日志可以看出,数据文件 '+DATA/orcl/datafile/tbs_sys.ora' 发生损坏,出现了部分损坏的 block 。损坏的块属于 XXXXXX_ST 用户下的 XXXX100 表。

2 .2 操作系统错误日志

 

检查 A IX 主机的操作系统日志,发现存在多块磁盘存在 DISK OPERATION ERROR 错误。

 

2 .3 R MN 校验数据库坏块

RMAN> run {

 allocate channel d1 type disk;

 allocate channel d2 type disk;

 allocate channel d3 type disk;

 allocate channel d4 type disk;

 backup validate check logical database;

 release channel d1;

 release channel d2;

 release channel d3;

 release channel d4;

 }

 

 

 

当使用 R MAN 校验到 6 2 号数据文件 t bs _ sys.ora 时报错,自动退出。

S QL> select   *   from   g v$database_block_corrutption

 

no rows selected

查询数据库坏块视图,没有返回数据库中坏块的信息。

2 .4 使用 exp 尝试导出表数据

 

使用 exp 导出 XXXX100 表,在查询到 6 2 号文件的 5 248 号块时报错,导出失败。

2 .5 RMAN 备份修复坏块

RMAN> blockrecover datafile 6 2   block 5248 ;  -- 单块修复

RMAN> blockrecover corruption list;  -- 对视图中所有坏块修复

 

sys@ORCL> select count(*) from XXXXXX_ST.XXXX100;

select count(*) from XXXXXX_ST.XXXX100

                                *

ERROR at line 1:

ORA-01115: IO error reading block from file  (block # )

ORA-01115: IO error reading block from file 62 (block # 5248)

ORA-15081: failed to submit an I/O operation to a disk

ORA-15081: failed to submit an I/O operation to a disk

再次查询 XXXX100 表在访问 5 248 号块时,报 I O error

 

2 .6 设置 1 0231 事件跳过坏块

SQL>alter system  SET EVENTS '10231 trace name context forever,level 10';

设置 1 0231 事件后,查询 XXXX100 表,仍然报数据块损坏,无法查询。

2 .7 使用 dbms 包标记坏块

sys@ORCL> begin

dbms_repair.skip_corrupt_blocks(

schema_name=>'XXXXXX_ST',

object_name=>'XXXX100',

object_type=>dbms_repair.table_object,

flags=>dbms_repair.skip_flag);

end;

/  2    3    4    5    6    7    8

 

PL/SQL procedure successfully completed.

 

sys@ORCL> select count(*) from XXXXXX_ST.XXXX100;

select count(*) from XXXXXX_ST.XXXX100

                                *

ERROR at line 1:

ORA-01115: IO error reading block from file  (block # )

ORA-01115: IO error reading block from file 62 (block # 5248)

ORA-15081: failed to submit an I/O operation to a disk

ORA-15081: failed to submit an I/O operation to a disk

 

sys@ORCL> exec dbms_repair.skip_corrupt_blocks(schema_name => 'XXXXXX_ST',object_name => 'XXXX100',flags => 1);

 

PL/SQL procedure successfully completed.

 

sys@ORCL>

sys@ORCL> select count(*) from XXXXXX_ST.XXXX100;

select count(*) from XXXXXX_ST.XXXX100

                                *

ERROR at line 1:

ORA-01115: IO error reading block from file  (block # )

ORA-01115: IO error reading block from file 62 (block # 5248)

ORA-15081: failed to submit an I/O operation to a disk

ORA-15081: failed to submit an I/O operation to a disk

从结果上来看,使用 dbms 包标记的方法也无法正常跳过坏块,查询出来数据。

2 .8 基于 rowid 抽取数据

2 .8.1 查看表属性

 

2 .8.2 基于块 rowid 生成插入语句

-- 基于 rowid 抽取坏块数据

select 'insert into test select * from XXXXXX_ST.XXXX100 t where rowid between ' || '''' ||

       dbms_rowid.rowid_create(1,

                               o.data_object_id,

                               e.RELATIVE_FNO,

                               e.BLOCK_ID,

                               0) || '''' || ' and ' || '''' ||

       DBMS_ROWID.ROWID_CREATE(1,

                               o.data_object_id,

                               e.RELATIVE_FNO,

                               e.BLOCK_ID + e.BLOCKS - 1,

                               10000) || '''' || ';' || CHR(10) ||

       'commit;'

  from dba_extents e, dba_objects o

 where e.segment_name = 'XXXX100'

   and e.owner = 'XXXXXX_ST'

   AND o.object_name = 'XXXX100'

   AND o.owner = 'XXXXXX_ST';

 

2 .8.3 创建空表

SQL> create table test as select * from XXXXXX_ST.XXXX100 where 1=2;

2 .8.4 插入数据

 

从图中可以看出就只有第一个 insert 语句因为数据块损坏无法正常执行,其他都执行成功。

2 .8.5 查询表数据

 

 

可以看出新建的表,能够正常访问和查询。

三、解决办法和建议

3.1  解决办法

从问题详细诊断过程的基于 rowid 抽取数据的方法可以看出,此方法能够跳过坏块,读取出表中可用的数据。因此,该问题的处理思路大概如下:

1 )将原表 rename 掉,例如: XXXX100 rename XXXX100 _BAK

2 )复制 XXXX100_ BAK 的表结构创建一个名为 XXXX100 的空表。

3 )使用 rowid 的抽取方式,把没有损坏的块的数据抽取到 XXXX100 表中。

4 )查看 XXXX100 _BAK 的表属性,重新创建 XXXX100 的索引及约束等。

3.2  分析建议

1 )因使用 R MAN 校验失败,无法从数据库视图中直接查询有多少表受到了影响,可以使用 select 批量生成查询脚本或使用 exp 导出该用户的方式,当访问或导出问题表时,会出现报错,通过此方法来确定有多少表受到了影响,进而统一进行修复。

2 )对重要的数据库做异地容灾,可以避免如机房断电、硬件故障等原因造成的业务中断或数据丢失。

 


请使用浏览器的分享功能分享到微信等