暂无图片
暂无图片
1
暂无图片
暂无图片
暂无图片

ADG之ORA-00600坏块问题处理和一些疑问

原创 凯多 2023-03-07
662

Table of Contents

1 现象

备库无法启动mrp0进程,alert报错如下。

Fri Mar 03 11:55:33 2023
ALTER DATABASE RECOVER managed standby database using current logfile disconnect from session
Attempt to start background Managed Standby Recovery process (racdbdg)
Fri Mar 03 11:55:33 2023
MRP0 started with pid=34, OS id=12815
MRP0: Background Managed Standby Recovery process started (racdbdg)
started logmerger process
Fri Mar 03 11:55:38 2023
Managed Standby Recovery starting Real Time Apply
Parallel Media Recovery started with 4 slaves
Waiting for all non-current ORLs to be archived...
All non-current ORLs have been archived.
Media Recovery Log /u01/archivelog/1_3475_1071146677.dbf
Media Recovery Log /u01/archivelog/2_3450_1071146677.dbf
Fri Mar 03 11:55:38 2023
Errors in file /u01/app/oracle/diag/rdbms/racdbdg/racdbdg/trace/racdbdg_pr02_12867.trc (incident=118384):
ORA-00600: internal error code, arguments: [3020], [1], [4929], [4199233], [], [], [], [], [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 1, block# 4929, file offset is 40378368 bytes)
ORA-10564: tablespace SYSTEM
ORA-01110: data file 1: '/u01/datafile/system01.dbf'
ORA-10561: block type 'TRANSACTION MANAGED DATA BLOCK', data object# 713
Incident details in: /u01/app/oracle/diag/rdbms/racdbdg/racdbdg/incident/incdir_118384/racdbdg_pr02_12867_i118384.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Slave exiting with ORA-600 exception
Errors in file /u01/app/oracle/diag/rdbms/racdbdg/racdbdg/trace/racdbdg_pr02_12867.trc:
ORA-00600: internal error code, arguments: [3020], [1], [4929], [4199233], [], [], [], [], [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 1, block# 4929, file offset is 40378368 bytes)
ORA-10564: tablespace SYSTEM
ORA-01110: data file 1: '/u01/datafile/system01.dbf'
ORA-10561: block type 'TRANSACTION MANAGED DATA BLOCK', data object# 713
Errors in file /u01/app/oracle/diag/rdbms/racdbdg/racdbdg/trace/racdbdg_mrp0_12815.trc (incident=118360):
ORA-00600: internal error code, arguments: [3020], [1], [4929], [4199233], [], [], [], [], [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 1, block# 4929, file offset is 40378368 bytes)
ORA-10564: tablespace SYSTEM
ORA-01110: data file 1: '/u01/datafile/system01.dbf'
ORA-10561: block type 'TRANSACTION MANAGED DATA BLOCK', data object# 713
Incident details in: /u01/app/oracle/diag/rdbms/racdbdg/racdbdg/incident/incdir_118360/racdbdg_mrp0_12815_i118360.trc
Completed: ALTER DATABASE RECOVER managed standby database using current logfile disconnect from session
Fri Mar 03 11:55:39 2023
Dumping diagnostic data in directory=[cdmp_20230303115539], requested by (instance=1, osid=12867 (PR02)), summary=[incident=118384].
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Recovery Slave PR02 previously exited with exception 600
Errors with log /u01/archivelog/2_3450_1071146677.dbf
MRP0: Background Media Recovery terminated with error 448
Errors in file /u01/app/oracle/diag/rdbms/racdbdg/racdbdg/trace/racdbdg_pr00_12863.trc:
ORA-00448: normal completion of background process
Managed Standby Recovery not using Real Time Apply
Recovery interrupted!
Recovered data files to a consistent state at change 107986693
MRP0: Background Media Recovery process shutdown (racdbdg)
Dumping diagnostic data in directory=[cdmp_20230303115540], requested by (instance=1, osid=12815 (MRP0)), summary=[incident=118360].

复制

2 分析

这套环境一直是比较稳定的,架构为11.2.0.4 rac to single。一开始想是不是临时性的错误,于是尝试重启数据库到mount再开启日志同步还是报一样的错。根据日志可知和gap无关,直观来看是system表空间出现了坏块。

ORA-00600: internal error code, arguments: [3020], [1], [4929], [4199233], [], [], [], [], [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 1, block# 4929, file offset is 40378368 bytes)

复制

3 处理

(1)主库单独备份system表空间。 backup datafile 1,2,3 format ‘/dbbak/dbf_%Y_%M_%D_%U.bak’;

RMAN> backup datafile 1 format '/home/oracle/system_%Y_%M_%D_%U.bak';

Starting backup at 2023-03-06 15:40:23
using channel ORA_DISK_1
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00001 name=+DATA/racdb/datafile/system.261.1070276763
channel ORA_DISK_1: starting piece 1 at 2023-03-06 15:40:23
channel ORA_DISK_1: finished piece 1 at 2023-03-06 15:40:30
piece handle=/home/oracle/system_2023_03_06_kl1mcen7_1_1.bak tag=TAG20230306T154023 comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:07
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
including current control file in backup set
including current SPFILE in backup set
channel ORA_DISK_1: starting piece 1 at 2023-03-06 15:40:31
channel ORA_DISK_1: finished piece 1 at 2023-03-06 15:40:32
piece handle=/home/oracle/system_2023_03_06_km1mcene_1_1.bak tag=TAG20230306T154023 comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:01
Finished backup at 2023-03-06 15:40:32

RMAN>
[oracle@ltrac1 ~]$ ll
-rw-r----- 1 oracle asmadmin 762363904 Mar 6 15:40 system_2023_03_06_kl1mcen7_1_1.bak
-rw-r----- 1 oracle asmadmin 20611072 Mar 6 15:40 system_2023_03_06_km1mcene_1_1.bak

复制

(2)将备份scp到备库。

[oracle@ltrac1 ~]$ scp system_2023_03_06_k* oracle@168.5.80.180:/home/oracle/rman0306
oracle@168.5.80.180's password:
system_2023_03_06_kl1mcen7_1_1.bak 100% 727MB 103.9MB/s 00:07
system_2023_03_06_km1mcene_1_1.bak 100% 20MB 19.7MB/s 00:00

复制

(3)备库恢复。

[oracle@ltracdg rman0306]$ ll
total 764624
-rw-r-----. 1 oracle oinstall 762363904 Mar 6 15:37 system_2023_03_06_kl1mcen7_1_1.bak
-rw-r-----. 1 oracle oinstall 20611072 Mar 6 15:37 system_2023_03_06_km1mcene_1_1.bak
[oracle@ltracdg rman0306]$ pwd
/home/oracle/rman0306


shutdown immediate
startup mount
rman target /

RMAN> catalog start with '/home/oracle/rman0306';

using target database control file instead of recovery catalog
searching for all files that match the pattern /home/oracle/rman0306
List of Files Unknown to the Database

File Name: /home/oracle/rman0306/system_2023_03_06_kl1mcen7_1_1.bak
File Name: /home/oracle/rman0306/system_2023_03_06_km1mcene_1_1.bak

Do you really want to catalog the above files (enter YES or NO)? yes
cataloging files...
cataloging done

List of Cataloged Files

File Name: /home/oracle/rman0306/system_2023_03_06_kl1mcen7_1_1.bak
File Name: /home/oracle/rman0306/system_2023_03_06_km1mcene_1_1.bak

RMAN> restore datafile 1;

Starting restore at 06-MAR-23
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=434 device type=DISK

channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00001 to /u01/datafile/system01.dbf
channel ORA_DISK_1: reading from backup piece /home/oracle/rman0306/system_2023_03_06_kl1mcen7_1_1.bak
channel ORA_DISK_1: piece handle=/home/oracle/rman0306/system_2023_03_06_kl1mcen7_1_1.bak tag=TAG20230306T154023
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:00:07
Finished restore at 06-MAR-23

复制

(4)此时应用日志,发现又报错undo也有问题。

Hex dump of (file 3, block 192) in trace file /u01/app/oracle/diag/rdbms/racdbdg/racdbdg/trace/racdbdg_pr01_15651.trc
Reading datafile '/u01/datafile/undotbs101.dbf' for corruption at rdba: 0x00c000c0 (file 3, block 192)
Reread (file 3, block 192) found same corrupt data (logically corrupt)
Errors in file /u01/app/oracle/diag/rdbms/racdbdg/racdbdg/trace/racdbdg_pr01_15651.trc (incident=121287):
ORA-00600: internal error code, arguments: [3020], [3], [192], [12583104], [], [], [], [], [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 3, block# 192, file offset is 1572864 bytes)
ORA-10564: tablespace UNDOTBS1
ORA-01110: data file 3: '/u01/datafile/undotbs101.dbf'
ORA-10560: block type 'KTU SMU HEADER BLOCK'
Incident details in: /u01/app/oracle/diag/rdbms/racdbdg/racdbdg/incident/incdir_121287/racdbdg_pr01_15651_i121287.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Slave exiting with ORA-600 exception
Errors in file /u01/app/oracle/diag/rdbms/racdbdg/racdbdg/trace/racdbdg_pr01_15651.trc:
ORA-00600: internal error code, arguments: [3020], [3], [192], [12583104], [], [], [], [], [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 3, block# 192, file offset is 1572864 bytes)
ORA-10564: tablespace UNDOTBS1
ORA-01110: data file 3: '/u01/datafile/undotbs101.dbf'
ORA-10560: block type 'KTU SMU HEADER BLOCK'
Errors in file /u01/app/oracle/diag/rdbms/racdbdg/racdbdg/trace/racdbdg_mrp0_15607.trc (incident=121272):
ORA-00600: internal error code, arguments: [3020], [3], [192], [12583104], [], [], [], [], [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 3, block# 192, file offset is 1572864 bytes)
ORA-10564: tablespace UNDOTBS1
ORA-01110: data file 3: '/u01/datafile/undotbs101.dbf'
ORA-10560: block type 'KTU SMU HEADER BLOCK'
Incident details in: /u01/app/oracle/diag/rdbms/racdbdg/racdbdg/incident/incdir_121272/racdbdg_mrp0_15607_i121272.trc

复制

(5)按照刚才的方法再修复一下undo。

(6)再次在mount状态下打开实时应用日志,观察备库日志发现已经可以正常应用日志了。

Mon Mar 06 16:25:06 2023
ALTER DATABASE RECOVER managed standby database using current logfile disconnect from session
Attempt to start background Managed Standby Recovery process (racdbdg)
Mon Mar 06 16:25:06 2023
MRP0 started with pid=37, OS id=18647
MRP0: Background Managed Standby Recovery process started (racdbdg)
started logmerger process
Mon Mar 06 16:25:11 2023
Managed Standby Recovery starting Real Time Apply
Parallel Media Recovery started with 4 slaves
Waiting for all non-current ORLs to be archived...
All non-current ORLs have been archived.
Mon Mar 06 16:26:20 2023
Media Recovery Log /u01/archivelog/2_3491_1071146677.dbf
Media Recovery Log /u01/archivelog/1_3512_1071146677.dbf
Media Recovery Log /u01/archivelog/1_3513_1071146677.dbf
Media Recovery Log /u01/archivelog/2_3492_1071146677.dbf
Media Recovery Log /u01/archivelog/1_3514_1071146677.dbf
Media Recovery Log /u01/archivelog/2_3493_1071146677.dbf
Media Recovery Log /u01/archivelog/1_3515_1071146677.dbf
Media Recovery Log /u01/archivelog/2_3494_1071146677.dbf
Media Recovery Log /u01/archivelog/1_3516_1071146677.dbf
Media Recovery Log /u01/archivelog/2_3495_1071146677.dbf
Media Recovery Log /u01/archivelog/1_3517_1071146677.dbf
Media Recovery Log /u01/archivelog/2_3496_1071146677.dbf
Media Recovery Log /u01/archivelog/2_3497_1071146677.dbf
Media Recovery Log /u01/archivelog/1_3518_1071146677.dbf
Media Recovery Log /u01/archivelog/2_3498_1071146677.dbf
Media Recovery Log /u01/archivelog/2_3499_1071146677.dbf
Media Recovery Log /u01/archivelog/1_3519_1071146677.dbf
Media Recovery Log /u01/archivelog/1_3520_1071146677.dbf
Media Recovery Log /u01/archivelog/2_3500_1071146677.dbf
Mon Mar 06 16:26:31 2023
Media Recovery Log /u01/archivelog/1_3521_1071146677.dbf
Media Recovery Log /u01/archivelog/1_3522_1071146677.dbf
Media Recovery Log /u01/archivelog/2_3501_1071146677.dbf
Media Recovery Log /u01/archivelog/1_3523_1071146677.dbf
Media Recovery Log /u01/archivelog/2_3502_1071146677.dbf
Media Recovery Log /u01/archivelog/2_3503_1071146677.dbf
Media Recovery Log /u01/archivelog/1_3524_1071146677.dbf
Media Recovery Log /u01/archivelog/1_3525_1071146677.dbf
Media Recovery Log /u01/archivelog/2_3504_1071146677.dbf
Media Recovery Log /u01/archivelog/1_3526_1071146677.dbf
Media Recovery Log /u01/archivelog/2_3505_1071146677.dbf
Media Recovery Log /u01/archivelog/2_3506_1071146677.dbf
Media Recovery Log /u01/archivelog/1_3527_1071146677.dbf
Media Recovery Waiting for thread 1 sequence 3528 (in transit)
Recovery of Online Redo Log: Thread 1 Group 10 Seq 3528 Reading mem 0
Mem# 0: /u01/onlinelog/group_10.272.1085395667
Media Recovery Waiting for thread 2 sequence 3507 (in transit)
Recovery of Online Redo Log: Thread 2 Group 15 Seq 3507 Reading mem 0
Mem# 0: /u01/onlinelog/group_15.280.1085395671

复制

(7)备库关闭实时应用,并启动到open,再次打开实时应用正常,broker也正常,故障恢复。

4 一些疑问

(1)一向稳定的ADG为何会发生ORA-00600故障?

–没有太好的思路,通常是bug或写丢失的情况导致。

(2)正常rman是需要先恢复控制文件,用备份restore,再用备份的归档文件recover,为什么此次故障只有restore呢?

–因为备库的控制文件也是从主库备份过来的,而启动mrp进程就相当于利用归档来进行recover了,或者通过 RECOVER DATABASE NOREDO 也行。

这里确实基础不太牢固,我记得数据库正常运行必须保证数据库、数据文件和数据文件头的checkpoint_change#是一致的,一开始猜想是因为这个原因备库才可以直接用rman恢复主库的数据文件,检查发现确实主备库的是一致的。

但是当我手动切换检查点后,主库更新了,但是备库的却不一致了,有大佬说ADG是通过redo传输,而检查点的操作不记录redo,所以备库即使和主库不一致也是正常的,但是备库的不统一真的没问题吗?希望有懂的大佬帮忙解答一下。

主库:


SYS@racdb1>select distinct checkpoint_change# from v$database;

CHECKPOINT_CHANGE#
------------------
109764566

Elapsed: 00:00:00.00
SYS@racdb1>select distinct checkpoint_change# from v$datafile;

CHECKPOINT_CHANGE#
------------------
109764566

Elapsed: 00:00:00.00
SYS@racdb1>select distinct checkpoint_change# from v$datafile_header;

CHECKPOINT_CHANGE#
------------------
109764566

Elapsed: 00:00:00.01

复制

备库:


SQL> select distinct checkpoint_change# from v$database;
select distinct checkpoint_change# from v$datafile;
select distinct checkpoint_change# from v$datafile_header;
CHECKPOINT_CHANGE#
------------------
109643118

SQL>
CHECKPOINT_CHANGE#
------------------
109764566

SQL>

CHECKPOINT_CHANGE#
------------------
109764566

SQL>

复制
最后修改时间:2023-03-07 10:55:11
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论