Table of Contents
1 现象
备库无法启动mrp0进程,alert报错如下。
Fri Mar 03 11:55:33 2023 ALTER DATABASE RECOVER managed standby database using current logfile disconnect from session Attempt to start background Managed Standby Recovery process (racdbdg) Fri Mar 03 11:55:33 2023 MRP0 started with pid=34, OS id=12815 MRP0: Background Managed Standby Recovery process started (racdbdg) started logmerger process Fri Mar 03 11:55:38 2023 Managed Standby Recovery starting Real Time Apply Parallel Media Recovery started with 4 slaves Waiting for all non-current ORLs to be archived... All non-current ORLs have been archived. Media Recovery Log /u01/archivelog/1_3475_1071146677.dbf Media Recovery Log /u01/archivelog/2_3450_1071146677.dbf Fri Mar 03 11:55:38 2023 Errors in file /u01/app/oracle/diag/rdbms/racdbdg/racdbdg/trace/racdbdg_pr02_12867.trc (incident=118384): ORA-00600: internal error code, arguments: [3020], [1], [4929], [4199233], [], [], [], [], [], [], [], [] ORA-10567: Redo is inconsistent with data block (file# 1, block# 4929, file offset is 40378368 bytes) ORA-10564: tablespace SYSTEM ORA-01110: data file 1: '/u01/datafile/system01.dbf' ORA-10561: block type 'TRANSACTION MANAGED DATA BLOCK', data object# 713 Incident details in: /u01/app/oracle/diag/rdbms/racdbdg/racdbdg/incident/incdir_118384/racdbdg_pr02_12867_i118384.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Slave exiting with ORA-600 exception Errors in file /u01/app/oracle/diag/rdbms/racdbdg/racdbdg/trace/racdbdg_pr02_12867.trc: ORA-00600: internal error code, arguments: [3020], [1], [4929], [4199233], [], [], [], [], [], [], [], [] ORA-10567: Redo is inconsistent with data block (file# 1, block# 4929, file offset is 40378368 bytes) ORA-10564: tablespace SYSTEM ORA-01110: data file 1: '/u01/datafile/system01.dbf' ORA-10561: block type 'TRANSACTION MANAGED DATA BLOCK', data object# 713 Errors in file /u01/app/oracle/diag/rdbms/racdbdg/racdbdg/trace/racdbdg_mrp0_12815.trc (incident=118360): ORA-00600: internal error code, arguments: [3020], [1], [4929], [4199233], [], [], [], [], [], [], [], [] ORA-10567: Redo is inconsistent with data block (file# 1, block# 4929, file offset is 40378368 bytes) ORA-10564: tablespace SYSTEM ORA-01110: data file 1: '/u01/datafile/system01.dbf' ORA-10561: block type 'TRANSACTION MANAGED DATA BLOCK', data object# 713 Incident details in: /u01/app/oracle/diag/rdbms/racdbdg/racdbdg/incident/incdir_118360/racdbdg_mrp0_12815_i118360.trc Completed: ALTER DATABASE RECOVER managed standby database using current logfile disconnect from session Fri Mar 03 11:55:39 2023 Dumping diagnostic data in directory=[cdmp_20230303115539], requested by (instance=1, osid=12867 (PR02)), summary=[incident=118384]. Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Recovery Slave PR02 previously exited with exception 600 Errors with log /u01/archivelog/2_3450_1071146677.dbf MRP0: Background Media Recovery terminated with error 448 Errors in file /u01/app/oracle/diag/rdbms/racdbdg/racdbdg/trace/racdbdg_pr00_12863.trc: ORA-00448: normal completion of background process Managed Standby Recovery not using Real Time Apply Recovery interrupted! Recovered data files to a consistent state at change 107986693 MRP0: Background Media Recovery process shutdown (racdbdg) Dumping diagnostic data in directory=[cdmp_20230303115540], requested by (instance=1, osid=12815 (MRP0)), summary=[incident=118360].
复制
2 分析
这套环境一直是比较稳定的,架构为11.2.0.4 rac to single。一开始想是不是临时性的错误,于是尝试重启数据库到mount再开启日志同步还是报一样的错。根据日志可知和gap无关,直观来看是system表空间出现了坏块。
ORA-00600: internal error code, arguments: [3020], [1], [4929], [4199233], [], [], [], [], [], [], [], [] ORA-10567: Redo is inconsistent with data block (file# 1, block# 4929, file offset is 40378368 bytes)
复制
3 处理
(1)主库单独备份system表空间。 backup datafile 1,2,3 format ‘/dbbak/dbf_%Y_%M_%D_%U.bak’;
RMAN> backup datafile 1 format '/home/oracle/system_%Y_%M_%D_%U.bak'; Starting backup at 2023-03-06 15:40:23 using channel ORA_DISK_1 channel ORA_DISK_1: starting full datafile backup set channel ORA_DISK_1: specifying datafile(s) in backup set input datafile file number=00001 name=+DATA/racdb/datafile/system.261.1070276763 channel ORA_DISK_1: starting piece 1 at 2023-03-06 15:40:23 channel ORA_DISK_1: finished piece 1 at 2023-03-06 15:40:30 piece handle=/home/oracle/system_2023_03_06_kl1mcen7_1_1.bak tag=TAG20230306T154023 comment=NONE channel ORA_DISK_1: backup set complete, elapsed time: 00:00:07 channel ORA_DISK_1: starting full datafile backup set channel ORA_DISK_1: specifying datafile(s) in backup set including current control file in backup set including current SPFILE in backup set channel ORA_DISK_1: starting piece 1 at 2023-03-06 15:40:31 channel ORA_DISK_1: finished piece 1 at 2023-03-06 15:40:32 piece handle=/home/oracle/system_2023_03_06_km1mcene_1_1.bak tag=TAG20230306T154023 comment=NONE channel ORA_DISK_1: backup set complete, elapsed time: 00:00:01 Finished backup at 2023-03-06 15:40:32 RMAN> [oracle@ltrac1 ~]$ ll -rw-r----- 1 oracle asmadmin 762363904 Mar 6 15:40 system_2023_03_06_kl1mcen7_1_1.bak -rw-r----- 1 oracle asmadmin 20611072 Mar 6 15:40 system_2023_03_06_km1mcene_1_1.bak
复制
(2)将备份scp到备库。
[oracle@ltrac1 ~]$ scp system_2023_03_06_k* oracle@168.5.80.180:/home/oracle/rman0306 oracle@168.5.80.180's password: system_2023_03_06_kl1mcen7_1_1.bak 100% 727MB 103.9MB/s 00:07 system_2023_03_06_km1mcene_1_1.bak 100% 20MB 19.7MB/s 00:00
复制
(3)备库恢复。
[oracle@ltracdg rman0306]$ ll total 764624 -rw-r-----. 1 oracle oinstall 762363904 Mar 6 15:37 system_2023_03_06_kl1mcen7_1_1.bak -rw-r-----. 1 oracle oinstall 20611072 Mar 6 15:37 system_2023_03_06_km1mcene_1_1.bak [oracle@ltracdg rman0306]$ pwd /home/oracle/rman0306 shutdown immediate startup mount rman target / RMAN> catalog start with '/home/oracle/rman0306'; using target database control file instead of recovery catalog searching for all files that match the pattern /home/oracle/rman0306 List of Files Unknown to the Database File Name: /home/oracle/rman0306/system_2023_03_06_kl1mcen7_1_1.bak File Name: /home/oracle/rman0306/system_2023_03_06_km1mcene_1_1.bak Do you really want to catalog the above files (enter YES or NO)? yes cataloging files... cataloging done List of Cataloged Files File Name: /home/oracle/rman0306/system_2023_03_06_kl1mcen7_1_1.bak File Name: /home/oracle/rman0306/system_2023_03_06_km1mcene_1_1.bak RMAN> restore datafile 1; Starting restore at 06-MAR-23 allocated channel: ORA_DISK_1 channel ORA_DISK_1: SID=434 device type=DISK channel ORA_DISK_1: starting datafile backup set restore channel ORA_DISK_1: specifying datafile(s) to restore from backup set channel ORA_DISK_1: restoring datafile 00001 to /u01/datafile/system01.dbf channel ORA_DISK_1: reading from backup piece /home/oracle/rman0306/system_2023_03_06_kl1mcen7_1_1.bak channel ORA_DISK_1: piece handle=/home/oracle/rman0306/system_2023_03_06_kl1mcen7_1_1.bak tag=TAG20230306T154023 channel ORA_DISK_1: restored backup piece 1 channel ORA_DISK_1: restore complete, elapsed time: 00:00:07 Finished restore at 06-MAR-23
复制
(4)此时应用日志,发现又报错undo也有问题。
Hex dump of (file 3, block 192) in trace file /u01/app/oracle/diag/rdbms/racdbdg/racdbdg/trace/racdbdg_pr01_15651.trc Reading datafile '/u01/datafile/undotbs101.dbf' for corruption at rdba: 0x00c000c0 (file 3, block 192) Reread (file 3, block 192) found same corrupt data (logically corrupt) Errors in file /u01/app/oracle/diag/rdbms/racdbdg/racdbdg/trace/racdbdg_pr01_15651.trc (incident=121287): ORA-00600: internal error code, arguments: [3020], [3], [192], [12583104], [], [], [], [], [], [], [], [] ORA-10567: Redo is inconsistent with data block (file# 3, block# 192, file offset is 1572864 bytes) ORA-10564: tablespace UNDOTBS1 ORA-01110: data file 3: '/u01/datafile/undotbs101.dbf' ORA-10560: block type 'KTU SMU HEADER BLOCK' Incident details in: /u01/app/oracle/diag/rdbms/racdbdg/racdbdg/incident/incdir_121287/racdbdg_pr01_15651_i121287.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Slave exiting with ORA-600 exception Errors in file /u01/app/oracle/diag/rdbms/racdbdg/racdbdg/trace/racdbdg_pr01_15651.trc: ORA-00600: internal error code, arguments: [3020], [3], [192], [12583104], [], [], [], [], [], [], [], [] ORA-10567: Redo is inconsistent with data block (file# 3, block# 192, file offset is 1572864 bytes) ORA-10564: tablespace UNDOTBS1 ORA-01110: data file 3: '/u01/datafile/undotbs101.dbf' ORA-10560: block type 'KTU SMU HEADER BLOCK' Errors in file /u01/app/oracle/diag/rdbms/racdbdg/racdbdg/trace/racdbdg_mrp0_15607.trc (incident=121272): ORA-00600: internal error code, arguments: [3020], [3], [192], [12583104], [], [], [], [], [], [], [], [] ORA-10567: Redo is inconsistent with data block (file# 3, block# 192, file offset is 1572864 bytes) ORA-10564: tablespace UNDOTBS1 ORA-01110: data file 3: '/u01/datafile/undotbs101.dbf' ORA-10560: block type 'KTU SMU HEADER BLOCK' Incident details in: /u01/app/oracle/diag/rdbms/racdbdg/racdbdg/incident/incdir_121272/racdbdg_mrp0_15607_i121272.trc
复制
(5)按照刚才的方法再修复一下undo。
(6)再次在mount状态下打开实时应用日志,观察备库日志发现已经可以正常应用日志了。
Mon Mar 06 16:25:06 2023 ALTER DATABASE RECOVER managed standby database using current logfile disconnect from session Attempt to start background Managed Standby Recovery process (racdbdg) Mon Mar 06 16:25:06 2023 MRP0 started with pid=37, OS id=18647 MRP0: Background Managed Standby Recovery process started (racdbdg) started logmerger process Mon Mar 06 16:25:11 2023 Managed Standby Recovery starting Real Time Apply Parallel Media Recovery started with 4 slaves Waiting for all non-current ORLs to be archived... All non-current ORLs have been archived. Mon Mar 06 16:26:20 2023 Media Recovery Log /u01/archivelog/2_3491_1071146677.dbf Media Recovery Log /u01/archivelog/1_3512_1071146677.dbf Media Recovery Log /u01/archivelog/1_3513_1071146677.dbf Media Recovery Log /u01/archivelog/2_3492_1071146677.dbf Media Recovery Log /u01/archivelog/1_3514_1071146677.dbf Media Recovery Log /u01/archivelog/2_3493_1071146677.dbf Media Recovery Log /u01/archivelog/1_3515_1071146677.dbf Media Recovery Log /u01/archivelog/2_3494_1071146677.dbf Media Recovery Log /u01/archivelog/1_3516_1071146677.dbf Media Recovery Log /u01/archivelog/2_3495_1071146677.dbf Media Recovery Log /u01/archivelog/1_3517_1071146677.dbf Media Recovery Log /u01/archivelog/2_3496_1071146677.dbf Media Recovery Log /u01/archivelog/2_3497_1071146677.dbf Media Recovery Log /u01/archivelog/1_3518_1071146677.dbf Media Recovery Log /u01/archivelog/2_3498_1071146677.dbf Media Recovery Log /u01/archivelog/2_3499_1071146677.dbf Media Recovery Log /u01/archivelog/1_3519_1071146677.dbf Media Recovery Log /u01/archivelog/1_3520_1071146677.dbf Media Recovery Log /u01/archivelog/2_3500_1071146677.dbf Mon Mar 06 16:26:31 2023 Media Recovery Log /u01/archivelog/1_3521_1071146677.dbf Media Recovery Log /u01/archivelog/1_3522_1071146677.dbf Media Recovery Log /u01/archivelog/2_3501_1071146677.dbf Media Recovery Log /u01/archivelog/1_3523_1071146677.dbf Media Recovery Log /u01/archivelog/2_3502_1071146677.dbf Media Recovery Log /u01/archivelog/2_3503_1071146677.dbf Media Recovery Log /u01/archivelog/1_3524_1071146677.dbf Media Recovery Log /u01/archivelog/1_3525_1071146677.dbf Media Recovery Log /u01/archivelog/2_3504_1071146677.dbf Media Recovery Log /u01/archivelog/1_3526_1071146677.dbf Media Recovery Log /u01/archivelog/2_3505_1071146677.dbf Media Recovery Log /u01/archivelog/2_3506_1071146677.dbf Media Recovery Log /u01/archivelog/1_3527_1071146677.dbf Media Recovery Waiting for thread 1 sequence 3528 (in transit) Recovery of Online Redo Log: Thread 1 Group 10 Seq 3528 Reading mem 0 Mem# 0: /u01/onlinelog/group_10.272.1085395667 Media Recovery Waiting for thread 2 sequence 3507 (in transit) Recovery of Online Redo Log: Thread 2 Group 15 Seq 3507 Reading mem 0 Mem# 0: /u01/onlinelog/group_15.280.1085395671
复制
(7)备库关闭实时应用,并启动到open,再次打开实时应用正常,broker也正常,故障恢复。
4 一些疑问
(1)一向稳定的ADG为何会发生ORA-00600故障?
–没有太好的思路,通常是bug或写丢失的情况导致。
(2)正常rman是需要先恢复控制文件,用备份restore,再用备份的归档文件recover,为什么此次故障只有restore呢?
–因为备库的控制文件也是从主库备份过来的,而启动mrp进程就相当于利用归档来进行recover了,或者通过 RECOVER DATABASE NOREDO 也行。
这里确实基础不太牢固,我记得数据库正常运行必须保证数据库、数据文件和数据文件头的checkpoint_change#是一致的,一开始猜想是因为这个原因备库才可以直接用rman恢复主库的数据文件,检查发现确实主备库的是一致的。
但是当我手动切换检查点后,主库更新了,但是备库的却不一致了,有大佬说ADG是通过redo传输,而检查点的操作不记录redo,所以备库即使和主库不一致也是正常的,但是备库的不统一真的没问题吗?希望有懂的大佬帮忙解答一下。
主库:
SYS@racdb1>select distinct checkpoint_change# from v$database; CHECKPOINT_CHANGE# ------------------ 109764566 Elapsed: 00:00:00.00 SYS@racdb1>select distinct checkpoint_change# from v$datafile; CHECKPOINT_CHANGE# ------------------ 109764566 Elapsed: 00:00:00.00 SYS@racdb1>select distinct checkpoint_change# from v$datafile_header; CHECKPOINT_CHANGE# ------------------ 109764566 Elapsed: 00:00:00.01
复制
备库:
SQL> select distinct checkpoint_change# from v$database; select distinct checkpoint_change# from v$datafile; select distinct checkpoint_change# from v$datafile_header; CHECKPOINT_CHANGE# ------------------ 109643118 SQL> CHECKPOINT_CHANGE# ------------------ 109764566 SQL> CHECKPOINT_CHANGE# ------------------ 109764566 SQL>
复制