Oracle 12.2 Rac redo异常的恢复案例

原创 Roger 2019-12-05

4187

前同事问的一个问题；其一套12.2 rac由于存储问题导致无法启动；重建控制文件后报如下错误：

2019-12-05T10:12:55.112874+08:00
Sleep 5 seconds and then try to clear SRLs in 2 time(s)
Errors in file /u01/oracle/app/oracle/diag/rdbms/orapuabis/xxoo2/trace/xxoo2_ora_327541.trc  (incident=5376496):
ORA-00600: internal error code, arguments: [kcbzib_kcrsds_1], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/oracle/app/oracle/diag/rdbms/xxoo/xxoo2/incident/incdir_5376496/xxoo2_ora_327541_i5376496.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
2019-12-05T10:12:56.191277+08:00
*****************************************************************
An internal routine has requested a dump of selected redo.
This usually happens following a specific internal error, when
analysis of the redo logs will help Oracle Support with the
diagnosis.
It is recommended that you retain all the redo logs generated (by
all the instances) during the past 12 hours, in case additional
redo dumps are required to help with the diagnosis.
*****************************************************************
2019-12-05T10:12:56.485111+08:00
Errors in file /u01/oracle/app/oracle/diag/rdbms/xxoo/xxoo2/trace/xxoo2_ora_327541.trc:
ORA-00704: bootstrap process failure
ORA-00600: internal error code, arguments: [kcbzib_kcrsds_1], [], [], [], [], [], [], [], [], [], [], []
2019-12-05T10:12:56.485226+08:00
Errors in file /u01/oracle/app/oracle/diag/rdbms/xxoo/xxoo2/trace/xxoo2_ora_327541.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00600: internal error code, arguments: [kcbzib_kcrsds_1], [], [], [], [], [], [], [], [], [], [], []
2019-12-05T10:12:56.485323+08:00
Errors in file /u01/oracle/app/oracle/diag/rdbms/xxoo/xxoo2/trace/xxoo2_ora_327541.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00600: internal error code, arguments: [kcbzib_kcrsds_1], [], [], [], [], [], [], [], [], [], [], []
Error 704 happened during db open, shutting down database
Errors in file /u01/oracle/app/oracle/diag/rdbms/xxoo/xxoo2/trace/xxoo2_ora_327541.trc  (incident=5376497):
ORA-00603: ORACLE server session terminated by fatal error
ORA-01092: ORACLE instance terminated. Disconnection forced
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00600: internal error code, arguments: [kcbzib_kcrsds_1], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/oracle/app/oracle/diag/rdbms/xxoo/xxoo2/incident/incdir_5376497/xxoo2_ora_327541_i5376497.trc
2019-12-05T10:12:56.602381+08:00
Dumping diagnostic data in directory=[cdmp_20191205101256], requested by (instance=2, osid=327541), summary=[incident=5376496].
2019-12-05T10:12:57.664168+08:00
opiodr aborting process unknown ospid (327541) as a result of ORA-603复制

这个600错误我还是第一次遇见。最近2年很少做恢复，因此在华为云服务器针对该错误进行了一次测试；通过模拟事务丢失（dd破坏current redo等）很容易出现该错误；如下是我的模拟过程：

SQL> startup
ORACLE instance started.

Total System Global Area 3221225472 bytes
Fixed Size                  8797928 bytes
Variable Size             939524376 bytes
Database Buffers         2264924160 bytes
Redo Buffers                7979008 bytes
Database mounted.
Database opened.
SQL> show pdbs;

    CON_ID CON_NAME                       OPEN MODE  RESTRICTED
---------- ------------------------------ ---------- ----------
         2 PDB$SEED                       READ ONLY  NO
         3 ENMOTEST                       MOUNTED
         4 PDBTEST1                       MOUNTED

SQL> alter pluggable database  ENMOTEST open;

Pluggable database altered.

SQL> alter pluggable database PDBTEST1 open;


alter 
Pluggable database altered.

SQL> SQL> SQL> session set container=enmotest;

Session altered.

SQL> create table roger as select * from sys.dba_objects where rownum < 10001;

Table created.

SQL> alter system flush buffer_cache;

System altered.

SQL> delete from roger where rownum < 1001;

1000 rows deleted.

SQL> select * from v$Log;

    GROUP#    THREAD#  SEQUENCE#      BYTES  BLOCKSIZE    MEMBERS ARCHIV STATUS                           FIRST_CHANGE# FIRST_TIME   NEXT_CHANGE# NEXT_TIME        CON_ID
---------- ---------- ---------- ---------- ---------- ---------- ------ -------------------------------- ------------- ------------ ------------ ------------ ----------
         1          1         37  209715200        512          1 NO     INACTIVE                               3232261 04-DEC-19         3335566 05-DEC-19       0
         2          1         38  209715200        512          1 NO     CURRENT                                3335566 05-DEC-19      1.8447E+19                 0
         3          1         36  209715200        512          1 NO     INACTIVE                               3121301 28-NOV-19         3232261 04-DEC-19       0

SQL> select member from v$Logfile;

MEMBER
---------------------------------------------------------------------
/sas_data/12c/app/oracle/oradata/test12c/redo03.log
/sas_data/12c/app/oracle/oradata/test12c/redo02.log
/sas_data/12c/app/oracle/oradata/test12c/redo01.log


SQL> conn /as sysdba
Connected.
SQL> shutdown  abort;
ORACLE instance shut down.
SQL> !
[oracle@enmodb3 ~]$ dd if=/dev/zero of=/sas_data/12c/app/oracle/oradata/test12c/redo02.log bs=1024k count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.00221134 s, 474 MB/s
[oracle@enmodb3 ~]$ exit
exit

SQL> startup
ORACLE instance started.

Total System Global Area 3221225472 bytes
Fixed Size                  8797928 bytes
Variable Size             939524376 bytes
Database Buffers         2264924160 bytes
Redo Buffers                7979008 bytes
Database mounted.
ORA-00313: open failed for members of log group 2 of thread 1
ORA-00312: online log 2 thread 1: '/sas_data/12c/app/oracle/oradata/test12c/redo02.log'
ORA-27048: skgfifi: file header information is invalid
Additional information: 2


SQL> startup pfile='/tmp/12_2.ora' mount;
ORACLE instance started.

Total System Global Area 3221225472 bytes
Fixed Size                  8797928 bytes
Variable Size             939524376 bytes
Database Buffers         2264924160 bytes
Redo Buffers                7979008 bytes
Database mounted.
SQL> oradebug setmypid
Statement processed.
SQL> alter database backup controlfile to trace;

Database altered.

SQL> oradebug tracefile_name
/sas_data/12c/app/oracle/diag/rdbms/test12c/test12c/trace/test12c_ora_30127.trc


在重建控制文件后，分区进行强制open时分别遇到了如下错误：


SQL> alter database open resetlogs;
ERROR:
ORA-03114: not connected to ORACLE


alter database open resetlogs
*
ERROR at line 1:
ORA-00603: ORACLE server session terminated by fatal error
ORA-01092: ORACLE instance terminated. Disconnection forced
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00600: internal error code, arguments: [kcbzib_kcrsds_1], [], [], [], [], [], [], [], [], [], [], []
Process ID: 13970
Session ID: 1 Serial number: 15896

SQL> alter pluggable database ENMOTEST open force;
alter pluggable database ENMOTEST open force
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [kcffo_online_pdb_check: fno_system], [3], [], [], [], [], [], [], [], [],
[], []

复制

我们言归正传；回到本文的案例中来；针对第一个600错误的处理方式是，首先通过10046 trace定位到访问的数据块；然后通过bbed修改相关block，提交事务后，再使用event="21307096 trace name context forever, level 1' 来推进scn即可。

当然，有可能在你在恢复过程中还会遇到ora-00600 [4194]等错误，这是跟undo相关的问题，可以通过屏蔽回滚段或者重建undo来解决。

当处理完毕后，强烈建议使用hcheck脚本检查数据字典的一致性，如果检查没有任何问题，那么基本上数据库运行不会有太大问题；如下是本case的检测结果：

SQL> @/home/oracle/hcheck.sql
HCheck Version 07MAY18 on 05-DEC-2019 11:28:59                                  
----------------------------------------------                                  
Catalog Version 12.2.0.1.0 (1202000100)                                         
db_name: xxxx                                                               
Is CDB?: NO                                                                     

                                   Catalog       Fixed                         
Procedure Name                     Version    Vs Release    Timestamp           
Result                                                                          
------------------------------ ... ---------- -- ---------- --------------      
------                                                                          
.- LobNotInObj                 ... 1202000100 <=  *All Rel* 12/05 11:28:59 PASS 
.- MissingOIDOnObjCol          ... 1202000100 <=  *All Rel* 12/05 11:28:59 PASS 
.- SourceNotInObj              ... 1202000100 <=  *All Rel* 12/05 11:28:59 PASS 
.- OversizedFiles              ... 1202000100 <=  *All Rel* 12/05 11:28:59 PASS 
.- PoorDefaultStorage          ... 1202000100 <=  *All Rel* 12/05 11:28:59 PASS 
.- PoorStorage                 ... 1202000100 <=  *All Rel* 12/05 11:28:59 PASS 
.- TabPartCountMismatch        ... 1202000100 <=  *All Rel* 12/05 11:28:59 PASS 
.- OrphanedTabComPart          ... 1202000100 <=  *All Rel* 12/05 11:28:59 PASS 
.- MissingSum$                 ... 1202000100 <=  *All Rel* 12/05 11:28:59 PASS 
.- MissingDir$                 ... 1202000100 <=  *All Rel* 12/05 11:28:59 PASS 
.- DuplicateDataobj            ... 1202000100 <=  *All Rel* 12/05 11:28:59 PASS 
.- ObjSynMissing               ... 1202000100 <=  *All Rel* 12/05 11:29:00 PASS 
.- ObjSeqMissing               ... 1202000100 <=  *All Rel* 12/05 11:29:00 PASS 
.- OrphanedUndo                ... 1202000100 <=  *All Rel* 12/05 11:29:00 PASS 
.- OrphanedIndex               ... 1202000100 <=  *All Rel* 12/05 11:29:00 PASS 
.- OrphanedIndexPartition      ... 1202000100 <=  *All Rel* 12/05 11:29:00 PASS 
.- OrphanedIndexSubPartition   ... 1202000100 <=  *All Rel* 12/05 11:29:00 PASS 
.- OrphanedTable               ... 1202000100 <=  *All Rel* 12/05 11:29:00 PASS 
.- OrphanedTablePartition      ... 1202000100 <=  *All Rel* 12/05 11:29:00 PASS 
.- OrphanedTableSubPartition   ... 1202000100 <=  *All Rel* 12/05 11:29:00 PASS 
.- MissingPartCol              ... 1202000100 <=  *All Rel* 12/05 11:29:00 PASS 
.- OrphanedSeg$                ... 1202000100 <=  *All Rel* 12/05 11:29:00 PASS 
.- OrphanedIndPartObj#         ... 1202000100 <=  *All Rel* 12/05 11:29:00 PASS 
.- DuplicateBlockUse           ... 1202000100 <=  *All Rel* 12/05 11:29:00 PASS 
.- FetUet                      ... 1202000100 <=  *All Rel* 12/05 11:29:00 PASS 
.- Uet0Check                   ... 1202000100 <=  *All Rel* 12/05 11:29:00 PASS 
.- SeglessUET                  ... 1202000100 <=  *All Rel* 12/05 11:29:00 PASS 
.- BadInd$                     ... 1202000100 <=  *All Rel* 12/05 11:29:00 PASS 
.- BadTab$                     ... 1202000100 <=  *All Rel* 12/05 11:29:00 PASS 
.- BadIcolDepCnt               ... 1202000100 <=  *All Rel* 12/05 11:29:00 PASS 
.- ObjIndDobj                  ... 1202000100 <=  *All Rel* 12/05 11:29:00 PASS 
.- TrgAfterUpgrade             ... 1202000100 <=  *All Rel* 12/05 11:29:00 PASS 
.- ObjType0                    ... 1202000100 <=  *All Rel* 12/05 11:29:00 PASS 
.- BadOwner                    ... 1202000100 <=  *All Rel* 12/05 11:29:00 PASS 
.- StmtAuditOnCommit           ... 1202000100 <=  *All Rel* 12/05 11:29:00 PASS 
.- BadPublicObjects            ... 1202000100 <=  *All Rel* 12/05 11:29:00 PASS 
.- BadSegFreelist              ... 1202000100 <=  *All Rel* 12/05 11:29:00 PASS 
.- BadDepends                  ... 1202000100 <=  *All Rel* 12/05 11:29:00 PASS 
.- CheckDual                   ... 1202000100 <=  *All Rel* 12/05 11:29:01 PASS 
.- ObjectNames                 ... 1202000100 <=  *All Rel* 12/05 11:29:01 PASS 
.- BadCboHiLo                  ... 1202000100 <=  *All Rel* 12/05 11:29:01 PASS 
.- ChkIotTs                    ... 1202000100 <=  *All Rel* 12/05 11:29:01 PASS 
.- NoSegmentIndex              ... 1202000100 <=  *All Rel* 12/05 11:29:01 PASS 
.- BadNextObject               ... 1202000100 <=  *All Rel* 12/05 11:29:01 PASS 
.- DroppedROTS                 ... 1202000100 <=  *All Rel* 12/05 11:29:01 PASS 
.- FilBlkZero                  ... 1202000100 <=  *All Rel* 12/05 11:29:01 PASS 
.- DbmsSchemaCopy              ... 1202000100 <=  *All Rel* 12/05 11:29:01 PASS 
.- OrphanedIdnseqObj           ... 1202000100 >  1201000000 12/05 11:29:01 PASS 
.- OrphanedIdnseqSeq           ... 1202000100 >  1201000000 12/05 11:29:01 PASS 
.- OrphanedObjError            ... 1202000100 >  1102000000 12/05 11:29:01 PASS 
.- ObjNotLob                   ... 1202000100 <=  *All Rel* 12/05 11:29:01 PASS 
.- MaxControlfSeq              ... 1202000100 <=  *All Rel* 12/05 11:29:01 PASS 
.- SegNotInDeferredStg         ... 1202000100 >  1102000000 12/05 11:29:01 PASS 
.- SystemNotRfile1             ... 1202000100 >   902000000 12/05 11:29:01 PASS 
.- DictOwnNonDefaultSYSTEM     ... 1202000100 <=  *All Rel* 12/05 11:29:01 PASS 
.- OrphanTrigger               ... 1202000100 <=  *All Rel* 12/05 11:29:01 PASS 
.- ObjNotTrigger               ... 1202000100 <=  *All Rel* 12/05 11:29:01 PASS 
---------------------------------------                                         
05-DEC-2019 11:29:01  Elapsed: 2 secs                                           
---------------------------------------                                         
Found 0 potential problem(s) and 0 warning(s)                                   

PL/SQL procedure successfully completed.复制

第一次遇到这个问题；很久没搞Oracle了。友情支持；就记录一下吧。

话说这里我自己在华为云12.2 版本测试所遇到的ora-00600 kcffo_online_pdb_check: fno_system 问题；还没找到很好的解决方法；放到下一篇文章吧。

oracle

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

文章被以下合辑收录

Roger's Database Notes（共77篇）

非知名DBA、数据库爱好者、专注于Oracle/MySQL/PostgreSQL/NewSQL/国产数据库等

Oracle 12.2 Rac redo异常的恢复案例

文章被以下合辑收录

评论

相关阅读