暂无图片
暂无图片
6
暂无图片
暂无图片
暂无图片

记一次ORA600内部错误故障分析与修复实录

作者:Digital Observer(施嘉伟)
Oracle ACE Pro: Database
PostgreSQL ACE Partner
11年数据库行业经验,现主要从事数据库服务工作
拥有Oracle OCM、DB2 10.1 Fundamentals、MySQL 8.0 OCP、WebLogic 12c OCA、KCP、PCTP、PCSD、PGCM、OCI、PolarDB技术专家、达梦师资认证、数据安全咨询高级等认证
ITPUB认证专家、PolarDB开源社区技术顾问、HaloDB技术顾问、TiDB社区技术布道师、青学会MOP技术社区专家顾问、国内某高校企业实践指导教师
公众号/墨天轮:Digital Observer;CSDN/PGfans:施嘉伟;ITPUB:sjw1933

1. 故障背景

1.1 事件概述

202X年X月X日,某用户的Oracle数据库在上午9点至10点期间发生故障,导致部分业务无法正常运行。经初步分析,发现数据库报出 ORA-00600 内部错误。该错误是Oracle的一个通用错误,通常由数据库内部逻辑异常、数据块损坏或存储问题引起。

1.2 影响范围

  • 业务系统部分查询和写入失败
  • 索引操作异常
  • 数据库响应速度变慢

2. 故障排查与分析

2.1 错误日志分析

通过 alert.log 日志文件,我们发现了如下错误信息:

ORA-00600: , : [13013], [5001], [126801], [109382409], [101], [109382437], [17], [], [], [], [], []
Incident details in: /opt/app/oracle/diag/rdbms/jcd/jcd/incident/incdir_99948/jcd_ora_23550_i99948.trc
Non critical error ORA-48113 caught while writing to trace file "/opt/app/oracle/diag/rdbms/jcd/jcd/incident/incdir_99948/jcd_ora_23550_i99948.trc"
ORA-00600: , : [13013], [5001], [126801], [109382408], [102], [109382433], [17], [], [], [], [], []
Incident details in: /opt/app/oracle/diag/rdbms/jcd/jcd/incident/incdir_99980/jcd_ora_445_i99980.trc
ORA-00600: , : [13013], [5001], [126801], [109382409], [123], [109382437], [17], [], [], [], [], []
Incident details in: /opt/app/oracle/diag/rdbms/jcd/jcd/incident/incdir_101652/jcd_ora_8586_i101652.trc
ORA-00600: , : [13013], [5001], [126801], [109382409], [168], [109382437], [17], [], [], [], [], []
ORA-00600: , : [13013], [5001], [126801], [109382406], [99], [109382433], [17], [], [], [], [], []
ORA-00600: , : [13013], [5001], [126801], [109382405], [152], [109382433], [17], [], [], [], [], []

ORA-00600 是Oracle的内部错误,2662 可能与数据块的SCN不一致有关。

ORA-600 [13013]
Format: ORA-600 [13013] [a] [b] [c] [d] [e] [f]
This format relates to Oracle Server 8.0.3 and above
Arg [a] Passcount
Arg [b] Data Object number
Arg [c] Tablespace Decimal Relative DBA (RDBA) of block containing the row to be updated
Arg [d] Row Slot number
Arg [e] Decimal RDBA of block being updated (Typically same as [c])
Arg [f] Code

ORA-00600: internal error code, arguments: [13013], [5001], [268], [8457615], [5], [8457615], [17], [], [], [], [], []

2.2 确认损坏数据块

使用 dbms_utility.data_block_address_filedbms_utility.data_block_address_block 确定损坏的数据块位置:

SELECT dbms_utility.data_block_address_file(109382409) rfile, dbms_utility.data_block_address_block(109382409) rblocks FROM dual; RFILE RBLOCKS ---------- ---------- 26 330505 select dbms_utility.data_block_address_file(109382408) rfile, dbms_utility.data_block_address_block(109382408) blocs from dual; RFILE RBLOCKS ---------- ---------- 26 330504 select dbms_utility.data_block_address_file(109382406) rfile, dbms_utility.data_block_address_block(109382406) rblocks from dual; RFILE RBLOCKS ---------- ---------- 26 330502 select dbms_utility.data_block_address_file(109382405) rfile, dbms_utility.data_block_address_block(109382405) rblocks from dual; RFILE RBLOCKS ---------- ---------- 26 330501

查询结果显示损坏的文件编号和块号。

确认对象:

SQL> set linesize 999 pagesize 9999 SQL> select owner,segment_name,segment_type,tablespace_name from dba_extents where file_id =26 and 330505 between block_id and block_id + blocks -1; OWNER SEGMENT_NAME SEGMENT_TYPE TABLESPACE_NAME ---------- ---------- ---------- -------------------- XHSD_JCD JCD_PUB_YHBMB TABLE TS_PUB_DATA

同时dbv校验26号文件:
无物理坏块!

尝试执行分析:
analyze table XHSD_JCD.JCD_PUB_YHBMB validate structure cascade online;
ERROR at line 1:
ORA-08104: this index object 130036 is being online built or rebuilt

2.3 关联对象检查

通过 DBA_EXTENTS 确定该损坏数据块对应的表或索引:

确认130036索引: select object_name,object_type from dba_object where object_id='130036'; OBJECT_NAME OBJECT_TYPE --------------------------------------------------------- PK_JCD_PUB_YHBMB INDEX

结果显示该数据块属于 IDX_ORDERS 索引。

//与客户沟通,此时正在重建索引.无法进行分析.

2.4 索引问题分析

由于查询过程中报出 ORA-08104: this index object is being online built or rebuilt,客户重建索引后仍无法继续分析:

ANALYZE INDEX IDX_ORDERS VALIDATE STRUCTURE;

发现索引结构损坏,导致查询时无法正常访问。

3. 故障处理方案

3.1 索引修复

由于 IDX_ORDERS 索引损坏,我们先尝试使用 DBMS_REPAIR.ONLINE_INDEX_CLEAN 进行清理:

analyze table XHSD_JCD.JCD_PUB_YHBMB validate structure cascade online; ERROR at line 1: ORA-08104: this index object 130036 is being online built or rebuilt

执行清理索引创建操作:

SQL> DECLARE isClean BOOLEAN; BEGIN isClean := DBMS_REPAIR.ONLINE_INDEX_CLEAN(130036); END; /

再次手动进行rebulid:

alter index XHSD_JCD.PK_JCD_PUB_YHBMB rebuild

再次进行分析资源被占用,别hang住sql占用:

analyze table XHSD_JCD.JCD_PUB_YHBMB validate structure cascade online; analyze table XHSD_JCD.JCD_PUB_YHBMB validate structure ANALYZE INDEX XHSD_JCD.PK_JCD_PUB_YHBMB VALIDATE STRUCTURE;

3.2 检查锁定对象

查询锁定的会话,确认是否有进程阻塞索引重建:

select l.session_id,o.owner,o.object_name from v$locked_object l,dba_objects o where l.object_id=o.object_id; SESSION_ID OWNER OBJECT_NAME ---------- ---------- -------------------- 780 XHSD_JCD JCD_PUB_YHBMB 627 XHSD_JCD JCD_PUB_YHBMB 1152 XHSD_JCD JCDMS_DZTSYCCL

操作系统层面杀锁:

select a.spid,b.sid,b.serial#,b.username from v$process a,v$session b where a.addr=b.paddr and b.sid=780; select a.spid,b.sid,b.serial#,b.username from v$process a,v$session b where a.addr=b.paddr and b.sid=627; select a.spid,b.sid,b.serial#,b.username from v$process a,v$session b where a.addr=b.paddr and b.sid=1152;

kill -9进行操作系统层面杀锁

再次执行分析成功:

analyze table XHSD_JCD.JCD_PUB_YHBMB validate structure cascade online; analyze table XHSD_JCD.JCD_PUB_YHBMB validate structure ANALYZE INDEX XHSD_JCD.PK_JCD_PUB_YHBMB VALIDATE STRUCTURE;

此时客户给出问题SQL:

执行SQL 会话hang住

UPDATE XHSD_JCD.PUB_YHBMB SET DLBZ = '1' WHERE DLH='00560';

后台出现:
ORA-07445: exception encountered: core dump [kduudm()+1798] [SIGSEGV] [ADDR:0x1E1000000] [PC:0x4BCB2A2] [Address not mapped to object] []

//对于该错误代码,Mos并无相关记录.

Wed Mar 03 11:49:46 2021
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x1E1000000] [PC:0x4BCB2A2, kduudm()+1798] [flags: 0x0, count: 1]
Errors in file /opt/app/oracle/diag/rdbms/jcd/jcd/trace/jcd_ora_9960.trc (incident=100772):
ORA-07445: exception encountered: core dump [kduudm()+1798] [SIGSEGV] [ADDR:0x1E1000000] [PC:0x4BCB2A2] [Address not mapped to object] []
Incident details in: /opt/app/oracle/diag/rdbms/jcd/jcd/incident/incdir_100772/jcd_ora_9960_i100772.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Wed Mar 03 11:49:48 2021
Dumping diagnostic data in directory=[cdmp_20210303114948], requested by (instance=1, osid=9960), summary=[incident=100772].
Wed Mar 03 11:49:49 2021
Sweep [inc][100772]: completed
Sweep [inc2][100772]: completed

询问客户视图创建SQL:

create or replace view pub_yhbmb as select DLH,BM,YHM,KL,SYBZ,QYRQ,ZZRQ,DLBZ,SM from jcd_PUB_YHBMB;

对原表执行delete也hang住:

delete from jcd_pub_yhbmb WHERE DLH='00560';

select语句数据获取正常:

select * from XHSD_JCD.jcd_pub_yhbmb

尝试跟踪delete以及update语句,但sql_trace以及10046无输出.

最终尝试手动通过CTS创建一张测试表进行DML测试可以执行:

create table XHSD_JCD.test as select * from XHSD_JCD.jcd_PUB_YHBMB; delete from XHSD_JCD.test WHERE DLH='00560';

最终建议客户使用CTS重建业务表业务正常.

相关Mos:
How to Use DBMS_REPAIR.ONLINE_INDEX_CLEAN For The Cleanup Of Online Index Failing With ORA-08104? (文档 ID 1378173.1)
How to Cleanup and Rebuild an Interrupted Online Index Rebuild - ORA-8104 , ORA-8106 (文档 ID 272735.1)
Session Was Killed During The Rebuild Of Index ORA-08104 (文档 ID 375856.1)

4.需要总结的知识点:

  1. 10046各个级别的区别
  2. debug与10046事件关系与使用方法区别
  3. 处理表被占用锁定v$locked_object
  4. Dblink on Physical standby - ORA-16000 (Doc ID 1296288.1)文档
  5. Query with multiple dblinks in ADG / read-only database fails with ORA-16000 (Doc ID 2462936.1)
  6. 301137.1 OS Watcher User Guide
  7. 580513.1 How To Start OSWatcher Black Box Every System Boot (Linux specific)

操作系统层面与数据库层面杀锁.
hhh6.jpg

「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论