背景:
2015年9月24日凌晨00:00开始,某库关键数据导入进程impdp hang,凌晨06:51将进程手工stop,随后重新进行导入,有两个分区表仍出现非常缓慢甚至hang的情况,其他表导入速度正常。对这两个表改用sqlldr测试速度如何,结果也是缓慢,中止;
Workaround:24日下午16:41客户将两个表上的index drop重建,再用sqlldr工具导入,响应速度恢复快速正常。
现场服务内容:
1、经过现场与客户的沟通,检查相应时间段内的数据库alertlog和相关trace files,发现在应用hang期间,数据库日志正常无报错,仅在客户手工stop impdp进程时有个别ORA-00600的报错,分析是客户强行中断impdp引起。
Node1:
Wed Sep 24 07:54:07 2015
Errors in file /oracle/products/admin/xxdb/bdump/xxdb1_dw05_6762660.trc:
ORA-00600: internal error code, arguments: [17092], [39119], [], [], [], [], [], []
ORA-39119: worker process interrupt for delete worker processes call by master process
ORA-29913: error in executing ODCIEXTTABLECLOSE callout
ORA-44002: invalid object name
ORA-06512: at "SYS.KUPF$FILE", line 5849
ORA-06512: at line 1
ORA-39119: worker process interrupt for delete worker processes call by master process
ORA-06512: at "SYS.ORACLE_DATAPUMP", line 99
ORA-39119: worker process interrupt for delete worker processes call by master process
Wed Sep 24 07:54:09 2015
2、分析故障期间的AWR报告(客户反映03:00-04:00时间段hang现象典型,重点分析):
DB Name DB Id Instance Inst num Release RAC Host
xxdb 890782739 xxdb1 1 10.2.0.2.0 YES S7_C_YZ_YZSJK
Snap Id Snap Time Sessions Cursors/Session
Begin Snap: 3350 24-Sep-08 03:00:12 763 18.7
End Snap: 3352 24-Sep-08 04:00:46 774 14.6
Elapsed: 60.56 (mins)
DB Time: 2,302.99 (mins)
从top event来看,数据库整体状态良好,显示数据库运行正常。这与客户反映的仅两个表的impdp hang,其他数据库应用正常相吻合。
Top 5 Timed Events
Node1:
Event Waits Time(s) Avg Wait(ms) % Total Call Time Wait Class
CPU time 41,611 30.1
db file sequential read 5,971,226 40,841 7 29.6 User I/O
direct path read 9,288,752 22,790 2 16.5 User I/O
db file scattered read 1,411,206 14,849 11 10.7 User I/O
DXX lock handle 475,670 3,931 8 2.8 Other
Node2:
Event Waits Time(s) Avg Wait(ms) % Total Call Time Wait Class
CPU time 29,369 65.9
db file sequential read 351,368 3,364 10 7.5 User I/O
direct path read 1,714,420 3,244 2 7.3 User I/O
db file scattered read 170,324 1,992 12 4.5 User I/O
library cache load lock 79,666 918 12 Concurrency
但是在enqueue 统计信息中,发现有较高的oracle CF(控制文件事务锁)等待,等待次数和等待时间都比较高。正常情况下,该事件的等待次数和时间都应该很小,不会进入top。导致CF等待加重的原因通常与SMON recovery,或share pool中某latch争用有关。通过system dump或hanganalyze可以对此类等待有较全面的分析。
• node1:
Enqueue Type (Request Reason) Requests Succ Gets Failed Gets Waits Wt Time (s) Av Wt Time(ms)
CF-Controlfile Transaction 79,524 79,472 54 29,791 2,860 96.01
TM-DML 898,799 897,936 847 2,838 248 87.53
TX-Transaction (allocate ITL entry) 18,718 18,537 0 16,452 127 7.71
RO-Multiple Object Reuse (fast object reuse) 47,349 47,348 0 13,721 77 5.59
PS-PX Process Reservation 819,391 426,282 392,956 71,335 55 0.77
• node2:
Enqueue Type (Request Reason) Requests Succ Gets Failed Gets Waits Wt Time (s) Av Wt Time(ms)
CF-Controlfile Transaction 17,280 17,249 31 4,791 262 54.76
TX-Transaction (allocate ITL entry) 3,983 3,912 0 3,831 80 20.77
WF-AWR Flush 30 30 0 23 39 1,717.35
PS-PX Process Reservation 22,617 22,595 0 22,303 32 1.42
TM-DML 229,710 229,696 0 1,573 13 8.34
3、检查两个表和表上index的情况,状态正常:
select count(*) ,OBJECT_TYPE from dba_objects where object_name='IB_WL_EBOXCHGLOG' and owner='XXZWHIS' group by object_type;
COUNT(*) OBJECT_TYPE
---------- -------------------
108 TABLE SUBPARTITION
27 TABLE PARTITION
1 TABLE
select count(*) ,OBJECT_TYPE from dba_objects where object_name='IB_WL_WOFFLOG' and owner='XXZWHIS' group by object_type;
COUNT(*) OBJECT_TYPE
---------- -------------------
108 TABLE SUBPARTITION
27 TABLE PARTITION
1 TABLE
select index_name,INDEX_TYPE , UNIQUENESS,INI_TRANS,to_char(LAST_ANALYZED,'yyyy-mm-dd hh24:mi'),DEGREE from dba_indexes where table_name='IB_WL_EBOXCHGLOG' and owner='XXZWHIS'
INDEX_NAME INDEX_TYPE UNIQUENES INI_TRANS
------------------------------ --------------------------- --------- ----------
TO_CHAR(LAST_ANA DEGREE
---------------- ----------------------------------------
IDX_IB_WL_EBOXCHGLOG4 NORMAL NONUNIQUE
2015-09-24 19:03 20
IDX_IB_WL_EBOXCHGLOG3 NORMAL NONUNIQUE
2015-09-24 19:01 20
IDX_IB_WL_EBOXCHGLOG2 NORMAL NONUNIQUE
2015-09-24 18:57 20
IDX_IB_WL_EBOXCHGLOG1 NORMAL NONUNIQUE
2015-09-24 18:49 20
select index_name,INDEX_TYPE , UNIQUENESS,INI_TRANS,to_char(LAST_ANALYZED,'yyyy-mm-dd hh24:mi'),DEGREE from dba_indexes where table_name='IB_WL_WOFFLOG' and owner='XXZWHIS';
INDEX_NAME INDEX_TYPE UNIQUENES INI_TRANS
------------------------------ --------------------------- --------- ----------
TO_CHAR(LAST_ANA DEGREE
---------------- ----------------------------------------
IDX_IB_WL_WOFFLOG3 NORMAL NONUNIQUE
2015-09-24 20:24 20
IDX_IB_WL_WOFFLOG1 NORMAL NONUNIQUE
2015-09-24 20:16 20
IDX_IB_WL_WOFFLOG2 NORMAL NONUNIQUE
2015-09-24 20:18 20
select * from dba_objects where owner='XXZWHIS' and object_name='IDX_IB_WL_EBOXCHGLOG4';
--all valid
4、在同一个生产数据库中,同一个表空间上,使用同样的expdmp文件,建立测试schema,对出问题的两个表进行impdp的测试,并行度开到比较大的12个,import快速顺利完成,共约12G的data,1.5G的index,12分钟完成。 Imp期间等待事件正常,较多表现为direct path write (写入数据)和direct path read temp(创建index)。
SQL> create user zcytest identified by oracle123 default tablespace XX_his_data temporary tablespace temp21;
User created.
SQL> grant connect to zcytest;
Grant succeeded.
SQL> alter user zcytest quota 20000m on XX_his_data;
User altered.
SQL> grant imp_full_database to zcytest;
Grant succeeded.
$ pwd
/oracle/datadump/move_tab/XX
$ more zcy_imp.par
userid=zcytest/oracle123
directory=MOVE_XX
dumpfile=exp_XXZW20150923%U.dmp
logfile=zcy_imp1.log
remap_schema=XXzw:zcytest
include=table:"in(
'IB_WL_WOFFLOG','IB_WL_EBOXCHGLOG'
)"
PARALLEL=12
分析:
异常等待CF-Controlfile Transaction事件导致datapump的import进程hang,因没有进一步的system dump或hanganalyze信息,故目前还无法定位引发该事件的真正原因。
客户反映当时看到SMON在作事情,分析可能是smon在进行一些与出问题的两个表相关的回滚操作,因回滚操作持续时间较长,导致随后多次尝试的impdp操作仍然表现为hang,直到24日下午,SMON回滚操作完成,故数据导入速度恢复快速正常。
因客户重建两个表上的index后,重新进行数据导入则顺利完成,分析也可能是两个表上的某个index状态异常,导致出现异常的row cache object争用,引起CF-Controlfile Transaction事件。
建议:
1、 若datapump再次出现慢或hang的现象,尽量对数据库作一个较全面的system dump或hanganalyze,如下方法一,若对时间要求紧迫,使用方法二,可快速得到一个初步的hanganlzye结果。
方法一:
sqlplus '/as sysdba'
oradebug setmypid
oradebug unlimit
oradebug dump systemstate 266/258
或:
sqlplus '/as sysdba'
oradebug setmypid
oradebug unlimit;
oradebug hanganalyze 3
方法二:
> sqlplus '/as sysdba'
SQL> oradebug setmypid
SQL> oradebug unlimit;
SQL> oradebug hanganalyze
2、表上的index可以大大优化条件查询,但却会给DML操作带来性能的负面影响,尤其在批量导入或修改数据时。建议应用在批量导入数据之前,可以先将nonunique index drop,数据导入完成后再重新create;或者在数据导入前先将表上的nonunique index置为unusable,并指定系统不检查unusable的index;
alter index *** unusable;
alter session/system set skip_unusable_indexes=true;