暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

由控制文件事件导致datapump的import进程hang

数据老匠 2016-07-28
916


背景:

        2015年9月24日凌晨00:00开始,某库关键数据导入进程impdp hang,凌晨06:51将进程手工stop,随后重新进行导入,有两个分区表仍出现非常缓慢甚至hang的情况,其他表导入速度正常。对这两个表改用sqlldr测试速度如何,结果也是缓慢,中止;

Workaround:24日下午16:41客户将两个表上的index drop重建,再用sqlldr工具导入,响应速度恢复快速正常。


现场服务内容:

        1、经过现场与客户的沟通,检查相应时间段内的数据库alertlog和相关trace files,发现在应用hang期间,数据库日志正常无报错,仅在客户手工stop impdp进程时有个别ORA-00600的报错,分析是客户强行中断impdp引起。

Node1:

Wed Sep 24 07:54:07 2015

Errors in file /oracle/products/admin/xxdb/bdump/xxdb1_dw05_6762660.trc:

ORA-00600: internal error code, arguments: [17092], [39119], [], [], [], [], [], []

ORA-39119: worker process interrupt for delete worker processes call by master process

ORA-29913: error in executing ODCIEXTTABLECLOSE callout

ORA-44002: invalid object name

ORA-06512: at "SYS.KUPF$FILE", line 5849

ORA-06512: at line 1

ORA-39119: worker process interrupt for delete worker processes call by master process

ORA-06512: at "SYS.ORACLE_DATAPUMP", line 99

ORA-39119: worker process interrupt for delete worker processes call by master process

Wed Sep 24 07:54:09 2015

        2、分析故障期间的AWR报告(客户反映03:00-04:00时间段hang现象典型,重点分析):

DB Name DB Id      Instance Inst num    Release  RAC Host

        xxdb 890782739   xxdb1 1         10.2.0.2.0 YES  S7_C_YZ_YZSJK

Snap Id Snap Time Sessions Cursors/Session

Begin Snap: 3350 24-Sep-08 03:00:12 763 18.7

 End Snap:   3352 24-Sep-08 04:00:46 774 14.6

    Elapsed: 60.56 (mins)  

   DB Time: 2,302.99 (mins)  


        从top event来看,数据库整体状态良好,显示数据库运行正常。这与客户反映的仅两个表的impdp hang,其他数据库应用正常相吻合。

Top 5 Timed Events 

Node1:

Event Waits Time(s) Avg Wait(ms) % Total Call Time Wait Class

CPU time           41,611 30.1  

db file sequential read 5,971,226           40,841 7 29.6 User I/O

direct path read            9,288,752         22,790 2 16.5 User I/O

db file scattered read 1,411,206         14,849 11 10.7 User I/O

DXX lock handle           475,670         3,931 8 2.8         Other

Node2:

Event Waits Time(s) Avg Wait(ms) % Total Call Time Wait Class

CPU time         29,369 65.9  

db file sequential read 351,368         3,364 10 7.5 User I/O

direct path read         1,714,420         3,244 2 7.3 User I/O

db file scattered read 170,324         1,992 12 4.5 User I/O

library cache load lock 79,666         918         12 Concurrency

        但是在enqueue 统计信息中,发现有较高的oracle CF(控制文件事务锁)等待,等待次数和等待时间都比较高。正常情况下,该事件的等待次数和时间都应该很小,不会进入top。导致CF等待加重的原因通常与SMON recovery,或share pool中某latch争用有关。通过system dump或hanganalyze可以对此类等待有较全面的分析。

node1: 

Enqueue Type (Request Reason) Requests Succ Gets Failed Gets Waits Wt Time (s) Av Wt Time(ms)

CF-Controlfile Transaction 79,524 79,472 54 29,791 2,860 96.01

TM-DML 898,799 897,936 847 2,838 248 87.53

TX-Transaction (allocate ITL entry) 18,718 18,537 0 16,452 127 7.71

RO-Multiple Object Reuse (fast object reuse) 47,349 47,348 0 13,721 77 5.59

PS-PX Process Reservation 819,391 426,282 392,956 71,335 55 0.77

node2:

Enqueue Type (Request Reason) Requests Succ Gets Failed Gets Waits Wt Time (s) Av Wt Time(ms)

CF-Controlfile Transaction 17,280 17,249 31 4,791 262 54.76

TX-Transaction (allocate ITL entry) 3,983 3,912 0 3,831 80 20.77

WF-AWR Flush 30 30 0 23 39 1,717.35

PS-PX Process Reservation 22,617 22,595 0 22,303 32 1.42

TM-DML 229,710 229,696 0 1,573 13 8.34

3、检查两个表和表上index的情况,状态正常:

  select  count(*) ,OBJECT_TYPE from dba_objects where object_name='IB_WL_EBOXCHGLOG'  and owner='XXZWHIS' group by object_type;

    COUNT(*) OBJECT_TYPE

---------- -------------------

       108 TABLE SUBPARTITION

        27 TABLE PARTITION

         1 TABLE

select  count(*) ,OBJECT_TYPE from dba_objects where object_name='IB_WL_WOFFLOG'  and owner='XXZWHIS' group by object_type;

     COUNT(*) OBJECT_TYPE

---------- -------------------

       108 TABLE SUBPARTITION

        27 TABLE PARTITION

         1 TABLE

select index_name,INDEX_TYPE , UNIQUENESS,INI_TRANS,to_char(LAST_ANALYZED,'yyyy-mm-dd hh24:mi'),DEGREE  from dba_indexes where table_name='IB_WL_EBOXCHGLOG'  and owner='XXZWHIS'

INDEX_NAME                     INDEX_TYPE                  UNIQUENES  INI_TRANS

------------------------------ --------------------------- --------- ----------

TO_CHAR(LAST_ANA DEGREE

---------------- ----------------------------------------

IDX_IB_WL_EBOXCHGLOG4          NORMAL                      NONUNIQUE

2015-09-24 19:03 20

IDX_IB_WL_EBOXCHGLOG3          NORMAL                      NONUNIQUE

2015-09-24 19:01 20

IDX_IB_WL_EBOXCHGLOG2          NORMAL                      NONUNIQUE

2015-09-24 18:57 20

IDX_IB_WL_EBOXCHGLOG1          NORMAL                      NONUNIQUE

2015-09-24 18:49 20

select index_name,INDEX_TYPE , UNIQUENESS,INI_TRANS,to_char(LAST_ANALYZED,'yyyy-mm-dd hh24:mi'),DEGREE  from dba_indexes where table_name='IB_WL_WOFFLOG'  and owner='XXZWHIS';

INDEX_NAME                     INDEX_TYPE                  UNIQUENES  INI_TRANS

------------------------------ --------------------------- --------- ----------

TO_CHAR(LAST_ANA DEGREE

---------------- ----------------------------------------

IDX_IB_WL_WOFFLOG3             NORMAL                      NONUNIQUE

2015-09-24 20:24 20

IDX_IB_WL_WOFFLOG1             NORMAL                      NONUNIQUE

2015-09-24 20:16 20

IDX_IB_WL_WOFFLOG2             NORMAL                      NONUNIQUE

2015-09-24 20:18 20

select * from dba_objects where  owner='XXZWHIS' and object_name='IDX_IB_WL_EBOXCHGLOG4';

--all valid


4、在同一个生产数据库中,同一个表空间上,使用同样的expdmp文件,建立测试schema,对出问题的两个表进行impdp的测试,并行度开到比较大的12个,import快速顺利完成,共约12G的data,1.5G的index,12分钟完成。 Imp期间等待事件正常,较多表现为direct path write (写入数据)和direct path read temp(创建index)。

SQL> create user zcytest identified by oracle123 default tablespace XX_his_data temporary tablespace temp21;

User created.

SQL> grant connect to zcytest;

Grant succeeded.

SQL> alter user zcytest quota 20000m on XX_his_data;

User altered.

SQL> grant imp_full_database to zcytest;

Grant succeeded.

$ pwd

/oracle/datadump/move_tab/XX

$ more zcy_imp.par

userid=zcytest/oracle123

directory=MOVE_XX

dumpfile=exp_XXZW20150923%U.dmp

logfile=zcy_imp1.log

remap_schema=XXzw:zcytest

include=table:"in(

'IB_WL_WOFFLOG','IB_WL_EBOXCHGLOG'

)"

PARALLEL=12


分析:

        异常等待CF-Controlfile Transaction事件导致datapump的import进程hang,因没有进一步的system dump或hanganalyze信息,故目前还无法定位引发该事件的真正原因。

        客户反映当时看到SMON在作事情,分析可能是smon在进行一些与出问题的两个表相关的回滚操作,因回滚操作持续时间较长,导致随后多次尝试的impdp操作仍然表现为hang,直到24日下午,SMON回滚操作完成,故数据导入速度恢复快速正常。

        因客户重建两个表上的index后,重新进行数据导入则顺利完成,分析也可能是两个表上的某个index状态异常,导致出现异常的row cache object争用,引起CF-Controlfile Transaction事件。


建议:

1、 若datapump再次出现慢或hang的现象,尽量对数据库作一个较全面的system dump或hanganalyze,如下方法一,若对时间要求紧迫,使用方法二,可快速得到一个初步的hanganlzye结果。

方法一:

sqlplus '/as sysdba'

oradebug setmypid

oradebug unlimit

oradebug dump systemstate 266/258

或:

sqlplus '/as sysdba'

oradebug setmypid

oradebug unlimit;

oradebug hanganalyze 3

方法二:

> sqlplus '/as sysdba'

SQL> oradebug setmypid

SQL> oradebug unlimit;

SQL> oradebug hanganalyze

2、表上的index可以大大优化条件查询,但却会给DML操作带来性能的负面影响,尤其在批量导入或修改数据时。建议应用在批量导入数据之前,可以先将nonunique index drop,数据导入完成后再重新create;或者在数据导入前先将表上的nonunique index置为unusable,并指定系统不检查unusable的index;

alter index *** unusable;

alter session/system set skip_unusable_indexes=true;


文章转载自数据老匠,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论