Oracle游标故障案例解析

IT那活儿 2022-04-12

2509

点击上方“IT那活儿”，关注后了解更多内容，不管IT什么活儿，干就完了！！！

游标的概念

游标是SQL的一个内存工作区，由系统或用户以变量的形式定义。游标的作用就是用于临时存储从数据库中提取的数据块。

在某些情况下，需要把数据从存放在磁盘的表中调到计算机内存中进行处理，最后将处理结果显示出来或最终写回数据库。这样数据处理的速度才会提高，否则频繁的磁盘数据交换会降低效率。

游标有两种类型：显式游标和隐式游标。

在前述程序中用到的SELECT...INTO...查询语句，一次只能从数据库中提取一行数据，对于这种形式的查询和DML操作，系统都会使用一个隐式游标。

但是如果要提取多行数据，就要由程序员定义一个显式游标，并通过与游标有关的语句进行处理。显式游标对应一个返回结果为多行多列的SELECT语句。

游标一旦打开，数据就从数据库中传送到游标变量中，然后应用程序再从游标变量中分解出需要的数据，并进行处理。

--隐式游标

如前所述，DML操作和单行SELECT语句会使用隐式游标，它们是：

插入操作：INSERT。
更新操作：UPDATE。
删除操作：DELETE。
单行查询操作：SELECT ... INTO ...。

当系统使用一个隐式游标时，可以通过隐式游标的属性来了解操作的状态和结果，进而控制程序的流程。

隐式游标可以使用名字SQL来访问，但要注意，通过SQL游标名总是只能访问前一个DML操作或单行SELECT操作的游标属性。所以通常在刚刚执行完操作之后，立即使用SQL游标名来访问属性。

游标的属性有四种，如下所示：

故障类型及解析

1. cursor:pin S wait on X等待事件

原因分析：

A session waits for this event when it is requesting a shared mutex pin and another session is holding an exclusive mutex pin on the same cursor object.（当会话请求一个共享互斥锁引脚，而另一个会话持有同一个游标对象上的互斥锁引脚时，会话等待该事件。）

这个事件的出现受到很多因素的影响，在高并发的情况下：

sga自动管理,sga的频繁扩展和收缩。
过渡硬解析,造成library cache中的cursor object被频繁的reload。
bug。

案例分析一：

数据库bug原因引发cursor:pin S wait on X等待事件。

1）查看等待事件详情

--查看系统上现有的快照信息：

SQL> col mintime for a30SQL> col maxtime for a30SQL> select min(snap_id) minid, max(snap_id) maxid,  2 to_char(min(begin_interval_time),'yyyy-mm-dd hh24:mi:ss') mintime,  3 to_char(max(end_interval_time),'yyyy-mm-dd hh24:mi:ss') maxtime  4 from dba_hist_snapshot;

--根据快照信息，我们来查看一下对应的等待事件分类情况：

SQL> 1  select wait_class_id,wait_class, count(*) cnt  2  from dba_hist_active_sess_history  3  where snap_id between 78303 and 78472  4  group by wait_class_id, wait_class  5   * order by 3WAIT_CLASS_ID WAIT_CLASS CNT------------- -------------------- ----------   2723168908 Idle 2   4166625743 Administrative 6   2000153315 Nication 829   3290255840 Configuration 4128   4108307767 System I/O 9234   1893977003 Other 11043   3386400367 Commit 26802   1740759767 User I/O 28076   3875070507 Concurrency 888984

--查看具体的等待事件情况：

SQL> select event_id, event, count(*) cnt   2  from dba_hist_active_sess_history  3  where snap_id between 78303 and 78472  4  and wait_class_id=3875070507  5  group by event_id, event  6  order by 3;  EVENT_ID EVENT CNT---------- -------------------------------- ---------877525844       cursor: mutex X 186156091        os thread startup 61242501677      latch: library cache pin 71714089451      row cache lock                      72952162927      library cache load lock               102802704141      library cache pin 222032051689      latch: library cache lock              451117386924      latch: row cache objects 601394127552      latch: In memory undo latch 682779959231      latch: cache buffers chains 8732161531084      buffer busy waits 4286916468430       library cache lock                  45492696347763      latch: shared pool 12360589947255       latch: library cache 127181729366244      cursor: pin S wait on X 853972

2）查找出pin S wait on X对应的SQL

SQL> select sql_id, count(*) cnt from dba_hist_active_sess_histo 2  rywhere snap_id between 78303 and 3   78472  4  and event_id in (1729366244)  5 group by sql_idhaving count(*)> 6 100order by   7 2 desc;SQL_ID CNT------------- ----------0nuvj12m3ryvy 853880

--接着上面的查询我们可以从awr历史信息里找到这些sql语句主要在等待那些对象：

SQL> select owner,current_obj#,object_name,object_type, count(*) cnt   2  from dba_hist_active_sess_history a, dba_objects b  3  where snap_id between 78303 and 78472  4  and event_id in (1729366244)   5  and sql_id in ('0nuvj12m3ryvy')  6  and a.current_obj#=b.object_id  7  group by owner,current_obj#,object_name,object_type  8  having count(*) > 10  9  order by 5 desc;OWNER CURRENT_OBJ# OBJECT_NAME OBJECT_TYPE CNT---------- ------------ ------------------------------ ------------------- ----------SETTLE 49326 T_OPERATE_LOG TABLE 654899SYS 73541 LOG$INFORMATION TABLE 16337SETTLE 48117 G_MENU_RIGHT TABLE 9684SETTLE 141993 CONFIG_UNIX INDEX 9567SETTLE 136520 T_MANAGE_WARN_CONFIG TABLE 9565SETTLE 51955 T_BILL_LOG TABLE 9520SETTLE 48128 G_ROLE TABLE 9458

3）下面确认等待的数据库是否过于集中，也就是确认是否存在热块儿问题：

SQL> select current_file#,current_block#, count(*) cnt  2  from dba_hist_active_sess_history  3  where snap_id between 78303 and 78472  4  and event_id in (1729366244) and sql_id in ('0nuvj12m3ryv 5 y')  6  and current_obj# in (49326,48117,141993,136520,51955,48128)  7  group by current_file#, current_block#  8  having count(*)>50  9  order by 3;CURRENT_FILE# CURRENT_BLOCK# CNT------------- -------------- ----------          9           4436       9458          276         839623       9500          246         857417       9520          276         839495       9521           2         532140       9565          55        1153960       9567          276         840134       9648          25         739537       9684          ...          275         905767      16209          275         909728      16213          275         904723      16262          275         908888      16263          276         844986      16275          276         844862      16347          275         906325      16394          275         904842      16403          275         908197      24737          276         841357      25472

4）排除热快根据MOS提示的

How to Determine the Blocking Session for Event: 'cursor: pin S wait on X' (Doc ID 786507.1).

Cursor: pin S wait on X当一个session为一个与pin相关的共享操作（such as executing a cursor）请求一个mutex时,该session会有Cursor: pin S wait on X等待事件。

但是该mutex不能被授权，因为该mutex正在被其他session以排他模式持有（比如 parsing the cursor）。

v$session or v$session_wait中的p2raw列给出了 cursor: pin S wait on X等待事件的阻塞者session（持有者session）按照MOS文档方法查看。

SQL> select p2raw from v$session where event = 'cursor: pin S wait on X'; P2RAW----------------000001B200000000000001B200000000000001B200000000...000001B200000000

--操作系统是64位的，用命令来换算得出。

SQL> select p2raw,to_number(substr(to_char(rawtohex(p2raw)),1,8),'XXXXXXXX') sid   2       from v$session   3       where event = 'cursor: pin S wait on X'; P2RAW SID---------------- ----------000001B200000000 434000001B200000000 434000001B200000000 434...000001B200000000 434

--找到block session会话关系：

SQL> select p1, p2raw, count(*) from v$session      where event ='cursor: pin S 2 wait on X'       and wait_time = 0         group by p1, p2raw; P1 P2RAW COUNT(*)---------- ---------------- ----------2788948862 000001B200000000 59

--参数说明：

p1 = the mutex Id  This has the same definition as v$mutex_sleep_history.mutex_identifier  p2raw = holding Session Id | Ref Count  The most significant bytes always store the Holding Session Id (Holding SId).  The least significant bytes always store the Ref Count.  SQL> select sid,serial#,SQL_ID,BLOCKING_SESSION,BLOCKING_SESSION_STATUS,EVENT     2       from v$session where SID=434; SID    SERIAL# SQL_ID        BLOCKING_SESSION BLOCKING_SE EVENT---------- ---------- ------------- ---------------- ----------- 434      34745 0nuvj12m3ryvy        UNKNOWN     single-task messageSQL> select sid,serial#,SQL_ID,BLOCKING_SESSION,BLOCKING_SESSION_STATUS,EVENT     2       from v$session where event ='cursor: pin S wait on X'  ; SID     SERIAL# SQL_ID         BLOCKING_SESSION BLOCKING_SE EVENT---------- ---------- ------------- ---------------- -----------     332      59875 0nuvj12m3ryvy                  UNKNOWN     cursor: pin S wait on X    333      27868 0nuvj12m3ryvy                  UNKNOWN     cursor: pin S wait on X   350      54031 0nuvj12m3ryvy                  UNKNOWN     cursor: pin S wait on X    365       5053 0nuvj12m3ryvy                  UNKNOWN     cursor: pin S wait on X    ...   1043      53471 0nuvj12m3ryvy                  UNKNOWN     cursor: pin S wait on X  1082      13982 0nuvj12m3ryvy                  UNKNOWN     cursor: pin S wait on X

--查看游标数量因为当前系统version count并不高,所以判断遇到了bug了,。

SQL> select sql_id,version_count from v$sqlarea where version_count> 100 order by 2 desc ;no rows selected

发生cursor: pin S wait on X原因：

Frequent Hard Parses If the frequency of Hard Parsing is extremely high, then contention can occur on this pin.High Version Counts When Version counts become excessive, a long chain of versions needs to be examined and this can lead to contention on this event Known bugs.Bug 5907779 - Self deadlock hang on "cursor: pin S wait on X" (typically from DBMS_STATS) [ID 5907779.8]Bug 7568642: BLOCKING_SESSION EMPTY FOR "CURSOR: PIN S WAIT ON X".

2. cursor:mutex X等待事件

原因分析：

当一个session为一个与pin相关的共享操作（such as executing a cursor）请求一个mutex时,该session会有Cursor: pin S wait on X等待事件。但是该mutex不能被授权，因为该mutex正在被其他session以排他模式持有（比如 parsing the cursor）。

触发该事件的情况有：

Mutex持有者得不到CPU。
过多的子游标 High Version Counts，过多的子游标版本Version Count可能导致Mutex 争用，一般一个SQL的Version Count不要高于500。
更新或构件SQL统计信息V$SQLSTATS。
已经被KILLED的SESSION仍持有Mutex。

案例分析一：

通过dba_hist_active_sess_history查看当时主要的等待事件：（开始出现Cursor：Mute X的时间），找到造成cursor: mutex X的对应SQL。

SQL_ID          COUNT(*)------------- ----------b1gtr4yvjk7uc     149275

--造成Cursor:mute X的主要原因：

select USER_BIND_PEEK_MISMATCH,count(*) from v$sql_shared_cursor where sql_id='b1gtr4yvjk7uc' group by USER_BIND_PEEK_MISMATCH;U COUNT(*)- ----------N 4Y 21

--查看AWR报告中Mutex Sleep Summary：

1）Wait time较长的Mutex type和BUG 28889389相匹配。

原因：

在观察到多个唯一会话在以下堆栈下的同一游标上以独占模式等待父游标互斥锁后，此错误被命中。'cursor: mutex X' 将被用于父游标以找出该特定的匹配子游标SQL。在高并发下，当所有会话同时执行相同的光标导致此互斥争用，该BUG没有workaround只能通过打补丁Patch 28889389来修复。

2）与本case比较相近的另一个BUG是BUG 32755517 : HIGH VERSION COUNT DUE TO USER_BIND_PEEK_MISMATCH AFTER FLUSH SHARED POOL。

该BUG的触发条件时Sql profile且里包含了_optim_peek_user_binds = false。这个问题直到23.1才被修复。

--Workaround的方法有2种：

3）建议将_cursor_obsolete_threshold降低。

这个参数可以控制version count,该值目前为12C默认值，建议将其设置成1000（该参数时静态参数，需要重启生效）。

High Version Counts For SQL Statements (>1024) Post Upgrade To 12.2 and Above Causing Database Slow Performance (Doc ID 2431353.1)

4）故障分析总结如下：

根据数据库当前已有信息，我们可以推测和上文提到的两个BUG相似，如果需要进一步匹配，需要通过call stack来进行分析，但当前数据库并没有保留该类信息。

Hanganalyze使用方法如下：

---------单机

SQL> sqlplus -prelim  as sysdbaSQL> oradebug setmypidSQL> oradebug unlimitSQL> oradebug hanganalyze 3SQL> oradebug hanganalyze 3SQL> oradebug hanganalyze 3SQL> oradebug dump systemstate 266 ---266或者258SQL> oradebug dump systemstate 266SQL> oradebug dump systemstate 266SQL> oradebug tracefile_name------------racSQL> sqlplus -prelim  as sysdbaSQL> oradebug setmypidSQL> oradebug unlimitSQL> oradebug -g all hanganalyze 3SQL> oradebug -g all hanganalyze 3SQL> oradebug -g all hanganalyze 3SQL> oradebug -g all dump systemstate 266SQL> oradebug -g all dump systemstate 266SQL> oradebug -g all dump systemstate 266SQL> oradebug tracefile_name

5）处理建议总结：

取消SQL ID为b1gtr4yvjk7uc的sql profile。
当问题重现的情况下，请及时使用hanganalyze收集信息，并取出系统留存，以便未来定位BUG。
建议将_cursor_obsolete_threshold降低。

案例分析二：

BIND_EQUIV_FAILURE绑定值的选择性与现有子游标的选择性不匹配，加上频次执行量较大创建大量子游标造成cursor:mutex X等待事件。

1）Sql执行频率：

SNAP_ID     NODE BEGIN_INTERVAL_TIME   SQL_ID         EXECS    AVG_ETIME   AVG_IO30615 2 22-JAN-22 08.00.08.843 AM      bq31p1f8gz5fg          796 .004   117.89321630616 2 22-JAN-22 09.00.36.368 AM      bq31p1f8gz5fg        6,521 .004   117.04201830617 1 22-JAN-22 10.00.04.663 AM      bq31p1f8gz5fg    9,612,074 14.498   161.60581330617 2 22-JAN-22 10.00.04.740 AM      bq31p1f8gz5fg       10,458 .004   124.729489

2）查看执行计划：

SNAP_ID   NODE BEGIN_INTERVAL_TIME            SQL_ID        PLAN_HASH_VALUE     EXECS    AVG_ETIME     AVG_IO------ ------ ------------------------------ ------------- --------------- --------- ------------ ----------30415 2 14-JAN-22 12.00.04.921 AM      bq31p1f8gz5fg       212347024 1,695 .006 164.184661...30615 2 22-JAN-22 08.00.08.843 AM      bq31p1f8gz5fg                       796 .004 117.89321630616 2 22-JAN-22 09.00.36.368 AM      bq31p1f8gz5fg                     6,521 .004 117.04201830617 1 22-JAN-22 10.00.04.663 AM      bq31p1f8gz5fg                 9,612,074 14.498 161.60581330617 2 22-JAN-22 10.00.04.740 AM      bq31p1f8gz5fg                    10,458 .004 124.72948930618 1 22-JAN-22 11.00.33.928 AM      bq31p1f8gz5fg                    23,631 .004  123.2808630618 2 22-JAN-22 11.00.34.020 AM      bq31p1f8gz5fg                     6,177 .004 130.68188430619 1 22-JAN-22 12.00.01.677 PM      bq31p1f8gz5fg                    16,435 .004 123.45810830619 2 22-JAN-22 12.00.01.755 PM      bq31p1f8gz5fg                     4,803 .004  128.1946730620 1 22-JAN-22 01.00.19.436 PM      bq31p1f8gz5fg                    15,061 .004 124.27939730620 2 22-JAN-22 01.00.19.355 PM      bq31p1f8gz5fg                     4,322 .004 131.075659

解决方法：删除不匹配的执行计划，选择正确执行计划。

本文作者：郭琳

本文来源：IT那活儿（上海新炬王翦团队）

数据库

最后修改时间：2022-04-12 10:24:30

文章转载自IT那活儿，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。