暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

Oracle 11.2.0.3.9 RAC is hang (for Linux)

原创 Roger 2014-07-16
704
一个朋友的库出现异常,6月30号就出现hang的问题,当时也给了建议直接kill会话解决了,没想到现在又出现了。简单的帮忙分析了一下原因,如下是alert log的信息:
Mon Jul 14 18:50:06 2014
Archived Log entry 3689 added for thread 1 sequence 2223 ID 0x1db958b2 dest 1:
Mon Jul 14 19:40:29 2014
Errors in file /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_lmhb_17008.trc (incident=224129):
ORA-00445: background process "GCR0" did not start after 120 seconds
Incident details in: /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/incident/incdir_224129/xxxx1_lmhb_17008_i224129.trc
Mon Jul 14 19:40:30 2014
Dumping diagnostic data in directory=[cdmp_20140714194030], requested by (instance=1, osid=17008 (LMHB)), summary=[incident=224129].
Mon Jul 14 19:40:46 2014
LCK0 (ospid: 17064) waits for latch 'shared pool' for 169 secs.
Mon Jul 14 19:41:29 2014
PMON failed to acquire latch, see PMON dump
Mon Jul 14 19:42:01 2014
Errors in file /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_lmhb_17008.trc (incident=224130):
ORA-29771: process USER (OSID 4221) blocks LCK0 (OSID 17064) for more than 70 seconds
Incident details in: /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/incident/incdir_224130/xxxx1_lmhb_17008_i224130.trc
Mon Jul 14 19:42:29 2014
PMON failed to acquire latch, see PMON dump
Mon Jul 14 19:43:10 2014
USER (ospid: 4221) is blocking LCK0 (ospid: 17064) in a wait
LMHB (ospid: 17008) kills USER (ospid: 4221).
复制

从上面的alert log信息来看,可以看到GCR0进程启动失败,且核心进程LCK0也在等待latch:shared pool. 更重要的一点是:
连pmon进程也无法获得latch了。很明显,这个数据库hang住了。

对于Oracle hang问题,大家所了解做法是,先登录数据库,然后找到阻塞源头kill会话。

那么如果数据库hang住了之后,怎么登录呢?对于Oracle 10g以及之后的版本可以通过sqlplus -prelim / as sysdba方式登录.

然后进行hanganalyze dump,可惜是朋友这里操作了之后看到任何信息,那么只能进行systemstate dump和processstate dump了。

首先我们来看下PMON进程在等待为什么,为什么无法获得latch ?
PROCESS 2: PMON
----------------------------------------
SO: 0x55ccf1b98, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x55ccf1b98, name=process, file=ksu.h LINE:12616, pg=0
(process) Oracle pid:2, ser:1, calls cur/top: 0x55d7c76b0/0x55d7c76b0
flags : (0xe) SYSTEM
flags2: (0x0), flags3: (0x10)
intr error: 0, call error: 0, sess error: 0, txn error 0
intr queue: empty
ksudlp FALSE at location: 0
(post info) last post received: 0 0 146
last post received-location: kjm.h LINE:1248 ID:kjfmHealth_Latch_killHolder: post pmon after latch holder kill
last process to post me: 544d25368 1 2
last post sent: 0 0 52
last post sent-location: ksv2.h LINE:1682 ID:ksvpst: cldel - post master
last process posted by me: 544d25368 1 2
(latch info) wait_event=0 bits=0
Location from where call was made: kgh.h LINE:6483 ID:kghfre: Chunk Header
Context saved from call: 22949992976
waiting for 601091f8 Child shared pool level=7 child#=2
Location from where latch is held: kgh.h LINE:6463 ID:kghalo:
Context saved from call: 0
state=busy [holder orapid=146] wlstate=free [value=0]
waiters [orapid (seconds since: put on list, posted, alive check)]:
205 (1830, 1405339770, 1769)
30 (1829, 1405339770, 1650)
208 (1793, 1405339770, 328)
66 (1770, 1405339770, 1770)
104 (1743, 1405339770, 1743)
16 (1536, 1405339770, 1536)
96 (1489, 1405339770, 1489)
58 (270, 1405339770, 270)
waiter count=8
gotten 2181033884 times wait, failed first 7066852 sleeps 865327
gotten 215950 times nowait, failed: 464
possible holder pid = 146 ospid=13135
Process Group: DEFAULT, pseudo proc: 0x540f7d060
O/S info: user: oracle, term: UNKNOWN, ospid: 16945
OSD pid info: Unix process pid: 16945, image: oracle@itvxxxx1 (PMON)
复制

可以看到pmon进程无法获得shared pool latch,而该资源(601091f8)正在被orapid=146所持有。下面来看下orapid 146在干什么?
PROCESS 146:
----------------------------------------
SO: 0x55cd17458, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x55cd17458, name=process, file=ksu.h LINE:12616, pg=0
(process) Oracle pid:146, ser:39, calls cur/top: 0x26c90ee80/0x26c9099e0
flags : (0x0) -
flags2: (0x0), flags3: (0x10)
intr error: 0, call error: 0, sess error: 0, txn error 0
intr queue: empty
ksudlp FALSE at location: 0
(post info) last post received: 0 0 259
last post received-location: kgl.h LINE:8714 ID:kgllkdl: post after freeing latch
last process to post me: 548d29e00 4 0
last post sent: 0 0 26
last post sent-location: ksa2.h LINE:285 ID:ksasnd
last process posted by me: 55ccf6f08 1 6
(latch info) wait_event=0 bits=80
holding (efd=17) 601091f8 Child shared pool level=7 child#=2
Location from where latch is held: kgh.h LINE:6463 ID:kghalo:
Context saved from call: 0
state=busy [holder orapid=146] wlstate=free [value=0]
waiters [orapid (seconds since: put on list, posted, alive check)]:
205 (2115, 1405340055, 2054)
30 (2114, 1405340055, 1935)
208 (2078, 1405340055, 162)
66 (2055, 1405340055, 2055)
104 (2028, 1405340055, 2028)
16 (1821, 1405340055, 1821)
96 (1774, 1405340055, 1774)
58 (555, 1405340055, 555)
65 (255, 1405340055, 255)
waiter count=9
Process Group: DEFAULT, pseudo proc: 0x540f7d060
O/S info: user: grid, term: UNKNOWN, ospid: 13135
OSD pid info: Unix process pid: 13135, image: oracle@itvxxxx1
复制

ok,我们看到该进程正在持有 601091f8这个Child shared pool。朋友这里单独对orapid=146这个会话进行了dump,下面来深入分析一下:
SO: 0x55cd17458, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x55cd17458, name=process, file=ksu.h LINE:12616, pg=0
(process) Oracle pid:146, ser:39, calls cur/top: 0x26c90ee80/0x26c9099e0
flags : (0x0) -
flags2: (0x0), flags3: (0x10)
intr error: 0, call error: 0, sess error: 0, txn error 0
intr queue: empty
ksudlp FALSE at location: 0
(post info) last post received: 0 0 259
last post received-location: kgl.h LINE:8714 ID:kgllkdl: post after freeing latch
last process to post me: 548d29e00 4 0
last post sent: 0 0 26
last post sent-location: ksa2.h LINE:285 ID:ksasnd
last process posted by me: 55ccf6f08 1 6
(latch info) wait_event=0 bits=80
holding (efd=17) 601091f8 Child shared pool level=7 child#=2
Location from where latch is held: kgh.h LINE:6463 ID:kghalo:
Context saved from call: 0
state=busy [holder orapid=146] wlstate=free [value=0]
waiters [orapid (seconds since: put on list, posted, alive check)]:
205 (3715, 1405341655, 3654)
30 (3714, 1405341655, 3535)
208 (3678, 1405341655, 406)
66 (3655, 1405341655, 3655)
104 (3628, 1405341655, 3628)
16 (3421, 1405341655, 3421)
96 (3374, 1405341655, 3374)
58 (2155, 1405341655, 2155)
65 (1855, 1405341655, 1855)
64 (956, 1405341655, 956)
69 (551, 1405341655, 551)
waiter count=11
Process Group: DEFAULT, pseudo proc: 0x540f7d060
O/S info: user: grid, term: UNKNOWN, ospid: 13135
OSD pid info: Unix process pid: 13135, image: oracle@itvxxxx1
。。。。。。。。。
。。。。。。。。。
SO: 0x55d09c5e0, type: 4, owner: 0x55cd17458, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x55cd17458, name=session, file=ksu.h LINE:12624, pg=0
(session) sid: 385 ser: 3787 trans: (nil), creator: 0x55cd17458
flags: (0x45) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
flags2: (0x40009) -/-/INC
DID: , short-term DID:
txn branch: (nil)
oct: 3, prv: 0, sql: 0x54f4c5250, psql: 0x4bc1de318, user: 34/ITVMANAGER
ksuxds FALSE at location: 0
service name: xxxx
client details:
O/S info: user: root, term: unknown, ospid: 1234
machine: localhost.localdomain program: JDBC Thin Client
application name: JDBC Thin Client, hash value=2546894660
Current Wait Stack:
0: waiting for 'library cache: mutex X'
idn=0x1a43f94e, value=0x76b00000000, where=0x4f
wait_id=7745 seq_num=7746 snap_id=1
wait times: snap=1 min 38 sec, exc=1 min 38 sec, total=1 min 38 sec
wait times: max=infinite, heur=61 min 56 sec
wait counts: calls=8957 os=8957
in_wait=1 iflags=0x15b2
There is at least one session blocking this session.
Dumping 1 direct blocker(s):
inst: 1, sid: 1899, ser: 35707
Dumping final blocker:
inst: 1, sid: 2654, ser: 39173
复制

可以看到该进程正在等待library cache: mutex X,而从最后的几行信息可以看到提示,fina blocker为:
inst: 1, sid: 2654, ser: 39173  (注意:11gR2之前这个信息是没有的,可见11gR2比较强大了)

既然这里提到最终的blocker 会话是2654,那么我们搜索关键字:sid: 2654 找到如下信息:
SO: 0x5417cacb0, type: 4, owner: 0x55cd163a8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x55cd163a8, name=session, file=ksu.h LINE:12624, pg=0
(session) sid: 2654 ser: 39173 trans: (nil), creator: 0x55cd163a8
flags: (0x45) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
flags2: (0x9) -/-/INC
DID: , short-term DID:
txn branch: (nil)
oct: 3, prv: 0, sql: 0x34f05aa78, psql: 0x29ac2e878, user: 34/xxxxxx
复制

 ksuxds FALSE at location: 0
service name: xxxx
client details:
O/S info: user: root, term: unknown, ospid: 1234
machine: localhost.localdomain program: JDBC Thin Client
application name: JDBC Thin Client, hash value=2546894660
Current Wait Stack:
0: waiting for 'latch: shared pool'
address=0x60109298, number=0x133, tries=0x0
wait_id=1048 seq_num=1049 snap_id=1
wait times: snap=42 min 12 sec, exc=42 min 12 sec, total=42 min 12 sec
wait times: max=infinite, heur=42 min 12 sec
wait counts: calls=0 os=0
in_wait=1 iflags=0x2520
There is at least one session blocking this session.
Dumping 1 direct blocker(s):
inst: 1, sid: 1716, ser: 19027
Dumping final blocker:
inst: 1, sid: 1716, ser: 19027
There are 57 sessions blocked by this session.
复制

这里又提示该进程的最终blocker为 sid 1716。而2654 会话等待等待latch: shared pool。 下面继续搜索sid: 1716
SO: 0x541505e80, type: 4, owner: 0x540d8abf0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x540d8abf0, name=session, file=ksu.h LINE:12624, pg=0
(session) sid: 1716 ser: 19027 trans: (nil), creator: 0x540d8abf0
flags: (0x45) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
flags2: (0x40009) -/-/INC
DID: , short-term DID:
txn branch: (nil)
oct: 3, prv: 0, sql: 0x55ba986f0, psql: 0x34ea4cf40, user: 34/xxxxxxxx
ksuxds FALSE at location: 0
service name: xxxx
client details:
O/S info: user: root, term: unknown, ospid: 1234
machine: localhost.localdomain program: JDBC Thin Client
application name: JDBC Thin Client, hash value=2546894660
Current Wait Stack:
0: waiting for 'library cache: mutex X'
idn=0x1a43f94e, value=0x76b00000000, where=0x4f
wait_id=32063 seq_num=32067 snap_id=1
wait times: snap=5 min 22 sec, exc=5 min 22 sec, total=5 min 22 sec
wait times: max=infinite, heur=41 min 33 sec
wait counts: calls=29210 os=29210
in_wait=1 iflags=0x15b2
There is at least one session blocking this session.
Dumping 1 direct blocker(s):
inst: 1, sid: 1899, ser: 35707
Dumping final blocker:
inst: 1, sid: 1899, ser: 35707
There are 56 sessions blocked by this session.
Dumping one waiter:
inst: 1, sid: 2654, ser: 39173
wait event: 'latch: shared pool'
p1: 'address'=0x60109298
p2: 'number'=0x133
p3: 'tries'=0x0
row_wait_obj#: 19692, block#: 3852, row#: 0, file# 50
min_blocked_time: 2483 secs, waiter_cache_ver: 11844
Wait State:
fixed_waits=0 flags=0x22 boundary=(nil)/-1
复制

可以看到sid 1716 在等待library cache: mutex X,而该会话的最终blocker为 sid:1899,继续搜索sid:1899
SO: 0x54553f600, type: 4, owner: 0x55cd152f8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x55cd152f8, name=session, file=ksu.h LINE:12624, pg=0
(session) sid: 1899 ser: 35707 trans: (nil), creator: 0x55cd152f8
flags: (0x45) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
flags2: (0x40009) -/-/INC
DID: , short-term DID:
txn branch: (nil)
oct: 3, prv: 0, sql: 0x555f6e3b0, psql: 0x55a0664c0, user: 34/xxxxxxx
ksuxds FALSE at location: 0
service name: xxxx
client details:
O/S info: user: root, term: unknown, ospid: 1234
machine: localhost.localdomain program: JDBC Thin Client
application name: JDBC Thin Client, hash value=2546894660
Current Wait Stack:
0: waiting for 'library cache: mutex X'
idn=0x1d694212, value=0xa5e00000000, where=0x7f
wait_id=928 seq_num=929 snap_id=1
wait times: snap=6 min 0 sec, exc=6 min 0 sec, total=6 min 0 sec
wait times: max=infinite, heur=42 min 12 sec
wait counts: calls=32681 os=32681
in_wait=1 iflags=0x15b2
There is at least one session blocking this session.
Dumping 1 direct blocker(s):
inst: 1, sid: 2654, ser: 39173
Dumping final blocker:
inst: 1, sid: 2654, ser: 39173
There are 57 sessions blocked by this session.
复制

我们可以清楚的看到,最后又回到了sid:2654. 感觉饶了一圈又回来了。 如果你感觉这样分析有点费劲,那么可以通过
Oracle systemstate dump的格式化脚本将前面的systemstate dump的trace格式化,格式化之后,可以你会发现如下的信息:
Starting Systemstate 1
..............................................................................
..................
.............................................................
...............................................................................
...............................................................................
...
Ass.Awk Version 1.0.39
~~~~~~~~~~~~~~~~~~~~~~
Source file : /soft/xxxx1_ora_23533.trc

System State 1 (2014-07-14 20:09:30.961)
~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~
1: [DEAD]
2: waiting for 'latch: shared pool' [Latch 601091f8]
3: waiting for 'rdbms ipc message'
4: waiting for 'VKTM Logical Idle Wait'
。。。。。。。
。。。。。。。

Resource Holder State
Latch 601091f8 146: 146: is waiting for 138:
Latch 60109298 105: 105: is waiting for 138:
PIN: Handle=0x55b1664c8 ??? Blocker
Latch 60109158 97: 97: is waiting for 138:
Enq PV-00000000-00000000 16: 16: is waiting for 146:
Latch 53330ec30 54: 54: is waiting for Rcache object=0x491dfa5e0
Rcache object=0x491dfa5e0 ??? Blocker
Mutex 4f3f1fc3 97: 97: is waiting for 138:
Mutex 1a43f94e 138: 138: is waiting for 142:
LOCK: Handle=0x54f06d310 138: 138: is waiting for 142:
Mutex 1d694212 142: 142: is waiting for 105:
Mutex b182f101 103: 103: is waiting for 105:
Latch 6000a018 208: 208: is waiting for 146:
Latch 60010f38 226: 226: is waiting for 97:
复制

我们把上面的信息简单整理一下,如下:

146(library cache: mutex X) 等待 138 (library cache: mutex X)
138(library cache: mutex X) 等待 142 (latch: shared pool)
142(latch: shared pool)     等待 105 (library cache: mutex X)
105(library cache: mutex X) 等待 138 (library cache: mutex X)

我们可以看出,很明显Oracle在这里形成了死锁,即library cache: mutex X 和 latch:shared pool构成了死锁.

当然这里如果是要解决问题,那么简单,将138,142,105 3个会话kill掉即可。

凭直觉我们可能就知道,这样的问题很可能是Oracle bug导致,注意朋友这里的环境已经是11.2.0.3.9了。

搜索了MOS,我认为90{39ecd679003247f2ed728ad9c7ed019a369dd84d0731b449c26bf628d3c1a20b}的情况是这个Bug 17588480 - library cache mutex/shared pool latch deadlock (文档 ID 17588480.8)

可惜的是朋友这里没有做errorstack,无法准确定位到是不是这个bug,较为遗憾。
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论