硬件环境运行了8年之久,机器比较老。
最近一段时间,从10月份开始,老是出现数据库hung死的现象,以前也有,不过几个月一次。hung死之后,一般是通过重启操作系统恢复,基本上都是通过同时重启两个节点操作系统,如果重启其中一个节点操作系统,则该节点数据库可能无法启动,ocssd.log会报takeover aborted due to ALIVE node on Disk错误。
重启数据库的时间附近,经常会报ORA-29702的错误,我不知道这是导致数据库hung的原因,还是我重启操作系统造成的。
11月1日13:50分左右,数据库又hung死了,重启两个节点操作系统恢复。我看日志里的是在13:56开始报ORA-29702的错误,但是在这之前,数据库已经hung住了,所以我怀疑ORA-29702可能是我重启操作系统造成的,但是我不敢确定。我把日志都上传了,请各位帮忙分析分析,找到确切的故障原因。不胜感激。
数据库hung死的原因,怀疑是由于oracle bug导致的,但是没有确切的日志反映,所以计划希望通过升级至Oracle RAC 10.2.0.5来尝试解决该问题。

- Fri Nov 1 13:55:06 2019 lsyy1
System State dumped lsyy1_ora_17433530.trc - Fri Nov 1 13:56:52 2019 asm1
Trace dumping is performing id=[cdmp_20191101135652] - Fri Nov 1 13:56:51 2019 asm2
GMON: terminating instance due to error 29702 - 2019-11-01 13:57:06.598 grid1
[cssd(2818714)]CRS-1612:node p550b (0) at 50% heartbeat fatal, eviction in 0.000 seconds - 2019-11-01 13:56:51.540 CSSD
WARNING: clssnmeventhndlr: Receive failure with node 2 (p550b), state 3, con(111e0e230), probe(0), rc=11
alert_grid1
[crsd(2949782)]CRS-1205:Auto-start failed for the CRS resource . Details in p550a.
2019-10-24 04:00:54.152
[crsd(2949782)]CRS-1205:Auto-start failed for the CRS resource . Details in p550a.
[cssd(2818714)]CRS-1601:CSSD Reconfiguration complete. Active nodes are p550a p550b .
2019-11-01 13:57:06.598
[cssd(2818714)]CRS-1612:node p550b (0) at 50% heartbeat fatal, eviction in 0.000 seconds
2019-11-01 13:57:07.599
[cssd(2818714)]CRS-1612:node p550b (0) at 50% heartbeat fatal, eviction in 0.000 seconds
2019-11-01 13:57:14.612
[cssd(2818714)]CRS-1611:node p550b (0) at 75% heartbeat fatal, eviction in 0.000 seconds
2019-11-01 13:57:18.615
[cssd(2818714)]CRS-1610:node p550b (0) at 90% heartbeat fatal, eviction in 0.000 seconds
2019-11-01 13:57:19.616
[cssd(2818714)]CRS-1610:node p550b (0) at 90% heartbeat fatal, eviction in 0.000 seconds
2019-11-01 13:57:20.616
[cssd(2818714)]CRS-1610:node p550b (0) at 90% heartbeat fatal, eviction in 0.000 seconds
2019-11-01 13:57:21.126
[cssd(2818714)]CRS-1607:CSSD evicting node p550b. Details in /oracle/product/10.2.0/crs/log/p550a/cssd/ocssd.log.
[cssd(2818714)]CRS-1601:CSSD Reconfiguration complete. Active nodes are p550a .
2019-11-01 13:57:33.903
[crsd(2949782)]CRS-1204:Recovering CRS resources for node p550b.
2019-11-01 13:57:47.118
[cssd(25952658)]CRS-1605:CSSD voting file is online: /dev/rhdiskpower4. Details in /oracle/product/10.2.0/crs/log/p550a/cssd/ocssd.log.
2019-11-01 14:04:01.636
复制
alert_grid2
2019-10-24 04:01:10.403
[crsd(2491016)]CRS-1201:CRSD started on node p550b.
2019-11-01 14:03:13.299
[cssd(1376764)]CRS-1605:CSSD voting file is online: /dev/rhdiskpower4. Details in /oracle/product/10.2.0/crs/log/p550b/cssd/ocssd.log.
[cssd(1376764)]CRS-1601:CSSD Reconfiguration complete. Active nodes are p550b .
2019-11-01 14:04:00.184
[evmd(2425748)]CRS-1401:EVMD started on node p550b.
2019-11-01 14:04:00.241
复制
cssd_1
[ CSSD]2019-10-24 04:01:06.497 [3858] >TRACE: clssgmReconfigThread: completed for reconfig(2), with status(1)
[ CSSD]2019-11-01 13:56:51.540 [1801] >WARNING: clssnmeventhndlr: Receive failure with node 2 (p550b), state 3, con(111e0e230), probe(0), rc=11
[ CSSD]2019-11-01 13:56:51.540 [1801] >TRACE: clssnmDiscHelper: p550b, node(2) connection failed, con (111e0e230), probe(0)
[ CSSD]2019-11-01 13:56:51.540 [2829] >TRACE: clssgmPeerDeactivate: node 2 (p550b), death 0, state 0x1 connstate 0xf
[ CSSD]2019-11-01 13:57:06.598 [3086] >WARNING: clssnmPollingThread: node p550b (2) at 50 2.481040e-265artbeat fatal, eviction in 14.523 seconds seedhbimpd 0
[ CSSD]2019-11-01 13:57:06.598 [3086] >TRACE: clssnmPollingThread: node p550b (2) is impending reconfig, flag 1, misstime 15477
[ CSSD]2019-11-01 13:57:06.598 [3086] >TRACE: clssnmPollingThread: diskTimeout set to (27000)ms impending reconfig status(1)
[ CSSD]2019-11-01 13:57:07.599 [3086] >WARNING: clssnmPollingThread: node p550b (2) at 50 2.481040e-265artbeat fatal, eviction in 13.522 seconds seedhbimpd 1
[ CSSD]2019-11-01 13:57:14.612 [3086] >WARNING: clssnmPollingThread: node p550b (2) at 75 2.481040e-265artbeat fatal, eviction in 6.508 seconds seedhbimpd 1
[ CSSD]2019-11-01 13:57:18.615 [3086] >WARNING: clssnmPollingThread: node p550b (2) at 90 2.481040e-265artbeat fatal, eviction in 2.505 seconds seedhbimpd 1
[ CSSD]2019-11-01 13:57:19.616 [3086] >WARNING: clssnmPollingThread: node p550b (2) at 90 2.481040e-265artbeat fatal, eviction in 1.505 seconds seedhbimpd 1
[ CSSD]2019-11-01 13:57:20.616 [3086] >WARNING: clssnmPollingThread: node p550b (2) at 90 2.481040e-265artbeat fatal, eviction in 0.504 seconds seedhbimpd 1
[ CSSD]2019-11-01 13:57:21.122 [3086] >TRACE: clssnmPollingThread: Eviction started for node p550b (2), flags 0x0001, state 3, wt4c 0 seedhbimpd 1
[ CSSD]2019-11-01 13:57:21.122 [3600] >TRACE: clssnmDoSyncUpdate: Initiating sync 3
[ CSSD]2019-11-01 13:57:21.122 [3600] >TRACE: clssnmDoSyncUpdate: diskTimeout set to (27000)ms
[ CSSD]2019-11-01 13:57:21.122 [3600] >TRACE: clssnmSetupAckWait: Ack message type (11)
[ CSSD]2019-11-01 13:57:21.122 [3600] >TRACE: clssnmSetupAckWait: node(1) is ALIVE
[ CSSD]2019-11-01 13:57:21.122 [3600] >TRACE: clssnmSendSync: syncSeqNo(3)
[ CSSD]2019-11-01 13:57:21.123 [3600] >TRACE: clssnmWaitForAcks: Ack message type(11), ackCount(1)
[ CSSD]2019-11-01 13:57:21.123 [1801] >TRACE: clssnmHandleSync: diskTimeout set to (27000)ms
[ CSSD]2019-11-01 13:57:21.123 [1801] >TRACE: clssnmHandleSync: Acknowledging sync: src[1] srcName[p550a] seq[9] sync[3]
[ CSSD]2019-11-01 13:57:21.123 [1] >USER: NMEVENT_SUSPEND [00][00][00][06]
[ CSSD]2019-11-01 13:57:21.124 [3600] >TRACE: clssnmWaitForAcks: done, msg type(11)
[ CSSD]2019-11-01 13:57:21.124 [3600] >TRACE: clssnmDoSyncUpdate: Terminating node 2, p550b, misstime(30001) state(5), seedhbimpd(1)
[ CSSD]2019-11-01 13:57:21.124 [3600] >TRACE: clssnmSetupAckWait: Ack message type (13)
[ CSSD]2019-11-01 13:57:21.124 [3600] >TRACE: clssnmSetupAckWait: node(1) is ACTIVE
[ CSSD]2019-11-01 13:57:21.124 [3600] >TRACE: clssnmWaitForAcks: Ack message type(13), ackCount(1)
[ CSSD]2019-11-01 13:57:21.124 [1801] >TRACE: clssnmSendVoteInfo: node(1) syncSeqNo(3)
[ CSSD]2019-11-01 13:57:21.125 [3600] >TRACE: clssnmWaitForAcks: done, msg type(13)
[ CSSD]2019-11-01 13:57:21.125 [3600] >TRACE: clssnmCheckDskInfo: Checking disk info...
[ CSSD]2019-11-01 13:57:21.126 [3600] >TRACE: clssnmEvict: Start
[ CSSD]2019-11-01 13:57:21.126 [3600] >TRACE: clssnmEvict: Evicting node 2, p550b, birth 2, death 3, impendingrcfg 1, stateflags 0x1
[ CSSD]2019-11-01 13:57:21.126 [3600] >TRACE: clssnmWaitOnEvictions: Start
[ CSSD]2019-11-01 13:57:21.126 [3600] >TRACE: clssnmWaitOnEvictions: node 2, p550b, is not dead, seedhbimpd 1
[ CSSD]2019-11-01 13:57:21.126 [3600] >TRACE: clssnmCheckKillStatus: Node 2, p550b, down, LATS(727368467),timeout(14527)
[ CSSD]2019-11-01 13:57:21.126 [3600] >TRACE: clssnmSetupAckWait: Ack message type (15)
[ CSSD]2019-11-01 13:57:21.126 [3600] >TRACE: clssnmSetupAckWait: node(1) is ACTIVE
[ CSSD]2019-11-01 13:57:21.127 [3600] >TRACE: clssnmSendUpdate: syncSeqNo(3)
[ CSSD]2019-11-01 13:57:21.128 [3600] >TRACE: clssnmWaitForAcks: Ack message type(15), ackCount(1)
[ CSSD]2019-11-01 13:57:21.128 [1801] >TRACE: clssnmUpdateNodeState: node 0, state (0/0) unique (0/0) prevConuni(0) birth (0/0) (old/new)
[ CSSD]2019-11-01 13:57:21.128 [1801] >TRACE: clssnmUpdateNodeState: node 1, state (3/3) unique (1571860840/1571860840) prevConuni(0) birth (1/1) (old/new)
[ CSSD]2019-11-01 13:57:21.128 [1801] >TRACE: clssnmUpdateNodeState: node 2, state (5/0) unique (1571860861/1571860861) prevConuni(1571860861) birth (2/2) (old/new)
[ CSSD]2019-11-01 13:57:21.129 [1801] >TRACE: clssnmDeactivateNode: node 2 (p550b) left cluster
[ CSSD]2019-11-01 13:57:21.129 [1801] >USER: clssnmHandleUpdate: SYNC(3) from node(1) completed
[ CSSD]2019-11-01 13:57:21.129 [1801] >USER: clssnmHandleUpdate: NODE 1 (p550a) IS ACTIVE MEMBER OF CLUSTER
[ CSSD]2019-11-01 13:57:21.129 [1801] >TRACE: clssnmHandleUpdate: diskTimeout set to (200000)ms
[ CSSD]2019-11-01 13:57:21.130 [3600] >TRACE: clssnmWaitForAcks: done, msg type(15)
[ CSSD]2019-11-01 13:57:21.130 [3600] >TRACE: clssnmDoSyncUpdate: Sync 3 complete!
[ CSSD]2019-11-01 13:57:21.135 [3859] >TRACE: clssgmReconfigThread: started for reconfig (3)
[ CSSD]2019-11-01 13:57:21.135 [3859] >USER: NMEVENT_RECONFIG [00][00][00][02]
[ CSSD]2019-11-01 13:57:21.135 [3859] >TRACE: clssgmCleanupGrocks: cleaning up grock crs_version type 2
[ CSSD]2019-11-01 13:57:21.136 [3859] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(crs_version) birth(2/0)
[ CSSD]2019-11-01 13:57:21.136 [3859] >TRACE: clssgmCleanupGrocks: cleaning up grock ORA_CLSRD_1_lsyy type 2
[ CSSD]2019-11-01 13:57:21.136 [3859] >TRACE: clssgmCleanupGrocks: cleaning up grock ORA_CLSRD_1_lsyy type 3
[ CSSD]2019-11-01 13:57:21.136 [3859] >TRACE: clssgmCleanupGrocks: cleaning up grock ORA_CLSRD_2_lsyy type 2
[ CSSD]2019-11-01 13:57:21.136 [3859] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(0) grock(ORA_CLSRD_2_lsyy) birth(2/0)
[ CSSD]2019-11-01 13:57:21.136 [3859] >TRACE: clssgmCleanupGrocks: cleaning up grock ORA_CLSRD_2_lsyy type 3
[ CSSD]2019-11-01 13:57:21.137 [3859] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(0) grock(ORA_CLSRD_2_lsyy) birth(2/0)
[ CSSD]2019-11-01 13:57:21.137 [3859] >TRACE: clssgmCleanupGrocks: cleaning up grock DB+ASM type 2
[ CSSD]2019-11-01 13:57:21.137 [3859] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(DB+ASM) birth(2/0)
[ CSSD]2019-11-01 13:57:21.137 [3859] >TRACE: clssgmCleanupGrocks: cleaning up grock DG+ASM type 2
[ CSSD]2019-11-01 13:57:21.137 [3859] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(DG+ASM) birth(2/0)
[ CSSD]2019-11-01 13:57:21.138 [3859] >TRACE: clssgmCleanupGrocks: cleaning up grock DBLSYY type 2
[ CSSD]2019-11-01 13:57:21.138 [3859] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(DBLSYY) birth(2/0)
[ CSSD]2019-11-01 13:57:21.138 [3859] >TRACE: clssgmCleanupGrocks: cleaning up grock DGLSYY type 2
[ CSSD]2019-11-01 13:57:21.138 [3859] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(DGLSYY) birth(2/0)
[ CSSD]2019-11-01 13:57:21.139 [3859] >TRACE: clssgmCleanupGrocks: cleaning up grock DG_DATA type 2
[ CSSD]2019-11-01 13:57:21.139 [3859] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(2) grock(DG_DATA) birth(2/0)
[ CSSD]2019-11-01 13:57:21.139 [3859] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(3) grock(DG_DATA) birth(2/0)
[ CSSD]2019-11-01 13:57:21.140 [3859] >TRACE: clssgmCleanupGrocks: cleaning up grock OSM_ALL type 2
[ CSSD]2019-11-01 13:57:21.140 [3859] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(OSM_ALL) birth(2/0)
[ CSSD]2019-11-01 13:57:21.140 [3859] >TRACE: clssgmCleanupGrocks: cleaning up grock DAALL_DB type 2
[ CSSD]2019-11-01 13:57:21.140 [3859] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(2) grock(DAALL_DB) birth(2/0)
[ CSSD]2019-11-01 13:57:21.140 [3859] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(3) grock(DAALL_DB) birth(2/0)
[ CSSD]2019-11-01 13:57:21.140 [3859] >TRACE: clssgmCleanupGrocks: cleaning up grock CRSDMAIN type 2
[ CSSD]2019-11-01 13:57:21.140 [3859] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(2) grock(CRSDMAIN) birth(2/0)
[ CSSD]2019-11-01 13:57:21.141 [3859] >TRACE: clssgmCleanupGrocks: cleaning up grock EVMDMAIN type 2
[ CSSD]2019-11-01 13:57:21.141 [3859] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(2) grock(EVMDMAIN) birth(2/0)
[ CSSD]2019-11-01 13:57:21.141 [3859] >TRACE: clssgmCleanupGrocks: cleaning up grock _ORA_CRS_MEMBER_p550a type 3
[ CSSD]2019-11-01 13:57:21.141 [3859] >TRACE: clssgmCleanupGrocks: cleaning up grock _ORA_CRS_MEMBER_p550b type 3
[ CSSD]2019-11-01 13:57:21.142 [3859] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(0) grock(_ORA_CRS_MEMBER_p550b) birth(2/0)
[ CSSD]2019-11-01 13:57:21.142 [3859] >TRACE: clssgmCleanupGrocks: cleaning up grock IGLSYYALL type 2
[ CSSD]2019-11-01 13:57:21.142 [3859] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(2) grock(IGLSYYALL) birth(2/0)
[ CSSD]2019-11-01 13:57:21.142 [3859] >TRACE: clssgmCleanupGrocks: cleaning up grock ocr_crs type 2
[ CSSD]2019-11-01 13:57:21.142 [3859] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(2) grock(ocr_crs) birth(2/0)
[ CSSD]2019-11-01 13:57:21.143 [3859] >TRACE: clssgmCleanupGrocks: cleaning up grock #CSS_CLSSOMON type 2
[ CSSD]2019-11-01 13:57:21.143 [3859] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(2) grock(#CSS_CLSSOMON) birth(2/0)
[ CSSD]2019-11-01 13:57:21.144 [3859] >TRACE: clssgmEstablishConnections: 1 nodes in cluster incarn 3
[ CSSD]2019-11-01 13:57:21.144 [2829] >TRACE: clssgmPeerListener: connects done (1/1)
[ CSSD]2019-11-01 13:57:21.145 [3859] >TRACE: clssgmEstablishMasterNode: MASTER for 3 is node(1) birth(1)
[ CSSD]2019-11-01 13:57:21.145 [3859] >TRACE: clssgmMasterCMSync: Synchronizing group/lock status
[ CSSD]2019-11-01 13:57:21.146 [3859] >TRACE: clssgmMasterSendDBDone: group/lock status synchronization complete
[ CSSD]CLSS-3000: reconfiguration successful, incarnation 3 with 1 nodes
[ CSSD]CLSS-3001: local node number 1, master node number 1
复制
alert_db2
Fri Nov 1 13:47:11 2019
Thread 2 advanced to log sequence 83093 (LGWR switch)
Current log# 11 seq# 83093 mem# 0: +DATA/lsyy/onlinelog/group211
Fri Nov 1 13:56:51 2019
Error: KGXGN aborts the instance (6)
Fri Nov 1 13:56:51 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/bdump/lsyy2_lmon_2490836.trc:
ORA-29702: error occurred in Cluster Group Service operation
LMON: terminating instance due to error 29702
Fri Nov 1 13:56:51 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/bdump/lsyy2_lms0_2818464.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Nov 1 13:56:51 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/bdump/lsyy2_lms1_3539096.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Nov 1 13:56:51 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/bdump/lsyy2_lms2_3276922.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Nov 1 13:56:51 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/bdump/lsyy2_lms3_3080900.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Nov 1 13:56:51 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/bdump/lsyy2_lmd0_2818208.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Nov 1 13:56:51 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/bdump/lsyy2_pmon_3211936.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Nov 1 13:56:52 2019
System state dump is made for local instance
System State dumped to trace file /oracle/product/10.2.0/db_1/admin/lsyy/bdump/lsyy2_diag_1114194.trc
Fri Nov 1 13:56:57 2019
Instance terminated by LMON, pid = 2490836
Fri Nov 1 14:04:39 2019
复制
alert_db1
Fri Nov 1 13:55:06 2019
System State dumped to trace file /oracle/product/10.2.0/db_1/admin/lsyy/udump/lsyy1_ora_17433530.trc
Fri Nov 1 13:56:52 2019
Trace dumping is performing id=[cdmp_20191101135651]
Fri Nov 1 13:57:06 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/udump/lsyy1_ora_14614914.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-03135: connection lost contact
ORA-06512: at line 3
Fri Nov 1 13:57:06 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/udump/lsyy1_ora_6620096.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-03135: connection lost contact
ORA-06512: at line 3
Fri Nov 1 13:57:09 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/udump/lsyy1_ora_7274598.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-03135: connection lost contact
ORA-06512: at line 3
Fri Nov 1 13:57:10 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/udump/lsyy1_ora_11403798.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-03135: connection lost contact
ORA-06512: at line 3
Fri Nov 1 13:57:11 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/udump/lsyy1_ora_22937984.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-03135: connection lost contact
ORA-06512: at line 3
Fri Nov 1 13:57:11 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/udump/lsyy1_ora_11076094.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-03135: connection lost contact
ORA-06512: at line 3
Fri Nov 1 13:57:12 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/udump/lsyy1_ora_16384294.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-03135: connection lost contact
ORA-06512: at line 3
Fri Nov 1 13:57:12 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/udump/lsyy1_ora_8258520.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-03135: connection lost contact
ORA-06512: at line 3
Fri Nov 1 13:57:14 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/udump/lsyy1_ora_6554444.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-03135: connection lost contact
ORA-06512: at line 3
Fri Nov 1 13:57:18 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/udump/lsyy1_ora_10551480.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-03135: connection lost contact
ORA-06512: at line 3
Fri Nov 1 13:57:18 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/udump/lsyy1_ora_12910728.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-03135: connection lost contact
ORA-06512: at line 3
Fri Nov 1 13:57:19 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/udump/lsyy1_ora_23003910.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-03135: connection lost contact
ORA-06512: at line 3
Fri Nov 1 13:57:19 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/udump/lsyy1_ora_13763432.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-03135: connection lost contact
ORA-06512: at line 3
Fri Nov 1 13:57:20 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/udump/lsyy1_ora_16974330.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-03135: connection lost contact
ORA-06512: at line 3
Fri Nov 1 13:57:23 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/udump/lsyy1_ora_6620098.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-03135: connection lost contact
ORA-06512: at line 3
Fri Nov 1 13:57:24 2019
Reconfiguration started (old inc 4, new inc 6)
List of nodes:
0
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Fri Nov 1 13:57:24 2019
LMS 2: 5 GCS shadows cancelled, 2 closed
Fri Nov 1 13:57:24 2019
LMS 3: 8 GCS shadows cancelled, 4 closed
Fri Nov 1 13:57:24 2019
LMS 0: 4 GCS shadows cancelled, 0 closed
Fri Nov 1 13:57:24 2019
LMS 1: 2 GCS shadows cancelled, 1 closed
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Fri Nov 1 13:57:26 2019
Instance recovery: looking for dead threads
Fri Nov 1 13:57:26 2019
Beginning instance recovery of 1 threads
Fri Nov 1 13:57:26 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/udump/lsyy1_ora_12058732.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-03135: connection lost contact
ORA-06512: at line 3
Fri Nov 1 13:57:27 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/udump/lsyy1_ora_14614616.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-03135: connection lost contact
ORA-06512: at line 3
Fri Nov 1 13:57:27 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/udump/lsyy1_ora_21627848.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-03135: connection lost contact
ORA-06512: at line 3
Fri Nov 1 13:57:29 2019
LMS 0: 710688 GCS shadows traversed, 0 replayed
Fri Nov 1 13:57:29 2019
LMS 3: 722670 GCS shadows traversed, 0 replayed
Fri Nov 1 13:57:29 2019
LMS 2: 723550 GCS shadows traversed, 0 replayed
Fri Nov 1 13:57:29 2019
LMS 1: 723294 GCS shadows traversed, 0 replayed
Fri Nov 1 13:57:29 2019
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Fri Nov 1 13:57:32 2019
parallel recovery started with 15 processes
Fri Nov 1 13:57:33 2019
Started redo scan
Fri Nov 1 13:57:33 2019
Completed redo scan
12508 redo blocks read, 1428 data blocks need recovery
Fri Nov 1 13:57:33 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/udump/lsyy1_ora_23462652.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-03135: connection lost contact
ORA-06512: at line 3
Fri Nov 1 13:57:33 2019
Started redo application at
Thread 2: logseq 83093, block 16146
Fri Nov 1 13:57:33 2019
Recovery of Online Redo Log: Thread 2 Group 11 Seq 83093 Reading mem 0
Mem# 0: +DATA/lsyy/onlinelog/group211
Fri Nov 1 13:57:33 2019
Completed redo application
Fri Nov 1 13:57:33 2019
Completed instance recovery at
Thread 2: logseq 83093, block 28654, scn 22641184229
1037 data blocks read, 1536 data blocks written, 12508 redo blocks read
Fri Nov 1 13:57:34 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/bdump/lsyy1_asmb_3146160.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Fri Nov 1 13:57:34 2019
ASMB: terminating instance due to error 15064
Fri Nov 1 13:57:34 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/bdump/lsyy1_lms1_3081164.trc:
ORA-15064: communication failure with ASM instance
Fri Nov 1 13:57:34 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/bdump/lsyy1_lmon_2621854.trc:
ORA-15064: communication failure with ASM instance
Fri Nov 1 13:57:34 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/bdump/lsyy1_lms0_2883754.trc:
ORA-15064: communication failure with ASM instance
Fri Nov 1 13:57:34 2019
Errors in file /oracle/product/10.2.0/db_1/admin/lsyy/bdump/lsyy1_lms2_3539612.trc:
ORA-15064: communication failure with ASM instance
Instance terminated by ASMB, pid = 3146160
Fri Nov 1 14:05:55 2019
复制
alert_asm2
NOTE: ASMB process exiting due to lack of ASM file activity
Fri Nov 1 13:56:51 2019
Errors in file /oracle/product/10.2.0/db_1/admin/+ASM/bdump/+asm2_gmon_2819014.trc:
ORA-29702: error occurred in Cluster Group Service operation
ORA-29702: error occurred in Cluster Group Service operation
Fri Nov 1 13:56:51 2019
GMON: terminating instance due to error 29702
Fri Nov 1 13:56:52 2019
System state dump is made for local instance
System State dumped to trace file /oracle/product/10.2.0/db_1/admin/+ASM/bdump/+asm2_diag_2949468.trc
Fri Nov 1 13:56:52 2019
Trace dumping is performing id=[cdmp_20191101135652]
Fri Nov 1 13:56:57 2019
Instance terminated by GMON, pid = 2819014
Fri Nov 1 14:04:10 2019
Starting ORACLE instance (normal)
复制
alert_asm1
Starting background process ASMB
ASMB started with pid=21, OS id=6030044
Fri Nov 1 13:56:52 2019
Trace dumping is performing id=[cdmp_20191101135652]
Fri Nov 1 13:57:21 2019
Reconfiguration started (old inc 2, new inc 3)
List of nodes:
0
Global Resource Directory frozen
* dead instance detected - domain 1 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Fri Nov 1 13:57:21 2019
LMS 0: 0 GCS shadows cancelled, 0 closed
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Fri Nov 1 13:57:21 2019
NOTE: SMON starting instance recovery for group 1 (mounted)
NOTE: F1X0 found on disk 0 fcn 0.0
NOTE: starting recovery of thread=2 ckpt=580.4028 group=1
Fri Nov 1 13:57:21 2019
LMS 0: 5486 GCS shadows traversed, 0 replayed
Fri Nov 1 13:57:21 2019
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Fri Nov 1 13:57:21 2019
NOTE: advancing ckpt for thread=2 ckpt=580.4028
NOTE: smon did instance recovery for domain 1
Fri Nov 1 13:57:21 2019
NOTE: recovering COD for group 1/0x7fc555b1 (DATA)
SUCCESS: completed COD recovery for group 1/0x7fc555b1 (DATA)
Fri Nov 1 13:57:34 2019
Error: KGXGN aborts the instance (6)
Fri Nov 1 13:57:34 2019
Errors in file /oracle/product/10.2.0/db_1/admin/+ASM/bdump/+asm1_lmon_2622036.trc:
ORA-29702: error occurred in Cluster Group Service operation
LMON: terminating instance due to error 29702
Fri Nov 1 13:57:34 2019
System state dump is made for local instance
System State dumped to trace file /oracle/product/10.2.0/db_1/admin/+ASM/bdump/+asm1_diag_2687330.trc
Fri Nov 1 13:57:34 2019
Trace dumping is performing id=[cdmp_20191101135734]
Fri Nov 1 13:57:39 2019
Instance terminated by LMON, pid = 2622036
Fri Nov 1 14:05:36 2019
Starting ORACLE instance (normal)
复制
awk_dump
[root@pr7 ~]# awk -f ass109.awk lsyy1_ora_17433530.trc
Starting Systemstate 1
..............................................................................
...............................................................................
...............................................................................
...............................................................................
...............................................................................
...............................................................................
...............................................................................
...............................................................................
...............................................................................
...............................................................................
....................................................
Ass.Awk Version 1.0.9 - Processing lsyy1_ora_17433530.trc
System State 1
~~~~~~~~~~~~~~~~
1:
2: waiting for 'pmon timer' wait
3: waiting for 'DIAG idle wait' wait
4: waiting for 'rdbms ipc message' wait
5: waiting for 'rdbms ipc message' wait
6: waiting for 'ges remote message' wait
7: waiting for 'gcs remote message' wait
8: waiting for 'gcs remote message' wait
9: waiting for 'gcs remote message' wait
10: waiting for 'gcs remote message' wait
11: waiting for 'rdbms ipc message' wait
12: waiting for 'rdbms ipc message' wait
13: waiting for 'rdbms ipc message' wait
14: waiting for 'rdbms ipc message' wait
15: waiting for 'rdbms ipc message' wait
16: waiting for 'smon timer' wait
17: waiting for 'rdbms ipc message' wait
18: waiting for 'rdbms ipc message' wait
19: waiting for 'rdbms ipc message' wait
20: waiting for 'rdbms ipc message' wait
21:
22:
23: waiting for 'rdbms ipc message' wait
24: waiting for 'jobq slave wait' wait
25: waiting for 'ASM background timer' wait
26: waiting for 'rdbms ipc message' wait
27: waiting for 'SQL*Net message from client' wait
28: waiting for 'SQL*Net message from client' wait
29: waiting for 'SQL*Net message from client' wait
30: waiting for 'SQL*Net message from client' wait
31: waiting for 'SQL*Net message from client' wait
Cmd: Select
32: waiting for 'SQL*Net message from client' wait
33: waiting for 'SQL*Net message from client' wait
34: waiting for 'SQL*Net message from client' wait
Cmd: Select
35: waiting for 'SQL*Net message from client' wait
Cmd: Select
36: waiting for 'SQL*Net message from client' wait
Cmd: Select
37: waiting for 'SQL*Net message from client' wait
Cmd: Select
38: waiting for 'SQL*Net message from client' wait
Cmd: Select
39: waiting for 'SQL*Net message from client' wait
Cmd: Select
40: waiting for 'SQL*Net message from client' wait
41: last wait for 'ksdxexeotherwait'
42: waiting for 'SQL*Net message from client' wait
Cmd: Select
43: waiting for 'rdbms ipc message' wait
44: waiting for 'rdbms ipc message' wait
45: waiting for 'SQL*Net message from client' wait
46: waiting for 'Streams AQ: qmn coordinator idle wait' wait
47: waiting for 'jobq slave wait' wait
48: waiting for 'jobq slave wait' wait
49: waiting for 'jobq slave wait' wait
50: waiting for 'jobq slave wait' wait
51: for 'Streams AQ: waiting for messages in the queue' wait
52: for 'Streams AQ: waiting for time management or cleanup tasks' wait
53: waiting for 'SQL*Net message from client' wait
54: waiting for 'Streams AQ: delete acknowledged messages' wait
55: waiting for 'SQL*Net message from client' wait
Cmd: Select
56: waiting for 'SQL*Net message from client' wait
Cmd: PL/SQL Execute
57: waiting for 'SQL*Net message from client' wait
Cmd: Select
省略
NO BLOCKING PROCESSES FOUND
3075273 Lines Processed.
复制


操作系统日志有报错吗?服务器有没有监控,比如CPU和内存。另外,有没有重启前的AWR报告可以提供下?


1、看到了章芋文的提醒,11月1日晚调整了以下两个DRM参数:
_gc_affinity_limit=250
_gc_affinity_minimum=10485760
正在观察,看看后面还会不会出现hung住的故障。没有找到确实的证据证明是DRM引起的,所以不敢完全断定是DRM造成的。
2、awk -f ass109.awk lsyy2_diag_1114194.trc 分析这个文件,一直卡住,不知道为什么,该文件中等待最多的是gc current request
3、我试图去查看关于DRM的报错,查看了lmd进程文件,虽然里面有begin drm。。。end drm的信息,但是发生的时间段都不在故障时间段,不知道是不是因为系统hung住后无法写进程日志造成的。
4、errpt没有看到什么明显的故障。操作系统层面有oswbb的监控,日志文件在20191101.rar附件中。
p550a_iostat_19.10.30.1300.dat
p550a_prvtnet_19.10.30.1300.dat
p550a_top_19.10.30.1300.dat
p550b_iostat_19.11.01.1300.dat
p550b_prvtnet_19.11.01.1300.dat
从上述日志中,也没看出来什么问题。不过由于硬件资源比较旧,在系统资源层面,CPU和IO经常占用比较高,CPU有时候90%利用率,IO经常100%busy。
5、我又重新上传一部分日志,包含11月1日13:00~13:30的awr报告(故障是在13:50多发生),由于重启操作系统,故障时间段的awr报告没收集到。还包含两个正常时间段的awr。
早上8:00之后,下午13:50之后,业务开始进入高峰期。
6、上传了部分lmd进程的trace,基本都是故障时间段的


awr报告上传,不支持11.2.0.3以下的,所以我放到新上传的附件中


