作者:Digital Observer(施嘉伟)
Oracle ACE Pro: Database
PostgreSQL ACE Partner
11年数据库行业经验,现主要从事数据库服务工作
拥有Oracle OCM、DB2 10.1 Fundamentals、MySQL 8.0 OCP、WebLogic 12c OCA、KCP、PCTP、PCSD、PGCM、OCI、PolarDB技术专家、达梦师资认证、数据安全咨询高级等认证
ITPUB认证专家、PolarDB开源社区技术顾问、HaloDB技术顾问、TiDB社区技术布道师、青学会MOP技术社区专家顾问、国内某高校企业实践指导教师
公众号/墨天轮:Digital Observer;CSDN/PGfans:施嘉伟;ITPUB:sjw1933
客户反馈对Oracle 19C RAC集群进行参数修改时,重启生效时出现报错ORA-27154,ORA-27300等报错,报错处理完成后出现节点一DB启动hang问题。
修改 ASM process=1000及 DB process=30000,启动数据库报错
ORA-27154: post/wait create failed
ORA-27300: OS system dependent operation:semget failed with status: 28
ORA-27301: OS failure message: No space left on device
ORA-27302: failure occurred at: sskgpcreates
检查操作系统参数
kernel.shmall = 268435456
kernel.shmmax = 549755813888
fs.file-max = 6815744
kernel.shmmni = 4096
kernel.sem = 250 32000 100 128 ####
fs.aio-max-nr = 3048576
net.ipv4.conf.all.rp_filter = 2
net.ipv4.conf.default.rp_filter = 2
net.ipv4.ip_local_port_range = 9000 65500
kernel.panic_on_oops = 1
net.core.rmem_default = 262144
net.core.rmem_max = 4194304
net.core.wmem_default = 262144
net.core.wmem_max = 1048576
信号量(Semaphore),有时被称为信号灯,是在多线程环境下使用的一种设施,它负责协调各个线程, 以保证它们能够正确、合理的使用公共资源。
由于参数修改导致超过系统参数上限制 32000导致启动报错。
调整db processes=25000
重启单节点db hang
节点一启动DB ,日志显示如下,且长时间无响应。
检查节点一alert日志
File Name or Source
-------------------------
alert_crm1.log
Description
--------------
DB Alert Log -- Node 1
Relevant Information Collection
---------------------------------------
2023-12-04T22:19:03.865956+08:00
Shutting down ORACLE instance (immediate) (OS id: 136267)
2023-12-04T22:19:03.867555+08:00
Shutdown is initiated by oraagent.bin@dbcrmserv1 (TNS V1-V3).
2023-12-04T22:19:21.948959+08:00
Entire Buffer Cache invalidation started
2023-12-04T22:23:14.743479+08:00
Entire Buffer Cache invalidation completed
Shutting down archive processes
Archiving is disabled
2023-12-04T22:26:09.314673+08:00
Instance shutdown complete (OS id: 136267)
2023-12-04T22:27:51.936189+08:00
Starting ORACLE instance (normal) (OS id: 185824)
2023-12-04T22:27:52.038442+08:00
****************************************************
Sys-V shared memory will be used for creating SGA
****************************************************
……此处省略
Increasing priority of 32 RS
* Setting GES domain 0
Attached to domain 0 (addr: 0x1184620388)
Reconfiguration started (old inc 0, new inc 12)
Dynamic remastering is disabled
List of instances (total 2) :
1 2
My inst 1 (I'm a new instance)
Global Resource Directory frozen
Enabling Dynamic Remastering: NONE->NORM switch
Communication channels reestablished
2023-12-04T22:28:07.491225+08:00
Increasing priority of 32 RS
* Setting GES domain 0
Attached to domain 0 (addr: 0x1184620388)
Reconfiguration started (old inc 0, new inc 12)
Dynamic remastering is disabled
List of instances (total 2) :
1 2
My inst 1 (I'm a new instance)
Global Resource Directory frozen
Enabling Dynamic Remastering: NONE->NORM switch
Communication channels reestablished
2023-12-04T22:28:07.540099+08:00
* domain 0 valid = 1 (flags x8820, pdb flags x8000) according to instance 2
2023-12-04T22:28:07.569867+08:00
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
2023-12-04T22:28:07.622630+08:00
LMS 7: 0 GCS shadows cancelled, 0 closed, 0 Xw survived, skipped 0
2023-12-04T22:28:07.622640+08:00
LMS 16: 0 GCS shadows cancelled, 0 closed, 0 Xw survived, skipped 0
2023-12-04T22:28:07.622641+08:00
LMS 11: 0 GCS shadows cancelled, 0 closed, 0 Xw survived, skipped 0
2023-12-04T22:28:07.622645+08:00
LMS 3: 0 GCS shadows cancelled, 0 closed, 0 Xw survived, skipped 0
2023-12-04T22:28:07.622648+08:00
LMS 17: 0 GCS shadows cancelled, 0 closed, 0 Xw survived, skipped 0
2023-12-04T22:28:07.622651+08:00
LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived, skipped 0
2023-12-04T22:28:07.622653+08:00
LMS 5: 0 GCS shadows cancelled, 0 closed, 0 Xw survived, skipped 0
2023-12-04T22:28:07.622655+08:00
LMS 27: 0 GCS shadows cancelled, 0 closed, 0 Xw survived, skipped 0
2023-12-04T22:28:07.622658+08:00
LMS 26: 0 GCS shadows cancelled, 0 closed, 0 Xw survived, skipped 0
2023-12-04T22:28:07.622661+08:00
……此处省略
LMS 28: 0 GCS shadows cancelled, 0 closed, 0 Xw survived, skipped 0
2023-12-04T22:28:07.622929+08:00
LMS 12: 0 GCS shadows cancelled, 0 closed, 0 Xw survived, skipped 0
2023-12-04T22:28:07.622935+08:00
LMS 19: 0 GCS shadows cancelled, 0 closed, 0 Xw survived, skipped 0
2023-12-04T22:28:07.622946+08:00
LMS 22: 0 GCS shadows cancelled, 0 closed, 0 Xw survived, skipped 0
2023-12-04T22:28:07.622947+08:00
LMS 23: 0 GCS shadows cancelled, 0 closed, 0 Xw survived, skipped 0
2023-12-04T22:28:07.622948+08:00
LMS 8: 0 GCS shadows cancelled, 0 closed, 0 Xw survived, skipped 0
2023-12-04T22:28:07.622951+08:00
LMS 20: 0 GCS shadows cancelled, 0 closed, 0 Xw survived, skipped 0
2023-12-04T22:28:07.622953+08:00
LMS 15: 0 GCS shadows cancelled, 0 closed, 0 Xw survived, skipped 0
2023-12-04T22:28:07.622962+08:00
LMS 29: 0 GCS shadows cancelled, 0 closed, 0 Xw survived, skipped 0
2023-12-04T22:28:07.622965+08:00
LMS 21: 0 GCS shadows cancelled, 0 closed, 0 Xw survived, skipped 0
Set master node info
2023-12-04T22:28:08.519434+08:00
Dwn-cvts replayed, VALBLKs dubious
2023-12-04T22:28:08.725648+08:00
All grantable enqueues granted ★★★ 在这儿卡住, 看起来在reconfiguration阶段出现hang
2023-12-04T22:34:34.149170+08:00
Shutting down ORACLE instance (abort) (OS id: 1906)
2023-12-04T22:34:34.149244+08:00
Shutdown is initiated by sqlplus@dbcrmserv1 (TNS V1-V3).
License high water mark = 10
USER (ospid: 1906): terminating the instance
2023-12-04T22:34:35.166987+08:00
Instance terminated by USER, pid = 1906
2023-12-04T22:34:35.576929+08:00
Instance shutdown complete (OS id: 1906)
2023-12-04T22:45:01.248337+08:00
Starting ORACLE instance (normal) (OS id: 15464)
2023-12-04T22:45:18.602905+08:00
Increasing priority of 32 RS
* Setting GES domain 0
Attached to domain 0 (addr: 0x1184620388)
Reconfiguration started (old inc 0, new inc 4)
Dynamic remastering is disabled
List of instances (total 2) :
1 2
My inst 1 (I'm a new instance)
Global Resource Directory frozen
Enabling Dynamic Remastering: NONE->NORM switch
Communication channels reestablished
2023-12-04T22:45:18.660518+08:00
* domain 0 valid = 1 (flags x8000, pdb flags x8000) according to instance 2
2023-12-04T22:45:18.691300+08:00
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
2023-12-04T22:45:18.751986+08:00
LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived, skipped 0
2023-12-04T22:45:18.752329+08:00
LMS 14: 0 GCS shadows cancelled, 0 closed, 0 Xw survived, skipped 0
……此处省略
2023-12-04T22:45:18.752487+08:00
LMS 24: 0 GCS shadows cancelled, 0 closed, 0 Xw survived, skipped 0
Set master node info
Dwn-cvts replayed, VALBLKs dubious
2023-12-04T22:45:19.200830+08:00
All grantable enqueues granted ★★★ 这次顺利通过
2023-12-04T22:45:19.717210+08:00
Reconfiguration complete (total time 1.1 secs) ★★★ Reconfiguration 完成
Decreasing priority of 32 RS
2023-12-04T22:45:19.717603+08:00
Starting background process LCK0
2023-12-04T22:45:20.083120+08:00
NOTE: MARK has subscribed
ALTER DATABASE MOUNT /* db agent *//* {1:22401:2} */ ★★★ MOUNT
2023-12-04T22:45:25.108279+08:00
Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE)
Lost write protection disabled
.... (PID:16426): Using STANDBY_ARCHIVE_DEST parameter default value as +ARCH/archivelog [krsd.c:18221]
Completed: ALTER DATABASE MOUNT /* db agent *//* {1:22401:2} */ ★★★ MOUNT Completed
ALTER DATABASE OPEN /* db agent *//* {1:22401:2} */
2023-12-04T22:45:29.345239+08:00
CJQ0 started with pid=228, OS id=16747
Completed: ALTER DATABASE OPEN /* db agent *//* {1:22401:2} */ ★★★ OPEN Completed
通过节点一db alert日志发现:
2023-12-04 22:28:08 instance启动卡住, 初步判断在reconfiguration阶段出现hang。
检查同一时间段节点二alert日志,无异常
-- Log File --
File Name or Source
-------------------------
alert_crm2.log
Description
--------------
DB Alert Log -- Node 2
Relevant Information Collection
---------------------------------------
2023-12-04T22:24:17.604166+08:00
Increasing priority of 32 RS
Reconfiguration started (old inc 8, new inc 10)
List of instances (total 1) :
2
Dead instances (total 1) :
1
My inst 2
2023-12-04T22:24:22.045343+08:00
Reconfiguration complete (total time 4.4 secs)
2023-12-04T22:28:07.491040+08:00
Reconfiguration started (old inc 10, new inc 12)
List of instances (total 2) :
1 2
New instances (total 1) :
1
My inst 2
2023-12-04T22:28:07.638190+08:00
LMS 31: 0 GCS shadows cancelled, 0 closed, 0 Xw survived, skipped 0
Set master node info
Dwn-cvts replayed, VALBLKs dubious
2023-12-04T22:28:08.767645+08:00
All grantable enqueues granted
2023-12-04T22:34:02.334571+08:00
minact-scn: useg scan erroring out with error e:12751
2023-12-04T22:34:34.545353+08:00
Reconfiguration started (old inc 12, new inc 14)
List of instances (total 1) :
2
Dead instances (total 1) :
1
My inst 2
Nested reconfiguration detected.
publish big name space - dead or down/up instance detected, invalidate domain 0
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
2023-12-04T22:34:39.383865+08:00
Reconfiguration complete (total time 4.8 secs)
2023-12-04T22:34:56.682166+08:00
Shutting down ORACLE instance (immediate) (OS id: 5996)
2023-12-04T22:34:56.682242+08:00
Shutdown is initiated by sqlplus@dbcrmserv2 (TNS V1-V3).
2023-12-04T22:37:22.743618+08:00
Instance shutdown complete (OS id: 5996)
2023-12-04T22:37:48.215771+08:00
Starting ORACLE instance (normal) (OS id: 9074)
2023-12-04T22:38:04.819147+08:00
Increasing priority of 32 RS
* Setting GES domain 0
Attached to domain 0 (addr: 0x1184620388)
Reconfiguration started (old inc 0, new inc 2)
Dynamic remastering is disabled
List of instances (total 1) :
2
My inst 2 (I'm a new instance)
Global Resource Directory frozen
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
2023-12-04T22:38:04.898179+08:00
LMS 9: 0 GCS shadows cancelled, 0 closed, 0 Xw survived, skipped 0
2023-12-04T22:38:04.898179+08:00
LMS 30: 0 GCS shadows cancelled, 0 closed, 0 Xw survived, skipped 0
Set master node info
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
2023-12-04T22:38:05.465367+08:00
Reconfiguration complete (total time 0.6 secs)
2023-12-04T22:38:17.210594+08:00
CJQ0 started with pid=227, OS id=10441
Completed: ALTER DATABASE OPEN /* db agent *//* {2:33869:22894} */
查看节点一lmon日志
- Log File --
File Name or Source
-------------------------
crm1_lmon_189071.trc
Description
--------------
Relevant Information Collection
---------------------------------------
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.19.0.0.0
Build label: RDBMS_19.19.0.0.0DBRU_LINUX.X64_230321.1
ORACLE_HOME: /oracle/app/product/19c/db_1
System name: Linux
Node name: dbcrmserv1
Release: 5.4.17-2011.6.2.el7uek.x86_64
Version: #2 SMP Thu Sep 3 14:09:14 PDT 2020
Machine: x86_64
Instance name: crm1
Redo thread mounted by this instance: 0 <none>
Oracle process number: 22
Unix process pid: 189071, image: oracle@dbcrmserv1 (LMON)
*** 2023-12-04T22:28:06.972084+08:00
kjxgmrcfg: Reconfiguration started, type 1 ★★★ 实例启动
CGS/IMR TIMEOUTS:
CSS recovery timeout = 31 sec (Total CSS waittime = 65)
IMR Reconfig timeout = 75 sec
CGS rcfg timeout = 85 sec
kjxgmcs: Setting state to 0 0.
2023-12-04 22:28:07.510 : * Begin lmon rcfg step KJGA_RCFG_BEGIN (kjidomena 0, rcfginfo x0)
Reconfiguration started (old inc 0, new inc 12)
2023-12-04 22:28:07.512 : * Begin lmon rcfg step KJGA_RCFG_FREEZE
2023-12-04 22:28:07.514 : * Begin lmon rcfg step KJGA_RCFG_COMM
2023-12-04 22:28:07.519 : * Begin lmon rcfg step KJGA_RCFG_EXCHANGE (kjidomena 0, hvmaster 2)
2023-12-04 22:28:07.569 : * Begin lmon rcfg step KJGA_RCFG_ENQCLEANUP
2023-12-04 22:28:07.571 : * Begin lmon rcfg step KJGA_RCFG_CLEANUP
2023-12-04 22:28:07.655 : * Begin lmon rcfg step KJGA_RCFG_TIMERQ
2023-12-04 22:28:07.655 : * Begin lmon rcfg step KJGA_RCFG_DDQ
2023-12-04 22:28:07.655 : * Begin lmon rcfg step KJGA_RCFG_SETMASTER
2023-12-04 22:28:08.323 : * Begin lmon rcfg step KJGA_RCFG_ENQREPLAY
2023-12-04 22:28:08.463 : * Begin lmon rcfg step KJGA_RCFG_ENQDUBIOUS
2023-12-04 22:28:08.519 : * Begin lmon rcfg step KJGA_RCFG_ENQGRANT
2023-12-04 22:28:08.768 : * Begin lmon rcfg step KJGA_RCFG_PCMREPLAY ★★★ 应该是在这个阶段出现了hang
*** 2023-12-04T22:29:07.562705+08:00
KJC Communication Summary Dump:
Advisory codes:
0x1 tkts on the wire are more than half of the available tkts
0x2 available ticket below threshold
0x4 sp_rsv ticket low
0x8 flushing error, detected flush delay in flush records
tkt_total = total # of tickets assigned to this send proxy
avl = tickets available for all senders
sp_rsv = send proxy reserved tickets low
real_avl = count of all tickets available for send proxy
tkt_total=1000 (min=1000) avl=750 sp_rsv=250 max_sp_rsv=250 ★★★ 可用的 ticket=750
tkt_total=1000 (min=1000) avl=750 sp_rsv=250 max_sp_rsv=250
tkt_total=1000 (min=1000) avl=750 sp_rsv=250 max_sp_rsv=250
tkt_total=1000 (min=1000) avl=750 sp_rsv=250 max_sp_rsv=250
tkt_total=1000 (min=1000) avl=750 sp_rsv=250 max_sp_rsv=250
tkt_total=1000 (min=1000) avl=750 sp_rsv=250 max_sp_rsv=250
tkt_total=1000 (min=1000) avl=750 sp_rsv=250 max_sp_rsv=250
tkt_total=1000 (min=1000) avl=750 sp_rsv=250 max_sp_rsv=250
tkt_total=1000 (min=1000) avl=750 sp_rsv=250 max_sp_rsv=250
tkt_total=1000 (min=1000) avl=750 sp_rsv=250 max_sp_rsv=250
tkt_total=1000 (min=1000) avl=750 sp_rsv=250 max_sp_rsv=250
tkt_total=1000 (min=1000) avl=750 sp_rsv=250 max_sp_rsv=250
tkt_total=1000 (min=1000) avl=750 sp_rsv=250 max_sp_rsv=250
tkt_total=1000 (min=1000) avl=750 sp_rsv=250 max_sp_rsv=250
tkt_total=1000 (min=1000) avl=750 sp_rsv=250 max_sp_rsv=250
tkt_total=1000 (min=1000) avl=750 sp_rsv=250 max_sp_rsv=250
tkt_total=1000 (min=1000) avl=750 sp_rsv=249 max_sp_rsv=250
tkt_total=1000 (min=1000) avl=750 sp_rsv=249 max_sp_rsv=250
tkt_total=1000 (min=1000) avl=750 sp_rsv=249 max_sp_rsv=250
tkt_total=1000 (min=1000) avl=750 sp_rsv=249 max_sp_rsv=250
数据库实例重启后,加入集群式需要进行reconfiguration,在lmon trace中看到,reconfiguration执行到"KJGA_RCFG_PCMREPLAY "后,就没有继续往下走。
但是除此之外,没有看到其它的异常。暂时,也没有找到类似的bug。
继续排查oswatch日志
-- Log File --
File Name or Source
-------------------------
dbcrmserv1_netstat_23.12.04.2200.dat
Description
--------------
netstat output -- 节点1
Relevant Information Collection
---------------------------------------
zzz ***Mon Dec 4 22:00:19 CST 2023
IpReasmFails 3342138 0.0
zzz ***Mon Dec 4 22:00:49 CST 2023
IpReasmFails 3342138 0.0
zzz ***Mon Dec 4 22:01:19 CST 2023
IpReasmFails 3342138 0.0
zzz ***Mon Dec 4 22:01:50 CST 2023
IpReasmFails 3342138 0.0
zzz ***Mon Dec 4 22:02:20 CST 2023
IpReasmFails 3342138 0.0
...
zzz ***Mon Dec 4 22:26:51 CST 2023
IpReasmFails 3342138 0.0
zzz ***Mon Dec 4 22:27:21 CST 2023
IpReasmFails 3342138 0.0
zzz ***Mon Dec 4 22:27:51 CST 2023
IpReasmFails 3342138 0.0
zzz ***Mon Dec 4 22:28:21 CST 2023 ★★★ 从这个时间点开始,IpReasmFails 指标开始增长,且增长速度很快
IpReasmFails 3438192 0.0
zzz ***Mon Dec 4 22:28:51 CST 2023
IpReasmFails 3742778 0.0
zzz ***Mon Dec 4 22:29:21 CST 2023
IpReasmFails 4047921 0.0
zzz ***Mon Dec 4 22:29:51 CST 2023
IpReasmFails 4353455 0.0
zzz ***Mon Dec 4 22:30:21 CST 2023
IpReasmFails 4658519 0.0
zzz ***Mon Dec 4 22:30:51 CST 2023
IpReasmFails 4960593 0.0
zzz ***Mon Dec 4 22:31:21 CST 2023
IpReasmFails 5256155 0.0
zzz ***Mon Dec 4 22:31:51 CST 2023
IpReasmFails 5551935 0.0
zzz ***Mon Dec 4 22:32:21 CST 2023
IpReasmFails 5846746 0.0
zzz ***Mon Dec 4 22:32:51 CST 2023
IpReasmFails 6137016 0.0
zzz ***Mon Dec 4 22:33:21 CST 2023
IpReasmFails 6426802 0.0
zzz ***Mon Dec 4 22:33:51 CST 2023
IpReasmFails 6715219 0.0
zzz ***Mon Dec 4 22:34:21 CST 2023
IpReasmFails 7005903 0.0
zzz ***Mon Dec 4 22:34:51 CST 2023
IpReasmFails 7121921 0.0
zzz ***Mon Dec 4 22:35:21 CST 2023
IpReasmFails 7121922 0.0
zzz ***Mon Dec 4 22:35:51 CST 2023
IpReasmFails 7121922 0.0
zzz ***Mon Dec 4 22:36:21 CST 2023
IpReasmFails 7121922 0.0
zzz ***Mon Dec 4 22:36:51 CST 2023
IpReasmFails 7121922 0.0
zzz ***Mon Dec 4 22:37:21 CST 2023
IpReasmFails 7121922 0.0
zzz ***Mon Dec 4 22:37:51 CST 2023
IpReasmFails 7121922 0.0
zzz ***Mon Dec 4 22:38:21 CST 2023
IpReasmFails 7121922 0.0
zzz ***Mon Dec 4 22:38:51 CST 2023
IpReasmFails 7121922 0.0
zzz ***Mon Dec 4 22:39:21 CST 2023
IpReasmFails 7121922 0.0
zzz ***Mon Dec 4 22:39:51 CST 2023
IpReasmFails 7121922 0.0
zzz ***Mon Dec 4 22:40:21 CST 2023
IpReasmFails 7121922 0.0
zzz ***Mon Dec 4 22:40:51 CST 2023
IpReasmFails 7121922 0.0
zzz ***Mon Dec 4 22:41:22 CST 2023
IpReasmFails 7121922 0.0
zzz ***Mon Dec 4 22:41:52 CST 2023
IpReasmFails 7121922 0.0
zzz ***Mon Dec 4 22:42:22 CST 2023
IpReasmFails 7121922 0.0
zzz ***Mon Dec 4 22:42:52 CST 2023
IpReasmFails 7121922 0.0
zzz ***Mon Dec 4 22:43:22 CST 2023
IpReasmFails 7121922 0.0
zzz ***Mon Dec 4 22:43:52 CST 2023
IpReasmFails 7121922 0.0
zzz ***Mon Dec 4 22:44:22 CST 2023
IpReasmFails 7121922 0.0
zzz ***Mon Dec 4 22:44:52 CST 2023
IpReasmFails 7121922 0.0
zzz ***Mon Dec 4 22:45:22 CST 2023
IpReasmFails 7121941 0.0
zzz ***Mon Dec 4 22:45:52 CST 2023
IpReasmFails 7122149 0.0
zzz ***Mon Dec 4 22:46:22 CST 2023
IpReasmFails 7122149 0.0
Comment
----------------
IpReasmFails指数开始增加的时间,刚好是数据库实例启动,并进行reconfiguration的时段。
From DB alert log:
2023-12-04T22:28:08.519434+08:00
Dwn-cvts replayed, VALBLKs dubious
2023-12-04T22:28:08.725648+08:00
All grantable enqueues granted ★★★ 在这儿卡住
2023-12-04T22:34:34.149170+08:00
Shutting down ORACLE instance (abort) (OS id: 1906)
可以发现IpReasmFails指数开始增加的时间,刚好是数据库实例启动,并进行reconfiguration的时段。
通过OSWatcher的netstat输出发现,数据库在进行reconfiguration的时段,IpReasmFails指标的增长速度非常快。
From dbcrmserv1_netstat_23.12.04.2200.dat:
------------------------------------------------------------------------
zzz ***Mon Dec 4 22:26:51 CST 2023
IpReasmFails 3342138 0.0
zzz ***Mon Dec 4 22:27:21 CST 2023
IpReasmFails 3342138 0.0
zzz ***Mon Dec 4 22:27:51 CST 2023
IpReasmFails 3342138 0.0
zzz ***Mon Dec 4 22:28:21 CST 2023 ★★★ 从这个时间点开始,IpReasmFails 指标开始增长,且增长速度很快
IpReasmFails 3438192 0.0
zzz ***Mon Dec 4 22:28:51 CST 2023
IpReasmFails 3742778 0.0
zzz ***Mon Dec 4 22:29:21 CST 2023
IpReasmFails 4047921 0.0
zzz ***Mon Dec 4 22:29:51 CST 2023
IpReasmFails 4353455 0.0
再去查看节点一alert日志
From alert_crm1.log:
------------------------------------------------------------------------
2023-12-04T22:27:51.936189+08:00
Starting ORACLE instance (normal) (OS id: 185824) ★★★ 数据库实例启动
2023-12-04T22:28:07.491225+08:00
Increasing priority of 32 RS
* Setting GES domain 0
Attached to domain 0 (addr: 0x1184620388)
Reconfiguration started (old inc 0, new inc 12) ★★★ Reconfiguration开始
Dynamic remastering is disabled
List of instances (total 2) :
1 2
My inst 1 (I'm a new instance)
2023-12-04T22:28:08.519434+08:00
Dwn-cvts replayed, VALBLKs dubious
2023-12-04T22:28:08.725648+08:00
All grantable enqueues granted ★★★ 在这儿卡住
2023-12-04T22:34:34.149170+08:00
Shutting down ORACLE instance (abort) (OS id: 1906)
------------------------------------------------------------------------
可以看到,上面的2个现象都发生在 “2023-12-04T22:28” 左右,时间非常吻合。IpReasmFails指标的增长表明,集群私网通信有异常。所以,这个问题应当跟私网通信有关系。
解决办法
请联系网络或系统管理员,启用Jumbo Frame。(把MTU设置为9000)
启用Jumbo Frame可以缓解 IpReasmFails 增长的现象。
参考文档:
Recommendation for the Real Application Cluster Interconnect and Jumbo Frames ( Doc ID 341788.1 )
Oracle Linux: How to Change MTU Size ( Doc ID 2520148.1 )
应急预案
对于问题二:
数据库正常运行不受影响,若MTU调整后还是出现这个问题,那么在数据库执行hang analyze。
ACTION PLAN
============
1. Please run the following on one instance as sysdba:
SQL>oradebug setmypid
SQL>oradebug unlimit
SQL>oradebug -g all hanganalyze 3
Wait for 30 seconds
SQL>oradebug -g all hanganalyze 3
SQL>exit
2. Open another session as sysdba:
SQL>oradebug setmypid
SQL>oradebug unlimit
SQL>oradebug -g all dump systemstate 258
Wait for 30 seconds
SQL>oradebug -g all dump systemstate 258
SQL>oradebug tracefile_name
SQL>exit
可以通过关闭故障节点集群服务保障业务运行。在定停机窗口进行RAC所有节点集群重启。如果没有停机窗口,可以尝试通过重启故障节点主机。





