大家好,这里是公众号 DBA学习之路,分享一些学习国产数据库路上的知识和经验。
目录
前言
半夜 11 点多接到电话,说有一个客户的生产库宕了一个节点,需要紧急恢复一下。赶紧下床打开电脑,一顿操作连上了客户的数据库主机,是一套 19C(19.22) RAC 2 节点数据库,看了下是节点 2 宕掉了,集群起不来了。
经过和客户的沟通之后,解决了这个问题,本文记录一下分析与解决过程,分享给大家一起进步。
问题分析
连上主机之后,看了下节点 2 的集群状态,发现是启动集群卡在了 ora.storage
起不来:
通过
crsctl stat res -t -init
检查启动进度时,我们可以看到在启动ora.storage
资源时会 HANG 住,等待一顿时间后就会超时。
ora.storage 1 ONLINE OFFLINE STABLE --------------------------------------------------------------------------------
查看 ohasd.trc
日志(已排除存储相关问题):
2025-02-19 00:23:02.300 : CRSPE:2791253760: [ INFO] {0:1:29} Got agent-specific msg: ORA-12541: TNS:no listener
ORA-12541: TNS:no listener
2025-02-19 00:23:02.300 : AGFW:2803861248: [ INFO] {0:1:29} Agfw Proxy Server sending the reply to PE for message:RESOURCE_START[ora.storage 1 1] ID 4
098:1471
2025-02-19 00:23:02.300 : CRSPE:2791253760: [ INFO] {0:1:29} Received reply to action [Start] message ID: 1471
2025-02-19 00:23:02.300 : CRSPE:2791253760: [ INFO] {0:1:29} Got agent-specific msg: CRS-5055: unable to connect to an ASM instance because no ASM inst
ance is running in the cluster
2025-02-19 00:23:02.300 : CRSPE:2791253760: [ INFO] {0:1:29} Received reply to action [Start] message ID: 1471
2025-02-19 00:23:02.301 : CRSMAIN:2791253760: [ NONE] {0:1:29} {0:1:29} Created alert : (:CRSPE00221:) : Start action timed out!
2025-02-19 00:23:02.301 : CRSPE:2791253760: [ INFO] {0:1:29} Start action failed with error code: 3
2025-02-19 00:23:02.301 : CRSRPT:2789152512: [ INFO] {0:1:29} Published to EVM CRS_ACTION_FAILURE for ora.storage
可以看到报错是 ORA-12541: TNS:no listener
,查看 ohasd_orarootagent_root.trc
日志:
025-02-19 00:14:23.824 : USRTHRD:3772593920: [ INFO] {0:1:29} [ora.storage] Error [kgfoAl06] in [kgfokge] at kgfo.c:3182
2025-02-19 00:14:23.824 : USRTHRD:3772593920: [ INFO] {0:1:29} [ora.storage] ORA-12541: TNS:no listener
ORA-12541: TNS:no listener
2025-02-19 00:14:23.824 : USRTHRD:3772593920: [ INFO] {0:1:29} [ora.storage] Category: 7
2025-02-19 00:14:23.824 : USRTHRD:3772593920: [ INFO] {0:1:29} [ora.storage] DepInfo: 12541
2025-02-19 00:14:23.824 : USRTHRD:3772593920: [ INFO] {0:1:29} [ora.storage] ADR is not properly configured
2025-02-19 00:14:23.824 : USRTHRD:3772593920: [ INFO] {0:1:29} [ora.storage] -- trace dump end --
2025-02-19 00:14:23.824 : USRTHRD:3772593920: [ INFO] {0:1:29} Thread:kgfoCheckMountExt isRunning is reset to false here
2025-02-19 00:14:23.824 : USRTHRD:3772593920: [ INFO] {0:1:29} Thread:kgfoCheckMountExt isFinished set to true
2025-02-19 00:14:23.824 : USRTHRD:3764188928: [ INFO] {0:1:29} Thread:kgfoCheckMountExt Tasklet::doTask m_cv.timewait returned
2025-02-19 00:14:23.824 : USRTHRD:3764188928: [ INFO] {0:1:29} Thread:kgfoCheckMountExt Tasklet::doTask(kgfoCheckMountExt) executed in 9 seconds
2025-02-19 00:14:23.824 :CLSDYNAM:3764188928: [ora.storage]{0:1:29} [start] StorageAgent::parsekgforetcodes retcode = 7, kgfoCheckMount(CRS), flag 2
2025-02-19 00:14:23.824 :CLSDYNAM:3764188928: [ora.storage]{0:1:29} [start] (null) category: 7, operation: kgfoAl06, loc: kgfokge, OS error: 12541, other: OR
A-12541: TNS:no listener
ORA-12541: TNS:no listener
2025-02-19 00:14:23.824 :CLSDYNAM:3764188928: [ora.storage]{0:1:29} [start] StorageAgent::check 260 kgfo returncode 1
2025-02-19 00:14:23.824 :CLSDYNAM:3764188928: [ora.storage]{0:1:29} [start] (:CLSN00140:)StorageAgent::check 300 parsekgforretcodes OCR dgName CRS state 1
2025-02-19 00:14:23.824 :CLSDYNAM:3764188928: [ora.storage]{0:1:29} [start] Storage::start waiting for check to not return PARTIAL or UNPLANNED_OFFLINE 1
2025-02-19 00:14:23.827 :CLSDYNAM:3764188928: [ora.storage]{0:1:29} [start] CssGroup::regis registration of groupName:ASM_NETWORK mbrid:-1 m_grpNum:2 regType
:1 succeeded
可以发现再启动 ora.storage
时报错 ORA-12541: TNS:no listener
,连接私网监听 ASM_NETWORK 报错,看到这里我也是有点懵逼的,在 MOS 上也搜不到相关问题记录。
在网上搜索到两篇问题现象比较相近的文章:
看了一下,大致原因都是人工重启主机导致,需要重新加载一下私网监听 ASMNET1LSNR_ASM。
随后咨询了一下客户,果然是节点 2 实例 crash 之后无法启动,执行 reboot 重启主机没有反应,所以使用管理端强制重启了主机,主机起来之后就一直 hang 在这里。
那看来现象基本吻合,看一下节点 1 的私网监听 ASMNET1LSNR_ASM 状态:
[grid@mesdb1 ~]$ lsnrctl stat ASMNET1LSNR_ASM
LSNRCTL for Linux: Version 19.0.0.0.0 - Production on 19-FEB-2025 00:23:31
Copyright (c) 1991, 2023, Oracle. All rights reserved.
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=ASMNET1LSNR_ASM)))
STATUS of the LISTENER
------------------------
Alias ASMNET1LSNR_ASM
Version TNSLSNR for Linux: Version 19.0.0.0.0 - Production
Start Date 31-JAN-2025 18:18:39
Uptime 18 days 6 hr. 4 min. 51 sec
Trace Level off
Security ON: Local OS Authentication
SNMP OFF
Listener Parameter File /u01/app/19.3.0/grid/network/admin/listener.ora
Listener Log File /u01/app/grid/diag/tnslsnr/mesdb1/asmnet1lsnr_asm/alert/log.xml
Listening Endpoints Summary...
(DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=ASMNET1LSNR_ASM)))
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=11.11.11.1)(PORT=1525)))
Services Summary...
Service "+ASM" has 2 instance(s).
Instance "+ASM1", status READY, has 2 handler(s) for this service...
Instance "+ASM2", status READY, has 1 handler(s) for this service...
Service "+ASM_CRS" has 2 instance(s).
Instance "+ASM1", status READY, has 2 handler(s) for this service...
Instance "+ASM2", status READY, has 1 handler(s) for this service...
Service "+ASM_DATA" has 2 instance(s).
Instance "+ASM1", status READY, has 2 handler(s) for this service...
Instance "+ASM2", status READY, has 1 handler(s) for this service...
Service "+ASM_FRA" has 2 instance(s).
Instance "+ASM1", status READY, has 2 handler(s) for this service...
Instance "+ASM2", status READY, has 1 handler(s) for this service...
The command completed successfully
可以看到 2 个实例都存在,明明节点 2 的 ASM 实例已经宕掉了,按理说应该只有节点 1 的存在才对。再次查看节点 1 的私网监听的 service:
[grid@mesdb1 ~]$ lsnrctl services ASMNET1LSNR_ASM
LSNRCTL for Linux: Version 19.0.0.0.0 - Production on 19-FEB-2025 00:26:42
Copyright (c) 1991, 2023, Oracle. All rights reserved.
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=ASMNET1LSNR_ASM)))
Services Summary...
Service "+ASM" has 2 instance(s).
Instance "+ASM1", status READY, has 2 handler(s) for this service...
Handler(s):
"DEDICATED" established:0 refused:0 state:ready
REMOTE SERVER
(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=11.11.11.1)(PORT=1525)))
"DEDICATED" established:0 refused:0 state:ready
LOCAL SERVER
Instance "+ASM2", status READY, has 1 handler(s) for this service...
Handler(s):
"DEDICATED" established:4578 refused:0 state:ready
REMOTE SERVER
(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=11.11.11.2)(PORT=1525)))
Service "+ASM_CRS" has 2 instance(s).
Instance "+ASM1", status READY, has 2 handler(s) for this service...
Handler(s):
"DEDICATED" established:0 refused:0 state:ready
REMOTE SERVER
(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=11.11.11.1)(PORT=1525)))
"DEDICATED" established:0 refused:0 state:ready
LOCAL SERVER
Instance "+ASM2", status READY, has 1 handler(s) for this service...
Handler(s):
"DEDICATED" established:4578 refused:0 state:ready
REMOTE SERVER
(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=11.11.11.2)(PORT=1525)))
Service "+ASM_DATA" has 2 instance(s).
Instance "+ASM1", status READY, has 2 handler(s) for this service...
Handler(s):
"DEDICATED" established:0 refused:0 state:ready
REMOTE SERVER
(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=11.11.11.1)(PORT=1525)))
"DEDICATED" established:0 refused:0 state:ready
LOCAL SERVER
Instance "+ASM2", status READY, has 1 handler(s) for this service...
Handler(s):
"DEDICATED" established:4578 refused:0 state:ready
REMOTE SERVER
(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=11.11.11.2)(PORT=1525)))
Service "+ASM_FRA" has 2 instance(s).
Instance "+ASM1", status READY, has 2 handler(s) for this service...
Handler(s):
"DEDICATED" established:0 refused:0 state:ready
REMOTE SERVER
(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=11.11.11.1)(PORT=1525)))
"DEDICATED" established:0 refused:0 state:ready
LOCAL SERVER
Instance "+ASM2", status READY, has 1 handler(s) for this service...
Handler(s):
"DEDICATED" established:4578 refused:0 state:ready
REMOTE SERVER
(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=11.11.11.2)(PORT=1525)))
The command completed successfully
确实可以看到有节点 2 的监听服务依然存在,这里参考上述文章的解决方案是重载私网监听。
解决方案
在节点 1 上手动执行私网监听重载:
[grid@mesdb1 ~]$ lsnrctl reload ASMNET1LSNR_ASM
LSNRCTL for Linux: Version 19.0.0.0.0 - Production on 19-FEB-2025 00:27:40
Copyright (c) 1991, 2023, Oracle. All rights reserved.
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=ASMNET1LSNR_ASM)))
The command completed successfully
[grid@mesdb1 ~]$ lsnrctl services ASMNET1LSNR_ASM
LSNRCTL for Linux: Version 19.0.0.0.0 - Production on 19-FEB-2025 00:27:42
Copyright (c) 1991, 2023, Oracle. All rights reserved.
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=ASMNET1LSNR_ASM)))
The listener supports no services
The command completed successfully
[grid@mesdb1 ~]$ lsnrctl services ASMNET1LSNR_ASM
LSNRCTL for Linux: Version 19.0.0.0.0 - Production on 19-FEB-2025 00:27:44
Copyright (c) 1991, 2023, Oracle. All rights reserved.
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=ASMNET1LSNR_ASM)))
The listener supports no services
The command completed successfully
[grid@mesdb1 ~]$ lsnrctl services ASMNET1LSNR_ASM
LSNRCTL for Linux: Version 19.0.0.0.0 - Production on 19-FEB-2025 00:27:45
Copyright (c) 1991, 2023, Oracle. All rights reserved.
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=ASMNET1LSNR_ASM)))
The listener supports no services
The command completed successfully
手动注册一下监听:
[grid@mesdb1 ~]$ sqlplus / as sysasm
SQL*Plus: Release 19.0.0.0.0 - Production on Wed Feb 19 00:27:48 2025
Version 19.22.0.0.0
Copyright (c) 1982, 2023, Oracle. All rights reserved.
Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.22.0.0.0
SQL> alter system register;
System altered.
SQL> exit
Disconnected from Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.22.0.0.0
再次查看私网监听状态:
[grid@mesdb1 ~]$ lsnrctl services ASMNET1LSNR_ASM
LSNRCTL for Linux: Version 19.0.0.0.0 - Production on 19-FEB-2025 00:28:03
Copyright (c) 1991, 2023, Oracle. All rights reserved.
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=ASMNET1LSNR_ASM)))
Services Summary...
Service "+ASM" has 1 instance(s).
Instance "+ASM1", status READY, has 2 handler(s) for this service...
Handler(s):
"DEDICATED" established:0 refused:0 state:ready
REMOTE SERVER
(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=11.11.11.1)(PORT=1525)))
"DEDICATED" established:0 refused:0 state:ready
LOCAL SERVER
Service "+ASM_CRS" has 1 instance(s).
Instance "+ASM1", status READY, has 2 handler(s) for this service...
Handler(s):
"DEDICATED" established:0 refused:0 state:ready
REMOTE SERVER
(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=11.11.11.1)(PORT=1525)))
"DEDICATED" established:0 refused:0 state:ready
LOCAL SERVER
Service "+ASM_DATA" has 1 instance(s).
Instance "+ASM1", status READY, has 2 handler(s) for this service...
Handler(s):
"DEDICATED" established:0 refused:0 state:ready
REMOTE SERVER
(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=11.11.11.1)(PORT=1525)))
"DEDICATED" established:0 refused:0 state:ready
LOCAL SERVER
Service "+ASM_FRA" has 1 instance(s).
Instance "+ASM1", status READY, has 2 handler(s) for this service...
Handler(s):
"DEDICATED" established:0 refused:0 state:ready
REMOTE SERVER
(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=11.11.11.1)(PORT=1525)))
"DEDICATED" established:0 refused:0 state:ready
LOCAL SERVER
The command completed successfully
已经没有节点 2 的监听服务存在了。再次尝试开启节点 2 的集群服务:
[grid@mesdb2 ~]$ which crsctl
/u01/app/19.3.0/grid/bin/crsctl
[grid@mesdb2 ~]$ exit
logout
[root@mesdb2 ~]# /u01/app/19.3.0/grid/bin/crsctl start cluster
CRS-2672: Attempting to start 'ora.crsd' on 'mesdb2'
CRS-2676: Start of 'ora.crsd' on 'mesdb2' succeeded
集群开启很顺利,问题解决。
写在最后
这次问题的解决过程告诉我们,表象并不总是反映真正的原因,我们需要透过现象深入分析,最终才能找到问题的根源和有效的解决方案!
ora.storage: 这真不是我的锅啊!
当然,这次的宕机问题到这其实仍未完全解决,尽管集群已恢复正常,但节点 2 的实例崩溃的原因以及如何恢复的问题仍需进一步探讨。我将在下一篇文章中继续分享。