暂无图片
暂无图片
8
暂无图片
暂无图片
暂无图片

Oracle 19C RAC 集群起不来了,ora.storage 的锅?

原创 Lucifer三思而后行 1天前
446

大家好,这里是公众号 DBA学习之路,分享一些学习国产数据库路上的知识和经验。

目录

前言

半夜 11 点多接到电话,说有一个客户的生产库宕了一个节点,需要紧急恢复一下。赶紧下床打开电脑,一顿操作连上了客户的数据库主机,是一套 19C(19.22) RAC 2 节点数据库,看了下是节点 2 宕掉了,集群起不来了。

经过和客户的沟通之后,解决了这个问题,本文记录一下分析与解决过程,分享给大家一起进步。

问题分析

连上主机之后,看了下节点 2 的集群状态,发现是启动集群卡在了 ora.storage 起不来:

通过 crsctl stat res -t -init 检查启动进度时,我们可以看到在启动 ora.storage 资源时会 HANG 住,等待一顿时间后就会超时。

ora.storage 1 ONLINE OFFLINE STABLE --------------------------------------------------------------------------------

查看 ohasd.trc 日志(已排除存储相关问题):

2025-02-19 00:23:02.300 : CRSPE:2791253760: [ INFO] {0:1:29} Got agent-specific msg: ORA-12541: TNS:no listener ORA-12541: TNS:no listener 2025-02-19 00:23:02.300 : AGFW:2803861248: [ INFO] {0:1:29} Agfw Proxy Server sending the reply to PE for message:RESOURCE_START[ora.storage 1 1] ID 4 098:1471 2025-02-19 00:23:02.300 : CRSPE:2791253760: [ INFO] {0:1:29} Received reply to action [Start] message ID: 1471 2025-02-19 00:23:02.300 : CRSPE:2791253760: [ INFO] {0:1:29} Got agent-specific msg: CRS-5055: unable to connect to an ASM instance because no ASM inst ance is running in the cluster 2025-02-19 00:23:02.300 : CRSPE:2791253760: [ INFO] {0:1:29} Received reply to action [Start] message ID: 1471 2025-02-19 00:23:02.301 : CRSMAIN:2791253760: [ NONE] {0:1:29} {0:1:29} Created alert : (:CRSPE00221:) : Start action timed out! 2025-02-19 00:23:02.301 : CRSPE:2791253760: [ INFO] {0:1:29} Start action failed with error code: 3 2025-02-19 00:23:02.301 : CRSRPT:2789152512: [ INFO] {0:1:29} Published to EVM CRS_ACTION_FAILURE for ora.storage

可以看到报错是 ORA-12541: TNS:no listener,查看 ohasd_orarootagent_root.trc 日志:

025-02-19 00:14:23.824 : USRTHRD:3772593920: [ INFO] {0:1:29} [ora.storage] Error [kgfoAl06] in [kgfokge] at kgfo.c:3182 2025-02-19 00:14:23.824 : USRTHRD:3772593920: [ INFO] {0:1:29} [ora.storage] ORA-12541: TNS:no listener ORA-12541: TNS:no listener 2025-02-19 00:14:23.824 : USRTHRD:3772593920: [ INFO] {0:1:29} [ora.storage] Category: 7 2025-02-19 00:14:23.824 : USRTHRD:3772593920: [ INFO] {0:1:29} [ora.storage] DepInfo: 12541 2025-02-19 00:14:23.824 : USRTHRD:3772593920: [ INFO] {0:1:29} [ora.storage] ADR is not properly configured 2025-02-19 00:14:23.824 : USRTHRD:3772593920: [ INFO] {0:1:29} [ora.storage] -- trace dump end -- 2025-02-19 00:14:23.824 : USRTHRD:3772593920: [ INFO] {0:1:29} Thread:kgfoCheckMountExt isRunning is reset to false here 2025-02-19 00:14:23.824 : USRTHRD:3772593920: [ INFO] {0:1:29} Thread:kgfoCheckMountExt isFinished set to true 2025-02-19 00:14:23.824 : USRTHRD:3764188928: [ INFO] {0:1:29} Thread:kgfoCheckMountExt Tasklet::doTask m_cv.timewait returned 2025-02-19 00:14:23.824 : USRTHRD:3764188928: [ INFO] {0:1:29} Thread:kgfoCheckMountExt Tasklet::doTask(kgfoCheckMountExt) executed in 9 seconds 2025-02-19 00:14:23.824 :CLSDYNAM:3764188928: [ora.storage]{0:1:29} [start] StorageAgent::parsekgforetcodes retcode = 7, kgfoCheckMount(CRS), flag 2 2025-02-19 00:14:23.824 :CLSDYNAM:3764188928: [ora.storage]{0:1:29} [start] (null) category: 7, operation: kgfoAl06, loc: kgfokge, OS error: 12541, other: OR A-12541: TNS:no listener ORA-12541: TNS:no listener 2025-02-19 00:14:23.824 :CLSDYNAM:3764188928: [ora.storage]{0:1:29} [start] StorageAgent::check 260 kgfo returncode 1 2025-02-19 00:14:23.824 :CLSDYNAM:3764188928: [ora.storage]{0:1:29} [start] (:CLSN00140:)StorageAgent::check 300 parsekgforretcodes OCR dgName CRS state 1 2025-02-19 00:14:23.824 :CLSDYNAM:3764188928: [ora.storage]{0:1:29} [start] Storage::start waiting for check to not return PARTIAL or UNPLANNED_OFFLINE 1 2025-02-19 00:14:23.827 :CLSDYNAM:3764188928: [ora.storage]{0:1:29} [start] CssGroup::regis registration of groupName:ASM_NETWORK mbrid:-1 m_grpNum:2 regType :1 succeeded

可以发现再启动 ora.storage 时报错 ORA-12541: TNS:no listener,连接私网监听 ASM_NETWORK 报错,看到这里我也是有点懵逼的,在 MOS 上也搜不到相关问题记录。

在网上搜索到两篇问题现象比较相近的文章

  1. ora.storage无法启动报ORA-12514故障处理
  2. ora.storage hangs after node reboot

看了一下,大致原因都是人工重启主机导致,需要重新加载一下私网监听 ASMNET1LSNR_ASM

随后咨询了一下客户,果然是节点 2 实例 crash 之后无法启动,执行 reboot 重启主机没有反应,所以使用管理端强制重启了主机,主机起来之后就一直 hang 在这里。

那看来现象基本吻合,看一下节点 1 的私网监听 ASMNET1LSNR_ASM 状态:

[grid@mesdb1 ~]$ lsnrctl stat ASMNET1LSNR_ASM LSNRCTL for Linux: Version 19.0.0.0.0 - Production on 19-FEB-2025 00:23:31 Copyright (c) 1991, 2023, Oracle. All rights reserved. Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=ASMNET1LSNR_ASM))) STATUS of the LISTENER ------------------------ Alias ASMNET1LSNR_ASM Version TNSLSNR for Linux: Version 19.0.0.0.0 - Production Start Date 31-JAN-2025 18:18:39 Uptime 18 days 6 hr. 4 min. 51 sec Trace Level off Security ON: Local OS Authentication SNMP OFF Listener Parameter File /u01/app/19.3.0/grid/network/admin/listener.ora Listener Log File /u01/app/grid/diag/tnslsnr/mesdb1/asmnet1lsnr_asm/alert/log.xml Listening Endpoints Summary... (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=ASMNET1LSNR_ASM))) (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=11.11.11.1)(PORT=1525))) Services Summary... Service "+ASM" has 2 instance(s). Instance "+ASM1", status READY, has 2 handler(s) for this service... Instance "+ASM2", status READY, has 1 handler(s) for this service... Service "+ASM_CRS" has 2 instance(s). Instance "+ASM1", status READY, has 2 handler(s) for this service... Instance "+ASM2", status READY, has 1 handler(s) for this service... Service "+ASM_DATA" has 2 instance(s). Instance "+ASM1", status READY, has 2 handler(s) for this service... Instance "+ASM2", status READY, has 1 handler(s) for this service... Service "+ASM_FRA" has 2 instance(s). Instance "+ASM1", status READY, has 2 handler(s) for this service... Instance "+ASM2", status READY, has 1 handler(s) for this service... The command completed successfully

可以看到 2 个实例都存在,明明节点 2 的 ASM 实例已经宕掉了,按理说应该只有节点 1 的存在才对。再次查看节点 1 的私网监听的 service:

[grid@mesdb1 ~]$ lsnrctl services ASMNET1LSNR_ASM LSNRCTL for Linux: Version 19.0.0.0.0 - Production on 19-FEB-2025 00:26:42 Copyright (c) 1991, 2023, Oracle. All rights reserved. Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=ASMNET1LSNR_ASM))) Services Summary... Service "+ASM" has 2 instance(s). Instance "+ASM1", status READY, has 2 handler(s) for this service... Handler(s): "DEDICATED" established:0 refused:0 state:ready REMOTE SERVER (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=11.11.11.1)(PORT=1525))) "DEDICATED" established:0 refused:0 state:ready LOCAL SERVER Instance "+ASM2", status READY, has 1 handler(s) for this service... Handler(s): "DEDICATED" established:4578 refused:0 state:ready REMOTE SERVER (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=11.11.11.2)(PORT=1525))) Service "+ASM_CRS" has 2 instance(s). Instance "+ASM1", status READY, has 2 handler(s) for this service... Handler(s): "DEDICATED" established:0 refused:0 state:ready REMOTE SERVER (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=11.11.11.1)(PORT=1525))) "DEDICATED" established:0 refused:0 state:ready LOCAL SERVER Instance "+ASM2", status READY, has 1 handler(s) for this service... Handler(s): "DEDICATED" established:4578 refused:0 state:ready REMOTE SERVER (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=11.11.11.2)(PORT=1525))) Service "+ASM_DATA" has 2 instance(s). Instance "+ASM1", status READY, has 2 handler(s) for this service... Handler(s): "DEDICATED" established:0 refused:0 state:ready REMOTE SERVER (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=11.11.11.1)(PORT=1525))) "DEDICATED" established:0 refused:0 state:ready LOCAL SERVER Instance "+ASM2", status READY, has 1 handler(s) for this service... Handler(s): "DEDICATED" established:4578 refused:0 state:ready REMOTE SERVER (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=11.11.11.2)(PORT=1525))) Service "+ASM_FRA" has 2 instance(s). Instance "+ASM1", status READY, has 2 handler(s) for this service... Handler(s): "DEDICATED" established:0 refused:0 state:ready REMOTE SERVER (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=11.11.11.1)(PORT=1525))) "DEDICATED" established:0 refused:0 state:ready LOCAL SERVER Instance "+ASM2", status READY, has 1 handler(s) for this service... Handler(s): "DEDICATED" established:4578 refused:0 state:ready REMOTE SERVER (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=11.11.11.2)(PORT=1525))) The command completed successfully

确实可以看到有节点 2 的监听服务依然存在,这里参考上述文章的解决方案是重载私网监听。

解决方案

在节点 1 上手动执行私网监听重载:

[grid@mesdb1 ~]$ lsnrctl reload ASMNET1LSNR_ASM LSNRCTL for Linux: Version 19.0.0.0.0 - Production on 19-FEB-2025 00:27:40 Copyright (c) 1991, 2023, Oracle. All rights reserved. Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=ASMNET1LSNR_ASM))) The command completed successfully [grid@mesdb1 ~]$ lsnrctl services ASMNET1LSNR_ASM LSNRCTL for Linux: Version 19.0.0.0.0 - Production on 19-FEB-2025 00:27:42 Copyright (c) 1991, 2023, Oracle. All rights reserved. Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=ASMNET1LSNR_ASM))) The listener supports no services The command completed successfully [grid@mesdb1 ~]$ lsnrctl services ASMNET1LSNR_ASM LSNRCTL for Linux: Version 19.0.0.0.0 - Production on 19-FEB-2025 00:27:44 Copyright (c) 1991, 2023, Oracle. All rights reserved. Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=ASMNET1LSNR_ASM))) The listener supports no services The command completed successfully [grid@mesdb1 ~]$ lsnrctl services ASMNET1LSNR_ASM LSNRCTL for Linux: Version 19.0.0.0.0 - Production on 19-FEB-2025 00:27:45 Copyright (c) 1991, 2023, Oracle. All rights reserved. Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=ASMNET1LSNR_ASM))) The listener supports no services The command completed successfully

手动注册一下监听:

[grid@mesdb1 ~]$ sqlplus / as sysasm SQL*Plus: Release 19.0.0.0.0 - Production on Wed Feb 19 00:27:48 2025 Version 19.22.0.0.0 Copyright (c) 1982, 2023, Oracle. All rights reserved. Connected to: Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production Version 19.22.0.0.0 SQL> alter system register; System altered. SQL> exit Disconnected from Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production Version 19.22.0.0.0

再次查看私网监听状态:

[grid@mesdb1 ~]$ lsnrctl services ASMNET1LSNR_ASM LSNRCTL for Linux: Version 19.0.0.0.0 - Production on 19-FEB-2025 00:28:03 Copyright (c) 1991, 2023, Oracle. All rights reserved. Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=ASMNET1LSNR_ASM))) Services Summary... Service "+ASM" has 1 instance(s). Instance "+ASM1", status READY, has 2 handler(s) for this service... Handler(s): "DEDICATED" established:0 refused:0 state:ready REMOTE SERVER (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=11.11.11.1)(PORT=1525))) "DEDICATED" established:0 refused:0 state:ready LOCAL SERVER Service "+ASM_CRS" has 1 instance(s). Instance "+ASM1", status READY, has 2 handler(s) for this service... Handler(s): "DEDICATED" established:0 refused:0 state:ready REMOTE SERVER (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=11.11.11.1)(PORT=1525))) "DEDICATED" established:0 refused:0 state:ready LOCAL SERVER Service "+ASM_DATA" has 1 instance(s). Instance "+ASM1", status READY, has 2 handler(s) for this service... Handler(s): "DEDICATED" established:0 refused:0 state:ready REMOTE SERVER (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=11.11.11.1)(PORT=1525))) "DEDICATED" established:0 refused:0 state:ready LOCAL SERVER Service "+ASM_FRA" has 1 instance(s). Instance "+ASM1", status READY, has 2 handler(s) for this service... Handler(s): "DEDICATED" established:0 refused:0 state:ready REMOTE SERVER (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=11.11.11.1)(PORT=1525))) "DEDICATED" established:0 refused:0 state:ready LOCAL SERVER The command completed successfully

已经没有节点 2 的监听服务存在了。再次尝试开启节点 2 的集群服务:

[grid@mesdb2 ~]$ which crsctl /u01/app/19.3.0/grid/bin/crsctl [grid@mesdb2 ~]$ exit logout [root@mesdb2 ~]# /u01/app/19.3.0/grid/bin/crsctl start cluster CRS-2672: Attempting to start 'ora.crsd' on 'mesdb2' CRS-2676: Start of 'ora.crsd' on 'mesdb2' succeeded

集群开启很顺利,问题解决。

写在最后

这次问题的解决过程告诉我们,表象并不总是反映真正的原因,我们需要透过现象深入分析,最终才能找到问题的根源和有效的解决方案!

ora.storage: 这真不是我的锅啊!

当然,这次的宕机问题到这其实仍未完全解决,尽管集群已恢复正常,但节点 2 的实例崩溃的原因以及如何恢复的问题仍需进一步探讨。我将在下一篇文章中继续分享。

「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
1人已赞赏
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论