暂无图片
暂无图片
3
暂无图片
暂无图片
6
暂无图片

Oracle RAC 触发 BUG,crsd 无限重启!

原创 Lucifer三思而后行 5天前
119

大家好,这里是公众号 DBA学习之路,分享一些学习国产数据库路上的知识和经验。

目录

前言

今天检查一套 Oracle RAC 12.2.0.1 数据库,检查集群状态时,发现集群命令一直夯着没反应:

image.png

经过一顿分析,终于解决问题,比较简单,这里分享一下处理过程。

问题分析

首先怀疑是集群资源挂了,查看集群资源,发现 ora.crsd 挂了:

## 节点一 [grid@lucifer1 ~]$ crsctl stat res -t -init -------------------------------------------------------------------------------- Name Target State Server State details -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.asm 1 ONLINE ONLINE mesdb0 Started,STABLE ora.cluster_interconnect.haip 1 ONLINE ONLINE mesdb0 STABLE ora.crf 1 ONLINE ONLINE mesdb0 STABLE ora.crsd 1 ONLINE OFFLINE STABLE ora.cssd 1 ONLINE ONLINE mesdb0 STABLE ora.cssdmonitor 1 ONLINE ONLINE mesdb0 STABLE ora.ctssd 1 ONLINE ONLINE mesdb0 OBSERVER,STABLE ora.diskmon 1 OFFLINE OFFLINE STABLE ora.evmd 1 ONLINE ONLINE mesdb0 STABLE ora.gipcd 1 ONLINE ONLINE mesdb0 STABLE ora.gpnpd 1 ONLINE ONLINE mesdb0 STABLE ora.mdnsd 1 ONLINE ONLINE mesdb0 STABLE ora.storage 1 ONLINE ONLINE mesdb0 STABLE -------------------------------------------------------------------------------- ## 节点二 [grid@lucifer2 ~]$ crsctl stat res -t CRS-4535: Cannot communicate with Cluster Ready Services CRS-4000: Command Status failed, or completed with errors.
复制

检查 crs alert.log 日志:

2025-02-24 06:11:42.105 [ORAROOTAGENT(29459)]CRS-8500: Oracle Clusterware ORAROOTAGENT 进程以操作系统进程 ID 29459 开头 2025-02-24 06:12:42.142 [ORAROOTAGENT(29459)]CRS-5818: 已中止命令 'check' (对于资源 'ora.crsd')。详细资料见 (:CRSAGF00113:) {0:15:2} (位于 /oracle/app/grid/diag/crs/mesdb0/crs/trace/ohasd_orarootagent_root.trc)。 2025-02-24 06:13:20.260 [CRSD(30357)]CRS-8500: Oracle Clusterware CRSD 进程以操作系统进程 ID 30357 开头 2025-02-24 06:13:22.541 [CRSD(30357)]CRS-1019: 主机 mesdb0 上的 OCR 服务已退出。详细资料见 /oracle/app/grid/diag/crs/mesdb0/crs/trace/crsd.trc 2025-02-24T06:13:22.563713+08:00 Errors in file /oracle/app/grid/diag/crs/mesdb0/crs/trace/crsd.trc (incident=41): CRS-1019 [] [] [] [] [] [] [] [] [] [] [] [] Incident details in: /oracle/app/grid/diag/crs/mesdb0/crs/incident/incdir_41/crsd_i41.trc 2025-02-24 06:13:22.584 [CRSD(30357)]CRS-8505: Oracle Clusterware CRSD 进程 (具有操作系统进程 ID 30357) 遇到内部错误 CRS-01019
复制

检查 crsd 日志:

2025-02-24 06:13:22.514 : OCRMSG:3187623680: prom_listen: Port str [a0f4-81a3-c06c-03aa] 2025-02-24 06:13:22.514 : OCRSRV:3187623680: proath_listen: listening to remote requests at portstr [a0f4-81a3-c06c-03aa] 2025-02-24 06:13:22.518 : OCRMSG:3168728832: prom_listen: Port str [ab1d-0688-2d30-7387] 2025-02-24 06:13:22.518 : OCRSRV:3168728832: th_invalidate_cache: listening to cache_invalidation requests at portstr [ab1d-0688-2d30-7387] 2025-02-24 06:13:22.522 : OCRMSG:3166627584: prom_listen: Port str [c71c-c1a3-dc88-994f] 2025-02-24 06:13:22.522 : OCRSRV:3166627584: proath_listen: listening to remote rim requests at portstr [c71c-c1a3-dc88-994f] 2025-02-24 06:13:22.533 : OCRMAS:3164526336: th_calc_av: Configured Active Patch Level [0] 2025-02-24 06:13:22.533 : OCRMAS:3164526336: th_calc_av:5'': Return persisted APL [0] OCRMAS:3164526336: th_calc_av:5': Return persisted AV [203424000] [12.2.0.1.0] 2025-02-24 06:13:22.535 : OCRMAS:3164526336: th_master_prereg: Persistent upgrade state retrieved from OCR is [0]. 2025-02-24 06:13:22.537 : OCRMAS:3164526336: th_master_prereg: Persistent upgrade toversion buffer retrieved from OCR is [12.2.0.1.0]. Setting toversion to [203424000]. 2025-02-24 06:13:22.541 : CSSCLNT:3164526336: clssgsGroupJoin: member in use group(1/ocrlocal) 2025-02-24 06:13:22.541 : default:3164526336: procr_reg_localgrp: Error [14] from clssgsreglocalgrp(). Return [23]. 2025-02-24 06:13:22.541 : default:3164526336: SLOS : [clsuSlosFormatDiag called with non-error slos.] 2025-02-24 06:13:22.541 : OCRMAS:3164526336: th_master_register: Failed to register in OCRLOCAL group. Retval:[23] 2025-02-24 06:13:22.541 : OCRAPI:3164526336: procr_ctx_set_invalid: ctx is in state [6]. 2025-02-24 06:13:22.541 : OCRAPI:3164526336: procr_ctx_set_invalid: ctx set to invalid Trace file /oracle/app/grid/diag/crs/mesdb0/crs/trace/crsd.trc Oracle Database 12c Clusterware Release 12.2.0.1.0 - Production Copyright 1996, 2016 Oracle. All rights reserved. DDE: Flood control is not active 2025-02-24T06:13:22.564565+08:00 Incident 41 created, dump file: /oracle/app/grid/diag/crs/mesdb0/crs/incident/incdir_41/crsd_i41.trc CRS-1019 [] [] [] [] [] [] [] [] [] [] [] [] 2025-02-24 06:13:22.706 : OCRAPI:3164526336: procr_ctx_set_invalid: Aborting... Trace file /oracle/app/grid/diag/crs/mesdb0/crs/trace/crsd.trc Oracle Database 12c Clusterware Release 12.2.0.1.0 - Production Copyright 1996, 2016 Oracle. All rights reserved. default:2552033344: 1: clskec:has:CLSU:910 4 args[CLSD00302][mod=clsdadr.c][loc=(:CLSD00302:)][msg=clsdAdrInit: Trace file size and number of segments fetched from environemnt variable: ORA_DAEMON_TRACE_FILE_OPTIONS filesize=26214400,numsegments=10] CLSB:2552033344: Argument count (argc) for this daemon is 2 CLSB:2552033344: Argument 0 is: /oracle/app/12.2.0/grid/bin/crsd.bin CLSB:2552033344: Argument 1 is: reboot 2025-02-24 06:13:22.829 : CSSCLNT:2552033344: clsssinit: initialized context: (0x4edf930) flags 0x207 2025-02-24 06:13:22.829 : CRSMAIN:2552033344: First attempt: init CSS context succeeded. 2025-02-24 06:13:22.829 : CRSMAIN:2552033344: Start mode: normal 2025-02-24 06:13:22.831 : CLSDMT:2343307008: PID for the Process [30402], connkey CRSD 2025-02-24 06:13:23.745 : CRSMAIN:2552033344: CRS Daemon Starting 2025-02-24 06:13:23.745 : CRSMAIN:2343307008: Process environment is not initialized yet! 2025-02-24 06:13:23.746 : CRSD:2552033344: Logging level for Module: clsdadr 0 2025-02-24 06:13:23.746 : CRSD:2552033344: Logging level for Module: clsdnreg 0 2025-02-24 06:13:23.746 : CRSD:2552033344: Logging level for Module: clsdynam 0
复制

查看 trace dump 日志:

----- Invocation Context Dump ----- Address: 0x7f1a9c024340 Phase: 3 flags: 0x10E0000 Incident ID: 41 Error Descriptor: CRS-1019 [] [] [] [] [] [] [] [] [] [] [] [] Error class: 0 Problem Key # of args: 0 Number of actions: 10 ----- Incident Context Dump ----- Address: 0x7f1abc9d99d0 Incident ID: 41 Problem Key: CRS 1019 Error: CRS-1019 [] [] [] [] [] [] [] [] [] [] [] [] [00]: dbgePostErrorDirectVaList_int [diag_dde] [01]: dbgePostErrorDirect [diag_dde] [02]: clsdAdrPostError [] [03]: clsdadrpr_CreateIncidentCheck [] [04]: clsdadrprAlert [] [05]: clsd_alertprintft [] [06]: proath_master_exit_helper []<-- Signaling [07]: proath_master_register [] [08]: proath_master [] [09]: start_thread [] MD [00]: 'Client ProcId'='crsd.bin@mesdb0.30357_139752810403584' (0x0) Impact 0: Impact 1: Impact 2: Impact 3: Derived Impact: ----- END Incident Context Dump -----
复制

看着很像是 BUG,在 MOS 搜索后发现有一个文章很匹配:

  1. crsd.bin Fail With Error CRS-1019 When ohasd Restarted (Doc ID 2291799.1)
  2. Bug 24396050 - crsd.bin failed several times with error CRS-1019 (Doc ID 24396050.8)

MOS 截图如下:

image.png

image.png

image.png

image.png

MOS 内容与问题日志完全一致,确认是 BUG,需要进行补丁修复:

image.png

解决问题

下载 BUG 所需补丁 Patch 24396050: LNX64-12.2-CRS: CRSD.BIN FAILED SEVERAL TIMES WITH ERROR CRS-1019

image.png

更新 OPatch

查看补丁 README:You must use the OPatch utility version 12.2.0.1.5 or later to apply this patch.

检查当前 OPatch 补丁是否符合要求:

[grid@mesdb0 ~]$ cd $ORACLE_HOME/OPatch/ [grid@mesdb0 OPatch]$ ./opatch version OPatch Version: 12.2.0.1.6 OPatch succeeded.
复制

符合要求,不需要更新 OPatch。

解压补丁

## root 执行 unzip -q /soft/p24396050_122010_Linux-x86-64.zip -d /soft/ chown -R oracle:oinstall /soft/24396050
复制

安装补丁

## root 执行 export GI_HOME=/oracle/app/12.2.0/grid ## 安装前检查 $GI_HOME/OPatch/opatchauto apply /soft/24396050 -analyze ## 安装补丁 $GI_HOME/OPatch/opatchauto apply /soft/24396050 -oh $GI_HOME
复制

安装补丁后重启系统验证集群已经恢复正常。

「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

筱悦星辰
暂无图片
1天前
评论
暂无图片 0
语言是有能量的。我们所说的每一句正能量的话,都像一阵无形的春风,滋养着我们的内心,让人变得更加积极向上。
1天前
暂无图片 点赞
评论
淡定
暂无图片
3天前
评论
暂无图片 0
Oracle RAC 触发 BUG,crsd 无限重启!
3天前
暂无图片 点赞
评论
R
reddey
暂无图片
4天前
评论
暂无图片 0
通常打GI补丁,也是先停用数据库服务和集群的吧。
4天前
暂无图片 点赞
评论
R
reddey
暂无图片
4天前
评论
暂无图片 0
三哥,你这个GI补丁,是所有节点都要打,还是只打了故障节点?
4天前
暂无图片 点赞
2
Lucifer三思而后行
暂无图片 暂无图片
4天前
回复
暂无图片 0
过两天重新写一篇打补丁步骤,过程有点坑
4天前
暂无图片 点赞
回复
R
reddey
暂无图片
3天前
回复
暂无图片 0
@Lucifer三思而后行 期待三哥的新篇
3天前
暂无图片 点赞
回复