问题概述
一套近期新安装的11.2 RAC,asm使用asmlib,2节点主机重启后,CRS无法启动。
[root@db2 ~]# /u0l/app/11.2.0/grid/bin/crsctl check crs
CRS-4638: 0racle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager
问题原因分析
(1)日志分析
查看GI alert log,报错比较明显,集群启动失败的原因是没有找到voting files。
[ohasd(2999)]CRS-1301:Oracle High Availability Service started on node db2.
2023-10-30 15:00:29.915:
[ohasd(2999)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0 errors occurred
2023-10-30 15:00:33.267:
[/u01/app/11.2.0/grid/bin/orarootagent.bin(10754)]CRS-2302:Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running).
2023-10-30 15:00:37.635:
[gpnpd(10867)]CRS-2328:GPNPD started on node db2.
2023-10-30 15:00:39.977:
[cssd(11019)]CRS-1713:CSSD daemon is started in clustered mode
2023-10-30 15:00:41.814:
[ohasd(2999)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE
2023-10-30 15:00:48.544:
[cssd(11019)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2.0/grid/log/db2/cssd/ocssd.log
2023-10-30 15:01:03.551:
[cssd(11019)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2.0/grid/log/db2/cssd/ocssd.log
2023-10-30 15:01:18.558:
[cssd(11019)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2.0/grid/log/db2/cssd/ocssd.log
2023-10-30 15:01:33.564:
[cssd(11019)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2.0/grid/log/db2/cssd/ocssd.log
查看cssd log,同样显示Voting file not found,但是日志中可以看到集群能发现ARCHDISK1和DATADISK1两个分别存放归档日志和数据文件的盘。
2023-10-30 15:00:48.543: [ CSSD][136529664]clssnmvDDiscThread: using discovery string for initial discovery
2023-10-30 15:00:48.543: [ SKGFD][136529664]Discovery with str::
2023-10-30 15:00:48.543: [ SKGFD][136529664]UFS discovery with ::
2023-10-30 15:00:48.543: [ SKGFD][136529664]Execute glob on the string /dev/raw/*
2023-10-30 15:00:48.543: [ SKGFD][136529664]running stat on disk:/dev/raw/rawctl
2023-10-30 15:00:48.543: [ SKGFD][136529664]Fetching UFS disk :/dev/raw/rawctl:
2023-10-30 15:00:48.543: [ SKGFD][136529664]OSS discovery with ::
2023-10-30 15:00:48.543: [ SKGFD][136529664]Discovery with asmlib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: str ::
2023-10-30 15:00:48.543: [ SKGFD][136529664]Fetching asmlib disk :ORCL:ARCHDISK1:
2023-10-30 15:00:48.543: [ SKGFD][136529664]Fetching asmlib disk :ORCL:DATADISK1:
2023-10-30 15:00:48.543: [ SKGFD][136529664]Handle 0x7f5ff8096c50 from lib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: for disk :ORCL:ARCHDISK1:
2023-10-30 15:00:48.543: [ SKGFD][136529664]Handle 0x7f5ff80975f0 from lib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: for disk :ORCL:DATADISK1:
2023-10-30 15:00:48.544: [ SKGFD][136529664]Lib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: closing handle 0x7f5ff8096c50 for disk :ORCL:ARCHDISK1:
2023-10-30 15:00:48.544: [ SKGFD][136529664]Lib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: closing handle 0x7f5ff80975f0 for disk :ORCL:DATADISK1:
2023-10-30 15:00:48.544: [ CSSD][136529664]clssnmvDiskVerify: Successful discovery of 0 disks
2023-10-30 15:00:48.544: [ CSSD][136529664]clssnmCompleteInitVFDiscovery: Completing initial voting file discovery
2023-10-30 15:00:48.544: [ CSSD][136529664]clssnmvFindInitialConfigs: No voting files found
2023-10-30 15:00:48.544: [ CSSD][136529664](:CSSNM00070:)clssnmCompleteInitVFDiscovery: Voting file not found. Retrying discovery in 15 seconds
2023-10-30 15:00:49.108: [ CSSD][139220736]clssscSelect: cookie accept request 0x7f5ff4084740
2023-10-30 15:00:49.108: [ CSSD][139220736]clssscevtypSHRCON: getting client with cmproc 0x7f5ff4084740
2023-10-30 15:00:49.108: [ CSSD][139220736]clssgmRegisterClient: proc(4/0x7f5ff4084740), client(4/0x7f5ff40774d0)
2023-10-30 15:00:49.108: [ CSSD][139220736]clssgmExecuteClientRequest(): type(6) size(684) only connect and exit messages are allowed before lease acquisition
(2)查看ASM磁盘
在二节点查看,3个OCR磁盘不显示,重新scan仍无法识别。
[root@db2 ~]# cd /dev/oracleasm/disks/
[rootadb2 disks]# ll
total 0
brw-rw---- 1 grid asmadmin 8,81 0ct 30 15:00 ARCHDISK1
brw-rw---- 1 grid asmadmin 8,97 0ct 30 15:00 DATADISK1
[root@db2 -]# oracleasm scandisks
Reloading disk partitions: done
Cleaning any stale ASM disks ...
Scanning system for ASM disks ..
[root@db2 ~]#
[root@db2 ~]# oracleasm listdisks
ARCHDISK1
DATADISK1
在一节点查看,所有磁盘都正常。
[root@db1 ~]# oracleasm listdisks
ARCHDISK1
DATADISK1
OCR_VOTE1
OCR_VOTE2
OCR_VOTE3
[root@db1 ~]# ls - /dev/oracleasm/disks
total 0
brw-rw---- 1 grid asmadmin 8,65 0ct 26 11:38 ARCHDISK1
brw-rw---- 1 grid asmadmin 8,81 0ct 26 11:38 DATADISK1
brw-rw---- 1 grid asmadmin 8,17 0ct 26 11:38 0CR_VOTE1
brw-rw---- 1 grid asmadmin 8,33 0ct 26 11:38 0CR_VOTE2
brw-rw---- 1 grid asmadmin 8,49 0ct 26 11:38 0CR_VOTE3
(3)物理磁盘检查
下图中3个5G的磁盘为OCR磁盘
磁盘的权限跟两外两个正常的ASM磁盘也一样
(4)kfed读取磁盘头
使用kfed分别读取3个OCR磁盘的磁盘头,都能正常读取
从上面的检查我们可以知道,物理磁盘没有发现有明显问题…
blkid命令发现一点异常,正常的ASM磁盘有LABEL信息,3个OCR磁盘都没有
(5)从ASM磁盘头重确认LABEL信息
ASM的LABEL信息记录着driver.provst属性中,通过对比能快速的确认,问题出这LABEL信息丢失
是否可以通过备份恢复磁盘头?经过确认备份块的LABEL信息也缺失。
解决方案
在确认是OCR磁盘LABEL信息丢失导致的问题后,从MOS中搜到了同样的案例(见参考文档),文中没有解释LABEL丢失的原因,原文翻译为“目前尚不清楚它们是如何进入这种状态的”,可能是触发了某个不明BUG。
The devices still had the OCR and voting file content, but were missing the ASMLIB driver information, it is unclear how they got into this state.
注意:
以下解决方案涉及到修改ASM磁盘头的信息,风险较高,如出现意外有可能导致集群无法启动,建议在充分评估风险后再操作,并在操作之前对数据库做一次全备。
(1)关闭所有节点CRS(包括正常的节点)
crsctl stop crs
(2)使用oracleasm renamedisk命令修复ASM磁盘头
[root@db2 ~]# /usr/sbin/oracleasm renamedisk -f /dev/sdc1 OCR_VOTE1
Writing disk header: done
Instantiating disk "OCR_VOTE1": done
[root@db2 ~]# /usr/sbin/oracleasm renamedisk -f /dev/sdd1 OCR_VOTE2
Writing disk header: done
Instantiating disk "OCR_VOTE2": done
[root@db2 ~]# /usr/sbin/oracleasm renamedisk -f /dev/sde1 OCR_VOTE3
Writing disk header: done
Instantiating disk "OCR_VOTE3": done
[root@db2 ~]# ls -l /dev/oracleasm/disks
total 0
brw-rw---- 1 grid asmadmin 8, 81 Oct 30 17:47 ARCHDISK1
brw-rw---- 1 grid asmadmin 8, 97 Oct 30 17:47 DATADISK1
brw-rw---- 1 grid asmadmin 8, 33 Oct 30 20:01 OCR_VOTE1
brw-rw---- 1 grid asmadmin 8, 49 Oct 30 20:02 OCR_VOTE2
brw-rw---- 1 grid asmadmin 8, 65 Oct 30 20:02 OCR_VOTE3
使用blkid确认,LABEL恢复正常
(3)重启启动crs
所有节点启动正常。
其它建议
不推荐使用asmlib来管理ASM的磁盘,asmlib的出现本质上是为了解决设备权限、设备所有者信息的问题,还可以提供一个“永不变化”的设备名,linux原生的udev同样可以做到这一点,而且做的更好,目前asmlib已经基本被淘汰,强烈推荐使用udev来绑定设备名。
参考文档
Linux: Voting Files Not Found Due To The Loss Of The ASMLIB Driver Information In The ASM Diskheader (Doc ID 1557191.1)