暂无图片
暂无图片
6
暂无图片
暂无图片
暂无图片

OCR磁盘头中ASMLIB驱动LABEL丢失信息导致集群无法启动

原创 yuqi.zhou 2023-11-28
834

问题概述

一套近期新安装的11.2 RAC,asm使用asmlib,2节点主机重启后,CRS无法启动。

[root@db2 ~]# /u0l/app/11.2.0/grid/bin/crsctl check crs
CRS-4638: 0racle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager

问题原因分析

(1)日志分析

查看GI alert log,报错比较明显,集群启动失败的原因是没有找到voting files。

[ohasd(2999)]CRS-1301:Oracle High Availability Service started on node db2.
2023-10-30 15:00:29.915: 
[ohasd(2999)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0 errors occurred
2023-10-30 15:00:33.267: 
[/u01/app/11.2.0/grid/bin/orarootagent.bin(10754)]CRS-2302:Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running). 
2023-10-30 15:00:37.635: 
[gpnpd(10867)]CRS-2328:GPNPD started on node db2. 
2023-10-30 15:00:39.977: 
[cssd(11019)]CRS-1713:CSSD daemon is started in clustered mode
2023-10-30 15:00:41.814: 
[ohasd(2999)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE
2023-10-30 15:00:48.544: 
[cssd(11019)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2.0/grid/log/db2/cssd/ocssd.log
2023-10-30 15:01:03.551: 
[cssd(11019)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2.0/grid/log/db2/cssd/ocssd.log
2023-10-30 15:01:18.558: 
[cssd(11019)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2.0/grid/log/db2/cssd/ocssd.log
2023-10-30 15:01:33.564: 
[cssd(11019)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2.0/grid/log/db2/cssd/ocssd.log

查看cssd log,同样显示Voting file not found,但是日志中可以看到集群能发现ARCHDISK1和DATADISK1两个分别存放归档日志和数据文件的盘。

2023-10-30 15:00:48.543: [    CSSD][136529664]clssnmvDDiscThread: using discovery string  for initial discovery 
2023-10-30 15:00:48.543: [   SKGFD][136529664]Discovery with str::
2023-10-30 15:00:48.543: [   SKGFD][136529664]UFS discovery with ::
2023-10-30 15:00:48.543: [   SKGFD][136529664]Execute glob on the string /dev/raw/*
2023-10-30 15:00:48.543: [   SKGFD][136529664]running stat on disk:/dev/raw/rawctl
2023-10-30 15:00:48.543: [   SKGFD][136529664]Fetching UFS disk :/dev/raw/rawctl:
2023-10-30 15:00:48.543: [   SKGFD][136529664]OSS discovery with ::
2023-10-30 15:00:48.543: [   SKGFD][136529664]Discovery with asmlib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: str ::
2023-10-30 15:00:48.543: [   SKGFD][136529664]Fetching asmlib disk :ORCL:ARCHDISK1:
2023-10-30 15:00:48.543: [   SKGFD][136529664]Fetching asmlib disk :ORCL:DATADISK1:
2023-10-30 15:00:48.543: [   SKGFD][136529664]Handle 0x7f5ff8096c50 from lib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: for disk :ORCL:ARCHDISK1:
2023-10-30 15:00:48.543: [   SKGFD][136529664]Handle 0x7f5ff80975f0 from lib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: for disk :ORCL:DATADISK1:
2023-10-30 15:00:48.544: [   SKGFD][136529664]Lib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: closing handle 0x7f5ff8096c50 for disk :ORCL:ARCHDISK1:
2023-10-30 15:00:48.544: [   SKGFD][136529664]Lib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: closing handle 0x7f5ff80975f0 for disk :ORCL:DATADISK1:
2023-10-30 15:00:48.544: [    CSSD][136529664]clssnmvDiskVerify: Successful discovery of 0 disks
2023-10-30 15:00:48.544: [    CSSD][136529664]clssnmCompleteInitVFDiscovery: Completing initial voting file discovery
2023-10-30 15:00:48.544: [    CSSD][136529664]clssnmvFindInitialConfigs: No voting files found
2023-10-30 15:00:48.544: [    CSSD][136529664](:CSSNM00070:)clssnmCompleteInitVFDiscovery: Voting file not found. Retrying discovery in 15 seconds
2023-10-30 15:00:49.108: [    CSSD][139220736]clssscSelect: cookie accept request 0x7f5ff4084740
2023-10-30 15:00:49.108: [    CSSD][139220736]clssscevtypSHRCON: getting client with cmproc 0x7f5ff4084740
2023-10-30 15:00:49.108: [    CSSD][139220736]clssgmRegisterClient: proc(4/0x7f5ff4084740), client(4/0x7f5ff40774d0)
2023-10-30 15:00:49.108: [    CSSD][139220736]clssgmExecuteClientRequest(): type(6) size(684) only connect and exit messages are allowed before lease acquisition 

(2)查看ASM磁盘

在二节点查看,3个OCR磁盘不显示,重新scan仍无法识别。

[root@db2 ~]# cd /dev/oracleasm/disks/
[rootadb2 disks]# ll
total 0
brw-rw---- 1 grid asmadmin 8,81 0ct 30 15:00 ARCHDISK1
brw-rw---- 1 grid asmadmin 8,97 0ct 30 15:00 DATADISK1

[root@db2 -]# oracleasm scandisks
Reloading disk partitions: done
Cleaning any stale ASM disks ...
Scanning system for ASM disks ..
[root@db2 ~]#
[root@db2 ~]# oracleasm listdisks
ARCHDISK1
DATADISK1

在一节点查看,所有磁盘都正常。

[root@db1 ~]# oracleasm listdisks
ARCHDISK1
DATADISK1
OCR_VOTE1
OCR_VOTE2
OCR_VOTE3

[root@db1 ~]# ls - /dev/oracleasm/disks
total 0
brw-rw---- 1 grid asmadmin 8,65 0ct 26 11:38 ARCHDISK1
brw-rw---- 1 grid asmadmin 8,81 0ct 26 11:38 DATADISK1
brw-rw---- 1 grid asmadmin 8,17 0ct 26 11:38 0CR_VOTE1
brw-rw---- 1 grid asmadmin 8,33 0ct 26 11:38 0CR_VOTE2
brw-rw---- 1 grid asmadmin 8,49 0ct 26 11:38 0CR_VOTE3

(3)物理磁盘检查

下图中3个5G的磁盘为OCR磁盘

磁盘的权限跟两外两个正常的ASM磁盘也一样

(4)kfed读取磁盘头

使用kfed分别读取3个OCR磁盘的磁盘头,都能正常读取
图片.png
从上面的检查我们可以知道,物理磁盘没有发现有明显问题…

blkid命令发现一点异常,正常的ASM磁盘有LABEL信息,3个OCR磁盘都没有
1699267452103.png

(5)从ASM磁盘头重确认LABEL信息

ASM的LABEL信息记录着driver.provst属性中,通过对比能快速的确认,问题出这LABEL信息丢失
1699267694669.png
是否可以通过备份恢复磁盘头?经过确认备份块的LABEL信息也缺失。
1699267843620.png

解决方案

在确认是OCR磁盘LABEL信息丢失导致的问题后,从MOS中搜到了同样的案例(见参考文档),文中没有解释LABEL丢失的原因,原文翻译为“目前尚不清楚它们是如何进入这种状态的”,可能是触发了某个不明BUG。

The devices still had the OCR and voting file content, but were missing the ASMLIB driver information, it is unclear how they got into this state.

注意:
以下解决方案涉及到修改ASM磁盘头的信息,风险较高,如出现意外有可能导致集群无法启动,建议在充分评估风险后再操作,并在操作之前对数据库做一次全备。


(1)关闭所有节点CRS(包括正常的节点)

crsctl stop crs

(2)使用oracleasm renamedisk命令修复ASM磁盘头

[root@db2 ~]# /usr/sbin/oracleasm renamedisk -f /dev/sdc1 OCR_VOTE1
Writing disk header: done
Instantiating disk "OCR_VOTE1": done
[root@db2 ~]# /usr/sbin/oracleasm renamedisk -f /dev/sdd1 OCR_VOTE2
Writing disk header: done
Instantiating disk "OCR_VOTE2": done
[root@db2 ~]# /usr/sbin/oracleasm renamedisk -f /dev/sde1 OCR_VOTE3
Writing disk header: done
Instantiating disk "OCR_VOTE3": done

[root@db2 ~]# ls -l /dev/oracleasm/disks
total 0
brw-rw---- 1 grid asmadmin 8, 81 Oct 30 17:47 ARCHDISK1
brw-rw---- 1 grid asmadmin 8, 97 Oct 30 17:47 DATADISK1
brw-rw---- 1 grid asmadmin 8, 33 Oct 30 20:01 OCR_VOTE1
brw-rw---- 1 grid asmadmin 8, 49 Oct 30 20:02 OCR_VOTE2
brw-rw---- 1 grid asmadmin 8, 65 Oct 30 20:02 OCR_VOTE3

使用blkid确认,LABEL恢复正常
1699267620685.png

(3)重启启动crs

所有节点启动正常。

其它建议

不推荐使用asmlib来管理ASM的磁盘,asmlib的出现本质上是为了解决设备权限、设备所有者信息的问题,还可以提供一个“永不变化”的设备名,linux原生的udev同样可以做到这一点,而且做的更好,目前asmlib已经基本被淘汰,强烈推荐使用udev来绑定设备名。

参考文档

Linux: Voting Files Not Found Due To The Loss Of The ASMLIB Driver Information In The ASM Diskheader (Doc ID 1557191.1)

最后修改时间:2023-11-29 09:50:04
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

文章被以下合辑收录

评论