暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

zData频繁误报ASM磁盘离线

原创 stofm 2022-11-22
447

适用范围

zData 4.9.0
oracle 11.2.0.4

问题概述及解决

1、10月31日,某人寿客户zData(4计算节点+3存储节点架构)web页面频繁误报一块ASM磁盘ocr_0000离线。排查asm disk状态一切正常,而且也没有ocr_0000这块磁盘。告警出现前清理过监控节点zmon库的历史数据。

2、zData web页面报警信息:

image.png

3、邮件报警信息如下:
image.png

4、检查OCR磁盘组磁盘状态均正常:
image.png
image.png

5、查看报警开始的时间是10月31日 12:06:28秒,这个时间正是清理zData监控节点zmon库历史数据的时间。
image.png

6、在任意etcd节点上,使用etcdcli 查看 /hms/oracle下面人工遍历到asm磁盘,发现每个计算节点都有错误信息ocr_0000

[root@zcce01 /]# etcdctl -u root:xxxx ls /hms/oracle/zcdb01/asm_disk/ocr

/hms/oracle/zcdb01/asm_disk/ocr/ocr_0003

/hms/oracle/zcdb01/asm_disk/ocr/ocr_0000

/hms/oracle/zcdb01/asm_disk/ocr/ocr_0001

/hms/oracle/zcdb01/asm_disk/ocr/ocr_0002

root@zcce01 /]# etcdctl -u root:xxxx ls /hms/oracle/zcdb02/asm_disk/ocr

/hms/oracle/zcdb02/asm_disk/ocr/ocr_0002

/hms/oracle/zcdb02/asm_disk/ocr/ocr_0003

/hms/oracle/zcdb02/asm_disk/ocr/ocr_0000

/hms/oracle/zcdb02/asm_disk/ocr/ocr_0001

[root@zcce01 /]# etcdctl -u root:xxxx ls /hms/oracle/zcdb03/asm_disk/ocr

/hms/oracle/zcdb03/asm_disk/ocr/ocr_0000

/hms/oracle/zcdb03/asm_disk/ocr/ocr_0001

/hms/oracle/zcdb03/asm_disk/ocr/ocr_0002

/hms/oracle/zcdb03/asm_disk/ocr/ocr_0003

[root@zcce01 /]# etcdctl -u root:xxxx ls /hms/oracle/zcdb04/asm_disk/ocr

/hms/oracle/zcdb04/asm_disk/ocr/ocr_0000

/hms/oracle/zcdb04/asm_disk/ocr/ocr_0001

/hms/oracle/zcdb04/asm_disk/ocr/ocr_0002

/hms/oracle/zcdb04/asm_disk/ocr/ocr_0003

7、手动依次删除每个节点错误信息的行

[root@zcce01 /]# etcdctl -u root:xxxx rm  /hms/oracle/zcdb01/asm_disk/ocr/ocr_0000
PrevNode.Value: {"bytes_read":"","bytes_written":"","create_date":"03-JUL-20","disk_number":"0","failgroup":"OCR_0000","failgroup_label":"","failgroup_type":"REGULAR","free_mb":"4772","group_name":"OCR","group_number":"4","header_stat":"UNKNOWN","incarn":"3941234489","label":"","library":"System","meta_host_ip":"xx.xx.xx.xx","meta_hostname":"zcdb01","meta_listen_port":":8091","meta_updated_time":"2020-07-10 15:54:11","meta_updated_timestamp":1594367651,"mode_stat":"OFFLINE","mount_date":"03-JUL-20","mount_stat":"MISSING","name":"OCR_0000","os_mb":"0","path":"","product":"","read_errors":"","read_time":"","reads":"","redund":"UNKNOWN","repair_timer":"6371","site_guid":"","site_label":"","site_name":"","site_status":"","state":"NORMAL","total_mb":"5120","udid":"","voting_file":"N","write":"","write_errors":"","write_time":""}
[root@zcce01 /]# etcdctl -u root:xxxx rm  /hms/oracle/zcdb02/asm_disk/ocr/ocr_0000
PrevNode.Value: {"bytes_read":"16793600","bytes_written":"4096","create_date":"03-JUL-20","disk_number":"0","failgroup":"OCR_0000","failgroup_label":"","failgroup_type":"REGULAR","free_mb":"4684","group_name":"OCR","group_number":"4","header_stat":"MEMBER","incarn":"3959303653","label":"","library":"System","meta_host_ip":"xx.xx.xx.xx","meta_hostname":"zcdb02","meta_listen_port":":8091","meta_updated_time":"2020-07-10 14:02:03","meta_updated_timestamp":1594360923,"mode_stat":"ONLINE","mount_date":"03-JUL-20","mount_stat":"CACHED","name":"OCR_0000","os_mb":"5120","path":"/dev/mapper/ZDATA_HDISK_ZCCE01_081","product":"","read_errors":"1","read_time":"53.830744","reads":"4101","redund":"UNKNOWN","repair_timer":"0","site_guid":"","site_label":"","site_name":"","site_status":"","state":"NORMAL","total_mb":"5120","udid":"","voting_file":"Y","write":"1","write_errors":"0","write_time":".000278"}
[root@zcce01 /]# etcdctl -u root:xxxx rm  /hms/oracle/zcdb03/asm_disk/ocr/ocr_0000
PrevNode.Value: {"bytes_read":"","bytes_written":"","create_date":"03-JUL-20","disk_number":"0","failgroup":"OCR_0000","failgroup_label":"","failgroup_type":"REGULAR","free_mb":"4772","group_name":"OCR","group_number":"4","header_stat":"UNKNOWN","incarn":"3957926746","label":"","library":"System","meta_host_ip":"xx.xx.xx.xx","meta_hostname":"zcdb03","meta_listen_port":":8091","meta_updated_time":"2020-07-10 15:55:49","meta_updated_timestamp":1594367749,"mode_stat":"OFFLINE","mount_date":"03-JUL-20","mount_stat":"MISSING","name":"OCR_0000","os_mb":"0","path":"","product":"","read_errors":"","read_time":"","reads":"","redund":"UNKNOWN","repair_timer":"6371","site_guid":"","site_label":"","site_name":"","site_status":"","state":"NORMAL","total_mb":"5120","udid":"","voting_file":"N","write":"","write_errors":"","write_time":""}
[root@zcce01 /]# etcdctl -u root:xxxx rm  /hms/oracle/zcdb04/asm_disk/ocr/ocr_0000
PrevNode.Value: {"bytes_read":"16740352","bytes_written":"4096","create_date":"03-JUL-20","disk_number":"0","failgroup":"OCR_0000","failgroup_label":"","failgroup_type":"REGULAR","free_mb":"4684","group_name":"OCR","group_number":"4","header_stat":"MEMBER","incarn":"3934153297","label":"","library":"System","meta_host_ip":"xx.xx.xx.xx","meta_hostname":"zcdb04","meta_listen_port":":8091","meta_updated_time":"2020-07-10 14:04:31","meta_updated_timestamp":1594361071,"mode_stat":"ONLINE","mount_date":"03-JUL-20","mount_stat":"CACHED","name":"OCR_0000","os_mb":"5120","path":"/dev/mapper/ZDATA_HDISK_ZCCE01_081","product":"","read_errors":"0","read_time":"65.16166","reads":"4087","redund":"UNKNOWN","repair_timer":"0","site_guid":"","site_label":"","site_name":"","site_status":"","state":"NORMAL","total_mb":"5120","udid":"","voting_file":"Y","write":"1","write_errors":"0","write_time":".004119"}

8、在zData web页面监控告警中将已有的asm告警记录勾选,逐一删除掉。
图略…

9、删除错误信息后,误报解除,并且收到了恢复正常的邮件:
image.png

问题原因

清理zData zMon库历史数据后,可能导致etcdctl产生错误信息,并误报出来。

解决方案

1、在任意etcd节点上 使用etcdcli 查看 /hms/oracle下面 人工遍历到asm磁盘,分计算节点依次找到误报的行,将其删除;
2、在监控web上将已有的asm告警记录勾选,逐一删除掉。

「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论