暂无图片
暂无图片
3
暂无图片
暂无图片
暂无图片

RAC的心跳机制

原创 乔治和猫 2022-11-18
1175

RAC的心跳机制

心跳类型

  • 集群心跳 cssdagent cssdmnoitor
      1. 节点间的连通性
      1. 用共享的位置保持节点的连通信息,及时记录和更新
      1. 本地节点的自我监控
  • 网络心跳(Network HeartBeat,NHB)
    • 保证节点之间的连通性,以便确认状态
    • ocssd.bin进程每秒向其他节点发送网络心跳,当心跳出现问题时做出处理
    • 相关线程ocssd.bin
      • 发送线程 每秒向其他节点发送网络心跳
      • 分析线程 分析心跳信息,有节点持续丢失,通知集群进行重新配置
      • 派遣线程,接受消息 并且投递给相应线程
      • 集群重新配置线程 收到分析线程发来的重新配置通知,线程启动重新配置。
  • 磁盘心跳(Disk HeartBeat,DHB)
    • 来自vote disk
    • 解决脑裂
    • 一旦发生脑裂,重新配置线程会通过表决盘的信息了解集群节点之间的连通性,从而决定集群会分裂成几个子集群
    • 相关线程
      • 磁盘心跳线程,向表决盘发送磁盘心跳,同时也负责读取表决盘中的kill block信息,确定本节点是否重启
      • 磁盘心跳监控线程 监控磁盘心跳线程是否能够正常地发送心跳,是否能正确读取kill block的信息
      • kill block线程:负责监控VF的kill block信息
    • 奇数个 保证一半以上可以被访问
  • 本地心跳(Local HeartBeat,LHB)
    • 监控ocssd.bin以及本地节点的状态
    • 每秒发送网络心跳的同时,向本地cssdagent 和cssdmonitor发送本地ocssd.bin的状态
    • 相关线程
      • 发送线程
    • 11.2+ 本地状态被整合进整体心跳

日志解析

  • ocssd.trc日志分析
1:41:2
2022-11-18 09:45:24.016 :    CSSD:3559126784: [     INFO] clssnmSendingThread: sending status msg to all nodes
2022-11-18 09:45:24.017 :    CSSD:3559126784: [     INFO] clssnmSendingThread: sent 5 status msgs to all nodes --->【本地心跳】
2022-11-18 09:45:25.548 :    CSSD:3584321280: [     INFO] clssgmcpGroupDataResp: sending type 5, size 164, status 0 to clientID 1:23:0
2022-11-18 09:45:25.809 :    CSSD:3587475200: [     INFO]   : Processing member data change type 1, size 4 for group HB+ASM, memberID 17:2:1 --->【ASM心跳】
2022-11-18 09:45:25.809 :    CSSD:3587475200: [     INFO]   : Sending member data change to GMP for group HB+ASM, memberID 17:2:1
2022-11-18 09:45:25.810 :    CSSD:3599832832: [     INFO] clssgmpcMemberDataUpdt: grockName HB+ASM memberID 17:2:1, datatype 1 datasize 4    --->【ASM心跳更新】
2022-11-18 09:45:25.810 :    CSSD:3584321280: [     INFO] clssgmcpDataUpdtCmpl: Status 0 mbr data updt memberID 17:2:1 from clientID 1:41:2
2022-11-18 09:45:26.450 :    CSSD:3584321280: [     INFO] clssgmcpGroupDataResp: Completed request with sequence number(201) for clientID 1:42:0
2022-11-18 09:45:26.450 :    CSSD:3584321280: [     INFO] clssgmcpGroupDataResp: sending type 5, size 167, status 0 to clientID 1:42:0
2022-11-18 09:45:27.866 :    CSSD:3587475200: [     INFO]   : Processing member data change type 1, size 4 for group HB+ASM, memberID 17:2:1
2022-11-18 09:45:27.866 :    CSSD:3587475200: [     INFO]   : Sending member data change to GMP for group HB+ASM, memberID 17:2:1
2022-11-18 09:45:27.866 :    CSSD:3599832832: [     INFO] clssgmpcMemberDataUpdt: grockName HB+ASM memberID 17:2:1, datatype 1 datasize 4 
2022-11-18 09:45:27.866 :    CSSD:3584321280: [     INFO] clssgmcpDataUpdtCmpl: Status 0 mbr data updt memberID 17:2:1 from clientID 1:41:2
2022-11-18 09:45:28.370 :    CSSD:3591948032: [     INFO] clssgmpcGMCReqWorkerThread: processing msg (0x7f9cc40414f0) type 2, msg size 76, payload (0x7f9cc404151c) size 32, sequence 2232, for clientID 1:41:2
2022-11-18 09:45:28.639 :    CSSD:3584321280: [     INFO] clssgmcpGroupDataResp: Completed request with sequence number(202) for clientID 1:42:0
2022-11-18 09:45:28.639 :    CSSD:3584321280: [     INFO] clssgmcpGroupDataResp: sending type 5, size 167, status 0 to clientID 1:42:0

  • NHB
[root@oel7n01 trace]# cat ocssd.trc |grep NHB

  • DHB
[root@oel7n01 trace]# cat ocssd.trc |grep DHB
2022-12-06 21:42:57.386 :    CSSD:1122473728: [     INFO] clssnmvReadDskHeartbeat: Reading DHBs to get the latest info for node(2/oel7n02), LATSvalid(0), nodeInfoDHB uniqueness(0)
2022-12-06 21:42:57.386 :    CSSD:1122473728: [     INFO] clssnmvDHBValidateNcopy: Saving DHB uniqueness for node(2/oel7n02), latestInfo(1670334162), readInfo(1670334162), nodeInfoDHB(0)
2022-12-06 21:42:57.386 :    CSSD:1122473728: [     INFO] clssnmvDHBValidateNcopy: Setting LATS valid due to second DHB seen on disk(0x7fc33c0fa110) for node(2/oel7n02) nodeStatus 0x1
2022-12-06 21:49:34.317 :    CSSD:1122473728: [     INFO] clssnmvReadDskHeartbeat: Reading DHBs to get the latest info for node(2/oel7n02), LATSvalid(0), nodeInfoDHB uniqueness(1670334162)
2022-12-06 21:49:34.317 :    CSSD:1122473728: [     INFO] clssnmvDHBValidateNcopy: Setting LATS valid due to uniqueness change for node(2/oel7n02), nodeInfoDHB(1670334162), readInfo(1670334565)
2022-12-06 21:49:34.317 :    CSSD:1122473728: [     INFO] clssnmvDHBValidateNcopy: Saving DHB uniqueness for node(2/oel7n02), latestInfo(1670334162), readInfo(1670334565), nodeInfoDHB(1670334162)


  • LHB
[root@oel7n01 trace]# cat ocssd.trc |grep LHB

默认值和修改方式

  • 查询NHB和DHB默认值
#NHB
[root@oel7n01 ~]# crsctl get css misscount
CRS-4678: Successful get misscount 30 for Cluster Synchronization Services.

#DHB
[root@oel7n01 ~]# crsctl get css disktimeout
CRS-4678: Successful get disktimeout 200 for Cluster Synchronization Services.

#可以看到网络心跳初始阈值为30s 磁盘心跳初始阈值为200s
  • 修改NHB和DHB默认值
#NHB
[root@oel7n02 ~]# crsctl set css misscount 50
CRS-4678: Successful set of parameter misscount to 50 for Cluster Synchronization Services.

[root@oel7n01 ~]# crsctl get css misscount
CRS-4678: Successful get misscount 50 for Cluster Synchronization Services.
#注意此处在2节点修改后,在1节点查询发现节点的心跳检测时间是一致的!


#DHB
[root@oel7n01 ~]# crsctl get css disktimeout
CRS-4678: Successful get disktimeout 200 for Cluster Synchronization Services.


[root@oel7n01 trace]# crsctl set css disktimeout 50
CRS-4696: Failed to set parameter disktimeout to 50 due to conflicting parameter misscount; the new value for disktimeout must be greater than 50.
[root@oel7n01 trace]# crsctl set css disktimeout 51
CRS-4684: Successful set of parameter disktimeout to 51 for Cluster Synchronization Services.
#####disktimeout的最小值为51


[root@oel7n01 trace]# crsctl set css disktimeout 1000
CRS-4684: Successful set of parameter disktimeout to 1000 for Cluster Synchronization Services.

####################彩蛋
#试试能扩充的最大值
[root@oel7n01 trace]# crsctl set css disktimeout 100000000
CRS-4684: Successful set of parameter disktimeout to 100000000 for Cluster Synchronization Services.
[root@oel7n01 trace]# crsctl set css disktimeout 10000000000000000000000
Negative values are not allowed for parameter disktimeout.
[root@oel7n01 trace]# crsctl set css disktimeout 1000000000000000000
Negative values are not allowed for parameter disktimeout.
[root@oel7n01 trace]# crsctl set css disktimeout 1000000000000000
Negative values are not allowed for parameter disktimeout.
[root@oel7n01 trace]# crsctl set css disktimeout 10000000000000
CRS-4684: Successful set of parameter disktimeout to 1316134912 for Cluster Synchronization Services.
[root@oel7n01 trace]# crsctl set css disktimeout 1316134913
CRS-4684: Successful set of parameter disktimeout to 1316134913 for Cluster Synchronization Services.
[root@oel7n01 trace]#  crsctl set css disktimeout 2000000000
CRS-4684: Successful set of parameter disktimeout to 2000000000 for Cluster Synchronization Services.
[root@oel7n01 trace]#  crsctl set css disktimeout 20000000000
Negative values are not allowed for parameter disktimeout.
[root@oel7n01 trace]#  crsctl set css disktimeout 9000000000
CRS-4684: Successful set of parameter disktimeout to 410065408 for Cluster Synchronization Services.
[root@oel7n01 trace]#  crsctl set css disktimeout 9000000000
CRS-4684: Successful set of parameter disktimeout to 410065408 for Cluster Synchronization Services.
[root@oel7n01 trace]#  crsctl set css disktimeout 9000000000
CRS-4684: Successful set of parameter disktimeout to 410065408 for Cluster Synchronization Services.
[root@oel7n01 trace]#  crsctl set css disktimeout 9000099999
CRS-4684: Successful set of parameter disktimeout to 410165407 for Cluster Synchronization Services.
[root@oel7n01 trace]#  crsctl set css disktimeout 499999999
CRS-4684: Successful set of parameter disktimeout to 499999999 for Cluster Synchronization Services.
[root@oel7n01 trace]#  crsctl set css disktimeout 999999999
CRS-4684: Successful set of parameter disktimeout to 999999999 for Cluster Synchronization Services.
[root@oel7n01 trace]#  crsctl set css disktimeout 9999999999
CRS-4684: Successful set of parameter disktimeout to 1410065407 for Cluster Synchronization Services.
[root@oel7n01 trace]#  crsctl set css disktimeout 99999999999
CRS-4684: Successful set of parameter disktimeout to 1215752191 for Cluster Synchronization Services.
[root@oel7n01 trace]#  crsctl set css disktimeout 999999999999
Negative values are not allowed for parameter disktimeout.
[root@oel7n01 trace]#  crsctl set css disktimeout 99999999999
CRS-4684: Successful set of parameter disktimeout to 1215752191 for Cluster Synchronization Services.
[root@oel7n01 trace]#  crsctl set css disktimeout 1215759999
CRS-4684: Successful set of parameter disktimeout to 1215759999 for Cluster Synchronization Services.
[root@oel7n01 trace]#  crsctl set css disktimeout 1219999999
CRS-4684: Successful set of parameter disktimeout to 1219999999 for Cluster Synchronization Services.
[root@oel7n01 trace]#  crsctl set css disktimeout 9999999999
CRS-4684: Successful set of parameter disktimeout to 1410065407 for Cluster Synchronization Services.
。。。。。。
[root@oel7n01 trace]#  crsctl set css disktimeout 2200099999
Negative values are not allowed for parameter disktimeout.
[root@oel7n01 trace]#  crsctl set css disktimeout 2109999999
CRS-4684: Successful set of parameter disktimeout to 2109999999 for Cluster Synchronization Services.
[root@oel7n01 trace]#  crsctl set css disktimeout 2109999999



[root@oel7n01 trace]# crsctl set css disktimeout 1000
CRS-4684: Successful set of parameter disktimeout to 1000 for Cluster Synchronization Services.
[root@oel7n01 trace]#  crsctl set css disktimeout 2109999999
CRS-4684: Successful set of parameter disktimeout to 2109999999 for Cluster Synchronization Services.
[root@oel7n01 trace]# crsctl set css disktimeout 1000
CRS-4684: Successful set of parameter disktimeout to 1000 for Cluster Synchronization Services.
[root@oel7n01 trace]# crsctl set css disktimeout 200
CRS-4684: Successful set of parameter disktimeout to 200 for Cluster Synchronization Services.

最后修改时间:2022-12-21 15:14:23
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论