CTSS介绍
Cluster Time Synchronization Service集群时间同步服务(CTSS)可以检测集群中节点之间的时间同步问题。
CTSS 作为 Oracle Clusterware 的一部分进行安装。如果检测到系统上的时间同步服务(如 NTP 或 Chrony),无论有效还是无效的时间同步服务配置,CTSS 会在观察者模式下运行。例如,如果在集群中的任何节点上存在 /etc/ntp.conf 文件,则即使没有运行时间同步软件,CTSS 也会在观察者模式下运行。
如果 CTSS 检测到集群中任何节点上没有时间同步服务或时间同步服务配置,则 CTSS 进入活动模式,接管集群的时间管理。
如果 CTSS 在活动模式运行,而同时运行的还有非 NTP 时间同步软件,则可以通过创建 /etc/ntp.conf 文件将 CTSS 改为在观察者模式下运行。CTSS 会在警报日志中记录有关更改到观察者模式的事件。
当节点加入集群时,如果CTSS处于活动模式,则它会将这些节点的时间与集群中一个节点上的参考时钟进行比较。如果两个时间之间存在差异并且差异在某个步长限制内,则CTSS执行步进时间同步,即将加入集群的节点的时间向前或向后调整以使其与参考时间同步。
由于各种原因,集群中节点上的时钟周期性地与参考时钟(CTSS用作基础并位于集群中第一个启动的节点上)不同步。当出现这种情况时,CTSS执行缓慢时间同步,即加速或减慢节点系统时间,直到它们与参考系统时间同步。在这种时间同步方法中,CTSS不会向后调整时间,从而保证了系统时间单调递增。
当Oracle Clusterware启动时,如果CTSS正在活动模式下运行且时间差异超出了步进限制(限制为24小时),则CTSS会在警报日志中生成一个警报,并退出,导致Oracle Clusterware无法启动。您必须手动调整加入集群的节点的时间以使其与集群同步,在此之后Oracle Clusterware才能启动并让CTSS管理节点的时间。
在执行缓慢时间同步时,CTSS永远不会运行向后调整时间以与参考时钟同步。 CTSS定期将警报写入警报日志,其中包含有关它多久调整一次节点时间以使其与参考时钟同步的信息。
CTSS活动模式,节点相差超过24小时,启动新节点失败
节点1时间2023-05-27 21:26:13,节点2时间改为5月30日,延后超过24小时,此时节点2启动失败
- 节点2启动集群alert日志
2023-05-27 21:26:13.991 [ORAROOTAGENT(12332)]CRS-5822: 代理 '/u01/app/19.0.0/grid/bin/orarootagent_root' 已从服务器断开连接。详细信息见 (:CRSAGF00117:) {0:4:12} (位于 /u01/app/grid/diag/crs/rac02/crs/trace/ohasd_orarootagent_root.trc)。 2023-05-30 00:00:13.885 [OHASD(62963)]CRS-8500: Oracle Clusterware OHASD 进程以操作系统进程 ID 62963 开头 2023-05-30 00:00:13.925 [OHASD(62963)]CRS-0714: Oracle Clusterware 发行版 19.0.0.0.0。 2023-05-30 00:00:13.934 [OHASD(62963)]CRS-2112: 已在节点 rac02 上启动 OLR 服务。 2023-05-30 00:00:14.079 [OHASD(62963)]CRS-1301: 已在节点 rac02 上启动 Oracle 高可用性服务。 2023-05-30 00:00:14.081 [OHASD(62963)]CRS-8017: 位置: /etc/oracle/lastgasp 具有 2 个重新启动指导日志文件, 0 个已发布, 0 个出现错误 2023-05-30 00:00:14.513 [CSSDAGENT(63766)]CRS-8500: Oracle Clusterware CSSDAGENT 进程以操作系统进程 ID 63766 开头 2023-05-30 00:00:14.545 [CSSDMONITOR(63769)]CRS-8500: Oracle Clusterware CSSDMONITOR 进程以操作系统进程 ID 63769 开头 2023-05-30 00:00:14.539 [ORAAGENT(63758)]CRS-8500: Oracle Clusterware ORAAGENT 进程以操作系统进程 ID 63758 开头 2023-05-30 00:00:14.583 [ORAROOTAGENT(63748)]CRS-8500: Oracle Clusterware ORAROOTAGENT 进程以操作系统进程 ID 63748 开头 2023-05-30 00:00:15.291 [ORAAGENT(63860)]CRS-8500: Oracle Clusterware ORAAGENT 进程以操作系统进程 ID 63860 开头 2023-05-30 00:00:15.539 [MDNSD(63884)]CRS-8500: Oracle Clusterware MDNSD 进程以操作系统进程 ID 63884 开头 2023-05-30 00:00:15.540 [EVMD(63885)]CRS-8500: Oracle Clusterware EVMD 进程以操作系统进程 ID 63885 开头 2023-05-30 00:00:16.555 [GPNPD(63923)]CRS-8500: Oracle Clusterware GPNPD 进程以操作系统进程 ID 63923 开头 2023-05-30 00:00:17.456 [GPNPD(63923)]CRS-2328: 已在节点 rac02 上启动 GPNPD。 2023-05-30 00:00:17.584 [GIPCD(63988)]CRS-8500: Oracle Clusterware GIPCD 进程以操作系统进程 ID 63988 开头 2023-05-30 00:00:18.416 [GIPCD(63988)]CRS-7517: The Oracle Grid Interprocess Communication (GIPC) failed to identify the Fast Node Death Detection (FNDD). 2023-05-30 00:00:19.598 [CSSDMONITOR(64020)]CRS-8500: Oracle Clusterware CSSDMONITOR 进程以操作系统进程 ID 64020 开头 2023-05-30 00:00:20.002 [OSYSMOND(64041)]CRS-8500: Oracle Clusterware OSYSMOND 进程以操作系统进程 ID 64041 开头 2023-05-30 00:00:19.973 [CSSDAGENT(64039)]CRS-8500: Oracle Clusterware CSSDAGENT 进程以操作系统进程 ID 64039 开头 2023-05-30 00:00:24.815 [OCSSD(64072)]CRS-8500: Oracle Clusterware OCSSD 进程以操作系统进程 ID 64072 开头 2023-05-30 00:00:25.869 [OCSSD(64072)]CRS-1713: CSSD 守护程序已在 hub 模式下启动 2023-05-30 00:00:27.150 [OCSSD(64072)]CRS-1707: 节点 rac02 (编号为 2) 的租约获取已完成 2023-05-30 00:00:28.235 [OCSSD(64072)]CRS-1621: 在 Oracle 注册表中存储的此节点的 IPMI 配置数据不完整; 详细信息见 (:CSSNK00002:) (位于 /u01/app/grid/diag/crs/rac02/crs/trace/ocssd.trc) 2023-05-30 00:00:28.236 [OCSSD(64072)]CRS-1617: 对节点 rac02 执行节点终止操作所需的信息不完整; 详细信息见 (:CSSNM00004:) (位于 /u01/app/grid/diag/crs/rac02/crs/trace/ocssd.trc) 2023-05-30 00:00:28.239 [OCSSD(64072)]CRS-1605: CSSD 表决文件联机: /dev/asm-diskf; 详细信息见 /u01/app/grid/diag/crs/rac02/crs/trace/ocssd.trc。 2023-05-30 00:00:28.243 [OCSSD(64072)]CRS-1605: CSSD 表决文件联机: /dev/asm-diskh; 详细信息见 /u01/app/grid/diag/crs/rac02/crs/trace/ocssd.trc。 2023-05-30 00:00:28.248 [OCSSD(64072)]CRS-1605: CSSD 表决文件联机: /dev/asm-diskc; 详细信息见 /u01/app/grid/diag/crs/rac02/crs/trace/ocssd.trc。 2023-05-30 00:10:20.137 [CSSDAGENT(64039)]CRS-5818: 已中止命令 'start' (对于资源 'ora.cssd')。详细信息见 (:CRSAGF00113:) {0:5:3} (位于 /u01/app/grid/diag/crs/rac02/crs/trace/ohasd_cssdagent_root.trc)。 2023-05-30 00:10:20.361 [OHASD(62963)]CRS-2757: 命令 'Start' 在等待来自资源 'ora.cssd' 的响应时超时。详细信息见 (:CRSPE00221:) {0:5:3} (位于 /u01/app/grid/diag/crs/rac02/crs/trace/ohasd.trc)。 2023-05-30 00:10:21.139 [OCSSD(64072)]CRS-1656: CSS 守护程序由于致命错误而正在终止; 详细信息见 (:CSSSC00012:) (位于 /u01/app/grid/diag/crs/rac02/crs/trace/ocssd.trc) 2023-05-30 00:10:21.139 [OCSSD(64072)]CRS-1603: 节点 rac02 上的 CSSD 已关闭。 2023-05-30T00:10:26.149335+08:00 Errors in file /u01/app/grid/diag/crs/rac02/crs/trace/ocssd.trc (incident=17): CRS-8503 [] [] [] [] [] [] [] [] [] [] [] [] Incident details in: /u01/app/grid/diag/crs/rac02/crs/incident/incdir_17/ocssd_i17.trc 2023-05-30 00:10:26.140 [OCSSD(64072)]CRS-8503: Oracle Clusterware 进程 OCSSD (具有操作系统进程 ID 64072) 遇到致命信号或异常错误代码 6。
复制
- 节点2启动ocssd日志
gipcwait returned with status gipcretPosted (17)
2023-05-30 00:04:37.041 : CSSD:60405504: clsssc_CLSFAInit_CB: System not ready for CLSFA initialization 2023-05-30 00:04:37.066 : CSSD:435600000: [ INFO] clssscWaitOnEventValue: after Reconfig Event val 2, eval 1 waited 100 with cvtimewait status 4294967186 2023-05-30 00:04:37.667 : CSSD:435600000: [ INFO] clssscWaitOnEventValue: after Reconfig Event val 2, eval 1 waited 100 with cvtimewait status 4294967186 2023-05-30 00:04:37.916 : CSSD:2670159616: [ INFO] clssscWaitChangeEventValue: ev(ConnectedNodes) unchanged from 0 2023-05-30 00:04:37.916 : CSSD:2670159616: [ INFO] clssscUpdateEventValue: ConnectedNodes val 0, changes 252 2023-05-30 00:04:37.916 : CSSD:46216960: [ INFO] clssscSelect: gipcwait returned with status gipcretPosted (17) 2023-05-30 00:04:37.916 : CSSD:46216960: [ INFO] clssgmPeerListener: connected to 1 of 2
复制
- 节点2启动gipc日志
存在gipc_latency_time和gipc_latency_time
2023-05-30 00:03:19.053 :GIPCHTHR:1153718016: gipchaWorkerWork: workerThread heart beat, time interval since last heartBeat 30040 loopCount 36 sendCount 0 recvCount 18 postCount 0 sendCmplCount 0 recvCmplCount 0 2023-05-30 00:03:19.054 :GIPCHTHR:1151547136: gipchaDaemonWork: DaemonThread heart beat, time interval since last heartBeat 30040 loopCount 42 2023-05-30 00:03:26.718 :GIPCDMON:1160857344: gipcdMonitorPublishDiags: key gipc_latency_time handle 0x7fc024193890 writeTime 3784044 value <rac01:3765014:3297621us:4291669975us> 2023-05-30 00:03:49.090 :GIPCHTHR:1153718016: gipchaWorkerWork: workerThread heart beat, time interval since last heartBeat 30040 loopCount 42 sendCount 0 recvCount 19 postCount 0 sendCmplCount 0 recvCmplCount 0 2023-05-30 00:03:49.090 :GIPCHTHR:1151547136: gipchaDaemonWork: DaemonThread heart beat, time interval since last heartBeat 30040 loopCount 42 2023-05-30 00:03:49.127 :GIPCDMON:1160857344: gipcdMonitorSaveInfMetrics: reset trace level, 3815124-3815124-5000 2023-05-30 00:03:49.127 :GIPCDMON:1160857344: gipcdMonitorSaveInfMetrics: inf[ 0] ens38 - rank 99, node rac01 , avgms 0.769231 [13/13/13] 2023-05-30 00:03:49.127 :GIPCDMON:1160857344: gipcdMonitorSaveInfMetrics: inf[ 1] ens38 - rank 99, node rac02 , avgms 0.769231 [13/13/13] 2023-05-30 00:03:53.868 :GIPCDCLT:1165059840: gipcdClientInterfaceMetrics: Received type(gipcdmsgtypeInterfaceMetrics), endp(000000000000039d), len(1024), buf(0x7fc02c22a588), inf(ip: 10.10.100.2:22209, mask: 255.255.255.0, subnet: 10.10.100.0, mac: , ifname: ens38) time(10), retry(0), stamp(10), send(10), recv(10) 2023-05-30 00:03:53.868 :GIPCDCLT:1165059840: gipcdClientInterfaceMetrics: enqueue interface metrics (1:rac01:ens38) to worklist 2023-05-30 00:03:53.868 :GIPCDCLT:1165059840: gipcdClientInterfaceMetrics: Received type(gipcdmsgtypeInterfaceMetrics), endp(000000000000039d), len(1024), buf(0x7fc02c235a48), inf(ip: 10.10.100.3:17001, mask: 255.255.255.0, subnet: 10.10.100.0, mac: , ifname: ens38) time(10), retry(0), stamp(10), send(10), recv(10) 2023-05-30 00:03:53.868 :GIPCDCLT:1165059840: gipcdClientInterfaceMetrics: enqueue interface metrics (1:rac02:ens38) to worklist 2023-05-30 00:03:54.095 :GIPCDCLT:1165059840: gipcdClientInterfaceMetrics: Received type(gipcdmsgtypeInterfaceMetrics), endp(0000000000000386), len(1024), buf(0x7fc02c235a48), inf(ip: 10.10.100.2:40180, mask: 255.255.255.0, subnet: 10.10.100.0, mac: , ifname: ens38) time(0), retry(0), stamp(2), send(3), recv(2) 2023-05-30 00:03:54.095 :GIPCDCLT:1165059840: gipcdClientInterfaceMetrics: enqueue interface metrics (1:rac01:ens38) to worklist 2023-05-30 00:03:54.095 :GIPCDCLT:1165059840: gipcdClientInterfaceMetrics: Received type(gipcdmsgtypeInterfaceMetrics), endp(0000000000000386), len(1024), buf(0x7fc02c3263f8), inf(ip: 10.10.100.3:25105, mask: 255.255.255.0, subnet: 10.10.100.0, mac: , ifname: ens38) time(0), retry(0), stamp(2), send(3), recv(2) 2023-05-30 00:03:54.095 :GIPCDCLT:1165059840: gipcdClientInterfaceMetrics: enqueue interface metrics (1:rac02:ens38) to worklist 2023-05-30 00:03:59.339 :GIPCDMON:1160857344: gipcdMonitorPublishDiags: key gipc_round_trip_time handle 0x7fc024193680 writeTime 3814084 value <rac01:6:3795064:224us> 2023-05-30 00:04:19.125 :GIPCHTHR:1153718016: gipchaWorkerWork: workerThread heart beat, time interval since last heartBeat 30030 loopCount 39 sendCount 0 recvCount 17 postCount 0 sendCmplCount 0 recvCmplCount 0 2023-05-30 00:04:19.125 :GIPCHTHR:1151547136: gipchaDaemonWork: DaemonThread heart beat, time interval since last heartBeat 30030 loopCount 42 2023-05-30 00:04:32.071 :GIPCDMON:1160857344: gipcdMonitorPublishDiags: key gipc_latency_time handle 0x7fc024193890 writeTime 3844114 value <rac01:3825094:3297725us:4291670122us> 2023-05-30 00:04:42.274 :GIPCDMON:1160857344: gipcdMonitorSaveInfMetrics: reset trace level, 3868264-3868264-5000 2023-05-30 00:04:42.274 :GIPCDMON:1160857344: gipcdMonitorSaveInfMetrics: inf[ 0] ens38 - rank 99, node rac01 , avgms 0.833333 [13/12/12] 2023-05-30 00:04:42.274 :GIPCDMON:1160857344: gipcdMonitorSaveInfMetrics: inf[ 1] ens38 - rank 99, node rac02 , avgms 0.833333 [13/12/12] 2023-05-30 00:04:49.166 :GIPCHTHR:1153718016: gipchaWorkerWork: workerThread heart beat, time interval since last heartBeat 30050 loopCount 41 sendCount 0 recvCount 18 postCount 0 sendCmplCount 0 recvCmplCount 0 2023-05-30 00:04:49.166 :GIPCHTHR:1151547136: gipchaDaemonWork: DaemonThread heart beat, time interval since last heartBeat 30050 loopCount 42
复制
CTSS活动模式,节点相差24小时之内(不足13小时),启动新节点同样失败
相差时间超过了ctss的限制10分钟(600秒)
- 节点2启动alert日志
2023-05-28 10:01:31.832 [OHASD(62963)]CRS-2791: 正在启动用于关闭 'rac02' 上 Oracle 高可用性服务管理的资源的操作 2023-05-28 10:01:33.388 [MDNSD(63884)]CRS-5602: mDNS 服务根据请求停止。 2023-05-28 10:01:33.392 [GPNPD(63923)]CRS-2329: 节点 rac02 上的 GPNPD 已关闭。 2023-05-28 10:01:33.405 [MDNSD(63884)]CRS-8504: Oracle Clusterware MDNSD 进程 (具有操作系统进程 ID 63884) 正在退出 2023-05-28 10:01:53.862 [OHASD(62963)]CRS-2793: 关闭 'rac02' 上 Oracle 高可用性服务管理的资源的操作已完成 2023-05-28 10:01:53.872 [ORAROOTAGENT(63748)]CRS-5822: 代理 '/u01/app/19.0.0/grid/bin/orarootagent_root' 已从服务器断开连接。详细信息见 (:CRSAGF00117:) {0:2:13} (位于 /u01/app/grid/diag/crs/rac02/crs/trace/ohasd_orarootagent_root.trc)。 2023-05-28 10:01:53.874 [CSSDAGENT(64039)]CRS-5822: 代理 '/u01/app/19.0.0/grid/bin/cssdagent_root' 已从服务器断开连接。详细信息见 (:CRSAGF00117:) {0:7:5} (位于 /u01/app/grid/diag/crs/rac02/crs/trace/ohasd_cssdagent_root.trc)。 2023-05-28 10:02:06.671 [OHASD(101744)]CRS-8500: Oracle Clusterware OHASD 进程以操作系统进程 ID 101744 开头 2023-05-28 10:02:06.711 [OHASD(101744)]CRS-0714: Oracle Clusterware 发行版 19.0.0.0.0。 2023-05-28 10:02:06.719 [OHASD(101744)]CRS-2112: 已在节点 rac02 上启动 OLR 服务。 2023-05-28 10:02:06.899 [OHASD(101744)]CRS-1301: 已在节点 rac02 上启动 Oracle 高可用性服务。 2023-05-28 10:02:06.899 [OHASD(101744)]CRS-8017: 位置: /etc/oracle/lastgasp 具有 2 个重新启动指导日志文件, 0 个已发布, 0 个出现错误 2023-05-28 10:02:07.348 [CSSDAGENT(101857)]CRS-8500: Oracle Clusterware CSSDAGENT 进程以操作系统进程 ID 101857 开头 2023-05-28 10:02:07.349 [ORAROOTAGENT(101838)]CRS-8500: Oracle Clusterware ORAROOTAGENT 进程以操作系统进程 ID 101838 开头 2023-05-28 10:02:07.365 [CSSDMONITOR(101859)]CRS-8500: Oracle Clusterware CSSDMONITOR 进程以操作系统进程 ID 101859 开头 2023-05-28 10:02:07.378 [ORAAGENT(101848)]CRS-8500: Oracle Clusterware ORAAGENT 进程以操作系统进程 ID 101848 开头 2023-05-28 10:02:08.099 [ORAAGENT(101950)]CRS-8500: Oracle Clusterware ORAAGENT 进程以操作系统进程 ID 101950 开头 2023-05-28 10:02:08.346 [MDNSD(101973)]CRS-8500: Oracle Clusterware MDNSD 进程以操作系统进程 ID 101973 开头 2023-05-28 10:02:08.343 [EVMD(101974)]CRS-8500: Oracle Clusterware EVMD 进程以操作系统进程 ID 101974 开头 2023-05-28 10:02:09.366 [GPNPD(102012)]CRS-8500: Oracle Clusterware GPNPD 进程以操作系统进程 ID 102012 开头 2023-05-28 10:02:10.287 [GPNPD(102012)]CRS-2328: 已在节点 rac02 上启动 GPNPD。 2023-05-28 10:02:10.409 [GIPCD(102078)]CRS-8500: Oracle Clusterware GIPCD 进程以操作系统进程 ID 102078 开头 2023-05-28 10:02:11.231 [GIPCD(102078)]CRS-7517: The Oracle Grid Interprocess Communication (GIPC) failed to identify the Fast Node Death Detection (FNDD). 2023-05-28 10:02:12.410 [CSSDMONITOR(102113)]CRS-8500: Oracle Clusterware CSSDMONITOR 进程以操作系统进程 ID 102113 开头 2023-05-28 10:02:14.226 [CSSDAGENT(102134)]CRS-8500: Oracle Clusterware CSSDAGENT 进程以操作系统进程 ID 102134 开头 2023-05-28 10:02:14.265 [OSYSMOND(102136)]CRS-8500: Oracle Clusterware OSYSMOND 进程以操作系统进程 ID 102136 开头 2023-05-28 10:02:19.064 [OCSSD(102167)]CRS-8500: Oracle Clusterware OCSSD 进程以操作系统进程 ID 102167 开头 2023-05-28 10:02:20.113 [OCSSD(102167)]CRS-1713: CSSD 守护程序已在 hub 模式下启动 2023-05-28 10:02:21.468 [OCSSD(102167)]CRS-1707: 节点 rac02 (编号为 2) 的租约获取已完成 2023-05-28 10:02:22.541 [OCSSD(102167)]CRS-1621: 在 Oracle 注册表中存储的此节点的 IPMI 配置数据不完整; 详细信息见 (:CSSNK00002:) (位于 /u01/app/grid/diag/crs/rac02/crs/trace/ocssd.trc) 2023-05-28 10:02:22.541 [OCSSD(102167)]CRS-1617: 对节点 rac02 执行节点终止操作所需的信息不完整; 详细信息见 (:CSSNM00004:) (位于 /u01/app/grid/diag/crs/rac02/crs/trace/ocssd.trc) 2023-05-28 10:02:22.547 [OCSSD(102167)]CRS-1605: CSSD 表决文件联机: /dev/asm-diskf; 详细信息见 /u01/app/grid/diag/crs/rac02/crs/trace/ocssd.trc。 2023-05-28 10:02:22.554 [OCSSD(102167)]CRS-1605: CSSD 表决文件联机: /dev/asm-diskh; 详细信息见 /u01/app/grid/diag/crs/rac02/crs/trace/ocssd.trc。 2023-05-28 10:02:22.561 [OCSSD(102167)]CRS-1605: CSSD 表决文件联机: /dev/asm-diskc; 详细信息见 /u01/app/grid/diag/crs/rac02/crs/trace/ocssd.trc。 2023-05-28 10:02:24.626 [OCSSD(102167)]CRS-1601: CSSD 重新配置完毕。活动节点为 rac01 rac02 。 2023-05-28 10:02:26.470 [OCSSD(102167)]CRS-1720: 集群同步服务守护程序 (CSSD) 已准备好进行操作。 2023-05-28 10:02:26.591 [OCTSSD(102688)]CRS-8500: Oracle Clusterware OCTSSD 进程以操作系统进程 ID 102688 开头 2023-05-28 10:02:27.418 [OCTSSD(102688)]CRS-2403: 主机 rac02 上的集群时间同步服务处于观察程序模式。 2023-05-28 10:02:28.633 [OCTSSD(102688)]CRS-2407: 新的集群时间同步服务引用节点为主机 rac01。 2023-05-28 10:02:28.634 [OCTSSD(102688)]CRS-2401: 已在主机 rac02 上启动了集群时间同步服务。 2023-05-28 10:02:28.716 [OCTSSD(102688)]CRS-2419: The clock on host rac02 differs from mean cluster time by 42164830518 microseconds. The Cluster Time Synchronization Service will not perform time synchronization because the time difference is beyond the permissible offset of 600 seconds. Details in /u01/app/grid/diag/crs/rac02/crs/trace/octssd.trc. 2023-05-28 10:02:29.416 [OCTSSD(102688)]CRS-2402: 已在主机 rac02 上中止了集群时间同步服务。详细信息见 (:ctsselect_msm3:) (位于 /u01/app/grid/diag/crs/rac02/crs/trace/octssd.trc)。
复制
节点2启动ctss日志
2023-05-28 10:02:28.704 : CTSS:3520726784: ctssslave_swm2_1: Waiting for time sync message from master. sync_state[2]. 2023-05-28 10:02:28.705 : CTSS:3524929280: ctsscomm_recv_cb4_2: Receive active version change msg. Old active version [318767104] New active version [318767104]. 2023-05-28 10:02:28.716 : CTSS:3524929280: ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state[3]. 2023-05-28 10:02:28.716 : CTSS:3520726784: ctssslave_swm2_3: Received time sync message from master. 2023-05-28 10:02:28.716 : CTSS:3520726784: ctssslave_swm: The magnitude [42164830518] of the offset [42164830518 usec] is larger than [600000000 usec] sec which is the CTSS limit. Hence CTSS is exiting. 2023-05-28 10:02:28.716 : CTSS:3524929280: ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler 2023-05-28 10:02:28.716 : CTSS:3520726784: (:ctsselect_msm3:): Failed in clsctssslave_sync_with_master [12]: Time offset is too much to be corrected, exiting. 2023-05-28 10:02:29.144 : CRSCCL:3532748544: clsCclGetPriMemberData: Detected pridata change for node[1]. Retrieving it to the cache. 2023-05-28 10:02:29.416 : CTSS:3556767488: ctss_checkcb: clsdm requested check alive. checkcb_data{mode[0xf6], offset[42164830 ms]}, length=[8]. 2023-05-28 10:02:29.416 : CTSS:3520726784: ctsselect_msm: CTSS daemon exiting [12] as offset is to large. 2023-05-28 10:02:29.416 : CTSS:3520726784: CTSS daemon aborting
复制
- 节点1集群alert日志
2023-05-27 22:58:12.986 [OCSSD(7181)]CRS-1612: 50% 的超时时间间隔内缺少与节点 rac02 (2) 的网络通信。如果这种情况持续存在, 将在 14.030 秒后从集群中删除此节点 2023-05-27 22:58:19.989 [OCSSD(7181)]CRS-1611: 75% 的超时时间间隔内缺少与节点 rac02 (2) 的网络通信。如果这种情况持续存在, 将在 7.030 秒后从集群中删除此节点 2023-05-27 22:58:24.993 [OCSSD(7181)]CRS-1610: 90% 的超时时间间隔内缺少与节点 rac02 (2) 的网络通信。如果这种情况持续存在, 将在 2.020 秒后从集群中删除此节点 2023-05-27 22:58:27.018 [OCSSD(7181)]CRS-1632: 节点 rac02 正在从集群副本 580693754 的集群中删除 2023-05-27 22:58:27.538 [OCSSD(7181)]CRS-1601: CSSD 重新配置完毕。活动节点为 rac01 。 2023-05-27 22:58:28.976 [CRSD(8767)]CRS-1012: 已在节点 rac01 上启动 OCR 服务。
复制
CTSS活动模式,节点相差10分钟之内,启动新节点正常
此时即使新节点存在时间同步配置chrony,也会强制新节点ctss使用观察模式,参考节点1参考时钟进行同步。crs服务,数据库也可正常启动。
- 节点2启动ctss日志
2023-05-27 22:45:31.145 : CTSS:589285120: ctssslave_swm2_3: Received time sync message from master. 2023-05-27 22:45:31.145 : CTSS:589285120: ctssslave_swm: The magnitude [478860370] of the offset [478860370 usec] is detected to be too larger. 2023-05-27 22:45:31.145 : CTSS:589285120: ctssslave_swm17: LT [1685198731sec 145891sec], MT [1685198252sec 285521usec], Delta [1313usec] 2023-05-27 22:45:31.145 : CTSS:589285120: ctssslave_swm19: The offset is [478860370 usec] and sync interval set to [5] 2023-05-27 22:45:31.145 : CTSS:589285120: ctsselect_msm: Sync interval returned in [5] 2023-05-27 22:45:31.145 : CTSS:593487616: ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler 2023-05-27 22:45:31.832 : CTSS:961419008: ctss_checkcb: clsdm requested check alive. checkcb_data{mode[0xe6], offset[478860 ms]}, length=[8]. 2023-05-27 22:45:32.146 :GIPCHTHR:957216512: gipchaWorkerWork: workerThread heart beat, time interval since last heartBeat 30010 loopCount 38 sendCount 4 recvCount 8 postCount 4 sendCmplCount 4 recvCmplCount 4 2023-05-27 22:45:32.551 :GIPCHTHR:955115264: gipchaDaemonWork: DaemonThread heart beat, time interval since last heartBeat 30320 loopCount 42 2023-05-27 22:45:39.149 : CTSS:589285120: sclsctss_ivsr2: default pid file not found 2023-05-27 22:45:39.149 : CTSS:589285120: sclsctss_ivsr1: default config file found 2023-05-27 22:45:39.149 : CTSS:589285120: sclsctss_gvss5: Chrony active, forcing observer mode 2023-05-27 22:45:39.149 : CTSS:589285120: ctss_check_vendor_sw: Vendor time sync software is detected. status [2]. 2023-05-27 22:45:39.149 : CTSS:589285120: ctsselect_msm: CTSS mode is [0xe6] 2023-05-27 22:45:39.149 : CTSS:589285120: ctssslave_swm1_2: Ready to initiate new time sync process. 2023-05-27 22:45:39.149 : CTSS:589285120: ctssslave_swm2_1: Waiting for time sync message from master. sync_state[2]. 2023-05-27 22:45:39.150 : CTSS:593487616: ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state[3]. 2023-05-27 22:45:39.150 : CTSS:589285120: ctssslave_swm2_3: Received time sync message from master. 2023-05-27 22:45:39.150 : CTSS:589285120: ctssslave_swm: The magnitude [478860379] of the offset [478860379 usec] is detected to be too larger. 2023-05-27 22:45:39.150 : CTSS:589285120: ctssslave_swm17: LT [1685198739sec 150664sec], MT [1685198260sec 290285usec], Delta [1407usec] 2023-05-27 22:45:39.150 : CTSS:589285120: ctssslave_swm19: The offset is [478860379 usec] and sync interval set to [5] 2023-05-27 22:45:39.150 : CTSS:589285120: ctsselect_msm: Sync interval returned in [5] 2023-05-27 22:45:39.150 : CTSS:593487616: ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler 2023-05-27 22:45:47.151 : CTSS:589285120: ctsselect_msm: CTSS mode is [0xe6] 2023-05-27 22:45:47.151 : CTSS:589285120: ctssslave_swm1_2: Ready to initiate new time sync process. 2023-05-27 22:45:47.152 : CTSS:589285120: ctssslave_swm2_1: Waiting for time sync message from master. sync_state[2]. 2023-05-27 22:45:47.152 : CTSS:593487616: ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state[3]. 2023-05-27 22:45:47.152 : CTSS:589285120: ctssslave_swm2_3: Received time sync message from master. 2023-05-27 22:45:47.152 : CTSS:589285120: ctssslave_swm: The magnitude [478860401] of the offset [478860401 usec] is detected to be too larger. 2023-05-27 22:45:47.152 : CTSS:589285120: ctssslave_swm17: LT [1685198747sec 152788sec], MT [1685198268sec 292387usec], Delta [1610usec] 2023-05-27 22:45:47.152 : CTSS:589285120: ctssslave_swm19: The offset is [478860401 usec] and sync interval set to [5] 2023-05-27 22:45:47.152 : CTSS:589285120: ctsselect_msm: Sync interval returned in [5] 2023-05-27 22:45:47.153 : CTSS:593487616: ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler
复制
CTSS观察模式,节点相差10分钟之内,启动新节点正常
即使处于观察模式,同样启动过程瞬间就将时间调整同步到节点1参考时钟,集群启动正常
- 节点2启动alert日志
2023-05-27 23:25:25.575 [OCTSSD(35377)]CRS-2401: 已在主机 rac02 上启动了集群时间同步服务。 2023-05-27 23:16:09.718 [OCTSSD(35377)]CRS-2403: 主机 rac02 上的集群时间同步服务处于观察程序模式。
复制
- 节点2启动ctss日志
2023-05-27 23:25:25.632 : CTSS:4249044736: ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state[3]. 2023-05-27 23:25:25.632 : CTSS:4246943488: ctssslave_swm2_3: Received time sync message from master. 2023-05-27 23:25:25.632 : CTSS:4246943488: clsctssslave.c:797:CLSCTSS_CVAL_SET: &cval[68177cc] prev_value[0] new_value[2408] start_tm [0] curr_tm [10125904] period [1800000]. 2023-05-27 23:16:09.718 : CTSS:4246943488: CRS-2408: The clock on host rac02 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. 2023-05-27 23:16:09.718 : CTSS:4246943488: ctssslave_swm15: The CTSS master is behind this node. The local time offset [-555913426 usec] is being adjusted. Sync method [1] 2023-05-27 23:16:09.718 : CTSS:4246943488: ctssslave_swm17: LT [1685201125sec 631998sec], MT [1685200569sec 718572usec], Delta [56139usec] 2023-05-27 23:16:09.718 : CTSS:4246943488: ctssslave_swm19: The offset is [555913426 usec] and sync interval set to [1] 2023-05-27 23:16:09.718 : CTSS:4246943488: ctssslave_swm: Operating role changed to Observer according to master.
复制
CTSS活动模式,直接改节点2时间,相差超过24小时,不影响集群运行
- 节点2的集群状态
集群正常,只是ctss进程异常
[root@rac02 ~]# crsctl stat res -t -init -------------------------------------------------------------------------------- Name Target State Server State details -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.asm 1 ONLINE ONLINE rac02 STABLE ora.cluster_interconnect.haip 1 ONLINE ONLINE rac02 STABLE ora.crf 1 ONLINE ONLINE rac02 STABLE ora.crsd 1 ONLINE ONLINE rac02 STABLE ora.cssd 1 ONLINE ONLINE rac02 STABLE ora.cssdmonitor 1 ONLINE ONLINE rac02 STABLE ora.ctssd 1 OFFLINE OFFLINE Wrong check return., STABLE ora.diskmon 1 OFFLINE OFFLINE STABLE ora.drivers.acfs 1 ONLINE ONLINE rac02 STABLE ora.evmd 1 ONLINE ONLINE rac02 STABLE ora.gipcd 1 ONLINE ONLINE rac02 STABLE ora.gpnpd 1 ONLINE ONLINE rac02 STABLE ora.mdnsd 1 ONLINE ONLINE rac02 STABLE ora.storage 1 ONLINE ONLINE rac02 STABLE -------------------------------------------------------------------------------- [root@rac02 ~]# crsctl check ctss CRS-4703: The Cluster Time Synchronization Service is Offline. [root@rac02 ~]#
复制
- 节点2 ctss日志
时间相差太大,直接进程终止了
2023-05-30 00:00:03.485 : CTSS:1690789632: sclsctss_ivsr2: default pid file not found 2023-05-30 00:00:03.485 : CTSS:1690789632: sclsctss_ivsr2: default pid file not found 2023-05-30 00:00:03.485 : CTSS:1690789632: sclsctss_ivsr2: default pid file not found 2023-05-30 00:00:03.485 : CTSS:1690789632: sclsctss_ivsr2: default pid file not found 2023-05-30 00:00:03.485 : CTSS:1690789632: ctss_check_vendor_sw: Vendor time sync software is not detected. status [1]. 2023-05-30 00:00:03.485 : CTSS:1690789632: ctsselect_msm: CTSS mode is [0xc4] 2023-05-30 00:00:03.485 : CTSS:1690789632: ctssslave_swm1_2: Ready to initiate new time sync process. 2023-05-30 00:00:03.486 : CTSS:1690789632: ctssslave_swm2_1: Waiting for time sync message from master. sync_state[2]. 2023-05-30 00:00:03.486 : CTSS:1697093376: ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state[3]. 2023-05-30 00:00:03.486 : CTSS:1690789632: ctssslave_swm2_3: Received time sync message from master. 2023-05-30 00:00:03.486 : CTSS:1690789632: ctssslave_swm: The magnitude [174311249197] of the offset [174311249197 usec] is larger than [600000000 usec] sec which is the CTSS limit. Hence CTSS is exiting. 2023-05-30 00:00:03.487 : CTSS:1690789632: (:ctsselect_msm3:): Failed in clsctssslave_sync_with_master [12]: Time offset is too much to be corrected, exiting. 2023-05-30 00:00:03.487 : CTSS:1697093376: ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler 2023-05-30 00:00:13.046 : CTSS:1729017600: ctss_checkcb: clsdm requested check alive. checkcb_data{mode[0xd4], offset[174311249 ms]}, length=[8]. 2023-05-30 00:00:13.046 : CTSS:1690789632: ctsselect_msm: CTSS daemon exiting [12] as offset is to large. 2023-05-30 00:00:13.046 : CTSS:1690789632: CTSS daemon aborting
复制
- 此时修改时间在快10分钟之内,然后启动ctss,节点2时间回退为正常时间
此时出现了时间回退的情况,无论ctss进程是观察模式还是活动模式,这与ctss只控制时间单调递增相违背。
2023年 05月 27日 星期六 23:55:37 CST [grid@rac02 ~]$ crsctl start res ora.ctssd -init CRS-2672: Attempting to start 'ora.ctssd' on 'rac02' CRS-2676: Start of 'ora.ctssd' on 'rac02' succeeded [grid@rac02 ~]$ date 2023年 05月 27日 星期六 23:48:01 CST [grid@rac02 ~]$
复制
Tips:
不同版本数据库表现不同。11gRAC测试过程表现,即使节点间时间相差超过600秒,甚至相差几十年,RAC同样可以正常启动。
订阅号:DongDB手记
墨天轮:https://www.modb.pro/u/231198