Oracle 19c集群 ,重启服务器后节点1挂了无法启动
[grid@o1:/home/grid]$ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager
这个是中文注释
CRS-4638: Oracle高可用性服务在线
CRS-4535:无法与群集就绪服务通信
CRS-4530:与集群同步服务守护进程通信失败
CRS-4534:无法与事件管理器通信
Oracle 19c的日志路径发生改变,与Oracle集群件相关的调试日志文件存储在 $ORACLE_BASE/diag/crs/ojndev51/crs/trace
[grid@o1:/home/grid]$ echo $ORACLE_BASE
/u01/app/grid
[grid@o1:/home/grid]$ cd /u01/app/grid/diag/crs/ojndev51/crs/trace
[grid@o1:/u01/app/grid/diag/crs/ojndev51/crs/trace]$ pwd
/u01/app/grid/diag/crs/ojndev51/crs/trace
HAS的日志信息
[grid@o1:/u01/app/grid/diag/crs/ojndev51/crs/trace]$ ls ohas*
OCSSD日志:
[grid@1:/u01/app/grid/diag/crs/ojndev51/crs/trace]$ ls ocssd.*
EVMD日志:
[grid@o1:/u01/app/grid/diag/crs/ojndev51/crs/trace]$ ls evm*
统一检查日志发现是按照时间段爆发的,选取了最近的一段时间
10:23
-rw-rw---- 1 grid oinstall 4105008 Apr 7 10:23 evmd.trm
-rw-rw---- 1 grid oinstall 25307636 Apr 7 10:23 evmd.trc
-rw-rw---- 1 root oinstall 2392498 Apr 7 10:23 ohasd_orarootagent_root.trm
-rw-rw---- 1 root oinstall 19506810 Apr 7 10:23 ohasd_orarootagent_root.trc
-rw-rw---- 1 grid oinstall 3609590 Apr 7 10:23 gipcd.trm
-rw-rw---- 1 grid oinstall 25180586 Apr 7 10:23 gipcd.trc
-rw-rw---- 1 root oinstall 2947623 Apr 7 10:24 osysmond.trc
-rw-rw---- 1 root oinstall 427298 Apr 7 10:24 osysmond.trm
-rw-rw---- 1 root oinstall 62367 Apr 7 10:24 ohasd.trm
-rw-rw---- 1 root oinstall 425772 Apr 7 10:24 ohasd.trc
-rw-rw---- 1 grid oinstall 424041 Apr 7 10:24 gpnpd.trm
-rw-rw---- 1 grid oinstall 7341956 Apr 7 10:24 gpnpd.trc
-rw-rw---- 1 grid oinstall 2306606 Apr 7 10:24 ohasd_oraagent_grid.trm
-rw-rw---- 1 grid oinstall 10956946 Apr 7 10:24 ohasd_oraagent_grid.trc
-rw-rw---- 1 root oinstall 3022316 Apr 7 10:24 ohasd_cssdmonitor_root.trm
-rw-rw---- 1 root oinstall 23691362 Apr 7 10:24 ohasd_cssdmonitor_root.trc
按照以上文件一个个定位报警日志
首先查看这个ohasd_orarootagent_root.trc文件
[grid@o1:/u01/app/grid/diag/crs/o1/crs/trace]$ tail -3000 ohasd_orarootagent_root.trc
2024-04-07 10:34:30.404 :CLSDYNAM:1170007808: [ora.storage]{0:0:2} [check] Time:04/07/2024 10:34:30.403 Tint:{0:0:2} action:104 resname:ora.storage lastCall:(:CLSN00109:) Agent::commonCheck check failed action:0104 retval:1
没百度到跟这个相关的报错,但是就这一个failed,先排除掉吧
再查看evmd.trc这个文件
EVMD这个进程负责发布CRS产生的各种事件(Event)
[grid@o1:/u01/app/grid/diag/crs/ojndev51/crs/trace]$ tail -3000 evmd.trc
2024-04-07 10:34:34.349 :EVMAGENT:1830906560: [ INFO] [ENTER] Got CE filter: 'not_expecting_events' for client 0
2024-04-07 10:34:34.352 : EVMEVT:1830906560: [ INFO] 0x7ff750000ba0 queueing filter event 0x5630920ec470 as 0x5630920e6ab0 until membership is available
2024-04-07 10:34:34.354 :EVMAGENT:1830906560: [ INFO] Got CE filter removal for client 0
2024-04-07 10:34:34.354 :EVMAGENT:1830906560: [ INFO] Removing CE filter: 'not_expecting_events' for client 0
2024-04-07 10:34:34.355 : EVMEVT:1830906560: [ INFO] 0x7ff750000ba0 queueing filter event 0x5630920f4650 as 0x563092101110 until membership is available
2024-04-07 10:35:27.115 : EVMEVT:1660151552: [ ERROR] EVMD waiting for CSS to be ready err = 3
2024-04-07 10:36:27.426 : EVMEVT:1660151552: [ ERROR] EVMD waiting for CSS to be ready err = 3
2024-04-07 10:37:27.731 : EVMEVT:1660151552: [ ERROR] EVMD waiting for CSS to be ready err = 3
2024-04-07 10:38:27.058 : EVMEVT:1660151552: [ ERROR] EVMD waiting for CSS to be ready err = 3
2024-04-07 10:39:27.401 : EVMEVT:1660151552: [ ERROR] EVMD waiting for CSS to be ready err = 3
有个说法是将 /var/log/.oracle 整个目录mv走,但是我没敢这么操作,先标记一下,如果实在不行以后再说
接着查看gipcd.trc文件
tail -3000f gipcd.trc
2024-04-07 10:54:02.217 : GIPCLIB:373221120: gipclibCheckProcessAliveness: ospid 10697, timestamp 2365 is ALIVE
2024-04-07 10:54:02.217 : GIPC:373221120: gipcdsMemoryDelDeadSubscribers: subscriber gipcd (pid 10697) is alive
2024-04-07 10:54:02.217 : GIPC:373221120: gipcdsMemoryDelDeadSubscribers: processed the subscribers list
2024-04-07 10:54:02.217 : GIPC:373221120: gipcdsMemoryGarbageCollection: garbage collection completed
2024-04-07 10:54:02.622 :GIPCHTHR:371119872: gipchaDaemonWork: DaemonThread heart beat, time interval since last heartBeat 30850 loopCount 33
2024-04-07 10:54:13.231 :GIPCHTHR:373221120: gipchaWorkerWork: workerThread heart beat, time interval since last heartBeat 30040 loopCount 30 sendCount 0 recvCount 0 postCount 0 sendCmplCount 0 recvCmplCount 0
2024-04-07 10:54:20.061 :GIPCDMON:379569920: gipcdMonitorPublishDiags: key gipc_round_trip_time handle 0x7fda0c3245f0 writeTime 1343768054 value <>
好像跟他没关系,就说是上次心跳的问题,距离上次心跳的时间间隔,感觉更像是结果。
接着查看osysmond.trc这个文件
tail -3000f osysmond.trc
2024-04-07 10:56:10.003 : default:1109950208: scrfosm_fill_all_nic_info: NIC: virbr0-nic: not found in ioctl array
2024-04-07 10:56:10.135 : CRFMOND:1109950208: Sender thread took more than expected time to send. Logging nodeview locally and going ahead with nodeview generation. Serial num = 268832
发送线程失败,我两个节点互相ping了物理IP、私有IP、VIP、SCANIP,都没有问题,判断是不是网络的问题
接着查看ohasd.trc这个文件
[grid@o1:/u01/app/grid/diag/crs/o1/crs/trace]$ more ohasd.trc
*** 2024-04-07T08:47:30.352172+08:00
*** TRACE CONTINUED FROM FILE /u01/app/grid/diag/crs/o1/crs/trace/ohasd_14.trc ***
2024-04-07 08:47:30.321 : CRSPE:316630784: [ INFO] {0:0:48805} Processing PE command id=114267 origin:ojndev51. Description: [Stat Resource : 0x7f0de41fccf0]
2024-04-07 08:47:30.322 :UiServer:310327040: [ INFO] {0:0:48805} Done for ctx=0x7f0de0044ce0
2024-04-07 08:47:30.328 :UiServer:310327040: [ INFO] {0:0:48806} Sending to PE. ctx= 0x7f0de0044c40, ClientPID=10309 set Properties (root,29093001), orig.tint: {0:0:2}
2024-04-07 08:47:30.329 : CRSPE:316630784: [ INFO] {0:0:48806} Processing PE command id=114268 origin:ojndev51. Description: [Stat Resource : 0x7f0de404ee50]
2024-04-07 08:47:30.330 :UiServer:310327040: [ INFO] {0:0:48806} Done for ctx=0x7f0de0044c40
2024-04-07 08:47:30.337 :UiServer:310327040: [ INFO] {0:0:48807} Sending to PE. ctx= 0x7f0de0044e80, ClientPID=10309 set Properties (root,29093128), orig.tint: {0:0:2}
2024-04-07 08:47:30.337 : CRSPE:316630784: [ INFO] {0:0:48807} Processing PE command id=114269 origin:ojndev51. Description: [Stat Resource : 0x7f0de41fccf0]
2024-04-07 08:47:30.337 : CRSPE:316630784: [ INFO] {0:0:48807} Expression Filter : ((LAST_SERVER == ojndev51) AND (NAME == ora.cssd))
2024-04-07 08:47:30.338 :UiServer:310327040: [ INFO] {0:0:48807} Done for ctx=0x7f0de0044e80
2024-04-07 08:47:36.139 :GIPCHTHR:306124544: gipchaDaemonWork: DaemonThread heart beat, time interval since last heartBeat 30830 loopCount 33
2024-04-07 08:47:38.411 :UiServer:310327040: [ INFO] {0:0:2}
OHAS DIAGNOSTICS
Last initiated command : Start Resource : 0x7f0de4055c70
Last initiated command timestamp : 03/22/2024 22:18:40
Rate of (local) STAT submissions : 5 reqs/minute
Rate of (local) non-STAT submissions : 0 reqs/minute
Rate of (PE) STAT submissions : 5 reqs/minute
Rate of (PE) STAT completions : 5 reqs/minute
Rate of (PE) non-STAT submissions : 0 reqs/minute
Rate of (PE) non-STAT completions : 0 reqs/minute
Job Scheduler Queue Size : 0
Pending (PE) STAT count : 0
Pending (PE) non-STAT count : 0
2024-04-07 08:47:47.347 :GIPCHTHR:333440768: gipchaWorkerWork: workerThread heart beat, time interval since last heartBeat 30030 loopCount 30 sendCount 0 recvCount 0 postCount 0 sendCmplCount 0 recvCmplCount 0
2024-04-07 08:47:53.443 :UiServer:310327040: [ INFO] {0:0:48808} Sending to PE. ctx= 0x7f0de0043d20, ClientPID=19610 set Properties (root,29093468)
2024-04-07 08:47:53.443 : CRSPE:316630784: [ INFO] {0:0:48808} Processing PE command id=114270 origin:ojndev51. Description: [Stat Resource : 0x7f0de404ee50]
2024-04-07 08:47:53.444 : CRSPE:316630784: [ INFO] {0:0:48808} Expression Filter : (((NAME == ora.crsd) OR (NAME == ora.cssd)) OR (NAME == ora.evmd))
2024-04-07 08:47:53.446 :UiServer:310327040: [ INFO] {0:0:48808} Done for ctx=0x7f0de0043d20
2024-04-07 08:48:00.248 :UiServer:310327040: [ INFO] {0:0:2} Periodic check of IPC sockets...
2024-04-07 08:48:00.248 :UiServer:310327040: [ INFO] {0:0:2} ...socket check done
2024-04-07 08:48:00.353 :UiServer:310327040: [ INFO] {0:0:48809} Sending to PE. ctx= 0x7f0de0044770, ClientPID=10583 set Properties (grid,29093724)
2024-04-07 08:48:00.354 : CRSPE:316630784: [ INFO] {0:0:48809} Processing PE command id=114271 origin:ojndev51. Description: [Stat Resource : 0x7f0de41fccf0]
2024-04-07 08:48:00.355 :UiServer:310327040: [ INFO] {0:0:48809} Done for ctx=0x7f0de0044770
没有特别的报警,还是心跳链接距离上次的时间
接着查看gpnpd.trc这个文件
2024-04-07 11:24:46.890 :GIPCXCPT:288073472: gipcInternalConnectSync: failed sync request, addr 0x7f55dc08ad90 [0000000000ce4064] { gipcAddress : name 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_UI_SOCKET)(GIPCID=00000000-00000000-0))', objFlags 0x0, addrFlags 0x4 }, ret gipcretConnectionRefused (29)
2024-04-07 11:24:46.890 :GIPCXCPT:288073472: gipcConnectSyncF [clscrsconGipcConnect : clscrscon.c : 700]: EXCEPTION[ ret gipcretConnectionRefused (29) ] failed sync connect endp 0x7f55dc08a5f0 [0000000000ce405d] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=00000000-00000000-0))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_UI_SOCKET)(GIPCID=00000000-00000000-0))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, readyRef (nil), ready 0, wobj 0x7f55dc82c390, sendp 0x7f55dd69e900 status 13flags 0xa108071a, flags-2 0x0, usrFlags 0x0 }, addr 0x7f55dc08ad90 [0000000000ce4064] { gipcAddress : name 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_UI_SOCKET)(GIPCID=00000000-00000000-0))', objFlags 0x0, addrFlags 0x4 }, flags 0x0
2024-04-07 11:24:46.891 : GPNP:288073472: clsgpnp_queryCrs(): CRS is not ready. Cannot query GNS resource state.
2024-04-07 11:24:51.891 : GPNP:288073472: clsgpnp_queryCrs(): Querying CRS for resource type "ora.gns.type".
我感觉这个文件报警信息最多也最明显,可是相关百度却没有比较明确的信息。
有的说是私网配置有问题,重新配置启动就好。可是我重启服务求那么多次,重启crsctl stop crs都没用
死马当成活马医
High Availability Services — has
[root@o1 log]# /u01/app/19.3.0/grid/bin/crsctl stop has -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'o1'
CRS-2673: Attempting to stop 'ora.crf' on 'o1'
CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'o1'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'o1'
CRS-2673: Attempting to stop 'ora.evmd' on 'o1'
CRS-2677: Stop of 'ora.cssdmonitor' on 'o1' succeeded
CRS-2677: Stop of 'ora.crf' on 'o1' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'o1'
CRS-2673: Attempting to stop 'ora.gpnpd' on 'o1'
CRS-2677: Stop of 'ora.evmd' on 'o1' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'o1' succeeded
CRS-2677: Stop of 'ora.gpnpd' on 'o1' succeeded
CRS-2677: Stop of 'ora.gipcd' on 'o1' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'o1' has completed
CRS-4133: Oracle High Availability Services has been stopped.
[root@o1 log]# /u01/app/19.3.0/grid/bin/crsctl start has
CRS-4123: Oracle High Availability Services has been started.
查看一下集群的状态信息,居然成功了?
疑问
HAS是干什么的?涉及哪些进程?为什么强制关闭再启动就好了?重启服务器不行
[root@o1 log]# /u01/app/19.3.0/grid/bin/crsctl check crs
CRS-4639: Could not contact Oracle High Availability Services
[root@o1 log]# /u01/app/19.3.0/grid/bin/crsctl status res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
高可用服务
重启服务器没好,重启CRS也没好,但是强制关闭启动高可用HAS就好了,说明开机自启动的某些设置有问题
那问题来了,重启has和重启crs有什么区别?他们分别影响的是哪些?为什么重启HAS在重启服务器数据库集群也能恢复正常?BUG还是数据库的哪些程序启动步骤换了?
########################################################################################################################################
第二次重启服务器,但是数据库又起不来了
再次用强制的手段去关闭has也不行
[root@o1 ~]# /u01/app/19.3.0/grid/bin/crsctl stop has -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'o1'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'o1'
CRS-2677: Stop of 'ora.mdnsd' on 'o1' succeeded
CRS-2673: Attempting to stop 'ora.crf' on 'o1'
CRS-2673: Attempting to stop 'ora.gpnpd' on 'o1'
CRS-2673: Attempting to stop 'ora.evmd' on 'o1'
CRS-2677: Stop of 'ora.crf' on 'o1' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'o1'
CRS-2677: Stop of 'ora.gpnpd' on 'o1' succeeded
CRS-2677: Stop of 'ora.evmd' on 'o1' succeeded
CRS-2677: Stop of 'ora.gipcd' on 'o1' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'o1' has completed
CRS-4133: Oracle High Availability Services has been stopped.
先把以前出现的那些文件都清空下,以防判断失误
-rw-rw---- 1 grid oinstall 4105008 Apr 7 10:23 evmd.trm
-rw-rw---- 1 grid oinstall 25307636 Apr 7 10:23 evmd.trc
-rw-rw---- 1 root oinstall 2392498 Apr 7 10:23 ohasd_orarootagent_root.trm
-rw-rw---- 1 root oinstall 19506810 Apr 7 10:23 ohasd_orarootagent_root.trc
-rw-rw---- 1 grid oinstall 3609590 Apr 7 10:23 gipcd.trm
-rw-rw---- 1 grid oinstall 25180586 Apr 7 10:23 gipcd.trc
-rw-rw---- 1 root oinstall 2947623 Apr 7 10:24 osysmond.trc
-rw-rw---- 1 root oinstall 427298 Apr 7 10:24 osysmond.trm
-rw-rw---- 1 root oinstall 62367 Apr 7 10:24 ohasd.trm
-rw-rw---- 1 root oinstall 425772 Apr 7 10:24 ohasd.trc
-rw-rw---- 1 grid oinstall 424041 Apr 7 10:24 gpnpd.trm
-rw-rw---- 1 grid oinstall 7341956 Apr 7 10:24 gpnpd.trc
-rw-rw---- 1 grid oinstall 2306606 Apr 7 10:24 ohasd_oraagent_grid.trm
-rw-rw---- 1 grid oinstall 10956946 Apr 7 10:24 ohasd_oraagent_grid.trc
-rw-rw---- 1 root oinstall 3022316 Apr 7 10:24 ohasd_cssdmonitor_root.trm
-rw-rw---- 1 root oinstall 23691362 Apr 7 10:24 ohasd_cssdmonitor_root.trc
echo > evmd.trc
echo > ohasd_orarootagent_root.trc
echo > gipcd.trc
echo > osysmond.trc
echo > ohasd.trc
echo > gpnpd.trc
echo > ohasd_oraagent_grid.trc
echo > ohasd_cssdmonitor_root.trc
启动以后
[root@o1 ~]# /u01/app/19.3.0/grid/bin/crsctl start has
这些都是最新的文件
-rw-rw---- 1 root oinstall 897 Apr 9 14:26 crsctl_18172.trm
-rw-rw---- 1 root oinstall 1630 Apr 9 14:26 crsctl_18172.trc
-rw-rw---- 1 root oinstall 1038 Apr 9 14:26 crsctl_18224.trm
-rw-rw---- 1 root oinstall 2337 Apr 9 14:26 crsctl_18224.trc
-rw-rw---- 1 grid oinstall 23135 Apr 9 14:26 evmlogger.trm
-rw-rw---- 1 grid oinstall 50412 Apr 9 14:26 evmlogger.trc
-rw-rw---- 1 grid oinstall 279145 Apr 9 14:26 mdnsd.trm
-rw-rw---- 1 grid oinstall 1936017 Apr 9 14:26 mdnsd.trc
-rw-rw---- 1 grid oinstall 4081 Apr 9 14:27 alert.log
-rw-rw---- 1 grid oinstall 94702 Apr 9 14:27 evmd.trm
-rw-rw---- 1 grid oinstall 19044 Apr 9 14:27 evmd.trc
-rw-rw---- 1 grid oinstall 2866766 Apr 9 14:27 ohasd_oraagent_grid.trm
-rw-rw---- 1 grid oinstall 137709 Apr 9 14:27 ohasd_oraagent_grid.trc
-rw-rw---- 1 root oinstall 2389270 Apr 9 14:27 ohasd_orarootagent_root.trm
-rw-rw---- 1 root oinstall 105889 Apr 9 14:27 ohasd_orarootagent_root.trc
-rw-rw---- 1 root oinstall 1091748 Apr 9 14:27 osysmond.trm
-rw-rw---- 1 root oinstall 14824 Apr 9 14:27 osysmond.trc
-rw-rw---- 1 root oinstall 1446060 Apr 9 14:27 ohasd_cssdagent_root.trm
-rw-rw---- 1 root oinstall 9750176 Apr 9 14:27 ohasd_cssdagent_root.trc
-rw-rw---- 1 root oinstall 1378922 Apr 9 14:27 ohasd.trm
-rw-rw---- 1 root oinstall 251695 Apr 9 14:27 ohasd.trc
-rw-rw---- 1 grid oinstall 1599154 Apr 9 14:27 gpnpd.trm
-rw-rw---- 1 grid oinstall 152768 Apr 9 14:27 gpnpd.trc
-rw-rw---- 1 grid oinstall 1117850 Apr 9 14:27 gipcd.trm
-rw-rw---- 1 grid oinstall 146112 Apr 9 14:27 gipcd.trc
-rw-rw---- 1 root oinstall 2455414 Apr 9 14:27 ohasd_cssdmonitor_root.trm
-rw-rw---- 1 root oinstall 107119 Apr 9 14:27 ohasd_cssdmonitor_root.trc
-rw-rw---- 1 grid oinstall 3610000 Apr 9 14:27 ocssd.trm
-rw-rw---- 1 grid oinstall 25281800 Apr 9 14:27 ocssd.trc
这回从最后往前看,因为有最新出现ocssd
先看 ocssd.trc
2024-04-09 14:29:42.625 : CSSD:1412859648: [ INFO] clssnmvDHBValidateNCopy: node 2, o2, has a disk HB, but no network HB, DHB has rcfg 606608534, wrtcnt, 9461155, LATS 99078554, lastSeqNo 9461152, uniqueness 1711113715, timestamp 1712644140/1530158194
查到了关键的日志,百度说是私网有问题
[grid@o1:/u01/app/grid/diag/crs/o1/crs/trace]$ crsctl stat res -t -init
--------------------------------------------------------------------------------
Name Target State Server State details
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
1 ONLINE OFFLINE STABLE
ora.cluster_interconnect.haip
1 ONLINE OFFLINE STABLE
ora.crf
1 ONLINE ONLINE o1 STABLE
ora.crsd
1 ONLINE OFFLINE STABLE
ora.cssd
1 ONLINE OFFLINE STABLE
ora.cssdmonitor
1 ONLINE ONLINE o1 STABLE
ora.ctssd
1 ONLINE OFFLINE STABLE
ora.diskmon
1 OFFLINE OFFLINE STABLE
ora.evmd
1 ONLINE INTERMEDIATE o1 STABLE
ora.gipcd
1 ONLINE ONLINE o1 STABLE
ora.gpnpd
1 ONLINE ONLINE o1 STABLE
ora.mdnsd
1 ONLINE ONLINE o1 STABLE
ora.storage
1 ONLINE OFFLINE STABLE
--------------------------------------------------------------------------------
忽然尝试节点2关闭集群,节点1就能启动,也就是说,两个节点只能启动一个
节点没启动的进程
[root@o2 log]# /u01/app/19.3.0/grid/bin/crsctl stat res -t -init
--------------------------------------------------------------------------------
Name Target State Server State details
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
1 ONLINE OFFLINE STABLE
ora.cluster_interconnect.haip
1 ONLINE OFFLINE STABLE
ora.crf
1 ONLINE ONLINE o2 STABLE
ora.crsd
1 ONLINE OFFLINE STABLE
ora.cssd
1 ONLINE OFFLINE STABLE
ora.cssdmonitor
1 ONLINE ONLINE o2 STABLE
ora.ctssd
1 ONLINE OFFLINE STABLE
ora.diskmon
1 OFFLINE OFFLINE STABLE
ora.evmd
1 ONLINE INTERMEDIATE o2 STABLE
ora.gipcd
1 ONLINE ONLINE o2 STABLE
ora.gpnpd
1 ONLINE ONLINE o2 STABLE
ora.mdnsd
1 ONLINE ONLINE o2 STABLE
ora.storage
1 ONLINE OFFLINE STABLE
--------------------------------------------------------------------------------
系统日志
Apr 9 15:09:44 o2 abrt-hook-ccpp: Process 31554 (ocssd.bin) of user 11012 killed by SIGABRT - dumping core
Apr 9 15:09:45 o2 abrt-hook-ccpp: Failed to create core_backtrace: waitpid failed: No child processes
Apr 9 15:09:45 o2 abrt-server: Executable '/u01/app/19.3.0/grid/bin/ocssd.bin' doesn't belong to any package and ProcessUnpackaged is set to 'no'
Apr 9 15:09:45 o2 abrt-server: 'post-create' on '/var/spool/abrt/ccpp-2024-04-09-15:09:44-31554' exited with 1
Apr 9 15:09:45 o2 abrt-server: Deleting problem directory '/var/spool/abrt/ccpp-2024-04-09-15:09:44-31554'
可以看到的是,明确告诉你 ocssd.bin 出问题了
[grid@o2:/u01/app/grid/diag/crs/o2/crs/trace]$ pwd
/u01/app/grid/diag/crs/o2/crs/trace
[grid@o2:/u01/app/grid/diag/crs/o2/crs/trace]$ tail -3000f ocssd.trc
*** 2024-04-09T15:09:43.404195+08:00
DDE: Flood control is not active
2024-04-09T15:09:43.419027+08:00
Incident 473 created, dump file: /u01/app/grid/diag/crs/o2/crs/incident/incdir_473/ocssd_i473.trc
CRS-8503 [] [] [] [] [] [] [] [] [] [] [] []
2024-04-09 15:09:44.025 : CSSD:353658624: [ INFO] clssnmvDHBValidateNCopy: node 1, o1, has a disk HB, but no network HB, DHB has rcfg 608140712, wrtcnt, 13451654, LATS 1532601984, lastSeqNo 13451651, uniqueness 1712645900, timestamp 1712646625/101521814
基本上就确认是网络的问题了
查看一下两个节点的网络配置
[grid@o2:/u01/app/grid/diag/crs/o2/crs/trace]$ logout
[root@o2 log]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
#Private IP
172.0.0.1 o1-priv
172.0.0.2 o2-priv
我先用的 ping、再用的是traceroute,发现都没有问题,其实到这里我已经要放弃重装了,然后不死心再ssh一下,发现密码居然要重新输入,这个是不应该的,因为以前我互相登陆过。
[root@o2 log]# traceroute o1
traceroute to o1 (192.168.254.221), 30 hops max, 60 byte packets
1 o1 (192.168.254.221) 0.346 ms 0.243 ms 0.215 ms
[root@o2 log]# traceroute o1-priv
traceroute to o1-priv (172.0.0.1), 30 hops max, 60 byte packets
1 o1-priv (172.0.0.1) 0.420 ms 0.255 ms 0.221 ms
[root@o2 log]# ssh o1-priv
The authenticity of host 'o1-priv (172.0.0.1)' can't be established.
ECDSA key fingerprint is SHA256:E6ZX+p3xt3ObHUmzIdOKg/QYbTFjXUahPYLIkFQDrGw.
ECDSA key fingerprint is MD5:40:37:92:4d:40:4c:82:e2:d3:81:e4:20:70:b8:d5:74.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'o1-priv' (ECDSA) to the list of known hosts.
Warning: the ECDSA host key for 'o1-priv' differs from the key for the IP address '172.0.0.1'
Offending key for IP in /root/.ssh/known_hosts:1
Are you sure you want to continue connecting (yes/no)? yes
root@o1-priv's password:
Last login: Wed Apr 3 17:22:07 2024 from 192.168.144.52
[root@o41 ~]#
这里发现我居然登陆到了另外一套集群的节点上!!!!因为他的私有IP也是 172.0.0.1
我退出再次重试发现还是这样,多次尝试,结果一样
[root@ojndev41 ~]# su - oracle
Last login: Tue Apr 9 10:36:08 CST 2024
[oracle@ojndev41:/home/oracle]$ ssh o1-priv
ssh: Could not resolve hostname o1-priv: Name or service not known
[oracle@ojndev41:/home/oracle]$ logout
[root@ojndev41 ~]# logout
Connection to o1-priv closed.
[root@o2 log]# ssh o1-priv
Warning: the ECDSA host key for 'o1-priv' differs from the key for the IP address '172.0.0.1'
Offending key for IP in /root/.ssh/known_hosts:1
Matching host key in /root/.ssh/known_hosts:2
Are you sure you want to continue connecting (yes/no)? yes
root@o1-priv's password:
Last login: Tue Apr 9 15:32:02 2024 from ojndev42-priv
[root@ojndev41 ~]# logout
Connection to o1-priv closed.
我觉得需要更改一下我的私有IP
修改私有IP
原IP信息:
[root@o1 ~]# cat /etc/hosts
#Private IP
172.0.0.1 o1-priv
172.0.0.2 o2-priv
主要是修改 PRIV 私有IP
新环境IP信息:
[root@o1 ~]# cat /etc/hosts
#Private IP
172.0.0.11 o1-priv
172.0.0.12 o2-priv
确认好IP信息以后,就可以做一些前置工作:
备份OCR和GNNP profile文件
正常关库、监听和CRS
修改/etc/hosts配置文件
OS层修改共有网卡地址
启动Crs
备份OCR和GPNP profile
节点1
[root@o1 ~]# su - grid
[grid@o1:/u01/app/grid]$ cd /u01/app/19.3.0/grid/gpnp/o1/profiles/peer
[grid@o1:/u01/app/19.3.0/grid/gpnp/o1/profiles/peer]$ cp -p profile.xml profile.xml.bak
节点2
[root@o2 log]# su - grid
[grid@o2:/home/grid]$ cd /u01/app/19.3.0/grid/gpnp/o2/profiles/peer
[grid@o2:/u01/app/19.3.0/grid/gpnp/o2/profiles/peer]$ cp -p profile.xml profile.xml.bak
root用户执行手工OCR的备份
[root@o1 ~]# /u01/app/19.3.0/grid/bin/ocrconfig -manualbackup
o1 2024/04/09 16:08:31 +OCR:/ojndev-cluster/OCRBACKUP/backup_20240409_160831.ocr.263.1165853313 2204791795
查看OCR的手工备份
[root@o1 ~]# /u01/app/19.3.0/grid/bin/ocrconfig -showbackup manual
o1 2024/04/09 16:08:31 +OCR:/ojndev-cluster/OCRBACKUP/backup_20240409_160831.ocr.263.1165853313 2204791795
正常关库、监听和CRS
## 单节点 grid 用户执行
[root@o1 ~]$ /u01/app/19.3.0/grid/bin/srvctl stop database -d o5
[root@o2 ~]$ /u01/app/19.3.0/grid/bin/srvctl stop listener
## 在所有节点用 root 用户执行
[root@o1 ~]# /u01/app/19.3.0/grid/bin/crsctl stop crs
[root@o2 ~]# /u01/app/19.3.0/grid/bin/crsctl stop crs
📢 注意:这样再次启动crs的时候,就不会自动启动监听和数据库了。
修改/etc/hosts配置文件
先备份
[root@o1 ~]# cp /etc/hosts /etc/hosts.bak
[root@o2 ~]# cp /etc/hosts /etc/hosts.bak
再修改/etc/hosts文件,将对应的IP修改为如下
[root@o1 ~]# cat /etc/hosts
#Private IP
172.0.0.11 o1-priv
172.0.0.12 o2-priv
其他内容不变
OS层修改公有网卡地址
两个节点都需要操作
[root@o1 ~]# ifconfig
ens224: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.0.0.1 netmask 255.255.255.0 broadcast 172.0.0.255
inet6 fe80::20c:29ff:fef2:f449 prefixlen 64 scopeid 0x20<link>
ether 00:0c:29:f2:f4:49 txqueuelen 1000 (Ethernet)
RX packets 17003814 bytes 14565525692 (13.5 GiB)
RX errors 0 dropped 1115 overruns 0 frame 0
TX packets 12245881 bytes 13036068008 (12.1 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
需要哪个改哪个,别乱改
节点1
[root@o1:~]# nmcli connection modify ens224 ipv4.addresses 172.0.0.11/24
[root@o1:~]# nmcli connection down ens224
[root@o1:~]# nmcli connection up ens224
节点2
[root@o2:~]# nmcli connection modify ens224 ipv4.addresses 172.0.0.12/24
[root@o2:~]# nmcli connection down ens224
[root@o2:~]# nmcli connection up ens224
稳妥地办法还是手动去修改文件的数值(实在不行改完了就重启服务器)
修改Private IP
[root@o1 ~]# /u01/app/19.3.0/grid/bin/oifcfg getif
ens192 192.168.254.128 global public
ens224 172.0.0.0 global cluster_interconnect,asm
## 所有节点查看验证修改成功
[root@hisdb01:~]# /u01/app/19.3.0/grid/bin/oifcfg getif
[root@hisdb02:~]# /u01/app/19.3.0/grid/bin/oifcfg getif
最后,重启数据库、集群、主机,验证修改是否成功
[grid@o1:~]$ srvctl stop database -d o1
[grid@o1:~]$ srvctl stop listener
## 在所有节点用 root 用户执行
[root@o1:~]# /u01/app/19.3.0/grid/bin/crsctl stop crs
[root@o1:~]# shutdown -r now
[root@o2:~]# /u01/app/19.3.0/grid/bin/crsctl stop crs
[root@o2:~]# shutdown -r now
启动
[grid@o1:~]$ srvctl start database -d o
[grid@o1:~]$ srvctl start listener