暂无图片
暂无图片
14
暂无图片
暂无图片
暂无图片

Oracle 19C Rac集群管理节点1服务器下电无法启动处理步骤

1146

Oracle 19c集群 ,重启服务器后节点1挂了无法启动

[grid@o1:/home/grid]$ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager

这个是中文注释

CRS-4638: Oracle高可用性服务在线
CRS-4535:无法与群集就绪服务通信
CRS-4530:与集群同步服务守护进程通信失败
CRS-4534:无法与事件管理器通信

Oracle 19c的日志路径发生改变,与Oracle集群件相关的调试日志文件存储在 $ORACLE_BASE/diag/crs/ojndev51/crs/trace

[grid@o1:/home/grid]$ echo $ORACLE_BASE
/u01/app/grid
[grid@o1:/home/grid]$ cd /u01/app/grid/diag/crs/ojndev51/crs/trace
[grid@o1:/u01/app/grid/diag/crs/ojndev51/crs/trace]$ pwd
/u01/app/grid/diag/crs/ojndev51/crs/trace
HAS的日志信息
[grid@o1:/u01/app/grid/diag/crs/ojndev51/crs/trace]$ ls ohas*
OCSSD日志:
[grid@1:/u01/app/grid/diag/crs/ojndev51/crs/trace]$ ls ocssd.*
EVMD日志:
[grid@o1:/u01/app/grid/diag/crs/ojndev51/crs/trace]$ ls evm*

统一检查日志发现是按照时间段爆发的,选取了最近的一段时间

10:23
-rw-rw---- 1 grid oinstall 4105008 Apr 7 10:23 evmd.trm
-rw-rw---- 1 grid oinstall 25307636 Apr 7 10:23 evmd.trc
-rw-rw---- 1 root oinstall 2392498 Apr 7 10:23 ohasd_orarootagent_root.trm
-rw-rw---- 1 root oinstall 19506810 Apr 7 10:23 ohasd_orarootagent_root.trc
-rw-rw---- 1 grid oinstall 3609590 Apr 7 10:23 gipcd.trm
-rw-rw---- 1 grid oinstall 25180586 Apr 7 10:23 gipcd.trc
-rw-rw---- 1 root oinstall 2947623 Apr 7 10:24 osysmond.trc
-rw-rw---- 1 root oinstall 427298 Apr 7 10:24 osysmond.trm
-rw-rw---- 1 root oinstall 62367 Apr 7 10:24 ohasd.trm
-rw-rw---- 1 root oinstall 425772 Apr 7 10:24 ohasd.trc
-rw-rw---- 1 grid oinstall 424041 Apr 7 10:24 gpnpd.trm
-rw-rw---- 1 grid oinstall 7341956 Apr 7 10:24 gpnpd.trc
-rw-rw---- 1 grid oinstall 2306606 Apr 7 10:24 ohasd_oraagent_grid.trm
-rw-rw---- 1 grid oinstall 10956946 Apr 7 10:24 ohasd_oraagent_grid.trc
-rw-rw---- 1 root oinstall 3022316 Apr 7 10:24 ohasd_cssdmonitor_root.trm
-rw-rw---- 1 root oinstall 23691362 Apr 7 10:24 ohasd_cssdmonitor_root.trc

按照以上文件一个个定位报警日志

首先查看这个ohasd_orarootagent_root.trc文件

[grid@o1:/u01/app/grid/diag/crs/o1/crs/trace]$ tail -3000 ohasd_orarootagent_root.trc

2024-04-07 10:34:30.404 :CLSDYNAM:1170007808: [ora.storage]{0:0:2} [check] Time:04/07/2024 10:34:30.403 Tint:{0:0:2} action:104 resname:ora.storage lastCall:(:CLSN00109:) Agent::commonCheck check failed action:0104 retval:1

没百度到跟这个相关的报错,但是就这一个failed,先排除掉吧


再查看evmd.trc这个文件

EVMD这个进程负责发布CRS产生的各种事件(Event)

[grid@o1:/u01/app/grid/diag/crs/ojndev51/crs/trace]$ tail -3000 evmd.trc

2024-04-07 10:34:34.349 :EVMAGENT:1830906560: [ INFO] [ENTER] Got CE filter: 'not_expecting_events' for client 0
2024-04-07 10:34:34.352 : EVMEVT:1830906560: [ INFO] 0x7ff750000ba0 queueing filter event 0x5630920ec470 as 0x5630920e6ab0 until membership is available
2024-04-07 10:34:34.354 :EVMAGENT:1830906560: [ INFO] Got CE filter removal for client 0
2024-04-07 10:34:34.354 :EVMAGENT:1830906560: [ INFO] Removing CE filter: 'not_expecting_events' for client 0
2024-04-07 10:34:34.355 : EVMEVT:1830906560: [ INFO] 0x7ff750000ba0 queueing filter event 0x5630920f4650 as 0x563092101110 until membership is available
2024-04-07 10:35:27.115 : EVMEVT:1660151552: [ ERROR] EVMD waiting for CSS to be ready err = 3
2024-04-07 10:36:27.426 : EVMEVT:1660151552: [ ERROR] EVMD waiting for CSS to be ready err = 3
2024-04-07 10:37:27.731 : EVMEVT:1660151552: [ ERROR] EVMD waiting for CSS to be ready err = 3
2024-04-07 10:38:27.058 : EVMEVT:1660151552: [ ERROR] EVMD waiting for CSS to be ready err = 3
2024-04-07 10:39:27.401 : EVMEVT:1660151552: [ ERROR] EVMD waiting for CSS to be ready err = 3

有个说法是将 /var/log/.oracle 整个目录mv走,但是我没敢这么操作,先标记一下,如果实在不行以后再说


接着查看gipcd.trc文件

tail -3000f gipcd.trc

2024-04-07 10:54:02.217 : GIPCLIB:373221120: gipclibCheckProcessAliveness: ospid 10697, timestamp 2365 is ALIVE
2024-04-07 10:54:02.217 : GIPC:373221120: gipcdsMemoryDelDeadSubscribers: subscriber gipcd (pid 10697) is alive
2024-04-07 10:54:02.217 : GIPC:373221120: gipcdsMemoryDelDeadSubscribers: processed the subscribers list
2024-04-07 10:54:02.217 : GIPC:373221120: gipcdsMemoryGarbageCollection: garbage collection completed
2024-04-07 10:54:02.622 :GIPCHTHR:371119872: gipchaDaemonWork: DaemonThread heart beat, time interval since last heartBeat 30850 loopCount 33
2024-04-07 10:54:13.231 :GIPCHTHR:373221120: gipchaWorkerWork: workerThread heart beat, time interval since last heartBeat 30040 loopCount 30 sendCount 0 recvCount 0 postCount 0 sendCmplCount 0 recvCmplCount 0
2024-04-07 10:54:20.061 :GIPCDMON:379569920: gipcdMonitorPublishDiags: key gipc_round_trip_time handle 0x7fda0c3245f0 writeTime 1343768054 value <>

好像跟他没关系,就说是上次心跳的问题,距离上次心跳的时间间隔,感觉更像是结果。


接着查看osysmond.trc这个文件

tail -3000f osysmond.trc

2024-04-07 10:56:10.003 : default:1109950208: scrfosm_fill_all_nic_info: NIC: virbr0-nic: not found in ioctl array
2024-04-07 10:56:10.135 : CRFMOND:1109950208: Sender thread took more than expected time to send. Logging nodeview locally and going ahead with nodeview generation. Serial num = 268832

发送线程失败,我两个节点互相ping了物理IP、私有IP、VIP、SCANIP,都没有问题,判断是不是网络的问题


接着查看ohasd.trc这个文件

[grid@o1:/u01/app/grid/diag/crs/o1/crs/trace]$ more ohasd.trc

*** 2024-04-07T08:47:30.352172+08:00

*** TRACE CONTINUED FROM FILE /u01/app/grid/diag/crs/o1/crs/trace/ohasd_14.trc ***

2024-04-07 08:47:30.321 : CRSPE:316630784: [ INFO] {0:0:48805} Processing PE command id=114267 origin:ojndev51. Description: [Stat Resource : 0x7f0de41fccf0]
2024-04-07 08:47:30.322 :UiServer:310327040: [ INFO] {0:0:48805} Done for ctx=0x7f0de0044ce0
2024-04-07 08:47:30.328 :UiServer:310327040: [ INFO] {0:0:48806} Sending to PE. ctx= 0x7f0de0044c40, ClientPID=10309 set Properties (root,29093001), orig.tint: {0:0:2}
2024-04-07 08:47:30.329 : CRSPE:316630784: [ INFO] {0:0:48806} Processing PE command id=114268 origin:ojndev51. Description: [Stat Resource : 0x7f0de404ee50]
2024-04-07 08:47:30.330 :UiServer:310327040: [ INFO] {0:0:48806} Done for ctx=0x7f0de0044c40
2024-04-07 08:47:30.337 :UiServer:310327040: [ INFO] {0:0:48807} Sending to PE. ctx= 0x7f0de0044e80, ClientPID=10309 set Properties (root,29093128), orig.tint: {0:0:2}
2024-04-07 08:47:30.337 : CRSPE:316630784: [ INFO] {0:0:48807} Processing PE command id=114269 origin:ojndev51. Description: [Stat Resource : 0x7f0de41fccf0]
2024-04-07 08:47:30.337 : CRSPE:316630784: [ INFO] {0:0:48807} Expression Filter : ((LAST_SERVER == ojndev51) AND (NAME == ora.cssd))
2024-04-07 08:47:30.338 :UiServer:310327040: [ INFO] {0:0:48807} Done for ctx=0x7f0de0044e80
2024-04-07 08:47:36.139 :GIPCHTHR:306124544: gipchaDaemonWork: DaemonThread heart beat, time interval since last heartBeat 30830 loopCount 33
2024-04-07 08:47:38.411 :UiServer:310327040: [ INFO] {0:0:2}
OHAS DIAGNOSTICS
Last initiated command : Start Resource : 0x7f0de4055c70
Last initiated command timestamp : 03/22/2024 22:18:40
Rate of (local) STAT submissions : 5 reqs/minute
Rate of (local) non-STAT submissions : 0 reqs/minute
Rate of (PE) STAT submissions : 5 reqs/minute
Rate of (PE) STAT completions : 5 reqs/minute
Rate of (PE) non-STAT submissions : 0 reqs/minute
Rate of (PE) non-STAT completions : 0 reqs/minute
Job Scheduler Queue Size : 0
Pending (PE) STAT count : 0
Pending (PE) non-STAT count : 0
2024-04-07 08:47:47.347 :GIPCHTHR:333440768: gipchaWorkerWork: workerThread heart beat, time interval since last heartBeat 30030 loopCount 30 sendCount 0 recvCount 0 postCount 0 sendCmplCount 0 recvCmplCount 0
2024-04-07 08:47:53.443 :UiServer:310327040: [ INFO] {0:0:48808} Sending to PE. ctx= 0x7f0de0043d20, ClientPID=19610 set Properties (root,29093468)
2024-04-07 08:47:53.443 : CRSPE:316630784: [ INFO] {0:0:48808} Processing PE command id=114270 origin:ojndev51. Description: [Stat Resource : 0x7f0de404ee50]
2024-04-07 08:47:53.444 : CRSPE:316630784: [ INFO] {0:0:48808} Expression Filter : (((NAME == ora.crsd) OR (NAME == ora.cssd)) OR (NAME == ora.evmd))
2024-04-07 08:47:53.446 :UiServer:310327040: [ INFO] {0:0:48808} Done for ctx=0x7f0de0043d20
2024-04-07 08:48:00.248 :UiServer:310327040: [ INFO] {0:0:2} Periodic check of IPC sockets...
2024-04-07 08:48:00.248 :UiServer:310327040: [ INFO] {0:0:2} ...socket check done
2024-04-07 08:48:00.353 :UiServer:310327040: [ INFO] {0:0:48809} Sending to PE. ctx= 0x7f0de0044770, ClientPID=10583 set Properties (grid,29093724)
2024-04-07 08:48:00.354 : CRSPE:316630784: [ INFO] {0:0:48809} Processing PE command id=114271 origin:ojndev51. Description: [Stat Resource : 0x7f0de41fccf0]
2024-04-07 08:48:00.355 :UiServer:310327040: [ INFO] {0:0:48809} Done for ctx=0x7f0de0044770

没有特别的报警,还是心跳链接距离上次的时间

接着查看gpnpd.trc这个文件

2024-04-07 11:24:46.890 :GIPCXCPT:288073472: gipcInternalConnectSync: failed sync request, addr 0x7f55dc08ad90 [0000000000ce4064] { gipcAddress : name 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_UI_SOCKET)(GIPCID=00000000-00000000-0))', objFlags 0x0, addrFlags 0x4 }, ret gipcretConnectionRefused (29)
2024-04-07 11:24:46.890 :GIPCXCPT:288073472: gipcConnectSyncF [clscrsconGipcConnect : clscrscon.c : 700]: EXCEPTION[ ret gipcretConnectionRefused (29) ] failed sync connect endp 0x7f55dc08a5f0 [0000000000ce405d] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=00000000-00000000-0))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_UI_SOCKET)(GIPCID=00000000-00000000-0))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, readyRef (nil), ready 0, wobj 0x7f55dc82c390, sendp 0x7f55dd69e900 status 13flags 0xa108071a, flags-2 0x0, usrFlags 0x0 }, addr 0x7f55dc08ad90 [0000000000ce4064] { gipcAddress : name 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_UI_SOCKET)(GIPCID=00000000-00000000-0))', objFlags 0x0, addrFlags 0x4 }, flags 0x0
2024-04-07 11:24:46.891 : GPNP:288073472: clsgpnp_queryCrs(): CRS is not ready. Cannot query GNS resource state.
2024-04-07 11:24:51.891 : GPNP:288073472: clsgpnp_queryCrs(): Querying CRS for resource type "ora.gns.type".

我感觉这个文件报警信息最多也最明显,可是相关百度却没有比较明确的信息。

有的说是私网配置有问题,重新配置启动就好。可是我重启服务求那么多次,重启crsctl stop crs都没用

死马当成活马医

High Availability Services —  has


[root@o1 log]# /u01/app/19.3.0/grid/bin/crsctl stop has -f

CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'o1'

CRS-2673: Attempting to stop 'ora.crf' on 'o1'

CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'o1'

CRS-2673: Attempting to stop 'ora.mdnsd' on 'o1'

CRS-2673: Attempting to stop 'ora.evmd' on 'o1'

CRS-2677: Stop of 'ora.cssdmonitor' on 'o1' succeeded

CRS-2677: Stop of 'ora.crf' on 'o1' succeeded

CRS-2673: Attempting to stop 'ora.gipcd' on 'o1'

CRS-2673: Attempting to stop 'ora.gpnpd' on 'o1'

CRS-2677: Stop of 'ora.evmd' on 'o1' succeeded

CRS-2677: Stop of 'ora.mdnsd' on 'o1' succeeded

CRS-2677: Stop of 'ora.gpnpd' on 'o1' succeeded

CRS-2677: Stop of 'ora.gipcd' on 'o1' succeeded

CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'o1' has completed

CRS-4133: Oracle High Availability Services has been stopped.


[root@o1 log]# /u01/app/19.3.0/grid/bin/crsctl start has

CRS-4123: Oracle High Availability Services has been started.

查看一下集群的状态信息,居然成功了?


疑问

HAS是干什么的?涉及哪些进程?为什么强制关闭再启动就好了?重启服务器不行

[root@o1 log]# /u01/app/19.3.0/grid/bin/crsctl check crs

CRS-4639: Could not contact Oracle High Availability Services


[root@o1 log]# /u01/app/19.3.0/grid/bin/crsctl status res -t

CRS-4535: Cannot communicate with Cluster Ready Services

CRS-4000: Command Status failed, or completed with errors.

高可用服务

重启服务器没好,重启CRS也没好,但是强制关闭启动高可用HAS就好了,说明开机自启动的某些设置有问题

那问题来了,重启has和重启crs有什么区别?他们分别影响的是哪些?为什么重启HAS在重启服务器数据库集群也能恢复正常?BUG还是数据库的哪些程序启动步骤换了?

########################################################################################################################################

第二次重启服务器,但是数据库又起不来了

再次用强制的手段去关闭has也不行

[root@o1 ~]# /u01/app/19.3.0/grid/bin/crsctl stop has -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'o1'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'o1'
CRS-2677: Stop of 'ora.mdnsd' on 'o1' succeeded

CRS-2673: Attempting to stop 'ora.crf' on 'o1'
CRS-2673: Attempting to stop 'ora.gpnpd' on 'o1'
CRS-2673: Attempting to stop 'ora.evmd' on 'o1'
CRS-2677: Stop of 'ora.crf' on 'o1' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'o1'
CRS-2677: Stop of 'ora.gpnpd' on 'o1' succeeded
CRS-2677: Stop of 'ora.evmd' on 'o1' succeeded
CRS-2677: Stop of 'ora.gipcd' on 'o1' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'o1' has completed
CRS-4133: Oracle High Availability Services has been stopped.


先把以前出现的那些文件都清空下,以防判断失误

-rw-rw---- 1 grid   oinstall  4105008 Apr  7 10:23 evmd.trm

-rw-rw---- 1 grid   oinstall 25307636 Apr  7 10:23 evmd.trc

-rw-rw---- 1 root   oinstall  2392498 Apr  7 10:23 ohasd_orarootagent_root.trm

-rw-rw---- 1 root   oinstall 19506810 Apr  7 10:23 ohasd_orarootagent_root.trc

-rw-rw---- 1 grid   oinstall  3609590 Apr  7 10:23 gipcd.trm

-rw-rw---- 1 grid   oinstall 25180586 Apr  7 10:23 gipcd.trc

-rw-rw---- 1 root   oinstall  2947623 Apr  7 10:24 osysmond.trc

-rw-rw---- 1 root   oinstall   427298 Apr  7 10:24 osysmond.trm

-rw-rw---- 1 root   oinstall    62367 Apr  7 10:24 ohasd.trm

-rw-rw---- 1 root   oinstall   425772 Apr  7 10:24 ohasd.trc

-rw-rw---- 1 grid   oinstall   424041 Apr  7 10:24 gpnpd.trm

-rw-rw---- 1 grid   oinstall  7341956 Apr  7 10:24 gpnpd.trc

-rw-rw---- 1 grid   oinstall  2306606 Apr  7 10:24 ohasd_oraagent_grid.trm

-rw-rw---- 1 grid   oinstall 10956946 Apr  7 10:24 ohasd_oraagent_grid.trc

-rw-rw---- 1 root   oinstall  3022316 Apr  7 10:24 ohasd_cssdmonitor_root.trm

-rw-rw---- 1 root   oinstall 23691362 Apr  7 10:24 ohasd_cssdmonitor_root.trc


echo > evmd.trc

echo > ohasd_orarootagent_root.trc

echo > gipcd.trc

echo > osysmond.trc

echo > ohasd.trc

echo > gpnpd.trc

echo > ohasd_oraagent_grid.trc

echo > ohasd_cssdmonitor_root.trc


启动以后

[root@o1 ~]# /u01/app/19.3.0/grid/bin/crsctl start has

这些都是最新的文件

-rw-rw---- 1 root   oinstall      897 Apr  9 14:26 crsctl_18172.trm

-rw-rw---- 1 root   oinstall     1630 Apr  9 14:26 crsctl_18172.trc

-rw-rw---- 1 root   oinstall     1038 Apr  9 14:26 crsctl_18224.trm

-rw-rw---- 1 root   oinstall     2337 Apr  9 14:26 crsctl_18224.trc

-rw-rw---- 1 grid   oinstall    23135 Apr  9 14:26 evmlogger.trm

-rw-rw---- 1 grid   oinstall    50412 Apr  9 14:26 evmlogger.trc

-rw-rw---- 1 grid   oinstall   279145 Apr  9 14:26 mdnsd.trm

-rw-rw---- 1 grid   oinstall  1936017 Apr  9 14:26 mdnsd.trc

-rw-rw---- 1 grid   oinstall     4081 Apr  9 14:27 alert.log

-rw-rw---- 1 grid   oinstall    94702 Apr  9 14:27 evmd.trm

-rw-rw---- 1 grid   oinstall    19044 Apr  9 14:27 evmd.trc

-rw-rw---- 1 grid   oinstall  2866766 Apr  9 14:27 ohasd_oraagent_grid.trm

-rw-rw---- 1 grid   oinstall   137709 Apr  9 14:27 ohasd_oraagent_grid.trc

-rw-rw---- 1 root   oinstall  2389270 Apr  9 14:27 ohasd_orarootagent_root.trm

-rw-rw---- 1 root   oinstall   105889 Apr  9 14:27 ohasd_orarootagent_root.trc

-rw-rw---- 1 root   oinstall  1091748 Apr  9 14:27 osysmond.trm

-rw-rw---- 1 root   oinstall    14824 Apr  9 14:27 osysmond.trc

-rw-rw---- 1 root   oinstall  1446060 Apr  9 14:27 ohasd_cssdagent_root.trm

-rw-rw---- 1 root   oinstall  9750176 Apr  9 14:27 ohasd_cssdagent_root.trc

-rw-rw---- 1 root   oinstall  1378922 Apr  9 14:27 ohasd.trm

-rw-rw---- 1 root   oinstall   251695 Apr  9 14:27 ohasd.trc

-rw-rw---- 1 grid   oinstall  1599154 Apr  9 14:27 gpnpd.trm

-rw-rw---- 1 grid   oinstall   152768 Apr  9 14:27 gpnpd.trc

-rw-rw---- 1 grid   oinstall  1117850 Apr  9 14:27 gipcd.trm

-rw-rw---- 1 grid   oinstall   146112 Apr  9 14:27 gipcd.trc

-rw-rw---- 1 root   oinstall  2455414 Apr  9 14:27 ohasd_cssdmonitor_root.trm

-rw-rw---- 1 root   oinstall   107119 Apr  9 14:27 ohasd_cssdmonitor_root.trc

-rw-rw---- 1 grid   oinstall  3610000 Apr  9 14:27 ocssd.trm

-rw-rw---- 1 grid   oinstall 25281800 Apr  9 14:27 ocssd.trc

这回从最后往前看,因为有最新出现ocssd

先看 ocssd.trc 

2024-04-09 14:29:42.625 :    CSSD:1412859648: [     INFO] clssnmvDHBValidateNCopy: node 2, o2, has a disk HB, but no network HB, DHB has rcfg 606608534, wrtcnt, 9461155, LATS 99078554, lastSeqNo 9461152, uniqueness 1711113715, timestamp 1712644140/1530158194

查到了关键的日志,百度说是私网有问题


[grid@o1:/u01/app/grid/diag/crs/o1/crs/trace]$ crsctl stat res -t -init

--------------------------------------------------------------------------------

Name           Target  State        Server                   State details       

--------------------------------------------------------------------------------

Cluster Resources

--------------------------------------------------------------------------------

ora.asm

      1        ONLINE  OFFLINE                               STABLE

ora.cluster_interconnect.haip

      1        ONLINE  OFFLINE                               STABLE

ora.crf

      1        ONLINE  ONLINE       o1                 STABLE

ora.crsd

      1        ONLINE  OFFLINE                               STABLE

ora.cssd

      1        ONLINE  OFFLINE                               STABLE

ora.cssdmonitor

      1        ONLINE  ONLINE       o1                 STABLE

ora.ctssd

      1        ONLINE  OFFLINE                               STABLE

ora.diskmon

      1        OFFLINE OFFLINE                               STABLE

ora.evmd

      1        ONLINE  INTERMEDIATE o1                 STABLE

ora.gipcd

      1        ONLINE  ONLINE       o1                 STABLE

ora.gpnpd

      1        ONLINE  ONLINE       o1                 STABLE

ora.mdnsd

      1        ONLINE  ONLINE       o1                 STABLE

ora.storage

      1        ONLINE  OFFLINE                               STABLE

--------------------------------------------------------------------------------

忽然尝试节点2关闭集群,节点1就能启动,也就是说,两个节点只能启动一个


节点没启动的进程

[root@o2 log]# /u01/app/19.3.0/grid/bin/crsctl stat res -t -init

--------------------------------------------------------------------------------

Name           Target  State        Server                   State details       

--------------------------------------------------------------------------------

Cluster Resources

--------------------------------------------------------------------------------

ora.asm

      1        ONLINE  OFFLINE                               STABLE

ora.cluster_interconnect.haip

      1        ONLINE  OFFLINE                               STABLE

ora.crf

      1        ONLINE  ONLINE       o2                 STABLE

ora.crsd

      1        ONLINE  OFFLINE                               STABLE

ora.cssd

      1        ONLINE  OFFLINE                               STABLE

ora.cssdmonitor

      1        ONLINE  ONLINE       o2                 STABLE

ora.ctssd

      1        ONLINE  OFFLINE                               STABLE

ora.diskmon

      1        OFFLINE OFFLINE                               STABLE

ora.evmd

      1        ONLINE  INTERMEDIATE o2                 STABLE

ora.gipcd

      1        ONLINE  ONLINE       o2                 STABLE

ora.gpnpd

      1        ONLINE  ONLINE       o2                 STABLE

ora.mdnsd

      1        ONLINE  ONLINE       o2                 STABLE

ora.storage

      1        ONLINE  OFFLINE                               STABLE

--------------------------------------------------------------------------------


系统日志

Apr  9 15:09:44 o2 abrt-hook-ccpp: Process 31554 (ocssd.bin) of user 11012 killed by SIGABRT - dumping core

Apr  9 15:09:45 o2 abrt-hook-ccpp: Failed to create core_backtrace: waitpid failed: No child processes

Apr  9 15:09:45 o2 abrt-server: Executable '/u01/app/19.3.0/grid/bin/ocssd.bin' doesn't belong to any package and ProcessUnpackaged is set to 'no'

Apr  9 15:09:45 o2 abrt-server: 'post-create' on '/var/spool/abrt/ccpp-2024-04-09-15:09:44-31554' exited with 1

Apr  9 15:09:45 o2 abrt-server: Deleting problem directory '/var/spool/abrt/ccpp-2024-04-09-15:09:44-31554'

可以看到的是,明确告诉你 ocssd.bin 出问题了


[grid@o2:/u01/app/grid/diag/crs/o2/crs/trace]$ pwd

/u01/app/grid/diag/crs/o2/crs/trace

[grid@o2:/u01/app/grid/diag/crs/o2/crs/trace]$ tail -3000f  ocssd.trc


*** 2024-04-09T15:09:43.404195+08:00

DDE: Flood control is not active

2024-04-09T15:09:43.419027+08:00

Incident 473 created, dump file: /u01/app/grid/diag/crs/o2/crs/incident/incdir_473/ocssd_i473.trc

CRS-8503 [] [] [] [] [] [] [] [] [] [] [] []

2024-04-09 15:09:44.025 :    CSSD:353658624: [     INFO] clssnmvDHBValidateNCopy: node 1, o1, has a disk HB, but no network HB, DHB has rcfg 608140712, wrtcnt, 13451654, LATS 1532601984, lastSeqNo 13451651, uniqueness 1712645900, timestamp 1712646625/101521814

基本上就确认是网络的问题了


查看一下两个节点的网络配置

[grid@o2:/u01/app/grid/diag/crs/o2/crs/trace]$ logout

[root@o2 log]# cat /etc/hosts

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4

::1         localhost localhost.localdomain localhost6 localhost6.localdomain6


#Private IP

172.0.0.1 o1-priv

172.0.0.2 o2-priv


我先用的 ping、再用的是traceroute,发现都没有问题,其实到这里我已经要放弃重装了,然后不死心再ssh一下,发现密码居然要重新输入,这个是不应该的,因为以前我互相登陆过。

[root@o2 log]# traceroute o1

traceroute to o1 (192.168.254.221), 30 hops max, 60 byte packets

1  o1 (192.168.254.221)  0.346 ms  0.243 ms  0.215 ms


[root@o2 log]# traceroute o1-priv

traceroute to o1-priv (172.0.0.1), 30 hops max, 60 byte packets

1  o1-priv (172.0.0.1)  0.420 ms  0.255 ms  0.221 ms


[root@o2 log]# ssh o1-priv

The authenticity of host 'o1-priv (172.0.0.1)' can't be established.

ECDSA key fingerprint is SHA256:E6ZX+p3xt3ObHUmzIdOKg/QYbTFjXUahPYLIkFQDrGw.

ECDSA key fingerprint is MD5:40:37:92:4d:40:4c:82:e2:d3:81:e4:20:70:b8:d5:74.

Are you sure you want to continue connecting (yes/no)? yes

Warning: Permanently added 'o1-priv' (ECDSA) to the list of known hosts.

Warning: the ECDSA host key for 'o1-priv' differs from the key for the IP address '172.0.0.1'

Offending key for IP in /root/.ssh/known_hosts:1

Are you sure you want to continue connecting (yes/no)? yes

root@o1-priv's password:

Last login: Wed Apr  3 17:22:07 2024 from 192.168.144.52

[root@o41 ~]#

这里发现我居然登陆到了另外一套集群的节点上!!!!因为他的私有IP也是 172.0.0.1


我退出再次重试发现还是这样,多次尝试,结果一样

[root@ojndev41 ~]# su - oracle

Last login: Tue Apr  9 10:36:08 CST 2024

[oracle@ojndev41:/home/oracle]$ ssh o1-priv

ssh: Could not resolve hostname o1-priv: Name or service not known

[oracle@ojndev41:/home/oracle]$ logout

[root@ojndev41 ~]# logout

Connection to o1-priv closed.

[root@o2 log]# ssh o1-priv

Warning: the ECDSA host key for 'o1-priv' differs from the key for the IP address '172.0.0.1'

Offending key for IP in /root/.ssh/known_hosts:1

Matching host key in /root/.ssh/known_hosts:2

Are you sure you want to continue connecting (yes/no)? yes

root@o1-priv's password:

Last login: Tue Apr  9 15:32:02 2024 from ojndev42-priv

[root@ojndev41 ~]# logout

Connection to o1-priv closed.

我觉得需要更改一下我的私有IP


修改私有IP

原IP信息:

[root@o1 ~]# cat /etc/hosts

#Private IP

172.0.0.1 o1-priv

172.0.0.2 o2-priv


主要是修改  PRIV 私有IP

新环境IP信息:

[root@o1 ~]# cat /etc/hosts


#Private IP

172.0.0.11 o1-priv

172.0.0.12 o2-priv


确认好IP信息以后,就可以做一些前置工作:

备份OCR和GNNP profile文件

正常关库、监听和CRS

修改/etc/hosts配置文件

OS层修改共有网卡地址

启动Crs


备份OCR和GPNP profile

节点1

[root@o1 ~]# su - grid

[grid@o1:/u01/app/grid]$ cd /u01/app/19.3.0/grid/gpnp/o1/profiles/peer

[grid@o1:/u01/app/19.3.0/grid/gpnp/o1/profiles/peer]$ cp -p profile.xml profile.xml.bak


节点2

[root@o2 log]# su - grid

[grid@o2:/home/grid]$ cd /u01/app/19.3.0/grid/gpnp/o2/profiles/peer

[grid@o2:/u01/app/19.3.0/grid/gpnp/o2/profiles/peer]$ cp -p profile.xml profile.xml.bak


root用户执行手工OCR的备份

[root@o1 ~]# /u01/app/19.3.0/grid/bin/ocrconfig -manualbackup

o1     2024/04/09 16:08:31     +OCR:/ojndev-cluster/OCRBACKUP/backup_20240409_160831.ocr.263.1165853313     2204791795     


查看OCR的手工备份

[root@o1 ~]# /u01/app/19.3.0/grid/bin/ocrconfig -showbackup manual

o1     2024/04/09 16:08:31     +OCR:/ojndev-cluster/OCRBACKUP/backup_20240409_160831.ocr.263.1165853313     2204791795  


正常关库、监听和CRS

## 单节点 grid 用户执行

[root@o1 ~]$ /u01/app/19.3.0/grid/bin/srvctl stop database -d o5

[root@o2 ~]$ /u01/app/19.3.0/grid/bin/srvctl stop listener


## 在所有节点用 root 用户执行

[root@o1 ~]# /u01/app/19.3.0/grid/bin/crsctl stop crs

[root@o2 ~]# /u01/app/19.3.0/grid/bin/crsctl stop crs

📢 注意:这样再次启动crs的时候,就不会自动启动监听和数据库了。


修改/etc/hosts配置文件

先备份

[root@o1 ~]# cp /etc/hosts /etc/hosts.bak

[root@o2 ~]# cp /etc/hosts /etc/hosts.bak

再修改/etc/hosts文件,将对应的IP修改为如下

[root@o1 ~]# cat /etc/hosts

#Private IP

172.0.0.11 o1-priv

172.0.0.12 o2-priv

其他内容不变


OS层修改公有网卡地址

两个节点都需要操作

[root@o1 ~]# ifconfig

ens224: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500

        inet 172.0.0.1  netmask 255.255.255.0  broadcast 172.0.0.255

        inet6 fe80::20c:29ff:fef2:f449  prefixlen 64  scopeid 0x20<link>

        ether 00:0c:29:f2:f4:49  txqueuelen 1000  (Ethernet)

        RX packets 17003814  bytes 14565525692 (13.5 GiB)

        RX errors 0  dropped 1115  overruns 0  frame 0

        TX packets 12245881  bytes 13036068008 (12.1 GiB)

        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0


需要哪个改哪个,别乱改

节点1

[root@o1:~]# nmcli connection modify ens224 ipv4.addresses 172.0.0.11/24

[root@o1:~]# nmcli connection down ens224

[root@o1:~]# nmcli connection up ens224


节点2

[root@o2:~]# nmcli connection modify ens224 ipv4.addresses 172.0.0.12/24

[root@o2:~]# nmcli connection down ens224

[root@o2:~]# nmcli connection up ens224

稳妥地办法还是手动去修改文件的数值(实在不行改完了就重启服务器)


修改Private IP

[root@o1 ~]# /u01/app/19.3.0/grid/bin/oifcfg getif

ens192  192.168.254.128  global  public

ens224  172.0.0.0  global  cluster_interconnect,asm


## 所有节点查看验证修改成功

[root@hisdb01:~]# /u01/app/19.3.0/grid/bin/oifcfg getif

[root@hisdb02:~]# /u01/app/19.3.0/grid/bin/oifcfg getif


最后,重启数据库、集群、主机,验证修改是否成功

[grid@o1:~]$ srvctl stop database -d o1

[grid@o1:~]$ srvctl stop listener


## 在所有节点用 root 用户执行

[root@o1:~]# /u01/app/19.3.0/grid/bin/crsctl stop crs

[root@o1:~]# shutdown -r now

[root@o2:~]# /u01/app/19.3.0/grid/bin/crsctl stop crs

[root@o2:~]# shutdown -r now


启动

[grid@o1:~]$ srvctl start database -d o

[grid@o1:~]$ srvctl start listener

最后修改时间:2024-04-10 12:00:45
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
1人已赞赏
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论