暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

19c rac ipv6环境 集群stop失败处理(子网长度大于64)

原创 袁长刚 2020-02-28
2384
  • 新装的19c rac ipv6 环境更换了ocr磁盘组,想重启集群,发现怎么也停不下来,错误日志如下:
[root@xydb5node1 bin]# ./crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'xydb5node1'
CRS-2673: Attempting to stop 'ora.crsd' on 'xydb5node1'
CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on server 'xydb5node1'
CRS-2679: Attempting to clean 'ora.xydb5node1.vip' on 'xydb5node1'
CRS-2673: Attempting to stop 'ora.LISTENER_SCAN1.lsnr' on 'xydb5node1'
CRS-33673: Attempting to stop resource group 'ora.asmgroup' on server 'xydb5node1'
CRS-2673: Attempting to stop 'ora.CRSDG.dg' on 'xydb5node1'
CRS-2673: Attempting to stop 'ora.DATADG1.dg' on 'xydb5node1'
CRS-2673: Attempting to stop 'ora.FRADG.dg' on 'xydb5node1'
CRS-2673: Attempting to stop 'ora.OCRDG.dg' on 'xydb5node1'
CRS-2681: Clean of 'ora.xydb5node1.vip' on 'xydb5node1' succeeded
CRS-2677: Stop of 'ora.DATADG1.dg' on 'xydb5node1' succeeded
CRS-2677: Stop of 'ora.CRSDG.dg' on 'xydb5node1' succeeded
CRS-2677: Stop of 'ora.FRADG.dg' on 'xydb5node1' succeeded
CRS-2677: Stop of 'ora.LISTENER_SCAN1.lsnr' on 'xydb5node1' succeeded
CRS-2673: Attempting to stop 'ora.scan1.vip' on 'xydb5node1'
CRS-2677: Stop of 'ora.OCRDG.dg' on 'xydb5node1' succeeded
CRS-2673: Attempting to stop 'ora.asm' on 'xydb5node1'
CRS-2677: Stop of 'ora.asm' on 'xydb5node1' succeeded
CRS-2673: Attempting to stop 'ora.ASMNET1LSNR_ASM.lsnr' on 'xydb5node1'
CRS-2677: Stop of 'ora.ASMNET1LSNR_ASM.lsnr' on 'xydb5node1' succeeded
CRS-2673: Attempting to stop 'ora.asmnet1.asmnetwork' on 'xydb5node1'
CRS-2677: Stop of 'ora.asmnet1.asmnetwork' on 'xydb5node1' succeeded
CRS-33677: Stop of resource group 'ora.asmgroup' on server 'xydb5node1' succeeded.
Action for VIP aborted
CRS-2675: Stop of 'ora.scan1.vip' on 'xydb5node1' failed
CRS-2679: Attempting to clean 'ora.scan1.vip' on 'xydb5node1'
CRS-2678: 'ora.scan1.vip' on 'xydb5node1' has experienced an unrecoverable failure
CRS-0267: Human intervention required to resume its availability.
CRS-2672: Attempting to start 'ora.xydb5node1.vip' on 'xydb5node2'
CRS-5005: IP Address: 2409:8760:1282:0001:0f11:0000:0000:0044 is already in use in the network
CRS-2674: Start of 'ora.xydb5node1.vip' on 'xydb5node2' failed
CRS-2799: Failed to shut down resource 'ora.scan1.vip' on 'xydb5node1'
CRS-2794: Shutdown of Cluster Ready Services-managed resources on 'xydb5node1' has failed
CRS-2675: Stop of 'ora.crsd' on 'xydb5node1' failed
CRS-2799: Failed to shut down resource 'ora.crsd' on 'xydb5node1'
CRS-2795: Shutdown of Oracle High Availability Services-managed resources on 'xydb5node1' has failed
CRS-4687: Shutdown command has completed with errors.
CRS-4000: Command Stop failed, or completed with errors.

复制

检查crs alert日志,截取了一段错误日志,如下:
/u01/app/grid/diag/crs/xydb5node1/crs/trace/alert.log

2020-02-27 18:21:12.138 [CRSD(286254)]CRS-2758: Resource 'ora.scan1.vip' is in an unknown state.
2020-02-27 18:21:12.138 [CRSD(286254)]CRS-2769: Unable to failover resource 'ora.net1.network'.
2020-02-27 18:21:12.404 [ORAROOTAGENT(425250)]CRS-8500: Oracle Clusterware ORAROOTAGENT process is starting with operating system process ID 425250
2020-02-27 18:21:17.816 [OHASD(284843)]CRS-2795: Shutdown of Oracle High Availability Services-managed resources on 'xydb5node1' has failed
2020-02-27 18:21:39.564 [OHASD(284843)]CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'xydb5node1'
2020-02-27 18:22:39.581 [ORAROOTAGENT(425250)]CRS-5818: Aborted command 'stop' for resource 'ora.xydb5node1.vip'. Details at (:CRSAGF00113:) {1:42136:7050} in /u01/app/grid/diag/crs/xydb5node1/crs/trace/crsd_orarootagent_root.trc.
2020-02-27 18:22:39.600 [CRSD(286254)]CRS-2757: Command 'Stop' timed out waiting for response from the resource 'ora.xydb5node1.vip'. Details at (:CRSPE00221:) {1:42136:7050} in /u01/app/grid/diag/crs/xydb5node1/crs/trace/crsd.trc.
2020-02-27 18:23:41.601 [ORAROOTAGENT(425250)]CRS-5818: Aborted command 'clean' for resource 'ora.xydb5node1.vip'. Details at (:CRSAGF00113:) {1:42136:7050} in /u01/app/grid/diag/crs/xydb5node1/crs/trace/crsd_orarootagent_root.trc.
2020-02-27 18:24:41.890 [ORAROOTAGENT(435394)]CRS-8500: Oracle Clusterware ORAROOTAGENT process is starting with operating system process ID 435394
2020-02-27 18:25:42.261 [ORAROOTAGENT(435394)]CRS-5818: Aborted command 'clean' for resource 'ora.xydb5node1.vip'. Details at (:CRSAGF00113:) {0:8:2} in /u01/app/grid/diag/crs/xydb5node1/crs/trace/crsd_orarootagent_root.trc.
2020-02-27 18:26:42.550 [ORAROOTAGENT(436268)]CRS-8500: Oracle Clusterware ORAROOTAGENT process is starting with operating system process ID 436268
2020-02-27 18:27:42.922 [ORAROOTAGENT(436268)]CRS-5818: Aborted command 'clean' for resource 'ora.xydb5node1.vip'. Details at (:CRSAGF00113:) {0:9:2} in /u01/app/grid/diag/crs/xydb5node1/crs/trace/crsd_orarootagent_root.trc.
2020-02-27 18:28:42.949 [OHASD(284843)]CRS-2795: Shutdown of Oracle High Availability Services-managed resources on 'xydb5node1' has failed
2020-02-27 18:28:42.942 [CRSD(286254)]CRS-2758: Resource 'ora.xydb5node1.vip' is in an unknown state.
2020-02-27 18:28:43.209 [ORAROOTAGENT(438555)]CRS-8500: Oracle Clusterware ORAROOTAGENT process is starting with operating system process ID 438555
2020-02-27 18:28:43.739 [ORAAGENT(438579)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 438579
2020-02-27 18:55:14.545 [OHASD(284843)]CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'xydb5node1'
复制

如上,从alert的日志中只能看出是卡在停集群vip资源,别的问题暂时无法定位,进一步分析trace日志,如下:
/u01/app/grid/diag/crs/xydb5node1/crs/trace/crsd_orarootagent_root.trc

2020-02-27 19:17:13.902 :CLSDYNAM:1310668544: [ora.xydb5node2.vip]{0:19:2} [clean] Failed to delete 2409:8760:1282:0001:0f11:0000:0000:0045 on bond0
2020-02-27 19:17:13.902 :CLSDYNAM:1310668544: [ora.xydb5node2.vip]{0:19:2} [clean] (null) category: -2, operation: ioctl, loc: SIOCDIFADDR, OS error: 99, other: failed to delete address
2020-02-27 19:17:13.902 :CLSDYNAM:1310668544: [ora.xydb5node2.vip]{0:19:2} [clean] VipActions::stopIpV6 }
2020-02-27 19:17:14.903 :CLSDYNAM:1310668544: [ora.xydb5node2.vip]{0:19:2} [clean] VipActions::stopIpV6 {
2020-02-27 19:17:14.903 :CLSDYNAM:1310668544: [ora.xydb5node2.vip]{0:19:2} [clean] Deleting ipv6 address '2409:8760:1282:0001:0f11:0000:0000:0045', on the interface name 'bond0'
2020-02-27 19:17:14.903 :CLSDYNAM:1310668544: [ora.xydb5node2.vip]{0:19:2} [clean] sclsideladdrsv6 returned 
2020-02-27 19:17:14.903 :CLSDYNAM:1310668544: [ora.xydb5node2.vip]{0:19:2} [clean] Failed to delete 2409:8760:1282:0001:0f11:0000:0000:0045 on bond0
2020-02-27 19:17:14.903 :CLSDYNAM:1310668544: [ora.xydb5node2.vip]{0:19:2} [clean] (null) category: -2, operation: ioctl, loc: SIOCDIFADDR, OS error: 99, other: failed to delete address
2020-02-27 19:17:14.903 :CLSDYNAM:1310668544: [ora.xydb5node2.vip]{0:19:2} [clean] VipActions::stopIpV6 }
2020-02-27 19:17:15.903 :CLSDYNAM:1310668544: [ora.xydb5node2.vip]{0:19:2} [clean] VipActions::stopIpV6 {
2020-02-27 19:17:15.903 :CLSDYNAM:1310668544: [ora.xydb5node2.vip]{0:19:2} [clean] Deleting ipv6 address '2409:8760:1282:0001:0f11:0000:0000:0045', on the interface name 'bond0'
2020-02-27 19:17:15.903 :CLSDYNAM:1310668544: [ora.xydb5node2.vip]{0:19:2} [clean] sclsideladdrsv6 returned 
2020-02-27 19:17:15.903 :CLSDYNAM:1310668544: [ora.xydb5node2.vip]{0:19:2} [clean] Failed to delete 2409:8760:1282:0001:0f11:0000:0000:0045 on bond0
2020-02-27 19:17:15.903 :CLSDYNAM:1310668544: [ora.xydb5node2.vip]{0:19:2} [clean] (null) category: -2, operation: ioctl, loc: SIOCDIFADDR, OS error: 99, other: failed to delete address
2020-02-27 19:17:15.903 :CLSDYNAM:1310668544: [ora.xydb5node2.vip]{0:19:2} [clean] VipActions::stopIpV6 }
2020-02-27 19:17:16.903 :CLSDYNAM:1310668544: [ora.xydb5node2.vip]{0:19:2} [clean] VipActions::stopIpV6 {
2020-02-27 19:17:16.903 :CLSDYNAM:1310668544: [ora.xydb5node2.vip]{0:19:2} [clean] Deleting ipv6 address '2409:8760:1282:0001:0f11:0000:0000:0045', on the interface name 'bond0'
2020-02-27 19:17:16.903 :CLSDYNAM:1310668544: [ora.xydb5node2.vip]{0:19:2} [clean] sclsideladdrsv6 returned 
2020-02-27 19:17:16.903 :CLSDYNAM:1310668544: [ora.xydb5node2.vip]{0:19:2} [clean] Failed to delete 2409:8760:1282:0001:0f11:0000:0000:0045 on bond0
2020-02-27 19:17:16.903 :CLSDYNAM:1310668544: [ora.xydb5node2.vip]{0:19:2} [clean] (null) category: -2, operation: ioctl, loc: SIOCDIFADDR, OS error: 99, other: failed to delete address
2020-02-27 19:17:16.903 :CLSDYNAM:1310668544: [ora.xydb5node2.vip]{0:19:2} [clean] VipActions::stopIpV6 }
2020-02-27 19:17:17.903 :CLSDYNAM:1310668544: [ora.xydb5node2.vip]{0:19:2} [clean] VipActions::stopIpV6 {
2020-02-27 19:17:17.903 :CLSDYNAM:1310668544: [ora.xydb5node2.vip]{0:19:2} [clean] Deleting ipv6 address '2409:8760:1282:0001:0f11:0000:0000:0045', on the interface name 'bond0'
复制

一堆奇怪的错误,暂时也看不懂,查mos也没有找到关于这个错误的文档说明,咨询同事说是因为子网长度大于64导致,检查这套库public ip的子网长度发现是128,如下:

[root@xydb5node1 bin]# ./oifcfg iflist -n
bond0  192.168.122.0  255.255.255.0
bond1  1.1.4.64  255.255.255.248
bond0  2409:8760:1282:1:f11::42  /128
bond1  fd17:625c:f037:a801::  /64
bond1  fd2a:1a21:628e:1::  /64
复制

直接修改网卡配置,然后重启network

[grid@xydb5node2 ~]$ oifcfg iflist -n
bond0  192.168.122.0  255.255.255.0
bond1  1.1.4.64  255.255.255.248
bond0  2409:8760:1282:1::  /64
bond1  fd17:625c:f037:a801::  /64
bond1  fd2a:1a21:628e:1::  /64
复制

如上,bond0的子网长度已经变成64位了,然后再去停crs集群,发现还是失败,无赖重启服务器。

2020-02-28 13:41:23.331 [OHASD(54165)]CRS-8500: Oracle Clusterware OHASD process is starting with operating system process ID 54165
2020-02-28 13:41:23.419 [OHASD(54165)]CRS-0714: Oracle Clusterware Release 19.0.0.0.0.
2020-02-28 13:41:23.432 [OHASD(54165)]CRS-2112: The OLR service started on node xydb6node2.
2020-02-28 13:41:23.457 [OHASD(60420)]CRS-8500: Oracle Clusterware OHASD process is starting with operating system process ID 60420
2020-02-28 13:41:23.545 [OHASD(60420)]CRS-0714: Oracle Clusterware Release 19.0.0.0.0.
2020-02-28 13:41:23.697 [OHASD(54165)]CRS-1301: Oracle High Availability Service started on node xydb6node2.
2020-02-28 13:41:23.697 [OHASD(54165)]CRS-8017: location: /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0 errors occurred
2020-02-28 13:41:23.896 [OHASD(54336)]CRS-8500: Oracle Clusterware OHASD process is starting with operating system process ID 54336
2020-02-28 13:41:24.003 [OHASD(54336)]CRS-2112: The OLR service started on node xydb6node2.
2020-02-28 13:41:24.018 [OHASD(54336)]CRS-1301: Oracle High Availability Service started on node xydb6node2.
2020-02-28 13:41:24.018 [OHASD(54336)]CRS-8017: location: /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0 errors occurred
2020-02-28 13:41:24.026 [OHASD(60420)]CRS-1301: Oracle High Availability Service started on node xydb6node2.
2020-02-28 13:41:24.026 [OHASD(60420)]CRS-8017: location: /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0 errors occurred
2020-02-28 13:41:24.025 [OHASD(62185)]CRS-8500: Oracle Clusterware OHASD process is starting with operating system process ID 62185
2020-02-28 13:41:24.132 [OHASD(62185)]CRS-0714: Oracle Clusterware Release 19.0.0.0.0.
2020-02-28 13:41:24.142 [OHASD(62185)]CRS-2112: The OLR service started on node xydb6node2.
2020-02-28 13:41:24.632 [OHASD(62185)]CRS-1301: Oracle High Availability Service started on node xydb6node2.
2020-02-28 13:41:24.632 [OHASD(62185)]CRS-8017: location: /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0 errors occurred
2020-02-28 13:41:24.812 [CSSDAGENT(62356)]CRS-8500: Oracle Clusterware CSSDAGENT process is starting with operating system process ID 62356
2020-02-28 13:41:24.835 [CSSDMONITOR(62358)]CRS-8500: Oracle Clusterware CSSDMONITOR process is starting with operating system process ID 62358
2020-02-28 13:41:24.880 [ORAROOTAGENT(62337)]CRS-8500: Oracle Clusterware ORAROOTAGENT process is starting with operating system process ID 62337
2020-02-28 13:41:24.910 [ORAAGENT(62347)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 62347
2020-02-28 13:41:25.499 [OHASD(62185)]CRS-6015: Oracle Clusterware has experienced an internal error. Details at (:CLSGEN00100:) {0:0:2} in /u01/app/grid/diag/crs/xydb6node2/crs/trace/ohasd.trc.
2020-02-28T13:41:25.515580+08:00
Errors in file /u01/app/grid/diag/crs/xydb6node2/crs/trace/ohasd.trc  (incident=1):
CRS-6015 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /u01/app/grid/diag/crs/xydb6node2/crs/incident/incdir_1/ohasd_i1.trc

2020-02-28 13:41:25.531 [OHASD(62185)]CRS-8505: Oracle Clusterware OHASD process with operating system process ID 62185 encountered internal error CRS-06015
复制

继续看trace中的日志,如下:

2763925248: [CLSDIMT] 2020-02-28 13:41:25.533 :GIPCHGEN:3446142720:  gipchaInternalGroupDestroy: Destroyed hagroup 0x7fd7c4021490 [00000000000092b0] { gipchaGroup : numDead 0, numEndp 0, numZombi+
2763925248: [CLSDIMT] 2020-02-28 13:41:25.533 :GIPCHGEN:3446142720:  gipchaGroupFree: destroying ha group 0x7fd7c4021490 [00000000000092b0] { gipchaGroup : numDead 0, numEndp 0, numZombie 0, numP+
2763925248: [CLSDIMT] 2020-02-28 13:41:25.533 :GIPCGEN:3446142720:  gipcEndpointFree: destroying the endp 0x7fd7c4021a70 endpId 00000000000092b4
2763925248: [CLSDIMT] 2020-02-28 13:41:25.533 :GIPCGEN:3446142720:  gipcEndpointCheckFlush: GIPC_FLAG_CLOSE_IMMEDIATE set for endp 0x7fd7c4021a70
2763925248: [CLSDIMT] 2020-02-28 13:41:25.534 :CLSCEVT:3446142720: (:CLSCE0099:)clsce_publish_internal 0x55bc53750280 destroying connection (nil)
2763925248: [CLSDIMT] 2020-02-28 13:41:25.534 :CLSCEVT:3446142720: mx 0x55bc53750280 release {
2763925248: [CLSDIMT] 2020-02-28 13:41:25.534 :CLSCEVT:3446142720: mx 0x55bc53750280 release }
2763925248: [CLSDIMT] 2020-02-28 13:41:25.534 :CLSCEVT:3446142720: clsce_publish_internal 0x55bc53750280 }
2763925248: [CLSDIMT] 2020-02-28 13:41:25.534 :CLSCAL:3446142720: (:CLSCAL0811:)clscal_repository_write_publish_evt_new: clsce_publish() failed, ret [4], err [CRS-10203: (:CLSCE0047:)  Could no+
2763925248: [CLSDIMT] 2020-02-28 13:41:25.534 :CLSCEVT:3446142720: clsce_event_serialize {
2763925248: [CLSDIMT] 2020-02-28 13:41:25.534 :CLSCEVT:3446142720: clsce_event_serialize }
2763925248: [CLSDIMT] 2020-02-28 13:41:25.534 :CLSCEVT:3446142720: clsce_event_destroy {
2763925248: [CLSDIMT] 2020-02-28 13:41:25.534 :CLSCEVT:3446142720: (:CLSCE0056:)clsce_event_destroy event 0x440d86a0 destroyed
2763925248: [CLSDIMT] 2020-02-28 13:41:25.534 :CLSCEVT:3446142720: clsce_event_destroy }
2020-02-28 13:41:25.716 :CLSDIMT:2763925248: Wraps: [16] Size: [10005,129]
2020-02-28 13:41:25.716 :CLSDIMT:2763925248: ===> CLSD In-memory buffer ends
----- END DDE Action: 'clsdAdrActions' (SUCCESS, 19 csec) -----
[TOC00018-END]
----- END DDE Actions Dump (total 20 csec) -----
[TOC00004-END]
End of Incident Dump
[TOC00002-END]
TOC00000 - Table of contents
TOC00001 - Error Stack
TOC00002 - Dump for incident 1 (CRS 6015)
| TOC00003 - START Event Driven Actions Dump
| TOC00004 - START DDE Actions Dump
| | TOC00005 - START DDE Action: 'dumpFrameContext' (Sync)
| | | TOC00006 - START Frame Context DUMP
| | TOC00007 - START DDE Action: 'dumpDiagCtx' (Sync)
| | | TOC00008 - Diag Context Dump
| | TOC00009 - START DDE Action: 'dumpBuckets' (Sync)
| | | TOC00010 - Trace Bucket Dump Begin: CLSD_SHARED_BUCKET
| | TOC00011 - START DDE Action: 'dumpGeneralConfiguration' (Sync)
| | | TOC00012 - General Configuration
| | TOC00013 - START DDE Action: 'xdb_dump_buckets' (Sync)
| | TOC00014 - START DDE Action: 'dumpKGERing' (Sync)
| | TOC00015 - START DDE Action: 'dumpKGEIEParms' (Sync)
| | TOC00016 - START DDE Action: 'dumpKGEState' (Sync)
| | TOC00017 - START DDE Action: 'kpuActionDefault' (Sync)
| | TOC00018 - START DDE Action: 'clsdAdrActions' (Sync)

复制

还是看不出是什么问题,crs还是无法正常启动,在Mos上也找不到答案,暂时的办法只能先改成64,再重装数据库。

IPV6INIT=yes
IPV6_FAILURE_FATAL=no
IPV6ADDR=2409:8760:1282:0001:0F11:0000:0000:0047/120
IPV6_DEFAULTGW=2409:8760:1282:0001:0F11:0000:0000:00FF
复制

改成64位

IPV6INIT=yes
IPV6_FAILURE_FATAL=no
IPV6ADDR=2409:8760:1282:0001:0F11:0000:0000:0047/64
IPV6_DEFAULTGW=2409:8760:1282:0001:0F11:0000:0000:00FF
复制

临时解决办法

  1. 将public ip的子网长度改成64。
  2. private ip的子网长度也要用64。
  3. 重装gi后恢复正常。

总结

关于19c rac ipv6 子网长度的问题 ,我暂时还没找到官方的回复,临时测试出来的解决办法是修改子网长度为64,目前测试发现大于小于64都不行,能正常安装,无法关闭,强杀进程后启不来,后续找到解决办法再补充进来。

附:ipv6基础知识

最后修改时间:2020-02-28 20:02:38
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论