Kingbase的RWC双主故障处理及集群缩容实操

原创 virvle 2025-03-22

环境配置

操作系统	配置	数据盘	系统盘	IP	节点
Centos 7.8	4H16G	100G	大于8G	10.10.100.236	node1
Centos 7.8	4H16G	100G	大于8G	10.10.100.235	node2

1.双主故障之修复

RWC主备关闭之后，误操作在本启动集群，导致备库状态异常

1.1 启动异常如下


[kingbase@dba236 bin]$ ./sys_monitor.sh start
2025-03-20 21:13:20 Ready to start all DB ...
2025-03-20 21:13:20 begin to start DB on "[10.10.100.236]".
waiting for server to start.... done
server started
2025-03-20 21:13:21 execute to start DB on "[10.10.100.236]" success, connect to check it.
2025-03-20 21:13:22 DB on "[10.10.100.236]" start success.
2025-03-20 21:13:22 Try to ping trusted_servers on host 10.10.100.236 ...
2025-03-20 21:13:25 Try to ping trusted_servers on host 10.10.100.235 ...
2025-03-20 21:13:28 begin to start DB on "[10.10.100.235]".
waiting for server to start.... done
server started
2025-03-20 21:13:29 execute to start DB on "[10.10.100.235]" success, connect to check it.
2025-03-20 21:13:30 DB on "[10.10.100.235]" start success.
[WARNING] following issues were detected
  - node "node2" (ID: 2) is registered as standby but running as primary

 ID | Name  | Role    | Status               | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                                                                                                                                            
----+-------+---------+----------------------+----------+----------+----------+----------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | * running            |          | default  | 100      | 1        |         | host=10.10.100.236 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
 2  | node2 | standby | ! running as primary |          | default  | 100      | 2        | 0 bytes | host=10.10.100.235 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
2025-03-20 21:13:30 There are more than one primary DBs([2] DBs are running), will stop DB and cron and exit.
2025-03-20 21:13:35 begin to stop DB on "[10.10.100.236]".
waiting for server to shut down.... done
server stopped
2025-03-20 21:13:36 DB on "[10.10.100.236]" stop success.
2025-03-20 21:13:40 begin to stop DB on "[10.10.100.235]".
waiting for server to shut down.... done
server stopped
2025-03-20 21:13:41 DB on "[10.10.100.235]" stop success.

复制

1.2 状态查看

 ./repmgr cluster show

复制

1.3 查看控制文件对比节点数据差异

node2节点

[kingbase@ocdb235 bin]$ ./sys_controldata -D  /yk/data/Kingbase/V9/data
sys_control version number:            1201
Catalog version number:               202404121
Database system identifier:           7472217260596738986
Database cluster state:               shut down
sys_control last modified:             Thu 20 Mar 2025 09:44:46 PM CST
Latest checkpoint location:           0/F000028
Latest checkpoint's REDO location:    0/F000028
Latest checkpoint's REDO WAL file:    00000002000000000000000F
Latest checkpoint's WalTimeLineID:       2
Latest checkpoint's PrevTimeLineID:   2
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0:1136
Latest checkpoint's NextOID:          16467
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        1078
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  0
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint:            Thu 20 Mar 2025 09:44:46 PM CST
Fake LSN counter for unlogged rels:   0/3E8
Minimum recovery ending location:     0/0
Min recovery ending loc's timeline:   0
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
wal_level setting:                    replica
wal_log_hints setting:                on
max_connections setting:              100
max_worker_processes setting:         8
max_wal_senders setting:              32
max_prepared_xacts setting:           100
max_locks_per_xact setting:           64
track_commit_timestamp setting:       off
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1988
Size of a large-object chunk:         2048
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value
Data page checksum version:           1
Data page checksum device:            0
Mock authentication nonce:            4e5145da4a12f0a0039d7fcfb32e5a8cc7e7c37c16ff44a43baf8c295105b96f
database mode:                        1
auth method mode:                     0


复制

node1节点


[kingbase@dba236 bin]$  ./sys_controldata -D  /yk/data/Kingbase/V9/data
sys_control version number:            1201
Catalog version number:               202404121
Database system identifier:           7472217260596738986
Database cluster state:               shut down
sys_control last modified:             Thu 20 Mar 2025 09:44:41 PM CST
Latest checkpoint location:           0/F000028
Latest checkpoint's REDO location:    0/F000028
Latest checkpoint's REDO WAL file:    00000001000000000000000F
Latest checkpoint's WalTimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0:1133
Latest checkpoint's NextOID:          16467
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        1078
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  0
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint:            Thu 20 Mar 2025 09:44:41 PM CST
Fake LSN counter for unlogged rels:   0/3E8
Minimum recovery ending location:     0/0
Min recovery ending loc's timeline:   0
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
wal_level setting:                    replica
wal_log_hints setting:                on
max_connections setting:              100
max_worker_processes setting:         8
max_wal_senders setting:              32
max_prepared_xacts setting:           100
max_locks_per_xact setting:           64
track_commit_timestamp setting:       off
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1988
Size of a large-object chunk:         2048
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value
Data page checksum version:           1
Data page checksum device:            0
Mock authentication nonce:            4e5145da4a12f0a0039d7fcfb32e5a8cc7e7c37c16ff44a43baf8c295105b96f
database mode:                        1
auth method mode:                     0


复制

查看日志，我发现node2的Latest checkpoint’s WalTimeLineID 为2，高于node1的1，且NextXID值note2为1136高于node1的1133，说明node2的数据库更新，故选择node2作为新主库

2. 双主故障集群修复

2.1 启动数据库

[kingbase@ocdb235 bin]$ ./sys_ctl -D /yk/data/Kingbase/V9/data
sys_ctl: no operation specified
Try "sys_ctl --help" for more information.
[kingbase@ocdb235 bin]$ ./sys_ctl -D /yk/data/Kingbase/V9/data restart
sys_ctl: PID file "/yk/data/Kingbase/V9/data/kingbase.pid" does not exist
Is server running?
trying to start server anyway
waiting for server to start....2025-03-20 22:06:27.467 CST [26257] LOG:  config the real archive_command string as soon as possible to archive WAL files
2025-03-20 22:06:27.475 CST [26257] LOG:  sepapower extension initialized
2025-03-20 22:06:27.481 CST [26257] LOG:  starting KingbaseES V009R001C002B0014 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28), 64-bit
2025-03-20 22:06:27.482 CST [26257] LOG:  listening on IPv4 address "0.0.0.0", port 54321
2025-03-20 22:06:27.482 CST [26257] LOG:  listening on IPv6 address "::", port 54321
2025-03-20 22:06:27.487 CST [26257] LOG:  listening on Unix socket "/tmp/.s.KINGBASE.54321"
2025-03-20 22:06:27.562 CST [26257] LOG:  redirecting log output to logging collector process
2025-03-20 22:06:27.562 CST [26257] HINT:  Future log output will appear in directory "sys_log".
 done
server started
复制

2.2 再次查看状态

 ./repmgr cluster show
复制

2.3 将原主库node1加入集群


[kingbase@ocdb235 bin]$ ./repmgr node rejoin -h 10.10.100.236 -U esrep -d esrep
[NOTICE] rejoin target is node "node1" (ID: 1)
[NOTICE] sys_rewind execution required for this node to attach to rejoin target node 1
[HINT] provide --force-rewind
[kingbase@ocdb235 bin]$ ./repmgr node rejoin -h 10.10.100.236 -U esrep -d esrep --force-rewind
[NOTICE] rejoin target is node "node1" (ID: 1)
[NOTICE] executing sys_rewind
[DETAIL] sys_rewind command is "/yk/data/Kingbase/cluster/kingbase/bin/sys_rewind -D '/yk/data/Kingbase/V9/data' --source-server='host=10.10.100.236 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000' -X stream"
sys_rewind: servers diverged at WAL location 0/A0000A0 on timeline 1
sys_rewind: rewinding from last common checkpoint at 0/A000028(redo lsn at 0/A000028) on timeline 1
sys_rewind: find last common checkpoint start time from 2025-03-20 22:12:26.479567 CST to 2025-03-20 22:12:26.584334 CST, in "0.104767" seconds.
sys_rewind: collect the number of WAL files to be processed:8, start time from 2025-03-20 22:12:26.584523 CST to 2025-03-20 22:12:26.625223 CST, cost "0.040700" seconds.
sys_rewind: read WAL in target server from 0/A000028 to 0/110000A0 (endpoint 0/110000A0)
sys_rewind: read the local Wal file information, start time from 2025-03-20 22:12:26.584523 CST to 2025-03-20 22:12:26.626283 CST, cost "0.001030" seconds.
sys_rewind: waiting for background process to finish streaming ...
sys_rewind: file replication start time from 2025-03-20 22:12:26.626313 CST to 2025-03-20 22:12:28.761475 CST, cost "2.135162" seconds.
sys_rewind: update the control file: minRecoveryPoint is '0/1100E900', minRecoveryPointTLI is '1', and database state is 'in archive recovery'
sys_rewind: rewind start wal location 0/A000028 (file 00000001000000000000000A), end wal location 0/1100E900 (file 000000010000000000000011). wal data increment:114746(kB). time from 2025-03-20 22:12:26.626313 CST to 2025-03-20 22:12:29.243817 CST, in "2.764250" seconds.
sys_rewind: Done!
[NOTICE] 0 files copied to /yk/data/Kingbase/V9/data
[INFO] creating replication slot as user "esrep"
[NOTICE] setting node 2's upstream to node 1
[WARNING] unable to ping "host=10.10.100.235 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000"
[DETAIL] KCIping() returned "KCIPING_NO_RESPONSE"
[NOTICE] begin to start server at 2025-03-20 22:12:29.256344
[NOTICE] starting server using "/yk/data/Kingbase/cluster/kingbase/bin/sys_ctl  -w -t 90 -D '/yk/data/Kingbase/V9/data' -l /yk/data/Kingbase/cluster/kingbase/bin/logfile start"
[NOTICE] start server finish at 2025-03-20 22:12:29.469731
[NOTICE] NODE REJOIN successful
[DETAIL] node 2 is now attached to node 1
(reverse-i-search)`repm': ./^Cpmgr node rejoin -h 10.10.100.236 -U esrep -d esrep --force-rewind
[kingbase@ocdb235 bin]$ repmgr service status
 ID | Name  | Role    | Status    | Upstream | repmgrd     | PID | Paused? | Upstream last seen
----+-------+---------+-----------+----------+-------------+-----+---------+--------------------
 1  | node1 | primary | * running |          | not running | n/a | n/a     | n/a
 2  | node2 | standby |   running | node1    | not running | n/a | n/a     | n/a
[kingbase@ocdb235 bin]$


复制

查看状态 repmgrd是 not running的，接下来重启集群

 ./sys_monitor.sh restart
复制

再次查看状态，集群已经重新恢复正常

4. 集群缩容，删除节点node1

删除节点可以通过先unregister备机节点、再停止备机standby数据库方式，将standby节点从数据库集群删除。最后，要在主节点上查询复制槽，将可能残留的原备机数据库的复制槽删除。

4.1 修改配置文件 install.conf （在安装目录的bin目录下）

修改如下配置：

[shrink]
shrink_type="0"                   # The node type of standby/witness node, which would be delete from cluster. 0:standby  1:witness
primary_ip="10.10.100.236"                    # The ip addr of cluster primary node, which need to shrink a standby/witness node.
shrink_ip="10.10.100.235"                     # The ip addr of standby/witness node, which would be delete from cluster.
node_id="2"                       # The node_id of standby/witness node, which would be delete from cluster. It does not the same with any one in  cluster node
                                 # for example: node_id="3"
## Specific instructions ,see it under [install]
install_dir="/yk/data/Kingbase/cluster"   
复制

4.2 节点缩容来了

./cluster_install.sh shrink
[CONFIG_CHECK] will deploy the cluster of
[RUNNING] success connect to the target "10.10.100.235" ..... OK
[RUNNING] success connect to "10.10.100.235" from current node by 'ssh' ..... OK
[RUNNING] success connect to the target "10.10.100.236" ..... OK
[RUNNING] success connect to "10.10.100.236" from current node by 'ssh' ..... OK
[RUNNING] Primary node ip is 10.10.100.236 ...
[RUNNING] Primary node ip is 10.10.100.236 ... OK
[CONFIG_CHECK] set install_with_root=1
[RUNNING] success connect to "" from current node by 'ssh' ..... OK
[RUNNING] success connect to the target "10.10.100.236" ..... OK
[RUNNING] success connect to "10.10.100.236" from current node by 'ssh' ..... OK
[INSTALL] load config from cluster.....
 [INFO] db_user=system
 [INFO] db_port=54321
 [INFO] use_scmd=1
 [INFO] auto_cluster_recovery_level=1
 [INFO] synchronous=quorum
[INSTALL] load config from cluster.....OK
[CONFIG_CHECK] check database connection ...
[CONFIG_CHECK] check database connection ... OK
[CONFIG_CHECK] shrink_ip[10.10.100.235] is a standby node IP in the cluster ...
[CONFIG_CHECK] shrink_ip[10.10.100.235] is a standby node IP in the cluster ...ok
[CONFIG_CHECK] The localhost is shrink_ip:[10.10.100.235] or primary_ip:[10.10.100.236]...
[CONFIG_CHECK] The localhost is shrink_ip:[10.10.100.235] or primary_ip:[10.10.100.236]...ok
[RUNNING] Primary node ip is 10.10.100.236 ...
[RUNNING] Primary node ip is 10.10.100.236 ... OK
[CONFIG_CHECK] check node_id is in cluster ...
[CONFIG_CHECK] check node_id is in cluster ...OK
[RUNNING] The /yk/data/Kingbase/cluster/kingbase/bin dir exist on "10.10.100.235" ...
[RUNNING] The /yk/data/Kingbase/cluster/kingbase/bin dir exist on "10.10.100.235" ... OK
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                      
----+-------+---------+-----------+----------+----------+----------+----------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 1        |         | host=10.10.100.236 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
 2  | node2 | standby |   running | node1    | default  | 100      | 1        | 0 bytes | host=10.10.100.235 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
[RUNNING] Del node is standby ...
[INFO] node:10.10.100.235 can be deleted ... OK
[RUNNING] query archive command at 10.10.100.236 ...
[RUNNING] current cluster not config sys_rman,return.
[Thu Mar 20 23:13:38 CST 2025] [INFO] /yk/data/Kingbase/cluster/kingbase/bin/repmgr standby unregister --node-id=2 ...
[INFO] connecting to local standby
[INFO] connecting to primary database
[NOTICE] unregistering node 2
[INFO] SET synchronous TO "async" on primary host
[INFO] change synchronous_standby_names from "ANY 1( node2)" to ""
[INFO] try to drop slot "repmgr_slot_2" of node 2 on primary node
[WARNING] replication slot "repmgr_slot_2" is still active on node 2
[INFO] standby unregistration complete
[Thu Mar 20 23:13:39 CST 2025] [INFO] /yk/data/Kingbase/cluster/kingbase/bin/repmgr standby unregister --node-id=2 ...OK
[Thu Mar 20 23:13:39 CST 2025] [INFO] check db connection ...
[Thu Mar 20 23:13:40 CST 2025] [INFO] check db connection ...ok
2025-03-20 23:13:40 Ready to stop local kbha daemon and repmgrd daemon ...
2025-03-20 23:13:45 begin to stop repmgrd on "[localhost]".
2025-03-20 23:13:46 repmgrd on "[localhost]" stop success.
2025-03-20 23:13:46 Done.
2025-03-20 23:13:46 begin to stop DB on "[localhost]".
waiting for server to shut down.... done
server stopped
2025-03-20 23:13:47 DB on "[localhost]" stop success.
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                      
----+-------+---------+-----------+----------+----------+----------+----------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 1        |         | host=10.10.100.236 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
[Thu Mar 20 23:13:47 CST 2025] [INFO] drop replication slot:repmgr_slot_2...
 pg_drop_replication_slot
--------------------------

(1 row)

[Thu Mar 20 23:13:47 CST 2025] [INFO] drop replication slot:repmgr_slot_2...OK
[Thu Mar 20 23:13:47 CST 2025] [INFO] modify synchronous parameter configuration...
[Thu Mar 20 23:13:48 CST 2025] [INFO] modify synchronous parameter configuration...ok
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                      
----+-------+---------+-----------+----------+----------+----------+----------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 1        |         | host=10.10.100.236 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000

复制

4.3 检查集群状态

5. 遇到的小坎坎~

5.1 若是要缩容的是带有备节点的主节点，是不允许删除的

[kingbase@ocdb235 bin]$ cluster_install.sh shrink
[CONFIG_CHECK] will deploy the cluster of
[RUNNING] success connect to the target "10.10.100.236" ..... OK
[RUNNING] success connect to "10.10.100.236" from current node by 'ssh' ..... OK
[RUNNING] success connect to the target "10.10.100.235" ..... OK
[RUNNING] success connect to "10.10.100.235" from current node by 'ssh' ..... OK
[RUNNING] Primary node ip is 10.10.100.235 ...
[RUNNING] Primary node ip is not 10.10.100.235, check fail!
复制

5.2 install.conf 的安装目录未配置 install_dir

 ./cluster_install.sh shrink
[CONFIG_CHECK] will deploy the cluster of
[ERROR] param [install_dir] is not set in config file "/yk/data/Kingbase/cluster/kingbase/bin/install.conf" or in myself shell sc

复制

5.3 install.conf 的安装目录install_dir配置错误，应为 repmgr所在目录

./cluster_install.sh shrink
[CONFIG_CHECK] will deploy the cluster of
[RUNNING] success connect to the target "10.10.100.235" ..... OK
[RUNNING] success connect to "10.10.100.235" from current node by 'ssh' ..... OK
[RUNNING] success connect to the target "10.10.100.236" ..... OK
[RUNNING] success connect to "10.10.100.236" from current node by 'ssh' ..... OK
[RUNNING] Primary node ip is 10.10.100.236 ...
bash: /yk/data/Kingbase/kingbase/bin/repmgr: No such file or directory
[RUNNING] Primary node ip is not 10.10.100.236, check fail!

复制

kingbase 墨力计划金仓数据库征文

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

听溪

14小时前

Kingbase的RWC双主故障处理及集群缩容实操

14小时前

筱悦星辰

3天前

语言是有能量的。我们所说的每一句正能量的话，都像一阵无形的春风，滋养着我们的内心，让人变得更加积极向上。

3天前

Kingbase的RWC双主故障处理及集群缩容实操

环境配置

1.双主故障之修复

1.1 启动异常如下

1.2 状态查看

1.3 查看控制文件对比节点数据差异

node2节点

node1节点

2. 双主故障集群修复

2.1 启动数据库

2.2 再次查看状态

2.3 将原主库node1加入集群

4. 集群缩容，删除节点node1

4.1 修改配置文件 install.conf （在安装目录的bin目录下）

4.2 节点缩容来了

4.3 检查集群状态

5. 遇到的小坎坎~

5.1 若是要缩容的是带有备节点的主节点，是不允许删除的

5.2 install.conf 的安装目录未配置 install_dir

5.3 install.conf 的安装目录install_dir配置错误，应为 repmgr所在目录

评论

相关阅读