MogHA作为基于MogDB同步和异步流复制技术的一款企业级高可用软件系统,在服务器宕机、实例宕机等多种情况下,能够实现主备自动切换和虚拟IP的自动漂移,使数据库的故障持续时间从分钟级降到秒级,确保业务系统的持续运行。
MogHA提供了较丰富的参数用以配置故障发生时的failover行为,不同的参数对RTO有直接影响。接下来基于实验具体分析下。
当检测到主库宕机后,MogHA是否介入处理(主备切换或拉起主库)是由参数handle_down_primary控制。 默认为True,当该参数值为False时,主库宕机MogHA不做任何操作。
修改handle_down_primary为False后关闭主库
[omm@mogdb ~]$ sed -i 's/handle_down_primary=True/handle_down_primary=False/g' /home/omm/mogha/node.conf && sudo systemctl restart mogha
[omm@mogdb ~]$ gs_ctl stop -D /mogdb/data/db1/
[2021-12-30 14:43:30.527][73254][][gs_ctl]: gs_ctl stopped ,datadir is /mogdb/data/db1
waiting for server to shut down...... done
server stopped
检查集群状态,主库异常
[omm@mogdb ~]$ gs_om -t status --detail
[ Cluster State ]
cluster_state : Unavailable
redistributing : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state | node node_ip instance state | node node_ip instance state
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 mogdb 172.16.71.30 6001 /mogdb/data/db1 P Down Manually stopped | 2 mogdb1 172.16.71.31 6002 /mogdb/data/db1 S Standby Need repair(Connecting) | 3 mogdb2 172.16.71.32 6003 /mogdb/data/db1 S Standby Need repair(Disconnected)
查看MogHA日志可见,由于handle_down_primary被设置False,MogHA虽然检查到主库异常但不会做任何操作。
2021-12-30 14:44:55,193 - heartbeat.loop - INFO [toolkit.py:180]: Ping result:{'172.16.71.32': True, '172.16.71.2': True, '172.16.71.31': True}
2021-12-30 14:44:55,207 - heartbeat.loop - INFO [loop.py:50]: Detect that local instance is down, try to handle it
2021-12-30 14:44:55,213 - heartbeat.loop - INFO [loop.py:55]: Local instance is stopped primary
2021-12-30 14:44:55,214 - heartbeat.loop - ERROR [loop.py:178]: Heartbeat failed:
2021-12-30 14:44:55,214 - heartbeat.loop - ERROR [loop.py:179]: Error: local instance is down,Please check for more information, mogha do nothing because `handle_down_primary` config is set False.
VIP依旧在原主所在的服务器上
[omm@mogdb ~]$ ifconfig
ens33: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.16.71.30 netmask 255.255.255.0 broadcast 172.16.71.255
inet6 fe80::bf6c:c163:d890:aa83 prefixlen 64 scopeid 0x20<link>
ether 00:0c:29:8c:47:fa txqueuelen 1000 (Ethernet)
RX packets 21774590 bytes 2652721311 (2.4 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 33345158 bytes 37380483262 (34.8 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens33:1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.16.71.29 netmask 255.255.255.0 broadcast 172.16.71.255
ether 00:0c:29:8c:47:fa txqueuelen 1000 (Ethernet)
通常我们会配置参数handle_down_primary为True这也是系统的默认值,当MogHA发现主库宕机时会由参数primary_down_handle_method控制后续操作,restart MogHA会优先尝试拉起主库,当尝试次数或时间超过restart_strategy设置的值后会做主备切换。
#设置primary_down_handle_method为resart,并将尝试时间设置为5次或2分钟
sed -i 's/primary_down_handle_method=failover/primary_down_handle_method=restart/g' /home/omm/mogha/node.conf && sed -i 's/restart_strategy=10\/3/restart_strategy=5\/2/g' /home/omm/mogha/node.conf && sudo systemctl restart mogha
查看当前主库在mogdb2(172.16.71.32)
[omm@mogdb2 ~]$ gs_om -t status --detail
[ Cluster State ]
cluster_state : Normal
redistributing : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state | node node_ip instance state | node node_ip instance state
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 mogdb 172.16.71.30 6001 /mogdb/data/db1 P Standby Normal | 2 mogdb1 172.16.71.31 6002 /mogdb/data/db1 S Standby Normal | 3 mogdb2 172.16.71.32 6003 /mogdb/data/db1 S Primary Normal
关闭主库
[omm@mogdb2 ~]$ gs_ctl stop -D /mogdb/data/db1/
[2021-12-30 15:05:07.721][37336][][gs_ctl]: gs_ctl stopped ,datadir is /mogdb/data/db1
waiting for server to shut down............ done
server stopped
[omm@mogdb2 ~]$ gs_om -t status --detail
[ Cluster State ]
cluster_state : Unavailable
redistributing : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state | node node_ip instance state | node node_ip instance state
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 mogdb 172.16.71.30 6001 /mogdb/data/db1 P Standby Need repair(Connecting) | 2 mogdb1 172.16.71.31 6002 /mogdb/data/db1 S Standby Need repair(Connecting) | 3 mogdb2 172.16.71.32 6003 /mogdb/data/db1 S Down Manually stopped
再次检查集群状态,发现已恢复正常。
[omm@mogdb2 ~]$ gs_om -t status --detail
[ Cluster State ]
cluster_state : Normal
redistributing : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state | node node_ip instance state | node node_ip instance state
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 mogdb 172.16.71.30 6001 /mogdb/data/db1 P Standby Normal | 2 mogdb1 172.16.71.31 6002 /mogdb/data/db1 S Standby Normal | 3 mogdb2 172.16.71.32 6003 /mogdb/data/db1 S Primary Normal
分析MogHA日志可以看到,检测到主库异常后很快就实施了拉起动作。
2021-12-30 15:08:13,083 - instance - ERROR [opengauss.py:70]: gsql exec error: failed to connect /opt/mogdb/tools/omm_mppdb:26000.
, cmd: /opt/mogdb/app/bin/gsql -h /opt/mogdb/tools/omm_mppdb -U omm -p 26000 postgres -c "select client_addr host,sync_state from pg_stat_replication"
2021-12-30 15:08:13,084 - heartbeat.loop - ERROR [loop.py:178]: Heartbeat failed:
2021-12-30 15:08:13,084 - heartbeat.loop - ERROR [loop.py:179]: primary_backup and primary_second_backup not exist in []
2021-12-30 15:08:16,703 - heartbeat.loop - INFO [toolkit.py:180]: Ping result:{'172.16.71.30': True, '172.16.71.31': True, '172.16.71.2': True}
2021-12-30 15:08:16,715 - heartbeat.loop - INFO [loop.py:50]: Detect that local instance is down, try to handle it
2021-12-30 15:08:16,723 - heartbeat.loop - INFO [loop.py:55]: Local instance is stopped primary
2021-12-30 15:08:16,724 - heartbeat.loop - INFO [loop.py:69]: disk check success, try to restart
2021-12-30 15:08:16,724 - heartbeat.primary - INFO [primary_heartbeat.py:518]: try to restart local instance, count: 1
2021-12-30 15:08:18,895 - instance - INFO [opengauss.py:279]: Instance start:
[2021-12-30 15:08:18.894][40573][][gs_ctl]: done
[2021-12-30 15:08:18.894][40573][][gs_ctl]: server started (/mogdb/data/db1)
, err:
2021-12-30 15:08:18,973 - heartbeat.primary - INFO [primary_heartbeat.py:522]: restart local instance successfully.
2021-12-30 15:08:18,974 - instance - INFO [opengauss.py:211]: VIP: 172.16.71.29 already set in local host: ['172.16.71.32', '172.16.71.29', '192.168.122.1']
2021-12-30 15:08:23,592 - heartbeat.loop - INFO [toolkit.py:180]: Ping result:{'172.16.71.30': True, '172.16.71.31': True, '172.16.71.2': True}
2021-12-30 15:08:23,621 - heartbeat.loop - INFO [loop.py:148]: Detect that local instance is active primary
如果原主无法快速拉起,则配置restart_strategy=5/2 会让MogHA在5次拉尝试或2分钟后才会选择主备切换,我们通过修改mogdb2(172.16.71.32)实例数据目录并关闭实例,模拟主库无法正常启动场景下,MogHA的动作。
[omm@mogdb2 data]$ gs_ctl stop -D /mogdb/data/db1 && mv /mogdb/data/db1 /mogdb/data/db1_bak
MogHA发现实例异常后会尝试拉起
2021-12-30 23:52:35,900 - heartbeat.loop - INFO [loop.py:50]: Detect that local instance is down, try to handle it
2021-12-30 23:52:35,909 - heartbeat.loop - INFO [loop.py:55]: Local instance is stopped primary
2021-12-30 23:52:35,910 - heartbeat.loop - INFO [loop.py:69]: disk check success, try to restart
2021-12-30 23:52:35,910 - heartbeat.primary - INFO [primary_heartbeat.py:518]: try to restart local instance, count: 1
2021-12-30 23:52:39,657 - instance - INFO [opengauss.py:279]: Instance start:
2021-12-30 23:52:40,679 - heartbeat.primary - INFO [primary_heartbeat.py:518]: try to restart local instance, count: 2
2021-12-30 23:52:44,422 - instance - INFO [opengauss.py:279]: Instance start:
2021-12-30 23:52:54,989 - heartbeat.primary - INFO [primary_heartbeat.py:518]: try to restart local instance, count: 5
2021-12-30 23:52:58,693 - instance - INFO [opengauss.py:279]: Instance start:
直到尝试拉起次数超过参数restart_strategy设置的最大值(本例中为5次)后,启动主备切换。
2021-12-30 23:52:59,721 - heartbeat.primary - INFO [primary_heartbeat.py:532]: restart 6 times, but still failure
2021-12-30 23:52:59,721 - heartbeat.primary - INFO [primary_heartbeat.py:544]: restart local instance failed, try failover
2021-12-30 23:52:59,722 - instance - INFO [opengauss.py:249]: VIP:172.16.71.29 already offline, local ip list: ['172.16.71.30']
2021-12-30 23:52:59,722 - heartbeat.primary - INFO [primary_heartbeat.py:495]: send failover request to host 172.16.71.31
如果原主所在服务器宕机,则MogHA 会在primary_lost_timeout时间后切换主备。
# 主库丢失的探测时间,当主库宕机无法ping到,认为持续到这个时长之后,备库可以切换为主库
primary_lost_timeout=10
下面我们关闭mogdb2(172.16.71.32)服务器模拟这个情况。
[root@mogdb2 ~]# init 0
Connection refused
在MogHA日志中可以查看
2021-12-30 17:09:26,215 - heartbeat.standby - ERROR [standby_heartbeat.py:51]: not found primary. maybe primary lost
2021-12-30 17:09:26,216 - instance - INFO [opengauss.py:249]: VIP:172.16.71.29 already offline, local ip list: ['172.16.71.30']
2021-12-30 17:09:26,216 - heartbeat.standby - INFO [standby_heartbeat.py:85]: primary lost check...
2021-12-30 17:09:28,219 - heartbeat.standby - ERROR [standby_heartbeat.py:114]: primary lost check :2s
2021-12-30 17:09:31,223 - heartbeat.standby - ERROR [standby_heartbeat.py:114]: primary lost check :5s
2021-12-30 17:09:34,226 - heartbeat.standby - ERROR [standby_heartbeat.py:114]: primary lost check :8s
2021-12-30 17:09:37,229 - heartbeat.standby - ERROR [standby_heartbeat.py:114]: primary lost check :11s
2021-12-30 17:09:38,230 - heartbeat.standby - INFO [standby_heartbeat.py:228]: Start failover...
2021-12-30 17:09:38,274 - heartbeat.standby - INFO [standby_heartbeat.py:235]: Start gs_ctl failover... now lsn:0/63DCE70
2021-12-30 17:09:39,325 - heartbeat.standby - INFO [standby_heartbeat.py:238]: Failover result:
out: [2021-12-30 17:09:38.309][50094][][gs_ctl]: gs_ctl failover ,datadir is /mogdb/data/db1
[2021-12-30 17:09:38.309][50094][][gs_ctl]: failover term (1)
[2021-12-30 17:09:38.314][50094][][gs_ctl]: waiting for server to failover...
.[2021-12-30 17:09:39.324][50094][][gs_ctl]: done
[2021-12-30 17:09:39.324][50094][][gs_ctl]: failover completed (/mogdb/data/db1)
- 小结
参数handle_down_primary为True 才会在主库异常时触发MogHA的failover操作。
参数primary_down_handle_method 确定failover的具体行为,primary_down_handle_method=failover主库异常后直接触发主备切换。primary_down_handle_method=restart主库异常后MogHA会优先尝试拉起主库,再经过restart_strategy=次数/分钟后主库依然无法启动后会触发主备切换。
参数heartbeat_interval控制心跳检测间隔时间(默认为3s)。当发现主库服务器无法ping通超过primary_lost_timeout(默认为10秒)后判断主库服务器宕机,触发主备切换。




