适用范围
pg12+
方案概述
通过Keepalived 实现 PG HA的自主切换,以达到高可用的目的,由于Keepalived 是在PG流复制之上,所以在安装Keepalived 之前,我们需要先搭建好PG流复制,这里我们使用PG12的版本,如果是用PG12以前的版本,需要修改failover.sh的切换内容。
实施步骤
1. 资源规划:
2. 安装准备:
安装前我们需要创建数据库Keepalived,并且创建表探测表sr_delay,后续Keepalived探测,刷新sr_delay表的last_alive字段为当前探测时间。
这张表用来判断主备延迟情况,数据库故障切换时会查询用到这张表。
postgres=# create user keepalived password 'keepalived' CONNECTION LIMIT 4 ;
postgres=# create database keepalived owner keepalived;
postgres=# \c keepalived keepalived
keepalived=> create table sr_delay(id int4, last_alive timestamp(0) without time zone);
#表sr_delay只允许写入一条记录,并且不允许删除此表数据,通过触发器实现。创建触发器函数,如下所示:
CREATE FUNCTION cannt_delete ()
RETURNS trigger
LANGUAGE plpgsql AS $$
BEGIN
RAISE EXCEPTION 'Table sr_delay can not delete !';
END;
$$;
#创建触发器:
CREATE TRIGGER trigger_sr_delay_del
BEFORE DELETE ON sr_delay
FOR EACH ROW EXECUTE PROCEDURE cannt_delete() ;
CREATE TRIGGER trigger_sr_delay_tru BEFORE TRUNCATE ON sr_delay FOR STATEMENT EXECUTE PROCEDURE cannt_delete() ;
#插入数据:
INSERT INTO sr_delay VALUES(1,now()) ;复制
后续Keepalived会每隔指定时间探测PostgreSQL数据库存活,并且以Keepalived用户登录Keepalived数据库刷新这张表, 配置主备库pg_hba.conf,增加如下内容:
echo "
host keepalived keepalived 10.85.10.0/24 md5
host keepalived keepalived 10.85.10.0/24 md5" >> pg_hba.conf复制
3.安装keepalived
#两个节点
$ wget https://www.keepalived.org/software/keepalived-2.2.7.tar.gz
tar xf keepalived-2.2.7.tar.gz
cd keepalived-2.2.7
./configure --prefix=/usr/local/keepalived/
# ./configure
或
#./configure --prefix=/postgres/keepalived-2.2.7
make
make install复制
3.1 配置 keepalived.conf
vi etc/keepalived/keepalived.conf
! Configuration File for keepalived
global_defs {
# 邮件通知信息
notification_email {
# 定义收件人
446464087@qq.com
}
# 定义发件人
notification_email_from test@cqdba.cn
# SMTP服务器地址
smtp_server 10.85.10.51
smtp_connect_timeout 30
# 路由器标识,一般不用改,也可以写成每个主机自己的主机名
router_id huyidb03
vrrp_skip_check_adv_addr
vrrp_strict
vrrp_garp_interval 0
vrrp_gna_interval 0
}
# 定义用于实例执行的脚本内容,比如可以在线降低优先级,用于强制切换
vrrp_script check_pg_alived {
script "/etc/keepalived/scirpts/check_pg.sh"
interval 5
fall 3 # require 3 failures for KO
}
# 一个vrrp_instance就是定义一个虚拟路由器的,实例名称
vrrp_instance VI_1 {
# 定义初始状态,可以是MASTER或者BACKUP
state MASTER
#非抢占模式
nopreempt
# 工作接口,通告选举使用哪个接口进行
interface ens38
# 虚拟路由ID,如果是一组虚拟路由就定义一个ID,如果是多组就要定义多个,而且这个虚拟
# ID还是虚拟MAC最后一段地址的信息,取值范围0-255
virtual_router_id 51
#权重 如果你上面定义了MASTER,这里的优先级就需要定义的比其他的高
priority 100
#通告频率 单位s
advert_int 1
#通信认证机制,这里是明文认证还有一种是加密认证
authentication {
auth_type PASS
auth_pass abcdefgh
}
# 设置虚拟VIP地址
virtual_ipaddress {
10.85.10.77
}
# 追踪脚本,通常用于去执行上面的vrrp_script定义的脚本内容
track_script {
check_pg_alived
}
# 如果主机状态变成Master|Backup|Fault之后会去执行的通知脚本,脚本要自己写
smtp_alert
notify_master "/etc/keepalived/scirpts/failover.sh"
notify_fault "/etc/keepalived/scirpts/fault.sh"
}
}复制
以上是Keepalived主节点的配置,Keepalived备节点的priority参数改成90 ,state改为BACKUP, 其余参数配置一样。
3.2 配置check_pg.sh
vi etc/keepalived/scirpts/check_pg.sh
#!/bin/bash
# 配置环境变量
pgport=5432
pguser=keepalived
pgdb=keepalived
pgpwd='keepalived'
LANG=en_US.utf8
PGHOME=/postgresql/pg12
LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib
PATH=$PGHOME/bin:$PATH:.
MONITOR_LOG="/etc/keepalived/log/pg_monitor.log"
SQL1="UPDATE sr_delay SET last_alive= now();"
SQL2='SELECT 1;'
keeplognums=30000
#此脚本不检查备库存活状态,如果是备库则退出
standby_flg=`psql -p$pgport -Upostgres -At -c "SELECT pg_is_in_recovery();"`
if [ ${standby_flg} == 't' ];
then
echo -e "`date +%F\ %T`:This is a standby database, exit!\n" > $MONITOR_LOG
exit 0
fi
export PGPASSWORD=$pgpwd
#主库更新sr_delay 表
echo $SQL1 | psql -At -p $pgport -U $pguser -d $pgdb >> $MONITOR_LOG
#判断主库是否可用
echo $SQL2 | psql -At -p $pgport -U $pguser -d $pgdb
if [ $? -eq 0 ] ;
then
echo -e "`date +%F\ %T`:Primary db is health." >> $MONITOR_LOG
exit 0
else
echo -e "`date +%F\ %T`:Attention: Primary db is not health!" >> $MONITOR_LOG
exit 1
fi
#日志保留 keeplognums 行
if [ ! -f ${MONITOR_LOG} ] ;then touch ${MONITOR_LOG};fi
lognums=`cat ${MONITOR_LOG} |wc -l`
catnum=$((${lognums} -${keeplognums}))
if [[ $lognums -gt ${keeplognums} ]] ; then sed -i "1,${catnum}d" ${MONITOR_LOG}; fi复制
3.3 配置failover.sh
vi etc/keepalived/scirpts/failover.sh
#!/bin/bash
export PGPORT=5432
export PGUSER=keepalived
export PG_OS_USER=postgres
export PGDBNAME=keepalived
export LANG=en_US.utf8
export PGPATH=/postgresql/pg12
export PGDATA=/postgresql/data
export PATH=$PATH:$PGPATH/bin
LOGFILE='/etc/keepalived/log/failover.log'
# 主备数据库同步时延,单位为秒
sr_allowed_delay_time=100
SQL1='select pg_is_in_recovery from pg_is_in_recovery();'
SQL2="select count(1) as delay_time from sr_delay where now()<(last_alive + interval '${sr_allowed_delay_time} seconds');"
#SQL2="select count(1) as delay_time from sr_delay where now()<(last_alive + interval '100 seconds');"
sleep $sr_allowed_delay_time
db_role=`echo $SQL1 | psql -At -p $PGPORT -U $PGUSER -d $PGDBNAME -w`
db_sr_delaytime=`echo $SQL2 | psql -p $PGPORT -d $PGDBNAME -U $PGUSER -At -w`
SWITCH_COMMAND='pg_ctl promote -D $PGDATA'
# 如果为备库,且延迟大于指定时间则切换为主库
# 判断 是否为备库,为主库退出
if [ ${db_role} == 'f' ];
then
echo -e `date +"%F %T"` "Attention:The current postgreSQL DB is master database,cannot switched!" >> $LOGFILE
exit 0
fi
# 判断延迟 时间, 小于延迟时间退出
if [ $db_sr_delaytime -gt 0 ];
then
echo -e `date +"%F %T"` "Attention:The current master database is health,the standby DB cannot switched!" >> $LOGFILE
exit 0
fi
#大于延迟 时间切换
if [ !$db_sr_delaytime ];
then
echo -e `date +"%F %T"` "Attention:The current database is statndby,ready to switch master database!" >> $LOGFILE
su - $PG_OS_USER -c "$SWITCH_COMMAND"
elif [ $? -eq 0 ];
then
echo -e `date +"%F %T"` "success:The current standby database successed to switched the primary PG database !" >> $LOGFILE
exit 0
else
echo -e `date +"%F %T"` "Error: the standby database failed to switch the primary PG database !,pelease checked it!" >> $LOGFILE
exit 1
fi复制
3.4 配置fault.sh
vi etc/keepalived/scirpts/fault.sh
#!/bin/bash
LOGFILE=/etc/keepalived/log/pg_db_fault.log
PGDATA=/postgresql/data
PGPORT=5432
echo -e `date +"%F %T"` "Error:Because of the priamry DB happend some unknown problem,So turn off the PostgreSQL Database!" >> $LOGFILE
PGPID="`head -n1 $PGDATA/postmaster.pid`"
systemctl stop keepalived
kill -9 $PGPID
if [ $? -eq 0 ] ;
then
echo -e `date +"%F %T"` "Error:Because of the priamry DB happend some unknown problem,So turn off the PostgreSQL Database!" >> $LOGFILE
systemctl stop keepalived
exit 1
fi复制
4.启动keepalived
chmod -R 755 etc/keepalived/scirpts/*
systemctl daemon-reload
systemctl enable keepalived
systemctl start keepalived
#状态查询 :
[root@huyidb03 ~]# systemctl status keepalived
● keepalived.service - LVS and VRRP High Availability Monitor
Loaded: loaded (/usr/lib/systemd/system/keepalived.service; enabled; vendor preset: disabled)
Active: active (running) since Sat 2022-06-11 11:49:14 CST; 5s ago
Docs: man:keepalived(8)
man:keepalived.conf(5)
man:genhash(1)
https://keepalived.org
Process: 97742 ExecStart=/usr/local/sbin/keepalived $KEEPALIVED_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 97743 (keepalived)
CGroup: system.slice/keepalived.service
├─97612 sleep 100
├─97743 usr/local/sbin/keepalived -D
├─97744 usr/local/sbin/keepalived -D
├─97757 bin/bash etc/keepalived/scirpts/failover.sh
└─97758 sleep 100
Jun 11 11:49:18 huyidb03 Keepalived_vrrp[97744]: (VI_1) Receive advertisement timeout
Jun 11 11:49:18 huyidb03 Keepalived_vrrp[97744]: (VI_1) Entering MASTER STATE # 主库上 MASTER 角色
Jun 11 11:49:18 huyidb03 Keepalived_vrrp[97744]: (VI_1) setting VIPs.
Jun 11 11:49:18 huyidb03 Keepalived_vrrp[97744]: (VI_1) Sending/queueing gratuitous ARPs on ens38 for 10.85.10.77 # 启动了 vip
Jun 11 11:49:18 huyidb03 Keepalived_vrrp[97744]: Sending gratuitous ARP on ens38 for 10.85.10.77
Jun 11 11:49:18 huyidb03 Keepalived_vrrp[97744]: Sending gratuitous ARP on ens38 for 10.85.10.77
Jun 11 11:49:18 huyidb03 Keepalived_vrrp[97744]: Sending gratuitous ARP on ens38 for 10.85.10.77
Jun 11 11:49:18 huyidb03 Keepalived_vrrp[97744]: Sending gratuitous ARP on ens38 for 10.85.10.77
Jun 11 11:49:18 huyidb03 Keepalived_vrrp[97744]: Sending gratuitous ARP on ens38 for 10.85.10.77复制
5.切换演示
5.1 主库日志
#主库停库
postgres@huyidb03-> pg_ctl stop
waiting for server to shut down....2022-06-11 12:03:43.219 CST [98168] LOG: received fast shutdown request
2022-06-11 12:03:43.220 CST [98168] LOG: aborting any active transactions
2022-06-11 12:03:43.221 CST [98168] LOG: background worker "logical replication launcher" (PID 98175) exited with exit code 1
2022-06-11 12:03:43.221 CST [98170] LOG: shutting down
2022-06-11 12:03:43.239 CST [98168] LOG: database system is shut down
done
server stopped
# 查看 keepalived
[root@huyidb03 ~]# systemctl status keepalived
● keepalived.service - LVS and VRRP High Availability Monitor
Loaded: loaded (/usr/lib/systemd/system/keepalived.service; enabled; vendor preset: disabled)
Active: inactive (dead) since Sat 2022-06-11 12:03:58 CST; 6s ago
Docs: man:keepalived(8)
man:keepalived.conf(5)
man:genhash(1)
https://keepalived.org
Process: 98183 ExecStart=/usr/local/sbin/keepalived $KEEPALIVED_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 98184 (code=exited, status=0/SUCCESS)
CGroup: system.slice/keepalived.service
Jun 11 11:55:56 huyidb03 Keepalived_vrrp[98185]: Sending gratuitous ARP on ens38 for 10.85.10.77
Jun 11 11:55:56 huyidb03 Keepalived_vrrp[98185]: Sending gratuitous ARP on ens38 for 10.85.10.77
Jun 11 12:03:47 huyidb03 Keepalived_vrrp[98185]: Script `check_pg_alived` now returning 1
Jun 11 12:03:57 huyidb03 Keepalived_vrrp[98185]: VRRP_Script(check_pg_alived) failed (exited with status 1)
Jun 11 12:03:57 huyidb03 Keepalived_vrrp[98185]: (VI_1) Entering FAULT STATE
Jun 11 12:03:57 huyidb03 Keepalived_vrrp[98185]: (VI_1) sent 0 priority
Jun 11 12:03:57 huyidb03 Keepalived_vrrp[98185]: (VI_1) removing VIPs.
Jun 11 12:03:57 huyidb03 Keepalived[98184]: Stopping
Jun 11 12:03:57 huyidb03 systemd[1]: Stopping LVS and VRRP High Availability Monitor...
Jun 11 12:03:58 huyidb03 systemd[1]: Stopped LVS and VRRP High Availability Monitor.
#查看日志
tail -100f etc/keepalived/log/pg_monitor.log
2022-06-11 12:03:42: Primary db is health.
2022-06-11 12:03:47: Attention: Primary db is not health!
2022-06-11 12:03:52: Attention: Primary db is not health!
2022-06-11 12:03:57: Attention: Primary db is not health!
[root@huyidb03 log]# tail -100f etc/keepalived/log/pg_db_fault.log
2022-06-11 11:52:59 Error:Because of the priamry DB happend some unknown problem,So turn off the PostgreSQL Database!
2022-06-11 12:03:57 Error:Because of the priamry DB happend some unknown problem,So turn off the PostgreSQL Database!复制
5.2 备库日志
tail -100f etc/keepalived/log/pg_monitor.log
[root@huyidb04 log]# cat failover.log
2022-06-11 12:05:38 Attention:The current database is statndby,ready to switch master database!
[root@huyidb04 log]# tail -100f pg_monitor.log
2022-06-11 12:04:38: This is a standby database, exit!\n
UPDATE 1
2022-06-11 12:05:43: Primary db is health.
UPDATE 1
2022-06-11 12:05:48: Primary db is health.
UPDATE 1
[root@huyidb04 ~]# systemctl status keepalived
● keepalived.service - LVS and VRRP High Availability Monitor
Loaded: loaded (/usr/lib/systemd/system/keepalived.service; disabled; vendor preset: disabled)
Active: active (running) since Sat 2022-06-11 12:01:38 CST; 9min ago
Docs: man:keepalived(8)
man:keepalived.conf(5)
man:genhash(1)
https://keepalived.org
Process: 79603 ExecStart=/usr/local/sbin/keepalived $KEEPALIVED_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 79604 (keepalived)
CGroup: /system.slice/keepalived.service
├─79604 /usr/local/sbin/keepalived -D
└─79605 /usr/local/sbin/keepalived -D
Jun 11 12:03:58 huyidb04 Keepalived_vrrp[79605]: Sending gratuitous ARP on ens38 for 10.85.10.77
Jun 11 12:03:58 huyidb04 Keepalived_vrrp[79605]: Sending gratuitous ARP on ens38 for 10.85.10.77
Jun 11 12:04:01 huyidb04 Keepalived_vrrp[79605]: smtp fd 14 returned write error
Jun 11 12:04:03 huyidb04 Keepalived_vrrp[79605]: (VI_1) Sending/queueing gratuitous ARPs on ens38 for 10.85.10.77
Jun 11 12:04:03 huyidb04 Keepalived_vrrp[79605]: Sending gratuitous ARP on ens38 for 10.85.10.77
Jun 11 12:04:03 huyidb04 Keepalived_vrrp[79605]: Sending gratuitous ARP on ens38 for 10.85.10.77
Jun 11 12:04:03 huyidb04 Keepalived_vrrp[79605]: Sending gratuitous ARP on ens38 for 10.85.10.77
Jun 11 12:04:03 huyidb04 Keepalived_vrrp[79605]: Sending gratuitous ARP on ens38 for 10.85.10.77
Jun 11 12:04:03 huyidb04 Keepalived_vrrp[79605]: Sending gratuitous ARP on ens38 for 10.85.10.77复制
6. keepalived使用注意事项
启动keepalived 前, 一定要先启动PG, 且同步状态正常
先库动PG主库的keepalived ,在启动PG备库的keepalive
keepalive master 机 down掉, VIP会自动切换到 backup 上, 即使主数据库正常 , 这时keepalive 也会将备库切换为新的主库 (如果不是用VIP连接数据库的,就会出现双主,数据不一致的情况)
正常维护的时候 , 先停备库的keepalived ;再停主库的 keepalived 。然后在停数据库
避免开机启动keepalived ,当PG还在启动中时,不确认谁是主时,keepaliveed 谁先启动,可能会先成为MSTER 状态。VIP 将在MASTER 上启动.
数据库DOWN掉时,keepalived 发现没有进程, 也会DOWN掉
当keepalived 自己 down掉,在另一台机器上切换为master 后, 数据库不会立即切换为主, 时间由 failover.sh 脚本中的,sr_allowed_delay_time 参数控制keepalived 只能 ,建议和DB 安装在同一台机器, 原因是在切换的时候需要技持OS 脚本,如:pg_ctl promote -D $PGDATA ,如果在不同机器,需要编写脚本,能远程连过来执行
如何回切呢?-- 原主库启动后,先配置与新从库之间的同步,数据达到一致后,再找个时间进行切换