暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

PostgreSQL同步逻辑复制槽的restart_lsn到备库

原创 仙人掌 2023-06-16
1491

使用逻辑复制时,会在主库创建复制槽来记录逻辑复制的相关信息,但是复制槽的信息不会被同步到备库,当数据库发生主备切换时,原备库没有复制槽信息,复制槽只能重建,导致故障期间的数据无法通过逻辑复制同步给订阅端。

有一个办法是在创建复制槽后,把主库$PGDATA/pg_replslot目录中的文件拷贝到备库,并重启备库,即可在备库中看到该复制槽

但是这又会产生一个新的问题,wal日志的清理依赖复制槽中的restart_lsn,备库的复制槽又不会更新,就导致备库的wal日志一直无法清理,借鉴patroni中使用函数pg_replication_slot_advance来推进备库复制槽的lsn,编写了以下脚本

脚本功能

读取主库复制槽的restart_lsn和pg_stat_replication中备库信息,使用函数pg_replication_slot_advance推进备库复制槽的lsn

脚本使用示例

执行脚本前,备库复制槽restart_lsn为从主库拷贝时的值,且restart_lsn不会更新,wal日志无法清理

[postgres@db pg_wal]$ ls -ltr |head -10 total 11763788 -rw------- 1 postgres postgres 16777216 Jun 6 11:08 000000010000000000000005.partial -rw------- 1 postgres postgres 41 Jun 6 11:08 00000002.history -rw------- 1 postgres postgres 83 Jun 6 14:20 00000003.history -rw------- 1 postgres postgres 16777216 Jun 6 14:20 000000020000000000000006 -rw------- 1 postgres postgres 16777216 Jun 6 16:18 000000030000000000000006 -rw------- 1 postgres postgres 16777216 Jun 6 16:18 000000030000000000000007.partial -rw------- 1 postgres postgres 125 Jun 6 16:18 00000004.history -rw------- 1 postgres postgres 16777216 Jun 6 16:35 000000040000000000000007 -rw------- 1 postgres postgres 167 Jun 6 16:35 00000005.history [postgres@db pg_wal]$ psql psql (12.7) Type "help" for help. postgres=# checkpoint ; CHECKPOINT postgres=# select * from pg_replication_slots ; slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn -----------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+--------------------- idcsync_0f99_26 | pgoutput | logical | 16546 | idc_test | f | f | | | 2284 | 0/6050080 | 0/60500B9
复制

执行脚本后,备库复制槽restart_lsn更新,wal日志可以正常清理

[postgres@db pg_wal]$ ls -ltr |head -10 total 18874472 -rw------- 1 postgres postgres 41 Jun 6 11:08 00000002.history -rw------- 1 postgres postgres 83 Jun 6 14:20 00000003.history -rw------- 1 postgres postgres 125 Jun 6 16:18 00000004.history -rw------- 1 postgres postgres 167 Jun 6 16:35 00000005.history -rw------- 1 postgres postgres 209 Jun 6 16:39 00000006.history -rw------- 1 postgres postgres 251 Jun 6 16:45 00000007.history -rw------- 1 postgres postgres 16777216 Jun 12 09:06 0000000700000002000000C7 -rw------- 1 postgres postgres 16777216 Jun 12 09:07 0000000700000002000000C8 -rw------- 1 postgres postgres 16777216 Jun 12 09:07 0000000700000002000000C9 [postgres@db-19-65 pg_wal]$ du -sh 19G . [postgres@db pg_wal]$ psql psql (12.7) Type "help" for help. postgres=# select * from pg_replication_slots ; slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn -----------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+--------------------- idcsync_0f99_26 | pgoutput | logical | 16546 | idc_test | f | f | | | 4622 | 2/C7E10468 | 2/C7E11770 (1 row)
复制

脚本运行日志

2023-06-12__11:06:15-- begin ...,vip_list_file /home/postgres/pg_replication_slot_advance_vip.list 2023-06-12__11:06:15--main-- get slots_list,ip 192.168.1.11:5433,sql [select string_agg(concat_ws('|',slot_name,restart_lsn,database),' ') from pg_replication_slots;] 2023-06-12__11:06:15--main-- slots_list [[idcsync_0f99_26|2/C7E11770|idc_test]] 2023-06-12__11:06:15--main-- get standby_list,ip 192.168.1.11:5433,sql [select string_agg(replace(application_name,'_',':'),' ') from pg_stat_replication where usename='repl';] 2023-06-12__11:06:15--main-- standby_list [[192.168.1.13:5432]] 2023-06-12__11:06:15--slot_advance-- set restart_lsn,ip 192.168.1.13:5432: slot_name idcsync_0f99_26,lsn 2/C7E11770; 2023-06-12__11:06:15--slot_advance-- sql [select case pg_is_in_recovery() when true then pg_replication_slot_advance('idcsync_0f99_26','2/C7E11770') end] (idcsync_0f99_26,2/C7E11770) 2023-06-12__11:06:15--clear_log clear log /home/postgres/pg_replication_slot_advance_log.20230605 rm: cannot remove ‘/home/postgres/pg_replication_slot_advance_log.20230605’: No such file or directory 2023-06-12__11:06:15-- end ...
复制

测试运行正常后添加定时任务,每20分钟执行一次该脚本

[postgres@db ~]$ crontab -l */20 * * * * . ~/.bash_profile; sh ~/pg_replication_slot_advance.sh
复制

查看运行日志

2023-06-14__09:40:01-- begin ...,vip_list_file /home/postgres/pg_replication_slot_advance_vip.list 2023-06-14__09:40:01--main-- get slots_list,ip 192.168.1.11:5433,sql [select string_agg(concat_ws('|',slot_name,restart_lsn,database),' ') from pg_replication_slots;] 2023-06-14__09:40:01--main-- slots_list [[idcsync_0f99_26|7/478DFEE0|idc_test]] 2023-06-14__09:40:01--main-- get standby_list,ip 192.168.1.11:5433,sql [select string_agg(replace(application_name,'_',':'),' ') from pg_stat_replication where usename='repl';] 2023-06-14__09:40:01--main-- standby_list [[192.168.1.13:5432]] 2023-06-14__09:40:01--slot_advance-- set restart_lsn,ip 192.168.1.13:5432: slot_name idcsync_0f99_26,lsn 7/478DFEE0; 2023-06-14__09:40:01--slot_advance-- sql [select case pg_is_in_recovery() when true then pg_replication_slot_advance('idcsync_0f99_26','7/478DFEE0') end] (idcsync_0f99_26,7/478DFEE0) 2023-06-14__09:40:02--clear_log clear log /home/postgres/pg_replication_slot_advance_log/pg_replication_slot_advance_log.20230607 rm: cannot remove ‘/home/postgres/pg_replication_slot_advance_log/pg_replication_slot_advance_log.20230607’: No such file or directory 2023-06-14__09:40:02-- end ...
复制

查看备库复制槽已更新,且wal日志正常清理

[postgres@db ~]$ psql psql (12.7) Type "help" for help. postgres=# select * from pg_replication_slots ; slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn -----------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+--------------------- idcsync_0f99_26 | pgoutput | logical | 16546 | idc_test | f | f | | | 18627 | 7/46B02B50 | 7/478DFEE0 (1 row) [postgres@db pg_wal]$ ls -ltr |head -10 total 901156 -rw------- 1 postgres postgres 41 Jun 6 11:08 00000002.history -rw------- 1 postgres postgres 83 Jun 6 14:20 00000003.history -rw------- 1 postgres postgres 125 Jun 6 16:18 00000004.history -rw------- 1 postgres postgres 167 Jun 6 16:35 00000005.history -rw------- 1 postgres postgres 209 Jun 6 16:39 00000006.history -rw------- 1 postgres postgres 251 Jun 6 16:45 00000007.history -rw------- 1 postgres postgres 16777216 Jun 12 21:02 000000070000000700000046 -rw------- 1 postgres postgres 16777216 Jun 13 15:00 000000070000000700000047 -rw------- 1 postgres postgres 16777216 Jun 13 15:00 000000070000000700000048
复制
脚本如下

pg_replication_slot_advance.sh

#!/bin/bash #set standby pg_replication_slots restart_lsn follow master path=`pwd` vip_list=$path/pg_replication_slot_advance_vip.list log_name=$path/pg_replication_slot_advance_log write_log(){ echo "`date +%Y-%m-%d__%H:%M:%S`--$*" } #connect to pg pg_con(){ user=$1 host=$2 dbname=$3 passwd="$4" sql="$5" psql "postgresql://$user:$passwd@$host/$dbname" -Atc "$sql" } #set standby pg_replication_slots restart_lsn ues function "pg_replication_slot_advance" slot_advance(){ user=$1 host=$2 dbname=$3 passwd="$4" slot_name="$5" lsn="$6" sql="select case pg_is_in_recovery() when true then pg_replication_slot_advance('$slot_name','$lsn') end" write_log "${FUNCNAME}-- set restart_lsn,ip $host:$port slot_name $slot_name,lsn $lsn;" write_log "${FUNCNAME}-- sql [$sql]" pg_con $user $host $dbname $passwd "$sql" } clear_log(){ clear_log_time=`date -d '7 day ago' +%Y%m%d` write_log "${FUNCNAME} clear log $log_name.$clear_log_time" rm $log_name.$clear_log_time } main(){ #get slot list(slot_name,restart_lsn,database) sql="select string_agg(concat_ws('|',slot_name,restart_lsn,database),' ') from pg_replication_slots where slot_type='logical';" write_log "${FUNCNAME}-- get slots_list,ip $host,sql [$sql]" slots_list=`pg_con $user $host $dbname $passwd "$sql"` write_log "${FUNCNAME}-- slots_list [[$slots_list]]" if [ "$slots_list" ];then #get standby list(ip,port) sql="select string_agg(replace(application_name,'_',':'),' ') from pg_stat_replication where usename='repl' and application_name<>'pg_basebackup';" write_log "${FUNCNAME}-- get standby_list,ip $host,sql [$sql]" standby_list=`pg_con $user $host $dbname $passwd "$sql"` write_log "${FUNCNAME}-- standby_list [[$standby_list]]" for i in $slots_list do slot_name=`echo $i|awk -F '|' '{print $1}'` lsn=`echo $i|awk -F '|' '{print $2}'` dbname=`echo $i|awk -F '|' '{print $3}'` for host in $standby_list do slot_advance $user $host $dbname $passwd $slot_name $lsn done done else write_log "${FUNCNAME} $vip slots count is 0" fi } log_time=`date +%Y%m%d` write_log " begin ...,vip_list_file $vip_list" >>$log_name.$log_time for vip in `cat $vip_list` do host=$vip user=username dbname=database passwd="password" main >>$log_name.$log_time 2>&1 done clear_log >>$log_name.$log_time 2>&1 write_log " end ..." >>$log_name.$log_time
复制
参考文档

Patroni 如何解决 PostgreSQL 集群中的逻辑复制槽故障转移问题 (percona.com)

patroni/patroni/postgresql/slots.py at master · zalando/patroni · GitHub

「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论