repmgr搭建参考上一篇 https://www.modb.pro/db/15358
1.每台服务器修改repmgr文件,增加以下内容
$ vi /opt/repmgr.conf failover='automatic' connection_check_type=ping reconnect_attempts=6 reconnect_interval=10 #如果主节点失败,则要进行切换的工作,并记录到日志 promote_command='/opt/pgsql/bin/repmgr standby promote -f /opt/repmgr.conf --log-to-file' #如果有连接到主节点的其他从节点,会根据命令连接到新主的节点 follow_command='/opt/pgsql/bin/repmgr standby follow -f /opt/repmgr.conf --log-to-file --upstream-node-id=%n'
复制
2.每台服务器postgresql.conf增加配置
$ vi data6000/postgresql.conf shared_preload_libraries = 'repmgr'
复制
3.每台服务器启动repmgrd守护进程
$ /opt/pgsql/bin/repmgrd -f /opt/repmgr.conf --通过事件记录查看repmgrd的开启情况 [postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf cluster event --event=repmgrd_start Node ID | Name | Event | OK | Timestamp | Details ---------+-------+---------------+----+---------------------+-------------------------------------------------------- 1 | node1 | repmgrd_start | t | 2020-02-02 22:33:30 | monitoring cluster primary "node1" (ID: 1) 2 | node2 | repmgrd_start | t | 2020-02-02 22:33:25 | monitoring connection to upstream node "node1" (ID: 1) --查看repmgr和repmgrd进程 [postgres@ha1 ~]$ ps f -u postgres PID TTY STAT TIME COMMAND 13124 pts/0 S 0:00 -bash 13389 pts/0 R+ 0:00 \_ ps f -u postgres 13191 ? S 0:00 /opt/pgsql/bin/repmgrd -f /opt/repmgr.conf 3247 ? Ss 0:00 /opt/pg12/bin/postgres -D /opt/data6000 3272 ? Ss 0:00 \_ postgres: logger 3680 ? Ss 0:00 \_ postgres: checkpointer 3681 ? Ss 0:00 \_ postgres: background writer 3682 ? Ss 0:00 \_ postgres: walwriter 3683 ? Ss 0:00 \_ postgres: autovacuum launcher 3684 ? Ss 0:00 \_ postgres: archiver 3685 ? Ss 0:00 \_ postgres: stats collector 3686 ? Ss 0:00 \_ postgres: logical replication launcher 13184 ? Ss 0:00 \_ postgres: repmgr repmgr 192.168.99.202(41551) idle 13189 ? Ss 0:00 \_ postgres: repmgr repmgr 192.168.99.200(49369) idle 13289 ? Ss 0:00 \_ postgres: walsender repmgr 192.168.99.202(41735) streaming 0/17003308 [postgres@ha2 ~]$ ps f -u postgres PID TTY STAT TIME COMMAND 3926 pts/0 S 0:00 -bash 4056 pts/0 R+ 0:00 \_ ps f -u postgres 3950 ? S 0:00 /opt/pgsql/bin/repmgrd -f /opt/repmgr.conf 3241 ? Ss 0:00 /opt/pg12/bin/postgres -D /opt/data6000 3259 ? Ss 0:00 \_ postgres: logger 3260 ? Ss 0:00 \_ postgres: startup recovering 000000070000000000000017 3674 ? Ss 0:00 \_ postgres: checkpointer 3675 ? Ss 0:00 \_ postgres: background writer 3676 ? Ss 0:00 \_ postgres: stats collector 3948 ? Ss 0:00 \_ postgres: repmgr repmgr 192.168.99.202(62092) idle 4002 ? Ss 0:00 \_ postgres: walreceiver streaming 0/17003308
复制
4.关闭主节点HA1
[postgres@ha1 ~]$ /opt/pgsql/bin/pg_ctl -D /opt/data6000/ stop waiting for server to shut down.... done server stopped --查看备库HA2日志,可以看到1 分钟后,无法连接到主库HA1,程序已经开始了切库的服务 [postgres@ha2 ~]$ cat /opt/data6000/repmgr.log ... [2020-02-03 11:25:34] [INFO] node "node2" (ID: 2) monitoring upstream node "node1" (ID: 1) in normal state [2020-02-03 11:25:36] [WARNING] unable to ping "host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2" [2020-02-03 11:25:36] [DETAIL] PQping() returned "PQPING_NO_RESPONSE" [2020-02-03 11:25:36] [WARNING] unable to connect to upstream node "node1" (ID: 1) [2020-02-03 11:25:36] [INFO] checking state of node 1, 1 of 6 attempts [2020-02-03 11:25:36] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.99.200 port=6000 fallback_application_name=repmgr" [2020-02-03 11:25:36] [DETAIL] PQping() returned "PQPING_NO_RESPONSE" [2020-02-03 11:25:36] [INFO] sleeping 10 seconds until next reconnection attempt [2020-02-03 11:25:46] [INFO] checking state of node 1, 2 of 6 attempts [2020-02-03 11:25:46] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.99.200 port=6000 fallback_application_name=repmgr" [2020-02-03 11:25:46] [DETAIL] PQping() returned "PQPING_NO_RESPONSE" [2020-02-03 11:25:46] [INFO] sleeping 10 seconds until next reconnection attempt [2020-02-03 11:25:56] [INFO] checking state of node 1, 3 of 6 attempts [2020-02-03 11:25:56] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.99.200 port=6000 fallback_application_name=repmgr" [2020-02-03 11:25:56] [DETAIL] PQping() returned "PQPING_NO_RESPONSE" [2020-02-03 11:25:56] [INFO] sleeping 10 seconds until next reconnection attempt [2020-02-03 11:26:06] [INFO] checking state of node 1, 4 of 6 attempts [2020-02-03 11:26:06] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.99.200 port=6000 fallback_application_name=repmgr" [2020-02-03 11:26:06] [DETAIL] PQping() returned "PQPING_NO_RESPONSE" [2020-02-03 11:26:06] [INFO] sleeping 10 seconds until next reconnection attempt [2020-02-03 11:26:16] [INFO] checking state of node 1, 5 of 6 attempts [2020-02-03 11:26:16] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.99.200 port=6000 fallback_application_name=repmgr" [2020-02-03 11:26:16] [DETAIL] PQping() returned "PQPING_NO_RESPONSE" [2020-02-03 11:26:16] [INFO] sleeping 10 seconds until next reconnection attempt [2020-02-03 11:26:26] [INFO] checking state of node 1, 6 of 6 attempts [2020-02-03 11:26:26] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.99.200 port=6000 fallback_application_name=repmgr" [2020-02-03 11:26:26] [DETAIL] PQping() returned "PQPING_NO_RESPONSE" [2020-02-03 11:26:26] [WARNING] unable to reconnect to node 1 after 6 attempts [2020-02-03 11:26:26] [INFO] 0 active sibling nodes registered [2020-02-03 11:26:26] [INFO] primary and this node have the same location ("default") [2020-02-03 11:26:26] [INFO] no other sibling nodes - we win by default [2020-02-03 11:26:26] [NOTICE] this node is the only available candidate and will now promote itself [2020-02-03 11:26:26] [INFO] promote_command is: "/opt/pgsql/bin/repmgr standby promote -f /opt/repmgr.conf --log-to-file" [2020-02-03 11:26:26] [NOTICE] redirecting logging output to "/opt/data6000/repmgr.log" [2020-02-03 11:26:26] [NOTICE] promoting standby to primary [2020-02-03 11:26:26] [DETAIL] promoting server "node2" (ID: 2) using pg_promote() [2020-02-03 11:26:26] [NOTICE] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete [2020-02-03 11:26:27] [NOTICE] STANDBY PROMOTE successful [2020-02-03 11:26:27] [DETAIL] server "node2" (ID: 2) was successfully promoted to primary [2020-02-03 11:26:27] [INFO] 0 followers to notify [2020-02-03 11:26:27] [INFO] switching to primary monitoring mode [2020-02-03 11:26:27] [NOTICE] monitoring cluster primary "node2" (ID: 2) --HA2查看集群状态,HA2已经promote成功 [postgres@ha2 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------------------- 1 | node1 | primary | - failed | | default | 100 | ? | host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2 2 | node2 | primary | * running | | default | 100 | 14 | host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2 WARNING: following issues were detected - unable to connect to node "node1" (ID: 1)
复制
5.HA1(故障服务未启动)重新加入集群
[postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf node rejoin -d'host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2' INFO: local node 1 can attach to rejoin target node 2 DETAIL: local node's recovery point: 0/20000028; rejoin target node's fork point: 0/200000A0 NOTICE: setting node 1's upstream to node 2 WARNING: unable to ping "host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2" DETAIL: PQping() returned "PQPING_NO_RESPONSE" NOTICE: starting server using "/opt/pgsql/bin/pg_ctl -w -D '/opt/data6000' start" NOTICE: NODE REJOIN successful DETAIL: node 1 is now attached to node 2 --HA1查看集群状态 [postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------------------- 1 | node1 | standby | running | node2 | default | 100 | 13 | host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2 2 | node2 | primary | * running | | default | 100 | 14 | host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2 --备库HA1执行checkpoint确保时间基线一致 [postgres@ha1 ~]$ /opt/pgsql/bin/psql -p6000 -Upostgres postgres=# delete from t; ERROR: cannot execute DELETE in a read-only transaction postgres=# checkpoint; CHECKPOINT postgres=# \q [postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------------------- 1 | node1 | standby | running | node2 | default | 100 | 14 | host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2 2 | node2 | primary | * running | | default | 100 | 14 | host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2 从上面可以看出通过node rejoin将刚才已经关闭的主库再次上线,并以备库的身份连到新主,加入集群成功。
复制
6.再次测试
[postgres@ha2 ~]$ /opt/pgsql/bin/psql -p6000 -Upostgres postgres=# insert into t values(5,'from node2'); INSERT 0 1 postgres=# select * from t; id | info ----+------------------------------- 1 | 2020-02-02 21:11:51.66735+08 2 | 2020-02-02 22:05:21.757823+08 3 | 2020-02-02 22:07:03.546043+08 4 | 2020-02-03 11:03:42.050227+08 5 | from node2 (5 rows) --关闭主库HA2 [postgres@ha2 ~]$ /opt/pgsql/bin/pg_ctl -D /opt/data6000/ stop --1分钟后查看备库HA1,可以看到已经切换为主库 [postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------------------- 1 | node1 | primary | * running | | default | 100 | 15 | host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2 2 | node2 | primary | - failed | | default | 100 | ? | host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2 WARNING: following issues were detected - unable to connect to node "node2" (ID: 2) --此时HA2先不直接使用node rejoin --HA1主库插入一些数据,并执行checkpoint [postgres@ha1 ~]$ /opt/pgsql/bin/psql -p6000 -Upostgres postgres=# insert into t values(6,'from node1'); INSERT 0 1 postgres=# select * from t; id | info ----+------------------------------- 1 | 2020-02-02 21:11:51.66735+08 2 | 2020-02-02 22:05:21.757823+08 3 | 2020-02-02 22:07:03.546043+08 4 | 2020-02-03 11:03:42.050227+08 5 | from node2 6 | from node1 (6 rows) postgres=# checkpoint; CHECKPOINT --HA2手工启动服务 [postgres@ha2 ~]$ /opt/pgsql/bin/pg_ctl -D /opt/data6000/ start & [postgres@ha2 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+----------------------+----------+----------+----------+----------+--------------------------------------------------------------------------- 1 | node1 | standby | ! running as primary | | default | 100 | 15 | host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2 2 | node2 | primary | * running | | default | 100 | 14 | host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2 WARNING: following issues were detected - node "node1" (ID: 1) is registered as standby but running as primary --HA2插入一些数据,并执行checkpoint [postgres@ha2 ~]$ /opt/pgsql/bin/psql -p6000 -Upostgres postgres=# insert into t values(7,'from node2'); INSERT 0 1 postgres=# select * from t; id | info ----+------------------------------- 1 | 2020-02-02 21:11:51.66735+08 2 | 2020-02-02 22:05:21.757823+08 3 | 2020-02-02 22:07:03.546043+08 4 | 2020-02-03 11:03:42.050227+08 5 | from node2 7 | from node2 (6 rows) --HA2关闭服务 [postgres@ha2 ~]$ /opt/pgsql/bin/pg_ctl -D /opt/data6000/ stop --HA2执行node rejoin [postgres@ha2 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf node rejoin -d'host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2' ERROR: this node cannot attach to rejoin target node 1 DETAIL: rejoin target server's timeline 15 forked off current database system timeline 14 before current recovery point 0/22000028 HINT: use --force-rewind to execute pg_rewind [postgres@ha2 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf node rejoin -d'host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2' --force-rewind NOTICE: pg_rewind execution required for this node to attach to rejoin target node 1 DETAIL: rejoin target server's timeline 15 forked off current database system timeline 14 before current recovery point 0/22000028 ERROR: --force-rewind specified but pg_rewind cannot be used DETAIL: "wal_log_hints" is set to "off" and data checksums are disabled 从上面可以看出node rejoin操作失败,因为两边库的数据已经不一致。
复制
7.增加wal_log_hints并恢复环境
[postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf primary unregister --node-id=2 INFO: node "node2" (ID: 2) was successfully unregistered [postgres@ha1 ~]$ vi /opt/data6000/postgresql.conf wal_log_hints=on [postgres@ha1 ~]$ /opt/pgsql/bin/pg_ctl -D /opt/data6000/ restart & [postgres@ha2 ~]$ rm -rf /opt/archive6000/* [postgres@ha2 ~]$ rm -rf /opt/data6000/* [postgres@ha2 ~]$ /opt/pgsql/bin/repmgr -h 192.168.99.200 -p6000 -U repmgr -d repmgr -f /opt/repmgr.conf standby clone [postgres@ha2 ~]$ vi /opt/data6000/postgresql.conf primary_conninfo='host=192.168.99.200 port=6000 user=repuser' [postgres@ha2 ~]$ /opt/pgsql/bin/pg_ctl -D /opt/data6000/ start & [postgres@ha2 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf standby register [postgres@ha2 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------------------- 1 | node1 | primary | * running | | default | 100 | 15 | host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2 2 | node2 | standby | running | node1 | default | 100 | 15 | host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2 [postgres@ha1 ~]$ > /opt/data6000/repmgr.log [postgres@ha2 ~]$ > /opt/data6000/repmgr.log [postgres@ha1 ~]$ /opt/pgsql/bin/repmgrd -f /opt/repmgr.conf [postgres@ha2 ~]$ /opt/pgsql/bin/repmgrd -f /opt/repmgr.conf
复制
8.再次测试
[postgres@ha1 ~]$ /opt/pgsql/bin/psql -p6000 -Upostgres postgres=# insert into t values(101,'from node1 '||now()); INSERT 0 1 postgres=# select * from t; id | info -----+------------------------------------------ 1 | 2020-02-02 21:11:51.66735+08 2 | 2020-02-02 22:05:21.757823+08 3 | 2020-02-02 22:07:03.546043+08 4 | 2020-02-03 11:03:42.050227+08 5 | from node2 6 | from node1 100 | from node1 2020-02-03 12:20:27.444394+08 101 | from node1 2020-02-03 12:33:26.365621+08 (8 rows) --关闭主库HA1 [postgres@ha1 ~]$ /opt/pgsql/bin/pg_ctl -D /opt/data6000/ stop --1分钟后查看备库HA2,可以看到已经切换为主库 [postgres@ha2 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------------------- 1 | node1 | primary | - failed | | default | 100 | ? | host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2 2 | node2 | primary | * running | | default | 100 | 16 | host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2 WARNING: following issues were detected - unable to connect to node "node1" (ID: 1) --HA2新主库插入一些数据,并执行checkpoint [postgres@ha2 ~]$ /opt/pgsql/bin/psql -p6000 -Upostgres postgres=# insert into t values(200,'from node2 '||now()); INSERT 0 1 postgres=# select * from t; id | info -----+------------------------------------------ 1 | 2020-02-02 21:11:51.66735+08 2 | 2020-02-02 22:05:21.757823+08 3 | 2020-02-02 22:07:03.546043+08 4 | 2020-02-03 11:03:42.050227+08 5 | from node2 6 | from node1 100 | from node1 2020-02-03 12:20:27.444394+08 101 | from node1 2020-02-03 12:33:26.365621+08 200 | from node2 2020-02-03 12:39:04.007665+08 (9 rows) postgres=# checkpoint; CHECKPOINT --HA1手工启动服务 [postgres@ha1 ~]$ /opt/pgsql/bin/pg_ctl -D /opt/data6000/ start & [postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+----------------------+----------+----------+----------+----------+--------------------------------------------------------------------------- 1 | node1 | primary | * running | | default | 100 | 15 | host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2 2 | node2 | standby | ! running as primary | | default | 100 | 16 | host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2 WARNING: following issues were detected - node "node2" (ID: 2) is registered as standby but running as primary --HA1插入一些数据,并执行checkpoint [postgres@ha1 ~]$ /opt/pgsql/bin/psql -p6000 -Upostgres postgres=# insert into t values(300,'from node1 '||now()); INSERT 0 1 postgres=# select * from t; id | info -----+------------------------------------------ 1 | 2020-02-02 21:11:51.66735+08 2 | 2020-02-02 22:05:21.757823+08 3 | 2020-02-02 22:07:03.546043+08 4 | 2020-02-03 11:03:42.050227+08 5 | from node2 6 | from node1 100 | from node1 2020-02-03 12:20:27.444394+08 101 | from node1 2020-02-03 12:33:26.365621+08 300 | from node1 2020-02-03 12:42:21.438444+08 (9 rows) --HA1关闭服务 [postgres@ha1 ~]$ /opt/pgsql/bin/pg_ctl -D /opt/data6000/ stop --HA1执行node rejoin [postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf node rejoin -d'host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2' --force-rewind NOTICE: pg_rewind execution required for this node to attach to rejoin target node 2 DETAIL: rejoin target server's timeline 16 forked off current database system timeline 15 before current recovery point 0/27000028 NOTICE: executing pg_rewind DETAIL: pg_rewind command is "/opt/pgsql/bin/pg_rewind -D '/opt/data6000' --source-server='host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2'" pg_rewind: servers diverged at WAL location 0/260000A0 on timeline 15 pg_rewind: rewinding from last common checkpoint at 0/26000028 on timeline 15 pg_rewind: Done! NOTICE: 0 files copied to /opt/data6000 NOTICE: setting node 1's upstream to node 2 WARNING: unable to ping "host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2" DETAIL: PQping() returned "PQPING_NO_RESPONSE" NOTICE: starting server using "/opt/pgsql/bin/pg_ctl -w -D '/opt/data6000' start" WARNING: unable to ping "host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2" DETAIL: PQping() returned "PQPING_REJECT" NOTICE: NODE REJOIN successful DETAIL: node 1 is now attached to node 2 --HA1查看集群状态 [postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------------------- 1 | node1 | standby | running | node2 | default | 100 | 15 | host=192.168.99.200 port=6000 user=repmgr dbname=repmgr connect_timeout=2 2 | node2 | primary | * running | | default | 100 | 16 | host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2 --HA1执行checkpoint或者服务restart,timeline保持一致 --验证数据 [postgres@ha1 ~]$ /opt/pgsql/bin/psql -p6000 -Upostgres postgres=# select * from t; id | info -----+------------------------------------------ 1 | 2020-02-02 21:11:51.66735+08 2 | 2020-02-02 22:05:21.757823+08 3 | 2020-02-02 22:07:03.546043+08 4 | 2020-02-03 11:03:42.050227+08 5 | from node2 6 | from node1 100 | from node1 2020-02-03 12:20:27.444394+08 101 | from node1 2020-02-03 12:33:26.365621+08 200 | from node2 2020-02-03 12:39:04.007665+08 (9 rows) postgres=# delete from t; ERROR: cannot execute DELETE in a read-only transaction [postgres@ha2 ~]$ /opt/pgsql/bin/psql -p6000 -Upostgres postgres=# select * from t; id | info -----+------------------------------------------ 1 | 2020-02-02 21:11:51.66735+08 2 | 2020-02-02 22:05:21.757823+08 3 | 2020-02-02 22:07:03.546043+08 4 | 2020-02-03 11:03:42.050227+08 5 | from node2 6 | from node1 100 | from node1 2020-02-03 12:20:27.444394+08 101 | from node1 2020-02-03 12:33:26.365621+08 200 | from node2 2020-02-03 12:39:04.007665+08 (9 rows) postgres=# insert into t values(400,'from node2 '||now()); INSERT 0 1
复制
从上面测试可以看出加上wal_log_hints后,主备库数据如果不一致时仍然可以使用node rejoin操作成功,新主库上线至旧主库rejoin之间时间段内,旧主库的写数据会丢失。
常见错误1
[postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf node rejoin -d’host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2’
ERROR: this node cannot attach to rejoin target node 2
DETAIL: rejoin target server’s timeline 8 forked off current database system timeline 7 before current recovery point 0/1A000028
HINT: use --force-rewind to execute pg_rewind
–分析:主备库时间线不一致
常见错误2
[postgres@ha1 ~]$ /opt/pgsql/bin/repmgr -f /opt/repmgr.conf node rejoin -d’host=192.168.99.202 port=6000 user=repmgr dbname=repmgr connect_timeout=2’ --force-rewind
NOTICE: pg_rewind execution required for this node to attach to rejoin target node 2
DETAIL: rejoin target server’s timeline 8 forked off current database system timeline 7 before current recovery point 0/1A000028
ERROR: --force-rewind specified but pg_rewind cannot be used
DETAIL: “wal_log_hints” is set to “off” and data checksums are disabled
–分析:数据库未开启wal_log_hints,数据目录初始化时也未打开checksums
常见错误3
timeline不一致问题,如何同步
1.备库执行checkpoint
2.关闭备库,执行node rejoin
保持联系
从2019年12月开始写第一篇文章,分享的初心一直在坚持,本人现在组建了一个PG乐知乐享交流群,欢迎关注我文章的小伙伴进群吹牛唠嗑,交流技术,互赞文章。
如果群二维码失效可以加我微信。