Patroni集群如何恢复一个异常的数据库节点

励志成为postgresql大神 2021-05-25

4871

前言

今天我们来探讨一下，Patroni集群下，一个PostgreSQL节点需要的日志已经被删了，无法启动该如何恢复？

修改参数

在研究这个问题之前，我们需要设置一下日志参数，方便我们观察错误。在Patroni集群中，如果我们直接在postgresql.conf中设置参数，则该参数可能无法正常工作，并且甚至在Postgresql实例重新启动后会回滚设置。因此，我们要使用patronictl edit-config命令进行设置。

patronictl -c etc/patroni.yaml show-config，通过命令查看postgresql相关参数，并未设置log_directory和logging_collector。

[postgres@133e0e204e206 pgdata]$ patronictl -c /etc/patroni.yml show-configloop_wait: 10master_start_timeout: 300maximum_lag_on_failover: 1048576postgresql:  parameters:    hot_standby: 'on'    listen_addresses: 0.0.0.0    max_replication_slots: 10    max_wal_senders: 10    port: 5432    wal_keep_segments: 100    wal_level: logical    wal_log_hints: 'on'  use_pg_rewind: true  use_slots: trueretry_timeout: 10synchronous_mode: falsettl: 30

我们来设置一下。patronictl -c etc/patroni.yml edit-config，通过该命令编辑参数，进入文字编辑界面。

修改完成后保存。然后reload集群配置。

patronictl -c /etc/patroni.yaml reload patnori-test

这里会同时在三个postgres节点上生效。

[postgres@133e0e204e206 pgdata]$ patronictl -c /etc/patroni.yml reload patnori-test       + Cluster: patnori-test (6962171552537974697) --+----+-----------+| Member    | Host          | Role    | State   | TL | Lag in MB |+-----------+---------------+---------+---------+----+-----------+| postgres1 | 133.0.204.206 | Leader  | running |  7 |           || postgres2 | 133.0.204.207 | Replica | running |  7 |         0 || postgres3 | 133.0.204.208 | Replica | running |  7 |         0 |+-----------+---------------+---------+---------+----+-----------+Are you sure you want to reload members postgres2, postgres1, postgres3? [y/N]: yReload request received for member postgres2 and will be processed within 10 secondsReload request received for member postgres1 and will be processed within 10 secondsReload request received for member postgres3 and will be processed within 10 seconds

然后把三个节点实例重启一下，让参数生效（修改logging_collector需要重启实例）。

pg_ctl stoppg_ctl start

Patroni会自动带起来PostgreSQL实例，然后查看三个节点的日志目录，已经创建成功了。

模拟节点故障

接下来我们来模拟节点故障，我们把三个节点的集群都停掉，单独把节点1重启两次。

[postgres@133e0e204e206 ~]$ patronictl -c /etc/patroni.yml list+ Cluster: patnori-test (6962171552537974697) --+----+-----------+-----------------+| Member    | Host          | Role    | State   | TL | Lag in MB | Pending restart |+-----------+---------------+---------+---------+----+-----------+-----------------+| postgres1 | 133.0.204.206 | Leader  | running |  9 |           |                 || postgres2 | 133.0.204.207 | Replica | running |  9 |         0 | *               || postgres3 | 133.0.204.208 | Replica | running |  9 |         0 | *               |+-----------+---------------+---------+---------+----+-----------+-----------------+

此时节点1的时间线就来到了11。

[postgres@133e0e204e206 ~]$  patronictl -c /etc/patroni.yml list+ Cluster: patnori-test (6962171552537974697) -+----+-----------+| Member    | Host          | Role   | State   | TL | Lag in MB |+-----------+---------------+--------+---------+----+-----------+| postgres1 | 133.0.204.206 | Leader | running | 11 |           |+-----------+---------------+--------+---------+----+-----------+

接下来重启节点2。

[postgres@133e0e204e206 ~]$  patronictl -c /etc/patroni.yml list+ Cluster: patnori-test (6962171552537974697) --+----+-----------+| Member    | Host          | Role    | State   | TL | Lag in MB |+-----------+---------------+---------+---------+----+-----------+| postgres1 | 133.0.204.206 | Leader  | running | 11 |           || postgres2 | 133.0.204.207 | Replica | running | 11 |         0 |+-----------+---------------+---------+---------+----+-----------+

观察节点2日志情况。

2021-05-23 22:59:15.410 CST [10540] FATAL:  the database system is starting up2021-05-23 22:59:15.414 CST [10539] LOG:  redo starts at 0/70004E02021-05-23 22:59:15.414 CST [10539] LOG:  consistent recovery state reached at 0/70005C82021-05-23 22:59:15.414 CST [10539] LOG:  invalid record length at 0/70005C8: wanted 24, got 02021-05-23 22:59:15.415 CST [10536] LOG:  database system is ready to accept read only connections2021-05-23 22:59:15.422 CST [10545] LOG:  fetching timeline history file for timeline 10 from primary server2021-05-23 22:59:15.430 CST [10545] LOG:  fetching timeline history file for timeline 11 from primary server2021-05-23 22:59:15.434 CST [10539] LOG:  new target timeline is 112021-05-23 22:59:25.455 CST [10575] LOG:  started streaming WAL from primary at 0/7000000 on timeline 92021-05-23 22:59:25.455 CST [10575] LOG:  replication terminated by primary server2021-05-23 22:59:25.455 CST [10575] DETAIL:  End of WAL reached on timeline 9 at 0/7000640.2021-05-23 22:59:30.453 CST [10539] LOG:  invalid record length at 0/7000640: wanted 24, got 02021-05-23 22:59:30.454 CST [10575] LOG:  restarted WAL streaming at 0/7000000 on timeline 102021-05-23 22:59:30.516 CST [10575] LOG:  replication terminated by primary server2021-05-23 22:59:30.516 CST [10575] DETAIL:  End of WAL reached on timeline 10 at 0/7000848.2021-05-23 22:59:35.458 CST [10539] LOG:  invalid record length at 0/7000848: wanted 24, got 02021-05-23 22:59:35.459 CST [10575] LOG:  restarted WAL streaming at 0/7000000 on timeline 11

这里有两个重要的信息。

fetching timeline history file for timeline 10 from primary serverfetching timeline history file for timeline 11 from primary server

如果我们重启节点3的时候，把节点1和节点2上的timeline history file删掉，节点3就无法启动了。

我们来测试一下这个情况。

[root@133e0e204e206 pg_wal]# rm -rf 0000000A.history[root@133e0e204e206 pg_wal]# rm -rf 0000000B.history

启动节点3，此时节点3就会因为缺少timeline history file无法启动。

2021-05-23 23:04:46.820 CST [6254] LOG:  consistent recovery state reached at 0/70008802021-05-23 23:04:46.820 CST [6254] LOG:  invalid record length at 0/7000880: wanted 24, got 02021-05-23 23:04:46.821 CST [6251] LOG:  database system is ready to accept read only connections2021-05-23 23:04:46.827 CST [6260] LOG:  fetching timeline history file for timeline 11 from primary server2021-05-23 23:04:46.828 CST [6260] FATAL:  could not receive timeline history file from the primary server: ERROR:  could not open file "pg_wal/0000000B.history": No such file or directory2021-05-23 23:04:46.835 CST [6262] LOG:  fetching timeline history file for timeline 11 from primary server2021-05-23 23:04:46.835 CST [6262] FATAL:  could not receive timeline history file from the primary server: ERROR:  could not open file "pg_wal/0000000B.history": No such file or directory2021-05-23 23:04:51.838 CST [6295] LOG:  fetching timeline history file for timeline 11 from primary server2021-05-23 23:04:51.838 CST [6295] FATAL:  could not receive timeline history file from the primary server: ERROR:  could not open file "pg_wal/0000000B.history": No such file or directory

这个时候如何恢复，我们可以使用patronictl reinit恢复失败的实例，该命令将删除目录，重新使用pg_basebackup进行恢复。

[postgres@133e0e204e208 ~]$ patronictl -c  /etc/patroni.yml reinit patnori-test postgres3+ Cluster: patnori-test (6962171552537974697) --+----+-----------+| Member    | Host          | Role    | State   | TL | Lag in MB |+-----------+---------------+---------+---------+----+-----------+| postgres1 | 133.0.204.206 | Leader  | running | 11 |           || postgres2 | 133.0.204.207 | Replica | running | 11 |         0 || postgres3 | 133.0.204.208 | Replica | running | 10 |        16 |+-----------+---------------+---------+---------+----+-----------+Are you sure you want to reinitialize members postgres3? [y/N]:

但是问题是pg_basebackup恢复的时候也报找不到0000000B.history。

May 23 23:17:16 133e0e204e208 patroni[7889]: 2021-05-23 23:17:16,508 INFO: Selected new etcd server http://133.0.204.205:2379May 23 23:17:16 133e0e204e208 patroni[7889]: 2021-05-23 23:17:16,515 INFO: No PostgreSQL configuration items changed, nothing to reload.May 23 23:17:16 133e0e204e208 patroni[7889]: 2021-05-23 23:17:16,522 INFO: Lock owner: postgres1; I am postgres3May 23 23:17:16 133e0e204e208 patroni[7889]: 2021-05-23 23:17:16,527 INFO: trying to bootstrap from leader 'postgres1'May 23 23:17:16 133e0e204e208 patroni[7889]: pg_basebackup: error: could not send replication command "TIMELINE_HISTORY": ERROR:  could not open file "pg_wal/0000000B.history": No such file or directoryMay 23 23:17:16 133e0e204e208 patroni[7889]: pg_basebackup: error: child process exited with exit code 1May 23 23:17:16 133e0e204e208 patroni[7889]: pg_basebackup: removing data directory "/app/pgdata"May 23 23:17:16 133e0e204e208 patroni[7889]: 2021-05-23 23:17:16,892 ERROR: Error when fetching backup: pg_basebackup exited with code=1May 23 23:17:16 133e0e204e208 patroni[7889]: 2021-05-23 23:17:16,892 WARNING: Trying again in 5 seconds

这个问题不复杂，只要把Patroni节点1在重启一下，再改变一下时间线，它就不会在需要这个timeline history file文件，就能恢复了。

此时节点2也会因为这个时间线文件，仍然处于11，处理方法也一样。

[postgres@133e0e204e206 log]$  patronictl -c /etc/patroni.yml list+ Cluster: patnori-test (6962171552537974697) --+----+-----------+| Member    | Host          | Role    | State   | TL | Lag in MB |+-----------+---------------+---------+---------+----+-----------+| postgres1 | 133.0.204.206 | Leader  | running | 12 |           || postgres2 | 133.0.204.207 | Replica | running | 11 |        96 || postgres3 | 133.0.204.208 | Replica | running | 12 |         0 |+-----------+---------------+---------+---------+----+-----------+

对节点2执行reinit。

从系统日志中可以看出，reinit先是移除了整个data目录。然后选择正确的节点进行备份恢复。

May 23 23:45:17 133e0e204e207 patroni: 2021-05-23 23:45:17,306 INFO: Removing data directory: /app/pgdataMay 23 23:45:17 133e0e204e207 shell_cmd: {postgres,/home/postgres,  122  ,2021-05-23 23:45:15,133.0.206.107,patronictl -c  /etc/patroni.yml reinit patnori-test postgres2}May 23 23:45:18 133e0e204e207 patroni: 2021-05-23 23:45:18,146 INFO: replica has been created using basebackupMay 23 23:45:18 133e0e204e207 patroni: 2021-05-23 23:45:18,147 INFO: bootstrapped from leader 'postgres1'May 23 23:45:18 133e0e204e207 patroni: 2021-05-23 23:45:18,149 INFO: closed patroni connection to the postgresql clusterMay 23 23:45:18 133e0e204e207 patroni: 2021-05-23 23:45:18.321 CST [14280] LOG:  redirecting log output to logging collector processMay 23 23:45:18 133e0e204e207 patroni: 2021-05-23 23:45:18.321 CST [14280] HINT:  Future log output will appear in directory "log".May 23 23:45:18 133e0e204e207 patroni: 2021-05-23 23:45:18,326 INFO: postmaster pid=14280May 23 23:45:18 133e0e204e207 patroni: localhost:5432 - rejecting connectionsMay 23 23:45:18 133e0e204e207 patroni: localhost:5432 - rejecting connectionsMay 23 23:45:19 133e0e204e207 patroni: localhost:5432 - accepting connectionsMay 23 23:45:22 133e0e204e207 patroni: 2021-05-23 23:45:22,180 INFO: Lock owner: postgres1; I am postgres2

后记

把Patroni相关的系列文章做个总结：

在学习patroni之前，先要杠etcd(一)继续杠etcd（二）Raft协议的简单入门学习etcd（三）基操拉满手贱玩挂了一个节点，etcd的灾难恢复安装Patroni无法上外网，Python这个折磨之王来了 patnori2.0，终于搭建好了。浅析Patroni2.0 配置文件

数据库

文章转载自励志成为postgresql大神，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

Patroni集群如何恢复一个异常的数据库节点

前言

修改参数

模拟节点故障

后记

评论