暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

纯 Raft 上的 Patroni

原创 CiciLee 2022-08-07
1417

自 2020 年 9 月及其 2.0 版本以来,Patroni 能够依赖 pysyncobj 模块来使用 python Raft 实现作为 DCS。

在这篇文章中,我们将设置一个演示集群来说明该功能。

安装

对于这个演示,我们将在 Streaming Replication 中安装 3 个 PostgreSQL 节点,在 Rocky Linux 8 上运行。

如果你熟悉 Vagrant,这里有一个简单的 Vagrantfile 来启动 3 个虚拟机:

# Vagrantfile
Vagrant.configure(2) do |config|
    config.vm.box = 'rockylinux/8'
    config.vm.provider 'libvirt' do |lv|
        lv.cpus = 1
        lv.memory = 1024
    end
    config.vm.synced_folder ".", "/vagrant", disabled: true

    nodes = 'srv1', 'srv2', 'srv3'
    nodes.each do |node|
        config.vm.define node do |conf|
            conf.vm.hostname = node
        end
    end

    config.vm.provision "shell", inline: <<-SHELL
        #-----------------------------
        sudo dnf install -y bind-utils
        #-----------------------------
    SHELL
end


复制

PostgreSQL

首先,让我们在所有节点上安装 PostgreSQL:

$ sudo dnf install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-8-x86_64/pgdg-redhat-repo-latest.noarch.rpm
$ sudo dnf -qy module disable postgresql
$ sudo dnf install -y postgresql14-server postgresql14-contrib
$ sudo systemctl disable postgresql-14

复制

Patroni 将引导(创建)初始 PostgreSQL 集群并负责启动服务,因此请确保对 PostgreSQL 服务禁用 systemctl。

看门狗

看门狗设备是软件或硬件机制,当它们在指定的时间范围内没有获得保持活动的心跳时,它们将重置整个系统。 这增加了额外的故障保护层,以防常见的 Patroni 脑裂保护机制发生故障。

Patroni 将是与看门狗设备交互的组件。 设置软件看门狗的权限:

$ cat <<EOF | sudo tee /etc/udev/rules.d/99-watchdog.rules
KERNEL=="watchdog", OWNER="postgres", GROUP="postgres"
EOF
$ sudo sh -c 'echo "softdog" >> /etc/modules-load.d/softdog.conf'
$ sudo modprobe softdog
$ sudo chown postgres: /dev/watchdog

复制

守护神

安装 Patroni 及其 Raft 依赖项:

$ sudo dnf install -y python39
$ sudo -iu postgres pip3 install --user --upgrade pip
$ sudo -iu postgres pip3 install --user setuptools_rust
$ sudo -iu postgres pip3 install --user psycopg[binary]>=3.0.0
$ sudo -iu postgres pip3 install --user patroni[raft]

复制

备注:Patroni 自 2021 年 12 月及其版本 2.1.2 起支持 psycopg3。

由于我们为 postgres 用户安装了 Patroni,让我们将其位置添加到用户 PATH:

$ sudo -u postgres sh -c 'echo "export PATH=\"/var/lib/pgsql/.local/bin:\$PATH\"" >> ~/.bash_profile'
$ sudo -iu postgres patroni --version
$ sudo -iu postgres syncobj_admin --help

复制

为 Raft 创建数据目录:

$ sudo mkdir /var/lib/raft
$ sudo chown postgres: /var/lib/raft

复制

守护神配置

我们需要定义参与 Raft 共识集群的 Patroni 节点列表。 要动态获取它,您可以使用这个简单的 shell 脚本(其中 srv1 srv2 srv3 是 3 个 Patroni 主机):

# Fetch the IP addresses of all Patroni hosts
MY_IP=$(hostname -I | awk ' {print $1}')
patroni_nodes=( srv1 srv2 srv3 )
i=0
for node in "${patroni_nodes[@]}"
do
  i=$i+1
  target_ip=$(dig +short $node)
  if [[ "$target_ip" = "$MY_IP" ]]; then
    continue
  fi
  target_array[$i]="'$target_ip:5010'"
done
RAFT_PARTNER_ADDRS=$(printf ",%s" "${target_array[@]}")
export RAFT_PARTNER_ADDRS="[${RAFT_PARTNER_ADDRS:1}]"
echo "partner_addrs: $RAFT_PARTNER_ADDRS"

复制

现在让我们在 /etc/patroni.yml 中定义 Patroni 配置:

$ CLUSTER_NAME="demo-cluster-1"
$ MY_NAME=$(hostname --short)
$ MY_IP=$(hostname -I | awk ' {print $1}')
$ cat <<EOF | sudo tee /etc/patroni.yml
scope: $CLUSTER_NAME
namespace: /db/
name: $MY_NAME

restapi:
  listen: "0.0.0.0:8008"
  connect_address: "$MY_IP:8008"
  authentication:
    username: patroni
    password: mySupeSecretPassword

raft:
  data_dir: /var/lib/raft
  self_addr: "$MY_IP:5010"
  partner_addrs: $RAFT_PARTNER_ADDRS

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: false
      use_slots: true
      parameters:
        archive_mode: "on"
        archive_command: "/bin/true"

  initdb:
  - encoding: UTF8
  - data-checksums
  - auth-local: peer
  - auth-host: scram-sha-256

  pg_hba:
  - host replication replicator 0.0.0.0/0 scram-sha-256
  - host all all 0.0.0.0/0 scram-sha-256

  # Some additional users which needs to be created after initializing new cluster
  users:
    admin:
      password: admin%
      options:
        - createrole
        - createdb

postgresql:
  listen: "0.0.0.0:5432"
  connect_address: "$MY_IP:5432"
  data_dir: /var/lib/pgsql/14/data
  bin_dir: /usr/pgsql-14/bin
  pgpass: /tmp/pgpass0
  authentication:
    replication:
      username: replicator
      password: confidential
    superuser:
      username: postgres
      password: my-super-password
    rewind:
      username: rewind_user
      password: rewind_password
  parameters:
    unix_socket_directories: '/var/run/postgresql,/tmp'

watchdog:
  mode: required
  device: /dev/watchdog
  safety_margin: 5

tags:
  nofailover: false
  noloadbalance: false
  clonefrom: false
  nosync: false
EOF

复制

除了与本地主机相关的MY_IP、MY_NAME 和$RAFT_PARTNER_ADDRS 外,所有Patroni 节点上的patroni.yml 配置应该相同。

根据 Patroni 安装源,如果在安装过程中没有创建 systemd 文件并启动 Patroni 服务:

$ cat <<EOF | sudo tee /etc/systemd/system/patroni.service
[Unit]
Description=Runners to orchestrate a high-availability PostgreSQL
After=syslog.target network.target

[Service]
Type=simple
User=postgres
Group=postgres
ExecStart=python3 /var/lib/pgsql/.local/bin/patroni /etc/patroni.yml
ExecReload=/bin/kill -s HUP \$MAINPID
KillMode=process
TimeoutSec=30
Restart=no

[Install]
WantedBy=multi-user.target
EOF

$ sudo systemctl daemon-reload
$ sudo systemctl enable patroni
$ sudo systemctl start patroni

复制

要检查 Raft 集群状态,请使用 syncobj_admin 命令:

$ sudo -iu postgres syncobj_admin -conn localhost:5010 -status

复制

要列出集群的成员,请使用patronictl 命令:

$ sudo -iu postgres patronictl -c /etc/patroni.yml topology
+--------+-----------------+---------+---------+----+-----------+
| Member | Host            | Role    | State   | TL | Lag in MB |
+ Cluster: demo-cluster-1 (7117556723320621508) ----+-----------+
| srv2   | 192.168.121.12  | Leader  | running |  1 |           |
| + srv1 | 192.168.121.126 | Replica | running |  1 |         0 |
| + srv3 | 192.168.121.194 | Replica | running |  1 |         0 |
+--------+-----------------+---------+---------+----+-----------+

复制

数据库连接

可以设置 HAProxy,而不是直接连接到数据库服务器,以便应用程序将连接到代理,然后将请求转发到 PostgreSQL。 当使用 HAproxy 时,还可以将只读请求路由到一个或多个副本,以实现负载平衡。 HAproxy 可以作为独立服务器安装,但也可以安装在应用程序服务器或数据库服务器本身上。

另一种可能性是使用支持客户端连接故障转移的 PostgreSQL 客户端库,如 libpq 和 jdbc。 连接字符串包含多个服务器(例如:host=srv1,srv2,srv3),客户端库在可用主机上循环以查找可用且能够进行读写或只读操作的连接。 此功能允许客户端在切换期间跟随主集群。

比如:

$ psql "host=srv1,srv2,srv3 dbname=postgres user=admin target_session_attrs=read-write" -c "SELECT pg_is_in_recovery();"
 pg_is_in_recovery
-------------------
 f
(1 row)

$ psql "host=srv1,srv2,srv3 dbname=postgres user=admin target_session_attrs=read-only" -c "SELECT pg_is_in_recovery();"
 pg_is_in_recovery
-------------------
 t
(1 row)

$ psql "host=srv1,srv2,srv3 dbname=postgres user=admin target_session_attrs=read-only" -c "\conninfo"
You are connected to database "postgres" as user "admin" on host "srv2" (address "192.168.121.36") at port "5432".

复制

自动故障转移测试

默认情况下,Patroni 将设置看门狗在 TTL 到期前 5 秒到期。 在默认设置 loop_wait=10 和 ttl=30 的情况下,这使 HA 循环至少有 15 秒(ttl - safety_margin - loop_wait)在系统被强制重置之前完成。 默认情况下,访问 DCS 配置为 10 秒后超时。 这意味着当 DCS 不可用时,例如由于网络问题,Patroni 和 PostgreSQL 将有至少 5 秒 (ttl - safety_margin - loop_wait - retry_timeout) 进入终止所有客户端连接的状态。

只需在领导节点上运行 pgbench 并断开 VM 网络接口几秒钟,即可注意到故障转移可能非常容易发生!

比如:

09:22:45,326 INFO: no action. I am (srv1), a secondary, and following a leader (srv2)
09:22:55,333 INFO: no action. I am (srv1), a secondary, and following a leader (srv2)
09:23:05,355 INFO: Got response from srv3 http://192.168.121.194:8008/patroni: {"state": "running", ...}
09:23:07,268 WARNING: Request failed to srv2: GET http://192.168.121.12:8008/patroni (...)
09:23:07,280 INFO: Software Watchdog activated with 25 second timeout, timing slack 15 seconds
09:23:07,319 INFO: promoted self to leader by acquiring session lock
09:23:07 srv1 python3[27101]: server promoting
09:23:07,340 INFO: cleared rewind state after becoming the leader
09:23:08,760 INFO: no action. I am (srv1), the leader with the lock

复制

当网络接口在 srv2 上恢复时,如果它接收到额外的数据,则复制可能会中断:

FATAL:  could not start WAL streaming: ERROR:  requested starting point 0/D9000000 on timeline 1 is not in this server's history
  DETAIL:  This server's history forked from timeline 1 at 0/CDBDA4A8.
LOG:  new timeline 2 forked off current database system timeline 1 before current recovery point 0/D9683DC8

复制

由于我们没有将 Patroni 配置为使用 pg_rewind,复制延迟可能会增长得非常快:

$ sudo -iu postgres patronictl -c /etc/patroni.yml list
+--------+-----------------+---------+---------+----+-----------+
| Member | Host            | Role    | State   | TL | Lag in MB |
+ Cluster: demo-cluster-1 (7117556723320621508) ----+-----------+
| srv1   | 192.168.121.126 | Leader  | running |  2 |           |
| srv2   | 192.168.121.12  | Replica | running |  1 |       169 |
| srv3   | 192.168.121.194 | Replica | running |  2 |         0 |
+--------+-----------------+---------+---------+----+-----------+

复制

非常有希望的,我们定义了一个 maximum_lag_on_failover 来防止故障备用服务器上的故障转移:

$ sudo -iu postgres patronictl -c /etc/patroni.yml switchover --candidate srv2 --force
Current cluster topology
+--------+-----------------+---------+---------+----+-----------+
| Member | Host            | Role    | State   | TL | Lag in MB |
+ Cluster: demo-cluster-1 (7117556723320621508) ----+-----------+
| srv1   | 192.168.121.126 | Leader  | running |  2 |           |
| srv2   | 192.168.121.12  | Replica | running |  1 |      1046 |
| srv3   | 192.168.121.194 | Replica | running |  2 |         0 |
+--------+-----------------+---------+---------+----+-----------+
Switchover failed, details: 503, Switchover failed

复制

从Patroni logs:

INFO: Member srv2 exceeds maximum replication lag
WARNING: manual failover: no healthy members found, failover is not possible

复制

我们必须重新初始化失败的备用:

$ sudo -iu postgres patronictl -c /etc/patroni.yml reinit demo-cluster-1 srv2
+--------+-----------------+---------+---------+----+-----------+
| Member | Host            | Role    | State   | TL | Lag in MB |
+ Cluster: demo-cluster-1 (7117556723320621508) ----+-----------+
| srv1   | 192.168.121.126 | Leader  | running |  2 |           |
| srv2   | 192.168.121.12  | Replica | running |  1 |      1575 |
| srv3   | 192.168.121.194 | Replica | running |  2 |         0 |
+--------+-----------------+---------+---------+----+-----------+
Are you sure you want to reinitialize members srv2? [y/N]: y
Success: reinitialize for member srv2

复制

从Patroni logs:

INFO: Removing data directory: /var/lib/pgsql/14/data
INFO: Lock owner: srv1; I am srv2
INFO: reinitialize in progress
...

复制

reinit 步骤默认使用 pg_basebackup 执行,没有快速检查点模式。 因此,根据检查点配置和数据库大小,可能需要很多时间。

总结

了解影响自动故障转移启动的参数以及切换/故障转移的后果,甚至是不使用 pg_rewind 的影响是非常重要的。

像往常一样,测试自己的配置很重要,一旦投入生产,更重要的是拥有一个良好的监控和警报系统!

原文标题:Patroni on pure Raft
原文作者:Stefan Fercot
原文地址:https://pgstef.github.io/2022/07/11/patroni_on_pure_raft.html

「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论