自 2020 年 9 月及其 2.0 版本以来,Patroni 能够依赖 pysyncobj 模块来使用 python Raft 实现作为 DCS。
在这篇文章中,我们将设置一个演示集群来说明该功能。
安装
对于这个演示,我们将在 Streaming Replication 中安装 3 个 PostgreSQL 节点,在 Rocky Linux 8 上运行。
如果你熟悉 Vagrant,这里有一个简单的 Vagrantfile 来启动 3 个虚拟机:
# Vagrantfile Vagrant.configure(2) do |config| config.vm.box = 'rockylinux/8' config.vm.provider 'libvirt' do |lv| lv.cpus = 1 lv.memory = 1024 end config.vm.synced_folder ".", "/vagrant", disabled: true nodes = 'srv1', 'srv2', 'srv3' nodes.each do |node| config.vm.define node do |conf| conf.vm.hostname = node end end config.vm.provision "shell", inline: <<-SHELL #----------------------------- sudo dnf install -y bind-utils #----------------------------- SHELL end
复制
PostgreSQL
首先,让我们在所有节点上安装 PostgreSQL:
$ sudo dnf install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-8-x86_64/pgdg-redhat-repo-latest.noarch.rpm $ sudo dnf -qy module disable postgresql $ sudo dnf install -y postgresql14-server postgresql14-contrib $ sudo systemctl disable postgresql-14
复制
Patroni 将引导(创建)初始 PostgreSQL 集群并负责启动服务,因此请确保对 PostgreSQL 服务禁用 systemctl。
看门狗
看门狗设备是软件或硬件机制,当它们在指定的时间范围内没有获得保持活动的心跳时,它们将重置整个系统。 这增加了额外的故障保护层,以防常见的 Patroni 脑裂保护机制发生故障。
Patroni 将是与看门狗设备交互的组件。 设置软件看门狗的权限:
$ cat <<EOF | sudo tee /etc/udev/rules.d/99-watchdog.rules KERNEL=="watchdog", OWNER="postgres", GROUP="postgres" EOF $ sudo sh -c 'echo "softdog" >> /etc/modules-load.d/softdog.conf' $ sudo modprobe softdog $ sudo chown postgres: /dev/watchdog
复制
守护神
安装 Patroni 及其 Raft 依赖项:
$ sudo dnf install -y python39 $ sudo -iu postgres pip3 install --user --upgrade pip $ sudo -iu postgres pip3 install --user setuptools_rust $ sudo -iu postgres pip3 install --user psycopg[binary]>=3.0.0 $ sudo -iu postgres pip3 install --user patroni[raft]
复制
备注:Patroni 自 2021 年 12 月及其版本 2.1.2 起支持 psycopg3。
由于我们为 postgres 用户安装了 Patroni,让我们将其位置添加到用户 PATH:
$ sudo -u postgres sh -c 'echo "export PATH=\"/var/lib/pgsql/.local/bin:\$PATH\"" >> ~/.bash_profile' $ sudo -iu postgres patroni --version $ sudo -iu postgres syncobj_admin --help
复制
为 Raft 创建数据目录:
$ sudo mkdir /var/lib/raft $ sudo chown postgres: /var/lib/raft
复制
守护神配置
我们需要定义参与 Raft 共识集群的 Patroni 节点列表。 要动态获取它,您可以使用这个简单的 shell 脚本(其中 srv1 srv2 srv3 是 3 个 Patroni 主机):
# Fetch the IP addresses of all Patroni hosts MY_IP=$(hostname -I | awk ' {print $1}') patroni_nodes=( srv1 srv2 srv3 ) i=0 for node in "${patroni_nodes[@]}" do i=$i+1 target_ip=$(dig +short $node) if [[ "$target_ip" = "$MY_IP" ]]; then continue fi target_array[$i]="'$target_ip:5010'" done RAFT_PARTNER_ADDRS=$(printf ",%s" "${target_array[@]}") export RAFT_PARTNER_ADDRS="[${RAFT_PARTNER_ADDRS:1}]" echo "partner_addrs: $RAFT_PARTNER_ADDRS"
复制
现在让我们在 /etc/patroni.yml 中定义 Patroni 配置:
$ CLUSTER_NAME="demo-cluster-1" $ MY_NAME=$(hostname --short) $ MY_IP=$(hostname -I | awk ' {print $1}') $ cat <<EOF | sudo tee /etc/patroni.yml scope: $CLUSTER_NAME namespace: /db/ name: $MY_NAME restapi: listen: "0.0.0.0:8008" connect_address: "$MY_IP:8008" authentication: username: patroni password: mySupeSecretPassword raft: data_dir: /var/lib/raft self_addr: "$MY_IP:5010" partner_addrs: $RAFT_PARTNER_ADDRS bootstrap: dcs: ttl: 30 loop_wait: 10 retry_timeout: 10 maximum_lag_on_failover: 1048576 postgresql: use_pg_rewind: false use_slots: true parameters: archive_mode: "on" archive_command: "/bin/true" initdb: - encoding: UTF8 - data-checksums - auth-local: peer - auth-host: scram-sha-256 pg_hba: - host replication replicator 0.0.0.0/0 scram-sha-256 - host all all 0.0.0.0/0 scram-sha-256 # Some additional users which needs to be created after initializing new cluster users: admin: password: admin% options: - createrole - createdb postgresql: listen: "0.0.0.0:5432" connect_address: "$MY_IP:5432" data_dir: /var/lib/pgsql/14/data bin_dir: /usr/pgsql-14/bin pgpass: /tmp/pgpass0 authentication: replication: username: replicator password: confidential superuser: username: postgres password: my-super-password rewind: username: rewind_user password: rewind_password parameters: unix_socket_directories: '/var/run/postgresql,/tmp' watchdog: mode: required device: /dev/watchdog safety_margin: 5 tags: nofailover: false noloadbalance: false clonefrom: false nosync: false EOF
复制
除了与本地主机相关的MY_IP、MY_NAME 和$RAFT_PARTNER_ADDRS 外,所有Patroni 节点上的patroni.yml 配置应该相同。
根据 Patroni 安装源,如果在安装过程中没有创建 systemd 文件并启动 Patroni 服务:
$ cat <<EOF | sudo tee /etc/systemd/system/patroni.service [Unit] Description=Runners to orchestrate a high-availability PostgreSQL After=syslog.target network.target [Service] Type=simple User=postgres Group=postgres ExecStart=python3 /var/lib/pgsql/.local/bin/patroni /etc/patroni.yml ExecReload=/bin/kill -s HUP \$MAINPID KillMode=process TimeoutSec=30 Restart=no [Install] WantedBy=multi-user.target EOF $ sudo systemctl daemon-reload $ sudo systemctl enable patroni $ sudo systemctl start patroni
复制
要检查 Raft 集群状态,请使用 syncobj_admin 命令:
$ sudo -iu postgres syncobj_admin -conn localhost:5010 -status
复制
要列出集群的成员,请使用patronictl 命令:
$ sudo -iu postgres patronictl -c /etc/patroni.yml topology +--------+-----------------+---------+---------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | + Cluster: demo-cluster-1 (7117556723320621508) ----+-----------+ | srv2 | 192.168.121.12 | Leader | running | 1 | | | + srv1 | 192.168.121.126 | Replica | running | 1 | 0 | | + srv3 | 192.168.121.194 | Replica | running | 1 | 0 | +--------+-----------------+---------+---------+----+-----------+
复制
数据库连接
可以设置 HAProxy,而不是直接连接到数据库服务器,以便应用程序将连接到代理,然后将请求转发到 PostgreSQL。 当使用 HAproxy 时,还可以将只读请求路由到一个或多个副本,以实现负载平衡。 HAproxy 可以作为独立服务器安装,但也可以安装在应用程序服务器或数据库服务器本身上。
另一种可能性是使用支持客户端连接故障转移的 PostgreSQL 客户端库,如 libpq 和 jdbc。 连接字符串包含多个服务器(例如:host=srv1,srv2,srv3),客户端库在可用主机上循环以查找可用且能够进行读写或只读操作的连接。 此功能允许客户端在切换期间跟随主集群。
比如:
$ psql "host=srv1,srv2,srv3 dbname=postgres user=admin target_session_attrs=read-write" -c "SELECT pg_is_in_recovery();" pg_is_in_recovery ------------------- f (1 row) $ psql "host=srv1,srv2,srv3 dbname=postgres user=admin target_session_attrs=read-only" -c "SELECT pg_is_in_recovery();" pg_is_in_recovery ------------------- t (1 row) $ psql "host=srv1,srv2,srv3 dbname=postgres user=admin target_session_attrs=read-only" -c "\conninfo" You are connected to database "postgres" as user "admin" on host "srv2" (address "192.168.121.36") at port "5432".
复制
自动故障转移测试
默认情况下,Patroni 将设置看门狗在 TTL 到期前 5 秒到期。 在默认设置 loop_wait=10 和 ttl=30 的情况下,这使 HA 循环至少有 15 秒(ttl - safety_margin - loop_wait)在系统被强制重置之前完成。 默认情况下,访问 DCS 配置为 10 秒后超时。 这意味着当 DCS 不可用时,例如由于网络问题,Patroni 和 PostgreSQL 将有至少 5 秒 (ttl - safety_margin - loop_wait - retry_timeout) 进入终止所有客户端连接的状态。
只需在领导节点上运行 pgbench 并断开 VM 网络接口几秒钟,即可注意到故障转移可能非常容易发生!
比如:
09:22:45,326 INFO: no action. I am (srv1), a secondary, and following a leader (srv2) 09:22:55,333 INFO: no action. I am (srv1), a secondary, and following a leader (srv2) 09:23:05,355 INFO: Got response from srv3 http://192.168.121.194:8008/patroni: {"state": "running", ...} 09:23:07,268 WARNING: Request failed to srv2: GET http://192.168.121.12:8008/patroni (...) 09:23:07,280 INFO: Software Watchdog activated with 25 second timeout, timing slack 15 seconds 09:23:07,319 INFO: promoted self to leader by acquiring session lock 09:23:07 srv1 python3[27101]: server promoting 09:23:07,340 INFO: cleared rewind state after becoming the leader 09:23:08,760 INFO: no action. I am (srv1), the leader with the lock
复制
当网络接口在 srv2 上恢复时,如果它接收到额外的数据,则复制可能会中断:
FATAL: could not start WAL streaming: ERROR: requested starting point 0/D9000000 on timeline 1 is not in this server's history DETAIL: This server's history forked from timeline 1 at 0/CDBDA4A8. LOG: new timeline 2 forked off current database system timeline 1 before current recovery point 0/D9683DC8
复制
由于我们没有将 Patroni 配置为使用 pg_rewind,复制延迟可能会增长得非常快:
$ sudo -iu postgres patronictl -c /etc/patroni.yml list +--------+-----------------+---------+---------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | + Cluster: demo-cluster-1 (7117556723320621508) ----+-----------+ | srv1 | 192.168.121.126 | Leader | running | 2 | | | srv2 | 192.168.121.12 | Replica | running | 1 | 169 | | srv3 | 192.168.121.194 | Replica | running | 2 | 0 | +--------+-----------------+---------+---------+----+-----------+
复制
非常有希望的,我们定义了一个 maximum_lag_on_failover 来防止故障备用服务器上的故障转移:
$ sudo -iu postgres patronictl -c /etc/patroni.yml switchover --candidate srv2 --force Current cluster topology +--------+-----------------+---------+---------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | + Cluster: demo-cluster-1 (7117556723320621508) ----+-----------+ | srv1 | 192.168.121.126 | Leader | running | 2 | | | srv2 | 192.168.121.12 | Replica | running | 1 | 1046 | | srv3 | 192.168.121.194 | Replica | running | 2 | 0 | +--------+-----------------+---------+---------+----+-----------+ Switchover failed, details: 503, Switchover failed
复制
从Patroni logs:
INFO: Member srv2 exceeds maximum replication lag WARNING: manual failover: no healthy members found, failover is not possible
复制
我们必须重新初始化失败的备用:
$ sudo -iu postgres patronictl -c /etc/patroni.yml reinit demo-cluster-1 srv2 +--------+-----------------+---------+---------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | + Cluster: demo-cluster-1 (7117556723320621508) ----+-----------+ | srv1 | 192.168.121.126 | Leader | running | 2 | | | srv2 | 192.168.121.12 | Replica | running | 1 | 1575 | | srv3 | 192.168.121.194 | Replica | running | 2 | 0 | +--------+-----------------+---------+---------+----+-----------+ Are you sure you want to reinitialize members srv2? [y/N]: y Success: reinitialize for member srv2
复制
从Patroni logs:
INFO: Removing data directory: /var/lib/pgsql/14/data INFO: Lock owner: srv1; I am srv2 INFO: reinitialize in progress ...
复制
reinit 步骤默认使用 pg_basebackup 执行,没有快速检查点模式。 因此,根据检查点配置和数据库大小,可能需要很多时间。
总结
了解影响自动故障转移启动的参数以及切换/故障转移的后果,甚至是不使用 pg_rewind 的影响是非常重要的。
像往常一样,测试自己的配置很重要,一旦投入生产,更重要的是拥有一个良好的监控和警报系统!
原文标题:Patroni on pure Raft
原文作者:Stefan Fercot
原文地址:https://pgstef.github.io/2022/07/11/patroni_on_pure_raft.html