Patroni
Patroni是一个使用异步流复制、Etcd和HAProxy的PostgreSQL高可用性解决方案。要了解更多细节,可以阅读Patroni文档,它非常好,涵盖了很多细节。在这篇文章中,我将解释:
- Patroni架构
- Ansible集群设置
- 验证patroni/etcd集群
- 切换
- 修复不同步问题
体系结构
Patroni代理负责在每个节点上运行PostgreSQL,并提供用于检查特定节点当前运行状况的API。HAProxy也运行在每个节点上,使用这个API,并为终端用户提供访问集群主节点的单一端点,主节点是管理整个集群所有读/写操作的节点,通常称为主节点。
这个端点可以同时用于读和写,并且可以在集群中端口“5000”上的每个节点上使用。Patroni集群还可以提供一个额外的HAproxy端点,它只接受端口‘6000’上可用的读取。Patroni REST API可直接在端口’ 8008 '上的每个节点上使用。
标准安装让端点在每个节点上运行,但不让应用程序使用全局端点。最好是添加另一个带负载均衡器的层,这样就可以有一个进入集群中可用节点的单一入口点,而不必关心在客户端使用哪个入口点
使用Ansible与Patroni创建HA PostgreSQL集群
Ansible是一套软件工具,可以将基础设施转换为代码。它是开源的,套件包括软件供应、配置管理和应用程序部署功能。要了解更多细节,请查看ansible文档
这里有一个包含两个剧本的例子。第一个ansible任务将在谷歌云平台(GCP)上设置三个计算实例,这些计算实例将用于部署PostgreSQL数据库。第二个ansible任务将把这三个节点配置为一个Patroni集群。
# de-test-patroni-cluster.yml
- name: Launch VM(s)
hosts: localhost
gather_facts: false
roles:
- role: gcp_vm
gcp_vm_instances:
- name: de-gcp-patroni-example-01
- name: de-gcp-patroni-example-02
- name: de-gcp-patroni-example-03
gcp_vm_options:
defaults:
state: present
os_image: ubuntu1804
type: n1-standard-1
ansible_group: patroni
- role: gcp_lb
gcp_lb_name: de-gcp-patroni-example
gcp_lb_draining_timeout_secs: 60
gcp_lb_healthcheck:
tcp_port: 5000
gcp_lb_instances: patroni # This is the name of the Ansible host group containing the instances we want to use with this load balancer.
gcp_lb_dns_name: de-gcp-patroni-example
- name: Configure VM(s)
hosts: patroni
become: true
roles:
- role: patroni
patroni_etcd_cluster_name: my_etcd_cluster
patroni_etcd_master_group_name: patroni
patroni_master_node: de-gcp-patroni-example-01
patroni_replication_nodes: "{{ groups['patroni'] }}"
patroni_dbbackup: true
patroni_users:
- name: myuser
pass: mypass
patroni_databases:
- name: mydatabase
owner: myuser
- name: myseconddatabase
owner: myuser
patroni_database_schemas:
- database: mydatabae
schema: myschema
owner: myuser
patroni_user_privileges:
- name: myuser
db: mydatabase
priv: ALL
role_attr_flags: NOCREATEDB
复制
Ansible playbook可以这样运行(上面的Ansible代码假设您使用的是Ansible version>2.1,可以在此处找到安装说明)
ansible-playbook de-test-patroni-cluster.yml
复制
在运行这个剧本之后,最终用户可以通过GCP负载均衡器’ de-gcp-patroni-example.de.gcp.cloud ‘在端口’ 5000 ‘上访问PostgreSQL集群进行读写,或者通过端口’ 6000 '访问PostgreSQL集群。
验证集群状态
要使集群可用,Patroni需要三个节点中的至少两个节点可操作。另一个好处是,Patroni完全管理PostgreSQL数据库。在打开和关闭它方面,您可以通过以下方法验证一个Patroni集群和数据库的状态:
HTTP API
向端口’ 8008 '上的一个节点发出HTTP请求。您可以在任何与您的Patroni集群有ssh连接的终端上运行它(下面的例子使用的是mac上的iterm终端)。您应该在响应中得到一个JSON文档,其中显示节点的角色、关于其他复制节点的详细信息以及它们的状态。举个例子,对用上面代码创建的集群中的第一个节点使用curl,将显示以下输出:
$ curl de-gcp-patroni-example-01:8008 { "database_system_identifier": "6820776744425517415", "postmaster_start_time": "2022-04-28 14:57:20.932 UTC", "timeline": 1, "cluster_unlocked": false, "patroni": { "scope": "patroniha", "version": "1.6.1" }, "replication": [ { "sync_state": "async", "sync_priority": 0, "client_addr": "10.80.129.111", "state": "streaming", "application_name": "de_gcp_patroni_example_03", "usename": "replicator" }, { "sync_state": "async", "sync_priority": 0, "client_addr": "10.80.129.113", "state": "streaming", "application_name": "de_gcp_patroni_example_02", "usename": "replicator" } ], "state": "running", "role": "master", "xlog": { "location": 67125840 }, "server_version": 110007 }
复制
Patronictl
Patroni附带了一个名为patronictl的CLI实用程序。可以使用此命令行实用程序执行与Postgres数据库或集群相关的任何管理操作。您可以使用命令行接口,通过向其中一个节点执行sh命令来描述当前的集群状态,并运行以下命令:
$ patronictl -d etcd://127.0.0.1:2379 list patroniha
+-----------+---------------------------+--------------------+--------+---------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------------------+--------------------+--------+---------+----+-----------+
| patroniha | de_gcp_patroni_example_01 | 10.80.131.214:5433 | Leader | running | 1 | |
| patroniha | de_gcp_patroni_example_02 | 10.80.129.113:5433 | | running | 1 | 0.0 |
| patroniha | de_gcp_patroni_example_03 | 10.80.129.111:5433 | | running | 1 | 0.0 |
+-----------+---------------------------+--------------------+--------+---------+----+-----------+
复制
检查日志
Patroni是通过SystemD进行管理的,因此,您可以通过sh命令获取特定服务器的错误日志,然后使用’ journalctl '进行查询。例如,在主机上:
$ sudo journalctl -f -u patroni
Apr 28 16:06:21 de-gcp-patroni-example-01 patroni[20782]: 2022-04-28 16:06:21,983 INFO: Lock owner: de_gcp_patroni_example_01; I am de_gcp_patroni_example_01
Apr 28 16:06:22 de-gcp-patroni-example-01 patroni[20782]: 2022-04-28 16:06:22,017 INFO: no action. i am the leader with the lock
复制
在第二张照片上,你会看到一些稍微不同的东西:
$ sudo journalctl -f -u patroni
Apr 28 16:08:11 de-gcp-patroni-example-02 patroni[20552]: 2022-04-28 16:08:11,990 INFO: does not have lock
Apr 28 16:08:11 de-gcp-patroni-example-02 patroni[20552]: 2022-04-28 16:08:11,997 INFO: no action. i am a secondary and i am following a leader
复制
Etcd集群状态验证
Etcd用于Patroni在节点之间分发配置和状态,因此它还必须是健康的,才能进行正确的集群操作。
您可以通过ssh到其中一个节点,并执行以下命令检查集群的状态:
$ etcdctl cluster-health patroniha member 13a17454e5cb0797 is healthy: got healthy result from http://10.80.129.113:2379 member 4ba8a9618e42f288 is healthy: got healthy result from http://10.80.129.111:2379 member 6a453ce7684a6c8d is healthy: got healthy result from http://10.80.131.214:2379 cluster is healthy
复制
手动切换
使用命令行界面,您可以强制Patroni切换到一个新的leader。切换将立即发生,并需要少量停机时间(集群不可用)。
- 首先,您应该SSH到集群中的一个节点并验证当前的集群状态。
$ patronictl -d etcd://127.0.0.1:2379 list patroniha
+-----------+---------------------------+--------------------+--------+---------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------------------+--------------------+--------+---------+----+-----------+
| patroniha | de_gcp_patroni_example_01 | 10.80.131.214:5433 | Leader | running | 1 | |
| patroniha | de_gcp_patroni_example_02 | 10.80.129.113:5433 | | running | 1 | 0.0 |
| patroniha | de_gcp_patroni_example_03 | 10.80.129.111:5433 | | running | 1 | 0.0 |
+-----------+---------------------------+--------------------+--------+---------+----+-----------+
复制
2.接下来,您可以触发切换并指定要创建新引线的节点。您还可以选择在将来安排切换或立即执行切换:
$ patronictl -d etcd://127.0.0.1:2379 switchover patroniha
Master [de_gcp_patroni_example_01]: <return>
Candidate ['de_gcp_patroni_example_02', 'de_gcp_patroni_example_03'] []: de_gcp_patroni_example_02
When should the switchover take place (e.g. 2020-04-28T17:15 ) [now]: <return>
Current cluster topology
+-----------+---------------------------+--------------------+--------+---------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+-----------+--------------------------+--------------------+--------+---------+----+-----------+
| patroniha | de_gcp_patroni_example_01 | 10.80.131.214:5433 | Leader | running | 1 | |
| patroniha | de_gcp_patroni_example_02 | 10.80.129.113:5433 | | running | 1 | 0.0 |
| patroniha | de_gcp_patroni_example_03 | 10.80.129.111:5433 | | running | 1 | 0.0 |
+-----------+---------------------------+--------------------+--------+---------+----+-----------+
Are you sure you want to switchover cluster patroniha, demoting current master de_gcp_patroni_example_01? [y/N]:
复制
您可以验证操作是否正确,然后输入“y”继续,然后您将看到以下内容:
2020-04-28 16:17:09.50350 Successfully switched over to "de_gcp_patroni_example_02"
+-----------+---------------------------+--------------------+--------+---------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------------------+--------------------+--------+---------+----+-----------+
| patroniha | de_gcp_patroni_example_01 | 10.80.131.214:5433 | | stopped | | unknown |
| patroniha | de_gcp_patroni_example_02 | 10.80.129.113:5433 | Leader | running | 1 | |
| patroniha | de_gcp_patroni_example_03 | 10.80.129.111:5433 | | running | 1 | 0.0 |
+-----------+---------------------------+--------------------+--------+---------+----+-----------+
复制
3.最后,您可以通过再次检查状态来验证切换现已完成:
$ patronictl -d etcd://127.0.0.1:2379 list patroniha
+-----------+---------------------------+--------------------+--------+---------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------------------+--------------------+--------+---------+----+-----------+
| patroniha | de_gcp_patroni_example_01 | 10.80.131.214:5433 | | running | 2 | 0.0 |
| patroniha | de_gcp_patroni_example_02 | 10.80.129.113:5433 | Leader | running | 2 | |
| patroniha | de_gcp_patroni_example_03 | 10.80.129.111:5433 | | running | 2 | 0.0 |
+-----------+---------------------------+--------------------+--------+---------+----+-----------+
复制
修复了Patroni不同步的问题
在某些情况下,PostgreSQL备节点可能会不同步。例如,当主服务器有大量事务时,或者生成了大量的归档文件,消耗了大量的磁盘空间,或者由于错误删除了一些Wal文件,或者由于网络问题删除了一些Wal文件。
您可以按照以下步骤确定辅助节点是否同步,如果不同步,则如何使用Patroni进行修复。
- 首先,您应该SSH到集群中的主节点并验证当前集群状态;您可以在任何集群节点上执行以下命令,以找出哪个是主节点。
$ patronictl -d etcd://127.0.0.1:2379 list patroniha
root@de-gcp-patroni-example-01:~# patronictl -d etcd://127.0.0.1:2379 list patroniha
+ Cluster: patroniha (6830699472358332339) ------+--------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+---------------------------+--------------------+--------+---------+----+-----------+
| de_gcp_patroni_example_01 | 10.80.128.61:5433 | Leader | running | 2 | |
| de_gcp_patroni_example_02 | 10.80.128.160:5433 | | running | 2 | 75 |
| de_gcp_patroni_example_03 | 10.80.128.236:5433 | | running | 2 | 0 |
+---------------------------+--------------------+--------+---------+----+-----------+
复制
如果您发现’ Lag in MB '大于任何二级服务器的零,这意味着它的同步有问题,可以在节点中使用命令reinit来解决这个问题。
- 登录集群的主节点,执行以下命令。Patroni有一个内置的功能reinit,使待机再次同步,节省您的手工工作。该命令将强制集群节点从主节点同步数据。
patronictl -d etcd://127.0.0.1:2379 reinit cluster_name node_name
复制
例如:
root@de-gcp-patroni-example-01:~# patronictl -d etcd://127.0.0.1:2379 reinit patroniha de_gcp_patroni_example_02
+ Cluster: patroniha (6830699472358332339) ------+--------+---------+----+------------+
| Member | Host | Role | State | TL | Lag in MB |
+---------------------------+--------------------+--------+---------+----+------------+
| de_gcp_patroni_example_01 | 10.80.128.61:5433 | Leader | running | 2 | |
| de_gcp_patroni_example_02 | 10.80.128.160:5433 | | running | 2 | 75 |
| de_gcp_patroni_example_03 | 10.80.128.236:5433 | | running | 2 | 0 |
+---------------------------+--------------------+--------+---------+----+------------+
Are you sure you want to reinitialize members de_gcp_patroni_example_02? [y/N]: y
Success: reinitialize for member de_gcp_patroni_example_02
root@de-gcp-patroni-example-01:~#
复制
登录节点de_gcp_patroni_example_02,也可以在servicecenter服务日志中进行验证。
-
ssh de-gcp-patroni-example-02
-
作为根用户运行
root@de-gcp-patroni-example-02:~# journalctl -f -u patroni
复制
您将在日志中看到一些信息:
2022-05-26 08:54:41.748 UTC [2846] LOG: database system is shut down
2022-05-26 08:54:41,760 INFO: Removing data directory: /data/postgresql/db
2022-05-26 0854:46,627 INFO: replica has been created using basebackup
2022-05-26 08:54:46,628 INFO: bootstrapped from leader 'de_gcp_patroni_example_01'
2022-05-26 08:54:46,633 INFO: closed patroni connection to the postgresql cluster
2022-05-26 08:54:47,442 INFO: postmaster pid=4426
localhost:5433 - no response
2022-05-26 08:54:47.713 UTC [4426] LOG: starting PostgreSQL 12.3 (Ubuntu 12.3-1.pgdg18.04+1) on x86_64-pc
And, normal logging agai
2022-05-26 08:55:00,441 INFO: no action. i am a secondary and i am following a leader
2022-05-26 08:55:10,433 INFO: Lock owner: de_gcp_patroni_example_01; I am de_gcp_patroni_example_02
2022-05-26 08:55:10,433 INFO: does not have lock
2022-05-26 08:55:10,440 INFO: no action. i am a secondary and i am following a leader
复制
3.一段时间后,根据数据库大小,“Lag in MB”应再次为零。您可以验证为
$ patronictl -d etcd://127.0.0.1:2379 list patroniha
root@de-gcp-patroni-example-01:~# patronictl -d etcd://127.0.0.1:2379 list patroniha
+ Cluster: patroniha (6830699472358332339) ----+--------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+---------------------------+--------------------+--------+---------+----+-----------+
| de_gcp_patroni_example_01 | 10.80.128.61:5433 | Leader | running | 2 | |
| de_gcp_patroni_example_02 | 10.80.128.160:5433 | | running | 2 | 0 |
| de_gcp_patroni_example_03 | 10.80.128.236:5433 | | running | 2 | 0 |
+---------------------------+--------------------+--------+---------+----+-----------+
复制
结论
在这篇文章中,我介绍了如何使用Ansible构建一个Patroni集群,验证集群状态,并了解Patroni如何帮助处理两个操作:切换和reinit。
原文标题:PATRONI – AN AWESOME POSTGRESQL HA SOLUTION
原文作者:Lochan Marwah
原文地址:https://blog.pythian.com/patroni-an-awesome-postgresql-ha-solution/