暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

postgresql HA解决方案

原创 X丶 2022-10-20
894

Patroni

Patroni是一个使用异步流复制、Etcd和HAProxy的PostgreSQL高可用性解决方案。要了解更多细节,可以阅读Patroni文档,它非常好,涵盖了很多细节。在这篇文章中,我将解释:

  • Patroni架构
  • Ansible集群设置
  • 验证patroni/etcd集群
  • 切换
  • 修复不同步问题

体系结构

Patroni代理负责在每个节点上运行PostgreSQL,并提供用于检查特定节点当前运行状况的API。HAProxy也运行在每个节点上,使用这个API,并为终端用户提供访问集群主节点的单一端点,主节点是管理整个集群所有读/写操作的节点,通常称为主节点。

这个端点可以同时用于读和写,并且可以在集群中端口“5000”上的每个节点上使用。Patroni集群还可以提供一个额外的HAproxy端点,它只接受端口‘6000’上可用的读取。Patroni REST API可直接在端口’ 8008 '上的每个节点上使用。

标准安装让端点在每个节点上运行,但不让应用程序使用全局端点。最好是添加另一个带负载均衡器的层,这样就可以有一个进入集群中可用节点的单一入口点,而不必关心在客户端使用哪个入口点

使用Ansible与Patroni创建HA PostgreSQL集群

Ansible是一套软件工具,可以将基础设施转换为代码。它是开源的,套件包括软件供应、配置管理和应用程序部署功能。要了解更多细节,请查看ansible文档

这里有一个包含两个剧本的例子。第一个ansible任务将在谷歌云平台(GCP)上设置三个计算实例,这些计算实例将用于部署PostgreSQL数据库。第二个ansible任务将把这三个节点配置为一个Patroni集群。

# de-test-patroni-cluster.yml - name: Launch VM(s) hosts: localhost gather_facts: false roles: - role: gcp_vm gcp_vm_instances: - name: de-gcp-patroni-example-01 - name: de-gcp-patroni-example-02 - name: de-gcp-patroni-example-03 gcp_vm_options: defaults: state: present os_image: ubuntu1804 type: n1-standard-1 ansible_group: patroni - role: gcp_lb gcp_lb_name: de-gcp-patroni-example gcp_lb_draining_timeout_secs: 60 gcp_lb_healthcheck: tcp_port: 5000 gcp_lb_instances: patroni # This is the name of the Ansible host group containing the instances we want to use with this load balancer. gcp_lb_dns_name: de-gcp-patroni-example - name: Configure VM(s) hosts: patroni become: true roles: - role: patroni patroni_etcd_cluster_name: my_etcd_cluster patroni_etcd_master_group_name: patroni patroni_master_node: de-gcp-patroni-example-01 patroni_replication_nodes: "{{ groups['patroni'] }}" patroni_dbbackup: true patroni_users: - name: myuser pass: mypass patroni_databases: - name: mydatabase owner: myuser - name: myseconddatabase owner: myuser patroni_database_schemas: - database: mydatabae schema: myschema owner: myuser patroni_user_privileges: - name: myuser db: mydatabase priv: ALL role_attr_flags: NOCREATEDB
复制

Ansible playbook可以这样运行(上面的Ansible代码假设您使用的是Ansible version>2.1,可以在此处找到安装说明)

ansible-playbook de-test-patroni-cluster.yml
复制

在运行这个剧本之后,最终用户可以通过GCP负载均衡器’ de-gcp-patroni-example.de.gcp.cloud ‘在端口’ 5000 ‘上访问PostgreSQL集群进行读写,或者通过端口’ 6000 '访问PostgreSQL集群。

验证集群状态

要使集群可用,Patroni需要三个节点中的至少两个节点可操作。另一个好处是,Patroni完全管理PostgreSQL数据库。在打开和关闭它方面,您可以通过以下方法验证一个Patroni集群和数据库的状态:

HTTP API

向端口’ 8008 '上的一个节点发出HTTP请求。您可以在任何与您的Patroni集群有ssh连接的终端上运行它(下面的例子使用的是mac上的iterm终端)。您应该在响应中得到一个JSON文档,其中显示节点的角色、关于其他复制节点的详细信息以及它们的状态。举个例子,对用上面代码创建的集群中的第一个节点使用curl,将显示以下输出:

$ curl de-gcp-patroni-example-01:8008 { "database_system_identifier": "6820776744425517415", "postmaster_start_time": "2022-04-28 14:57:20.932 UTC", "timeline": 1, "cluster_unlocked": false, "patroni": { "scope": "patroniha", "version": "1.6.1" }, "replication": [ { "sync_state": "async", "sync_priority": 0, "client_addr": "10.80.129.111", "state": "streaming", "application_name": "de_gcp_patroni_example_03", "usename": "replicator" }, { "sync_state": "async", "sync_priority": 0, "client_addr": "10.80.129.113", "state": "streaming", "application_name": "de_gcp_patroni_example_02", "usename": "replicator" } ], "state": "running", "role": "master", "xlog": { "location": 67125840 }, "server_version": 110007 }
复制

Patronictl

Patroni附带了一个名为patronictl的CLI实用程序。可以使用此命令行实用程序执行与Postgres数据库或集群相关的任何管理操作。您可以使用命令行接口,通过向其中一个节点执行sh命令来描述当前的集群状态,并运行以下命令:

$ patronictl -d etcd://127.0.0.1:2379 list patroniha +-----------+---------------------------+--------------------+--------+---------+----+-----------+ | Cluster | Member | Host | Role | State | TL | Lag in MB | +-----------+---------------------------+--------------------+--------+---------+----+-----------+ | patroniha | de_gcp_patroni_example_01 | 10.80.131.214:5433 | Leader | running | 1 | | | patroniha | de_gcp_patroni_example_02 | 10.80.129.113:5433 | | running | 1 | 0.0 | | patroniha | de_gcp_patroni_example_03 | 10.80.129.111:5433 | | running | 1 | 0.0 | +-----------+---------------------------+--------------------+--------+---------+----+-----------+
复制

检查日志

Patroni是通过SystemD进行管理的,因此,您可以通过sh命令获取特定服务器的错误日志,然后使用’ journalctl '进行查询。例如,在主机上:

$ sudo journalctl -f -u patroni Apr 28 16:06:21 de-gcp-patroni-example-01 patroni[20782]: 2022-04-28 16:06:21,983 INFO: Lock owner: de_gcp_patroni_example_01; I am de_gcp_patroni_example_01 Apr 28 16:06:22 de-gcp-patroni-example-01 patroni[20782]: 2022-04-28 16:06:22,017 INFO: no action. i am the leader with the lock
复制

在第二张照片上,你会看到一些稍微不同的东西:

$ sudo journalctl -f -u patroni Apr 28 16:08:11 de-gcp-patroni-example-02 patroni[20552]: 2022-04-28 16:08:11,990 INFO: does not have lock Apr 28 16:08:11 de-gcp-patroni-example-02 patroni[20552]: 2022-04-28 16:08:11,997 INFO: no action. i am a secondary and i am following a leader
复制

Etcd集群状态验证
Etcd用于Patroni在节点之间分发配置和状态,因此它还必须是健康的,才能进行正确的集群操作。
您可以通过ssh到其中一个节点,并执行以下命令检查集群的状态:

$ etcdctl cluster-health patroniha member 13a17454e5cb0797 is healthy: got healthy result from http://10.80.129.113:2379 member 4ba8a9618e42f288 is healthy: got healthy result from http://10.80.129.111:2379 member 6a453ce7684a6c8d is healthy: got healthy result from http://10.80.131.214:2379 cluster is healthy
复制

手动切换
使用命令行界面,您可以强制Patroni切换到一个新的leader。切换将立即发生,并需要少量停机时间(集群不可用)。

  1. 首先,您应该SSH到集群中的一个节点并验证当前的集群状态。
$ patronictl -d etcd://127.0.0.1:2379 list patroniha +-----------+---------------------------+--------------------+--------+---------+----+-----------+ | Cluster | Member | Host | Role | State | TL | Lag in MB | +-----------+---------------------------+--------------------+--------+---------+----+-----------+ | patroniha | de_gcp_patroni_example_01 | 10.80.131.214:5433 | Leader | running | 1 | | | patroniha | de_gcp_patroni_example_02 | 10.80.129.113:5433 | | running | 1 | 0.0 | | patroniha | de_gcp_patroni_example_03 | 10.80.129.111:5433 | | running | 1 | 0.0 | +-----------+---------------------------+--------------------+--------+---------+----+-----------+
复制

2.接下来,您可以触发切换并指定要创建新引线的节点。您还可以选择在将来安排切换或立即执行切换:

$ patronictl -d etcd://127.0.0.1:2379 switchover patroniha Master [de_gcp_patroni_example_01]: <return> Candidate ['de_gcp_patroni_example_02', 'de_gcp_patroni_example_03'] []: de_gcp_patroni_example_02 When should the switchover take place (e.g. 2020-04-28T17:15 ) [now]: <return> Current cluster topology +-----------+---------------------------+--------------------+--------+---------+----+-----------+ | Cluster | Member | Host | Role | State | TL | Lag in MB | +-----------+--------------------------+--------------------+--------+---------+----+-----------+ | patroniha | de_gcp_patroni_example_01 | 10.80.131.214:5433 | Leader | running | 1 | | | patroniha | de_gcp_patroni_example_02 | 10.80.129.113:5433 | | running | 1 | 0.0 | | patroniha | de_gcp_patroni_example_03 | 10.80.129.111:5433 | | running | 1 | 0.0 | +-----------+---------------------------+--------------------+--------+---------+----+-----------+ Are you sure you want to switchover cluster patroniha, demoting current master de_gcp_patroni_example_01? [y/N]:
复制

您可以验证操作是否正确,然后输入“y”继续,然后您将看到以下内容:

2020-04-28 16:17:09.50350 Successfully switched over to "de_gcp_patroni_example_02" +-----------+---------------------------+--------------------+--------+---------+----+-----------+ | Cluster | Member | Host | Role | State | TL | Lag in MB | +-----------+---------------------------+--------------------+--------+---------+----+-----------+ | patroniha | de_gcp_patroni_example_01 | 10.80.131.214:5433 | | stopped | | unknown | | patroniha | de_gcp_patroni_example_02 | 10.80.129.113:5433 | Leader | running | 1 | | | patroniha | de_gcp_patroni_example_03 | 10.80.129.111:5433 | | running | 1 | 0.0 | +-----------+---------------------------+--------------------+--------+---------+----+-----------+
复制

3.最后,您可以通过再次检查状态来验证切换现已完成:

$ patronictl -d etcd://127.0.0.1:2379 list patroniha +-----------+---------------------------+--------------------+--------+---------+----+-----------+ | Cluster | Member | Host | Role | State | TL | Lag in MB | +-----------+---------------------------+--------------------+--------+---------+----+-----------+ | patroniha | de_gcp_patroni_example_01 | 10.80.131.214:5433 | | running | 2 | 0.0 | | patroniha | de_gcp_patroni_example_02 | 10.80.129.113:5433 | Leader | running | 2 | | | patroniha | de_gcp_patroni_example_03 | 10.80.129.111:5433 | | running | 2 | 0.0 | +-----------+---------------------------+--------------------+--------+---------+----+-----------+
复制

修复了Patroni不同步的问题
在某些情况下,PostgreSQL备节点可能会不同步。例如,当主服务器有大量事务时,或者生成了大量的归档文件,消耗了大量的磁盘空间,或者由于错误删除了一些Wal文件,或者由于网络问题删除了一些Wal文件。

您可以按照以下步骤确定辅助节点是否同步,如果不同步,则如何使用Patroni进行修复。

  1. 首先,您应该SSH到集群中的主节点并验证当前集群状态;您可以在任何集群节点上执行以下命令,以找出哪个是主节点。
$ patronictl -d etcd://127.0.0.1:2379 list patroniha root@de-gcp-patroni-example-01:~# patronictl -d etcd://127.0.0.1:2379 list patroniha + Cluster: patroniha (6830699472358332339) ------+--------+---------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +---------------------------+--------------------+--------+---------+----+-----------+ | de_gcp_patroni_example_01 | 10.80.128.61:5433 | Leader | running | 2 | | | de_gcp_patroni_example_02 | 10.80.128.160:5433 | | running | 2 | 75 | | de_gcp_patroni_example_03 | 10.80.128.236:5433 | | running | 2 | 0 | +---------------------------+--------------------+--------+---------+----+-----------+
复制

如果您发现’ Lag in MB '大于任何二级服务器的零,这意味着它的同步有问题,可以在节点中使用命令reinit来解决这个问题。

  1. 登录集群的主节点,执行以下命令。Patroni有一个内置的功能reinit,使待机再次同步,节省您的手工工作。该命令将强制集群节点从主节点同步数据。
patronictl -d etcd://127.0.0.1:2379 reinit cluster_name node_name
复制

例如:

root@de-gcp-patroni-example-01:~# patronictl -d etcd://127.0.0.1:2379 reinit patroniha de_gcp_patroni_example_02 + Cluster: patroniha (6830699472358332339) ------+--------+---------+----+------------+ | Member | Host | Role | State | TL | Lag in MB | +---------------------------+--------------------+--------+---------+----+------------+ | de_gcp_patroni_example_01 | 10.80.128.61:5433 | Leader | running | 2 | | | de_gcp_patroni_example_02 | 10.80.128.160:5433 | | running | 2 | 75 | | de_gcp_patroni_example_03 | 10.80.128.236:5433 | | running | 2 | 0 | +---------------------------+--------------------+--------+---------+----+------------+ Are you sure you want to reinitialize members de_gcp_patroni_example_02? [y/N]: y Success: reinitialize for member de_gcp_patroni_example_02 root@de-gcp-patroni-example-01:~#
复制

登录节点de_gcp_patroni_example_02,也可以在servicecenter服务日志中进行验证。

  1. ssh de-gcp-patroni-example-02

  2. 作为根用户运行

root@de-gcp-patroni-example-02:~# journalctl -f -u patroni
复制

您将在日志中看到一些信息:

2022-05-26 08:54:41.748 UTC [2846] LOG: database system is shut down 2022-05-26 08:54:41,760 INFO: Removing data directory: /data/postgresql/db 2022-05-26 0854:46,627 INFO: replica has been created using basebackup 2022-05-26 08:54:46,628 INFO: bootstrapped from leader 'de_gcp_patroni_example_01' 2022-05-26 08:54:46,633 INFO: closed patroni connection to the postgresql cluster 2022-05-26 08:54:47,442 INFO: postmaster pid=4426 localhost:5433 - no response 2022-05-26 08:54:47.713 UTC [4426] LOG: starting PostgreSQL 12.3 (Ubuntu 12.3-1.pgdg18.04+1) on x86_64-pc And, normal logging agai 2022-05-26 08:55:00,441 INFO: no action. i am a secondary and i am following a leader 2022-05-26 08:55:10,433 INFO: Lock owner: de_gcp_patroni_example_01; I am de_gcp_patroni_example_02 2022-05-26 08:55:10,433 INFO: does not have lock 2022-05-26 08:55:10,440 INFO: no action. i am a secondary and i am following a leader
复制

3.一段时间后,根据数据库大小,“Lag in MB”应再次为零。您可以验证为

$ patronictl -d etcd://127.0.0.1:2379 list patroniha root@de-gcp-patroni-example-01:~# patronictl -d etcd://127.0.0.1:2379 list patroniha + Cluster: patroniha (6830699472358332339) ----+--------+---------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +---------------------------+--------------------+--------+---------+----+-----------+ | de_gcp_patroni_example_01 | 10.80.128.61:5433 | Leader | running | 2 | | | de_gcp_patroni_example_02 | 10.80.128.160:5433 | | running | 2 | 0 | | de_gcp_patroni_example_03 | 10.80.128.236:5433 | | running | 2 | 0 | +---------------------------+--------------------+--------+---------+----+-----------+
复制

结论

在这篇文章中,我介绍了如何使用Ansible构建一个Patroni集群,验证集群状态,并了解Patroni如何帮助处理两个操作:切换和reinit。

原文标题:PATRONI – AN AWESOME POSTGRESQL HA SOLUTION
原文作者:Lochan Marwah
原文地址:https://blog.pythian.com/patroni-an-awesome-postgresql-ha-solution/

「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论