postgresql HA解决方案

原创 X丶 2022-10-20

894

Patroni

Patroni是一个使用异步流复制、Etcd和HAProxy的PostgreSQL高可用性解决方案。要了解更多细节，可以阅读Patroni文档，它非常好，涵盖了很多细节。在这篇文章中，我将解释:

Patroni架构
Ansible集群设置
验证patroni/etcd集群
切换
修复不同步问题

体系结构

Patroni代理负责在每个节点上运行PostgreSQL，并提供用于检查特定节点当前运行状况的API。HAProxy也运行在每个节点上，使用这个API，并为终端用户提供访问集群主节点的单一端点，主节点是管理整个集群所有读/写操作的节点，通常称为主节点。

这个端点可以同时用于读和写，并且可以在集群中端口“5000”上的每个节点上使用。Patroni集群还可以提供一个额外的HAproxy端点，它只接受端口‘6000’上可用的读取。Patroni REST API可直接在端口’ 8008 '上的每个节点上使用。

标准安装让端点在每个节点上运行，但不让应用程序使用全局端点。最好是添加另一个带负载均衡器的层，这样就可以有一个进入集群中可用节点的单一入口点，而不必关心在客户端使用哪个入口点

使用Ansible与Patroni创建HA PostgreSQL集群

Ansible是一套软件工具，可以将基础设施转换为代码。它是开源的，套件包括软件供应、配置管理和应用程序部署功能。要了解更多细节，请查看ansible文档

这里有一个包含两个剧本的例子。第一个ansible任务将在谷歌云平台(GCP)上设置三个计算实例，这些计算实例将用于部署PostgreSQL数据库。第二个ansible任务将把这三个节点配置为一个Patroni集群。

# de-test-patroni-cluster.yml 

- name:         Launch VM(s)
  hosts:        localhost
  gather_facts: false
  roles:
    - role: gcp_vm
      gcp_vm_instances:
        - name:          de-gcp-patroni-example-01
        - name:          de-gcp-patroni-example-02
        - name:          de-gcp-patroni-example-03

      gcp_vm_options:
        defaults:
          state:         present
          os_image:      ubuntu1804
          type:          n1-standard-1
          ansible_group: patroni

    - role: gcp_lb
      gcp_lb_name: de-gcp-patroni-example
      gcp_lb_draining_timeout_secs: 60
      gcp_lb_healthcheck:
        tcp_port: 5000
      gcp_lb_instances: patroni   # This is the name of the Ansible host group containing the instances we want to use with this load balancer.
      gcp_lb_dns_name: de-gcp-patroni-example

- name:         Configure VM(s)
  hosts:        patroni
  become:       true

  roles:
    - role: patroni
      patroni_etcd_cluster_name:      my_etcd_cluster
      patroni_etcd_master_group_name: patroni                       
      patroni_master_node:            de-gcp-patroni-example-01     
      patroni_replication_nodes:      "{{ groups['patroni'] }}"
      patroni_dbbackup:               true
      patroni_users:

        - name: myuser
          pass: mypass                    

      patroni_databases:                  
        - name:  mydatabase
          owner: myuser

        - name:  myseconddatabase
          owner: myuser

      patroni_database_schemas:           

        - database: mydatabae
          schema:   myschema
          owner:    myuser

      patroni_user_privileges:            

        - name:            myuser
          db:              mydatabase
          priv:            ALL
          role_attr_flags: NOCREATEDB
复制

Ansible playbook可以这样运行（上面的Ansible代码假设您使用的是Ansible version>2.1，可以在此处找到安装说明）

ansible-playbook de-test-patroni-cluster.yml
复制

在运行这个剧本之后，最终用户可以通过GCP负载均衡器’ de-gcp-patroni-example.de.gcp.cloud ‘在端口’ 5000 ‘上访问PostgreSQL集群进行读写，或者通过端口’ 6000 '访问PostgreSQL集群。

验证集群状态

要使集群可用，Patroni需要三个节点中的至少两个节点可操作。另一个好处是，Patroni完全管理PostgreSQL数据库。在打开和关闭它方面，您可以通过以下方法验证一个Patroni集群和数据库的状态:

HTTP API

向端口’ 8008 '上的一个节点发出HTTP请求。您可以在任何与您的Patroni集群有ssh连接的终端上运行它(下面的例子使用的是mac上的iterm终端)。您应该在响应中得到一个JSON文档，其中显示节点的角色、关于其他复制节点的详细信息以及它们的状态。举个例子，对用上面代码创建的集群中的第一个节点使用curl，将显示以下输出:

$ curl de-gcp-patroni-example-01:8008

{
  "database_system_identifier": "6820776744425517415",
  "postmaster_start_time": "2022-04-28 14:57:20.932 UTC",
  "timeline": 1,
  "cluster_unlocked": false,
  "patroni": {
    "scope": "patroniha",
    "version": "1.6.1"
  },

  "replication": [

    {

      "sync_state": "async",
      "sync_priority": 0,
      "client_addr": "10.80.129.111",
      "state": "streaming",
      "application_name": "de_gcp_patroni_example_03",
      "usename": "replicator"
    },

    {
      "sync_state": "async",
      "sync_priority": 0,
      "client_addr": "10.80.129.113",
      "state": "streaming",
      "application_name": "de_gcp_patroni_example_02",
      "usename": "replicator"
    }
  ],

  "state": "running",
  "role": "master",
  "xlog": {
    "location": 67125840
  },
  "server_version": 110007

}
复制

Patronictl

Patroni附带了一个名为patronictl的CLI实用程序。可以使用此命令行实用程序执行与Postgres数据库或集群相关的任何管理操作。您可以使用命令行接口，通过向其中一个节点执行sh命令来描述当前的集群状态，并运行以下命令:

$ patronictl -d etcd://127.0.0.1:2379 list patroniha

+-----------+---------------------------+--------------------+--------+---------+----+-----------+
|  Cluster  |           Member          |        Host        |  Role  |  State  | TL | Lag in MB |
+-----------+---------------------------+--------------------+--------+---------+----+-----------+
| patroniha | de_gcp_patroni_example_01 | 10.80.131.214:5433 | Leader | running |  1 |           |
| patroniha | de_gcp_patroni_example_02 | 10.80.129.113:5433 |        | running |  1 |       0.0 |
| patroniha | de_gcp_patroni_example_03 | 10.80.129.111:5433 |        | running |  1 |       0.0 |
+-----------+---------------------------+--------------------+--------+---------+----+-----------+
复制

检查日志

Patroni是通过SystemD进行管理的，因此，您可以通过sh命令获取特定服务器的错误日志，然后使用’ journalctl '进行查询。例如，在主机上:

$ sudo journalctl -f -u patroni

Apr 28 16:06:21 de-gcp-patroni-example-01 patroni[20782]: 2022-04-28 16:06:21,983 INFO: Lock owner: de_gcp_patroni_example_01; I am de_gcp_patroni_example_01
Apr 28 16:06:22 de-gcp-patroni-example-01 patroni[20782]: 2022-04-28 16:06:22,017 INFO: no action.  i am the leader with the lock
复制

在第二张照片上，你会看到一些稍微不同的东西：

$ sudo journalctl -f -u patroni

Apr 28 16:08:11 de-gcp-patroni-example-02 patroni[20552]: 2022-04-28 16:08:11,990 INFO: does not have lock
Apr 28 16:08:11 de-gcp-patroni-example-02 patroni[20552]: 2022-04-28 16:08:11,997 INFO: no action.  i am a secondary and i am following a leader
复制

Etcd集群状态验证
Etcd用于Patroni在节点之间分发配置和状态，因此它还必须是健康的，才能进行正确的集群操作。
您可以通过ssh到其中一个节点，并执行以下命令检查集群的状态:

$ etcdctl cluster-health patroniha

member 13a17454e5cb0797 is healthy: got healthy result from http://10.80.129.113:2379
member 4ba8a9618e42f288 is healthy: got healthy result from http://10.80.129.111:2379
member 6a453ce7684a6c8d is healthy: got healthy result from http://10.80.131.214:2379

cluster is healthy
复制

手动切换
使用命令行界面，您可以强制Patroni切换到一个新的leader。切换将立即发生，并需要少量停机时间(集群不可用)。

首先，您应该SSH到集群中的一个节点并验证当前的集群状态。

$ patronictl -d etcd://127.0.0.1:2379 list patroniha

   +-----------+---------------------------+--------------------+--------+---------+----+-----------+
   |  Cluster  |           Member          |        Host        |  Role  |  State  | TL | Lag in MB |
   +-----------+---------------------------+--------------------+--------+---------+----+-----------+
   | patroniha | de_gcp_patroni_example_01 | 10.80.131.214:5433 | Leader | running |  1 |           |
   | patroniha | de_gcp_patroni_example_02 | 10.80.129.113:5433 |        | running |  1 |       0.0 |
   | patroniha | de_gcp_patroni_example_03 | 10.80.129.111:5433 |        | running |  1 |       0.0 |
   +-----------+---------------------------+--------------------+--------+---------+----+-----------+
复制

2.接下来，您可以触发切换并指定要创建新引线的节点。您还可以选择在将来安排切换或立即执行切换：

$ patronictl -d etcd://127.0.0.1:2379 switchover patroniha

Master [de_gcp_patroni_example_01]: <return>
Candidate ['de_gcp_patroni_example_02', 'de_gcp_patroni_example_03'] []: de_gcp_patroni_example_02
When should the switchover take place (e.g. 2020-04-28T17:15 )  [now]: <return>
Current cluster topology

+-----------+---------------------------+--------------------+--------+---------+----+-----------+
|  Cluster  |           Member          |        Host        |  Role  |  State  | TL | Lag in MB |
+-----------+--------------------------+--------------------+--------+---------+----+-----------+
| patroniha | de_gcp_patroni_example_01 | 10.80.131.214:5433 | Leader | running |  1 |           |
| patroniha | de_gcp_patroni_example_02 | 10.80.129.113:5433 |        | running |  1 |       0.0 |
| patroniha | de_gcp_patroni_example_03 | 10.80.129.111:5433 |        | running |  1 |       0.0 |
+-----------+---------------------------+--------------------+--------+---------+----+-----------+

Are you sure you want to switchover cluster patroniha, demoting current master de_gcp_patroni_example_01? [y/N]:
复制

您可以验证操作是否正确，然后输入“y”继续，然后您将看到以下内容：

2020-04-28 16:17:09.50350 Successfully switched over to "de_gcp_patroni_example_02"

+-----------+---------------------------+--------------------+--------+---------+----+-----------+
|  Cluster  |           Member          |        Host        |  Role  |  State  | TL | Lag in MB |
+-----------+---------------------------+--------------------+--------+---------+----+-----------+
| patroniha | de_gcp_patroni_example_01 | 10.80.131.214:5433 |        | stopped |    |   unknown |
| patroniha | de_gcp_patroni_example_02 | 10.80.129.113:5433 | Leader | running |  1 |           |
| patroniha | de_gcp_patroni_example_03 | 10.80.129.111:5433 |        | running |  1 |       0.0 |
+-----------+---------------------------+--------------------+--------+---------+----+-----------+
复制

3.最后，您可以通过再次检查状态来验证切换现已完成：

$ patronictl -d etcd://127.0.0.1:2379 list patroniha

+-----------+---------------------------+--------------------+--------+---------+----+-----------+
|  Cluster  |           Member          |        Host        |  Role  |  State  | TL | Lag in MB |
+-----------+---------------------------+--------------------+--------+---------+----+-----------+
| patroniha | de_gcp_patroni_example_01 | 10.80.131.214:5433 |        | running |  2 |       0.0 |
| patroniha | de_gcp_patroni_example_02 | 10.80.129.113:5433 | Leader | running |  2 |           |
| patroniha | de_gcp_patroni_example_03 | 10.80.129.111:5433 |        | running |  2 |       0.0 |
+-----------+---------------------------+--------------------+--------+---------+----+-----------+
复制

修复了Patroni不同步的问题
在某些情况下，PostgreSQL备节点可能会不同步。例如，当主服务器有大量事务时，或者生成了大量的归档文件，消耗了大量的磁盘空间，或者由于错误删除了一些Wal文件，或者由于网络问题删除了一些Wal文件。

您可以按照以下步骤确定辅助节点是否同步，如果不同步，则如何使用Patroni进行修复。

首先，您应该SSH到集群中的主节点并验证当前集群状态;您可以在任何集群节点上执行以下命令，以找出哪个是主节点。

$ patronictl -d etcd://127.0.0.1:2379 list patroniha

root@de-gcp-patroni-example-01:~# patronictl -d etcd://127.0.0.1:2379 list  patroniha
+ Cluster: patroniha (6830699472358332339) ------+--------+---------+----+-----------+
|          Member           |        Host        |  Role  |  State  | TL | Lag in MB |
+---------------------------+--------------------+--------+---------+----+-----------+
| de_gcp_patroni_example_01 | 10.80.128.61:5433  | Leader | running |  2 |           |
| de_gcp_patroni_example_02 | 10.80.128.160:5433 |        | running |  2 |        75 |
| de_gcp_patroni_example_03 | 10.80.128.236:5433 |        | running |  2 |         0 |
+---------------------------+--------------------+--------+---------+----+-----------+
复制

如果您发现’ Lag in MB '大于任何二级服务器的零，这意味着它的同步有问题，可以在节点中使用命令reinit来解决这个问题。

登录集群的主节点，执行以下命令。Patroni有一个内置的功能reinit，使待机再次同步，节省您的手工工作。该命令将强制集群节点从主节点同步数据。

patronictl -d etcd://127.0.0.1:2379 reinit cluster_name node_name
复制

例如：

root@de-gcp-patroni-example-01:~# patronictl -d etcd://127.0.0.1:2379 reinit patroniha de_gcp_patroni_example_02

   + Cluster: patroniha (6830699472358332339) ------+--------+---------+----+------------+
   |          Member           |        Host        |  Role  |  State  | TL | Lag in MB  |
   +---------------------------+--------------------+--------+---------+----+------------+
   | de_gcp_patroni_example_01 | 10.80.128.61:5433  | Leader | running |  2 |            |
   | de_gcp_patroni_example_02 | 10.80.128.160:5433 |        | running |  2 |         75 |
   | de_gcp_patroni_example_03 | 10.80.128.236:5433 |        | running |  2 |          0 |
   +---------------------------+--------------------+--------+---------+----+------------+

   Are you sure you want to reinitialize members de_gcp_patroni_example_02? [y/N]: y
   Success: reinitialize for member de_gcp_patroni_example_02

   root@de-gcp-patroni-example-01:~#
复制

登录节点de_gcp_patroni_example_02，也可以在servicecenter服务日志中进行验证。

ssh de-gcp-patroni-example-02
作为根用户运行

root@de-gcp-patroni-example-02:~# journalctl -f -u patroni
复制

您将在日志中看到一些信息：

2022-05-26 08:54:41.748 UTC [2846] LOG:  database system is shut down
2022-05-26 08:54:41,760 INFO: Removing data directory: /data/postgresql/db
2022-05-26 0854:46,627 INFO: replica has been created using basebackup
2022-05-26 08:54:46,628 INFO: bootstrapped from leader 'de_gcp_patroni_example_01'
2022-05-26 08:54:46,633 INFO: closed patroni connection to the postgresql cluster
2022-05-26 08:54:47,442 INFO: postmaster pid=4426
localhost:5433 - no response

2022-05-26 08:54:47.713 UTC [4426] LOG:  starting PostgreSQL 12.3 (Ubuntu 12.3-1.pgdg18.04+1) on x86_64-pc

And, normal logging agai
    
2022-05-26 08:55:00,441 INFO: no action.  i am a secondary and i am following a leader
2022-05-26 08:55:10,433 INFO: Lock owner: de_gcp_patroni_example_01; I am de_gcp_patroni_example_02
2022-05-26 08:55:10,433 INFO: does not have lock
2022-05-26 08:55:10,440 INFO: no action.  i am a secondary and i am following a leader
复制

3.一段时间后，根据数据库大小，“Lag in MB”应再次为零。您可以验证为

$ patronictl -d etcd://127.0.0.1:2379 list patroniha

root@de-gcp-patroni-example-01:~# patronictl -d etcd://127.0.0.1:2379 list patroniha

+ Cluster: patroniha (6830699472358332339) ----+--------+---------+----+-----------+
|          Member           |        Host        |  Role  |  State  | TL | Lag in MB |
+---------------------------+--------------------+--------+---------+----+-----------+
| de_gcp_patroni_example_01 | 10.80.128.61:5433  | Leader | running |  2 |           |
| de_gcp_patroni_example_02 | 10.80.128.160:5433 |        | running |  2 |         0 |
| de_gcp_patroni_example_03 | 10.80.128.236:5433 |        | running |  2 |         0 |
+---------------------------+--------------------+--------+---------+----+-----------+
复制

结论

在这篇文章中，我介绍了如何使用Ansible构建一个Patroni集群，验证集群状态，并了解Patroni如何帮助处理两个操作:切换和reinit。

原文标题：PATRONI – AN AWESOME POSTGRESQL HA SOLUTION
原文作者：Lochan Marwah
原文地址：https://blog.pythian.com/patroni-an-awesome-postgresql-ha-solution/

postgresql 墨力翻译

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

postgresql HA解决方案

Patroni

体系结构

使用Ansible与Patroni创建HA PostgreSQL集群

结论

评论

相关阅读