1 Halo Shield的介绍
Halo Shield 是一套自动管理 Halo 主从数据库的高可用解决方案。该方案包括以下几个组成部分。
Etcd 分布式键值数据库
Patroni 流复制创建、管理、监控和自动故障转移进程
Etcd 最少需要三个节点且为奇数来进行 leader 选举。一般可以和Halo数据库部署在相同的服务器上。Etcd 集群之间会进行心跳检测,当发生网络故障时,节点数少的etcd集群会不可用,从而避免脑裂。
Patroni 运用 etcd 集群存储、检测Halo 主从节点的状态与配置信息,当由于故障使得某个节点无法工作,Patroni会自动侦测故障,并通过 etcd 更新节点信息,从而通知所有其他节点。如果该故障节点是主节点,Patroni 还会自动切换,并重新绑定VIP到新的主节点,从而减少对应用系统的影响。
Patroni 通过连接 etcd 对其它节点做心跳检测,当主节点无法更新etcd中的leader lock,patroni会终止当前节点的Halo数据库从而避免脑裂。
2 Halo Shield的安装
软件安装
Halo的产品安装包里已带有Halo Shield的完整功能。因此,Halo数据库正确安装后,意味着Halo Shield也已经安装完成。Halo的具体安装过程请参见第2章。
3 Halo Shield的配置
物理复制环境规划和搭建
Halo Shield 可以在已有的物理复制环境之上进行配置。建议根据第10章手动搭建物理复制环境。
Halo Shield 依赖于 etcd 分布式键值数据库。如果主从数据库超过3台服务器,建议在其中3台上配置并运行etcd服务。
ETCD 配置
确保 HALO_BASE 环境变量已设置,在所有需要运行 etcd 服务的服务器上用 halo用户执行 etcd_config.sh,输入etcd 将要运行的所有服务器 IP 地址。
vi /home/halo/.bash_profile
export HALO_BASE=/u01/app/halo
$HALO_BASE/product/shield/etcd/v3.5.2/conf/etcd_config.sh
Starting to run etcd configuration setup
Please set HALO_BASE environment variables before proceed
Press y/Y to continue, any other key to cancel
y
Please input IP list where etcd cluster will be running
e.g. 192.168.1.1,192.168.1.2,192.168.1.3
10.16.16.155,10.16.16.156,10.16.16.157
10.16.16.155 will be used as node IP
Initializing done.
Following steps to be done manually.
Run following commands as root
ln -s /u01/app/halo/product/shield/etcd/v3.5.2/conf/etcd.service /usr/lib/systemd/system/etcd.service
根据输出信息,在所有运行 etcd 服务的服务器上用 root 用户创建自启动服务连接
ln -s /u01/app/halo/product/shield/etcd/v3.5.2/conf/etcd.service /usr/lib/systemd/system/etcd.service
Patroni 配置
确保 HALO_BASE、HALO_HOME、PGDATA 环境变量已设置,在所有主备服务器上用halo用户执行 patroni_config.sh,输入etcd 将要运行的所有服务器 IP 地址、输入当前服务器名和VIP信息。
$HALO_BASE/product/shield/patroni/conf/patroni_config.sh
Starting to run patroni configuration setup
Please set HALO_BASE, HALO_HOME, PGDATA environment variables before proceed
Press y/Y to continue, any other key to cancel
y
Please input IP list where etcd cluster will be running
e.g. 192.168.1.1,192.168.1.2,192.168.1.3
10.16.16.155,10.16.16.156,10.16.16.157
Input current node name
Default: node1
Input VIP for the HA cluster
10.16.16.199
10.16.16.199 will be used as VIP
Input network interface to bind the VIP
Default: eth0
eth0 will be used as network interface
Input VIP netmask
Default: 255.255.255.0
255.255.255.0 will be used as VIP netmask
Input VIP broadcast address
Default: 10.16.16.255
10.16.16.255 will be used as VIP broadcast address
Initialize python ...
Python initializing done.
Initializing done.
Following steps to be done manually.
1. Add the following line into the end of .bash_profile of user halo
export PATH=/u01/app/halo/product/shield/patroni/python/bin:$PATH
export PATRONICTL_CONFIG_FILE=/u01/app/halo/product/shield/patroni/conf/patroni_halo.yml
2. Run following commands as root
ln -s /u01/app/halo/product/shield/patroni/conf/patroni.service /usr/lib/systemd/system/patroni.service
3. Add the following line to /etc/sudoers
halo ALL=(ALL) NOPASSWD: /usr/sbin/ip, /usr/bin/arping, /usr/sbin/iptables
根据输出信息,在所有主备服务器上执行以下步骤:
1. 在 halo 用户的 .bash_profile 中加入
export PATH=/u01/app/halo/product/shield/patroni/python/bin:$PATH
export PATRONICTL_CONFIG_FILE=/u01/app/halo/product/shield/patroni/conf/patroni_halo.yml
2. 用 root 用户创建自启动服务连接
ln -s /u01/app/halo/product/shield/patroni/conf/patroni.service /usr/lib/systemd/system/patroni.service
3. 用 root 用户在 /etc/sudoers 中加入以下行
halo ALL=(ALL) NOPASSWD: /usr/sbin/ip, /usr/bin/arping, /usr/sbin/iptables
创建数据库管理用户
在主库上创建以下用户
psql
create user patroni SUPERUSER password 'patroni';
创建watchdog服务(可选)
使用 root 执行以下命令
yum install -y watchdog
modprobe softdog
chown halo /dev/watchdog
systemctl start watchdog
systemctl enable watchdog
4 Halo Shield的使用
启动ETCD服务
在所有配置了etcd服务的服务器上用root启动服务
systemctl start etcd
确保至少同时在2台服务器上启动etcd服务,如果只在1台服务器上启动,etcd会因为找不到其他服务器而启动失败
如需开机自启动etcd服务,用root执行以下命令
systemctl enable etcd
查询etcd运行状态
$HALO_BASE/product/shield/etcd/v3.5.2/etcdctl endpoint status --cluster=true -w table
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://10.16.16.155:2379 | 1c3f7702e434565f | 3.5.2 | 37 kB | true | false | 2 | 54 | 54 | |
| http://10.16.16.157:2379 | 698a77fbbe16ea8a | 3.5.2 | 37 kB | false | false | 2 | 54 | 54 | |
| http://10.16.16.156:2379 | c67e61e43ed4f1d4 | 3.5.2 | 37 kB | false | false | 2 | 54 | 54 | |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
启动patroni服务
在所有主备服务器上用root启动patroni服务
systemctl start patroni
如需开机自启动patroni服务,用root执行以下命令
systemctl enable patroni
主备节点查询
patronictl list
+--------+-------------------+---------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: halo-cluster (7078872291747273152) ---+----+-----------+
| node1 | 10.16.16.155:1921 | Leader | running | 2 | |
| node2 | 10.16.16.156:1921 | Replica | running | 2 | 0 |
| node3 | 10.16.16.157:1921 | Replica | running | 2 | 0 |
+--------+-------------------+---------+---------+----+-----------+
停止自动主备切换
patronictl pause
Success: cluster management is paused
[halo@node1 ~]$ patronictl list
+--------+-------------------+---------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: halo-cluster (7078872291747273152) ---+----+-----------+
| node1 | 10.16.16.155:1921 | Leader | running | 2 | |
| node2 | 10.16.16.156:1921 | Replica | running | 2 | 0 |
| node3 | 10.16.16.157:1921 | Replica | running | 2 | 0 |
+--------+-------------------+---------+---------+----+-----------+
Maintenance mode: on
手动主备切换
patronictl switchover
Master [node1]:
Candidate ['node2', 'node3'] []: node2
When should the switchover take place (e.g. 2022-05-10T13:57 ) [now]:
Current cluster topology
+--------+-------------------+---------+---------+----+-----------+-----------------+
| Member | Host | Role | State | TL | Lag in MB | Pending restart |
+ Cluster: halo-cluster (7078872291747273152) ---+----+-----------+-----------------+
| node1 | 10.16.16.155:1921 | Leader | running | 2 | | |
| node2 | 10.16.16.156:1921 | Replica | running | 2 | 0 | |
| node3 | 10.16.16.157:1921 | Replica | running | 2 | 0 | * |
+--------+-------------------+---------+---------+----+-----------+-----------------+
Maintenance mode: on
Are you sure you want to switchover cluster halo-cluster, demoting current master node1? [y/N]: y
2022-05-10 12:57:57.44645 Successfully switched over to "node2"
+--------+-------------------+---------+---------+----+-----------+-----------------+
| Member | Host | Role | State | TL | Lag in MB | Pending restart |
+ Cluster: halo-cluster (7078872291747273152) ---+----+-----------+-----------------+
| node1 | 10.16.16.155:1921 | Replica | stopped | | unknown | |
| node2 | 10.16.16.156:1921 | Leader | running | 2 | | |
| node3 | 10.16.16.157:1921 | Replica | running | 2 | 0 | * |
+--------+-------------------+---------+---------+----+-----------+-----------------+
Maintenance mode: on
重启其中一个节点
patronictl restart halo-cluster node3
+--------+-------------------+---------+---------+----+-----------+-----------------+
| Member | Host | Role | State | TL | Lag in MB | Pending restart |
+ Cluster: halo-cluster (7078872291747273152) ---+----+-----------+-----------------+
| node1 | 10.16.16.155:1921 | Replica | running | 3 | 0 | |
| node2 | 10.16.16.156:1921 | Leader | running | 3 | | |
| node3 | 10.16.16.157:1921 | Replica | running | 2 | 0 | * |
+--------+-------------------+---------+---------+----+-----------+-----------------+
Maintenance mode: on
When should the restart take place (e.g. 2022-05-10T13:58) [now]:
Are you sure you want to restart members node3? [y/N]: y
Restart if the PostgreSQL version is less than provided (e.g. 9.5.2) []:
Success: restart on member node3
评论


