暂无图片
暂无图片
2
暂无图片
暂无图片
暂无图片

从宕机到恢复:分布式数据库故障排查记录

原创 szrsu 2025-03-23
95

1. 故障现象

之前在虚拟机环境搭了一套GBASE 8C分布式数据库(见GBase 8C 集群安装部署全攻略,轻松上手!),由于没有正常关机(主机突然断电),导致集群异常,异常如下:

--检查集群状态 [gbase@gbase8c1 ~]$ gha_ctl monitor all -H -l http://10.10.10.34:2379 { "ret":80000301, "msg":"Transport endpoint unreach" } [gbase@gbase8c1 ~]$ gha_ctl monitor all -H -l http://10.10.10.36:2379 +----+-------------+-------------+-------+---------+--------+ | No | name | host | port | state | leader | +----+-------------+-------------+-------+---------+--------+ | 0 | gha_server1 | 10.10.10.34 | 20001 | running | True | +----+-------------+-------------+-------+---------+--------+ +----+------+-------------+------+---------------------------+---------+---------+ | No | name | host | port | work_dir | state | role | +----+------+-------------+------+---------------------------+---------+---------+ | 0 | gtm1 | 10.10.10.34 | 6666 | /home/gbase/data/gtm/gtm1 | running | primary | +----+------+-------------+------+---------------------------+---------+---------+ +----+------+-------------+------+----------------------------+---------+---------+ | No | name | host | port | work_dir | state | role | +----+------+-------------+------+----------------------------+---------+---------+ | 0 | cn1 | 10.10.10.34 | 5432 | /home/gbase/data/coord/cn1 | running | primary | +----+------+-------------+------+----------------------------+---------+---------+ +----+-------+-------+-------------+-------+----------------------------+---------+---------+ | No | group | name | host | port | work_dir | state | role | +----+-------+-------+-------------+-------+----------------------------+---------+---------+ | 0 | dn1 | dn1_1 | 10.10.10.35 | 15432 | /home/gbase/data/dn1/dn1_1 | running | primary | | 1 | dn2 | dn2_1 | 10.10.10.36 | 20010 | /home/gbase/data/dn2/dn2_1 | running | primary | +----+-------+-------+-------------+-------+----------------------------+---------+---------+ +----+-------------------------+--------+-----------+----------+ | No | url | name | state | isLeader | +----+-------------------------+--------+-----------+----------+ | 0 | http://10.10.10.36:2379 | node_2 | healthy | False | | 1 | http://10.10.10.34:2379 | node_0 | unhealthy | False | | 2 | http://10.10.10.35:2379 | node_1 | healthy | True | +----+-------------------------+--------+-----------+----------+
复制

发现节点 10.10.10.34 的状态为unhealthy,问题在10.10.10.34节点。

2. 故障排查

根据上述的报错 { “ret”:80000301, “msg”:“Transport endpoint unreach” }信息,主要对以下几个方面进行排查:

(1)确认时间是否同步,三台机器是否有时间差。确认ntpd服务运行情况。

(2)机器IP是否有变化,网络通讯是否正常。

(3)确认etcd服务运行情况。

对前两项进行排查,无异常,检查etcd服务时,发现了异常。

2.1 检查 etcd 服务

--对所有节点进行检查etcd服务状态 # systemctl status etcd ● etcd.service - Etcd Server Active: activating (start) since Thu 2025-03-22 23:03:26 CST; 48s ago [...] Mar 22 23:03:34 gbase8c1 etcd[10363]: publish error: etcdserver: request timed out
复制

检查所有节点,服务都正常,但在10.10.10.34查看服务状态时,有发现以下报错:

[root@gbase8c1 member]# systemctl status etcd
● etcd.service - Etcd Server
   Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled)
   Active: activating (start) since Thu 2025-03-20 23:03:26 CST; 48s ago
Main PID: 10363 (etcd)
   CGroup: /docker/e7ff60899b159f0e16156801bae5649ccb06983ff489abe0cdd252941cb2fcfa/system.slice/etcd.service
           └─10363 /usr/bin/etcd --name=node_0 --data-dir=/var/lib/etcd/default.etcd --listen-client-urls=http://10.10.10.34:2379
           ‣ 10363 /usr/bin/etcd --name=node_0 --data-dir=/var/lib/etcd/default.etcd --listen-client-urls=http://10.10.10.34:2379

Mar 22 23:03:27 gbase8c1 etcd[10363]: established a TCP streaming connection with peer 1ee8d2017f324082 (stream Message writer)
Mar 22 23:03:27 gbase8c1 etcd[10363]: established a TCP streaming connection with peer 1ee8d2017f324082 (stream MsgApp v2 writer)
Mar 22 23:03:27 gbase8c1 etcd[10363]: established a TCP streaming connection with peer e0dea71e4a2e0936 (stream MsgApp v2 writer)
Mar 22 23:03:27 gbase8c1 etcd[10363]: 9c5365ebdda29888 initialzed peer connection; fast-forwarding 8 ticks (election ticks 10) with 2 active peer(s)
Mar 22 23:03:34 gbase8c1 etcd[10363]: publish error: etcdserver: request timed out, possibly due to connection lost
Mar 22 23:03:41 gbase8c1 etcd[10363]: publish error: etcdserver: request timed out
Mar 22 23:03:48 gbase8c1 etcd[10363]: publish error: etcdserver: request timed out
Mar 22 23:03:55 gbase8c1 etcd[10363]: publish error: etcdserver: request timed out
Mar 22 23:04:02 gbase8c1 etcd[10363]: publish error: etcdserver: request timed out
Mar 22 23:04:09 gbase8c1 etcd[10363]: publish error: etcdserver: request timed out
复制

从报错看,etcd节点在尝试发布数据时遇到超时问题,可能由网络问题、节点配置错误或资源不足引起。

2.2 查看系统日志

查看10.10.10.34的系统日志:

[root@gbase8c1 ~]# tail -f /var/log/messages Mar 22 23:12:24 gbase8c1 etcd: established a TCP streaming connection with peer 1ee8d2017f324082 (stream MsgApp v2 writer) Mar 22 23:12:24 gbase8c1 etcd: established a TCP streaming connection with peer 1ee8d2017f324082 (stream MsgApp v2 reader) Mar 22 23:12:24 gbase8c1 etcd: established a TCP streaming connection with peer 1ee8d2017f324082 (stream Message writer) Mar 22 23:12:24 gbase8c1 etcd: established a TCP streaming connection with peer e0dea71e4a2e0936 (stream MsgApp v2 writer) Mar 22 23:12:24 gbase8c1 etcd: established a TCP streaming connection with peer e0dea71e4a2e0936 (stream Message writer) Mar 22 23:12:24 gbase8c1 etcd: started streaming with peer e0dea71e4a2e0936 (stream Message reader) Mar 22 23:12:24 gbase8c1 etcd: raft.node: 9c5365ebdda29888 elected leader e0dea71e4a2e0936 at term 173 Mar 22 23:12:24 gbase8c1 etcd: established a TCP streaming connection with peer e0dea71e4a2e0936 (stream Message reader) Mar 22 23:12:24 gbase8c1 etcd: 9c5365ebdda29888 initialzed peer connection; fast-forwarding 8 ticks (election ticks 10) with 2 active peer(s) Mar 22 23:12:31 gbase8c1 etcd: publish error: etcdserver: request timed out, possibly due to connection lost Mar 22 23:12:38 gbase8c1 etcd: publish error: etcdserver: request timed out Mar 22 23:12:45 gbase8c1 etcd: publish error: etcdserver: request timed out Mar 22 23:12:52 gbase8c1 etcd: publish error: etcdserver: request timed out Mar 22 23:12:59 gbase8c1 etcd: publish error: etcdserver: request timed out Mar 22 23:13:06 gbase8c1 etcd: publish error: etcdserver: request timed out Mar 22 23:13:13 gbase8c1 etcd: publish error: etcdserver: request timed out Mar 22 23:13:20 gbase8c1 etcd: publish error: etcdserver: request timed out Mar 22 23:13:27 gbase8c1 etcd: publish error: etcdserver: request timed out Mar 22 23:13:34 gbase8c1 etcd: publish error: etcdserver: request timed out Mar 22 23:13:41 gbase8c1 etcd: publish error: etcdserver: request timed out Mar 22 23:13:48 gbase8c1 etcd: publish error: etcdserver: request timed out Mar 22 23:13:54 gbase8c1 systemd: etcd.service start operation timed out. Terminating.
复制

2.3 检查 etcd 配置

–检查所有节点的etcd配置

[root@gbase8c1 ~]# cat /etc/etcd/etcd.conf ETCD_DATA_DIR = "/var/lib/etcd/default.etcd" ETCD_ENABLE_V2 = "true" ETCD_INITIAL_CLUSTER_TOKEN = "etcd-cluster" ETCD_NAME="node_0" ETCD_LISTEN_PEER_URLS="http://10.10.10.34:2380" ETCD_LISTEN_CLIENT_URLS="http://10.10.10.34:2379" ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.10.10.34:2380" ETCD_ADVERTISE_CLIENT_URLS="http://10.10.10.34:2379" ETCD_INITIAL_CLUSTER="node_0=http://10.10.10.34:2380,node_1=http://10.10.10.35:2380,node_2=http://10.10.10.36:2380" [root@gbase8c2 ~]# cat /etc/etcd/etcd.conf ETCD_DATA_DIR = "/var/lib/etcd/default.etcd" ETCD_ENABLE_V2 = "true" ETCD_INITIAL_CLUSTER_TOKEN = "etcd-cluster" ETCD_NAME="node_1" ETCD_LISTEN_PEER_URLS="http://10.10.10.35:2380" ETCD_LISTEN_CLIENT_URLS="http://10.10.10.35:2379" ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.10.10.35:2380" ETCD_ADVERTISE_CLIENT_URLS="http://10.10.10.35:2379" ETCD_INITIAL_CLUSTER="node_0=http://10.10.10.34:2380,node_1=http://10.10.10.35:2380,node_2=http://10.10.10.36:2380" [root@gbase8c3 ~]# cat /etc/etcd/etcd.conf ETCD_DATA_DIR = "/var/lib/etcd/default.etcd" ETCD_ENABLE_V2 = "true" ETCD_INITIAL_CLUSTER_TOKEN = "etcd-cluster" ETCD_NAME="node_2" ETCD_LISTEN_PEER_URLS="http://10.10.10.36:2380" ETCD_LISTEN_CLIENT_URLS="http://10.10.10.36:2379" ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.10.10.36:2380" ETCD_ADVERTISE_CLIENT_URLS="http://10.10.10.36:2379" ETCD_INITIAL_CLUSTER="node_0=http://10.10.10.34:2380,node_1=http://10.10.10.35:2380,node_2=http://10.10.10.36:2380"
复制

相关配置项说明:

ETCD_NAME
#ETCD集群中的节点名,这里可以随意,可区分且不重复就行  
ETCD_LISTEN_PEER_URLS
#监听的用于节点之间通信的URL,可监听多个,集群内部将通过这些URL进行数据交互(如选举,数据同步等)
ETCD_INITIAL_ADVERTISE_PEER_URLS 
#建议用于节点之间通信的URL,节点间将以该值进行通信。
ETCD_LISTEN_CLIENT_URLS
#监听的用于客户端通信的URL,同样可以监听多个。
ETCD_ADVERTISE_CLIENT_URLS
#建议使用的客户端通信URL,该值用于ETCD代理或ETCD成员与ETCD节点通信。
ETCD_INITIAL_CLUSTER_TOKEN 
#节点的TOKEN值,设置该值后集群将生成唯一ID,并为每个节点也生成唯一ID,当使用相同配置文件再启动一个集群时,只要该TOKEN值不一样,ETCD集群就不会相互影响。
ETCD_INITIAL_CLUSTER
#集群中所有的INITIAL_ADVERTISE_PEER_URLS 的合集
复制

所有节点etcd配置正常。

3. 故障处理

3.1 重启etcd服务

–重启etcd服务

# systemctl restart etcd
复制

先尝试重启了10.10.10.34的etcd服务,发现没有效果,报错依然。后面把其他节点的etcd服务也重启了,依然没有效果。

3.2 清理异常节点数据

由于是突然宕机,怀疑集群节点间etcd的数据不一致导致的报错,尝试删除集群下报错节点的数据,使它重新同步:

# 停止 etcd 服务 [root@gbase8c1 ~]# systemctl stop etcd # 保险起见,使用mv方式移除数据目录到其他地方,有问题再恢复 [root@gbase8c1 ~]# mv /var/lib/etcd/default.etcd /tmp # 重启 etcd 服务 [root@gbase8c1 ~]# systemctl start etcd # 查看 etcd状态 [root@gbase8c1 ~]# systemctl status etcd ● etcd.service - Etcd Server Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled) Active: active (running) since Sun 2025-03-22 23:40:55 CST; 3min 57s ago Main PID: 277 (etcd) CGroup: /docker/e7ff60899b159f0e16156801bae5649ccb06983ff489abe0cdd252941cb2fcfa/system.slice/etcd.service └─277 /usr/bin/etcd --name=node_0 --data-dir=/var/lib/etcd/default.etcd --listen-client-urls=http://10.10.10.34:2379 ‣ 277 /usr/bin/etcd --name=node_0 --data-dir=/var/lib/etcd/default.etcd --listen-client-urls=http://10.10.10.34:2379 Mar 22 23:40:55 gbase8c1 etcd[277]: established a TCP streaming connection with peer 1ee8d2017f324082 (stream Message reader) Mar 22 23:40:55 gbase8c1 etcd[277]: 9c5365ebdda29888 [term: 186] received a MsgVote message with higher term from e0dea71e4a2e0936 [term: 187] Mar 22 23:40:55 gbase8c1 etcd[277]: 9c5365ebdda29888 became follower at term 187 Mar 22 23:40:55 gbase8c1 etcd[277]: 9c5365ebdda29888 [logterm: 186, index: 724331, vote: 0] cast MsgVote for e0dea71e4a2e0936 [logterm: 186, index: 724331] at term 187 Mar 22 23:40:55 gbase8c1 etcd[277]: raft.node: 9c5365ebdda29888 elected leader e0dea71e4a2e0936 at term 187 Mar 22 23:40:55 gbase8c1 etcd[277]: 9c5365ebdda29888 initialzed peer connection; fast-forwarding 8 ticks (election ticks 10) with 2 active peer(s) Mar 22 23:40:55 gbase8c1 etcd[277]: published {Name:node_0 ClientURLs:[http://10.10.10.34:2379]} to cluster 3503f38b8057518f Mar 22 23:40:55 gbase8c1 etcd[277]: ready to serve client requests Mar 22 23:40:55 gbase8c1 etcd[277]: serving insecure client requests on 10.10.10.34:2379, this is strongly discouraged! Mar 22 23:40:55 gbase8c1 systemd[1]: Started Etcd Server.
复制

etcd服务已正常,不再报错。

3.3. 验证

检查集群状态:

[gbase@gbase8c1 ~]$ gha_ctl monitor all -H -l http://10.10.10.35:2379 +----+-------------+-------------+-------+---------+--------+ | No | name | host | port | state | leader | +----+-------------+-------------+-------+---------+--------+ | 0 | gha_server1 | 10.10.10.34 | 20001 | running | True | +----+-------------+-------------+-------+---------+--------+ +----+------+-------------+------+---------------------------+---------+---------+ | No | name | host | port | work_dir | state | role | +----+------+-------------+------+---------------------------+---------+---------+ | 0 | gtm1 | 10.10.10.34 | 6666 | /home/gbase/data/gtm/gtm1 | running | primary | +----+------+-------------+------+---------------------------+---------+---------+ +----+------+-------------+------+----------------------------+---------+---------+ | No | name | host | port | work_dir | state | role | +----+------+-------------+------+----------------------------+---------+---------+ | 0 | cn1 | 10.10.10.34 | 5432 | /home/gbase/data/coord/cn1 | running | primary | +----+------+-------------+------+----------------------------+---------+---------+ +----+-------+-------+-------------+-------+----------------------------+---------+---------+ | No | group | name | host | port | work_dir | state | role | +----+-------+-------+-------------+-------+----------------------------+---------+---------+ | 0 | dn1 | dn1_1 | 10.10.10.35 | 15432 | /home/gbase/data/dn1/dn1_1 | running | primary | | 1 | dn2 | dn2_1 | 10.10.10.36 | 20010 | /home/gbase/data/dn2/dn2_1 | running | primary | +----+-------+-------+-------------+-------+----------------------------+---------+---------+ +----+-------------------------+--------+---------+----------+ | No | url | name | state | isLeader | +----+-------------------------+--------+---------+----------+ | 0 | http://10.10.10.36:2379 | node_2 | healthy | False | | 1 | http://10.10.10.34:2379 | node_0 | healthy | False | | 2 | http://10.10.10.35:2379 | node_1 | healthy | True | +----+-------------------------+--------+---------+----------+
复制

集群所有节点,均已正常。

「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论