暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

Kubernetes平台上mysql集群故障处理记录

请叫我水哥 2020-08-31
344

一、故障现象

1. 巡检发现mysqlha集群有重启
[root@pass02 ~]# kubectget pods --all-namespaces |grep mysqlha
acs-system mysqlha-mysqlha-0 1/1 Running 4 159d
acs-system mysqlha-mysqlha-1 1/1 Running 2 80d
acs-system mysqlha-mysqlha-2 1/1 Running 2 80d
acs-system mysqlha-phpmyadmin-7566586968-rffjj 1/1 Running 4 159d
acs-system mysqlha-sentinel-558bd59f88-8q9l5 1/1 Running 2 80d
2. 检查日志查看相关报错
mysqlha2的日志如下:
[root@pass02 ~]# kubectl logs mysqlha-mysqlha-2 -n acs-system
2020-08-28T23:03:36.562225Z 1717127 [ERROR] Slave SQL for channel '': Could not execute Write_rows event on table ecm.ecm_global_lock; Duplicate entry 'MSG_LOCK-MSG_LOCK' for key 'uk_type_lock_key', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysqlha-mysqlha-0-bin.000044, end_log_pos 66209378, Error_code: 1062
2020-08-28T23:03:36.562249Z 1717127 [Warning] Slave: Duplicate entry 'MSG_LOCK-MSG_LOCK' for key 'uk_type_lock_key' Error_code: 1062
2020-08-28T23:03:36.562258Z 1717127 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysqlha-mysqlha-0-bin.000044' position 66209102.
2020-08-28T23:03:43.267703Z 1717126 [Note] Slave I/O thread killed while reading event for channel ''
2020-08-28T23:03:43.267737Z 1717126 [Note] Slave I/O thread exiting for channel '', read up to log 'mysqlha-mysqlha-0-bin.000044', position 66223930
2020-08-28T23:03:43.278969Z 1717130 [Note] 'CHANGE MASTER TO FOR CHANNEL '' executed'. Previous state master_host='mysqlha-mysqlha-0.mysqlha-mysqlha', master_port= 3306, master_log_file='mysqlha-mysqlha-0-bin.000044', master_log_pos= 66223930, master_bind=''. New state master_host='mysqlha-mysqlha-0.mysqlha-mysqlha', master_port= 3306, master_log_file='mysqlha-mysqlha-0-bin.000044', master_log_pos= 66223373, master_bind=''.
2020-08-28T23:03:43.291744Z 1717132 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2020-08-28T23:03:43.292226Z 1717133 [Warning] Slave SQL for channel '': If a crash happens this configuration does not guarantee that the relay log info will be consistent, Error_code: 0
2020-08-28T23:03:43.292285Z 1717133 [Note] Slave SQL thread for channel '' initialized, starting replication in log 'mysqlha-mysqlha-0-bin.000044' at position 66223373, relay log './mysqlha-mysqlha-2-relay-bin.000001' position: 4
2020-08-28T23:03:43.293537Z 1717132 [Note] Slave I/O thread for channel '': connected to master 'aliyun_cos@mysqlha-mysqlha-0.mysqlha-mysqlha:3306',replication started in log 'mysqlha-mysqlha-0-bin.000044' at position 66223373
高可用组件的报错如下:
[root@pass02 ~]# kubectl logs mysqlha-sentinel-558bd59f88-8q9l5 -n acs-system
07:41:14-2020-08-22 Defualt-Mode-Set-Slave: Label mysqlha-mysqlha-2 to slave: master_file: mysqlha-mysqlha-0-bin.000042, master_pos: 858028038
pod "mysqlha-mysqlha-2" not labeled
07:41:20-2020-08-22 Slave_SQL_Running is Not Yes
07:41:21-2020-08-22 Defualt-Mode-Set-Slave: Label mysqlha-mysqlha-2 to slave: master_file: mysqlha-mysqlha-0-bin.000042, master_pos: 858044740
pod "mysqlha-mysqlha-2" not labeled
07:41:27-2020-08-22 Slave_SQL_Running is Not Yes
07:41:27-2020-08-22 Defualt-Mode-Set-Slave: Label mysqlha-mysqlha-2 to slave: master_file: mysqlha-mysqlha-0-bin.000042, master_pos: 858061796
pod "mysqlha-mysqlha-2" not labeled
07:41:34-2020-08-22 Slave_SQL_Running is Not Yes
07:41:34-2020-08-22 Defualt-Mode-Set-Slave: Label mysqlha-mysqlha-2 to slave: master_file: mysqlha-mysqlha-0-bin.000042, master_pos: 858076207
mysql集群状态
[root@pass02 ~]# master_pod=$(kubectl get pod -n acs-system -l mysql-role=master -o wide | grep -v NAME | awk '{print $6}');echo "MASTER STATUS:";mysql -h $master_pod -uroot -paliyun_cos --connect-timeout 2 -e "SHOW MASTER STATUS\G" | egrep '(File|Position)';for pod in $(kubectl get pod -n acs-system -l mysql-role=slave -o wide | grep -v NAME | awk '{print $6}'); do echo "SLAVE ${pod} STATUS:"; mysql -h $pod -uroot -paliyun_cos --connect-timeout 2 -e "SHOW SLAVE STATUS\G" | egrep '(Master_Log_File|Read_Master_Log_Pos|Relay_Log_File|Relay_Log_Pos|Relay_Master_Log_File|Slave_SQL_Running|Slave_IO_Running|Exec_Master_Log_Pos)'; done
MASTER STATUS:
File: mysqlha-mysqlha-0-bin.000044
Position: 500869724
SLAVE xx.xx.xx.54 STATUS:
Master_Log_File: mysqlha-mysqlha-0-bin.000044
Read_Master_Log_Pos: 500872009
Relay_Log_File: mysqlha-mysqlha-1-relay-bin.000040
Relay_Log_Pos: 500872246
Relay_Master_Log_File: mysqlha-mysqlha-0-bin.000044
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Exec_Master_Log_Pos: 500872009
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
SLAVE xx.xx.xx.55 STATUS:
Master_Log_File: mysqlha-mysqlha-0-bin.000044
Read_Master_Log_Pos: 500872009
Relay_Log_File: mysqlha-mysqlha-2-relay-bin.000002
Relay_Log_Pos: 332
Relay_Master_Log_File: mysqlha-mysqlha-0-bin.000044
Slave_IO_Running: Yes
Slave_SQL_Running: No
Exec_Master_Log_Pos: 500858798
Slave_SQL_Running_State:
复制

二、故障解决

slave库重新搭建步骤如下

1. 确定master地址
[root@pass02 ~]# kubectl get pod -n acs-system -l mysql-role=master -o custom-columns=MySQL-IP:.status.podIP | sed -n '$p'
xx.xx.xx.44  //业务环境ip隐藏下
2. 执行备份
[root@pass02 ~]# mysqldump -h$mysql_ip -p --single-transaction --all-databases --events --routines --flush-logs --master-data=2 > $backup_file
Enter password:
[root@pass02 ~]# ll backup-2020-08-31-10-08.sql
-rw-r--r-- 1 root root 277113942 8月 31 10:47 backup-2020-08-31-10-08.sql
3. 暂停高可用组件
[root@pass02 ~]# kubectl scale --current-replicas=1 --replicas=0 deployment/mysqlha-sentinel -n acs-system
deployment.extensions/mysqlha-sentinel scaled
[root@pass02 ~]# kubectl get pods -n acs-system |grep mysqlha
mysqlha-mysqlha-0 1/1 Running 4 159d
mysqlha-mysqlha-1 1/1 Running 2 80d
mysqlha-mysqlha-2 1/1 Running 2 80d
mysqlha-phpmyadmin-7566586968-rffjj 1/1 Running 4 159d
mysqlha-sentinel-558bd59f88-8q9l5 1/1 Terminating 2 80d
[root@pass02 ~]# kubectl get pods -n acs-system |grep mysqlha //确定已经停止,mysqlha-sentinel的pod已经不存在
mysqlha-mysqlha-0 1/1 Running 4 159d
mysqlha-mysqlha-1 1/1 Running 2 80d
mysqlha-mysqlha-2 1/1 Running 2 80d
mysqlha-phpmyadmin-7566586968-rffjj 1/1 Running 4 159d
4. 找到主库不一致的地址
[root@pass02 ~]# slave_ips=$(kubectl get pod -n acs-system -l mysql-role=slave -o custom-columns=MySQL-IP:.status.podIP | grep -v MySQL-IP)
[root@pass02 ~]# echo $slave_ips
xx.xx.xx.54 xx.xx.xx.55
[root@pass02 ~]# slave_ip_1=$(echo $slave_ips | awk '{print $1}')
[root@pass02 ~]# echo $slave_ip_1
xx.xx.xx.54
[root@pass02 ~]# mysql -p -h$slave_ip_1 -e "show slave status\G" | grep -w -E "(Slave_IO_Running)|(Slave_SQL_Running)"
Enter password:
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
[root@pass02 ~]# slave_ip_2=$(echo $slave_ips | awk '{print $2}')
[root@pass02 ~]# echo $slave_ip_2
xx.xx.xx.54
[root@pass02 ~]# mysql -p -h$slave_ip_2 -e "show slave status\G" | grep -w -E "(Slave_IO_Running)|(Slave_SQL_Running)"
Enter password:
Slave_IO_Running: Yes
Slave_SQL_Running: No
5. 执行备库恢复
[root@pass02 ~]# repl_info=$(cat $backup_file | grep "CHANGE MASTER TO")
[root@pass02 ~]# master_file=$(echo $repl_info | awk -F',' '{print $1}' | awk -F'=' '{print $2}')
[root@pass02 ~]# master_pos=$(echo $repl_info | awk -F',' '{print $2}' | awk -F'=' '{print $2}' | sed 's/.$//' | awk -F';' '{print $1}')
[root@pass02 ~]# echo $master_file $master_pos
'mysqlha-mysqlha-0-bin.000045' 154
[root@pass02 ~]# mysql -h$slave_ip_2 -p -e "STOP SLAVE; RESET SLAVE;"
[root@pass02 ~]# mysql -p -h$slave_ip_2 -e "show slave status\G" | grep -w -E "(Slave_IO_Running)|(Slave_SQL_Running)"
Enter password:
Slave_IO_Running: No
Slave_SQL_Running: No
[root@pass02 ~]# cmd=$(mysql -p -h$slave_ip_2 -e "select group_concat(concat('kill ',id) separator '; ') as cmd from information_schema.processlist where user not in ('root', 'system user');" | grep -v "cmd\|NULL")
[root@pass02 ~]# mysql -p -h$slave_ip_2 -e "$cmd"
[root@pass02 ~]# echo $backup_file
backup-2020-08-31-10-08.sql
[root@pass02 ~]# mysql -h$slave_ip_2 -p < $backup_file
[root@pass02 ~]# mysql -h$slave_ip_2 -p -e "change master to master_host='mysqlha-mysqlha-0.mysqlha-mysqlha',master_user='aliyun_cos',master_password='aliyun_cos',master_log_file=$master_file,master_log_pos=$master_pos;"
[root@pass02 ~]# mysql -h$slave_ip_2 -p -e "set global read_only=1; start slave;"
6. 验证结果
[root@pass02 ~]# master_pod=$(kubectl get pod -n acs-system -l mysql-role=master -o wide | grep -v NAME | awk '{print $6}');echo "MASTER STATUS:";mysql -h $master_pod -uroot -paliyun_cos --connect-timeout 2 -e "SHOW MASTER STATUS\G" | egrep '(File|Position)';for pod in $(kubectl get pod -n acs-system -l mysql-role=slave -o wide | grep -v NAME | awk '{print $6}'); do
> echo "SLAVE ${pod} STATUS:"
> mysql -h $pod -uroot -paliyun_cos --connect-timeout 2 -e "SHOW SLAVE STATUS\G" | egrep '(Master_Log_File|Read_Master_Log_Pos|Relay_Log_File|Relay_Log_Pos|Relay_Master_Log_File|Slave_SQL_Running|Slave_IO_Running|Exec_Master_Log_Pos)'
> done
MASTER STATUS:
File: mysqlha-mysqlha-0-bin.000045 //binlog一致
Position: 1808815
SLAVE xx.xx.xx.54 STATUS:
Master_Log_File: mysqlha-mysqlha-0-bin.000045 //binlog一致
Read_Master_Log_Pos: 1808815
Relay_Log_File: mysqlha-mysqlha-1-relay-bin.000042
Relay_Log_Pos: 1809052
Relay_Master_Log_File: mysqlha-mysqlha-0-bin.000045
Slave_IO_Running: Yes //SQL和IO线程yes
Slave_SQL_Running: Yes
Exec_Master_Log_Pos: 1808815 //最后一次同步binlog的position一致
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
SLAVE xx.xx.xx.55 STATUS:
Master_Log_File: mysqlha-mysqlha-0-bin.000045 //binlog一致
Read_Master_Log_Pos: 1808815
Relay_Log_File: mysqlha-mysqlha-2-relay-bin.000002
Relay_Log_Pos: 1808993
Relay_Master_Log_File: mysqlha-mysqlha-0-bin.000045
Slave_IO_Running: Yes //SQL和IO线程yes
Slave_SQL_Running: Yes
Exec_Master_Log_Pos: 1808815 //最后一次同步binlog的position一致
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
7. 恢复数据高可用
[root@pass02 ~]# kubectl scale --current-replicas=0 --replicas=1 deployment/mysqlha-sentinel -n acs-system
deployment.extensions/mysqlha-sentinel scaled
[root@pass02 ~]# kubectl get pods -n acs-system |grep mysqlha
mysqlha-mysqlha-0 1/1 Running 4 159d
mysqlha-mysqlha-1 1/1 Running 2 80d
mysqlha-mysqlha-2 1/1 Running 2 80d
mysqlha-phpmyadmin-7566586968-rffjj 1/1 Running 4 159d
mysqlha-sentinel-558bd59f88-q7bqt 1/1 Running 0 60s
[root@pass02 ~]# kubectl logs mysqlha-sentinel-558bd59f88-q7bqt -n acs-system
03:01:01-2020-08-31 Starting to running Sentinel...
8.mysqlha2日志已经无报错
2020-08-31T02:59:03.068963Z 1900418 [Note] Slave SQL thread for channel '' initialized, starting replication in log 'mysqlha-mysqlha-0-bin.000045' at position 154, relay log './mysqlha-mysqlha-2-relay-bin.000001' position: 4
2020-08-31T02:59:03.070742Z 1900417 [Note] Slave I/O thread for channel '': connected to master 'aliyun_cos@mysqlha-mysqlha-0.mysqlha-mysqlha:3306',replication started in log 'mysqlha-mysqlha-0-bin.000045' at position 154
复制

三、其他相关故障现象

这里列出的其他故障和上面处理的方法类似,都是slave节点问题

slave pod的状态是CrashLoopBackoff,并且只有一个slave,但是statefulset里面期望值是3。kubernets中针对有状态的应用建议使用statefulset调度器管理,查看一下statefulset的状态如下:

1.进入slave节点pass10,并删除目录mysql目录。这个目录是mysqlha-mysqlha-1所在的node,mysqlha-mysqlha-1是备库,是可以删除的,后面做重启pod,备库重搭,数据会还原回来
[root@pass02 ~]# kubectl get pod -n acs-system -L mysql-role -o wide|grep mysqlha-mysqlha
mysqlha-mysqlha-0 1/1 Running 4 77d 150.0.172.49 pass02 <none> master
mysqlha-mysqlha-1 0/1 CrashLoopBackoff 2 77d 150.0.172.54 pass10 <none> slave
2.查看statefulset状态
[root@pass02 ~]# kubectl get statefulset -n acs-system
NAME DESIRED CURRENT AGE
elasticsearch 3 3 78d
mysqlha-mysqlha 3 2 78d
//可以看到期望状态3个,实际状态2个。可以看到这是不正常的
3.查看具体pod的状态
[root@pass02 ~]# kubectl get pod -n acs-system -L mysql-role |grep mysqlha-mysqlha
mysqlha-mysqlha-0 1/1 Running 2 77d master
mysqlha-mysqlha-1 0/1 CrashLoopBackoff 8168 77d slave
//可以看到状态失败,并重启了8618次,庆幸的是master没有故障
4.mysql的日志查看,报错如下
kubectl logs mysqlha-mysqlha-1 -n acs-system
2020-06-11T01:37:42.899292Z 0 [ERROR] InnoDB: Page [page id: space=0, page number=6] log sequence number 5793633031 is in the future! Current system log sequence number 5793624231.
2020-06-11T01:37:42.899312Z 0 [ERROR] InnoDB: Your database may be corrupt or you may have copied the InnoDB tablespace but not the InnoDB log files. Please refer to http://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html for information about forcing recovery.
2020-06-11T01:37:42.899520Z 0 [ERROR] InnoDB: Page [page id: space=0, page number=45] log sequence number 5793642104 is in the future! Current system log sequence number 5793624231.
2020-06-11T01:37:42.899530Z 0 [ERROR] InnoDB: Your database may be corrupt or you may have copied the InnoDB tablespace but not the InnoDB log files. Please refer to http://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html for information about forcing recovery.
2020-06-11T01:37:42.899598Z 0 [ERROR] InnoDB: Page [page id: space=0, page number=274] log sequence number 5793631648 is in the future! Current system log sequence number 5793624231.
2020-06-11T01:37:42.899606Z 0 [ERROR] InnoDB: Your database may be corrupt or you may have copied the InnoDB tablespace but not the InnoDB log files. Please refer to http://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html for information about forcing recovery.
2020-06-11T01:37:42.899755Z 0 [ERROR] InnoDB: Page [page id: space=0, page number=46] log sequence number 5793633070 is in the future! Current system log sequence number 5793624231.
2020-06-11T01:37:42.899764Z 0 [ERROR] InnoDB: Your database may be corrupt or you may have copied the InnoDB tablespace but not the InnoDB log files. Please refer to http://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html for information about forcing recovery.
2020-06-11T01:37:42.899830Z 0 [ERROR] InnoDB: Page [page id: space=0, page number=277] log sequence number 5793642349 is in the future! Current system log sequence number 5793624231.
2020-06-11T01:37:42.899838Z 0 [ERROR] InnoDB: Your database may be corrupt or you may have copied the InnoDB tablespace but not the InnoDB log files. Please refer to http://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html for information about forcing recovery.
2020-06-11T01:37:42.899898Z 0 [ERROR] InnoDB: Page [page id: space=0, page number=532] log sequence number 5793633070 is in the future! Current system log sequence number 5793624231.
2020-06-11T01:37:42.899906Z 0 [ERROR] InnoDB: Your database may be corrupt or you may have copied the InnoDB tablespace but not the InnoDB log files. Please refer to http://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html for information about forcing recovery.


根据提示进入官网查看解决办法http://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
复制

恢复步骤如下:

1. 进入故障pod的数据持久化目录所在的服务器
[root@pass02 ~]# ssh root@pass10
[root@pass10 ~]# ls /var/lib/docker/mysqlha/mysql //此目录就是mysql的安装和数据目录
[root@pass10 ~]# mv /var/lib/docker/mysqlha/mysql /var/lib/docker/mysqlha/mysql-20200611-bak
[root@pass02 ~]# kubectl delete pod -n acs-system mysqlha-mysqlha-1


2.备份master数据库
[root@pass02 ~]# backup_file="backup-`date +%Y-%m-%d-%H-%m`.sql"
[root@pass02 ~]# mysqldump -h $mysql_ip -p --single-transaction --all-databases --events --routines --flush-logs --master-data=2 > $backup_file


3.此时mysql状态
[root@pass02 ~]# kubectl get pod -n acs-system -L mysql-role |grep mysqlha-mysqlha
mysqlha-mysqlha-0 1/1 Running 2 77d master
mysqlha-mysqlha-1 1/1 Running 0 7m14s slave
mysqlha-mysqlha-2 1/1 Running 0 6m49s slave


4.执行数据库同步,这里就和前面的步骤二的恢复步骤一致了。
复制


文章转载自请叫我水哥,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论