前言
此案例介绍半同步复制环境中,因为主库binary log index file里面出现两个相同binlog文件名,导致该实例的slave库I/O线程停止。
1 环境信息
自建MySQL:5.7.22
OS 版本:CentOS Linux release 7.6
2 错误描述
早晨接到报警,提示报表库(见上面拓扑图)I/O线程异常。
2.1 登录报表库,查看复制状态
[ root@报表库:~ ]# mysql
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 132984
Server version: 5.7.22-log MySQL Community Server (GPL)
Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> show slave status\G;
*************************** 1. row ***************************
Slave_IO_State:
Master_Host: 备份库地址
Master_User: repl_user
Master_Port: 3306
Connect_Retry: 1
Master_Log_File: mysql_bin.003427
Read_Master_Log_Pos: 4
Relay_Log_File: relay.001627
Relay_Log_Pos: 241
Relay_Master_Log_File: mysql_bin.003427
Slave_IO_Running: No
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 4
Relay_Log_Space: 525
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 1236
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Could not open log file'
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 10002
Master_UUID: 85cb9384-4db4-11ec-b654-fa1b79c33900
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp: 221021 04:29:36
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 8077181f-4db4-11ec-8225-fa417cfd7e00:169899218-205014272
Executed_Gtid_Set: 3c07e8e0-399a-11ed-9be9-5254006919a1:1-156029,
8077181f-4db4-11ec-8225-fa417cfd7e00:1-205014272,
88958222-4db4-11ec-b637-faa9e751b600:1-243629236
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
1 row in set (0.00 sec)
根据报错,提示打不开binlog文件,报错时间为:221021 04:29:36。
2.2 查看对应时间报表库错误日志:
2022-10-21T04:29:36.176599+08:00 1 [ERROR] Error reading packet from server for channel '': Could not open log file (server_errno=1236)
2022-10-21T04:29:36.176638+08:00 1 [ERROR] Slave I/O for channel '': Got fatal error 1236 from master when reading data from binary log: 'Could not open log file', Error_code: 1
236
2022-10-21T04:29:36.176642+08:00 1 [Note] Slave I/O thread exiting for channel '', read up to log 'mysql_bin.003427', position 4
可以看到读到mysql_bin.003427文件position 4的时候报错。
3 排查
3.1 登录备份库,查看binlog
[ root@备份库:/xxxxxx/binlog ]# ll
total 247390320
.................................省略.............................
.................................省略.............................
-rw-r----- 1 mysql mysql 1080514038 Oct 21 04:23 mysql_bin.003425
-rw-r----- 1 mysql mysql 1080022172 Oct 21 04:29 mysql_bin.003426
-rw-r----- 1 mysql mysql 1073745885 Oct 21 10:45 mysql_bin.003427
-rw-r----- 1 mysql mysql 1073742926 Oct 21 14:56 mysql_bin.003428
-rw-r----- 1 mysql mysql 1073749274 Oct 21 16:56 mysql_bin.003429
-rw-r----- 1 mysql mysql 147230918 Oct 21 17:32 mysql_bin.003430
-rw-r----- 1 mysql mysql 12642 Oct 21 16:56 mysql_bin.index
[ root@备份库:/xxxxxxx/binlog ]# cat mysql_bin.index
.................................省略.............................
.................................省略.............................
/xxxxxx/mysql_bin.003425
/xxxxxx/mysql_bin.003426
/xxxxxx/mysql_bin.003427
/xxxxxx/mysql_bin.003427
看到备份库正在写mysql_bin.003427文件,且只有一个。但是mysql_bin.index里面记录了两个mysql_bin.003427文件。
3.2 查看错误日志
2022-10-21T04:29:36.171970+08:00 815671 [ERROR] Failed to open log (file '/xxxxxx/binlog/mysql_bin.003427', errno 2)
2022-10-21T04:29:36.171982+08:00 669961 [ERROR] Failed to open log (file '/xxxxxx/binlog/mysql_bin.003427', errno 2)
2022-10-21T04:29:36.172200+08:00 815671 [Note] Stop asynchronous binlog_dump to slave (server_id: 1)
2022-10-21T04:29:36.172200+08:00 669961 [Note] Stop asynchronous binlog_dump to slave (server_id: 330647)
可以看到报错和从库一致。
3.3 解析备份库binlog
根据2.1步骤,报表库/从库最后一个事务gtid为: '8077181f-4db4-11ec-8225-fa417cfd7e00:205014272'
先不考虑为什么有两个 mysql_bin.003427 文件,先解析备份库binlog,
mysql_bin.003427日志的第一个事物gtid为:'8077181f-4db4-11ec-8225-fa417cfd7e00:205221231'
[ root@备份库:/xxxxxx/binlog ]# mysqlbinlog --base64-output=decode-rows -v -v mysql_bin.003427|more
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#221021 8:15:54 server id 10002 end_log_pos 123 CRC32 0xd51f2be9 Start: binlog v 4, server v 5.7.22-log created 221021 8:15:54
# at 123
#221021 8:15:54 server id 10002 end_log_pos 234 CRC32 0x7d2337cf Previous-GTIDs
# 8077181f-4db4-11ec-8225-fa417cfd7e00:1-205221230,
# 88958222-4db4-11ec-b637-faa9e751b600:1-243629236
# at 234
#221021 8:15:54 server id 10003 end_log_pos 299 CRC32 0xcaae95a6 GTID last_committed=0 sequence_number=1 rbr_only=yes
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/;
SET @@SESSION.GTID_NEXT= '8077181f-4db4-11ec-8225-fa417cfd7e00:205221231'/*!*/;
# at 299
上一个日志mysql_bin.003426的最后一个事物gtid为:'8077181f-4db4-11ec-8225-fa417cfd7e00:205014272'
可以看到,两个binlog日志文件差了 205221231-205014272 将近21万个gtid事务。
4 原因
之前有人复现过,主库手动删除在用的binlog日志文件+延迟 会复现上面的报错,参考文章:
文章链接,转载注明出处:https://cdn.modb.pro/db/401771
简单说就是手动删除主库在用的binlog文件,会一定概率导致从库I/O线程停止。
但是此次问题不是这种情形。从拓扑图可以看到,出现两个同名binlog的虽然是主库,但是是线上的备份库,只读的,不会有人手动清理binlog,并且日志清理默认是15天。
mysql> show variables like '%expire%';
+--------------------------------+-------+
| Variable_name | Value |
+--------------------------------+-------+
| disconnect_on_expired_password | ON |
| expire_logs_days | 15 |
+--------------------------------+-------+
2 rows in set (0.00 sec)
备份库每天凌晨1点物理备份全库,应该不是这个原因。
还有一个就是,发现该实例一直在新启动一个新的mysql进程(间隔大约是100秒一次),但是一直失败。不知道是不是这个原因(后面通过停止pid为73985的进程解决这个问题,没有再出现启动新进程)
[ root@备份库:/etc/systemd/system ]# ps -ef|grep mysqld
mysql 6708 1 5 Jun24 ? 6-00:05:03 /usr/local/mysql57/sbin/mysqld --defaults-file=/etc/mysql57/my_3306.cnf --basedir=/usr/local/mysql57 --plugin-dir=/usr/local/mysql57/lib64/mysql/plugin
mysql 73985 1 3 14:23 ? 00:00:01 /usr/local/mysql57/sbin/mysqld --defaults-file=/etc/mysql57/my_3306.cnf --basedir=/usr/local/mysql57 --plugin-dir=/usr/local/mysql57/lib64/mysql/plugin
root 74130 69736 0 14:24 pts/2 00:00:00 grep --colour mysqld
错误日志:
2022-10-21T03:50:07.523347+08:00 0 [Note] /usr/local/mysql57/sbin/mysqld (mysqld 5.7.22-log) starting as process 109889 ...
2022-10-21T03:50:07.527315+08:00 0 [Warning] option 'innodb-sort-buffer-size': unsigned value 134217728 adjusted to 67108864
2022-10-21T03:50:07.527829+08:00 0 [Note] InnoDB: PUNCH HOLE support available
2022-10-21T03:50:07.527858+08:00 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2022-10-21T03:50:07.527865+08:00 0 [Note] InnoDB: Uses event mutexes
2022-10-21T03:50:07.527870+08:00 0 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2022-10-21T03:50:07.527875+08:00 0 [Note] InnoDB: Compressed tables use zlib 1.2.3
2022-10-21T03:50:07.527881+08:00 0 [Note] InnoDB: Using Linux native AIO
2022-10-21T03:50:07.531363+08:00 0 [Note] InnoDB: Number of pools: 1
2022-10-21T03:50:07.531516+08:00 0 [Note] InnoDB: Using CPU crc32 instructions
2022-10-21T03:50:07.535524+08:00 0 [Note] InnoDB: Initializing buffer pool, total size = 8G, instances = 16, chunk size = 128M
2022-10-21T03:50:07.535555+08:00 0 [Note] InnoDB: Setting NUMA memory policy to MPOL_INTERLEAVE
2022-10-21T03:50:08.222880+08:00 0 [Note] InnoDB: Setting NUMA memory policy to MPOL_DEFAULT
2022-10-21T03:50:08.222925+08:00 0 [Note] InnoDB: Completed initialization of buffer pool
2022-10-21T03:50:08.434086+08:00 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2022-10-21T03:50:08.434440+08:00 0 [ERROR] InnoDB: Unable to lock ./ibdata1 error: 11
2022-10-21T03:50:08.434463+08:00 0 [Note] InnoDB: Check that you do not already have another mysqld process using the same InnoDB data or log files.
2022-10-21T03:50:08.434470+08:00 0 [Note] InnoDB: Retrying to lock the first data file
2022-10-21T03:50:09.434584+08:00 0 [ERROR] InnoDB: Unable to lock ./ibdata1 error: 11
主机日志:
Oct 21 14:20:04 systemd[1]: mysql57@3306.service: main process exited, code=exited, status=1/FAILURE
Oct 21 14:20:04 kill[73396]: Usage:
Oct 21 14:20:04 kill[73396]: kill [options] <pid|name> [...]
Oct 21 14:20:04 kill[73396]: Options:
Oct 21 14:20:04 kill[73396]: -a, --all do not restrict the name-to-pid conversion to processes
Oct 21 14:20:04 kill[73396]: with the same uid as the present process
Oct 21 14:20:04 kill[73396]: -s, --signal <sig> send specified signal
Oct 21 14:20:04 kill[73396]: -q, --queue <sig> use sigqueue(2) rather than kill(2)
Oct 21 14:20:04 kill[73396]: -p, --pid print pids without signaling them
Oct 21 14:20:04 kill[73396]: -l, --list [=<signal>] list signal names, or convert one to a name
Oct 21 14:20:04 kill[73396]: -L, --table list signal names and numbers
Oct 21 14:20:04 kill[73396]: -h, --help display this help and exit
Oct 21 14:20:04 kill[73396]: -V, --version output version information and exit
Oct 21 14:20:04 kill[73396]: For more details see kill(1).
Oct 21 14:20:04 systemd[1]: mysql57@3306.service: control process exited, code=exited status=1
Oct 21 14:20:04 systemd[1]: Unit mysql57@3306.service entered failed state.
Oct 21 14:20:04 systemd[1]: mysql57@3306.service failed.
Oct 21 14:20:04 systemd[1]: mysql57@3306.service holdoff time over, scheduling restart.
Oct 21 14:20:04 systemd[1]: Stopped MySQL Community Server.
Oct 21 14:20:04 systemd[1]: Starting MySQL Community Server...
Oct 21 14:20:05 systemd[1]: Started MySQL Community Server.
5 解决方案
既然备份库的binlog中缺失了gtid,并且缺失的gtid区间为:205014273-205221230 。那就解析线上主库对应时间段的binlog日志文件,确定对应gtid的postion,然后通过mysqlbinlog --start-positon和--stop-position生成sql文件,到报表库执行即可。
执行期间,通过show master status;可以看到Executed_Gtid_Set的事务id一直在增长。
最后start slave即可。