MySQL主从复制 - 常见故障与处理办法
[TOC]
测试环境版本:
[root@master data]# mysql -V
mysql Ver 8.0.30 for Linux on x86_64 (MySQL Community Server - GPL)
一、主从同步报错
这种情况常发生于从库没有配置super_read_only=1,然后错误地在从库修改了数据,导致从库与主库数据不一致。
要解决这类问题,通常需要在从库执行反向操作,比如删掉这些错误增改的数据,让主从数据恢复到之前一致状态。当然也可以先忽略错误,继续同步后面的数据。
1.1 常见错误:1062主键冲突
1.1.1 模拟错误
- 主库建表,插入数据
mysql> use testdb1;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> create table t1 (id int primary key,val varchar(20));
Query OK, 0 rows affected (0.02 sec)
mysql> insert into t1 values(1,'hi');
Query OK, 1 row affected (0.00 sec)
- salve检查同步正常
mysql> use tesdb1;
ERROR 1049 (42000): Unknown database 'tesdb1'
mysql> use testdb1;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql>
mysql> select * from t1;
+----+------+
| id | val |
+----+------+
| 1 | hi |
+----+------+
1 row in set (0.00 sec)
- slave 手动添加一条数据
mysql> insert into t1 values(2,'hello');
Query OK, 1 row affected (0.00 sec)
- 主库继续添加数据
mysql> insert into t1 values(2,'ni hao');
Query OK, 1 row affected (0.01 sec)
- slave检查同步报错
mysql> show slave status \G;
*************************** 1. row ***************************
Slave_IO_State: Waiting for source to send event
Master_Host: 172.17.0.2
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: master-bin.000002
Read_Master_Log_Pos: 1430
Relay_Log_File: slave-relay-bin.000002
Relay_Log_Pos: 1235
Relay_Master_Log_File: master-bin.000002
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB: information_schema,performance_schema,mysql,sys
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 1062
Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '2a94bc88-5018-11ed-9277-0242ac110002:5' at master log master-bin.000002, end_log_pos 1399. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Skip_Counter: 0
Exec_Master_Log_Pos: 1136
Relay_Log_Space: 1739
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 1062
Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '2a94bc88-5018-11ed-9277-0242ac110002:5' at master log master-bin.000002, end_log_pos 1399. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Replicate_Ignore_Server_Ids:
Master_Server_Id: 80303306
Master_UUID: 2a94bc88-5018-11ed-9277-0242ac110002
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State:
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp: 221020 10:15:17
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 2a94bc88-5018-11ed-9277-0242ac110002:2-5
Executed_Gtid_Set: 2a94bc88-5018-11ed-9277-0242ac110002:1-4,
2c1a941e-5018-11ed-90e0-0242ac110003:1-2
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
Master_public_key_path:
Get_master_public_key: 0
Network_Namespace:
1 row in set, 1 warning (0.01 sec)
ERROR:
No query specified
结论:可以看到抛出如下错误:
Slave_SQL_Running: No Last_SQL_Errno: 1062 Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '2a94bc88-5018-11ed-9277-0242ac110002:5' at master log master-bin.000002, end_log_pos 1399. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
1.1.2 解决办法:Slave端反向处理错误数据
- 根据Last_Error中的报错信息获取具体出错的SQL
方法1:通过主库binlog查询
-- 从上面报错可以看到,错误的事务是 master log master-bin.000002, end_log_pos 1399
[root@master data]# mysqlbinlog -v --base64-output=DECODE-ROWS master-bin.000002|grep 'end_log_pos 1399' -C 6
SET TIMESTAMP=1666232117/*!*/;
BEGIN
/*!*/;
# at 1293
#221020 10:15:17 server id 80303306 end_log_pos 1352 CRC32 0x5a1ace2e Table_map: `testdb1`.`t1` mapped to number 166
# at 1352
#221020 10:15:17 server id 80303306 end_log_pos 1399 CRC32 0xd290df5a Write_rows: table id 166 flags: STMT_END_F
### INSERT INTO `testdb1`.`t1`
### SET
### @1=2
### @2='ni hao'
# at 1399
#221020 10:15:17 server id 80303306 end_log_pos 1430 CRC32 0x4b78fb88 Xid = 119
结论: 可以看到end_log_pos 1399 对应的事务就是 INSERT INTO testdb1
.t1
SET @1=2 @2=‘ni hao’,这一行数据在slave端无法插入,因为主键冲突,所以删除slave端主键冲突的这一行数据即可。
方法2 :slave端通过表查询
Last_SQL_Errno: 1062
mysql> select * from performance_schema.replication_applier_status_by_worker where LAST_ERROR_NUMBER=1062\G;
*************************** 1. row ***************************
CHANNEL_NAME:
WORKER_ID: 1
THREAD_ID: NULL
SERVICE_STATE: OFF
LAST_ERROR_NUMBER: 1062
LAST_ERROR_MESSAGE: Worker 1 failed executing transaction '2a94bc88-5018-11ed-9277-0242ac110002:5' at master log master-bin.000002, end_log_pos 1399; Could not execute Write_rows event on table testdb1.t1; Duplicate entry '2' for key 't1.PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log master-bin.000002, end_log_pos 1399
LAST_ERROR_TIMESTAMP: 2022-10-20 10:15:17.969478
LAST_APPLIED_TRANSACTION: 2a94bc88-5018-11ed-9277-0242ac110002:4
LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 2022-10-20 10:13:41.920825
LAST_APPLIED_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 2022-10-20 10:13:41.920825
LAST_APPLIED_TRANSACTION_START_APPLY_TIMESTAMP: 2022-10-20 10:13:41.921418
LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP: 2022-10-20 10:13:41.936177
APPLYING_TRANSACTION: 2a94bc88-5018-11ed-9277-0242ac110002:5
APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 2022-10-20 10:15:17.968554
APPLYING_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 2022-10-20 10:15:17.968554
APPLYING_TRANSACTION_START_APPLY_TIMESTAMP: 2022-10-20 10:15:17.969275
LAST_APPLIED_TRANSACTION_RETRIES_COUNT: 0
LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_NUMBER: 0
LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_MESSAGE:
LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_TIMESTAMP: 0000-00-00 00:00:00.000000
APPLYING_TRANSACTION_RETRIES_COUNT: 0
APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_NUMBER: 0
APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_MESSAGE:
APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_TIMESTAMP: 0000-00-00 00:00:00.000000
1 row in set (0.00 sec)
ERROR:
No query specified
结论: 可以看到 Could not execute Write_rows event on table testdb1.t1; Duplicate entry ‘2’ for key ‘t1.PRIMARY’,,因为主键冲突,且冲突的是主键=2的这一行,所以删除slave端主键冲突的这一行数据即可。
- 定位出错误语句后,在从库执行反向操作,我们这里删除主键冲突的这一行,然后重新启动slave进程。
mysql> delete from t1 where id=2;
Query OK, 1 row affected (0.00 sec)
mysql> start slave;
Query OK, 0 rows affected, 1 warning (0.01 sec)
mysql> show slave status\G;
*************************** 1. row ***************************
Slave_IO_State: Waiting for source to send event
Master_Host: 172.17.0.2
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: master-bin.000002
Read_Master_Log_Pos: 1430
Relay_Log_File: slave-relay-bin.000002
Relay_Log_Pos: 1529
Relay_Master_Log_File: master-bin.000002
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB: information_schema,performance_schema,mysql,sys
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 1430
Relay_Log_Space: 1739
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 80303306
Master_UUID: 2a94bc88-5018-11ed-9277-0242ac110002
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Replica has read all relay log; waiting for more updates
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 2a94bc88-5018-11ed-9277-0242ac110002:2-5
Executed_Gtid_Set: 2a94bc88-5018-11ed-9277-0242ac110002:1-5,
2c1a941e-5018-11ed-90e0-0242ac110003:1-3
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
Master_public_key_path:
Get_master_public_key: 0
Network_Namespace:
1 row in set, 1 warning (0.01 sec)
ERROR:
No query specified
mysql> select * from t1;
+----+--------+
| id | val |
+----+--------+
| 1 | hi |
| 2 | ni hao |
+----+--------+
2 rows in set (0.00 sec)
结论:主从同步恢复正常。
1.2 常见错误:1032更改的数据不存在
1.2.1 模拟错误
- slave 手动删除一条数据
mysql> delete from t1 where id=2;
Query OK, 1 row affected (0.00 sec)
- 主库更新数据
mysql> update t1 set val='hello' where id=2;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0
- salve 报错
mysql> show slave status\G;
*************************** 1. row ***************************
Slave_IO_State: Waiting for source to send event
Master_Host: 172.17.0.2
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: master-bin.000002
Read_Master_Log_Pos: 1745
Relay_Log_File: slave-relay-bin.000002
Relay_Log_Pos: 1529
Relay_Master_Log_File: master-bin.000002
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB: information_schema,performance_schema,mysql,sys
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 1032
Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '2a94bc88-5018-11ed-9277-0242ac110002:6' at master log master-bin.000002, end_log_pos 1714. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Skip_Counter: 0
Exec_Master_Log_Pos: 1430
Relay_Log_Space: 2054
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 1032
Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '2a94bc88-5018-11ed-9277-0242ac110002:6' at master log master-bin.000002, end_log_pos 1714. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Replicate_Ignore_Server_Ids:
Master_Server_Id: 80303306
Master_UUID: 2a94bc88-5018-11ed-9277-0242ac110002
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State:
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp: 221020 12:57:36
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 2a94bc88-5018-11ed-9277-0242ac110002:2-6
Executed_Gtid_Set: 2a94bc88-5018-11ed-9277-0242ac110002:1-5,
2c1a941e-5018-11ed-90e0-0242ac110003:1-4
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
Master_public_key_path:
Get_master_public_key: 0
Network_Namespace:
1 row in set, 1 warning (0.00 sec)
ERROR:
No query specified
结论:可以看到抛出如下错误:
Slave_SQL_Running: No Last_Errno: 1032 Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '2a94bc88-5018-11ed-9277-0242ac110002:6' at master log master-bin.000002, end_log_pos 1714. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
1.2.2 解决办法:跳过错误的事务
从slave status中可以看到执行报错的事务是failed executing transaction ‘2a94bc88-5018-11ed-9277-0242ac110002:6’,所以需要跳过该事务。
-- GTID模式下跳过事务
mysql> stop slave;
Query OK, 0 rows affected, 1 warning (0.01 sec)
mysql> set gtid_next='2a94bc88-5018-11ed-9277-0242ac110002:6';
Query OK, 0 rows affected (0.00 sec)
mysql> begin;commit;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
mysql> set gtid_next='AUTOMATIC';
Query OK, 0 rows affected (0.00 sec)
mysql> start slave;
Query OK, 0 rows affected, 1 warning (0.01 sec)
mysql> show slave status \G;
*************************** 1. row ***************************
Slave_IO_State: Waiting for source to send event
Master_Host: 172.17.0.2
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: master-bin.000002
Read_Master_Log_Pos: 1745
Relay_Log_File: slave-relay-bin.000003
Relay_Log_Pos: 463
Relay_Master_Log_File: master-bin.000002
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB: information_schema,performance_schema,mysql,sys
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 1745
Relay_Log_Space: 2360
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 80303306
Master_UUID: 2a94bc88-5018-11ed-9277-0242ac110002
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Replica has read all relay log; waiting for more updates
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 2a94bc88-5018-11ed-9277-0242ac110002:2-6
Executed_Gtid_Set: 2a94bc88-5018-11ed-9277-0242ac110002:1-6,
2c1a941e-5018-11ed-90e0-0242ac110003:1-4
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
Master_public_key_path:
Get_master_public_key: 0
Network_Namespace:
1 row in set, 1 warning (0.01 sec)
ERROR:
No query specified
结论:跳过错误的事务后,恢复同步状态。
-- 主库继续插入数据
mysql> insert into t1 values(3,'good luck');
Query OK, 1 row affected (0.00 sec)
-- slave端同步正常
mysql> select * from t1;
+----+-----------+
| id | val |
+----+-----------+
| 1 | hi |
| 3 | good luck |
+----+-----------+
2 rows in set (0.00 sec)
1.3 其他主从报错解决办法
a. 设置参数 sql_slave_skip_counter 跳过当前错误
set global sql_slave_skip_counter=1; -- 1跳过当前出错事务
slave_skip_errors=all; --all跳过所有出错事务
注意: 在GTID模式下,不支持该方式。
-- 5.7版本执行会抛出如下错误:
mysql> set global sql_slave_skip_counter=1;
ERROR 1858 (HY000): sql_slave_skip_counter can not be set when the server is running with @@GLOBAL.GTID_MODE = ON. Instead, for each transaction that you want to skip, generate an empty transaction with the same GTID as the transaction
-- 8.0版本,不报错,但是有2 warnings,也不能解决问题。
mysql> set global sql_slave_skip_counter=1;
Query OK, 0 rows affected, 2 warnings (0.02 sec)
b. 设置参数slave_skip_errors 跳过指定类型的错误
使用GTID复制从库跳过错误,不支持设置sql_slave_skip_counter,但是支持设置参数slave_skip_errors(read only variable),且只能在参数文件中设置。
slave_skip_errors选项有四个可用值,分别为: off,all,ErorCode,ddl_exist_errors 。
my.cnf中的写法有:
slave_skip_errors=1062,1053
slave_skip_errors=all -- 跳过所有错误,不建议使用
slave_skip_errors=ddl_exist_errors
该参数不支持在线修改:
mysql> set global slave_skip_errors=1062;
ERROR 1238 (HY000): Variable 'slave_skip_errors' is a read only variable
c. 使用pt-slave-restart工具
示例1:跳过1个错误 ,ip 为从库ip地址
[root@mysql]# pt-slave-restart --user=root --password=xxx --host=172.17.0.3 --port=3307 --skip-count=1
示例2:跳过错误代码为1062的错误 ,ip 为从库ip地址
[root@mysql]# pt-slave-restart --user=root --password=xxx --host=172.17.0.3 --port=3307 --error-numbers=1062
二、主从数据不一致
导致主库数据不一致的根本原因就是直接在从库上进行增删改操作,比如上面测试中,我们通过跳过事务恢复同步,导致主库和备库数据不一致。
--主库:
mysql> select * from testdb1.t1;
+----+-----------+
| id | val |
+----+-----------+
| 1 | hi |
| 2 | hello |
| 3 | good luck |
+----+-----------+
3 rows in set (0.01 sec)
--slave:
mysql> select * from testdb1.t1;
+----+-----------+
| id | val |
+----+-----------+
| 1 | hi |
| 3 | good luck |
+----+-----------+
2 rows in set (0.00 sec)
解决办法
1. 通过mysqldump重新初始化表
略。。。
2. 使用pt-table-sync主从数据修复
pt-table-sync是Percona-Toolkit工具集中的一个组件,主要用于表数据的高效同步。
使用对两个库不一致的数据进行同步,他能够自动发现两个实例间不一致的数据,然后进行sync操作,但是pt-table-sync无法同步表结构和索引等对象,只能同步数据。
NOTE1: 如果是sync主从数据,只有当需要sync的表都有唯一键(主键或唯一索引),才能使用–sync-to-master and/or --replicate。当使用–replicate或者–sync-to-master参数来对从库数据进行同步时,如果主从数据不一致,不会直接在从库上修改不一致的数据,而是在主库上进行replace 操作(需要表有唯一键),通过binlog把动作传递给从库,从而达到修改从库数据的目的。当然在主库上做的修改应该是no-op的,并不会修改主库表的当前值,只对从库数据产生影响。
NOTE2: 没有唯一键,那么只能通过直接修改从库上的数据来实现主从数据的同步。(不能使用–replicate或者–sync-to-master)
没有主键的情况下,几种修复方式:
a) 同步所有的库和表
[root@mysql57 ~]# pt-table-sync --charset=utf8 --ignore-databases=mysql,sys,percona dsn=u=root,p=xxx,h=172.17.0.2,P=3306 dsn=u=root,p=xxx,h=172.17.0.3,P=3307 --execute --print --no-check-slave
b) 可以指定多个从库,命令里有多个ip,第一次出现的是Master的地址,第2个是Slave的地址,第3个也是slave地址…
[root@mysql57 ~]# pt-table-sync --charset=utf8 --ignore-databases=mysql,sys,percona dsn=u=root,p=xxx,h=172.17.0.2,P=3306 dsn=u=root,p=xxx,h=172.17.0.3,P=3307 dsn=u=root,p=xxx,h=172.172.0.4,P=3308 --execute –print --no-check-slave
c) 只对指定的库进行sync
[root@mysql57 ~]# pt-table-sync --charset=utf8 --ignore-databases=mysql,sys,percona --databases=test1 --no-check-slave dsn=u=root,p=xxx,h=172.17.0.2,P=3306 dsn=u=root,p=xxx,h=172.17.0.3,P=3307 --execute --print --no-check-slave
d) 只对指定的表进行sync 修复我们当前环境不同步的t1表
[root@slave ~]# pt-table-sync --charset=utf8mb4 --ignore-databases=mysql,sys,percona --databases=testdb1 --tables=t1 --no-check-slave dsn=u=root,p=lhr,h=172.17.0.2,P=3306 dsn=u=root,p=lhr,h=172.17.0.3,P=3307 --execute --print --no-check-slave
*******************************************************************
# A software update is available:
INSERT INTO `testdb1`.`t1`(`id`, `val`) VALUES ('2', 'hello') /*percona-toolkit src_db:testdb1 src_tbl:t1 src_dsn:A=utf8mb4,P=3306,h=172.17.0.2,p=... dst_db:testdb1 dst_tbl:t1 dst_dsn:A=utf8mb4,P=3307,h=172.17.0.3,p=... lock:0 transaction:1 changing_src:0 replicate:0 bidirectional:0 pid:17159 user:root host:slave*/;
-- salve 再次查询数据一致
mysql> select * from testdb1.t1;
+----+-----------+
| id | val |
+----+-----------+
| 1 | hi |
| 2 | hello |
| 3 | good luck |
+----+-----------+
3 rows in set (0.00 sec)
三、主库binlog丢失
主库binlog被删除,从库会因为读取日志失败产生错误。要解决这个问题,通常就是找一个业务空闲期停服,然后从库重新做主从配置。但是,有的时候考虑到数据量较大,不想重建,选择放弃丢失binlog的这一部分数据,继续同步。
3.1 模拟错误
- 停止备库
[root@slave ~]# systemctl stop mysqld
[root@slave ~]# ps -ef|grep mysql
root 22632 17129 0 16:39 pts/1 00:00:00 grep --color=auto mysql
- 主库操作,并删除binlog日志
mysql> flush logs;
Query OK, 0 rows affected (0.01 sec)
mysql> delete from testdb1.t1 where id=1;
Query OK, 1 row affected (0.00 sec)
mysql> flush logs;
Query OK, 0 rows affected (0.01 sec)
mysql> delete from testdb1.t1 where id=2;
Query OK, 1 row affected (0.01 sec)
mysql> insert into t1 values (4,'hi');
Query OK, 1 row affected (0.01 sec)
mysql> select * from t1;
+----+-----------+
| id | val |
+----+-----------+
| 3 | good luck |
| 4 | hi |
+----+-----------+
2 rows in set (0.00 sec)
mysql> show master status;
+-------------------+----------+--------------+-------------------------------------------------+-----------
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+-------------------+----------+--------------+-------------------------------------------------+-----------
| master-bin.000005 | 773 | | mysql,information_schema,performance_schema,sys | 2a94bc88-5018-11ed-9277-0242ac110002:1-10 |
+-------------------+----------+--------------+-------------------------------------------------+-----------
1 row in set (0.00 sec)
-- 删除主库binlog日志
[root@master data]# rm -rf master-bin.000003 master-bin.000004
- 从库启动,同步报错
[root@slave ~]# systemctl start mysqld
[root@slave ~]# mysql -uroot -plhr
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 14
Server version: 8.0.30 MySQL Community Server - GPL
Copyright (c) 2000, 2022, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> show slave status\G;
*************************** 1. row ***************************
Slave_IO_State:
Master_Host: 172.17.0.2
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: master-bin.000003
Read_Master_Log_Pos: 197
Relay_Log_File: slave-relay-bin.000004
Relay_Log_Pos: 415
Relay_Master_Log_File: master-bin.000003
Slave_IO_Running: No
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB: information_schema,performance_schema,mysql,sys
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 197
Relay_Log_Space: 635
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 13114
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Cannot replicate because the master purged required binary logs. Replicate the missing transactions from elsewhere, or provision a new slave from backup. Consider increasing the master's binary log expiration period. The GTID set sent by the slave is '2a94bc88-5018-11ed-9277-0242ac110002:1-7,
2c1a941e-5018-11ed-90e0-0242ac110003:1-9', and the missing transactions are '2a94bc88-5018-11ed-9277-0242ac110002:8''
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 80303306
Master_UUID: 2a94bc88-5018-11ed-9277-0242ac110002
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Replica has read all relay log; waiting for more updates
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp: 221020 16:46:16
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 2a94bc88-5018-11ed-9277-0242ac110002:2-7
Executed_Gtid_Set: 2a94bc88-5018-11ed-9277-0242ac110002:1-7,
2c1a941e-5018-11ed-90e0-0242ac110003:1-9
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
Master_public_key_path:
Get_master_public_key: 0
Network_Namespace:
1 row in set, 1 warning (0.01 sec)
ERROR:
No query specified
结论:可以看到抛出如下错误:
Relay_Master_Log_File: master-bin.000003 --slave正在同步master上的binlog日志名,但是已经被删了,所以报错
Slave_IO_Running: No
Last_IO_Errno: 13114
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Cannot replicate because the master purged required binary logs. Replicate the missing transactions from elsewhere, or provision a new slave from backup. Consider increasing the master's binary log expiration period. The GTID set sent by the slave is '2a94bc88-5018-11ed-9277-0242ac110002:1-7,
2c1a941e-5018-11ed-90e0-0242ac110003:1-9', and the missing transactions are '2a94bc88-5018-11ed-9277-0242ac110002:8''
3.2 解决办法
- 清空原主从配置
-- 从库操作
mysql> reset master; -- 清空从库的 gtid
Query OK, 0 rows affected (0.01 sec)
mysql> stop slave;
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> reset slave all; --清空从库信息
Query OK, 0 rows affected, 1 warning (0.00 sec)
- 为了尽可能的少丢数据,我们检查主库目前最旧的binlog日志
[root@master data]# mysqlbinlog -v --base64-output=DECODE-ROWS master-bin.000005 |more
# The proper term is pseudo_replica_mode, but we use this compatibility alias
# to make the statement usable on server versions 8.0.24 and older.
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#221020 16:43:40 server id 80303306 end_log_pos 126 CRC32 0x57df434c Start: binlog v 4, server v 8.0.30 created 221020 16:43:40
# Warning: this binlog is either in use or was not closed properly.
# at 126
#221020 16:43:40 server id 80303306 end_log_pos 197 CRC32 0xebea1c88 Previous-GTIDs
# 2a94bc88-5018-11ed-9277-0242ac110002:1-8
结论:Previous-GTIDs 2a94bc88-5018-11ed-9277-0242ac110002:1-8
- 手动设置GTID_PURGED
mysql> SET @@GLOBAL.GTID_PURGED='2a94bc88-5018-11ed-9277-0242ac110002:1-8';
Query OK, 0 rows affected (0.00 sec)
mysql> show master status;
+------------------+----------+--------------+-------------------------------------------------+------------
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+------------------+----------+--------------+-------------------------------------------------+------------
| slave-bin.000001 | 157 | | mysql,information_schema,performance_schema,sys | 2a94bc88-5018-11ed-9277-0242ac110002:1-8 |
+------------------+----------+--------------+-------------------------------------------------+------------
1 row in set (0.00 sec)
mysql> change master to
-> master_host='172.17.0.2',
-> master_port=3306,
-> master_user='repl',
-> master_password='repl',
-> master_auto_position=1;
Query OK, 0 rows affected, 8 warnings (0.02 sec)
mysql> start slave;
Query OK, 0 rows affected, 1 warning (0.01 sec)
mysql> show slave status\G;
*************************** 1. row ***************************
Slave_IO_State: Waiting for source to send event
Master_Host: 172.17.0.2
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: master-bin.000005
Read_Master_Log_Pos: 773
Relay_Log_File: slave-relay-bin.000002
Relay_Log_Pos: 951
Relay_Master_Log_File: master-bin.000005
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB: information_schema,performance_schema,mysql,sys
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 773
Relay_Log_Space: 1161
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 80303306
Master_UUID: 2a94bc88-5018-11ed-9277-0242ac110002
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Replica has read all relay log; waiting for more updates
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 2a94bc88-5018-11ed-9277-0242ac110002:9-10
Executed_Gtid_Set: 2a94bc88-5018-11ed-9277-0242ac110002:1-10
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
Master_public_key_path:
Get_master_public_key: 0
Network_Namespace:
1 row in set, 1 warning (0.00 sec)
ERROR:
No query specified
结论:主从同步恢复,但是因为舍弃了主库丢失的binlog日志,导致部分同步事务丢失,数据不一致,案例中从库t1表id=1的行数据未同步删除,id=2的行数据被正常同步删除。
mysql> select * from t1;
+----+-----------+
| id | val |
+----+-----------+
| 1 | hi |
| 3 | good luck |
| 4 | hi |
+----+-----------+
3 rows in set (0.00 sec)
四、从库日志丢失
误删除了从库上的relay log,导致还有事务没应用,发生主从故障。
4.1 模拟错误
- 备库停止同步
mysql> stop slave sql_thread;
Query OK, 0 rows affected, 1 warning (0.00 sec)
- 删除还没有同步relay log
mysql> show slave status\G;
*************************** 1. row ***************************
Slave_IO_State: Waiting for source to send event
Master_Host: 172.17.0.2
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: master-bin.000008
Read_Master_Log_Pos: 197
Relay_Log_File: slave-relay-bin.000005
Relay_Log_Pos: 415
Relay_Master_Log_File: master-bin.000007
Slave_IO_Running: Yes
Slave_SQL_Running: No
[root@slave data]# ll slave-relay-bin.*
-rw-r----- 1 mysql mysql 1040 Oct 24 13:19 slave-relay-bin.000004
-rw-r----- 1 mysql mysql 468 Oct 24 13:20 slave-relay-bin.000005
-rw-r----- 1 mysql mysql 298 Oct 24 13:20 slave-relay-bin.000006
-rw-r----- 1 mysql mysql 701 Oct 24 13:23 slave-relay-bin.000007
-rw-r----- 1 mysql mysql 100 Oct 24 13:20 slave-relay-bin.index
-- 删除从库relay log
[root@slave data]# rm -rf slave-relay-bin.000006 slave-relay-bin.000007
- 主库继续插入数据
mysql> insert into testdb1.t1 values (5,'hello');
Query OK, 1 row affected (0.01 sec)
mysql> flush logs;
Query OK, 0 rows affected (0.01 sec)
mysql> insert into testdb1.t1 values (6,'hello');
Query OK, 1 row affected (0.00 sec)
- 启动slave,报错
mysql> start slave sql_thread;
Query OK, 0 rows affected, 1 warning (0.01 sec)
mysql> show slave status\G;
*************************** 1. row ***************************
Slave_IO_State: Waiting for source to send event
Master_Host: 172.17.0.2
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: master-bin.000009
Read_Master_Log_Pos: 483
Relay_Log_File: slave-relay-bin.000005
Relay_Log_Pos: 415
Relay_Master_Log_File: master-bin.000007
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB: information_schema,performance_schema,mysql,sys
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 13121
Last_Error: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, the server was unable to fetch a keyring key required to open an encrypted relay log file, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave.
Skip_Counter: 0
Exec_Master_Log_Pos: 197
Relay_Log_Space: 2519
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 13121
Last_SQL_Error: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, the server was unable to fetch a keyring key required to open an encrypted relay log file, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave.
Replicate_Ignore_Server_Ids:
Master_Server_Id: 80303306
Master_UUID: 2a94bc88-5018-11ed-9277-0242ac110002
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State:
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp: 221024 13:24:33
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 2a94bc88-5018-11ed-9277-0242ac110002:9-16
Executed_Gtid_Set: 2a94bc88-5018-11ed-9277-0242ac110002:1-14
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
Master_public_key_path:
Get_master_public_key: 0
Network_Namespace:
1 row in set, 1 warning (0.00 sec)
ERROR:
No query specified
结论:可以看到抛出如下错误:
Slave_SQL_Running: No
Last_Errno: 13121
Last_Error: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, the server was unable to fetch a keyring key required to open an encrypted relay log file, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave.
4.2 解决办法
- 根据slave status 信息,可以得到我们从库同步中断,所对应的主库位置:
Relay_Master_Log_File: master-bin.000007 Exec_Master_Log_Pos: 197 Executed_Gtid_Set: 2a94bc88-5018-11ed-9277-0242ac110002:1-14
主要观察Relay_Master_Log_File和Exec_Master_Log_Pos、Executed_Gtid_Set的值,代表了从库已经处理的中继日志文件和位置所对应的binlog文件和位置。也可以通过查询主库binlog信息,进一步确认上面信息是否正确。
[root@master data]# mysqlbinlog -v --base64-output=DECODE-ROWS master-bin.000007
# The proper term is pseudo_replica_mode, but we use this compatibility alias
# to make the statement usable on server versions 8.0.24 and older.
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#221024 13:18:26 server id 80303306 end_log_pos 126 CRC32 0xd75fbc8a Start: binlog v 4, server v 8.0.30 created 221024 13:18:26
# at 126
#221024 13:18:26 server id 80303306 end_log_pos 197 CRC32 0xb2e81f0f Previous-GTIDs
# 2a94bc88-5018-11ed-9277-0242ac110002:1-14
# at 197
#221024 13:20:58 server id 80303306 end_log_pos 245 CRC32 0xdd30bcb7 Rotate to master-bin.000008 pos: 4
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;
- 重新配置同步
-- 从库操作
mysql> reset master; -- 清空从库的 gtid
Query OK, 0 rows affected (0.01 sec)
mysql> stop slave;
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> reset slave all; --清空从库信息
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> SET @@GLOBAL.GTID_PURGED='2a94bc88-5018-11ed-9277-0242ac110002:1-14';
Query OK, 0 rows affected (0.01 sec)
mysql> show master status;
+------------------+----------+--------------+-------------------------------------------------+------------
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+------------------+----------+--------------+-------------------------------------------------+------------
| slave-bin.000001 | 157 | | mysql,information_schema,performance_schema,sys | 2a94bc88-5018-11ed-9277-0242ac110002:1-14 |
+------------------+----------+--------------+-------------------------------------------------+------------
1 row in set (0.01 sec)
mysql> change master to
-> master_host='172.17.0.2',
-> master_port=3306,
-> master_user='repl',
-> master_password='repl',
-> master_auto_position=1;
Query OK, 0 rows affected, 8 warnings (0.02 sec)
mysql> start slave;
Query OK, 0 rows affected, 1 warning (0.01 sec)
- 检查同步正常
mysql> show slave status\G;
*************************** 1. row ***************************
Slave_IO_State: Waiting for source to send event
Master_Host: 172.17.0.2
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: master-bin.000009
Read_Master_Log_Pos: 483
Relay_Log_File: slave-relay-bin.000004
Relay_Log_Pos: 701
Relay_Master_Log_File: master-bin.000009
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB: information_schema,performance_schema,mysql,sys
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 483
Relay_Log_Space: 999
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 80303306
Master_UUID: 2a94bc88-5018-11ed-9277-0242ac110002
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Replica has read all relay log; waiting for more updates
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 2a94bc88-5018-11ed-9277-0242ac110002:15-16
Executed_Gtid_Set: 2a94bc88-5018-11ed-9277-0242ac110002:1-16
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
Master_public_key_path:
Get_master_public_key: 0
Network_Namespace:
1 row in set, 1 warning (0.00 sec)
ERROR:
No query specified
-- 主库新插入的数据也都同步过来了
mysql> select * from testdb1.t1;
+----+-----------+
| id | val |
+----+-----------+
| 1 | hi |
| 3 | good luck |
| 4 | hi |
| 5 | hello |
| 6 | hello |
+----+-----------+
5 rows in set (0.00 sec)