MySQL主主SQL线程异常修复大作战，一失足成千古恨啊！

不背锅运维 2022-06-06

292

写在开篇的总结

总结的前戏

通过这次不断的折腾，各种折腾。先是解析Binlog，找到指定的位置，手动转化为SQL去执行，问题不多还好，问题多了这个办法就悲剧了。而且，手动转化为可执行的SQL未必能执行成功。笔者的互为主从环境问题非常多，只能不断的去跳过有问题的GTID事务ID，这是唯一的办法，而且有问题的GITD数量有多少也是未知的。但笔者又不想重建互为主从的环境，如果是生产环境更不能随便重建。

总结的高潮

不管是生产还是测试的互为主从的环境，发生这样的问题而又不能重建或者不想重建，那么请按照以下步骤进行处理：

停掉上层应用，不要再往数据库进行读写数据；
互为主从的环境，在2台互为slave服务器的角色中不断的去跳过有问题的GTID，直到2台slave角色中的SQL线程都为YES；
2台slave角色中的SQL线程都为YES后，还要观察一段时间，2台slave都要观察，通过“show replica status\G;”查看复制状态，说不定还会出现有问题的GTID，按照同样的方法继续跳过处理；
观察了一段时间后，2台slave确实不会再出现有问题的GTID之后，按正常顺序停止复制、停止Mysql服务，然后按正常顺序拉起Mysql，继续观察2台MySQL服务器的IO和SQL线程是否都为YES；
互为主从的环境确实都没问题了，都在master上创建个测试的库，严重是否能正常同步到slave；
互为主从的MySQL环境确实真的真的真的没问题了之后，再拉起上层应用，笔者的上层应用仅需连接其中一台即可，并没有去搞读写分离的骚操作。

好了，下面进入本次排查和解决的全过程，步入主题！！！

主从环境信息

角色	主机名	IP
master	db01	192.168.11.151
slave	db02	192.168.11.152

说明：笔者的环境是启用了GTID模式的主从复制，关于GTID模式和传统的模式，后续会抽时间输出经验进行分享。

问题故障现象

查看slave库的SQL线程为NO，具体的信息如下：

mysql> show replica status\G;
*************************** 1. row ***************************
             Replica_IO_State: Waiting for source to send event
                  Source_Host: 192.168.11.151
                  Source_User: syn_a
                  Source_Port: 3306
                Connect_Retry: 60
              Source_Log_File: mysql-bin.000005
          Read_Source_Log_Pos: 1986340
               Relay_Log_File: zbx-db02-relay-bin.000012
                Relay_Log_Pos: 630974
        Relay_Source_Log_File: mysql-bin.000001
           Replica_IO_Running: Yes
          Replica_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1452
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:768' at master log mysql-bin.000001, end_log_pos 634234. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
复制

主库的binlog：mysql-bin.000001（master log ）
主库的binlog结束位置：634234（end_log_pos）

查看slave库的error

...
2022-05-10T08:06:05.008230Z 69 [ERROR] [MY-010584] [Repl] Slave SQL for channel '': Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:768' at master log mysql-bin.000001, end_log_pos 634234; Could not execute Write_rows event on table zabbix.event_recovery; Cannot add or update a child row: a foreign key constraint fails (`zabbix`.`event_recovery`, CONSTRAINT `c_event_recovery_1` FOREIGN KEY (`eventid`) REFERENCES `events` (`eventid`) ON DELETE CASCADE), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 634234, Error_code: MY-001452
...
复制

通过错误日志的简单分析：

通过上面的errorlog，发现event_recovery表有外键约束，约束名为c_event_recovery_1，eventid作为外键，参考的是events表中的eventid字段，也就是说：父表是events，子表是event_recovery，现在要往子表插入数据，但是父表没有，所以失败了。

进一步深入分析和排查

在slave库上查一下event_recovery表的建表语句

mysql> show create table zabbix.event_recovery\G;
*************************** 1. row ***************************
       Table: event_recovery
Create Table: CREATE TABLE `event_recovery` (
  `eventid` bigint unsigned NOT NULL,
  `r_eventid` bigint unsigned NOT NULL,
  `c_eventid` bigint unsigned DEFAULT NULL,
  `correlationid` bigint unsigned DEFAULT NULL,
  `userid` bigint unsigned DEFAULT NULL,
  PRIMARY KEY (`eventid`),
  KEY `event_recovery_1` (`r_eventid`),
  KEY `event_recovery_2` (`c_eventid`),
  CONSTRAINT `c_event_recovery_1` FOREIGN KEY (`eventid`) REFERENCES `events` (`eventid`) ON DELETE CASCADE,
  CONSTRAINT `c_event_recovery_2` FOREIGN KEY (`r_eventid`) REFERENCES `events` (`eventid`) ON DELETE CASCADE,
  CONSTRAINT `c_event_recovery_3` FOREIGN KEY (`c_eventid`) REFERENCES `events` (`eventid`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3 COLLATE=utf8_bin
1 row in set (0.00 sec)

ERROR: 
No query specified

复制

约束名：c_event_recovery_1，子表是event_recovery，父表是events。子表的字段eventid参考的是父表的eventid

在master库上解析mysql-bin.000001文件，并找到结束位置634234

[root@zbx-db01 ~]# mysqlbinlog -v --base64-output=decode-rows --stop-position=634234 /data/mysql_data/mysql-bin.000001 | tail -20

# 找到的634234位置的内容如下：
#220507 22:04:25 server id 5  end_log_pos 634234 CRC32 0xf6000142       Write_rows: table id 174 flags: STMT_END_F
### INSERT INTO `zabbix`.`event_recovery`
### SET
###   @1=21751
###   @2=22357
###   @3=NULL
###   @4=NULL
###   @5=NULL
ROLLBACK /* added by mysqlbinlog */ /*!*/;
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file

复制

通过解析出来的内容，将master库上位置634234的内容人工转化为可执行的语句后，是这样的：

INSERT INTO `zabbix`.`event_recovery` values(21751,22357,NULL,NULL,NULL);

复制

要注意：子表event_recovery的字段eventid参考的是父表events的eventid字段

在master库上查下父表events中的eventid是否有21751的记录

mysql> select * from zabbix.events where eventid=21751\G;
*************************** 1. row ***************************
     eventid: 21751
      source: 0
      object: 0
    objectid: 13560
       clock: 1651925304
       value: 1
acknowledged: 0
          ns: 381417199
        name: Zabbix task manager processes more than 75% busy
    severity: 3
1 row in set (0.00 sec)

复制

在master主库上是有的，是存在的呢。

那接着在slave库上也查一下父表events中的eventid是否有21751的记录

mysql> select * from zabbix.events where eventid=21751;
Empty set (0.01 sec)

mysql> 

复制

在slave库上，父表events中没有21751这个eventid的记录，因为自动同步的原因，所以自动执行也是失败的

在slave库上，尝试执行人工转化后的可执行语句因此，向event_recovery表插入数据时报错，提示无法添加或更新子行，外键约束失败，错误码 ERROR 1452

mysql> INSERT INTO zabbix.event_recovery values(21751,22357,NULL,NULL,NULL);
ERROR 1452 (23000): Cannot add or update a child row: a foreign key constraint fails (`zabbix`.`event_recovery`, CONSTRAINT `c_event_recovery_1` FOREIGN KEY (`eventid`) REFERENCES `events` (`eventid`) ON DELETE CASCADE) 

复制

发现也是失败的，往子表插入数据，但是父表没有，不管是手动执行还是自动执行，都是失败的

解决办法和过程

将master库父表events中的eventid为21751的记录查出来，再构造好可执行的插入数据的语句

# 现在master库上查
mysql> select * from zabbix.events where eventid=21751\G;
*************************** 1. row ***************************
     eventid: 21751
      source: 0
      object: 0
    objectid: 13560
       clock: 1651925304
       value: 1
acknowledged: 0
          ns: 381417199
        name: Zabbix task manager processes more than 75% busy
    severity: 3
1 row in set (0.13 sec)

ERROR: 
No query specified

# 构造可执行的插入数据的sql语句
insert into zabbix.events values(21751,0,0,13560,1651925304,1,0,381417199,"Zabbix task manager processes more than 75% busy",3);

复制

将构造好的语句，在slave库中执行，插入和master中父表events一样的数据到slave库里的父表events

mysql> insert into zabbix.events values(21751,0,0,13560,1651925304,1,0,381417199,"Zabbix task manager processes more than 75% busy",3);
Query OK, 1 row affected (0.00 sec)

mysql> 

复制

接着在slave库中执行原来报错的语句，就是往子表event_recovery插入数据，居然又报错了，这次是不同的错误

mysql> INSERT INTO `zabbix`.`event_recovery` values(21751,22357,NULL,NULL,NULL);
ERROR 1452 (23000): Cannot add or update a child row: a foreign key constraint fails (`zabbix`.`event_recovery`, CONSTRAINT `c_event_recovery_2` FOREIGN KEY (`r_eventid`) REFERENCES `events` (`eventid`) ON DELETE CASCADE)

复制

这次是约束名为c_event_recovery_2的问题，这是新的问题，通过查看event_recovery表的创建表语句，约束c_event_recovery_2具体信息如下：

CONSTRAINT `c_event_recovery_2` FOREIGN KEY (`r_eventid`) REFERENCES `events` (`eventid`) ON DELETE CASCADE

复制

在master库上，查看子表event_recovery的表结构

mysql> desc zabbix.event_recovery;
+---------------+-----------------+------+-----+---------+-------+
| Field         | Type            | Null | Key | Default | Extra |
+---------------+-----------------+------+-----+---------+-------+
| eventid       | bigint unsigned | NO   | PRI | NULL    |       |
| r_eventid     | bigint unsigned | NO   | MUL | NULL    |       |
| c_eventid     | bigint unsigned | YES  | MUL | NULL    |       |
| correlationid | bigint unsigned | YES  |     | NULL    |       |
| userid        | bigint unsigned | YES  |     | NULL    |       |
+---------------+-----------------+------+-----+---------+-------+
5 rows in set (0.01 sec)

复制

也就是说，待插入的值values(21751,22357,NULL,NULL,NULL)中的第2个值（值是22357）就是r_eventid字段，目前在slave库也是缺失的呢。

分别在master库和slave库上查父表events的eventid字段有没有值为22357的记录，如果slave库上没有，那就要构造了

# master库上查，是有的
mysql> select * from zabbix.events where eventid=22357\G;
*************************** 1. row ***************************
     eventid: 22357
      source: 0
      object: 0
    objectid: 13560
       clock: 1651932264
       value: 0
acknowledged: 0
          ns: 578308822
        name: Zabbix task manager processes more than 75% busy
    severity: 0
1 row in set (0.00 sec)

ERROR: 
No query specified

mysql> 

# 在slave库上，果然是没有
mysql> select * from zabbix.events where eventid=22357; 
Empty set (0.00 sec)

# 构造可执行的插入数据的sql语句
insert into zabbix.events values(22357,0,0,13560,1651932264,0,0,578308822,"Zabbix task manager processes more than 75% busy",0);

复制

将构造好的语句，在slave库中执行，插入和master中父表events一样的数据到slave库里的父表events

mysql> insert into zabbix.events values(22357,0,0,13560,1651932264,0,0,578308822,"Zabbix task manager processes more than 75% busy",0);
Query OK, 1 row affected (0.09 sec)

mysql> 

复制

接着在slave库中执行原来报错的语句，就是往子表event_recovery插入数据，成功了。

mysql> INSERT INTO `zabbix`.`event_recovery` values(21751,22357,NULL,NULL,NULL);
Query OK, 1 row affected (0.00 sec)

mysql> 

复制

接着启动复制，并查看状态

mysql> start replica;
Query OK, 0 rows affected (0.01 sec)

mysql> 
mysql> 
mysql> show replica status\G;
*************************** 1. row ***************************
             Replica_IO_State: Waiting for source to send event
                  Source_Host: 192.168.11.151
                  Source_User: syn_a
                  Source_Port: 3306
                Connect_Retry: 60
              Source_Log_File: mysql-bin.000006
          Read_Source_Log_Pos: 236
               Relay_Log_File: zbx-db02-relay-bin.000012
                Relay_Log_Pos: 630974
        Relay_Source_Log_File: mysql-bin.000001
           Replica_IO_Running: Yes
          Replica_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1062
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:768' at master log mysql-bin.000001, end_log_pos 634116. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
                 Skip_Counter: 0

复制

发现SQL线程还是NO，位置信息这时候是 end_log_pos 634116，也就是位置变了，又是另外一个问题造成，具体啥问题，还得去看mysql的error日志和去解析对应的binkog文件，其实处理的套路都是一样的，继续处理它。

续解决位置在634116的问题

在slave库上查看mysql的error，过滤Last_Errno: 1062的error信息，看最新时间的那条就好

[root@zbx-db02 mysql_data]# cat mysql3306.err | grep 1062
2022-05-11T00:53:03.919190Z 9 [ERROR] [MY-010584] [Repl] Slave SQL for channel '': Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:768' at master log mysql-bin.000001, end_log_pos 634116; Could not execute Write_rows event on table zabbix.events; Duplicate entry '22357' for key 'events.PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 634116, Error_code: MY-001062

复制

大概分析的原因是：无法在表 zabbix.events 上执行写入行记录事件，键“events.PRIMARY”的重复条目“22357”，错误代码：1062

在master库上解析binglog文件mysql-bin.000001, 找到位置 end_log_pos 634116，并手动转化为可执行的sql语句

# 开始执行解析
mysqlbinlog -v --base64-output=decode-rows --stop-position=634116 /data/mysql_data/mysql-bin.000001

# 解析后，找到的634116位置内容如下：
#220507 22:04:25 server id 5  end_log_pos 634116 CRC32 0xf54d313b       Write_rows: table id 113 flags: STMT_END_F
### INSERT INTO `zabbix`.`events`
### SET
###   @1=22357
###   @2=0
###   @3=0
###   @4=13560
###   @5=1651932264
###   @6=0
###   @7=0
###   @8=578308822
###   @9='Zabbix task manager processes more than 75% busy'
###   @10=0
ROLLBACK /* added by mysqlbinlog */ /*!*/;
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file

# 手动转化为可执行的sql语句
INSERT INTO zabbix.events values(22357,0,0,13560,1651932264,0,0,578308822,'Zabbix task manager processes more than 75% busy',0);

复制

开始在slave库执行转化后的sql语句

mysql> INSERT INTO zabbix.events values(22357,0,0,13560,1651932264,0,0,578308822,'Zabbix task manager processes more than 75% busy',0);
ERROR 1062 (23000): Duplicate entry '22357' for key 'events.PRIMARY'
mysql> 

复制

执行后报错，刚刚在slave库上查看mysql的error log文件，也是报这个错，所以不管是slave库的SQL线程自动执行还是现在手动执行这条语句，都是报错。events表的主键是eventid（请执行查看表结构便知道），也就是说已经存在这条22357记录了，再插入就是重复了，主键约束的目的就是只能唯一，不能重复，因此报错。

回想了一下，22357这条记录是在 “四、解决办法和过程” 的处理过程中插入进去的。

在slave库上尝试解决

mysql> delete from zabbix.events where eventid=22357;
Query OK, 1 row affected (0.54 sec)

mysql> INSERT INTO zabbix.events values(22357,0,0,13560,1651932264,0,0,578308822,'Zabbix task manager processes more than 75% busy',0);
Query OK, 1 row affected (0.00 sec)

mysql> start replica;
Query OK, 0 rows affected (0.02 sec)

mysql> show replica status\G;
*************************** 1. row ***************************
             Replica_IO_State: Waiting for source to send event
                  Source_Host: 192.168.11.151
                  Source_User: syn_a
                  Source_Port: 3306
                Connect_Retry: 60
              Source_Log_File: mysql-bin.000007
          Read_Source_Log_Pos: 236
               Relay_Log_File: zbx-db02-relay-bin.000024
                Relay_Log_Pos: 324
        Relay_Source_Log_File: mysql-bin.000001
           Replica_IO_Running: Yes
          Replica_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1062
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:768' at master log mysql-bin.000001, end_log_pos 634116. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
                 Skip_Counter: 0
          Exec_Source_Log_Pos: 633751
              Relay_Log_Space: 41924759

复制

继续查看slave库的Mysql error日志（看最新的那条就好）

[root@zbx-db02 mysql_data]# cat mysql3306.err | grep 1062
2022-05-11T02:43:53.329019Z 23 [ERROR] [MY-010584] [Repl] Slave SQL for channel '': Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:768' at master log mysql-bin.000001, end_log_pos 634116; Could not execute Write_rows event on table zabbix.events; Duplicate entry '22357' for key 'events.PRIMARY', Error_code: 106; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 634116, Error_code: MY-001062

复制

发现是events表中的eventid字段（主键）已经存在了22357的记录，奇怪了，还是这个问题，还是这个主键重复的问题，居然也还是22357这条记录。那刚才第4步骤的解决办法中，岂不是白干了？我....顶！...

新的解决办法

新的解决办法是：跳过指定的GTID事务（忽略slave库上发生的主键冲突），注意，笔者的主从环境是启用了GTID模式的。

之前停止过复制，现在拉起来

mysql> start replica;
Query OK, 0 rows affected (0.02 sec)

复制

查看复制状态

mysql> show replica status\G;
*************************** 1. row ***************************
             Replica_IO_State: Waiting for source to send event
                  Source_Host: 192.168.11.151
                  Source_User: syn_a
                  Source_Port: 3306
                Connect_Retry: 60
              Source_Log_File: mysql-bin.000007
          Read_Source_Log_Pos: 236
               Relay_Log_File: zbx-db02-relay-bin.000024
                Relay_Log_Pos: 324
        Relay_Source_Log_File: mysql-bin.000001
           Replica_IO_Running: Yes
          Replica_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1062
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:768' at master log mysql-bin.000001, end_log_pos 634116. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
                 Skip_Counter: 0
          Exec_Source_Log_Pos: 633751
              Relay_Log_Space: 41925179
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Source_SSL_Allowed: No
           Source_SSL_CA_File: 
           Source_SSL_CA_Path: 
              Source_SSL_Cert: 
            Source_SSL_Cipher: 
               Source_SSL_Key: 
        Seconds_Behind_Source: NULL
Source_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1062
               Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:768' at master log mysql-bin.000001, end_log_pos 634116. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
  Replicate_Ignore_Server_Ids: 
             Source_Server_Id: 5
                  Source_UUID: 92099aae-4731-11ec-a3da-00505629525b
             Source_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
    Replica_SQL_Running_State: 
           Source_Retry_Count: 86400
                  Source_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 220511 10:43:53
               Source_SSL_Crl: 
           Source_SSL_Crlpath: 
           Retrieved_Gtid_Set: 92099aae-4731-11ec-a3da-00505629525b:768-71845 # 检索到的Gtid事务列表
            Executed_Gtid_Set: 9208096f-4731-11ec-a23e-005056210589:1-55, # 已执行的Gtid事务列表
92099aae-4731-11ec-a3da-00505629525b:1-767
                Auto_Position: 0
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Source_TLS_Version: 
       Source_public_key_path: 
        Get_Source_public_key: 0
            Network_Namespace: 
1 row in set (0.00 sec)

ERROR: 
No query specified

复制

Retrieved_Gtid_Set（检索到的Gtid事务列表）：92099aae-4731-11ec-a3da-00505629525b:768-71845
Executed_Gtid_Set（已执行的Gtid事务列表）：9208096f-4731-11ec-a23e-005056210589:1-55,92099aae-4731-11ec-a3da-00505629525b:1-767

故障深入分析

按照正常推断，如下：

上面的信息可以看出，当前从Master库取到了'92099aae-4731-11ec-a3da-00505629525b:768-71845'的事务列表，并且已执行（Executed_Gtid_Set)到了'92099aae-4731-11ec-a3da-00505629525b:1-767'这个事务GTID的位置。

根据之前在slave库的Mysql error日志：

[root@zbx-db02 mysql_data]# cat mysql3306.err | grep 1062
2022-05-11T02:43:53.329019Z 23 [ERROR] [MY-010584] [Repl] Slave SQL for channel '': Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:768' at master log mysql-bin.000001, end_log_pos 634116; Could not execute Write_rows event on table zabbix.events; Duplicate entry '22357' for key 'events.PRIMARY', Error_code: 106; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 634116, Error_code: MY-001062

复制

注意这条error：Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:768'。也就是说执行失败的事务是：'92099aae-4731-11ec-a3da-00505629525b:768'

大胆推测：那是不是可以说，只要跳过这个'92099aae-4731-11ec-a3da-00505629525b:768'事务，就可以了？也就是说主从库出现主键冲突（重复）时（比如现在的问题就是这个情况）可以通过注入空事物的方式进行跳过？于是笔者斗胆一试。

在slave库上尝试操作，跳过指定的GTID事务

# 停止复制
stop replica;

# 指定下一个事务执行的版本，即想要跳过的GTID，也就是要跳过'92099aae-4731-11ec-a3da-00505629525b:768'
set gtid_next='92099aae-4731-11ec-a3da-00505629525b:768';
begin;

# 提交，开始注入一个空事物
commit;

# 设置自动的寻找GTID事务
set gtid_next='AUTOMATIC';

# 开始同步
start replica;

复制

跳过指定的GTID后，继续在slave库上查看复制状态

mysql> show replica status\G;
*************************** 1. row ***************************
             Replica_IO_State: Waiting for source to send event
                  Source_Host: 192.168.11.151
                  Source_User: syn_a
                  Source_Port: 3306
                Connect_Retry: 60
              Source_Log_File: mysql-bin.000014
          Read_Source_Log_Pos: 236
               Relay_Log_File: zbx-db02-relay-bin.000045
                Relay_Log_Pos: 324
        Relay_Source_Log_File: mysql-bin.000001
           Replica_IO_Running: Yes
          Replica_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1452
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:784' at master log mysql-bin.000001, end_log_pos 643505. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.

复制

顶！！！SQL线程还是为NO，看来问题很多啊，慢慢修吧！不过这次出现的是一个新的问题，事务ID也变了，事务ID是'92099aae-4731-11ec-a3da-00505629525b:784'，位置也变了，这次的位置是643505

在slave库上查看mysql的error日志，error代码是1452（查看最新的那条就好）

[root@zbx-db02 mysql_data]# cat mysql3306.err | grep 1452
2022-06-06T02:41:00.836059Z 18 [ERROR] [MY-010584] [Repl] Slave SQL for channel '': Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:784' at master log mysql-bin.000001, end_log_pos 643505; Could not execute Write_rows event on table zabbix.event_recovery; Cannot add or update a child row: a foreign key constraint fails (`zabbix`.`event_recovery`, CONSTRAINT `c_event_recovery_1` FOREIGN KEY (`eventid`) REFERENCES `events` (`eventid`) ON DELETE CASCADE), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 643505, Error_code: MY-001452

复制

这次的引发错误的原因是外键约束失败，而不是主键冲突，GTID事务ID是：92099aae-4731-11ec-a3da-00505629525b:784

继续解决外键约束失败的问题（跳过指定的GTID事务）

根据刚才的错误，binlog信息是：master log mysql-bin.000001, end_log_pos 643505，解决方案：决定继续采用跳过指定的GTID事务的办法

在slave库上尝试操作，跳过指定的GTID事务

# 停止复制
stop replica;

# 指定下一个事务执行的版本，即想要跳过的GTID，也就是要跳过'92099aae-4731-11ec-a3da-00505629525b:784'
set gtid_next='92099aae-4731-11ec-a3da-00505629525b:784';
begin;

# 提交，开始注入一个空事物
commit;

# 设置自动的寻找GTID事务
set gtid_next='AUTOMATIC';

# 开始同步
start replica;

复制

继续查看slave的replica的状态

mysql> show replica status\G;
*************************** 1. row ***************************
             Replica_IO_State: Waiting for source to send event
                  Source_Host: 192.168.11.151
                  Source_User: syn_a
                  Source_Port: 3306
                Connect_Retry: 60
              Source_Log_File: mysql-bin.000014
          Read_Source_Log_Pos: 236
               Relay_Log_File: zbx-db02-relay-bin.000045
                Relay_Log_Pos: 22387
        Relay_Source_Log_File: mysql-bin.000001
           Replica_IO_Running: Yes
          Replica_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1452
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '92099aae-4731-11ec-a3da-00505629525b:1069' at master log mysql-bin.000001, end_log_pos 772696. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.

复制

还是未能解决，SQL现存依然为NO，事务ID又变了，这次是'92099aae-4731-11ec-a3da-00505629525b:1069'，按照之前的办法继续跳过指定的GTID事务ID

最终放大招：只要有错误的GTID事务都跳过

目前看来，各种问题太多了，主从已经严重不一致，从库各种主键冲突、约束等等问题引发SQL线程为NO。而且到底还有多少个问题不得而知，如果都像之前一样去解析binlog，然后找到指定的位置，手动转化为sql去执行，已经不现实了。所以，现在唯一的，最终放大招的解决办法是：凡是GIID事务有问题的，都跳过指定的GTID事务。注意，笔者的主从环境是启用了GTID模式的。如果不是GITD的模式，那就不适用该大招。

持续跳过指定的GTID事务ID，操作如下：

stop replica;
set gtid_next='92099aae-4731-11ec-a3da-00505629525b:1069';
begin;
commit;
set gtid_next='AUTOMATIC';
start replica;
show replica status\G;

复制

只要有问题的GTID，都按照上述的办法跳过指定的事务ID，每个有问题的事务ID都不一样，只需将gtid_next=''写成有问题的GTID，其他指令不变、步骤不变。

经过放大招，互为主从的环境SQL线程已经恢复正常

master(192.168.11.152)、slave（192.168.11.152）

mysql> show replica status\G;
*************************** 1. row ***************************
             Replica_IO_State: Waiting for source to send event
                  Source_Host: 192.168.11.152
                  Source_User: syn_b
                  Source_Port: 3306
                Connect_Retry: 60
              Source_Log_File: mysql-bin.000010
          Read_Source_Log_Pos: 42583848
               Relay_Log_File: zbx-db01-relay-bin.000050
                Relay_Log_Pos: 922
        Relay_Source_Log_File: mysql-bin.000010
           Replica_IO_Running: Yes
          Replica_SQL_Running: Yes
              Replicate_Do_DB: 

复制

master(192.168.11.152)、slave（192.168.11.151）

mysql> show replica status\G;
*************************** 1. row ***************************
             Replica_IO_State: Waiting for source to send event
                  Source_Host: 192.168.11.151
                  Source_User: syn_a
                  Source_Port: 3306
                Connect_Retry: 60
              Source_Log_File: mysql-bin.000014
          Read_Source_Log_Pos: 2178
               Relay_Log_File: zbx-db02-relay-bin.000084
                Relay_Log_Pos: 404
        Relay_Source_Log_File: mysql-bin.000014
           Replica_IO_Running: Yes
          Replica_SQL_Running: Yes
              Replicate_Do_DB: 

复制

写在最后的自我反省

为什么会出现这样的问题？经过笔者的自我反省，核心原因就是：笔者的是虚拟机环境，正是因为是虚拟机环境，在每次关机的时候都不注重启停顺序。甚至为了方便，直接强制关掉运行了MySQL和上层应用的虚拟机电源，最终引发了数据库的这一些列问题。也正好是因为这次的测试环境，给了笔者一次莫大的教训。笔者认为，如果连测试环境都抱着随便维护的心态、不严谨，一旦养成这种陋习，维护的生产环境总有一天会毁在自身手里。

关于正确的启停顺序

假设应用的后端数据库环境是互为主从架构，笔者的测试环境就是该架构，且不涉及其他数据库或者中间件。

启动

启动互为主从的数据库（master、slave），并检查replica的状态是否正常；
启动上层应用；

停止

停止上层应用；
停止互为主从数据库（master、slave）的replica，再停止mysql服务；

文章转载自不背锅运维，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。