故障原因
生产环境的故障原因:
(Galera集群MariaDB 10.0.15)
节点1加字段成功,节点3由于没空间,加字段失败,节点2加字段成功。等节点3有空间了,节点3还可以正常加入集群,是以少字段的形式在复制。虽然节点3的错误日志中有记录,表加字段失败了。但是后面还是可以正常加班集群的,也可以正常复制。所以节点3就一直没有这个字段,并且继续复制。
MariaDB 10.0.15 -Galera集群缺字段复制
garela cluster集群中,有一台或2台服务器的表中少一个字段,同步数据的时候不报错。
集群版本:
mysql Ver 15.1 Distrib 10.0.15-MariaDB, for Linux (x86_64) using readline 5.1
在节点1上建张表,并插入几笔数据 mysql> CREATE TABLE `sbtest2` ( -> `id` int(11) NOT NULL AUTO_INCREMENT, -> `k` int(11) NOT NULL DEFAULT '0', -> `c` char(120) NOT NULL DEFAULT '', -> `pad` char(60) NOT NULL DEFAULT '', -> PRIMARY KEY (`id`), -> KEY `k_1` (`k`) -> ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8; Query OK, 0 rows affected (1.04 sec) mysql> insert into sbtest2(k,c,pad) select k,c,pad from sbtest1 limit 5; Query OK, 5 rows affected (0.04 sec) Records: 5 Duplicates: 0 Warnings: 0 之后业务在1节点添加字段: alter table sbtest2 add column check_flag varchar(1) default '1';
复制
检查集群的3台服务器,此时数据和表结构是一致的。
此时执行第3节点以单点模式运行,并删除字段check_flag:
alter table sbtest2 drop column check_flag;
后面三节点重新开启集群参数,加入节点:
验证过程
现在开始对表进行DML操作,对比有字段check_flag和无字段check_flag的服务器数据
第1节点服务器(有字段check_flag): mysql> insert into sbtest2(id,k,c,pad,check_flag) values(100,100,'100k','100c','0'); Query OK, 1 row affected (0.00 sec) mysql> select * from sbtest2 where id=100; +-----+-----+------+------+------------+ | id | k | c | pad | check_flag | +-----+-----+------+------+------------+ | 100 | 100 | 100k | 100c | 0 | +-----+-----+------+------+------------+ 1 row in set (0.00 sec) 第3节点服务器(无字段check_flag): MariaDB [sbtest]> select * from sbtest2 where id=100; +-----+-----+------+------+ | id | k | c | pad | +-----+-----+------+------+ | 100 | 100 | 100k | 100c | +-----+-----+------+------+ 1 row in set (0.00 sec) 第1节点服务器(有字段check_flag): mysql> update sbtest2 set pad='111c' where id=100; Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0 mysql> select * from sbtest2 where id=100; +-----+-----+------+------+------------+ | id | k | c | pad | check_flag | +-----+-----+------+------+------------+ | 100 | 100 | 100k | 111c | 0 | +-----+-----+------+------+------------+ 1 row in set (0.00 sec) 第3节点服务器(无字段check_flag): MariaDB [sbtest]> select * from sbtest2 where id=100; +-----+-----+------+------+ | id | k | c | pad | +-----+-----+------+------+ | 100 | 100 | 100k | 111c | +-----+-----+------+------+ 1 row in set (0.00 sec) 第1节点服务器(有字段check_flag): mysql> delete from sbtest2 where id=100; Query OK, 1 row affected (0.00 sec) mysql> select * from sbtest2 where id=100; Empty set (0.00 sec) 第3节点服务器(无字段check_flag): MariaDB [sbtest]> select * from sbtest2 where id=100; Empty set (0.00 sec) 第1节点服务器(有字段check_flag): mysql> insert into sbtest2(id,k,c,pad,check_flag) values(110,110,'110k','110c','0'); Query OK, 1 row affected (0.01 sec) mysql> update sbtest2 set check_flag=1 where id=110; Query OK, 1 row affected (0.01 sec) Rows matched: 1 Changed: 1 Warnings: 0 mysql> select * from sbtest2 where id=110; +-----+-----+------+------+------------+ | id | k | c | pad | check_flag | +-----+-----+------+------+------------+ | 110 | 110 | 110k | 110c | 1 | +-----+-----+------+------+------------+ 1 row in set (0.00 sec) 第3节点服务器(无字段check_flag): MariaDB [sbtest]> select * from sbtest2 where id=110; +-----+-----+------+------+ | id | k | c | pad | +-----+-----+------+------+ | 110 | 110 | 110k | 110c | +-----+-----+------+------+ 1 row in set (0.00 sec)
复制
反过来操作,在无字段check_flag的服务器上操作:
第3节点服务器(无字段check_flag): MariaDB [sbtest]> select * from sbtest2 where id=110; +-----+-----+------+------+ | id | k | c | pad | +-----+-----+------+------+ | 110 | 110 | 110k | 110c | +-----+-----+------+------+ 1 row in set (0.00 sec) MariaDB [sbtest]> update sbtest2 set check_flag='0' where id=110; ERROR 1054 (42S22): Unknown column 'check_flag' in 'field list' MariaDB [sbtest]> delete from sbtest2 where check_flag='0'; ERROR 1054 (42S22): Unknown column 'check_flag' in 'where clause' 第3节点服务器(无字段check_flag): MariaDB [sbtest]> insert into sbtest2(id,k,c,pad) values(120,120,'110k','110c'); Query OK, 1 row affected (0.01 sec) MariaDB [sbtest]> select * from sbtest2 where id=120; +-----+-----+------+------+ | id | k | c | pad | +-----+-----+------+------+ | 120 | 120 | 110k | 110c | +-----+-----+------+------+ 1 row in set (0.00 sec) 第1节点服务器(有字段check_flag): mysql> select * from sbtest2 where id=120; +-----+-----+------+------+------------+ | id | k | c | pad | check_flag | +-----+-----+------+------+------------+ | 120 | 120 | 110k | 110c | 1 | +-----+-----+------+------+------------+ 1 row in set (0.00 sec)
复制
在有字段check_flag的服务器只操作check_flag字段
第1节点服务器(有字段check_flag): mysql> update sbtest2 set check_flag='0' where id=120; Query OK, 1 row affected (0.01 sec) Rows matched: 1 Changed: 1 Warnings: 0 mysql> select * from sbtest2 where id=120; +-----+-----+------+------+------------+ | id | k | c | pad | check_flag | +-----+-----+------+------+------------+ | 120 | 120 | 110k | 110c | 0 | +-----+-----+------+------+------------+ 1 row in set (0.00 sec) mysql> delete from sbtest2 where check_flag='0'; Query OK, 1 row affected (0.00 sec) mysql> select * from sbtest2 where id=120; Empty set (0.00 sec) 第3节点服务器(无字段check_flag): MariaDB [sbtest]> select * from sbtest2 where id=120; Empty set (0.00 sec) 数据也成功同步
复制
测试结论
从以上测试可以看出,只要在无字段check_flag的服务器上执行DML,不相关check_flag字段就不会报错。
而在有字段check_flag的服务器上执行DML,数据同样会复制到无字段check_flag的服务器上,并且不报错。
这是可能导致系统运行很长时间,都没有发现数据不一致的情况。
MySQL 8.0.25-MGR集群测试
第3节点 单独执行dml语句之后,MGR集群会拒绝它加入集群,并且报错:
2021-06-23T02:18:29.463798Z 0 [ERROR] [MY-011526] [Repl] Plugin group_replication reported: 'This member has more executed transactions than those present in the group. Local transactions: b2347c8c-c373-11eb-a6bc-e2cf689cd558:1-24, be1ba1b2-c376-11eb-9484-000c2902fb88:1 > Group transactions: b2347c8c-c373-11eb-a6bc-e2cf689cd558:1-24'
复制
测试结论:
1.MGR在节点加字段报错的时候,就会把节点提出集群。
2.本次gelera节点发生3节点字段添加没成功,在生产上的原因是由于3节点磁盘满了。等磁盘有空间后,会继续同步,且不会引起报错,导致数据不一致。