记一次MySQL8.0 MGR中mysqld-auto.cnf的坑
能力有限,如有错误请谅解
在上次使用mysqlshell
完成MGR
集群部署后,在测试集群节点启停重新加入集群时发现group_replication_start_on_boot
被自动设置为ON
, 导致节点在重启后非自愿的自动加入了集群。
环境介绍
部署环境可以看上一篇文章MGR + Router+mysqlsh 集群部署安装

简单做下说明:这个架构是在完成实例初始化后,使用mysqlsh
建立集群,之后使用Router实现读写分离
背景
配置文件关键参数
[root@mgr1 3306]# grep boot my_3306_mgr.cnf
loose-group_replication_start_on_boot = OFF
loose-group_replication_bootstrap_group = OFF
数据库状态
mysql> show global variables like '%boot%';
+-----------------------------------+-------+
| Variable_name | Value |
+-----------------------------------+-------+
| group_replication_bootstrap_group | OFF |
| group_replication_start_on_boot | ON |
+-----------------------------------+-------+
2 rows in set (0.00 sec)
mysql> set global group_replication_start_on_boot=off;
Query OK, 0 rows affected (0.00 sec)
mysql> show global variables like '%boot%';
+-----------------------------------+-------+
| Variable_name | Value |
+-----------------------------------+-------+
| group_replication_bootstrap_group | OFF |
| group_replication_start_on_boot | OFF |
+-----------------------------------+-------+
2 rows in set (0.01 sec)
这里将group_replication_start_on_boot
手动修改为OFF
, 但在创建集群后没注意到这个参数,自以为是和配置参数中一样。但实际已经被修改为ON
。
模拟再现
关闭mgr1:3306 节点数据库
mysql> shutdown;
Query OK, 0 rows affected (0.22 sec)
mysql> show processlist;
ERROR 2013 (HY000): Lost connection to MySQL server during query
No connection. Trying to reconnect...
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/data/mysql_mgr/3306/mysql.sock' (2)
ERROR:
Can't connect to the server
其他节点上查看集群节点状态
mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | MEMBER_COMMUNICATION_STACK |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| group_replication_applier | 96c67e22-a01e-11ec-ac1d-080027eef429 | mgr3 | 3306 | ONLINE | PRIMARY | 8.0.27 | XCom |
| group_replication_applier | 96c9d0ee-a01e-11ec-b610-080027cdfa8a | mgr2 | 3306 | ONLINE | SECONDARY | 8.0.27 | XCom |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
2 rows in set (0.01 sec)
此时mgr1:3306 已经被下线处理
重启mgr1:3306节点
[root@mgr1 3306]# /usr/local/mysql_8/bin/mysqld_safe --defaults-file=my_3306_mgr.cnf &
[1] 31039
[root@mgr1 3306]# dblogin.sh 3306
mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | MEMBER_COMMUNICATION_STACK |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| group_replication_applier | 96c63f53-a01e-11ec-82de-080027d56d22 | mgr1 | 3306 | ONLINE | SECONDARY | 8.0.27 | XCom |
| group_replication_applier | 96c67e22-a01e-11ec-ac1d-080027eef429 | mgr3 | 3306 | ONLINE | PRIMARY | 8.0.27 | XCom |
| group_replication_applier | 96c9d0ee-a01e-11ec-b610-080027cdfa8a | mgr2 | 3306 | ONLINE | SECONDARY | 8.0.27 | XCom |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+----------------------------+
3 rows in set (0.01 sec)
此时发现已经自动加入了集群
排查
因为是不识别文件中的参数设置,所以先怀疑配置文件书写出了问题,在测试其他节点重启后依然重新加入了集群后排除这种可能。
怀疑是加载文件上出了问题
MySQL
加载文件使用顺序的,因为 还停留在5.7
的固定思维上因为并没有发现真正的问题
[root@mgr1 3306]# mysqld --verbose --help> mysqld_help.out
读取配置文件信息
Default options are read from the following files in the given order:
/etc/my.cnf /etc/mysql/my.cnf /usr/local/mysql/etc/my.cnf ~/.my.cnf
The following groups are read: mysqld server mysqld-8.0
四处找可能的原因, performance_schema.replication_%
相关的表。
最后还是认为是有文件的可能性最高,原因是该参数在启动后就被修改为ON
,是在MySQL
启动加载时设置的
最后在datadir/
路径下找到了可疑文件 mysqld-auto.cnf
文件内容如下
[root@mgr1 3306]# cat data/mysqld-auto.cnf
{ "Version" : 1 , "mysql_server" : { "auto_increment_offset" : { "Value" : "2" , "Metadata" : { "Timestamp" : 1646888302629225 , "User" : "myshell" , "Host" : "" } } , "super_read_only" : { "Value" : "ON" , "Metadata" : { "Timestamp" : 1646888302615680 , "User" : "myshell" , "Host" : "" } } , "auto_increment_increment" : { "Value" : "1" , "Metadata" : { "Timestamp" : 1646888302625629 , "User" : "myshell" , "Host" : "" } } , "mysql_server_static_options" : { "group_replication_enforce_update_everywhere_checks" : { "Value" : "OFF" , "Metadata" : { "Timestamp" : 1646888302619246 , "User" : "myshell" , "Host" : "" } } , "group_replication_group_name" : { "Value" : "b619e3e1-a02e-11ec-90a1-080027d56d22" , "Metadata" : { "Timestamp" : 1646888302617028 , "User" : "myshell" , "Host" : "" } } , "group_replication_group_seeds" : { "Value" : "10.0.2.31:33061,10.0.2.32:33061,10.0.2.33:33061,10.0.2.31:33061,mgr2:33061,mgr3:33061" , "Metadata" : { "Timestamp" : 1646912487400711 , "User" : "myshell" , "Host" : "" } } , "group_replication_local_address" : { "Value" : "mgr1:33061" , "Metadata" : { "Timestamp" : 1646888302623824 , "User" : "myshell" , "Host" : "" } } , "group_replication_recovery_use_ssl" : { "Value" : "ON" , "Metadata" : { "Timestamp" : 1646888302622209 , "User" : "myshell" , "Host" : "" } } , "group_replication_single_primary_mode" : { "Value" : "ON" , "Metadata" : { "Timestamp" : 1646888302620939 , "User" : "myshell" , "Host" : "" } } , "group_replication_ssl_mode" : { "Value" : "REQUIRED" , "Metadata" : { "Timestamp" : 1646888302623037 , "User" : "myshell" , "Host" : "" } } , "group_replication_start_on_boot" : { "Value" : "ON" , "Metadata" : { "Timestamp" : 1646888302624712 , "User" : "myshell" , "Host" : "" } } , "group_replication_view_change_uuid" : { "Value" : "b61a07ef-a02e-11ec-90a1-080027d56d22" , "Metadata" : { "Timestamp" : 1646888302618557 , "User" : "myshell" , "Host" : "" } } , "replica_parallel_workers" : { "Value" : "4" , "Metadata" : { "Timestamp" : 1646887972321149 , "User" : "myshell" , "Host" : "" } } , "slave_parallel_workers" : { "Value" : "4" , "Metadata" : { "Timestamp" : 1646887972321149 , "User" : "myshell" , "Host" : "" } } } } }[root@mgr1 3306]#
然后就是去找有关于mysqld-auto.cnf
文件的说明了 文档中是这样描述的:SET PERSIST Statement[1] 大意就是这个文件是在8.0 中新增的功能,他可以让DBA可以在客户端运行SET PERSIST
修改 动态全局参数 的同时达到 持久化 的效果(写到文件中)。可以更好的适应云产品功能,如果用过一些私有云产品数据库也可以找到类似的文件。
那问题来了,出现该文件是需要执行 set persist
命令,自己可以确认没有执行这类语句,哪是谁执行的呢。代码不会骗人,肯定是触发了什么机制,我第一时间想到了mysqlshell
去mysqlshell
手册中去搜索关键字知道如下信息 6.1.5 Persisting Settings[2] 由此可知:该文件在使用mysqlshell
修改参数时会自动实现持久化效果 满足条件
The instance is running MySQL version 8.0.11 or later.
persisted_globals_load is set to ON.
The instance has not been started with the --no-defaults option.
但是这只能知道可能是mysqlshell
导致的没有找到具体的证据,由于没有阅读源码的能力只能考虑别的方式证明了。所以开启general_log
重新部署集群。最后找到相关信息
2022-03-10T19:38:52.453756+08:00 13 Query SET PERSIST `super_read_only` = 'ON'
2022-03-10T19:38:52.455366+08:00 13 Query SET PERSIST `group_replication_group_name` = 'b619e3e1-a02e-11ec-90a1-080027d56d22'
2022-03-10T19:38:52.457020+08:00 13 Query SET PERSIST `group_replication_view_change_uuid` = 'b61a07ef-a02e-11ec-90a1-080027d56d22'
2022-03-10T19:38:52.458181+08:00 13 Query SET PERSIST `group_replication_enforce_update_everywhere_checks` = 'OFF'
2022-03-10T19:38:52.459315+08:00 13 Query SET PERSIST `group_replication_single_primary_mode` = 'ON'
2022-03-10T19:38:52.461049+08:00 13 Query SET PERSIST `group_replication_recovery_use_ssl` = 'ON'
2022-03-10T19:38:52.462309+08:00 13 Query SET PERSIST `group_replication_ssl_mode` = 'REQUIRED'
2022-03-10T19:38:52.463640+08:00 13 Query SET PERSIST `group_replication_local_address` = 'mgr3:33061'
2022-03-10T19:38:52.465535+08:00 13 Query SET PERSIST `group_replication_group_seeds` = 'mgr1:33061,mgr2:33061'
2022-03-10T19:38:52.467082+08:00 13 Query SET PERSIST `group_replication_consistency` = 'EVENTUAL'
2022-03-10T19:38:52.468948+08:00 13 Query SET PERSIST `group_replication_member_expel_timeout` = 5
2022-03-10T19:38:52.473389+08:00 13 Query SET PERSIST `group_replication_start_on_boot` = 'ON'
2022-03-10T19:38:52.474831+08:00 13 Query SET PERSIST `auto_increment_increment` = 1
2022-03-10T19:38:52.475981+08:00 13 Query SET PERSIST `auto_increment_offset` = 2
2022-03-10T19:38:52.477608+08:00 13 Query CHANGE REPLICATION SOURCE TO SOURCE_USER = 'mysql_innodb_cluster_333306', SOURCE_PASSWORD = <secret> FOR CHANNEL 'group_replication_recovery'
2022-03-10T19:38:52.497598+08:00 13 Query SELECT NOW(6)
2022-03-10T19:38:52.498107+08:00 13 Query START GROUP_REPLICATION
可以看到这些都是在执行l.addInstance('10.0.2.32:3306')
出现的,至此可以证明使用mysqlshell
工具部署MGR集群他会自动执行他的一整套的 从配置环境到启动集群 的命令集合
这样我可以通过SQL语句或者修改文件的方法达到恢复的效果
说来惭愧,排查原因使用过strace 进行跟踪启动,但是因为经验不足还是没找到相关信息,但其实也有类似的信息,这里就不做相关讨论了。
找到原因就好解决了,手工将mysqld-auto.cnf
对应参数进行修改就可以了。
扩展思考
使用mysqlshell
虽然方便了集群管理,但是他的命令对我来说还是个黑盒子,还不能完全掌握他命令所带来的的影响。在找到原因后又想到了,他是自动的命令,当我们手动修改了mysqld-auto.cnf
,可再次使用mysqlshell
时他是否会重新修改该文件,所以在后面的测试的时候找到了一些场景,如:
l.switchToMultiPrimaryMode() l.switchToSinglePrimaryMode() dba.configureInstance() dba.createCluster(''); l.addInstance('')
也需要注意不止mysqlshell
会执行持久化的SQL ,在执行多主/单主(select group_replication_switch_to_single_primary_mode();,select group_replication_switch_to_multi_primary_mode();)
模式切换函数也同样会持久化参数(这也是正常的,毕竟这样也是避免忘记和减少了修改my.cnf文件的操作)
在查找问题的时候发现使用mysqlshell
配置MGR
时,还发现有一些MGR
集群标识的参数是自动生成而不是参考my.cnf
的,这多少有些不理解:
SET PERSIST group_replication_group_name = 'b619e3e1-a02e-11ec-90a1-080027d56d22'
SET PERSIST 'group_replication_local_address' = 'mgr3:33061'
SET PERSIST 'group_replication_group_seeds' = 'mgr1:33061,mgr2:33061'
这几个参数在识别MGR
集群 还是比较重要的
还有在创建MGR
验证时也用的是自动生成的用户
CHANGE REPLICATION SOURCE TO SOURCE_USER = 'mysql_innodb_cluster_333306', SOURCE_PASSWORD = <secret> FOR CHANNEL 'group_replication_recovery'
这里有个不好的点就是这用户 DBA是不知道密码的,这样也要求我们既然使用了mysqlshell
进行部署mgr
,那么后期运维也需要通过mysqlshell
进行会方便许多,但也增加了些不透明性,不过这也算是官方推荐的方法吧
References
[1]
SET PERSIST Statement: https://dev.mysql.com/blog-archive/mysql-8-0-persisting-configuration-variables/#:~:text=introduced%20in%208.0-,SET%20PERSIST%20Statement,-In%20Worklog%20%238688[2]
6.1.5 Persisting Settings: https://dev.mysql.com/doc/mysql-shell/8.0/en/admin-api-persisting-settings.html#:~:text=these%20configuration%20changes%20can%20be%20persisted%20to%20the%20instance%20automatically.




