暂无图片
暂无图片
6
暂无图片
暂无图片
暂无图片

腾讯云TDSQL分布式数据库一次主从切换案例分享

原创 老王 2022-08-23
2400

背景:

腾讯云的分布式数据库TDSQL拥有强大的数据库监控和分析能力,对于运维DBA来说是一件好事,方便DBA定位问题的根本原因。通过近期发生的一次主从切换案例,给大家分享如何利用腾讯云的工具进行问题分析诊断的过程。

一、问题现象

收到数据库主从切换的告警,登录腾讯云控制台,在实例列表中,根据告警内容的实例 ID / 名称,点击进入对应实例,此次故障实例ID为 tdsqlshard-4c7nma6d。

依次点击“监控告警”–》“实例告警”,可以看到告警发生时间为2022-08-15 21:43:00。

在“监控告警”的指标处,下拉选择“主从切换次数”,时间段选择告警前后10分钟左右,发现2022-08-15 21:43:00采样值为1,意味着确实发生过主从切换。

二、分析过程

确认了告警内容为真实的后,接下来进行详细分析。

2.1 主从切换的原因

说到TDSQL的主从切换,就必须得先搞清楚其高可用原理。下图是引用腾讯云数据库专家工程师雷海林在DTCC2019年中国数据库大会上《TDSQL智能运维平台-扁鹊架构与实践》的技术分享内容。

据其介绍,TDSQL高可用一般是通过探活检测,其利用一个agent模块定期的连接DB,并向TDSQL自建的一个心跳表中写入数据,这样无论是磁盘坏块,磁盘满了还是DB重启导致DB不可用,agent都能准确的判断出来,

当agent连续一段时间写入心跳失败或超时就会触发切换。

那引起切换的原因可能有如下几种:

  • DB服务器故障

  • 内核bug导致的系统hang

  • 磁盘或文件系统故障

  • 资源竞争引起的DB异常

    • –IO耗尽

    • –InnoDB并发线程耗尽

    • – binlog写入竞争

2.2 分析过程

首先,分析DB服务器是否发生故障,通过查看后台日志,发现在2022-08-15 21:43:00前,并未有服务器关机或重启信息,所以排除服务器故障导致的主从切换的可能。通过赤兔平台的日志分析功能也可以看到服务器未重启。

其次,分析系统CPU、IO和文件系统占用情况,使用控制台的指标监控。

2.2.1 CPU占用情况

通过历史监控数据,可以看到在问题发生前,如2022-08-15 21:42:00,CPU使用率达到了78%,并且在80%左右持续了很长一段时间。而告警时刻2022-08-15 21:43:00,使用率突降为0,很有可能是MySQL实例异常后被降级为备库,相关SQL终止了导致使用率突降。之后,CPU使用率又迅速涨到了80%左右,且持续维持在高位。通过这个曲线,说明主从切换前后,MySQL进程的负载较大且持续。

2.2.2 IO使用情况

看到在问题发生前,如2022-08-15 21:42:00,磁盘空间使用率达到了6.85%,并不高。告警时刻2022-08-15 21:43:00,也是稍微增长,说明IO没有达到瓶颈。

2.2.3 磁盘空间占用情况

看到在问题发生前,如2022-08-15 21:42:00,磁盘空间使用率达到了104%,也就是空间已经满了,这时很可能MySQL进程异常。我们知道数据目录空间满是很可能导致MySQL进程异常的。

2.2.4 分析赤兔平台日志

赤兔运营平台是腾讯云的DBA分析利器,通常在私有云环境部署,公有云暂未开放。如果在公有云环境想解决相关问题,可联系腾讯云客服进行咨询。

以下是赤兔平台的后台agent日志,涵盖了服务器状态和数据库信息,格式化其中的JSON格式日志,可以看到问题发生前,如2022-08-15 21:41:58 740323显示MySQL进程占用CPU为61%,而数据目录使用率达到了105%,与监控数据显示一致。

“fs”:"/data2/4466/dbdata_raw/data",“total”:“10000”,“usage”:“105”,“used”:“10523”
复制

初始复制关系为:30.121.109.38_446为主库(可用区:广州七区),9.251.175.72_4466为从库(可用区:广州六区)

[2022-08-15 21:42:20 346409] DEBUG 86051,/data/landun/workspace/TDSQL_Agent/src/localcomm/mysqlinstance.cpp:1161:setrunChange,tid:0x7ff7d1486700,86098,get path:/noshard1/group_1655820823_2764/sets/set@set_1657183327_7/setrun@set_1657183327_7,value is:{“cgroup_cpu”:“hard”,“degrade_flag”:1,“history_ids”:[{“id”:“set_1657183327_7”,“timestamp”:1657183327,“type”:0},{“id”:“set_1657166229_5”,“timestamp”:1657166229,“type”:0},{“id”:“set_1655820937_1”,“timestamp”:1655820937,“type”:0}],“kpstatus”:0,“master”:{“alive”:"-1",“city”:“default”,“election”:true,“hb_err”:“1”,“idc”:“IDC_2_1439_3751_6106-B15”,“idc_weight”:“1”,“losthbtime”:“1660570932”,“name”:“30.121.109.38_4466”,“sqlasyn”:“0”,“weight”:“1”,“zone”:“ap-guangzhou-7”},“password”:“I0pDtpY@xC56rqeDn3”,“proxy”:[{“name”:“30.121.109.38_24466”},{“name”:“9.251.175.72_24466”}],“read_only”:“0”,“resource_info”:{“cpu”:50,“data_disk”:10000,“log_disk”:8000,“mem”:2000},“set”:“set_1657183327_7”,“slave”:[{“alive”:“0”,“city”:“default”,“election”:true,“hb_err”:“0”,“idc”:“IDC_2_1439_2171_1101-P11”,“idc_weight”:“1”,“losthbtime”:“0”,“name”:“9.251.175.72_4466”,“sqlasyn”:“0”,“weight”:“1”,“zone”:“ap-guangzhou-6”}],“specid”:32768,“status”:0,“uniqueid”:“unique_1655820937_1”,“user”:“tdsqlsys_normal”,“wait_slaves_run”:1}
复制

在2022-08-15 21:42:20 346544左右,开始执行主从切换操作

[2022-08-15 21:42:20 346544] DEBUG 86051,/data/landun/workspace/TDSQL_Agent/src/localcomm/mysqlinstance.cpp:374:parseFromSetrunJson,tid:0x7ff7d1486700,86098,set run change info m_masterCon->host:30.121.109.38,port:4466,m_ip:9.251.175.72,m_port:4466 [2022-08-15 21:42:20 346574] DEBUG 86051,/data/landun/workspace/TDSQL_Agent/src/localcomm/mysqlinstance.cpp:1168:setrunChange,tid:0x7ff7d1486700,86098,parseFromSetrunJson success,result:set:set_1657183327_7,election:truemaster:{runtype:0,name:30.121.109.38_4466,alive:-1,sqlasyn:0}tmprep:{runtype:2,name:_0,alive:-1,sqlasyn:0}read_only:0,slave:{runtype:1,name:9.251.175.72_4466,alive:0,sqlasyn:0},resource:specid:32768,resource_info:{mem:2000,data_disk:10000,log_disk:8000,cpu:50},path:/noshard1/group_1655820823_2764/sets/set@set_1657183327_7/setrun@set_1657183327_7 [2022-08-15 21:42:20 367606] DEBUG 86051,/data/landun/workspace/TDSQL_Agent/src/report_new/checkalivethread.cpp:502:dealCheckAlive,tid:0x7ff78d7fa700,87327,thread:7ff78d7fa700 begin to work
复制

在2022-08-15 21:42:20 381168开始,在原主库的相关命令执行报错,提示无法连接MySQL server。

[2022-08-15 21:42:20 381448] ERROR 86051,/data/landun/workspace/TDSQL_Agent/src/report_new/reportstatus.cpp:678:computeReplStatus,tid:0x7ff78dffb700,87326,error [], delay: 21,Slave_IO_Running:Connecting,Slave_SQL_Running:Yes [2022-08-15 21:42:20 382365] ERROR 86051,mysql/mysqlop.cpp:150:tranToVarMap,tid:0x7ff78dffb700,87326,run sql show variables like ‘server_uuid’ error:connect error[Can’t connect to MySQL server on30.121.109.38’ (111Connection refused”)], errno:2003 [2022-08-15 21:42:20 382370] ERROR 86051,/data/landun/workspace/TDSQL_Agent/src/report_new/reportstatus.cpp:393:getServerId,tid:0x7ff78dffb700,87326,get delayUuid failed. error:connect error[Can’t connect to MySQL server on30.121.109.38’ (111Connection refused”)], errno:2003
复制

在2022-08-15 21:42:21 549671开始,看到主从库完成角色切换,9.251.175.72_4466为主库(可用区:广州六区),30.121.109.38_446为从库(可用区:广州七区)。

[2022-08-15 21:42:21 549671] DEBUG 86051,/data/landun/workspace/TDSQL_Agent/src/localcomm/mysqlinstance.cpp:1161:setrunChange,tid:0x7ff7d1486700,86098,get path:/noshard1/group_1655820823_2764/sets/set@set_1657183327_7/setrun@set_1657183327_7,value is:{“cgroup_cpu”:“hard”,“degrade_flag”:1,“history_ids”:[{“id”:“set_1657183327_7”,“timestamp”:1657183327,“type”:0},{“id”:“set_1657166229_5”,“timestamp”:1657166229,“type”:0},{“id”:“set_1655820937_1”,“timestamp”:1655820937,“type”:0}],“kpstatus”:0,“master”:{“alive”:“0”,“city”:“default”,“election”:true,“hb_err”:“0”,“idc”:“IDC_2_1439_2171_1101-P11”,“idc_weight”:“1”,“losthbtime”:“0”,“name”:“9.251.175.72_4466”,“sqlasyn”:“0”,“weight”:“1”,“zone”:“ap-guangzhou-6”},“password”:“I0pDtpY@xC56rqeDn3”,“proxy”:[{“name”:“30.121.109.38_24466”},{“name”:“9.251.175.72_24466”}],“read_only”:“0”,“resource_info”:{“cpu”:50,“data_disk”:10000,“log_disk”:8000,“mem”:2000},“set”:“set_1657183327_7”,“slave”:[{“alive”:"-1",“city”:“default”,“election”:true,“hb_err”:“1”,“idc”:“IDC_2_1439_3751_6106-B15”,“idc_weight”:“1”,“losthbtime”:“1660570932”,“name”:“30.121.109.38_4466”,“sqlasyn”:“0”,“weight”:“1”,“zone”:“ap-guangzhou-7”}],“specid”:32768,“status”:0,“uniqueid”:“unique_1655820937_1”,“user”:“tdsqlsys_normal”,“wait_slaves_run”:1}
复制

2.2.5 分析慢日志

通过TDSQL自带的慢日志分析工具,分析问题时间段前后的日志(2022-08-15 21:00:53-2022-08-15 21:46:53),看出共有慢查询10条

导出结果为Excel,可以清晰看到各类慢SQL的数据,包括语句,执行时间,扫描行数和发送行数等信息。其实主要是两类,一类是UPDATE语句,另一类是多表查询语句。

且这些查询时间大多数在200秒左右,几乎是运行到问题发生时刻。

抽象后SQL语句 SQL示例 最后执行时间 总次数 总时间占比 最大时间(秒) 发送行数 扫描行数
update user_table? set carid=? where actorid%?=? /7:29460170-1660570231-2/update user_table14 set carid=‘cmncxsetwerzsgedye’ where actorId%20=2 2022/8/15 21:36 4 32.57% 309.2 0 5.94E+07
update user_table? set carid=? where actorid%?=? /7:30049999-1660570370-2/update user_table14 set carid=‘zbzxbdert3w45sfsf2’ where actorId%20=1 2022/8/15 21:39 2 16.46% 362.99 0 3.30E+07
select date(create_time) as date_time , actorid as cityid , cast(count(?) as char) as pv , cast(count(distinct actorid) as char) as uv , max( date(create_time)) as max(date_time) from ( select * from ( select actorid , vlog , age , create_time , update_time from user_table? where carid = ? group by create_time, carid /repeat union/ ) a ) as expr_qry where create_time >= ? and create_time <= ? group by date(create_time) , actorid order by max(date(create_time)) desc ,update_time,vlog /7:29918921-1660570108-2/SELECT date(create_time) AS date_time , actorId AS cityId , CAST(COUNT(1) AS CHAR) AS PV , CAST(COUNT(DISTINCT actorId) AS CHAR) AS UV , max( date(create_time)) AS MAX(date_time) FROM ( select * from ( SELECT actorId , vlog , age , create_time , update_time from user_table14 WHERE carid = ‘wrt36fdthrtuy567sdfg’ group by create_time, carid union SELECT actorId , vlog , age , create_time , update_time from user_table13 WHERE carid = ‘csdfgejhw3rwfsdfs’ group by create_time, carid union SELECT actorId , vlog , age , create_time , update_time from user_table12 WHERE carid = ‘cmncxsetwerzsgedye’ group by create_time, carid ) a ) AS expr_qry WHERE create_time >= ‘2000-08-08 00:00:00.000000’ AND create_time <= ‘2030-08-31 00:00:00.000000’ GROUP BY date(create_time) , actorId ORDER BY max(date(create_time)) DESC ,update_time,vlog 2022/8/15 21:30 5 16.15% 105.68 1.34E+06 4.90E+06
select date(create_time) as date_time , actorid as cityid , cast(count(?) as char) as pv , cast(count(distinct actorid) as char) as uv , max( date(create_time)) as max(date_time) from ( select * from ( select actorid , vlog , age , create_time , update_time from user_table? where carid = ? group by create_time, carid /repeat union all/ ) a ) as expr_qry where create_time >= ? and create_time <= ? group by date(create_time) , actorid order by max(date(create_time)) desc ,update_time,vlog /7:29689545-1660570274-2/SELECT date(create_time) AS date_time , actorId AS cityId , CAST(COUNT(1) AS CHAR) AS PV , CAST(COUNT(DISTINCT actorId) AS CHAR) AS UV , max( date(create_time)) AS MAX(date_time) FROM ( select * from ( SELECT actorId , vlog , age , create_time , update_time from user_table14 WHERE carid = ‘cmncxsetwerzsgedye’ group by create_time, carid union SELECT actorId , vlog , age , create_time , update_time from user_table13 WHERE carid = ‘cmncxsetwerzsgedye’ group by create_time, carid union all SELECT actorId , vlog , age , create_time , update_time from user_table12 WHERE carid = ‘cmncxsetwerzsgedye’ group by create_time, carid ) a ) AS expr_qry WHERE create_time >= ‘2000-08-08 00:00:00.000000’ AND create_time <= ‘2030-08-31 00:00:00.000000’ GROUP BY date(create_time) , actorId ORDER BY max(date(create_time)) DESC ,update_time,vlog 2022/8/15 21:35 1 7.72% 244.09 269634 983247
select date(create_time) as date_time , actorid as cityid , cast(count(?) as char) as pv , cast(count(distinct actorid) as char) as uv , max( date(create_time)) as max(date_time) from ( select * from ( select actorid , vlog , age , create_time , update_time from user_table? where carid = ? group by create_time, carid ,vlog /repeat union all/ ) a ) as expr_qry where create_time >= ? and create_time <= ? group by date(create_time),vlog , actorid order by max(date(create_time)) desc ,update_time,vlog /7:29361869-1660570463-2/SELECT date(create_time) AS date_time , actorId AS cityId , CAST(COUNT(1) AS CHAR) AS PV , CAST(COUNT(DISTINCT actorId) AS CHAR) AS UV , max( date(create_time)) AS MAX(date_time) FROM ( select * from ( SELECT actorId , vlog , age , create_time , update_time from user_table14 WHERE carid = ‘zbzxbdert3w45sfsf2’ group by create_time, carid ,vlog union SELECT actorId , vlog , age , create_time , update_time from user_table13 WHERE carid = ‘zbzxbdert3w45sfsf2’ group by create_time, carid ,vlog union all SELECT actorId , vlog , age , create_time , update_time from user_table12 WHERE carid = ‘zbzxbdert3w45sfsf2’ group by create_time, carid ,vlog ) a ) AS expr_qry WHERE create_time >= ‘2000-08-08 00:00:00.000000’ AND create_time <= ‘2030-08-31 00:00:00.000000’ GROUP BY date(create_time),vlog , actorId ORDER BY max(date(create_time)) DESC ,update_time,vlog 2022/8/15 21:37 1 6.76% 213.9 400698 1.46E+06
insert into user_table? (name,age,vlog,create_time,update_time) select name,age,vlog,create_time,now() from user_table? /7:29755081-1660569476-2/insert into user_table14 (name,age,vlog,create_time,update_time) select name,age,vlog,create_time,now() from user_table13 2022/8/15 21:21 1 6.01% 190.05 0 1.32E+07
select date(create_time) as date_time , actorid as cityid , cast(count(?) as char) as pv , cast(count(distinct actorid) as char) as uv , max( date(create_time)) as max(date_time) from ( select * from ( select actorid , vlog , age , create_time , update_time from user_table? where carid = ? group by create_time, carid /repeat union/ ) a ) as expr_qry where create_time >= ? and create_time <= ? group by date(create_time) , actorid order by max(date(create_time)) desc ,update_time,vlog /7:29558472-1660569811-2/SELECT date(create_time) AS date_time , actorId AS cityId , CAST(COUNT(1) AS CHAR) AS PV , CAST(COUNT(DISTINCT actorId) AS CHAR) AS UV , max( date(create_time)) AS MAX(date_time) FROM ( select * from ( SELECT actorId , vlog , age , create_time , update_time from user_table14 WHERE carid = ‘asfdxbdty3we43sdfsg2’ group by create_time, carid union SELECT actorId , vlog , age , create_time , update_time from user_table13 WHERE carid = ‘asfdxbdty3we43sdfsg2’ group by create_time, carid union SELECT actorId , vlog , age , create_time , update_time from user_table12 WHERE carid = ‘asfdxbdty3we43sdfsg2’ group by create_time, carid ) a ) AS expr_qry WHERE create_time >= ‘2000-08-08 00:00:00.000000’ AND create_time <= ‘2030-08-31 00:00:00.000000’ GROUP BY date(create_time) , actorId ORDER BY max(date(create_time)) DESC ,update_time,vlog 2022/8/15 21:25 5 5.70% 38.66 1.34E+06 4.90E+06
select date(create_time) as date_time , actorid as cityid , cast(count(?) as char) as pv , cast(count(distinct actorid) as char) as uv , max( date(create_time)) as max(date_time) from ( select actorid , vlog , age , create_time , update_time from user_table? where carid = ? group by create_time, carid /repeat union/ ) as expr_qry where create_time >= ? and create_time <= ? group by date(create_time) , actorid order by max(date(create_time)) desc limit ? /7:30082756-1660569305-2/SELECT date(create_time) AS date_time , actorId AS cityId , CAST(COUNT(1) AS CHAR) AS PV , CAST(COUNT(DISTINCT actorId) AS CHAR) AS UV , max( date(create_time)) AS MAX(date_time) FROM ( SELECT actorId , vlog , age , create_time , update_time from user_table14 WHERE carid = ‘zbzxbdert3w45sfsf2’ group by create_time, carid union SELECT actorId , vlog , age , create_time , update_time from user_table13 WHERE carid = ‘zbzxbdert3w45sfsf2’ group by create_time, carid union SELECT actorId , vlog , age , create_time , update_time from user_table12 WHERE carid = ‘zbzxbdert3w45sfsf2’ group by create_time, carid ) AS expr_qry WHERE create_time >= ‘2000-08-08 00:00:00.000000’ AND create_time <= ‘2030-08-31 00:00:00.000000’ GROUP BY date(create_time) , actorId ORDER BY max(date(create_time)) DESC LIMIT 10000 2022/8/15 21:16 4 3.58% 31.49 40000 3.03E+06
select date(create_time) as date_time , actorid as cityid , cast(count(?) as char) as pv , cast(count(distinct actorid) as char) as uv , max( date(create_time)) as max(date_time) from ( select actorid , vlog , age , create_time , update_time from user_table? where carid = ? group by create_time, carid /repeat union/ ) as expr_qry where create_time >= ? and create_time <= ? group by date(create_time) , actorid order by max(date(create_time)) desc /7:29787847-1660569481-2/SELECT date(create_time) AS date_time , actorId AS cityId , CAST(COUNT(1) AS CHAR) AS PV , CAST(COUNT(DISTINCT actorId) AS CHAR) AS UV , max( date(create_time)) AS MAX(date_time) FROM ( SELECT actorId , vlog , age , create_time , update_time from user_table14 WHERE carid = ‘cmncxsetwerzsgedye’ group by create_time, carid union SELECT actorId , vlog , age , create_time , update_time from user_table13 WHERE carid = ‘cmncxsetwerzsgedye’ group by create_time, carid union SELECT actorId , vlog , age , create_time , update_time from user_table12 WHERE carid = ‘cmncxsetwerzsgedye’ group by create_time, carid ) AS expr_qry WHERE create_time >= ‘2000-08-08 00:00:00.000000’ AND create_time <= ‘2030-08-31 00:00:00.000000’ GROUP BY date(create_time) , actorId ORDER BY max(date(create_time)) DESC 2022/8/15 21:19 2 2.98% 67.8 539268 1.97E+06
select date(create_time) as date_time , actorid as cityid , cast(count(?) as char) as pv , cast(count(distinct actorid) as char) as uv , max( date(create_time)) as max(date_time) from ( select actorid , vlog , age , create_time , update_time from user_table? where carid = ? group by create_time, carid /repeat union/ ) as expr_qry where create_time >= ? and create_time <= ? group by date(create_time) , actorid order by max(date(create_time)) desc limit ? /7:29689539-1660569470-2/SELECT date(create_time) AS date_time , actorId AS cityId , CAST(COUNT(1) AS CHAR) AS PV , CAST(COUNT(DISTINCT actorId) AS CHAR) AS UV , max( date(create_time)) AS MAX(date_time) FROM ( SELECT actorId , vlog , age , create_time , update_time from user_table14 WHERE carid = ‘zbzxbdert3w45sfsf2’ group by create_time, carid union SELECT actorId , vlog , age , create_time , update_time from user_table13 WHERE carid = ‘zbzxbdert3w45sfsf2’ group by create_time, carid union SELECT actorId , vlog , age , create_time , update_time from user_table12 WHERE carid = ‘zbzxbdert3w45sfsf2’ group by create_time, carid ) AS expr_qry WHERE create_time >= ‘2000-08-08 00:00:00.000000’ AND create_time <= ‘2030-08-31 00:00:00.000000’ GROUP BY date(create_time) , actorId ORDER BY max(date(create_time)) DESC LIMIT 10000 2022/8/15 21:18 1 2.07% 65.52 10000 758699

查看慢日志实际信息,发现问题发生前,很多平常执行非常快的心跳检测信息*(SysDB.StatusTableForHb)查询语句,查询时间(Query_time: 1.178992)达到秒级,而且大部分是处于锁等待时间(Lock_time: 1.178888),说明数据库有严重资源竞争问题。同时系统的字典视图查询也非常慢,说明当时数据库性能非常差。

# Time: 2022-08-15T21:40:25.337243+08:00 # User@Host: tdsqlsys_agent[tdsqlsys_agent] @ localhost [] Id: 95 # Query_time: 2.519860 Lock_time: 0.000043 Rows_sent: 189967 Rows_examined: 189967 SET timestamp=1660570822; select * from performance_schema.data_locks; # Time: 2022-08-15T21:40:31.266382+08:00 # User@Host: tdsqlsys_kp_new[tdsqlsys_kp_new] @ [9.251.175.72] Id: 301073 # Query_time: 1.178992 Lock_time: 1.178888 Rows_sent: 1 Rows_examined: 1 SET timestamp=1660570830; select * from SysDB.StatusTableForHb limit 1; # Time: 2022-08-15T21:40:33.167208+08:00 # User@Host: tdsqlsys_agent[tdsqlsys_agent] @ localhost [] Id: 95 # Query_time: 3.218136 Lock_time: 0.000050 Rows_sent: 246421 Rows_examined: 246421 SET timestamp=1660570829; select * from performance_schema.data_locks; # Time: 2022-08-15T21:40:41.903564+08:00 # User@Host: tdsqlsys_agent[tdsqlsys_agent] @ localhost [] Id: 91 # Query_time: 1.626103 Lock_time: 1.625971 Rows_sent: 0 Rows_examined: 0 SET timestamp=1660570840; show table status in xa like 'gtid_log_t'; # Time: 2022-08-15T21:40:41.903568+08:00 # User@Host: tdsqlsys_agent[tdsqlsys_agent] @ localhost [] Id: 30099 # Query_time: 1.349831 Lock_time: 1.349707 Rows_sent: 0 Rows_examined: 0 SET timestamp=1660570840; CREATE TABLE IF NOT EXISTS `StatusTable` (`ip` varchar(20) NOT NULL DEFAULT '',`port` int(11) NOT NULL DEFAULT '0',`ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_T IMESTAMP,PRIMARY KEY (`ip`,`port`)) ENGINE=InnoDB; # Time: 2022-08-15T21:40:41.937017+08:00 # User@Host: tdsqlsys_kp_new[tdsqlsys_kp_new] @ [9.251.175.72] Id: 301087 # Query_time: 1.143792 Lock_time: 1.143639 Rows_sent: 1 Rows_examined: 1 SET timestamp=1660570840; select * from SysDB.StatusTableForHb limit 1; # Time: 2022-08-15T21:40:44.468514+08:00 # User@Host: tdsqlsys_agent[tdsqlsys_agent] @ localhost [] Id: 95 # Query_time: 4.195132 Lock_time: 0.000176 Rows_sent: 324735 Rows_examined: 324735 SET timestamp=1660570840; select * from performance_schema.data_locks; /data/tdsql_run/4466/mysql-server-8.0.18/bin/mysqld, Version: 8.0.22-v18-txsql-2.0.1-V2.0R710D002-20210517-2251 (Source distribution). started with: Tcp port: 4466 Unix socket: /data2/4466/prod/mysql.sock Time Id Command Argument # Time: 2022-08-15T21:43:52.946641+08:00 # User@Host: tdsqlsys_agent[tdsqlsys_agent] @ localhost [] Id: 95 # Query_time: 1.292711 Lock_time: 0.000038 Rows_sent: 81827 Rows_examined: 81827 SET timestamp=1660571031; select * from performance_schema.data_locks;
复制

下面是正常情况下的慢日志内容,可以看到几乎没有什么慢SQL。

/data/tdsql_run/4466/mysql-server-8.0.18/bin/mysqld, Version: 8.0.22-v18-txsql-2.0.1-V2.0R710D002-20210517-2251 (Source distribution). started with: Tcp port: 4466 Unix socket: /data2/4466/prod/mysql.sock Time Id Command Argument /data/tdsql_run/4466/mysql-server-8.0.18/bin/mysqld, Version: 8.0.22-v18-txsql-2.0.1-V2.0R710D002-20210517-2251 (Source distribution). started with: Tcp port: 4466 Unix socket: /data2/4466/prod/mysql.sock Time Id Command Argument /data/tdsql_run/4466/mysql-server-8.0.18/bin/mysqld, Version: 8.0.22-v18-txsql-2.0.1-V2.0R710D002-20210517-2251 (Source distribution). started with: Tcp port: 4466 Unix socket: /data2/4466/prod/mysql.sock Time Id Command Argument /data/tdsql_run/4466/mysql-server-8.0.18/bin/mysqld, Version: 8.0.22-v18-txsql-2.0.1-V2.0R710D002-20210517-2251 (Source distribution). started with: Tcp port: 4466 Unix socket: /data2/4466/prod/mysql.sock Time Id Command Argument /data/tdsql_run/4466/mysql-server-8.0.18/bin/mysqld, Version: 8.0.22-v18-txsql-2.0.1-V2.0R710D002-20210517-2251 (Source distribution). started with: Tcp port: 4466 Unix socket: /data2/4466/prod/mysql.sock Time Id Command Argument /data/tdsql_run/4466/mysql-server-8.0.18/bin/mysqld, Version: 8.0.22-v18-txsql-2.0.1-V2.0R710D002-20210517-2251 (Source distribution). started with: Tcp port: 4466 Unix socket: /data2/4466/prod/mysql.sock Time Id Command Argument
复制

2.2.6 分析错误日志

看到在问题发生前几分钟,大量的连接因通信问题造成无法连接,而21:42:20.380420直接提示无法连接实例,接着实例就切换为从库了,说明实例并没有异常。

2022-08-15T21:35:07.040290+08:00 2722830 [Note] [MY-010914] [Server] Aborted connection 2722830 to db: 'unconnected' user: 'tdsqlsys_kp_new' host: '30.121.109.38' (Got an error reading communication packets). 2022-08-15T21:35:07.045254+08:00 0 [Note] [MY-011953] [InnoDB] Page cleaner took 4293ms to flush 429 pages 2022-08-15T21:35:40.966194+08:00 2723017 [Note] [MY-010914] [Server] Aborted connection 2723017 to db: 'unconnected' user: 'tdsqlsys_kp_new' host: '30.121.109.38' (Got an error reading communication packets). 2022-08-15T21:35:41.040021+08:00 2723019 [Note] [MY-010914] [Server] Aborted connection 2723019 to db: 'unconnected' user: 'tdsqlsys_kp_new' host: '9.251.175.72' (Got an error reading communication packets). 2022-08-15T21:35:41.040107+08:00 2723023 [Note] [MY-010914] [Server] Aborted connection 2723023 to db: 'sysdb' user: 'tdsqlsys_agent' host: 'localhost' (Got an error writing communication packets). 2022-08-15T21:35:41.147232+08:00 2722864 [Note] [MY-010914] [Server] Aborted connection 2722864 to db: 'sysdb' user: 'tdsqlsys_agent' host: 'localhost' (Got an error writing communication packets). 2022-08-15T21:35:41.147358+08:00 2722998 [Note] [MY-010914] [Server] Aborted connection 2722998 to db: 'sysdb' user: 'tdsqlsys_agent' host: 'localhost' (Got an error writing communication packets). 2022-08-15T21:35:41.748745+08:00 0 [Note] [MY-011953] [InnoDB] Page cleaner took 5699ms to flush 9574 pages 2022-08-15T21:35:54.940278+08:00 2723098 [Note] [MY-010914] [Server] Got an error reading communication packets 2022-08-15T21:35:54.940307+08:00 2723098 [Note] [MY-010914] [Server] Aborted connection 2723098 to db: 'unconnected' user: 'unauthenticated' host: '30.121.109.38' (Got an error reading communication packets). 2022-08-15T21:36:09.540272+08:00 2723175 [Note] [MY-010914] [Server] Got an error reading communication packets 2022-08-15T21:36:09.540304+08:00 2723175 [Note] [MY-010914] [Server] Aborted connection 2723175 to db: 'unconnected' user: 'unauthenticated' host: '30.121.109.38' (Got an error reading communication packets). 2022-08-15T21:36:26.761718+08:00 2723035 [Note] [MY-010914] [Server] Aborted connection 2723035 to db: 'sysdb' user: 'tdsqlsys_agent' host: 'localhost' (Got an error writing communication packets). 2022-08-15T21:36:58.140203+08:00 2723433 [Note] [MY-010914] [Server] Got an error reading communication packets 2022-08-15T21:36:58.140232+08:00 2723433 [Note] [MY-010914] [Server] Aborted connection 2723433 to db: 'unconnected' user: 'unauthenticated' host: '30.121.109.38' (Got an error reading communication packets). 2022-08-15T21:37:16.479774+08:00 2723516 [Note] [MY-010914] [Server] Aborted connection 2723516 to db: 'unconnected' user: 'tdsqlsys_kp_new' host: '9.251.175.72' (Got an error reading communication packets). 2022-08-15T21:37:49.740279+08:00 2723707 [Note] [MY-010914] [Server] Got an error reading communication packets 2022-08-15T21:37:49.740310+08:00 2723707 [Note] [MY-010914] [Server] Aborted connection 2723707 to db: 'unconnected' user: 'unauthenticated' host: '30.121.109.38' (Got an error reading communication packets). 2022-08-15T21:38:00.539938+08:00 2723766 [Note] [MY-010914] [Server] Got an error reading communication packets 2022-08-15T21:38:00.539971+08:00 2723766 [Note] [MY-010914] [Server] Aborted connection 2723766 to db: 'unconnected' user: 'unauthenticated' host: '9.251.175.72' (Got an error reading communication packets). 2022-08-15T21:42:20.380420+08:00 18 [ERROR] [MY-010557] [Repl] Error reading packet from server for channel '': Lost connection to MySQL server during query (server_errno=2013) 2022-08-15T21:42:20.380459+08:00 18 [Note] [MY-011027] [Repl] Failed registering on master, reconnecting to try again, log 'binlog.000066' at position 74625733. for channel ''. 2022-08-15T21:42:20.380751+08:00 18 [Warning] [MY-010897] [Repl] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information. 2022-08-15T21:42:20.381551+08:00 18 [ERROR] [MY-010584] [Repl] Slave I/O for channel '': error reconnecting to master 'tdsqlsys_repl@30.121.109.38:4466' - retry-time: 60 retries: 1 message: Can't connect to MySQL server on '30.121.109.38' (111), Error_code: MY-002003 2022-08-15T21:42:21.442460+08:00 18 [Note] [MY-011028] [Repl] Slave I/O thread killed during or after reconnect. 2022-08-15T21:42:21.442488+08:00 18 [Note] [MY-010570] [Repl] Slave I/O thread exiting for channel '', read up to log 'binlog.000066', position 74625733 2022-08-15T21:42:23.853441+08:00 301243 [Note] [MY-010462] [Repl] Start binlog_dump to master_thread_id(301243) slave_server(1831245938), pos(, 4) 2022-08-15T21:42:24.545757+08:00 19 [Note] [MY-010596] [Repl] Error reading relay log event for channel '': slave SQL thread was killed 2022-08-15T21:42:24.548974+08:00 19 [Note] [MY-010587] [Repl] Slave SQL thread for channel '' exiting, replication stopped in log 'binlog.000066' at position 74625733 2022-08-15T21:44:18.881308+08:00 300995 [Note] [MY-010914] [Server] Aborted connection 300995 to db: 'sysdb' user: 'tdsqlsys_agent' host: 'localhost' (Got an error writing communication packets). 2022-08-15T21:44:41.781175+08:00 301886 [Note] [MY-010914] [Server] Aborted connection 301886 to db: 'sysdb' user: 'tdsqlsys_agent' host: 'localhost' (Got an error writing communication packets). 2022-08-15T21:44:52.137729+08:00 302052 [Note] [MY-010914] [Server] Got an error reading communication packets 2022-08-15T21:44:52.137770+08:00 302052 [Note] [MY-010914] [Server] Aborted connection 302052 to db: 'unconnected' user: 'unauthenticated' host: '9.251.175.72' (Got an error reading communication packets). 2022-08-15T21:44:52.187806+08:00 302054 [Note] [MY-010914] [Server] Got an error reading communication packets
复制

2.3 根因分析

通过赤兔平台的agent日志分析结果看,并未找到发生切换的真实原因。

那我们从主从切换的原因以及各个维度的数据一一进行分析:

  1. 首先排除服务器故障

  2. 是否MySQL实例使用CPU满了导致,根据监控数据看,最高使用率在80%左右,并未使用满,但是需要注意的是监控室分钟级采样,而让agent进行切换只需几十秒,所以这个可能还是有的;

  3. 是否IO异常导致,通过IO使用率可以看出IO并未使用满,高峰只有15%左右(但是这个数据可能是整个服务器的,而不是这一个实例的IO使用情况),而且,如果IO异常了,那慢日志中肯定会出现很多commit超时的情况,而实际的慢日志并未出现写心跳超时的SQL,所以初步排除;

  4. 是否文件系统满导致,看监控数据,确实在问题时刻,磁盘使用率从80%增长到了105%,意味着磁盘空间满了,那MySQL通常会异常。但是由于是云MySQL实例,其空间最大可使用到150%,所以应该不会是存储空间满了导致的,而且切换后空间依然是105%,实例依然正常运行,说明文件系统满了导致的原因也可以排除;

  5. 是否是binlog写入竞争,因写心跳的binlog无法写入导致agent等待。这种情况一般binlog较大(一般超过10G),问题时间段前后binlog均没有这么大;

  6. 是否线程池竞争导致,根据每分钟监控采集数据看,活动会话是没有大于64,最多十几,所以初步排除。

最后,通过赤兔平台后台的抓取的活动会话数据(见下图),可以看到当时有很多慢查询(慢日志中的SQL)以及对几张大表的千万级DML大事务操作。由于实例配置低(1核2G),在多个并发慢查询以及大事务的压力下,导致心跳写入操作(replace into SysDB.StatusTableForHB)处于无法提交状态(waiting for handler commit ),最终触发主从切换。。
1402610692.jpg

三、案例回顾

通过以上案例分析流程,看到腾讯云TDSQL提供了非常完善的运维工具,如实例多维度的分钟级监控帮助分析问题前后的数据库状态、告警信息及时通知运维人员、慢查询的统计分析,以及功能强大的赤兔平台能及时收集故障前的现场信息并自动分析出相关原因。这些强大的功能都是DBA运维的利器。

最后修改时间:2022-08-26 15:45:54
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论