(本文阅读预计时间:29分钟)
今天通过现网的案例总结一下对于DB运行慢或单个请求慢的这种常规问题的分析思路和方向。
问题现象描述
问题表现为应用写请求频繁超时报错:
read tcp 1xx.1xx.220.9:24735->100.9xx.1xx.3:5432: i/o timeout.
应用日志具体如下:
2020/10/28 10:11:05 AccessReport.go:77: [ERROR] ExecEx Error: read tcp 1xx.1xx.220.9:24735->100.9xx.1xx.3:5432: i/o timeout. dsn:100.9xx.1xx.3:5432, sql:INSERT INTO user_community_access_records(community_id, uin, access_time) VALUES(453, 1152921504794612653, '2020/10/28 10:11:05')
ON CONFLICT ON CONSTRAINT user_community_access_records_unique_constraint DO UPDATE SET access_time = '2020/10/28 10:11:05'
2020/10/28 10:20:07 TaskFlow.go:109: [ERROR] Exec Insert Error:read tcp 1xx.1xx.220.9:48782->100.9x.19x.44:5432: i/o timeout, insertSQL:INSERT INTO community_task_flow(task_type, task_flow_id, community_id, uin, status, finish_time, update_time)
VALUES(3, 'SignIn_366_2207245694_1603814400', 366, 2207245694, 0, TO_TIMESTAMP(1603814400), TO_TIMESTAMP(1603814400)) ON CONFLICT ON CONSTRAINT community_task_flow_unique_constraint DO UPDATE SET status = 0, update_time=TO_TIMESTAMP(1603814400)
复制
该业务系统设置200ms超时强制退出,即事务中语句执行超过200ms没有完成,导致业务请求异常。
问题诊断分析
DB语句执行较慢,主要的原因大致可分为:
语句本身性能较差导致,通常表现局部请求缓慢,极端情况下也会引发资源瓶颈导致全局性能问题;
语句本身性能没问题,但受资源负载瓶颈影响导致,这里会表现为全局的性能问题,大量请求缓慢;
语句本身性能没问题,但受前置任务阻塞导致,通常现象为局部请求缓慢,如果高并发场景阻塞验证,会上升为全局性能问题。
首先,万事得查看分析日志中的相关信息。
既然SQL请求200ms异常超时退出,这里先统计超过200ms的慢查询,发现都是写请求,读请求无慢查询出现:
$ grep -e "duration" $PGDATA/pg_log/postgresql-Wednesday-10.csv | awk -F"duration:" '{ print $2}' |awk '{if ($1 >= 200) print $0}'| sort -n
232.439 ms statement: INSERT INTO user_community_access_records (community_id, uin, access_time) VALUES (345, 872097103, '2020-10-28 15:28:45+08'::timestamp with time zone) ON CONFLICT ON CONSTRAINT user_community_access_records_unique_constraint DO UPDATE SET access_time = '2020-10-28 15:28:45+08'::timestamp with time zone",,,,,,,,,"psql"
232.549 ms statement: INSERT INTO community_task_flow (task_type, task_flow_id, community_id, uin, status, finish_time, update_time) VALUES (3, 'SignIn_124_596203222_1603814400'::text, 124, 596203222, 0, to_timestamp((1603814400)::double precision), to_timestamp((1603814400)::double precision)) ON CONFLICT ON CONSTRAINT community_task_flow_unique_constraint DO UPDATE SET status = 0, update_time = to_timestamp((1603814400)::double precision)",,,,,,,,,"pgxc:cn004"
232.593 ms statement: INSERT INTO user_community_access_records (community_id, uin, access_time) VALUES (435, 1693958247, '2020-10-28 15:50:47+08'::timestamp with time zone) ON CONFLICT ON CONSTRAINT user_community_access_records_unique_constraint DO UPDATE SET access_time = '2020-10-28 15:50:47+08'::timestamp with time zone",,,,,,,,,"pgxc:cn004"
242.948 ms statement: INSERT INTO user_community_access_records (community_id, uin, access_time) VALUES (175, '2891971600'::bigint, '2020-10-28 14:38:39+08'::timestamp with time zone) ON CONFLICT ON CONSTRAINT user_community_access_records_unique_constraint DO UPDATE SET access_time = '2020-10-28 14:38:39+08'::timestamp with time zone",,,,,,,,,"pgxc:cn003"
249.911 ms statement: COMMIT TRANSACTION",,,,,,,,,"pgxc:cn003"
276.932 ms statement: INSERT INTO user_active_week_rank (uin, community_id, year, week, week_shard, active, status, is_hit, update_time) VALUES ('2752224720'::bigint, 154, 2020, 43, '154_2020_43'::text, 79, 0, 0, to_timestamp((1603867120)::double precision)) ON CONFLICT ON CONSTRAINT user_active_week_rank_unique_constraint DO UPDATE SET active = 79, update_time = to_timestamp((1603867120)::double precision)",,,,,,,,,"psql"
294.255 ms statement: COMMIT TRANSACTION",,,,,,,,,"pgxc:cn001"
372.986 ms statement: INSERT INTO user_active_week_rank (uin, community_id, year, week, week_shard, active, status, is_hit, update_time) VALUES ('2708789532'::bigint, 114, 2020, 43, '114_2020_43'::text, 264, 0, 0, to_timestamp((1603870296)::double precision)) ON CONFLICT ON CONSTRAINT user_active_week_rank_unique_constraint DO UPDATE SET active = 264, update_time = to_timestamp((1603870296)::double precision)",,,,,,,,,"pgxc:cn004"
410.946 ms statement: COMMIT TRANSACTION",,,,,,,,,"psql"
418.113 ms statement: COMMIT TRANSACTION",,,,,,,,,"pgxc:cn001"
447.159 ms statement: INSERT INTO user_active_week_rank (uin, community_id, year, week, week_shard, active, status, is_hit, update_time) VALUES ('1152921505034816585'::bigint, 132, 2020, 43, '132_2020_43'::text, 0, 0, 0, to_timestamp((1603867119)::double precision)) ON CONFLICT ON CONSTRAINT user_active_week_rank_unique_constraint DO UPDATE SET active = 0, update_time = to_timestamp((1603867119)::double precision)",,,,,,,,,"pgxc:cn003"
475.575 ms statement: COMMIT TRANSACTION",,,,,,,,,"pgxc:cn004"
487.518 ms statement: INSERT INTO user_active_week_rank (uin, community_id, year, week, week_shard, active, status, is_hit, update_time) VALUES ('2978056732'::bigint, 165, 2020, 43, '165_2020_43'::text, 609, 0, 0, to_timestamp((1603870308)::double precision)) ON CONFLICT ON CONSTRAINT user_active_week_rank_unique_constraint DO UPDATE SET active = 609, update_time = to_timestamp((1603870308)::double precision)",,,,,,,,,"pgxc:cn002"
499.915 ms statement: INSERT INTO user_active_week_rank (uin, community_id, year, week, week_shard, active, status, is_hit, update_time) VALUES ('2787039202'::bigint, 241, 2020, 43, '241_2020_43'::text, 0, 0, 0, to_timestamp((1603867118)::double precision)) ON CONFLICT ON CONSTRAINT user_active_week_rank_unique_constraint DO UPDATE SET active = 0, update_time = to_timestamp((1603867118)::double precision)",,,,,,,,,"pgxc:cn004"
复制
接着,验证上述语句执行本身性能问题:都在3ms以内,即语句本身的性能没有问题,需要进一步分析是否其他因素影响导致。
target=> begin;
BEGIN
Time: 0.382 ms
target=> INSERT INTO user_target_access_records(target_id, uin, access_time) VALUES(175, 2076287361, TO_TIMESTAMP(1603850945));
INSERT 0 1
Time: 1.457 ms
target=> rollback;
ROLLBACK
Time: 2.488 ms
--此处省略部分验证--
target=>
target=> INSERT INTO target_task_flow (task_type, task_flow_id, target_id, uin, status, finish_time, update_time) VALUES (3, 'SignIn_284_1152921505012932125_1603814400'::text, 284, '1152921505012932125'::bigint, 0, to_timestamp((1603814400)::double precision), to_timestamp((1603814400)::double precision)) ON CONFLICT ON CONSTRAINT target_task_flow_unique_constraint DO UPDATE SET status = 0, update_time = to_timestamp((1603814400)::double precision);
INSERT 0 1
Time: 1.854 ms
target=> rollback;
ROLLBACK
Time: 2.585 ms
target=>
复制
pg日志记录的慢查询耗时包含了锁等待时间,接着每100ms查看一次锁等待情况:
postgres=# show deadlock_timeout;
deadlock_timeout
------------------
1s
(1 row)
postgres=# show log_lock_waits;
log_lock_waits
----------------
on
(1 row)
postgres=#
postgres=# select pid,pg_blocking_pids(pid),wait_event_type,query from pg_stat_activity where lower(wait_event_type) like '%lock%' and pid!=pg_backend_pid();\watch 0.1
pid | pg_blocking_pids | wait_event_type | query
-----+------------------+-----------------+-------
(0 rows)
Wed Oct 28 15:18:37 2020 (every 0.1s)
pid | pg_blocking_pids | wait_event_type | query
-----+------------------+-----------------+-------
(0 rows)
Wed Oct 28 15:18:37 2020 (every 0.1s)
pid | pg_blocking_pids | wait_event_type | query
-----+------------------+-----------------+-------
(0 rows)
复制
监控一段时间,发现并未有lock类型的等待事件出现。
业务层面和DB日志中只有写请求较慢,明显是局部的性能问题,在整个问题周期,PostgreSQL中的gtm、dn、cn节点资源负载也比较低:
[aken@dbhost]$ top
top - 14:54:02 up 159 days, 23:13, 1 user, load average: 0.41, 0.23, 0.24
Tasks: 787 total, 3 running, 783 sleeping, 0 stopped, 1 zombie
%Cpu(s): 2.1 us, 1.9 sy, 0.0 ni, 96.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 65282996 total, 258904 free, 3055888 used, 61968204 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 52239968 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6028 aken 20 0 6102596 21464 7380 S 24.5 0.0 0:00.74 postgres: aken target 100.6x.12x.186(38159) idle
5450 aken 20 0 6086192 62956 58512 S 2.0 0.1 0:00.06 postgres: aken postgres 100.65.128.13(65511) idle
7457 aken 20 0 6088972 65852 60848 S 2.0 0.1 0:00.06 postgres: aken target 100.9xx.1xx.3(45672) idle
复制
DB磁盘IO情况:
[aken@dbhost]$ iostat -d md1 -x 1 3
Linux 3.10.107-1-tlinux2-0052 (Tencent-SNG) 10/28/20 _x86_64_ (32 CPU)
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
md1 0.00 0.00 0.19 68.07 2.50 968.12 28.44 0.00 0.00 0.00 0.00 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
md1 0.00 0.00 0.00 12.00 0.00 72.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
md1 0.00 0.00 0.00 53.00 0.00 384.00 14.49 0.00 0.00 0.00 0.00 0.00 0.00
复制
postgres=# show synchronous_commit;
synchronous_commit
--------------------
on
(1 row)
postgres=# show synchronous_standby_names;
synchronous_standby_names
------------------------------------------------------------------------------------------------------------------------------
ANY 17 (9.2x.4x.66:11000,9.10x.13x.207:11000,9.2x.4x.72:11000,100.10x.14x.55:11002,9.18x.16x.253:11004,9.5x1.3x.148:11000,
9.23x.1xx.140:11000,10.5x.8xx.188:11004,9.5x.3x.152:11000,9.xx.xx.39:11000,9.xx.4x.68:110
00,9.23x.14x.17:11000,9.2xx.4x.1xx:11000,9.5x.3x.187:11000,9.1xx.1xx.x7:11000,100.9x.9x.121:11002,9.1xx.x1.x1:11000)
(1 row)
复制
具体的参数函数可以参考PostgreSQL官方文档:
https://www.postgresql.org/docs/current/warm-standby.html
这里Multiple Synchronous Standbys的同步方式为:
synchronous_commit=on
synchronous_standby_names=ANY 17(...)
复制
即Primary的事务要强同步到17个Standby才能Commit成功。这里说的强同步指:
任何一个写请求必须将Wal日志流发送到Standby,且Standby成功Applied应用,并且将数据从OS Cache刷到Disk。
回到上面的synchronous_standby_names配置,17个standby中有12个ip属于异地的上海节点,5个为深圳同城节点,这么多的standby进行跨地域强同步,那么主从之间极有可能因为同步锁相关等待而损耗了主库的事务性能。
进一步查看dn主节点日志,可以看到:务在主dn节点本地提交,但未同步到standby,前端在wait for synchronous replication过程中超时而canced:
$ grep waiting postgresql-Wednesday-14.csv | more
2020-10-28 15:09:35.732 CST,"plat_logic_2","target",27513,coord(0,0),"100.9x.19x.44:48800",5f991903.6b79,coord(0,0),2,"UPDATE waiting for 139/19C21EB0",2020-10-28 15:08:51 CST,83/9190195,338241989,WARNING,01
000,"canceling waitforsynchronousreplication due touser request","The transaction has already committed locally, but might not have been replicated to the standby.",,,,,,,,"pgxc:cn002"
2020-10-28 15:09:35.968 CST,"plat_logic_2","target",1566,coord(0,0),"100.6x.12x.186:45904",5f991928.61e,coord(0,0),1,"INSERT waiting for139/19C33748",2020-10-28 15:09:28 CST,21/2431746,338241991,WARNING,010
00,"canceling waitforsynchronousreplication due touser request","The transaction has already committed locally, but might not have been replicated to the standby.",,,,,,,,"pgxc:cn003"
2020-10-28 15:13:36.254 CST,"plat_logic_2","target",8630,coord(0,0),"100.6x.12x.186:47939",5f991a18.21b6,coord(0,0),1,"INSERT waiting for139/1B445F80",2020-10-28 15:13:28 CST,6/2132308,338245059,WARNING,010
00,"canceling wait for synchronous replication due to user request","The transaction has already committed locally, but might not have been replicated to the standby.",,,,,,,,"psql"
复制
关于主从同步,我们还可以这样观察有没有同步锁:
select query,wait_event_type,wait_event from pg_stat_activity where wait_event='SyncRep';
复制
问题解决方法
为了降低强同步带来的性能损耗,接下来将主从同步方式改成同城any one同步,异地全异步的方式。
postgres=# postgres=# show synchronous_standby_names ;
synchronous_standby_names
-----------------------------------------------------------------------------------------------------------
ANY 1 (100.10x.14x.55:11002,9.23x.14x.17:11000,100.9x.9x.121:11002,10.5x.8xx.188:11004,9.23x.1xx.140:11000)
(1 row)
postgres=#
postgres=# SELECT client_addr,application_name,sync_state FROM pg_stat_replication;
client_addr | application_name | sync_state
----------------+----------------------+------------
100.9x.9x.121 | 100.9x.9x.121:11002 | quorum <<< 深圳同步流any 1 list成员
9.5x.3x.187 | 9.5x.3x.187:11000 | async <<< 上海异步流
9.23x.14x.17 | 9.23x.14x.17:11000 | quorum <<< 深圳同步流any 1 list成员
10.5x.8xx.188 | 10.5x.8xx.188:11004 | quorum <<< 深圳同步流any 1 list成员
9.23x.1xx.140 | 9.23x.1xx.140:11000 | quorum <<< 深圳同步流any 1 list成员
100.10x.14x.55 | 100.10x.14x.55:11002 | quorum <<< 深圳同步流any 1 list成员
(6 rows)
postgres=#
复制
上面修改后的配置表示:
在5个同城的standby中任意一个强同步,并异步同步到异地其中一个standby,异地剩余的standby都从该异步节点同步。
修改后同步锁消失,事务异常超时消失:
[aken@dbhost]$ grep waiting postgresql-Wednesday-18.csv
[aken@dbhost]$
复制
分析思路总结
总结一下,现网的系统慢的排查思路其实大同小异:
首先定位什么慢,是全局慢,还是局不慢;
对于全局慢,优先排查资源问题,重点是什么引发资源了问题;
对于局部慢,则优先分析是否个体的自身的性能问题;
对于局部慢,非自身引发,则考虑排查阻塞、等待事件等相关外因。
PostgreSQL与Oracle:成本、易用性和功能上的差异
文章转载自公众号:DB印象
作者:Aken
评论
