
1、背景
在前一篇短文里:破坏你的PG(1): 让它尽快OOM, 或者让表急剧膨胀, 只是简要介绍了通过缩小以下两个参数:max_locks_per_transaction 和 max_connections的值,分别从64降到10, 100降到5。这样可以模拟快速达到边界,触发OOM。
采用默认值的时候,当一个事务的子事务超过64,也会出现问题。如果有兴趣,可以设计那样的案例来重现。
OOM大多是因为Shared Buffer用完了,不够用,但是同时又要申请,出现的问题。通常操作系统中会启用虚存SWAP空间,如果物理内存不够,SWAP来凑。如果我们禁用的了SWAP呢?
2、具体实例
我们可以参考一下白老师的一篇文章以及一些SWAP的一些设置,文章在:内存管理中的overcommit和oom killer (https://mp.weixin.qq.com/s/xTvxCm9EM9rZ64quB8ehqg)。
我们可以反其道而行之,在一个内存为2G的VM里,我们把SWAP先完全禁掉。
sysctl -w vm.oom-kill=1
sysctl -w vm.swappiness=0
sysctl -w vm.overcommit_memory=2
sysctl -w vm.overcommit_ratio=10
vm.swappiness = 0, 完全禁用swap,不用虚存
vm.oom-kill=1, 启用oom kill
vm.overcommit_memory=2, 会分配一部分物理RAM用于内核使用。分配的数量由设置vm.overcommit_ratio配置。这意味着可用于程序的虚拟内存的数量实际上是:
RAM *(overcommit_ratio 100) + SWAP。
我们使用root用户将上述参数一设置,那么可能就只有那么200M左右的内存可供使用了。
2.1 准备表及数据
postgres=# create table t(id int primary key, col2 text);
CREATE TABLE
postgres=# CREATE OR REPLACE FUNCTION random_string( int ) RETURNS TEXT as $$
postgres$# SELECT string_agg(substring('abcdefghijklmnopqrstuvwxyz', round(random() * 26 + 0.5)::integer, 1), '')
postgres$# FROM generate_series(1, $1);
postgres$# $$ language sql;
CREATE FUNCTION
^
postgres=# insert into t select n, random_string(2480) from generate_series(1, 10000) as n;
INSERT 0 10000
2.2 更改PG相关参数
把shared_buffer参数从默认值再往下降一降, 把work_mem值再往上,涨到20M
vi postgresql.conf
shared_buffers = 32MB
work_mem = 1MB
pg_ctl restart
2.3 pgbench加压验证
只运行一个简单的查询脚本 query.sql:
select * from t where col2 is not null;
2.3.1、10个并发请求同时运行:
pgbench -c 10 -j 10 -r -n -T 240 -f query.sql
pgbench (14.7)
transaction type: query.sql
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 10
duration: 240 s
number of transactions actually processed: 1820
latency average = 1322.868 ms
initial connection time = 72.444 ms
tps = 7.559331 (without initial connection time)
statement latencies in milliseconds:
1321.344 select * from t where col2 is not null;
用smem查看内存消耗:
[04:16:13-postgres@centos1:/var/lib/pgsql/14/data/log]$ smem -k
PID User Command Swap USS PSS RSS
83433 postgres postgres: archiver 0 152.0K 235.0K 2.2M
83429 postgres postgres: checkpointer 0 168.0K 247.0K 2.2M
83427 postgres postgres: logger 0 192.0K 264.0K 2.1M
83430 postgres postgres: background writer 0 228.0K 364.0K 2.7M
83434 postgres postgres: stats collector 0 284.0K 365.0K 2.3M
83435 postgres postgres: logical replicati 0 444.0K 554.0K 2.9M
83432 postgres postgres: autovacuum launch 0 576.0K 716.0K 3.2M
83431 postgres postgres: walwriter 0 224.0K 822.0K 3.3M
82193 postgres -bash 0 1.8M 2.1M 3.5M
82377 postgres -bash 0 1.8M 2.2M 3.6M
83625 postgres postgres: postgres postgres 0 1.0M 4.0M 32.8M
83627 postgres postgres: postgres postgres 0 1.0M 4.0M 32.8M
83628 postgres postgres: postgres postgres 0 1.0M 4.0M 32.8M
83626 postgres postgres: postgres postgres 0 1.0M 4.0M 32.8M
83629 postgres postgres: postgres postgres 0 1.0M 4.0M 32.8M
83630 postgres postgres: postgres postgres 0 1.0M 4.0M 32.8M
83631 postgres postgres: postgres postgres 0 1.0M 4.0M 32.8M
83633 postgres postgres: postgres postgres 0 1.0M 4.0M 32.8M
83632 postgres postgres: postgres postgres 0 1.0M 4.0M 32.8M
83624 postgres postgres: postgres postgres 0 1.0M 4.0M 32.8M
83426 postgres /usr/pgsql-14/bin/postgres 0 4.8M 5.8M 11.9M
83763 postgres python /usr/bin/smem -k 0 5.3M 6.0M 7.5M
83613 postgres pgbench -c 10 -j 10 -r -n - 0 146.8M 146.9M 148.9M
好像还是撑得住。
2.3.2、更换策略, 调大work_mem:
将参数work_mem调整为10M,再看。
vi postgresql.conf
work_mem = 10MB
pg_ctl restart
pgbench -c 10 -j 10 -r -n -T 240 -f query.sql
pgbench (14.7)
transaction type: query.sql
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 10
duration: 240 s
number of transactions actually processed: 2524
latency average = 953.088 ms
initial connection time = 25.653 ms
tps = 10.492215 (without initial connection time)
statement latencies in milliseconds:
952.447 select * from t where col2 is not null;
运行过程中,查看内存情况:
[04:22:50-postgres@centos1:/var/lib/pgsql/14/data/log]$ smem -k
PID User Command Swap USS PSS RSS
84045 postgres postgres: archiver 0 152.0K 230.0K 2.2M
84041 postgres postgres: checkpointer 0 168.0K 244.0K 2.2M
84039 postgres postgres: logger 0 192.0K 262.0K 2.1M
84042 postgres postgres: background writer 0 228.0K 360.0K 2.7M
84046 postgres postgres: stats collector 0 284.0K 361.0K 2.3M
84047 postgres postgres: logical replicati 0 444.0K 548.0K 2.9M
84044 postgres postgres: autovacuum launch 0 576.0K 700.0K 3.2M
84043 postgres postgres: walwriter 0 224.0K 822.0K 3.3M
84244 postgres postgres: postgres postgres 0 984.0K 1.1M 4.6M
84243 postgres psql 0 1.4M 1.7M 4.0M
84164 postgres -bash 0 1.7M 1.9M 3.3M
82193 postgres -bash 0 1.8M 2.0M 3.5M
82377 postgres -bash 0 1.8M 2.1M 3.6M
84084 postgres postgres: postgres postgres 0 1.0M 4.0M 32.8M
84089 postgres postgres: postgres postgres 0 1.0M 4.0M 32.8M
84092 postgres postgres: postgres postgres 0 1.0M 4.0M 32.8M
84087 postgres postgres: postgres postgres 0 1.0M 4.0M 32.8M
84088 postgres postgres: postgres postgres 0 1.0M 4.0M 32.8M
84091 postgres postgres: postgres postgres 0 1.0M 4.0M 32.8M
84090 postgres postgres: postgres postgres 0 1.0M 4.0M 32.8M
84085 postgres postgres: postgres postgres 0 1.0M 4.0M 32.8M
84083 postgres postgres: postgres postgres 0 1.1M 4.0M 32.8M
84086 postgres postgres: postgres postgres 0 1.1M 4.1M 32.9M
84038 postgres /usr/pgsql-14/bin/postgres 0 4.8M 5.7M 11.9M
84326 postgres python /usr/bin/smem -k 0 5.3M 6.0M 7.5M
84072 postgres pgbench -c 10 -j 10 -r -n - 0 134.1M 134.3M 136.4M
似乎上边的参数不受太大的影响。实际每个连接还是只占1M左右。我们后边直接加大并发看看。这是因为query.sql不需要用到多大的work_mem。
我们回想一下,如何让work_mem真正的用起来,可以使用order by排序来加大实际利用率。
2.3.3、修改query.sql
select * from t order by col2;
[04:35:50-postgres@centos1:/var/lib/pgsql]$ pgbench -c 90 -j 90 -r -n -T 60 -f query.sql
[04:35:50-postgres@centos1:/var/lib/pgsql]$ pgbench -c 90 -j 90 -r -n -T 60 -f query.sql
pgbench (14.7)
pgbench: error: client 28 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 3 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 50 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 9 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 19 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 14 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 83 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 45 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 37 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 57 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 40 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 31 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 42 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 73 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 85 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 80 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 6 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 25 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 0 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 84 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 72 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 59 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 22 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 41 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 1 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 58 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 18 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 87 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 49 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 20 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 46 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 39 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 64 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 52 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 53 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 33 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 78 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 62 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 38 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 21 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 60 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 75 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 4 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 51 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 89 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 23 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 55 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 44 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 69 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 35 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 76 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 2 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 77 script 0 aborted in command 0 query 0: out of memory for query result
通过监控内存开销,我们也可以看到:
[04:35:16-postgres@centos1:/var/lib/pgsql/14/data/log]$ smem -k
PID User Command Swap USS PSS RSS
84039 postgres postgres: logger 0 192.0K 240.0K 2.1M
84045 postgres postgres: archiver 0 152.0K 271.0K 2.2M
84042 postgres postgres: background writer 0 212.0K 400.0K 2.7M
84046 postgres postgres: stats collector 0 288.0K 405.0K 2.3M
84041 postgres postgres: checkpointer 0 328.0K 462.0K 3.0M
84047 postgres postgres: logical replicati 0 444.0K 633.0K 2.9M
84044 postgres postgres: autovacuum launch 0 572.0K 789.0K 3.2M
84043 postgres postgres: walwriter 0 216.0K 848.0K 3.3M
84164 postgres -bash 0 1.7M 1.9M 3.3M
82193 postgres -bash 0 1.8M 2.0M 3.5M
82377 postgres -bash 0 1.8M 2.1M 3.5M
84038 postgres /usr/pgsql-14/bin/postgres 0 4.5M 5.4M 11.9M
85379 postgres python /usr/bin/smem -k 0 5.4M 6.1M 7.6M
85316 postgres postgres: postgres postgres 0 2.4M 12.3M 34.4M
85321 postgres postgres: postgres postgres 0 2.4M 12.3M 34.5M
85350 postgres postgres: postgres postgres 0 2.4M 12.3M 34.5M
85194 postgres pgbench -c 90 -j 90 -r -n - 0 27.0M 27.0M 29.0M
它的每个连接实际开销也在2.4M左右。这样90个连接加起来,已经超过我们前边设定的200M左右的上限了。于是OOM出现了。
3、总结
做这些验证的目的,不是真正的要破坏你的系统。只是想在限制资源的前提下去模拟重现OOM的现象。有时候因为你的硬件资源过于强大,你反而不那么容易重现。重现了问题之后,可以加深对一些系统参数的理解。
还有,上边限制vm内存的方法,是全局性的,也不是那么可取,纯粹是实验的目的。你也完全可以使用cgroup来进行特定进程的限制,从而不影响操作系统其它进程。
上边的脚本只是用的极简脚本来模拟,如果你使用benchmarkSQL中的大仓来跑,估计重现的更快,这里就不再试了。
当然,也别忘了最后恢复一下系统的vm控制参数:
vm.overcommit_memory = 0
vm.overcommit_ratio = 50
vm.swappiness = 30
后边有机会,可以试着做一做大规模子事务耗光XID的模拟实验。





