破坏你的PostgreSQL(2): 使用vm参数让它尽快OOM

数据库杂记 2023-04-15

1、背景

在前一篇短文里：破坏你的PG(1): 让它尽快OOM, 或者让表急剧膨胀, 只是简要介绍了通过缩小以下两个参数：max_locks_per_transaction 和 max_connections的值，分别从64降到10， 100降到5。这样可以模拟快速达到边界，触发OOM。

采用默认值的时候，当一个事务的子事务超过64，也会出现问题。如果有兴趣，可以设计那样的案例来重现。

OOM大多是因为Shared Buffer用完了，不够用，但是同时又要申请，出现的问题。通常操作系统中会启用虚存SWAP空间，如果物理内存不够，SWAP来凑。如果我们禁用的了SWAP呢？

2、具体实例

我们可以参考一下白老师的一篇文章以及一些SWAP的一些设置，文章在：内存管理中的overcommit和oom killer (https://mp.weixin.qq.com/s/xTvxCm9EM9rZ64quB8ehqg)。

我们可以反其道而行之，在一个内存为2G的VM里，我们把SWAP先完全禁掉。

sysctl -w vm.oom-kill=1
sysctl -w vm.swappiness=0
sysctl -w vm.overcommit_memory=2
sysctl -w vm.overcommit_ratio=10

vm.swappiness = 0, 完全禁用swap，不用虚存
vm.oom-kill=1, 启用oom kill
vm.overcommit_memory=2, 会分配一部分物理RAM用于内核使用。分配的数量由设置vm.overcommit_ratio配置。这意味着可用于程序的虚拟内存的数量实际上是：
RAM *（overcommit_ratio 100） + SWAP。

我们使用root用户将上述参数一设置，那么可能就只有那么200M左右的内存可供使用了。

2.1 准备表及数据

postgres=# create table t(id int primary key, col2 text);
CREATE TABLE
postgres=# CREATE OR REPLACE FUNCTION random_string( int ) RETURNS TEXT as $$
postgres$#     SELECT string_agg(substring('abcdefghijklmnopqrstuvwxyz', round(random() * 26 + 0.5)::integer, 1), '')
postgres$#         FROM generate_series(1, $1);
postgres$# $$ language sql;
CREATE FUNCTION
                           ^
postgres=# insert into t select n, random_string(2480) from generate_series(1, 10000) as n;
INSERT 0 10000

2.2 更改PG相关参数

把shared_buffer参数从默认值再往下降一降, 把work_mem值再往上，涨到20M

vi postgresql.conf
shared_buffers = 32MB
work_mem = 1MB

pg_ctl restart

2.3 pgbench加压验证

只运行一个简单的查询脚本 query.sql:

select * from t where col2 is not null;

2.3.1、10个并发请求同时运行：

 pgbench -c 10 -j 10 -r -n -T 240 -f query.sql

pgbench (14.7)
transaction type: query.sql
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 10
duration: 240 s
number of transactions actually processed: 1820
latency average = 1322.868 ms
initial connection time = 72.444 ms
tps = 7.559331 (without initial connection time)
statement latencies in milliseconds:
      1321.344  select * from t where col2 is not null;

用smem查看内存消耗：

[04:16:13-postgres@centos1:/var/lib/pgsql/14/data/log]$ smem -k
  PID User     Command                         Swap      USS      PSS      RSS
83433 postgres postgres: archiver                 0   152.0K   235.0K     2.2M
83429 postgres postgres: checkpointer             0   168.0K   247.0K     2.2M
83427 postgres postgres: logger                   0   192.0K   264.0K     2.1M
83430 postgres postgres: background writer        0   228.0K   364.0K     2.7M
83434 postgres postgres: stats collector          0   284.0K   365.0K     2.3M
83435 postgres postgres: logical replicati        0   444.0K   554.0K     2.9M
83432 postgres postgres: autovacuum launch        0   576.0K   716.0K     3.2M
83431 postgres postgres: walwriter                0   224.0K   822.0K     3.3M
82193 postgres -bash                              0     1.8M     2.1M     3.5M
82377 postgres -bash                              0     1.8M     2.2M     3.6M
83625 postgres postgres: postgres postgres        0     1.0M     4.0M    32.8M
83627 postgres postgres: postgres postgres        0     1.0M     4.0M    32.8M
83628 postgres postgres: postgres postgres        0     1.0M     4.0M    32.8M
83626 postgres postgres: postgres postgres        0     1.0M     4.0M    32.8M
83629 postgres postgres: postgres postgres        0     1.0M     4.0M    32.8M
83630 postgres postgres: postgres postgres        0     1.0M     4.0M    32.8M
83631 postgres postgres: postgres postgres        0     1.0M     4.0M    32.8M
83633 postgres postgres: postgres postgres        0     1.0M     4.0M    32.8M
83632 postgres postgres: postgres postgres        0     1.0M     4.0M    32.8M
83624 postgres postgres: postgres postgres        0     1.0M     4.0M    32.8M
83426 postgres /usr/pgsql-14/bin/postgres         0     4.8M     5.8M    11.9M
83763 postgres python /usr/bin/smem -k            0     5.3M     6.0M     7.5M
83613 postgres pgbench -c 10 -j 10 -r -n -        0   146.8M   146.9M   148.9M

好像还是撑得住。

2.3.2、更换策略, 调大work_mem：

将参数work_mem调整为10M，再看。

vi postgresql.conf
work_mem = 10MB

pg_ctl restart

pgbench -c 10 -j 10 -r -n -T 240 -f query.sql

pgbench (14.7)
transaction type: query.sql
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 10
duration: 240 s
number of transactions actually processed: 2524
latency average = 953.088 ms
initial connection time = 25.653 ms
tps = 10.492215 (without initial connection time)
statement latencies in milliseconds:
       952.447  select * from t where col2 is not null;

运行过程中，查看内存情况：

[04:22:50-postgres@centos1:/var/lib/pgsql/14/data/log]$ smem -k
  PID User     Command                         Swap      USS      PSS      RSS
84045 postgres postgres: archiver                 0   152.0K   230.0K     2.2M
84041 postgres postgres: checkpointer             0   168.0K   244.0K     2.2M
84039 postgres postgres: logger                   0   192.0K   262.0K     2.1M
84042 postgres postgres: background writer        0   228.0K   360.0K     2.7M
84046 postgres postgres: stats collector          0   284.0K   361.0K     2.3M
84047 postgres postgres: logical replicati        0   444.0K   548.0K     2.9M
84044 postgres postgres: autovacuum launch        0   576.0K   700.0K     3.2M
84043 postgres postgres: walwriter                0   224.0K   822.0K     3.3M
84244 postgres postgres: postgres postgres        0   984.0K     1.1M     4.6M
84243 postgres psql                               0     1.4M     1.7M     4.0M
84164 postgres -bash                              0     1.7M     1.9M     3.3M
82193 postgres -bash                              0     1.8M     2.0M     3.5M
82377 postgres -bash                              0     1.8M     2.1M     3.6M
84084 postgres postgres: postgres postgres        0     1.0M     4.0M    32.8M
84089 postgres postgres: postgres postgres        0     1.0M     4.0M    32.8M
84092 postgres postgres: postgres postgres        0     1.0M     4.0M    32.8M
84087 postgres postgres: postgres postgres        0     1.0M     4.0M    32.8M
84088 postgres postgres: postgres postgres        0     1.0M     4.0M    32.8M
84091 postgres postgres: postgres postgres        0     1.0M     4.0M    32.8M
84090 postgres postgres: postgres postgres        0     1.0M     4.0M    32.8M
84085 postgres postgres: postgres postgres        0     1.0M     4.0M    32.8M
84083 postgres postgres: postgres postgres        0     1.1M     4.0M    32.8M
84086 postgres postgres: postgres postgres        0     1.1M     4.1M    32.9M
84038 postgres /usr/pgsql-14/bin/postgres         0     4.8M     5.7M    11.9M
84326 postgres python /usr/bin/smem -k            0     5.3M     6.0M     7.5M
84072 postgres pgbench -c 10 -j 10 -r -n -        0   134.1M   134.3M   136.4M

似乎上边的参数不受太大的影响。实际每个连接还是只占1M左右。我们后边直接加大并发看看。这是因为query.sql不需要用到多大的work_mem。

我们回想一下，如何让work_mem真正的用起来，可以使用order by排序来加大实际利用率。

2.3.3、修改query.sql

select * from t order by col2;

[04:35:50-postgres@centos1:/var/lib/pgsql]$ pgbench -c 90 -j 90 -r -n -T 60 -f query.sql

[04:35:50-postgres@centos1:/var/lib/pgsql]$ pgbench -c 90 -j 90 -r -n -T 60 -f query.sql
pgbench (14.7)
pgbench: error: client 28 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 3 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 50 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 9 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 19 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 14 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 83 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 45 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 37 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 57 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 40 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 31 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 42 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 73 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 85 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 80 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 6 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 25 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 0 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 84 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 72 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 59 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 22 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 41 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 1 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 58 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 18 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 87 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 49 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 20 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 46 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 39 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 64 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 52 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 53 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 33 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 78 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 62 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 38 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 21 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 60 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 75 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 4 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 51 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 89 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 23 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 55 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 44 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 69 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 35 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 76 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 2 script 0 aborted in command 0 query 0: out of memory for query result
pgbench: error: client 77 script 0 aborted in command 0 query 0: out of memory for query result

通过监控内存开销，我们也可以看到：

[04:35:16-postgres@centos1:/var/lib/pgsql/14/data/log]$ smem -k
  PID User     Command                         Swap      USS      PSS      RSS
84039 postgres postgres: logger                   0   192.0K   240.0K     2.1M
84045 postgres postgres: archiver                 0   152.0K   271.0K     2.2M
84042 postgres postgres: background writer        0   212.0K   400.0K     2.7M
84046 postgres postgres: stats collector          0   288.0K   405.0K     2.3M
84041 postgres postgres: checkpointer             0   328.0K   462.0K     3.0M
84047 postgres postgres: logical replicati        0   444.0K   633.0K     2.9M
84044 postgres postgres: autovacuum launch        0   572.0K   789.0K     3.2M
84043 postgres postgres: walwriter                0   216.0K   848.0K     3.3M
84164 postgres -bash                              0     1.7M     1.9M     3.3M
82193 postgres -bash                              0     1.8M     2.0M     3.5M
82377 postgres -bash                              0     1.8M     2.1M     3.5M
84038 postgres /usr/pgsql-14/bin/postgres         0     4.5M     5.4M    11.9M
85379 postgres python /usr/bin/smem -k            0     5.4M     6.1M     7.6M
85316 postgres postgres: postgres postgres        0     2.4M    12.3M    34.4M
85321 postgres postgres: postgres postgres        0     2.4M    12.3M    34.5M
85350 postgres postgres: postgres postgres        0     2.4M    12.3M    34.5M
85194 postgres pgbench -c 90 -j 90 -r -n -        0    27.0M    27.0M    29.0M

它的每个连接实际开销也在2.4M左右。这样90个连接加起来，已经超过我们前边设定的200M左右的上限了。于是OOM出现了。

3、总结

做这些验证的目的，不是真正的要破坏你的系统。只是想在限制资源的前提下去模拟重现OOM的现象。有时候因为你的硬件资源过于强大，你反而不那么容易重现。重现了问题之后，可以加深对一些系统参数的理解。

还有，上边限制vm内存的方法，是全局性的，也不是那么可取，纯粹是实验的目的。你也完全可以使用cgroup来进行特定进程的限制，从而不影响操作系统其它进程。

上边的脚本只是用的极简脚本来模拟，如果你使用benchmarkSQL中的大仓来跑，估计重现的更快，这里就不再试了。

当然，也别忘了最后恢复一下系统的vm控制参数：

vm.overcommit_memory = 0
vm.overcommit_ratio = 50
vm.swappiness = 30

后边有机会，可以试着做一做大规模子事务耗光XID的模拟实验。

postgresql 虚拟机

文章转载自数据库杂记，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。