大家好,今天要分享一下关于打开WAL 日志 compress 压缩参数 wal_compression 的性能测试。
本篇文章是上一篇 《PG 故障分析之 select count(*) 产生大量WAL 日志》的延续:https://www.modb.pro/db/621695
我们打算通过开启WAL 日志压缩的方式来减少WAL的日志的生成量和减少跨DC之间网络传输的流量。
我们来看一下这个参数:
翻译成白话来说: 当开启这个参数的时候, PG SERVER 会把 full page writes 进行压缩(备份的时候也会), standby 从库 replay 这些压缩的 WAL 日志的时候,会自动解压。
压缩的算法支持 pglz和LZ4,具体的压缩方式取决于编译PG软件源码的时候,指的是 --with-lz4 还是 --with-zstd。 这个参数默认是OFF 关闭的。
只有拥有superuser 权限的账户才可以修改这个参数。
开启这个参数的优点是可以减小WAL的日志空间的大小(在不关闭full page writes 的情况下,为了保持数据的恢复的完整性),缺点是主库压缩日志和从库解压日志会带来
额外的CPU的消耗。
我们开启一下这个参数: 这个参数是不需要重启的
下面我们做一个对比的实验看一下, 开关 wal_compress 参数对 full page writes 和 系统性能的影响:
实验1:步骤
a)参数关闭 wal_compress = off ;
b)记录 pg_current_wal_lsn() , 标记LSN1
c)运行性能测试命令
d)记录 pg_current_wal_lsn(),标记LSN2
e)用 pg_waldump 分析在 标记LSN1和LSN2区间的日志分布情况
f)观测机器上CPU的使用趋势
a)参数关闭 wal_compress = off ;
postgres@[local:/tmp]:2026=#1282 show wal_compression;
wal_compression
-----------------
off
(1 row)
复制
b)记录 pg_current_wal_lsn() , 标记LSN1 = 3/BC017F68
postgres@[local:/tmp]:2026=#96759 select pg_current_wal_lsn();
pg_current_wal_lsn
--------------------
3/BC017F68
(1 row)
复制
c)运行性能测试命令, 跑15分钟
pgbench -M prepared -r -c 8 -j 4 -T 900 -U postgres -p 2026 -d pgbench -l
...
...
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 16
query mode: prepared
number of clients: 8
number of threads: 4
maximum number of tries: 1
duration: 900 s
number of transactions actually processed: 566459
number of failed transactions: 0 (0.000%)
latency average = 12.711 ms
initial connection time = 9.823 ms
tps = 629.393752 (without initial connection time)
statement latencies in milliseconds and failures:
0.155 0 \set aid random(1, 100000 * :scale)
0.129 0 \set bid random(1, 1 * :scale)
0.141 0 \set tid random(1, 10 * :scale)
0.127 0 \set delta random(-5000, 5000)
0.389 0 BEGIN;
0.803 0 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.658 0 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
0.883 0 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
1.987 0 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
0.716 0 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
6.208 0 END;
复制
d)记录 pg_current_wal_lsn(),标记LSN2 =4/2001C40
postgres@[local:/tmp]:2026=#115013 select pg_current_wal_lsn();
pg_current_wal_lsn
--------------------
4/2001C40
(1 row)
postgres@[local:/tmp]:2026=#115013 select pg_wal_lsn_diff('4/2001C40','3/BC017F68')/1024/1024 as "WAL size(MB)";
WAL size(MB)
-----------------------
1119.9132919311523438
(1 row)
复制
e)用 pg_waldump 分析在 标记LSN1和LSN2区间的日志分布情况
pg_waldump -p /data/postgreSQL/2026/data/pg_wal -z -s 3/BC017F68 -e 4/2001C40 -t 10
INFRA [postgres@wqdcsrv3352 pg_wal]# pg_waldump -p /data/postgreSQL/2026/data/pg_wal -z -s 3/BC017F68 -e 4/2001C40 -t 10
WAL statistics between 3/BC017F68 and 4/2001C40:
Type N (%) Record size (%) FPI size (%) Combined size (%)
---- - --- ----------- --- -------- --- ------------- ---
XLOG 38176 ( 1.05) 1870794 ( 0.79) 307375580 ( 33.76) 309246374 ( 26.98)
Transaction 566918 ( 15.62) 19279884 ( 8.17) 0 ( 0.00) 19279884 ( 1.68)
Storage 1 ( 0.00) 42 ( 0.00) 0 ( 0.00) 42 ( 0.00)
CLOG 17 ( 0.00) 510 ( 0.00) 0 ( 0.00) 510 ( 0.00)
Database 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
Tablespace 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
MultiXact 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
RelMap 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
Standby 111 ( 0.00) 8374 ( 0.00) 0 ( 0.00) 8374 ( 0.00)
Heap2 576109 ( 15.87) 36584534 ( 15.50) 543930604 ( 59.75) 580515138 ( 50.64)
Heap 2437845 ( 67.17) 177670987 ( 75.29) 16555068 ( 1.82) 194226055 ( 16.94)
Btree 9382 ( 0.26) 542973 ( 0.23) 42491888 ( 4.67) 43034861 ( 3.75)
Hash 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
Gin 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
Gist 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
Sequence 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
SPGist 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
BRIN 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
CommitTs 692 ( 0.02) 20760 ( 0.01) 0 ( 0.00) 20760 ( 0.00)
ReplicationOrigin 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
Generic 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
LogicalMessage 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
-------- -------- -------- --------
Total 3629251 235978858 [20.59%] 910353140 [79.41%] 1146331998 [100%]
复制
我们可以看到 FPI size 是 910353140 [79.41%]
实验2:步骤
a)参数关闭 wal_compress = on ;
b)记录 pg_current_wal_lsn() , 标记LSN1
c)运行性能测试命令
d)记录 pg_current_wal_lsn(),标记LSN2
e)用 pg_waldump 分析在 标记LSN1和LSN2区间的日志分布情况
f)观测机器上CPU的使用趋势
a)参数关闭 wal_compress = on ;
postgres@[local:/tmp]:2026=#126143 alter system set wal_compression = 'on';
ALTER SYSTEM
postgres@[local:/tmp]:2026=#126143 select pg_reload_conf();
pg_reload_conf
----------------
t
(1 row)
postgres@[local:/tmp]:2026=#126143 show wal_compression;
wal_compression
-----------------
pglz
(1 row)
复制
b)记录 pg_current_wal_lsn() , 标记LSN1
postgres@[local:/tmp]:2026=#66183 select pg_current_wal_lsn();
pg_current_wal_lsn
--------------------
3/A0036AE0
(1 row)
复制
c)运行性能测试命令, 跑15分钟
pgbench -M prepared -r -c 8 -j 4 -T 900 -U postgres -p 2026 -d pgbench -l
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 16
query mode: prepared
number of clients: 8
number of threads: 4
maximum number of tries: 1
duration: 900 s
number of transactions actually processed: 619202
number of failed transactions: 0 (0.000%)
latency average = 11.634 ms
initial connection time = 6.066 ms
tps = 687.653024 (without initial connection time)
statement latencies in milliseconds and failures:
0.140 0 \set aid random(1, 100000 * :scale)
0.133 0 \set bid random(1, 1 * :scale)
0.117 0 \set tid random(1, 10 * :scale)
0.123 0 \set delta random(-5000, 5000)
0.338 0 BEGIN;
0.805 0 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.585 0 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
0.790 0 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
1.801 0 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
0.602 0 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
5.747 0 END;
复制
d)记录 pg_current_wal_lsn(),标记LSN2
cappcore@[local:/tmp]:2026=#78160 select pg_current_wal_lsn();
pg_current_wal_lsn
--------------------
3/BAA0BDE0
(1 row)
cappcore@[local:/tmp]:2026=#78160 select pg_wal_lsn_diff('3/BAA0BDE0','3/A0036AE0')/1024/1024 as "WAL size(MB)";
WAL size(MB)
----------------------
425.8327636718750000
(1 row)
复制
e)用 pg_waldump 分析在 标记LSN1和LSN2区间的日志分布情况
pg_waldump -p /data/postgreSQL/2026/data/pg_wal -z -s 3/A0036AE0 -e 3/BAA0BDE0 -t 10
INFRA [postgres@wqdcsrv3352 pg_wal]# pg_waldump -p /data/postgreSQL/2026/data/pg_wal -z -s 3/A0036AE0 -e 3/BAA0BDE0 -t 10
WAL statistics between 3/A0036AE0 and 3/BAA0BDE0:
Type N (%) Record size (%) FPI size (%) Combined size (%)
---- - --- ----------- --- -------- --- ------------- ---
XLOG 9836 ( 0.25) 501223 ( 0.19) 19229477 ( 12.12) 19730700 ( 4.73)
Transaction 619693 ( 15.68) 21075693 ( 8.16) 0 ( 0.00) 21075693 ( 5.05)
Storage 2 ( 0.00) 84 ( 0.00) 0 ( 0.00) 84 ( 0.00)
CLOG 19 ( 0.00) 570 ( 0.00) 0 ( 0.00) 570 ( 0.00)
Database 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
Tablespace 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
MultiXact 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
RelMap 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
Standby 114 ( 0.00) 8612 ( 0.00) 0 ( 0.00) 8612 ( 0.00)
Heap2 629719 ( 15.93) 40019873 ( 15.49) 104085234 ( 65.58) 144105107 ( 34.55)
Heap 2673900 ( 67.65) 195600540 ( 75.71) 1808081 ( 1.14) 197408621 ( 47.33)
Btree 18783 ( 0.48) 1130845 ( 0.44) 33587996 ( 21.16) 34718841 ( 8.32)
Hash 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
Gin 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
Gist 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
Sequence 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
SPGist 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
BRIN 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
CommitTs 757 ( 0.02) 22710 ( 0.01) 0 ( 0.00) 22710 ( 0.01)
ReplicationOrigin 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
Generic 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
LogicalMessage 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
-------- -------- -------- --------
Total 3952823 258360150 [61.95%] 158710788 [38.05%] 417070938 [100%]
复制
我们可以明显的看到 FPI size 下降到了 158710788 [38.05%]
最后我们汇总比对一下:
参数 wal_compression | 打开 ON | 关闭 OFF |
---|---|---|
FPI size | 158710788 [38.05%] | 910353140 [79.41%] |
wal log size | WAL size(MB) ---------------------- 425.8327636718750000 |
WAL size(MB) --------------------- 1119.9132919311523438 |
空间节省:
1) 可以看到 FPI 的写入量从 868MB [79.41%] 下降到了 151MB [38.05%]
2)WAL log size 从 1119MB 下降到了 425MB
CPU 带来的额外消耗如下图:大致不到10%的上涨
Have a fun 🙂 !