PostgreSQL16预排序功能
在PG15及更早的版本中,包含ORDER BY或DISTINCT子句的聚合函数,总是在优化器计划的聚合节点内执行排序,这就使的优化器不能尝试生成一个计划来提供预先排序的输入,以便按顺序聚合行。在PG16中可以支持优化器尝试生成一个计划,该计划以正确的顺序将数据行提供给计划的聚合节点。下面我们来对预排序的相关功能进行一下验证。
准备测试用例
创建测试表t1,然后插入少量的数据。
create table t1 as
select round(random()*100) as id,a,
round(random()*500) as b,
round(random()*100) as c
FROM generate_series(1, 10000000) as t(a);
analyze t1;
PG15和PG16的执行计划对比
先来看一下PG15的执行计划
explain analyze
select
sum(id order by c, a),
sum(id order by a, b)
from t1
group by c;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=1814702.35..1914702.20 rows=101 width=24) (actual time=5701.518..15535.561 rows=101 loops=1)
Group Key: c
-> Sort (cost=1814702.35..1839702.06 rows=9999884 width=28) (actual time=5647.732..6488.243 rows=10000000 loops=1)
Sort Key: c
Sort Method: external merge Disk: 411056kB
-> Seq Scan on t1 (cost=0.00..173528.84 rows=9999884 width=28) (actual time=0.028..924.559 rows=10000000 loops=1)
Planning Time: 0.093 ms
Execution Time: 15565.374 ms
(8 rows)
Time: 15565.932 ms (00:15.566)
然后再来查看PG16的执行计划
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=1814735.51..1914737.32 rows=101 width=24) (actual time=7324.118..9284.522 rows=101 loops=1)
Group Key: c
-> Sort (cost=1814735.51..1839735.71 rows=10000080 width=28) (actual time=7315.336..8367.929 rows=10000000 loops=1)
Sort Key: c, a, b
Sort Method: external merge Disk: 411000kB
-> Seq Scan on t1 (cost=0.00..173530.80 rows=10000080 width=28) (actual time=0.032..1013.274 rows=10000000 loops=1)
Planning Time: 0.139 ms
Execution Time: 9313.202 ms
(8 rows)
Time: 9314.367 ms (00:09.314)
从上面的执行结果可以知道,当测试表中有1000w的数据量时,用同一个sql来测试聚合函数,在PG15用时15565.932 ms,PG16用时9314.367 ms,从耗时上来说查询速度提升了40%左右,而且查询速度的提升是目标表的数据量有关系,在数据量很少的时候,其实查询速度的提升并不是很明显。这里测试结果只是在个人的虚拟机上测试,测试结果仅供参考。
新增索引,然后来看一下观察一下对于查询速度的提升情况
create index idx_t1_ca on t1(c,a);
先来看一下PG15的执行计划
explain analyze
select
sum(id order by c, a),
sum(id order by a, b)
from t1
group by c;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=0.43..673145.07 rows=101 width=24) (actual time=153.823..19287.683 rows=101 loops=1)
Group Key: c
-> Index Scan using idx_t1_ca on t1 (cost=0.43..598144.06 rows=10000000 width=28) (actual time=0.049..12936.236 rows=10000000 loops=1)
Planning Time: 0.301 ms
Execution Time: 19287.854 ms
(5 rows)
Time: 19288.833 ms (00:19.289)
然后再来查看PG16的执行计划
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=0.50..1123140.18 rows=101 width=24) (actual time=89.467..16454.058 rows=101 loops=1)
Group Key: c
-> Incremental Sort (cost=0.50..1048139.17 rows=10000000 width=28) (actual time=0.133..15598.819 rows=10000000 loops=1)
Sort Key: c, a, b
Presorted Key: c, a
Full-sort Groups: 312500 Sort Method: quicksort Average Memory: 26kB Peak Memory: 26kB
-> Index Scan using idx_t1_ca on t1 (cost=0.43..598139.17 rows=10000000 width=28) (actual time=0.071..14028.081 rows=10000000 loops=1)
Planning Time: 0.100 ms
Execution Time: 16454.178 ms
(9 rows)
Time: 16454.864 ms (00:16.455)
当添加索引后,会发现查询速度反而比之前未加索引的时候,查询速度上慢了不少,但是相比cost值,确降低了不少。这个有可能与数据分组分布不均匀有关系。
ORDER BY聚合是如何选择排序键
由于下面只是为了测试功能,因此为了快速获得查询结果,在这里我们就插入少量的数据。
drop table t1;
create table t1 as
select round(random()*100) as id,a,
round(random()*500) as b,
round(random()*100) as c
FROM generate_series(1, 100) as t(a);
analyze t1;
ORDER BY聚合,语句无GROUP BY
--a的数量多
explain select max(a order by a),count(a order by a),count(b order by b) from t1;
QUERY PLAN
-----------------------------------------------------------------
Aggregate (cost=6.32..6.33 rows=1 width=20)
-> Sort (cost=5.32..5.57 rows=100 width=12)
Sort Key: a
-> Seq Scan on t1 (cost=0.00..2.00 rows=100 width=12)
(4 rows)
--b的数量多
explain select max(a order by a),max(b order by b),count(b order by b) from t1;
QUERY PLAN
-----------------------------------------------------------------
Aggregate (cost=6.32..6.33 rows=1 width=20)
-> Sort (cost=5.32..5.57 rows=100 width=12)
Sort Key: b
-> Seq Scan on t1 (cost=0.00..2.00 rows=100 width=12)
(4 rows)
从上面的验证可以得出,当聚合a列的数量大于b列的数量时,就选择a列作为预排序的排序键;同理,如果当聚合b列的数量大于a列的数量时,就选择b列作为预排序的排序键,这个时候与排序列的先后顺序没有关系。
--a和b的数量一致
explain select max(a order by a),count(a order by a),min(b order by b), avg(b order by b) from t1;
QUERY PLAN
-----------------------------------------------------------------
Aggregate (cost=6.57..6.58 rows=1 width=28)
-> Sort (cost=5.32..5.57 rows=100 width=12)
Sort Key: a
-> Seq Scan on t1 (cost=0.00..2.00 rows=100 width=12)
(4 rows)
--调换a和b列的先后顺序
explain select min(b order by b),max(a order by a),count(a order by a), avg(b order by b) from t1;
QUERY PLAN
-----------------------------------------------------------------
Aggregate (cost=6.57..6.58 rows=1 width=28)
-> Sort (cost=5.32..5.57 rows=100 width=12)
Sort Key: b
-> Seq Scan on t1 (cost=0.00..2.00 rows=100 width=12)
(4 rows)
当a和b列的数量一致时,可以看出这个时候就与ab列的先后顺序就有关系了,那个作为第一列,就是预排序的排序键。
ORDER BY聚合,语句含有GROUP BY
聚合排序列中含有GROUP BY列
在下面测试中,我们是以c列做分组列,聚合排序列为a、b和c3列。在这里我们先来调整聚合字段列的先后顺序,在这里主要是对sum(id order by c, a)进行了调整。
explain (costs off)
select
sum(id order by c, a), sum(id order by b),
sum(id order by a, b)
from t1
group by c;
QUERY PLAN
----------------------------
GroupAggregate
Group Key: c
-> Sort
Sort Key: c, a, b
-> Seq Scan on t1
(5 rows)
explain (costs off)
select
sum(id order by b),sum(id order by a,b),
sum(id order by c, a)
from t1
group by c;
QUERY PLAN
----------------------------
GroupAggregate
Group Key: c
-> Sort
Sort Key: c, a, b
-> Seq Scan on t1
(5 rows)
--没有a,b的排序顺序
explain (costs off)
select
sum(id order by a),sum(id order by b,a),
sum(id order by c, a)
from t1
group by c;
QUERY PLAN
----------------------------
GroupAggregate
Group Key: c
-> Sort
Sort Key: c, a
-> Seq Scan on t1
(5 rows)
explain (costs off)
select
sum(id order by c,b),min(id order by c,a),
sum(id order by c,a)
from t1
group by c;
QUERY PLAN
----------------------------
GroupAggregate
Group Key: c
-> Sort
Sort Key: c, a
-> Seq Scan on t1
(5 rows)
Time: 0.448 ms
explain (costs off)
select
sum(id order by c,b),min(id order by c,a),
sum(id order by c,a)
from t1
group by c;
在上面的测试中,可以看出分组c列始终作为排序键的第一列,第二列会优先选择分组列c开头的多级排序顺序,即在上面测试测用例中的c,a的排序顺序;如果聚合排序的存在a,b,那么最终就会组合为c,a,b的排序键;如果不存在a,b,则最终的排序键即为c,a。
聚合排序列中,没有GROUP BY列
在下面测试中,我们仍然是以c列做分组列,但是聚合排序列为a和b2列,不包含有c列。
explain (costs off)
select
sum(id order by b),sum(id order by a,b), min(id order by a,b)
from t1
group by c;
QUERY PLAN
----------------------------
GroupAggregate
Group Key: c
-> Sort
Sort Key: c, a, b
-> Seq Scan on t1
(5 rows)
explain (costs off)
select
sum(id order by b),sum(id order by b,a), min(id order by a,b),sum(id order by a)
from t1
group by c;
QUERY PLAN
----------------------------
GroupAggregate
Group Key: c
-> Sort
Sort Key: c, b, a
-> Seq Scan on t1
(5 rows)
从上面的测试结果可以看出,如果聚合排序列中不含有分组列时,排序键的规则是以ORDER BY聚合的排序选择规则为准,详细内容可以参考上面的结果。
总结
总之,通过上述的实际验证操作,我们大概清楚如下几个点:
1、理解了优化器是如何选择聚合函数的排序键。该排序键与相同排序数量的多少有关系,另外如果多种排序的数量一致,则以最前聚合函数的的顺序为准,即放在sql语句最左边的聚合函数的顺序为准。
2、如果SQL语句中含有group by语句,那么聚合函数的排序键中需要优先包含有分组列。在这里如果聚合函数的排序列中未含有分组列,则排序键中分组列之后的排列顺序以上述第一点的规则进行排序。如果聚合函数的排序列中含有分组列,这个时候排序键优先选择含有分组列的排序顺序,如果其他的排序列与之前的排序结果有交集,则优先使用该排列顺序。比如目前有分组列c,聚合函数的排序顺序有a,c,b 、b,a、a,b,那么这里会先选取c,b的顺序,然后根据当前的顺序与其他顺序看是否有交集,通过观察可以看到b,a符合我们的要求,因此最终排序键的结果是c,b,a。
3、PG16预排序对于SQL的性能提升还是比较明显的,在个人的电脑上测试提升大概40%左右。
参考
https://postgrespro.com/blog/pgsql/5969859#commit_1349d279
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=1349d2790
https://www.citusdata.com/blog/2024/02/08/whats-new-in-postgres-16-query-planner-optimizer/#faster-orderby-distinct
– / END / –
可以通过下面的方式联系我
如果这篇文章为你带来了灵感或启发,就请帮忙点赞、收藏、转发;如果文章中不严谨或者错漏之处,请及时评论指正。非常感谢!