PostgreSQL16预排序功能

原创墨竹 2025-02-05

107

PostgreSQL16预排序功能

在PG15及更早的版本中，包含ORDER BY或DISTINCT子句的聚合函数，总是在优化器计划的聚合节点内执行排序，这就使的优化器不能尝试生成一个计划来提供预先排序的输入，以便按顺序聚合行。在PG16中可以支持优化器尝试生成一个计划，该计划以正确的顺序将数据行提供给计划的聚合节点。下面我们来对预排序的相关功能进行一下验证。

准备测试用例

创建测试表t1，然后插入少量的数据。

create table t1 as 
select round(random()*100) as id,a,
round(random()*500) as b,
round(random()*100) as c
FROM generate_series(1, 10000000) as t(a);
analyze t1;

PG15和PG16的执行计划对比

先来看一下PG15的执行计划

explain analyze
select
 sum(id order by c, a), 
sum(id order by a, b)
from t1
group by c;

                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=1814702.35..1914702.20 rows=101 width=24) (actual time=5701.518..15535.561 rows=101 loops=1)
   Group Key: c
   ->  Sort  (cost=1814702.35..1839702.06 rows=9999884 width=28) (actual time=5647.732..6488.243 rows=10000000 loops=1)
         Sort Key: c
         Sort Method: external merge  Disk: 411056kB
         ->  Seq Scan on t1  (cost=0.00..173528.84 rows=9999884 width=28) (actual time=0.028..924.559 rows=10000000 loops=1)
 Planning Time: 0.093 ms
 Execution Time: 15565.374 ms
(8 rows)

Time: 15565.932 ms (00:15.566)

然后再来查看PG16的执行计划

                                                          QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=1814735.51..1914737.32 rows=101 width=24) (actual time=7324.118..9284.522 rows=101 loops=1)
   Group Key: c
   ->  Sort  (cost=1814735.51..1839735.71 rows=10000080 width=28) (actual time=7315.336..8367.929 rows=10000000 loops=1)
         Sort Key: c, a, b
         Sort Method: external merge  Disk: 411000kB
         ->  Seq Scan on t1  (cost=0.00..173530.80 rows=10000080 width=28) (actual time=0.032..1013.274 rows=10000000 loops=1)
 Planning Time: 0.139 ms
 Execution Time: 9313.202 ms
(8 rows)

Time: 9314.367 ms (00:09.314)

从上面的执行结果可以知道，当测试表中有1000w的数据量时，用同一个sql来测试聚合函数，在PG15用时15565.932 ms，PG16用时9314.367 ms，从耗时上来说查询速度提升了40%左右，而且查询速度的提升是目标表的数据量有关系，在数据量很少的时候，其实查询速度的提升并不是很明显。这里测试结果只是在个人的虚拟机上测试，测试结果仅供参考。

新增索引，然后来看一下观察一下对于查询速度的提升情况

create index idx_t1_ca on t1(c,a);

先来看一下PG15的执行计划

explain analyze
select
 sum(id order by c, a), 
sum(id order by a, b)
from t1
group by c;

                                                                 QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=0.43..673145.07 rows=101 width=24) (actual time=153.823..19287.683 rows=101 loops=1)
   Group Key: c
   ->  Index Scan using idx_t1_ca on t1  (cost=0.43..598144.06 rows=10000000 width=28) (actual time=0.049..12936.236 rows=10000000 loops=1)
 Planning Time: 0.301 ms
 Execution Time: 19287.854 ms
(5 rows)

Time: 19288.833 ms (00:19.289)

然后再来查看PG16的执行计划

                                                                    QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=0.50..1123140.18 rows=101 width=24) (actual time=89.467..16454.058 rows=101 loops=1)
   Group Key: c
   ->  Incremental Sort  (cost=0.50..1048139.17 rows=10000000 width=28) (actual time=0.133..15598.819 rows=10000000 loops=1)
         Sort Key: c, a, b
         Presorted Key: c, a
         Full-sort Groups: 312500  Sort Method: quicksort  Average Memory: 26kB  Peak Memory: 26kB
         ->  Index Scan using idx_t1_ca on t1  (cost=0.43..598139.17 rows=10000000 width=28) (actual time=0.071..14028.081 rows=10000000 loops=1)
 Planning Time: 0.100 ms
 Execution Time: 16454.178 ms
(9 rows)

Time: 16454.864 ms (00:16.455)

当添加索引后，会发现查询速度反而比之前未加索引的时候，查询速度上慢了不少，但是相比cost值，确降低了不少。这个有可能与数据分组分布不均匀有关系。

ORDER BY聚合是如何选择排序键

由于下面只是为了测试功能，因此为了快速获得查询结果，在这里我们就插入少量的数据。

drop table t1;
create table t1 as 
select round(random()*100) as id,a,
round(random()*500) as b,
round(random()*100) as c
FROM generate_series(1, 100) as t(a);
analyze t1;

ORDER BY聚合，语句无GROUP BY

--a的数量多
explain select  max(a order by a),count(a order by a),count(b order by b)  from t1;
                           QUERY PLAN
-----------------------------------------------------------------
 Aggregate  (cost=6.32..6.33 rows=1 width=20)
   ->  Sort  (cost=5.32..5.57 rows=100 width=12)
         Sort Key: a
         ->  Seq Scan on t1  (cost=0.00..2.00 rows=100 width=12)
(4 rows)
--b的数量多
explain select  max(a order by a),max(b order by b),count(b order by b)  from t1;

                           QUERY PLAN
-----------------------------------------------------------------
 Aggregate  (cost=6.32..6.33 rows=1 width=20)
   ->  Sort  (cost=5.32..5.57 rows=100 width=12)
         Sort Key: b
         ->  Seq Scan on t1  (cost=0.00..2.00 rows=100 width=12)
(4 rows)

从上面的验证可以得出，当聚合a列的数量大于b列的数量时，就选择a列作为预排序的排序键；同理，如果当聚合b列的数量大于a列的数量时,就选择b列作为预排序的排序键，这个时候与排序列的先后顺序没有关系。

--a和b的数量一致
explain select  max(a order by a),count(a order by a),min(b order by b), avg(b order by b)  from t1;
                           QUERY PLAN
-----------------------------------------------------------------
 Aggregate  (cost=6.57..6.58 rows=1 width=28)
   ->  Sort  (cost=5.32..5.57 rows=100 width=12)
         Sort Key: a
         ->  Seq Scan on t1  (cost=0.00..2.00 rows=100 width=12)
(4 rows)
--调换a和b列的先后顺序
explain select  min(b order by b),max(a order by a),count(a order by a), avg(b order by b)  from t1;
                           QUERY PLAN
-----------------------------------------------------------------
 Aggregate  (cost=6.57..6.58 rows=1 width=28)
   ->  Sort  (cost=5.32..5.57 rows=100 width=12)
         Sort Key: b
         ->  Seq Scan on t1  (cost=0.00..2.00 rows=100 width=12)
(4 rows)

当a和b列的数量一致时，可以看出这个时候就与ab列的先后顺序就有关系了，那个作为第一列，就是预排序的排序键。

ORDER BY聚合，语句含有GROUP BY

聚合排序列中含有GROUP BY列

在下面测试中，我们是以c列做分组列，聚合排序列为a、b和c3列。在这里我们先来调整聚合字段列的先后顺序，在这里主要是对sum(id order by c, a)进行了调整。

explain (costs off)
select
 sum(id order by c, a), sum(id order by b),
sum(id order by a, b)
from t1
group by c;

         QUERY PLAN
----------------------------
 GroupAggregate
   Group Key: c
   ->  Sort
         Sort Key: c, a, b
         ->  Seq Scan on t1
(5 rows)
explain (costs off)
select
 sum(id order by b),sum(id order by  a,b), 
sum(id order by c, a)
from t1
group by c;

         QUERY PLAN
----------------------------
 GroupAggregate
   Group Key: c
   ->  Sort
         Sort Key: c, a, b
         ->  Seq Scan on t1
(5 rows)
--没有a,b的排序顺序
explain (costs off)
select
 sum(id order by a),sum(id order by  b,a),
sum(id order by c, a)
from t1
group by c;
         QUERY PLAN
----------------------------
 GroupAggregate
   Group Key: c
   ->  Sort
         Sort Key: c, a
         ->  Seq Scan on t1
(5 rows)

explain (costs off)
select
 sum(id order by c,b),min(id order by  c,a),
sum(id order by c,a)
from t1
group by c;
         QUERY PLAN
----------------------------
 GroupAggregate
   Group Key: c
   ->  Sort
         Sort Key: c, a
         ->  Seq Scan on t1
(5 rows)

Time: 0.448 ms

explain (costs off)
select
 sum(id order by c,b),min(id order by  c,a),
sum(id order by c,a)
from t1
group by c;

在上面的测试中，可以看出分组c列始终作为排序键的第一列，第二列会优先选择分组列c开头的多级排序顺序，即在上面测试测用例中的c,a的排序顺序；如果聚合排序的存在a,b，那么最终就会组合为c,a,b的排序键；如果不存在a,b，则最终的排序键即为c,a。

聚合排序列中，没有GROUP BY列

在下面测试中，我们仍然是以c列做分组列，但是聚合排序列为a和b2列，不包含有c列。

explain (costs off)
select
 sum(id order by b),sum(id order by  a,b), min(id order by  a,b)
from t1
group by c;
         QUERY PLAN
----------------------------
 GroupAggregate
   Group Key: c
   ->  Sort
         Sort Key: c, a, b
         ->  Seq Scan on t1
(5 rows)

explain (costs off)
select
 sum(id order by b),sum(id order by  b,a), min(id order by  a,b),sum(id order by a)
from t1
group by c;
         QUERY PLAN
----------------------------
 GroupAggregate
   Group Key: c
   ->  Sort
         Sort Key: c, b, a
         ->  Seq Scan on t1
(5 rows)

从上面的测试结果可以看出，如果聚合排序列中不含有分组列时，排序键的规则是以ORDER BY聚合的排序选择规则为准，详细内容可以参考上面的结果。

总结

总之，通过上述的实际验证操作，我们大概清楚如下几个点：

1、理解了优化器是如何选择聚合函数的排序键。该排序键与相同排序数量的多少有关系，另外如果多种排序的数量一致，则以最前聚合函数的的顺序为准，即放在sql语句最左边的聚合函数的顺序为准。

2、如果SQL语句中含有group by语句，那么聚合函数的排序键中需要优先包含有分组列。在这里如果聚合函数的排序列中未含有分组列，则排序键中分组列之后的排列顺序以上述第一点的规则进行排序。如果聚合函数的排序列中含有分组列，这个时候排序键优先选择含有分组列的排序顺序，如果其他的排序列与之前的排序结果有交集，则优先使用该排列顺序。比如目前有分组列c，聚合函数的排序顺序有a，c,b 、b,a、a,b，那么这里会先选取c,b的顺序，然后根据当前的顺序与其他顺序看是否有交集，通过观察可以看到b,a符合我们的要求，因此最终排序键的结果是c,b,a。

3、PG16预排序对于SQL的性能提升还是比较明显的，在个人的电脑上测试提升大概40%左右。

参考

https://postgrespro.com/blog/pgsql/5969859#commit_1349d279

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=1349d2790

https://www.citusdata.com/blog/2024/02/08/whats-new-in-postgres-16-query-planner-optimizer/#faster-orderby-distinct

– / END / –

可以通过下面的方式联系我

微信公众号：@墨竹札记
墨天轮：@墨竹
微信：wshf395062788
PGFans：@墨竹

如果这篇文章为你带来了灵感或启发，就请帮忙点赞、收藏、转发；如果文章中不严谨或者错漏之处，请及时评论指正。非常感谢！

最后修改时间：2025-02-06 10:37:30

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者