暂无图片
暂无图片
4
暂无图片
暂无图片
暂无图片

PostgreSQL16预排序功能

原创 墨竹 2025-02-05
107

PostgreSQL16预排序功能

在PG15及更早的版本中,包含ORDER BY或DISTINCT子句的聚合函数,总是在优化器计划的聚合节点内执行排序,这就使的优化器不能尝试生成一个计划来提供预先排序的输入,以便按顺序聚合行。在PG16中可以支持优化器尝试生成一个计划,该计划以正确的顺序将数据行提供给计划的聚合节点。下面我们来对预排序的相关功能进行一下验证。

准备测试用例

创建测试表t1,然后插入少量的数据。

create table t1 as 
select round(random()*100) as id,a,
round(random()*500) as b,
round(random()*100) as c
FROM generate_series(1, 10000000) as t(a);
analyze t1;

PG15和PG16的执行计划对比

先来看一下PG15的执行计划

explain analyze
select
 sum(id order by c, a), 
sum(id order by a, b)
from t1
group by c;

                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=1814702.35..1914702.20 rows=101 width=24) (actual time=5701.518..15535.561 rows=101 loops=1)
   Group Key: c
   ->  Sort  (cost=1814702.35..1839702.06 rows=9999884 width=28) (actual time=5647.732..6488.243 rows=10000000 loops=1)
         Sort Key: c
         Sort Method: external merge  Disk: 411056kB
         ->  Seq Scan on t1  (cost=0.00..173528.84 rows=9999884 width=28) (actual time=0.028..924.559 rows=10000000 loops=1)
 Planning Time: 0.093 ms
 Execution Time: 15565.374 ms
(8 rows)

Time: 15565.932 ms (00:15.566)

然后再来查看PG16的执行计划

                                                          QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=1814735.51..1914737.32 rows=101 width=24) (actual time=7324.118..9284.522 rows=101 loops=1)
   Group Key: c
   ->  Sort  (cost=1814735.51..1839735.71 rows=10000080 width=28) (actual time=7315.336..8367.929 rows=10000000 loops=1)
         Sort Key: c, a, b
         Sort Method: external merge  Disk: 411000kB
         ->  Seq Scan on t1  (cost=0.00..173530.80 rows=10000080 width=28) (actual time=0.032..1013.274 rows=10000000 loops=1)
 Planning Time: 0.139 ms
 Execution Time: 9313.202 ms
(8 rows)

Time: 9314.367 ms (00:09.314)

从上面的执行结果可以知道,当测试表中有1000w的数据量时,用同一个sql来测试聚合函数,在PG15用时15565.932 ms,PG16用时9314.367 ms,从耗时上来说查询速度提升了40%左右,而且查询速度的提升是目标表的数据量有关系,在数据量很少的时候,其实查询速度的提升并不是很明显。这里测试结果只是在个人的虚拟机上测试,测试结果仅供参考。

新增索引,然后来看一下观察一下对于查询速度的提升情况

create index idx_t1_ca on t1(c,a);

先来看一下PG15的执行计划

explain analyze
select
 sum(id order by c, a), 
sum(id order by a, b)
from t1
group by c;

                                                                 QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=0.43..673145.07 rows=101 width=24) (actual time=153.823..19287.683 rows=101 loops=1)
   Group Key: c
   ->  Index Scan using idx_t1_ca on t1  (cost=0.43..598144.06 rows=10000000 width=28) (actual time=0.049..12936.236 rows=10000000 loops=1)
 Planning Time: 0.301 ms
 Execution Time: 19287.854 ms
(5 rows)

Time: 19288.833 ms (00:19.289)

然后再来查看PG16的执行计划

                                                                    QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=0.50..1123140.18 rows=101 width=24) (actual time=89.467..16454.058 rows=101 loops=1)
   Group Key: c
   ->  Incremental Sort  (cost=0.50..1048139.17 rows=10000000 width=28) (actual time=0.133..15598.819 rows=10000000 loops=1)
         Sort Key: c, a, b
         Presorted Key: c, a
         Full-sort Groups: 312500  Sort Method: quicksort  Average Memory: 26kB  Peak Memory: 26kB
         ->  Index Scan using idx_t1_ca on t1  (cost=0.43..598139.17 rows=10000000 width=28) (actual time=0.071..14028.081 rows=10000000 loops=1)
 Planning Time: 0.100 ms
 Execution Time: 16454.178 ms
(9 rows)

Time: 16454.864 ms (00:16.455)

当添加索引后,会发现查询速度反而比之前未加索引的时候,查询速度上慢了不少,但是相比cost值,确降低了不少。这个有可能与数据分组分布不均匀有关系。

ORDER BY聚合是如何选择排序键

由于下面只是为了测试功能,因此为了快速获得查询结果,在这里我们就插入少量的数据。

drop table t1;
create table t1 as 
select round(random()*100) as id,a,
round(random()*500) as b,
round(random()*100) as c
FROM generate_series(1, 100) as t(a);
analyze t1;

ORDER BY聚合,语句无GROUP BY

--a的数量多
explain select  max(a order by a),count(a order by a),count(b order by b)  from t1;
                           QUERY PLAN
-----------------------------------------------------------------
 Aggregate  (cost=6.32..6.33 rows=1 width=20)
   ->  Sort  (cost=5.32..5.57 rows=100 width=12)
         Sort Key: a
         ->  Seq Scan on t1  (cost=0.00..2.00 rows=100 width=12)
(4 rows)
--b的数量多
explain select  max(a order by a),max(b order by b),count(b order by b)  from t1;

                           QUERY PLAN
-----------------------------------------------------------------
 Aggregate  (cost=6.32..6.33 rows=1 width=20)
   ->  Sort  (cost=5.32..5.57 rows=100 width=12)
         Sort Key: b
         ->  Seq Scan on t1  (cost=0.00..2.00 rows=100 width=12)
(4 rows)

从上面的验证可以得出,当聚合a列的数量大于b列的数量时,就选择a列作为预排序的排序键;同理,如果当聚合b列的数量大于a列的数量时,就选择b列作为预排序的排序键,这个时候与排序列的先后顺序没有关系。

--a和b的数量一致
explain select  max(a order by a),count(a order by a),min(b order by b), avg(b order by b)  from t1;
                           QUERY PLAN
-----------------------------------------------------------------
 Aggregate  (cost=6.57..6.58 rows=1 width=28)
   ->  Sort  (cost=5.32..5.57 rows=100 width=12)
         Sort Key: a
         ->  Seq Scan on t1  (cost=0.00..2.00 rows=100 width=12)
(4 rows)
--调换a和b列的先后顺序
explain select  min(b order by b),max(a order by a),count(a order by a), avg(b order by b)  from t1;
                           QUERY PLAN
-----------------------------------------------------------------
 Aggregate  (cost=6.57..6.58 rows=1 width=28)
   ->  Sort  (cost=5.32..5.57 rows=100 width=12)
         Sort Key: b
         ->  Seq Scan on t1  (cost=0.00..2.00 rows=100 width=12)
(4 rows)

当a和b列的数量一致时,可以看出这个时候就与ab列的先后顺序就有关系了,那个作为第一列,就是预排序的排序键。

ORDER BY聚合,语句含有GROUP BY

聚合排序列中含有GROUP BY列

在下面测试中,我们是以c列做分组列,聚合排序列为a、b和c3列。在这里我们先来调整聚合字段列的先后顺序,在这里主要是对sum(id order by c, a)进行了调整。

explain (costs off)
select
 sum(id order by c, a), sum(id order by b),
sum(id order by a, b)
from t1
group by c;

         QUERY PLAN
----------------------------
 GroupAggregate
   Group Key: c
   ->  Sort
         Sort Key: c, a, b
         ->  Seq Scan on t1
(5 rows)
explain (costs off)
select
 sum(id order by b),sum(id order by  a,b), 
sum(id order by c, a)
from t1
group by c;

         QUERY PLAN
----------------------------
 GroupAggregate
   Group Key: c
   ->  Sort
         Sort Key: c, a, b
         ->  Seq Scan on t1
(5 rows)
--没有a,b的排序顺序
explain (costs off)
select
 sum(id order by a),sum(id order by  b,a),
sum(id order by c, a)
from t1
group by c;
         QUERY PLAN
----------------------------
 GroupAggregate
   Group Key: c
   ->  Sort
         Sort Key: c, a
         ->  Seq Scan on t1
(5 rows)

explain (costs off)
select
 sum(id order by c,b),min(id order by  c,a),
sum(id order by c,a)
from t1
group by c;
         QUERY PLAN
----------------------------
 GroupAggregate
   Group Key: c
   ->  Sort
         Sort Key: c, a
         ->  Seq Scan on t1
(5 rows)

Time: 0.448 ms

explain (costs off)
select
 sum(id order by c,b),min(id order by  c,a),
sum(id order by c,a)
from t1
group by c;

在上面的测试中,可以看出分组c列始终作为排序键的第一列,第二列会优先选择分组列c开头的多级排序顺序,即在上面测试测用例中的c,a的排序顺序;如果聚合排序的存在a,b,那么最终就会组合为c,a,b的排序键;如果不存在a,b,则最终的排序键即为c,a。

聚合排序列中,没有GROUP BY列

在下面测试中,我们仍然是以c列做分组列,但是聚合排序列为a和b2列,不包含有c列。

explain (costs off)
select
 sum(id order by b),sum(id order by  a,b), min(id order by  a,b)
from t1
group by c;
         QUERY PLAN
----------------------------
 GroupAggregate
   Group Key: c
   ->  Sort
         Sort Key: c, a, b
         ->  Seq Scan on t1
(5 rows)

explain (costs off)
select
 sum(id order by b),sum(id order by  b,a), min(id order by  a,b),sum(id order by a)
from t1
group by c;
         QUERY PLAN
----------------------------
 GroupAggregate
   Group Key: c
   ->  Sort
         Sort Key: c, b, a
         ->  Seq Scan on t1
(5 rows)

从上面的测试结果可以看出,如果聚合排序列中不含有分组列时,排序键的规则是以ORDER BY聚合的排序选择规则为准,详细内容可以参考上面的结果。

总结

总之,通过上述的实际验证操作,我们大概清楚如下几个点:

1、理解了优化器是如何选择聚合函数的排序键。该排序键与相同排序数量的多少有关系,另外如果多种排序的数量一致,则以最前聚合函数的的顺序为准,即放在sql语句最左边的聚合函数的顺序为准。

2、如果SQL语句中含有group by语句,那么聚合函数的排序键中需要优先包含有分组列。在这里如果聚合函数的排序列中未含有分组列,则排序键中分组列之后的排列顺序以上述第一点的规则进行排序。如果聚合函数的排序列中含有分组列,这个时候排序键优先选择含有分组列的排序顺序,如果其他的排序列与之前的排序结果有交集,则优先使用该排列顺序。比如目前有分组列c,聚合函数的排序顺序有a,c,b 、b,a、a,b,那么这里会先选取c,b的顺序,然后根据当前的顺序与其他顺序看是否有交集,通过观察可以看到b,a符合我们的要求,因此最终排序键的结果是c,b,a。

3、PG16预排序对于SQL的性能提升还是比较明显的,在个人的电脑上测试提升大概40%左右。

参考

https://postgrespro.com/blog/pgsql/5969859#commit_1349d279

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=1349d2790

https://www.citusdata.com/blog/2024/02/08/whats-new-in-postgres-16-query-planner-optimizer/#faster-orderby-distinct

– / END / –

可以通过下面的方式联系我

  • 微信公众号:@墨竹札记
  • 墨天轮:@墨竹
  • 微信:wshf395062788
  • PGFans:@墨竹

如果这篇文章为你带来了灵感或启发,就请帮忙点赞收藏转发;如果文章中不严谨或者错漏之处,请及时评论指正。非常感谢!

最后修改时间:2025-02-06 10:37:30
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论