PostgreSQL 排序去重limit查询优化 - 递归 vs group分组 (loop降到极限, block scan降到极限)

digoal 2020-05-15

336

作者

digoal

日期

2020-05-15

背景

数据结构介绍, (不要想改数据结构(因为其他业务需要))

简化例子:

```
create extension hll;

create table tab1 (
vid int8, -- 有重复值, 一个表里面每个VID 可能有100条
score numeric, -- 同一个vid有多条记录, 同一个vid的多条记录score完全相同. 不同vid的score可能相同也可能不同
other_col int -- 其他属性
);
```

写入1亿测试数据

insert into tab1 select generate_series(1,1000000), generate_series(1,1000000)/10000000000000::numeric , random()*10000 from generate_series(1,100);

写入若干条重复score, 不同vid的数据.

insert into tab1 select generate_series(1000001,1000009), (1000000-9)/10000000000000::numeric, random()*10000 from generate_series(1,100);

创建表B, 使用hll存储一些unique value, 用于本文的测试查询过滤条件

create table tab2 (uid int8, vids hll);

查询时过滤这些hll包含的 ID

insert into tab2 values (1, hll_empty() || hll_hash_bigint(1000000) || hll_hash_bigint(1000000-1) || hll_hash_bigint(1000000-5) || hll_hash_bigint(1000000-8) );

查询需求:

第一部分需求:
- 按score倒排, 每个VID仅取一条,
第二部分需求:
- 并且使用tab2.vids过滤tab1中的vid

创建索引:

create index idx_tab1_1 on tab1 (score desc,vid);

第一部分(按score倒排, 每个VID仅取一条)可以这么查询:

```
select distinct on (score,vid) vid,score from tab1 order by score desc,vid limit 20;

select vid,score from tab1 group by score,vid order by score desc,vid limit 20;
```

vid | score ---------+---------------------------- 1000000 | 0.000000100000000000000000 999999 | 0.000000099999900000000000 999998 | 0.000000099999800000000000 999997 | 0.000000099999700000000000 999996 | 0.000000099999600000000000 999995 | 0.000000099999500000000000 999994 | 0.000000099999400000000000 999993 | 0.000000099999300000000000 999992 | 0.000000099999200000000000 999991 | 0.000000099999100000000000 1000001 | 0.000000099999100000000000 1000002 | 0.000000099999100000000000 1000003 | 0.000000099999100000000000 1000004 | 0.000000099999100000000000 1000005 | 0.000000099999100000000000 1000006 | 0.000000099999100000000000 1000007 | 0.000000099999100000000000 1000008 | 0.000000099999100000000000 1000009 | 0.000000099999100000000000 999990 | 0.000000099999000000000000 (20 rows)

```
postgres=> explain (analyze,verbose,timing,costs,buffers)
select distinct on (score,vid) vid,score
from tab1 order by score desc,vid limit 20;
QUERY PLAN

Limit (cost=0.57..70.50 rows=20 width=16) (actual time=0.017..1.041 rows=20 loops=1)
Output: vid, score
Buffers: shared hit=1076
-> Result (cost=0.57..34965244.98 rows=10000090 width=16) (actual time=0.016..1.035 rows=20 loops=1)
Output: vid, score
Buffers: shared hit=1076
-> Unique (cost=0.57..34965244.98 rows=10000090 width=16) (actual time=0.015..1.030 rows=20 loops=1)
Output: score, vid
Buffers: shared hit=1076
-> Index Only Scan using idx_tab1_1 on public.tab1 (cost=0.57..34465240.50 rows=100000896 width=16) (actual time=0.013..0.683 rows=1901 loops=1)
Output: score, vid
Heap Fetches: 1901
Buffers: shared hit=1076
Planning Time: 0.092 ms
Execution Time: 1.062 ms
(15 rows)

postgres=> explain (analyze,verbose,timing,costs,buffers)
select vid,score from tab1
group by score,vid order by score desc,vid limit 20;
QUERY PLAN

Limit (cost=0.57..70.50 rows=20 width=16) (actual time=0.016..1.037 rows=20 loops=1)
Output: vid, score
Buffers: shared hit=1076
-> Group (cost=0.57..34965244.98 rows=10000090 width=16) (actual time=0.016..1.032 rows=20 loops=1)
Output: vid, score
Group Key: tab1.score, tab1.vid
Buffers: shared hit=1076
-> Index Only Scan using idx_tab1_1 on public.tab1 (cost=0.57..34465240.50 rows=100000896 width=16) (actual time=0.014..0.692 rows=1901 loops=1)
Output: score, vid
Heap Fetches: 1901
Buffers: shared hit=1076
Planning Time: 0.091 ms
Execution Time: 1.055 ms
(13 rows)
```

结合第二部分需求, sql如下:

explain (analyze,verbose,timing,costs,buffers) with tmp as (select vids from tab2 where uid=1) select t1.vid,t1.score from tab1 t1,tmp where (tmp.vids || hll_hash_bigint(t1.vid) <> tmp.vids) group by t1.score,t1.vid order by t1.score desc,t1.vid limit 20;

vid | score ---------+---------------------------- 999998 | 0.000000099999800000000000 999997 | 0.000000099999700000000000 999996 | 0.000000099999600000000000 999994 | 0.000000099999400000000000 999993 | 0.000000099999300000000000 999991 | 0.000000099999100000000000 1000001 | 0.000000099999100000000000 1000002 | 0.000000099999100000000000 1000003 | 0.000000099999100000000000 1000004 | 0.000000099999100000000000 1000005 | 0.000000099999100000000000 1000006 | 0.000000099999100000000000 1000007 | 0.000000099999100000000000 1000008 | 0.000000099999100000000000 1000009 | 0.000000099999100000000000 999990 | 0.000000099999000000000000 999989 | 0.000000099998900000000000 999988 | 0.000000099998800000000000 999987 | 0.000000099998700000000000 999986 | 0.000000099998600000000000 (20 rows)

Limit (cost=25.57..133.47 rows=20 width=16) (actual time=0.626..3.477 rows=20 loops=1) Output: t1.vid, t1.score Buffers: shared hit=1475 read=4 dirtied=1 I/O Timings: read=0.619 CTE tmp -> Seq Scan on public.tab2 (cost=0.00..25.00 rows=6 width=32) (actual time=0.309..0.310 rows=1 loops=1) Output: tab2.vids Filter: (tab2.uid = 1) Buffers: shared read=1 dirtied=1 I/O Timings: read=0.299 -> Group (cost=0.57..53950415.15 rows=10000090 width=16) (actual time=0.625..3.471 rows=20 loops=1) Output: t1.vid, t1.score Group Key: t1.score, t1.vid Buffers: shared hit=1475 read=4 dirtied=1 I/O Timings: read=0.619 -> Nested Loop (cost=0.57..50965388.40 rows=597005349 width=16) (actual time=0.624..3.098 rows=1901 loops=1) Output: t1.vid, t1.score Join Filter: ((tmp.vids || hll_hash_bigint(t1.vid, 0)) <> tmp.vids) Rows Removed by Join Filter: 400 Buffers: shared hit=1475 read=4 dirtied=1 I/O Timings: read=0.619 -> Index Only Scan using idx_tab1_1 on public.tab1 t1 (cost=0.57..34465240.50 rows=100000896 width=16) (actual time=0.012..1.190 rows=2301 loops=1) Output: t1.score, t1.vid Heap Fetches: 2301 Buffers: shared hit=1475 read=3 I/O Timings: read=0.320 -> CTE Scan on tmp (cost=0.00..0.12 rows=6 width=32) (actual time=0.000..0.000 rows=1 loops=2301) Output: tmp.vids Buffers: shared read=1 dirtied=1 I/O Timings: read=0.299 Planning Time: 0.164 ms Execution Time: 3.541 ms (32 rows)

性能问题在哪?

1、循环次数多, 实际上是浪费, 因为每一个分组内的每条记录都在外部

loops=2301

2、block被扫描数过多: 1475

因为每个vid有多条, 分组时扫描了指定分组对应的所有heap(通过 limit 限制)

limit越大, 浪费越多.

虽然sql看起来好像很快, 实际上浪费很多(特别是vid重复记录多, limit多时浪费大). 能不能再优化, 特别是高并发下, 效果明显.

方法

递归, 看起来行不通, 因为相同score可能有多个vid, 所以需要考虑score和vid两个跳跃条件. 而递归的边界比较好理解的是1个条件, 逐渐跳跃(类似skip index scan).

如果明确不同的score vid一定不同, 那么跳跃边界条件就是score, 没有vid了.

思路:

把vid作为权重放到score中, 而且要不影响原来的score排序顺序, 可以使用表达式把它加权到score中(而且不能改变原来的score排序顺序).

例如score的值一定大于 1/1e21 ,那么可以把vid加权到1/1e21后面的小数位. 当然，数据类型精度必须确保能存下加权后的小数位,例如本例使用的是numeric类型，未限制精度，精确到e25没有问题 ,否则必须指定小数点大于或等于25位.

vid转换为数值(如果vid不是数值, 使用pg内置hash函数可以转换为数值), 然后取模(控制vid对撞空间, 同时确保不需要增加太多小数位来存储加权值)

如下, 9999需要4个小数位来存储加权, 所以除以1e25 (21+4) 创建如下索引:

create index idx_tab1_2 on tab1 ( (score + (mod(vid,9999)/1e25)) );

explain (analyze,verbose,timing,costs,buffers) with recursive tmp as ( -- 递归 ( select array[vid::numeric, score, (score + (mod(vid,9999)/1e25))] as r from tab1 order by (score + (mod(vid,9999)/1e25)) desc limit 1 -- 取最大score一条 ) union all ( select (select array[vid::numeric, score, (score + (mod(vid,9999)/1e25))] as r from tab1 t1 where (t1.score + (mod(t1.vid,9999)/1e25)) < (tmp.r)[3] -- 取下一个最大score, vid不同但是score相同的记录也会被获取, (仅指当vid的模不同时) and (t1.score + (mod(t1.vid,9999)/1e25)) is not null order by (t1.score + (mod(t1.vid,9999)/1e25)) desc limit 1 -- 取一条 ) from tmp where (tmp.r)[3] is not null ) ), tmp2 as ( select vids from tab2 where uid = 1 ) select (v.r)[1]::int8 as vid, (v.r)[2] as score from tmp v, tmp2 t where (t.vids || hll_hash_bigint(((v.r)[1])::int8) <> t.vids) limit 20 ;

小缺陷:
当多个vid的score值相同, 并且vid取模9999后的值也相同时, 会导致只取一条vid.

vid | score ---------+---------------------------- 999998 | 0.000000099999800000000000 999997 | 0.000000099999700000000000 999996 | 0.000000099999600000000000 999994 | 0.000000099999400000000000 999993 | 0.000000099999300000000000 1000009 | 0.000000099999100000000000 1000008 | 0.000000099999100000000000 1000007 | 0.000000099999100000000000 1000006 | 0.000000099999100000000000 1000005 | 0.000000099999100000000000 1000004 | 0.000000099999100000000000 1000003 | 0.000000099999100000000000 1000002 | 0.000000099999100000000000 1000001 | 0.000000099999100000000000 999991 | 0.000000099999100000000000 999990 | 0.000000099999000000000000 999989 | 0.000000099998900000000000 999988 | 0.000000099998800000000000 999987 | 0.000000099998700000000000 999986 | 0.000000099998600000000000 (20 rows)

Limit (cost=117.68..118.37 rows=20 width=40) (actual time=0.077..0.336 rows=20 loops=1) Output: ((v.r[1])::bigint), (v.r[2]) Buffers: shared hit=121 CTE tmp -> Recursive Union (cost=0.57..92.68 rows=101 width=32) (actual time=0.021..0.292 rows=24 loops=1) Buffers: shared hit=120 -> Subquery Scan on "*SELECT* 1" (cost=0.57..0.88 rows=1 width=32) (actual time=0.021..0.022 rows=1 loops=1) Output: "*SELECT* 1".r Buffers: shared hit=5 -> Limit (cost=0.57..0.87 rows=1 width=64) (actual time=0.020..0.021 rows=1 loops=1) Output: (ARRAY[(tab1.vid)::numeric, tab1.score, (tab1.score + ((mod(tab1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric))]), ((tab1.score + ((mod(tab1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric))) Buffers: shared hit=5 -> Index Scan Backward using idx_tab1_2 on public.tab1 (cost=0.57..29864985.67 rows=100000896 width=64) (actual time=0.020..0.020 rows=1 loops=1) Output: ARRAY[(tab1.vid)::numeric, tab1.score, (tab1.score + ((mod(tab1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric))], (tab1.score + ((mod(tab1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric)) Buffers: shared hit=5 -> WorkTable Scan on tmp (cost=0.00..8.98 rows=10 width=32) (actual time=0.011..0.011 rows=1 loops=23) Output: (SubPlan 1) Filter: (tmp.r[3] IS NOT NULL) Buffers: shared hit=115 SubPlan 1 -> Limit (cost=0.57..0.88 rows=1 width=64) (actual time=0.010..0.010 rows=1 loops=23) Output: (ARRAY[(t1.vid)::numeric, t1.score, (t1.score + ((mod(t1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric))]), ((t1.score + ((mod(t1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric))) Buffers: shared hit=115 -> Index Scan Backward using idx_tab1_2 on public.tab1 t1 (cost=0.57..10296397.54 rows=33166964 width=64) (actual time=0.010..0.010 rows=1 loops=23) Output: ARRAY[(t1.vid)::numeric, t1.score, (t1.score + ((mod(t1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric))], (t1.score + ((mod(t1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric)) Index Cond: (((t1.score + ((mod(t1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric)) < (tmp.r)[3]) AND ((t1.score + ((mod(t1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric)) IS NOT NULL)) Buffers: shared hit=115 CTE tmp2 -> Seq Scan on public.tab2 (cost=0.00..25.00 rows=6 width=32) (actual time=0.005..0.005 rows=1 loops=1) Output: tab2.vids Filter: (tab2.uid = 1) Buffers: shared hit=1 -> Nested Loop (cost=0.00..20.82 rows=603 width=40) (actual time=0.076..0.331 rows=20 loops=1) Output: (v.r[1])::bigint, v.r[2] Join Filter: ((t.vids || hll_hash_bigint((v.r[1])::bigint, 0)) <> t.vids) Rows Removed by Join Filter: 4 Buffers: shared hit=121 -> CTE Scan on tmp2 t (cost=0.00..0.12 rows=6 width=32) (actual time=0.007..0.007 rows=1 loops=1) Output: t.vids Buffers: shared hit=1 -> CTE Scan on tmp v (cost=0.00..2.02 rows=101 width=32) (actual time=0.022..0.300 rows=24 loops=1) Output: v.r Buffers: shared hit=120 Planning Time: 0.224 ms Execution Time: 0.386 ms (45 rows)

优化成果:
- | 优化前 | 优化后 | 性能提升
---|---|---|---
循环次数 | 2301 | 23 | 100倍
扫描BLOCK数 | 1475 | 121 | 12倍
耗时 | 3.541 ms | 0.386 ms | 9倍

小结

有技术追求的人, 用PG越用越喜欢. 还能给公司省大把的钱, 节能减排.

而纵观现在的很多产品很多只能靠堆机器来提升性能, 而且还大肆宣传（看起来很高大上）, 实际浪费资源. 老板们被坑的不计其数.

附送PG另一个很好用的功能, 性能随便提升上万倍:

《PostgreSQL 随机采样应用 - table sample, tsm_system_rows, tsm_system_time》

《PostgreSQL 向量相似推荐设计 - pase》

文章转载自digoal，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

PostgreSQL 排序去重limit查询优化 - 递归 vs group分组 (loop降到极限, block scan降到极限)

作者

日期

标签

背景

性能问题在哪?

方法

小结

PostgreSQL 许愿链接

9.9元购买3个月阿里云RDS PostgreSQL实例

PostgreSQL 解决方案集合

德哥 / digoal's github - 公益是一辈子的事.

评论

相关阅读