作者
digoal
日期
2020-05-15
标签
PostgreSQL , 递归 , 去重
背景
数据结构介绍, (不要想改数据结构(因为其他业务需要))
简化例子:
```
create extension hll;
create table tab1 (
vid int8, -- 有重复值, 一个表里面每个VID 可能有100条
score numeric, -- 同一个vid有多条记录, 同一个vid的多条记录score完全相同. 不同vid的score可能相同也可能不同
other_col int -- 其他属性
);
```
写入1亿测试数据
insert into tab1 select generate_series(1,1000000),
generate_series(1,1000000)/10000000000000::numeric ,
random()*10000 from generate_series(1,100);
写入若干条重复score, 不同vid的数据.
insert into tab1 select generate_series(1000001,1000009),
(1000000-9)/10000000000000::numeric,
random()*10000 from generate_series(1,100);
创建表B, 使用hll存储一些unique value, 用于本文的测试查询过滤条件
create table tab2 (uid int8, vids hll);
查询时过滤这些hll包含的 ID
insert into tab2 values
(1,
hll_empty() || hll_hash_bigint(1000000) || hll_hash_bigint(1000000-1) || hll_hash_bigint(1000000-5) || hll_hash_bigint(1000000-8)
);
查询需求:
- 第一部分需求:
- 按score倒排, 每个VID仅取一条,
- 第二部分需求:
- 并且使用tab2.vids过滤tab1中的vid
创建索引:
create index idx_tab1_1 on tab1 (score desc,vid);
第一部分(按score倒排, 每个VID仅取一条)可以这么查询:
```
select distinct on (score,vid) vid,score from tab1 order by score desc,vid limit 20;
select vid,score from tab1 group by score,vid order by score desc,vid limit 20;
```
vid | score
---------+----------------------------
1000000 | 0.000000100000000000000000
999999 | 0.000000099999900000000000
999998 | 0.000000099999800000000000
999997 | 0.000000099999700000000000
999996 | 0.000000099999600000000000
999995 | 0.000000099999500000000000
999994 | 0.000000099999400000000000
999993 | 0.000000099999300000000000
999992 | 0.000000099999200000000000
999991 | 0.000000099999100000000000
1000001 | 0.000000099999100000000000
1000002 | 0.000000099999100000000000
1000003 | 0.000000099999100000000000
1000004 | 0.000000099999100000000000
1000005 | 0.000000099999100000000000
1000006 | 0.000000099999100000000000
1000007 | 0.000000099999100000000000
1000008 | 0.000000099999100000000000
1000009 | 0.000000099999100000000000
999990 | 0.000000099999000000000000
(20 rows)
```
postgres=> explain (analyze,verbose,timing,costs,buffers)
select distinct on (score,vid) vid,score
from tab1 order by score desc,vid limit 20;
QUERY PLAN
Limit (cost=0.57..70.50 rows=20 width=16) (actual time=0.017..1.041 rows=20 loops=1)
Output: vid, score
Buffers: shared hit=1076
-> Result (cost=0.57..34965244.98 rows=10000090 width=16) (actual time=0.016..1.035 rows=20 loops=1)
Output: vid, score
Buffers: shared hit=1076
-> Unique (cost=0.57..34965244.98 rows=10000090 width=16) (actual time=0.015..1.030 rows=20 loops=1)
Output: score, vid
Buffers: shared hit=1076
-> Index Only Scan using idx_tab1_1 on public.tab1 (cost=0.57..34465240.50 rows=100000896 width=16) (actual time=0.013..0.683 rows=1901 loops=1)
Output: score, vid
Heap Fetches: 1901
Buffers: shared hit=1076
Planning Time: 0.092 ms
Execution Time: 1.062 ms
(15 rows)
postgres=> explain (analyze,verbose,timing,costs,buffers)
select vid,score from tab1
group by score,vid order by score desc,vid limit 20;
QUERY PLAN
Limit (cost=0.57..70.50 rows=20 width=16) (actual time=0.016..1.037 rows=20 loops=1)
Output: vid, score
Buffers: shared hit=1076
-> Group (cost=0.57..34965244.98 rows=10000090 width=16) (actual time=0.016..1.032 rows=20 loops=1)
Output: vid, score
Group Key: tab1.score, tab1.vid
Buffers: shared hit=1076
-> Index Only Scan using idx_tab1_1 on public.tab1 (cost=0.57..34465240.50 rows=100000896 width=16) (actual time=0.014..0.692 rows=1901 loops=1)
Output: score, vid
Heap Fetches: 1901
Buffers: shared hit=1076
Planning Time: 0.091 ms
Execution Time: 1.055 ms
(13 rows)
```
结合第二部分需求, sql如下:
explain (analyze,verbose,timing,costs,buffers)
with tmp as (select vids from tab2 where uid=1)
select t1.vid,t1.score from tab1 t1,tmp
where (tmp.vids || hll_hash_bigint(t1.vid) <> tmp.vids)
group by t1.score,t1.vid
order by t1.score desc,t1.vid limit 20;
vid | score
---------+----------------------------
999998 | 0.000000099999800000000000
999997 | 0.000000099999700000000000
999996 | 0.000000099999600000000000
999994 | 0.000000099999400000000000
999993 | 0.000000099999300000000000
999991 | 0.000000099999100000000000
1000001 | 0.000000099999100000000000
1000002 | 0.000000099999100000000000
1000003 | 0.000000099999100000000000
1000004 | 0.000000099999100000000000
1000005 | 0.000000099999100000000000
1000006 | 0.000000099999100000000000
1000007 | 0.000000099999100000000000
1000008 | 0.000000099999100000000000
1000009 | 0.000000099999100000000000
999990 | 0.000000099999000000000000
999989 | 0.000000099998900000000000
999988 | 0.000000099998800000000000
999987 | 0.000000099998700000000000
999986 | 0.000000099998600000000000
(20 rows)
Limit (cost=25.57..133.47 rows=20 width=16) (actual time=0.626..3.477 rows=20 loops=1)
Output: t1.vid, t1.score
Buffers: shared hit=1475 read=4 dirtied=1
I/O Timings: read=0.619
CTE tmp
-> Seq Scan on public.tab2 (cost=0.00..25.00 rows=6 width=32) (actual time=0.309..0.310 rows=1 loops=1)
Output: tab2.vids
Filter: (tab2.uid = 1)
Buffers: shared read=1 dirtied=1
I/O Timings: read=0.299
-> Group (cost=0.57..53950415.15 rows=10000090 width=16) (actual time=0.625..3.471 rows=20 loops=1)
Output: t1.vid, t1.score
Group Key: t1.score, t1.vid
Buffers: shared hit=1475 read=4 dirtied=1
I/O Timings: read=0.619
-> Nested Loop (cost=0.57..50965388.40 rows=597005349 width=16) (actual time=0.624..3.098 rows=1901 loops=1)
Output: t1.vid, t1.score
Join Filter: ((tmp.vids || hll_hash_bigint(t1.vid, 0)) <> tmp.vids)
Rows Removed by Join Filter: 400
Buffers: shared hit=1475 read=4 dirtied=1
I/O Timings: read=0.619
-> Index Only Scan using idx_tab1_1 on public.tab1 t1 (cost=0.57..34465240.50 rows=100000896 width=16) (actual time=0.012..1.190 rows=2301 loops=1)
Output: t1.score, t1.vid
Heap Fetches: 2301
Buffers: shared hit=1475 read=3
I/O Timings: read=0.320
-> CTE Scan on tmp (cost=0.00..0.12 rows=6 width=32) (actual time=0.000..0.000 rows=1 loops=2301)
Output: tmp.vids
Buffers: shared read=1 dirtied=1
I/O Timings: read=0.299
Planning Time: 0.164 ms
Execution Time: 3.541 ms
(32 rows)
性能问题在哪?
1、循环次数多, 实际上是浪费, 因为每一个分组内的每条记录都在外部
loops=2301
2、block被扫描数过多: 1475
因为每个vid有多条, 分组时扫描了指定分组对应的所有heap(通过 limit 限制)
limit越大, 浪费越多.
虽然sql看起来好像很快, 实际上浪费很多(特别是vid重复记录多, limit多时浪费大). 能不能再优化, 特别是高并发下, 效果明显.
方法
递归, 看起来行不通, 因为相同score可能有多个vid, 所以需要考虑score和vid两个跳跃条件. 而递归的边界比较好理解的是1个条件, 逐渐跳跃(类似skip index scan).
如果明确不同的score vid一定不同, 那么跳跃边界条件就是score, 没有vid了.
思路:
把vid作为权重放到score中, 而且要不影响原来的score排序顺序, 可以使用表达式把它加权到score中(而且不能改变原来的score排序顺序).
例如score的值一定大于 1/1e21 ,那么可以把vid加权到1/1e21后面的小数位. 当然,数据类型精度必须确保能存下加权后的小数位,例如本例使用的是numeric类型,未限制精度,精确到e25没有问题 ,否则必须指定小数点大于或等于25位.
vid转换为数值(如果vid不是数值, 使用pg内置hash函数可以转换为数值), 然后取模(控制vid对撞空间, 同时确保不需要增加太多小数位来存储加权值)
如下, 9999需要4个小数位来存储加权, 所以除以1e25 (21+4) 创建如下索引:
create index idx_tab1_2 on tab1 ( (score + (mod(vid,9999)/1e25)) );
explain (analyze,verbose,timing,costs,buffers)
with recursive tmp as ( -- 递归
(
select array[vid::numeric, score, (score + (mod(vid,9999)/1e25))] as r
from tab1
order by (score + (mod(vid,9999)/1e25)) desc limit 1 -- 取最大score一条
)
union all
(
select
(select array[vid::numeric, score, (score + (mod(vid,9999)/1e25))] as r from tab1 t1
where (t1.score + (mod(t1.vid,9999)/1e25)) < (tmp.r)[3] -- 取下一个最大score, vid不同但是score相同的记录也会被获取, (仅指当vid的模不同时)
and (t1.score + (mod(t1.vid,9999)/1e25)) is not null
order by (t1.score + (mod(t1.vid,9999)/1e25)) desc limit 1 -- 取一条
)
from tmp where (tmp.r)[3] is not null
)
),
tmp2 as
(
select vids from tab2 where uid = 1
)
select (v.r)[1]::int8 as vid, (v.r)[2] as score from tmp v, tmp2 t
where (t.vids || hll_hash_bigint(((v.r)[1])::int8) <> t.vids)
limit 20
;
小缺陷:
当多个vid的score值相同, 并且vid取模9999后的值也相同时, 会导致只取一条vid.
vid | score
---------+----------------------------
999998 | 0.000000099999800000000000
999997 | 0.000000099999700000000000
999996 | 0.000000099999600000000000
999994 | 0.000000099999400000000000
999993 | 0.000000099999300000000000
1000009 | 0.000000099999100000000000
1000008 | 0.000000099999100000000000
1000007 | 0.000000099999100000000000
1000006 | 0.000000099999100000000000
1000005 | 0.000000099999100000000000
1000004 | 0.000000099999100000000000
1000003 | 0.000000099999100000000000
1000002 | 0.000000099999100000000000
1000001 | 0.000000099999100000000000
999991 | 0.000000099999100000000000
999990 | 0.000000099999000000000000
999989 | 0.000000099998900000000000
999988 | 0.000000099998800000000000
999987 | 0.000000099998700000000000
999986 | 0.000000099998600000000000
(20 rows)
Limit (cost=117.68..118.37 rows=20 width=40) (actual time=0.077..0.336 rows=20 loops=1)
Output: ((v.r[1])::bigint), (v.r[2])
Buffers: shared hit=121
CTE tmp
-> Recursive Union (cost=0.57..92.68 rows=101 width=32) (actual time=0.021..0.292 rows=24 loops=1)
Buffers: shared hit=120
-> Subquery Scan on "*SELECT* 1" (cost=0.57..0.88 rows=1 width=32) (actual time=0.021..0.022 rows=1 loops=1)
Output: "*SELECT* 1".r
Buffers: shared hit=5
-> Limit (cost=0.57..0.87 rows=1 width=64) (actual time=0.020..0.021 rows=1 loops=1)
Output: (ARRAY[(tab1.vid)::numeric, tab1.score, (tab1.score + ((mod(tab1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric))]), ((tab1.score + ((mod(tab1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric)))
Buffers: shared hit=5
-> Index Scan Backward using idx_tab1_2 on public.tab1 (cost=0.57..29864985.67 rows=100000896 width=64) (actual time=0.020..0.020 rows=1 loops=1)
Output: ARRAY[(tab1.vid)::numeric, tab1.score, (tab1.score + ((mod(tab1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric))], (tab1.score + ((mod(tab1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric))
Buffers: shared hit=5
-> WorkTable Scan on tmp (cost=0.00..8.98 rows=10 width=32) (actual time=0.011..0.011 rows=1 loops=23)
Output: (SubPlan 1)
Filter: (tmp.r[3] IS NOT NULL)
Buffers: shared hit=115
SubPlan 1
-> Limit (cost=0.57..0.88 rows=1 width=64) (actual time=0.010..0.010 rows=1 loops=23)
Output: (ARRAY[(t1.vid)::numeric, t1.score, (t1.score + ((mod(t1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric))]), ((t1.score + ((mod(t1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric)))
Buffers: shared hit=115
-> Index Scan Backward using idx_tab1_2 on public.tab1 t1 (cost=0.57..10296397.54 rows=33166964 width=64) (actual time=0.010..0.010 rows=1 loops=23)
Output: ARRAY[(t1.vid)::numeric, t1.score, (t1.score + ((mod(t1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric))], (t1.score + ((mod(t1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric))
Index Cond: (((t1.score + ((mod(t1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric)) < (tmp.r)[3]) AND ((t1.score + ((mod(t1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric)) IS NOT NULL))
Buffers: shared hit=115
CTE tmp2
-> Seq Scan on public.tab2 (cost=0.00..25.00 rows=6 width=32) (actual time=0.005..0.005 rows=1 loops=1)
Output: tab2.vids
Filter: (tab2.uid = 1)
Buffers: shared hit=1
-> Nested Loop (cost=0.00..20.82 rows=603 width=40) (actual time=0.076..0.331 rows=20 loops=1)
Output: (v.r[1])::bigint, v.r[2]
Join Filter: ((t.vids || hll_hash_bigint((v.r[1])::bigint, 0)) <> t.vids)
Rows Removed by Join Filter: 4
Buffers: shared hit=121
-> CTE Scan on tmp2 t (cost=0.00..0.12 rows=6 width=32) (actual time=0.007..0.007 rows=1 loops=1)
Output: t.vids
Buffers: shared hit=1
-> CTE Scan on tmp v (cost=0.00..2.02 rows=101 width=32) (actual time=0.022..0.300 rows=24 loops=1)
Output: v.r
Buffers: shared hit=120
Planning Time: 0.224 ms
Execution Time: 0.386 ms
(45 rows)
优化成果:
- | 优化前 | 优化后 | 性能提升
---|---|---|---
循环次数 | 2301 | 23 | 100倍
扫描BLOCK数 | 1475 | 121 | 12倍
耗时 | 3.541 ms | 0.386 ms | 9倍
小结
有技术追求的人, 用PG越用越喜欢. 还能给公司省大把的钱, 节能减排.
而纵观现在的很多产品很多只能靠堆机器来提升性能, 而且还大肆宣传(看起来很高大上), 实际浪费资源. 老板们被坑的不计其数.
附送PG另一个很好用的功能, 性能随便提升上万倍:
《PostgreSQL 随机采样应用 - table sample, tsm_system_rows, tsm_system_time》
PostgreSQL 许愿链接
您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.