用PostgreSQL找回618秒逝去的青春 - 递归收敛优化

digoal 2016-12-01

208

TAG 15

作者

digoal

日期

2016-12-01

背景

有一个这样的场景，一张小表A，里面存储了一些ID，大约几百个。

（比如说巡逻车辆ID，环卫车辆的ID，公交车，微公交的ID）。

另外有一张日志表B，每条记录中的ID是来自前面那张小表的，但不是每个ID都出现在这张日志表中，比如说一天可能只有几十个ID会出现在这个日志表的当天的数据中。

（比如车辆的行车轨迹数据，每秒上报轨迹，数据量就非常庞大）。

那么我怎么快速的找出今天没有出现的ID呢。

（哪些巡逻车辆没有出现在这个片区，是不是偷懒了，哪些环卫车辆没有出行，哪些公交或微公交没有出行）

``` select id from A where id not in (select id from B where time between and );

select (300 ids) not in (select ids from 300万)
```

这个QUERY会很慢，有什么优化方法呢。

当然，你还可以让车辆签到的方式来解决这个问题，但是总有未签到的，或者没有这种设计的时候，那么怎么解决呢

优化方法1

其实方法也很精妙，和我之前做的两个CASE很相似。

《时序数据合并场景加速分析和实现 - 复合索引，窗口分组查询加速，变态递归加速》

《distinct xx和count(distinct xx)的变态递归优化方法 - 索引收敛(skip scan)扫描》

在B表中，其实ID的值是很稀疏的，只是由于是流水，所以总量大。

优化的手段就是对B的取值区间，做递归的收敛查询，然后再做NOT IN就很快了。

例子

建表

``` create table a(id int primary key, info text);

create table b(id int primary key, aid int, crt_time timestamp); create index b_aid on b(aid); ```

插入测试数据

``` -- a表插入1000条 insert into a select generate_series(1,1000), md5(random()::text);

-- b表插入500万条，只包含aid的500个id。 insert into b select generate_series(1,5000000), generate_series(1,500), clock_timestamp(); ```

优化前的性能

``` \timing

explain (analyze,verbose,timing,costs,buffers) select * from a where id not in (select aid from b);

                                                        QUERY PLAN

Seq Scan on public.a (cost=0.00..67030021.50 rows=500 width=37) (actual time=2932.080..618776.881 rows=500 loops=1) Output: a.id, a.info Filter: (NOT (SubPlan 1)) Rows Removed by Filter: 500 Buffers: shared hit=27037, temp read=4264454 written=8545 SubPlan 1 -> Materialize (cost=0.00..121560.00 rows=5000000 width=4) (actual time=0.002..298.049 rows=2500125 loops=1000) Output: b.aid Buffers: shared hit=27028, temp read=4264454 written=8545 -> Seq Scan on public.b (cost=0.00..77028.00 rows=5000000 width=4) (actual time=0.009..888.427 rows=5000000 loops=1) Output: b.aid Buffers: shared hit=27028 Planning time: 0.969 ms Execution time: 618794.299 ms (14 rows) ```

另外你有一种选择是使用outer join, b表同样需要全扫一遍，有很大的改进，不过还可以更好，继续往后看。

``` postgres=# explain (analyze,verbose,timing,costs,buffers) select a.id from a left join b on (a.id=b.aid) where b.* is null; QUERY PLAN

Hash Right Join (cost=31.50..145809.50 rows=25000 width=4) (actual time=2376.777..2376.862 rows=500 loops=1) Output: a.id Hash Cond: (b.aid = a.id) Filter: (b. IS NULL) Rows Removed by Filter: 5000000 Buffers: shared hit=27037 -> Seq Scan on public.b (cost=0.00..77028.00 rows=5000000 width=44) (actual time=0.012..1087.997 rows=5000000 loops=1) Output: b.aid, b. Buffers: shared hit=27028 -> Hash (cost=19.00..19.00 rows=1000 width=4) (actual time=0.355..0.355 rows=1000 loops=1) Output: a.id Buckets: 1024 Batches: 1 Memory Usage: 44kB Buffers: shared hit=9 -> Seq Scan on public.a (cost=0.00..19.00 rows=1000 width=4) (actual time=0.010..0.183 rows=1000 loops=1) Output: a.id Buffers: shared hit=9 Planning time: 0.302 ms Execution time: 2376.934 ms (18 rows) ```

递归收敛优化后的性能

``` explain (analyze,verbose,timing,costs,buffers) select * from a where id not in ( with recursive skip as (
(
select min(aid) aid from b where aid is not null
)
union all
(
select (select min(aid) aid from b where b.aid > s.aid and b.aid is not null)
from skip s where s.aid is not null
) -- 这里的where s.aid is not null 一定要加,否则就死循环了.
)
select aid from skip where aid is not null );

                                                                             QUERY PLAN

Seq Scan on public.a (cost=54.98..76.48 rows=500 width=37) (actual time=10.837..10.957 rows=500 loops=1) Output: a.id, a.info Filter: (NOT (hashed SubPlan 5)) Rows Removed by Filter: 500 Buffers: shared hit=2012 SubPlan 5 -> CTE Scan on skip (cost=52.71..54.73 rows=100 width=4) (actual time=0.042..10.386 rows=500 loops=1) Output: skip.aid Filter: (skip.aid IS NOT NULL) Rows Removed by Filter: 1 Buffers: shared hit=2003 CTE skip -> Recursive Union (cost=0.46..52.71 rows=101 width=4) (actual time=0.037..10.104 rows=501 loops=1) Buffers: shared hit=2003 -> Result (cost=0.46..0.47 rows=1 width=4) (actual time=0.036..0.036 rows=1 loops=1) Output: $1 Buffers: shared hit=4 InitPlan 3 (returns $1) -> Limit (cost=0.43..0.46 rows=1 width=4) (actual time=0.031..0.032 rows=1 loops=1) Output: b_1.aid Buffers: shared hit=4 -> Index Only Scan using b_aid on public.b b_1 (cost=0.43..131903.43 rows=5000000 width=4) (actual time=0.030..0.030 rows=1 loops=1) Output: b_1.aid Index Cond: (b_1.aid IS NOT NULL) Heap Fetches: 1 Buffers: shared hit=4 -> WorkTable Scan on skip s (cost=0.00..5.02 rows=10 width=4) (actual time=0.019..0.019 rows=1 loops=501) Output: (SubPlan 2) Filter: (s.aid IS NOT NULL) Rows Removed by Filter: 0 Buffers: shared hit=1999 SubPlan 2 -> Result (cost=0.47..0.48 rows=1 width=4) (actual time=0.018..0.018 rows=1 loops=500) Output: $3 Buffers: shared hit=1999 InitPlan 1 (returns $3) -> Limit (cost=0.43..0.47 rows=1 width=4) (actual time=0.017..0.017 rows=1 loops=500) Output: b.aid Buffers: shared hit=1999 -> Index Only Scan using b_aid on public.b (cost=0.43..66153.48 rows=1666667 width=4) (actual time=0.017..0.017 rows=1 loops=500) Output: b.aid Index Cond: ((b.aid > s.aid) AND (b.aid IS NOT NULL)) Heap Fetches: 499 Buffers: shared hit=1999 Planning time: 0.323 ms Execution time: 11.082 ms (46 rows) ```

采用收敛查询优化后，耗时从最初的 618794毫秒降低到了 11毫秒，感觉一下子节约了好多青春。

优化方法2

此方法来自SQL性能挑战赛，书写更简洁：

https://yq.aliyun.com/roundtable/56354

采用sub query，A表数据量小，查询A表的QUERY中使用SUB QUERY使得SUB QUERY的扫描次数下降到与A行数一致，SUB QUERY中采用LIMIT 1限定返回数，is null限定得出B表中未出现的aid。妙!!!

如下

``` postgres=# explain analyze select * from ( select a.* ,
(select aid from b where b.aid=a.id limit 1) as aid -- sub query, limit 1控制了扫描次数 from a -- a表很小
) as t where t.aid is null;

QUERY PLAN Seq Scan on a (cost=0.00..4137.84 rows=5 width=41) (actual time=18.232..18.904 rows=100 loops=1) Filter: ((SubPlan 2) IS NULL) Rows Removed by Filter: 901 SubPlan 1 -> Limit (cost=0.43..4.09 rows=1 width=4) (actual time=0.003..0.003 rows=0 loops=100) -> Index Only Scan using b_aid on b (cost=0.43..40614.63 rows=11099 width=4) (actual time=0.002..0.002 rows=0 loops=100) Index Cond: (aid = a.id) Heap Fetches: 0 SubPlan 2 -> Limit (cost=0.43..4.09 rows=1 width=4) (actual time=0.017..0.017 rows=1 loops=1001) -> Index Only Scan using b_aid on b b_1 (cost=0.43..40614.63 rows=11099 width=4) (actual time=0.017..0.017 rows=1 loops=1001) Index Cond: (aid = a.id) Heap Fetches: 901 Planning time: 0.297 ms Execution time: 18.980 ms (15 rows) ```

或者

``` postgres=# explain analyze select * from a where (select aid from b where b.aid=a.id limit 1) is null; -- sub query is NULL, 是不是很给力呢

QUERY PLAN Seq Scan on a (cost=0.00..4117.37 rows=5 width=37) (actual time=18.346..18.699 rows=100 loops=1) Filter: ((SubPlan 1) IS NULL) Rows Removed by Filter: 901 SubPlan 1 -> Limit (cost=0.43..4.09 rows=1 width=4) (actual time=0.017..0.018 rows=1 loops=1001) -> Index Only Scan using b_aid on b (cost=0.43..40614.63 rows=11099 width=4) (actual time=0.017..0.017 rows=1 loops=1001) Index Cond: (aid = a.id) Heap Fetches: 901 Planning time: 0.181 ms Execution time: 18.755 ms (10 rows) ```

小结

1、递归查询，A表全扫，B表索引扫描了若干次（若干 = 唯一AID在B中出现的次数）。

2、SUB QUERY，A表全扫，B表索引扫描了若干次（若干 = A表记录数）。

由于B表都是索引扫，两种方法差别不大(递归扫描的次数更少一些)。

文章转载自digoal，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

用PostgreSQL找回618秒逝去的青春 - 递归收敛优化

TAG 15

作者

日期

标签

背景

优化方法1

优化方法2

小结

PostgreSQL 许愿链接

9.9元购买3个月阿里云RDS PostgreSQL实例

PostgreSQL 解决方案集合

德哥 / digoal's github - 公益是一辈子的事.

评论