暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

理解数据库扫描方法 - 利用扫描方法对数据存储进行优化

digoal 2018-06-08
342

作者

digoal

日期

2018-06-08

标签

PostgreSQL , 扫描方法 , 数据存储


背景

假设一个黑盒中有三种水果:苹果,香蕉、菠萝。一共有若干个水果。

假设你需要拿10个苹果,你需要拿多少次呢?

最差的情况,你可能需要把所有的水果都拿完。(全表扫描,扫到最后才拿到10个或者不足10个)

最好的情况,你可能10次就拿完。(全表扫描,扫10行全都是苹果。)

PS:索引扫描这里就不说了,因为要说的就是根据扫描方法来进行的优化。

全表扫描最好的情况优化

```
create table tbl (gid int, info text, crt_time timestamp);

insert into tbl select random()*10000 , 'test', now() from generate_series(1,10000000);

select * from tbl where gid=1 limit 10;

explain (analyze,verbose,timing,costs,buffers) select * from tbl where gid=1 limit 10;
QUERY PLAN


Limit (cost=0.00..1917.62 rows=10 width=17) (actual time=0.050..11.165 rows=10 loops=1)
Output: gid, info, crt_time
Buffers: shared hit=3 read=667 dirtied=354 written=340
-> Seq Scan on public.tbl (cost=0.00..188693.39 rows=984 width=17) (actual time=0.048..11.160 rows=10 loops=1)
Output: gid, info, crt_time
Filter: (tbl.gid = 1)
Rows Removed by Filter: 105132
Buffers: shared hit=3 read=667 dirtied=354 written=340
Planning time: 0.078 ms
Execution time: 11.184 ms
(10 rows)
```

存储优化

```
postgres=# begin;
BEGIN
postgres=# create temp table tmp_tbl1 as select * from tbl where gid<>1 or gid is null;
SELECT 9998987
postgres=# delete from tbl where gid<>1;
DELETE 9998987
postgres=# end;
COMMIT
postgres=# vacuum full tbl;
VACUUM
postgres=# insert into tbl select * from tmp_tbl1 ;
INSERT 0 9998987
postgres=#
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where gid=1 limit 10;
QUERY PLAN


Limit (cost=0.00..1972.60 rows=10 width=17) (actual time=0.018..0.022 rows=10 loops=1)
Output: gid, info, crt_time
Buffers: shared read=1
-> Seq Scan on public.tbl (cost=0.00..178914.70 rows=907 width=17) (actual time=0.017..0.019 rows=10 loops=1)
Output: gid, info, crt_time
Filter: (tbl.gid = 1)
Buffers: shared read=1
Planning time: 0.129 ms
Execution time: 0.041 ms
(9 rows)
```

场景升华 - 多表JOIN LIMIT优化

JOIN + LIMIT的场景:

通常有LIMIT的场景使用NESTLOOP JOIN性能可以比较好。

1、从外表开始扫

2、内表循环N次

存储优化方法

1、外表,一开始扫描到的就是内表符合条件的数据

2、根据这种思路重新整理数据

3、查看能耗

例子

```
create table a(id int, c1 int, c2 int, c3 int);

create table b(id int, c1 int, c2 int, c3 int);

insert into a select generate_series(1,10000000),1,1,1;

insert into b select random()100, random()100, random()100, random()100 from generate_series(1,10000000);

create index idx_a_1 on a(id,c1,c2,c3);

create index idx_b_1 on b(c1,c2);

vacuum analyze a;
vacuum analyze b;

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from a join b on (a.id=b.id and a.c1=1 and a.c2=1 and a.c3=1 and b.c1=1 and b.c2=1) limit 1000;
QUERY PLAN


Limit (cost=0.87..2669.74 rows=1000 width=32) (actual time=0.081..8.266 rows=991 loops=1)
Output: a.id, a.c1, a.c2, a.c3, b.id, b.c1, b.c2, b.c3
Buffers: shared hit=3984
-> Nested Loop (cost=0.87..2723.11 rows=1020 width=32) (actual time=0.080..7.996 rows=991 loops=1)
Output: a.id, a.c1, a.c2, a.c3, b.id, b.c1, b.c2, b.c3
Buffers: shared hit=3984
-> Index Scan using idx_b_1 on public.b (cost=0.43..1136.01 rows=1020 width=16) (actual time=0.053..2.569 rows=996 loops=1)
Output: b.id, b.c1, b.c2, b.c3
Index Cond: ((b.c1 = 1) AND (b.c2 = 1))
Buffers: shared hit=995
-> Index Only Scan using idx_a_1 on public.a (cost=0.43..1.55 rows=1 width=16) (actual time=0.004..0.004 rows=1 loops=996)
Output: a.id, a.c1, a.c2, a.c3
Index Cond: ((a.id = b.id) AND (a.c1 = 1) AND (a.c2 = 1) AND (a.c3 = 1))
Heap Fetches: 0
Buffers: shared hit=2989
Planning time: 0.603 ms
Execution time: 8.509 ms
(17 rows)
```

存储优化

第一种可能,如果一次LOOP就可以返回1000条,那么可以这样优化

都使用SEQ SCAN

但是把复合条件的数据提到前面。

1、找到内表能满足1000条以上的ID,数据提前。

2、找到与内表ID对应的数据,数据提前。

```
postgres=# select b.id,count() from a join b on (a.id=b.id and a.c1=1 and a.c2=1 and a.c3=1 and b.c1=1 and b.c2=1) group by 1 order by count() desc limit 10;
id | count
----+-------
26 | 18
68 | 18
52 | 16
94 | 16
35 | 16
80 | 15
77 | 15
96 | 15
73 | 15
74 | 15
(10 rows)

postgres=# create table b1 as select * from b where id in (select b.id from a join b on (a.id=b.id and a.c1=1 and a.c2=1 and a.c3=1 and b.c1=1 and b.c2=1) group by 1 order by count() desc limit 1000) and b.c1=1 and b.c2=1;
SELECT 991
postgres=# insert into b1 select * from b where not (id in (select b.id from a join b on (a.id=b.id and a.c1=1 and a.c2=1 and a.c3=1 and b.c1=1 and b.c2=1) group by 1 order by count(
) desc limit 1000) and b.c1=1 and b.c2=1)
postgres-# ;
INSERT 0 9999009
postgres=# alter table b rename to b2;
ALTER TABLE
postgres=# alter table b1 rename to b;
ALTER TABLE
```

外表只需要扫描6个数据块。

(但是注意这个方法,如果总共数据不满足1000条,那么会导致外表全扫)

```

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from a join b on (a.id=b.id and a.c1=1 and a.c2=1 and a.c3=1 and b.c1=1 and b.c2=1) limit 991;
QUERY PLAN


Limit (cost=0.43..205423.04 rows=876 width=32) (actual time=0.071..7.845 rows=991 loops=1)
Output: a.id, a.c1, a.c2, a.c3, b.id, b.c1, b.c2, b.c3
Buffers: shared hit=2980
-> Nested Loop (cost=0.43..205423.04 rows=876 width=32) (actual time=0.069..7.577 rows=991 loops=1)
Output: a.id, a.c1, a.c2, a.c3, b.id, b.c1, b.c2, b.c3
Buffers: shared hit=2980
-> Seq Scan on public.b (cost=0.00..204057.62 rows=876 width=16) (actual time=0.019..0.384 rows=991 loops=1)
Output: b.id, b.c1, b.c2, b.c3
Filter: ((b.c1 = 1) AND (b.c2 = 1))
Buffers: shared hit=6
-> Index Only Scan using idx_a_1 on public.a (cost=0.43..1.55 rows=1 width=16) (actual time=0.006..0.006 rows=1 loops=991)
Output: a.c1, a.c2, a.c3, a.id
Index Cond: ((a.c1 = 1) AND (a.c2 = 1) AND (a.c3 = 1) AND (a.id = b.id))
Heap Fetches: 0
Buffers: shared hit=2974
Planning time: 0.513 ms
Execution time: 8.079 ms
(17 rows)
```

参考

《PostgreSQL OUTER JOIN 优化的几个知识点 - 语义转换、内存带宽、JOIN算法、FILTER亲和力、TSP、HINT、命中率、存储顺序、扫描顺序、索引深度》

PostgreSQL 许愿链接

您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.

9.9元购买3个月阿里云RDS PostgreSQL实例

PostgreSQL 解决方案集合

德哥 / digoal's github - 公益是一辈子的事.

digoal's wechat

文章转载自digoal,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论