暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

PostgreSQL 多查询条件,多个索引的选择算法与问题诊断方法

digoal 2018-03-09
374

作者

digoal

日期

2018-03-09

标签

PostgreSQL , 多列条件 , 多索引 , 单列索引 , 复合索引 , 联合索引 , 优化器 , 评估 , 行评估 , 成本


背景

当一个SQL中涉及多个条件,并且多个条件有多种索引可选时,数据库优化器是如何选择使用哪个索引的?

例如

有一张表,有2个字段,单列一个索引,双列一个复合索引.

```
建表。
postgres=# create table tbl(id int, gid int);
CREATE TABLE

插入1000万记录,其中ID唯一,GID只有10个值。
postgres=# insert into tbl select generate_series(1,10000000), random()*9 ;
INSERT 0 10000000

创建两个索引。
postgres=# create index idx1 on tbl(id);
CREATE INDEX
postgres=# create index idx2 on tbl(gid,id);
CREATE INDEX
```

下面三条SQL,会如何选择使用哪个索引呢?

```
select * from tbl where id in (1,2,3,4,5,6,7,8,9,10) and gid=123;

select * from tbl where id in (1,2,3,4,5,6,7,8,9,10) and gid=1;

select * from tbl where id in (1,2,3,4,5,6,7,8,9,10);
```

问题思考

人为选择

这三条QUERY,实际上有三重含义:

1、gid=123的行根本不存在。

如果让你来选索引,你肯定会选复合索引,马上就能定位到数据不存在扫描最少的BLOCK。

select * from tbl where id in (1,2,3,4,5,6,7,8,9,10) and gid=123;

2、gid=1存在,同时id里面的条件也存在。

如果让你来选索引,应该也是选择复合索引,因为精确定位到了所有的行。

当然如果id in里面很多记录不存在,那么你可能就会选择id单列索引,因为这个索引本身更小,可能扫描更少的BLOCK。

select * from tbl where id in (1,2,3,4,5,6,7,8,9,10) and gid=1;

3、只有id的条件。

此时,肯定选单列索引了。

select * from tbl where id in (1,2,3,4,5,6,7,8,9,10);

实际情况如何呢?

1、数据库执行计划与预期一致

```
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where id in (1,2,3,4,5,6,7,8,9,10) and gid=123;
QUERY PLAN


Index Only Scan using idx2 on public.tbl (cost=0.43..2.46 rows=1 width=8) (actual time=0.037..0.037 rows=0 loops=1)
Output: id, gid
Index Cond: (tbl.gid = 123)
Filter: (tbl.id = ANY ('{1,2,3,4,5,6,7,8,9,10}'::integer[]))
Heap Fetches: 0
Buffers: shared hit=3
Planning time: 0.829 ms
Execution time: 0.086 ms
(8 rows)
```

2、与预期一致

```
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where id in (1,2,3,4,5,6,7,8,9,10) and gid=1;
QUERY PLAN


Index Only Scan using idx2 on public.tbl (cost=0.43..15.46 rows=1 width=8) (actual time=0.026..0.037 rows=2 loops=1)
Output: id, gid
Index Cond: ((tbl.gid = 1) AND (tbl.id = ANY ('{1,2,3,4,5,6,7,8,9,10}'::integer[])))
Heap Fetches: 2
Buffers: shared hit=31
Planning time: 0.121 ms
Execution time: 0.058 ms
(7 rows)
```

3、与预期一致

```
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where id in (1,2,3,4,5,6,7,8,9,10);
QUERY PLAN


Index Scan using idx1 on public.tbl (cost=0.43..15.52 rows=10 width=8) (actual time=0.021..0.035 rows=10 loops=1)
Output: id, gid
Index Cond: (tbl.id = ANY ('{1,2,3,4,5,6,7,8,9,10}'::integer[]))
Buffers: shared hit=31
Planning time: 0.104 ms
Execution time: 0.055 ms
(6 rows)
```

问题升华

数据库生成执行计划靠的是统计信息,如果统计信息不准确,那么执行计划必然不准确。

例如我们人为关闭TBL的自动统计信息收集,然后写入一批新的数据。

postgres=# alter table tbl set (autovacuum_enabled =off); ALTER TABLE postgres=# insert into tbl select generate_series(1,10000000), 100; INSERT 0 10000000

这个数据的特点是GID=100,在原有的统计信息中,gid=100的行是不存在的,所以下面的SQL优化器显然做出了错误的决定。

```
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where id in (1,2,3,4,5,6,7,8,9,10) and gid=100;
QUERY PLAN


Index Only Scan using idx2 on public.tbl (cost=0.44..2.46 rows=1 width=8) (actual time=0.030..2051.851 rows=10 loops=1)
Output: id, gid
Index Cond: (tbl.gid = 100)
Filter: (tbl.id = ANY ('{1,2,3,4,5,6,7,8,9,10}'::integer[]))
Rows Removed by Filter: 9999990
Heap Fetches: 10000000
Buffers: shared hit=71574
Planning time: 0.130 ms
Execution time: 2051.900 ms
(9 rows)
```

更新统计信息后,执行计划就准确了。

```
postgres=# analyze tbl;
ANALYZE
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where id in (1,2,3,4,5,6,7,8,9,10) and gid=100;
QUERY PLAN


Index Only Scan using idx2 on public.tbl (cost=0.44..20.57 rows=10 width=8) (actual time=0.027..0.043 rows=10 loops=1)
Output: id, gid
Index Cond: ((tbl.gid = 100) AND (tbl.id = ANY ('{1,2,3,4,5,6,7,8,9,10}'::integer[])))
Heap Fetches: 10
Buffers: shared hit=31
Planning time: 0.212 ms
Execution time: 0.067 ms
(7 rows)
```

如何自动收集统计信息

开启autovacuum , track_counts即可。

有几个微调参数,决定了什么时候扫描是否需要收集统计信息,以及当前表的变化量。

```
track_counts = on

------------------------------------------------------------------------------

AUTOVACUUM PARAMETERS

------------------------------------------------------------------------------

autovacuum = on # Enable autovacuum subprocess? 'on'
# requires track_counts to also be on.

log_autovacuum_min_duration = -1 # -1 disables, 0 logs all actions and

                                    # their durations, > 0 logs only  
                                    # actions running at least this number  
                                    # of milliseconds.

autovacuum_max_workers = 3 # max number of autovacuum subprocesses

                                    # (change requires restart)

autovacuum_naptime = 3s # time between autovacuum runs

autovacuum_vacuum_threshold = 50 # min number of row updates before

                                    # vacuum

autovacuum_analyze_threshold = 50 # min number of row updates before

                                    # analyze

autovacuum_vacuum_scale_factor = 0.2 # fraction of table size before vacuum

autovacuum_analyze_scale_factor = 0.1 # fraction of table size before analyze

autovacuum_freeze_max_age = 200000000 # maximum XID age before forced vacuum

                                    # (change requires restart)

autovacuum_multixact_freeze_max_age = 400000000 # maximum multixact age

                                    # before forced vacuum  
                                    # (change requires restart)

autovacuum_vacuum_cost_delay = 0ms # default vacuum cost delay for
# autovacuum, in milliseconds;
# -1 means use vacuum_cost_delay

autovacuum_vacuum_cost_limit = -1 # default vacuum cost limit for

                                    # autovacuum, -1 means use  
                                    # vacuum_cost_limit

```

PostgreSQL优化器是支持CBO与遗传算法

《数据库优化器原理 - 如何治疗选择综合症》

评估每个条件过滤多少行

《PostgreSQL pg_stats used to estimate top N freps values and explain rows》

统计信息解读

《PostgreSQL pg_stat_ pg_statio_ 统计信息(scan,read,fetch,hit)源码解读》

《PostgreSQL 统计信息pg_statistic格式及导入导出dump_stat - 兼容Oracle》

升华-多列统计信息

《PostgreSQL 10 黑科技 - 自定义统计信息》

其他因统计信息不准导致的性能问题

《Greenplum 统计信息收集参数 - 暨统计信息不准引入的broadcast motion一例》

其他参考文献

《PostgreSQL 10 黑科技 - 自定义统计信息》

《数据库优化器原理 - 如何治疗选择综合症》

《PostgreSQL bitmapAnd, bitmapOr, bitmap index scan, bitmap heap scan》

《PostgreSQL 9种索引的原理和应用场景》

《Greenplum 统计信息收集参数 - 暨统计信息不准引入的broadcast motion一例》

《PostgreSQL pg_stat_ pg_statio_ 统计信息(scan,read,fetch,hit)源码解读》

《PostgreSQL 统计信息pg_statistic格式及导入导出dump_stat - 兼容Oracle》

《PostgreSQL pg_stats used to estimate top N freps values and explain rows》

PostgreSQL 许愿链接

您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.

9.9元购买3个月阿里云RDS PostgreSQL实例

PostgreSQL 解决方案集合

德哥 / digoal's github - 公益是一辈子的事.

digoal's wechat

文章转载自digoal,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论