PostgreSQL 索引虚拟列 - 表达式索引 - JOIN提速

digoal 2017-11-21

207

作者

digoal

日期

2017-11-21

背景

CASE：使用虚拟索引，响应时间从2.3秒下降到0.3毫秒

业务系统在设计时，为了减少数据冗余，提升可读性，通常需要将不同的数据放到不同的表。

在查询时，通过多表JOIN来补齐需要查询或在过滤的内容。

比如这样的例子：

有两张表，分别有1千万和100万数据，当用户查询时，需要补齐那100万表中的某个字段进行过滤。

```
create table a (id int, bid int, c1 int, c2 int, c3 int);

CREATE TABLE b (id int primary key, path text);

insert into a select id, random()1000000 , random()10000000, random()10000000 , random()10000000 from generate_series(1,10000000) t(id);

insert into b select id, md5(random()::text) from generate_series(1,1000000) t(id);

create index idx_b_1 on b(path text_pattern_ops);

-- 查询
select a.* from a left join b on (a.bid=b.id and b.path like 'abc%');
```

那么它的性能如何呢？

```
postgres=# explain select a.* from a left join b on (a.bid=b.id) where b.path like 'abcde%';
QUERY PLAN

Hash Join (cost=9.70..289954.61 rows=1000 width=20)
Hash Cond: (a.bid = b.id)
-> Seq Scan on a (cost=0.00..163695.00 rows=10000000 width=20)
-> Hash (cost=8.45..8.45 rows=100 width=4)
-> Index Scan using idx_b_1 on b (cost=0.42..8.45 rows=100 width=4)
Index Cond: ((path ~>=~ 'abcde'::text) AND (path ~<~ 'abcdf'::text))
Filter: (path ~~ 'abcde%'::text)
(7 rows)

Time: 0.777 ms
postgres=# select a.* from a left join b on (a.bid=b.id) where b.path like 'abcde%';
id | bid | c1 | c2 | c3
---------+--------+---------+---------+---------
2423577 | 633740 | 846719 | 1720744 | 416608
2433286 | 633740 | 9797626 | 6737349 | 5669893
3851817 | 633740 | 8764393 | 3779499 | 2830950
4889541 | 633740 | 3892055 | 9470525 | 611262
5004634 | 633740 | 5420943 | 2448245 | 5719976
5372019 | 633740 | 5402891 | 3441462 | 8194368
6051251 | 633740 | 8691218 | 7184625 | 5940346
6344344 | 633740 | 5869018 | 9352883 | 636112
9751456 | 633740 | 3797867 | 1934900 | 2511398
(9 rows)

Time: 2348.506 ms (00:02.349)
```

条件越多，性能会越差。

这样的查询，并发一高，性能影响会比较大。

当b表是静态的时(没有DML)，可以用虚拟列索引来实现优化。

表达式索引 - 虚拟列索引

假设B表不会发生DML，是一个静态表。

1、创建一个获取path的函数

create or replace function get_path(int) returns text as $$ select path from b where id=$1; $$ language sql strict immutable;

这个函数用于从B表获取path，假设B表静态（不会有增删改），那么这个函数就是immutable的，无论什么时候输入一个ID，返回的都是同一个path。

2、在a表直接创建表达式索引（虚拟列索引）

create index idx_a_1 on a(get_path(bid) text_pattern_ops);

3、修改SQL语句如下

select * from a where get_path(bid) like 'abc%';

4、响应时间从2.3秒下降到0.3毫秒。

```
postgres=# select * from a where get_path(bid) like 'abcde%';
id | bid | c1 | c2 | c3
---------+--------+---------+---------+---------
2423577 | 633740 | 846719 | 1720744 | 416608
2433286 | 633740 | 9797626 | 6737349 | 5669893
3851817 | 633740 | 8764393 | 3779499 | 2830950
4889541 | 633740 | 3892055 | 9470525 | 611262
5004634 | 633740 | 5420943 | 2448245 | 5719976
5372019 | 633740 | 5402891 | 3441462 | 8194368
6051251 | 633740 | 8691218 | 7184625 | 5940346
6344344 | 633740 | 5869018 | 9352883 | 636112
9751456 | 633740 | 3797867 | 1934900 | 2511398
(9 rows)

postgres=# select * from b where path like 'abcde%';
id | path
--------+----------------------------------
633740 | abcde980c8568a9a6a140885d92fcebe
(1 row)

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from a where get_path(bid) like 'abcde%';
QUERY PLAN

Index Scan using idx_a_1 on public.a (cost=0.56..158697.56 rows=50000 width=20) (actual time=0.151..0.276 rows=9 loops=1)
Output: id, bid, c1, c2, c3
Index Cond: ((get_path(a.bid) ~>=~ 'abcde'::text) AND (get_path(a.bid) ~<~ 'abcdf'::text))
Filter: (get_path(a.bid) ~~ 'abcde%'::text)
Buffers: shared hit=50
Planning time: 0.092 ms
Execution time: 0.300 ms
(7 rows)
```

即使没有虚拟索引，也可以PostgreSQL的使用并行计算

32个并行，耗时454毫秒，依旧不如使用虚拟列索引的效果。

```
postgres=# set parallel_tuple_cost =0;
SET
postgres=# set parallel_setup_cost =0;
SET
postgres=# set max_parallel_workers_per_gather =32;
SET
postgres=# alter table a set (parallel_workers =32);
ALTER TABLE

postgres=# explain select a.* from a left join b on (a.bid=b.id) where b.path like 'abcde%';
QUERY PLAN

Gather (cost=20835.25..91600.56 rows=1000 width=20)
Workers Planned: 32
-> Hash Join (cost=20835.25..91600.56 rows=31 width=20)
Hash Cond: (a.bid = b.id)
-> Parallel Seq Scan on a (cost=0.00..66820.00 rows=312500 width=20)
-> Hash (cost=20834.00..20834.00 rows=100 width=4)
-> Seq Scan on b (cost=0.00..20834.00 rows=100 width=4)
Filter: (path ~~ 'abcde%'::text)
(8 rows)

Time: 0.685 ms

postgres=# select a.* from a left join b on (a.bid=b.id) where b.path like 'abcde%';
id | bid | c1 | c2 | c3
---------+--------+---------+---------+---------
5004634 | 633740 | 5420943 | 2448245 | 5719976
9751456 | 633740 | 3797867 | 1934900 | 2511398
3851817 | 633740 | 8764393 | 3779499 | 2830950
4889541 | 633740 | 3892055 | 9470525 | 611262
6344344 | 633740 | 5869018 | 9352883 | 636112
2433286 | 633740 | 9797626 | 6737349 | 5669893
2423577 | 633740 | 846719 | 1720744 | 416608
6051251 | 633740 | 8691218 | 7184625 | 5940346
5372019 | 633740 | 5402891 | 3441462 | 8194368
(9 rows)

Time: 454.405 ms
```

文章转载自digoal，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

PostgreSQL 索引虚拟列 - 表达式索引 - JOIN提速

作者

日期

标签

背景

CASE：使用虚拟索引，响应时间从2.3秒下降到0.3毫秒

表达式索引 - 虚拟列索引

即使没有虚拟索引，也可以PostgreSQL的使用并行计算

PostgreSQL 许愿链接

9.9元购买3个月阿里云RDS PostgreSQL实例

PostgreSQL 解决方案集合

德哥 / digoal's github - 公益是一辈子的事.

评论

相关阅读

PostgreSQL 索引虚拟列 - 表达式索引 - JOIN提速

作者

日期

标签

背景

CASE： 使用虚拟索引，响应时间从2.3秒下降到0.3毫秒

表达式索引 - 虚拟列索引

即使没有虚拟索引，也可以PostgreSQL的使用并行计算

PostgreSQL 许愿链接

9.9元购买3个月阿里云RDS PostgreSQL实例

PostgreSQL 解决方案集合

德哥 / digoal's github - 公益是一辈子的事.

评论

相关阅读

CASE：使用虚拟索引，响应时间从2.3秒下降到0.3毫秒