PostgreSQL14新特性：索引方面的增强

中启乘数 2021-10-20

1477

1. 前言

当前PostgreSQL 14beta2已经发布，中启乘数科技专注于PostgreSQL数据库的各项新技术，在这里把PostgreSQL 14在索引方面的七点改进分享给大家。

2. 频繁更新时Btree数减少了膨胀

Btree索引是我们最常用的索引。PostgreSQL 14对Btree索引有很多方便的性能增强：

让nbtree和heapam更好的配合，以便更积极的移除因MVCC产生的重复行
避免因为多版本的重复行导致的索引块分裂
自下向上的索引项删除（bottom-up index deletion）：在唯一索引中删除旧版本时可以比较索引块的分裂（Deleting older versions in unique indexes to avoid page splits）
会把逻辑层面索引向没有改变的标志（”logically unchanged index” hint）下推到索引中，避免没有必要的索引的变更，具体可见：https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9dc718bd ，the “extra tuples” simple deletion enhancementincreases the number of index tuples deleted with almost any workloadthat has LP_DEAD bits set in leaf pages.
具体信息见：https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d168b666823b6e0bcf60ed19ce24fb5fb91b8ccf

总之通过上面的这些优化，在一些频繁更新的场景下，Btree减少了膨胀，提升了性能。

3. BRIN 索引支持多区间和布隆过滤器

原先的BRIN索引特别适合边界清晰的堆存储数据, 例如BLOCK 1到8 存储的id范围是100-10000, 9到16 存储的id范围是100001到200000, 检索id=1000时, 只需要扫描1到8号数据块。然而经常可能会在固定的值范围内插入了一个非常大的值，导致BRIN索引失去过滤性。为了解决这个问题, PostgreSQL 14 支持多区间的BRIN，即multi range brin, 1到8号块存储的ID范围可能是1-199, 10000-10019, 20000-20000, 占用5个value(1,199,10000,10019,20000), 一个blocks区间存储多少个value取决于values_per_range参数(8到256).
当不断插入数据时, 这些范围还可以被合并。

见例子：

CREATE TABLE t (a int);
CREATE INDEX ON t USING brin (a int4_minmax_multi_ops(values_per_range=16));

上面的索引子句中int4_minmax_multi_ops(values_per_range=16))
就是指定建立的一个多区间的BRIN索引。

更详细的信息可见：

https://www.postgresql.org/message-id/c1138ead-7668-f0e1-0638-c3be3237e812%402ndquadrant.com
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ab596105b55f1d7fbd5a66b66f65227d210b047d

在PostgreSQL 14实现了基于布隆过滤器的BRIN索引，每个连续heap blocks, 存储一个占位bits, 被索引字段的hash value经过bloom hash填充占位bit，创建的方法如下：：

CREATE TABLE t (a int); 
CREATE INDEX ON t USING brin (a int4_bloom_ops(false_positive_rate =0.05, n_distinct_per_range =100));

比较不同类型的BRIN索引的大小：

create index test_brin_idx on bloom_test using brin (id);
create index test_bloom_idx on bloom_test using brin (id uuid_bloom_ops);
create index test_btree_idx on bloom_test (id);

大小统计如下：

Schema	Name	Type	Owner	Table	Size
public	test_bloom_idx	index	tomas	bloom_test	12 MB
public	test_brin_idx	index	tomas	bloom_test	832 kB
public	test_btree_idx	index	tomas	bloom_test	6016 MB

4. 更快更小的GiST 索引

GiST 索引现在可以在其构建过程中对数据进行预排序，从而可以更快地创建索引并缩小索引。

目前GiST对point类型实现了预排序。这个功能是通过为point_ops类增加了一个sortsupport的函数来实现的。
见下面的例子：

建测试表：

create table x asselect point (random(),random())from generate_series(1,3000000,1);

PostgreSQL 13.3中

postgres=# create index ON x using gist (point );
CREATE INDEX
Time:49804.780 ms (00:49.805)

大小为：223264768

PostreSQL 14中：

postgres=# create index ON x using gist (point );
CREATE INDEX
Time:2551.954 ms (00:02.552)

索引的大小为：148955136

可以看到大小从223M减少到148M，创建时间更是直接从49秒减少到2.5秒，性能提升非常明显。

更具体的信息见：

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=16fa9b2b30a357b4aea982bd878ec2e5e002dbcc
https://www.postgresql.org/message-id/1A36620E-CAD8-4267-9067-FB31385E7C0D%40yandex-team.ru

5. SP-GiST 索引也支持覆盖索引（index only scan）

创建测试表和索引：

create table students(p point, addr text, student text);
insert into students select point (random(),random()), seq, seq from generate_series(1,1000000,1)as seq;
create index on students using spgist (p) include(addr,student);
analyze table students;

看执行计划：

postgres=# explain analyze select p,addr,student from students where p >^'(0.99999,0.99999)'::point;
                                                                     QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------
---------------------
IndexOnlyScanusing students_p_addr_student_idx on students  (cost=0.29..4986.28 rows=100000 width=28)(actual time=0.212..0
.742 rows=5 loops=1)
IndexCond:(p >^'(0.99999,0.99999)'::point)
HeapFetches:0
PlanningTime:0.056 ms
ExecutionTime:0.759 ms
(5 rows)

Time:1.077 ms

上面的p >^ '(0.99999,0.99999)'::point
是查找在点0.99999,0.99999
上面的点。可以看到走到了Index Only Scan
。

更详细的信息可见：https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=09c1c6ab4bc5764dd69c53ccfd43b2060b1fd090

6. 索引的VACUUM方面上的增强

PostgreSQL 14可以防止长时间创建索引导致VACUUM不能回收垃圾：当create index concurrently时, 只要不是表达式索引, partial index, 不是rc或ssi隔离级别, 那么这个操作的snapshot xmin就不会用做计算oldestxmin，从而它运行多长时间都不会导致vacuum无法回收某些垃圾而导致膨胀，具体可以建：

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d9d076222f5b94a85e0e318339cfc44b8f26022d

为了避免每次vacuum都要清理index, PostgreSQL 14进行了优化, 当vacuum一个table时, 如果低于2%的PAGE有dead LP(例如一个表占用了100个page, 如果只有2个page里面有dead LP), 那么VACUUM将跳过索引，并保留这些索引项。当table中的垃圾行（dead lp）积累到超过2% page时, 才需要对索引执行垃圾回收。
因为LP 只占用4字节, 所以不清理也影响不大, 但是大幅降低了因对索引的vacuum导致的vacuum负担。
目前阈值2%是在代码中写死的, 未来也许会支持索引级别配置, 或者支持GUC配置，更详细的信息见：

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5100010ee4d5c8ef46619dbd1d17090c627e6d0a

7. 可以通过reindex 命令把索引移动到另一个表空间中

REINDEX command
增加 tablespace 选项, 支持通过重建索引的方法把索引移动到另一个表空间中。

REINDEX的语法如下：

postgres=# \h reindex
Command:     REINDEX
Description: rebuild indexes
Syntax:
REINDEX [( option [,...])]{ INDEX | TABLE | SCHEMA | DATABASE | SYSTEM }[ CONCURRENTLY ] name

where option can be one of:

    CONCURRENTLY [boolean]
    TABLESPACE new_tablespace
    VERBOSE [boolean]

URL: https://www.postgresql.org/docs/14/sql-reindex.html

可以看到语法中增加了指定表空间的子句。

8. reindex 命令支持分区表

REINDEX
支持分区表, 用这个命令可以自动重建所有子分区的索引，方便了分区表的管理。

本站文章，未经作者同意，请勿转载，如需转载，请邮件customer@csudata.com.

postgresql

文章转载自中启乘数，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。