深入浅出PostgreSQL B-Tree索引结构

digoal 2016-05-28

490

作者

digoal

日期

2016-05-28

背景

PostgreSQL B-Tree是一种变种(high-concurrency B-tree management algorithm)，算法详情请参考

src/backend/access/nbtree/README

PostgreSQL 的B-Tree索引页分为几种类别

```
meta page
root page # btpo_flags=2
branch page # btpo_flags=0
leaf page # btpo_flags=1

如果即是leaf又是root则 btpo_flags=3。
```

其中meta page和root page是必须有的，meta page需要一个页来存储，表示指向root page的page id。

随着记录数的增加，一个root page可能存不下所有的heap item，就会有leaf page，甚至branch page，甚至多层的branch page。

一共有几层branch 和 leaf，就用btree page元数据的 level 来表示。

我们可以使用pageinspect插件，内窥B-Tree的结构。

层次可以从bt_page_stats的btpo得到，代表当前index page所处的层级。

注意层级并不是唯一的，例如btpo=3的层级，可能有分几个档。

打个比喻，腾讯的技术岗位级别T3，对应T3这个级别又有几个小的档位。和这里的含义差不多，只是没有区分小档位的值，但是后面我们能看到它的存在。

btpo=0级表示最底层，处于这个层级的index pages存储的items(ctid)是指向heap page的。

类别和层级不挂钩，类别里面又可以有多个层级，但是只有层级=0的index page存储的ctid内容才是指向heap page的; 其他层级index page存储的ctid内容都是指向同层级其他index page(双向链表)，或者指下级的index page。

0层结构，只有meta和root页。

root页最多可以存储的item数，取决于索引字段数据的长度、以及索引页的大小。

例子

```
postgres=# create extension pageinspect;

postgres=# create table tab1(id int primary key, info text);
CREATE TABLE
postgres=# insert into tab1 select generate_series(1,100), md5(random()::text);
INSERT 0 100
postgres=# vacuum analyze tab1;
VACUUM
```

查看meta page，可以看到root page id = 1 。

索引的level = 0，说明没有branch和leaf page。

postgres=# select * from bt_metap('tab1_pkey'); magic | version | root | level | fastroot | fastlevel --------+---------+------+-------+----------+----------- 340322 | 2 | 1 | 0 | 1 | 0 (1 row)

根据root page id = 1查看root page的stats

btpo=0 说明已经到了最底层

btpo_flags=3，说明它既是leaf又是root页。

postgres=# select * from bt_page_stats('tab1_pkey',1); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 1 | l | 100 | 0 | 16 | 8192 | 6148 | 0 | 0 | 0 | 3 (1 row)

btpo_prev和btpo_next分别表示该页的相邻页（branch page是双向链表）。

btpo_flags 可以在代码中查看(src/include/access/nbtree.h)，一共有几个

```
/ Bits defined in btpo_flags /

define BTP_LEAF (1 << 0) / leaf page, i.e. not internal page /

define BTP_ROOT (1 << 1) / root page (has no parent) /

define BTP_DELETED (1 << 2) / page has been deleted from tree /

define BTP_META (1 << 3) / meta-page /

define BTP_HALF_DEAD (1 << 4) / empty, but still in tree /

define BTP_SPLIT_END (1 << 5) / rightmost page of split group /

define BTP_HAS_GARBAGE (1 << 6) / page has LP_DEAD tuples /

define BTP_INCOMPLETE_SPLIT (1 << 7) / right sibling's downlink is missing /

```

查看0级 page存储的ctid (即items)

0级ctid 表示存储的是 heap页的寻址。（如果是多层结构，那么branch page中的ctid，它表示的是同级btree页(链条项ctid)或者下级btree页的寻址）。

当ctid指向heap时， data是对应的列值。(多级结构的data意义不一样，后面会讲)

postgres=# select * from bt_page_items('tab1_pkey',1); itemoffset | ctid | itemlen | nulls | vars | data ------------+---------+---------+-------+------+------------------------- 1 | (0,1) | 16 | f | f | 01 00 00 00 00 00 00 00 2 | (0,2) | 16 | f | f | 02 00 00 00 00 00 00 00 ... 99 | (0,99) | 16 | f | f | 63 00 00 00 00 00 00 00 100 | (0,100) | 16 | f | f | 64 00 00 00 00 00 00 00 (100 rows)

根据ctid 查看heap记录

postgres=# select * from tab1 where ctid='(0,100)'; id | info -----+---------------------------------- 100 | 68b63c269ee8cc2d99fe204f04d0ffcb (1 row)

1层结构，包括meta page, root page, leaf page.

例子

postgres=# truncate tab1; TRUNCATE TABLE postgres=# insert into tab1 select generate_series(1,1000), md5(random()::text); INSERT 0 1000 postgres=# vacuum analyze tab1; VACUUM

查看meta page，可以看到root page id = 3, 索引的level = 1。

level = 1 表示包含了leaf page。

postgres=# select * from bt_metap('tab1_pkey'); magic | version | root | level | fastroot | fastlevel --------+---------+------+-------+----------+----------- 340322 | 2 | 3 | 1 | 3 | 1 (1 row)

根据root page id 查看root page的stats

btpo = 1 说明还没有到最底层（最底层btpo=0, 这种页里面存储的ctid才代表指向heap page的地址）

btpo_flags=2 说明这个页是root page

postgres=# select * from bt_page_stats('tab1_pkey',3); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 3 | r | 3 | 0 | 13 | 8192 | 8096 | 0 | 0 | 1 | 2 (1 row)

查看root page存储的 leaf page items (指向leaf page)

一共3个leaf pages, data存储的是这个leaf page存储的最小值。

postgres=# select * from bt_page_items('tab1_pkey',3); itemoffset | ctid | itemlen | nulls | vars | data ------------+-------+---------+-------+------+------------------------- 1 | (1,1) | 8 | f | f | 2 | (2,1) | 16 | f | f | 6f 01 00 00 00 00 00 00 3 | (4,1) | 16 | f | f | dd 02 00 00 00 00 00 00 (3 rows)

第一条为空，是因为这个leaf page是最左边的PAGE，不存最小值。

对于有右leaf page的leaf page，第一条存储的heap item为该页的右链路。

第二条才是起始ITEM。

另外需要注意，虽然在item里面只存储右链，leaf page还是双向链表，在stats能看到它的prev 和next page。

根据leaf page id查看stats

最左leaf page = 1

prev btpo 指向meta page

可以看到btpo = 0了，说明这个页是底层页。 btpo_flags=1 说明是leaf page postgres=# select * from bt_page_stats('tab1_pkey',1); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 1 | l | 367 | 0 | 16 | 8192 | 808 | 0 | 2 | 0 | 1 (1 row)

next btpo 指向meta page

最右leaf page = 4

btpo_flags=1 说明是leaf page

postgres=# select * from bt_page_stats('tab1_pkey',4); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 4 | l | 268 | 0 | 16 | 8192 | 2788 | 2 | 0 | 0 | 1 (1 row)

中间leaf page = 2

btpo_flags=1 说明是leaf page

postgres=# select * from bt_page_stats('tab1_pkey',2); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 2 | l | 367 | 0 | 16 | 8192 | 808 | 1 | 4 | 0 | 1 (1 row)

查看leaf page存储的 heap ctid (即heap items)

含右页的例子, index page 1

第一条为右链表的第一条item, 第二条才是起始item

postgres=# select * from bt_page_items('tab1_pkey',1); itemoffset | ctid | itemlen | nulls | vars | data ------------+---------+---------+-------+------+------------------------- 1 | (3,7) | 16 | f | f | 6f 01 00 00 00 00 00 00 2 | (0,1) | 16 | f | f | 01 00 00 00 00 00 00 00 3 | (0,2) | 16 | f | f | 02 00 00 00 00 00 00 00 ... 367 | (3,6) | 16 | f | f | 6e 01 00 00 00 00 00 00 (367 rows)

不含右页的例子, index page 4

第一条就是起始ctid (即items)

postgres=# select * from bt_page_items('tab1_pkey',4); itemoffset | ctid | itemlen | nulls | vars | data ------------+---------+---------+-------+------+------------------------- 1 | (6,13) | 16 | f | f | dd 02 00 00 00 00 00 00 2 | (6,14) | 16 | f | f | de 02 00 00 00 00 00 00 ... 268 | (8,40) | 16 | f | f | e8 03 00 00 00 00 00 00 (268 rows)

根据ctid 查看heap记录

postgres=# select * from tab1 where ctid='(0,1)'; id | info ----+---------------------------------- 1 | 6ebc6b77aebf5dd11621a2ed846c08c4 (1 row)

记录数超过1层结构的索引可以存储的记录数时，会分裂为2层结构，除了meta page和root page，还可能包含1层branch page以及1层leaf page。

如果是边界页(branch or leaf)，那么其中一个方向没有PAGE，这个方向的链表信息都统一指向meta page。

例子

```
create table tbl1(id int primary key, info text);
postgres=# select 285^2;
?column?

(1 row)
postgres=# insert into tab2 select trunc(random()*10000000), md5(random()::text) from generate_series(1,1000000) on conflict on constraint tab2_pkey do nothing;
INSERT 0 951379
postgres=# vacuum analyze tab2;
VACUUM
```

查看meta page，可以看到root page id = 412, 索引的level=2，即包括1级 branch 和 1级 leaf。

postgres=# select * from bt_metap('tab2_pkey'); magic | version | root | level | fastroot | fastlevel --------+---------+------+-------+----------+----------- 340322 | 2 | 412 | 2 | 412 | 2 (1 row)

根据root page id 查看root page的stats

btpo = 2 当前在第二层，另外还表示下层是1

btpo_flags = 2 说明是root page

postgres=# select * from bt_page_stats('tab2_pkey', 412); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 412 | r | 11 | 0 | 15 | 8192 | 7936 | 0 | 0 | 2 | 2 (1 row)

查看root page存储的 branch page items (指向branch page)

postgres=# select * from bt_page_items('tab2_pkey', 412); itemoffset | ctid | itemlen | nulls | vars | data ------------+----------+---------+-------+------+------------------------- 1 | (3,1) | 8 | f | f | 2 | (2577,1) | 16 | f | f | e1 78 0b 00 00 00 00 00 3 | (1210,1) | 16 | f | f | ec 3a 18 00 00 00 00 00 4 | (2316,1) | 16 | f | f | de 09 25 00 00 00 00 00 5 | (574,1) | 16 | f | f | aa e8 33 00 00 00 00 00 6 | (2278,1) | 16 | f | f | 85 90 40 00 00 00 00 00 7 | (1093,1) | 16 | f | f | f6 e9 4e 00 00 00 00 00 8 | (2112,1) | 16 | f | f | a3 60 5c 00 00 00 00 00 9 | (411,1) | 16 | f | f | b2 ea 6b 00 00 00 00 00 10 | (2073,1) | 16 | f | f | db de 79 00 00 00 00 00 11 | (1392,1) | 16 | f | f | df b0 8a 00 00 00 00 00 (11 rows)

根据branch page id查看stats

btpo = 1 当前在第一层，另外还表示下层是0

btpo_flags = 0 说明是branch page

postgres=# select * from bt_page_stats('tab2_pkey', 3); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 3 | i | 254 | 0 | 15 | 8192 | 3076 | 0 | 2577 | 1 | 0 (1 row)

查看branch page存储的 leaf page ctid (指向leaf page)

只要不是最右边的页，第一条都代表右页的起始item。

第二条才是当前页的起始ctid

注意所有branch page的起始item对应的data都是空的。

也就是说它不存储当前branch page包含的所有leaf pages的索引字段内容的最小值。

postgres=# select * from bt_page_items('tab2_pkey', 3); itemoffset | ctid | itemlen | nulls | vars | data ------------+----------+---------+-------+------+------------------------- 1 | (735,1) | 16 | f | f | e1 78 0b 00 00 00 00 00 2 | (1,1) | 8 | f | f | 3 | (2581,1) | 16 | f | f | a8 09 00 00 00 00 00 00 4 | (1202,1) | 16 | f | f | f8 13 00 00 00 00 00 00 ... 254 | (3322,1) | 16 | f | f | ee 6f 0b 00 00 00 00 00 (254 rows)

根据ctid 查看leaf page

btpo = 0 当前在第0层，即最底层，这里存储的是heap ctid

btpo_flags = 1 说明是leaf page

```
postgres=# select * from bt_page_stats('tab2_pkey', 1);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
1 | l | 242 | 0 | 16 | 8192 | 3308 | 0 | 2581 | 0 | 1
(1 row)

postgres=# select * from bt_page_items('tab2_pkey', 1);
itemoffset | ctid | itemlen | nulls | vars | data
------------+------------+---------+-------+------+-------------------------
1 | (4985,16) | 16 | f | f | a8 09 00 00 00 00 00 00
2 | (7305,79) | 16 | f | f | 01 00 00 00 00 00 00 00
3 | (2757,120) | 16 | f | f | 09 00 00 00 00 00 00 00
...
242 | (1329,101) | 16 | f | f | a0 09 00 00 00 00 00 00
(242 rows)
```

查看leaf page中包含的heap page items。

如果我们根据索引页结构的原理，能推算出来(7305,79)是最小值，取它就没错了。

```
postgres=# select * from tab2 where ctid='(7305,79)';
id | info
----+----------------------------------
1 | 18aaeb74c359355311ac825ae2aeb22a
(1 row)

postgres=# select min(id) from tab2;
min

1
(1 row)
```

多层结构，除了meta page，还可能包含多层branch page，以及一层leaf page。

例子

postgres=# create table tab3(id int primary key, info text); CREATE TABLE postgres=# insert into tab3 select generate_series(1, 100000000), md5(random()::text);

查看meta page, 注意level，已经是3级了。

meta page postgres=# select * from bt_metap('tab3_pkey'); magic | version | root | level | fastroot | fastlevel --------+---------+--------+-------+----------+----------- 340322 | 2 | 116816 | 3 | 116816 | 3 (1 row)

btpo_flags=2 代表 root page

btpo = 3 代表第3层

```
postgres=# select * from bt_page_stats('tab3_pkey', 116816);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
--------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
116816 | r | 3 | 0 | 13 | 8192 | 8096 | 0 | 0 | 3 | 2
(1 row)

postgres=# select * from bt_page_items('tab3_pkey', 116816);
itemoffset | ctid | itemlen | nulls | vars | data
------------+------------+---------+-------+------+-------------------------
1 | (412,1) | 8 | f | f |
2 | (116815,1) | 16 | f | f | 5f 9e c5 01 00 00 00 00
3 | (198327,1) | 16 | f | f | bd 3c 8b 03 00 00 00 00
(3 rows)
```

btpo_flags=0 代表 branch page

btpo = 2 代表第2层

```
postgres=# select * from bt_page_stats('tab3_pkey', 412);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
412 | i | 286 | 0 | 15 | 8192 | 2436 | 0 | 116815 | 2 | 0
(1 row)

postgres=# select * from bt_page_items('tab3_pkey', 412);
itemoffset | ctid | itemlen | nulls | vars | data
------------+-----------+---------+-------+------+-------------------------
1 | (81636,1) | 16 | f | f | 5f 9e c5 01 00 00 00 00 -- 这是指向当前层级右页的ctid
2 | (3,1) | 8 | f | f | -- 注意第一条初始值是这
3 | (411,1) | 16 | f | f | 77 97 01 00 00 00 00 00
4 | (698,1) | 16 | f | f | ed 2e 03 00 00 00 00 00
...
286 | (81350,1) | 16 | f | f | e9 06 c4 01 00 00 00 00
(286 rows)
```

btpo_flags=0 代表 branch page

btpo = 1 代表第1层

```
postgres=# select * from bt_page_stats('tab3_pkey', 3);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
3 | i | 286 | 0 | 15 | 8192 | 2436 | 0 | 411 | 1 | 0
(1 row)

postgres=# select * from bt_page_items('tab3_pkey', 3);
itemoffset | ctid | itemlen | nulls | vars | data
------------+---------+---------+-------+------+-------------------------
1 | (287,1) | 16 | f | f | 77 97 01 00 00 00 00 00
2 | (1,1) | 8 | f | f |
3 | (2,1) | 16 | f | f | 6f 01 00 00 00 00 00 00
4 | (4,1) | 16 | f | f | dd 02 00 00 00 00 00 00
...
286 | (286,1) | 16 | f | f | 09 96 01 00 00 00 00 00
(286 rows)
```

btpo_flags=1 代表 leaf page

btpo = 0 代表第0层

```
postgres=# select * from bt_page_stats('tab3_pkey', 1);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
1 | l | 367 | 0 | 16 | 8192 | 808 | 0 | 2 | 0 | 1
(1 row)

postgres=# select * from bt_page_items('tab3_pkey', 1);
itemoffset | ctid | itemlen | nulls | vars | data
------------+---------+---------+-------+------+-------------------------
1 | (3,7) | 16 | f | f | 6f 01 00 00 00 00 00 00
2 | (0,1) | 16 | f | f | 01 00 00 00 00 00 00 00
3 | (0,2) | 16 | f | f | 02 00 00 00 00 00 00 00
...
367 | (3,6) | 16 | f | f | 6e 01 00 00 00 00 00 00
(367 rows)
```

通过第0层的ctid就可以获取到heap了.

heap tuple例子

postgres=# select * from tab3 where ctid='(0,1)'; id | info ----+---------------------------------- 1 | 370ee1989a2b7f5d8a5b43243596d91f (1 row)

如何解释explain analyze中的扫描了多少个btree page

实战例子1

postgres=# create table tbl1(id int primary key, info text); CREATE TABLE postgres=# insert into tbl1 select trunc(random()*10000000), md5(random()::text) from generate_series(1,5000000) on conflict on constraint tbl1_pkey do nothing; INSERT 0 3934875 postgres=# select ctid,* from tbl1 limit 10; ctid | id | info --------+---------+---------------------------------- (0,1) | 2458061 | 5c91812b54bdcae602321dceaf22e276 (0,2) | 8577271 | fe8e7a8be0d71a94e13b1b5a7786010b (0,3) | 4612744 | 56983e47f044b5a4655300e1868d2850 (0,4) | 3690167 | 4a5ec8abf67bc018dcc113be829a59da (0,5) | 2646638 | 7686b47dcb94e56c11d69ec04d6017f3 (0,6) | 6023272 | 4779d9a849c8287490be9d37a27b4637 (0,7) | 7163674 | 35af37f479f48caa65033a5ef56cd75e (0,8) | 4049257 | 12fa110d927c88dce0773b546cc600c6 (0,9) | 5815903 | 69ed9770ede59917d15ac2373ca8c797 (0,10) | 4068194 | 738595f73670da7ede40aefa8cb3d00c (10 rows) postgres=# vacuum analyze tbl1; VACUUM

首先我们需要了解索引的level，才能正确的判断需要扫描多少个index page才能取出1条记录。

postgres=# select * from bt_metap('tbl1_pkey'); magic | version | root | level | fastroot | fastlevel --------+---------+------+-------+----------+----------- 340322 | 2 | 412 | 2 | 412 | 2 (1 row)

level = 2的btree应该长这样

1. 以下查询，命中了1条记录，并且走的是index only scan。

读了4个INDEX PAGE, 包括1 meta page, 1 root page, 1 branch page, 1 leaf page. 1个heap visibility map page

```
postgres=# explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id = 1;
QUERY PLAN

Index Only Scan using tbl1_pkey on public.tbl1 (cost=0.42..1.44 rows=1 width=4) (actual time=0.019..0.020 rows=1 loops=1)
Output: id
Index Cond: (tbl1.id = 1)
Heap Fetches: 0
Buffers: shared hit=4
Planning time: 0.072 ms
Execution time: 0.072 ms
(7 rows)
```
2. 以下查询，命中了0条记录，并且走的是index only scan。

读了4个INDEX PAGE, 包括1 meta page, 1 root page, 1 branch page, 1 leaf page. 0个heap visibility map page

但是explain只算了3个，因为rows=0, 没有匹配的行。不需要查询visibility map文件。

```
postgres=# explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id in (3);
QUERY PLAN

Index Only Scan using tbl1_pkey on public.tbl1 (cost=0.43..1.45 rows=1 width=4) (actual time=0.010..0.010 rows=0 loops=1)
Output: id
Index Cond: (tbl1.id = 3)
Heap Fetches: 0
Buffers: shared hit=3
Planning time: 0.073 ms
Execution time: 0.031 ms
(7 rows)
```

3. 以下查询，命中了7条记录，并且走的是index only scan。

读了22个INDEX PAGE,

1 meta page + 7 * (1 root + 1 branch + 1 leaf) = 22

也就是说，每个value都扫了root,branch,leaf。

x个heap visibility map page

```
postgres=# explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id in (1,2,3,4,100,1000,10000);
QUERY PLAN

Index Only Scan using tbl1_pkey on public.tbl1 (cost=0.42..10.10 rows=7 width=4) (actual time=0.018..0.033 rows=7 loops=1)
Output: id
Index Cond: (tbl1.id = ANY ('{1,2,3,4,100,1000,10000}'::integer[]))
Heap Fetches: 0
Buffers: shared hit=22
Planning time: 0.083 ms
Execution time: 0.056 ms
(7 rows)
```

4. 以下查询，命中了2条记录，并且走的是index only scan。

读了22个INDEX PAGE,

1 meta page + 7 * (1 root + 1 branch + 1 leaf) = 22

也就是说，每个value都扫了root,branch,leaf。

x个heap visibility map page

```
postgres=# explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id in (1,2,3,4,5,6,7);
QUERY PLAN

Index Only Scan using tbl1_pkey on public.tbl1 (cost=0.43..10.13 rows=7 width=4) (actual time=0.039..0.046 rows=2 loops=1)
Output: id
Index Cond: (tbl1.id = ANY ('{1,2,3,4,5,6,7}'::integer[]))
Heap Fetches: 0
Buffers: shared hit=22
Planning time: 0.232 ms
Execution time: 0.086 ms
(7 rows)
```

5. 以下查询结果和以上查询一样，也命中了3条记录，并且走的是index only scan。

但是只读了4个INDEX PAGE,

1 meta page + 1 root + 1 branch + 1 leaf

x个heap visibility map page

```
postgres=# explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id>0 and id <=7;
QUERY PLAN

Index Only Scan using tbl1_pkey on public.tbl1 (cost=0.43..1.49 rows=3 width=4) (actual time=0.008..0.009 rows=2 loops=1)
Output: id
Index Cond: ((tbl1.id > 0) AND (tbl1.id <= 7))
Heap Fetches: 0
Buffers: shared hit=4
Planning time: 0.127 ms
Execution time: 0.028 ms
(7 rows)
```

对于第四个查询，扫描了22个块，这个查询，优化器有优化的空间，比如找到1和7作为边界值，在查询到第一个值时，就可以取到leaf page的下一个page的最小值，从而得到1,2,3,4,5,6,7的值在当前page就可以完全取到，不需要去重复扫描。

hit计数为什么感觉不正确?: 算上visibility map时.

src/include/common/relpath.h

```
/
* Stuff for fork names.

* The physical storage of a relation consists of one or more forks.
* The main fork is always created, but in addition to that there can be
* additional forks for storing various metadata. ForkNumber is used when
* we need to refer to a specific fork in a relation.
*/
typedef enum ForkNumber
{
InvalidForkNumber = -1,
MAIN_FORKNUM = 0,
FSM_FORKNUM,
VISIBILITYMAP_FORKNUM,
INIT_FORKNUM

    /*  
     * NOTE: if you add a new fork, change MAX_FORKNUM and possibly  
     * FORKNAMECHARS below, and update the forkNames array in  
     * src/common/relpath.c  
     */

} ForkNumber;

define MAX_FORKNUM INIT_FORKNUM

```

src/backend/storage/buffer/bufmgr.c

```
/
* ReadBuffer_common -- common logic for all ReadBuffer variants

* hit is set to true if the request was satisfied from shared buffer cache.
/
static Buffer
ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
BlockNumber blockNum, ReadBufferMode mode,
BufferAccessStrategy strategy, bool hit)
{
BufferDesc bufHdr;
Block bufBlock;
bool found;
bool isExtend;
bool isLocalBuf = SmgrIsTemp(smgr);

    *hit = false;

... ...
/ Substitute proper block number if caller asked for P_NEW /
if (isExtend)
blockNum = smgrnblocks(smgr, forkNum);

    if (isLocalBuf)  
    {  
            bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);  
            if (found)  
                    pgBufferUsage.local_blks_hit++;  
            else if (isExtend)  
                    pgBufferUsage.local_blks_written++;  
            else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||  
                             mode == RBM_ZERO_ON_ERROR)  
                    pgBufferUsage.local_blks_read++;  
    }  
    else  
    {  
            /*  
             * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is  
             * not currently in memory.  
             */  
            bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,  
                                                     strategy, &found);  
            if (found)  
                    pgBufferUsage.shared_blks_hit++;  
            else if (isExtend)  
                    pgBufferUsage.shared_blks_written++;  
            else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||  
                             mode == RBM_ZERO_ON_ERROR)  
                    pgBufferUsage.shared_blks_read++;  
    }

```

src/backend/commands/explain.c

```
/
* Show buffer usage details.
/
static void
show_buffer_usage(ExplainState es, const BufferUsage usage, bool planning)
{

...
/ Show only positive counter values. /
if (has_shared || has_local || has_temp)
{
ExplainIndentText(es);
appendStringInfoString(es->str, "Buffers:");

                    if (has_shared)  
                    {  
                            appendStringInfoString(es->str, " shared");  
                            if (usage->shared_blks_hit > 0)  
                                    appendStringInfo(es->str, " hit=%ld",  
                                                                     usage->shared_blks_hit);

```

create extension pageinspect; create extension pg_buffercache;

创建解析page data的函数. 倒转, 二进制转int

create or replace function idx2int(text) returns int as $$ declare res text := ''; res1 int8; x text := ''; begin for x in select regexp_split_to_table($1,' ') loop res := x||res; end loop; execute format($_$ select x'%s'::int8 $_$, res) into res1; return res1::int; end; $$ language plpgsql strict;

```
postgres=# explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id = 5;
QUERY PLAN

Index Only Scan using tbl1_pkey on public.tbl1 (cost=0.43..1.55 rows=1 width=4) (actual time=0.027..0.029 rows=1 loops=1)
Output: id
Index Cond: (tbl1.id = 5)
Heap Fetches: 0
Buffers: shared hit=4
Planning Time: 0.071 ms
Execution Time: 0.046 ms
(7 rows)

postgres=# SELECT * FROM bt_metap('tbl1_pkey'); -- meta
magic | version | root | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples | allequalimage
--------+---------+------+-------+----------+-----------+-------------+-------------------------+---------------
340322 | 4 | 412 | 2 | 412 | 2 | 0 | 3934241 | t
(1 row)

postgres=# select *,idx2int(data) from bt_page_items('tbl1_pkey', 412); -- root
itemoffset | ctid | itemlen | nulls | vars | data | dead | htid | tids | idx2int
------------+----------+---------+-------+------+-------------------------+------+------+------+---------
1 | (3,0) | 8 | f | f | | | | | 0
2 | (9202,1) | 16 | f | f | 2b 5e 03 00 00 00 00 00 | | | | 220715
3 | (4616,1) | 16 | f | f | ac b1 06 00 00 00 00 00 | | | | 438700
4 | (9227,1) | 16 | f | f | 7a 4c 0a 00 00 00 00 00 | | | | 674938
...

postgres=# select *,idx2int(data) from bt_page_items('tbl1_pkey', 3); -- branch
itemoffset | ctid | itemlen | nulls | vars | data | dead | htid | tids | idx2int
------------+-----------+---------+-------+------+-------------------------+------+------+------+---------
1 | (4521,1) | 16 | f | f | 2b 5e 03 00 00 00 00 00 | | | | 220715
2 | (1,0) | 8 | f | f | | | | | 0
3 | (10555,1) | 16 | f | f | 56 02 00 00 00 00 00 00 | | | | 598
4 | (5423,1) | 16 | f | f | 05 05 00 00 00 00 00 00 | | | | 1285
...

postgres=# select *,idx2int(data) from bt_page_items('tbl1_pkey', 1); -- leaf
itemoffset | ctid | itemlen | nulls | vars | data | dead | htid | tids | idx2int
------------+-------------+---------+-------+------+-------------------------+------+-------------+------+---------
1 | (25828,1) | 16 | f | f | 56 02 00 00 00 00 00 00 | | | | 598
2 | (594,78) | 16 | f | f | 00 00 00 00 00 00 00 00 | f | (594,78) | | 0
3 | (19518,66) | 16 | f | f | 05 00 00 00 00 00 00 00 | f | (19518,66) | | 5
4 | (32751,68) | 16 | f | f | 06 00 00 00 00 00 00 00 | f | (32751,68) | | 6
5 | (20678,31) | 16 | f | f | 07 00 00 00 00 00 00 00 | f | (20678,31) | | 7
...

postgres=# select ctid,* from tbl1 where ctid='(19518,66)'; -- heap
ctid | id | info
------------+----+----------------------------------
(19518,66) | 5 | 01113e164911cf0eaf5db51b4c6e086b
(1 row)
```

计算heap page 19518 的2 bit位属于那个vm page num

```
postgres=# show block_size ;
block_size

8192
(1 row)

postgres=# select 2*19518/8.0;
?column?

4879.5000000000000000 -- bytes
(1 row)
```

访问了第1个vm数据块, 0号块

postgres=# select * from pg_buffercache where relfilenode=pg_relation_filenode('tbl1'::regclass) and relforknumber=2 ; -- table vm form bufferid | relfilenode | reltablespace | reldatabase | relforknumber | relblocknumber | isdirty | usagecount | pinning_backends ----------+-------------+---------------+-------------+---------------+----------------+---------+------------+------------------ 402016 | 1093455 | 1663 | 14174 | 2 | 0 | f | 5 | 0 -- 访问这个 402017 | 1093455 | 1663 | 14174 | 2 | 1 | f | 4 | 0 (2 rows)

存在的值, Buffers: shared hit=4 -- (meta+root+branch+leaf) + vm

```
postgres=# explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id = 5;
QUERY PLAN

不存在的值, Buffers: shared hit=3 -- (meta+root+branch+leaf)

```
postgres=# explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id = 1;
QUERY PLAN

Index Only Scan using tbl1_pkey on public.tbl1 (cost=0.43..1.55 rows=1 width=4) (actual time=0.025..0.066 rows=0 loops=1)
Output: id
Index Cond: (tbl1.id = 1)
Heap Fetches: 0
Buffers: shared hit=3
Planning Time: 0.086 ms
Execution Time: 0.082 ms
(7 rows)
```

存在、不存在唯一的差别是: 不存在时不需要访问vm.

所以从现象看, shared hit统计不准确: 可能的解释是 bt index meta page 没有被count? 有兴趣的同学可以从代码侧再分析一下。

还有访问多次时vm page是否被重复计算?

此前有对比过bitmap index scan和index scan, 在离散扫描上index sacn会重复算每次的heap page扫描, 二bitmap index scan只算一次heap scan.

文章转载自digoal，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

深入浅出PostgreSQL B-Tree索引结构

作者

日期

标签

背景

define BTP_LEAF (1 << 0) / leaf page, i.e. not internal page /

define BTP_ROOT (1 << 1) / root page (has no parent) /

define BTP_DELETED (1 << 2) / page has been deleted from tree /

define BTP_META (1 << 3) / meta-page /

define BTP_HALF_DEAD (1 << 4) / empty, but still in tree /

define BTP_SPLIT_END (1 << 5) / rightmost page of split group /

define BTP_HAS_GARBAGE (1 << 6) / page has LP_DEAD tuples /

define BTP_INCOMPLETE_SPLIT (1 << 7) / right sibling's downlink is missing /

hit计数为什么感觉不正确?: 算上visibility map时.

define MAX_FORKNUM INIT_FORKNUM

PostgreSQL 许愿链接

9.9元购买3个月阿里云RDS PostgreSQL实例

PostgreSQL 解决方案集合

德哥 / digoal's github - 公益是一辈子的事.

评论