作者
digoal
日期
2021-03-26
标签
PostgreSQL , brin , 索引 , 块级索引 , bloom , 等值查询
背景
等值查询一般有些什么索引支持?
btree: 精准定位
hash: 精准定位
bloom: 超集定位, 支持任意组合等值查询, 少量bit占位符表示是否包含某个value, 由于不同value的bits不同, 当插入大量value后占位bit可能被大量设置为1, 导致超集出现, 等值查询真不一定为真, 假一定为假.
gin: 倒排索引
brin: 块级索引, 保存一个heap blocks段存储的被索引字段value 的范围. min,max
目前BRIN变种出现, 存储的不再是min, max, 而是支持了bloom filter, 也就是说它存储的是占位bits. 每个连续heap blocks, 存储一个占位bits, 被索引字段的hash value经过再次bloom hash填充占位bit.
brin bloom 的应用场景:
离散值较多时: 等值查询, 任意字段组合等值查询.
PostgreSQL 开发者确实很精致, 赞!!!
最佳设置实践:
1、false_positive_rate: false(0) bits 占比不能太低, 越低, 失真越高(都是1的占位bit, 意思是它包含任意value, 所以就没有过滤意义)
2、n_distinct_per_range: 一连串blocks包含多少个被索引字段的唯一值, 这个值越大越容易失真, 因为占位组合会更多
3、通过调整pages_per_range 来控制 n_distinct_per_range
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=77b88cd1bb9041a735f24072150cacfa06c699a3
```
BRIN bloom indexes
author Tomas Vondra tomas.vondra@postgresql.org
Fri, 26 Mar 2021 12:35:29 +0000 (13:35 +0100)
committer Tomas Vondra tomas.vondra@postgresql.org
Fri, 26 Mar 2021 12:35:32 +0000 (13:35 +0100)
commit 77b88cd1bb9041a735f24072150cacfa06c699a3
tree be9ca84d673e3aa17e0e75ec579be414ae7eac18 tree
parent a681e3c107aa97eb554f118935c4d2278892c3dd commit | diff
BRIN bloom indexes
Adds a BRIN opclass using a Bloom filter to summarize the range. Indexes
using the new opclasses allow only equality queries (similar to hash
indexes), but that works fine for data like UUID, MAC addresses etc. for
which range queries are not very common. This also means the indexes
work for data that is not well correlated to physical location within
the table, or perhaps even entirely random (which is a common issue with
existing BRIN minmax opclasses).
It's possible to specify opclass parameters with the usual Bloom filter
parameters, i.e. the desired false-positive rate and the expected number
of distinct values per page range.
CREATE TABLE t (a int);
CREATE INDEX ON t
USING brin (a int4_bloom_ops(false_positive_rate = 0.05,
n_distinct_per_range = 100));
The opclasses do not operate on the indexed values directly, but compute
a 32-bit hash first, and the Bloom filter is built on the hash value.
Collisions should not be a huge issue though, as the number of distinct
values in a page ranges is usually fairly small.
Bump catversion, due to various catalog changes.
Author: Tomas Vondra tomas.vondra@postgresql.org
Reviewed-by: Alvaro Herrera alvherre@alvh.no-ip.org
Reviewed-by: Alexander Korotkov aekorotkov@gmail.com
Reviewed-by: Sokolov Yura y.sokolov@postgrespro.ru
Reviewed-by: Nico Williams nico@cryptonector.com
Reviewed-by: John Naylor john.naylor@enterprisedb.com
Discussion: https://postgr.es/m/c1138ead-7668-f0e1-0638-c3be3237e812@2ndquadrant.com
Discussion: https://postgr.es/m/5d78b774-7e9c-c94e-12cf-fef51cc89b1a%402ndquadrant.com
```
用法
1 CREATE TABLE brintest_bloom (byteacol bytea,
2 charcol "char",
3 namecol name,
4 int8col bigint,
5 int2col smallint,
6 int4col integer,
7 textcol text,
8 oidcol oid,
9 float4col real,
10 float8col double precision,
11 macaddrcol macaddr,
12 inetcol inet,
13 cidrcol cidr,
14 bpcharcol character,
15 datecol date,
16 timecol time without time zone,
17 timestampcol timestamp without time zone,
18 timestamptzcol timestamp with time zone,
19 intervalcol interval,
20 timetzcol time with time zone,
21 numericcol numeric,
22 uuidcol uuid,
23 lsncol pg_lsn
24 ) WITH (fillfactor=10);
25 INSERT INTO brintest_bloom SELECT
26 repeat(stringu1, 8)::bytea,
27 substr(stringu1, 1, 1)::"char",
28 stringu1::name, 142857 * tenthous,
29 thousand,
30 twothousand,
31 repeat(stringu1, 8),
32 unique1::oid,
33 (four + 1.0)/(hundred+1),
34 odd::float8 / (tenthous + 1),
35 format('%s:00:%s:00:%s:00', to_hex(odd), to_hex(even), to_hex(hundred))::macaddr,
36 inet '10.2.3.4/24' + tenthous,
37 cidr '10.2.3/24' + tenthous,
38 substr(stringu1, 1, 1)::bpchar,
39 date '1995-08-15' + tenthous,
40 time '01:20:30' + thousand * interval '18.5 second',
41 timestamp '1942-07-23 03:05:09' + tenthous * interval '36.38 hours',
42 timestamptz '1972-10-10 03:00' + thousand * interval '1 hour',
43 justify_days(justify_hours(tenthous * interval '12 minutes')),
44 timetz '01:30:20+02' + hundred * interval '15 seconds',
45 tenthous::numeric(36,30) * fivethous * even / (hundred + 1),
46 format('%s%s-%s-%s-%s-%s%s%s', to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'))::uuid,
47 format('%s/%s%s', odd, even, tenthous)::pg_lsn
48 FROM tenk1 ORDER BY unique2 LIMIT 100;
49 -- throw in some NULL's and different values
50 INSERT INTO brintest_bloom (inetcol, cidrcol) SELECT
51 inet 'fe80::6e40:8ff:fea9:8c46' + tenthous,
52 cidr 'fe80::6e40:8ff:fea9:8c46' + tenthous
53 FROM tenk1 ORDER BY thousand, tenthous LIMIT 25;
54 -- test bloom specific index options
55 -- ndistinct must be >= -1.0
56 CREATE INDEX brinidx_bloom ON brintest_bloom USING brin (
57 byteacol bytea_bloom_ops(n_distinct_per_range = -1.1)
58 );
59 ERROR: value -1.1 out of bounds for option "n_distinct_per_range"
60 DETAIL: Valid values are between "-1.000000" and "2147483647.000000".
61 -- false_positive_rate must be between 0.0001 and 0.25
62 CREATE INDEX brinidx_bloom ON brintest_bloom USING brin (
63 byteacol bytea_bloom_ops(false_positive_rate = 0.00009)
64 );
65 ERROR: value 0.00009 out of bounds for option "false_positive_rate"
66 DETAIL: Valid values are between "0.000100" and "0.250000".
67 CREATE INDEX brinidx_bloom ON brintest_bloom USING brin (
68 byteacol bytea_bloom_ops(false_positive_rate = 0.26)
69 );
70 ERROR: value 0.26 out of bounds for option "false_positive_rate"
71 DETAIL: Valid values are between "0.000100" and "0.250000".
72 CREATE INDEX brinidx_bloom ON brintest_bloom USING brin (
73 byteacol bytea_bloom_ops,
74 charcol char_bloom_ops,
75 namecol name_bloom_ops,
76 int8col int8_bloom_ops,
77 int2col int2_bloom_ops,
78 int4col int4_bloom_ops,
79 textcol text_bloom_ops,
80 oidcol oid_bloom_ops,
81 float4col float4_bloom_ops,
82 float8col float8_bloom_ops,
83 macaddrcol macaddr_bloom_ops,
84 inetcol inet_bloom_ops,
85 cidrcol inet_bloom_ops,
86 bpcharcol bpchar_bloom_ops,
87 datecol date_bloom_ops,
88 timecol time_bloom_ops,
89 timestampcol timestamp_bloom_ops,
90 timestamptzcol timestamptz_bloom_ops,
91 intervalcol interval_bloom_ops,
92 timetzcol timetz_bloom_ops,
93 numericcol numeric_bloom_ops,
94 uuidcol uuid_bloom_ops,
95 lsncol pg_lsn_bloom_ops
96 ) with (pages_per_range = 1);
97 CREATE TABLE brinopers_bloom (colname name, typ text,
98 op text[], value text[], matches int[],
99 check (cardinality(op) = cardinality(value)),
100 check (cardinality(op) = cardinality(matches)));
101 INSERT INTO brinopers_bloom VALUES
102 ('byteacol', 'bytea',
103 '{=}',
104 '{BNAAAABNAAAABNAAAABNAAAABNAAAABNAAAABNAAAABNAAAA}',
105 '{1}'),
106 ('charcol', '"char"',
107 '{=}',
108 '{M}',
109 '{6}'),
110 ('namecol', 'name',
111 '{=}',
112 '{MAAAAA}',
113 '{2}'),
114 ('int2col', 'int2',
115 '{=}',
116 '{800}',
117 '{1}'),
118 ('int4col', 'int4',
119 '{=}',
120 '{800}',
121 '{1}'),
122 ('int8col', 'int8',
123 '{=}',
124 '{1257141600}',
125 '{1}'),
126 ('textcol', 'text',
127 '{=}',
128 '{BNAAAABNAAAABNAAAABNAAAABNAAAABNAAAABNAAAABNAAAA}',
129 '{1}'),
130 ('oidcol', 'oid',
131 '{=}',
132 '{8800}',
133 '{1}'),
134 ('float4col', 'float4',
135 '{=}',
136 '{1}',
137 '{4}'),
138 ('float8col', 'float8',
139 '{=}',
140 '{0}',
141 '{1}'),
142 ('macaddrcol', 'macaddr',
143 '{=}',
144 '{2c:00:2d:00:16:00}',
145 '{2}'),
146 ('inetcol', 'inet',
147 '{=}',
148 '{10.2.14.231/24}',
149 '{1}'),
150 ('inetcol', 'cidr',
151 '{=}',
152 '{fe80::6e40:8ff:fea9:8c46}',
153 '{1}'),
154 ('cidrcol', 'inet',
155 '{=}',
156 '{10.2.14/24}',
157 '{2}'),
158 ('cidrcol', 'inet',
159 '{=}',
160 '{fe80::6e40:8ff:fea9:8c46}',
161 '{1}'),
162 ('cidrcol', 'cidr',
163 '{=}',
164 '{10.2.14/24}',
165 '{2}'),
166 ('cidrcol', 'cidr',
167 '{=}',
168 '{fe80::6e40:8ff:fea9:8c46}',
169 '{1}'),
170 ('bpcharcol', 'bpchar',
171 '{=}',
172 '{W}',
173 '{6}'),
174 ('datecol', 'date',
175 '{=}',
176 '{2009-12-01}',
177 '{1}'),
178 ('timecol', 'time',
179 '{=}',
180 '{02:28:57}',
181 '{1}'),
182 ('timestampcol', 'timestamp',
183 '{=}',
184 '{1964-03-24 19:26:45}',
185 '{1}'),
186 ('timestamptzcol', 'timestamptz',
187 '{=}',
188 '{1972-10-19 09:00:00-07}',
189 '{1}'),
190 ('intervalcol', 'interval',
191 '{=}',
192 '{1 mons 13 days 12:24}',
193 '{1}'),
194 ('timetzcol', 'timetz',
195 '{=}',
196 '{01:35:50+02}',
197 '{2}'),
198 ('numericcol', 'numeric',
199 '{=}',
200 '{2268164.347826086956521739130434782609}',
201 '{1}'),
202 ('uuidcol', 'uuid',
203 '{=}',
204 '{52225222-5222-5222-5222-522252225222}',
205 '{1}'),
206 ('lsncol', 'pg_lsn',
207 '{=, IS, IS NOT}',
208 '{44/455222, NULL, NULL}',
209 '{1, 25, 100}');
210 DO $x$
211 DECLARE
212 r record;
213 r2 record;
214 cond text;
215 idx_ctids tid[];
216 ss_ctids tid[];
217 count int;
218 plan_ok bool;
219 plan_line text;
220 BEGIN
221 FOR r IN SELECT colname, oper, typ, value[ordinality], matches[ordinality] FROM brinopers_bloom, unnest(op) WITH ORDINALITY AS oper LOOP
222
223 -- prepare the condition
224 IF r.value IS NULL THEN
225 cond := format('%I %s %L', r.colname, r.oper, r.value);
226 ELSE
227 cond := format('%I %s %L::%s', r.colname, r.oper, r.value, r.typ);
228 END IF;
229
230 -- run the query using the brin index
231 SET enable_seqscan = 0;
232 SET enable_bitmapscan = 1;
233
234 plan_ok := false;
235 FOR plan_line IN EXECUTE format($y$EXPLAIN SELECT array_agg(ctid) FROM brintest_bloom WHERE %s $y$, cond) LOOP
236 IF plan_line LIKE '%Bitmap Heap Scan on brintest_bloom%' THEN
237 plan_ok := true;
238 END IF;
239 END LOOP;
240 IF NOT plan_ok THEN
241 RAISE WARNING 'did not get bitmap indexscan plan for %', r;
242 END IF;
243
244 EXECUTE format($y$SELECT array_agg(ctid) FROM brintest_bloom WHERE %s $y$, cond)
245 INTO idx_ctids;
246
247 -- run the query using a seqscan
248 SET enable_seqscan = 1;
249 SET enable_bitmapscan = 0;
250
251 plan_ok := false;
252 FOR plan_line IN EXECUTE format($y$EXPLAIN SELECT array_agg(ctid) FROM brintest_bloom WHERE %s $y$, cond) LOOP
253 IF plan_line LIKE '%Seq Scan on brintest_bloom%' THEN
254 plan_ok := true;
255 END IF;
256 END LOOP;
257 IF NOT plan_ok THEN
258 RAISE WARNING 'did not get seqscan plan for %', r;
259 END IF;
260
261 EXECUTE format($y$SELECT array_agg(ctid) FROM brintest_bloom WHERE %s $y$, cond)
262 INTO ss_ctids;
263
264 -- make sure both return the same results
265 count := array_length(idx_ctids, 1);
266
267 IF NOT (count = array_length(ss_ctids, 1) AND
268 idx_ctids @> ss_ctids AND
269 idx_ctids <@ ss_ctids) THEN
270 -- report the results of each scan to make the differences obvious
271 RAISE WARNING 'something not right in %: count %', r, count;
272 SET enable_seqscan = 1;
273 SET enable_bitmapscan = 0;
274 FOR r2 IN EXECUTE 'SELECT ' || r.colname || ' FROM brintest_bloom WHERE ' || cond LOOP
275 RAISE NOTICE 'seqscan: %', r2;
276 END LOOP;
277
278 SET enable_seqscan = 0;
279 SET enable_bitmapscan = 1;
280 FOR r2 IN EXECUTE 'SELECT ' || r.colname || ' FROM brintest_bloom WHERE ' || cond LOOP
281 RAISE NOTICE 'bitmapscan: %', r2;
282 END LOOP;
283 END IF;
284
285 -- make sure we found expected number of matches
286 IF count != r.matches THEN RAISE WARNING 'unexpected number of results % for %', count, r; END IF;
287 END LOOP;
288 END;
289 $x$;
290 RESET enable_seqscan;
291 RESET enable_bitmapscan;
292 INSERT INTO brintest_bloom SELECT
293 repeat(stringu1, 42)::bytea,
294 substr(stringu1, 1, 1)::"char",
295 stringu1::name, 142857 * tenthous,
296 thousand,
297 twothousand,
298 repeat(stringu1, 42),
299 unique1::oid,
300 (four + 1.0)/(hundred+1),
301 odd::float8 / (tenthous + 1),
302 format('%s:00:%s:00:%s:00', to_hex(odd), to_hex(even), to_hex(hundred))::macaddr,
303 inet '10.2.3.4' + tenthous,
304 cidr '10.2.3/24' + tenthous,
305 substr(stringu1, 1, 1)::bpchar,
306 date '1995-08-15' + tenthous,
307 time '01:20:30' + thousand * interval '18.5 second',
308 timestamp '1942-07-23 03:05:09' + tenthous * interval '36.38 hours',
309 timestamptz '1972-10-10 03:00' + thousand * interval '1 hour',
310 justify_days(justify_hours(tenthous * interval '12 minutes')),
311 timetz '01:30:20' + hundred * interval '15 seconds',
312 tenthous::numeric(36,30) * fivethous * even / (hundred + 1),
313 format('%s%s-%s-%s-%s-%s%s%s', to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'))::uuid,
314 format('%s/%s%s', odd, even, tenthous)::pg_lsn
315 FROM tenk1 ORDER BY unique2 LIMIT 5 OFFSET 5;
316 SELECT brin_desummarize_range('brinidx_bloom', 0);
317 brin_desummarize_range
318 ------------------------
319
320 (1 row)
321
322 VACUUM brintest_bloom; -- force a summarization cycle in brinidx
323 UPDATE brintest_bloom SET int8col = int8col * int4col;
324 UPDATE brintest_bloom SET textcol = '' WHERE textcol IS NOT NULL;
325 -- Tests for brin_summarize_new_values
326 SELECT brin_summarize_new_values('brintest_bloom'); -- error, not an index
327 ERROR: "brintest_bloom" is not an index
328 SELECT brin_summarize_new_values('tenk1_unique1'); -- error, not a BRIN index
329 ERROR: "tenk1_unique1" is not a BRIN index
330 SELECT brin_summarize_new_values('brinidx_bloom'); -- ok, no change expected
331 brin_summarize_new_values
332 ---------------------------
333 0
334 (1 row)
335
336 -- Tests for brin_desummarize_range
337 SELECT brin_desummarize_range('brinidx_bloom', -1); -- error, invalid range
338 ERROR: block number out of range: -1
339 SELECT brin_desummarize_range('brinidx_bloom', 0);
340 brin_desummarize_range
341 ------------------------
342
343 (1 row)
344
345 SELECT brin_desummarize_range('brinidx_bloom', 0);
346 brin_desummarize_range
347 ------------------------
348
349 (1 row)
350
351 SELECT brin_desummarize_range('brinidx_bloom', 100000000);
352 brin_desummarize_range
353 ------------------------
354
355 (1 row)
356
357 -- Test brin_summarize_range
358 CREATE TABLE brin_summarize_bloom (
359 value int
360 ) WITH (fillfactor=10, autovacuum_enabled=false);
361 CREATE INDEX brin_summarize_bloom_idx ON brin_summarize_bloom USING brin (value) WITH (pages_per_range=2);
362 -- Fill a few pages
363 DO $$
364 DECLARE curtid tid;
365 BEGIN
366 LOOP
367 INSERT INTO brin_summarize_bloom VALUES (1) RETURNING ctid INTO curtid;
368 EXIT WHEN curtid > tid '(2, 0)';
369 END LOOP;
370 END;
371 $$;
372 -- summarize one range
373 SELECT brin_summarize_range('brin_summarize_bloom_idx', 0);
374 brin_summarize_range
375 ----------------------
376 0
377 (1 row)
378
379 -- nothing: already summarized
380 SELECT brin_summarize_range('brin_summarize_bloom_idx', 1);
381 brin_summarize_range
382 ----------------------
383 0
384 (1 row)
385
386 -- summarize one range
387 SELECT brin_summarize_range('brin_summarize_bloom_idx', 2);
388 brin_summarize_range
389 ----------------------
390 1
391 (1 row)
392
393 -- nothing: page doesn't exist in table
394 SELECT brin_summarize_range('brin_summarize_bloom_idx', 4294967295);
395 brin_summarize_range
396 ----------------------
397 0
398 (1 row)
399
400 -- invalid block number values
401 SELECT brin_summarize_range('brin_summarize_bloom_idx', -1);
402 ERROR: block number out of range: -1
403 SELECT brin_summarize_range('brin_summarize_bloom_idx', 4294967296);
404 ERROR: block number out of range: 4294967296
405 -- test brin cost estimates behave sanely based on correlation of values
406 CREATE TABLE brin_test_bloom (a INT, b INT);
407 INSERT INTO brin_test_bloom SELECT x/100,x%100 FROM generate_series(1,10000) x(x);
408 CREATE INDEX brin_test_bloom_a_idx ON brin_test_bloom USING brin (a) WITH (pages_per_range = 2);
409 CREATE INDEX brin_test_bloom_b_idx ON brin_test_bloom USING brin (b) WITH (pages_per_range = 2);
410 VACUUM ANALYZE brin_test_bloom;
411 -- Ensure brin index is used when columns are perfectly correlated
412 EXPLAIN (COSTS OFF) SELECT * FROM brin_test_bloom WHERE a = 1;
413 QUERY PLAN
414 --------------------------------------------------
415 Bitmap Heap Scan on brin_test_bloom
416 Recheck Cond: (a = 1)
417 -> Bitmap Index Scan on brin_test_bloom_a_idx
418 Index Cond: (a = 1)
419 (4 rows)
420
421 -- Ensure brin index is not used when values are not correlated
422 EXPLAIN (COSTS OFF) SELECT * FROM brin_test_bloom WHERE b = 1;
423 QUERY PLAN
424 -----------------------------
425 Seq Scan on brin_test_bloom
426 Filter: (b = 1)
427 (2 rows)
原理:
1 /*
2 * brin_bloom.c
3 * Implementation of Bloom opclass for BRIN
4 *
5 * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
6 * Portions Copyright (c) 1994, Regents of the University of California
7 *
8 *
9 * A BRIN opclass summarizing page range into a bloom filter.
10 *
11 * Bloom filters allow efficient testing whether a given page range contains
12 * a particular value. Therefore, if we summarize each page range into a small
13 * bloom filter, we can easily (and cheaply) test whether it contains values
14 * we get later.
15 *
16 * The index only supports equality operators, similarly to hash indexes.
17 * Bloom indexes are however much smaller, and support only bitmap scans.
18 *
19 * Note: Don't confuse this with bloom indexes, implemented in a contrib
20 * module. That extension implements an entirely new AM, building a bloom
21 * filter on multiple columns in a single row. This opclass works with an
22 * existing AM (BRIN) and builds bloom filter on a column.
23 *
24 *
25 * values vs. hashes
26 * -----------------
27 *
28 * The original column values are not used directly, but are first hashed
29 * using the regular type-specific hash function, producing a uint32 hash.
30 * And this hash value is then added to the summary - i.e. it's hashed
31 * again and added to the bloom filter.
32 *
33 * This allows the code to treat all data types (byval/byref/...) the same
34 * way, with only minimal space requirements, because we're working with
35 * hashes and not the original values. Everything is uint32.
36 *
37 * Of course, this assumes the built-in hash function is reasonably good,
38 * without too many collisions etc. But that does seem to be the case, at
39 * least based on past experience. After all, the same hash functions are
40 * used for hash indexes, hash partitioning and so on.
41 *
42 *
43 * hashing scheme
44 * --------------
45 *
46 * Bloom filters require a number of independent hash functions. There are
47 * different schemes how to construct them - for example we might use
48 * hash_uint32_extended with random seeds, but that seems fairly expensive.
49 * We use a scheme requiring only two functions described in this paper:
50 *
51 * Less Hashing, Same Performance:Building a Better Bloom Filter
52 * Adam Kirsch, Michael Mitzenmacher†, Harvard School of Engineering and
53 * Applied Sciences, Cambridge, Massachusetts [DOI 10.1002/rsa.20208]
54 *
55 * The two hash functions h1 and h2 are calculated using hard-coded seeds,
56 * and then combined using (h1 + i * h2) to generate the hash functions.
57 *
58 *
59 * sizing the bloom filter
60 * -----------------------
61 *
62 * Size of a bloom filter depends on the number of distinct values we will
63 * store in it, and the desired false positive rate. The higher the number
64 * of distinct values and/or the lower the false positive rate, the larger
65 * the bloom filter. On the other hand, we want to keep the index as small
66 * as possible - that's one of the basic advantages of BRIN indexes.
67 *
68 * Although the number of distinct elements (in a page range) depends on
69 * the data, we can consider it fixed. This simplifies the trade-off to
70 * just false positive rate vs. size.
71 *
72 * At the page range level, false positive rate is a probability the bloom
73 * filter matches a random value. For the whole index (with sufficiently
74 * many page ranges) it represents the fraction of the index ranges (and
75 * thus fraction of the table to be scanned) matching the random value.
76 *
77 * Furthermore, the size of the bloom filter is subject to implementation
78 * limits - it has to fit onto a single index page (8kB by default). As
79 * the bitmap is inherently random (when "full" about half the bits is set
80 * to 1, randomly), compression can't help very much.
81 *
82 * To reduce the size of a filter (to fit to a page), we have to either
83 * accept higher false positive rate (undesirable), or reduce the number
84 * of distinct items to be stored in the filter. We can't alter the input
85 * data, of course, but we may make the BRIN page ranges smaller - instead
86 * of the default 128 pages (1MB) we may build index with 16-page ranges,
87 * or something like that. This should reduce the number of distinct values
88 * in the page range, making the filter smaller (with fixed false positive
89 * rate). Even for random data sets this should help, as the number of rows
90 * per heap page is limited (to ~290 with very narrow tables, likely ~20
91 * in practice).
92 *
93 * Of course, good sizing decisions depend on having the necessary data,
94 * i.e. number of distinct values in a page range (of a given size) and
95 * table size (to estimate cost change due to change in false positive
96 * rate due to having larger index vs. scanning larger indexes). We may
97 * not have that data - for example when building an index on empty table
98 * it's not really possible. And for some data we only have estimates for
99 * the whole table and we can only estimate per-range values (ndistinct).
100 *
101 * Another challenge is that while the bloom filter is per-column, it's
102 * the whole index tuple that has to fit into a page. And for multi-column
103 * indexes that may include pieces we have no control over (not necessarily
104 * bloom filters, the other columns may use other BRIN opclasses). So it's
105 * not entirely clear how to distribute the space between those columns.
106 *
107 * The current logic, implemented in brin_bloom_get_ndistinct, attempts to
108 * make some basic sizing decisions, based on the size of BRIN ranges, and
109 * the maximum number of rows per range.
110 *
111 *
112 * IDENTIFICATION
113 * src/backend/access/brin/brin_bloom.c
114 */
115 #include "postgres.h"
116
117 #include "access/genam.h"
118 #include "access/brin.h"
119 #include "access/brin_internal.h"
120 #include "access/brin_page.h"
121 #include "access/brin_tuple.h"
122 #include "access/hash.h"
123 #include "access/htup_details.h"
124 #include "access/reloptions.h"
125 #include "access/stratnum.h"
126 #include "catalog/pg_type.h"
127 #include "catalog/pg_amop.h"
128 #include "utils/builtins.h"
129 #include "utils/datum.h"
130 #include "utils/lsyscache.h"
131 #include "utils/rel.h"
132 #include "utils/syscache.h"
133
134 #include <math.h>
135
136 #define BloomEqualStrategyNumber 1
137
138 /*
139 * Additional SQL level support functions. We only have one, which is
140 * used to calculate hash of the input value.
141 *
142 * Procedure numbers must not use values reserved for BRIN itself; see
143 * brin_internal.h.
144 */
145 #define BLOOM_MAX_PROCNUMS 1 /* maximum support procs we need */
146 #define PROCNUM_HASH 11 /* required */
147
148 /*
149 * Subtract this from procnum to obtain index in BloomOpaque arrays
150 * (Must be equal to minimum of private procnums).
151 */
152 #define PROCNUM_BASE 11
153
154 /*
155 * Storage type for BRIN's reloptions.
156 */
157 typedef struct BloomOptions
158 {
159 int32 vl_len_; /* varlena header (do not touch directly!) */
160 double nDistinctPerRange; /* number of distinct values per range */
161 double falsePositiveRate; /* false positive for bloom filter */
162 } BloomOptions;
163
164 /*
165 * The current min value (16) is somewhat arbitrary, but it's based
166 * on the fact that the filter header is ~20B alone, which is about
167 * the same as the filter bitmap for 16 distinct items with 1% false
168 * positive rate. So by allowing lower values we'd not gain much. In
169 * any case, the min should not be larger than MaxHeapTuplesPerPage
170 * (~290), which is the theoretical maximum for single-page ranges.
171 */
172 #define BLOOM_MIN_NDISTINCT_PER_RANGE 16
173
174 /*
175 * Used to determine number of distinct items, based on the number of rows
176 * in a page range. The 10% is somewhat similar to what estimate_num_groups
177 * does, so we use the same factor here.
178 */
179 #define BLOOM_DEFAULT_NDISTINCT_PER_RANGE -0.1 /* 10% of values */
180
181 /*
182 * Allowed range and default value for the false positive range. The exact
183 * values are somewhat arbitrary, but were chosen considering the various
184 * parameters (size of filter vs. page size, etc.).
185 *
186 * The lower the false-positive rate, the more accurate the filter is, but
187 * it also gets larger - at some point this eliminates the main advantage
188 * of BRIN indexes, which is the tiny size. At 0.01% the index is about
189 * 10% of the table (assuming 290 distinct values per 8kB page).
190 *
191 * On the other hand, as the false-positive rate increases, larger part of
192 * the table has to be scanned due to mismatches - at 25% we're probably
193 * close to sequential scan being cheaper.
194 */
195 #define BLOOM_MIN_FALSE_POSITIVE_RATE 0.0001 /* 0.01% fp rate */
196 #define BLOOM_MAX_FALSE_POSITIVE_RATE 0.25 /* 25% fp rate */
197 #define BLOOM_DEFAULT_FALSE_POSITIVE_RATE 0.01 /* 1% fp rate */
198
199 #define BloomGetNDistinctPerRange(opts) \
200 ((opts) && (((BloomOptions *) (opts))->nDistinctPerRange != 0) ? \
201 (((BloomOptions *) (opts))->nDistinctPerRange) : \
202 BLOOM_DEFAULT_NDISTINCT_PER_RANGE)
203
204 #define BloomGetFalsePositiveRate(opts) \
205 ((opts) && (((BloomOptions *) (opts))->falsePositiveRate != 0.0) ? \
206 (((BloomOptions *) (opts))->falsePositiveRate) : \
207 BLOOM_DEFAULT_FALSE_POSITIVE_RATE)
208
209 /*
210 * And estimate of the largest bloom we can fit onto a page. This is not
211 * a perfect guarantee, for a couple of reasons. For example, the row may
212 * be larger because the index has multiple columns.
213 */
214 #define BloomMaxFilterSize \
215 MAXALIGN_DOWN(BLCKSZ - \
216 (MAXALIGN(SizeOfPageHeaderData + \
217 sizeof(ItemIdData)) + \
218 MAXALIGN(sizeof(BrinSpecialSpace)) + \
219 SizeOfBrinTuple))
220
221 /*
222 * Seeds used to calculate two hash functions h1 and h2, which are then used
223 * to generate k hashes using the (h1 + i * h2) scheme.
224 */
225 #define BLOOM_SEED_1 0x71d924af
226 #define BLOOM_SEED_2 0xba48b314
227
228 /*
229 * Bloom Filter
230 *
231 * Represents a bloom filter, built on hashes of the indexed values. That is,
232 * we compute a uint32 hash of the value, and then store this hash into the
233 * bloom filter (and compute additional hashes on it).
234 *
235 * XXX We could implement "sparse" bloom filters, keeping only the bytes that
236 * are not entirely 0. But while indexes don't support TOAST, the varlena can
237 * still be compressed. So this seems unnecessary, because the compression
238 * should do the same job.
239 *
240 * XXX We can also watch the number of bits set in the bloom filter, and then
241 * stop using it (and not store the bitmap, to save space) when the false
242 * positive rate gets too high. But even if the false positive rate exceeds the
243 * desired value, it still can eliminate some page ranges.
244 */
245 typedef struct BloomFilter
246 {
PostgreSQL 许愿链接
您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.