PostgreSQL 14 preview - BRIN (典型IoT 时序场景) 块级索引支持 bloom filter - 随机,大量distinct value, 等值查询

digoal 2021-01-03

216

作者

digoal

日期

2021-03-26

背景

等值查询一般有些什么索引支持?

btree: 精准定位
hash: 精准定位
bloom: 超集定位, 支持任意组合等值查询, 少量bit占位符表示是否包含某个value, 由于不同value的bits不同, 当插入大量value后占位bit可能被大量设置为1, 导致超集出现, 等值查询真不一定为真, 假一定为假.
gin: 倒排索引
brin: 块级索引, 保存一个heap blocks段存储的被索引字段value 的范围. min,max

目前BRIN变种出现, 存储的不再是min, max, 而是支持了bloom filter, 也就是说它存储的是占位bits. 每个连续heap blocks, 存储一个占位bits, 被索引字段的hash value经过再次bloom hash填充占位bit.

brin bloom 的应用场景:

离散值较多时: 等值查询, 任意字段组合等值查询.

PostgreSQL 开发者确实很精致, 赞!!!

最佳设置实践:

1、false_positive_rate: false(0) bits 占比不能太低, 越低, 失真越高(都是1的占位bit, 意思是它包含任意value, 所以就没有过滤意义)
2、n_distinct_per_range: 一连串blocks包含多少个被索引字段的唯一值, 这个值越大越容易失真, 因为占位组合会更多
3、通过调整pages_per_range 来控制 n_distinct_per_range

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=77b88cd1bb9041a735f24072150cacfa06c699a3

```
BRIN bloom indexes
author Tomas Vondra tomas.vondra@postgresql.org
Fri, 26 Mar 2021 12:35:29 +0000 (13:35 +0100)
committer Tomas Vondra tomas.vondra@postgresql.org
Fri, 26 Mar 2021 12:35:32 +0000 (13:35 +0100)
commit 77b88cd1bb9041a735f24072150cacfa06c699a3
tree be9ca84d673e3aa17e0e75ec579be414ae7eac18 tree
parent a681e3c107aa97eb554f118935c4d2278892c3dd commit | diff
BRIN bloom indexes

Adds a BRIN opclass using a Bloom filter to summarize the range. Indexes
using the new opclasses allow only equality queries (similar to hash
indexes), but that works fine for data like UUID, MAC addresses etc. for
which range queries are not very common. This also means the indexes
work for data that is not well correlated to physical location within
the table, or perhaps even entirely random (which is a common issue with
existing BRIN minmax opclasses).

It's possible to specify opclass parameters with the usual Bloom filter
parameters, i.e. the desired false-positive rate and the expected number
of distinct values per page range.

CREATE TABLE t (a int);
CREATE INDEX ON t
USING brin (a int4_bloom_ops(false_positive_rate = 0.05,
n_distinct_per_range = 100));

The opclasses do not operate on the indexed values directly, but compute
a 32-bit hash first, and the Bloom filter is built on the hash value.
Collisions should not be a huge issue though, as the number of distinct
values in a page ranges is usually fairly small.

Bump catversion, due to various catalog changes.

Author: Tomas Vondra tomas.vondra@postgresql.org
Reviewed-by: Alvaro Herrera alvherre@alvh.no-ip.org
Reviewed-by: Alexander Korotkov aekorotkov@gmail.com
Reviewed-by: Sokolov Yura y.sokolov@postgrespro.ru
Reviewed-by: Nico Williams nico@cryptonector.com
Reviewed-by: John Naylor john.naylor@enterprisedb.com
Discussion: https://postgr.es/m/c1138ead-7668-f0e1-0638-c3be3237e812@2ndquadrant.com
Discussion: https://postgr.es/m/5d78b774-7e9c-c94e-12cf-fef51cc89b1a%402ndquadrant.com
```

用法

1 CREATE TABLE brintest_bloom (byteacol bytea, 2 charcol "char", 3 namecol name, 4 int8col bigint, 5 int2col smallint, 6 int4col integer, 7 textcol text, 8 oidcol oid, 9 float4col real, 10 float8col double precision, 11 macaddrcol macaddr, 12 inetcol inet, 13 cidrcol cidr, 14 bpcharcol character, 15 datecol date, 16 timecol time without time zone, 17 timestampcol timestamp without time zone, 18 timestamptzcol timestamp with time zone, 19 intervalcol interval, 20 timetzcol time with time zone, 21 numericcol numeric, 22 uuidcol uuid, 23 lsncol pg_lsn 24 ) WITH (fillfactor=10); 25 INSERT INTO brintest_bloom SELECT 26 repeat(stringu1, 8)::bytea, 27 substr(stringu1, 1, 1)::"char", 28 stringu1::name, 142857 * tenthous, 29 thousand, 30 twothousand, 31 repeat(stringu1, 8), 32 unique1::oid, 33 (four + 1.0)/(hundred+1), 34 odd::float8 / (tenthous + 1), 35 format('%s:00:%s:00:%s:00', to_hex(odd), to_hex(even), to_hex(hundred))::macaddr, 36 inet '10.2.3.4/24' + tenthous, 37 cidr '10.2.3/24' + tenthous, 38 substr(stringu1, 1, 1)::bpchar, 39 date '1995-08-15' + tenthous, 40 time '01:20:30' + thousand * interval '18.5 second', 41 timestamp '1942-07-23 03:05:09' + tenthous * interval '36.38 hours', 42 timestamptz '1972-10-10 03:00' + thousand * interval '1 hour', 43 justify_days(justify_hours(tenthous * interval '12 minutes')), 44 timetz '01:30:20+02' + hundred * interval '15 seconds', 45 tenthous::numeric(36,30) * fivethous * even / (hundred + 1), 46 format('%s%s-%s-%s-%s-%s%s%s', to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'))::uuid, 47 format('%s/%s%s', odd, even, tenthous)::pg_lsn 48 FROM tenk1 ORDER BY unique2 LIMIT 100; 49 -- throw in some NULL's and different values 50 INSERT INTO brintest_bloom (inetcol, cidrcol) SELECT 51 inet 'fe80::6e40:8ff:fea9:8c46' + tenthous, 52 cidr 'fe80::6e40:8ff:fea9:8c46' + tenthous 53 FROM tenk1 ORDER BY thousand, tenthous LIMIT 25; 54 -- test bloom specific index options 55 -- ndistinct must be >= -1.0 56 CREATE INDEX brinidx_bloom ON brintest_bloom USING brin ( 57 byteacol bytea_bloom_ops(n_distinct_per_range = -1.1) 58 ); 59 ERROR: value -1.1 out of bounds for option "n_distinct_per_range" 60 DETAIL: Valid values are between "-1.000000" and "2147483647.000000". 61 -- false_positive_rate must be between 0.0001 and 0.25 62 CREATE INDEX brinidx_bloom ON brintest_bloom USING brin ( 63 byteacol bytea_bloom_ops(false_positive_rate = 0.00009) 64 ); 65 ERROR: value 0.00009 out of bounds for option "false_positive_rate" 66 DETAIL: Valid values are between "0.000100" and "0.250000". 67 CREATE INDEX brinidx_bloom ON brintest_bloom USING brin ( 68 byteacol bytea_bloom_ops(false_positive_rate = 0.26) 69 ); 70 ERROR: value 0.26 out of bounds for option "false_positive_rate" 71 DETAIL: Valid values are between "0.000100" and "0.250000". 72 CREATE INDEX brinidx_bloom ON brintest_bloom USING brin ( 73 byteacol bytea_bloom_ops, 74 charcol char_bloom_ops, 75 namecol name_bloom_ops, 76 int8col int8_bloom_ops, 77 int2col int2_bloom_ops, 78 int4col int4_bloom_ops, 79 textcol text_bloom_ops, 80 oidcol oid_bloom_ops, 81 float4col float4_bloom_ops, 82 float8col float8_bloom_ops, 83 macaddrcol macaddr_bloom_ops, 84 inetcol inet_bloom_ops, 85 cidrcol inet_bloom_ops, 86 bpcharcol bpchar_bloom_ops, 87 datecol date_bloom_ops, 88 timecol time_bloom_ops, 89 timestampcol timestamp_bloom_ops, 90 timestamptzcol timestamptz_bloom_ops, 91 intervalcol interval_bloom_ops, 92 timetzcol timetz_bloom_ops, 93 numericcol numeric_bloom_ops, 94 uuidcol uuid_bloom_ops, 95 lsncol pg_lsn_bloom_ops 96 ) with (pages_per_range = 1); 97 CREATE TABLE brinopers_bloom (colname name, typ text, 98 op text[], value text[], matches int[], 99 check (cardinality(op) = cardinality(value)), 100 check (cardinality(op) = cardinality(matches))); 101 INSERT INTO brinopers_bloom VALUES 102 ('byteacol', 'bytea', 103 '{=}', 104 '{BNAAAABNAAAABNAAAABNAAAABNAAAABNAAAABNAAAABNAAAA}', 105 '{1}'), 106 ('charcol', '"char"', 107 '{=}', 108 '{M}', 109 '{6}'), 110 ('namecol', 'name', 111 '{=}', 112 '{MAAAAA}', 113 '{2}'), 114 ('int2col', 'int2', 115 '{=}', 116 '{800}', 117 '{1}'), 118 ('int4col', 'int4', 119 '{=}', 120 '{800}', 121 '{1}'), 122 ('int8col', 'int8', 123 '{=}', 124 '{1257141600}', 125 '{1}'), 126 ('textcol', 'text', 127 '{=}', 128 '{BNAAAABNAAAABNAAAABNAAAABNAAAABNAAAABNAAAABNAAAA}', 129 '{1}'), 130 ('oidcol', 'oid', 131 '{=}', 132 '{8800}', 133 '{1}'), 134 ('float4col', 'float4', 135 '{=}', 136 '{1}', 137 '{4}'), 138 ('float8col', 'float8', 139 '{=}', 140 '{0}', 141 '{1}'), 142 ('macaddrcol', 'macaddr', 143 '{=}', 144 '{2c:00:2d:00:16:00}', 145 '{2}'), 146 ('inetcol', 'inet', 147 '{=}', 148 '{10.2.14.231/24}', 149 '{1}'), 150 ('inetcol', 'cidr', 151 '{=}', 152 '{fe80::6e40:8ff:fea9:8c46}', 153 '{1}'), 154 ('cidrcol', 'inet', 155 '{=}', 156 '{10.2.14/24}', 157 '{2}'), 158 ('cidrcol', 'inet', 159 '{=}', 160 '{fe80::6e40:8ff:fea9:8c46}', 161 '{1}'), 162 ('cidrcol', 'cidr', 163 '{=}', 164 '{10.2.14/24}', 165 '{2}'), 166 ('cidrcol', 'cidr', 167 '{=}', 168 '{fe80::6e40:8ff:fea9:8c46}', 169 '{1}'), 170 ('bpcharcol', 'bpchar', 171 '{=}', 172 '{W}', 173 '{6}'), 174 ('datecol', 'date', 175 '{=}', 176 '{2009-12-01}', 177 '{1}'), 178 ('timecol', 'time', 179 '{=}', 180 '{02:28:57}', 181 '{1}'), 182 ('timestampcol', 'timestamp', 183 '{=}', 184 '{1964-03-24 19:26:45}', 185 '{1}'), 186 ('timestamptzcol', 'timestamptz', 187 '{=}', 188 '{1972-10-19 09:00:00-07}', 189 '{1}'), 190 ('intervalcol', 'interval', 191 '{=}', 192 '{1 mons 13 days 12:24}', 193 '{1}'), 194 ('timetzcol', 'timetz', 195 '{=}', 196 '{01:35:50+02}', 197 '{2}'), 198 ('numericcol', 'numeric', 199 '{=}', 200 '{2268164.347826086956521739130434782609}', 201 '{1}'), 202 ('uuidcol', 'uuid', 203 '{=}', 204 '{52225222-5222-5222-5222-522252225222}', 205 '{1}'), 206 ('lsncol', 'pg_lsn', 207 '{=, IS, IS NOT}', 208 '{44/455222, NULL, NULL}', 209 '{1, 25, 100}'); 210 DO $x$ 211 DECLARE 212 r record; 213 r2 record; 214 cond text; 215 idx_ctids tid[]; 216 ss_ctids tid[]; 217 count int; 218 plan_ok bool; 219 plan_line text; 220 BEGIN 221 FOR r IN SELECT colname, oper, typ, value[ordinality], matches[ordinality] FROM brinopers_bloom, unnest(op) WITH ORDINALITY AS oper LOOP 222 223 -- prepare the condition 224 IF r.value IS NULL THEN 225 cond := format('%I %s %L', r.colname, r.oper, r.value); 226 ELSE 227 cond := format('%I %s %L::%s', r.colname, r.oper, r.value, r.typ); 228 END IF; 229 230 -- run the query using the brin index 231 SET enable_seqscan = 0; 232 SET enable_bitmapscan = 1; 233 234 plan_ok := false; 235 FOR plan_line IN EXECUTE format($y$EXPLAIN SELECT array_agg(ctid) FROM brintest_bloom WHERE %s $y$, cond) LOOP 236 IF plan_line LIKE '%Bitmap Heap Scan on brintest_bloom%' THEN 237 plan_ok := true; 238 END IF; 239 END LOOP; 240 IF NOT plan_ok THEN 241 RAISE WARNING 'did not get bitmap indexscan plan for %', r; 242 END IF; 243 244 EXECUTE format($y$SELECT array_agg(ctid) FROM brintest_bloom WHERE %s $y$, cond) 245 INTO idx_ctids; 246 247 -- run the query using a seqscan 248 SET enable_seqscan = 1; 249 SET enable_bitmapscan = 0; 250 251 plan_ok := false; 252 FOR plan_line IN EXECUTE format($y$EXPLAIN SELECT array_agg(ctid) FROM brintest_bloom WHERE %s $y$, cond) LOOP 253 IF plan_line LIKE '%Seq Scan on brintest_bloom%' THEN 254 plan_ok := true; 255 END IF; 256 END LOOP; 257 IF NOT plan_ok THEN 258 RAISE WARNING 'did not get seqscan plan for %', r; 259 END IF; 260 261 EXECUTE format($y$SELECT array_agg(ctid) FROM brintest_bloom WHERE %s $y$, cond) 262 INTO ss_ctids; 263 264 -- make sure both return the same results 265 count := array_length(idx_ctids, 1); 266 267 IF NOT (count = array_length(ss_ctids, 1) AND 268 idx_ctids @> ss_ctids AND 269 idx_ctids <@ ss_ctids) THEN 270 -- report the results of each scan to make the differences obvious 271 RAISE WARNING 'something not right in %: count %', r, count; 272 SET enable_seqscan = 1; 273 SET enable_bitmapscan = 0; 274 FOR r2 IN EXECUTE 'SELECT ' || r.colname || ' FROM brintest_bloom WHERE ' || cond LOOP 275 RAISE NOTICE 'seqscan: %', r2; 276 END LOOP; 277 278 SET enable_seqscan = 0; 279 SET enable_bitmapscan = 1; 280 FOR r2 IN EXECUTE 'SELECT ' || r.colname || ' FROM brintest_bloom WHERE ' || cond LOOP 281 RAISE NOTICE 'bitmapscan: %', r2; 282 END LOOP; 283 END IF; 284 285 -- make sure we found expected number of matches 286 IF count != r.matches THEN RAISE WARNING 'unexpected number of results % for %', count, r; END IF; 287 END LOOP; 288 END; 289 $x$; 290 RESET enable_seqscan; 291 RESET enable_bitmapscan; 292 INSERT INTO brintest_bloom SELECT 293 repeat(stringu1, 42)::bytea, 294 substr(stringu1, 1, 1)::"char", 295 stringu1::name, 142857 * tenthous, 296 thousand, 297 twothousand, 298 repeat(stringu1, 42), 299 unique1::oid, 300 (four + 1.0)/(hundred+1), 301 odd::float8 / (tenthous + 1), 302 format('%s:00:%s:00:%s:00', to_hex(odd), to_hex(even), to_hex(hundred))::macaddr, 303 inet '10.2.3.4' + tenthous, 304 cidr '10.2.3/24' + tenthous, 305 substr(stringu1, 1, 1)::bpchar, 306 date '1995-08-15' + tenthous, 307 time '01:20:30' + thousand * interval '18.5 second', 308 timestamp '1942-07-23 03:05:09' + tenthous * interval '36.38 hours', 309 timestamptz '1972-10-10 03:00' + thousand * interval '1 hour', 310 justify_days(justify_hours(tenthous * interval '12 minutes')), 311 timetz '01:30:20' + hundred * interval '15 seconds', 312 tenthous::numeric(36,30) * fivethous * even / (hundred + 1), 313 format('%s%s-%s-%s-%s-%s%s%s', to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'))::uuid, 314 format('%s/%s%s', odd, even, tenthous)::pg_lsn 315 FROM tenk1 ORDER BY unique2 LIMIT 5 OFFSET 5; 316 SELECT brin_desummarize_range('brinidx_bloom', 0); 317 brin_desummarize_range 318 ------------------------ 319 320 (1 row) 321 322 VACUUM brintest_bloom; -- force a summarization cycle in brinidx 323 UPDATE brintest_bloom SET int8col = int8col * int4col; 324 UPDATE brintest_bloom SET textcol = '' WHERE textcol IS NOT NULL; 325 -- Tests for brin_summarize_new_values 326 SELECT brin_summarize_new_values('brintest_bloom'); -- error, not an index 327 ERROR: "brintest_bloom" is not an index 328 SELECT brin_summarize_new_values('tenk1_unique1'); -- error, not a BRIN index 329 ERROR: "tenk1_unique1" is not a BRIN index 330 SELECT brin_summarize_new_values('brinidx_bloom'); -- ok, no change expected 331 brin_summarize_new_values 332 --------------------------- 333 0 334 (1 row) 335 336 -- Tests for brin_desummarize_range 337 SELECT brin_desummarize_range('brinidx_bloom', -1); -- error, invalid range 338 ERROR: block number out of range: -1 339 SELECT brin_desummarize_range('brinidx_bloom', 0); 340 brin_desummarize_range 341 ------------------------ 342 343 (1 row) 344 345 SELECT brin_desummarize_range('brinidx_bloom', 0); 346 brin_desummarize_range 347 ------------------------ 348 349 (1 row) 350 351 SELECT brin_desummarize_range('brinidx_bloom', 100000000); 352 brin_desummarize_range 353 ------------------------ 354 355 (1 row) 356 357 -- Test brin_summarize_range 358 CREATE TABLE brin_summarize_bloom ( 359 value int 360 ) WITH (fillfactor=10, autovacuum_enabled=false); 361 CREATE INDEX brin_summarize_bloom_idx ON brin_summarize_bloom USING brin (value) WITH (pages_per_range=2); 362 -- Fill a few pages 363 DO $$ 364 DECLARE curtid tid; 365 BEGIN 366 LOOP 367 INSERT INTO brin_summarize_bloom VALUES (1) RETURNING ctid INTO curtid; 368 EXIT WHEN curtid > tid '(2, 0)'; 369 END LOOP; 370 END; 371 $$; 372 -- summarize one range 373 SELECT brin_summarize_range('brin_summarize_bloom_idx', 0); 374 brin_summarize_range 375 ---------------------- 376 0 377 (1 row) 378 379 -- nothing: already summarized 380 SELECT brin_summarize_range('brin_summarize_bloom_idx', 1); 381 brin_summarize_range 382 ---------------------- 383 0 384 (1 row) 385 386 -- summarize one range 387 SELECT brin_summarize_range('brin_summarize_bloom_idx', 2); 388 brin_summarize_range 389 ---------------------- 390 1 391 (1 row) 392 393 -- nothing: page doesn't exist in table 394 SELECT brin_summarize_range('brin_summarize_bloom_idx', 4294967295); 395 brin_summarize_range 396 ---------------------- 397 0 398 (1 row) 399 400 -- invalid block number values 401 SELECT brin_summarize_range('brin_summarize_bloom_idx', -1); 402 ERROR: block number out of range: -1 403 SELECT brin_summarize_range('brin_summarize_bloom_idx', 4294967296); 404 ERROR: block number out of range: 4294967296 405 -- test brin cost estimates behave sanely based on correlation of values 406 CREATE TABLE brin_test_bloom (a INT, b INT); 407 INSERT INTO brin_test_bloom SELECT x/100,x%100 FROM generate_series(1,10000) x(x); 408 CREATE INDEX brin_test_bloom_a_idx ON brin_test_bloom USING brin (a) WITH (pages_per_range = 2); 409 CREATE INDEX brin_test_bloom_b_idx ON brin_test_bloom USING brin (b) WITH (pages_per_range = 2); 410 VACUUM ANALYZE brin_test_bloom; 411 -- Ensure brin index is used when columns are perfectly correlated 412 EXPLAIN (COSTS OFF) SELECT * FROM brin_test_bloom WHERE a = 1; 413 QUERY PLAN 414 -------------------------------------------------- 415 Bitmap Heap Scan on brin_test_bloom 416 Recheck Cond: (a = 1) 417 -> Bitmap Index Scan on brin_test_bloom_a_idx 418 Index Cond: (a = 1) 419 (4 rows) 420 421 -- Ensure brin index is not used when values are not correlated 422 EXPLAIN (COSTS OFF) SELECT * FROM brin_test_bloom WHERE b = 1; 423 QUERY PLAN 424 ----------------------------- 425 Seq Scan on brin_test_bloom 426 Filter: (b = 1) 427 (2 rows)

原理:

1 /* 2 * brin_bloom.c 3 * Implementation of Bloom opclass for BRIN 4 * 5 * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group 6 * Portions Copyright (c) 1994, Regents of the University of California 7 * 8 * 9 * A BRIN opclass summarizing page range into a bloom filter. 10 * 11 * Bloom filters allow efficient testing whether a given page range contains 12 * a particular value. Therefore, if we summarize each page range into a small 13 * bloom filter, we can easily (and cheaply) test whether it contains values 14 * we get later. 15 * 16 * The index only supports equality operators, similarly to hash indexes. 17 * Bloom indexes are however much smaller, and support only bitmap scans. 18 * 19 * Note: Don't confuse this with bloom indexes, implemented in a contrib 20 * module. That extension implements an entirely new AM, building a bloom 21 * filter on multiple columns in a single row. This opclass works with an 22 * existing AM (BRIN) and builds bloom filter on a column. 23 * 24 * 25 * values vs. hashes 26 * ----------------- 27 * 28 * The original column values are not used directly, but are first hashed 29 * using the regular type-specific hash function, producing a uint32 hash. 30 * And this hash value is then added to the summary - i.e. it's hashed 31 * again and added to the bloom filter. 32 * 33 * This allows the code to treat all data types (byval/byref/...) the same 34 * way, with only minimal space requirements, because we're working with 35 * hashes and not the original values. Everything is uint32. 36 * 37 * Of course, this assumes the built-in hash function is reasonably good, 38 * without too many collisions etc. But that does seem to be the case, at 39 * least based on past experience. After all, the same hash functions are 40 * used for hash indexes, hash partitioning and so on. 41 * 42 * 43 * hashing scheme 44 * -------------- 45 * 46 * Bloom filters require a number of independent hash functions. There are 47 * different schemes how to construct them - for example we might use 48 * hash_uint32_extended with random seeds, but that seems fairly expensive. 49 * We use a scheme requiring only two functions described in this paper: 50 * 51 * Less Hashing, Same Performance:Building a Better Bloom Filter 52 * Adam Kirsch, Michael Mitzenmacher†, Harvard School of Engineering and 53 * Applied Sciences, Cambridge, Massachusetts [DOI 10.1002/rsa.20208] 54 * 55 * The two hash functions h1 and h2 are calculated using hard-coded seeds, 56 * and then combined using (h1 + i * h2) to generate the hash functions. 57 * 58 * 59 * sizing the bloom filter 60 * ----------------------- 61 * 62 * Size of a bloom filter depends on the number of distinct values we will 63 * store in it, and the desired false positive rate. The higher the number 64 * of distinct values and/or the lower the false positive rate, the larger 65 * the bloom filter. On the other hand, we want to keep the index as small 66 * as possible - that's one of the basic advantages of BRIN indexes. 67 * 68 * Although the number of distinct elements (in a page range) depends on 69 * the data, we can consider it fixed. This simplifies the trade-off to 70 * just false positive rate vs. size. 71 * 72 * At the page range level, false positive rate is a probability the bloom 73 * filter matches a random value. For the whole index (with sufficiently 74 * many page ranges) it represents the fraction of the index ranges (and 75 * thus fraction of the table to be scanned) matching the random value. 76 * 77 * Furthermore, the size of the bloom filter is subject to implementation 78 * limits - it has to fit onto a single index page (8kB by default). As 79 * the bitmap is inherently random (when "full" about half the bits is set 80 * to 1, randomly), compression can't help very much. 81 * 82 * To reduce the size of a filter (to fit to a page), we have to either 83 * accept higher false positive rate (undesirable), or reduce the number 84 * of distinct items to be stored in the filter. We can't alter the input 85 * data, of course, but we may make the BRIN page ranges smaller - instead 86 * of the default 128 pages (1MB) we may build index with 16-page ranges, 87 * or something like that. This should reduce the number of distinct values 88 * in the page range, making the filter smaller (with fixed false positive 89 * rate). Even for random data sets this should help, as the number of rows 90 * per heap page is limited (to ~290 with very narrow tables, likely ~20 91 * in practice). 92 * 93 * Of course, good sizing decisions depend on having the necessary data, 94 * i.e. number of distinct values in a page range (of a given size) and 95 * table size (to estimate cost change due to change in false positive 96 * rate due to having larger index vs. scanning larger indexes). We may 97 * not have that data - for example when building an index on empty table 98 * it's not really possible. And for some data we only have estimates for 99 * the whole table and we can only estimate per-range values (ndistinct). 100 * 101 * Another challenge is that while the bloom filter is per-column, it's 102 * the whole index tuple that has to fit into a page. And for multi-column 103 * indexes that may include pieces we have no control over (not necessarily 104 * bloom filters, the other columns may use other BRIN opclasses). So it's 105 * not entirely clear how to distribute the space between those columns. 106 * 107 * The current logic, implemented in brin_bloom_get_ndistinct, attempts to 108 * make some basic sizing decisions, based on the size of BRIN ranges, and 109 * the maximum number of rows per range. 110 * 111 * 112 * IDENTIFICATION 113 * src/backend/access/brin/brin_bloom.c 114 */ 115 #include "postgres.h" 116 117 #include "access/genam.h" 118 #include "access/brin.h" 119 #include "access/brin_internal.h" 120 #include "access/brin_page.h" 121 #include "access/brin_tuple.h" 122 #include "access/hash.h" 123 #include "access/htup_details.h" 124 #include "access/reloptions.h" 125 #include "access/stratnum.h" 126 #include "catalog/pg_type.h" 127 #include "catalog/pg_amop.h" 128 #include "utils/builtins.h" 129 #include "utils/datum.h" 130 #include "utils/lsyscache.h" 131 #include "utils/rel.h" 132 #include "utils/syscache.h" 133 134 #include <math.h> 135 136 #define BloomEqualStrategyNumber 1 137 138 /* 139 * Additional SQL level support functions. We only have one, which is 140 * used to calculate hash of the input value. 141 * 142 * Procedure numbers must not use values reserved for BRIN itself; see 143 * brin_internal.h. 144 */ 145 #define BLOOM_MAX_PROCNUMS 1 /* maximum support procs we need */ 146 #define PROCNUM_HASH 11 /* required */ 147 148 /* 149 * Subtract this from procnum to obtain index in BloomOpaque arrays 150 * (Must be equal to minimum of private procnums). 151 */ 152 #define PROCNUM_BASE 11 153 154 /* 155 * Storage type for BRIN's reloptions. 156 */ 157 typedef struct BloomOptions 158 { 159 int32 vl_len_; /* varlena header (do not touch directly!) */ 160 double nDistinctPerRange; /* number of distinct values per range */ 161 double falsePositiveRate; /* false positive for bloom filter */ 162 } BloomOptions; 163 164 /* 165 * The current min value (16) is somewhat arbitrary, but it's based 166 * on the fact that the filter header is ~20B alone, which is about 167 * the same as the filter bitmap for 16 distinct items with 1% false 168 * positive rate. So by allowing lower values we'd not gain much. In 169 * any case, the min should not be larger than MaxHeapTuplesPerPage 170 * (~290), which is the theoretical maximum for single-page ranges. 171 */ 172 #define BLOOM_MIN_NDISTINCT_PER_RANGE 16 173 174 /* 175 * Used to determine number of distinct items, based on the number of rows 176 * in a page range. The 10% is somewhat similar to what estimate_num_groups 177 * does, so we use the same factor here. 178 */ 179 #define BLOOM_DEFAULT_NDISTINCT_PER_RANGE -0.1 /* 10% of values */ 180 181 /* 182 * Allowed range and default value for the false positive range. The exact 183 * values are somewhat arbitrary, but were chosen considering the various 184 * parameters (size of filter vs. page size, etc.). 185 * 186 * The lower the false-positive rate, the more accurate the filter is, but 187 * it also gets larger - at some point this eliminates the main advantage 188 * of BRIN indexes, which is the tiny size. At 0.01% the index is about 189 * 10% of the table (assuming 290 distinct values per 8kB page). 190 * 191 * On the other hand, as the false-positive rate increases, larger part of 192 * the table has to be scanned due to mismatches - at 25% we're probably 193 * close to sequential scan being cheaper. 194 */ 195 #define BLOOM_MIN_FALSE_POSITIVE_RATE 0.0001 /* 0.01% fp rate */ 196 #define BLOOM_MAX_FALSE_POSITIVE_RATE 0.25 /* 25% fp rate */ 197 #define BLOOM_DEFAULT_FALSE_POSITIVE_RATE 0.01 /* 1% fp rate */ 198 199 #define BloomGetNDistinctPerRange(opts) \ 200 ((opts) && (((BloomOptions *) (opts))->nDistinctPerRange != 0) ? \ 201 (((BloomOptions *) (opts))->nDistinctPerRange) : \ 202 BLOOM_DEFAULT_NDISTINCT_PER_RANGE) 203 204 #define BloomGetFalsePositiveRate(opts) \ 205 ((opts) && (((BloomOptions *) (opts))->falsePositiveRate != 0.0) ? \ 206 (((BloomOptions *) (opts))->falsePositiveRate) : \ 207 BLOOM_DEFAULT_FALSE_POSITIVE_RATE) 208 209 /* 210 * And estimate of the largest bloom we can fit onto a page. This is not 211 * a perfect guarantee, for a couple of reasons. For example, the row may 212 * be larger because the index has multiple columns. 213 */ 214 #define BloomMaxFilterSize \ 215 MAXALIGN_DOWN(BLCKSZ - \ 216 (MAXALIGN(SizeOfPageHeaderData + \ 217 sizeof(ItemIdData)) + \ 218 MAXALIGN(sizeof(BrinSpecialSpace)) + \ 219 SizeOfBrinTuple)) 220 221 /* 222 * Seeds used to calculate two hash functions h1 and h2, which are then used 223 * to generate k hashes using the (h1 + i * h2) scheme. 224 */ 225 #define BLOOM_SEED_1 0x71d924af 226 #define BLOOM_SEED_2 0xba48b314 227 228 /* 229 * Bloom Filter 230 * 231 * Represents a bloom filter, built on hashes of the indexed values. That is, 232 * we compute a uint32 hash of the value, and then store this hash into the 233 * bloom filter (and compute additional hashes on it). 234 * 235 * XXX We could implement "sparse" bloom filters, keeping only the bytes that 236 * are not entirely 0. But while indexes don't support TOAST, the varlena can 237 * still be compressed. So this seems unnecessary, because the compression 238 * should do the same job. 239 * 240 * XXX We can also watch the number of bits set in the bloom filter, and then 241 * stop using it (and not store the bitmap, to save space) when the false 242 * positive rate gets too high. But even if the false positive rate exceeds the 243 * desired value, it still can eliminate some page ranges. 244 */ 245 typedef struct BloomFilter 246 {

文章转载自digoal，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

PostgreSQL 14 preview - BRIN (典型IoT 时序场景) 块级索引支持 bloom filter - 随机,大量distinct value, 等值查询

作者

日期

标签

背景

PostgreSQL 许愿链接

9.9元购买3个月阿里云RDS PostgreSQL实例

PostgreSQL 解决方案集合

德哥 / digoal's github - 公益是一辈子的事.

评论

相关阅读