作者
digoal
日期
2019-11-08
标签
PostgreSQL , md5hash , int128 , bit
背景
md5例子:c4ca4238a0b923820dcc509a6f75849b,在数据库中需要使用text或varchar存储,比较浪费空间。因为text变长类型需要1到4个字节的头,同时每个字符耗费1个字节,等于需要33字节存储。
实际上可以压缩为128bit存储。md5hash插件就是这么做的,新增了一个数据类型md5,底层使用128bit存储。
https://pgxn.org/dist/md5hash/1.0.1/
This extension provides a simple data type storing 128-bit values (e.g. MD5 hashes) in a bit more efficient way - in a fixed-length columns and inline.
The extension defines the 'md5hash' type itself, casts, operators and an operator class for btree indexes. It's fairly straightforward and simple to add more operators or classes.
使用举例
以PG12为例。
1、安装软件
wget http://api.pgxn.org/dist/md5hash/1.0.1/md5hash-1.0.1.zip
unzip md5hash-1.0.1.zip
cd md5hash-1.0.1/
export PATH=$PGHOME/bin:$PATH
USE_PGXS=1 make
USE_PGXS=1 make install
2、安装插件
postgres=# create extension md5hash;
CREATE EXTENSION
3、使用md5hash类型
```
postgres=# create table ts(id md5hash primary key);
CREATE TABLE
postgres=# create table ts1(id md5hash, id1 text);
CREATE TABLE
postgres=# insert into ts1 select md5(i::text),md5(i::text) from generate_series(1,100) i;
INSERT 0 100
postgres=# select * from ts1 limit 10;
id | id1
----------------------------------+----------------------------------
c4ca4238a0b923820dcc509a6f75849b | c4ca4238a0b923820dcc509a6f75849b
c81e728d9d4c2f636f067f89cc14862c | c81e728d9d4c2f636f067f89cc14862c
eccbc87e4b5ce2fe28308fd9f2a7baf3 | eccbc87e4b5ce2fe28308fd9f2a7baf3
a87ff679a2f3e71d9181a67b7542122c | a87ff679a2f3e71d9181a67b7542122c
e4da3b7fbbce2345d7772b0674a318d5 | e4da3b7fbbce2345d7772b0674a318d5
1679091c5a880faf6fb5e6087eb1b2dc | 1679091c5a880faf6fb5e6087eb1b2dc
8f14e45fceea167a5a36dedd4bea2543 | 8f14e45fceea167a5a36dedd4bea2543
c9f0f895fb98ab9159f51fd0297e236d | c9f0f895fb98ab9159f51fd0297e236d
45c48cce2e2d7fbdea1afc51c7c6ad26 | 45c48cce2e2d7fbdea1afc51c7c6ad26
d3d9446802a44259755d38e6d163e820 | d3d9446802a44259755d38e6d163e820
(10 rows)
```
md5hash与text比较时,需要强制转换一下
```
postgres=# select * from ts1 where id::text=id1 limit 10;
id | id1
----------------------------------+----------------------------------
c4ca4238a0b923820dcc509a6f75849b | c4ca4238a0b923820dcc509a6f75849b
c81e728d9d4c2f636f067f89cc14862c | c81e728d9d4c2f636f067f89cc14862c
eccbc87e4b5ce2fe28308fd9f2a7baf3 | eccbc87e4b5ce2fe28308fd9f2a7baf3
a87ff679a2f3e71d9181a67b7542122c | a87ff679a2f3e71d9181a67b7542122c
e4da3b7fbbce2345d7772b0674a318d5 | e4da3b7fbbce2345d7772b0674a318d5
1679091c5a880faf6fb5e6087eb1b2dc | 1679091c5a880faf6fb5e6087eb1b2dc
8f14e45fceea167a5a36dedd4bea2543 | 8f14e45fceea167a5a36dedd4bea2543
c9f0f895fb98ab9159f51fd0297e236d | c9f0f895fb98ab9159f51fd0297e236d
45c48cce2e2d7fbdea1afc51c7c6ad26 | 45c48cce2e2d7fbdea1afc51c7c6ad26
d3d9446802a44259755d38e6d163e820 | d3d9446802a44259755d38e6d163e820
(10 rows)
postgres=# select * from ts1 where id=id1::md5hash limit 10;
id | id1
----------------------------------+----------------------------------
c4ca4238a0b923820dcc509a6f75849b | c4ca4238a0b923820dcc509a6f75849b
c81e728d9d4c2f636f067f89cc14862c | c81e728d9d4c2f636f067f89cc14862c
eccbc87e4b5ce2fe28308fd9f2a7baf3 | eccbc87e4b5ce2fe28308fd9f2a7baf3
a87ff679a2f3e71d9181a67b7542122c | a87ff679a2f3e71d9181a67b7542122c
e4da3b7fbbce2345d7772b0674a318d5 | e4da3b7fbbce2345d7772b0674a318d5
1679091c5a880faf6fb5e6087eb1b2dc | 1679091c5a880faf6fb5e6087eb1b2dc
8f14e45fceea167a5a36dedd4bea2543 | 8f14e45fceea167a5a36dedd4bea2543
c9f0f895fb98ab9159f51fd0297e236d | c9f0f895fb98ab9159f51fd0297e236d
45c48cce2e2d7fbdea1afc51c7c6ad26 | 45c48cce2e2d7fbdea1afc51c7c6ad26
d3d9446802a44259755d38e6d163e820 | d3d9446802a44259755d38e6d163e820
(10 rows)
```
查看存储空间对比,节省空间
postgres=# select pg_column_size(id), pg_column_size(id1) from ts1 limit 10;
pg_column_size | pg_column_size
----------------+----------------
16 | 33
16 | 33
16 | 33
16 | 33
16 | 33
16 | 33
16 | 33
16 | 33
16 | 33
16 | 33
(10 rows)
除了节省空间,实际上md5hash的查询效率也更高。
```
CREATE TABLE test_md5hash (id md5hash PRIMARY KEY);
CREATE TABLE test_text (id varchar(32) PRIMARY KEY);
INSERT INTO test_md5hash SELECT md5(i::text)
FROM generate_series(1,1000000) s(i);
INSERT INTO test_text SELECT md5(i::text)
FROM generate_series(1,1000000) s(i);
SELECT relname,
(pg_relation_size(oid)/1024) AS relation_size_kB,
(pg_total_relation_size(oid)/1024) AS total_size_kB
FROM pg_class WHERE relname LIKE 'test_%';
relname | relation_size_kb | total_size_kb
复制
-------------------+------------------+---------------
test_md5hash | 43248 | 82744
test_md5hash_pkey | 39464 | 39464
test_text | 66672 | 141816
test_text_pkey | 75096 | 75096
(4 rows)
```
加速检索
Now, let's see the difference when querying the column - to eliminate planning overhead, I've used prepared statements and execute 1.000.000 queries with 'WHERE id = $1' condition (using the PK index).
TEXT 129 seconds
md5hash 117 seconds
小结
md5hash将md5值存储为128bit,节省了存储空间,同时也节省了shared buffer等内存的使用,对于pk列,md5hash的查询效率比text更高。
参考
https://pgxn.org/dist/md5hash/1.0.1/
PostgreSQL 许愿链接
您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.