作者
digoal
日期
2021-06-15
标签
PostgreSQL , 非对称分区表智能JOIN
背景
非对称分区表智能JOIN
https://commitfest.postgresql.org/33/3099/
某些场景需要将分区数据append后再进行JOIN, 需要改写SQL才能做到基于每个分区JOIN后合并结果.
非对称分区表智能JOIN旨在自动优化这样的SQL改写.
```
Hello,
PostgreSQL optimizer right now considers join pairs on only
non-partition - non-partition or
partition-leaf - partition-leaf relations. On the other hands, it is
harmless and makes sense to
consider a join pair on non-partition - partition-leaf.
See the example below. ptable is partitioned by hash, and contains 10M
rows. ftable is not
partitioned and contains 50 rows. Most of ptable::fkey shall not have
matched rows in this
join.
create table ptable (fkey int, dist text) partition by hash (dist);
create table ptable_p0 partition of ptable for values with (modulus 3,
remainder 0);
create table ptable_p1 partition of ptable for values with (modulus 3,
remainder 1);
create table ptable_p2 partition of ptable for values with (modulus 3,
remainder 2);
insert into ptable (select x % 10000, md5(x::text) from
generate_series(1,10000000) x);
create table ftable (pkey int primary key, memo text);
insert into ftable (select x, 'ftable__#' || x::text from
generate_series(1,50) x);
vacuum analyze;
postgres=# explain analyze select count(*) from ptable p, ftable f
where p.fkey = f.pkey;
QUERY PLAN
Aggregate (cost=266393.38..266393.39 rows=1 width=8) (actual
time=2333.193..2333.194 rows=1 loops=1)
-> Hash Join (cost=2.12..260143.38 rows=2500000 width=0) (actual
time=0.056..2330.079 rows=50000 loops=1)
Hash Cond: (p.fkey = f.pkey)
-> Append (cost=0.00..233335.00 rows=10000000 width=4)
(actual time=0.012..1617.268 rows=10000000 loops=1)
-> Seq Scan on ptable_p0 p (cost=0.00..61101.96
rows=3332796 width=4) (actual time=0.011..351.137 rows=3332796
loops=1)
-> Seq Scan on ptable_p1 p_1 (cost=0.00..61106.25
rows=3333025 width=4) (actual time=0.005..272.925 rows=3333025
loops=1)
-> Seq Scan on ptable_p2 p_2 (cost=0.00..61126.79
rows=3334179 width=4) (actual time=0.006..416.141 rows=3334179
loops=1)
-> Hash (cost=1.50..1.50 rows=50 width=4) (actual
time=0.033..0.034 rows=50 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on ftable f (cost=0.00..1.50 rows=50
width=4) (actual time=0.004..0.017 rows=50 loops=1)
Planning Time: 0.286 ms
Execution Time: 2333.264 ms
(12 rows)
We can manually rewrite this query as follows:
postgres=# explain analyze select count(*) from (
select * from ptable_p0 p, ftable f where p.fkey =
f.pkey union all
select * from ptable_p1 p, ftable f where p.fkey =
f.pkey union all
select * from ptable_p2 p, ftable f where p.fkey = f.pkey) subqry;
Because Append does not process tuples that shall have no matched
tuples in ftable,
this query has cheaper cost and short query execution time.
(2333ms --> 1396ms)
postgres=# explain analyze select count(*) from (
select * from ptable_p0 p, ftable f where p.fkey =
f.pkey union all
select * from ptable_p1 p, ftable f where p.fkey =
f.pkey union all
select * from ptable_p2 p, ftable f where p.fkey = f.pkey) subqry;
QUERY PLAN
Aggregate (cost=210478.25..210478.26 rows=1 width=8) (actual
time=1396.024..1396.024 rows=1 loops=1)
-> Append (cost=2.12..210353.14 rows=50042 width=0) (actual
time=0.058..1393.008 rows=50000 loops=1)
-> Subquery Scan on "SELECT 1" (cost=2.12..70023.66
rows=16726 width=0) (actual time=0.057..573.197 rows=16789 loops=1)
-> Hash Join (cost=2.12..69856.40 rows=16726
width=72) (actual time=0.056..571.718 rows=16789 loops=1)
Hash Cond: (p.fkey = f.pkey)
-> Seq Scan on ptable_p0 p (cost=0.00..61101.96
rows=3332796 width=4) (actual time=0.009..255.791 rows=3332796
loops=1)
-> Hash (cost=1.50..1.50 rows=50 width=4)
(actual time=0.034..0.035 rows=50 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on ftable f (cost=0.00..1.50
rows=50 width=4) (actual time=0.004..0.019 rows=50 loops=1)
-> Subquery Scan on "SELECT 2" (cost=2.12..70027.43
rows=16617 width=0) (actual time=0.036..409.712 rows=16578 loops=1)
-> Hash Join (cost=2.12..69861.26 rows=16617
width=72) (actual time=0.036..408.626 rows=16578 loops=1)
Hash Cond: (p_1.fkey = f_1.pkey)
-> Seq Scan on ptable_p1 p_1
(cost=0.00..61106.25 rows=3333025 width=4) (actual time=0.005..181.422
rows=3333025 loops=1)
-> Hash (cost=1.50..1.50 rows=50 width=4)
(actual time=0.020..0.020 rows=50 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on ftable f_1
(cost=0.00..1.50 rows=50 width=4) (actual time=0.004..0.011 rows=50
loops=1)
-> Subquery Scan on "SELECT 3" (cost=2.12..70051.84
rows=16699 width=0) (actual time=0.025..407.103 rows=16633 loops=1)
-> Hash Join (cost=2.12..69884.85 rows=16699
width=72) (actual time=0.025..406.048 rows=16633 loops=1)
Hash Cond: (p_2.fkey = f_2.pkey)
-> Seq Scan on ptable_p2 p_2
(cost=0.00..61126.79 rows=3334179 width=4) (actual time=0.004..181.015
rows=3334179 loops=1)
-> Hash (cost=1.50..1.50 rows=50 width=4)
(actual time=0.014..0.014 rows=50 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on ftable f_2
(cost=0.00..1.50 rows=50 width=4) (actual time=0.003..0.008 rows=50
loops=1)
Planning Time: 0.614 ms
Execution Time: 1396.131 ms
(25 rows)
How about your opinions for this kind of asymmetric partition-wise
JOIN support by the optimizer?
I think we can harmlessly push-down inner-join and left-join if
partition-leaf is left side.
Probably, we need to implement two key functionalities.
1. Construction of RelOpInfo for join on non-partition table and
partition-leafs for each pairs.
Instead of JoinPaths, this logic adds AppendPath that takes
asymmetric partition-wise join
paths as sub-paths. Other optimization logic is equivalent as we
are currently doing.
2. Allow to share the hash-table built from table scan distributed to
individual partition leafs.
In the above example, SeqScan on ftable and relevant Hash path
will make identical hash-
table for the upcoming hash-join. If sibling paths have equivalent
results, it is reasonable to
reuse it.
Best regards,
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei
```
PostgreSQL 许愿链接
您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.