等值关联条件可以通过效率较高的HashJoin来实现,但非等值关联条件只能通过NestLoopJoin实现,内外表的每行数据都需要进行关联,因此NestLoop在数据量较大的情况下会导致比较严重的性能问题。除了Join方式之外,NestLoop还需要将其中一个表进行Broadcast广播到所有计算节点进行。如果广播的表都比较大,也会占用大量网络资源并导致性能瓶颈。
1. 非等值关联条件场景
关联条件为非等值关联条件,或者关联条件恒为真(比如1=1)会产生笛卡尔积,这类语句都可能会导致性能较差。
wutongdb=> EXPLAIN(costs off) SELECT * FROM t1 INNER JOIN t2 ON t1.a<t2.a;
QUERY PLAN
---------------------------------------------------------------
Gather Motion 4:1 (slice1; segments: 4)
-> Nested Loop
Join Filter: (t1.a < t2.a)
-> Seq Scan on t1
-> Materialize
-> Broadcast Motion 4:4 (slice2; segments: 4)
-> Seq Scan on t2
Optimizer: GPORCA
NewExecutor: ON
(9 rows)
复制
2. OR关联条件场景
除此之外,关联条件包含OR也会导致只能通过NestLoop连接。应当避免此类用法或者根据表定义和实际语意尝试对此类SQL进行优化。
wutongdb=> EXPLAIN(COSTS OFF) SELECT * FROM t1 INNER JOIN t2 ON t1.a=t2.a OR t1.b=t2.b;
QUERY PLAN
---------------------------------------------------------------
Gather Motion 4:1 (slice1; segments: 4)
-> Nested Loop
Join Filter: ((t1.a = t2.a) OR (t1.b = t2.b))
-> Seq Scan on t1
-> Materialize
-> Broadcast Motion 4:4 (slice2; segments: 4)
-> Seq Scan on t2
Optimizer: GPORCA
NewExecutor: ON
(9 rows)
复制
可以考虑改写为UNION ALL:
wutongdb=> EXPLAIN(COSTS OFF) SELECT * FROM t1 INNER JOIN t2 ON t1.a=t2.a
UNION ALL
SELECT * FROM t1 INNER JOIN t2 ON (t1.a<>t2.a OR t1.a<>t2.a IS NULL) AND t1.b=t2.b;
QUERY PLAN
---------------------------------------------------------------------------------
Gather Motion 4:1 (slice1; segments: 4)
-> Append
-> Hash Join
Hash Cond: (t1.a = t2.a)
-> Redistribute Motion 4:4 (slice2; segments: 4)
Hash Key: t1.a
-> Seq Scan on t1
-> Hash
-> Redistribute Motion 4:4 (slice3; segments: 4)
Hash Key: t2.a
-> Seq Scan on t2
-> Hash Join
Hash Cond: (t1_1.b = t2_1.b)
Join Filter: ((t1_1.a <> t2_1.a) OR ((t1_1.a <> t2_1.a) IS NULL))
-> Redistribute Motion 4:4 (slice4; segments: 4)
Hash Key: t1_1.b
-> Seq Scan on t1 t1_1
-> Hash
-> Redistribute Motion 4:4 (slice5; segments: 4)
Hash Key: t2_1.b
-> Seq Scan on t2 t2_1
Optimizer: GPORCA
NewExecutor: ON
(23 rows)
复制
3. 相关子查询场景
相关子查询如果不能提升为JOIN的话,那么在计划中就会变成SubPlan,每次计算后的结果作为参数传入相关子查询并重新计算,性能很差。
wutongdb=> EXPLAIN(COSTS OFF) SELECT * FROM t1 WHERE a = (SELECT max(a) FROM t2 WHERE t1.b=t2.b GROUP BY b); QUERY PLAN ----------------------------------------------------------------------------- Gather Motion 4:1 (slice1; segments: 4) -> Seq Scan on t1 Filter: (a = (SubPlan 1)) SubPlan 1 -> HashAggregate Group Key: t2.b -> Result Filter: (t1.b = t2.b) -> Materialize -> Broadcast Motion 4:4 (slice2; segments: 4) -> Seq Scan on t2 Optimizer: GPORCA NewExecutor: ON (13 rows)
复制
可以将表达式相关子查询考虑改写为包含等值条件的EXISTS相关子查询:
wutongdb=> EXPLAIN(COSTS OFF) SELECT * FROM t1 WHERE EXISTS (SELECT 1 FROM t2 WHERE t1.b=t2.b GROUP BY b HAVING t1.a = max(a));
QUERY PLAN
------------------------------------------------------------------------
Gather Motion 4:1 (slice1; segments: 4)
-> Hash Semi Join
Hash Cond: (t1.b = t2.b)
Join Filter: (t1.a = (max(t2.a)))
-> Redistribute Motion 4:4 (slice2; segments: 4)
Hash Key: t1.b
-> Seq Scan on t1
-> Hash
-> HashAggregate
Group Key: t2.b
-> Redistribute Motion 4:4 (slice3; segments: 4)
Hash Key: t2.b
-> Seq Scan on t2
Optimizer: GPORCA
NewExecutor: ON
(15 rows)
复制
总结
NestLoop是最简单的一种连接方式,很多无法通过HashJoin连接的复杂场景可以通过NestLoop轻松实现。但是NestLoop的性能在数据量大的情况下往往较差,所以应该尽可能地避免语句只能使用NestLoop方式执行的情况。
除此之外,改写时也需要确认改写前后逻辑上是否等价,是否改写后Join出现重复数据,NULL值是否正确处理等等。
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。