再聊下PostgreSQL下边IN, EXISTS及Values的骚操作

数据库杂记 2024-08-12

176

前言

以前也写过一篇短文：PostgreSQL中IN, EXISTS在SQL查询中到底有无区别, 总觉得意思未表达完整。最近又看了看薛晓刚老师的一篇文章：in null和not in null都是基础。不妨再用PostgreSQL作示例，对比总结一下：

实例

准备表及数据

mydb2=# create table a(id, col2) as values (1, 1), (2, 2), (3, 3), (null, 1);
SELECT 4

mydb2=# create table b(id, col2) as values (1, null), (2, 2), (null, 3);
SELECT 3

验证几个查询

1、IN与Exists

mydb2=# select * from b where id in (select id from a);
 id | col2
----+------
  1 |
  2 |    2
(2 rows)

可以看到id为NULL的记录都不在结果里头。这也说明NULL值不会参与其中的判断。这与exists的行为也基本上是一致的：
mydb2=# select * from b where exists (select id from a where id = b.id);
 id | col2
----+------
  1 |
  2 |    2
(2 rows)

2、Not IN与Not Exists

mydb2=# select * from b where id not in (select id from a);
 id | col2
----+------
(0 rows)

mydb2=# select * from b where not exists (select id from a where id = b.id);
 id | col2
----+------
    |    3
(1 row)

这个示例说明了啥？

not exists 操作，还好解释，将exists的那两行排除，剩下的就是not exists的结果。（null, 3)这一条记录就是不符合条件的。

而not in呢，是咋回事，一条记录也不符合？

好，直接再展开一下：

mydb2=# select * from b where id not in (1,2,3,null);
 id | col2
----+------
(0 rows)

mydb2=# select * from b where id not in (1,3,null);
 id | col2
----+------
(0 rows)

mydb2=# select * from b where id not in (1,null);
 id | col2
----+------
(0 rows)

只要not in的值列表或子查询里头有NULL值，结果一律为空。这就是一个巨大的差别。在生产环境当中，如果遇到这种情况，是不是要极力去避免？

再谈多列用于IN

Oracle：
select * from a where (id, col2) in (select 1 as id, 1 as col2 from dual)

PostgreSQL：// 至少有两种描述形式：
mydb2=# select count(*) from a where (id, col2) in (select 1, 1);
 count
-------
     1
(1 row)
mydb2=# select count(*) from a where (id, col2) in (values (1, 1),(2,2));
 count
-------
     2
(1 row)

PG中甚至可以对多列进行比较：

mydb2=# select count(*) from a where (id, col2) >= any (values (1, 1),(2,2));
 count
-------
     3
(1 row)

mydb2=# select count(*) from a where (id, col2) >= (values (2,2));
 count
-------
     2
(1 row)

mydb2=# select count(*) from a where (id, col2) >(values (2,2));
 count
-------
     1
(1 row)

关于values

PG中，有时候你完全可以不用构造表表验证一些SQL查询，比如：

mydb2=# select column1 from (values(1), (2), (3)) as A where A.column1 > 2;
 column1
---------
       3
(1 row)
-- 上边我们没有建任何表，就可以弄出一个有效的查询

mydb2=# select column1, column2 from (values(1, now()), (2, '2005-09-01'), (3, '2024-12-20')) as A where A.column1 > 2;
 column1 |        column2
---------+------------------------
       3 | 2024-12-20 00:00:00+08
(1 row)

这也是我喜欢PG的一个很好的理由。values的构造功能强大无比。

关于values及别名扩充或定义：

mydb2=# select aaaa as col1 from (values(1), (2), (3)) as A (aaaa);
 col1
------
    1
    2
    3
(3 rows)

mydb2=# select count(*) from (values(1), (2), (3)) as A (nolock);
 count
-------
     3
(1 row)

可以对默认的列: column1, column2进行定制。

借这个机会，我们看看下边的一个特殊的SQL：

1. mydb2=# select count(*) from a tableA (nolock);
 count
-------
     4
(1 row)

2. mydb2=# select count(*) from a tableA (nolock2);
 count
-------
     4
(1 row)

3. mydb2=# select count(*) from a tableA (nolock2,aaa);
 count
-------
     4
(1 row)

4. mydb2=# select * from a aa (nolock);
 nolock | col2
--------+------
      1 |    1
      2 |    2
      3 |    3
        |    1
(4 rows)

5. mydb2=# select * from a tableA (nolock, c1);
 nolock | c1
--------+----
      1 |  1
      2 |  2
      3 |  3
        |  1
(4 rows)

6. mydb2=# select * from a tableA (nolock, c1, c2, c3);
ERROR:  table "tablea" has 2 columns available but 4 columns specified

至此，有人甚至认为第3个SQL是BUG. 
第5，第6个好理解。就是重新定义了与表对应的列名。
第3个，因为用的是count(*)，它不是表的列。

如果使用下边的SQL，一样的会出错：
mydb2=# select count(*) from a tableA (c1, nolock, c2, countB);
ERROR:  table "tablea" has 2 columns available but 4 columns specified

也就是说，后边的别名，及列定义指定，不能超过表现存的总列数。

真是挺有意思的。

再谈性能

通常情况下，exists和in在性能上没有太大区别，没有NULL值干扰的情况下，两者确实大多数情况下可以转换着使用。区别特别大的地方是NOT IN, NOT EXISTS

select count(*) from ins_subscriptions is2 ;
count|
-----+
19679|

select count(*) from ins_instances ii ;
count |
------+
162525|

两个表都不算太大。只是下边的子查询里用到的是那张大表，也相当于是父表。i2.instance_id = i1.subscribed_instance_id，关联列上都建有索引。

explain analyze select count(*) from ins_subscriptions i1 where not exists (select 1  from ins_instances i2 where i2.instance_id = i1.subscribed_instance_id);
QUERY PLAN                                                                                                                                                                                     |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Aggregate  (cost=26282.35..26282.36 rows=1 width=8) (actual time=72.286..75.068 rows=1 loops=1)                                                                                                |
  ->  Gather  (cost=23694.09..26282.35 rows=1 width=0) (actual time=72.281..75.062 rows=0 loops=1)                                                                                             |
        Workers Planned: 1                                                                                                                                                                     |
        Workers Launched: 1                                                                                                                                                                    |
        ->  Parallel Hash Anti Join  (cost=22694.09..25282.25 rows=1 width=0) (actual time=69.841..69.843 rows=0 loops=2)                                                                      |
              Hash Cond: ((i1.subscribed_instance_id)::text = (i2.instance_id)::text)                                                                                                          |
              ->  Parallel Index Only Scan using idx_subcribed_instanceid_inssub on ins_subscriptions i1  (cost=0.41..2545.16 rows=11576 width=37) (actual time=0.094..5.471 rows=9840 loops=2)|
                    Heap Fetches: 19679                                                                                                                                                        |
              ->  Parallel Hash  (cost=21847.19..21847.19 rows=67719 width=37) (actual time=58.292..58.293 rows=81262 loops=2)                                                                 |
                    Buckets: 262144  Batches: 1  Memory Usage: 13536kB                                                                                                                         |
                    ->  Parallel Seq Scan on ins_instances i2  (cost=0.00..21847.19 rows=67719 width=37) (actual time=0.007..24.099 rows=81262 loops=2)                                        |
Planning Time: 0.218 ms                                                                                                                                                                        |
Execution Time: 75.107 ms                                                                                                                                                                      |

这个大概花了75ms.

我们再来看看not in的效果：

explain analyze select count(*) from ins_subscriptions i1 where subscribed_instance_id not in  (select instance_id  from ins_instances i2);

QUERY PLAN                                                                                                                                                                                             |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Finalize Aggregate  (cost=146346978.63..146346978.64 rows=1 width=8) (actual time=90916.094..90917.461 rows=1 loops=1)                                                                                 |
  ->  Gather  (cost=146346978.51..146346978.62 rows=1 width=8) (actual time=90722.703..90917.451 rows=2 loops=1)                                                                                       |
        Workers Planned: 1                                                                                                                                                                             |
        Workers Launched: 1                                                                                                                                                                            |
        ->  Partial Aggregate  (cost=146345978.51..146345978.52 rows=1 width=8) (actual time=90817.426..90817.428 rows=1 loops=2)                                                                      |
              ->  Parallel Index Only Scan using idx_subcribed_instanceid_inssub on ins_subscriptions i1  (cost=0.41..146345964.04 rows=5788 width=0) (actual time=90817.423..90817.424 rows=0 loops=2)|
                    Filter: (NOT (SubPlan 1))                                                                                                                                                          |
                    Rows Removed by Filter: 9840                                                                                                                                                       |
                    Heap Fetches: 19679                                                                                                                                                                |
                    SubPlan 1                                                                                                                                                                          |
                      ->  Materialize  (cost=0.00..24877.88 rows=162525 width=37) (actual time=0.001..6.101 rows=37705 loops=19679)                                                                    |
                            ->  Seq Scan on ins_instances i2  (cost=0.00..22795.25 rows=162525 width=37) (actual time=0.001..7.728 rows=44721 loops=4700)                                              |
Planning Time: 0.105 ms                                                                                                                                                                                |
Execution Time: 90919.448 ms                                                                                                                                                                           |

这里能明显看到，使用NOT IN, 差距明显，居然要花90秒。

所以，在实际使用当中，除非非常明确的较小的查询子集，我们可以使用IN语句来描述表达，其它绝大多数情况，都使用exists/not exists。没有必要在这上面花太多精力。最好的印证方式，就是将对应的查询计划试验出来，看看实际效果。

总结

写了这半天，也不知道到底写了些啥。PostgreSQL中可以玩的东西实在太多。在生产环境中，别乱用就好。满足客户基本需求，性能上别太拉垮，别太浪费资源、浪费客户的银子，就算是基本达标了。

我是【Sean】, 欢迎大家长按关注并加星公众号：数据库杂记。有好资源相送，同时为你提供及时更新。已关注的朋友，发送0、1到7，都有好资源相送。

往期导读:
1. PostgreSQL中配置单双向SSL连接详解
2. 提升PSQL使用技巧：PostgreSQL中PSQL使用技巧汇集(1)
3. 提升PSQL使用技巧：PostgreSQL中PSQL使用技巧汇集(2)
4. PostgreSQL SQL的基础使用及技巧
5. PostgreSQL开发技术基础:过程与函数
6. PostgreSQL中vacuum 物理文件truncate发生的条件
7. PostgreSQL中表的年龄与Vacuum的实验探索：Vacuum有大用
8. PostgreSQL利用分区表来弥补AutoVacuum的不足
9. 也聊聊PostgreSQL中的空间膨胀与AutoVacuum
10. 正确理解SAP BTP中hyperscaler PG中的IOPS (AWS篇)

文章转载自数据库杂记，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。