PostgreSQL创建扩展统计信息，让优化器做更好的选择

励志成为PostgreSQL大神 2021-03-25

1184

前言

今天我们来试试PostgreSQL中的扩展统计信息吧。

先科普一下统计信息

在 PostgreSQL中进行 SQL优化时，我们使用 Explain，它为 SQL语句生成的执行计划，它告诉我们如何扫描表(使用顺序扫描，使用索引扫描等)，以及在多个表关联时，使用那种连接算法(Merge Join, Hash Join等)。但 PostgreSQL优化器是如何知道应该使用那种扫描方式和连接算法呢？这主要取决于统计数据。统计信息可以告诉执行计划，在执行时要估计要返回多少行，这对于扫描方式或者连接算法有重要影响。可以通过 analyze、 vacuum和一些 ddl命令(例如 create index)收集或更新统计信息。

在pg_class和pg_statistics表中存储统计信息，pg_class基本上存储每个表和索引的总条目数，以及它们占用的磁盘page数量。而pg_statistic会存储每一列的统计数据，比如，有多少个不同的值，有多少列是空值(百分比%)，什么是最常见的值，直方图范围等等。

下一步，我们将做一个表的演示，创建t1表并插入100万个数据

create table t1 (a INT,b INT,c TEXT);

INSERT INTO t1 SELECT i/10000, i/100000 , repeat(' ', 100)  FROM generate_series (1,10000000) s(i); 

复制

对于列 a，是0,1,2,999，这个数值重复记录了1000次。对于 b列，0,1,2,999数据重复记录重复记录10000次。

对这个表进行统计信息收集，然后查看pg_class查看表的统计信息，reltuples代表了占用了多少元组，而relpages表示占用了多少page。

postgres=# analyze t1;
ANALYZE

postgres=# select reltuples,relpages from pg_class where relname='t1';
  reltuples   | relpages 
--------------+----------
 9.999943e+06 |   172414

复制

查看pg_stats视图可以查看列的统计信息。


postgres=# select * from pg_stats where tablename = 't1' and attname = 'a';
-[ RECORD 1 ]----------+---------------------------------------------------------------------------------------
schemaname             | public
tablename              | t1
attname                | a
inherited              | f
null_frac              | 0
avg_width              | 4
n_distinct             | 1000
most_common_vals       | {695,743,253,9,174,504,342,......,659,662,682,885,907}
most_common_freqs      | {0.0017333333,0.0017333333,0.0017,0.0016666667,0.0016666667,.......,0.0014333334,0.0014333334}
histogram_bounds       | {0,10,19,28,40,49,60,70,82,92,103,111,120,130,........,981,990,999}
correlation            | 1
most_common_elems      | 
most_common_elem_freqs | 
elem_count_histogram   | 

postgres=# select * from pg_stats where tablename = 't1' and attname = 'b';
-[ RECORD 1 ]----------+--------------------------------------------------------------------------------------
schemaname             | public
tablename              | t1
attname                | b
inherited              | f
null_frac              | 0
avg_width              | 4
n_distinct             | 100
most_common_vals       | {40,27,90,18,69,46,39,89,19,65,72,51,........,8,12,75,56,52,44,76}
most_common_freqs      | {0.011533333,0.011366666,0.0113,0.011266666,0.0112333335,..........0.008633333,0.0085,0.008166667}
histogram_bounds       | 
correlation            | 1
most_common_elems      | 
most_common_elem_freqs | 
elem_count_histogram   | 

复制

我们可以看到， A列的n_distinct和 B列的n_distinct分别为1000和100，这就是估算出的近似值。

当单列统计信息不足时

这些单列统计信息有助于优化器估算条件的选择性，我们首先来执行下列查询。

postgres=# EXPLAIN ANALYZE SELECT * FROM t1 where a = 1;
                                                QUERY PLAN                                                 
-------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..226463.03 rows=9657 width=109) (actual time=1.793..594.347 rows=10000 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on t1  (cost=0.00..224497.33 rows=4024 width=109) (actual time=389.960..586.246 rows=3333 loops=3)
         Filter: (a = 1)
         Rows Removed by Filter: 3330000
 Planning Time: 0.137 ms
 Execution Time: 595.021 ms
(8 rows)            

复制

这里用了并行执行计划，不方便我们观察，先把并行关闭，再执行一次。

postgres=# set max_parallel_workers_per_gather=0;
SET
postgres=# EXPLAIN ANALYZE SELECT * FROM t1 where a = 1;
                                                 QUERY PLAN                                                  
-------------------------------------------------------------------------------------------------------------
 Seq Scan on t1  (cost=0.00..297414.00 rows=9657 width=109) (actual time=1.279..1149.829 rows=10000 loops=1)
   Filter: (a = 1)
   Rows Removed by Filter: 9990000
 Planning Time: 1.658 ms
 Execution Time: 1150.464 ms
(5 rows)

复制

我们可以看到优化器估算a=1的过滤后行数为9657，而实际查询出来的结果是10000，非常接近。

但是当我们在把另外一个条件b=0带入。

postgres=# EXPLAIN ANALYZE SELECT * FROM t1 where a = 1 and b=0;
                                               QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..235923.90 rows=99 width=109) (actual time=1.671..597.417 rows=10000 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on t1  (cost=0.00..234914.00 rows=41 width=109) (actual time=195.301..590.047 rows=3333 loops=3)
         Filter: ((a = 1) AND (b = 0))
         Rows Removed by Filter: 3330000
 Planning Time: 0.049 ms
 Execution Time: 598.062 ms
(8 rows)

复制

可以看到它估算的是99行，而实际查询出的结果是10000行。

为什么会这样呢？因为 A列的n_distinct等于1000，即选择性等于1/1000，而 B列的n_distinct等于100，即选择性等于1/100。该优化器将基于这两种选择率进行相乘，从而得到选择率=0.001*0.01=0.00001。表中的行数是10000000行，乘以0.0001选择率，等于100。与这里显示的99接近。因此按照这种算法将估计一个错误的值。

扩展统计信息

回到刚刚我们这个问题，我们这个A和B列是有关联关系的。比如B列，它的取值只有A列的1/10。这个优化器是不知道的，但是我们是知道的。A列足以确定值，而B列没有特殊的能跳出去的值。所以在B列这个列上过滤不会删除任何A列的行。

所以我们可以创建扩展的统计信息，让优化器明白这个关系。

CREATE STATISTICS s1 (dependencies) on a, b from t1; 
ANALYZE tbl;

postgres=# EXPLAIN ANALYZE SELECT * FROM t1 where a = 1 and b=0;
                                                 QUERY PLAN                                                  
-------------------------------------------------------------------------------------------------------------
 Seq Scan on t1  (cost=0.00..322413.15 rows=9732 width=109) (actual time=0.009..1164.188 rows=10000 loops=1)
   Filter: ((a = 1) AND (b = 0))
   Rows Removed by Filter: 9990000
 Planning Time: 0.087 ms
 Execution Time: 1164.796 ms
(5 rows)

复制

创建完扩展统计信息再次执行后，估算出来的是9732行，这次是比较准确的。

postgres=# select * from pg_statistic_ext where stxname ='s1';
  oid  | stxrelid | stxname | stxnamespace | stxowner | stxstattarget | stxkeys | stxkind 
-------+----------+---------+--------------+----------+---------------+---------+---------
 33579 |    33564 | s1      |         2200 |       10 |            -1 | 1 2     | {f}
 
 postgres=# select * from pg_statistic_ext_data;
 stxoid | stxdndistinct |   stxddependencies   | stxdmcv 
--------+---------------+----------------------+---------
  33579 |               | {"1 => 2": 1.000000} | 

复制

通过查询pg_statistic_ext和pg_statistic_ext_data视图可以查看扩展统计信息的情况。stxkind字段的{f}表示收集有关依赖项的扩展信息。

这里可以看到，{"1 => 2": 1.000000}。从第一列到第二列的对应关系是1：1.00000，也就是说A列完全可以确定B列。

后记

扩展统计信息还是有一些作用，今天就到这里，下一节我们会更进一步的学习。

postgresql

文章转载自励志成为PostgreSQL大神，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

PostgreSQL创建扩展统计信息，让优化器做更好的选择

前言

先科普一下统计信息

当单列统计信息不足时

扩展统计信息

后记

评论

相关阅读