为何函数又让优化器犯了难？

原创 xiongcc 2024-02-28

277

前言

最近看了一些 SQL 调优相关的文献和视频，其中不乏一些相对陌生的领域，GET 到不少新技能！这一期和各位聊聊一个鲜为人知的调优技巧——函数代价。

何如

对于优化器来说，函数通常被视为一个"黑盒"，优化器对于函数体内的逻辑知之甚少，比如位于 FROM 列表的函数就有一个独立的算子 FunctionScan

postgres=# explain select * from now();
                       QUERY PLAN                       
--------------------------------------------------------
 Function Scan on now  (cost=0.00..0.01 rows=1 width=8)
(1 row)

其次某些场景，优化器也无法很好评估函数执行的成本和返回的结果集行数，举个栗子：

postgres=# explain select * from generate_series('2024-02-01','2024-02-28',interval'1 day');
                               QUERY PLAN                               
------------------------------------------------------------------------
 Function Scan on generate_series  (cost=0.00..10.00 rows=1000 width=8)
(1 row)

站在上帝视角，我们当然知道二月总共就 28 天，但是优化器却犯了难，预估返回 1000 行，和实际相差了 50 倍！具体原理是取决于是否有相对应的支持函数，在 pg_proc.prosupport 有记录

Planner support function for this function, or zero if none

postgres=# select proname, prosupport, prorows 
           from pg_proc 
           where proname like '%generate%';
           proname            |          prosupport          | prorows 
------------------------------+------------------------------+---------
 generate_subscripts          | -                            |    1000
 generate_subscripts          | -                            |    1000
 generate_series              | generate_series_int4_support |    1000
 generate_series              | generate_series_int4_support |    1000
 generate_series_int4_support | -                            |       0
 generate_series              | generate_series_int8_support |    1000
 generate_series              | generate_series_int8_support |    1000
 generate_series_int8_support | -                            |       0
 generate_series              | -                            |    1000
 generate_series              | -                            |    1000
 generate_series              | -                            |    1000
 generate_series              | -                            |    1000
 generate_series              | -                            |    1000
(13 rows)

可以看到对于 int4 和 int8 就有相对应的支持函数，所以对于整型，优化器预估的行数就很准确

postgres=# explain select * from generate_series(1,1000);  ---预估准确
                               QUERY PLAN                               
------------------------------------------------------------------------
 Function Scan on generate_series  (cost=0.00..10.00 rows=1000 width=4)
(1 row)

postgres=# explain select * from generate_series(1,995);  ---预估准确
                              QUERY PLAN                              
----------------------------------------------------------------------
 Function Scan on generate_series  (cost=0.00..9.95 rows=995 width=4)
(1 row)

倘若没有对应的支持函数，那么预估行数直接取自 pg_proc 中的 prorows 列，所以对于某些场景，你可以根据需要自行调整相关值以使优化器的预估更为准确：

postgres=# alter function generate_series(timestamp with time zone, timestamp with time zone, interval) rows 28;
ALTER FUNCTION
postgres=# explain select * from generate_series('2024-02-01','2024-02-28',interval'1 day');  ---这次优化器预估行数便准确了
                             QUERY PLAN                              
---------------------------------------------------------------------
 Function Scan on generate_series  (cost=0.00..0.28 rows=28 width=8)
(1 row)

另外我在优化器文章中也有描述，PostgreSQL 会择机优化一些简单的函数，此功能叫做函数内联 (function inline)，函数内联的思想是尽可能减少函数调用，以及在函数表达式中提供常量折叠的机会，从而加快查询速度。看个例子：

postgres=# create or replace function myfunc(v_id int)returns int 
as $$
 select abs(v_id);
$$ language sql;
CREATE FUNCTION
postgres=# explain select * from generate_series(1,10) as x where myfunc(x) = 5;
                              QUERY PLAN                              
----------------------------------------------------------------------
 Function Scan on generate_series x  (cost=0.00..0.15 rows=1 width=4)
   Filter: (abs(x) = 5)
(2 rows)

postgres=# create or replace function myfunc(v_id int)returns int 
as $$
declare
 ret int;
begin
 select abs(v_id) into ret;
end;
$$ language plpgsql;
CREATE FUNCTION
postgres=# explain select * from generate_series(1,10) as x where myfunc(x) = 5;
                              QUERY PLAN                              
----------------------------------------------------------------------
 Function Scan on generate_series x  (cost=0.00..2.63 rows=1 width=4)
   Filter: (myfunc(x) = 5)
(2 rows)

第一条语句 myfunc 函数被替换成了 abs，但是仅限于 SQL 函数，所以第二条语句就无法替换 (plpgsql)。再看个常量折叠的例子：

postgres=# CREATE FUNCTION incr4(int) RETURNS int
AS 'SELECT $1 + (2 + 2)' LANGUAGE SQL;
CREATE FUNCTION
postgres=# explain SELECT * FROM t2 where incr4(id) = 5;
                     QUERY PLAN                     
----------------------------------------------------
 Seq Scan on t2  (cost=0.00..48.25 rows=13 width=4)
   Filter: ((id + 4) = 5)
(2 rows)

小众技巧

前文介绍了优化器对于函数返回的行数是如何预估的，那么问题来了，成本是怎么计算的呢？让我们看个更加复杂点的例子：

postgres=# CREATE TABLE t_test AS SELECT * FROM generate_series(1, 10000000) AS id;
SELECT 10000000
postgres=# CREATE FUNCTION returns_many (int)
    RETURNS int
    AS $$
BEGIN
    IF $1 % 2 = 0 THEN
        RETURN $1;
    END IF;
    RETURN 0;
END;
$$
LANGUAGE 'plpgsql';
CREATE FUNCTION

postgres=# CREATE FUNCTION returns_few (int)
    RETURNS int
    AS $$
BEGIN
    IF $1 % 1000 = 35 THEN
        RETURN $1;
    END IF;
    RETURN 0;
END;
$$
LANGUAGE 'plpgsql';
CREATE FUNCTION

此处创建了两个函数，returns_many 返回的结果集较大，returns_few 返回较少。现在让我们跑一个同时调用这两个函数的查询：

postgres=# EXPLAIN        
SELECT
    *
FROM
    t_test
WHERE
    returns_many (id) = id
    AND returns_few (id) = id;
                           QUERY PLAN                           
----------------------------------------------------------------
 Seq Scan on t_test  (cost=0.00..5194292.63 rows=250 width=4)
   Filter: ((returns_many(id) = id) AND (returns_few(id) = id))
(2 rows)

此处 5194292.63 是如何计算的？答案还是藏在 pg_proc 中，procost 表明了执行此函数相对应的代价

Estimated execution cost (in units of cpu_operator_cost); if proretset, this is cost per row returned
A positive number giving the estimated execution cost for the function, in units of cpu_operator_cost. If the function returns a set, this is the cost per returned row. If the cost is not specified, 1 unit is assumed for C-language and internal functions, and 100 units for functions in all other languages. Larger values cause the planner to try to avoid evaluating the function more often than necessary.
正数表示函数的预估执行成本，以 cpu_operator_cost 的单位计算。如果函数返回一个集合，这是每个返回行的成本。如果没有指定成本，对于 C 语言和内部函数，默认为 1 个单位，对于所有其他语言的函数，默认为 100 个单位。较大的值会导致规划器尝试避免不必要地频繁评估函数。

对于 C 类型和一些自带的函数，procost 都是 1，那么其成本便是 cpu_operator_cost * 1 = 0.0025

postgres=# select proname,procost from pg_proc where proname = 'now';
 proname | procost 
---------+---------
 now     |       1
(1 row)

对于自定义函数，其成本是 cpu_operator_cost * 100，所以总成本计算如下

postgres=# SELECT
    relpages * current_setting('seq_page_cost')::numeric + reltuples * current_setting('cpu_tuple_cost')::numeric + current_setting('cpu_operator_cost')::numeric * 100 * reltuples * 2 + current_setting('cpu_operator_cost')::numeric * reltuples * 2 AS total_cost
FROM
    pg_class
WHERE
    relname = 't_test';
 total_cost  
-------------
 5194292.635
(1 row)

现在让我们执行一下这条语句，为了观察函数执行效果需要提前打开 track_functions，这样在 pg_stat_user_functions 中便可以看到函数调用此处和执行时间：

postgres=# EXPLAIN ANALYZE
SELECT
    *
FROM
    t_test
WHERE
    returns_many (id) = id
    AND returns_few (id) = id;
                                                   QUERY PLAN                                                   
----------------------------------------------------------------------------------------------------------------
 Seq Scan on t_test  (cost=0.00..5194292.63 rows=250 width=4) (actual time=30130.571..30130.572 rows=0 loops=1)
   Filter: ((returns_many(id) = id) AND (returns_few(id) = id))
   Rows Removed by Filter: 10000000
 Planning Time: 0.066 ms
 Execution Time: 30130.606 ms
(5 rows)

postgres=# select * from pg_stat_user_functions ;
 funcid | schemaname |   funcname   |  calls   | total_time | self_time 
--------+------------+--------------+----------+------------+-----------
  24634 | public     | returns_many | 10000000 |  16273.401 | 16273.401
  24636 | public     | returns_few  |  5000000 |   8339.559 |  8339.559
(2 rows)

现在让我们做个操作，做了什么操作先卖个关子... 各位读者可以先行思考一下🤔

现在让我们再次观察一下：

postgres=# select pg_stat_reset();
 pg_stat_reset 
---------------
 
(1 row)

postgres=# EXPLAIN ANALYZE
SELECT
    *
FROM
    t_test
WHERE
    returns_many (id) = id
    AND returns_few (id) = id;
                                                   QUERY PLAN                                                    
-----------------------------------------------------------------------------------------------------------------
 Seq Scan on t_test  (cost=0.00..27694312.88 rows=250 width=4) (actual time=20786.101..20786.102 rows=0 loops=1)
   Filter: ((returns_few(id) = id) AND (returns_many(id) = id))
   Rows Removed by Filter: 10000000
 Planning Time: 0.062 ms
 Execution Time: 20786.130 ms
(5 rows)

postgres=# select * from pg_stat_user_functions ;
 funcid | schemaname |   funcname   |  calls   | total_time | self_time 
--------+------------+--------------+----------+------------+-----------
  24636 | public     | returns_few  | 10000000 |  16431.214 | 16431.214
  24634 | public     | returns_many |    10000 |     16.909 |    16.909
(2 rows)

可以很清晰的看到，不仅执行时间由 3 秒变成了 2 秒，效率提升了不说，其次 pg_stat_user_functions 表明这次变成了 returns_few 执行的次数变多了！意味着返回行数较少的函数执行的次数更多了，优化器貌似变聪明了？那我做了什么操作使得优化器做出了更为准确的判断呢？

duang！我做了如下这么一个操作，使得 returns_many 代价比 returns_few (默认是 100)更高了！

postgres=# alter function returns_many(int) cost 1000;
ALTER FUNCTION

那么对于优化器来说，如果有两个函数同时作为 SELECT 语句的谓词同时执行，那么优化器会先执行代价低的，再去执行成本高的，即使在查询语句中，我将 returns_many 摆在了前面，所以经过我这么一个操作，优化器先去执行了 returns_few，再去执行了 returns_many，也就意味着我先过滤了一大部分元组之后，再去执行后续的操作，其效率同之前相比，也就不言而喻了。

小结

由于函数对于优化器来说相对是个黑盒，在某些场景下，根据需要，我们可以调整函数返回的行数和代价，使得优化器预估更为准确，更加聪明。其次尽可能使得函数体相对简单，使用 SQL 语言编写函数可以减少函数调用开销，提升效率，这一点我们可以结合 pg_stat_user_functions 一起观察。

最后让我们再看一个网上的例子加深下印象吧：

postgres=# create table t1 ( a int, b text, c date );
CREATE TABLE
postgres=# insert into t1 select a,a::text,now() from generate_series(1,1000000) a;
INSERT 0 1000000
postgres=# create unique index i1 on t1(a);
CREATE INDEX
postgres=# analyze t1;
ANALYZE

create or replace function f_tmp ( a_id in int ) returns setof t1
as $$
declare
begin
  return query select * from t1 where a = $1;
end;
$$ language plpgsql;

postgres=# explain (analyze) select f_tmp (1);
                                         QUERY PLAN                                         
--------------------------------------------------------------------------------------------
 ProjectSet  (cost=0.00..5.27 rows=1000 width=32) (actual time=0.654..0.657 rows=1 loops=1)
   ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.003..0.004 rows=1 loops=1)
 Planning Time: 0.047 ms
 Execution Time: 0.696 ms
(4 rows)

create or replace function f_tmp ( a_id in int ) returns setof t1
as $$
declare
begin
  return query select * from t1 where a = $1;
end;
$$ language plpgsql
   rows 1;
   
postgres=# explain (analyze) select f_tmp (1);
                                        QUERY PLAN                                        
------------------------------------------------------------------------------------------
 ProjectSet  (cost=0.00..0.27 rows=1 width=32) (actual time=0.451..0.454 rows=1 loops=1)
   ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.003..0.004 rows=1 loops=1)
 Planning Time: 0.068 ms
 Execution Time: 0.503 ms
(4 rows)

小结

https://www.dbi-services.com/blog/telling-the-postgresql-optimizer-more-about-your-functions/

https://www.cybertec-postgresql.com/en/better-sql-functions-in-postgresql-v14/

https://www.postgresql.org/docs/current/sql-createfunction.html

为何函数又让优化器犯了难？

前言

何如

小众技巧

小结

小结

推荐阅读

Feel free to contact me

评论