PostgreSQL 物化视图

景闻你好 2021-06-10

2438

视图和物化视图

数据库中的/视图(View)/ 指的是通过一张或多张表查询出来的逻辑表，本身只是一段 SQL 的封装并不存储查询出来的数据。

创建视图的语法为：

CREATE [ OR REPLACE ] [ TEMP | TEMPORARY ] [ RECURSIVE ] VIEW name [ ( column_name [, ...] ) ]
    [ WITH ( view_option_name [= view_option_value] [, ... ] ) ]
    AS query
    [ WITH [ CASCADED | LOCAL ] CHECK OPTION ]

复制

而/物化视图/是是什么呢？这是一个很特别的概念，我们先看下Wikipedia中的描述

In computing, a materialized view is a database object that contains the results of a query. For example, it may be a local copy of data located remotely, or may be a subset of the rows and/or columns of a table or join result, or may be a summary using an aggregate function.

The process of setting up a materialized view is sometimes called materialization. This is a form of caching the results of a query, similar to memoization of the value of a function in functional languages, and it is sometimes described as a form of precomputation. As with other forms of precomputation, database users typically use materialized views for performance reasons, i.e. as a form of optimization.

PostgreSQL 9.3中支持物化视图的，创建物化视图的语法为：

CREATE MATERIALIZED VIEW [ IF NOT EXISTS ] table_name
    [ (column_name [, ...] ) ]
    [ WITH ( storage_parameter [= value] [, ... ] ) ]
    [ TABLESPACE tablespace_name ]
    AS query
    [ WITH [ NO ] DATA ]

复制

看下视图和物化视图查询plan，

创建一个销售记录的表

CREATE TABLE invoice (
    invoice_no    integer        PRIMARY KEY,
    seller_no     integer,       -- 销售员的 ID
    invoice_date  date,          -- 销售日期
    invoice_amt   numeric(13,2)  -- 销售量
);


CREATE TABLE invoice_award (
    invoice_no    integer        PRIMARY KEY,
    seller_no     integer,       -- 销售员的 ID
    invoice_date  date,          -- 获奖日期
    award   text                 -- 奖名
);

复制

创建视图

CREATE OR REPLACE VIEW  SALES AS
SELECT 
        aa.seller_no,aa.award_cnt,cc.invoice_amt
FROM
(
    SELECT seller_no,COUNT(award) AS award_cnt  
        FROM invoice_award
        GROUP  BY
            seller_no
)aa 
JOIN
(
    SELECT 
            seller_no,invoice_amt
    FROM    invoice
)cc    
ON  aa.seller_no = cc.seller_no
ORDER BY cc.invoice_amt desc;

复制

创建物化视图

CREATE MATERIALIZED VIEW  SALES_M AS
SELECT 
        aa.seller_no,aa.award_cnt,cc.invoice_amt
FROM
(
    SELECT seller_no,COUNT(award) AS award_cnt  
        FROM invoice_award
        GROUP  BY
            seller_no
)aa 
JOIN
(
    SELECT 
            seller_no,invoice_amt
    FROM    invoice
)cc    
ON  aa.seller_no = cc.seller_no
ORDER BY cc.invoice_amt desc;

复制

查询的plan

postgres=# EXPLAIN SELECT * FROM SALES;
                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Sort  (cost=137.97..141.60 rows=1450 width=28)
   Sort Key: invoice.invoice_amt DESC
   ->  Hash Join  (cost=33.45..61.83 rows=1450 width=28)
         Hash Cond: (invoice.seller_no = invoice_award.seller_no)
         ->  Seq Scan on invoice  (cost=0.00..24.50 rows=1450 width=20)
         ->  Hash  (cost=30.95..30.95 rows=200 width=12)
               ->  HashAggregate  (cost=26.95..28.95 rows=200 width=12)
                     Group Key: invoice_award.seller_no
                     ->  Seq Scan on invoice_award  (cost=0.00..21.30 rows=1130 width=36)
(9 rows)


postgres=# EXPLAIN SELECT * FROM SALES_M;
                         QUERY PLAN
------------------------------------------------------------
 Seq Scan on sales_m  (cost=0.00..24.50 rows=1450 width=28)
(1 row)

复制

可以看出，pg的查询计划视图还是从原来的关系表中去查询，而物化视图则是从物化视图中查询，这个查询的效率肯定是比视图要高很多。

物化视图使用

我们对刚才的表简单的插入一些数据，来做接下来的使用的分析。

postgres=# insert into invoice  values(1,1,'2021-06-07',100.0);
postgres=# insert into invoice_award values(1,1,'2021-06-07','A');
复制

查询视图和物化视图

postgres=# select * from sales_m;
 seller_no | award_cnt | invoice_amt
-----------+-----------+-------------
(0 rows)

postgres=# select * from sales;
 seller_no | award_cnt | invoice_amt
-----------+-----------+-------------
         1 |         1 |      100.00
(1 row)


复制

查询的结果我们看出，视图是可以查询到数据的，但是物化视图却是没有数据的。因此我们看下物化视图的数据更新策略。PostgreSQL使用物化视图数据的更新策略主要是通过关键字/REFRESH/。我们可以\h看下帮助文档。

postgres=# \h REFRESH
Command:     REFRESH MATERIALIZED VIEW
Description: replace the contents of a materialized view
Syntax:
REFRESH MATERIALIZED VIEW [ CONCURRENTLY ] name
    [ WITH [ NO ] DATA ]

URL: https://www.postgresql.org/docs/13/sql-refreshmaterializedview.html

复制

通过这个帮助文档的链接，阅读了refresh的策略知道，物化视图的数据更新策略是支持全量更新【with data】和增量更新【CONCURRENTLY】。

先来看下全量更新WITH data

postgres=# REFRESH MATERIALIZED VIEW sales_m WITH NO DATA;
REFRESH MATERIALIZED VIEW
postgres=# SELECT * FROM sales_m;
ERROR:  materialized view "sales_m" has not been populated
HINT:  Use the REFRESH MATERIALIZED VIEW command.

复制

with no data是会将物化视图设置为populated，这个状态下不能被查询的。需要使用with data然后重新执行下，才能看到数据被刷新进入了视图。而且创建视图的时候默认是with data。

再来看下增量更新CONCURRENTLY

对物化视图的刷新不会同时阻塞对物化视图的查询，如果没有这个option，一次影响很多行的refresh将使用更少的资源并且更快的结束，但是可能会阻塞其他尝试从物化视图中读取的连接。
只有当物化视图上有至少一个UNIQUE索引（只用列名并且包括所有行）时，才允许这个选项。也就是说，该索引不能建立在任何表达式上或者包括WHERE子句。
当物化视图还未被填充时，这个选项不能被使用。即
使带有这个选项，对于任意一个物化视图一次也只能运行一个 REFRESH。只是在刷新时不阻塞读当前物化视图，但多个物化视图刷新语句还是不能并发执行

postgres=# insert into invoice_award values(2,1,'2021-06-07','B');
INSERT 0 1
postgres=# SELECT * FROM sales_m;
 seller_no | award_cnt | invoice_amt
-----------+-----------+-------------
         1 |         1 |      100.00
(1 row)
postgres=# refresh materialized view CONCURRENTLY sales_m;
ERROR:  cannot refresh materialized view "public.sales_m" concurrently
HINT:  Create a unique index with no WHERE clause on one or more columns of the materialized view.
postgres=# create unique index idx_seller_no on sales_m(seller_no);
CREATE INDEX
postgres=# refresh materialized view CONCURRENTLY sales_m;
REFRESH MATERIALIZED VIEW
postgres=# SELECT * FROM sales_m;
 seller_no | award_cnt | invoice_amt
-----------+-----------+-------------
         1 |         2 |      100.00
(1 row)

复制

往原始的invoice_award插入了一条件记录，再去从物化视图中select下，发现数据没有增量更新，因此我们需要用到CONCURRENTLY的这个option的时候，系统提示需要先给物化视图创建一个unique的index才行，然后再进行refresh之后，数据才会进入了物化视图。

物化视图PG实现

基于当前pg12.3这个版本来分析实现的一些逻辑细节。

核心的Node

typedef struct RefreshMatViewStmt
{
 NodeTag  type;
 bool  concurrent;  /* allow concurrent access? */
 bool  skipData;  /* true for WITH NO DATA */
 RangeVar   *relation;  /* relation to insert into */
} RefreshMatViewStmt;

复制

入口函数

/*
 * ExecRefreshMatView -- execute a REFRESH MATERIALIZED VIEW command
 *
 * This refreshes the materialized view by creating a new table and swapping
 * the relfilenodes of the new table and the old materialized view, so the OID
 * of the original materialized view is preserved. Thus we do not lose GRANT
 * nor references to this materialized view.
 *
 * If WITH NO DATA was specified, this is effectively like a TRUNCATE;
 * otherwise it is like a TRUNCATE followed by an INSERT using the SELECT
 * statement associated with the materialized view.  The statement node's
 * skipData field shows whether the clause was used.
 *
 * Indexes are rebuilt too, via REINDEX. Since we are effectively bulk-loading
 * the new heap, it's better to create the indexes afterwards than to fill them
 * incrementally while we load.
 *
 * The matview's "populated" state is changed based on whether the contents
 * reflect the result set of the materialized view's query.
 */
ObjectAddress
ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
       ParamListInfo params, QueryCompletion *qc)

复制

case1：有concurrent选项

设置锁的模式为排它锁ExclusiveLock
做一些准确性的check

检查操作的是物化视图
检查是否被填充
冲突的情况，concurrent和skipdata
规则的校验
校验是否有唯一索引unique index

创建临时表空间tableSpace
创建临时表
调用refresh_by_match_merge完成刷新
关闭table对象，释放锁

case2：无concurrent选项

设置锁的模式为排它锁AccessExclusiveLock
做一些准确性的check

检查操作的是物化视图
检查是否被填充
冲突的情况，concurrent和skipdata
规则的校验
校验是否有唯一索引unique index

调用refresh_by_heap_swap完成刷新

交换数据文件
重建索引
清理临时表
清除映射关系

关闭table对象，释放锁

核心逻辑Lock

lockmode = concurrent ? ExclusiveLock : AccessExclusiveLock;

复制

ExclusiveLock（排他锁）: 每个事务在它的事务ID的整个时间里都会持有一个Exclusive Lock（排他锁）”。如果一个事务发现它需要特别地等待另一个事务，它就会尝试地在另一个事务ID上获取 Share Lock（共享锁）。这仅当另一个事务结束并释放它自己的锁时才会成功。（注意，冲突）。Exclusive Lock（排他锁）会在同一张表上阻塞INSERT, UPDATE, DELETE,CREATE INDEX, ALTER TABLE,DROP TABLE, /SELECT/...FOR UPDATE 以及 VACUUM命令。

AccessExclusiveLock（访问排他锁）：通过ALTER TABLE, DROP TABLE, 或者 VACUUM命令来修改表时自动获取。从开始在表上获取锁时，会阻塞任何并发命令或者其他锁模式。

核心逻辑unique index

foreach(indexoidscan, indexoidlist)
{
 Oid   indexoid = lfirst_oid(indexoidscan);
 Relation indexRel;

 indexRel = index_open(indexoid, AccessShareLock);
 hasUniqueIndex = is_usable_unique_index(indexRel);
 index_close(indexRel, AccessShareLock);
 if (hasUniqueIndex)
  break;
}

复制

获取视图的indexRel，通过is_usable_unique_index函数来判断是否存在唯一索引unique index。

核心逻辑tablespace

/* Concurrent refresh builds new data in temp tablespace, and does diff. */
if (concurrent)
{
 tableSpace = GetDefaultTablespace(RELPERSISTENCE_TEMP, false);
 relpersistence = RELPERSISTENCE_TEMP;
}
else
{
 tableSpace = matviewRel->rd_rel->reltablespace;
 relpersistence = matviewRel->rd_rel->relpersistence;
}

复制

如果是concurrent的case，创建一个临时表的存储空间tablespace，否则就会返回视图存在的tablespace。

核心逻辑Make_NEW_HEAP

/*
 * Create the transient table that will receive the regenerated data. Lock
 * it against access by any other process until commit (by which time it
 * will be gone).
 */
OIDNewHeap = make_new_heap(matviewOid, tableSpace, relpersistence,
         ExclusiveLock);

snprintf(NewHeapName, sizeof(NewHeapName), "pg_temp_%u", OIDOldHeap);

OIDNewHeap = heap_create_with_catalog(NewHeapName,
           namespaceid,
           NewTableSpace,
           InvalidOid,
           InvalidOid,
           InvalidOid,
           OldHeap->rd_rel->relowner,
           OldHeap->rd_rel->relam,
           OldHeapDesc,
           NIL,
           RELKIND_RELATION,
           relpersistence,
           false,
           RelationIsMapped(OldHeap),
           ONCOMMIT_NOOP,
           reloptions,
           false,
           true,
           true,
           OIDOldHeap,
           NULL);

复制

通过看函数的源码，是调用make_new_heap会调用heap_create_with_catalog()函数进行新的临时表创建，表名为pg_temp_%u。

核心逻辑Refresh by MatchMerge

主要是通过原表和临时的一些join来形成差异的数据表，然后进行删除数据。

static void
refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner,
        int save_sec_context)
{
 appendStringInfo(&querybuf,
     "CREATE TEMP TABLE %s AS "
     "SELECT mv.ctid AS tid, newdata "
     "FROM %s mv FULL JOIN %s newdata ON (",
     diffname, matviewname, tempname);


 appendStringInfoString(&querybuf,
         " AND newdata OPERATOR(pg_catalog.*=) mv) "
         "WHERE newdata IS NULL OR mv IS NULL "
         "ORDER BY tid");
         
 appendStringInfo(&querybuf,
      "DELETE FROM %s mv WHERE ctid OPERATOR(pg_catalog.=) ANY "
      "(SELECT diff.tid FROM %s diff "
      "WHERE diff.tid IS NOT NULL "
      "AND diff.newdata IS NULL)",
      matviewname, diffname);

 appendStringInfo(&querybuf,
      "INSERT INTO %s SELECT (diff.newdata).* "
      "FROM %s diff WHERE tid IS NULL",
      matviewname, diffname);
      
 appendStringInfo(&querybuf, "DROP TABLE %s, %s", diffname, tempname);
}        

复制

通过SPI执行query的方式进行数据的处理操作，完整的执行的逻辑如下：

创建临时差异表

CREATE TEMP TABLE diffname AS 
SELECT mv.ctid AS tid, newdata FROM matviename mv FULL JOIN tempname newdata ON (newdata.tid=mv.tid AND newdata = mv) 
WHERE newdata IS NULL OR mv IS NULL 
ORDER BY tid;

复制

删除原表已被更新数据

DELETE FROM matviewname mv WHERE ctid = ANY (SELECT diff.tid FROM diffname diff
WHERE diff.tid IS NOT NULL AND diff.newdata IS NULL);

复制

更新的数据插回原表

INSERT INTO matviewname SELECT (diff.newdata).* FROM diffname diff WHERE tid IS NULL;

复制

删除差异表和临时表

Drop table diffname,tempname;

复制

核心逻辑Refresh by HeapSwap

主要是开始做finish_heap_swap来完成交换数据文件、重建索引、清理临时表、清除映射关系几个实现步骤。

swap_relation_files(OIDOldHeap, OIDNewHeap,
      (OIDOldHeap == RelationRelationId),
      swap_toast_by_content, is_internal,
      frozenXid, cutoffMulti, mapped_tables);

复制

切换表的oid和tablespace等信息。

重建索引

reindex_relation(OIDOldHeap, reindex_flags, &reindex_params);

复制

PostgreSQL12还是实现了concurrent reindex ，让重建索引的时候不能执行select的问题，过程中将锁设置为Share Update ExclusiveLock。主要的实现逻辑为

在系统表中创建新的索引
构建新索引
使新索引与在构建期间出现的变化同步(catch up)
重命名新索引为旧索引的名称，并切换所有依赖旧索引的地方到新索引，旧索引设置为invalid
标记旧索引为dead状态
删除旧索引

物化视图应用场景

查询优化

一般在关系型的业务库中，需要做一些join查询，当数据量比较大的时候，往往会影响查询的性能，这个可能有多个方面的问题，比如数据量比较大，表的字段比较多，因此需要优化查询的时候，可以将查询的逻辑进行物化视图的创建，相当于做预聚合。只要有基表数据更新的时候，就触发下物化视图的刷新逻辑即可。

开发效率

在实时数仓的开发过程中，我们都会选择实时计算引擎比如flink，sparkstreaming等来开发计算逻辑，实时任务完成计算的数据会写入OLAP或者持久化的RDBMS等数据库中，数据会实时的按照逻辑insert到数据中。如果是OLAP库，我们会有一些中间数据比如DWD的明细数据也叫大宽表的数据提供交互查询服务，但是一旦计算量比较大，计算周期比较长的，我们可以创建一个物化视图来进行预聚合，但是这样场景如果不要求太实时的话，就是一个非常好的选择物化视图的条件。因为通常的做法是给实时计算任务增加压力，开发实时任务，然后进行数据聚合，将汇总好的数据写入持久化的数据库。

postgresql

文章转载自景闻你好，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

让Joy飞🌻🎶

2年前

分析得太好啦，感谢！！

2年前