视图和物化视图
数据库中的/视图(View)/ 指的是通过一张或多张表查询出来的逻辑表,本身只是一段 SQL 的封装并不存储查询出来的数据。
创建视图的语法为:
CREATE [ OR REPLACE ] [ TEMP | TEMPORARY ] [ RECURSIVE ] VIEW name [ ( column_name [, ...] ) ]
[ WITH ( view_option_name [= view_option_value] [, ... ] ) ]
AS query
[ WITH [ CASCADED | LOCAL ] CHECK OPTION ]复制
而/物化视图/是是什么呢?这是一个很特别的概念,我们先看下Wikipedia中的描述
In computing, a materialized view is a database object that contains the results of a query. For example, it may be a local copy of data located remotely, or may be a subset of the rows and/or columns of a table or join result, or may be a summary using an aggregate function.
The process of setting up a materialized view is sometimes called materialization. This is a form of caching the results of a query, similar to memoization of the value of a function in functional languages, and it is sometimes described as a form of precomputation. As with other forms of precomputation, database users typically use materialized views for performance reasons, i.e. as a form of optimization.
PostgreSQL 9.3中支持物化视图的,创建物化视图的语法为:
CREATE MATERIALIZED VIEW [ IF NOT EXISTS ] table_name
[ (column_name [, ...] ) ]
[ WITH ( storage_parameter [= value] [, ... ] ) ]
[ TABLESPACE tablespace_name ]
AS query
[ WITH [ NO ] DATA ]复制
看下视图和物化视图查询plan,
创建一个销售记录的表
CREATE TABLE invoice (
invoice_no integer PRIMARY KEY,
seller_no integer, -- 销售员的 ID
invoice_date date, -- 销售日期
invoice_amt numeric(13,2) -- 销售量
);
CREATE TABLE invoice_award (
invoice_no integer PRIMARY KEY,
seller_no integer, -- 销售员的 ID
invoice_date date, -- 获奖日期
award text -- 奖名
);复制
创建视图
CREATE OR REPLACE VIEW SALES AS
SELECT
aa.seller_no,aa.award_cnt,cc.invoice_amt
FROM
(
SELECT seller_no,COUNT(award) AS award_cnt
FROM invoice_award
GROUP BY
seller_no
)aa
JOIN
(
SELECT
seller_no,invoice_amt
FROM invoice
)cc
ON aa.seller_no = cc.seller_no
ORDER BY cc.invoice_amt desc;复制
创建物化视图
CREATE MATERIALIZED VIEW SALES_M AS
SELECT
aa.seller_no,aa.award_cnt,cc.invoice_amt
FROM
(
SELECT seller_no,COUNT(award) AS award_cnt
FROM invoice_award
GROUP BY
seller_no
)aa
JOIN
(
SELECT
seller_no,invoice_amt
FROM invoice
)cc
ON aa.seller_no = cc.seller_no
ORDER BY cc.invoice_amt desc;复制
查询的plan
postgres=# EXPLAIN SELECT * FROM SALES;
QUERY PLAN
------------------------------------------------------------------------------------------
Sort (cost=137.97..141.60 rows=1450 width=28)
Sort Key: invoice.invoice_amt DESC
-> Hash Join (cost=33.45..61.83 rows=1450 width=28)
Hash Cond: (invoice.seller_no = invoice_award.seller_no)
-> Seq Scan on invoice (cost=0.00..24.50 rows=1450 width=20)
-> Hash (cost=30.95..30.95 rows=200 width=12)
-> HashAggregate (cost=26.95..28.95 rows=200 width=12)
Group Key: invoice_award.seller_no
-> Seq Scan on invoice_award (cost=0.00..21.30 rows=1130 width=36)
(9 rows)
postgres=# EXPLAIN SELECT * FROM SALES_M;
QUERY PLAN
------------------------------------------------------------
Seq Scan on sales_m (cost=0.00..24.50 rows=1450 width=28)
(1 row)复制
可以看出,pg的查询计划视图还是从原来的关系表中去查询,而物化视图则是从物化视图中查询,这个查询的效率肯定是比视图要高很多。
物化视图使用
我们对刚才的表简单的插入一些数据,来做接下来的使用的分析。
postgres=# insert into invoice values(1,1,'2021-06-07',100.0);
postgres=# insert into invoice_award values(1,1,'2021-06-07','A');复制
查询视图和物化视图
postgres=# select * from sales_m;
seller_no | award_cnt | invoice_amt
-----------+-----------+-------------
(0 rows)
postgres=# select * from sales;
seller_no | award_cnt | invoice_amt
-----------+-----------+-------------
1 | 1 | 100.00
(1 row)复制
查询的结果我们看出,视图是可以查询到数据的,但是物化视图却是没有数据的。因此我们看下物化视图的数据更新策略。PostgreSQL使用物化视图数据的更新策略主要是通过关键字/REFRESH/。我们可以\h看下帮助文档。
postgres=# \h REFRESH
Command: REFRESH MATERIALIZED VIEW
Description: replace the contents of a materialized view
Syntax:
REFRESH MATERIALIZED VIEW [ CONCURRENTLY ] name
[ WITH [ NO ] DATA ]
URL: https://www.postgresql.org/docs/13/sql-refreshmaterializedview.html复制
通过这个帮助文档的链接,阅读了refresh的策略知道,物化视图的数据更新策略是支持全量更新【with data】和增量更新【CONCURRENTLY】。
先来看下全量更新WITH data
postgres=# REFRESH MATERIALIZED VIEW sales_m WITH NO DATA;
REFRESH MATERIALIZED VIEW
postgres=# SELECT * FROM sales_m;
ERROR: materialized view "sales_m" has not been populated
HINT: Use the REFRESH MATERIALIZED VIEW command.复制
with no data是会将物化视图设置为populated,这个状态下不能被查询的。需要使用with data然后重新执行下,才能看到数据被刷新进入了视图。而且创建视图的时候默认是with data。
再来看下增量更新CONCURRENTLY
对物化视图的刷新不会同时阻塞对物化视图的查询,如果没有这个option, 一次影响很多行的refresh将使用更少的资源并且更快的结束,但是可能会阻塞其他尝试从物化视图中读取的连接。 只有当物化视图上有至少一个UNIQUE索引(只用列名 并且包括所有行)时,才允许这个选项。也就是说,该索引不能建立在任何 表达式上或者包括WHERE子句。 当物化视图还未被填充时,这个选项不能被使用。即 使带有这个选项,对于任意一个物化视图一次也只能运行一个 REFRESH。只是在刷新时不阻塞读当前物化视图,但多个物化视图刷新语句还是不能并发执行
postgres=# insert into invoice_award values(2,1,'2021-06-07','B');
INSERT 0 1
postgres=# SELECT * FROM sales_m;
seller_no | award_cnt | invoice_amt
-----------+-----------+-------------
1 | 1 | 100.00
(1 row)
postgres=# refresh materialized view CONCURRENTLY sales_m;
ERROR: cannot refresh materialized view "public.sales_m" concurrently
HINT: Create a unique index with no WHERE clause on one or more columns of the materialized view.
postgres=# create unique index idx_seller_no on sales_m(seller_no);
CREATE INDEX
postgres=# refresh materialized view CONCURRENTLY sales_m;
REFRESH MATERIALIZED VIEW
postgres=# SELECT * FROM sales_m;
seller_no | award_cnt | invoice_amt
-----------+-----------+-------------
1 | 2 | 100.00
(1 row)复制
往原始的invoice_award插入了一条件记录,再去从物化视图中select下,发现数据没有增量更新,因此我们需要用到CONCURRENTLY的这个option的时候,系统提示需要先给物化视图创建一个unique的index才行,然后再进行refresh之后,数据才会进入了物化视图。
物化视图PG实现
基于当前pg12.3这个版本来分析实现的一些逻辑细节。
核心的Node
typedef struct RefreshMatViewStmt
{
NodeTag type;
bool concurrent; /* allow concurrent access? */
bool skipData; /* true for WITH NO DATA */
RangeVar *relation; /* relation to insert into */
} RefreshMatViewStmt;复制
入口函数
/*
* ExecRefreshMatView -- execute a REFRESH MATERIALIZED VIEW command
*
* This refreshes the materialized view by creating a new table and swapping
* the relfilenodes of the new table and the old materialized view, so the OID
* of the original materialized view is preserved. Thus we do not lose GRANT
* nor references to this materialized view.
*
* If WITH NO DATA was specified, this is effectively like a TRUNCATE;
* otherwise it is like a TRUNCATE followed by an INSERT using the SELECT
* statement associated with the materialized view. The statement node's
* skipData field shows whether the clause was used.
*
* Indexes are rebuilt too, via REINDEX. Since we are effectively bulk-loading
* the new heap, it's better to create the indexes afterwards than to fill them
* incrementally while we load.
*
* The matview's "populated" state is changed based on whether the contents
* reflect the result set of the materialized view's query.
*/
ObjectAddress
ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
ParamListInfo params, QueryCompletion *qc)复制
case1:有concurrent选项
设置锁的模式为排它锁ExclusiveLock 做一些准确性的check 检查操作的是物化视图 检查是否被填充 冲突的情况,concurrent和skipdata 规则的校验 校验是否有唯一索引unique index 创建临时表空间tableSpace 创建临时表 调用refresh_by_match_merge完成刷新
关闭table对象,释放锁
case2:无concurrent选项
设置锁的模式为排它锁AccessExclusiveLock 做一些准确性的check 检查操作的是物化视图 检查是否被填充 冲突的情况,concurrent和skipdata 规则的校验 校验是否有唯一索引unique index 调用refresh_by_heap_swap完成刷新 交换数据文件 重建索引 清理临时表 清除映射关系 关闭table对象,释放锁
核心逻辑Lock
lockmode = concurrent ? ExclusiveLock : AccessExclusiveLock;
复制
ExclusiveLock(排他锁): 每个事务在它的事务ID的整个时间里都会持有一个Exclusive Lock(排他锁)”。如果一个事务发现它需要特别地等待另一个事务,它就会尝试地在另一个事务ID上获取 Share Lock(共享锁)。这仅当另一个事务结束并释放它自己的锁时才会成功。(注意,冲突)。Exclusive Lock(排他锁)会在同一张表上阻塞INSERT, UPDATE, DELETE,CREATE INDEX, ALTER TABLE,DROP TABLE, /SELECT/...FOR UPDATE 以及 VACUUM命令。
AccessExclusiveLock(访问排他锁):通过ALTER TABLE, DROP TABLE, 或者 VACUUM命令来修改表时自动获取。从开始在表上获取锁时,会阻塞任何并发命令或者其他锁模式。
核心逻辑unique index
foreach(indexoidscan, indexoidlist)
{
Oid indexoid = lfirst_oid(indexoidscan);
Relation indexRel;
indexRel = index_open(indexoid, AccessShareLock);
hasUniqueIndex = is_usable_unique_index(indexRel);
index_close(indexRel, AccessShareLock);
if (hasUniqueIndex)
break;
}复制
获取视图的indexRel,通过is_usable_unique_index函数来判断是否存在唯一索引unique index。
核心逻辑tablespace
/* Concurrent refresh builds new data in temp tablespace, and does diff. */
if (concurrent)
{
tableSpace = GetDefaultTablespace(RELPERSISTENCE_TEMP, false);
relpersistence = RELPERSISTENCE_TEMP;
}
else
{
tableSpace = matviewRel->rd_rel->reltablespace;
relpersistence = matviewRel->rd_rel->relpersistence;
}复制
如果是concurrent的case,创建一个临时表的存储空间tablespace,否则就会返回视图存在的tablespace。
核心逻辑Make_NEW_HEAP
/*
* Create the transient table that will receive the regenerated data. Lock
* it against access by any other process until commit (by which time it
* will be gone).
*/
OIDNewHeap = make_new_heap(matviewOid, tableSpace, relpersistence,
ExclusiveLock);
snprintf(NewHeapName, sizeof(NewHeapName), "pg_temp_%u", OIDOldHeap);
OIDNewHeap = heap_create_with_catalog(NewHeapName,
namespaceid,
NewTableSpace,
InvalidOid,
InvalidOid,
InvalidOid,
OldHeap->rd_rel->relowner,
OldHeap->rd_rel->relam,
OldHeapDesc,
NIL,
RELKIND_RELATION,
relpersistence,
false,
RelationIsMapped(OldHeap),
ONCOMMIT_NOOP,
reloptions,
false,
true,
true,
OIDOldHeap,
NULL);复制
通过看函数的源码,是调用make_new_heap会调用heap_create_with_catalog()函数进行新的临时表创建,表名为pg_temp_%u。
核心逻辑Refresh by MatchMerge
主要是通过原表和临时的一些join来形成差异的数据表,然后进行删除数据。
static void
refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner,
int save_sec_context)
{
appendStringInfo(&querybuf,
"CREATE TEMP TABLE %s AS "
"SELECT mv.ctid AS tid, newdata "
"FROM %s mv FULL JOIN %s newdata ON (",
diffname, matviewname, tempname);
appendStringInfoString(&querybuf,
" AND newdata OPERATOR(pg_catalog.*=) mv) "
"WHERE newdata IS NULL OR mv IS NULL "
"ORDER BY tid");
appendStringInfo(&querybuf,
"DELETE FROM %s mv WHERE ctid OPERATOR(pg_catalog.=) ANY "
"(SELECT diff.tid FROM %s diff "
"WHERE diff.tid IS NOT NULL "
"AND diff.newdata IS NULL)",
matviewname, diffname);
appendStringInfo(&querybuf,
"INSERT INTO %s SELECT (diff.newdata).* "
"FROM %s diff WHERE tid IS NULL",
matviewname, diffname);
appendStringInfo(&querybuf, "DROP TABLE %s, %s", diffname, tempname);
}复制
通过SPI执行query的方式进行数据的处理操作,完整的执行的逻辑如下:
创建临时差异表
CREATE TEMP TABLE diffname AS
SELECT mv.ctid AS tid, newdata FROM matviename mv FULL JOIN tempname newdata ON (newdata.tid=mv.tid AND newdata = mv)
WHERE newdata IS NULL OR mv IS NULL
ORDER BY tid;复制
删除原表已被更新数据
DELETE FROM matviewname mv WHERE ctid = ANY (SELECT diff.tid FROM diffname diff
WHERE diff.tid IS NOT NULL AND diff.newdata IS NULL);复制
更新的数据插回原表
INSERT INTO matviewname SELECT (diff.newdata).* FROM diffname diff WHERE tid IS NULL;
复制
删除差异表和临时表
Drop table diffname,tempname;
复制
核心逻辑Refresh by HeapSwap
主要是开始做finish_heap_swap来完成交换数据文件、重建索引、清理临时表、清除映射关系几个实现步骤。
swap_relation_files(OIDOldHeap, OIDNewHeap,
(OIDOldHeap == RelationRelationId),
swap_toast_by_content, is_internal,
frozenXid, cutoffMulti, mapped_tables);复制
切换表的oid和tablespace等信息。
重建索引
reindex_relation(OIDOldHeap, reindex_flags, &reindex_params);
复制
PostgreSQL12还是实现了concurrent reindex ,让重建索引的时候不能执行select的问题,过程中将锁设置为Share Update ExclusiveLock。主要的实现逻辑为
在系统表中创建新的索引 构建新索引 使新索引与在构建期间出现的变化同步(catch up) 重命名新索引为旧索引的名称,并切换所有依赖旧索引的地方到新索引,旧索引设置为invalid 标记旧索引为dead状态 删除旧索引
物化视图应用场景
查询优化
一般在关系型的业务库中,需要做一些join查询,当数据量比较大的时候,往往会影响查询的性能,这个可能有多个方面的问题,比如数据量比较大,表的字段比较多,因此需要优化查询的时候,可以将查询的逻辑进行物化视图的创建,相当于做预聚合。只要有基表数据更新的时候,就触发下物化视图的刷新逻辑即可。
开发效率
在实时数仓的开发过程中,我们都会选择实时计算引擎比如flink,sparkstreaming等来开发计算逻辑,实时任务完成计算的数据会写入OLAP或者持久化的RDBMS等数据库中,数据会实时的按照逻辑insert到数据中。如果是OLAP库,我们会有一些中间数据比如DWD的明细数据也叫大宽表的数据提供交互查询服务,但是一旦计算量比较大,计算周期比较长的,我们可以创建一个物化视图来进行预聚合,但是这样场景如果不要求太实时的话,就是一个非常好的选择物化视图的条件。因为通常的做法是给实时计算任务增加压力,开发实时任务,然后进行数据聚合,将汇总好的数据写入持久化的数据库。
评论
