暂无图片
暂无图片
2
暂无图片
暂无图片
1
暂无图片

庖丁解牛,深入浅出无日志表

原创 xiongcc 2024-12-07
213

前言

在 PostgreSQL 中,支持多种类型的表 — 临时表、普通表,以及无日志表,顾名思义,"无日志"其优势在于不用记录 WAL,那么写入速度自然也杠杠的,同理,无日志表在备库上没有数据 (只有一个壳),也无法进行访问,会提示 ERROR: cannot access temporary or unlogged relations during recovery,pg_basebackup 的时候也会跳过无日志表 (除了 init 分支)。那么无日志表又有哪些鲜为人知的细节呢?

分析

首先,让我们思考一下,无日志表的数据会落盘吗?或者说,按照常识,执行正常检查点的时候会落盘吗?

However, they are not crash-safe: an unlogged table is automatically truncated after a crash or unclean shutdown. The contents of an unlogged table are also not replicated to standby servers. Any indexes created on an unlogged table are automatically unlogged as well.

根据官网解释,在崩溃或者非正常关闭的时候,表文件会被截断,也就是说,在正常关闭的情况下,数据是会正常落盘的,让我们验证一下:

postgres=# create unlogged table t1(id int,info text);
CREATE TABLE
postgres=# insert into t1 select n,md5(random()::text) from generate_series(1,10000) as n;
INSERT 0 10000
postgres=# select pg_relation_filepath('t1');
 pg_relation_filepath 
----------------------
 base/5/51158
(1 row)
postgres=# checkpoint ;
CHECKPOINT
[postgres@mypg 5]$ ls -lrth 51158*
-rw------- 1 postgres postgres    0 Dec  7 15:04 51158_init
-rw------- 1 postgres postgres  24K Dec  7 15:04 51158_fsm
-rw------- 1 postgres postgres 672K Dec  7 15:04 51158
-rw------- 1 postgres postgres 8.0K Dec  7 15:04 51158_vm
[postgres@mypg 5]$ hexdump -C 51158
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000a8000
复制

hexdump 会将相同的都是 0 的行标记为 *,用 vim 打开也确实全是 0。

0000000: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000020: 0000 0000 0000 0000 0000 0000 0000 0000  ................
....
复制

让我们正常关闭,再观察一下

postgres=# \q
[postgres@mypg ~]$ pg_ctl -D 17data stop
waiting for server to shut down.... done
server stopped
[postgres@mypg 5]$ hexdump -C 51158 | head -n 10
00000000  00 00 00 00 00 00 00 00  00 00 04 00 f8 01 00 02  |................|
00000010  00 20 04 20 00 00 00 00  c0 9f 7a 00 80 9f 7a 00  |. . ......z...z.|
00000020  40 9f 7a 00 00 9f 7a 00  c0 9e 7a 00 80 9e 7a 00  |@.z...z...z...z.|
00000030  40 9e 7a 00 00 9e 7a 00  c0 9d 7a 00 80 9d 7a 00  |@.z...z...z...z.|
00000040  40 9d 7a 00 00 9d 7a 00  c0 9c 7a 00 80 9c 7a 00  |@.z...z...z...z.|
00000050  40 9c 7a 00 00 9c 7a 00  c0 9b 7a 00 80 9b 7a 00  |@.z...z...z...z.|
00000060  40 9b 7a 00 00 9b 7a 00  c0 9a 7a 00 80 9a 7a 00  |@.z...z...z...z.|
00000070  40 9a 7a 00 00 9a 7a 00  c0 99 7a 00 80 99 7a 00  |@.z...z...z...z.|
00000080  40 99 7a 00 00 99 7a 00  c0 98 7a 00 80 98 7a 00  |@.z...z...z...z.|
00000090  40 98 7a 00 00 98 7a 00  c0 97 7a 00 80 97 7a 00  |@.z...z...z...z.|
复制

这次就可以很清楚地看到,数据都被刷盘了。那么这二者有什么区别?其实代码的注释就很清楚

    /*
     * Unless this is a shutdown checkpoint or we have been explicitly told,
     * we write only permanent, dirty buffers.  But at shutdown or end of
     * recovery, we write all dirty buffers.
     */
    if (!((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
                    CHECKPOINT_FLUSH_ALL))))
        mask |= BM_PERMANENT;
复制

也就是说,如果 checkpoint 的时候,其 flags 不是 CHECKPOINT_IS_SHUTDOWNCHECKPOINT_END_OF_RECOVERY 或者 CHECKPOINT_FLUSH_ALL,那么只会刷持久化的表。让我们验证一下:

(gdb) b CreateCheckPoint
Breakpoint 1 at 0x569430: file xlog.c, line 6827.
(gdb) signal SIGINT
Continuing with signal SIGINT.
Program received signal SIGINT, Interrupt.
0x00007f0ccd0ae0c3 in __epoll_wait_nocancel () from /lib64/libc.so.6
(gdb) c
Continuing.
Breakpoint 1, CreateCheckPoint (flags=108) at xlog.c:6827
6827    xlog.c: No such file or directory.
(gdb) p/x flags
$1 = 0x6c
(gdb) p flags
$2 = 108
复制

108 等于:

•CHECKPOINT_REQUESTED (0x0040 = 64)

•CHECKPOINT_WAIT (0x0020 = 32)

•CHECKPOINT_FORCE (0x0008 = 8)

•CHECKPOINT_IMMEDIATE (0x0004 = 4)

说明不带有 CHECKPOINT_IS_SHUTDOWN 或者 CHECKPOINT_FLUSH_ALL,对应到我们前面的现象,没有刷新 unlogged tables。再看看正常关机的情况 (注意此处需要用 SIGUSR2 的信号):

(gdb) b CreateCheckPoint
Breakpoint 1 at 0x569430: file xlog.c, line 6827.
(gdb) signal SIGUSR2
Continuing with signal SIGUSR2.
Program received signal SIGUSR2, User defined signal 2.
0x00007f0ccd0ae0c3 in __epoll_wait_nocancel () from /lib64/libc.so.6
(gdb) c
Continuing.
Breakpoint 1, CreateCheckPoint (flags=5) at xlog.c:6827
6827    xlog.c: No such file or directory.
(gdb) p/x flags
$1 = 0x5
复制

flags = 5 表示 CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE,所以对应到我们前面的现象,正常关机会进行刷盘。

另外上面细心的读者可能发现了,执行正常的 checkpoint,其大小也变为了同样的大小,我没有看具体代码,猜测是类似于打开 wal_init_zero 参数,调用 pg_pwrite_zeros,直接调用 mdextend -> FileWrite

/*
 * mdextend() -- Add a block to the specified relation.
 *
 * The semantics are nearly the same as mdwrite(): write at the
 * specified position.  However, this is to be used for the case of
 * extending a relation (i.e., blocknum is at or beyond the current
 * EOF).  Note that we assume writing a block beyond current EOF
 * causes intervening file space to become filled with zeroes.
 */
void
mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
         const void *buffer, bool skipFsync)
{
    off_t        seekpos;
    int            nbytes;
    MdfdVec    *v;
复制

见微知著

简单浏览了一下代码,CHECKPOINT_FLUSH_ALL 一共在两处进行了调用:

CreateDatabaseUsingFileCopy:Create a new database using the FILE_COPY strategy.

FILE_COPY 可以参照官网

movedb:对应到 ALTER DATABASE SET TABLESPACE

所以,对应到这些场景的时候,包括 unlogged tables,也需要进行刷盘,注意潜在的性能影响。

/*
 * Perform a checkpoint --- either during shutdown, or on-the-fly
 *
 * flags is a bitwise OR of the following:
 *    CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
 *    CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery.
 *    CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
 *        ignoring checkpoint_completion_target parameter.
 *    CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occurred
 *        since the last one (implied by CHECKPOINT_IS_SHUTDOWN or
 *        CHECKPOINT_END_OF_RECOVERY).
 *    CHECKPOINT_FLUSH_ALL: also flush buffers of unlogged tables.
复制

表的转换

还有需要注意的点在于,普通表和无日志表之间的转换

ALTER TABLE mytable SET UNLOGGED; -- cheap! 
ALTER TABLE mytable SET LOGGED; -- expensive!
复制

也就是将无日志表转为普通表的时候,会需要写大量的 REDO,不难理解,不然崩溃了就无法恢复数据了。反之则很快,只需要修改一下元数据即可。因此,这种行为也可以进行发散,直接通过修改元数据的方式,避免将无日志表转为普通表的时候写入大量 WAL,但是有什么幺蛾子就不好说了,比如 PITR,都没有这个表的 WAL,那自然恢复出来的数据也不一致了。

块结构

还有一点值得注意的细节是,块上面的 LSN

无日志表的 LSN 为 0,这意味着无日志表由其自己的数据文件中的数据表示,但不由日志文件中的数据表示。

小结

最后就用一张引用的图作为总结吧!

参考

https://www.crunchydata.com/blog/postgresl-unlogged-tables[1]

https://levelup.gitconnected.com/logged-unlogged-and-temporary-tables-in-postgresql-d390d9a4ef15[2]

https://blog.japinli.top/2022/11/postgresql-debug-checkpointer/[3]

https://github.com/digoal/blog/blob/master/202405/20240510_03.md[4]

「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

吾亦可往
暂无图片
3月前
评论
暂无图片 0
在 PostgreSQL 中,支持多种类型的表 — 临时表、普通表,以及无日志表,顾名思义,"无日志"其优势在于不用记录 WAL,那么写入速度自然也杠杠的,同理,无日志表在备库上没有数据 (只有一个壳),也无法进行访问,会提示 ERROR: cannot access temporary or unlogged relations during recovery,pg_basebackup 的时候也会跳过无日志表 (除了 init 分支)。那么无日志表又有哪些鲜为人知的细节呢?
3月前
暂无图片 点赞
评论