前言
在 PostgreSQL 中,支持多种类型的表 — 临时表、普通表,以及无日志表,顾名思义,"无日志"其优势在于不用记录 WAL,那么写入速度自然也杠杠的,同理,无日志表在备库上没有数据 (只有一个壳),也无法进行访问,会提示 ERROR: cannot access temporary or unlogged relations during recovery
,pg_basebackup 的时候也会跳过无日志表 (除了 init 分支)。那么无日志表又有哪些鲜为人知的细节呢?
分析
首先,让我们思考一下,无日志表的数据会落盘吗?或者说,按照常识,执行正常检查点的时候会落盘吗?
However, they are not crash-safe: an unlogged table is automatically truncated after a crash or unclean shutdown. The contents of an unlogged table are also not replicated to standby servers. Any indexes created on an unlogged table are automatically unlogged as well.
根据官网解释,在崩溃或者非正常关闭的时候,表文件会被截断,也就是说,在正常关闭的情况下,数据是会正常落盘的,让我们验证一下:
postgres=# create unlogged table t1(id int,info text); CREATE TABLE postgres=# insert into t1 select n,md5(random()::text) from generate_series(1,10000) as n; INSERT 0 10000 postgres=# select pg_relation_filepath('t1'); pg_relation_filepath ---------------------- base/5/51158 (1 row) postgres=# checkpoint ; CHECKPOINT [postgres@mypg 5]$ ls -lrth 51158* -rw------- 1 postgres postgres 0 Dec 7 15:04 51158_init -rw------- 1 postgres postgres 24K Dec 7 15:04 51158_fsm -rw------- 1 postgres postgres 672K Dec 7 15:04 51158 -rw------- 1 postgres postgres 8.0K Dec 7 15:04 51158_vm [postgres@mypg 5]$ hexdump -C 51158 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 000a8000
复制
hexdump 会将相同的都是 0 的行标记为 *,用 vim 打开也确实全是 0。
0000000: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0000010: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0000020: 0000 0000 0000 0000 0000 0000 0000 0000 ................ ....
复制
让我们正常关闭,再观察一下
postgres=# \q [postgres@mypg ~]$ pg_ctl -D 17data stop waiting for server to shut down.... done server stopped [postgres@mypg 5]$ hexdump -C 51158 | head -n 10 00000000 00 00 00 00 00 00 00 00 00 00 04 00 f8 01 00 02 |................| 00000010 00 20 04 20 00 00 00 00 c0 9f 7a 00 80 9f 7a 00 |. . ......z...z.| 00000020 40 9f 7a 00 00 9f 7a 00 c0 9e 7a 00 80 9e 7a 00 |@.z...z...z...z.| 00000030 40 9e 7a 00 00 9e 7a 00 c0 9d 7a 00 80 9d 7a 00 |@.z...z...z...z.| 00000040 40 9d 7a 00 00 9d 7a 00 c0 9c 7a 00 80 9c 7a 00 |@.z...z...z...z.| 00000050 40 9c 7a 00 00 9c 7a 00 c0 9b 7a 00 80 9b 7a 00 |@.z...z...z...z.| 00000060 40 9b 7a 00 00 9b 7a 00 c0 9a 7a 00 80 9a 7a 00 |@.z...z...z...z.| 00000070 40 9a 7a 00 00 9a 7a 00 c0 99 7a 00 80 99 7a 00 |@.z...z...z...z.| 00000080 40 99 7a 00 00 99 7a 00 c0 98 7a 00 80 98 7a 00 |@.z...z...z...z.| 00000090 40 98 7a 00 00 98 7a 00 c0 97 7a 00 80 97 7a 00 |@.z...z...z...z.|
复制
这次就可以很清楚地看到,数据都被刷盘了。那么这二者有什么区别?其实代码的注释就很清楚
/* * Unless this is a shutdown checkpoint or we have been explicitly told, * we write only permanent, dirty buffers. But at shutdown or end of * recovery, we write all dirty buffers. */ if (!((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_FLUSH_ALL)))) mask |= BM_PERMANENT;
复制
也就是说,如果 checkpoint 的时候,其 flags 不是 CHECKPOINT_IS_SHUTDOWN
、CHECKPOINT_END_OF_RECOVERY
或者 CHECKPOINT_FLUSH_ALL
,那么只会刷持久化的表。让我们验证一下:
(gdb) b CreateCheckPoint Breakpoint 1 at 0x569430: file xlog.c, line 6827. (gdb) signal SIGINT Continuing with signal SIGINT. Program received signal SIGINT, Interrupt. 0x00007f0ccd0ae0c3 in __epoll_wait_nocancel () from /lib64/libc.so.6 (gdb) c Continuing. Breakpoint 1, CreateCheckPoint (flags=108) at xlog.c:6827 6827 xlog.c: No such file or directory. (gdb) p/x flags $1 = 0x6c (gdb) p flags $2 = 108
复制
108 等于:
•CHECKPOINT_REQUESTED (0x0040 = 64)
•CHECKPOINT_WAIT (0x0020 = 32)
•CHECKPOINT_FORCE (0x0008 = 8)
•CHECKPOINT_IMMEDIATE (0x0004 = 4)
说明不带有 CHECKPOINT_IS_SHUTDOWN
或者 CHECKPOINT_FLUSH_ALL
,对应到我们前面的现象,没有刷新 unlogged tables。再看看正常关机的情况 (注意此处需要用 SIGUSR2 的信号):
(gdb) b CreateCheckPoint Breakpoint 1 at 0x569430: file xlog.c, line 6827. (gdb) signal SIGUSR2 Continuing with signal SIGUSR2. Program received signal SIGUSR2, User defined signal 2. 0x00007f0ccd0ae0c3 in __epoll_wait_nocancel () from /lib64/libc.so.6 (gdb) c Continuing. Breakpoint 1, CreateCheckPoint (flags=5) at xlog.c:6827 6827 xlog.c: No such file or directory. (gdb) p/x flags $1 = 0x5
复制
flags = 5
表示 CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE
,所以对应到我们前面的现象,正常关机会进行刷盘。
另外上面细心的读者可能发现了,执行正常的 checkpoint,其大小也变为了同样的大小,我没有看具体代码,猜测是类似于打开 wal_init_zero 参数,调用 pg_pwrite_zeros,直接调用 mdextend -> FileWrite
/* * mdextend() -- Add a block to the specified relation. * * The semantics are nearly the same as mdwrite(): write at the * specified position. However, this is to be used for the case of * extending a relation (i.e., blocknum is at or beyond the current * EOF). Note that we assume writing a block beyond current EOF * causes intervening file space to become filled with zeroes. */ void mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, const void *buffer, bool skipFsync) { off_t seekpos; int nbytes; MdfdVec *v;
复制
见微知著
简单浏览了一下代码,CHECKPOINT_FLUSH_ALL
一共在两处进行了调用:
CreateDatabaseUsingFileCopy:Create a new database using the FILE_COPY strategy.
FILE_COPY 可以参照官网
movedb:对应到 ALTER DATABASE SET TABLESPACE
所以,对应到这些场景的时候,包括 unlogged tables,也需要进行刷盘,注意潜在的性能影响。
/* * Perform a checkpoint --- either during shutdown, or on-the-fly * * flags is a bitwise OR of the following: * CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown. * CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery. * CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP, * ignoring checkpoint_completion_target parameter. * CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occurred * since the last one (implied by CHECKPOINT_IS_SHUTDOWN or * CHECKPOINT_END_OF_RECOVERY). * CHECKPOINT_FLUSH_ALL: also flush buffers of unlogged tables.
复制
表的转换
还有需要注意的点在于,普通表和无日志表之间的转换
ALTER TABLE mytable SET UNLOGGED; -- cheap! ALTER TABLE mytable SET LOGGED; -- expensive!
复制
也就是将无日志表转为普通表的时候,会需要写大量的 REDO,不难理解,不然崩溃了就无法恢复数据了。反之则很快,只需要修改一下元数据即可。因此,这种行为也可以进行发散,直接通过修改元数据的方式,避免将无日志表转为普通表的时候写入大量 WAL,但是有什么幺蛾子就不好说了,比如 PITR,都没有这个表的 WAL,那自然恢复出来的数据也不一致了。
块结构
还有一点值得注意的细节是,块上面的 LSN

无日志表的 LSN 为 0,这意味着无日志表由其自己的数据文件中的数据表示,但不由日志文件中的数据表示。
小结
最后就用一张引用的图作为总结吧!

参考
https://www.crunchydata.com/blog/postgresl-unlogged-tables[1]
https://levelup.gitconnected.com/logged-unlogged-and-temporary-tables-in-postgresql-d390d9a4ef15[2]
https://blog.japinli.top/2022/11/postgresql-debug-checkpointer/[3]
https://github.com/digoal/blog/blob/master/202405/20240510_03.md[4]
评论
