pg_dump的原理与技巧，你学废了吗？

原创 xiongcc 2024-04-10

1270

前言

日常运维中，备份肯定少不了，那自然也少不了 pg_dump 的身影，常规备份手段也没啥好说的了，度娘一堆。那让我们聊些新鲜的，pg_dump 的原理和技巧。

备份格式

pg_dump 支持多种备份格式，文本格式 (plain) 当然是最方便的了，典型场景是需要恢复至另外一个库，如果相关用户也发生了改变，那么手动编辑文件，修改一下用户名，再使用 psql -f 的形式恢复。

另外一个常见技巧是 pg_dump | psql 搭配管道的形式，on the fly，即时恢复，不过要小心 OOM，库稍微大一点，很可能就 OOM 了。

tar 格式用的较少，并没有什么明显优势，默认会输出为一个 tar 包，需要自行解压。

Output a tar-format archive suitable for input into pg_restore. The tar format is compatible with the directory format: extracting a tar-format archive produces a valid directory-format archive. However, the tar format does not support compression. Also, when using tar format the relative order of table data items cannot be changed during restore.

不过 tar 格式有个有趣的地方，会生成一个额外的 restore.sql：

[postgres@mypg ~]$ file test_bk/*
test_bk/3460.dat:    ASCII text
test_bk/restore.sql: ASCII text
test_bk/toc.dat:     PostgreSQL custom database dump - v1.15-0
复制

pg_backup_tar.c 中有这么一段话，简而言之——restore.sql 这个文件没什么用，只是方便阅读，仅此而已。

 * This file is copied from the 'files' format file, but dumps data into
 * one temp file then sends it to the output TAR archive.
 *
 * The tar format also includes a 'restore.sql' script which is there for
 * the benefit of humans. This script is never used by pg_restore.
复制

至于 toc.dat，就是一个用于描述备份内容的文件，恢复期间，pg_restore 会读取该文件，创建相应的对象

[postgres@mypg ~]$ cat test_bk/toc.dat | strings | more
PGDMP
mydb
16.1
16.1
ENCODING
ENCODING
SET client_encoding = 'UTF8';
false
STDSTRINGS
STDSTRINGS
SET standa
...
...

TABLE DATA
COPY public.t1 (info) FROM stdin;
public
postgres
false
3460.dat
1259
159009
...
...
复制

其中当然也包括使用 COPY，从对应的 xxx.dat 导入相应的数据。这一点在前面的 restore.sql 文件中也有所体现：

--
-- Name: t1; Type: TABLE; Schema: public; Owner: postgres
--

CREATE TABLE public.t1 (
    info character varying(20)
);


ALTER TABLE public.t1 OWNER TO postgres;

--
-- Data for Name: t1; Type: TABLE DATA; Schema: public; Owner: postgres
--

COPY public.t1 (info) FROM stdin;
\.
COPY public.t1 (info) FROM '$$PATH$$/3460.dat';

--
-- Name: t1_info_idx; Type: INDEX; Schema: public; Owner: postgres
--

CREATE INDEX t1_info_idx ON public.t1 USING btree (info);
复制

custom 格式是类似的，一个常见技巧是，我们可以手动编辑描述文件，比如只恢复函数：

pg_dump -U username --format=c --schema-only -f dump_test your_database
pg_restore --list dump_test | grep FUNCTION > function_list    ---从list里面摘出来所有的函数
pg_restore -U username -d your_other_database -L function_list dump_test
复制

并行备份

其次并行备份，对于大库，吭哧吭哧单进程备份不知要备份到猴年马月，但是要使用并行，必须使用 Directory 的格式，即 pg_dump -Fd，Fd 是唯一支持并行的方式，每个表会生成一个文件。

The "directory" format is the only format that supports parallel dumps.

另外需要注意的是，使用 pg_restore 进行恢复的时候，还需要指定 -I 指定恢复索引，否则你恢复出来的对象，可能只有一张表，即

[postgres@xiongcc ~]$ pg_restore -t test_dump -d postgres -I test_dump_id_idx dump_dir/
[postgres@xiongcc ~]$ psql
psql (15beta1)
Type "help" for help.

postgres=# \d test_dump 
             Table "public.test_dump"
 Column |  Type   | Collation | Nullable | Default 
--------+---------+-----------+----------+---------
 id     | integer |           |          | 
 info   | text    |           |          | 
Indexes:
    "test_dump_id_idx" btree (id)
复制

但是有点遗憾，Fd 的格式不能像文本格式一样，on the fly，比如 pg_dump -Fd | pg_restore 的形式，必须落地，但是没关系，有一个工具就是基于此原理——pgcopydb，https://github.com/dimitri/pgcopydb，其以速度见长

pgcopydb is a tool that automates running pg_dump | pg_restore between two running Postgres servers. To make a copy of a database to another server as quickly as possible, one would like to use the parallel options of pg_dump and still be able to stream the data to as many pg_restore jobs.
Given that, pgcopydb then uses pg_dump and pg_restore for the schema parts of the process, and implements its own data copying multi-process streaming parts. Also, pgcopydb bypasses pg_restore index building and drives that internally so that all indexes may be built concurrently.

这个工具的原理就是基于并行导出，然后恢复，同时索引还会使用 concurrently 的方式创建，此外还会基于逻辑复制 + wal2json，进行 CDC，迁移完成之后还支持校验和比较。因此假如有 A 库迁移 B 库的需求，无疑，pgcopydb 最为高效的方式，Give it a try。

并行如何保证一致性

既然支持并行，那么多个 pg_dump 进程如何保证看到的数据均是一致的？肯定不能认为是"同一时刻"启动的，就想当然认为看到的数据是一致的。其实，pg_dump 借助了一个很有趣的技巧——pg_export_snapshot，leader 导出 snapshot id，其他多个事务便可以共用一个快照，以此确保不同事务间数据的一致性。

基于此，我们还可以实现一些其他高级技巧。针对传统 Fd 并行方式，仅仅支持的是不同的表，当然也可以是分区表，因为 Fd 的原理是多个备份进程同时备份不同的表，那假如现在有一个很大的表，又没有分区该怎么办？这个时候，Fd 就无济于事了，但是！但是，我们可以借助 pg_export_snapshot，实现高效伪并行备份，举个栗子，我们可以这样做：

test=# start transaction isolation level repeatable read;
START TRANSACTION
test=*# set transaction snapshot '00000004-000BF714-1';
SET    
test=*# copy (select * from pgbench_accounts where aid <= 12500000) to stdout;    
复制

借助类似的行为，多开几个事务，分别备份其中一段数据，第二个事务备份 12500000 ~ 25000000，第三个事务备份 25000000 ~... 最后，使用 pg_dump 指定快照，同时排除这个特别大的表

pg_dump \
  -Fd \
  -j2 \
  -f ./everything_but_pgba_data.dump \
  --snapshot="00000004-000BF714-1" \
  --exclude-table-data="pgbench_accounts" \
  test 
复制

这样，就既可以实现特别大的单独并行备份，同时利用 Fd 并行备份其他表，真是一个极佳的操作技巧。

备份期间有啥坑

备份期间，为了确保数据一致性，pg_dump 使用的是 SI 隔离级别，在 Art of PostgreSQL 中，有这么一段话

From PostgreSQL version 9.1 onward, pg_dump uses the isolation level serializable. It used to be repeatable read until SSI implementation… more on that later.
从 PostgreSQL 版本 9.1 开始，pg_dump 使用隔离级别可序列化。在实现 SSI 之前，它一直是可重复读取的。

代码中也有这么一段逻辑，开启事务

 /*
  * To support the combination of serializable_deferrable with the jobs
  * option we use REPEATABLE READ for the worker connections that are
  * passed a snapshot.  As long as the snapshot is acquired in a
  * SERIALIZABLE, READ ONLY, DEFERRABLE transaction, its use within a
  * REPEATABLE READ transaction provides the appropriate integrity
  * guarantees.  This is a kluge, but safe for back-patching.
  */
 if (dopt->serializable_deferrable && AH->sync_snapshot_id == NULL)
  ExecuteSqlStatement(AH,
       "SET TRANSACTION ISOLATION LEVEL "
       "SERIALIZABLE, READ ONLY, DEFERRABLE");
 else
  ExecuteSqlStatement(AH,
       "SET TRANSACTION ISOLATION LEVEL "
       "REPEATABLE READ, READ ONLY");
复制

所以备份期间，你还是可以正常读写的，不用担心看到的数据不一致，因为是一致性备份。但如果你在备库进行备份的话，那么则还是 RR，因为目前备库还不支持可串行化隔离级别。

The Serializable transaction isolation level is not yet available in hot standby.

其次，在备库备份还是会遇到类似的流复制冲突问题，主库 vacuum 了，清除了备库备份需要的死元组，就有可能导致备库备份提示 User query might have needed to see row versions that must be removed.，解决办法还是那三板斧，max_standby_streaming_delay + hot_standby_feedback，但是也要小心可能的表膨胀，因为在备份期间 xmin 都是不变的。

另外，pg_dump 还支持 --serializable-deferrable wait until the dump can run without anomalies，指定 deferrable，为了避免只读事务可能导致损害数据一致性的异常情况，PostgreSQL 提供了一个有趣的解决方案：可以推迟此事务，直到其执行变得安全，这也是 SELECT 语句唯一可能被更新阻塞的情况，具体原理可以参照《PostgreSQL 14 intenal》，这本书我已经完工 95%了，快了快了。

锁

另外一个容易掉坑的地方，便是锁了。pg_dump 会在备份最开始，给所有需要的表加上 AccessShareLock，1 级锁，大多数情况下，这是没有问题的，因为 1 级锁不会阻塞正常 DML。但是这会阻塞 DDL，比如 truncate，alter table，因此假如你的业务是一些 ETL/BI 类型的，经常会做一些 truncate，那么就很尴尬了，在备份期间，是无法执行 DDL 的。解决办法当然也有，使用临时表，因为临时表并不会进行备份，自然也不会加锁。

这一点其他备份工具，比如 gpbackup，其行为也是类似的，不过在 Greenplum 里面，只支持 RC 和 RR，备份期间开启 RR，也是一样给所有的表加 AccessShareLock。不过，Greenplum 要考虑的更多，比如要避免死锁，感兴趣的可以阅读下这篇文章 https://greenplum.org/improving-backup-performance-and-reliability-with-distributed-snapshots/

Since worker 0 collected all the ACCESS SHARE locks at the beginning of the backup, it can safely issue the commands without concern for deadlock.

// Set synchronized snapshot for connNum to snapshotId
func SetSynchronizedSnapshot(connectionPool *dbconn.DBConn, connNum int,  snapshotId string) error {
 if connectionPool.Tx[connNum] == nil {
  err := connectionPool.Begin(connNum) // Begins transaction in repeatable read
  if err != nil {
   return err
  }
 }
 _, err := connectionPool.Exec(fmt.Sprintf("SET TRANSACTION SNAPSHOT '%s'", snapshotId), connNum)
 if err != nil {
  return err
 }
 gplog.Debug("Worker %d: Setting synchronized snapshot to %s", connNum, snapshotId)
 return nil
}
复制

小结

以上，便是 pg_dump 的种种技巧与原理了，希望各位读者阅读以后，能够理解更加透彻。

参考

https://greenplum.org/improving-backup-performance-and-reliability-with-distributed-snapshots/

https://www.postgresql.org/docs/current/app-pgdump.html

https://developer.aliyun.com/article/647432

How to speed up pg_dump when dumping large Postgres databases

pg_dump的原理与技巧，你学废了吗？

前言

备份格式

并行备份

并行如何保证一致性

备份期间有啥坑

锁

小结

参考

推荐阅读

Feel free to contact me

评论