暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

PostgreSQL 最佳实践 - 块级增量备份(ZFS篇)单个数据库采用多个zfs卷(如表空间)时如何一致性备份

digoal 2016-08-23
443
TAG 24

作者

digoal

日期

2016-08-23

标签

PostgreSQL , 增量备份 , 块级 , COW , 写时复制 , zfs , clone , snapshot , 快照


背景

当我们使用了多个ZFS卷或者文件系统时,如果一个实例的多个部分,如表空间,放在了不同的zfs上,再使用基于ZFS快照的备份时,可能出现多个文件系统不一致的情况。

例如控制文件是新的,但是数据是旧的。

保物理备份的一致性检查

基于文件的物理备份,为了保证备份的一致性,在备份开始时,需要做一个检查点,同时打开FULL PAGE WRTIE,同时还会生成backup_label文件记录备份开始时的WAL文件,检查点位置等信息。

backup_label文件内容示例

START WAL LOCATION: 42/EB000060 (file 0000000100000042000000EB) CHECKPOINT LOCATION: 42/EB000098 BACKUP METHOD: pg_start_backup BACKUP FROM: master START TIME: 2016-08-23 19:43:32 CST LABEL: a

同时backup_label也指出了当前备份是从主节点还是从 standby备份的

使用read_backup_label解析backup_label文件

优先从backup_label中得到备份开始的WAL位置,检查点的WAL位置,进行恢复。

而不是控制文件的信息,因为控制文件可能在备份过程中,经过检查点后,发生了变化。

```
/
* read_backup_label: check to see if a backup_label file is present
*
* If we see a backup_label during recovery, we assume that we are recovering
* from a backup dump file, and we therefore roll forward from the checkpoint
* identified by the label file, NOT what pg_control says. This avoids the
* problem that pg_control might have been archived one or more checkpoints
* later than the start of the dump, and so if we rely on it as the start
* point, we will fail to restore a consistent database state.
*
* Returns TRUE if a backup_label was found (and fills the checkpoint
* location and its REDO location into
checkPointLoc and RedoStartLSN,
* respectively); returns FALSE if not. If this backup_label came from a
* streamed backup, backupEndRequired is set to TRUE. If this backup_label
* was created during recovery,
backupFromStandby is set to TRUE.
/
static bool
read_backup_label(XLogRecPtr
checkPointLoc, bool backupEndRequired,
bool
backupFromStandby)
{
char startxlogfilename[MAXFNAMELEN];
TimeLineID tli;
FILE *lfp;
char ch;
char backuptype[20];
char backupfrom[20];
uint32 hi,
lo;

    *backupEndRequired = false;      
    *backupFromStandby = false;

    /*      
     * See if label file is present      
     */      
    lfp = AllocateFile(BACKUP_LABEL_FILE, "r");      
    if (!lfp)      
    {      
            if (errno != ENOENT)      
                    ereport(FATAL,      
                                    (errcode_for_file_access(),      
                                     errmsg("could not read file \"%s\": %m",      
                                                    BACKUP_LABEL_FILE)));      
            return false;                   /* it's not there, all is fine */      
    }

    /*      
     * Read and parse the START WAL LOCATION and CHECKPOINT lines (this code      
     * is pretty crude, but we are not expecting any variability in the file      
     * format).      
     */      
    if (fscanf(lfp, "START WAL LOCATION: %X/%X (file %08X%16s)%c",      
                       &hi, &lo, &tli, startxlogfilename, &ch) != 5 || ch != '\n')      
            ereport(FATAL,      
                            (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),      
                             errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));      
    RedoStartLSN = ((uint64) hi) << 32 | lo;      
    if (fscanf(lfp, "CHECKPOINT LOCATION: %X/%X%c",      
                       &hi, &lo, &ch) != 3 || ch != '\n')      
            ereport(FATAL,      
                            (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),      
                             errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));      
    *checkPointLoc = ((uint64) hi) << 32 | lo;

    /*      
     * BACKUP METHOD and BACKUP FROM lines are new in 9.2. We can't restore      
     * from an older backup anyway, but since the information on it is not      
     * strictly required, don't error out if it's missing for some reason.      
     */      
    if (fscanf(lfp, "BACKUP METHOD: %19s\n", backuptype) == 1)      
    {      
            if (strcmp(backuptype, "streamed") == 0)      
                    *backupEndRequired = true;      
    }

    if (fscanf(lfp, "BACKUP FROM: %19s\n", backupfrom) == 1)      
    {      
            if (strcmp(backupfrom, "standby") == 0)      
                    *backupFromStandby = true;      
    }

    if (ferror(lfp) || FreeFile(lfp))      
            ereport(FATAL,      
                            (errcode_for_file_access(),      
                             errmsg("could not read file \"%s\": %m",      
                                            BACKUP_LABEL_FILE)));

    return true;

}
```

两种物理备份模式

物理备份分exclusive和non-exclusive两种模式。

* There are two kind of backups: exclusive and non-exclusive. An exclusive * backup is started with pg_start_backup(), and there can be only one active * at a time. The backup and tablespace map files of an exclusive backup are * written to $PGDATA/backup_label and $PGDATA/tablespace_map, and they are * removed by pg_stop_backup(). * * A non-exclusive backup is used for the streaming base backups (see * src/backend/replication/basebackup.c). The difference to exclusive backups * is that the backup label and tablespace map files are not written to disk. * Instead, their would-be contents are returned in *labelfile and *tblspcmapfile, * and the caller is responsible for including them in the backup archive as * 'backup_label' and 'tablespace_map'. There can be many non-exclusive backups * active at the same time, and they don't conflict with an exclusive backup * either. * * tblspcmapfile is required mainly for tar format in windows as native windows * utilities are not able to create symlinks while extracting files from tar. * However for consistency, the same is used for all platforms. * * needtblspcmapfile is true for the cases (exclusive backup and for * non-exclusive backup only when tar format is used for taking backup) * when backup needs to generate tablespace_map file, it is used to * embed escape character before newline character in tablespace path. * * Returns the minimum WAL position that must be present to restore from this * backup, and the corresponding timeline ID in *starttli_p. * * Every successfully started non-exclusive backup must be stopped by calling * do_pg_stop_backup() or do_pg_abort_backup(). * * It is the responsibility of the caller of this function to verify the * permissions of the calling user!

exclusive模式是使用pg_statrt_backup()函数调用产生的,这个函数只能在主节点调用

在备节点调用报错如下

XLogRecPtr do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p, char **labelfile, DIR *tblspcdir, List **tablespaces, char **tblspcmapfile, bool infotbssize, bool needtblspcmapfile) { bool exclusive = (labelfile == NULL); ... /* * Currently only non-exclusive backup can be taken during recovery. */ if (backup_started_in_recovery && exclusive) ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("recovery is in progress"), errhint("WAL control functions cannot be executed during recovery.")));

在standby备份如何保证输出backup_label

在standby备份,不能通过直接调用pg_start_backup()函数来生成backup_label文件,但是PG提供了pg_basebackup命令,是以non-exclusive的模式进行备份的。

labelfile非空,所以不会触发以上告警。

startptr = do_pg_start_backup(opt->label, opt->fastcheckpoint, &starttli, &labelfile, tblspcdir, &tablespaces, &tblspc_map_file, opt->progress, opt->sendtblspcmapfile);

这个labelfile是在最开始的时候备份的,最后备份控制文件。

```
if (ti->path == NULL)
{
struct stat statbuf;

                            /* In the main tar, include the backup_label first... */      
                            sendFileWithContent(BACKUP_LABEL_FILE, labelfile);

                            /*      
                             * Send tablespace_map file if required and then the bulk of      
                             * the files.      
                             */      
                            if (tblspc_map_file && opt->sendtblspcmapfile)      
                            {      
                                    sendFileWithContent(TABLESPACE_MAP, tblspc_map_file);      
                                    sendDir(".", 1, false, tablespaces, false);      
                            }      
                            else      
                                    sendDir(".", 1, false, tablespaces, true);

                            /* ... and pg_control after everything else. */      
                            if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)      
                                    ereport(ERROR,      
                                                    (errcode_for_file_access(),      
                                                     errmsg("could not stat control file \"%s\": %m",      
                                                                    XLOG_CONTROL_FILE)));      
                            sendFile(XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf, false);      
                    }      
                    else      
                            sendTablespace(ti->path, false);

```

通过以上方法,我们备份时的检查点信息得以保存在backup_label中,因此pg_basebackup的备份集,在恢复时可以拿到正确的WAL位置进行恢复,而不需要用到控制文件的检查点位置。

因为控制文件可能不是最初的位置。

zfs快照, 如何保证多个zfs文件系统的一致性

其实从pg_basebackup的备份步骤我们就能得知,备份集需要一个最老的检查点位置。

那我们不使用pg_basebackup,如何能保证在整个备份过程中控制文件是最老的呢?

因此最靠谱的方法是,优先给控制文件所在的ZFS打快照,然后再给其他ZFS文件系统打快照,这样就很好的解决了需要最老的控制文件的这个问题。

例子

回到我写的《PostgreSQL 最佳实践 - 块级增量备份(ZFS篇)方案与实战》

https://yq.aliyun.com/articles/59363

以上案例中一个standby集群用的是一个ZFS(所有表空间, 以及$PGDATA), 当数据库集群使用多个ZFS时, 因为快照只能基于单个volume或filesystem产生, 所以当我们使用了多个zfs filesystem是要得到数据库一致的备份, 怎么办呢?

例如 :

1. $PGDATA使用文件系统 zfs/pg_root

2. tbs1使用文件系统 zfs/tbs1

3. tbs2使用文件系统 zfs/tbs2

4. pg_xlog使用文件系统 zfs/pg_xlog

要使用zfs snapshot来备份这个PostgreSQL, 应该怎么做?

前面已经从代码中分析了,我们需要最老的控制文件即可。

首选创建控制文件所在zfs的快照, 再创建其他文件系统的快照.

pg_xlog的快照可以不创建, 而通过归档来恢复.

那么就比较清晰了, 以上快照的步骤如下 :

STIME=`date +%F%T` 1. zfs snapshot zp1/pg_root@$STIME 2. zfs snapshot zp1/tbs1@$STIME 3. zfs snapshot zp1/tbs2@$STIME

验证

验证一下这种方法的有效性

还记得我以前写过一篇使用老的控制文件来恢复一个异常的standby吗?

某数据库的流复制standby因为主库产生的XLOG过多, 延迟后触发了recovery.conf的restore命令, 但是restore里面用了sudo 进行copy并未保持源文件的owner, 属性等特征, 使用了root owner, 导致COPY完的xlog不能被postgresql 正常读取.

同事在处理这个事情, 因为没有了解实际情况, 上去就使用了pg_resetxlog修改standby的nextXID(resetxlog在这个场景属于扯淡的操作, 千万不要这么干), 接着大家知道的, standby无法正常完成standby的工作了.

这个数据库有10几个T, 要重新做standby的话, 拷贝的数据量太大了, 即使使用rsync重做, 工作量也比较大(表空间过多, 目录过多).

有没有省事的方法呢?

首先来分析一下事件,

1\. xlog不能被postgresql 正常读取(这个很好修复, 修改一下restore command就可以了, 或者不要用sudo 来拷贝).

2014-11-30 10:05:54.505 CST,,,3695,,5435f48b.e6f,6,,2014-10-09 10:35:55 CST,1/0,0,LOG,00000,"restored log file ""0000000300002A9D0000004D"" from archive",,,,,,,,"RestoreArchivedFile, xlogarchive.c:254","" 2014-11-30 10:05:54.534 CST,,,3695,,5435f48b.e6f,7,,2014-10-09 10:35:55 CST,1/0,0,PANIC,42501,"could not open file ""pg_xlog/0000000300002A9D0000004D"": Permission denied",,,,,,,,"XLogFileRead, xlog.c:2696",""

2\. pg_resetxlog对控制文件造成了持久性伤害, 没有办法修复.

pg_resetxlog -x 90010254 -f $PGDATA

为了快速恢复, 可以找到更早的控制文件, 刚好这套数据库的备份系统使用的是ZFS快照做的, 每天会创建一个快照, 所以可以取早些时间的快照, 拿到控制文件, 替换掉standby被人为"破坏"的控制文件, 启动standby后, 将从控制文件开始恢复需要的xlog. (只要这些xlog归档还在就可以了).

操作过程 :

1\. shutdown standby

2\. mount zfs snapshot old then standby crashed time.

3\. copy $PGDATA/global/pg_control from snapshot to standby.

4\. startup standby.

5\. umount snapshot.

注意, 拿到的控制文件必须是standby crash之前的控制文件, 并且控制文件至今的所有xlog归档必须都在. postgresql standby将从控制文件需要的xlog开始恢复数据块. 所以这么做是完全可以的.

现在standby数据库已经完全恢复了

882 postgres 20 0 2337m 15m 14m S 0.0 0.1 0:00.10 /opt/pgsql9.3.2/bin/postgres 883 postgres 20 0 157m 1072 476 S 0.0 0.0 0:00.14 postgres: logger process 884 postgres 20 0 2338m 2.0g 2.0g S 0.0 8.6 5:16.35 postgres: startup process recovering 0000000300002AA50000003F 889 postgres 20 0 2338m 2.0g 2.0g S 0.0 8.6 0:35.69 postgres: checkpointer process 890 postgres 20 0 2338m 2.0g 2.0g S 0.0 8.6 0:34.98 postgres: writer process 905 postgres 20 0 159m 1212 456 S 0.0 0.0 0:02.62 postgres: stats collector process 1922 postgres 20 0 105m 1632 1180 S 0.0 0.0 0:00.01 -bash 1959 postgres 20 0 137m 1764 1292 S 0.0 0.0 0:00.00 psql 1960 postgres 20 0 2340m 6780 4684 S 0.0 0.0 0:00.01 postgres: postgres postgres [local] idle 4309 postgres 20 0 2346m 7080 1748 S 0.0 0.0 0:00.67 postgres: wal receiver process streaming 2AA5/3F0429C0

PostgreSQL 许愿链接

您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.

9.9元购买3个月阿里云RDS PostgreSQL实例

PostgreSQL 解决方案集合

德哥 / digoal's github - 公益是一辈子的事.

digoal's wechat

文章转载自digoal,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论