1、背景
数据库备份恢复是数据安全的最后一道防线,直接影响企业生存能力。它不仅是一项技术措施,更是风险管理、合规运营和客户信任的基石。忽视备份等同于将企业置于“数据裸奔”的风险中,一旦灾难发生,后果可能是毁灭性的。生产系统中一定要做好数据库备份!!!
正好项目上同事问磐维集中式数据库怎么实现远程备份,趁周末有空在虚拟机里面做了实验,现在把整个步骤分享出来,希望对大家有所帮助!
2、环境说明
主机名 | ip地址 | OS版本 | 内存、CPU | 安装软件 | 用途 |
---|---|---|---|---|---|
node1 | 192.*.*.60 | Centos7.9 | 4G 、 1个双核 | PanWeiDB_V2.0-S3.1.0_B01数据库 | 数据库主节点 |
node2 | 192.*.*.62 | Centos7.9 | 4G 、 1个双核 | PanWeiDB_V2.0-S3.1.0_B01数据库 | 数据库备节点 |
node3 | 192.*.*.64 | Centos7.9 | 4G 、 1个双核 | PanWeiDB_V2.0-S3.1.0_B01数据库 | 数据库备节点 |
pubnode | 192.*.*.100 | Centos7.9 | 3G 、 1个双核 | PanWeiDB_V2.0-S3.1.0_B01 工具包 | 备份机 |
3、数据库参数检查、设置
postgres=# show archive_mode;
archive_mode
--------------
on
(1 row)
# 如果archive_mode参数的值不是'on',登录主库用下面命令设置
# alter system set archive_mode=on;
postgres=# show archive_dest;
archive_dest
--------------
/archive
(1 row)
# 如果archive_dest参数没设置,登录主库用下面命令设置
# alter system set archive_dest = '/archive' ;
postgres=# show wal_sender_timeout ;
wal_sender_timeout
--------------------
10s
(1 row)
#如果wal_sender_timeout参数的值不是10s,登录主库用下面命令设置
# alter system set wal_sender_timeout = '10s' ;
postgres=# show session_timeout;
session_timeout
-----------------
0
(1 row)
# 如果session_timeout参数的值不是10s,登录主库用下面命令设置
# alter user omm set session_timeout to 0;
复制
4、远程备份实施过程
4.1、安装tools工具包(备份机上执行)
#root用户
mkdir /tools
groupadd -g 1101 dbgrp
useradd -g dbgrp -u 1101 -m omm
echo "omm密码" | passwd --stdin omm
chown -R omm:dbgrp /tools
#omm用户
su - omm
tar -zxf PanWeiDB_V2.0-S3.1.0_B01-tools-centos_7-x86_64.tar.gz -C /tools
cat >>.bashrc<<EOF
export PATH=/tools/bin:$PATH
export LD_LIBRARY_PATH=/tools/lib:/tools/lib/postgresql:$LD_LIBRARY_PATH
EOF
复制
4.2、创建备份目录、初始化备份路径(备份机上执行)
[root@pubnode ~]# mkdir /backup
[root@pubnode ~]# chown -R omm:dbgrp /backup
[root@pubnode ~]# su - omm
Last login: Sat Apr 13 16:12:08 CST 2025 on pts/0
[omm@pubnode ~]$ gs_probackup init -B /backup/probackup
INFO: Backup catalog '/backup/probackup' successfully inited
# 创建备份日志目录
[omm@pubnode ~]$ mkdir /backup/probackup/log/
[omm@pubnode ~]$
复制
4.3、设置备份机和数据库主机之间的ssh免密(备份机上执行)
#备份机
[omm@pubnode ~]$ mkdir ~/.ssh
[omm@pubnode ~]$ chmod 700 ~/.ssh
[omm@pubnode ~]$ ssh-keygen -t rsa
将公钥文件上传至同集群所有节点(包括本节点),即可实现免密登录(此操作需输入密码):
[omm@pubnode ~]$ ssh-copy-id omm@192.*.*.60
[omm@pubnode ~]$ ssh-copy-id omm@192.*.*.100
复制
4.4、创建备份实例(备份机上执行)
[omm@pubnode ~]$ gs_probackup add-instance -B /backup/probackup --instance=panweidb -D /data/panweidb/data --remote-host=192.*.*.60 --remote-user=omm --remote-port=22 --retention-redundancy=3 --retention-window=0
LOG: Start SSH client process, pid 3411
INFO: Instance 'panweidb' successfully inited
复制
4.5、备份数据库
注意:omm用户不允许远程连接,因此需要创建testusr用户进行备份,并设置白名单。
4.5.1、在数据库主机上创建备份用户、设置白名单:
#创建备份用户testusr
[omm@node1 ~]$ gsql -r
gsql ((PanWeiDB_V2.0-S3.1.0_B01) compiled at 2024-12-19 11:44:29 commit 5d08dc9 )
Non-SSL connection (SSL connection is recommended when requiring high-security)
Type "help" for help.
postgres=# create user testusr password '******';
CREATE ROLE
postgres=# \q
#设置白名单,允许备份机能连接上数据库
[omm@node1 ~]$ gs_guc reload -I all -N all -h "host all all 192.*.*.100/32 sha256"
复制
4.5.2、备份机上进行数据库备份
[omm@pubnode ~]$ gs_probackup backup --backup-path=/backup/probackup --instance=panweidb --backup-mode=full --delete-expired --remote-host=192.*.*.60 --remote-user=omm --remote-port=22 --pgdatabase=postgres -U testusr -W'******' -p 17700 --progress
INFO: Backup start, pw_probackup version: 2.4.2, instance: panweidb, backup ID: SUNGJH, backup mode: FULL, wal mode: STREAM, remote: true, compress-algorithm: none, compress-level: 1
LOG: Backup destination is initialized
ERROR: could not connect to database postgres: FATAL: Normal user is not allowed to use HA channel!
WARNING: Backup SUNGJH is running, setting its status to ERROR
复制
备份报错:ERROR: could not connect to database postgres: FATAL: Normal user is not allowed to use HA channel!
4.5.3、授予备份用户sysadmin权限,并再次备份
#在数据库主机上执行
postgres=# alter user testusr sysadmin;
ALTER ROLE
postgres=# alter user testusr set session_timeout to 0;
ALTER ROLE
#在备份机上再次备份:
[omm@pubnode ~]$ gs_probackup backup --backup-path=/backup/probackup --instance=panweidb --backup-mode=full --delete-expired --remote-host=192.*.*.60 --remote-user=omm --remote-port=22 --pgdatabase=postgres -U testusr -W'******' -p 17700
INFO: Backup start, pw_probackup version: 2.4.2, instance: panweidb, backup ID: SUNH9P, backup mode: FULL, wal mode: STREAM, remote: true, compress-algorithm: none, compress-level: 1
LOG: Backup destination is initialized
LOG: This PanWeiDB instance was initialized with data block checksums. Data block corruption will be detected
LOG: Start SSH client process, pid 7298
LOG: Database backup start
ERROR: could not connect to database postgres: FATAL: no pg_hba.conf entry for replication connection from host "192.*.*.100", user "testusr", SSL on
FATAL: no pg_hba.conf entry for replication connection from host "192.*.*.100", user "testusr", SSL off
WARNING: backup in progress, stop backup
INFO: wait for pg_stop_backup()
INFO: pg_stop backup() successfully executed
WARNING: Backup SUNH9P is running, setting its status to ERROR
复制
备份报错:FATAL: no pg_hba.conf entry for replication connection from host “192...100”, user “testusr”, SSL off
4.5.4、数据库主机上添加replication连接的白名单,再次备份
gs_guc reload -I all -N all -h "host replication all 192.*.*.100/32 sha256"
复制
再次备份:
[omm@pubnode wal]$ gs_probackup backup --backup-path=/backup/probackup --instance=panweidb --backup-mode=full --delete-expired --remote-host=192.*.*.60 --remote-user=omm --remote-port=22 --pgdatabase=postgres -U testusr -W'******' -p 17700
INFO: Backup start, pw_probackup version: 2.4.2, instance: panweidb, backup ID: SUNHQA, backup mode: FULL, wal mode: STREAM, remote: true, compress-algorithm: none, compress-level: 1
LOG: Backup destination is initialized
LOG: This PanWeiDB instance was initialized with data block checksums. Data block corruption will be detected
LOG: Start SSH client process, pid 7394
LOG: Database backup start
LOG: started streaming WAL at 0/1F000000 (timeline 1)
[2025-04-13 17:44:35]: check identify system success
[2025-04-13 17:44:35]: send START_REPLICATION 0/1F000000 success
[2025-04-13 17:44:35]: keepalive message is received
LOG: SSH process 7394 is terminated with status 0
INFO: PGDATA size: 720MB
INFO: Start transferring data files
[2025-04-13 17:44:35]: keepalive message is received
LOG: Start SSH client process, pid 7398
LOG: Creating page header map "/backup/probackup/backups/panweidb/SUNHQA/page_header_map"
[2025-04-13 17:44:40]: keepalive message is received
[2025-04-13 17:44:45]: keepalive message is received
LOG: SSH process 7398 is terminated with status 0
INFO: Data files are transferred, time elapsed: 13s
INFO: wait for pg_stop_backup()
INFO: pg_stop backup() successfully executed
LOG: stop_lsn: 0/1F000270
LOG: Looking for LSN 0/1F000270 in segment: 00000001000000000000001F
LOG: Found WAL segment: /backup/probackup/backups/panweidb/SUNHQA/database/pg_xlog/00000001000000000000001F
LOG: Thread [0]: Opening WAL segment "/backup/probackup/backups/panweidb/SUNHQA/database/pg_xlog/00000001000000000000001F"
LOG: Found LSN: 0/1F000270
[2025-04-13 17:44:53]:(null): not renaming "/backup/probackup/backups/panweidb/SUNHQA/database/pg_xlog/000000010000000000000020", segment is not complete.
LOG: finished streaming WAL at 0/200000B0 (timeline 1)
LOG: Getting the Recovery Time from WAL
LOG: Thread [0]: Opening WAL segment "/backup/probackup/backups/panweidb/SUNHQA/database/pg_xlog/00000001000000000000001F"
INFO: Syncing backup files to disk
INFO: Backup files are synced, time elapsed: 0
INFO: Validating backup SUNHQA
INFO: Backup SUNHQA data files are valid
INFO: Backup SUNHQA resident size: 752MB
INFO: Backup SUNHQA completed
LOG: REDUNDANCY=3
INFO: Evaluate backups by retention
INFO: Backup SUNHQA, mode: FULL, status: OK. Redundancy: 1/3, Time Window: 0d/0d. Active
INFO: Backup SUNH9P, mode: FULL, status: ERROR. Redundancy: 2/3, Time Window: 0d/0d. Active
INFO: Backup SUNH75, mode: FULL, status: ERROR. Redundancy: 3/3, Time Window: 0d/0d. Expired
LOG: Consider backup SUNH75 for purge
INFO: Delete: SUNH751970-01-01 08:00:00+08
INFO: There are no backups to merge by retention policy
INFO: Purging finished
INFO: There is no WAL to purge by retention policy
复制
备份成功!
4.5.5、查看备份(备份机上执行)
[omm@pubnode ~]$ gs_probackup show -B /backup/probackup/
BACKUP INSTANCE 'panweidb'
=========================================================================================================================================
Instance Version ID Recovery Time Mode WAL Mode TLI Time Data WAL Zratio Start LSN Stop LSN Type Status
=========================================================================================================================================
panweidb 9.2 SUNHU9 2025-04-13 17:47:10+08 FULL STREAM 1/0 18s 736MB 16MB 0.98 0/23000028 0/23000270 FILE OK
panweidb 9.2 SUNHRE 2025-04-13 17:45:29+08 FULL STREAM 1/0 20s 736MB 16MB 0.98 0/21000028 0/21000270 FILE OK
panweidb 9.2 SUNHQA 2025-04-13 17:44:48+08 FULL STREAM 1/0 19s 736MB 16MB 0.98 0/1F000028 0/1F000270 FILE OK
复制
5、远程恢复实施过程
5.1 数据库主机上,修改数据目录,模拟数据库异常
postgres=# show data_directory ;
data_directory
---------------------
/data/panweidb/data
(1 row)
[omm@node1 ~]$ cd /data/panweidb
[omm@node1 panweidb]$ mv data data.old
复制
5.2 备份机上执行数据库恢复
[omm@pubnode ~]$ gs_probackup restore --backup-path=/backup/probackup --instance=panweidb --remote-host=192.*.*.60 --remote-user=omm --remote-port=22
LOG: Start SSH client process, pid 8035
LOG: Restore begin.
LOG: there is no file tablespace_map
LOG: check tablespace directories of backup SUNJ1F
LOG: check external directories of backup SUNJ1F
INFO: Validating backup SUNJ1F
INFO: Backup SUNJ1F data files are valid
LOG: Thread [1]: Opening WAL segment "/backup/probackup/backups/panweidb/SUNJ1F/database/pg_xlog/000000010000000000000024"
INFO: Backup SUNJ1F WAL segments are valid
INFO: Backup SUNJ1F is valid.
INFO: Restoring the database from backup at 2025-04-13 18:12:51+08
LOG: there is no file tablespace_map
LOG: Restore directories and symlinks... in /data/panweidb/data
LOG: SSH process 8035 is terminated with status 0
INFO: Start restoring backup files. DATA size: 744MB
LOG: Start thread 1
LOG: Start SSH client process, pid 8039
LOG: SSH process 8039 is terminated with status 0
INFO: Backup files are restored. Transfered bytes: 744MB, time elapsed: 12s
INFO: Restore incremental ratio (less is better): 100% (744MB/744MB)
INFO: Syncing restored files to disk
LOG: Start SSH client process, pid 8040
INFO: Restored backup files are synced, time elapsed: 1s
LOG: remote execute command [gs_guc set -D /data/panweidb/data -c "pw_wal_directory='pg_xlog'" &> /dev/null] success
LOG: SSH process 8040 is terminated with status 0
INFO: Restore of backup SUNJ1F completed.
复制
由于本地虚拟机里面没有数据写入,所以恢复也比较简单。如果生产环境中有实时业务数据写入,PITR恢复请参考下面的文章:
【干货】磐维数据库物理备份&恢复实战
5.3 数据库主机上检查集群状态
[omm@node1 panweidb]$ gs_om -t status --detail [ CMServer State ] node node_ip instance state ----------------------------------------------------------------- 1 node1 192.*.*.60 1 /data/panweidb/cm/cm_server Primary 2 node2 192.*.*.62 2 /data/panweidb/cm/cm_server Standby 3 node3 192.*.*.64 3 /data/panweidb/cm/cm_server Standby [ Cluster State ] cluster_state : Normal redistributing : No balanced : No current_az : AZ_ALL [ Datanode State ] node node_ip instance state ------------------------------------------------------------------- 1 node1 192.*.*.60 6001 /data/panweidb/data P Standby Normal 2 node2 192.*.*.62 6002 /data/panweidb/data S Primary Normal 3 node3 192.*.*.64 6003 /data/panweidb/data S Standby Normal
复制
数据库集群恢复正常。
6、配置定时任务(备份机上执行)
配置omm用户的crontab定时任务:
- 每三天00:30:00执行数据库全备份,备份成功后删除过期全备份。
- 每两小时删除十天前的归档。
- 每天0:10:00 删除ERROR状态的备份
crontab -e
30 00 */3 * * source /home/omm/.bashrc;gs_probackup backup --backup-path=/backup/probackup --instance=panweidb --backup-mode=full --delete-expired --remote-host=192.*.*.60 --remote-user=omm --remote-port=22 --pgdatabase=postgres -U testusr -W'******' -p 17700 > /backup/probackup/log/backup_week$(date +\%u).log 2>&1 &
0 */2 * * * sh /home/omm/move_and_clear_archive.sh 9
10 00 * * * source /home/omm/.bashrc; gs_probackup delete -B /backup/probackup/ --status=ERROR --instance=panweidb > /dev/null 2>&1 &
复制
cat /home/omm/move_and_clear_archive.sh:
#!/bin/bash
fnum=$1
echo "begin to clear archive log..."
find /archive -mtime +${fnum} -type f -name "0000*" | xargs rm -f
echo "archive files clean successfully"
复制
7、总结
远程备份整体步骤总结:
- 数据库主机上检查、确认数据库参数设置
- 备份机上安装数据库工具包
- 备份机上创建备份目录、初始化备份路径
- 备份机上设置备份机和数据库主机之间的ssh免密
- 备份机上创建备份实例
- 数据库主机上设置白名单(设置远程连接以及replication连接)、创建远程备份用户(需要授予sysadmin权限、session_timeout)
- 备份机上执行远程备份
- 备份机上配置定时备份任务
远程恢复步骤总结:在本地恢复命令的基础上加了remote相关的参数。
生产环境中一定要做数据库备份,把数据库备份结果纳入监控平台,并定期做数据库恢复验证工作,防患于未然。
关于作者:
网名:飞天,墨天轮2024年度优秀原创作者,拥有 Oracle 10g OCM 认证、PGCE认证以及OBCA、KCP、KCSM、ACP、磐维等众多国产数据库认证证书,目前从事Oracle、Mysql、PostgresSQL、磐维数据库管理运维工作,喜欢结交更多志同道合的朋友,热衷于研究、分享数据库技术。
微信公众号:飞天online
墨天轮:https://www.modb.pro/u/15197
如有任何疑问,欢迎大家留言,共同探讨~~~