作者:Digital Observer(施嘉伟)
Oracle ACE Pro: Database
PostgreSQL ACE Partner
11年数据库行业经验,现主要从事数据库服务工作
拥有Oracle OCM、DB2 10.1 Fundamentals、MySQL 8.0 OCP、WebLogic 12c OCA、KCP、PCTP、PCSD、PGCM、OCI、PolarDB技术专家、达梦师资认证、数据安全咨询高级等认证
ITPUB认证专家、PolarDB开源社区技术顾问、HaloDB技术顾问、TiDB社区技术布道师、青学会MOP技术社区专家顾问、国内某高校企业实践指导教师
公众号/墨天轮:Digital Observer;CSDN/PGfans:施嘉伟;ITPUB:sjw1933
一、常规备份策略
为了尽可能的保证数据的安全和完整,我们需要制定完备的备份方案,当发生故障的时候能最大程度减少不必要的损失。举例子:某客户每天凌晨12点发起一个rman全备份,结果第二天晚上11点数据库磁盘损坏导致数据文件和归档文件全部损坏,数据库无法正常打开。在没有其他任何容灾备份的情况下,备份就显得格外重要了。但是我们备份是前一天12点发起的全备,意味着客户付出的代价是将近一天的数据丢失。所以制定完备的备份策略显得尤为重要!
由于备份会影响生产库性能,所以我们不能每时每刻对生产数据进行备份,所以合理的制定备份策略显得尤为重要。
1)选一天中业务相对不繁忙的时间进行备份
2)避开数据库跑批时间
3)合理利用资源制定备份策略,比如开并行
下面笔者介绍一个真实的备份备份案例:
每日:
- 01:00执行一次数据库全备
- 02:20将数据备份片传输到异机
- 03:20将归档备份片传输到异机
- 00:30删除三天前归档
注意
备份片生成后应及时将备份片复制一份到异机,复制到异机的目的是为了防止本地存储故障导致生产数据和备份数据都遭受损坏。
备份片复制到异机而不移动到异机是因为在生产出现问题后能快速执行恢复操作,省去拷贝备份片的时间。
由于次案例有容灾归档删除策略为备库应用后才可以删除,确保了数据的高安全性
备份脚本:
零级备份,代码如下:
. $HOME/.profile
CUSER=`id |cut -d"(" -f2 | cut -d ")" -f1`
RMAN_LOG_FILE=/archlog/script/hotbackup.out
if [ -f "$RMAN_LOG_FILE" ]
then
rm -f "$RMAN_LOG_FILE"
fi
echo >> $RMAN_LOG_FILE
chmod 666 $RMAN_LOG_FILE
echo Script $0 >> $RMAN_LOG_FILE
echo ==== started on `date` ==== >> $RMAN_LOG_FILEl
echo >> $RMAN_LOG_FILE
ORACLE_USER=oracle
CMD_STR="
ORACLE_HOME=$ORACLE_HOME
export ORACLE_HOME
ORACLE_SID=$ORACLE_SID
export ORACLE_SID
$ORACLE_HOME/bin/rman target / nocatalog msglog $RMAN_LOG_FILE append<<EOF
allocate channel for maintenance device type disk;
delete noprompt backupset completed before 'sysdate-3' tag='FULLDB';
delete noprompt backupset completed before 'sysdate-3' tag='ARCHIVELOG';
run {
allocate channel ch00 device type disk;
backup database include current controlfile format '/rmandata/fullbak/full_%d_%T_%s_%p_.bkp' tag='FULLDB';
sql 'alter system archive log current';
backup archivelog all format '/rmandata/archbak/arch_%d_%T_%s_%p_.bkp' tag='ARCHIVELOG';
release channel ch00;
}
EOF
"
if [ "$CUSER" = "root" ]
then
su - $ORACLE_USER -c "$CMD_STR" >> $RMAN_LOG_FILE
RSTAT=$?
else
/usr/bin/sh -c "$CMD_STR" >> $RMAN_LOG_FILE
RSTAT=$?
fi
if [ "$RSTAT" = "0" ]
then
LOGMSG="ended successfully"
else
LOGMSG="ended in error"
fi
echo >> $RMAN_LOG_FILE
echo Script $0 >> $RMAN_LOG_FILE
echo ==== $LOGMSG on `date` ==== >> $RMAN_LOG_FILE
echo >> $RMAN_LOG_FILE
复制
ftp传输脚本(数据):
find /rmandata/fullbak -type f -ctime 1|grep 'rmandata'|sed 's/\/rmandata\/fullbak\///g' > ftp_full_file.txt
#put_file="mput $file"
echo "ftp -n<<EOF" >putfile.sh
echo "open 10.33.81.97" >>putfile.sh
echo "user rman_rz ******" >>putfile.sh
echo "lcd /rmandata/fullbak" >>putfile.sh
echo "prompt off " >>putfile.sh
echo "binary" >>putfile.sh
cat ftp_full_file.txt|while read line
do
echo "put `echo $line|awk '{print $1}'` " >>putfile.sh
done
#echo "$put_file" >>putfile.sh
echo "bye" >>putfile.sh
echo "EOF" >>putfile.sh
echo "-----------Excute Backup----------"
cat putfile.sh
sh putfile.sh
echo "----------End----------"
复制
ftp传输脚本(归档):
find /rmandata/archbak -type f -ctime 1|grep 'rmandata\/archbak\/arch'|sed 's/\/rmandata\/archbak\///g' > ftp_arch_file.txt
#put_file="mput $file"
echo "ftp -n<<EOF" >putfile.sh
echo "open 10.33.81.97" >>putfile.sh
echo "user rman_rz ******" >>putfile.sh
echo "lcd /rmandata/archbak" >>putfile.sh
echo "prompt off " >>putfile.sh
echo "binary" >>putfile.sh
cat ftp_arch_file.txt|while read line
do
echo "put `echo $line|awk '{print $1}'` " >>putfile.sh
done
#echo "$put_file" >>putfile.sh
echo "bye" >>putfile.sh
echo "EOF" >>putfile.sh
echo "-----------Excute Backup----------"
cat putfile.sh
sh putfile.sh
echo "----------End----------"
复制
归档删除脚本:
. $HOME/.profile
export ORACLE_SID=oracl
export GGS_HOME=/ggs
sqlplus -s "sys/******@ORACL_std as sysdba" << EOF > /tmp/seq_dg.seq
set pagesize 0
set head off
set feedback off
select max(sequence#) from v\$archived_log where name is not null and applied='YES' and first_time<sysdate-3;
EOF
#cd $GGS_HOME
#./ggsci << EOF > /tmp/seq_ggs.seq.tmp
#info tj_ext showch
#EOF
#cat /tmp/seq_ggs.seq.tmp|grep -p "Recovery Checkpoint (position of oldest unprocessed transaction in the data source)"|grep "Sequence #
"|awk -F ":" '{print $2}' > /tmp/seq_ggs.seq
SEQ_DG=`cat /tmp/seq_dg.seq`
#SEQ_GGS=`cat /tmp/seq_ggs.seq`
#if [ "${SEQ_DG}" -le "${SEQ_GGS}" ];then
# SEQ_DEL=${SEQ_DG}
#else
# SEQ_DEL=${SEQ_GGS}
#fi
#SEQ_DEL=$((${SEQ_DEL}-10))
#echo $SEQ_DEL
rman target / nocatalog << EOF
crosscheck archivelog all;
delete noprompt archivelog until sequence ${SEQ_DG};
EOF
复制
crontab部署如下:
00 01 * * * sh /archlog/script/hotbackup.sh 20 02 * * * sh /archlog/script/ftp_full.sh >/archlog/script/ftp_full.log 20 03 * * * sh /archlog/script/ftp_arch.sh >/archlog/script/ftp_arch.log 30 00 * * * sh /archlog/script/del_arch.sh > /dev/null
复制
注意
再完备的备份策略也会存在相关风险点,为了确保备份片的可用性需要定期进行恢复演练以验证备份的有效性。
定期查看备份日志,一旦存在报错一定要引起重视。
二、常见问题
RMAN备份恢复常见报错以及解决方z法:
1)在恢复归档的时候,明明这个归档已经不存在,或者不属于本节点,但是数据库还是回去找文件,且提示文件不存在,解决方法:
Crosscheck archivelog all;
复制
2)rman前台运行过程中突然中断,终止rman任务方法:
查看RMAN分配的各个通道的进程号,命令如下:
select s.sid, s.serial#, p.spid, s.client_info from v$process p, v$session s where p.addr = s.paddr and s.program like '%rman%' ; SID SERIAL# SPID CLIENT_INFO ---------- ---------- ------------------------ ---------------------------------------------------------------- 89 64899 5157 107 49576 5170 96 39108 5172 rman channel=ORA_DISK_1
复制
根据第上面步骤中得到的进程号,终止RMAN备份:
注:这里既要kill 掉RMAN备份脚本的PID,也要kill 掉RMAN中分配的各个通道的PID
[root@asmdb ~]# ps -ef | grep beq oracle 1918 1 0 Jul10 ? 00:00:09 oracle+ASM (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq))) oracle 15790 1 0 08:53 ? 00:00:00 oracle+ASM …… (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq))) oracle 15901 15885 5 08:57 ? 00:00:00 oracleasmdb (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq))) root 15903 15823 0 08:57 pts/3 00:00:00 grep beq [root@asmdb ~]# kill -9 15898 [root@asmdb ~]# kill -9 15899 [root@asmdb ~]# kill -9 15900 [root@asmdb ~]# kill -9 15901 [root@asmdb ~]# ps -ef | grep rman oracle 15885 15740 1 08:57 pts/2 00:00:02 rman target / root 15926 15823 0 09:01 pts/3 00:00:00 grep rman [root@asmdb ~]# kill -9 15885
复制
注意 这里既要kill 掉RMAN备份脚本的PID,也要kill 掉RMAN中分配的各个通道的PID
3)RMAN备份或者恢复时间很长,想确认rman备份恢复进度,方法如下:
SQL> SELECT inst_id, 2 sid, 3 serial#, 4 opname, 5 ROUND(SOFAR / TOTALWORK * 100, 2) "%_COMPLETE" 6 FROM gV$SESSION_LONGOPS 7 WHERE OPNAME LIKE 'RMAN%' 8 --AND OPNAME NOT LIKE '%aggregate%' 9 AND TOTALWORK != 0 10 AND SOFAR <> TOTALWORK;
复制
4)RMAN-4071:
报错信息如下:
RMAN> startup nomount LRM-00109: could not open parameter file '/data/app/oracle/product/10.2.0/db_1/dbs/initKEN.ora' RMAN-4024: starting Oracle instance without parameter file for retrival of spfile DBGMISC: krmicomp: error 4014 signalled during compilation [15:33:17.252] RMAN-571: =========================================================== RMAN-569: =============== ERROR MESSAGE STACK FOLLOWS =============== RMAN-571: =========================================================== RMAN-3002: failure of startup command at 05/01/2007 15:33:17 RMAN-4014: startup failed: ORA-1078: failure in processing system parameters RMAN-4017: startup error description: ORA-821: Specified value of sga_target 152M is too small, needs to be at least 228M
复制
问题成因:
部分oracle 10g环境下可能会出现无法通过缺省pifle启动到nomount阶段的问。
解决思路:
执行以下语句,再进入rman启动到nomount即可:
export ORA_RMAN_SGA_TARGET = 350
复制
5)rman备份发生下面报错:
发生原因:被其他策略影响,排查其他备份策略。
6)RMAN-06564错误的原因及解决办法
报错信息如下:
RMAN> restore spfile from autobackup; Starting restore at 01-NOV-20 using target database control file instead of recovery catalog allocated channel: ORA_DISK_1 channel ORA_DISK_1: SID=100 device type=DISK RMAN-00571: =========================================================== RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS =============== RMAN-00571: =========================================================== RMAN-03002: failure of restore command at 11/01/2020 11:30:33 RMAN-06564: must use the TO clause when the instance is started with SPFILE
复制
解决方法:根据错误提示要指定文件位置,代码如下:
RMAN> restore spfile to '/tmp/spfile_sid.ora' from autobackup;
复制
7)RMAN-06026& RMAN-06023
报错信息如下:
RMAN> restore database validate; Starting restore at 03-JUN-20 using channel ORA_DISK_1 RMAN-00571: ================================================= RMAN-00569: ===== ERROR MESSAGE STACK FOLLOWS ==== RMAN-00571: ========================================== RMAN-03002: failure of restore command at 06/03/2020 11:21:28 RMAN-06026: some targets not found - aborting restore RMAN-06023: no backup or copy of datafile 3 found to restore RMAN-06023: no backup or copy of datafile 2 found to restore RMAN-06023: no backup or copy of datafile 1 found to restore
复制
解决思路:控制文件不正确,重新传输一个正确的控制文件
8)ORA-27072: File I/O error
报错信息如下:
ORA-19502: write error on file "/backup/testdb1/rman/testdb1_full_22pnh0ea_1_1", block number 14976 (block size=8192) ORA-27072: File I/O error
复制
排查思路:
(1)检查文件系统空间是否满
(2)查看操作系统日志,排查磁盘相关报错
9)RMAN-06054
报错信息如下:
RMAN> recover database; unable to find archived log archived log thread=1 sequence=19 RMAN-00571: =========================================================== RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS =============== RMAN-00571: =========================================================== RMAN-03002: failure of recover command at 06/02/2020 14:38:09 RMAN-06054: media recovery requesting unknown archived log for thread 1 with sequence 19 and starting SCN of 1125015
复制
问题成因:
当执行异机恢复时,由于redo日志组不存在,并且没有设置停止应用归档日志的条件,因此会不断尝试应用归档日志。
解决方法:
查看数据文件头部SCN与 控制文件中记录的SCN是否一致,一致直接打开即可。
或者
RMAN> recover database until scn 1125015
10)RMAN-08138
报错信息如下:
RMAN> delete archivelog all; released channel: ORA_DISK_1 allocated channel: ORA_DISK_1 channel ORA_DISK_1: SID=12 device type=DISK RMAN-08138: WARNING: archived log not deleted - must create more backups archived log file name=/oracle/app/arch/prod1_69_1046156085.dbf thread=1 sequence=69
复制
问题成因:
该问题是由于设置了归档日志删除策略。
解决方法:
1.将归档日志删除策略置为none RMAN> configure archivelog deletion policy to none; 2. RMAN>backup archivelog all delete input /备份完成后删除/ 3. RMAN>delete force archivelog all;/强制删除/
复制
10)RMAN-08120
报错信息如下:
RMAN> delete archivelog until time 'sysdate' ; released channel: ORA_DISK_1 allocated channel: ORA_DISK_1 channel ORA_DISK_1: SID=9 device type=DISK RMAN-08120: WARNING: archived log not deleted, not yet applied by standby
复制
archived log file name=/data/epodb/arch/1_12587_893259821.arc thread=1 sequence=12587
问题成因:
删除的归档日志还没有被备库应用,无法删除。
解决方法:
修复DG环境
或者执行强制删除归档日志,命令如下:
RMAN> delete FORCE archivelog until time ‘sysdate’;
11)RMAN-06208
RMAN> delete backup; RMAN-06207: WARNING: 3 objects could not be deleted for DISK channel(s) due RMAN-06208: to mismatched status. Use CROSSCHECK command to fix status RMAN-06210: List of Mismatched objects RMAN-06211: ========================== RMAN-06212: Object Type Filename/Handle RMAN-06213: --------------- --------------------------------------------------- RMAN-06214: Backup Piece /oracle/app/product/11.2.0/db_1/dbs/0lv8hubl_1_1 RMAN-06214: Backup Piece /oracle/app/product/11.2.0/db_1/dbs/0mv8huiu_1_1 RMAN-06214: Backup Piece /oracle/app/product/11.2.0/db_1/dbs/0nv8hul9_1_1
复制
问题成因:
备份文件在OS层面被删除后没有进行crosscheck检查,无法将其删除。
解决方法:
RMAN> crosscheck backup; RMAN> delete expired backup;
复制
12)RMAN-06564
报错信息如下:
RMAN> restore spfile; Starting restore at 23-AUG-20 using channel ORA_DISK_1 RMAN-03002: failure of restore command at 08/23/2020 02:36:57 RMAN-06564: must use the TO clause when the instance is started with SPFILE
复制
问题成因:
实例使用spfile启动,正在使用spfile,无法恢复至默认位置。
解决办法:
RMAN> restore spfile to '/home/oracle/spfile';
复制
13)RMAN-06556
报错信息如下:
RMAN> recover database until scn 993592; Starting recover at 24-JUN-20 using channel ORA_DISK_1 RMAN-00571: ================================================= RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS =============== RMAN-00571: =========================================================== RMAN-03002: failure of recover command at 06/24/2020 10:13:15 RMAN-06556: datafile 4 must be restored from backup older than SCN 993592
复制
问题成因:
数据文件4使用了单独的备份集进行恢复的,数据文件在scn993592后进行了单独备份,进行restore恢复时,数据文件4使用了最新的备份,与其他文件使用的备份集不同,无法进行不完全恢复。
解决办法:
RMAN> recover database
14)RMAN-03009
报错信息如下:
RMAN> backup database; Starting Control File and SPFILE Autobackup at 23-AUG-20 RMAN-00571: =========================================================== RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS =============== RMAN-00571: =========================================================== RMAN-03009: failure of Control File and SPFILE Autobackup command on ORA_DISK_1 channel at 08/23/2020 21:05:33 ORA-19504: failed to create file "/home/rman/c-457994418-20200823-01" ORA-27040: file create error, unable to create file Linux-x86_64 Error: 2: No such file or directory Additional information: 1
复制
问题成因:
备份一号文件时,触发了控制文件自动备份,RMAN开启了控制文件自动备份属性,但是设置的路径不存在或者权限出现问题。
解决方法:
排查目录是否存在,不存在进行创建,如果权限不对,修改权限
关闭控制文件自动备份RMAN>CONFIGURE CONTROLFILE AUTOBACKUP OFF;
15)RMAN-00600 [5041]
报错信息如下:
RMAN> recover database noredo; RMAN-00600: internal error, arguments [5041] [2] [16813038] [20712091] []
复制
报错场景:CDB备库进行增量恢复时出现的RMAN-00600错误
问题成因:
有数据文件头SCN(PDB数据文件)低于增量备份from scn,导致无法应用增量备份。
解决方法:
查看整个CDB环境最小SCN,在主库从该SCN开始增量备份,命令如下:
select hxfil FILENUMBER,con_id, fhsta STATUS, fhscn SCN, fhrba_seq SEQUENCE,fhafs,FHDBI,fhdbn,fhtnm from x$kcvfh order by fhsta;
16)RMAN-06056
报错信息如下:
RMAN> backup database; RMAN-06169: could not read file header for datafile 46 error reason 4 released channel: c1 released channel: c2 RMAN-00571: =========================================================== RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS =============== RMAN-00571: =========================================================== RMAN-03002: failure of backup command at 03/29/2016 13:40:51 RMAN-06056: could not access datafile 46
复制
解决方法:
RMAN> backup database skip inaccessible;(跳过不可访问的文件)
或者从控制文件中删除该文件的记录
17)RMAN-06900/RMAN-06901/ORA-04031
报错信息如下:
RMAN> backup database; RMAN-06900: WARNING: unable to generate V$RMAN_STATUS or V$RMAN_OUTPUT row RMAN-06901: WARNING: disabling update of V$RMAN_STATUS and V$RMAN_OUTPUT rows ORACLE error from target database: ORA-04031: unable to allocate 16 bytes of shared memory ("shared pool","DATABASE","KKTIN^ae4fab53","kglhin: temp") ORA-04031: unable to allocate 16 bytes of shared memory ("shared pool","select job, nvl2(last_date, ...","SQLP","tmp") 问题成因:
复制
共享池sharepool空间不足
解决办法:
合理调整数据库内存分配
18)RMAN-06100
报错信息如下:
RMAN> restore database; RMAN-06026: some targets not found - aborting restore RMAN-06100: no channel to restore a backup or copy of datafile 80 RMAN-06100: no channel to restore a backup or copy of datafile 79 RMAN-06100: no channel to restore a backup or copy of datafile 78 RMAN-06100: no channel to restore a backup or copy of datafile 77 RMAN-06100: no channel to restore a backup or copy of datafile 76 RMAN-06100: no channel to restore a backup or copy of datafile 75
复制
问题成因:
恢复控制文件后,没有将RMAN备份注册到控制文件中,导致RMAN在恢复数据文件时无法找到备份。
解决方法:
将RMAN备份注册到控制文件中,命令如下:
RMAN> CATALOG START WITH ‘/oradata/rman’;
19)RMAN-06726
RMAN> backup archivelog all RMAN-00571: =========================================================== RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS =============== RMAN-00571: =========================================================== RMAN-03002: failure of backup command at 02/03/2007 01:17:34 RMAN-06726: could not locate archivelog XXXXXXXXXXXXXXXXX
复制
环境描述:RAC2节点环境之后又添加一个节点。
解决方法:
归档日志全部都存在,检查参数发现其中一个实例中cluster_database_instances 参数值为2,由于为3节点环境,该参数需要修改。
将该参数修改为cluster_database_instances=3。