RHEL设置的oracle备份脚本crontab未正常执行排查
故障现象
这几天正在给财政厅的客户做oracle 数据库的备份恢复策略的梳理,总共有3套核心库,分别设置了备份的策略。如下:
[oracle@czdb1 sh]$ crontab -l
0 1 * * 0 /home/oracle/sh/full_bak.sh
0 1 * * 1,2,3,4,5,6 /home/oracle/sh/arch_bak.sh
[oracle@czdb1 sh]$ cat full_bak.sh
#!/bin/bash
export ORACLE_SID=orcl1
export ORACLE_BASE=/u01/app/oracle
export ORACLE_HOME=/u01/app/oracle/product/11.2.0
export NLS_DATE_FORMAT="yyyymmdd hh24:mi:ss"
rman target / log=/home/oracle/sh/log/full_`date +%Y%m%d_%H%M%S`.log << EOF
run{
configure retention policy to recovery window of 7 days;
configure controlfile autobackup on;
allocate channel c1 type disk;
allocate channel c2 type disk;
allocate channel c3 type disk;
backup as compressed backupset database tag='ORCL_FULL' format='/backup/rman_full/%d_%T_%s_%p_%U_FULL' plus archivelog delete all input format='/backup/rman_full/%d_%T_%s_%p_%U_ARCHIVE';
backup current controlfile tag='bak_ctlfile' format='/backup/rman_full/%d_%T_%s_%p_%U_CONTROL';
backup spfile tag='spfile' format='/backup/rman_full/%d_%T_%s_%p_%U_spfile';
crosscheck backup;
crosscheck archivelog all;
delete noprompt obsolete;
delete noprompt expired backup;
release channel c1;
release channel c2;
release channel c3;
}
exit;
EOF
[oracle@czdb1 sh]$ cat arch_bak.sh
#!/bin/bash
export ORACLE_SID=orcl1
export ORACLE_BASE=/u01/app/oracle
export ORACLE_HOME=/u01/app/oracle/product/11.2.0
export NLS_DATE_FORMAT="yyyymmdd hh24:mi:ss"
rman target / log=/home/oracle/sh/log/arch_`date +%Y%m%d_%H%M%S`.log << EOF
run {
allocate channel c1 type disk;
allocate channel c2 type disk;
sql 'alter system archive log current';
backup archivelog all not backed up 1 times delete all input format '/backup/rman_full/%d_%T_%s_%p_%U_ARCHIVE';
backup current controlfile tag='bak_ctlfile' format='/backup/rman_full/%d_%T_%s_%p_%U_CONTROL';
backup spfile tag='spfile' format='/backup/rman_full/%d_%T_%s_%p_%U_spfile';
crosscheck backup;
crosscheck archivelog all;
delete noprompt obsolete;
delete noprompt expired backup;
release channel c1;
release channel c2;
}
exit;
EOF
复制
每套库的备份策略都差不多,都是周日全备,然后每天备份归档日志,保留7天内随便恢复。
其中有1套库,当时测试手动执行,都没有错误,但是过了半个月去查看,并没有备份成功,备份集还是当时测试时候的备份集,所有的备份日志也没有生成。而剩余2套库的备份全部都成功。可见,只有这一套库的备份失败。
环境如下:
oracle 11204 2节点RAC,os RHEL 6.2
故障分析
- 手动测试所有备份脚本均能成功,但是添加进crontab 里面备份失败。所以先排查crontab是否有问题?crontab 是否语法或者其他问题,而没有真正的调用起来?
检查crontab 的日志,默认日志位置在/var/log/cron下面,里面详细记录了crontab 的执行情况,vi 打开,进行关键字匹配:
Mar 5 01:00:01 xxx CROND[33040]: (root) CMD (/usr/sbin/raid-check) Mar 5 01:00:01 xxx CROND[33041]: (oracle) CMD (/home/oracle/scripts/full_bak.sh) 。。。。 Feb 27 00:50:01 xxx CROND[35015]: (root) CMD (/usr/lib64/sa/sa1 1 1) Feb 27 01:00:01 xxx CROND[40299]: (oracle) CMD (/home/oracle/scripts/arch_bak.sh) Feb 27 01:00:01 xxx CROND[40300]: (root) CMD (/usr/lib64/sa/sa1 1 1) 。。。 Mar 1 01:00:01 qhczhwdb3 CROND[37163]: (oracle) CMD (/home/oracle/scripts/arch_bak.sh) Mar 1 01:01:01 xxx CROND[37616]: (root) CMD (run-parts /etc/cron.hourly)
复制
从日志里面可以看出,每天的备份任务都在正常调用,所以crontab 的语法以及crontab 的设置是没有问题的。那么问题肯定就在脚本本身了。
2. 但是之前手动测试的时候,均能成功。所以脚本也应该是没问题的,所以此时没有了排查思路。
3. 通过查看官方文档,crontab 执行失败,如果没有屏蔽错误的话,会产生一个系统邮件,位置在 /var/spool/mail/。
屏蔽错误的意思就是我们通常在crontab 里面的 2>&1 这种设置。如果要排查错误,可以先把这个屏蔽错误取消掉,等待几分钟查看邮件。
我们查看这台机器的mail 日志,如下:
[root@qhczhwdb3 mail]# ll total 37404 -rw-rw---- 1 grid mail 0 Oct 12 2021 grid -rw-rw---- 1 oracle mail 37668664 Mar 7 01:00 oracle -rw------- 1 root mail 622536 Mar 4 00:29 root 是oracle 下的crontab,所以我们查看oracle,vi编辑,会有如下报错: Message-Id: <20230304170001.6EB491E00C9@qhczhwdb3.localdomain> Date: Sun, 5 Mar 2023 01:00:01 +0800 (CST) /home/oracle/scripts/full_bak.sh: line 6: rman: command not found From root@qhczhwdb3.localdomain Mon Mar 6 01:00:01 2023 Return-Path: <root@qhczhwdb3.localdomain> X-Original-To: oracle Delivered-To: oracle@qhczhwdb3.localdomain Received: by qhczhwdb3.localdomain (Postfix, from userid 501) id D7F2C1E00CA; Mon, 6 Mar 2023 01:00:01 +0800 (CST) From: root@qhczhwdb3.localdomain (Cron Daemon) To: oracle@qhczhwdb3.localdomain Subject: Cron <oracle@qhczhwdb3> /home/oracle/scripts/arch_bak.sh Content-Type: text/plain; charset=UTF-8 Auto-Submitted: auto-generated X-Cron-Env: <LANG=en_US.UTF-8> X-Cron-Env: <SHELL=/bin/sh> X-Cron-Env: <HOME=/home/oracle> X-Cron-Env: <PATH=/usr/bin:/bin> X-Cron-Env: <LOGNAME=oracle> X-Cron-Env: <USER=oracle> Message-Id: <20230305170001.D7F2C1E00CA@qhczhwdb3.localdomain> Date: Mon, 6 Mar 2023 01:00:01 +0800 (CST) /home/oracle/scripts/arch_bak.sh: line 6: rman: command not found From root@qhczhwdb3.localdomain Tue Mar 7 01:00:01 2023 Return-Path: <root@qhczhwdb3.localdomain> X-Original-To: oracle Delivered-To: oracle@qhczhwdb3.localdomain Received: by qhczhwdb3.localdomain (Postfix, from userid 501) id 8C3EB1E00CB; Tue, 7 Mar 2023 01:00:01 +0800 (CST) From: root@qhczhwdb3.localdomain (Cron Daemon) To: oracle@qhczhwdb3.localdomain Subject: Cron <oracle@qhczhwdb3> /home/oracle/scripts/arch_bak.sh Content-Type: text/plain; charset=UTF-8 Auto-Submitted: auto-generated X-Cron-Env: <LANG=en_US.UTF-8> X-Cron-Env: <SHELL=/bin/sh> X-Cron-Env: <HOME=/home/oracle> X-Cron-Env: <PATH=/usr/bin:/bin> X-Cron-Env: <LOGNAME=oracle> X-Cron-Env: <USER=oracle> Message-Id: <20230306170001.8C3EB1E00CB@qhczhwdb3.localdomain> Date: Tue, 7 Mar 2023 01:00:01 +0800 (CST) /home/oracle/scripts/arch_bak.sh: line 6: rman: command not found
复制
关键错误信息:
X-Cron-Env: <PATH=/usr/bin:/bin>
/home/oracle/scripts/arch_bak.sh: line 6: rman: command not found
这两行,它引用了oracle 用户的默认PATH,导致下面没有rman的环境变量,所以会报错rman找不到命令。
4. 所以,此处还是环境变量的问题。脚本里面要加PATH 环境变量或者写全路径或者source .bash_profile。
解决方案
修改脚本,前面如下:
1、把rman 命令修改为全路径:
[oracle@czdb1 sh]$ cat arch_bak.sh
#!/bin/bash
export ORACLE_SID=orcl1
export ORACLE_BASE=/u01/app/oracle
export ORACLE_HOME=/u01/app/oracle/product/11.2.0
export NLS_DATE_FORMAT="yyyymmdd hh24:mi:ss"
rman target / log=/home/oracle/sh/log/arch_`date +%Y%m%d_%H%M%S`.log << EOF
复制
2、设置PATH 环境变量:
[oracle@czdb1 sh]$ cat arch_bak.sh
#!/bin/bash
export ORACLE_SID=orcl1
export ORACLE_BASE=/u01/app/oracle
export ORACLE_HOME=/u01/app/oracle/product/11.2.0
export PATH=$ORACLE_HOME/bin:$ORACLE_HOME/OPatch:/sbin:/bin:/usr/sbin:/usr/bin
export NLS_DATE_FORMAT="yyyymmdd hh24:mi:ss"
rman target / log=/home/oracle/sh/log/arch_`date +%Y%m%d_%H%M%S`.log << EOF
复制
3、引用Oracle用户的环境变量:
[oracle@czdb1 sh]$ cat arch_bak.sh
#!/bin/bash
source /home/oracle/.bash_profile
rman target / log=/home/oracle/sh/log/arch_`date +%Y%m%d_%H%M%S`.log << EOF
复制
最后发现剩余两套库的环境变量都设置了,只有这一套库每设置。
常见crontab 调用失败的原因
通过本案例,可以看出,linux 操作系统还是有很多日志可以查询的。想crontab 的日志,还有mail 的日志,我们日常都可以详细检查,用以排查错误。
下面总结下常见的crontab 失败错误的原因:
- crontab服务异常
- 脚本权限
- 环境变量