暂无图片
暂无图片
8
暂无图片
暂无图片
暂无图片

DG日志应用延时严重故障处理及优化(数据延迟6天将近3T归档日志文件)

原创 企系山顶上噶雪梨 2022-07-11
4098

DG日志应用延时严重故障处理及优化(数据延迟6天将近3T归档日志文件)

【说明】:操作部分若没有特别说明,都是在目标端

本文包含故障的处理过程,DG目标端应用速度优化,DG监控思路

摘要:11G DG(Data Gurad)单实例目标端应用延迟6天,将近3T的归档日志为应用,故障处理及数据积压后怎么快速追平数据

故障原因分析:

1)第一时间查看了数据库的打开状态为“ READ ONLY ”,而非正常的状态 “ READ ONLY WITH APPLY ”

SQL> select open_mode from v$database; OPEN_MODE READ ONLY

2)日志怎么会挂了呢?查看告警日志,其中一段显示,磁盘空间不足,造成后台进程MRP0 shutdown掉

09812 ARC3: Closing local archive destination LOG_ARCHIVE_DEST_1: '/mnt/vdd1/archivelog/ARC0000202222_1013199719.0001.arc' (error 19502) (orcl) 309813 ARCH: Archival stopped, error occurred. Will continue retrying 309814 ORACLE Instance orcl - Archival Error 309815 ORA-16038: log 4 sequence# 202222 cannot be archived 309816 ORA-19502: write error on file "", block number (block size=) 309817 ORA-00312: online log 4 thread 1: '/mnt/oracle/oradata/orcl/redo04.log' 309818 Sat Jul 02 14:14:06 2022 309819 Media Recovery Log /mnt/vdd1/archivelog/ARC0000201136_1013199719.0001.arc 309820 Sat Jul 02 14:14:08 2022 309821 ARCH: Archival stopped, error occurred. Will continue retrying 309822 ORACLE Instance orcl - Archival Error 309823 ORA-16014: log 4 sequence# 202222 not archived, no available destinations 309824 ORA-00312: online log 4 thread 1: '/mnt/oracle/oradata/orcl/redo04.log' 309825 Errors with log /mnt/vdd1/archivelog/ARC0000201136_1013199719.0001.arc 309826 MRP0: Background Media Recovery terminated with error 1237 309827 Errors in file /mnt/oracle/diag/rdbms/orcl_st/orcl/trace/orcl_pr00_29562.trc: 309828 ORA-01237: cannot extend datafile 91 309829 ORA-01110: data file 91: '/mnt/vdd1/oradata/TS_PART4_01' 309830 ORA-19502: write error on file "/mnt/vdd1/oradata/TS_PART4_01", block number 4130560 (block size=8192) 309831 ORA-27072: File I/O error 309832 Linux-x86_64 Error: 28: No space left on device 309833 Additional information: 4 309834 Additional information: 4130560 309835 Additional information: -1 309836 Managed Standby Recovery not using Real Time Apply 309837 Sat Jul 02 14:15:29 2022 309838 Recovery interrupted! 309839 Recovered data files to a consistent state at change 17592401117937 309840 Sat Jul 02 14:15:31 2022 309841 MRP0: Background Media Recovery process shutdown (orcl)

3)14点多发生磁盘空间不足,删除目标端日志的计划任务是在18点,那18点后,DG检测又空间了,继续将归档日志重传到目标端,但是由于日志应用的进程MRP0挂掉,导致数据同步失败

4)大坑:未及时发现,需设置监控告警DG情况

故障解决:

1)考虑归档日志已经被计划任务删除,需要先进行归档日志恢复,目标端最新应用到201136,而主库还存在归档是207163,限于磁盘空间,分多次进行恢复,先进行恢复202315到203000直接的归档日志文件 ,并后台执行

建立shell脚本:
cat arch_restore.sh

#!/bin/bash . $HOME/.bash_profile rman target / nocatalog msglog /mnt/script/logs/rman-arch.log <<EOF run { set archivelog destination to '/mnt/vdd1/archivelog'; allocate channel ch1 type disk; allocate channel ch2 type disk; allocate channel ch3 type disk; allocate channel ch4 type disk; allocate channel ch5 type disk; allocate channel ch6 type disk; allocate channel ch7 type disk; allocate channel ch8 type disk; restore archivelog from logseq 202315 until logseq 203000; release channel ch1; release channel ch2; release channel ch3; release channel ch4; release channel ch5; release channel ch6; release channel ch7; release channel ch8; } exit EOF

后台执行脚本 :

nohup sh arch_restore.sh >alog07070813.log &

2)重启MRP0,恢复日志应用 (### 然后忙去了,几个小时候去检查,发现才应用了三百个多文件… ###)

alter database recover managed standby database cancel; alter database recover managed standby database using current logfile disconnect;

3)再查日志,发现需要一个,到源端去拿一个,最后还直接挂了 (### 哭晕在厕所o(╥﹏╥)o… ###)

日志显示:

Wed Jul 06 18:59:07 2022 Archived Log entry 206097 added for thread 1 sequence 206015 ID 0xe08f092f dest 1: Wed Jul 06 18:59:41 2022 Media Recovery Log /mnt/vdd1/archivelog/ARC0000201380_1013199719.0001.arc Error opening /mnt/vdd1/archivelog/ARC0000201380_1013199719.0001.arc Attempting refetch Media Recovery Waiting for thread 1 sequence 201380 Fetching gap sequence in thread 1, gap sequence 201380-201380 Wed Jul 06 18:59:42 2022 RFS\[7513\]: Allowing overwrite of partial archivelog for thread 1 sequence 201380 RFS\[7513\]: Opened log for thread 1 sequence 201380 dbid 1116521861 branch 1013199719 Archived Log entry 206098 added for thread 1 sequence 201380 rlc 1013199719 ID 0x0 dest 2: Media Recovery Log /mnt/vdd1/archivelog/ARC0000201380_1013199719.0001.arc Wed Jul 06 19:00:09 2022 Archived Log entry 206099 added for thread 1 sequence 206016 ID 0xe08f092f dest 1: Wed Jul 06 19:01:35 2022 Media Recovery Log /mnt/vdd1/archivelog/ARC0000201381_1013199719.0001.arc Error opening /mnt/vdd1/archivelog/ARC0000201381_1013199719.0001.arc Attempting refetch Media Recovery Waiting for thread 1 sequence 201381 Fetching gap sequence in thread 1, gap sequence 201381-201381 Media Recovery Waiting for thread 1 sequence 201381 Fetching gap sequence in thread 1, gap sequence 201381-201381 Wed Jul 06 19:05:07 2022 FAL\[client\]: Failed to request gap sequence GAP - thread 1 sequence 201381-201381 DBID 1116521861 branch 1013199719 FAL\[client\]: All defined FAL servers have been attempted. Check that the CONTROL_FILE_RECORD_KEEP_TIME initialization parameter is defined to a value that's sufficiently large enough to maintain adequate log switch information to resolve archivelog gaps.

4)将日志批量复制到备库,并解决GAP

– 【目标端】GAP查询语句

select count(*) from v$archive_gap;

–【目标端】有时候该视图查询超级慢,也可以考虑用下面语句代替:

select USERENV('Instance'), high.thread#, low.lsq, high.hsq from (select a.thread#, rcvsq, min(a.sequence#)-1 hsq from v$archived_log a, (select lh.thread#, lh.resetlogs_change#, max(lh.sequence#) rcvsq from v$log_history lh, v$database_incarnation di where lh.resetlogs_time = di.resetlogs_time and lh.resetlogs_change# = di.resetlogs_change# and di.status = 'CURRENT' and lh.thread# is not null and lh.resetlogs_change# is not null and lh.resetlogs_time is not null group by lh.thread#, lh.resetlogs_change# ) b where a.thread# = b.thread# and a.resetlogs_change# = b.resetlogs_change# and a.sequence# > rcvsq group by a.thread#, rcvsq) high, (select srl_lsq.thread#, nvl(lh_lsq.lsq, srl_lsq.lsq) lsq from (select thread#, min(sequence#)+1 lsq from v$log_history lh, x$kccfe fe, v$database_incarnation di where to_number(fe.fecps) <= lh.next_change# and to_number(fe.fecps) >= lh.first_change# and fe.fedup!=0 and bitand(fe.festa, 12) = 12 and di.resetlogs_time = lh.resetlogs_time and lh.resetlogs_change# = di.resetlogs_change# and di.status = 'CURRENT' group by thread#) lh_lsq, (select thread#, max(sequence#)+1 lsq from v$log_history where (select min( to_number(fe.fecps)) from x$kccfe fe where fe.fedup!=0 and bitand(fe.festa, 12) = 12) >= next_change# group by thread#) srl_lsq where srl_lsq.thread# = lh_lsq.thread#(+) ) low where low.thread# = high.thread# and lsq < = hsq and hsq > rcvsq;

–【源端】复制归档日志目标端

scp -P port username@IP:目标端目录

–【目标端】手动注册归档日志(注册断掉的第一个就好,后续再注册会提示已注册)

alter database register logfile '/${归档目录绝对路径}/ARC0000201381_1013199719.0001.arc';

(### 你以为这就完了么??? ###)

在线重做日志我设置是500M,大部分归档日志文件在400~500M之间,观察日志应用发现,有时候是1分钟一个,有时候3到5分钟应用一个,有时候又1分钟俩三个,任由下去的话,要被业务部门KO了。怎么加速呢??

DG目标端应用优化加速

1)开启并行(视负载情况及CPU核数,最大设置cpu*2)

– 取消日志应用

alter database recover managed standby database cancel;

– 默认8通道,更改为12通道

alter database recover managed standby database parallel 16 using current logfile disconnect;

2)主库到备库的日志传输,考虑积压量大,可以传输一段时间关闭一段时间的方式(源端主库操作)

– 停止主库归档日志传输到备库

ALTER system SET log_archive_dest_state_2 = 'defer';

– 恢复归档日志传输

ALTER system SET log_archive_dest_state_2 = enable;

3)优化参数并重启数据库(目标端执行)

– 查看参数设置

show parameter parallel_execution_message_size --值为16384 show parameter filesystemio_options --默认none

– linux操作系统层面可查看

grep kio /proc/slabinfo

– 更改设置为异步IO(单机实例,11.2.0.4)

alter system set filesystemio_options=setall scope=spfile;

–指定并行执行中meassage的大小,11G取值2148~32768

alter system set parallel_execution_message_size=32768 scope=spfile;

通过上述1 2 点优化设置后,速度有点提升,但是速度不稳定(感谢lgs提供的思路);

停掉日志传输,并行度从4、6、8、12、16、 20都尝试过,但资源监控:CPU、IO都闲的发慌,占比20%以下,这让我再想加资源都没有理由。

最后发现第3点的参数调优,可以一试,趁着中午业务低峰期,调整参数后,导出pfile,比对之前的初始化参数文件,确定没问题后,重启目标端端数据库【慎之又慎】。

通过以上三点,优化完成后效果显著,结果是可喜的:
速度提升10-20倍,由原来的2分钟1-3个,提升到2分钟30-50个文件

开启DG监控,通过shell脚本+任务计划实现:

1)设定日志监控:通过监控alert日志的关键字(在使用中逐步优化)进行告警

2)通过视图v$dataguard_stats进行监控(单位分钟):

select to_number(substr(s.value,2,2))2460+to_number(substr(s.value,5,2))*60+to_number(substr(s.value,8,2)) as nums from v$dataguard_stats s where name='apply lag';

其他相关查询:

– 查询归档还原情况,若是正常在同步中,YES后是IN-MEMORY状态,表示正在应用的日志文件,或通过文件号定位最新的应用到的归档,YES为已应用,NO为未应用

select g.recid,g.sequence#,g.applied,status,archived,g.first_time from v$archived_log g where 1=1-- g.sequence#> 201134 and applied='IN-MEMORY' order by g.sequence#

– 查看备库DG状态

SELECT PROCESS,CLIENT_PROCESS,PROCESS,STATUS,THREAD#,SEQUENCE#,BLOCK#,BLOCKS FROM V$MANAGED_STANDBY;

– 查询在线日志情况

select thread#, group#, sequence#, bytes, archived, s.status , blocksize, member FROM v$standby_log s join v$logfile using (GROUP#) order by thread#, group#;

– 查询进程状态 dg库日志应用性能监控

select to_char(sysdate,'DD-MON-YYYY HH24:MI:SS') "Current time" ,s.process , p.spid , substr(s.program, -6) PROC , s.event , s.p1 , s.p2 , s.p3 , s.seconds_in_wait SIW , s.seq# from v$session s, v$process p where p.addr = s.paddr and (s.program like '%MRP%' or s.program like '%PR0%' or s.program like '%DBW%' or s.program like '%CKPT%') order by s.process;
最后修改时间:2022-09-08 18:07:30
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论