适用范围
适用于OGG所有版本。
问题概述
2022年11月11日,收到客户反馈,说OGG目标端一个replicat进程延迟很大,查看进程RBA号持续不动,trail文件也一直不更新。
源端查看进程发现捕获进程已经ABENDED,与replicat进程停止复制的时间一致。
查看ggserr.log报错:Unable to write to file “./dirdat/ec005114” (error 28, No space left on device),显然目录没有空间了。查看文件系统,OGG目录果然使用率100%。
mgr进程配置了自动清理过期trail,却没有自动清理。排查发现源端的捕获、投递extract进程各有两个Extract Trail路径,其中Seqno:0 的是无效的。
故障分析
1、OGG目标端一个replicat进程延迟很大
2、首先查看进程RBA号,持续不动。
REPLICAT REP1 Last Started 2022-05-31 10:20 Status RUNNING Checkpoint Lag 00:00:00 (updated 00:00:00 ago) Log Read Checkpoint File ./dirdat/ec005133 2022-11-11 08:37:30.897554 RBA 60803824
复制
3、查看ggserr.log和进程report日志,都没有出现报错;查看数据库v$session,也没有OGG相关的大事务会话。
set linesize 145 set pagesize 11111 col username for a12 col PROGRAM for a18 col MACHINE for a15 col EVENT for a34 col TERMINAL for a15 col osuser for a15 col sql_id for a13 col STATUS for a8 col sid for 9999 col serial# for 9999999 select username,SID,SERIAL#,BLOCKING_INSTANCE,blocking_session,BLOCKING_SESSION_STATUS,STATUS,MACHINE,PROGRAM,TERMINAL,SQL_ID,EVENT from v$session where username is not null and event not like '%message%' and username='OGG' order by event;
复制
4、查看replicat进程trail文件,发现trail文件的修改时间和replicat进程的Log Read Checkpoint时间一致,也就是说源端一直未投递trail到目标端。
[oracle@host dirdat]$ ls -l ec005133 -rw-r----- 1 oracle oinstall 60975590 Nov 11 08:37 ec005133 [oracle@host dirdat]$
复制
5、检查源端进程,发现捕获进程abended
GGSCI (host) 1> info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING EXTRACT RUNNING PUMP1 00:00:02 00:00:07 EXTRACT ABENDED EXT1 unknown 00:00:03
复制
6、查看ggserr.log日志,根据报错信息显示,明显没有空间了
2022-11-11 08:13:39 ERROR OGG-01096 Oracle GoldenGate Capture for Oracle, ext1.prm: Unable to write to file "./dirdat/ec005114" (error 28, No space left on device). 2022-11-11 08:13:41 ERROR OGG-01668 Oracle GoldenGate Capture for Oracle, ext1.prm: PROCESS ABENDING.
复制
7、查看文件系统使用率,OGG目录果然使用率100%。
[oracle@host dirdat]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup-LogVol_root 99G 8.5G 85G 10% / tmpfs 505G 658M 505G 1% /dev/shm /dev/sda1 240M 33M 195M 15% /boot /dev/mapper/VolGroup-LogVol_oracle 99G 64G 30G 69% /u01 /dev/mapper/oggvg-lvogg 96G 96G 0 100% /oggfs
复制
8、查看哪些文件占用的空间最大,发现是trail文件。
[oracle@host oggfs]$ du -sh dirdat 93G dirdat
复制
9、mgr配置了自动清理过期trail却没有自动清理。
GGSCI (host) 3> view param mgr port 8899 DYNAMICPORTLIST 8899-9988 --autostart er * autorestart extract *, retries 5, waitminutes 1 purgeoldextracts ./dirdat/*, usecheckpoints, minkeepdays 1 userid ogg, password ogg purgeddlhistory minkeepdays 15, maxkeepdays 30 purgemarkerhistory minkeepdays 15, maxkeepdays 30
复制
10、查看进程exttrail信息,发现进程有两个Extract Trail路径,其中Seqno:0 的是无效的。
GGSCI (host) 1> info exttrail * Extract Trail: /oggfs/dirdat/ec Extract: PUMP1 Seqno: 0 RBA: 0 File Size: 100M Extract Trail: ./dirdat/ec Extract: PUMP1 Seqno: 5345 RBA: 37735873 File Size: 100M Extract Trail: /oggfs/dirdat/ec Extract: EXT1 Seqno: 0 RBA: 0 File Size: 100M Extract Trail: ./dirdat/ec Extract: EXT1 Seqno: 5328 RBA: 22457198 File Size: 100M
复制
11、删除无效的Extract Trail路径,发现正在运行的进程无法删除(一定确定删除的Extract Trail路径没有其他进程正在使用)
GGSCI (host) 6> DELETE EXTTRAIL /oggfs/dirdat/ec Cannot delete extract trail /oggfs/dirdat/ec, extract PUMP1 is running. Cannot delete extract trail /oggfs/dirdat/ec, extract EXT1 is running.
复制
12、先停掉捕获、投递进程
GGSCI (host) 11> stop ext1 Sending STOP request to EXTRACT EXT1 ... Request processed. GGSCI (host) 12> stop pump Sending STOP request to EXTRACT PUMP1 ... Request processed.
复制
13、再次删除无效的Extract Trail路径
GGSCI (host) 13> DELETE EXTTRAIL /oggfs/dirdat/ec Deleting extract trail /oggfs/dirdat/ec for extract PUMP1 Deleting extract trail /oggfs/dirdat/ec for extract EXT1
复制
14、再次查看进程exttrail信息,发现无效Extract Trail路径已经被删除
GGSCI (host) 14> info exttrail * Extract Trail: ./dirdat/ec Extract: PUMP1 Seqno: 5345 RBA: 94634565 File Size: 100M Extract Trail: ./dirdat/ec Extract: EXT1 Seqno: 5328 RBA: 79370115 File Size: 100M
复制
15、查看日志发现开始自动删除过期trail文件
2022-11-11 18:09:29 INFO OGG-00957 Oracle GoldenGate Manager for Oracle, mgr.prm: Purged old extract file /oggfs/dirdat/ec005000, applying UseCheckPoints purge rule: Oldest Chkpt Seqno 5329 > 5000. 2022-11-11 18:09:29 INFO OGG-00957 Oracle GoldenGate Manager for Oracle, mgr.prm: Purged old extract file /oggfs/dirdat/ec005001, applying UseCheckPoints purge rule: Oldest Chkpt Seqno 5329 > 5001. 2022-11-11 18:09:29 INFO OGG-00957 Oracle GoldenGate Manager for Oracle, mgr.prm: Purged old extract file /oggfs/dirdat/ec005002, applying UseCheckPoints purge rule: Oldest Chkpt Seqno 5329 > 5002. 2022-11-11 18:09:29 INFO OGG-00957 Oracle GoldenGate Manager for Oracle, mgr.prm: Purged old extract file /oggfs/dirdat/ec005003, applying UseCheckPoints purge rule: Oldest Chkpt Seqno 5329 > 5003. 2022-11-11 18:09:29 INFO OGG-00957 Oracle GoldenGate Manager for Oracle, mgr.prm: Purged old extract file /oggfs/dirdat/ec005004, applying UseCheckPoints purge rule: Oldest Chkpt Seqno 5329 > 5004. 2022-11-11 18:09:29 INFO OGG-00957 Oracle GoldenGate Manager for Oracle, mgr.prm: Purged old extract file /oggfs/dirdat/ec005005, applying UseCheckPoints purge rule: Oldest Chkpt Seqno 5329 > 5005. 2022-11-11 18:09:29 INFO OGG-00957 Oracle GoldenGate Manager for Oracle, mgr.prm: Purged old extract file /oggfs/dirdat/ec005006, applying UseCheckPoints purge rule: Oldest Chkpt Seqno 5329 > 5006. 2022-11-11 18:09:29 INFO OGG-00957 Oracle GoldenGate Manager for Oracle, mgr.prm: Purged old extract file /oggfs/dirdat/ec005007, applying UseCheckPoints purge rule: Oldest Chkpt Seqno 5329 > 5007. 2022-11-11 18:09:29 INFO OGG-00957 Oracle GoldenGate Manager for Oracle, mgr.prm: Purged old extract file /oggfs/dirdat/ec005008, applying UseCheckPoints purge rule: Oldest Chkpt Seqno 5329 > 5008.
复制
故障根源
源端的捕获、投递extract进程各有两个Extract Trail路径,其中有一个路径无效,导致mgr无法自动删除过期的trail文件,进而导致ogg目录文件系统空间满,最终导致捕获进程因“No space left on device”而ABENDED。
解决方案
1、先停掉有无效Extract Trail路径的进程(不停无法删除);
2、删除进程中无效的Extract Trail路径;
3、再启动ogg进程后,自动清理trail文件的配置生效。