暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

无效的Extract Trail路径导致无法自动清除过期的trail文件

原创 stofm 2022-11-22
1798

适用范围

适用于OGG所有版本。

问题概述

2022年11月11日,收到客户反馈,说OGG目标端一个replicat进程延迟很大,查看进程RBA号持续不动,trail文件也一直不更新。
源端查看进程发现捕获进程已经ABENDED,与replicat进程停止复制的时间一致。
查看ggserr.log报错:Unable to write to file “./dirdat/ec005114” (error 28, No space left on device),显然目录没有空间了。查看文件系统,OGG目录果然使用率100%。
mgr进程配置了自动清理过期trail,却没有自动清理。排查发现源端的捕获、投递extract进程各有两个Extract Trail路径,其中Seqno:0 的是无效的。

故障分析

1、OGG目标端一个replicat进程延迟很大
image.png

2、首先查看进程RBA号,持续不动。

REPLICAT   REP1      Last Started 2022-05-31 10:20   Status RUNNING
Checkpoint Lag       00:00:00 (updated 00:00:00 ago)
Log Read Checkpoint  File ./dirdat/ec005133
                     2022-11-11 08:37:30.897554  RBA 60803824
复制

3、查看ggserr.log和进程report日志,都没有出现报错;查看数据库v$session,也没有OGG相关的大事务会话。

set linesize 145
set pagesize 11111
col username for a12
col PROGRAM for a18
col MACHINE for a15
col EVENT for a34
col TERMINAL for a15
col osuser for a15
col sql_id for a13
col STATUS for a8
col sid for 9999
col serial# for 9999999
select username,SID,SERIAL#,BLOCKING_INSTANCE,blocking_session,BLOCKING_SESSION_STATUS,STATUS,MACHINE,PROGRAM,TERMINAL,SQL_ID,EVENT from v$session where username is not null and event not like '%message%' and username='OGG' order by event;

复制

4、查看replicat进程trail文件,发现trail文件的修改时间和replicat进程的Log Read Checkpoint时间一致,也就是说源端一直未投递trail到目标端。

[oracle@host dirdat]$ ls -l ec005133
-rw-r----- 1 oracle oinstall 60975590 Nov  11 08:37  ec005133
[oracle@host dirdat]$ 

复制

5、检查源端进程,发现捕获进程abended

GGSCI (host) 1> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING                                           
EXTRACT     RUNNING     PUMP1       00:00:02      00:00:07    
EXTRACT     ABENDED     EXT1        unknown       00:00:03    

复制

6、查看ggserr.log日志,根据报错信息显示,明显没有空间了

2022-11-11 08:13:39  ERROR   OGG-01096  Oracle GoldenGate Capture for Oracle, ext1.prm:  Unable to write to file "./dirdat/ec005114" (error 28, No space left on device).
2022-11-11 08:13:41  ERROR   OGG-01668  Oracle GoldenGate Capture for Oracle, ext1.prm:  PROCESS ABENDING.

复制

7、查看文件系统使用率,OGG目录果然使用率100%。

[oracle@host dirdat]$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup-LogVol_root
                       99G  8.5G   85G  10% /
tmpfs                 505G  658M  505G   1% /dev/shm
/dev/sda1             240M   33M  195M  15% /boot
/dev/mapper/VolGroup-LogVol_oracle
                       99G   64G   30G  69% /u01
/dev/mapper/oggvg-lvogg
                       96G   96G   0  100% /oggfs

复制

8、查看哪些文件占用的空间最大,发现是trail文件。

[oracle@host oggfs]$ du -sh dirdat
93G	dirdat
复制

9、mgr配置了自动清理过期trail却没有自动清理。

GGSCI (host) 3> view param mgr

port 8899
DYNAMICPORTLIST 8899-9988
--autostart er *
autorestart extract *, retries 5, waitminutes 1
purgeoldextracts ./dirdat/*, usecheckpoints, minkeepdays 1
userid ogg, password ogg 
purgeddlhistory minkeepdays 15, maxkeepdays 30
purgemarkerhistory minkeepdays 15, maxkeepdays 30

复制

10、查看进程exttrail信息,发现进程有两个Extract Trail路径,其中Seqno:0 的是无效的。

GGSCI (host) 1> info exttrail *

     Extract Trail: /oggfs/dirdat/ec
           Extract: PUMP1
             Seqno: 0
               RBA: 0
         File Size: 100M
     Extract Trail: ./dirdat/ec
           Extract: PUMP1
             Seqno: 5345
               RBA: 37735873
         File Size: 100M

     Extract Trail: /oggfs/dirdat/ec
           Extract: EXT1
             Seqno: 0
               RBA: 0
         File Size: 100M
     Extract Trail: ./dirdat/ec
           Extract: EXT1
             Seqno: 5328
               RBA: 22457198
         File Size: 100M
复制

11、删除无效的Extract Trail路径,发现正在运行的进程无法删除(一定确定删除的Extract Trail路径没有其他进程正在使用)

GGSCI (host) 6> DELETE EXTTRAIL /oggfs/dirdat/ec
Cannot delete extract trail /oggfs/dirdat/ec, extract PUMP1 is running.
Cannot delete extract trail /oggfs/dirdat/ec, extract EXT1 is running.
复制

12、先停掉捕获、投递进程

GGSCI (host) 11> stop ext1

Sending STOP request to EXTRACT EXT1 ...
Request processed.


GGSCI (host) 12> stop pump

Sending STOP request to EXTRACT PUMP1 ...
Request processed.

复制

13、再次删除无效的Extract Trail路径

GGSCI (host) 13> DELETE EXTTRAIL /oggfs/dirdat/ec
Deleting extract trail /oggfs/dirdat/ec for extract PUMP1
Deleting extract trail /oggfs/dirdat/ec for extract EXT1

复制

14、再次查看进程exttrail信息,发现无效Extract Trail路径已经被删除

GGSCI (host) 14> info exttrail *

       Extract Trail: ./dirdat/ec
             Extract: PUMP1
               Seqno: 5345
                 RBA: 94634565
           File Size: 100M

       Extract Trail: ./dirdat/ec
             Extract: EXT1
               Seqno: 5328
                 RBA: 79370115
           File Size: 100M
复制

15、查看日志发现开始自动删除过期trail文件

2022-11-11 18:09:29  INFO    OGG-00957  Oracle GoldenGate Manager for Oracle, mgr.prm:  Purged old extract file /oggfs/dirdat/ec005000, applying UseCheckPoints purge rule: Oldest Chkpt Seqno 5329 > 5000.
2022-11-11 18:09:29  INFO    OGG-00957  Oracle GoldenGate Manager for Oracle, mgr.prm:  Purged old extract file /oggfs/dirdat/ec005001, applying UseCheckPoints purge rule: Oldest Chkpt Seqno 5329 > 5001.
2022-11-11 18:09:29  INFO    OGG-00957  Oracle GoldenGate Manager for Oracle, mgr.prm:  Purged old extract file /oggfs/dirdat/ec005002, applying UseCheckPoints purge rule: Oldest Chkpt Seqno 5329 > 5002.
2022-11-11 18:09:29  INFO    OGG-00957  Oracle GoldenGate Manager for Oracle, mgr.prm:  Purged old extract file /oggfs/dirdat/ec005003, applying UseCheckPoints purge rule: Oldest Chkpt Seqno 5329 > 5003.
2022-11-11 18:09:29  INFO    OGG-00957  Oracle GoldenGate Manager for Oracle, mgr.prm:  Purged old extract file /oggfs/dirdat/ec005004, applying UseCheckPoints purge rule: Oldest Chkpt Seqno 5329 > 5004.
2022-11-11 18:09:29  INFO    OGG-00957  Oracle GoldenGate Manager for Oracle, mgr.prm:  Purged old extract file /oggfs/dirdat/ec005005, applying UseCheckPoints purge rule: Oldest Chkpt Seqno 5329 > 5005.
2022-11-11 18:09:29  INFO    OGG-00957  Oracle GoldenGate Manager for Oracle, mgr.prm:  Purged old extract file /oggfs/dirdat/ec005006, applying UseCheckPoints purge rule: Oldest Chkpt Seqno 5329 > 5006.
2022-11-11 18:09:29  INFO    OGG-00957  Oracle GoldenGate Manager for Oracle, mgr.prm:  Purged old extract file /oggfs/dirdat/ec005007, applying UseCheckPoints purge rule: Oldest Chkpt Seqno 5329 > 5007.
2022-11-11 18:09:29  INFO    OGG-00957  Oracle GoldenGate Manager for Oracle, mgr.prm:  Purged old extract file /oggfs/dirdat/ec005008, applying UseCheckPoints purge rule: Oldest Chkpt Seqno 5329 > 5008.

复制

故障根源

源端的捕获、投递extract进程各有两个Extract Trail路径,其中有一个路径无效,导致mgr无法自动删除过期的trail文件,进而导致ogg目录文件系统空间满,最终导致捕获进程因“No space left on device”而ABENDED。

解决方案

1、先停掉有无效Extract Trail路径的进程(不停无法删除);
2、删除进程中无效的Extract Trail路径;
3、再启动ogg进程后,自动清理trail文件的配置生效。

「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论