【背景】
前几天有个朋友发消息说,GOLDENATE复制进程无法启动,错误日志如下:
错误日志:
2019-01-11 17:56:34 ERROR OGG-01296 Error mapping from SOURCE.AA to TARGET.AA
DISCARD日志:
Mapping error to target column: DID
Mapping error to target column: DID
Current time: 2019-01-11 17:34:31
Discarded record from action ABEND on error 0 -- 提示ABEND on error 0
Aborting transaction on ./dirdat/lq beginning at seqno 15172 rba 475244077
error at seqno 15172 rba 475427432
Problem replicating SOURCE.AA toTARGET.AA
Mapping problem with insert record (source format)...
*
DID = <Raw Data> ------------------> 这个主键是数字,没有显示正确,居然是<Raw Data>
000000: d5 11
DCODE = 0001
ORTYPE = <Raw Data>
000000: c5 e4 bc fe c6 d5 cd a8 b6 a9 b5 a5 |............ |
*
Process Abending : 2019-01-11 17:34:31
【环境介绍】
os: windows 2008 R2
db:11.2.0.4
ogg:11.2.1.0.17
注释:windows环境确实有点不习惯,包括debug以及之类.更坑的是报错信息居然和linux下不一样的.
表结构:
CREATE TABLE AA(
DID NUMBER(20) NOT NULL primary key,
DCCODE VARCHAR2(50),
OTYPE VARCHAR2(50),
DIST NUMBER(18,2),
CEDATE DATE,
DDLINE DATE,
CODE VARCHAR2(50))
【分析过程】
1、使用logdump分析数据 --发现主键的16进制6061123939,
但是discard具体没有显示,只是显示<Raw Data>且000000: d5 11 ,通过d5 11通过各种几进制转换,也没有对不上.
2、对比2边表结构发现一致的.
12.2开始表定义存在trailfile header中,可以通过SCANFORMETADATA命令可以获取,如下:
logdump> SCANFORMETADATA
12.2以下版本可以通过ggsci中
capture tabledefs xx.xx来获取
3、使用goldengate sql debug模式来打印执行语句:
注释:诡异,debug模式进程直接推出,什么都没有打印出来,正常不管什么sql都可以打印出来。
debug模式日志:
2019-01-11 17:56:34 WARNING OGG-01431 Aborted grouped transaction on 'SOURCE.AA', Mapping error.
2019-01-11 17:56:34 WARNING OGG-01003 Repositioning to rba 475244077 in seqno 15172.
2019-01-11 17:56:34 WARNING OGG-01151 Error mapping from SOURCE.AA to SOURCE.AA.
2019-01-11 17:56:34 WARNING OGG-01003 Repositioning to rba 475244077 in seqno 15172.
Source Context :
SourceModule : [er.errors]
SourceID : [er/errors.cpp]
SourceFunction : [take_rep_err_action]
SourceLine : [632]
ThreadBacktrace : [11] elements
: [D:\dmsjk\gglog.dll(??1CContextItem@@UEAA@XZ+0x34f3) [0x0000000180114DE3]]
: [D:\dmsjk\gglog.dll(?_MSG_ERR_MAP_TO_TANDEM_FAILED@@YAPEAVCMessage@@PEAVCSourceContext@@AEBV?$CQualDBObjName@$00@ggapp@gglib@ggs@@1W4MessageDisposition@CMessageFactory@@@Z+0x138) [0x00000001800AE3D8]]
: [D:\dmsjk\replicat.exe(ERCALLBACK+0x72ae) [0x000000014009CA3E]]
: [D:\dmsjk\replicat.exe(ERCALLBACK+0x36577) [0x00000001400CBD07]]
: [D:\dmsjk\replicat.exe(ERCALLBACK+0x5c98a) [0x00000001400F211A]]
: [D:\dmsjk\replicat.exe(_ggTryDebugHook+0x13a34) [0x00000001401D05F4]]
: [D:\dmsjk\replicat.exe(_ggTryDebugHook+0x12ad3) [0x00000001401CF693]]
: [D:\dmsjk\replicat.exe(ERCALLBACK+0x5ce90) [0x00000001400F2620]]
: [D:\dmsjk\replicat.exe(CommonLexerNewSSD+0xcf20) [0x00000001402B2F70]]
: [C:\Windows\system32\kernel32.dll(BaseThreadInitThunk+0xd) [0x000000007760F56D]]
: [C:\Windows\SYSTEM32\ntdll.dll(RtlUserThreadStart+0x21) [0x0000000077743281]]
2019-01-11 17:56:34 ERROR OGG-01296 Error mapping from SOURCE.AA to SOURCE.AA.
***********************************************************************
* ** Run Time Statistics ** *
***********************************************************************
Reading ./dirdat/lq015172, current RBA 475427432, 0 records
Report at 2019-01-11 17:56:34 (activity since 2019-01-11 17:56:32)
From Table SOURCE.AA toSOURCE.AA:
# inserts: 0
# updates: 0
# deletes: 0
# discards: 1
4、分析logdump中第一列长度是10,但是表结构中显示20
经过了解,源端ogg配置ddl,直接修改数据库主键长度10变成20,目标端由于特定原因导致进程异常,ddl没有正常同步,运维人员手动执行ddl操作,修改ddl后,过一段时间,ogg运维人员发现复制进程宕机了,为了验证此过程,新建一张一样的表且主键长度为10(因为表中有数据,无法将字段长度减小)
再次开启debug模式,可以打印SQL
5、为了模拟这个场景,在linux上和windows的discard错误差异很多.
linux下discard文件:
Current time: 2019-01-14 17:00:19
Discarded record from action ABEND on error 0
Aborting transaction on ./dirdat/lq beginning at seqno 15172 rba 475427432
error at seqno 15172 rba 475427432
Problem replicating DMS.T_SH_DISCOUNT to TARGET.T_SH_DISCOUNT
Mapping problem with insert record (source format)...
*
DID = --->这个为空
DCODE = 5582
OTYPE = <Raw Data>
000000: c5 e4 bc fe c6 d5 cd a8 b6 a9 b5 a5 |............ |
Windows的discard文件:
Aborting transaction on ./dirdat/lq beginning at seqno 15172 rba 475244077
error at seqno 15172 rba 475427432
Problem replicating SOURCE.AA toTARGET.AA
Mapping problem with insert record (source format)...
*
DID = <Raw Data> ------------------> 这个主键是数字,没有显示正确,居然是<Raw Data>
000000: d5 11
DCODE = 0001
ORTYPE = <Raw Data>
000000: c5 e4 bc fe c6 d5 cd a8 b6 a9 b5 a5 |............ |
*
Process Abending : 2019-01-11 17:34:31
【总结】
1、对于修改表结构的,目标端必须将之前延迟数据应用完成后,再修改DDL
2、对于配置ddl同步的,源端和目标端会自动同步,无需人工干预.
3、多分析report日志以及discard文件找出蛛丝马迹.