oracle dblink 2pc+index split造成死锁

原创 badguy001 2022-06-26

883

问题

有一套生产环境，在上海部有一个linux 12.2.0.1 oracle数据库（shdb），深圳有一个linux 12.2.0.1 oracle数据库（szdb），业务跑在shdb这边，shdb这边的数据会有shdb上的存储过程通过dblink同步到szdb，业务操作跟数据同步都会往同一个按月分区的日志表插入数据。

某天，客户调整了szdb那边的网络，调整了后shdb与szdb之间的网络变得非常不稳定，一个连接经常网络层面数据包发出去之后，对端接受不到数据包。

由于业务操作过程中是不会使用dblink的，按理来说业务操作是不会收到影响，然而实际情况是业务有时候会在一段时间内中断，然后又自动恢复了。通过查看业务中断时间内的应用日志，有大量的ORA-01591: lock held by in-doubt distributed transaction XXX.XX.XX报错，而查询业务中断时间内shdb上的活动会话情况，更是发现了许多用户会话 --buffer busy wait--> 用户会话1 --enq: TX - contention--> RECO进程 --buffer busy wait--> 用户会话1 死锁。在DUMP HANGANALYZE及systemstate之后，将阻塞链上的所有用户会话杀掉，业务恢复正常。

初步分析

首先分析阻塞链，处于等待事件buffer busy wait中的用户会话及RECO进程都是在等待同一个日志表上一索引上的同一个块，而用户会话1则在等待RECO进程的事物锁。

用户会话1调用堆栈：ksedsts()+465<-ksdxfstk()+32<-ksdxcb()+1927<-sspuser()+112<-__sighandler()<-semtimedop()+10<-skgpwwait()+160<-ksliwat()+2022<-kslwaitctx()+163<-ksqcmi()+2848<-ksqgtlctx()+3501<-ksqgelctx()+557<-ktuGetTxForXid()+131<-ktcwit1()+5671<-ktuGetCommitTimes()+1133<-ktbgcl1()+6391<-ktbgfi()+4602<-kdiins0()+1711<-kdiinsp()+91<-kauxsin()+1784<-qesltcLoadIndexList()+922<-qesltcLoadIndexes()+55<-qerltcSimpleSingleInsRowCBK()+70<-qerltcSingleRowLoad()+279<-qerltcFetch()+380<-insexe()+682<-opiexe()+5632<-kpoal8()+2118<-opiodr()，sql语句为inert into 日志表 values (c1,c2...)，可以看出用户会话1往索引块上插入索引键数据，但不知何故，出现了enq: TX - contention等待。

RECO进程调用堆栈：ksedsts()+465<-ksdxfstk()+32<-ksdxcb()+1927<-sspuser()+112<-__sighandler()<-semtimedop()+10<-skgpwwait()+160<-ksliwat()+2022<-kslwaitctx()+163<-kcbzwb()+1548<-kcbget()+13605<-ktbxchg()+153<-kdifind()+706<-kdifind4undo()+67<-kdiulk()+9380<-kcoubk()+366<-ktundo()+1464<-kturCurrBackoutOneChg()+1442<-kturab()+1455<-ktdabt()+434<-k2lrab()+58<-k2vrec()+583<-k2vdrv()+7011<-ksbabs()+771<-k2vabs()+25<-ksbrdp()+1045<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+250<-ssthrdmain()+265<-main()，没有在执行sql语句，结合网络情况，基本可以确定，RECO进程在回滚因通过dblink发起分布式事物的进程异常终止而遗留的事物。根据systemstatedump的数据，RECO进程需要pin主的索引块上正好有用户会话1等待的事物。

通过在网上查找资料，找到https://nigelnoble.wordpress.com/2013/07/12/enq-tx-contention-on-select-with-a-large-buffer-cache-and-2pc/，文中说到，当dblink分布式事物进行两阶段提交的时候，在commit执行后进入prepare阶段，如果查询了被事物更新了的表，可能会造成查询会话阻塞在enq: TX - contention等待上。而在https://docwiki.embarcadero.com/DBOptimizer/en/Oracle:Enqueues#enq:_TX_-_contention上提到在索引分裂的时候也有可能造成enq: TX - contention等待。

根据现有的资料跟对条用堆栈的分析，通过猜测并且多次实验后，最终复现了上面的问题。

复现实验

环境：本地db，11.2.0.4.0，linux，dblink：lnk_test连接到远端db。远端db，12.1.0.2.0 linux。

创建测试表：

--本地db建表
DROP table scott.TEMP_20220625 purge;
CREATE TABLE scott.temp_20220625  (id varchar2(1000), idx varchar2(1000)) ;
CREATE INDEX scott.idx_20220625 ON scott.TEMP_20220625 (idx);
--远端db建表
create table system.temp_20220625 (id varchar2(1));
复制

本地db生成分布式事物：

--本地db
insert into scott.temp_20220625 (id,idx)values('a',lpad('1',1000,'0'));
insert into scott.temp_20220625 (id,idx)values('a',lpad('2',1000,'0'));
insert into scott.temp_20220625 (id,idx)values('a',lpad('2',1000,'0'));
insert into scott.temp_20220625 (id,idx)values('a',lpad('2',1000,'0'));
insert into scott.temp_20220625 (id,idx)values('a',lpad('2',1000,'0'));
insert into scott.temp_20220625 (id,idx)values('a',lpad('8',1000,'0'));
insert into scott.temp_20220625 (id,idx)values('a',lpad('9',1000,'0'));
insert into system.temp_20220625@LNK_TEST values ('d');
--不要提交复制

丢弃本地db与远端db的网络数据包，并杀掉dblink连接远端db而在远程db生成的会话：

# 远端db
iptables -t filter -I INPUT -p tcp --dport 1521 -j DROP
iptables -t filter -I OUTPUT -p tcp --sport 1521 -j DROP
# 找到对应的会话并kill掉复制

提交本地db生成分布式事物，此时提交会卡住，而查询scott.temp_20220625表会进入enq: TX - contention等待，插入数据如果没有造成索引分裂，插入语句也能正常执行，否则也会进入enq: TX - contention等待。

本地db开10个会话，执行插入语句，使得scott.temp_20220625表上的索引发生分裂：

--本地db开10个会话执行下面的sql，第一个执行的会话进入enq: TX - contention等待，后面的会话进入buffer busy waits等待
insert into scott.temp_20220625 (id,idx)values('a',lpad('5',1000,'0'));复制

挂起enq: TX - contention等待的会话：

--本地db找到enq: TX - contention等待的会话并挂起此会话
select * from v$process where addr = (select paddr from v$session where sid = 1234);
oradebug setorapid orapid;
oradebug suspend复制

恢复本地db与远端db的通信：

#远端db
iptables -t filter -F复制

等待enq: TX - contention会话会在数分钟内报错ORA-01591: lock held by in-doubt distributed transaction XXX.XX.XX

报错后恢复挂起的会话

--本地db
oradebug resume复制

此时查询等待链，大概率会出现上面问题中描述的死锁情况，如果没有出现，可以多尝试几次。

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

oracle dblink 2pc+index split造成死锁

问题

初步分析

复现实验

评论

相关阅读