问题描述
某客户数据库出现了严重的性能问题,导致应用出现大范围超时以及会话激增等问题,多次尝试kill session都无法彻底解决问题,重启后系统恢复正常。
专家解答
拿到故障时刻的AWR报告,可以发现问题时刻,数据库的主要等待为:Global transaction acquire instance locks和enq: TX – row lock contention。
Event Waits Time(s) Avg wait (ms) % DB time Wait Class Global transaction acquire instance locks 5,342 5,343 1000 74.09 Configuration enq: TX – row lock contention 5 1,437 287308 19.92 Application DB CPU 331 4.59 direct path read 37,708 72 2 0.99 User I/O log file sync 7,817 12 2 0.16 Commit复制
其中TX – row lock contention等待十分常见,这个等待事件造成应用的阻塞也很容易理解,但是Global transaction acquire instance locks并不是常见等待,从字面上理解,是全局事务在尝试获取实例锁。这个等待在等待时间占比上,消耗了将近75%的DBTIME。
当然数据库中TOP 5中最严重的等待不一定是问题的根源,
分析问题时刻的ASH信息,在问题时刻,最先出现的是全局事务获取锁的等待,随后开始出现行锁等待:
SQL> select to_char(sample_time, ‘hh24miss’), session_id, event, blocking_session 2 from dba_hist_active_sess_history 3 where sample_time >= to_date(‘201810180652’, ‘yyyymmddhh24mi’) 4 and sample_time <= to_date(‘201810180700’, ‘yyyymmddhh24mi’) 5 and instance_number = 1 6 order by 1, 2; TO_CHA SESSION_ID EVENT BLOCKING_SESSION —— ———- —————————————— —————- 065204 69 Global transaction acquire instance locks 065214 69 Global transaction acquire instance locks 065224 69 Global transaction acquire instance locks 065224 535 065234 69 Global transaction acquire instance locks 065234 232 065234 535 065234 763 Global transaction acquire instance locks 065244 69 Global transaction acquire instance locks 065244 535 direct path read 065244 695 enq: TX – row lock contention 763 065244 763 Global transaction acquire instance locks 065254 69 Global transaction acquire instance locks 065254 535 065254 695 enq: TX – row lock contention 763 065254 763 Global transaction acquire instance locks 065304 136 Global transaction acquire instance locks 065304 695 enq: TX – row lock contention 763 065304 763 Global transaction acquire instance locks 065314 136 Global transaction acquire instance locks 065314 695 enq: TX – row lock contention 763 065314 763 Global transaction acquire instance locks 065324 69 Global transaction acquire instance locks 065324 136 Global transaction acquire instance locks 065324 695 enq: TX – row lock contention 763 065324 763 Global transaction acquire instance locks 065334 69 Global transaction acquire instance locks 065334 136 Global transaction acquire instance locks 065334 695 enq: TX – row lock contention 763 065344 69 Global transaction acquire instance locks 065344 136 Global transaction acquire instance locks 065344 434 enq: TX – row lock contention 763 065344 695 enq: TX – row lock contention 763 065354 69 Global transaction acquire instance locks 065354 136 Global transaction acquire instance locks 065354 434 enq: TX – row lock contention 763 065354 695 enq: TX – row lock contention 763 065404 69 Global transaction acquire instance locks 065404 136 Global transaction acquire instance locks 065404 434 enq: TX – row lock contention 763 065404 695 enq: TX – row lock contention 763 . . . 065944 302 enq: TX – row lock contention 763 065944 336 enq: TX – row lock contention 763 065944 434 enq: TX – row lock contention 763 065944 695 enq: TX – row lock contention 763 065954 71 Global transaction acquire instance locks 065954 302 enq: TX – row lock contention 763 065954 336 enq: TX – row lock contention 763 065954 434 enq: TX – row lock contention 763 065954 695 enq: TX – row lock contention 763 216 rows selected.复制
首先出现问题的是会话763,其处于Global transaction acquire instance locks等待中,该会话并未显示被其他会话阻塞。之后开始出现了行锁等待,这些等待enq: TX – row lock contention的会话,其阻塞对象都是会话763。
显然在本次故障中,虽然最终导致大范围业务挂起的是enq: TX – row lock contention等待,但是最终问题的根源是Global transaction acquire instance locks等待。
几乎与此同时,后台告警日志出现大量的报错:
Thu Oct 18 06:53:33 2018 opiodr aborting process unknown ospid (26428) as a result of ORA-24756 Thu Oct 18 06:53:36 2018 Errors in file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_pmon_30688.trc: ORA-24756: transaction does not exist Errors in file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_pmon_30688.trc: ORA-24756: transaction does not exist Errors in file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_pmon_30688.trc: ORA-24756: transaction does not exist Errors in file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_pmon_30688.trc: ORA-24756: transaction does not exist复制
数据库中出现该等待的时间点06:52,再考虑到等待在报错前会经历一个超时,因此数据库中的等待与告警日志中的ORA-24756错误有密切的关系。
以ORA-24756作为关键字查询MOS,可以找到与当前现象非常接近的文章:PMON cleanup fails with ORA-24756: transaction does not exist (文档 ID 2299927.1)。
文档中描述的报错现象与当前问题一致,且数据库版本也很接近。Oracle认为这个错误是由于PMON进程在清理不存在的事务时出现了卡住的现象。Oracle给出了几个比较类似的bug,但是这些bug的状态不是不可重现就是已中止,因此类似的问题并没有明确的结论:
Bug 20676168 – PMON GOT CONTINUOUS ORA-24756 DURING SHUTDOWN <<<<<<<<<< Closed, Could Not Reproduce Bug 16317766 – EXADATA : ORA-24756: TRANSACTION DOES NOT EXIST <<<<<<<<<<<< Suspended, Req’d Info not Avail Bug 9757979 – PMON CLEANUP FAILS WITH ORA-24756: TRANSACTION DOES NOT EXIST <<<<<<<<<<< Closed, Could Not Reproduce复制
无论是Global transaction acquire instance locks等待还是后台alert日志中不断出现的ORA-24756错误,问题都指向Oracle的锁和全局事务处理。
而从11g以后,RAC的全局事务的处理由后台进程GTXn来自动维护。该进程是否启动受初始化参数global_txn_processes的控制,该参数默认值为1,也就是在数据库启动的时候会默认启动GTXn进程。
询问客户近期是否进行过应用和数据库的调整,可以确认的是客户的应用程序并未发生任何变化,在之前也未出现过类似的问题,不过在出现问题之前,数据库做过主库和Data Guard备库之间的切换演练,而切换演练完成后,恢复正常业务时,就出现了这次故障。显然这次切换演练的操作是被怀疑的重点。
在详细检查了alert告警日志后发现,在DATA GUARD演练后切换回主库时,GTXn进程未启动:
Thu Oct 18 02:36:18 2018 alter database commit to switchover to physical standby with session shutdown ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY [Process Id: 4366] (orcl1) . . . Switchover: Primary controlfile converted to standby controlfile succesfully. Switchover: Complete – Database shutdown required Completed: alter database commit to switchover to physical standby with session shutdown Thu Oct 18 02:45:27 2018 . . . RFS[8]: Assigned to RFS process 3216 RFS[8]: Identified database type as ‘physical standby’: Client is ARCH pid 11990 Thu Oct 18 03:11:13 2018 alter database open AUDIT_TRAIL initialization parameter is changed to OS, as DB is NOT compatible for database opened with read-only access This instance was first to open Beginning standby crash recovery. Serial Media Recovery started . . . Thu Oct 18 03:11:13 2018 SMON: enabling cache recovery Dictionary check beginning Dictionary check complete Database Characterset is ZHS16GBK No Resource Manager plan active Starting background process GTX0 Thu Oct 18 03:11:14 2018 GTX0 started with pid=51, OS id=5041 replication_dependency_tracking turned off (no async multimaster replication found) Physical standby database opened for read only access. Completed: alter database open . . . Thu Oct 18 04:57:19 2018 alter database commit to switchover to primary with session shutdown ALTER DATABASE SWITCHOVER TO PRIMARY (orcl1) Maximum wait for role transition is 15 minutes. Switchover: Media recovery is still active Role Change: Canceling MRP – no more redo to apply . . . Switchover: Complete – Database mounted as primary Completed: alter database commit to switchover to primary with session shutdown Thu Oct 18 04:57:39 2018 alter database open This instance was first to open Picked broadcast on commit scheme to generate SCNs . . . Thu Oct 18 04:57:44 2018 QMNC started with pid=41, OS id=22585 LOGSTDBY: Validating controlfile with logical metadata LOGSTDBY: Validation complete Completed: alter database open . . . Thu Oct 18 06:53:33 2018 opiodr aborting process unknown ospid (26428) as a result of ORA-24756 Thu Oct 18 06:53:36 2018 Errors in file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_pmon_30688.trc: ORA-24756: transaction does not exist Errors in file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_pmon_30688.trc: . . . Starting background process GTX0 Thu Oct 18 09:51:20 2018 GTX0 started with pid=36, OS id=32470 Starting background process RCBG Thu Oct 18 09:51:20 2018复制
从上面节选的部分alert日志可以看到,在2点36分,主库执行了SWITCHOVER操作,开始切换演练。
在3点11分,数据库打开,alert中提示,当前数据库为物理备库,为只读打开模式,在随后的日志中加粗的部分,显示此时后台进程GTX0进程启动。
在4点57分,数据库再次执行SWITCHOVER切换,此时数据库并未完全SHUTDOWN,而是直接转换为MOUNT模式,并于4点57分44秒完成了数据库的OPEN过程。但是这个时间段的alert中并未发现GXTn进程的启动。
而在alert中发现GTX0进程再次出现,是故障之后实例1重启时刻的9点51分。
显然,问题已经非常明确,在数据库由物理备库切换为主库的过程中,GTXn进程没有启动,这导致了Oracle无法处理分布式事务的锁问题,因此前台会话出现Global transaction acquire instance locks等待,而后台PMON也无法正常的清理会话和事务。
由于实例2在SWITCHOVER切换过程中是被SHUTDOWN后重新启动的,因此实例2上的GTXn进程是正常启动的,这也是应用切换到实例2上不在出现故障的原因。
如果要避免类似的问题产生,一方面在进行DATA GUARD的切换演练后,在主库切换到PRIMARY模式后,再执行一次数据库重启的操作,确保所有必要的进程都会启动。另一方面,对于应用程序,如果确实需要修改其他数据库中的表,应该通过建立其他数据库连接的方式来实现,避免在数据库中通过数据库链的方式直接更新远端表。