1 在巡检的过程中,多次发现UNDOTBS12表空间使用率为100%,多次查询数据库的alert日志,未发现数据库有undo表空间告警的相关信息。作为一个从业多年的DBA,已经认为这有可能有问题了,需要持续关注。
TABLESPACE_NAME TOT_SIZE TOT_FREE PCT_FREE MAX_FREE CHUNKS_FREE
------------------------------ ---------------- ---------------- ---------------- ---------------- ----------------
UNDOTBS11 184,320 134,760 73 3,968 1,697
UNDOTBS12 184,320 42 0 8 7
2 查看那个用户在使用undo数据块,发现只使用了一个数据库块,当前会话没有影响undo表空间资源耗尽。
set linesize 300
col username for a20
col machine for a20
col PROGRAM for a20
col name for a30
SELECT S.SID, S.USERNAME, S.PROGRAM, S.MACHINE, U.NAME, T.USED_UBLK
FROM V$TRANSACTION T, V$ROLLSTAT R, V$ROLLNAME U, V$SESSION S
WHERE S.TADDR = T.ADDR
AND T.XIDUSN = R.USN
AND R.USN = U.USN
ORDER BY S.USERNAME;
SID USERNAME PROGRAM MACHINE NAME USED_UBLK
---------- -------------------- -------------------- -------------------- ------------------------------ ---------
183 TEST_SN JDBC Thin Client yxnewapp6 _SYSSMU1787_2063356590$ 1
3 查看undo表空间UNEXPIRED、EXPIRED、ACTIVE 使用情况,发现过期的Undo数据块有177G,但undo表空间使用率为什么还是100%。
set linesize 300
col TABLESPACE_NAME for a20
SELECT TABLESPACE_NAME,STATUS,TRUNC(SUM(BLOCKS) * 8 / 1024) AS "Size M",COUNT(*) Undo_Extent_Num FROM DBA_UNDO_EXTENTS
where TABLESPACE_NAME='UNDOTBS12'
GROUP BY TABLESPACE_NAME, STATUS order by TABLESPACE_NAME,STATUS;
TABLESPACE_NAME STATUS Size M UNDO_EXTENT_NUM
-------------------- --------------------------- ---------- ---------------
UNDOTBS12 EXPIRED 177223 44795
UNDOTBS12 UNEXPIRED 6275 942
4 查看是否由于undo自动调试,引起的undo不足,经检查,没有自动调试,故认为是正常的
SELECT BEGIN_TIME, END_TIME, UNDOTSN, UNDOBLKS, TXNCOUNT,TUNED_UNDORETENTION,
MAXCONCURRENCY AS "MAXCON" FROM V$UNDOSTAT;
BEGIN_TIME END_TIME UNDOTSN UNDOBLKS TXNCOUNT TUNED_UNDORETENTION MAXCON
------------------ ------------------ ---------- ---------- ---------- ------------------- ----------
21-JAN-21 30-NOV-21 8 3418760733 2073276409 0 821
5 查询哪些undo段被用于回滚,查询结果为空
select b.name "UNDO Segment Name", b.inst# "Instance ID", b.status$ STATUS, a.ktuxesiz "UNDO Blocks", a.ktuxeusn, a.ktuxeslt xid_slot, a.ktuxesqn xid_seq
from x$ktuxe a, undo$ b
where a.ktuxesta = 'ACTIVE' and a.ktuxecfl like '%DEAD%' and a.ktuxeusn = b.us#;
6 查看数据库的版本,为11.2.0.4.171017 (26392168),故认为不可能是由于PSU引起的,
由于查看Undo相关参数设置是否正常,经检查,认为都是正常的。
查看Undo相关的隐含参数
NAME VALUE DESCRIB
------------------------------ -------------------- ------------------------------------------------------------
_flush_undo_after_tx_recovery TRUE if TRUE, flush undo buffers after TX recovery
_gc_undo_affinity FALSE if TRUE, enable dynamic undo affinity
_gc_undo_block_disk_reads TRUE if TRUE, enable undo block disk reads
_undo_block_compression TRUE enable undo block compression
undo_management AUTO instance runs in SMU mode if TRUE, else in RBU mode
undo_tablespace UNDOTBS12 use/switch undo tablespace
_collect_undo_stats TRUE Collect Statistics v$undostat
_undo_debug_mode 0 debug flag for undo related operations
_verify_undo_quota FALSE TRUE - verify consistency of undo quota statistics
_undo_autotune FALSE enable auto tuning of undo_retention
_highthreshold_undoretention 4294967294 high threshold undo_retention in seconds
undo_retention 7200 undo retention in seconds
_undo_debug_usage 0 invoke undo usage functions for testing
_disable_undo_tablespace_alerts FALSE disable tablespace alerts for UNDO tablespaces
_smon_undo_seg_rescan_limit 10 limit of SMON continous undo segments re-scan
_optimizer_undo_cost_change 11.2.0.4 optimizer undo cost change
_optimizer_undo_changes FALSE undo changes to query optimizer
_enable_default_undo_threshold TRUE Enable Default Tablespace Utilization Threshold for TEMPORARY Tablespaces
7 查看数据文件是否因为关闭自动扩展引起的undo调试时间变化,根据以上参数及设置,认为没有
SQL> set lines 300
SQL> col file_name for a60
SQL> select file_id,file_name,tablespace_name,bytes/1024/1024,MAXBYTES/1024/1024,autoextensible from dba_data_files
where tablespace_name in ('UNDOTBS12') order by 1;
FILE_ID FILE_NAME TABLESPACE_NAME BYTES/1024/1024 MAXBYTES/1024/1024 AUTOEXTEN
---------- ------------------------------------------------------------ ------------------------------ --------------- ------------------ ---------
218 +DATA_S/test/datafile/undotbs12.269.1022257191 UNDOTBS12 30720 0 NO
220 +DATA_S/test/datafile/undotbs12.268.1022257243 UNDOTBS12 30720 0 NO
271 +DATAVG/test/datafile/undotbs12.1972.1027500943 UNDOTBS12 30720 0 NO
272 +DATAVG/test/datafile/undotbs12.1973.1027500985 UNDOTBS12 30720 0 NO
344 +DATAVG/test/datafile/undotbs12.2056.1087226619 UNDOTBS12 30720 0 NO
345 +DATAVG/test/datafile/undotbs12.2055.1087226655 UNDOTBS12 30720 0 NO
6 rows selected.
8 查看undo表空间是否为NOGUARANTEE
SQL> col tablespace_name for a20
SQL> select tablespace_name,block_size,extent_management,segment_space_management,contents,
retention from dba_tablespaces where tablespace_name='UNDOTBS12';
TABLESPACE_NAME BLOCK_SIZE EXTENT_MANAGEMENT SEGMENT_SPACE_MANA CONTENTS RETENTION
-------------------- ---------- ------------------------------ ------------------ --------------------------- ---------------------------------
UNDOTBS12 8192 LOCAL MANUAL UNDO NOGUARANTEE
9 通过此视图,验证Undo是否有争用,发现undo header等待事件过长。
SQL> select * from v$waitstat;
CLASS COUNT TIME
------------------------------------------------------ ---------- ----------
data block 1054253604 977861564
sort block 0 0
save undo block 0 0
segment header 1256109 339064
save undo header 0 0
free list 0 0
extent map 3536 196
1st level bmb 15523793 1738257
2nd level bmb 3159275 17816519
3rd level bmb 847 102
bitmap block 0 0
bitmap index block 57 0
file header block 1630526 11230885
unused 0 0
system undo header 1225 183
system undo block 23 0
undo header 60728323 13951843
undo block 14025728 760038
18 rows selected.
10 查看回滚段的统计信息,通过统计信息,确定是否有异常。XACTS 为0,代表事务已经提交。不为0,则代表活动的事务。
Ratio的百分比如果过高,则代表undo数据头有争用。
通过此统计信息,认为数据库在undo使用上没有问题的
set linesize 300
col name for a25
col status for a15
SELECT n.name, s.extents, s.rssize, s.optsize, s.hwmsize, s.xacts,s.waits,s.gets,100*(s.waits/s.gets) "Ratio%", s.status FROM v$rollname n, v$rollstat s WHERE n.usn = s.usn;
NAME EXTENTS RSSIZE OPTSIZE HWMSIZE XACTS WAITS GETS Ratio% STATUS
------------------------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------------
SYSTEM 7 450560 450560 0 1072 186417 .57505485 ONLINE
_SYSSMU1291_1397830725$ 33 260169728 3860979712 0 50354 287278824 .017527919 ONLINE
_SYSSMU1292_252107360$ 181 199942144 2438062080 0 47865 280260848 .017078732 ONLINE
_SYSSMU1293_1495173289$ 51 190963712 3893485568 0 45848 295071129 .015537948 ONLINE
_SYSSMU1294_3251668230$ 58 214032384 2281824256 0 41967 260615769 .016103016 ONLINE
_SYSSMU1295_4122107009$ 34 225566720 4284604416 0 39520 255348602 .015476881 ONLINE
_SYSSMU1296_3699510013$ 39 178380800 4231127040 0 36448 270844107 .013457188 ONLINE
_SYSSMU1297_3481746686$ 45 228712448 1849483264 0 35460 250105401 .014178022 ONLINE
...
_SYSSMU2234_569768215$ 54 277995520 328327168 0 3029 34127401 .008875566 ONLINE
_SYSSMU2235_3141883639$ 37 183623680 316137472 0 3370 35601358 .009465931 ONLINE
823 rows selected.
11 以下查询用于通过查询,来确定将undo设置多大比较合适,经验证,认为将undo设置为7G,就满足目前数据库的使用。
SQL> SELECT (UR * (UPS * DBS)) AS "Bytes"
2 FROM (SELECT value AS UR FROM v$parameter WHERE name = 'undo_retention'),
3 (SELECT undoblks/((end_time-begin_time)*86400) AS UPS
4 FROM v$undostat
5 WHERE undoblks = (SELECT MAX(undoblks) FROM v$undostat)),
6 (SELECT block_size AS DBS
7 FROM dba_tablespaces
8 WHERE tablespace_name = (SELECT UPPER(value) FROM v$parameter WHERE name = 'undo_tablespace'));
Bytes
----------
7558168221
12 仔细考虑Oracle的架构,回收资源是mmon进程,既然alert日志没有相关信息,那么查看mmon进程是否有有用的信息, 果然发现undo 服务处于停滞状态,这估计就是导致undo表空间使用率降不下来的根本原因
*** 2021-11-30 16:49:56.094
Unable to schedule a MMON slave at: Auto DBFUS Main Slave action has been temporarily suspended
- Slave action had prior policy violations.
Unknown return code: 101
*** 2021-12-01 00:50:16.197
Unable to schedule a MMON slave at: Auto DBFUS Main
Slave action has been temporarily suspended
- Slave action had prior policy violations.
Unknown return code: 101
查看相关相关参数, _swrf_mmon_dbfus为true,是开启的,为正常。
NAME VALUE DESCRIB
------------------------------ -------------------- ------------------------------------------------------------
_swrf_mmon_dbfus TRUE Enable/disable SWRF MMON DB Feature Usage
_swrf_test_dbfus FALSE Enable/disable DB Feature Usage Testing
13 通过如下信息,可以确定在2020年7月16日后,undo就没有正常的工作过
col version for a20
col description for a30
col FIRST for a20
col LAST for a20
select a.name,a.version,a.detected_usages,to_char(a.first_usage_date,'yyyy-mm-dd hh24:mi:ss') "FIRST",
to_char(a.last_sample_date,'yyyy-mm-dd hh24:mi:ss') "LAST",a.description from dba_feature_usage_statistics a
where a.detected_usages >0 and a.name like '%Undo%' order by a.name,a.version;
NAME VERSION DETECTED_USAGES FIRST LAST DESCRIPTION
------------------------------ -------------------- --------------- -------------------- -------------------- ------------------------------
Automatic Undo Management 11.2.0.4.0 95 2018-09-26 11:29:52 2020-07-16 19:19:24 Oracle automatically manages u
ndo data using an UNDO tablesp
14 触发UNDO自动回收的机制为:一为undo表空间100%,二为调整undo的时间,但此系统已经有176G的undo数据块为
过期,故认为通过调整undo保留时间,不一定有效果。
但undo达到100%,按理会触发一次,按照目前系统的情况,认为数据库没有进行回收,故准备通过修改undo表空间阈值,
来触发数据库自动回收undo。
系统Undo表空间阈值默认值
SQL> SELECT OBJECT_NAME,warning_value, critical_value, status FROM dba_thresholds WHERE metrics_name = 'Tablespace Space Usage' AND object_name like 'UNDO%';
OBJECT_NAME WARNING_VALUE CRITICAL_VALUE STATUS
-------------------- ------------------------------ -------------------- ---------------------
UNDOTBS1 <SYSTEM-GENERATED THRESHOLD> 0 VALID
UNDOTBS11 <SYSTEM-GENERATED THRESHOLD> 0 VALID
UNDOTBS2 <SYSTEM-GENERATED THRESHOLD> 0 VALID
UNDOTBS12 <SYSTEM-GENERATED THRESHOLD> 0 VALID
执行如下语句,设置UNDOTBS12表空间阈值为50%收集一次,70%收集一次,
SQL> begin
2 DBMS_SERVER_ALERT.SET_THRESHOLD(
3 metrics_id => dbms_server_alert.tablespace_pct_full,
4 warning_operator => dbms_server_alert.operator_ge,
5 warning_value => '50',
6 critical_operator => dbms_server_alert.operator_ge,
7 critical_value => '70',
8 observation_period => 1,
9 consecutive_occurrences => 1,
10 instance_name => NULL,
11 object_type => dbms_server_alert.object_type_tablespace,
12 object_name => 'UNDOTBS12');
13 end;
14 /
OBJECT_NAME WARNING_VALUE CRITICAL_VALUE STATUS
-------------------- ------------------------------ -------------------- ---------------------
UNDOTBS1 <SYSTEM-GENERATED THRESHOLD> 0 VALID
UNDOTBS11 <SYSTEM-GENERATED THRESHOLD> 0 VALID
UNDOTBS2 <SYSTEM-GENERATED THRESHOLD> 0 VALID
UNDOTBS12 50 70 VALID
经过调整,发现数据库已经缓慢的收集undo,但Unable to schedule a MMON slave at: Auto DBFUS Main
Slave action has been temporarily suspended 还是每过几个小时触发一次。
15 DBFUS不正常,根据多年经验,认为有可能同数据库实例相关视图的统计信息有关,于是查看相关统计信息,
发现从2020年7月20日后就没有进行更新过,而undo统计信息是从2020-7-16日为最后一次。
故认为有可能是数据库实例相关视图统计信息过期,性能下降,导致undo自动回收工作异常。
15.1 查看数据库实例相关统计信息,发现最后一次为2020-7-20日。
OWNER TABLE_NAME NUM_ROWS LAST_ANALYZED
------------------------------ ------------------------------ ---------- -------------
SYS X$KSPPO 131 2020/7/20
SYS X$KSOLSSTAT 22 2020/7/20
SYS X$KSOLSFTS 566126 2020/7/20
SYS X$KSPVLD_VALUES 617 2020/7/20
SYS X$KSPSPFH 1 2020/7/20
SYS X$KSPPI 2914 2020/7/20
SYS X$KSPPSV2 2919 2020/7/20
SYS X$KSPPCV2 2919 2020/7/20
SYS X$KSPPSV 2914 2020/7/20
SYS X$KSPPCV 2914 2020/7/20
SYS X$KSPSPFILE 2932 2020/7/20
SYS X$KSUCPUSTAT 12 2020/7/20
SYS X$KSULL 1 2020/7/20
SYS X$KSUCF 10 2020/7/20
SYS X$KSUINSTSTAT 0 2020/7/20
SYS X$KSQST 320 2020/7/20
SYS X$KSQDN 1 2020/7/20
SYS X$KSQRS 24272 2020/7/20
SYS X$KSQEQTYP 210 2020/7/20
SYS X$KSUCLNDPCC 0 2020/7/20
15.2 手动收集相关统计信息,收集dbms_stats.gather_fixed_objects_stats统计信息,执行几天都没有结束,
对会话进行追踪,发现一直卡在X$KSOLSFTS表上,
查询此表的数据,使用select count(1) from X$KSOLSFTS 执行7天都没有结束,初步估计表的数据有可能有上百亿条数据,
故导致收集统计信息执行不成功。
SQL> exec dbms_stats.gather_table_stats('SYS', 'X$KEWRATTRNEW');
PL/SQL procedure successfully completed.
SQL> exec dbms_stats.gather_table_stats('SYS', 'X$KEWRTSQLPLAN');
PL/SQL procedure successfully completed.
SQL> exec dbms_stats.gather_table_stats('SYS', 'X$KEWRATTRNEW')
PL/SQL procedure successfully completed.
SQL> exec dbms_stats.gather_table_stats('SYS', 'X$KEWRSQLIDTAB');
PL/SQL procedure successfully completed.
SQL> exec dbms_stats.gather_fixed_objects_stats; --执行2天没有完成
15.3 经过观察,自从收集统计信息后,Unable to schedule a MMON slave at: Auto DBFUS Main
Slave action has been temporarily suspended 告警从每4小时
发生一次,变为到目前为止在没有发生过,故认为Unable to schedule a MMON slave at:
Auto DBFUS Main Slave action has been temporarily suspended 告警
确实是由于统计信息过期导致的。
15.4 自从收集统计信息后,发现undo表空间的使用率已经慢慢的降下来,目前剩余79G,使用率为57% 。
TABLESPACE_NAME TOT_SIZE TOT_FREE PCT_FREE MAX_FREE CHUNKS_FREE
------------------------------ ---------------- ---------------- ---------------- ---------------- ----------------
UNDOTBS12 184,320 79,552 43 2,343 6,799
15.5 查询Undo自动管理的JOB,发现在2021-12-12日再次执行过,证明通过收集统计信息,
Automatic Undo Management工作已经变为正常,虽然由于X$KSOLSFTS太大,
导致统计信息还有部分过期,但数据库已经缓慢恢复正常。
NAME VERSION DETECTED_USAGES FIRST LAST DESCRIPTION
------------------------------ -------------------- --------------- -------------------- -------------------- ------------------------------
Automatic Undo Management 11.2.0.4.0 97 2018-09-26 11:29:52 2021-12-12 19:17:17 Oracle automatically manages undo data using an UNDO tablespace.
16 本计划使用如下命令,针对Auto-DBFUS Action事件生成相关trace文件,找到明确的信息,
来证明是由于统计信息过期引起的异常,但自从收集统计信息后,DBFUS不在告警了,
故心里那个恨呀,为什么就手动执行了收集统计信息的操作呢,破坏了现场环境,
导致没有明确的信息来证明是由于统计信息过期引起的。
begin
dbms_monitor.serv_mod_act_trace_enable(service_name=>'SYS$BACKGROUND',
module_name=>'MMON_SLAVE',
action_name=>'Auto-DBFUS Action');
end;
/
begin
dbms_monitor.serv_mod_act_trace_disable(service_name=>'SYS$BACKGROUND',
module_name=>'MMON_SLAVE',
action_name=>'Auto-DBFUS Action');
end;
/
17 重启数据库,释放相关资源,收集统计信息,经过验证,undo表空间已经恢复正常。
重启后第3天,表空间的使用情况
SQL> select a.tablespace_name,sum(a.tots) Tot_Size, sum(a.sumb) Tot_Free, sum(a.sumb)*100/sum(a.tots) Pct_Free, sum(a.largest) Max_Free,sum(a.chunks) Chunks_Free from ( select tablespace_name,0 tots,sum(bytes/1024/1024) sumb, max(bytes/1024/1024) largest,count(*) chunks from dba_free_space a group by tablespace_name union select tablespace_name,sum(bytes/1024/1024) tots,0,0,0 from dba_data_files group by tablespace_name) a group by a.tablespace_name order by Pct_Free desc;
TABLESPACE_NAME TOT_SIZE TOT_FREE PCT_FREE MAX_FREE CHUNKS_FREE
------------------------------ ---------------- ---------------- ---------------- ---------------- ----------------
UNDOTBS11 184,320 172,153 93 3,968 1,686
UNDOTBS12 184,320 149,383 81 1,381 2,623
9 rows selected.