暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

一次特殊的ORA-04030故障处理

原创 Oracle 2023-03-28
158

      这是一则来自客户现场的真实案例,客户现场人员在通过PL/SQL Developer客户端工具创建索引时出现ORA-04030错误,导致索引创建失败。但是通过SQL*Plus重新执行创建索引语句却能够成功。为什么会出现这种错误和执行差异呢?请跟随我们的思路来一步步地了解这个案例的解析过程。

11.1 聚集数据的信息采集和分析

      首先,了解ORA-04030错误的原因。ORA-04030错误引起的原因大概有以下几种情况。

(1)是否有足够的可用内存?

查看操作系统内存使用情况。

+-topas_nmon--r=Resources--------Host=gisdata2-------Refresh=2 secs---10:03.50------------+

| Memory ---------------------------------------------------------------------------|

|     Physical PageSpace |    pages/sec In   Out | FileSystemCache      |

|% Used    87.7%   2.6% | to Paging Space  0.5  0.0 | (numperm) 4.6%      |

**|% Free    12.3%   97.4%**  | to File System  0.0  0.0 | Process  66.1%      |

|MB Used  55450.2MB  523.7MB | Page Scans    0.0    | System  17.0%      |

**|MB Free  7781.6MB 19956.3MB** | Page Cycles    0.0    | Free   12.3%      |

|Total(MB) 63231.9MB 20480.0MB | Page Steals    0.0    |      ------     |

|                 | Page Faults  2110.3    | Total  100.0%      |

|--------------------------------------------------------- | numclient 4.6%     |

|Min/Maxperm   3065MB( 5%) 9195MB( 15%) <--% of RAM    | maxclient 15.0%     |

|Min/Maxfree  3000  4000    Total Virtual  81.7GB   | User   67.7%      |

|Min/Maxpgahead  2   8  Accessed Virtual  50.8GB 62.2%| Pinned  19.8%     |

|                               | lruable pages  15692064.0|

|-----------------------------------------------------------------------------------|

复制

      查看当时出故障的服务器资源,可以看到剩余内存还有7.7GB左右,说明在操作系统层面还有足够多的可用内存。

(2)是否设置了Oracle的限制?

查看数据库中与PGA相关的设置。

SQL> show parameter PGA

NAME                 TYPE               VALUE

------------------------------------ ----------------- ------------------------------

**pga_aggregate_target         big integer           8400M**

SQL> select * from v$pgastat;

NAME                                  VALUE UNIT

------------------------------------------------------------- ---------- ------------

aggregate PGA target parameter                  8808038400 bytes

aggregate PGA auto target                     7288095744 bytes

global memory bound                        880803840 bytes

total PGA inuse                          720201728 bytes

total PGA allocated                        985562112 bytes

**maximum PGA allocated                      2692780032 bytes  ---**实际分配最大内存 **2568.03516MB**

total freeable PGA memory                     163905536 bytes

process count                              414

max processes count                           529

PGA memory freed back to OS                   2.4854E+11 bytes

total PGA used for auto workareas                 10177536 bytes

maximum PGA used for auto workareas                642154496 bytes

total PGA used for manual workareas                    0 bytes

maximum PGA used for manual workareas                537600 bytes

over allocation count                           0

bytes processed                         5.7321E+11 bytes

extra bytes read/written                     2125827072 bytes

cache hit percentage                         99.63 percent

recompute count (total)                        61401

SQL> SELECT x.ksppinm NAME, y.ksppstvl VALUE, x.ksppdesc describ

 2 FROM SYS.x$ksppi x, SYS.x$ksppcv y

 3 WHERE x.indx = y.indx

 4 AND x.ksppinm LIKE '%_pga_max_size%';

NAME              VALUE        DESCRIB

------------------------------ -----------------------------------------------------

**_pga_max_size     1761607680    Maximum size of the PGA memory for one process**

SQL> SELECT x.ksppinm NAME, y.ksppstvl VALUE, x.ksppdesc describ

 2 FROM SYS.x$ksppi x, SYS.x$ksppcv y

 3 WHERE x.indx = y.indx

 4 AND x.ksppinm LIKE '%_smm_max_size%';

NAME              VALUE        DESCRIB

------------------------------ ------ ------------------------------------------------

**_smm_max_size        860160       maximum work area size in auto mode (serial)**

 

复制

      从上面的内容可以看到PGA设置的大小为8400M,根据单个会话使用PGA的期望尺寸(也可以认为是实际分配的最大尺寸)计算公式是:min(5%pga_aggregate_target,50%_pga_max_size, _smm_max_size),可以简单计算下min(5%*8400M,50%1680M,840M)=420M(其中_pga_max_size的单位为bytes,_smm_max_size的单位为kb),即单个会话能使用PGA的期望尺寸为420M,那报错的会话是否超过了该限制呢?

查看ORA-04030报错的trace文件如下所示。


TOP 10 MEMORY USES FOR THIS PROCESS

\---------------------------------------

98% 109 MB, 1797 chunks: "permanent memory     " SQL

     sort subheap  ds=1108b67f0 dsprt=11048a508

 1% 1493 KB, 1779 chunks: "free memory        " 

     session heap  ds=11047a818 dsprt=11019eae0

 0% 193 KB, 26 chunks: "permanent memory     " 

     pga heap    ds=110072eb0 dsprt=0

 0%  82 KB,  6 chunks: "frame segment       " SQL

     kxs-heap-f   ds=11048a8c8 dsprt=11047a818

 0%  45 KB,  4 chunks: "permanent memory     " 

     session heap  ds=11047a818 dsprt=11019eae0

 0%  35 KB, 11 chunks: "kzctxhugi2        " 

     session heap  ds=11047a818 dsprt=11019eae0

 0%  33 KB,  1 chunk : "free memory        " 

     top call heap  ds=11019e8c0 dsprt=0

 0%  24 KB, 11 chunks: "koh-kghu session heap   " 

     session heap  ds=11047a818 dsprt=11019eae0

 0%  22 KB,  1 chunk : "Fixed Uga         " 

     pga heap    ds=110072eb0 dsprt=0

 0%  20 KB,  5 chunks: "kxsFrame4kPage      " 

     session heap  ds=11047a818 dsprt=11019eae0

 

=======================================

PRIVATE MEMORY SUMMARY FOR THIS PROCESS

\---------------------------------------

******************************************************

**PRIVATE HEAP SUMMARY DUMP**

**111 MB total:**

  111 MB commented, 194 KB permanent

  45 KB free (0 KB in empty extents),

   111 MB,  1 heap:  "session heap  "      

\------------------------------------------------------

Summary of subheaps at depth 1

111 MB total:

  109 MB commented, 55 KB permanent

 1513 KB free (1 KB in empty extents),

   2 KB uncommented freeable with mark,   110 MB,  1 heap:  "kxs-heap-w  "  1490 KB free held

\------------------------------------------------------

Summary of subheaps at depth 2

109 MB total:

  109 MB commented, 10 KB permanent

  14 KB free (4 KB in empty extents),

   0 KB uncommented freeable with mark,   109 MB,  1 heap:  "sort subheap  "      

\------------------------------------------------------

Summary of subheaps at depth 3

109 MB total:

   0 KB commented, 109 MB permanent

   2 KB free (0 KB in empty extents),

==========================================

INSTANCE-WIDE PRIVATE MEMORY USAGE SUMMARY

\-------------------------

Top 10 processes:

\-------------------------

(percentage is of 693 MB total allocated memory)

**16% pid 168: 111 MB used of 111 MB allocated <= CURRENT PROC**

 6% pid 3: 43 MB used of 43 MB allocated 

 2% pid 24: 402 KB used of 12 MB allocated (12 MB freeable)

 2% pid 33: 532 KB used of 12 MB allocated (11 MB freeable)

 2% pid 7: 11 MB used of 11 MB allocated 

 2% pid 8: 11 MB used of 11 MB allocated 

 2% pid 9: 11 MB used of 11 MB allocated 

 2% pid 10: 11 MB used of 11 MB allocated 

 2% pid 11: 11 MB used of 11 MB allocated 

 2% pid 12: 11 MB used of 11 MB allocated 


复制

      从以上的trace文件中可以看到,报错的进程实际分配的进程只有111MB,远远未达到420MB,说明并非是由于Oracle自身的限制引起的ORA-04030报错。

3)哪个进程需要的内存过多?

      上一个是否是由于Oracle自身限制引起的解释中,已经可以从trace文件中看到,消耗最多内存的进程就是报ORA-04030的进程,消耗的内存为110MB,并未发现其他更消耗内存的进程。

4)是否设置了操作系统限制?

查看操作系统限制,Oracle用户的限制。


$ ulimit -a

time(seconds)    unlimited

file(blocks)     unlimited

data(kbytes)     unlimited

stack(kbytes)    4194304

memory(kbytes)    unlimited

coredump(blocks)   unlimited

nofiles(descriptors) unlimited

threads(per process) unlimited

processes(per user) unlimited


复制

root用户的限制。


gisdata2@root[/]ulimit -a

time(seconds)    unlimited

file(blocks)     unlimited

**data(kbytes)     131072**

stack(kbytes)    32768

memory(kbytes)    unlimited

coredump(blocks)   unlimited

nofiles(descriptors) unlimited

threads(per process) unlimited

processes(per user) unlimited



复制

从上面root和oracle的limit限制来看,root用户的data(kbytes)的限定值得关注,该属性的意义是soft data segment size in blocks(进程数据段大小限制)。

11.2 聚焦疑点的跟踪测试与验证

      现在做一个反向测试:利用SQLPlus工具创建索引,创建成功。问题来了,为何SQLPlus会成功,PL/SQL Developer却会失败?通过PL/SQL Developer工具创建索引时报ORA-04030错误,但是通过SQLPus创建却能成功,两者除了使用的工具不同,还有就是连接的方式不同(PL/SQL Developer是通过监听程序建立的进程连接,而SQLPlus是在数据库服务器上直接创建创建的连接,绕过了监听程序建立的进程)。从连接工具和方式的不同得到了不一样的结果,如何来验证到底是连接工具的问题还是连接方式的引起的报错?由于PL/SQL Developer只能通过监听的方式进行连接,但是SQLPlus可以通过监听或则直接连接两种方式进行,所以先对连接方式进行测试。通过SQLPlus以tnsnames.ora标签名的方式通过监听进行连接,并执行创建索引报错的语句,发现错误依然存在,但是如果不通过监听而直接连接是不会报错的,说明跟是否通过监听进行连接有很大的关系。

为何会受监听的影响?
在Oracle RAC环境中,由于crs的启停是通过root用户进行的。

gisdata2@root[/]ps -ef|grep init

  root     1    0  0  Aug 31   - 18:24 /etc/init

  root 5046686    1  0  Jul 18   - 0:00 /bin/sh /etc/init.crsd run

  root 5702082 27066830  0  Jul 18   - 0:00 /bin/sh /etc/init.cssd oclsomon

  root 6554098 27066830  0  Jul 18   - 0:00 /bin/sh /etc/init.cssd daemon

  root 24969600    1  0  Jul 18   - 0:00 /bin/sh /etc/init.evmd run

  root 27066830    1  1  Jul 18   - 8:07 /bin/sh /etc/init.cssd fatal

  root 40108312 18940178  0 10:57:45 pts/1 0:00 grep init

复制

所以在crs会继承root用户的limit属性,当通过crs或则srvctl命令启动监听时,也会继承root用户相应的limit属性,即data(kbytes)为131072。如何验证该推断?

现在通过监听的形式进行连接。


**$ SQLplus sys/password@gissc2 as sysdba**

SQL*Plus: Release 10.2.0.5.0 - Production on Mon Jul 20 17:20:02 2015

Copyright (c) 1982, 2010, Oracle.  All Rights Reserved.

 

SQL> select * from v$mystat where rownum<2;

    SID STATISTIC#   VALUE

---------- ---------- ----------

   1427     0     1

SQL> select spid from v$process where addr =(select paddr from v$session where sid=1427);

SPID

\------------

**10158446         -----**该连接的操作系统进程号

 


复制

通过dbx工具查看该进程的limit信息:

**$ dbx -a 10158446**

Waiting to attach to process 10158446 ...

Successfully attached to oracle.

warning: Directory containing oracle could not be determined.

Apply 'use' command to initialize source path.

 

Type 'help' for help.

reading symbolic information ...

stopped in read at 0x90000000002e294 ($t1)

0x90000000002e294 (read+0x274) e8410028     ld  r2,0x28(r1)

 (dbx) proc rlimit

rlimit name:     rlimit_cur        rlimit_max    (units)

 RLIMIT_CPU:     (unlimited)       (unlimited)    sec

 RLIMIT_FSIZE:    (unlimited)       (unlimited)    bytes

 **RLIMIT_DATA:     134217728       (unlimited)    bytes**

 RLIMIT_STACK:     33554432       4294967296    bytes

 RLIMIT_CORE:    (unlimited)       (unlimited)    bytes

 RLIMIT_RSS:     (unlimited)       (unlimited)    bytes

 RLIMIT_AS:     (unlimited)       (unlimited)    bytes

 RLIMIT_NOFILE:      65534          65536    descriptors

 RLIMIT_THREADS:   (unlimited)       (unlimited)    per process

 RLIMIT_NPROC:    (unlimited)       (unlimited)    per user

(dbx) quit

复制

从上面的内容可以看到data属性的limit值为134217728 bytes即131072 kbytes与root的data(kbytes) 131072值完全吻合(stack的33554432 bytes即32768也与root的stack(kbytes) 32768一致),说明是通过监听建立连接进程的limit继承于root用户。

不通过监听进行连接:

$ SQLplus / as sysdba

SQL*Plus: Release 10.2.0.5.0 - Production on Mon Jul 20 17:16:09 2015

Copyright © 1982, 2010, Oracle. All Rights Reserved.

SQL> select * from v$mystat where rownum<2;

SID STATISTIC# VALUE


1375 0 1

SQL> select spid from v=(processwhereaddr=(selectpaddrfromvsession where sid=1375);

SPID

------------

47710432

跟踪这个服务器进程。


**$ dbx -a 47710432**

Waiting to attach to process 47710432 ...

Successfully attached to oracle.

warning: Directory containing oracle could not be determined.

Apply 'use' command to initialize source path.

 

Type 'help' for help.

reading symbolic information ...

stopped in read at 0x90000000002e294 ($t1)

0x90000000002e294 (read+0x274) e8410028     ld  r2,0x28(r1)

(dbx) proc rlimit

rlimit name:     rlimit_cur        rlimit_max    (units)

 RLIMIT_CPU:     (unlimited)       (unlimited)    sec

 RLIMIT_FSIZE:    (unlimited)       (unlimited)    bytes

 **RLIMIT_DATA:    (unlimited)       (unlimited)    bytes**

 RLIMIT_STACK:     33554432       4294967296    bytes

 RLIMIT_CORE:    (unlimited)       (unlimited)    bytes

 RLIMIT_RSS:     (unlimited)       (unlimited)    bytes

 RLIMIT_AS:     (unlimited)       (unlimited)    bytes

 RLIMIT_NOFILE:      65534       (unlimited)    descriptors

 RLIMIT_THREADS:   (unlimited)       (unlimited)    per process

 RLIMIT_NPROC:    (unlimited)       (unlimited)    per user

(dbx) quit

$

复制

从上面的内容可以看到,如果不通过监听连接数据库创建的进程,它的data限制为unlimited的即无限制。

最后查看crs中监听的启动日志:(/u01/oracle/product/10.2.0/db_1/log/gisdata2/racg中的日志文件ora.gisdata2.LISTENER_GISDATA2.lsnr.log)。

2015-07-18 14:16:54.676: [  RACG][1] [28508196][1][ora.gisdata2.LISTENER_GISDATA2.lsnr]: The command completed successfully

 

**2015-07-18 14:47:09.078:** [  RACG][1] [29950132][1][ora.gisdata2.LISTENER_GISDATA2.lsnr]:

LSNRCTL for IBM/AIX RISC System/6000: Version 10.2.0.5.0 - Production on 18-JUL-2015 14:47:08

 

Copyright (c) 1991, 2010, Oracle. All rights reserved.

Starting /u01/oracle/product/10.2.0/db_1/bin/tnslsnr: please wait...

2015-07-18 14:47:09.079: [  RACG][1] [29950132][1][ora.gisdata2.LISTENER_GISDATA2.lsnr]: TNSLSNR for IBM/AIX RISC System/6000: Version 10.2.0.5.0 - Production

System parameter file is /u01/oracle/product/10.2.0/db_1/network/admin/listener.ora

Log messages written to /u01/oracle/product/10.2.0/db_1/network/log/listener_gisdata2.log

 

2015-07-18 14:47:09.079: [  RACG][1] [29950132][1][ora.gisdata2.LISTENER_GISDATA2.lsnr]: Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.176.30.81)(PORT=1521)))

Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.176.30.79)(PORT=1521)))

Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC))) 

复制

从以上的内容可以发现监听是由crs(或则是srvctl命令)启动的。

以及监听的运行时间:


$ lsnrctl status

 

LSNRCTL for IBM/AIX RISC System/6000: Version 10.2.0.5.0 - Production on 20-JUL-2015 16:31:03

Copyright (c) 1991, 2010, Oracle. All rights reserved.

Connecting to (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521))

STATUS of the LISTENER

\------------------------

Alias           LISTENER_GISDATA2

Version          TNSLSNR for IBM/AIX RISC System/6000: Version 10.2.0.5.0 - Production

**Start Date        18-JUL-2015 14:56:39**

**Uptime          2 days 1 hr. 34 min. 24 sec**

Trace Level        off

Security          ON: Local OS Authentication

SNMP           ON

Listener Parameter File  /u01/oracle/product/10.2.0/db_1/network/admin/listener.ora

Listener Log File     /u01/oracle/product/10.2.0/db_1/network/log/listener_gisdata2.log

复制

监听启动的时间也与日志中的时间对应。

11.3 解析原理的问题总结与建议

      那为什么CRS会使用ROOT用户的limit限制呢?

首先,得明白在Linux中,每触发任何一个事件时,系统都会为它定义一个进程,并且给予一个ID,即PID,同时会根据触发这个进程的用户与相关属性关系,给这个PID设置有效的权限,如图11-1所示。


图11-1

      从上面可以知道,系统启动以后,CRS会自动启动,启动主要由/etc/init.d的几个脚本完成。而这些脚本的执行用户和用户组是root,也就是说当CRS启动时,Linux系统会根据UID/GID来判断资源属性和环境变量,那么这个进程所衍生的进程,也会沿用这些属性。
由此可以得到结论,由于监听是通过crs进行的启动,继承了root用户的limit限制,每个会话所能持有的内存大小最大不能超过128M,当通过监听进行数据库连接时,由监听创建的用户进程也将继承该limit限制,所以当通过P/LSQL Developer连接数据库(包括SQL*Plus等工具需要通过监听建立用户进程的情况),在创建索引过程中,当所请求的内存达到或非常接近该限制时,就会由于无法进一步申请更多的内存资源,抛出ORA-04030错误。

      提示,如果是在Linux系统中,可以通过cat /proc/PID/limits命令进行查看单个进程的limit属性值,其中PID为要查看进程的进程号。

以ORA-04030案例进行延伸扩展,为避免该错误可参考以下建议。

(1)在安装Oracle软件创建数据库之前应该对主机层面的内核参数、limit限制等进行规范的修改,以避免类似问题的发生。

(2)配置合理的内存,例如物理内存和交换空间。

(3)使用自动PGA内存管理可降低ORA-04030错误的概率。

「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论