最近遇到了几起linux 异步IO满了,导致ASM实例一致报错。严重的也有出现ASM实例挂了无法启动,导致集群节点异常的。
ASM alert的报错大概这样,信息还是很明确的,ORA-27090,asynchronous disk I/O 不足。
Wed Apr 07 20:00:39 2021
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_17619.trc:
ORA-27090: Unable to reserve kernel resources for asynchronous disk I/O
Additional information: 3
Additional information: 128
Additional information: 8
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_17619.trc:
ORA-27090: Unable to reserve kernel resources for asynchronous disk I/O
Linux-x86_64 Error: 2: No such file or directory
Additional information: 3
Additional information: 128
Additional information: 180464224
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_17619.trc:
ORA-27090: Unable to reserve kernel resources for asynchronous disk I/O
Linux-x86_64 Error: 2: No such file or directory
Additional information: 3
Additional information: 128
Additional information: -1514856272
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_17619.trc:
ORA-27090: Unable to reserve kernel resources for asynchronous disk I/O
Linux-x86_64 Error: 2: No such file or directory
Additional information: 3
Additional information: 128
Additional information: -1514856272复制
通过检查aio-max-nr,aio-nr 两个参数的值,确定了就是异步IO被占满了。
[grid@node01 trace]$ cat /proc/sys/fs/aio-max-nr
1048576
[grid@node01 trace]$ cat /proc/sys/fs/aio-nr
1048493复制
当然,故障解决的方法就很简单了:
修改/etc/sysctl.conf中的fs.aio-max-nr=3145728
然后:
sysctl -w fs.aio-max-nr=3145728复制
那到底是什么进程打开了这么多aio呢?查遍了/proc/pid下的信息,也没有找到aio相关的值,起初怀疑的监控语句访问asm实例产生的,但仔细想想,报错是不足128,也就是每次请求不多,并且是短连接,很快就释放。当前的现象明显是不释放。
还好,遇事不懂就找狗哥,查aio的值的方法没有,找到了一个监控aio分配的语句,而且也是ora-27090,异步IO不足脑子里对这个文章,狂刷了一波 666666666666
https://blog.pythian.com/troubleshooting-ora-27090-async-io-errors/
按需安装了软件包:这两个在ISO里面:
systemtap
kernel-devel复制
下面两个可以在这里按内核版本下载:http://debuginfo.centos.org/6/x86_64/
kernel-debuginfo-common-x86_64
kernel-debuginfo复制
安装完成后,就整成脚本放后台运行了。
vi aio
#!/bin/sh
stap -ve '
global allocated, allocatedctx, freed
probe syscall.io_setup {
allocatedctx[pid()] += maxevents; allocated[pid()]++;
printf("%d AIO events requested by PID %d (%s)\n",
maxevents, pid(), cmdline_str());
}
probe syscall.io_destroy {freed[pid()]++}
probe kprocess.exit {
if (allocated[pid()]) {
printf("PID %d exited\n", pid());
delete allocated[pid()];
delete allocatedctx[pid()];
delete freed[pid()];
}
}
probe end {
foreach (pid in allocated) {
printf("PID %d allocated=%d allocated events=%d freed=%d\n",
pid, allocated[pid], allocatedctx[pid], freed[pid]);
}
}
'
nohup sh aio > aio.out &复制
等着后台收集的时候也没有闲着。搞好有一个aio高的环境可以重启,就想着一个个进程关闭,看能否把这个偷了aio”贼“抓出来。那么需要这么做 ;会话1 ,操作 ,会话2 :不断的 cat proc/sys/fs/aio-nr
测试1)关闭数据库实例
会话一:
20210412_13:24:01
20210412_13:24:07 13:24:01 sys@MYDB > alter system switch logfile;
20210412_13:24:07
20210412_13:24:07 System altered.
20210412_13:24:07
20210412_13:24:07 Elapsed: 00:00:00.32
20210412_13:24:11 13:24:07 sys@MYDB > alter system checkpoint;
20210412_13:24:11
20210412_13:24:11 System altered.
20210412_13:24:11
20210412_13:24:11 Elapsed: 00:00:00.21
20210412_13:24:12 13:24:11 sys@MYDB >
20210412_13:24:12 13:24:12 sys@MYDB >
20210412_13:25:05 13:24:12 sys@MYDB > shut immediate;
20210412_13:25:20 Database closed.
20210412_13:25:20 Database dismounted.
20210412_13:26:37 ORACLE instance shut down.
20210412_13:26:41 13:26:37 sys@MYDB > exit复制
会话二:
20210412_13:23:55 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:23:55 1063499
20210412_13:25:03 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:25:03 1063627
20210412_13:25:08 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:25:08 1062896
20210412_13:27:31 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:31 1050752
20210412_13:27:32 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:32 1050752
20210412_13:27:33 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:33 1050752复制
显然,通过测试关闭实例,可以确定不是实例占用的。
测试2:关闭集群节点
会话一:
20210412_13:27:26 [root@node01 ~]# /u01/app/11.2.0/grid/bin/crsctl stop crs
20210412_13:27:27 CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'node01'
20210412_13:27:27 CRS-2673: Attempting to stop 'ora.crsd' on 'node01'
20210412_13:27:27 CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on 'node01'
20210412_13:27:27 CRS-2673: Attempting to stop 'ora.LISTENER_SCAN1.lsnr' on 'node01'
20210412_13:27:27 CRS-2673: Attempting to stop 'ora.cvu' on 'node01'
20210412_13:27:27 CRS-2673: Attempting to stop 'ora.oc4j' on 'node01'
20210412_13:27:27 CRS-2673: Attempting to stop 'ora.CRS.dg' on 'node01'
20210412_13:27:27 CRS-2673: Attempting to stop 'ora.registry.acfs' on 'node01'
20210412_13:27:27 CRS-2673: Attempting to stop 'ora.ARCH.dg' on 'node01'
20210412_13:27:27 CRS-2673: Attempting to stop 'ora.DATA.dg' on 'node01'
20210412_13:27:27 CRS-2673: Attempting to stop 'ora.LISTENER.lsnr' on 'node01'
20210412_13:27:27 CRS-2677: Stop of 'ora.cvu' on 'node01' succeeded
20210412_13:27:27 CRS-2672: Attempting to start 'ora.cvu' on 'node02'
20210412_13:27:27 CRS-2677: Stop of 'ora.LISTENER.lsnr' on 'node01' succeeded
20210412_13:27:27 CRS-2673: Attempting to stop 'ora.node01.vip' on 'node01'
20210412_13:27:27 CRS-2677: Stop of 'ora.LISTENER_SCAN1.lsnr' on 'node01' succeeded
20210412_13:27:27 CRS-2673: Attempting to stop 'ora.scan1.vip' on 'node01'
20210412_13:27:27 CRS-2676: Start of 'ora.cvu' on 'node02' succeeded
20210412_13:27:27 CRS-2677: Stop of 'ora.ARCH.dg' on 'node01' succeeded
20210412_13:27:28 CRS-2677: Stop of 'ora.DATA.dg' on 'node01' succeeded
20210412_13:27:28 CRS-2677: Stop of 'ora.registry.acfs' on 'node01' succeeded
20210412_13:27:29 CRS-2677: Stop of 'ora.scan1.vip' on 'node01' succeeded
20210412_13:27:29 CRS-2672: Attempting to start 'ora.scan1.vip' on 'node02'
20210412_13:27:29 CRS-2677: Stop of 'ora.node01.vip' on 'node01' succeeded
20210412_13:27:29 CRS-2672: Attempting to start 'ora.node01.vip' on 'node02'
20210412_13:27:31 CRS-2676: Start of 'ora.scan1.vip' on 'node02' succeeded
20210412_13:27:31 CRS-2672: Attempting to start 'ora.LISTENER_SCAN1.lsnr' on 'node02'
20210412_13:27:32 CRS-2676: Start of 'ora.node01.vip' on 'node02' succeeded
20210412_13:27:33 CRS-2676: Start of 'ora.LISTENER_SCAN1.lsnr' on 'node02' succeeded
20210412_13:27:34 CRS-2677: Stop of 'ora.oc4j' on 'node01' succeeded
20210412_13:27:34 CRS-2672: Attempting to start 'ora.oc4j' on 'node02'
20210412_13:27:37 CRS-2677: Stop of 'ora.CRS.dg' on 'node01' succeeded
20210412_13:27:37 CRS-2673: Attempting to stop 'ora.asm' on 'node01'
20210412_13:27:37 CRS-2677: Stop of 'ora.asm' on 'node01' succeeded
20210412_13:27:52 CRS-2676: Start of 'ora.oc4j' on 'node02' succeeded
20210412_13:27:52 CRS-2673: Attempting to stop 'ora.ons' on 'node01'
20210412_13:27:52 CRS-2677: Stop of 'ora.ons' on 'node01' succeeded
20210412_13:27:52 CRS-2673: Attempting to stop 'ora.net1.network' on 'node01'
20210412_13:27:52 CRS-2677: Stop of 'ora.net1.network' on 'node01' succeeded
20210412_13:27:52 CRS-2792: Shutdown of Cluster Ready Services-managed resources on 'node01' has completed
20210412_13:27:54 CRS-2677: Stop of 'ora.crsd' on 'node01' succeeded
20210412_13:27:54 CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'node01'
20210412_13:27:54 CRS-2673: Attempting to stop 'ora.crf' on 'node01'
20210412_13:27:54 CRS-2673: Attempting to stop 'ora.ctssd' on 'node01'
20210412_13:27:54 CRS-2673: Attempting to stop 'ora.evmd' on 'node01'
20210412_13:27:54 CRS-2673: Attempting to stop 'ora.asm' on 'node01'
20210412_13:27:54 CRS-2673: Attempting to stop 'ora.mdnsd' on 'node01'
20210412_13:27:55 CRS-2677: Stop of 'ora.evmd' on 'node01' succeeded
20210412_13:27:55 CRS-2677: Stop of 'ora.mdnsd' on 'node01' succeeded
20210412_13:27:55 CRS-2677: Stop of 'ora.crf' on 'node01' succeeded
20210412_13:27:55 CRS-2677: Stop of 'ora.ctssd' on 'node01' succeeded
20210412_13:28:02 CRS-2677: Stop of 'ora.asm' on 'node01' succeeded
20210412_13:28:02 CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'node01'
20210412_13:28:04 CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'node01' succeeded
20210412_13:28:04 CRS-2673: Attempting to stop 'ora.cssd' on 'node01'
20210412_13:28:05 CRS-2677: Stop of 'ora.cssd' on 'node01' succeeded
20210412_13:28:05 CRS-2673: Attempting to stop 'ora.gipcd' on 'node01'
20210412_13:28:05 CRS-2677: Stop of 'ora.drivers.acfs' on 'node01' succeeded
20210412_13:28:06 CRS-2677: Stop of 'ora.gipcd' on 'node01' succeeded
20210412_13:28:06 CRS-2673: Attempting to stop 'ora.gpnpd' on 'node01'
20210412_13:28:07 CRS-2677: Stop of 'ora.gpnpd' on 'node01' succeeded
20210412_13:28:07 CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'node01' has completed
20210412_13:28:07 CRS-4133: Oracle High Availability Services has been stopped.复制
会话二:
20210412_13:27:23 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:23 1050496
20210412_13:27:30 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:30 1050752
20210412_13:27:31 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:31 1050752
20210412_13:27:32 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:32 1050752
20210412_13:27:33 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:33 1050752
20210412_13:27:33 [oracle@node01 ~]$
20210412_13:27:33 [oracle@node01 ~]$
20210412_13:27:34 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:34 1050752
20210412_13:27:35 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:35 1050880
20210412_13:27:36 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:36 1050880
20210412_13:27:37 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:37 1050880
20210412_13:27:37 [oracle@node01 ~]$
20210412_13:27:37 [oracle@node01 ~]$
20210412_13:27:38 [oracle@node01 ~]$
20210412_13:27:38 [oracle@node01 ~]$
20210412_13:27:39 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:39 1050624
20210412_13:27:40 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:40 1050624
20210412_13:27:40 [oracle@node01 ~]$
20210412_13:27:41 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:41 1050624
20210412_13:27:42 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:42 1050624
20210412_13:27:43 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:43 1050624
20210412_13:27:45 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:45 1050624
20210412_13:27:46 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:46 1050624
20210412_13:27:47 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:47 1050624
20210412_13:27:47 [oracle@node01 ~]$
20210412_13:27:47 [oracle@node01 ~]$
20210412_13:27:48 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:48 1050624
20210412_13:27:49 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:49 1050624
20210412_13:27:49 [oracle@node01 ~]$
20210412_13:27:49 [oracle@node01 ~]$
20210412_13:27:50 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:50 1050624
20210412_13:27:51 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:51 1050624
20210412_13:27:52 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:52 1050624
20210412_13:27:53 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:53 1050496
20210412_13:27:55 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:55 8192
20210412_13:27:56 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:56 8064
20210412_13:28:28 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:28:28 0
20210412_13:28:52 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:28:52 896
20210412_13:28:55 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:28:55 1408
20210412_13:28:56 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:28:56 1408
20210412_13:28:57 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:28:57 1408
20210412_13:28:57 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:28:57 1408
20210412_13:28:58 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:28:58 1408
20210412_13:28:59 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr复制
神奇的现象出现了,aio突然就全部释放了。也就肯定了aio是被集群相关的进程占用着。
看着aio释放的时间和集群进程关闭的时间比对:
20210412_13:27:53 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:53 1050496
20210412_13:27:55 [oracle@node01 ~]$ cat /proc/sys/fs/aio-nr
20210412_13:27:55 8192
20210412_13:27:52 CRS-2792: Shutdown of Cluster Ready Services-managed resources on 'node01' has completed
20210412_13:27:54 CRS-2677: Stop of 'ora.crsd' on 'node01' succeeded
……
20210412_13:27:55 CRS-2677: Stop of 'ora.evmd' on 'node01' succeeded
20210412_13:27:55 CRS-2677: Stop of 'ora.mdnsd' on 'node01' succeeded
20210412_13:27:55 CRS-2677: Stop of 'ora.crf' on 'node01' succeeded
20210412_13:27:55 CRS-2677: Stop of 'ora.ctssd' on 'node01' succeeded复制
这个异常的进程就出现在ora.crsd,ora.evmd,ora.mdnsd,ora.crf,ora.ctssd资源之中。
考虑影响最小的是ora.crf,我这还有一个节点。那就先验证是否是它:
[oracle@node02 ~]$ cat /proc/sys/fs/aio-nr
1063724
[grid@node02 ~]$ crsctl stop res ora.crf -init
CRS-2673: Attempting to stop 'ora.crf' on 'node02'
CRS-2677: Stop of 'ora.crf' on 'node02' succeeded
[oracle@node02 ~]$ cat /proc/sys/fs/aio-nr
22956复制
运气不要太好,果然,就是ora.crf导致的。到这里测试验证就以确定是ora.crf资源的导致的了。ora.crf涉及了2个进程,ologgerd 和osysmond.bin,仍然无法确定是谁。
一天过去,发现aio-nr又变大了,这样通过systemtap工具运行的aio的监控就有意义了。根据上面的结论,很容易就定位到了就是osysmond.bin进程不断的新分配异步aio,导致不断的增长。
[root@node01 tmp]# cat aio.out|grep ologgerd
[root@node01 tmp]# cat aio.out|grep osysmond
128 AIO events requested by PID 20736 (/u01/app/11.2.0/grid/bin/osysmond.bin)
128 AIO events requested by PID 20736 (/u01/app/11.2.0/grid/bin/osysmond.bin)
128 AIO events requested by PID 20736 (/u01/app/11.2.0/grid/bin/osysmond.bin)
128 AIO events requested by PID 20736 (/u01/app/11.2.0/grid/bin/osysmond.bin)
128 AIO events requested by PID 20736 (/u01/app/11.2.0/grid/bin/osysmond.bin)
128 AIO events requested by PID 20736 (/u01/app/11.2.0/grid/bin/osysmond.bin)
128 AIO events requested by PID 20736 (/u01/app/11.2.0/grid/bin/osysmond.bin)
128 AIO events requested by PID 20736 (/u01/app/11.2.0/grid/bin/osysmond.bin)
128 AIO events requested by PID 20736 (/u01/app/11.2.0/grid/bin/osysmond.bin)
128 AIO events requested by PID 20736 (/u01/app/11.2.0/grid/bin/osysmond.bin)
128 AIO events requested by PID 20736 (/u01/app/11.2.0/grid/bin/osysmond.bin)
128 AIO events requested by PID 20736 (/u01/app/11.2.0/grid/bin/osysmond.bin)
128 AIO events requested by PID 20736 (/u01/app/11.2.0/grid/bin/osysmond.bin)
128 AIO events requested by PID 20736 (/u01/app/11.2.0/grid/bin/osysmond.bin)
128 AIO events requested by PID 20736 (/u01/app/11.2.0/grid/bin/osysmond.bin)
128 AIO events requested by PID 20736 (/u01/app/11.2.0/grid/bin/osysmond.bin)
128 AIO events requested by PID 20736 (/u01/app/11.2.0/grid/bin/osysmond.bin)
128 AIO events requested by PID 20736 (/u01/app/11.2.0/grid/bin/osysmond.bin)
128 AIO events requested by PID 20736 (/u01/app/11.2.0/grid/bin/osysmond.bin)
128 AIO events requested by PID 20736 (/u01/app/11.2.0/grid/bin/osysmond.bin)
128 AIO events requested by PID 20736 (/u01/app/11.2.0/grid/bin/osysmond.bin)
128 AIO events requested by PID 20736 (/u01/app/11.2.0/grid/bin/osysmond.bin)
128 AIO events requested by PID 20736 (/u01/app/11.2.0/grid/bin/osysmond.bin)复制
然而,以上是不够的,如果没有定位到ora.crf怎么办呢?因此需要想到分析aio.out这个监控结果的方法,从监控记录来看,进程分配了AIO会记录,进程退出了也会记录。
[root@node01 tmp]# more aio.out
Pass 1: parsed user script and 86 library script(s) using 195620virt/24308res/3036shr/21560data kb, in 160usr/10sys/177real ms.
Pass 2: analyzed script: 4 probe(s), 10 function(s), 2 embed(s), 3 global(s) using 360404virt/52364res/4136shr/48500data kb, in 470usr/180sys/645real ms.
Pass 3: using cached /root/.systemtap/cache/91/stap_91cc8fd721590c225d80de6761ca9205_6624.c
Pass 4: using cached /root/.systemtap/cache/91/stap_91cc8fd721590c225d80de6761ca9205_6624.ko
Pass 5: starting run.
128 AIO events requested by PID 3210 (oraclemydb1 (LOCAL=NO))
128 AIO events requested by PID 3214 (oracle+ASM1_user3210_mydb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq))))
256 AIO events requested by PID 3214 (oracle+ASM1_user3210_mydb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq))))
PID 3210 exited
PID 3214 exited
128 AIO events requested by PID 3224 (oraclemydb1 (LOCAL=NO))
128 AIO events requested by PID 3228 (oracle+ASM1_user3224_mydb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq))))
256 AIO events requested by PID 3228 (oracle+ASM1_user3224_mydb1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq))))
PID 3224 exited
PID 3228 exited
128 AIO events requested by PID 3276 (oraclemydb1 (LOCAL=NO))
PID 3276 exited
128 AIO events requested by PID 3352 (oraclemydb1 (LOCAL=NO))
PID 3352 exited
128 AIO events requested by PID 3356 (oraclemydb1 (LOCAL=NO))
PID 3356 exited
128 AIO events requested by PID 3360 (oraclemydb1 (LOCAL=NO))
PID 3360 exited
128 AIO events requested by PID 3512 (oraclemydb1 (LOCAL=NO))
PID 3512 exited
128 AIO events requested by PID 3516 (oraclemydb1 (LOCAL=NO))
128 AIO events requested by PID 3518 (oraclemydb1 (LOCAL=NO))
PID 3516 exited
PID 3518 exited复制
那么假定,进程创建分配一次aio,进程退出了就默认释放了1次aio(实际上退出进程是会释放全部aio的,这里就是为了把单个进程多次分配aio的找出来)
根据上面的假定得到的算法就是 ,进程退出:进程aio的分配次数等于 总次数-进程退出次数(为了保证PID重用的情况) 进程不退出:进程aio的分配次数等于总次数。AIO数量=分配次数*单次aio数量
按这个算法;需要获取两份数据:数据1:进程分配aio的记录,根据日志,不难看出,可以通过“AIO events requested”来筛选;cat aio.out|grep "AIO events requested"|awk '{print 8" "$1}' |sort -rn -k1,2 |uniq -c |sort -n > file0.txt
得到的数据格式如下:
次数 PID 进程信息(用来避免重复进程号不同分类) aio数量
1 10051 (oracle+ASM1_user10047_mydb1 128
1 10051 (oracle+ASM1_user10047_mydb1 256
1 10051 (oraclemydb1 128
1 10054 (oraclemydb1 128
1 10058 (oraclemydb1 128
1 10059 (oraclemydb1 128
1 10060 (oraclemydb1 128
1 10063 (oraclemydb1 128
1 10064 (oraclemydb1 128
1 10066 (oraclemydb1 128
1 10068 (oraclemydb1 128
1 1006 (oraclemydb1 128
1 10072 (oraclemydb1 128
1 10074 (oraclemydb1 128
1 10077 (oraclemydb1 128
1 10078 (oraclemydb1 128
1 1007 (ora_m000_mydb1) 128
1 10080 (oraclemydb1 128
1 10081 (oracle+ASM1_user10077_mydb1 128
1 10081 (oracle+ASM1_user10077_mydb1 256
1 10081 (oraclemydb1 128
1 10086 (oraclemydb1 128复制
数据2:进程退出的记录,根据日志,不难看出,可以通过“ exited”来筛选;
cat aio.out|grep " exited"|awk '{print $2}' |sort -rn |uniq -c |sort -n > file1.txt 得到的数据格式如下:
退出次数 PID
1 1000
1 10000
1 10002
1 10004
1 10005
1 10006
1 10007
1 10008复制
(当然,这个筛选字样也是可以修改监控aio的脚本自定义)
接下来配以汇总分析:
awk 'NR==FNR{a[$2]=$0;next}NR>FNR{if($2 in a) print "PID:" $2" CNT:"($1-a[$1])" AIO_CNT:"($1-a[$1])*$4 ;else print "PID: " $2" CNT: "$1" AIO_CNT: "$1*$4 }' file1.txt file0.txt|sort -n -k 4
****
PID:9999 CNT:1 AIO_CNT:128
PID:999 CNT:1 AIO_CNT:128
PID:999 CNT:1 AIO_CNT:128
PID:999 CNT:1 AIO_CNT:256
PID: 20736 CNT: 24 AIO_CNT: 3072 <<<<找到了增长最大的进程,就是osysmond.bin
[root@node01 tmp]# ps -ef |grep 20736
root 20736 1 4 Apr12 ? 01:35:22 /u01/app/11.2.0/grid/bin/osysmond.bin
root 26669 22789 0 00:53 pts/3 00:00:00 grep 20736复制
至此,偷走aio的贼抓到了,已经把证据送到他爹那里,等待他爹给答复。
小结:1、遇到怪事多深究一下,你会有更多的收获。2、不得不说systemtap工具好强大。
评论
