一、现象描述
研发中心开发部一同事反馈Oracle一开发测试库无法访问,业务无法测试,让我协助帮忙处理下。
尝试登陆该服务器失败,但可以ping通该服务器,于是让系统组同事帮忙通过控制台登陆该服务器看下什么问题,被告知无法进入服务器,报内核错误,于是强制重启。
服务器重启一段时间后,可以正常登陆服务器,数据库已被自动拉起。
告知研发同事进行测试,被告知应用连接超时。
观察数据库日志,发现大量如下报错:
开发测试人员反馈,因为大量超时,应用无法正常进行测试。
二、问题分析
由于ORA-609,导致ospid(xxxx)进程被aborting了,同时由于TNS-12537报错,最终导致应用连接被关闭。
ORA-609错误原因是因为无法与进入的连接进行联系,所以无法将此连接转入监听器,所以数据库的process中止此进程,因此服务器进程被终止关闭。
根据报错信息,在网上及MOS上搜索了下相关信息,并根据相关信息,在Oracle用户$ORACLE_HOME/network/admin下创建sqlnet.ora,并配置
sqlnet.expire_time=10 SQLNET.INBOUND_CONNECT_TIMEOUT=180 DIAG_ADR_ENABLED = OFF
复制
让研发同事连接测试,结果应用连接后,很快超时,数据库负载压力突增,就连编辑文件都会被夯住。
根据报错,又反复调整了几个参数值,应用测试还是超时。
另外观察到ASM日志里保留大量如下错误:
根据资料显示 WARNING: Waited 139 secs for write IO to PST disk 0 in group 1 这种报错大意为PST通信链路在访问磁盘组中的磁盘时候等待了XXX秒,而且触发了持续的等待。超时等待会在频率触发的基础上递增每次的等待时间。
出现这种状况的原因一般与操作系统网络通信链路,数据库主机磁盘或者超时参数的设置有关。
并根据建议将_asm_hbeatiowait值进行了调整:alter system set “_asm_hbeatiowait”=120 scope=spfile;
并重启了ASM实例,重启完后,依然应用报连接超时。
此时观察到系统日志/var/log/messages里有报如下信息:
根据 “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message 查询了相关资料,对此的解释是:
默认情况下, Linux会最多使用40%的可用内存作为文件系统缓存。当超过这个阈值后,文件系统会把将缓存中的内存全部写入磁盘, 导致后续的IO请求都是同步的。
将缓存写入磁盘时,有一个默认120秒的超时时间。 出现上面的问题的原因是IO子系统的处理速度不够快,不能在120秒将缓存中的数据全部写入磁盘。IO系统响应缓慢,导致越来越多的请求堆积,最终系统内存全部被占用,导致系统失去响应。
对此的建议修改是:
sysctl -w vm.dirty_ratio=10 sysctl -w vm.dirty_background_ratio=5 sysctl -p
复制
如果系统永久生效,修改/etc/sysctl.conf文件。加入如下两行
vm.dirty_background_ratio = 5 vm.dirty_ratio = 10
复制
重启了数据库和操作系统。
此时也到了周末,期待着经过周末,该问题能否自动修复。
但很遗憾,周一早上观察,依然报连接超时。
研发的同事反馈,部分SQL可以正常查询,但几个关键SQL查询很慢,超过了应用设置的超时时间,其中如下这个SQL要求在3毫米内查出。
-- SQL已做了脱敏处理 select * from ( select q.*,rownum rnum from ( select * from (select * from (select inquiry_id id, d.*, case when quo_num > 0 then '1' when gi_flag = '1' then '4' when left_day <= 0 then '3' else '3' end quotation_con, row_number() over(partition by inquiry_id order by so_flag asc,quo_id desc) rn from (select m.com_name comName, m.com_id, i.info_title, i.inquiry_id, (select count(ip.inquiry_prod_id) from inq_pr_in ip where ip.relation_id = i.inquiry_id and ip.relation_type = '1' and ip.delete_flag = '1') inq_prod, (select ip.pu_un_text from inq_pr_in ip where ip.relation_id = i.inquiry_id and ip.relation_type = '1' and ip.delete_flag = '1' and rownum=1) in_prod_un, (select sum(ip.pu_qua) from inq_pr_in ip where ip.relation_id = i.inquiry_id and ip.relation_type = '1' and ip.delete_flag = '1') in_prod_qu, i.member_id, (select com_id from cor_co_cn_ma_chk where xxx_member_id = i.member_id and rownum<=1) mic_com_id, (select mm.MEMBER_TYPE from xxx.CO_ME_INFO mm where mm.MEMBER_ID = i.member_id) member_type, (select l.log_user from xxx.core_logon_info l where l.member_id=i.member_id and l.user_role='0' and rownum<=1) log_user, i.delivery_province, i.delivery_city, i.eff_ti, i.added_time added_time_inquiry, trunc(i.eff_ti) - trunc(sysdate) + 1 left_day, f_get_big_buyer_flag(i.member_id) big_flag, f_get_com_cert_st(i.member_id) cert_st, q.invite_read_flag read_flag, q.gi_flag, '1' so_flag, q.quo_id quo_id, (select count(1) from quo_info q where q.relation_id = i.inquiry_id and q.relation_type = '1' and q.quotation_flag = '1' and q.member_id = xxxx) quo_num, (select max(q.added_time) from quo_info q where q.relation_id = i.inquiry_id and q.relation_type = '1' and q.quotation_flag = '1') added_time_order, '0' selection from inquiry_info i, quo_info q, cor_com_ma m where i.bu_flag = '1' and i.sensetive_flag = '0' and i.CON_ST in ('0', '3') and m.member_id = i.member_id and m.com_sta in ('1', '3') and q.relation_type = '1' and i.inquiry_id = q.relation_id and q.member_id = xxxx and q.invite_email is not null union all select m.com_name comName, m.com_id, i.info_title, i.inquiry_id, (select count(ip.inquiry_prod_id) from inq_pr_in ip where ip.relation_id = i.inquiry_id and ip.relation_type = '1' and ip.delete_flag = '1') inq_prod, (select ip.pu_un_text from inq_pr_in ip where ip.relation_id = i.inquiry_id and ip.relation_type = '1' and ip.delete_flag = '1' and rownum=1) in_prod_un, (select sum(ip.pu_qua) from inq_pr_in ip where ip.relation_id = i.inquiry_id and ip.relation_type = '1' and ip.delete_flag = '1') in_prod_qu, i.member_id, (select com_id from cor_co_cn_ma_chk where xxx_member_id = i.member_id and rownum<=1) mic_com_id, (select mm.MEMBER_TYPE from xxx.CO_ME_INFO mm where mm.MEMBER_ID = i.member_id) member_type, (select l.log_user from xxx.core_logon_info l where l.member_id=i.member_id and l.user_role='0' and rownum<=1) log_user, i.delivery_province, i.delivery_city, i.eff_ti, i.added_time added_time_inquiry, trunc(i.eff_ti) - trunc(sysdate) + 1 left_day, f_get_big_buyer_flag(i.member_id) big_flag, f_get_com_cert_st(i.member_id) cert_st, q.invite_read_flag read_flag, q.gi_flag, '1' so_flag, q.quo_id quo_id, (select count(1) from quo_info q where q.relation_id = i.inquiry_id and q.relation_type = '1' and q.quotation_flag = '1' and q.member_id = xxxx) quo_num, (select max(q.added_time) from quo_info q where q.relation_id = i.inquiry_id and q.relation_type = '1' and q.quotation_flag = '1') added_time_order, '0' selection from inquiry_info i, quo_info q, core_pr_se p, cor_com_ma m where i.bu_flag = '1' and i.sensetive_flag = '0' and i.CON_ST in ('0', '3') and m.member_id = i.member_id and m.com_sta in ('1', '3') and q.relation_type = '1' and i.inquiry_id = q.relation_id and i.mic_prod_id = p.prod_id and q.member_id = xxxx and q.mic_com_id is not null union all select i.comName, i.com_id, i.info_title, i.inquiry_id, (select count(ip.inquiry_prod_id) from inq_pr_in ip where ip.relation_id = i.inquiry_id and ip.relation_type = '1' and ip.delete_flag = '1') inq_prod, (select ip.pu_un_text from inq_pr_in ip where ip.relation_id = i.inquiry_id and ip.relation_type = '1' and ip.delete_flag = '1' and rownum=1) in_prod_un, (select sum(ip.pu_qua) from inq_pr_in ip where ip.relation_id = i.inquiry_id and ip.relation_type = '1' and ip.delete_flag = '1') in_prod_qu, i.member_id, (select com_id from cor_co_cn_ma_chk where xxx_member_id = i.member_id and rownum<=1) mic_com_id, (select mm.MEMBER_TYPE from xxx.CO_ME_INFO mm where mm.MEMBER_ID = i.member_id) member_type, (select l.log_user from xxx.core_logon_info l where l.member_id=i.member_id and l.user_role='0' and rownum<=1) log_user, i.delivery_province, i.delivery_city, i.eff_ti, i.added_time added_time_inquiry, trunc(i.eff_ti) - trunc(sysdate) + 1 left_day, f_get_big_buyer_flag(i.member_id) big_flag, f_get_com_cert_st(i.member_id) cert_st, i.readFlag read_flag, i.gi_flag, '2' so_flag, i.quo_id quo_id, (select count(1) from quo_info q where q.relation_id = i.inquiry_id and q.relation_type = '1' and q.quotation_flag = '1' and q.member_id = xxxx) quo_num, nvl((select max(q.added_time) from quo_info q where q.relation_id = i.inquiry_id and q.relation_type = '1' and q.quotation_flag = '1'), i.s_added_time) added_time_order, i.selection from (select row_number() over(partition by s.ma_rel_id order by s.added_time asc) rn, s.gi_flag, s.added_time s_added_time, s.read_flag_supp readFlag, s.match_id quo_id, m.com_id, m.com_name comName, s.selection, q.* from xxx.sy_ma_info s, xxx.inquiry_info q, xxx.cor_com_ma m, xxx.cor_co_cn_ma_chk k where s.ma_rel_id = q.inquiry_id and s.mic_com_id = k.com_id and s.relation_type in ('3', '4', '5') and s.delete_flag = '1' and q.bu_flag = '1' and q.sensetive_flag = '0' and q.CON_ST in ('0', '3') and m.member_id = s.member_id and m.com_sta in ('1', '3') and k.xxx_member_id = xxxx) i union all select i.comName, i.com_id, i.info_title, i.inquiry_id, (select count(ip.inquiry_prod_id) from inq_pr_in ip where ip.relation_id = i.inquiry_id and ip.relation_type = '1' and ip.delete_flag = '1') inq_prod, (select ip.pu_un_text from inq_pr_in ip where ip.relation_id = i.inquiry_id and ip.relation_type = '1' and ip.delete_flag = '1' and rownum=1) in_prod_un, (select sum(ip.pu_qua) from inq_pr_in ip where ip.relation_id = i.inquiry_id and ip.relation_type = '1' and ip.delete_flag = '1') in_prod_qu, i.member_id, (select com_id from cor_co_cn_ma_chk where xxx_member_id = i.member_id and rownum<=1) mic_com_id, (select mm.MEMBER_TYPE from xxx.CO_ME_INFO mm where mm.MEMBER_ID = i.member_id) member_type, (select l.log_user from xxx.core_logon_info l where l.member_id=i.member_id and l.user_role='0' and rownum<=1) log_user, i.delivery_province, i.delivery_city, i.eff_ti, i.added_time added_time_inquiry, trunc(i.eff_ti) - trunc(sysdate) + 1 left_day, f_get_big_buyer_flag(i.member_id) big_flag, f_get_com_cert_st(i.member_id) cert_st, i.readFlag read_flag, i.gi_flag, '3' so_flag, i.quo_id quo_id, (select count(1) from quo_info q where q.relation_id = i.inquiry_id and q.relation_type = '1' and q.quotation_flag = '1' and q.member_id = xxxx) quo_num, nvl((select max(q.added_time) from quo_info q where q.relation_id = i.inquiry_id and q.relation_type = '1' and q.quotation_flag = '1'), i.s_added_time) added_time_order, '0' selection from (select row_number() over(partition by s.ma_rel_id order by s.added_time asc) rn, s.gi_flag, s.added_time s_added_time, s.read_flag readFlag, s.match_id quo_id, m.com_id, m.com_name comName, q.* from sys_mat_cn s, inquiry_info q, cor_com_ma m where s.supp_member_id = xxxx and s.ma_rel_id = q.inquiry_id and s.ma_meth = '2' and q.bu_flag = '1' and q.sensetive_flag = '0' and q.CON_ST in ('0', '3') and m.member_id = q.member_id and m.com_sta in ('1', '3') AND (( ((TRUNC(s.ADDED_TIME) >= TRUNC(SYSDATE - 2)) OR q.inquiry_id = 2000522250 ) AND NOT EXISTS (SELECT * FROM quo_info k WHERE k.relation_id = q.inquiry_id AND k.relation_type = '1' AND k.quotation_flag = '1' AND k.member_id = xxxx)) OR s.gi_flag = '1' OR EXISTS (SELECT * FROM quo_info k WHERE k.relation_id = q.inquiry_id AND k.relation_type = '1' AND k.quotation_flag = '1' AND k.member_id = xxxx)) ) i union all select i.comName, i.com_id, i.info_title, i.inquiry_id, (select count(ip.inquiry_prod_id) from inq_pr_in ip where ip.relation_id = i.inquiry_id and ip.relation_type = '1' and ip.delete_flag = '1') inq_prod, (select ip.pu_un_text from inq_pr_in ip where ip.relation_id = i.inquiry_id and ip.relation_type = '1' and ip.delete_flag = '1' and rownum=1) in_prod_un, (select sum(ip.pu_qua) from inq_pr_in ip where ip.relation_id = i.inquiry_id and ip.relation_type = '1' and ip.delete_flag = '1') in_prod_qu, i.member_id, (select com_id from cor_co_cn_ma_chk where xxx_member_id = i.member_id and rownum<=1) mic_com_id, (select mm.MEMBER_TYPE from xxx.CO_ME_INFO mm where mm.MEMBER_ID = i.member_id) member_type, (select l.log_user from xxx.core_logon_info l where l.member_id=i.member_id and l.user_role='0' and rownum<=1) log_user, i.delivery_province, i.delivery_city, i.eff_ti, i.added_time added_time_inquiry, trunc(i.eff_ti) - trunc(sysdate) + 1 left_day, f_get_big_buyer_flag(i.member_id) big_flag, f_get_com_cert_st(i.member_id) cert_st, i.readFlag read_flag, i.gi_flag, '2' so_flag, i.quo_id quo_id, (select count(1) from quo_info q where q.relation_id = i.inquiry_id and q.relation_type = '1' and q.quotation_flag = '1' and q.member_id = xxxx) quo_num, nvl((select max(q.added_time) from quo_info q where q.relation_id = i.inquiry_id and q.relation_type = '1' and q.quotation_flag = '1'), i.s_added_time) added_time_order, i.selection from (select row_number() over(partition by s.ma_rel_id order by s.added_time asc) rn, s.gi_flag, s.added_time s_added_time, s.read_flag readFlag, s.match_id quo_id, m.com_id, m.com_name comName, '1' as selection, q.* from xxx.sys_mat_cn s, xxx.inquiry_info q, xxx.cor_com_ma m, xxx.cor_co_cn_ma_chk k where s.ma_rel_id = q.inquiry_id and s.mic_com_id = k.com_id and s.ma_meth ='2' and q.bu_flag = '1' and q.sensetive_flag = '0' and q.CON_ST in ('0', '3') and q.member_id = m.member_id and m.com_sta in ('1', '3') and k.xxx_member_id = xxxx) i where rn <= 1) d) where rn <= 1) d where d.quo_num=0 and d.gi_flag='0' and d.eff_ti>=trunc(sysdate) order by eff_ti DESC ) q ) p where rnum > 1 and rnum <= 10
复制
尝试执行了下该SQL,发现当查到有数据的时候,会非常慢,超过一分钟,当查询无数据的时候会很快,也就一秒左右。
备注:
在处理过程中,也遇到了监听报WARNING: Subscription for node down event still pending告警,如下所示:
在10g以后,涉及到Oracle监听器的消息默认会订阅到ONS(Oracle Notification Service),在非RAC环境,建议禁止订阅这些消息。
并根据MOS资料,在GRID用户 $ORACLE_HOME/network/admin下修改 listener.ora
添加 SUBSCRIBE_FOR_NODE_DOWN_EVENT_LISTENER=OFF
并对监听执行 lsnrctl reload。
根据一段时间观察,监听日志未再报 WARNING: Subscription for node down event still pending告警。
三、解决方法
该数据库服务器为64G,SGA和PGA分别设置了30G和10G。
对比了一个线上11G和此环境配置差不多的生产库,调整了下SGA和PGA以及sysctl.conf、limits.conf的参数,重启数据库和服务器依然如此。
并查询了SQL的缓存命中率。
于是将应用程序运行了一段时间,抓取问题时段的ASH报告,如下:
发现主要是direct path read等待事件。direct path read是 11G推出的一个新特性,
Direct path read的目的是让一些不常使用的大表数据(冷数据)在全表扫描时,每次都从磁盘读到用户的私有内存(PGA),而不要去挤占有限的、宝贵的、频繁使用的数据(热数据)所在的共享内存(SGA-buffer cache)。
热数据只在第一次访问时从磁盘读,读到SGA的buffer cache后,再次访问会直接从内存读,效率高、对存储压力小。
从共享内存读到直接路径读,这个变化在不频繁的全表扫描时是起到积极作用的,但某些SQL由于业务设计问题,导致SQL执行计划不合适,就会对数据库产生致命影响。
于是就尝试将 direct path read这一特性进行强制关闭。
ALTER SYSTEM SET _serial_direct_read='NEVER' SCOPE=BOTH;
复制
并通过如下语句查询执行结果:
SELECT x.ksppinm NAME, y.ksppstvl VALUE, x.ksppdesc describ FROM SYS.x$ksppi x, SYS.x$ksppcv y WHERE x.indx = y.indx AND (x.ksppinm = '_small_table_threshold' OR x.ksppinm = '_serial_direct_read');
复制
我将之前执行的SQL重新又执行了一次,耗时从原来的一分多钟一下子降到了一秒多,于是又重新找了几个变量带入进去,查询时间基本在3秒以下。
通知研发人员复测下,反馈时间确实块了很多,业务测试通过了。
四、问题反馈
对于该环境,通过和研发的沟通,目前存在硬解析较高,因为研发未对SQL添加绑定变量,导致硬解析高。另外禁用了direct path read是否会对其它SQL产生影响,还在观察中。
评论




