目录
故障描述
Oracle一体机计算节点1和计算节点3的agent运行正常,但是监控不到操作系统的cpu/内存/IO/网络的数据。
重启agent,未解决。
开始故障排查。
1 确认 Agent 证书是否有效
- Agent 使用 SSL 证书进行安全通信。检查证书是否过期或损坏,可以通过以下命令查看证书信息:
emctl status agent -details
[oracle@gsydbadm01 bin]$ ./emctl status agent -details Oracle Enterprise Manager Cloud Control 13c Release 5 Copyright (c) 1996, 2021 Oracle Corporation. All rights reserved. Agent Version : 13.5.0.0.0 OMS Version : 13.5.0.0.0 Protocol Version : 12.1.0.1.0 Agent Home : /u01/app/oracle/agent135/agent_inst Agent Log Directory : /u01/app/oracle/agent135/agent_inst/sysman/log Agent Binaries : /u01/app/oracle/agent135/agent_13.5.0.0.0 Core JAR Location : /u01/app/oracle/agent135/agent_13.5.0.0.0/jlib Agent Process ID : 133557 Parent Process ID : 121999 Agent URL : https://gsydbadm01.local:3872/emd/main/ Local Agent URL in NAT : https://gsydbadm01.local:3872/emd/main/ Repository URL : https://em13c:4903/empbs/upload Started at : 2025-02-06 17:01:21 Started by user : oracle Operating System : Linux version 2.6.39-400.284.1.el6uek.x86_64 (amd64) Number of Targets : 73 Last Reload : (none) -------无信息 Last successful upload : (none) -----无信息 说明上传信息未成功 Last attempted upload : 2025-02-08 08:42:56 Total Megabytes of XML files uploaded so far : 0 Number of XML files pending upload : 5,444 Size of XML files pending upload(MB) : 3.93 Available disk space on upload filesystem : 14.36% Collection Status : [COLLECTIONS_HALTED( UPLOAD_SYSTEM Threshold (UploadMaxNumberXML: 5000) exceeded with 5110 files)] Backoff Expiration : 2025-02-08 08:43:11 Heartbeat Status : Ok Last attempted heartbeat to OMS : 2025-02-08 08:42:19 Last successful heartbeat to OMS : 2025-02-08 08:42:19 Next scheduled heartbeat to OMS : 2025-02-08 08:43:19 ----------------------------------------------------------------------------- Agent is Running and Ready
复制
从以上信息可以判断,是上传信息不成功。为什么会不成功呢?网络是可以通的。
2 到日志中查找问题。
cd /u01/app/oracle/agent135/agent_inst/sysman/log gcagent.log emctl.log tail -f emagent.nohup EONSPROVIDER: oracle.eons.proxy.impl.ONSFactoryImpl Feb 06, 2025 5:01:40 PM oracle.eons.proxy.impl.ConnectionManagerImpl readFormFactor WARNING: unable to locate formfactor file - /u01/app/oracle/agent135/agent_13.5.0.0.0/eons/conf/.formfactor Feb 06, 2025 5:01:42 PM oracle.sysman.diag.EMDiagImpl captureDiagData.478 SEVERE: Critical error: java.time.OffsetDateTime cannot be cast to java.sql.Timestamp java.lang.ClassCastException: java.time.OffsetDateTime cannot be cast to java.sql.Timestamp at oracle.sysman.db.receivelet.aqmetricsdb.SvrGenAlertType.setValue(SvrGenAlertType.java:228) at oracle.sysman.db.receivelet.aqmetricsdb.SvrGenAlertType.<init>(SvrGenAlertType.java:131) at oracle.sysman.db.receivelet.aqmetricsdb.SvrGenAlrt$QueueListener.report(SvrGenAlrt.java:687) at oracle.sysman.db.receivelet.aqmetricsdb.SvrGenAlrt$QueueListener.run(SvrGenAlrt.java:1228) at oracle.sysman.gcagent.target.interaction.execution.ReceiveletInteractionMgr$3$1.run(ReceiveletInteractionMgr.java:1554) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at oracle.sysman.gcagent.util.system.GCAThread$RunnableWrapper.run(GCAThread.java:198) at java.lang.Thread.run(Thread.java:748) Feb 06, 2025 5:01:42 PM oracle.sysman.diag.EMDiagImpl createIncident.648 INFO: incident 650 created with problem key java.lang.ClassCastException:oracle.sysman.db.receivelet.aqmetricsdb.SvrGenAlertType:228, in directory /u01/app/oracle/agent135/ agent_inst/diag/ofm/emagent/emagent/incident/incdir_650 Feb 06, 2025 5:01:43 PM oracle.sysman.diag.EMDiagImpl captureDiagData.478 SEVERE: Critical error: java.time.OffsetDateTime cannot be cast to java.sql.Timestamp java.lang.ClassCastException: java.time.OffsetDateTime cannot be cast to java.sql.Timestamp at oracle.sysman.db.receivelet.aqmetricsdb.SvrGenAlertType.setValue(SvrGenAlertType.java:228) at oracle.sysman.db.receivelet.aqmetricsdb.SvrGenAlertType.<init>(SvrGenAler tail -f emagent.nohup Feb 08, 2025 9:31:38 AM oracle.sysman.diag.EMDiagImpl createIncident.648 INFO: incident 654 created with problem key java.lang.ClassCastException:oracle.sysman.db.receivelet.aqmetricsdb.SvrGenAlertType:228, in directory /u01/app/oracle/agent135/agent_inst/diag/ofm/emagent/emagent/incident/incdir_654
复制
进入到目录中
more readme.txt Problem Key: java.lang.ClassCastException:oracle.sysman.db.receivelet.aqmetricsdb.SvrGenAlertType:228 ECID: 0000PJZBkDkBx0w0wFw0zk1bdfFr000003 Thread Id: 46 Error Message Id: OFM-99999 Context Values -------------- threadName : AQMetricsDB Stack Trace ----------- java.lang.ClassCastException: java.time.OffsetDateTime cannot be cast to java.sql.Timestamp at oracle.sysman.db.receivelet.aqmetricsdb.SvrGenAlertType.setValue(SvrGenAlertType.java:228) at oracle.sysman.db.receivelet.aqmetricsdb.SvrGenAlertType.<init>(SvrGenAlertType.java:131) at oracle.sysman.db.receivelet.aqmetricsdb.SvrGenAlrt$QueueListener.report(SvrGenAlrt.java:687) at oracle.sysman.db.receivelet.aqmetricsdb.SvrGenAlrt$QueueListener.run(SvrGenAlrt.java:1228) at oracle.sysman.gcagent.target.interaction.execution.ReceiveletInteractionMgr$3$1.run(ReceiveletInteractionMgr.java:1554) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at oracle.sysman.gcagent.util.system.GCAThread$RunnableWrapper.run(GCAThread.java:198) at java.lang.Thread.run(Thread.java:748) Supplemental Files ------------------
复制
3 进入官网查询
Applies to:Enterprise Manager Base Platform - Version 13.5.0.0.0 and later Information in this document applies to any platform. Symptoms On : 13.5.0.0.0 version, OMS Upgrade ACTUAL BEHAVIOR --------------- EM 13.5 - We applied RU3 Patch but still see below issue Receiving following incidents: Observed following error in <AGENT_INST>/sysman/log/gcagent.log:INFO - ADR Incident created: Id=4, message=[java.time.OffsetDateTime cannot be cast to java.sql.Timestamp], module=oracle.sysman.db.receivelet.aqmetricsdb.SvrGenAlertType, problemKey='java.lang.ClassCastException:oracle.sysman.db.receivelet.aqmetricsdb.SvrGenAlertType:228', direcotry=<AGENT_INST>/diag/ofm/emagent/emagent/incident/incdir_4 Cause This issue is being addressed in the following bug: BUG 33125216 - ClassCastException:oracle.sysman.db.receivelet.aqmetricsdb.SvrGenAlertType:228 Solution Run the root.sh on the problematic agent server and it should resolve the issue.
复制
4 尝试解决
重新运行一下root.sh
需要使用root用户进行运行
cd /u01/app/oracle/agent135/agent_13.5.0.0.0
./root.sh
然后重新启动agent
继续监控emagent.nohup日志,从结果来看,为出现相应的报错
--- EMState agent ----- 2025-02-08 09:48:38,319::119713::Mismatch detected between timezone in env (Asia/Shanghai) and in /u01/app/oracle/agent135/agent_inst/sysman/config/emd.properties (PRC). Forcing value to latter.. ----- ----- 2025-02-08 09:48:38,764::119713::Auto tuning the agent at time 2025-02-08 09:48:38,764 ----- ----- 2025-02-08 09:48:39,526::119713::Finished auto tuning the agent at time 2025-02-08 09:48:39,526 ----- ----- 2025-02-08 09:48:39,529::119713::Launching the JVM with following options: -Xmx240M -XX:MaxMetaspaceSize=224M -server -Djava.security.egd=file:///dev/./urandom -Dsun.lang.ClassLoader.allowArraySyntax=true -XX:-UseLargePages -XX:+UseLinuxPosixThreadCPUClocks -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+UseCompressedOops -DHTTPClient.dontSeekTerminatingChunk=true ----- ----- 2025-02-08 09:48:39,530::119713::Agent Launched with PID 123615 at time 2025-02-08 09:48:39,530 ----- ----- 2025-02-08 09:48:39,530::123615::Time elapsed between Launch of Watchdog process and execing EMAgent is 2 secs ----- ----- 2025-02-08 09:48:39,531::119713::Previous Thrash State(-1,-1) ----- 2025-02-08 09:48:39,745 [1:main (@ 2025-02-08 09:48:39 CST)] WARN - Missing filename for log handler 'wsm' 2025-02-08 09:48:39,753 [1:main (@ 2025-02-08 09:48:39 CST)] WARN - Missing filename for log handler 'opss' 2025-02-08 09:48:39,754 [1:main (@ 2025-02-08 09:48:39 CST)] WARN - Missing filename for log handler 'opsscfg' SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. EONSPROVIDER: oracle.eons.proxy.impl.ONSFactoryImpl Feb 08, 2025 9:48:53 AM oracle.eons.proxy.impl.ConnectionManagerImpl readFormFactor WARNING: unable to locate formfactor file - /u01/app/oracle/agent135/agent_13.5.0.0.0/eons/conf/.formfactor
复制
5 问题未解决
2025-02-08 10:00:59,871 [282:F37ECCBE:GC.SysExecutor.8 (AgentSystemMonitorTask)] WARN - Subsystem (Upload Manager) returned bad status of {+ Upload Manager: *Critical, but not mandatory component* +} 2025-02-08 10:01:00,224 [130:980F9148:GC.SysExecutor.2 (Ping OMS)] INFO - attempting another heartbeat
复制
从日志来看,信息还是无法上传到oms服务上。
进行OMS服务器查看日志
cd /u01/gc_inst/em/EMGC_OMS1/sysman/log tail -200 emoms_pbs.log|more 2025-02-08 10:21:56,396 [GCLoader[response_severity] - https://gsydbadm03.local:3872/emd/main/] ERROR gcloader.DataLoader logp.251 - LOADER ERROR: Loader already procesing this request: tracking_key=26179.1738 832488000 emd_url=https://gsydbadm03.local:3872/emd/main/ loadEntryGuid=78A31DA12406AE511D4587933BE24246 upload_type=response_severity stream_id=1 2025-02-08 10:21:56,396 [GCLoader[response_severity] - https://gsydbadm03.local:3872/emd/main/] ERROR gcloader.Receiver logp.251 - Upload failed: emdURL=https://gsydbadm03.local:3872/emd/main/ trackingKey=26179 .1738832488000 type=response_severity e=ERROR-800|LOADER ERROR: Loader already procesing this request: tracking_key=26179.1738832488000 emd_url=https://gsydbadm03.local:3872/emd/main/ loadEntryGuid=78A31DA1240 6AE511D4587933BE24246 upload_type=response_severity stream_id=1 ERROR-800|LOADER ERROR: Loader already procesing this request: tracking_key=26179.1738832488000 emd_url=https://gsydbadm03.local:3872/emd/main/ loadEntryGuid=78A31DA12406AE511D4587933BE24246 upload_type=respon se_severity stream_id=1
复制
6 继续查询MOS。
Applies to:Enterprise Manager Base Platform - Version 13.2.0.0.0 and later Information in this document applies to any platform. Symptoms On : 13.2.1.0.0 version, Agent Agent Status shows running and ready. However the Section under Collection Status : [COLLECTIONS_HALTED( UPLOAD SYSTEM Threshold - unable to purge files in upload system)] Oracle Enterprise Manager Cloud Control 13c Release 2 Copyright (c) 1996, 2016 Oracle Corporation. All rights reserved. --------------------------------------------------------------- Agent Version : 13.2.0.0.0 OMS Version : 13.2.0.0.0 Protocol Version : 12.1.0.1.0 Agent Home : <AGENT BASE DIRECTORY>/agent_inst Agent Log Directory : <AGENT BASE DIRECTORY>/agent_inst/sysman/log Agent Binaries : <AGENT BASE DIRECTORY>/agent_13.2.0.0.0 Core JAR Location :<AGENT BASE DIRECTORY>/agent_13.2.0.0.0/jlib Agent Process ID : 383892 Parent Process ID : 383541 Agent URL : https://<AGENT HOSTNAME>.<DOMAINNAME>:3872/emd/main/ Local Agent URL in NAT : https:/<AGENT HOSTNAME>.<DOMAINNAME>:3872/emd/main/ Repository URL : https://<OMS HOSTNAME>.<DOMAINNAME>:4900/empbs/upload Started at : 2018-10-04 09:25:05 Started by user : oracle Operating System : Linux version 4.1.12-94.8.4.el6uek.x86_64 (amd64) Number of Targets : 37 Last Reload : (none) Last successful upload : 2018-10-08 08:00:13 Last attempted upload : 2018-10-09 08:25:54 Total Megabytes of XML files uploaded so far : 0.06 Number of XML files pending upload : 4,905 Size of XML files pending upload(MB) : 4.8 Available disk space on upload filesystem : 38.94% Collection Status : [COLLECTIONS_HALTED( UPLOAD SYSTEM Threshold - unable to purge files in upload system)] Backoff Expiration : 2018-10-09 08:26:17 Heartbeat Status : Ok Last attempted heartbeat to OMS : 2018-10-09 08:25:12 Last successful heartbeat to OMS : 2018-10-09 08:25:12 Next scheduled heartbeat to OMS : 2018-10-09 08:26:12 --------------------------------------------------------------- Agent is Running and Ready The file .../gc_inst/em/EMGC_OMS1/sysman/log/emoms_pbs.trc shows the following error regarding the Loader System:2018-10-09 12:52:40,620 [GCLoader[severity] -https://<HOSTNAME>.<DOMAINNAME>:3872/emd/main/] ERROR gcloader.DataLoader logp.251 - LOADER ERROR: Loader already procesing this request: tracking_key=14693.1538138172000 emd_url=https://<HOSTNAME>.<DOMAINNAME>:3872/emd/main/ loadEntryGuid=6E19BEA94FCEC5091969986F77065F7F upload_type=severity stream_id=2 2018-10-09 12:52:40,621 [GCLoader[severity] - https://<HOSTNAME>.<DOMAINNAME>:3872/emd/main/] ERROR gcloader.Receiver logp.251 - Upload failed: emdURL=https://<HOSTNAME>.<DOMAINNAME>:3872/emd/main/ trackingKey=14693.1538138172000 type=severity e=ERROR-800|LOADER ERROR: Loader already processing this request: tracking_key=14693.1538138172000 emd_url=https://<HOSTNAME>.<DOMAINNAME>:3872/emd/main/ loadEntryGuid=6E19BEA94FCEC5091969986F77065F7F upload_type=severity stream_id=2 ERROR-800|LOADER ERROR: Loader already processing this request: tracking_key=14693.1538138172000 emd_url=https://<HOSTNAME>.<DOMAINNAME>:3872/emd/main/ loadEntryGuid=6E19BEA94FCEC5091969986F77065F7F upload_type=severity stream_id=2 at oracle.sysman.core.pbs.gcloader.DataLoader.startUpload(DataLoader.java:2257) at oracle.sysman.core.pbs.gcloader.RequestMapper.processAll(RequestMapper.java:160) at oracle.sysman.core.pbs.gcloader.Receiver.processFile(Receiver.java:2835) at oracle.sysman.core.pbs.gcloader.Receiver.doPost(Receiver.java:2329) at javax.servlet.http.HttpServlet.service(HttpServlet.java:751) at javax.servlet.http.HttpServlet.service(HttpServlet.java:844) Cause This issue happens when a row is found in the MGMT_LOAD_ENTRIES table for the emd_url + upload_type + stream_id string combination is currently locked and being processed by another loader thread. If the condition is found, there is an error logged in the emoms_pbs.trc:; ERROR-800|LOADER ERROR: Loader already processing this request Most probably reason is that the previous process that uploaded the data got stuck not releasing the lock. This can be a locking problem in the repository database as documented in the two bugs below: Bug 26522375 Agent Upload timed out before completion Bug 23509601 EM Agent upload failing due to backoff event Solution 1. Stop the Oracle Management Server ( OMS). cd <OMS_HOME>/bin ./emctl stop oms -all -force 2. After the OMS is stopped, please verify no processes are left over running 3.Verify no processes are hanging: ps -ef | grep EMGC_ADMINSERVER ps -ef | grep EMGC_OMS1 ps -ef | grep java ps -ef | grep opmn 4. Kill the left over OMS java processes $kill -9 ( from above results) 5. Stop/ Start the repository database - sqlplus as <SYS USER>/<SYS PASSWORD> as sysdba sql> shutdown - stop listener lsnrctl stop - Restart the Listener lsnrctl start - Start the repository database - sqlplus as <SYS USER>/<SYS PASSWORD> as sysdba sql> startup 5. Bounce the job subsystem - Login to the DB repository as SYS and verify the value of the parameter job_queue_processes SQL> show parameter job_queue_processes ->>remember this value or write it down SQL> alter system set job_queue_processes=0 scope=BOTH; - Connect to the repository database as the <USER SYSMAN> user and run the following SQL> connect SYSMAN/<SYSMAN PASSWORD> SQL>exec emd_maintenance.remove_em_dbms_jobs; SQL> commit; Reconnect to the repository database as the user with SYSDBA permission (<SYS USER> ) and reset the value of job_queue_processes to it’s original value that you wrote down in previous step. SQL>Connect as SYS again SQL>alter system set job_queue_processes= scope=BOTH; For example: SQL>alter system set job_queue_processes=1000 scope=BOTH; - Connect to the repository database as the <SYSMAN USER> and re-submit the DBMS_SCHEDULER jobs. SQL>exec emd_maintenance.submit_em_dbms_jobs; SQL>commit; 6.Start the OMS and re-check the repository jobs on both the nodes $<OMS_HOME>/bin>./emctl start oms $<OMS_HOME>/bin>./emctl status oms –details Wait for the OMS to start. 7. For the affected agents: emctl stop agent emctl clearstate agent emctl start agent emctl upload agent The agent may have many files to upload, and this may take several times to upload all the files.
复制
7 尝试解决
按照下面7步进行处理。
1. Stop the Oracle Management Server ( OMS). cd <OMS_HOME>/bin ./emctl stop oms -all -force 2. After the OMS is stopped, please verify no processes are left over running 3.Verify no processes are hanging: ps -ef | grep EMGC_ADMINSERVER ps -ef | grep EMGC_OMS1 ps -ef | grep java ps -ef | grep opmn 4. Kill the left over OMS java processes $kill -9 ( from above results) 5. Stop/ Start the repository database - sqlplus as <SYS USER>/<SYS PASSWORD> as sysdba sql> shutdown - stop listener lsnrctl stop - Restart the Listener lsnrctl start - Start the repository database - sqlplus as <SYS USER>/<SYS PASSWORD> as sysdba sql> startup 5. Bounce the job subsystem - Login to the DB repository as SYS and verify the value of the parameter job_queue_processes SQL> show parameter job_queue_processes ->>remember this value or write it down SQL> alter system set job_queue_processes=0 scope=BOTH; - Connect to the repository database as the <USER SYSMAN> user and run the following SQL> connect SYSMAN/<SYSMAN PASSWORD> SQL>exec emd_maintenance.remove_em_dbms_jobs; SQL> commit; Reconnect to the repository database as the user with SYSDBA permission (<SYS USER> ) and reset the value of job_queue_processes to it’s original value that you wrote down in previous step. SQL>Connect as SYS again SQL>alter system set job_queue_processes= scope=BOTH; For example: SQL>alter system set job_queue_processes=1000 scope=BOTH; - Connect to the repository database as the <SYSMAN USER> and re-submit the DBMS_SCHEDULER jobs. SQL>exec emd_maintenance.submit_em_dbms_jobs; SQL>commit; 6.Start the OMS and re-check the repository jobs on both the nodes $<OMS_HOME>/bin>./emctl start oms $<OMS_HOME>/bin>./emctl status oms –details Wait for the OMS to start. 7. For the affected agents: emctl stop agent emctl clearstate agent emctl start agent emctl upload agent The agent may have many files to upload, and this may take several times to upload all the files.
复制
问题解决
重点关注下面2个bug
Bug 26522375 Agent Upload timed out before completion Bug 23509601 EM Agent upload failing due to backoff event
复制
文章转载自二两烧麦,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。