客户在使用postgresql时,访问应用出现断断续续不稳定现象,客户环境为tomcat 8 + postgresql13.3。
应用报错
通过查看应用端tomcat的错误日志发现有如下错误:
org.postgresql.util.PSQLException:
FATAL: sorry, too many clients already
at
org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:613)
at
org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:161)
at
org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:213)
at
org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:51)
at
org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:223)
at
org.postgresql.Driver.makeConnection(Driver.java:465)
at org.postgresql.Driver.connect(Driver.java:264)
at
java.sql.DriverManager.getConnection(DriverManager.java:582)
at
java.sql.DriverManager.getConnection(DriverManager.java:185)
说明db那边已经不能分配jdbc连接了
分析过程
登录postgresql检查运行日志,同样发现错误提示:
17:45:37.065 CST
[8570] FATAL: sorry, too many clients
already
出现上面错误是已经达到了pg设置的最大连接数了
检查pg的配置:
postgres=# show
config_file;
config_file
------------------------------
/home/pgdata/postgresql.conf
postgres=# show
max_connections;
max_connections
-----------------
500
检查当前的连接消耗
postgres=# select count(*) from
pg_stat_activity where application_name='PostgreSQL JDBC Driver' ;
count
-------
495
(1 row)
最大500个连接,已经基本全部被耗光
开启日志
当前为默认值-1,不记录慢sql
postgres=# show
log_min_duration_statement;
log_min_duration_statement
----------------------------
-1
(1 row)
调整为5秒
vi
postgresql.conf
调整为: log_min_duration_statement
= 5000
在线生效:
postgres=#
SELECT pg_reload_conf();
pg_reload_conf
----------------
t
(1 row)
postgres=# show
log_directory;
log_directory
---------------
log
(1 row)
postgres=# show
log_filename ;
log_filename
-------------------
postgresql-%a.log
(1 row)
发现pg 日志中有大量的如下日志:
22:21:07.201 CST
[21146] LOG: duration: 147753.171
ms execute <unnamed>: select a.id
from minervdb_bench a,minervdb_bench1 b where a.id = b.id……
22:23:17.275 CST [21535] LOG: duration: 16739.452 ms execute <unnamed>: select a.id from
minervdb_bench a,minervdb_bench1 b where a.id = b.id……
22:24:20.080 CST [21598] LOG: duration: 16695.955 ms execute <unnamed>: select a.id from
minervdb_bench a,minervdb_bench1 b where a.id = b.id……
22:24:38.846 CST [21618] LOG: duration: 16660.069 ms execute <unnamed>: select a.id from
minervdb_bench a,minervdb_bench1 b where a.id = b.id……
……
大量10多秒的慢查询耗光了连接资源
至此大概故障的原因就明显了,由于sql语句的执行效率明显低于正常的预期,当访问量达到一定程度后,由于大量的连接没有释放,当连接积累到一定程度后达到了最大值,新的访问请求刚好在达到最大连接时出现访问异常,当访问量降低后,只要没有达到连接的最大值,业务处理还是能够正常返回结果,只不过需要等待较长的加载时间;优化慢sql后应用恢复正常。
评论
