HikariPool-1 - Connection is not available

This afternoon, I got a report that website A is not accessible. I login the server and first checked the nginx error log. It says connection to upstream is timed out.

1
2
3
4
...
2019/04/20 09:58:42 [error] 6327#0: *129762 upstream timed out (110: Connection timed out) while reading response header from upstream, client: <xxx.xxx.xx.xx>, server: <xxx.xxxxxxxx.xxx>, request: "GET /api/user/me HTTP/1.1", upstream: "http://127.0.0.1:5000/api/user/me", host: "<xxx.xxxxxxxx.xxx>", referrer: "https://<xxx.xxxxxxxx.xxx>/page/home"
2019/04/20 09:58:42 [error] 6327#0: *129763 upstream timed out (110: Connection timed out) while reading response header from upstream, client: <xxx.xxx.xx.xx>, server: <xxx.xxxxxxxx.xxx>, request: "GET /api/home/loadHomeData HTTP/1.1", upstream: "http://127.0.0.1:5000/api/home/loadHomeData", host: "<xxx.xxxxxxxx.xxx>", referrer: "https://<xxx.xxxxxxxx.xxx>/page/home"
...

Then I checked log from Spring

1
2
3
4
5
6
org.springframework.transaction.CannotCreateTransactionException: Could not open JPA EntityManager for transaction; nested exception is org.hibernate.exception.JDBCConnectionException: Unable to acquire JDBC Connection
...
Caused by: org.hibernate.exception.JDBCConnectionException: Unable to acquire JDBC Connection
... 61 common frames omitted
Caused by: java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms.
...

It clearly indicates that connection pool is running out of free connections. Connections are not returned to connection pool as expected. Incoming requests temping to acquire connection wait on it and finally timed out, which further caused the reading from nginx timed out.

For interest, I checked the network connection status with

1
netstat -peanut

As shown below, majortiy of them are in CLOSE_WAIT. If you still remember the state transition for TCP, the sever side of a TCP connection will enter CLOSE_WAIT state when it receive FIN from its client. Thus, we can infer that the connection is originally issued from 127.0.0.1:xxxxx to 127.0.0.1:5000. This make sense, as 127.0.0.1:5000 is the upstream server to which where Nginx proxy incoming requests.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Proto Recv-Q Send-Q Local Address           Foreign Address         State       User       Inode       PID/Program name
tcp 1 0 127.0.0.1:5000 127.0.0.1:13160 CLOSE_WAIT 1000 66628224 4209/java
tcp 1 0 127.0.0.1:5000 127.0.0.1:13152 CLOSE_WAIT 1000 66628218 4209/java
tcp 1 0 127.0.0.1:5000 127.0.0.1:57766 CLOSE_WAIT 1000 62175223 4209/java
tcp 1 0 127.0.0.1:5000 127.0.0.1:21812 CLOSE_WAIT 1000 71135785 4209/java
tcp 1 0 127.0.0.1:5000 127.0.0.1:13150 CLOSE_WAIT 1000 66628836 4209/java
...
tcp 1 0 127.0.0.1:5000 127.0.0.1:13794 CLOSE_WAIT 1000 66644609 4209/java
tcp 1 0 127.0.0.1:5000 127.0.0.1:13154 CLOSE_WAIT 1000 66628222 4209/java
tcp 1 0 127.0.0.1:5000 127.0.0.1:13792 CLOSE_WAIT 1000 66637431 4209/java
tcp 1 0 127.0.0.1:5000 127.0.0.1:30190 CLOSE_WAIT 1000 64263646 4209/java
tcp 1 0 127.0.0.1:5000 127.0.0.1:13228 CLOSE_WAIT 1000 66629637 4209/java
tcp 1 0 127.0.0.1:5000 127.0.0.1:35004 CLOSE_WAIT 1000 72919978 4209/java
tcp 1 0 127.0.0.1:5000 127.0.0.1:13210 CLOSE_WAIT 1000 66629002 4209/java
tcp 1 0 127.0.0.1:5000 127.0.0.1:13198 CLOSE_WAIT 1000 66628344 4209/java

So, the problem is clear. Then what’s the cause? Unfortunately, previously I didn’t enable HikariCP‘s Connect leak detection. And I also didn’t enable Postgresql’s slow log.

HikariCP is the default connection pool used by Spring boot since 2.0. I checked its configuration with Spring boot actuator endpoint.

For your interest, the actuator was configed as following

1
2
3
4
5
6
7
8
9
10
@Configuration
public class ActuatorSecurityConfig extends WebSecurityConfigurerAdapter {
@Override
protected void configure(HttpSecurity http) throws Exception {
http.requestMatcher(EndpointRequest.toAnyEndpoint()).authorizeRequests()
.anyRequest().hasRole("SUPER_ADMIN")
.and()
.httpBasic();
}
}

Since the upstream server is only bind to 127.0.0.1, I can only invoke that endpoint within the server.

1
curl -u <account>:<password>  http://127.0.0.1:5000/actuator/configprops

Yes, sad but true, leakDetectionThreshold is the default value 0 (disabled).

To enable it, add the following config in my application.yml

1
2
3
4
spring:
datasource:
hikari:
leak-detection-threshold: 30000 # 30 seconds

By the way, I also change com.zaxxer.hikari‘s log level to DEBUG by following the adivce from the Author of HikariCP here.

1
2
3
4
5
6
// logback-spring.xml in classpath root
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<include resource="org/springframework/boot/logging/logback/base.xml"/>
<logger name="com.zaxxer.hikari" level="DEBUG"/>
</configuration>

That’s what I have done to enhance my spring-boot server. Hope it can catch some useful info when the same problem happened again.

That’s not enough, Postgresql as the database should be able to tell something useful too. So I enable the slow query log there. Edit in postgresql.conf.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# log to file with suffix ".log"
log_destination = 'stderr'
# required to be on for generating log
loggin_collector = on
# the direction should exist and postgres server should has the read&write permission
log_directly = '/var/log/postgresql'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
# enable log file rotation
log_truncate_on_rotation = on
# rotate log file on a daily basis
log_rotation_age = 1d
# if a query takes more that 2 seconds, log it out
log_min_duration_statement = 2000
# if this set to not `none`, then log_min_duration_statement will be ignored
log_statement = 'none'
log_timezone = 'Asia/Shanghai'

You can locate postgresql.conf file with query:

1
show config_file;

Some fields require a restarting after modified. If not required, then simple execute the below SQL is enough to make it take effect.

1
SELECT pg_reload_conf();

Try with

1
SELECT pg_sleep(7.5);

A corresponding log should have been logged into the correct place.

You can also read this article which mentions that simply increasing maximumPoolSize as a countermeansure for this problem is not a good idea.

References:

How to identify slow queries in PostgreSQL

How to show queries log in PostgreSQL?

HikariCP

Production Ready Endpoints

logging