2019-04-20

HikariPool-1 - Connection is not available

This afternoon, I got a report that website A is not accessible. I login the server and first checked the nginx error log. It says connection to upstream is timed out.

...
2019/04/20 09:58:42 [error] 6327#0: *129762 upstream timed out (110: Connection timed out) while reading response header from upstream, client: <xxx.xxx.xx.xx>, server: <xxx.xxxxxxxx.xxx>, request: "GET /api/user/me HTTP/1.1", upstream: "http://127.0.0.1:5000/api/user/me", host: "<xxx.xxxxxxxx.xxx>", referrer: "https://<xxx.xxxxxxxx.xxx>/page/home"
2019/04/20 09:58:42 [error] 6327#0: *129763 upstream timed out (110: Connection timed out) while reading response header from upstream, client: <xxx.xxx.xx.xx>, server: <xxx.xxxxxxxx.xxx>, request: "GET /api/home/loadHomeData HTTP/1.1", upstream: "http://127.0.0.1:5000/api/home/loadHomeData", host: "<xxx.xxxxxxxx.xxx>", referrer: "https://<xxx.xxxxxxxx.xxx>/page/home"
...

Then I checked log from Spring

org.springframework.transaction.CannotCreateTransactionException: Could not open JPA EntityManager for transaction; nested exception is org.hibernate.exception.JDBCConnectionException: Unable to acquire JDBC Connection
		...
Caused by: org.hibernate.exception.JDBCConnectionException: Unable to acquire JDBC Connection
        ... 61 common frames omitted
Caused by: java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms.
		...

It clearly indicates that connection pool is running out of free connections. Connections are not returned to connection pool as expected. Incoming requests temping to acquire connection wait on it and finally timed out, which further caused the reading from nginx timed out.

For interest, I checked the network connection status with

1	netstat -peanut

As shown below, majortiy of them are in CLOSE_WAIT. If you still remember the state transition for TCP, the sever side of a TCP connection will enter CLOSE_WAIT state when it receive FIN from its client. Thus, we can infer that the connection is originally issued from 127.0.0.1:xxxxx to 127.0.0.1:5000. This make sense, as 127.0.0.1:5000 is the upstream server to which where Nginx proxy incoming requests.

Proto Recv-Q Send-Q Local Address           Foreign Address         State       User       Inode       PID/Program name
tcp        1      0 127.0.0.1:5000          127.0.0.1:13160         CLOSE_WAIT  1000       66628224    4209/java       
tcp        1      0 127.0.0.1:5000          127.0.0.1:13152         CLOSE_WAIT  1000       66628218    4209/java       
tcp        1      0 127.0.0.1:5000          127.0.0.1:57766         CLOSE_WAIT  1000       62175223    4209/java       
tcp        1      0 127.0.0.1:5000          127.0.0.1:21812         CLOSE_WAIT  1000       71135785    4209/java       
tcp        1      0 127.0.0.1:5000          127.0.0.1:13150         CLOSE_WAIT  1000       66628836    4209/java       
...
tcp        1      0 127.0.0.1:5000          127.0.0.1:13794         CLOSE_WAIT  1000       66644609    4209/java       
tcp        1      0 127.0.0.1:5000          127.0.0.1:13154         CLOSE_WAIT  1000       66628222    4209/java       
tcp        1      0 127.0.0.1:5000          127.0.0.1:13792         CLOSE_WAIT  1000       66637431    4209/java       
tcp        1      0 127.0.0.1:5000          127.0.0.1:30190         CLOSE_WAIT  1000       64263646    4209/java       
tcp        1      0 127.0.0.1:5000          127.0.0.1:13228         CLOSE_WAIT  1000       66629637    4209/java       
tcp        1      0 127.0.0.1:5000          127.0.0.1:35004         CLOSE_WAIT  1000       72919978    4209/java       
tcp        1      0 127.0.0.1:5000          127.0.0.1:13210         CLOSE_WAIT  1000       66629002    4209/java       
tcp        1      0 127.0.0.1:5000          127.0.0.1:13198         CLOSE_WAIT  1000       66628344    4209/java

So, the problem is clear. Then what’s the cause? Unfortunately, previously I didn’t enable HikariCP‘s Connect leak detection. And I also didn’t enable Postgresql’s slow log.

HikariCP is the default connection pool used by Spring boot since 2.0. I checked its configuration with Spring boot actuator endpoint.

For your interest, the actuator was configed as following

@Configuration
public class ActuatorSecurityConfig extends WebSecurityConfigurerAdapter {
    @Override
    protected void configure(HttpSecurity http) throws Exception {
        http.requestMatcher(EndpointRequest.toAnyEndpoint()).authorizeRequests()
                .anyRequest().hasRole("SUPER_ADMIN")
                .and()
                .httpBasic();
    }
}

Since the upstream server is only bind to 127.0.0.1, I can only invoke that endpoint within the server.

1	curl -u <account>:<password> http://127.0.0.1:5000/actuator/configprops

Yes, sad but true, leakDetectionThreshold is the default value 0 (disabled).

To enable it, add the following config in my application.yml

spring:
    datasource:
        hikari:
            leak-detection-threshold: 30000 # 30 seconds

By the way, I also change com.zaxxer.hikari‘s log level to DEBUG by following the adivce from the Author of HikariCP here.

// logback-spring.xml in classpath root
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <include resource="org/springframework/boot/logging/logback/base.xml"/>
    <logger name="com.zaxxer.hikari" level="DEBUG"/>
</configuration>

That’s what I have done to enhance my spring-boot server. Hope it can catch some useful info when the same problem happened again.

That’s not enough, Postgresql as the database should be able to tell something useful too. So I enable the slow query log there. Edit in postgresql.conf.

# log to file with suffix ".log"
log_destination = 'stderr'
# required to be on for generating log
loggin_collector = on
# the direction should exist and postgres server should has the read&write permission
log_directly = '/var/log/postgresql'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
# enable log file rotation
log_truncate_on_rotation = on
# rotate log file on a daily basis
log_rotation_age = 1d
# if a query takes more that 2 seconds, log it out
log_min_duration_statement = 2000
# if this set to not `none`, then log_min_duration_statement will be ignored
log_statement = 'none'
log_timezone = 'Asia/Shanghai'

You can locate postgresql.conf file with query:

1	show config_file;

Some fields require a restarting after modified. If not required, then simple execute the below SQL is enough to make it take effect.

1	SELECT pg_reload_conf();

Try with

1	SELECT pg_sleep(7.5);

A corresponding log should have been logged into the correct place.

You can also read this article which mentions that simply increasing maximumPoolSize as a countermeansure for this problem is not a good idea.

References:

How to identify slow queries in PostgreSQL

How to show queries log in PostgreSQL?

HikariCP

Production Ready Endpoints

logging

Benson's IT notes or summaries

Continuously updated on a irregular basis

HikariPool-1 - Connection is not available