相关背景:
https://issues.apache.org/jira/browse/HBASE-9393(Region Server fails to properly close socket resulting in many CLOSE_WAIT to Data Nodes)
https://issues.apache.org/jira/browse/HDFS-7694(FSDataInputStream should support "unbuffer")
复制
CLOSE_WAIT
keepalive机制
客户端或者服务端意外断电、死机、进程挂掉重启等;
中间网络出现问题,连接双方无法知道一直等待;
程序问题导致的长时间CLOSE_WAIT问题;
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200
复制
tcp_keepalive_time,在TCP保活打开的情况下,最后一次数据交换到TCP发送第一个保活探测包的间隔,即允许的持续空闲时长,或者说每次正常发送心跳的周期,默认值为7200s(2h)。
tcp_keepalive_probes 在tcp_keepalive_time之后,没有接收到对方确认,继续发送保活探测包次数,默认值为9(次)
tcp_keepalive_intvl,在tcp_keepalive_time之后,没有接收到对方确认,继续发送保活探测包的发送频率,默认值为75s。
图一. 正常ack,保持连接
图二. 对方响应rst,释放连接
图三. 对方服务无响应,释放连接
源码解析:
/**
* Enable/disable {@link SocketOptions#SO_KEEPALIVE SO_KEEPALIVE}.
*
* @param on whether or not to have socket keep alive turned on.
* @exception SocketException if there is an error
* in the underlying protocol, such as a TCP error.
* @since 1.3
* @see #getKeepAlive()
*/
public void setKeepAlive(boolean on) throws SocketException {
if (isClosed())
throw new SocketException("Socket is closed");
getImpl().setOption(SocketOptions.SO_KEEPALIVE, Boolean.valueOf(on));
}
复制
SocketOptions相关代码:
/**
* When the keepalive option is set for a TCP socket and no data
* has been exchanged across the socket in either direction for
* 2 hours (NOTE: the actual value is implementation dependent),
* TCP automatically sends a keepalive probe to the peer. This probe is a
* TCP segment to which the peer must respond.
* One of three responses is expected:
* 1. The peer responds with the expected ACK. The application is not
* notified (since everything is OK). TCP will send another probe
* following another 2 hours of inactivity.
* 2. The peer responds with an RST, which tells the local TCP that
* the peer host has crashed and rebooted. The socket is closed.
* 3. There is no response from the peer. The socket is closed.
*
* The purpose of this option is to detect if the peer host crashes.
*
* Valid only for TCP socket: SocketImpl
*
* @see Socket#setKeepAlive
* @see Socket#getKeepAlive
*/
@Native public final static int SO_KEEPALIVE = 0x0008;
复制
int opt = 1;
setsockopt(sockfd, SOL_SOCKET, SO_KEEPALIVE, (void*)&opt, sizeof(opt));
复制
linux tcp keepalive机制相关源码实现可查看net/core/sock.c、net/ipv4/tcp_timer.c、net/ipv4/tcp_timer.c
重点总结
a. java层面开启keepalive需要通过socket实例调用setKeepAlive进行设置(建议在两端均设置),只能配置开关,其他参数依赖于sysctl在系统层面进行配置。
b. C语言开启keepalive需要在socket实例上调用setsockopt设置。
c. 调整keepalive内核参数后对现有已打开keepalive机制的socket链接直接生效,无需重启。
d. tcp keep-alive机制可以解决大量连接无法回收、占用资源的问题.