log-drive一次 java.io.IOException Connection reset by peer 故障排查

文章目录
  1. 1. tcpdump
    1. 1.1. tcpdump 下的 tcp/ip 协议解析
    2. 1.2. 抓包数据
    3. 1.3. 握手逻辑
    4. 1.4. 数据分析
    5. 1.5. 为什么会出现 RST

转载地址: https://mengkang.net/1118.html

我的弹幕服务,代码换了个环境运行,虽然服务可用,但是发现出现了如下情况,基本1秒输出来一次:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:853)
at io.netty.buffer.WrappedByteBuf.writeBytes(WrappedByteBuf.java:641)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:240)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:115)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:514)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:471)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:385)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:351)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at io.netty.util.internal.chmv8.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1412)
at io.netty.util.internal.chmv8.ForkJoinTask.doExec(ForkJoinTask.java:280)
at io.netty.util.internal.chmv8.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:877)
at io.netty.util.internal.chmv8.ForkJoinPool.scan(ForkJoinPool.java:1706)
at io.netty.util.internal.chmv8.ForkJoinPool.runWorker(ForkJoinPool.java:1661)
at io.netty.util.internal.chmv8.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:126)

https://github.com/netty/netty/issues/5936
Connection reset by peer is typically a result of your peer sending a TCP RST to you.

tcpdump

既然是网络故障,tcpdump 是否可用呢?

1
2
$sudo netstat -lnp|grep 8657
tcp 0 0 0.0.0.0:443 0.0.0.0:* LISTEN 8657/java

没有列出全部的 socket 连接,从当前进程监听的端口入手试试。ifconfig可以看到这台机器有两个网卡(eth0 内网 ip,eth1 公网 ip)。

tcpdump 下的 tcp/ip 协议解析

先看下 tcpdump 输出tcp/ip 数据的大致格式

参考 http://www.tcpdump.org/tcpdump_man.html 中关于 tcp 协议的描述,我整理了下格式

The general format of a TCP protocol line is

1
src > dst: Flags [tcpflags], seq data-seqno, ack ackno, win window, urg urgent, options [opts], length len
  1. src and dst are the source and destination IP addresses and ports.
  2. tcpflags are some combination of S (SYN), F (FIN), P (PUSH), R (RST), U (URG), W (ECN CWR), E (ECN-Echo) or . (ACK), or none if no flags are set.
  3. data-seqno describes the portion of sequence space covered by the data in this packet.
  4. ackno is sequence number of the next data expected the other direction on this connection.
  5. window is the number of bytes of receive buffer space available the other direction on this connection.
  6. urg indicates there is urgent data in the packet.
  7. opts are TCP options (e.g., mss 1024).
  8. len is the length of payload data.

抓包数据

通过tcpdump查看下443 端口的网络情况。与上面的格式略有不同。因为我有两个网卡,请求是从 eth0 进,从 eth1 出,我尝试使用tcpdump -i any 发现第三次握手输出的 askno 数据不对。(为什么会这样呢,还不知道,谁知道请告知下

1
2
3
4
5
6
$sudo tcpdump -i any -nn port 443

16:18:41.553460 IP xxx.xxx.238.110.5745 > xxx.xxx.198.40.443: S 806033:806033(0) win 14600 <mss 1460,sackOK,timestamp 2230744217 0,nop,wscale 8>
16:18:41.553483 IP xxx.xxx.198.40.443 > xxx.xxx.238.110.5745: S 1720728675:1720728675(0) ack 806034 win 14480 <mss 1460,sackOK,timestamp 2425389802 2230744217,nop,wscale 7>
16:18:41.553647 IP xxx.xxx.238.110.5745 > xxx.xxx.198.40.443: . ack 1 win 58 <nop,nop,timestamp 2230744217 2425389802>
16:18:41.553677 IP xxx.xxx.238.110.5745 > xxx.xxx.198.40.443: R 1:1(0) ack 1 win 58 <nop,nop,timestamp 2230744217 2425389802>

但是如果你要导出*.pcap文件然后用wireshark来查看,还是得将两个网卡的流量合并到一起才行。

两个网卡分开抓

1
2
3
4
5
6
7
8
# 合并到一起写
$sudo tcpdump -i eth0 -nn port 443
$sudo tcpdump -i eth1 -nn port 443

16:51:16.073956 IP xxx.xxx.238.2.61835 > xxx.xxx.198.40.443: S 2659415794:2659415794(0) win 14600 <mss 1460,sackOK,timestamp 2328745293 0,nop,wscale 8>
16:51:16.073985 IP xxx.xxx.198.40.443 > xxx.xxx.238.2.61835: S 955422999:955422999(0) ack 2659415795 win 14480 <mss 1460,sackOK,timestamp 2427344323 2328745293,nop,wscale 7>
16:51:16.074147 IP xxx.xxx.238.2.61835 > xxx.xxx.198.40.443: . ack 955423000 win 58 <nop,nop,timestamp 2328745293 2427344323>
16:51:16.074192 IP xxx.xxx.238.2.61835 > xxx.xxx.198.40.443: R 0:0(0) ack 1 win 58 <nop,nop,timestamp 2328745293 2427344323>

上面的握手比较简单对应着

1
时间 协议 src > dst tcpflags data-seqno ack ackno win window <opts>

握手逻辑

一次连接建立的三次握手的步骤:

  1. Caller sends SYN
  2. Recipient responds with SYN, ACK
  3. Caller sends ACK

数据分析

  1. src 发送的 tcpflags = S,也就是说发送的SYN请求,随机产生一个值 data-seqno = J(2659415794),并将该数据包发送给 dst
  2. dst 发送的 tcpflags = S,实际发送的时候 SYN 和 ACK 标志位都置1了,返回一个 ackno=J + 1,也就是2659415795,然后随机生成一个新的 data-seqno = K(955422999) 发送给 src
  3. src 检查收到的 ackno 是不是等于J+1,然后再发送一个 ackno = K+1,也就是955423000,前面的文档中说道tcpflags.的时候也表示ACK
  4. src 最后为什么又给 dst 发送一个 tcpflags = R 的请求呢?

一次 java.io.IOException: Connection reset by peer 故障排查

为什么会出现 RST

因为具有周期性,我大概猜到了,是 lvs 对我的后端服务的健康检查导致的,联系了网络运营服务客服人员,我把.pcap给他们

1
sudo tcpdump -i any port 443 -c 3 -w log.pcap

他们确认源地址就是LVS的local地址。
健康检查的机制就是建立成功后就 RST 掉,就不占用 session 了。这是公司的规范,问题算是定位了。既然公司的规范,那么我们这边需要定制修改下,针对 vip 发送的 RST 不处理。

其他参考

https://github.com/netty/netty/issues/6029
https://tools.ietf.org/html/rfc793
https://my.oschina.net/costaxu/blog/127394

1
2


1