Hi,
as soon as I put a little bit more load on my firewall cluster it looses packets and the TCP connections get closed.
The nodes are ProLiant DL380 G7 with 32GB RAM, two Quad-Core Xeons X5660 and three Quad-Port Intel 82580 NICs. So I assume the hardware is not the problem. It has link aggregation with loadbalance mode on all interfaces.
The system is not under stress. It has approx. 10k sessions. 1% CPU load. Lots of mbufs, no errors, no drops neither on the NICs nor on the switch ports.
At some indefinite point the firewall looses packets.
The trouble starts after acknowledging number 291137. The database server sends packages until the TCP window gets full. But these packages didn't reach the other site as well as the ACK's from the webserver didn't reach the database. And after retransmission timed out the connection is reset from the database server.
The traces were made on the firewall. I've made them on the physical and the lagg interfaces with no difference.
Any ideas where to look further?
And why do I see ICMP packages from the firewall on this TCP connection?
Many thanks
Frank
lagg0 - 192.168.19.0/24
330 299.939233 172.16.6.69 -> 192.168.19.4 TCP 54 55353 > ms-sql-s [ACK] Seq=12642 Ack=283137 Win=45312 Len=0
331 299.939238 192.168.19.4 -> 172.16.6.69 TDS 1514 Unknown Packet Type: 0
332 299.939252 172.16.6.69 -> 192.168.19.4 TCP 54 55353 > ms-sql-s [ACK] Seq=12642 Ack=291137 Win=37376 Len=0
333 299.939397 192.168.19.4 -> 172.16.6.69 TDS 1514 Unknown Packet Type: 0
334 299.939572 192.168.19.4 -> 172.16.6.69 TDS 1514 Unknown Packet Type: 0 (Not last buffer)
335 299.939576 192.168.19.4 -> 172.16.6.69 TDS 1514 Unknown Packet Type: 0
336 299.939579 192.168.19.4 -> 172.16.6.69 TDS 1514 Unknown Packet Type: 0 (Not last buffer)
337 299.939582 192.168.19.4 -> 172.16.6.69 TDS 1514 Unknown Packet Type: 0 (Not last buffer)
338 299.939585 192.168.19.4 -> 172.16.6.69 TDS 1514 Unknown Packet Type: 0 (Not last buffer)
339 299.939588 192.168.19.4 -> 172.16.6.69 TDS 1514 Unknown Packet Type: 0 (Not last buffer)
340 299.939591 192.168.19.4 -> 172.16.6.69 TDS 1514 Unknown Packet Type: 0
341 299.939595 192.168.19.4 -> 172.16.6.69 TDS 1514 Unknown Packet Type: 0
342 299.939599 192.168.19.4 -> 172.16.6.69 TDS 1514 Unknown Packet Type: 0 (Not last buffer)
343 299.939602 192.168.19.4 -> 172.16.6.69 TDS 1514 Unknown Packet Type: 0
344 299.939605 192.168.19.4 -> 172.16.6.69 TDS 1514 Unknown Packet Type: 0
345 299.939608 192.168.19.4 -> 172.16.6.69 TCP 1514 ms-sql-s > 55353 [PSH, ACK] Seq=324657 Ack=12642 Win=65536 Len=1460
346 299.939610 192.168.19.4 -> 172.16.6.69 TDS 1514 Unknown Packet Type: 0
347 300.239719 192.168.19.4 -> 172.16.6.69 TCP 1514 [TCP Retransmission] ms-sql-s > 55353 [ACK] Seq=291137 Ack=12642 Win=65536 Len=1460
348 300.239743 192.168.19.31 -> 192.168.19.4 ICMP 82 Destination unreachable (Host unreachable)
349 300.838833 192.168.19.4 -> 172.16.6.69 TCP 1514 [TCP Retransmission] ms-sql-s > 55353 [ACK] Seq=291137 Ack=12642 Win=65536 Len=1460
350 300.838859 192.168.19.31 -> 192.168.19.4 ICMP 82 Destination unreachable (Host unreachable)
351 302.041479 192.168.19.4 -> 172.16.6.69 TCP 1514 [TCP Retransmission] ms-sql-s > 55353 [ACK] Seq=291137 Ack=12642 Win=65536 Len=1460
352 302.041502 192.168.19.31 -> 192.168.19.4 ICMP 82 Destination unreachable (Host unreachable)
353 304.438934 192.168.19.4 -> 172.16.6.69 TCP 1514 [TCP Retransmission] ms-sql-s > 55353 [ACK] Seq=291137 Ack=12642 Win=65536 Len=1460
354 304.438957 192.168.19.31 -> 192.168.19.4 ICMP 82 Destination unreachable (Host unreachable)
355 309.239126 192.168.19.4 -> 172.16.6.69 TCP 1514 [TCP Retransmission] ms-sql-s > 55353 [ACK] Seq=291137 Ack=12642 Win=65536 Len=1460
356 309.239148 192.168.19.31 -> 192.168.19.4 ICMP 82 Destination unreachable (Host unreachable)
357 318.839481 192.168.19.4 -> 172.16.6.69 TCP 60 ms-sql-s > 55353 [RST, ACK] Seq=292597 Ack=12642 Win=0 Len=0
358 329.939143 172.16.6.69 -> 192.168.19.4 TCP 55 [TCP Keep-Alive] [TCP Window Full] 55353 > ms-sql-s [ACK] Seq=12641 Ack=307137 Win=131328 Len=1
359 329.939261 192.168.19.4 -> 172.16.6.69 TCP 60 ms-sql-s > 55353 [RST] Seq=307137 Win=0 Len=0
lagg1 - 172.16.6.0/24
329 299.939251 172.16.6.69 -> 192.168.19.4 TCP 60 55353 > ms-sql-s [ACK] Seq=12642 Ack=283137 Win=45312 Len=0
330 299.939261 192.168.19.4 -> 172.16.6.69 TDS 1514 Unknown Packet Type: 0
331 299.939271 172.16.6.69 -> 192.168.19.4 TCP 60 55353 > ms-sql-s [ACK] Seq=12642 Ack=291137 Win=37376 Len=0
332 299.939273 192.168.19.4 -> 172.16.6.69 TDS 1514 Unknown Packet Type: 0
333 299.939321 172.16.6.69 -> 192.168.19.4 TCP 60 55353 > ms-sql-s [ACK] Seq=12642 Ack=299137 Win=29440 Len=0
334 299.939492 172.16.6.69 -> 192.168.19.4 TCP 60 55353 > ms-sql-s [ACK] Seq=12642 Ack=307137 Win=21504 Len=0
335 299.939636 172.16.6.69 -> 192.168.19.4 TCP 60 [TCP Window Update] 55353 > ms-sql-s [ACK] Seq=12642 Ack=307137 Win=69376 Len=0
336 299.940190 172.16.6.69 -> 192.168.19.4 TCP 60 [TCP Window Update] 55353 > ms-sql-s [ACK] Seq=12642 Ack=307137 Win=131328 Len=0
337 318.839520 192.168.19.4 -> 172.16.6.69 TCP 54 ms-sql-s > 55353 [RST, ACK] Seq=292597 Ack=12642 Win=0 Len=0
338 329.939156 172.16.6.69 -> 192.168.19.4 TCP 60 [TCP Keep-Alive] [TCP Window Full] 55353 > ms-sql-s [ACK] Seq=12641 Ack=307137 Win=131328 Len=1
339 329.939300 192.168.19.4 -> 172.16.6.69 TCP 54 ms-sql-s > 55353 [RST] Seq=307137 Win=0 Len=0
Solved it by increasing the undocumented igb(4) hw.igb.buf_ring_size setting.
It seems that the HPE NC365T adapter cannot push the packets fast enough out to the wire.
But that's anyone's guess.
If someone runs OPNsense on a ProLiant too here are my settings.
@franco: Could be the first settings for the network card tweak plugin. :)
/boot/loader.conf.local
ipmi_load="YES"
net.link.ifqmaxlen="8192"
hw.igb.buf_ring_size="32768"
hw.igb.max_interrupt_rate="96000"
hw.igb.num_queues="1"
hw.igb.rx_process_limit="4096"
hw.igb.tx_process_limit="4096"
hw.igb.rxd="4096"
hw.igb.txd="4096"
net.pf.states_hashsize="16777216"
System -> Settings -> Tunables
kern.ipc.maxsockbuf 8388608
net.inet.tcp.sendbuf_max 16777216
net.inet.tcp.recvbuf_max 16777216
net.inet.tcp.sendspace 131072
net.inet.tcp.recvspace 131072
net.inet.tcp.sendbuf_inc 32768
net.inet.tcp.recvbuf_inc 65536
kern.ipc.soacceptqueue 1024
Interfaces -> Settings
uncheck 'Disable hardware CRC, TSO and LRO'
The HPE NC365T is a add-on card iirc? So probably not just Proliant related?
Correct.
These settings are the result of many tries I've made until I've got the most stability and performance.
Feel free and test it out on other hardware ;)
No doubt others will. Good find.
I have, or have had, the 364T. Still have some dual ports at home.
Don't use them anymore when I went SFP+ (home usage, because I can).