High Packet Loss/CRC errors for Tx on axgbe (DEC2750/OPNsense 22.1)

Started by burly, April 07, 2022, 07:10:50 AM

Previous topic - Next topic
More to details to come tomorrow when I have time to consolidate and write up all these tests, but Update - I've added the important supporting data below - I believe the root cause of the behavior I'm seeing with the VMs is different than what is going on with the DEC2750.

I was able to get 10Gbps line rate working in the OPNsense 22.1 VM by enabling all hardware offloading to a virtio NIC as long as the traffic was going in and out the same interface. If hardware checksum offloading is enabled though, then checksums fail when it traverses the NAT. Essentially, I think I'm hitting this: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=235607
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=165059

Disabling hardware offloading avoids the checksum problem, but makes everything slow. I was able to get ~5Gpbs/4Gbps working across NAT traversal with the following configuration:


# Ensure this returns 1
$ sysctl net.inet.tcp.tso
net.inet.tcp.tso: 1

# Enable  tx checksum, tcp segmentation, and large receive offloading but NOT receive checksum offloading on the WAN device (e.g., vtnet0)
$ ifconfig vtnet0 -rxcsum txcsum tso lro txcsum6 -vlanhwtag -vlanhwtso

# Disable rx & tx checksum, tcp segmentation, and large receive offloading on the LAN device (e.g., vtnet1)
$  ifconfig vtnet1 -rxcsum -txcsum -tso -lro -txcsum6 -vlanhwtag -vlanhwtso


Note that the official docs do mention the possibility of hardware issues with offloading and additionally state that both checksum and TCP segmentation offload need to be disabled if using IPS (https://docs.opnsense.org/manual/interfaces_settings.html) - so take this into account if you are considering turning on TCXSUM/TSO in your OPNsense VM with virtio-nics or RXCSUM/TXCSUM/TSO with non-virtio-nics.

Investigating throughput via iperf3 through the opnsense 22.1 VM shows middle of the road with all offloading disabled. No significant improvement is seen with most hardware offloading enabled on both interfaces except for leaving receive checksum disabled. Enabling rcxsum at all on either interface results in no TCP connectivity.

[client 1] <---> [ (vtnet1) opnsense 22.1 <== NAT ==> (vtnet0)] <--> [client 2 separate machine]

All offloading disabled

# on opnsense-22.1
$ ifconfig vtnet0 -rxcsum -txcsum -tso -lro -txcsum6 -vlanhwtag -vlanhwtso
$ ifconfig vtnet1 -rxcsum -txcsum -tso -lro -txcsum6 -vlanhwtag -vlanhwtso

# on client 1
$ iperf3 -c 172.16.5.57
...
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  6.43 GBytes  5.53 Gbits/sec  3018             sender
[  5]   0.00-10.00  sec  6.43 GBytes  5.52 Gbits/sec                  receiver

$ iperf3 -c 172.16.5.57 -R
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  4.26 GBytes  3.66 Gbits/sec  2144             sender
[  5]   0.00-10.00  sec  4.26 GBytes  3.66 Gbits/sec                  receiver


Only receive offloading disabled

# on opnsense-22.1
$ ifconfig vtnet0 -rxcsum txcsum tso lro txcsum6 -vlanhwtag -vlanhwtso
$ ifconfig vtnet1 -rxcsum txcsum tso lro txcsum6 -vlanhwtag -vlanhwtso

# on client 1
$ iperf3 -c 172.16.5.57
...
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  6.35 GBytes  5.46 Gbits/sec  1287             sender
[  5]   0.00-10.00  sec  6.35 GBytes  5.45 Gbits/sec                  receiver

$ iperf3 -c 172.16.5.57 -R
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  4.40 GBytes  3.78 Gbits/sec  845             sender
[  5]   0.00-10.00  sec  4.40 GBytes  3.78 Gbits/sec                  receiver


Receive offloading enabled

# on opnsense-22.1
$ ifconfig vtnet0 -rxcsum txcsum tso lro txcsum6 -vlanhwtag -vlanhwtso
$ ifconfig vtnet1 rxcsum txcsum tso lro txcsum6 -vlanhwtag -vlanhwtso

# on client 1
$ iperf3 -c 172.16.5.57
iperf3: error - unable to connect to server: Connection timed out


Only receive offloading enabled

# on opnsense-22.1
$ ifconfig vtnet0 -rxcsum -txcsum -tso -lro -txcsum6 -vlanhwtag -vlanhwtso
$ ifconfig vtnet1 rxcsum -txcsum -tso -lro -txcsum6 -vlanhwtag -vlanhwtso

# on client 1
$ iperf3 -c 172.16.5.57
iperf3: error - unable to connect to server: Connection timed out



Investigating via packet capture at the upstream next hop from the WAN interface of the opnsense reveals good checksums with rcxsum offloading disabled and bad with it enabled:

[client VM 1] <---> [ (vtnet1) opnsense 22.1 <== NAT ==> (vtnet0)] <--> [upstream hardware firewall]

Receive offloading disabled

# on opnsense-22.1
$ ifconfig vtnet0 -rxcsum txcsum tso lro txcsum6 -vlanhwtag -vlanhwtso
$ ifconfig vtnet1 -rxcsum txcsum tso lro txcsum6 -vlanhwtag -vlanhwtso

$ curl https://www.google.com
(immediate full result)

# on upstream firewall
root@fw:~ # tcpdump -nv host 172.16.5.58 -i ax0
tcpdump: listening on ax0, link-type EN10MB (Ethernet), capture size 262144 bytes
03:50:57.729234 IP (tos 0x0, ttl 63, id 26307, offset 0, flags [DF], proto TCP (6), length 60)
    172.16.5.58.56398 > 172.253.122.147.443: Flags [S], cksum 0x0a68 (correct), seq 2786157904, win 64240, options [mss 1460,sackOK,TS val 2576672459 ecr 0,nop,wscale 7], length 0
03:50:57.734756 IP (tos 0x80, ttl 123, id 5969, offset 0, flags [none], proto TCP (6), length 60)
    172.253.122.147.443 > 172.16.5.58.56398: Flags [S.], cksum 0xf663 (correct), seq 3977252634, ack 2786157905, win 65535, options [mss 1430,sackOK,TS val 1789175857 ecr 2576672459,nop,wscale 8], length 0
03:50:57.735172 IP (tos 0x0, ttl 63, id 26308, offset 0, flags [DF], proto TCP (6), length 52)
    172.16.5.58.56398 > 172.253.122.147.443: Flags [.], cksum 0x2317 (correct), ack 1, win 502, options [nop,nop,TS val 2576672465 ecr 1789175857], length 0


Receive offloading enabled

# on opnsense-22.1
$ ifconfig vtnet0 -rxcsum txcsum tso lro txcsum6 -vlanhwtag -vlanhwtso
$ ifconfig vtnet1 rxcsum txcsum tso lro txcsum6 -vlanhwtag -vlanhwtso

# on client 1
$ curl https://www.google.com
(hangs, no result)

# on upstream firewall
root@fw:~ # tcpdump -nv host 172.16.5.58 -i ax0
tcpdump: listening on ax0, link-type EN10MB (Ethernet), capture size 262144 bytes
03:56:58.192294 IP (tos 0x0, ttl 63, id 18870, offset 0, flags [DF], proto TCP (6), length 60)
    172.16.5.58.22311 > 172.253.122.103.443: Flags [S], cksum 0x387f (incorrect -> 0xe379), seq 4123969246, win 64240, options [mss 1460,sackOK,TS val 3062744264 ecr 0,nop,wscale 7], length 0
03:56:59.221337 IP (tos 0x0, ttl 63, id 18871, offset 0, flags [DF], proto TCP (6), length 60)
    172.16.5.58.22311 > 172.253.122.103.443: Flags [S], cksum 0x387f (incorrect -> 0xdf74), seq 4123969246, win 64240, options [mss 1460,sackOK,TS val 3062745293 ecr 0,nop,wscale 7], length 0
03:57:01.237466 IP (tos 0x0, ttl 63, id 18872, offset 0, flags [DF], proto TCP (6), length 60)



Originating the traffic directly on the opnsense VM to eliminate the NAT traversal shows full line rate with all hardware offloading enabled and line rate for TX but low throughput for RX with receive checksum offloading disabled:

[opnsense 22.1 (vtnet0)] <--> [upstream hardware firewall]

All hardware offloading enabled

# on opnsense-22.1
$ ifconfig vtnet0 rxcsum txcsum tso lro txcsum6 -vlanhwtag -vlanhwtso
$ ifconfig vtnet1 rxcsum txcsum tso lro txcsum6 -vlanhwtag -vlanhwtso

root@opnsense-22:~ # iperf3 -c 172.16.5.57
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  10.8 GBytes  9.24 Gbits/sec  7698             sender
[  5]   0.00-10.00  sec  10.8 GBytes  9.24 Gbits/sec                  receiver

root@opnsense-22:~ # iperf3 -c 172.16.5.57 -R
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  10.8 GBytes  9.27 Gbits/sec  50066             sender
[  5]   0.00-10.00  sec  10.8 GBytes  9.27 Gbits/sec                  receiver



Only receive offloading disabled

# on opnsense-22.1
$ ifconfig vtnet0 -rxcsum txcsum tso lro txcsum6 -vlanhwtag -vlanhwtso
$ ifconfig vtnet1 -rxcsum txcsum tso lro txcsum6 -vlanhwtag -vlanhwtso

root@opnsense-22:~ # iperf3 -c 172.16.5.57
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  10.8 GBytes  9.29 Gbits/sec  11963             sender
[  5]   0.00-10.00  sec  10.8 GBytes  9.29 Gbits/sec                  receiver

root@opnsense-22:~ # iperf3 -c 172.16.5.57 -R
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.98 GBytes  1.70 Gbits/sec  189             sender
[  5]   0.00-10.00  sec  1.98 GBytes  1.70 Gbits/sec                  receiver


TODO: Add a second NIC to the vanilla FreeBSD 13.0 VM and perform NAT traversal to see if the issue is present there as well.

TODO: Continue/resume the hunt for the DEC2750 performance issue

NOTE: My original tests were using the web GUI to enable or disable the various hardware offloading options and then going to the individual interfaces and saving and applying the settings to make them issue to the underlying device. This is done under the hood with ifconfig <interface> <options>. An issue that I ran into with this was for my production VMs, the TSO was disabled at the sysctl level via tunables net.inet.tcp.tso = 0 (which ultimately is in /boot/loader.conf). I re-enabled this and also issued the interface changes manually to verify they had actually taken place. This is why some of my original OPNsense VMs saw no performance increase even when I enabled hardware offloading in the webGUI.

NOTE: I found that processor affinity within the VM guest can affect the results. If the iperf3 process gets scheduled on the same core with the kthread responsible for handling the network queue, you'll get lower throughput for several seconds before the OS will finally schedule one of the two to a different thread. When this occurs, I would re-run the tests or see which core the kthread is on and run iperf3 with -A to set affinity to a different core.

NOTE: I found that turbo boosting of the cores by the host can affect the results. Prior to running the "official" test, I would run iperf3 with -t 60 to allow it run for 60s and stabilize the throughput before immediately re-running it  to capture comparable throughput and retr values. This does ignore additional complexities such as host processor affinity and hyper-threading, but the testing procedure seemed to produce fairly consistent data.

Quote from: burly on April 08, 2022, 04:04:26 AM
**RSS UPDATE**: I tried turning of RSS (dev.ax.0.rss_enabled="0" dev.ax.1.rss_enabled="0") and rebooting. I then re-tested send/receive with both single and parallel threads and observed no improvement. I believe since it's both src host:port and dst host:port in the hash, that -P4 should be able to generate different queue targets  in the LSB of the hash and thus spread it across cores. Said more simply, I think this is a valid test, but I'm not fully up to speed on RSS. See here for more details: https://forum.opnsense.org/index.php?topic=24409.0

Yes and no, uniqueness cannot be guaranteed in a hash in which only the ports are incremented sequentially (which is the case for iperf3). Also, RSS in the driver does not mean that the correct hash is used. The driver actually fills the hardware registers with random bytes to use as a hash if RSS is disabled in the kernel. If RSS is enabled in the kernel, the kernel-defined hash is used, which should distribute much more evenly (though still no guarantees can be made).

The AX driver specifically shuts off all multi-queue functionality in the hardware if hardware-RSS is disabled and forces everything through a bottleneck - always keep it enabled. This is different from other vendors such as Intel which play by more sophisticated RSS rules.

In an effort to clear the air regarding performance on axgbe (in this case specifically DEC2750 to match the situation as described by the OP), we have set up a standardized test bench in order to potentially catch some of the more performance-degrading changes in the kernel.

Because the world is very complex and consists of an infinite amount of ways to test, wrongly interpret, and setup a clean environment, we will stick to a single configuration which does not change, except for OPNsense versions, in which simple firewall throughput is measured.

Linux (iperf client) ----> OPNsense ----> Linux (iperf server)

Because single iperf3 tests can be wonky due to various reasons, e.g. iperf3 itself (it is a single threaded application, at least on FreeBSD), throttling, system activity, link parter inconsistencies etc. We measure 5 separate sessions with multiple threads. Also, NICs like certain packet sizes more than others. To account for this, multiple packet sizes are used in the tests.

Regarding the system configuration such as hardware offloading, tunables etc. only the system defaults (e.g. as delivered from Deciso) are used.

To start, we set a baseline on 21.7.8 (FreeBSD 12.1):
21.7.8
---------------------------------------------------------------------------
[Firewall]
iperf3 mss 1500
bps                  : 8.04 Gbps (avg 7.45 Gbps in 5 tries)
pps                  : 702.37 Kpps (avg 650.72 Kpps in 5 tries)
iperf3 mss 1200
bps                  : 7.90 Gbps (avg 7.81 Gbps in 5 tries)
pps                  : 863.29 Kpps (avg 853.30 Kpps in 5 tries)
iperf3 mss 500
bps                  : 3.16 Gbps (avg 3.10 Gbps in 5 tries)
pps                  : 828.86 Kpps (avg 813.87 Kpps in 5 tries)
iperf3 mss 100
bps                  : 565.02 Mbps (avg 528.56 Mbps in 5 tries)
pps                  : 723.22 Kpps (avg 676.56 Kpps in 5 tries)
netperf latency
mean_latency         : 150.92 Microseconds [RTT]
---------------------------------------------------------------------------

These results fall within expectations regarding performance of the DEC2750 proc. the amount of Kpps is the most important measurement in this setup.

Next up, we test the different kernel versions starting from 22.1.2:

22.1.2
---------------------------------------------------------------------------
[Firewall]
iperf3 mss 1500
bps                  : 8.76 Gbps (avg 7.65 Gbps in 5 tries)
pps                  : 765.32 Kpps (avg 668.13 Kpps in 5 tries)
iperf3 mss 1200
bps                  : 8.58 Gbps (avg 8.36 Gbps in 5 tries)
pps                  : 937.37 Kpps (avg 912.66 Kpps in 5 tries)
iperf3 mss 500
bps                  : 3.45 Gbps (avg 3.38 Gbps in 5 tries)
pps                  : 903.84 Kpps (avg 886.74 Kpps in 5 tries)
iperf3 mss 100
bps                  : 595.11 Mbps (avg 551.87 Mbps in 5 tries)
pps                  : 761.74 Kpps (avg 706.39 Kpps in 5 tries)
netperf latency
mean_latency         : 148.03 Microseconds [RTT]
---------------------------------------------------------------------------

22.1.4
---------------------------------------------------------------------------
[Firewall]
iperf3 mss 1500
bps                  : 8.75 Gbps (avg 7.92 Gbps in 5 tries)
pps                  : 764.57 Kpps (avg 692.37 Kpps in 5 tries)
iperf3 mss 1200
bps                  : 8.26 Gbps (avg 6.78 Gbps in 5 tries)
pps                  : 902.74 Kpps (avg 740.71 Kpps in 5 tries)
iperf3 mss 500
bps                  : 3.35 Gbps (avg 3.13 Gbps in 5 tries)
pps                  : 878.98 Kpps (avg 820.60 Kpps in 5 tries)
iperf3 mss 100
bps                  : 574.86 Mbps (avg 486.28 Mbps in 5 tries)
pps                  : 735.82 Kpps (avg 622.44 Kpps in 5 tries)
netperf latency
mean_latency         : 148.66 Microseconds [RTT]
---------------------------------------------------------------------------

22.1.5
---------------------------------------------------------------------------
[Firewall]
iperf3 mss 1500
bps                  : 8.75 Gbps (avg 8.35 Gbps in 5 tries)
pps                  : 764.92 Kpps (avg 729.96 Kpps in 5 tries)
iperf3 mss 1200
bps                  : 8.43 Gbps (avg 7.66 Gbps in 5 tries)
pps                  : 920.95 Kpps (avg 836.44 Kpps in 5 tries)
iperf3 mss 500
bps                  : 3.56 Gbps (avg 3.44 Gbps in 5 tries)
pps                  : 933.90 Kpps (avg 901.74 Kpps in 5 tries)
iperf3 mss 100
bps                  : 622.28 Mbps (avg 608.44 Mbps in 5 tries)
pps                  : 796.51 Kpps (avg 778.81 Kpps in 5 tries)
netperf latency
mean_latency         : 169.09 Microseconds [RTT]
---------------------------------------------------------------------------

If anything, performance has increased since FreeBSD 13-STABLE.

Cheers,

Stephan

Thank you for the update and the testing!

Those numbers are consistent with what I was experiencing with 21.x. You testing across LAN <-> WAN with pf enabled and NAT with just basic ACLs, right?

I'm going to hopefully have some time to dive into this further on my system this weekend.

Quote from: burly on April 24, 2022, 12:29:52 AM
Those numbers are consistent with what I was experiencing with 21.x. You testing across LAN <-> WAN with pf enabled and NAT with just basic ACLs, right?

Correct :)

Were you able to solve this issue?

I appear to be hitting the same problem.

I have a proxmox host with opnsense running as a guest. When I have CRC, TSO and LRO disabled I hit bottlenecks (~5-6gbit in one direction and ~7-8gbit in the other). When I enable CRC,TSO and LRO I am able to hit ~9gbit from host to router and ~23gbit from router to ISP in both directions however going across the interfaces (NAT) I get pushed down to ~1mbit.

I think your issue can only remotely be comparable to this one... first off, how do you install proxmox on a DEC2750?

The OP did not have proxmox running, but OpnSense directky on a DEC2750. Even the underlying FreeBSD version is different than yours (i.e. 13.0 vs. 13.1). So I doubt your situation is like this.

Second (because I do not know any way to make proxmox run on a DEC2750): What other hardware do you use with axgbe NICs?

I would start a new thread for your question because at least one thing (OS, vrtualization, hardware) is different. Please describe your hardware, OpnSense version and provide as much other info as you may have when you do.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+