OPNsense Forum

Archive => 22.1 Legacy Series => Topic started by: burly on April 07, 2022, 07:10:50 am

Title: High Packet Loss/CRC errors for Tx on axgbe (DEC2750/OPNsense 22.1)
Post by: burly on April 07, 2022, 07:10:50 am
I am experiencing high-packet loss when transmitting from ax0 (LAN) to another LAN device on my DEC2750 running OPNsense 22.1. I have poor throughput in both directions (~1.8Gbps as sender, ~1.7Gbps as receiver) however, I'm only observing packet loss/retx when ax0 is the transmitter.

On my DEC2750 the LAN is ax0 and it is connected to port 8 of a USW-Aggregation 10Gbps switch via a Mellanox MCP2100-X003B DAC. Looking at the switch port I can see the input errors and CRC counts increasing when I run iperf. 

Code: [Select]
root@fw:~ # iperf3 -c 172.16.5.14
Connecting to host 172.16.5.14, port 5201
[  5] local 172.16.5.1 port 29519 connected to 172.16.5.14 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   180 MBytes  1.51 Gbits/sec  390   29.8 KBytes
[  5]   1.00-2.00   sec   122 MBytes  1.03 Gbits/sec  252   49.8 KBytes
[  5]   2.00-3.00   sec   259 MBytes  2.17 Gbits/sec  508   54.1 KBytes
[  5]   3.00-4.00   sec   255 MBytes  2.14 Gbits/sec  529   25.5 KBytes
[  5]   4.00-5.01   sec   134 MBytes  1.12 Gbits/sec  298    334 KBytes
[  5]   5.01-6.01   sec   192 MBytes  1.61 Gbits/sec  397    781 KBytes
[  5]   6.01-7.00   sec   218 MBytes  1.84 Gbits/sec  434   48.3 KBytes
[  5]   7.00-8.00   sec   117 MBytes   983 Mbits/sec  242   19.9 KBytes
[  5]   8.00-9.00   sec   176 MBytes  1.48 Gbits/sec  326   22.7 KBytes
[  5]   9.00-10.00  sec   215 MBytes  1.81 Gbits/sec  435   44.0 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.82 GBytes  1.57 Gbits/sec  3811             sender
[  5]   0.00-10.00  sec  1.82 GBytes  1.57 Gbits/sec                  receiver

Code: [Select]
SW-Aggregation# show interfaces TenGigabitEthernet 8
TenGigabitEthernet8 is up
  Hardware is Ten Gigabit Ethernet
  Full-duplex, 10Gb/s, media type is Fiber
  flow-control is off
  back-pressure is enabled
     262840538 packets input, 865223445 bytes, 0 throttles
     Received 2488 broadcasts (0 multicasts)
     0 runts, 477 giants, 0 throttles
     510220 input errors, 509743 CRC, 0 frame
     0 multicast, 0 pause input
     0 input packets with dribble condition detected
     156613060 packets output, 1602945509 bytes, 0 underrun
     644 output errors, 0 collisions
     644 babbles, 0 late collision, 0 deferred
     0 PAUSE output

I don't see any errors/discards at the fw LAN interface (DEC2750 ax0). MTU is 1500 all around.
Code: [Select]
root@fw:~ # ifconfig ax0
ax0: flags=8863<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        description: LAN
        options=4800028<VLAN_MTU,JUMBO_MTU,NOMAP>
        ether f4:90:ea:00:73:4a
        inet 172.16.5.1 netmask 0xffffff00 broadcast 172.16.5.255
        media: Ethernet autoselect (10GBase-SFI <full-duplex,rxpause,txpause>)
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>

root@fw:~ # netstat -i log | grep -iE "Name|ax0"
Name    Mtu Network       Address              Ipkts Ierrs Idrop    Opkts Oerrs  Coll
ax0    1500 <Link#4>      f4:90:ea:00:73:4a  9077437     0     0 12499343     0     0
ax0       - 172.16.5.0/24 fw                    6778     -     -    12065     -     -

I have verified:
 - I can send 9.4Gbps bi-directionally between all other devices connected to the USW-Aggregation switch
 - Switch CPU utilization is low (~3-5%)
 - iperf3 -u -b 9000M (UDP) shows the same bandwidth and packet loss behavior
 
Additionally, I verified back in January on OPNsense 21.7 that I could bi-directionarlly push 9.4Gbps on the LAN interface to other 10Gbe devices (and well in excess of 5Gbps across the FW and out ax1).

I have tried:
 - Rebooting DEC2750 (no change)
 - Rebooting the switch (no change)
 - Switching to a known good DAC (no change)
 - Put the original DAC used by the FW on another known-good host (no change - the known good can hit 9.4Gbps without issue)
 - Change port on the switch (no change)
 - Switch to ax1 on the DEC2750 (no change)
 - enabling hardware checksum offloading on fw (no change)
 - enabling hardware tcp segmentation offloading on fw (no change)
 - enabling large receive offload on fw (no change)
 - enabling flow control on the switch (no change in throughput but it does completely eliminate the iperf3 TCP ReTxs)
 - enabling flow control on ax0 (add tunable for dev.ax.0.rx_pause 1 and dev.ax_0.tx_pause 1 then reboot)  (no chnage in throughput but eliminates iperf3 TCP ReTxs)

 I have not yet tried:
 - Direct connecting the FW to another 10Gbps port device Update See post below on this
 - Downgrading to OPNSense 21.x
 - Using a verified tested & working DAC module (e.g.  [DAC] UBIQUITI 10G 1M DAC) Update - Arrived and installed, no change

The only thing that I know of that has changed is the update to OPNsense 22.1 (which bases on FreeBSD 13 vs 21.x which was on FreeBSD 12). Could this be a potential issue with OPSense 22.1/FreeBSD 13 and axgbe?


Hardware: DEC2750

Software: OPNsense 22.1.4_1-amd64


Code: [Select]
$ uname -a FreeBSD fw 13.0-STABLE FreeBSD 13.0-STABLE stable/22.1-n248063-ac40e064d3c SMP  amd64

$ dmesg | grep -i ax0 
ax0: <AMD 10 Gigabit Ethernet Driver> mem 0xd0060000-0xd007ffff,0xd0040000-0xd005ffff,0xd0082000-0xd0083fff at device 0.1 on pci6
ax0: Using 2048 TX descriptors and 2048 RX descriptors
ax0: Using 3 RX queues 3 TX queues
ax0: Using MSI-X interrupts with 7 vectors
ax0: Ethernet address: f4:90:ea:00:73:4a
ax0: xgbe_config_sph_mode: SPH disabled in channel 0
ax0: xgbe_config_sph_mode: SPH disabled in channel 1
ax0: xgbe_config_sph_mode: SPH disabled in channel 2
ax0: RSS Enabled
ax0: Receive checksum offload Enabled
ax0: VLAN filtering Enabled
ax0: VLAN Stripping Enabled
ax0: Checking GPIO expander validity
ax0: SFP detected:
ax0:   vendor:   Mellanox
ax0:   part number:    MCP2100-X003B
ax0:   revision level: A1
ax0:   serial number:  MT1403VS18803
ax0: netmap queues/slots: TX 3/2048, RX 3/2048

These are the potentially relevant modified tunables I received "out-of-the-box" when delivered from Deciso:
Code: [Select]
dev.ax.0.iflib.override_nrxds 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048
dev.ax.0.iflib.override_ntxds 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048
dev.ax.0.rss_enabled 1
dev.ax.1.iflib.override_nrxds 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048
dev.ax.1.iflib.override_ntxds 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048
dev.ax.1.rss_enabled 1
Title: Re: High Packet Loss/CRC errors for Tx on axgbe (DEC2750/OPNsense 22.1)
Post by: tuto2 on April 07, 2022, 02:43:55 pm
This is interesting. I'll take a look at it on my setup once I find some time.

Glossing over your post it seems a downgrade to 21.x would be interesting for comparison.

No virtual interfaces running on top of ax0?

Cheers,

Stephan
Title: Re: High Packet Loss/CRC errors for Tx on axgbe (DEC2750/OPNsense 22.1)
Post by: burly on April 07, 2022, 04:54:29 pm
That would be great, thank you!

I've ordered one of the Deciso tested & working DAC models ([DAC] UBIQUITI 10G 1M DAC) to try. I'll likely try the direct connect this evening. Downgrading will take a little more setup time as I need to update the configuration on backup FW VM to bring online first as it's not setup in HA.

Update Correct, no VLANs on this interface
Title: Re: High Packet Loss/CRC errors for Tx on axgbe (DEC2750/OPNsense 22.1)
Post by: burly on April 08, 2022, 02:58:09 am
Tonight I tried direct connecting the DEC2750 (aka fw) port ax1 to another 10Gbps device with a known good DAC. The results showed no packet loss over TCP but again the bandwidth is limited to 1.7-1.9Gbps bidirectionally. UDP was able to muster only ~2.6Gbps and with high packet loss.

I may try and boot off a USB key into Linux this weekend to see if I can get different results in a different OS. If so, I can pursue downgrading to OPNsense 21.7.

fw ax1 configuration:
Code: [Select]
ax1: xgbe_config_sph_mode: SPH disabled in channel 0
ax1: xgbe_config_sph_mode: SPH disabled in channel 1
ax1: xgbe_config_sph_mode: SPH disabled in channel 2
ax1: RSS Enabled
ax1: Receive checksum offload Disabled
ax1: VLAN filtering Disabled
ax1: VLAN Stripping Disabled
ax1: Checking GPIO expander validity
ax1: SFP detected:
ax1:   vendor:   Mellanox
ax1:   part number:    MCP2100-X003B
ax1:   revision level: A1
ax1:   serial number:  MT1416VS02297
ax1: link state changed to DOWN
ax1: Link is UP - 10Gbps/Full - flow control off
ax1: link state changed to UP

TCP Results
fw ax1 as Sender
Code: [Select]
root@fw:~ # iperf3 -c 172.16.200.2
Connecting to host 172.16.200.2, port 5201
[  5] local 172.16.200.1 port 46924 connected to 172.16.200.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   228 MBytes  1.91 Gbits/sec    0   3.00 MBytes
[  5]   1.00-2.00   sec   201 MBytes  1.68 Gbits/sec    0   3.00 MBytes
[  5]   2.00-3.00   sec   217 MBytes  1.82 Gbits/sec    0   3.00 MBytes
[  5]   3.00-4.00   sec   213 MBytes  1.79 Gbits/sec    0   3.00 MBytes
[  5]   4.00-5.00   sec   209 MBytes  1.75 Gbits/sec    0   3.00 MBytes
[  5]   5.00-6.00   sec   228 MBytes  1.92 Gbits/sec    0   3.00 MBytes
[  5]   6.00-7.00   sec   170 MBytes  1.43 Gbits/sec    0   3.00 MBytes
[  5]   7.00-8.00   sec   220 MBytes  1.85 Gbits/sec    0   3.00 MBytes
[  5]   8.00-9.00   sec   196 MBytes  1.64 Gbits/sec    0   3.00 MBytes
[  5]   9.00-10.00  sec   221 MBytes  1.85 Gbits/sec    0   3.00 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  2.05 GBytes  1.76 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  2.05 GBytes  1.76 Gbits/sec                  receiver

fw ax1 as Receiver
Code: [Select]
root@fw:~ # iperf3 -c 172.16.200.2 -R
Connecting to host 172.16.200.2, port 5201
Reverse mode, remote host 172.16.200.2 is sending
[  5] local 172.16.200.1 port 59337 connected to 172.16.200.2 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   204 MBytes  1.71 Gbits/sec
[  5]   1.00-2.00   sec   213 MBytes  1.79 Gbits/sec
[  5]   2.00-3.00   sec   207 MBytes  1.74 Gbits/sec
[  5]   3.00-4.00   sec   218 MBytes  1.83 Gbits/sec
[  5]   4.00-5.00   sec   213 MBytes  1.78 Gbits/sec
[  5]   5.00-6.00   sec   211 MBytes  1.77 Gbits/sec
[  5]   6.00-7.00   sec   210 MBytes  1.77 Gbits/sec
[  5]   7.00-8.00   sec   213 MBytes  1.79 Gbits/sec
[  5]   8.00-9.00   sec   210 MBytes  1.76 Gbits/sec
[  5]   9.00-10.00  sec   210 MBytes  1.76 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  2.06 GBytes  1.77 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  2.06 GBytes  1.77 Gbits/sec                  receiver

UDP Results
fw ax1 as Sender
Code: [Select]
root@fw:~ # iperf3 -c 172.16.200.2 -u -b 9000M
Connecting to host 172.16.200.2, port 5201
[  5] local 172.16.200.1 port 34911 connected to 172.16.200.2 port 5201
[ ID] Interval           Transfer     Bitrate         Total Datagrams
[  5]   0.00-1.00   sec   219 MBytes  1.84 Gbits/sec  157390
[  5]   1.00-2.00   sec   215 MBytes  1.80 Gbits/sec  154532
[  5]   2.00-3.00   sec   234 MBytes  1.97 Gbits/sec  168399
[  5]   3.00-4.00   sec   233 MBytes  1.96 Gbits/sec  167659
[  5]   4.00-5.00   sec   234 MBytes  1.96 Gbits/sec  167797
[  5]   5.00-6.00   sec   235 MBytes  1.98 Gbits/sec  169111
[  5]   6.00-7.00   sec   235 MBytes  1.97 Gbits/sec  168725
[  5]   7.00-8.00   sec   233 MBytes  1.96 Gbits/sec  167502
[  5]   8.00-9.00   sec   235 MBytes  1.97 Gbits/sec  168753
[  5]   9.00-10.00  sec   234 MBytes  1.96 Gbits/sec  167758
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-10.00  sec  2.25 GBytes  1.94 Gbits/sec  0.000 ms  0/1657626 (0%)  sender
[  5]   0.00-10.00  sec  2.25 GBytes  1.94 Gbits/sec  0.003 ms  269/1657626 (0.016%)  receiver

fw ax1 as Receiver
Code: [Select]
root@fw:~ # iperf3 -c 172.16.200.2 -u -b 9000M -R
Connecting to host 172.16.200.2, port 5201
Reverse mode, remote host 172.16.200.2 is sending
[  5] local 172.16.200.1 port 64097 connected to 172.16.200.2 port 5201
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-1.00   sec   315 MBytes  2.64 Gbits/sec  0.003 ms  1589/228030 (0.7%)
[  5]   1.00-2.00   sec   317 MBytes  2.66 Gbits/sec  0.003 ms  770/228374 (0.34%)
[  5]   2.00-3.00   sec   312 MBytes  2.62 Gbits/sec  0.004 ms  1234/225533 (0.55%)
[  5]   3.00-4.00   sec   295 MBytes  2.47 Gbits/sec  0.003 ms  15494/227089 (6.8%)
[  5]   4.00-5.00   sec   309 MBytes  2.59 Gbits/sec  0.003 ms  333/222378 (0.15%)
[  5]   5.00-6.00   sec   304 MBytes  2.55 Gbits/sec  0.004 ms  9455/227482 (4.2%)
[  5]   6.00-7.00   sec   312 MBytes  2.62 Gbits/sec  0.004 ms  1488/225716 (0.66%)
[  5]   7.00-8.00   sec   300 MBytes  2.51 Gbits/sec  0.003 ms  7986/223226 (3.6%)
[  5]   8.00-9.00   sec   311 MBytes  2.61 Gbits/sec  0.004 ms  878/224257 (0.39%)
[  5]   9.00-10.00  sec   321 MBytes  2.69 Gbits/sec  0.005 ms  1184/231395 (0.51%)
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-10.00  sec  3.08 GBytes  2.64 Gbits/sec  0.000 ms  0/2263509 (0%)  sender
[  5]   0.00-10.00  sec  3.02 GBytes  2.60 Gbits/sec  0.005 ms  40411/2263480 (1.8%)  receiver

fw Interface Statistics
Code: [Select]
root@fw:~ # ifconfig ax1
ax1: flags=8863<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        description: Test
        options=4800028<VLAN_MTU,JUMBO_MTU,NOMAP>
        ether f4:90:ea:00:73:4b
        inet 172.16.200.1 netmask 0xffffff00 broadcast 172.16.200.255
        media: Ethernet autoselect (10GBase-SFI <full-duplex,rxpause,txpause>)
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>

root@fw:~ # netstat -i log | grep -iE 'Name|ax1'
Name    Mtu Network       Address              Ipkts Ierrs Idrop    Opkts Oerrs  Coll
ax1    1500 <Link#5>      f4:90:ea:00:73:4b 14228226     0     0 15969806     0     0
ax1       - 172.16.200.0/ fw                14231375     -     - 15974070     -     -

root@fw:~ # netstat -ihw 1
            input        (Total)           output
   packets  errs idrops      bytes    packets  errs      bytes colls
       130     0     0        28K        133     0        28K     0
      1.5k     0     0       1.6M       1.5k     0       1.7M     0
      2.4k     0     0       3.3M       2.4k     0       222K     0
      151k     0     0       218M       151k     0        10M     0
      151k     0     0       218M       151k     0        10M     0
      150k     0     0       216M       150k     0        11M     0
      151k     0     0       218M       151k     0        10M     0
      151k     0     0       219M       151k     0        10M     0
      153k     0     0       221M       153k     0        12M     0
      151k     0     0       218M       151k     0        10M     0
      152k     0     0       220M       152k     0        10M     0
      152k     0     0       220M       152k     0        10M     0
      153k     0     0       221M       153k     0        10M     0
      1.3k     0     0       1.5M       1.3k     0       1.5M     0
       39k     0     0       2.6M        82k     0       118M     0
       76k     0     0       5.1M       175k     0       253M     0
       75k     0     0       5.0M       170k     0       247M     0
       72k     0     0       4.8M       178k     0       258M     0
       76k     0     0       5.1M       166k     0       239M     0
       70k     0     0       4.7M       152k     0       220M     0
       66k     0     0       4.5M       149k     0       216M     0

pve4 Interface Statistics
Code: [Select]
root@pve4:~# ifconfig enp6s0d1
enp6s0d1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.16.200.2  netmask 255.255.255.0  broadcast 0.0.0.0
        inet6 fe80::202:c9ff:fe0e:9ce9  prefixlen 64  scopeid 0x20<link>
        ether 00:02:c9:0e:9c:e9  txqueuelen 1000  (Ethernet)
        RX packets 15974098  bytes 17522362543 (16.3 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 14231333  bytes 17446822814 (16.2 GiB)
        TX errors 0  dropped 1 overruns 0  carrier 0  collisions 0

root@pve4:~# netstat -i log | grep -iE 'Iface|enp6s0d1'
Iface      MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
enp6s0d1  1500 15974098      0      0 0      14231333      0      1      0 BMR
Title: Re: High Packet Loss/CRC errors for Tx on axgbe (DEC2750/OPNsense 22.1)
Post by: burly on April 08, 2022, 03:10:38 am
Thought: Interestingly, although this person is sending traffic Through the FW, which introduces a whole lot of additional points that throughput can be reduced, it's interesting that they are seeing the same specific range of values (~1.8Gbps, occasionally bursting to 2.4Gbps). https://www.reddit.com/r/OPNsenseFirewall/comments/s6zu4b/help_with_bad_performance_on_dec2750_opnsense/

The 2.4Gbps value is suspiciously close to 1/4th the expected speed (~9.6Gbps). Is there any chance that the multi-queues are not actually being multi-processed by the kernel and thus we are only processing on one core at a time?


Code: [Select]
last pid: 73684;  load averages:  1.34,  0.47,  0.28                                                                                                                                                                                               up 0+01:09:53  01[0/515]
208 threads:   11 running, 167 sleeping, 30 waiting
CPU:  0.3% user,  0.0% nice, 24.8% system,  0.0% interrupt, 74.8% idle
Mem: 162M Active, 35M Inact, 400M Wired, 116M Buf, 7261M Free
Swap: 8478M Total, 8478M Free

  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
   11 root        155 ki31     0B   128K CPU3     3  69:02 100.00% idle{idle: cpu3}
    0 root        -76    -     0B  1008K CPU2     2   1:39  99.92% kernel{if_io_tqg_2}
   11 root        155 ki31     0B   128K RUN      0  68:15  97.34% idle{idle: cpu0}
   11 root        155 ki31     0B   128K CPU1     1  69:34  97.28% idle{idle: cpu1}
73684 root        100    0    17M  6184K CPU6     6   0:27  96.72% iperf3
   11 root        155 ki31     0B   128K CPU7     7  68:34  95.35% idle{idle: cpu7}
   11 root        155 ki31     0B   128K CPU4     4  68:19  91.80% idle{idle: cpu4}
   11 root        155 ki31     0B   128K CPU5     5  68:45  81.63% idle{idle: cpu5}
   11 root        155 ki31     0B   128K RUN      6  67:52  29.06% idle{idle: cpu6}
    0 root        -92    -     0B  1008K -        4   1:01   1.44% kernel{axgbe dev taskq}
    0 root        -92    -     0B  1008K -        4   1:00   1.44% kernel{axgbe dev taskq}
   12 root        -72    -     0B   480K WAIT     5   0:01   0.70% intr{swi1: pfsync}
    0 root        -92    -     0B  1008K -        0   0:34   0.40% kernel{dummynet}
    6 root        -16    -     0B    16K -        4   0:04   0.18% rand_harvestq
    0 root        -76    -     0B  1008K -        0   0:48   0.16% kernel{if_io_tqg_0}
    0 root        -76    -     0B  1008K -        4   0:28   0.16% kernel{if_io_tqg_4}

Here is someone else reporting similar behavior (and discussing iflib as well, which is in play in my situation as well.
https://forum.opnsense.org/index.php?topic=18754.30

Update This issue in iflib reported in FreeBSD 12, although involving a vmx NIC, reveals an issue in a similar vein of thinking: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=237166

Here is the relevant info from my device:
Code: [Select]
root@fw:~ # dmesg | grep -iE 'ax0|ax1' | grep -C2 queues
ax0: <AMD 10 Gigabit Ethernet Driver> mem 0xd0060000-0xd007ffff,0xd0040000-0xd005ffff,0xd0082000-0xd0083fff at device 0.1 on pci6
ax0: Using 2048 TX descriptors and 2048 RX descriptors
ax0: Using 3 RX queues 3 TX queues
ax0: Using MSI-X interrupts with 7 vectors
ax0: Ethernet address: f4:90:ea:00:73:4a
--
ax0:   revision level: A1
ax0:   serial number:  MT1403VS18803
ax0: netmap queues/slots: TX 3/2048, RX 3/2048
ax1: <AMD 10 Gigabit Ethernet Driver> mem 0xd0020000-0xd003ffff,0xd0000000-0xd001ffff,0xd0080000-0xd0081fff at device 0.2 on pci6
ax1: Using 2048 TX descriptors and 2048 RX descriptors
ax1: Using 3 RX queues 3 TX queues
ax1: Using MSI-X interrupts with 7 vectors

Code: [Select]
root@fw:~ # sysctl -a | grep override
dev.ax.1.iflib.override_nrxds: 2048
dev.ax.1.iflib.override_ntxds: 2048
dev.ax.1.iflib.override_qs_enable: 0
dev.ax.1.iflib.override_nrxqs: 0
dev.ax.1.iflib.override_ntxqs: 0
dev.ax.0.iflib.override_nrxds: 2048
dev.ax.0.iflib.override_ntxds: 2048
dev.ax.0.iflib.override_qs_enable: 0
dev.ax.0.iflib.override_nrxqs: 0
dev.ax.0.iflib.override_ntxqs: 0


**UPDATE** I tried to adapt the nic-queue-usage script found here to the ax, but it doesn't appear that iflib provides per rx_queue packet stats?
https://github.com/ocochard/BSDRP/blob/master/BSDRP/Files/usr/local/bin/nic-queue-usage

Code: [Select]
root@fw:~ # sysctl dev.ax.1.iflib | grep -i rxq
dev.ax.1.iflib.rxq2.rxq_fl0.buf_size: 2048
dev.ax.1.iflib.rxq2.rxq_fl0.credits: 2047
dev.ax.1.iflib.rxq2.rxq_fl0.cidx: 1557
dev.ax.1.iflib.rxq2.rxq_fl0.pidx: 1556
dev.ax.1.iflib.rxq2.cpu: 2
dev.ax.1.iflib.rxq1.rxq_fl0.buf_size: 2048
dev.ax.1.iflib.rxq1.rxq_fl0.credits: 2047
dev.ax.1.iflib.rxq1.rxq_fl0.cidx: 974
dev.ax.1.iflib.rxq1.rxq_fl0.pidx: 973
dev.ax.1.iflib.rxq1.cpu: 0
dev.ax.1.iflib.rxq0.rxq_fl0.buf_size: 2048
dev.ax.1.iflib.rxq0.rxq_fl0.credits: 2047
dev.ax.1.iflib.rxq0.rxq_fl0.cidx: 703
dev.ax.1.iflib.rxq0.rxq_fl0.pidx: 702
dev.ax.1.iflib.rxq0.cpu: 6
dev.ax.1.iflib.override_nrxqs: 0
Title: Re: High Packet Loss/CRC errors for Tx on axgbe (DEC2750/OPNsense 22.1)
Post by: burly on April 08, 2022, 04:04:26 am
OK, getting somewhere:

I made the following changes to both ax0 (LAN) and ax1 (test) and rebooted. LAN is primary interface that I care about and what is connected to the switch. ax1 is direct connected to another 10G host for testing purposes.
- Disabled flow control on rx and tx
- Enabled hardware TCP segmentation offload

Note that both hardware checksum offload and hardware large receive offload were left disabled.

I was then able to:
 - Send at 9.4Gbps on ax1 with no Retx.  Kernel times for if_io_tqg_2 were around 12-19%
 - Send at 8.25Gbps on ax0 but still with Retx (although fewer of them then when this all started). Kernel times for  if_io_tqg_2were around 18-24%

However, receiving is still underperforming:
 - Receive @ 2.32Gbps on ax1 with no retx.   Kernel times for if_o_tqg_2 were around 97-100%
 - Receive @ 2.32Gbps on ax0 with no retx.   Kernel times for if_o_tqg_2 were around 97-100%


Per the documentation https://docs.opnsense.org/manual/interfaces_settings.html (https://docs.opnsense.org/manual/interfaces_settings.html), all three offloading options should be disabled.

Two thoughts
 1. Either a single core the this machine (DEC2750 -> AMD Ryzen V1500B) is expected to be able to handle full 10Gbps traffic on the interface and code path this is presently taking is slower than expected
 2. Or the manner in which the system is expected to achieve this throughput with offloading disabled is through the use of multiple CPU cores.

My expectation is that multiple threads will be processing multiple queues across multiple cores to achieve the necessary throughput. This is why I find it so suspicious that the single kernel thread is being pegged out while all the other cores are basically idle.  One thought I had was that Receive Side Scaling (RSS) maybe forcing all the packets from this single TCP stream into a single queue for locality, thus effectively making this a single threaded activity. More testing is needed, however I would expect that if this were solely the issue, running 4 parallel streams should result in ~4x the throughput. My quick tests with TCP -P4 and UDP -P1/-P4 with iperf3 don't show a 4x, but rather 1.5-2x increase with just two cores seeing utilization. I need to setup up some better tests to investigate this line of thinking.

**RSS UPDATE**: I tried turning of RSS (dev.ax.0.rss_enabled="0" dev.ax.1.rss_enabled="0") and rebooting. I then re-tested send/receive with both single and parallel threads and observed no improvement. I believe since it's both src host:port and dst host:port in the hash, that -P4 should be able to generate different queue targets  in the LSB of the hash and thus spread it across cores. Said more simply, I think this is a valid test, but I'm not fully up to speed on RSS. See here for more details: https://forum.opnsense.org/index.php?topic=24409.0
Title: Re: High Packet Loss/CRC errors for Tx on axgbe (DEC2750/OPNsense 22.1)
Post by: burly on April 09, 2022, 07:03:16 pm
As I'm doing some testing to isolate the issue, it appears it could be related to the broader set of issues that have been seen over the past several years regarding the NIC performance delta observed between OPNsense vs the FreeBSD it's based upon. The overall issue that I'm observing on the DEC2750 (single core pegging out even with parallel streams, throughput limited to 1.8-2.4Gbps) is reproducible in VMs.

Using the base version of FreeBSD on the same hosts with the same guest configuration I see far higher throughput and the usage of multiple CPUs even when processing a single stream with multiple queues available.

I also grabbed the FreeBSD 13.0 kernel from the opnsense website and booted that on my freebsd VM to see it was the source of the issue and it did reveal one - The throughput for the FreeBSD-13 VM booting off the kernel from the opnsense site was only half what it is for that same FreeBSD-13 VM when booting off the stock kernel! In fact, it is very close in performance to what the stock FreeBSD-13 VM is with only a single queue presented to it.

**UPDATE:** Interestingly, pfSense-CE-2.6 exhibits the same throughput issues. It's further hampered by ALTQ as noted below (so I can't make use of multi-queues for virtio like I can for opnsense/FreeBSD).

More to come...

References:
 - https://forum.opnsense.org/index.php?topic=18754.0
 - https://www.reddit.com/r/OPNsenseFirewall/comments/s6zu4b/help_with_bad_performance_on_dec2750_opnsense/
 - https://forum.opnsense.org/index.php?topic=22477.0
 - https://github.com/opnsense/src/issues/119

Test Environment

Baremetal

NameOSKernelCPURAMNICIPNotes
pve1Proxmox 7.1-15.13.19-14-pveE3-1270v232GBMellanox ConnectX2172.16.5.3ethtool -K vmbr1 tx off gso off
pve2Proxmox 7.1-15.13.19-14-pveE3-1270v232GBMellanox ConnectX2172.16.5.4ethtool -K vmbr1 tx off gso off
truenasTrueNAS Core 12-U812.2-RELEASE-p12 amd64E3-1240v564GBMellanox ConnectX2172.16.200.2
(DAC to pve4)
fw (DEC2750)OPNsense 22.1.513.0-STABLERyzen V1500B8GBAMD 10 Gigabit172.16.5.1
172.16.200.1
(DAC to truenas)


VMs
NameOSKernelCPURAMNICIPNotes
tankUbuntu 20.4.4 LTS5.4.0-100-generic1 vCPU IvyBridge2GBvirtio 2 queues172.16.5.37Runs on pve1
mgmtUbuntu 20.4.4 LTS5.13.0-39-generic2 vCPU IvyBridge4GBvirtio 2 queues172.16.5.36Runs on pve4
mgmt-cloneUbuntu 20.4.4 LTS5.13.0-39-generic2 vCPU IvyBridge4GBvirtio 2 queues172.16.5.57Runs on pve2
freebsd-13FreeBSD 13.0releng/13.0-n2447332 vCPUs4GBvirtio 2 queues172.16.5.59Runs on pve4
opnsense22.1OPNsense 22.1stable/22.1-n2480594 vCPUs IvyBridge4GBvirtio 2 queues172.16.6.1Runs on pve4
opnsense21.7OPNsense 21.74 vCPUs IvyBridge4GBvirtio 2 queues 172.16.6.1Runs on pve4
pfsenseCE2.6PFsense 2.6.012.3-STABLE4 vCPUs IvyBridge4GBvirtio 2 queues 172.16.6.1Runs on pve4


Test Results

Baremetal
ClientServerProtocolTx BitrateTx RetrRx BitrateRx RetrNotes
pve2pve1TCP9.31Gbps48.68Gbps0
pve4pve1TCP7.17Gbps08.49Gbps430
truenaspve4TCP9.24Gbps09.31Gbps0Hardware Offloading Enabled
{mlxen1 rx cq} <1% cpu
intr{mlx4_core0} 22% cpu
truenas pve4TCP9.08Gbps39.30Gbps231Hardware Offloading Disabled
{mlxen1 rx cq} 56% cpu
intr{mlx4_core0} 22% cpu
fw pve4TCP1.87Gbps01.63Gbps0Hardware Offloading Disabled
kernel{if_io_tqg_4} 100% cpu
VMs
ClientServerProtocolTx BitrateTx RetrRx BitrateRx RetrNotes
tankmgmtTCP9.31Gbps384399.34Gbps21532
mgmt-clonemgmtTCP9.30Gbps680019.36Gbps10267
tankmgmt-cloneTCP9.24Gbps421379.27Gbps63471
freebsd-13mgmt-cloneTCP9.22Gbps39279.00Gbps75984Hardware Offloading Enabled
intr(irq27: virtio_pci3} 30%CPU
1 queue
freebsd-13mgmt-cloneTCP4.16Gbps5904.80Gbps3566Hardware Offloading Disabled
intr{irq27: virtio_pci3} 65% CPU
1 queue
freebsd-13mgmt-clone TCP9.27Gbps41089.03Gbps69501Hardware Offloading Enabled
intr{irq27: virtio_pci3} 33% CPU
2 queue
freebsd-13mgmt-clone TCP9.26Gbps87619.00Gbps55472Hardware Offloading Disabled
intr{irq27: virtio_pci3} 65% CPU
2 queue
freebsd-13mgmt-cloneTCP4.51Gbps2604.25Gbps3206OPNsense Kernel*
Hardware Offloading Disabled
intr{irq27: virtio_pci3} 100%
2 queue
opnsense-22.1mgmt-cloneTCP2.81Gbps2221.64Gbps155Hardware Offloading Enabled
intr{irq27: virtio_pci3} 86% CPU
2 queue
opnsense-22.1mgmt-cloneTCP2.62Gbps01.68Gbps139Hardware Offloading Disabled
intr{irq27: virtio_pci3} 97% CPU
2 queue
opnsense-21.7mgmt-cloneTCP5.25Gbps5871.91Gbps88Hardware Offloading Enabled
intr{irq27: virtio_pci3} XX% CPU
2 queue
opnsense-21.7mgmt-cloneTCP2.42Gbps01.61Gbps47Hardware Offloading Disabled
intr{irq27: virtio_pci3} xx% CPU
2 queue
pfsenseCE2.6mgmt-cloneTCP8.6Gbps24321.40Gbps14Hardware Offloading Enabled
intr{irq261: virtio_pci2} 53% CPU
2 queue (but ALTQ **)
pfsenseCE2.6mgmt-cloneTCP2.33Gbps11.40Gbps20Hardware Offloading Disabled
intr{irq261: virtio_pci2} 100% CPU
2 queue (but ALTQ **)

[**] https://forum.netgate.com/topic/138174/pfsense-vtnet-lack-of-queues/8
Title: Re: High Packet Loss/CRC errors for Tx on axgbe (DEC2750/OPNsense 22.1)
Post by: meyergru on April 14, 2022, 05:39:04 pm
Subscribed. I see the bad performance on my DEC750 as well.

Did you compare the sysctl -a outputs to see if there is just a random parameter that limits the OpnSense kernel?

Although there seems to be a lot more than just parameter differences when I look at this comparison: https://hardenedbsd.org/content/easy-feature-comparison
Title: Re: High Packet Loss/CRC errors for Tx on axgbe (DEC2750/OPNsense 22.1)
Post by: Raketenmeyer on April 14, 2022, 06:50:35 pm
What exactly do you mean in that comparison?
Title: Re: High Packet Loss/CRC errors for Tx on axgbe (DEC2750/OPNsense 22.1)
Post by: Vesalius on April 14, 2022, 08:07:57 pm
What exactly do you mean in that comparison?
OPNsense no longer uses hardenedBSD as of 22.1, so this is an OPNSense FreeBSD 13 kernel versus the native FreeBSD 13 kernel difference.
Title: Re: High Packet Loss/CRC errors for Tx on axgbe (DEC2750/OPNsense 22.1)
Post by: meyergru on April 14, 2022, 10:27:37 pm
Oh, I just saw that I misread the OpenBSD column for OpnSense in that comparison...
Title: Re: High Packet Loss/CRC errors for Tx on axgbe (DEC2750/OPNsense 22.1)
Post by: burly on April 15, 2022, 12:10:18 am
Did you compare the sysctl -a outputs to see if there is just a random parameter that limits the OpnSense kernel?

Yes in fact, I have done that, I just forgot to post about it! It's a pretty long list of deltas but nothing stuck out to me. The filtered files were grepping for `vtnet|virtio`.

The OPNsense VM has two NICs while the FreeBSD VM has a single NIC, so there are a bunch of extra entries for vtnet1 in the OPNsense file. Which makes me release that I should test the FreeBSD with the presence of a second NIC to see if that has any affect on it's performance.

Code: [Select]
diff --git a/freebsd-13.0/freebsd-13.0-filtered.sysctl b/opnsense-22.1/opnsense-22.1-filtered.sysctl
index e69a004..c36435b 100644
--- a/freebsd-13.0/freebsd-13.0-filtered.sysctl
+++ b/opnsense-22.1/opnsense-22.1-filtered.sysctl
@@ -1,26 +1,36 @@
-000.001395 [ 450] vtnet_netmap_attach       vtnet attached txq=1, txd=256 rxq=1, rxd=128
-586.211720 [ 450] vtnet_netmap_attach       vtnet attached txq=2, txd=256 rxq=2, rxd=128
-951.722084 [ 450] vtnet_netmap_attach       vtnet attached txq=1, txd=256 rxq=1, rxd=128
-<118>Apr  8 21:26:23 freebsd-13 dhclient[417]: Interface vtnet0 no longer appears valid.
-<118>Apr  8 21:26:23 freebsd-13 dhclient[417]: ioctl(SIOCGIFFLAGS) on vtnet0: Operation not permitted
-<118>Apr  8 21:26:23 freebsd-13 dhclient[417]: receive_packet failed on vtnet0: Device not configured
-<118>Apr  8 21:32:29 freebsd-13 dhclient[1463]: Interface vtnet0 no longer appears valid.
-<118>Apr  8 21:32:29 freebsd-13 dhclient[1463]: ioctl(SIOCGIFFLAGS) on vtnet0: Operation not permitted
-<118>Apr  8 21:32:29 freebsd-13 dhclient[1463]: receive_packet failed on vtnet0: Device not configured
-<118>DHCPDISCOVER on vtnet0 to 255.255.255.255 port 67 interval 6
-<118>DHCPREQUEST on vtnet0 to 255.255.255.255 port 67
-<118>Starting Network: lo0 vtnet0.
-<118>vtnet0: flags=8863<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
-<6>vtnet0: Ethernet address: 86:c3:9d:51:ba:5f
-<6>vtnet0: Ethernet address: 86:c3:9d:51:ba:5f
-<6>vtnet0: Ethernet address: 86:c3:9d:51:ba:5f
+000.001735 [ 450] vtnet_netmap_attach       vtnet attached txq=2, txd=256 rxq=2, rxd=128
+000.001735 [ 450] vtnet_netmap_attach       vtnet attached txq=2, txd=256 rxq=2, rxd=128
+000.001735 [ 450] vtnet_netmap_attach       vtnet attached txq=2, txd=256 rxq=2, rxd=128
+000.001736 [ 450] vtnet_netmap_attach       vtnet attached txq=2, txd=256 rxq=2, rxd=128
+000.001736 [ 450] vtnet_netmap_attach       vtnet attached txq=2, txd=256 rxq=2, rxd=128
+000.001736 [ 450] vtnet_netmap_attach       vtnet attached txq=2, txd=256 rxq=2, rxd=128
+<118> LAN (vtnet1)    -> v4: 172.16.6.1/24
+<118> LAN (vtnet1)    -> v4: 172.16.6.1/24
+<118> LAN (vtnet1)    -> v4: 172.16.6.1/24
+<118> WAN (vtnet0)    -> v4/DHCP4: 172.16.5.58/24
+<118> WAN (vtnet0)    -> v4/DHCP4: 172.16.5.58/24
+<118> WAN (vtnet0)    -> v4/DHCP4: 172.16.5.58/24
+<118>Reconfiguring IPv4 on vtnet0
+<118>Reconfiguring IPv4 on vtnet0
+<118>Reconfiguring IPv4 on vtnet0
+<6>vtnet0: Ethernet address: b2:6c:3a:1c:ce:cf
+<6>vtnet0: Ethernet address: b2:6c:3a:1c:ce:cf
+<6>vtnet0: Ethernet address: b2:6c:3a:1c:ce:cf
 <6>vtnet0: link state changed to UP
 <6>vtnet0: link state changed to UP
 <6>vtnet0: link state changed to UP
-<6>vtnet0: link state changed to UP
-<6>vtnet0: netmap queues/slots: TX 1/256, RX 1/128
-<6>vtnet0: netmap queues/slots: TX 1/256, RX 1/128
 <6>vtnet0: netmap queues/slots: TX 2/256, RX 2/128
+<6>vtnet0: netmap queues/slots: TX 2/256, RX 2/128
+<6>vtnet0: netmap queues/slots: TX 2/256, RX 2/128
+<6>vtnet1: Ethernet address: d2:db:66:e1:92:5f
+<6>vtnet1: Ethernet address: d2:db:66:e1:92:5f
+<6>vtnet1: Ethernet address: d2:db:66:e1:92:5f
+<6>vtnet1: link state changed to UP
+<6>vtnet1: link state changed to UP
+<6>vtnet1: link state changed to UP
+<6>vtnet1: netmap queues/slots: TX 2/256, RX 2/128
+<6>vtnet1: netmap queues/slots: TX 2/256, RX 2/128
+<6>vtnet1: netmap queues/slots: TX 2/256, RX 2/128
 dev.virtio_pci.%parent:
 dev.virtio_pci.0.%desc: VirtIO PCI (legacy) Balloon adapter
 dev.virtio_pci.0.%driver: virtio_pci
@@ -51,9 +61,17 @@ dev.virtio_pci.3.%driver: virtio_pci
 dev.virtio_pci.3.%location: slot=18 function=0 dbsf=pci0:0:18:0 handle=\_SB_.PCI0.S90_
 dev.virtio_pci.3.%parent: pci0
 dev.virtio_pci.3.%pnpinfo: vendor=0x1af4 device=0x1000 subvendor=0x1af4 subdevice=0x0001 class=0x020000
-dev.virtio_pci.3.host_features: 0x79bfffe7 <RingEventIdx,RingIndirectDesc,AnyLayout,NotifyOnEmpty,CtrlMacAddr,GuestAnnounce,CtrlRxModeExtra,CtrlVLANFilter,CtrlRxMode,CtrlVq,Status,MrgRxBuf,TxUFO,TxTSOECN,TxTSOv6,TxTSOv4,RxUFO,RxLROECN,RxLROv6,RxLROv4,TxGSO,MAC,CtrlRxOffloads,RxChecksum,TxChecksum>
-dev.virtio_pci.3.negotiated_features: 0x3087bbe7 <RingEventIdx,RingIndirectDesc,CtrlMacAddr,CtrlRxMode,CtrlVq,Status,MrgRxBuf,TxTSOECN,TxTSOv6,TxTSOv4,RxLROECN,RxLROv6,RxLROv4,TxGSO,MAC,CtrlRxOffloads,RxChecksum,TxChecksum>
-dev.virtio_pci.3.nvqs: 3
+dev.virtio_pci.3.host_features: 0x79ffffe7 <RingEventIdx,RingIndirectDesc,AnyLayout,NotifyOnEmpty,CtrlMacAddr,Multiqueue,GuestAnnounce,CtrlRxModeExtra,CtrlVLANFilter,CtrlRxMode,CtrlVq,Status,MrgRxBuf,TxUFO,TxTSOECN,TxTSOv6,TxTSOv4,RxUFO,RxLROECN,RxLROv6,RxLROv4,TxGSO,MAC,CtrlRxOffloads,RxChecksum,TxChecksum>
+dev.virtio_pci.3.negotiated_features: 0x30c7b865 <RingEventIdx,RingIndirectDesc,CtrlMacAddr,Multiqueue,CtrlRxMode,CtrlVq,Status,MrgRxBuf,TxTSOECN,TxTSOv6,TxTSOv4,TxGSO,MAC,CtrlRxOffloads,TxChecksum>
+dev.virtio_pci.3.nvqs: 5
+dev.virtio_pci.4.%desc: VirtIO PCI (legacy) Network adapter
+dev.virtio_pci.4.%driver: virtio_pci
+dev.virtio_pci.4.%location: slot=19 function=0 dbsf=pci0:0:19:0 handle=\_SB_.PCI0.S98_
+dev.virtio_pci.4.%parent: pci0
+dev.virtio_pci.4.%pnpinfo: vendor=0x1af4 device=0x1000 subvendor=0x1af4 subdevice=0x0001 class=0x020000
+dev.virtio_pci.4.host_features: 0x79ffffe7 <RingEventIdx,RingIndirectDesc,AnyLayout,NotifyOnEmpty,CtrlMacAddr,Multiqueue,GuestAnnounce,CtrlRxModeExtra,CtrlVLANFilter,CtrlRxMode,CtrlVq,Status,MrgRxBuf,TxUFO,TxTSOECN,TxTSOv6,TxTSOv4,RxUFO,RxLROECN,RxLROv6,RxLROv4,TxGSO,MAC,CtrlRxOffloads,RxChecksum,TxChecksum>
+dev.virtio_pci.4.negotiated_features: 0x30c7b865 <RingEventIdx,RingIndirectDesc,CtrlMacAddr,Multiqueue,CtrlRxMode,CtrlVq,Status,MrgRxBuf,TxTSOECN,TxTSOv6,TxTSOv4,TxGSO,MAC,CtrlRxOffloads,TxChecksum>
+dev.virtio_pci.4.nvqs: 5
 dev.vtballoon.0.%desc: VirtIO Balloon Adapter
 dev.vtballoon.0.%parent: virtio_pci0
 dev.vtblk.0.%desc: VirtIO Block Adapter
@@ -66,10 +84,10 @@ dev.vtnet.0.%driver: vtnet
 dev.vtnet.0.%location:
 dev.vtnet.0.%parent: virtio_pci3
 dev.vtnet.0.%pnpinfo: vendor=0x00001af4 device=0x1000 subvendor=0x1af4 device_type=0x00000001
-dev.vtnet.0.act_vq_pairs: 1
-dev.vtnet.0.max_vq_pairs: 1
+dev.vtnet.0.act_vq_pairs: 2
+dev.vtnet.0.max_vq_pairs: 2
 dev.vtnet.0.mbuf_alloc_failed: 0
-dev.vtnet.0.req_vq_pairs: 1
+dev.vtnet.0.req_vq_pairs: 2
 dev.vtnet.0.rx_csum_bad_ethtype: 0
 dev.vtnet.0.rx_csum_bad_ipproto: 0
 dev.vtnet.0.rx_csum_bad_offset: 0
@@ -80,14 +98,22 @@ dev.vtnet.0.rx_enq_replacement_failed: 0
 dev.vtnet.0.rx_frame_too_large: 0
 dev.vtnet.0.rx_mergeable_failed: 0
 dev.vtnet.0.rx_task_rescheduled: 0
-dev.vtnet.0.rxq0.csum: 1303017
+dev.vtnet.0.rxq0.csum: 263363
 dev.vtnet.0.rxq0.csum_failed: 0
-dev.vtnet.0.rxq0.host_lro: 1085080
-dev.vtnet.0.rxq0.ibytes: 31181915110
+dev.vtnet.0.rxq0.host_lro: 0
+dev.vtnet.0.rxq0.ibytes: 27354672
 dev.vtnet.0.rxq0.ierrors: 0
-dev.vtnet.0.rxq0.ipackets: 1370916
+dev.vtnet.0.rxq0.ipackets: 269781
 dev.vtnet.0.rxq0.iqdrops: 0
-dev.vtnet.0.rxq0.rescheduled: 0
+dev.vtnet.0.rxq0.rescheduled: 67
+dev.vtnet.0.rxq1.csum: 110384
+dev.vtnet.0.rxq1.csum_failed: 0
+dev.vtnet.0.rxq1.host_lro: 0
+dev.vtnet.0.rxq1.ibytes: 5388510337
+dev.vtnet.0.rxq1.ierrors: 0
+dev.vtnet.0.rxq1.ipackets: 3688347
+dev.vtnet.0.rxq1.iqdrops: 0
+dev.vtnet.0.rxq1.rescheduled: 3
 dev.vtnet.0.tx_csum_offloaded: 0
 dev.vtnet.0.tx_csum_proto_mismatch: 0
 dev.vtnet.0.tx_csum_unknown_ethtype: 0
@@ -97,12 +123,74 @@ dev.vtnet.0.tx_task_rescheduled: 0
 dev.vtnet.0.tx_tso_not_tcp: 0
 dev.vtnet.0.tx_tso_offloaded: 0
 dev.vtnet.0.tx_tso_without_csum: 0
-dev.vtnet.0.txq0.csum: 1117351
-dev.vtnet.0.txq0.obytes: 98345391
-dev.vtnet.0.txq0.omcasts: 0
-dev.vtnet.0.txq0.opackets: 1117426
+dev.vtnet.0.txq0.csum: 0
+dev.vtnet.0.txq0.obytes: 3799355110
+dev.vtnet.0.txq0.omcasts: 363
+dev.vtnet.0.txq0.opackets: 2510616
 dev.vtnet.0.txq0.rescheduled: 0
-dev.vtnet.0.txq0.tso: 200
+dev.vtnet.0.txq0.tso: 0
+dev.vtnet.0.txq1.csum: 0
+dev.vtnet.0.txq1.obytes: 232421133
+dev.vtnet.0.txq1.omcasts: 1
+dev.vtnet.0.txq1.opackets: 3517901
+dev.vtnet.0.txq1.rescheduled: 0
+dev.vtnet.0.txq1.tso: 0
+dev.vtnet.1.%desc: VirtIO Networking Adapter
+dev.vtnet.1.%driver: vtnet
+dev.vtnet.1.%location:
+dev.vtnet.1.%parent: virtio_pci4
+dev.vtnet.1.%pnpinfo: vendor=0x00001af4 device=0x1000 subvendor=0x1af4 device_type=0x00000001
+dev.vtnet.1.act_vq_pairs: 2
+dev.vtnet.1.max_vq_pairs: 2
+dev.vtnet.1.mbuf_alloc_failed: 0
+dev.vtnet.1.req_vq_pairs: 2
+dev.vtnet.1.rx_csum_bad_ethtype: 0
+dev.vtnet.1.rx_csum_bad_ipproto: 0
+dev.vtnet.1.rx_csum_bad_offset: 0
+dev.vtnet.1.rx_csum_bad_proto: 0
+dev.vtnet.1.rx_csum_failed: 0
+dev.vtnet.1.rx_csum_offloaded: 0
+dev.vtnet.1.rx_enq_replacement_failed: 0
+dev.vtnet.1.rx_frame_too_large: 0
+dev.vtnet.1.rx_mergeable_failed: 0
+dev.vtnet.1.rx_task_rescheduled: 0
+dev.vtnet.1.rxq0.csum: 23387
+dev.vtnet.1.rxq0.csum_failed: 0
+dev.vtnet.1.rxq0.host_lro: 0
+dev.vtnet.1.rxq0.ibytes: 1643315
+dev.vtnet.1.rxq0.ierrors: 0
+dev.vtnet.1.rxq0.ipackets: 23398
+dev.vtnet.1.rxq0.iqdrops: 0
+dev.vtnet.1.rxq0.rescheduled: 0
+dev.vtnet.1.rxq1.csum: 4920
+dev.vtnet.1.rxq1.csum_failed: 0
+dev.vtnet.1.rxq1.host_lro: 0
+dev.vtnet.1.rxq1.ibytes: 440413
+dev.vtnet.1.rxq1.ierrors: 0
+dev.vtnet.1.rxq1.ipackets: 5294
+dev.vtnet.1.rxq1.iqdrops: 0
+dev.vtnet.1.rxq1.rescheduled: 0
+dev.vtnet.1.tx_csum_offloaded: 0
+dev.vtnet.1.tx_csum_proto_mismatch: 0
+dev.vtnet.1.tx_csum_unknown_ethtype: 0
+dev.vtnet.1.tx_defrag_failed: 0
+dev.vtnet.1.tx_defragged: 0
+dev.vtnet.1.tx_task_rescheduled: 0
+dev.vtnet.1.tx_tso_not_tcp: 0
+dev.vtnet.1.tx_tso_offloaded: 0
+dev.vtnet.1.tx_tso_without_csum: 0
+dev.vtnet.1.txq0.csum: 0
+dev.vtnet.1.txq0.obytes: 29429366
+dev.vtnet.1.txq0.omcasts: 0
+dev.vtnet.1.txq0.opackets: 25058
+dev.vtnet.1.txq0.rescheduled: 0
+dev.vtnet.1.txq0.tso: 0
+dev.vtnet.1.txq1.csum: 0
+dev.vtnet.1.txq1.obytes: 48974151
+dev.vtnet.1.txq1.omcasts: 0
+dev.vtnet.1.txq1.opackets: 33751
+dev.vtnet.1.txq1.rescheduled: 0
+dev.vtnet.1.txq1.tso: 0
 device virtio
 device virtio_balloon
 device virtio_blk
@@ -119,24 +207,35 @@ hw.vtnet.mq_max_pairs: 32
 hw.vtnet.rx_process_limit: 1024
 hw.vtnet.tso_disable: 0
 hw.vtnet.tso_maxlen: 65535
-pfil: duplicate head "vtnet0"
-pfil: duplicate head "vtnet0"
 virtio_pci0: <VirtIO PCI (legacy) Balloon adapter> port 0xe080-0xe0bf mem 0xfe400000-0xfe403fff irq 11 at device 3.0 on pci0
-virtio_pci1: <VirtIO PCI (legacy) Console adapter> port 0xe0c0-0xe0ff mem 0xfea51000-0xfea51fff,0xfe404000-0xfe407fff irq 11 at device 8.0 on pci0
-virtio_pci2: <VirtIO PCI (legacy) Block adapter> port 0xe000-0xe07f mem 0xfea52000-0xfea52fff,0xfe408000-0xfe40bfff irq 10 at device 10.0 on pci0
-virtio_pci3: <VirtIO PCI (legacy) Network adapter> at device 18.0 on pci0
-virtio_pci3: <VirtIO PCI (legacy) Network adapter> at device 18.0 on pci0
-virtio_pci3: <VirtIO PCI (legacy) Network adapter> port 0xe120-0xe13f mem 0xfea53000-0xfea53fff,0xfe40c000-0xfe40ffff irq 10 at device 18.0 on pci0
+virtio_pci0: <VirtIO PCI (legacy) Balloon adapter> port 0xe080-0xe0bf mem 0xfe400000-0xfe403fff irq 11 at device 3.0 on pci0
+virtio_pci0: <VirtIO PCI (legacy) Balloon adapter> port 0xe080-0xe0bf mem 0xfe400000-0xfe403fff irq 11 at device 3.0 on pci0
+virtio_pci1: <VirtIO PCI (legacy) Console adapter> port 0xe0c0-0xe0ff mem 0xfea91000-0xfea91fff,0xfe404000-0xfe407fff irq 11 at device 8.0 on pci0
+virtio_pci1: <VirtIO PCI (legacy) Console adapter> port 0xe0c0-0xe0ff mem 0xfea91000-0xfea91fff,0xfe404000-0xfe407fff irq 11 at device 8.0 on pci0
+virtio_pci1: <VirtIO PCI (legacy) Console adapter> port 0xe0c0-0xe0ff mem 0xfea91000-0xfea91fff,0xfe404000-0xfe407fff irq 11 at device 8.0 on pci0
+virtio_pci2: <VirtIO PCI (legacy) Block adapter> port 0xe000-0xe07f mem 0xfea92000-0xfea92fff,0xfe408000-0xfe40bfff irq 10 at device 10.0 on pci0
+virtio_pci2: <VirtIO PCI (legacy) Block adapter> port 0xe000-0xe07f mem 0xfea92000-0xfea92fff,0xfe408000-0xfe40bfff irq 10 at device 10.0 on pci0
+virtio_pci2: <VirtIO PCI (legacy) Block adapter> port 0xe000-0xe07f mem 0xfea92000-0xfea92fff,0xfe408000-0xfe40bfff irq 10 at device 10.0 on pci0
+virtio_pci3: <VirtIO PCI (legacy) Network adapter> port 0xe100-0xe13f mem 0xfea93000-0xfea93fff,0xfe40c000-0xfe40ffff irq 10 at device 18.0 on pci0
+virtio_pci3: <VirtIO PCI (legacy) Network adapter> port 0xe100-0xe13f mem 0xfea93000-0xfea93fff,0xfe40c000-0xfe40ffff irq 10 at device 18.0 on pci0
+virtio_pci3: <VirtIO PCI (legacy) Network adapter> port 0xe100-0xe13f mem 0xfea93000-0xfea93fff,0xfe40c000-0xfe40ffff irq 10 at device 18.0 on pci0
+virtio_pci4: <VirtIO PCI (legacy) Network adapter> port 0xe140-0xe17f mem 0xfea94000-0xfea94fff,0xfe410000-0xfe413fff irq 11 at device 19.0 on pci0
+virtio_pci4: <VirtIO PCI (legacy) Network adapter> port 0xe140-0xe17f mem 0xfea94000-0xfea94fff,0xfe410000-0xfe413fff irq 11 at device 19.0 on pci0
+virtio_pci4: <VirtIO PCI (legacy) Network adapter> port 0xe140-0xe17f mem 0xfea94000-0xfea94fff,0xfe410000-0xfe413fff irq 11 at device 19.0 on pci0
 vm.uma.vtnet_tx_hdr.bucket_size: 254
 vm.uma.vtnet_tx_hdr.bucket_size_max: 254
-vm.uma.vtnet_tx_hdr.domain.0.imax: 254
-vm.uma.vtnet_tx_hdr.domain.0.imin: 254
-vm.uma.vtnet_tx_hdr.domain.0.nitems: 254
+vm.uma.vtnet_tx_hdr.domain.0.bimin: 762
+vm.uma.vtnet_tx_hdr.domain.0.imax: 762
+vm.uma.vtnet_tx_hdr.domain.0.imin: 762
+vm.uma.vtnet_tx_hdr.domain.0.limin: 267
+vm.uma.vtnet_tx_hdr.domain.0.nitems: 762
+vm.uma.vtnet_tx_hdr.domain.0.timin: 2080
 vm.uma.vtnet_tx_hdr.domain.0.wss: 0
 vm.uma.vtnet_tx_hdr.flags: 0x10000<FIRSTTOUCH>
 vm.uma.vtnet_tx_hdr.keg.align: 0
 vm.uma.vtnet_tx_hdr.keg.domain.0.free_items: 128
-vm.uma.vtnet_tx_hdr.keg.domain.0.pages: 6
+vm.uma.vtnet_tx_hdr.keg.domain.0.free_slabs: 0
+vm.uma.vtnet_tx_hdr.keg.domain.0.pages: 9
 vm.uma.vtnet_tx_hdr.keg.efficiency: 98
 vm.uma.vtnet_tx_hdr.keg.ipers: 168
 vm.uma.vtnet_tx_hdr.keg.name: vtnet_tx_hdr
@@ -149,13 +248,16 @@ vm.uma.vtnet_tx_hdr.limit.max_items: 0
 vm.uma.vtnet_tx_hdr.limit.sleepers: 0
 vm.uma.vtnet_tx_hdr.limit.sleeps: 0
 vm.uma.vtnet_tx_hdr.size: 24
-vm.uma.vtnet_tx_hdr.stats.allocs: 29346567
+vm.uma.vtnet_tx_hdr.stats.allocs: 6087327
 vm.uma.vtnet_tx_hdr.stats.current: 1
 vm.uma.vtnet_tx_hdr.stats.fails: 0
-vm.uma.vtnet_tx_hdr.stats.frees: 29346566
+vm.uma.vtnet_tx_hdr.stats.frees: 6087326
 vm.uma.vtnet_tx_hdr.stats.xdomain: 0
 vtnet0: <VirtIO Networking Adapter> on virtio_pci3
 vtnet0: <VirtIO Networking Adapter> on virtio_pci3
 vtnet0: <VirtIO Networking Adapter> on virtio_pci3
-vtnet0: detached
-vtnet0: detached
+vtnet1: <VirtIO Networking Adapter> on virtio_pci4
+vtnet1: <VirtIO Networking Adapter> on virtio_pci4
+vtnet1: <VirtIO Networking Adapter> on virtio_pci4
+vtnet1: vtnet_update_rx_offloads: cannot update Rx features
+vtnet1: vtnet_update_rx_offloads: cannot update Rx features
Title: Re: High Packet Loss/CRC errors for Tx on axgbe (DEC2750/OPNsense 22.1)
Post by: meyergru on April 15, 2022, 12:53:14 pm
Two comments:

1. Just one observation: The negotiated features differ in RX offloading:

Code: [Select]
-dev.virtio_pci.3.negotiated_features: 0x3087bbe7 <RingEventIdx,RingIndirectDesc,CtrlMacAddr,CtrlRxMode,CtrlVq,Status,MrgRxBuf,TxTSOECN,TxTSOv6,TxTSOv4,RxLROECN,RxLROv6,RxLROv4,TxGSO,MAC,CtrlRxOffloads,RxChecksum,TxChecksum>
+dev.virtio_pci.3.negotiated_features: 0x30c7b865 <RingEventIdx,RingIndirectDesc,CtrlMacAddr,Multiqueue,CtrlRxMode,CtrlVq,Status,MrgRxBuf,TxTSOECN,TxTSOv6,TxTSOv4,TxGSO,MAC,CtrlRxOffloads,TxChecksum>

2. By limiting the diff to driver-specific aspects, you miss any other performance-related things, like memory protection, threading settings or circumvention of CPU flaws (e.g. hw.ibrs_disable or hw.spec_store_bypass_disable or net.isr.bindthreads). However, I actually have no clue as to what might be the performance impact of any setting.
Title: Re: High Packet Loss/CRC errors for Tx on axgbe (DEC2750/OPNsense 22.1)
Post by: burly on April 15, 2022, 01:38:39 pm
1. Good point. A few of those I tried altering on the FreeBSD system to see if they made any difference but I didn’t find any. The ones that jump out to me here are LSO/TSO (which I tried changing) and multiqueue (which is what I’ve been observing appears more to be functional in some of the OPNsense tests). I’ll specifically see if I can match the OPNsense negotiated connection or at least have them match in ethtool -k and see if that makes a difference.

UPDATE: See https://forum.opnsense.org/index.php?topic=27828.msg135793

2. Very true. I actually did scan through the complete list but didn’t see anything that looked like it could explain it. I filtered to make something more manageable here, but as you correctly point out that could mask a contributing element. The specific ones you’ve mentioned I can check and I could also do a CPU and memory benchmark test in the VM to see if there are any drastic differences.

UPDATE I looked not only at the complete sysctl list pretty thoroughly, I also:
 - Compared /boot/loader.conf as well as copied it over from opnsense 22.1 to FreeBSD + OPNsense kernel
 - Compared kernel modules
 - Compared loaded module lists
 - Compared all of /boot directory

Ultimately, I believe the hardware offloading was the issue *for the VMs*. See #1
Title: Re: High Packet Loss/CRC errors for Tx on axgbe (DEC2750/OPNsense 22.1)
Post by: joellinn on April 15, 2022, 05:09:47 pm
Has anyone in here tried to pull some pmc statistics and see where there might be a delay? I tried briefly but couldn't make sense of the results.
Title: Re: High Packet Loss/CRC errors for Tx on axgbe (DEC2750/OPNsense 22.1)
Post by: burly on April 16, 2022, 04:56:21 am
More to details to come tomorrow when I have time to consolidate and write up all these tests, but Update - I've added the important supporting data below - I believe the root cause of the behavior I'm seeing with the VMs is different than what is going on with the DEC2750.

I was able to get 10Gbps line rate working in the OPNsense 22.1 VM by enabling all hardware offloading to a virtio NIC as long as the traffic was going in and out the same interface. If hardware checksum offloading is enabled though, then checksums fail when it traverses the NAT. Essentially, I think I'm hitting this: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=235607
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=165059

Disabling hardware offloading avoids the checksum problem, but makes everything slow. I was able to get ~5Gpbs/4Gbps working across NAT traversal with the following configuration:

Code: [Select]
# Ensure this returns 1
$ sysctl net.inet.tcp.tso
net.inet.tcp.tso: 1

# Enable  tx checksum, tcp segmentation, and large receive offloading but NOT receive checksum offloading on the WAN device (e.g., vtnet0)
$ ifconfig vtnet0 -rxcsum txcsum tso lro txcsum6 -vlanhwtag -vlanhwtso

# Disable rx & tx checksum, tcp segmentation, and large receive offloading on the LAN device (e.g., vtnet1)
$  ifconfig vtnet1 -rxcsum -txcsum -tso -lro -txcsum6 -vlanhwtag -vlanhwtso

Note that the official docs do mention the possibility of hardware issues with offloading and additionally state that both checksum and TCP segmentation offload need to be disabled if using IPS (https://docs.opnsense.org/manual/interfaces_settings.html) - so take this into account if you are considering turning on TCXSUM/TSO in your OPNsense VM with virtio-nics or RXCSUM/TXCSUM/TSO with non-virtio-nics.

Investigating throughput via iperf3 through the opnsense 22.1 VM shows middle of the road with all offloading disabled. No significant improvement is seen with most hardware offloading enabled on both interfaces except for leaving receive checksum disabled. Enabling rcxsum at all on either interface results in no TCP connectivity.

[client 1] <---> [ (vtnet1) opnsense 22.1 <== NAT ==> (vtnet0)] <--> [client 2 separate machine]

All offloading disabled
Code: [Select]
# on opnsense-22.1
$ ifconfig vtnet0 -rxcsum -txcsum -tso -lro -txcsum6 -vlanhwtag -vlanhwtso
$ ifconfig vtnet1 -rxcsum -txcsum -tso -lro -txcsum6 -vlanhwtag -vlanhwtso

# on client 1
$ iperf3 -c 172.16.5.57
...
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  6.43 GBytes  5.53 Gbits/sec  3018             sender
[  5]   0.00-10.00  sec  6.43 GBytes  5.52 Gbits/sec                  receiver

$ iperf3 -c 172.16.5.57 -R
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  4.26 GBytes  3.66 Gbits/sec  2144             sender
[  5]   0.00-10.00  sec  4.26 GBytes  3.66 Gbits/sec                  receiver

Only receive offloading disabled
Code: [Select]
# on opnsense-22.1
$ ifconfig vtnet0 -rxcsum txcsum tso lro txcsum6 -vlanhwtag -vlanhwtso
$ ifconfig vtnet1 -rxcsum txcsum tso lro txcsum6 -vlanhwtag -vlanhwtso

# on client 1
$ iperf3 -c 172.16.5.57
...
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  6.35 GBytes  5.46 Gbits/sec  1287             sender
[  5]   0.00-10.00  sec  6.35 GBytes  5.45 Gbits/sec                  receiver

$ iperf3 -c 172.16.5.57 -R
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  4.40 GBytes  3.78 Gbits/sec  845             sender
[  5]   0.00-10.00  sec  4.40 GBytes  3.78 Gbits/sec                  receiver

Receive offloading enabled
Code: [Select]
# on opnsense-22.1
$ ifconfig vtnet0 -rxcsum txcsum tso lro txcsum6 -vlanhwtag -vlanhwtso
$ ifconfig vtnet1 rxcsum txcsum tso lro txcsum6 -vlanhwtag -vlanhwtso

# on client 1
$ iperf3 -c 172.16.5.57
iperf3: error - unable to connect to server: Connection timed out

Only receive offloading enabled
Code: [Select]
# on opnsense-22.1
$ ifconfig vtnet0 -rxcsum -txcsum -tso -lro -txcsum6 -vlanhwtag -vlanhwtso
$ ifconfig vtnet1 rxcsum -txcsum -tso -lro -txcsum6 -vlanhwtag -vlanhwtso

# on client 1
$ iperf3 -c 172.16.5.57
iperf3: error - unable to connect to server: Connection timed out


Investigating via packet capture at the upstream next hop from the WAN interface of the opnsense reveals good checksums with rcxsum offloading disabled and bad with it enabled:

[client VM 1] <---> [ (vtnet1) opnsense 22.1 <== NAT ==> (vtnet0)] <--> [upstream hardware firewall]

Receive offloading disabled
Code: [Select]
# on opnsense-22.1
$ ifconfig vtnet0 -rxcsum txcsum tso lro txcsum6 -vlanhwtag -vlanhwtso
$ ifconfig vtnet1 -rxcsum txcsum tso lro txcsum6 -vlanhwtag -vlanhwtso

$ curl https://www.google.com
(immediate full result)

# on upstream firewall
root@fw:~ # tcpdump -nv host 172.16.5.58 -i ax0
tcpdump: listening on ax0, link-type EN10MB (Ethernet), capture size 262144 bytes
03:50:57.729234 IP (tos 0x0, ttl 63, id 26307, offset 0, flags [DF], proto TCP (6), length 60)
    172.16.5.58.56398 > 172.253.122.147.443: Flags [S], cksum 0x0a68 (correct), seq 2786157904, win 64240, options [mss 1460,sackOK,TS val 2576672459 ecr 0,nop,wscale 7], length 0
03:50:57.734756 IP (tos 0x80, ttl 123, id 5969, offset 0, flags [none], proto TCP (6), length 60)
    172.253.122.147.443 > 172.16.5.58.56398: Flags [S.], cksum 0xf663 (correct), seq 3977252634, ack 2786157905, win 65535, options [mss 1430,sackOK,TS val 1789175857 ecr 2576672459,nop,wscale 8], length 0
03:50:57.735172 IP (tos 0x0, ttl 63, id 26308, offset 0, flags [DF], proto TCP (6), length 52)
    172.16.5.58.56398 > 172.253.122.147.443: Flags [.], cksum 0x2317 (correct), ack 1, win 502, options [nop,nop,TS val 2576672465 ecr 1789175857], length 0

Receive offloading enabled
Code: [Select]
# on opnsense-22.1
$ ifconfig vtnet0 -rxcsum txcsum tso lro txcsum6 -vlanhwtag -vlanhwtso
$ ifconfig vtnet1 rxcsum txcsum tso lro txcsum6 -vlanhwtag -vlanhwtso

# on client 1
$ curl https://www.google.com
(hangs, no result)

# on upstream firewall
root@fw:~ # tcpdump -nv host 172.16.5.58 -i ax0
tcpdump: listening on ax0, link-type EN10MB (Ethernet), capture size 262144 bytes
03:56:58.192294 IP (tos 0x0, ttl 63, id 18870, offset 0, flags [DF], proto TCP (6), length 60)
    172.16.5.58.22311 > 172.253.122.103.443: Flags [S], cksum 0x387f (incorrect -> 0xe379), seq 4123969246, win 64240, options [mss 1460,sackOK,TS val 3062744264 ecr 0,nop,wscale 7], length 0
03:56:59.221337 IP (tos 0x0, ttl 63, id 18871, offset 0, flags [DF], proto TCP (6), length 60)
    172.16.5.58.22311 > 172.253.122.103.443: Flags [S], cksum 0x387f (incorrect -> 0xdf74), seq 4123969246, win 64240, options [mss 1460,sackOK,TS val 3062745293 ecr 0,nop,wscale 7], length 0
03:57:01.237466 IP (tos 0x0, ttl 63, id 18872, offset 0, flags [DF], proto TCP (6), length 60)


Originating the traffic directly on the opnsense VM to eliminate the NAT traversal shows full line rate with all hardware offloading enabled and line rate for TX but low throughput for RX with receive checksum offloading disabled:

[opnsense 22.1 (vtnet0)] <--> [upstream hardware firewall]

All hardware offloading enabled
Code: [Select]
# on opnsense-22.1
$ ifconfig vtnet0 rxcsum txcsum tso lro txcsum6 -vlanhwtag -vlanhwtso
$ ifconfig vtnet1 rxcsum txcsum tso lro txcsum6 -vlanhwtag -vlanhwtso

root@opnsense-22:~ # iperf3 -c 172.16.5.57
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  10.8 GBytes  9.24 Gbits/sec  7698             sender
[  5]   0.00-10.00  sec  10.8 GBytes  9.24 Gbits/sec                  receiver

root@opnsense-22:~ # iperf3 -c 172.16.5.57 -R
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  10.8 GBytes  9.27 Gbits/sec  50066             sender
[  5]   0.00-10.00  sec  10.8 GBytes  9.27 Gbits/sec                  receiver


Only receive offloading disabled
Code: [Select]
# on opnsense-22.1
$ ifconfig vtnet0 -rxcsum txcsum tso lro txcsum6 -vlanhwtag -vlanhwtso
$ ifconfig vtnet1 -rxcsum txcsum tso lro txcsum6 -vlanhwtag -vlanhwtso

root@opnsense-22:~ # iperf3 -c 172.16.5.57
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  10.8 GBytes  9.29 Gbits/sec  11963             sender
[  5]   0.00-10.00  sec  10.8 GBytes  9.29 Gbits/sec                  receiver

root@opnsense-22:~ # iperf3 -c 172.16.5.57 -R
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.98 GBytes  1.70 Gbits/sec  189             sender
[  5]   0.00-10.00  sec  1.98 GBytes  1.70 Gbits/sec                  receiver

TODO: Add a second NIC to the vanilla FreeBSD 13.0 VM and perform NAT traversal to see if the issue is present there as well.

TODO: Continue/resume the hunt for the DEC2750 performance issue

NOTE: My original tests were using the web GUI to enable or disable the various hardware offloading options and then going to the individual interfaces and saving and applying the settings to make them issue to the underlying device. This is done under the hood with ifconfig <interface> <options>. An issue that I ran into with this was for my production VMs, the TSO was disabled at the sysctl level via tunables net.inet.tcp.tso = 0 (which ultimately is in /boot/loader.conf). I re-enabled this and also issued the interface changes manually to verify they had actually taken place. This is why some of my original OPNsense VMs saw no performance increase even when I enabled hardware offloading in the webGUI.

NOTE: I found that processor affinity within the VM guest can affect the results. If the iperf3 process gets scheduled on the same core with the kthread responsible for handling the network queue, you'll get lower throughput for several seconds before the OS will finally schedule one of the two to a different thread. When this occurs, I would re-run the tests or see which core the kthread is on and run iperf3 with -A to set affinity to a different core.

NOTE: I found that turbo boosting of the cores by the host can affect the results. Prior to running the "official" test, I would run iperf3 with -t 60 to allow it run for 60s and stabilize the throughput before immediately re-running it  to capture comparable throughput and retr values. This does ignore additional complexities such as host processor affinity and hyper-threading, but the testing procedure seemed to produce fairly consistent data.
Title: Re: High Packet Loss/CRC errors for Tx on axgbe (DEC2750/OPNsense 22.1)
Post by: tuto2 on April 19, 2022, 11:23:30 am
**RSS UPDATE**: I tried turning of RSS (dev.ax.0.rss_enabled="0" dev.ax.1.rss_enabled="0") and rebooting. I then re-tested send/receive with both single and parallel threads and observed no improvement. I believe since it's both src host:port and dst host:port in the hash, that -P4 should be able to generate different queue targets  in the LSB of the hash and thus spread it across cores. Said more simply, I think this is a valid test, but I'm not fully up to speed on RSS. See here for more details: https://forum.opnsense.org/index.php?topic=24409.0

Yes and no, uniqueness cannot be guaranteed in a hash in which only the ports are incremented sequentially (which is the case for iperf3). Also, RSS in the driver does not mean that the correct hash is used. The driver actually fills the hardware registers with random bytes to use as a hash if RSS is disabled in the kernel. If RSS is enabled in the kernel, the kernel-defined hash is used, which should distribute much more evenly (though still no guarantees can be made).

The AX driver specifically shuts off all multi-queue functionality in the hardware if hardware-RSS is disabled and forces everything through a bottleneck - always keep it enabled. This is different from other vendors such as Intel which play by more sophisticated RSS rules.
Title: Re: High Packet Loss/CRC errors for Tx on axgbe (DEC2750/OPNsense 22.1)
Post by: tuto2 on April 21, 2022, 10:53:50 am
In an effort to clear the air regarding performance on axgbe (in this case specifically DEC2750 to match the situation as described by the OP), we have set up a standardized test bench in order to potentially catch some of the more performance-degrading changes in the kernel.

Because the world is very complex and consists of an infinite amount of ways to test, wrongly interpret, and setup a clean environment, we will stick to a single configuration which does not change, except for OPNsense versions, in which simple firewall throughput is measured.

Linux (iperf client) ----> OPNsense ----> Linux (iperf server)

Because single iperf3 tests can be wonky due to various reasons, e.g. iperf3 itself (it is a single threaded application, at least on FreeBSD), throttling, system activity, link parter inconsistencies etc. We measure 5 separate sessions with multiple threads. Also, NICs like certain packet sizes more than others. To account for this, multiple packet sizes are used in the tests.

Regarding the system configuration such as hardware offloading, tunables etc. only the system defaults (e.g. as delivered from Deciso) are used.

To start, we set a baseline on 21.7.8 (FreeBSD 12.1):
21.7.8
---------------------------------------------------------------------------
[Firewall]
iperf3 mss 1500
bps                  : 8.04 Gbps (avg 7.45 Gbps in 5 tries)
pps                  : 702.37 Kpps (avg 650.72 Kpps in 5 tries)
iperf3 mss 1200
bps                  : 7.90 Gbps (avg 7.81 Gbps in 5 tries)
pps                  : 863.29 Kpps (avg 853.30 Kpps in 5 tries)
iperf3 mss 500
bps                  : 3.16 Gbps (avg 3.10 Gbps in 5 tries)
pps                  : 828.86 Kpps (avg 813.87 Kpps in 5 tries)
iperf3 mss 100
bps                  : 565.02 Mbps (avg 528.56 Mbps in 5 tries)
pps                  : 723.22 Kpps (avg 676.56 Kpps in 5 tries)
netperf latency
mean_latency         : 150.92 Microseconds [RTT]
---------------------------------------------------------------------------

These results fall within expectations regarding performance of the DEC2750 proc. the amount of Kpps is the most important measurement in this setup.

Next up, we test the different kernel versions starting from 22.1.2:

22.1.2
---------------------------------------------------------------------------
[Firewall]
iperf3 mss 1500
bps                  : 8.76 Gbps (avg 7.65 Gbps in 5 tries)
pps                  : 765.32 Kpps (avg 668.13 Kpps in 5 tries)
iperf3 mss 1200
bps                  : 8.58 Gbps (avg 8.36 Gbps in 5 tries)
pps                  : 937.37 Kpps (avg 912.66 Kpps in 5 tries)
iperf3 mss 500
bps                  : 3.45 Gbps (avg 3.38 Gbps in 5 tries)
pps                  : 903.84 Kpps (avg 886.74 Kpps in 5 tries)
iperf3 mss 100
bps                  : 595.11 Mbps (avg 551.87 Mbps in 5 tries)
pps                  : 761.74 Kpps (avg 706.39 Kpps in 5 tries)
netperf latency
mean_latency         : 148.03 Microseconds [RTT]
---------------------------------------------------------------------------

22.1.4
---------------------------------------------------------------------------
[Firewall]
iperf3 mss 1500
bps                  : 8.75 Gbps (avg 7.92 Gbps in 5 tries)
pps                  : 764.57 Kpps (avg 692.37 Kpps in 5 tries)
iperf3 mss 1200
bps                  : 8.26 Gbps (avg 6.78 Gbps in 5 tries)
pps                  : 902.74 Kpps (avg 740.71 Kpps in 5 tries)
iperf3 mss 500
bps                  : 3.35 Gbps (avg 3.13 Gbps in 5 tries)
pps                  : 878.98 Kpps (avg 820.60 Kpps in 5 tries)
iperf3 mss 100
bps                  : 574.86 Mbps (avg 486.28 Mbps in 5 tries)
pps                  : 735.82 Kpps (avg 622.44 Kpps in 5 tries)
netperf latency
mean_latency         : 148.66 Microseconds [RTT]
---------------------------------------------------------------------------

22.1.5
---------------------------------------------------------------------------
[Firewall]
iperf3 mss 1500
bps                  : 8.75 Gbps (avg 8.35 Gbps in 5 tries)
pps                  : 764.92 Kpps (avg 729.96 Kpps in 5 tries)
iperf3 mss 1200
bps                  : 8.43 Gbps (avg 7.66 Gbps in 5 tries)
pps                  : 920.95 Kpps (avg 836.44 Kpps in 5 tries)
iperf3 mss 500
bps                  : 3.56 Gbps (avg 3.44 Gbps in 5 tries)
pps                  : 933.90 Kpps (avg 901.74 Kpps in 5 tries)
iperf3 mss 100
bps                  : 622.28 Mbps (avg 608.44 Mbps in 5 tries)
pps                  : 796.51 Kpps (avg 778.81 Kpps in 5 tries)
netperf latency
mean_latency         : 169.09 Microseconds [RTT]
---------------------------------------------------------------------------

If anything, performance has increased since FreeBSD 13-STABLE.

Cheers,

Stephan
Title: Re: High Packet Loss/CRC errors for Tx on axgbe (DEC2750/OPNsense 22.1)
Post by: burly on April 24, 2022, 12:29:52 am
Thank you for the update and the testing!

Those numbers are consistent with what I was experiencing with 21.x. You testing across LAN <-> WAN with pf enabled and NAT with just basic ACLs, right?

I'm going to hopefully have some time to dive into this further on my system this weekend.
Title: Re: High Packet Loss/CRC errors for Tx on axgbe (DEC2750/OPNsense 22.1)
Post by: tuto2 on April 25, 2022, 01:20:45 pm
Those numbers are consistent with what I was experiencing with 21.x. You testing across LAN <-> WAN with pf enabled and NAT with just basic ACLs, right?

Correct :)
Title: Re: High Packet Loss/CRC errors for Tx on axgbe (DEC2750/OPNsense 22.1)
Post by: sschueller on November 27, 2022, 07:03:44 pm
Were you able to solve this issue?

I appear to be hitting the same problem.

I have a proxmox host with opnsense running as a guest. When I have CRC, TSO and LRO disabled I hit bottlenecks (~5-6gbit in one direction and ~7-8gbit in the other). When I enable CRC,TSO and LRO I am able to hit ~9gbit from host to router and ~23gbit from router to ISP in both directions however going across the interfaces (NAT) I get pushed down to ~1mbit.
Title: Re: High Packet Loss/CRC errors for Tx on axgbe (DEC2750/OPNsense 22.1)
Post by: meyergru on November 28, 2022, 08:09:23 am
I think your issue can only remotely be comparable to this one... first off, how do you install proxmox on a DEC2750?

The OP did not have proxmox running, but OpnSense directky on a DEC2750. Even the underlying FreeBSD version is different than yours (i.e. 13.0 vs. 13.1). So I doubt your situation is like this.

Second (because I do not know any way to make proxmox run on a DEC2750): What other hardware do you use with axgbe NICs?

I would start a new thread for your question because at least one thing (OS, vrtualization, hardware) is different. Please describe your hardware, OpnSense version and provide as much other info as you may have when you do.