High Packet Loss/CRC errors for Tx on axgbe (DEC2750/OPNsense 22.1)

Started by burly, April 07, 2022, 07:10:50 AM

Previous topic - Next topic
I am experiencing high-packet loss when transmitting from ax0 (LAN) to another LAN device on my DEC2750 running OPNsense 22.1. I have poor throughput in both directions (~1.8Gbps as sender, ~1.7Gbps as receiver) however, I'm only observing packet loss/retx when ax0 is the transmitter.

On my DEC2750 the LAN is ax0 and it is connected to port 8 of a USW-Aggregation 10Gbps switch via a Mellanox MCP2100-X003B DAC. Looking at the switch port I can see the input errors and CRC counts increasing when I run iperf. 


root@fw:~ # iperf3 -c 172.16.5.14
Connecting to host 172.16.5.14, port 5201
[  5] local 172.16.5.1 port 29519 connected to 172.16.5.14 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   180 MBytes  1.51 Gbits/sec  390   29.8 KBytes
[  5]   1.00-2.00   sec   122 MBytes  1.03 Gbits/sec  252   49.8 KBytes
[  5]   2.00-3.00   sec   259 MBytes  2.17 Gbits/sec  508   54.1 KBytes
[  5]   3.00-4.00   sec   255 MBytes  2.14 Gbits/sec  529   25.5 KBytes
[  5]   4.00-5.01   sec   134 MBytes  1.12 Gbits/sec  298    334 KBytes
[  5]   5.01-6.01   sec   192 MBytes  1.61 Gbits/sec  397    781 KBytes
[  5]   6.01-7.00   sec   218 MBytes  1.84 Gbits/sec  434   48.3 KBytes
[  5]   7.00-8.00   sec   117 MBytes   983 Mbits/sec  242   19.9 KBytes
[  5]   8.00-9.00   sec   176 MBytes  1.48 Gbits/sec  326   22.7 KBytes
[  5]   9.00-10.00  sec   215 MBytes  1.81 Gbits/sec  435   44.0 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.82 GBytes  1.57 Gbits/sec  3811             sender
[  5]   0.00-10.00  sec  1.82 GBytes  1.57 Gbits/sec                  receiver



SW-Aggregation# show interfaces TenGigabitEthernet 8
TenGigabitEthernet8 is up
  Hardware is Ten Gigabit Ethernet
  Full-duplex, 10Gb/s, media type is Fiber
  flow-control is off
  back-pressure is enabled
     262840538 packets input, 865223445 bytes, 0 throttles
     Received 2488 broadcasts (0 multicasts)
     0 runts, 477 giants, 0 throttles
     510220 input errors, 509743 CRC, 0 frame
     0 multicast, 0 pause input
     0 input packets with dribble condition detected
     156613060 packets output, 1602945509 bytes, 0 underrun
     644 output errors, 0 collisions
     644 babbles, 0 late collision, 0 deferred
     0 PAUSE output


I don't see any errors/discards at the fw LAN interface (DEC2750 ax0). MTU is 1500 all around.

root@fw:~ # ifconfig ax0
ax0: flags=8863<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        description: LAN
        options=4800028<VLAN_MTU,JUMBO_MTU,NOMAP>
        ether f4:90:ea:00:73:4a
        inet 172.16.5.1 netmask 0xffffff00 broadcast 172.16.5.255
        media: Ethernet autoselect (10GBase-SFI <full-duplex,rxpause,txpause>)
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>

root@fw:~ # netstat -i log | grep -iE "Name|ax0"
Name    Mtu Network       Address              Ipkts Ierrs Idrop    Opkts Oerrs  Coll
ax0    1500 <Link#4>      f4:90:ea:00:73:4a  9077437     0     0 12499343     0     0
ax0       - 172.16.5.0/24 fw                    6778     -     -    12065     -     -


I have verified:
- I can send 9.4Gbps bi-directionally between all other devices connected to the USW-Aggregation switch
- Switch CPU utilization is low (~3-5%)
- iperf3 -u -b 9000M (UDP) shows the same bandwidth and packet loss behavior

Additionally, I verified back in January on OPNsense 21.7 that I could bi-directionarlly push 9.4Gbps on the LAN interface to other 10Gbe devices (and well in excess of 5Gbps across the FW and out ax1).

I have tried:
- Rebooting DEC2750 (no change)
- Rebooting the switch (no change)
- Switching to a known good DAC (no change)
- Put the original DAC used by the FW on another known-good host (no change - the known good can hit 9.4Gbps without issue)
- Change port on the switch (no change)
- Switch to ax1 on the DEC2750 (no change)
- enabling hardware checksum offloading on fw (no change)
- enabling hardware tcp segmentation offloading on fw (no change)
- enabling large receive offload on fw (no change)
- enabling flow control on the switch (no change in throughput but it does completely eliminate the iperf3 TCP ReTxs)
- enabling flow control on ax0 (add tunable for dev.ax.0.rx_pause 1 and dev.ax_0.tx_pause 1 then reboot)  (no chnage in throughput but eliminates iperf3 TCP ReTxs)

I have not yet tried:
- Direct connecting the FW to another 10Gbps port device Update See post below on this
- Downgrading to OPNSense 21.x
- Using a verified tested & working DAC module (e.g.  [DAC] UBIQUITI 10G 1M DAC) Update - Arrived and installed, no change

The only thing that I know of that has changed is the update to OPNsense 22.1 (which bases on FreeBSD 13 vs 21.x which was on FreeBSD 12). Could this be a potential issue with OPSense 22.1/FreeBSD 13 and axgbe?


Hardware: DEC2750

Software: OPNsense 22.1.4_1-amd64



$ uname -a FreeBSD fw 13.0-STABLE FreeBSD 13.0-STABLE stable/22.1-n248063-ac40e064d3c SMP  amd64

$ dmesg | grep -i ax0 
ax0: <AMD 10 Gigabit Ethernet Driver> mem 0xd0060000-0xd007ffff,0xd0040000-0xd005ffff,0xd0082000-0xd0083fff at device 0.1 on pci6
ax0: Using 2048 TX descriptors and 2048 RX descriptors
ax0: Using 3 RX queues 3 TX queues
ax0: Using MSI-X interrupts with 7 vectors
ax0: Ethernet address: f4:90:ea:00:73:4a
ax0: xgbe_config_sph_mode: SPH disabled in channel 0
ax0: xgbe_config_sph_mode: SPH disabled in channel 1
ax0: xgbe_config_sph_mode: SPH disabled in channel 2
ax0: RSS Enabled
ax0: Receive checksum offload Enabled
ax0: VLAN filtering Enabled
ax0: VLAN Stripping Enabled
ax0: Checking GPIO expander validity
ax0: SFP detected:
ax0:   vendor:   Mellanox
ax0:   part number:    MCP2100-X003B
ax0:   revision level: A1
ax0:   serial number:  MT1403VS18803
ax0: netmap queues/slots: TX 3/2048, RX 3/2048


These are the potentially relevant modified tunables I received "out-of-the-box" when delivered from Deciso:

dev.ax.0.iflib.override_nrxds 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048
dev.ax.0.iflib.override_ntxds 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048
dev.ax.0.rss_enabled 1
dev.ax.1.iflib.override_nrxds 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048
dev.ax.1.iflib.override_ntxds 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048
dev.ax.1.rss_enabled 1

This is interesting. I'll take a look at it on my setup once I find some time.

Glossing over your post it seems a downgrade to 21.x would be interesting for comparison.

No virtual interfaces running on top of ax0?

Cheers,

Stephan

That would be great, thank you!

I've ordered one of the Deciso tested & working DAC models ([DAC] UBIQUITI 10G 1M DAC) to try. I'll likely try the direct connect this evening. Downgrading will take a little more setup time as I need to update the configuration on backup FW VM to bring online first as it's not setup in HA.

Update Correct, no VLANs on this interface

Tonight I tried direct connecting the DEC2750 (aka fw) port ax1 to another 10Gbps device with a known good DAC. The results showed no packet loss over TCP but again the bandwidth is limited to 1.7-1.9Gbps bidirectionally. UDP was able to muster only ~2.6Gbps and with high packet loss.

I may try and boot off a USB key into Linux this weekend to see if I can get different results in a different OS. If so, I can pursue downgrading to OPNsense 21.7.

fw ax1 configuration:

ax1: xgbe_config_sph_mode: SPH disabled in channel 0
ax1: xgbe_config_sph_mode: SPH disabled in channel 1
ax1: xgbe_config_sph_mode: SPH disabled in channel 2
ax1: RSS Enabled
ax1: Receive checksum offload Disabled
ax1: VLAN filtering Disabled
ax1: VLAN Stripping Disabled
ax1: Checking GPIO expander validity
ax1: SFP detected:
ax1:   vendor:   Mellanox
ax1:   part number:    MCP2100-X003B
ax1:   revision level: A1
ax1:   serial number:  MT1416VS02297
ax1: link state changed to DOWN
ax1: Link is UP - 10Gbps/Full - flow control off
ax1: link state changed to UP


TCP Results
fw ax1 as Sender

root@fw:~ # iperf3 -c 172.16.200.2
Connecting to host 172.16.200.2, port 5201
[  5] local 172.16.200.1 port 46924 connected to 172.16.200.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   228 MBytes  1.91 Gbits/sec    0   3.00 MBytes
[  5]   1.00-2.00   sec   201 MBytes  1.68 Gbits/sec    0   3.00 MBytes
[  5]   2.00-3.00   sec   217 MBytes  1.82 Gbits/sec    0   3.00 MBytes
[  5]   3.00-4.00   sec   213 MBytes  1.79 Gbits/sec    0   3.00 MBytes
[  5]   4.00-5.00   sec   209 MBytes  1.75 Gbits/sec    0   3.00 MBytes
[  5]   5.00-6.00   sec   228 MBytes  1.92 Gbits/sec    0   3.00 MBytes
[  5]   6.00-7.00   sec   170 MBytes  1.43 Gbits/sec    0   3.00 MBytes
[  5]   7.00-8.00   sec   220 MBytes  1.85 Gbits/sec    0   3.00 MBytes
[  5]   8.00-9.00   sec   196 MBytes  1.64 Gbits/sec    0   3.00 MBytes
[  5]   9.00-10.00  sec   221 MBytes  1.85 Gbits/sec    0   3.00 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  2.05 GBytes  1.76 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  2.05 GBytes  1.76 Gbits/sec                  receiver


fw ax1 as Receiver

root@fw:~ # iperf3 -c 172.16.200.2 -R
Connecting to host 172.16.200.2, port 5201
Reverse mode, remote host 172.16.200.2 is sending
[  5] local 172.16.200.1 port 59337 connected to 172.16.200.2 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   204 MBytes  1.71 Gbits/sec
[  5]   1.00-2.00   sec   213 MBytes  1.79 Gbits/sec
[  5]   2.00-3.00   sec   207 MBytes  1.74 Gbits/sec
[  5]   3.00-4.00   sec   218 MBytes  1.83 Gbits/sec
[  5]   4.00-5.00   sec   213 MBytes  1.78 Gbits/sec
[  5]   5.00-6.00   sec   211 MBytes  1.77 Gbits/sec
[  5]   6.00-7.00   sec   210 MBytes  1.77 Gbits/sec
[  5]   7.00-8.00   sec   213 MBytes  1.79 Gbits/sec
[  5]   8.00-9.00   sec   210 MBytes  1.76 Gbits/sec
[  5]   9.00-10.00  sec   210 MBytes  1.76 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  2.06 GBytes  1.77 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  2.06 GBytes  1.77 Gbits/sec                  receiver


UDP Results
fw ax1 as Sender

root@fw:~ # iperf3 -c 172.16.200.2 -u -b 9000M
Connecting to host 172.16.200.2, port 5201
[  5] local 172.16.200.1 port 34911 connected to 172.16.200.2 port 5201
[ ID] Interval           Transfer     Bitrate         Total Datagrams
[  5]   0.00-1.00   sec   219 MBytes  1.84 Gbits/sec  157390
[  5]   1.00-2.00   sec   215 MBytes  1.80 Gbits/sec  154532
[  5]   2.00-3.00   sec   234 MBytes  1.97 Gbits/sec  168399
[  5]   3.00-4.00   sec   233 MBytes  1.96 Gbits/sec  167659
[  5]   4.00-5.00   sec   234 MBytes  1.96 Gbits/sec  167797
[  5]   5.00-6.00   sec   235 MBytes  1.98 Gbits/sec  169111
[  5]   6.00-7.00   sec   235 MBytes  1.97 Gbits/sec  168725
[  5]   7.00-8.00   sec   233 MBytes  1.96 Gbits/sec  167502
[  5]   8.00-9.00   sec   235 MBytes  1.97 Gbits/sec  168753
[  5]   9.00-10.00  sec   234 MBytes  1.96 Gbits/sec  167758
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-10.00  sec  2.25 GBytes  1.94 Gbits/sec  0.000 ms  0/1657626 (0%)  sender
[  5]   0.00-10.00  sec  2.25 GBytes  1.94 Gbits/sec  0.003 ms  269/1657626 (0.016%)  receiver


fw ax1 as Receiver

root@fw:~ # iperf3 -c 172.16.200.2 -u -b 9000M -R
Connecting to host 172.16.200.2, port 5201
Reverse mode, remote host 172.16.200.2 is sending
[  5] local 172.16.200.1 port 64097 connected to 172.16.200.2 port 5201
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-1.00   sec   315 MBytes  2.64 Gbits/sec  0.003 ms  1589/228030 (0.7%)
[  5]   1.00-2.00   sec   317 MBytes  2.66 Gbits/sec  0.003 ms  770/228374 (0.34%)
[  5]   2.00-3.00   sec   312 MBytes  2.62 Gbits/sec  0.004 ms  1234/225533 (0.55%)
[  5]   3.00-4.00   sec   295 MBytes  2.47 Gbits/sec  0.003 ms  15494/227089 (6.8%)
[  5]   4.00-5.00   sec   309 MBytes  2.59 Gbits/sec  0.003 ms  333/222378 (0.15%)
[  5]   5.00-6.00   sec   304 MBytes  2.55 Gbits/sec  0.004 ms  9455/227482 (4.2%)
[  5]   6.00-7.00   sec   312 MBytes  2.62 Gbits/sec  0.004 ms  1488/225716 (0.66%)
[  5]   7.00-8.00   sec   300 MBytes  2.51 Gbits/sec  0.003 ms  7986/223226 (3.6%)
[  5]   8.00-9.00   sec   311 MBytes  2.61 Gbits/sec  0.004 ms  878/224257 (0.39%)
[  5]   9.00-10.00  sec   321 MBytes  2.69 Gbits/sec  0.005 ms  1184/231395 (0.51%)
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-10.00  sec  3.08 GBytes  2.64 Gbits/sec  0.000 ms  0/2263509 (0%)  sender
[  5]   0.00-10.00  sec  3.02 GBytes  2.60 Gbits/sec  0.005 ms  40411/2263480 (1.8%)  receiver


fw Interface Statistics

root@fw:~ # ifconfig ax1
ax1: flags=8863<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        description: Test
        options=4800028<VLAN_MTU,JUMBO_MTU,NOMAP>
        ether f4:90:ea:00:73:4b
        inet 172.16.200.1 netmask 0xffffff00 broadcast 172.16.200.255
        media: Ethernet autoselect (10GBase-SFI <full-duplex,rxpause,txpause>)
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>

root@fw:~ # netstat -i log | grep -iE 'Name|ax1'
Name    Mtu Network       Address              Ipkts Ierrs Idrop    Opkts Oerrs  Coll
ax1    1500 <Link#5>      f4:90:ea:00:73:4b 14228226     0     0 15969806     0     0
ax1       - 172.16.200.0/ fw                14231375     -     - 15974070     -     -

root@fw:~ # netstat -ihw 1
            input        (Total)           output
   packets  errs idrops      bytes    packets  errs      bytes colls
       130     0     0        28K        133     0        28K     0
      1.5k     0     0       1.6M       1.5k     0       1.7M     0
      2.4k     0     0       3.3M       2.4k     0       222K     0
      151k     0     0       218M       151k     0        10M     0
      151k     0     0       218M       151k     0        10M     0
      150k     0     0       216M       150k     0        11M     0
      151k     0     0       218M       151k     0        10M     0
      151k     0     0       219M       151k     0        10M     0
      153k     0     0       221M       153k     0        12M     0
      151k     0     0       218M       151k     0        10M     0
      152k     0     0       220M       152k     0        10M     0
      152k     0     0       220M       152k     0        10M     0
      153k     0     0       221M       153k     0        10M     0
      1.3k     0     0       1.5M       1.3k     0       1.5M     0
       39k     0     0       2.6M        82k     0       118M     0
       76k     0     0       5.1M       175k     0       253M     0
       75k     0     0       5.0M       170k     0       247M     0
       72k     0     0       4.8M       178k     0       258M     0
       76k     0     0       5.1M       166k     0       239M     0
       70k     0     0       4.7M       152k     0       220M     0
       66k     0     0       4.5M       149k     0       216M     0


pve4 Interface Statistics

root@pve4:~# ifconfig enp6s0d1
enp6s0d1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.16.200.2  netmask 255.255.255.0  broadcast 0.0.0.0
        inet6 fe80::202:c9ff:fe0e:9ce9  prefixlen 64  scopeid 0x20<link>
        ether 00:02:c9:0e:9c:e9  txqueuelen 1000  (Ethernet)
        RX packets 15974098  bytes 17522362543 (16.3 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 14231333  bytes 17446822814 (16.2 GiB)
        TX errors 0  dropped 1 overruns 0  carrier 0  collisions 0

root@pve4:~# netstat -i log | grep -iE 'Iface|enp6s0d1'
Iface      MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
enp6s0d1  1500 15974098      0      0 0      14231333      0      1      0 BMR


Thought: Interestingly, although this person is sending traffic Through the FW, which introduces a whole lot of additional points that throughput can be reduced, it's interesting that they are seeing the same specific range of values (~1.8Gbps, occasionally bursting to 2.4Gbps). https://www.reddit.com/r/OPNsenseFirewall/comments/s6zu4b/help_with_bad_performance_on_dec2750_opnsense/

The 2.4Gbps value is suspiciously close to 1/4th the expected speed (~9.6Gbps). Is there any chance that the multi-queues are not actually being multi-processed by the kernel and thus we are only processing on one core at a time?



last pid: 73684;  load averages:  1.34,  0.47,  0.28                                                                                                                                                                                               up 0+01:09:53  01[0/515]
208 threads:   11 running, 167 sleeping, 30 waiting
CPU:  0.3% user,  0.0% nice, 24.8% system,  0.0% interrupt, 74.8% idle
Mem: 162M Active, 35M Inact, 400M Wired, 116M Buf, 7261M Free
Swap: 8478M Total, 8478M Free

  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
   11 root        155 ki31     0B   128K CPU3     3  69:02 100.00% idle{idle: cpu3}
    0 root        -76    -     0B  1008K CPU2     2   1:39  99.92% kernel{if_io_tqg_2}
   11 root        155 ki31     0B   128K RUN      0  68:15  97.34% idle{idle: cpu0}
   11 root        155 ki31     0B   128K CPU1     1  69:34  97.28% idle{idle: cpu1}
73684 root        100    0    17M  6184K CPU6     6   0:27  96.72% iperf3
   11 root        155 ki31     0B   128K CPU7     7  68:34  95.35% idle{idle: cpu7}
   11 root        155 ki31     0B   128K CPU4     4  68:19  91.80% idle{idle: cpu4}
   11 root        155 ki31     0B   128K CPU5     5  68:45  81.63% idle{idle: cpu5}
   11 root        155 ki31     0B   128K RUN      6  67:52  29.06% idle{idle: cpu6}
    0 root        -92    -     0B  1008K -        4   1:01   1.44% kernel{axgbe dev taskq}
    0 root        -92    -     0B  1008K -        4   1:00   1.44% kernel{axgbe dev taskq}
   12 root        -72    -     0B   480K WAIT     5   0:01   0.70% intr{swi1: pfsync}
    0 root        -92    -     0B  1008K -        0   0:34   0.40% kernel{dummynet}
    6 root        -16    -     0B    16K -        4   0:04   0.18% rand_harvestq
    0 root        -76    -     0B  1008K -        0   0:48   0.16% kernel{if_io_tqg_0}
    0 root        -76    -     0B  1008K -        4   0:28   0.16% kernel{if_io_tqg_4}


Here is someone else reporting similar behavior (and discussing iflib as well, which is in play in my situation as well.
https://forum.opnsense.org/index.php?topic=18754.30

Update This issue in iflib reported in FreeBSD 12, although involving a vmx NIC, reveals an issue in a similar vein of thinking: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=237166

Here is the relevant info from my device:

root@fw:~ # dmesg | grep -iE 'ax0|ax1' | grep -C2 queues
ax0: <AMD 10 Gigabit Ethernet Driver> mem 0xd0060000-0xd007ffff,0xd0040000-0xd005ffff,0xd0082000-0xd0083fff at device 0.1 on pci6
ax0: Using 2048 TX descriptors and 2048 RX descriptors
ax0: Using 3 RX queues 3 TX queues
ax0: Using MSI-X interrupts with 7 vectors
ax0: Ethernet address: f4:90:ea:00:73:4a
--
ax0:   revision level: A1
ax0:   serial number:  MT1403VS18803
ax0: netmap queues/slots: TX 3/2048, RX 3/2048
ax1: <AMD 10 Gigabit Ethernet Driver> mem 0xd0020000-0xd003ffff,0xd0000000-0xd001ffff,0xd0080000-0xd0081fff at device 0.2 on pci6
ax1: Using 2048 TX descriptors and 2048 RX descriptors
ax1: Using 3 RX queues 3 TX queues
ax1: Using MSI-X interrupts with 7 vectors



root@fw:~ # sysctl -a | grep override
dev.ax.1.iflib.override_nrxds: 2048
dev.ax.1.iflib.override_ntxds: 2048
dev.ax.1.iflib.override_qs_enable: 0
dev.ax.1.iflib.override_nrxqs: 0
dev.ax.1.iflib.override_ntxqs: 0
dev.ax.0.iflib.override_nrxds: 2048
dev.ax.0.iflib.override_ntxds: 2048
dev.ax.0.iflib.override_qs_enable: 0
dev.ax.0.iflib.override_nrxqs: 0
dev.ax.0.iflib.override_ntxqs: 0



**UPDATE** I tried to adapt the nic-queue-usage script found here to the ax, but it doesn't appear that iflib provides per rx_queue packet stats?
https://github.com/ocochard/BSDRP/blob/master/BSDRP/Files/usr/local/bin/nic-queue-usage


root@fw:~ # sysctl dev.ax.1.iflib | grep -i rxq
dev.ax.1.iflib.rxq2.rxq_fl0.buf_size: 2048
dev.ax.1.iflib.rxq2.rxq_fl0.credits: 2047
dev.ax.1.iflib.rxq2.rxq_fl0.cidx: 1557
dev.ax.1.iflib.rxq2.rxq_fl0.pidx: 1556
dev.ax.1.iflib.rxq2.cpu: 2
dev.ax.1.iflib.rxq1.rxq_fl0.buf_size: 2048
dev.ax.1.iflib.rxq1.rxq_fl0.credits: 2047
dev.ax.1.iflib.rxq1.rxq_fl0.cidx: 974
dev.ax.1.iflib.rxq1.rxq_fl0.pidx: 973
dev.ax.1.iflib.rxq1.cpu: 0
dev.ax.1.iflib.rxq0.rxq_fl0.buf_size: 2048
dev.ax.1.iflib.rxq0.rxq_fl0.credits: 2047
dev.ax.1.iflib.rxq0.rxq_fl0.cidx: 703
dev.ax.1.iflib.rxq0.rxq_fl0.pidx: 702
dev.ax.1.iflib.rxq0.cpu: 6
dev.ax.1.iflib.override_nrxqs: 0


OK, getting somewhere:

I made the following changes to both ax0 (LAN) and ax1 (test) and rebooted. LAN is primary interface that I care about and what is connected to the switch. ax1 is direct connected to another 10G host for testing purposes.
- Disabled flow control on rx and tx
- Enabled hardware TCP segmentation offload

Note that both hardware checksum offload and hardware large receive offload were left disabled.

I was then able to:
- Send at 9.4Gbps on ax1 with no Retx.  Kernel times for if_io_tqg_2 were around 12-19%
- Send at 8.25Gbps on ax0 but still with Retx (although fewer of them then when this all started). Kernel times for  if_io_tqg_2were around 18-24%

However, receiving is still underperforming:
- Receive @ 2.32Gbps on ax1 with no retx.   Kernel times for if_o_tqg_2 were around 97-100%
- Receive @ 2.32Gbps on ax0 with no retx.   Kernel times for if_o_tqg_2 were around 97-100%


Per the documentation https://docs.opnsense.org/manual/interfaces_settings.html, all three offloading options should be disabled.

Two thoughts
1. Either a single core the this machine (DEC2750 -> AMD Ryzen V1500B) is expected to be able to handle full 10Gbps traffic on the interface and code path this is presently taking is slower than expected
2. Or the manner in which the system is expected to achieve this throughput with offloading disabled is through the use of multiple CPU cores.

My expectation is that multiple threads will be processing multiple queues across multiple cores to achieve the necessary throughput. This is why I find it so suspicious that the single kernel thread is being pegged out while all the other cores are basically idle.  One thought I had was that Receive Side Scaling (RSS) maybe forcing all the packets from this single TCP stream into a single queue for locality, thus effectively making this a single threaded activity. More testing is needed, however I would expect that if this were solely the issue, running 4 parallel streams should result in ~4x the throughput. My quick tests with TCP -P4 and UDP -P1/-P4 with iperf3 don't show a 4x, but rather 1.5-2x increase with just two cores seeing utilization. I need to setup up some better tests to investigate this line of thinking.

**RSS UPDATE**: I tried turning of RSS (dev.ax.0.rss_enabled="0" dev.ax.1.rss_enabled="0") and rebooting. I then re-tested send/receive with both single and parallel threads and observed no improvement. I believe since it's both src host:port and dst host:port in the hash, that -P4 should be able to generate different queue targets  in the LSB of the hash and thus spread it across cores. Said more simply, I think this is a valid test, but I'm not fully up to speed on RSS. See here for more details: https://forum.opnsense.org/index.php?topic=24409.0

As I'm doing some testing to isolate the issue, it appears it could be related to the broader set of issues that have been seen over the past several years regarding the NIC performance delta observed between OPNsense vs the FreeBSD it's based upon. The overall issue that I'm observing on the DEC2750 (single core pegging out even with parallel streams, throughput limited to 1.8-2.4Gbps) is reproducible in VMs.

Using the base version of FreeBSD on the same hosts with the same guest configuration I see far higher throughput and the usage of multiple CPUs even when processing a single stream with multiple queues available.

I also grabbed the FreeBSD 13.0 kernel from the opnsense website and booted that on my freebsd VM to see it was the source of the issue and it did reveal one - The throughput for the FreeBSD-13 VM booting off the kernel from the opnsense site was only half what it is for that same FreeBSD-13 VM when booting off the stock kernel! In fact, it is very close in performance to what the stock FreeBSD-13 VM is with only a single queue presented to it.

**UPDATE:** Interestingly, pfSense-CE-2.6 exhibits the same throughput issues. It's further hampered by ALTQ as noted below (so I can't make use of multi-queues for virtio like I can for opnsense/FreeBSD).

More to come...

References:
- https://forum.opnsense.org/index.php?topic=18754.0
- https://www.reddit.com/r/OPNsenseFirewall/comments/s6zu4b/help_with_bad_performance_on_dec2750_opnsense/
- https://forum.opnsense.org/index.php?topic=22477.0
- https://github.com/opnsense/src/issues/119

Test Environment

Baremetal

NameOSKernelCPURAMNICIPNotes
pve1Proxmox 7.1-15.13.19-14-pveE3-1270v232GBMellanox ConnectX2172.16.5.3ethtool -K vmbr1 tx off gso off
pve2Proxmox 7.1-15.13.19-14-pveE3-1270v232GBMellanox ConnectX2172.16.5.4ethtool -K vmbr1 tx off gso off
truenasTrueNAS Core 12-U812.2-RELEASE-p12 amd64E3-1240v564GBMellanox ConnectX2172.16.200.2
(DAC to pve4)
fw (DEC2750)OPNsense 22.1.513.0-STABLERyzen V1500B8GBAMD 10 Gigabit172.16.5.1
172.16.200.1
(DAC to truenas)


VMs
NameOSKernelCPURAMNICIPNotes
tankUbuntu 20.4.4 LTS5.4.0-100-generic1 vCPU IvyBridge2GBvirtio 2 queues172.16.5.37Runs on pve1
mgmtUbuntu 20.4.4 LTS5.13.0-39-generic2 vCPU IvyBridge4GBvirtio 2 queues172.16.5.36Runs on pve4
mgmt-cloneUbuntu 20.4.4 LTS5.13.0-39-generic2 vCPU IvyBridge4GBvirtio 2 queues172.16.5.57Runs on pve2
freebsd-13FreeBSD 13.0releng/13.0-n2447332 vCPUs4GBvirtio 2 queues172.16.5.59Runs on pve4
opnsense22.1OPNsense 22.1stable/22.1-n2480594 vCPUs IvyBridge4GBvirtio 2 queues172.16.6.1Runs on pve4
opnsense21.7OPNsense 21.74 vCPUs IvyBridge4GBvirtio 2 queues 172.16.6.1Runs on pve4
pfsenseCE2.6PFsense 2.6.012.3-STABLE4 vCPUs IvyBridge4GBvirtio 2 queues 172.16.6.1Runs on pve4


Test Results




Baremetal
ClientServerProtocolTx BitrateTx RetrRx BitrateRx RetrNotes
pve2pve1TCP9.31Gbps48.68Gbps0
pve4pve1TCP7.17Gbps08.49Gbps430
truenaspve4TCP9.24Gbps09.31Gbps0Hardware Offloading Enabled
{mlxen1 rx cq} <1% cpu
intr{mlx4_core0} 22% cpu
truenas pve4TCP9.08Gbps39.30Gbps231Hardware Offloading Disabled
{mlxen1 rx cq} 56% cpu
intr{mlx4_core0} 22% cpu
fw pve4TCP1.87Gbps01.63Gbps0Hardware Offloading Disabled
kernel{if_io_tqg_4} 100% cpu
VMs
ClientServerProtocolTx BitrateTx RetrRx BitrateRx RetrNotes
tankmgmtTCP9.31Gbps384399.34Gbps21532
mgmt-clonemgmtTCP9.30Gbps680019.36Gbps10267
tankmgmt-cloneTCP9.24Gbps421379.27Gbps63471
freebsd-13mgmt-cloneTCP9.22Gbps39279.00Gbps75984Hardware Offloading Enabled
intr(irq27: virtio_pci3} 30%CPU
1 queue
freebsd-13mgmt-cloneTCP4.16Gbps5904.80Gbps3566Hardware Offloading Disabled
intr{irq27: virtio_pci3} 65% CPU
1 queue
freebsd-13mgmt-clone TCP9.27Gbps41089.03Gbps69501Hardware Offloading Enabled
intr{irq27: virtio_pci3} 33% CPU
2 queue
freebsd-13mgmt-clone TCP9.26Gbps87619.00Gbps55472Hardware Offloading Disabled
intr{irq27: virtio_pci3} 65% CPU
2 queue
freebsd-13mgmt-cloneTCP4.51Gbps2604.25Gbps3206OPNsense Kernel*
Hardware Offloading Disabled
intr{irq27: virtio_pci3} 100%
2 queue
opnsense-22.1mgmt-cloneTCP2.81Gbps2221.64Gbps155Hardware Offloading Enabled
intr{irq27: virtio_pci3} 86% CPU
2 queue
opnsense-22.1mgmt-cloneTCP2.62Gbps01.68Gbps139Hardware Offloading Disabled
intr{irq27: virtio_pci3} 97% CPU
2 queue
opnsense-21.7mgmt-cloneTCP5.25Gbps5871.91Gbps88Hardware Offloading Enabled
intr{irq27: virtio_pci3} XX% CPU
2 queue
opnsense-21.7mgmt-cloneTCP2.42Gbps01.61Gbps47Hardware Offloading Disabled
intr{irq27: virtio_pci3} xx% CPU
2 queue
pfsenseCE2.6mgmt-cloneTCP8.6Gbps24321.40Gbps14Hardware Offloading Enabled
intr{irq261: virtio_pci2} 53% CPU
2 queue (but ALTQ **)
pfsenseCE2.6mgmt-cloneTCP2.33Gbps11.40Gbps20Hardware Offloading Disabled
intr{irq261: virtio_pci2} 100% CPU
2 queue (but ALTQ **)


Subscribed. I see the bad performance on my DEC750 as well.

Did you compare the sysctl -a outputs to see if there is just a random parameter that limits the OpnSense kernel?

Although there seems to be a lot more than just parameter differences when I look at this comparison: https://hardenedbsd.org/content/easy-feature-comparison
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 770 up, Bufferbloat A


Quote from: Raketenmeyer on April 14, 2022, 06:50:35 PM
What exactly do you mean in that comparison?
OPNsense no longer uses hardenedBSD as of 22.1, so this is an OPNSense FreeBSD 13 kernel versus the native FreeBSD 13 kernel difference.

Oh, I just saw that I misread the OpenBSD column for OpnSense in that comparison...
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 770 up, Bufferbloat A

Quote from: meyergru on April 14, 2022, 05:39:04 PM
Did you compare the sysctl -a outputs to see if there is just a random parameter that limits the OpnSense kernel?

Yes in fact, I have done that, I just forgot to post about it! It's a pretty long list of deltas but nothing stuck out to me. The filtered files were grepping for `vtnet|virtio`.

The OPNsense VM has two NICs while the FreeBSD VM has a single NIC, so there are a bunch of extra entries for vtnet1 in the OPNsense file. Which makes me release that I should test the FreeBSD with the presence of a second NIC to see if that has any affect on it's performance.


diff --git a/freebsd-13.0/freebsd-13.0-filtered.sysctl b/opnsense-22.1/opnsense-22.1-filtered.sysctl
index e69a004..c36435b 100644
--- a/freebsd-13.0/freebsd-13.0-filtered.sysctl
+++ b/opnsense-22.1/opnsense-22.1-filtered.sysctl
@@ -1,26 +1,36 @@
-000.001395 [ 450] vtnet_netmap_attach       vtnet attached txq=1, txd=256 rxq=1, rxd=128
-586.211720 [ 450] vtnet_netmap_attach       vtnet attached txq=2, txd=256 rxq=2, rxd=128
-951.722084 [ 450] vtnet_netmap_attach       vtnet attached txq=1, txd=256 rxq=1, rxd=128
-<118>Apr  8 21:26:23 freebsd-13 dhclient[417]: Interface vtnet0 no longer appears valid.
-<118>Apr  8 21:26:23 freebsd-13 dhclient[417]: ioctl(SIOCGIFFLAGS) on vtnet0: Operation not permitted
-<118>Apr  8 21:26:23 freebsd-13 dhclient[417]: receive_packet failed on vtnet0: Device not configured
-<118>Apr  8 21:32:29 freebsd-13 dhclient[1463]: Interface vtnet0 no longer appears valid.
-<118>Apr  8 21:32:29 freebsd-13 dhclient[1463]: ioctl(SIOCGIFFLAGS) on vtnet0: Operation not permitted
-<118>Apr  8 21:32:29 freebsd-13 dhclient[1463]: receive_packet failed on vtnet0: Device not configured
-<118>DHCPDISCOVER on vtnet0 to 255.255.255.255 port 67 interval 6
-<118>DHCPREQUEST on vtnet0 to 255.255.255.255 port 67
-<118>Starting Network: lo0 vtnet0.
-<118>vtnet0: flags=8863<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
-<6>vtnet0: Ethernet address: 86:c3:9d:51:ba:5f
-<6>vtnet0: Ethernet address: 86:c3:9d:51:ba:5f
-<6>vtnet0: Ethernet address: 86:c3:9d:51:ba:5f
+000.001735 [ 450] vtnet_netmap_attach       vtnet attached txq=2, txd=256 rxq=2, rxd=128
+000.001735 [ 450] vtnet_netmap_attach       vtnet attached txq=2, txd=256 rxq=2, rxd=128
+000.001735 [ 450] vtnet_netmap_attach       vtnet attached txq=2, txd=256 rxq=2, rxd=128
+000.001736 [ 450] vtnet_netmap_attach       vtnet attached txq=2, txd=256 rxq=2, rxd=128
+000.001736 [ 450] vtnet_netmap_attach       vtnet attached txq=2, txd=256 rxq=2, rxd=128
+000.001736 [ 450] vtnet_netmap_attach       vtnet attached txq=2, txd=256 rxq=2, rxd=128
+<118> LAN (vtnet1)    -> v4: 172.16.6.1/24
+<118> LAN (vtnet1)    -> v4: 172.16.6.1/24
+<118> LAN (vtnet1)    -> v4: 172.16.6.1/24
+<118> WAN (vtnet0)    -> v4/DHCP4: 172.16.5.58/24
+<118> WAN (vtnet0)    -> v4/DHCP4: 172.16.5.58/24
+<118> WAN (vtnet0)    -> v4/DHCP4: 172.16.5.58/24
+<118>Reconfiguring IPv4 on vtnet0
+<118>Reconfiguring IPv4 on vtnet0
+<118>Reconfiguring IPv4 on vtnet0
+<6>vtnet0: Ethernet address: b2:6c:3a:1c:ce:cf
+<6>vtnet0: Ethernet address: b2:6c:3a:1c:ce:cf
+<6>vtnet0: Ethernet address: b2:6c:3a:1c:ce:cf
<6>vtnet0: link state changed to UP
<6>vtnet0: link state changed to UP
<6>vtnet0: link state changed to UP
-<6>vtnet0: link state changed to UP
-<6>vtnet0: netmap queues/slots: TX 1/256, RX 1/128
-<6>vtnet0: netmap queues/slots: TX 1/256, RX 1/128
<6>vtnet0: netmap queues/slots: TX 2/256, RX 2/128
+<6>vtnet0: netmap queues/slots: TX 2/256, RX 2/128
+<6>vtnet0: netmap queues/slots: TX 2/256, RX 2/128
+<6>vtnet1: Ethernet address: d2:db:66:e1:92:5f
+<6>vtnet1: Ethernet address: d2:db:66:e1:92:5f
+<6>vtnet1: Ethernet address: d2:db:66:e1:92:5f
+<6>vtnet1: link state changed to UP
+<6>vtnet1: link state changed to UP
+<6>vtnet1: link state changed to UP
+<6>vtnet1: netmap queues/slots: TX 2/256, RX 2/128
+<6>vtnet1: netmap queues/slots: TX 2/256, RX 2/128
+<6>vtnet1: netmap queues/slots: TX 2/256, RX 2/128
dev.virtio_pci.%parent:
dev.virtio_pci.0.%desc: VirtIO PCI (legacy) Balloon adapter
dev.virtio_pci.0.%driver: virtio_pci
@@ -51,9 +61,17 @@ dev.virtio_pci.3.%driver: virtio_pci
dev.virtio_pci.3.%location: slot=18 function=0 dbsf=pci0:0:18:0 handle=\_SB_.PCI0.S90_
dev.virtio_pci.3.%parent: pci0
dev.virtio_pci.3.%pnpinfo: vendor=0x1af4 device=0x1000 subvendor=0x1af4 subdevice=0x0001 class=0x020000
-dev.virtio_pci.3.host_features: 0x79bfffe7 <RingEventIdx,RingIndirectDesc,AnyLayout,NotifyOnEmpty,CtrlMacAddr,GuestAnnounce,CtrlRxModeExtra,CtrlVLANFilter,CtrlRxMode,CtrlVq,Status,MrgRxBuf,TxUFO,TxTSOECN,TxTSOv6,TxTSOv4,RxUFO,RxLROECN,RxLROv6,RxLROv4,TxGSO,MAC,CtrlRxOffloads,RxChecksum,TxChecksum>
-dev.virtio_pci.3.negotiated_features: 0x3087bbe7 <RingEventIdx,RingIndirectDesc,CtrlMacAddr,CtrlRxMode,CtrlVq,Status,MrgRxBuf,TxTSOECN,TxTSOv6,TxTSOv4,RxLROECN,RxLROv6,RxLROv4,TxGSO,MAC,CtrlRxOffloads,RxChecksum,TxChecksum>
-dev.virtio_pci.3.nvqs: 3
+dev.virtio_pci.3.host_features: 0x79ffffe7 <RingEventIdx,RingIndirectDesc,AnyLayout,NotifyOnEmpty,CtrlMacAddr,Multiqueue,GuestAnnounce,CtrlRxModeExtra,CtrlVLANFilter,CtrlRxMode,CtrlVq,Status,MrgRxBuf,TxUFO,TxTSOECN,TxTSOv6,TxTSOv4,RxUFO,RxLROECN,RxLROv6,RxLROv4,TxGSO,MAC,CtrlRxOffloads,RxChecksum,TxChecksum>
+dev.virtio_pci.3.negotiated_features: 0x30c7b865 <RingEventIdx,RingIndirectDesc,CtrlMacAddr,Multiqueue,CtrlRxMode,CtrlVq,Status,MrgRxBuf,TxTSOECN,TxTSOv6,TxTSOv4,TxGSO,MAC,CtrlRxOffloads,TxChecksum>
+dev.virtio_pci.3.nvqs: 5
+dev.virtio_pci.4.%desc: VirtIO PCI (legacy) Network adapter
+dev.virtio_pci.4.%driver: virtio_pci
+dev.virtio_pci.4.%location: slot=19 function=0 dbsf=pci0:0:19:0 handle=\_SB_.PCI0.S98_
+dev.virtio_pci.4.%parent: pci0
+dev.virtio_pci.4.%pnpinfo: vendor=0x1af4 device=0x1000 subvendor=0x1af4 subdevice=0x0001 class=0x020000
+dev.virtio_pci.4.host_features: 0x79ffffe7 <RingEventIdx,RingIndirectDesc,AnyLayout,NotifyOnEmpty,CtrlMacAddr,Multiqueue,GuestAnnounce,CtrlRxModeExtra,CtrlVLANFilter,CtrlRxMode,CtrlVq,Status,MrgRxBuf,TxUFO,TxTSOECN,TxTSOv6,TxTSOv4,RxUFO,RxLROECN,RxLROv6,RxLROv4,TxGSO,MAC,CtrlRxOffloads,RxChecksum,TxChecksum>
+dev.virtio_pci.4.negotiated_features: 0x30c7b865 <RingEventIdx,RingIndirectDesc,CtrlMacAddr,Multiqueue,CtrlRxMode,CtrlVq,Status,MrgRxBuf,TxTSOECN,TxTSOv6,TxTSOv4,TxGSO,MAC,CtrlRxOffloads,TxChecksum>
+dev.virtio_pci.4.nvqs: 5
dev.vtballoon.0.%desc: VirtIO Balloon Adapter
dev.vtballoon.0.%parent: virtio_pci0
dev.vtblk.0.%desc: VirtIO Block Adapter
@@ -66,10 +84,10 @@ dev.vtnet.0.%driver: vtnet
dev.vtnet.0.%location:
dev.vtnet.0.%parent: virtio_pci3
dev.vtnet.0.%pnpinfo: vendor=0x00001af4 device=0x1000 subvendor=0x1af4 device_type=0x00000001
-dev.vtnet.0.act_vq_pairs: 1
-dev.vtnet.0.max_vq_pairs: 1
+dev.vtnet.0.act_vq_pairs: 2
+dev.vtnet.0.max_vq_pairs: 2
dev.vtnet.0.mbuf_alloc_failed: 0
-dev.vtnet.0.req_vq_pairs: 1
+dev.vtnet.0.req_vq_pairs: 2
dev.vtnet.0.rx_csum_bad_ethtype: 0
dev.vtnet.0.rx_csum_bad_ipproto: 0
dev.vtnet.0.rx_csum_bad_offset: 0
@@ -80,14 +98,22 @@ dev.vtnet.0.rx_enq_replacement_failed: 0
dev.vtnet.0.rx_frame_too_large: 0
dev.vtnet.0.rx_mergeable_failed: 0
dev.vtnet.0.rx_task_rescheduled: 0
-dev.vtnet.0.rxq0.csum: 1303017
+dev.vtnet.0.rxq0.csum: 263363
dev.vtnet.0.rxq0.csum_failed: 0
-dev.vtnet.0.rxq0.host_lro: 1085080
-dev.vtnet.0.rxq0.ibytes: 31181915110
+dev.vtnet.0.rxq0.host_lro: 0
+dev.vtnet.0.rxq0.ibytes: 27354672
dev.vtnet.0.rxq0.ierrors: 0
-dev.vtnet.0.rxq0.ipackets: 1370916
+dev.vtnet.0.rxq0.ipackets: 269781
dev.vtnet.0.rxq0.iqdrops: 0
-dev.vtnet.0.rxq0.rescheduled: 0
+dev.vtnet.0.rxq0.rescheduled: 67
+dev.vtnet.0.rxq1.csum: 110384
+dev.vtnet.0.rxq1.csum_failed: 0
+dev.vtnet.0.rxq1.host_lro: 0
+dev.vtnet.0.rxq1.ibytes: 5388510337
+dev.vtnet.0.rxq1.ierrors: 0
+dev.vtnet.0.rxq1.ipackets: 3688347
+dev.vtnet.0.rxq1.iqdrops: 0
+dev.vtnet.0.rxq1.rescheduled: 3
dev.vtnet.0.tx_csum_offloaded: 0
dev.vtnet.0.tx_csum_proto_mismatch: 0
dev.vtnet.0.tx_csum_unknown_ethtype: 0
@@ -97,12 +123,74 @@ dev.vtnet.0.tx_task_rescheduled: 0
dev.vtnet.0.tx_tso_not_tcp: 0
dev.vtnet.0.tx_tso_offloaded: 0
dev.vtnet.0.tx_tso_without_csum: 0
-dev.vtnet.0.txq0.csum: 1117351
-dev.vtnet.0.txq0.obytes: 98345391
-dev.vtnet.0.txq0.omcasts: 0
-dev.vtnet.0.txq0.opackets: 1117426
+dev.vtnet.0.txq0.csum: 0
+dev.vtnet.0.txq0.obytes: 3799355110
+dev.vtnet.0.txq0.omcasts: 363
+dev.vtnet.0.txq0.opackets: 2510616
dev.vtnet.0.txq0.rescheduled: 0
-dev.vtnet.0.txq0.tso: 200
+dev.vtnet.0.txq0.tso: 0
+dev.vtnet.0.txq1.csum: 0
+dev.vtnet.0.txq1.obytes: 232421133
+dev.vtnet.0.txq1.omcasts: 1
+dev.vtnet.0.txq1.opackets: 3517901
+dev.vtnet.0.txq1.rescheduled: 0
+dev.vtnet.0.txq1.tso: 0
+dev.vtnet.1.%desc: VirtIO Networking Adapter
+dev.vtnet.1.%driver: vtnet
+dev.vtnet.1.%location:
+dev.vtnet.1.%parent: virtio_pci4
+dev.vtnet.1.%pnpinfo: vendor=0x00001af4 device=0x1000 subvendor=0x1af4 device_type=0x00000001
+dev.vtnet.1.act_vq_pairs: 2
+dev.vtnet.1.max_vq_pairs: 2
+dev.vtnet.1.mbuf_alloc_failed: 0
+dev.vtnet.1.req_vq_pairs: 2
+dev.vtnet.1.rx_csum_bad_ethtype: 0
+dev.vtnet.1.rx_csum_bad_ipproto: 0
+dev.vtnet.1.rx_csum_bad_offset: 0
+dev.vtnet.1.rx_csum_bad_proto: 0
+dev.vtnet.1.rx_csum_failed: 0
+dev.vtnet.1.rx_csum_offloaded: 0
+dev.vtnet.1.rx_enq_replacement_failed: 0
+dev.vtnet.1.rx_frame_too_large: 0
+dev.vtnet.1.rx_mergeable_failed: 0
+dev.vtnet.1.rx_task_rescheduled: 0
+dev.vtnet.1.rxq0.csum: 23387
+dev.vtnet.1.rxq0.csum_failed: 0
+dev.vtnet.1.rxq0.host_lro: 0
+dev.vtnet.1.rxq0.ibytes: 1643315
+dev.vtnet.1.rxq0.ierrors: 0
+dev.vtnet.1.rxq0.ipackets: 23398
+dev.vtnet.1.rxq0.iqdrops: 0
+dev.vtnet.1.rxq0.rescheduled: 0
+dev.vtnet.1.rxq1.csum: 4920
+dev.vtnet.1.rxq1.csum_failed: 0
+dev.vtnet.1.rxq1.host_lro: 0
+dev.vtnet.1.rxq1.ibytes: 440413
+dev.vtnet.1.rxq1.ierrors: 0
+dev.vtnet.1.rxq1.ipackets: 5294
+dev.vtnet.1.rxq1.iqdrops: 0
+dev.vtnet.1.rxq1.rescheduled: 0
+dev.vtnet.1.tx_csum_offloaded: 0
+dev.vtnet.1.tx_csum_proto_mismatch: 0
+dev.vtnet.1.tx_csum_unknown_ethtype: 0
+dev.vtnet.1.tx_defrag_failed: 0
+dev.vtnet.1.tx_defragged: 0
+dev.vtnet.1.tx_task_rescheduled: 0
+dev.vtnet.1.tx_tso_not_tcp: 0
+dev.vtnet.1.tx_tso_offloaded: 0
+dev.vtnet.1.tx_tso_without_csum: 0
+dev.vtnet.1.txq0.csum: 0
+dev.vtnet.1.txq0.obytes: 29429366
+dev.vtnet.1.txq0.omcasts: 0
+dev.vtnet.1.txq0.opackets: 25058
+dev.vtnet.1.txq0.rescheduled: 0
+dev.vtnet.1.txq0.tso: 0
+dev.vtnet.1.txq1.csum: 0
+dev.vtnet.1.txq1.obytes: 48974151
+dev.vtnet.1.txq1.omcasts: 0
+dev.vtnet.1.txq1.opackets: 33751
+dev.vtnet.1.txq1.rescheduled: 0
+dev.vtnet.1.txq1.tso: 0
device virtio
device virtio_balloon
device virtio_blk
@@ -119,24 +207,35 @@ hw.vtnet.mq_max_pairs: 32
hw.vtnet.rx_process_limit: 1024
hw.vtnet.tso_disable: 0
hw.vtnet.tso_maxlen: 65535
-pfil: duplicate head "vtnet0"
-pfil: duplicate head "vtnet0"
virtio_pci0: <VirtIO PCI (legacy) Balloon adapter> port 0xe080-0xe0bf mem 0xfe400000-0xfe403fff irq 11 at device 3.0 on pci0
-virtio_pci1: <VirtIO PCI (legacy) Console adapter> port 0xe0c0-0xe0ff mem 0xfea51000-0xfea51fff,0xfe404000-0xfe407fff irq 11 at device 8.0 on pci0
-virtio_pci2: <VirtIO PCI (legacy) Block adapter> port 0xe000-0xe07f mem 0xfea52000-0xfea52fff,0xfe408000-0xfe40bfff irq 10 at device 10.0 on pci0
-virtio_pci3: <VirtIO PCI (legacy) Network adapter> at device 18.0 on pci0
-virtio_pci3: <VirtIO PCI (legacy) Network adapter> at device 18.0 on pci0
-virtio_pci3: <VirtIO PCI (legacy) Network adapter> port 0xe120-0xe13f mem 0xfea53000-0xfea53fff,0xfe40c000-0xfe40ffff irq 10 at device 18.0 on pci0
+virtio_pci0: <VirtIO PCI (legacy) Balloon adapter> port 0xe080-0xe0bf mem 0xfe400000-0xfe403fff irq 11 at device 3.0 on pci0
+virtio_pci0: <VirtIO PCI (legacy) Balloon adapter> port 0xe080-0xe0bf mem 0xfe400000-0xfe403fff irq 11 at device 3.0 on pci0
+virtio_pci1: <VirtIO PCI (legacy) Console adapter> port 0xe0c0-0xe0ff mem 0xfea91000-0xfea91fff,0xfe404000-0xfe407fff irq 11 at device 8.0 on pci0
+virtio_pci1: <VirtIO PCI (legacy) Console adapter> port 0xe0c0-0xe0ff mem 0xfea91000-0xfea91fff,0xfe404000-0xfe407fff irq 11 at device 8.0 on pci0
+virtio_pci1: <VirtIO PCI (legacy) Console adapter> port 0xe0c0-0xe0ff mem 0xfea91000-0xfea91fff,0xfe404000-0xfe407fff irq 11 at device 8.0 on pci0
+virtio_pci2: <VirtIO PCI (legacy) Block adapter> port 0xe000-0xe07f mem 0xfea92000-0xfea92fff,0xfe408000-0xfe40bfff irq 10 at device 10.0 on pci0
+virtio_pci2: <VirtIO PCI (legacy) Block adapter> port 0xe000-0xe07f mem 0xfea92000-0xfea92fff,0xfe408000-0xfe40bfff irq 10 at device 10.0 on pci0
+virtio_pci2: <VirtIO PCI (legacy) Block adapter> port 0xe000-0xe07f mem 0xfea92000-0xfea92fff,0xfe408000-0xfe40bfff irq 10 at device 10.0 on pci0
+virtio_pci3: <VirtIO PCI (legacy) Network adapter> port 0xe100-0xe13f mem 0xfea93000-0xfea93fff,0xfe40c000-0xfe40ffff irq 10 at device 18.0 on pci0
+virtio_pci3: <VirtIO PCI (legacy) Network adapter> port 0xe100-0xe13f mem 0xfea93000-0xfea93fff,0xfe40c000-0xfe40ffff irq 10 at device 18.0 on pci0
+virtio_pci3: <VirtIO PCI (legacy) Network adapter> port 0xe100-0xe13f mem 0xfea93000-0xfea93fff,0xfe40c000-0xfe40ffff irq 10 at device 18.0 on pci0
+virtio_pci4: <VirtIO PCI (legacy) Network adapter> port 0xe140-0xe17f mem 0xfea94000-0xfea94fff,0xfe410000-0xfe413fff irq 11 at device 19.0 on pci0
+virtio_pci4: <VirtIO PCI (legacy) Network adapter> port 0xe140-0xe17f mem 0xfea94000-0xfea94fff,0xfe410000-0xfe413fff irq 11 at device 19.0 on pci0
+virtio_pci4: <VirtIO PCI (legacy) Network adapter> port 0xe140-0xe17f mem 0xfea94000-0xfea94fff,0xfe410000-0xfe413fff irq 11 at device 19.0 on pci0
vm.uma.vtnet_tx_hdr.bucket_size: 254
vm.uma.vtnet_tx_hdr.bucket_size_max: 254
-vm.uma.vtnet_tx_hdr.domain.0.imax: 254
-vm.uma.vtnet_tx_hdr.domain.0.imin: 254
-vm.uma.vtnet_tx_hdr.domain.0.nitems: 254
+vm.uma.vtnet_tx_hdr.domain.0.bimin: 762
+vm.uma.vtnet_tx_hdr.domain.0.imax: 762
+vm.uma.vtnet_tx_hdr.domain.0.imin: 762
+vm.uma.vtnet_tx_hdr.domain.0.limin: 267
+vm.uma.vtnet_tx_hdr.domain.0.nitems: 762
+vm.uma.vtnet_tx_hdr.domain.0.timin: 2080
vm.uma.vtnet_tx_hdr.domain.0.wss: 0
vm.uma.vtnet_tx_hdr.flags: 0x10000<FIRSTTOUCH>
vm.uma.vtnet_tx_hdr.keg.align: 0
vm.uma.vtnet_tx_hdr.keg.domain.0.free_items: 128
-vm.uma.vtnet_tx_hdr.keg.domain.0.pages: 6
+vm.uma.vtnet_tx_hdr.keg.domain.0.free_slabs: 0
+vm.uma.vtnet_tx_hdr.keg.domain.0.pages: 9
vm.uma.vtnet_tx_hdr.keg.efficiency: 98
vm.uma.vtnet_tx_hdr.keg.ipers: 168
vm.uma.vtnet_tx_hdr.keg.name: vtnet_tx_hdr
@@ -149,13 +248,16 @@ vm.uma.vtnet_tx_hdr.limit.max_items: 0
vm.uma.vtnet_tx_hdr.limit.sleepers: 0
vm.uma.vtnet_tx_hdr.limit.sleeps: 0
vm.uma.vtnet_tx_hdr.size: 24
-vm.uma.vtnet_tx_hdr.stats.allocs: 29346567
+vm.uma.vtnet_tx_hdr.stats.allocs: 6087327
vm.uma.vtnet_tx_hdr.stats.current: 1
vm.uma.vtnet_tx_hdr.stats.fails: 0
-vm.uma.vtnet_tx_hdr.stats.frees: 29346566
+vm.uma.vtnet_tx_hdr.stats.frees: 6087326
vm.uma.vtnet_tx_hdr.stats.xdomain: 0
vtnet0: <VirtIO Networking Adapter> on virtio_pci3
vtnet0: <VirtIO Networking Adapter> on virtio_pci3
vtnet0: <VirtIO Networking Adapter> on virtio_pci3
-vtnet0: detached
-vtnet0: detached
+vtnet1: <VirtIO Networking Adapter> on virtio_pci4
+vtnet1: <VirtIO Networking Adapter> on virtio_pci4
+vtnet1: <VirtIO Networking Adapter> on virtio_pci4
+vtnet1: vtnet_update_rx_offloads: cannot update Rx features
+vtnet1: vtnet_update_rx_offloads: cannot update Rx features

Two comments:

1. Just one observation: The negotiated features differ in RX offloading:


-dev.virtio_pci.3.negotiated_features: 0x3087bbe7 <RingEventIdx,RingIndirectDesc,CtrlMacAddr,CtrlRxMode,CtrlVq,Status,MrgRxBuf,TxTSOECN,TxTSOv6,TxTSOv4,RxLROECN,RxLROv6,RxLROv4,TxGSO,MAC,CtrlRxOffloads,RxChecksum,TxChecksum>
+dev.virtio_pci.3.negotiated_features: 0x30c7b865 <RingEventIdx,RingIndirectDesc,CtrlMacAddr,Multiqueue,CtrlRxMode,CtrlVq,Status,MrgRxBuf,TxTSOECN,TxTSOv6,TxTSOv4,TxGSO,MAC,CtrlRxOffloads,TxChecksum>


2. By limiting the diff to driver-specific aspects, you miss any other performance-related things, like memory protection, threading settings or circumvention of CPU flaws (e.g. hw.ibrs_disable or hw.spec_store_bypass_disable or net.isr.bindthreads). However, I actually have no clue as to what might be the performance impact of any setting.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 770 up, Bufferbloat A

1. Good point. A few of those I tried altering on the FreeBSD system to see if they made any difference but I didn't find any. The ones that jump out to me here are LSO/TSO (which I tried changing) and multiqueue (which is what I've been observing appears more to be functional in some of the OPNsense tests). I'll specifically see if I can match the OPNsense negotiated connection or at least have them match in ethtool -k and see if that makes a difference.

UPDATE: See https://forum.opnsense.org/index.php?topic=27828.msg135793

2. Very true. I actually did scan through the complete list but didn't see anything that looked like it could explain it. I filtered to make something more manageable here, but as you correctly point out that could mask a contributing element. The specific ones you've mentioned I can check and I could also do a CPU and memory benchmark test in the VM to see if there are any drastic differences.

UPDATE I looked not only at the complete sysctl list pretty thoroughly, I also:
- Compared /boot/loader.conf as well as copied it over from opnsense 22.1 to FreeBSD + OPNsense kernel
- Compared kernel modules
- Compared loaded module lists
- Compared all of /boot directory

Ultimately, I believe the hardware offloading was the issue *for the VMs*. See #1

Has anyone in here tried to pull some pmc statistics and see where there might be a delay? I tried briefly but couldn't make sense of the results.