Poor Throughput (Even On Same Network Segment)

Started by hax0rwax0r, August 25, 2020, 08:31:25 PM

Previous topic - Next topic
Very interesting discussion here regarding degraded performance with Opnsense. Roughly, one month ago I noticed degraded performance in SMB transfers between my own server and clients. At first, I suspected my server itself as the performance bottleneck due to a kernel upgrade, short time before. I am not sure, whether this issue likewise correlates with an update of the opnsense firewall, I performed meanwhile. I investigated a little bit and got some discussions regarding issues with the server network card (intel LM-219) and Linux. But, after buying a low-priced USB network adapter (Realtek chipset) for testing, I got the same poor performance results.

My next steps are to investigate the whole network and Opnsense this upcoming weekend (if the wether is fine in this context — ⛈ 🌩 ...). So, this discussion is a very interesting starting point for me and my investigation.

Here some details regarding my Opnsense (20.7.3):

- Mainboard: Supermicro A2SDi-4C-HLN4F (Link to specs)
- RAM: 8GB
- Network performance (past): around 900MBit/s (SMB transfer across two subnets)
- Network performance (now):  around 200MBit/s (SMB transfer across two subnets)

OPNsense 24.7.11_2-amd64

I tried re-running these tests with OPNsense 20.7.3 and also tried the netmap kernel. For my particular case, this did not result in a change in throughput.

I'll recap my environment:
HP Server ML10v2/Xeon E3 1220v3/32GB of RAM

VM configurations:
Each pfSense and OPNsense VM has 2vCPU/4GB RAM/VMX3 NICs
Each pfSense and OPNsense VM has default settings and all hardware offloading disabled

The OPNsense netmap kernel was tested by doing the following:
opnsense-update -kr 20.7.3-netmap
reboot


When running these iperf3 tests, each test was run for 60 seconds, all test were run twice and the last test result is recorded here to allow some of the firewalls time to "warm up" to the throughput load. All tests were perform on the same host, and two VMs were used to simulate a WAN/LAN configuration with separate vSwitches. This allows us to push traffic through the firewall, instead of using the firewall as an iperf3 client.

Below are my results from today:

pfSense 2.5.0Build_10-16-20 1500MTU receiving from WAN, vmx3 NICs, all hardware offloading disabled, default ruleset
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  14.8 GBytes  2.12 Gbits/sec  550             sender
[  5]   0.00-60.00  sec  14.8 GBytes  2.12 Gbits/sec                  receiver


pfSense 2.4.5p1 1500MTU receiving from WAN, vmx3 NICs, all hardware offloading disabled, default ruleset
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  29.4 GBytes  4.21 Gbits/sec  12054             sender
[  5]   0.00-60.00  sec  29.4 GBytes  4.21 Gbits/sec                  receiver


OpenWRT 19.07.3 1500MTU receiving from WAN, vmx3 NICs, default ruleset
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  44.1 GBytes  6.31 Gbits/sec  40490             sender
[  5]   0.00-60.00  sec  44.1 GBytes  6.31 Gbits/sec                  receiver


OPNsense 20.7.3 1500MTU receiving from WAN, vmx3 NICs, all hardware offloading disabled, default ruleset
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  5.39 GBytes   771 Mbits/sec  362             sender
[  5]   0.00-60.00  sec  5.39 GBytes   771 Mbits/sec                  receiver


OPNsense 20.7.3(netflow disabled) 1500MTU receiving from WAN, vmx3 NICs, all hardware offloading disabled, default ruleset
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  6.66 GBytes   953 Mbits/sec  561             sender
[  5]   0.00-60.00  sec  6.66 GBytes   953 Mbits/sec                  receiver


OPNsense 20.7.3(netmap kernel) 1500MTU receiving from WAN, vmx3 NICs, all hardware offloading disabled, default ruleset
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  5.35 GBytes   766 Mbits/sec  434             sender
[  5]   0.00-60.00  sec  5.35 GBytes   766 Mbits/sec                  receiver


OPNsense 20.7.3(netmap kernel, netflow disabled) 1500MTU receiving from WAN, vmx3 NICs, all hardware offloading disabled, default ruleset
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  6.55 GBytes   937 Mbits/sec  399             sender
[  5]   0.00-60.00  sec  6.55 GBytes   937 Mbits/sec                  receiver



Its actually qiute interesting to see the performance degradation from pfsense 2.4 to 2.5

One should think that things were moving forward instead of backwards.

And could it be on purpose since TNSR is launched that somehow is able to route significant more?

I know its kernel dependant, but its really annoying that the new FreeBSD releases actually perform worse than 10.3 and the OS dependant on that.

Giving the right MTU's then yu can easily push 7+ gbit/s on a FW.

I probably should have clarified on that. I tested both *sense based distros just show that they both see a hit with the FreeBSD 12.x kernel. I don't think this is out of malicious intent from either side, just teething issues due to the new way that the 12.x kernel pushes packets. I'm NOT trying to compare OPNsense to pfSense, I merely wanted to show that they both see a hit moving to 12.x.

There is an upside to all of this. I'm running OPNsense 20.7.3 on bare metal at home with the stock kernel. With the FreeBSD 12.x implementations I no longer need to leave FQ_Codel shaping enabled to get A+ scores on my 500/500 Fiber connection. It seems the way that FreeBSD 12.x handles transfer queues is much more efficient. I'm sure as time moves forward this will all get worked out. I'm posting here mainly just to show what I am seeing, and hopefully we can see the numbers get better as newer kernels are integrated.

Yes, it needs more user base to test and diagnose. I'm sure If pfsense would switch there would be faster progress. Currently it's Up to the Sensei guys and 12.1 community

Do they need a sponsor to make it happen sooner?


Although we haven't experienced performance issues on the equipment we sell ourselves, quite some of the feedback in
this thread seems to be related to virtual setups.
Since we had a setup available from the webinar last Thursday, I thought to replicate the simple vmxnet3 test on our end.

Small disclaimer upfront, I'm not a frequent VMWare ESXi user, so I just followed the obvious steps.

Our test machine is really small, not extremely fast, but usable for the purpose (a random desktop which was available).

Machine specs:

Lenovo 10T700AHMH desktop
6 CPUs x Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz
8GB Memory
|- OPNsense vm, 2 vcores
|- kali1, 1 vcore
|- kali2, 1 vcore


While going through the VMWare setup, for some reason I wasn't allowed to select VMXNET3, so I edited the .vmx file manually
to make sure all attached interfaces used the correct driver.


ethernetX.virtualDev = "vmxnet3"


The clients atached are simple kali linux installs, both using their own vSwitch, so traffic is measured from kali 1 to kali 2
using iperf3 (doesn't really tell a lot about real world performance, but I didn't have the time or spirit available to setup trex and proper testsets)


[kali1, client] --- vswitch1 --- [OPNsense] --- vswitch2 --- [kali2, server]
192.168.1.100/24     -     192.168.1.1/24,192.168.2.1/24   -  192.168.2.100/24


Before testing, let's establish a baseline, move both kali linux machines in the same network and iperf between them.


# iperf3 -c 192.168.2.100 -t 10000
Connecting to host 192.168.2.100, port 5201
[  5] local 192.168.2.101 port 55240 connected to 192.168.2.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  3.34 GBytes  28.7 Gbits/sec    0   1.91 MBytes       
[  5]   1.00-2.00   sec  5.03 GBytes  43.2 Gbits/sec    0   2.93 MBytes       
[  5]   2.00-3.00   sec  5.24 GBytes  45.0 Gbits/sec    0   3.08 MBytes       
[  5]   3.00-4.00   sec  5.18 GBytes  44.5 Gbits/sec    0   3.08 MBytes       
[  5]   4.00-5.00   sec  5.23 GBytes  45.0 Gbits/sec    0   3.08 MBytes       


Which is the absolute maximum my setup could reach, using linux and all defaults set.... but, since we don't use
any offloading features (https://wiki.freebsd.org/10gFreeBSD/Router), it would be fairer to check what the performance should be when disabling offloading on
the same setup.

So, we disable all offloading, assuming our router/firewall won't use them either.


# ethtool -K eth0 lro off
# ethtool -K eth0 tso off
# ethtool -K eth0 rx off
# ethtool -K eth0 tx off
# ethtool -K eth0 sg off


And test again:


# iperf3 -c 192.168.2.100 -t 10000
Connecting to host 192.168.2.100, port 5201
[  5] local 192.168.2.101 port 55274 connected to 192.168.2.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  1.20 GBytes  10.3 Gbits/sec    0    458 KBytes       
[  5]   1.00-2.00   sec  1.30 GBytes  11.2 Gbits/sec    0   1007 KBytes       
[  5]   2.00-3.00   sec  1.30 GBytes  11.1 Gbits/sec    0   1.18 MBytes       
[  5]   3.00-4.00   sec  1.29 GBytes  11.1 Gbits/sec    0   1.24 MBytes       
[  5]   4.00-5.00   sec  1.30 GBytes  11.2 Gbits/sec    0   1.37 MBytes       
[  5]   5.00-6.00   sec  1.31 GBytes  11.2 Gbits/sec    0   1.43 MBytes       
[  5]   6.00-7.00   sec  1.30 GBytes  11.2 Gbits/sec    0   1.51 MBytes       


Which keeps about 25% of our original throughput, vmware seems to be very efficient when hardware tasks are pushed back
to the hypervisor.

Now reconnect the kali machines back into their own networks, with OPNsense (20.7.3+new netmap kernel) in between.
The firewall policy is simple, just accept anything, no other features used.


# iperf3 -c 192.168.2.100 -t 10000
Connecting to host 192.168.2.100, port 5201
[  5] local 192.168.1.100 port 54870 connected to 192.168.2.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   280 MBytes  2.35 Gbits/sec   59    393 KBytes       
[  5]   1.00-2.00   sec   281 MBytes  2.35 Gbits/sec   33    383 KBytes       
[  5]   2.00-3.00   sec   279 MBytes  2.34 Gbits/sec   60    379 KBytes       
[  5]   3.00-4.00   sec   275 MBytes  2.31 Gbits/sec   46    380 KBytes       
[  5]   4.00-5.00   sec   276 MBytes  2.32 Gbits/sec   31    387 KBytes       


Next step is to check the man page of the vmx driver (man vmx), which shows quite some sysctl tunables which
don't seem to work anymore on 12.x, probably due to switching to iflib. One comment however seems quite relevant:

Quote
The vmx driver supports multiple transmit and receive queues.  Multiple
queues are only supported by certain VMware products, such as ESXi.  The
number of queues allocated depends on the presence of MSI-X, the number
of configured CPUs, and the tunables listed below.  FreeBSD does not
enable MSI-X support on VMware by default.  The
hw.pci.honor_msi_blacklist tunable must be disabled to enable MSI-X
support.

So we go to tunables, disable hw.pci.honor_msi_blacklist (set to 0) and reboot out machine.

Time to test again:


# iperf3 -c 192.168.2.100
Connecting to host 192.168.2.100, port 5201
[  5] local 192.168.1.100 port 54878 connected to 192.168.2.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   350 MBytes  2.93 Gbits/sec  589    304 KBytes       
[  5]   1.00-2.00   sec   342 MBytes  2.87 Gbits/sec  378    337 KBytes       
[  5]   2.00-3.00   sec   342 MBytes  2.87 Gbits/sec  324    298 KBytes       
[  5]   3.00-4.00   sec   343 MBytes  2.88 Gbits/sec  292    301 KBytes       
[  5]   4.00-5.00   sec   345 MBytes  2.89 Gbits/sec  337    307 KBytes       
[  5]   5.00-6.00   sec   341 MBytes  2.86 Gbits/sec  266    301 KBytes       
[  5]   6.00-7.00   sec   341 MBytes  2.86 Gbits/sec  301    311 KBytes       


Single flow performance is often a challenge, so to be sure, let's try to push 2 sessions through iperf3


# iperf3 -c 192.168.2.100 -P 2 -t 10000
Connecting to host 192.168.2.100, port 5201
[  5] local 192.168.1.100 port 54952 connected to 192.168.2.100 port 5201
[  7] local 192.168.1.100 port 54954 connected to 192.168.2.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   261 MBytes  2.19 Gbits/sec  176    281 KBytes       
[  7]   0.00-1.00   sec   245 MBytes  2.05 Gbits/sec  136    342 KBytes       
[SUM]   0.00-1.00   sec   506 MBytes  4.24 Gbits/sec  312             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   1.00-2.00   sec   302 MBytes  2.54 Gbits/sec   57    281 KBytes       
[  7]   1.00-2.00   sec   208 MBytes  1.74 Gbits/sec   25    375 KBytes       
[SUM]   1.00-2.00   sec   510 MBytes  4.28 Gbits/sec   82             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   2.00-3.00   sec   304 MBytes  2.55 Gbits/sec   45    284 KBytes       
[  7]   2.00-3.00   sec   210 MBytes  1.76 Gbits/sec    9    392 KBytes       
[SUM]   2.00-3.00   sec   514 MBytes  4.31 Gbits/sec   54             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   3.00-4.00   sec   304 MBytes  2.55 Gbits/sec   39    386 KBytes       
[  7]   3.00-4.00   sec   209 MBytes  1.75 Gbits/sec   15    331 KBytes       
[SUM]   3.00-4.00   sec   512 MBytes  4.30 Gbits/sec   54             
^C- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   4.00-4.95   sec   288 MBytes  2.54 Gbits/sec   39    287 KBytes       
[  7]   4.00-4.95   sec   198 MBytes  1.74 Gbits/sec   23    325 KBytes       
[SUM]   4.00-4.95   sec   485 MBytes  4.28 Gbits/sec   62             

Which is already way better, more sessions don't seem to impact my setup as far as I could see, but that could also
be caused by the number of queues confiure (2, see dmesg | grep vmx). In the new iflib world I wasn't able to
increase that number, so I'll leave it at that.

Just for fun, I disabled pf (pfctl -d) to get a bit of insights about how the firewall impacts our performance,
the details of that test are shown below (just for reference)

[code]
# iperf3 -c 192.168.2.100 -P 2 -t 10000
Connecting to host 192.168.2.100, port 5201
[  5] local 192.168.1.100 port 55038 connected to 192.168.2.100 port 5201
[  7] local 192.168.1.100 port 55040 connected to 192.168.2.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   300 MBytes  2.51 Gbits/sec    0    888 KBytes       
[  7]   0.00-1.00   sec   302 MBytes  2.53 Gbits/sec   69   2.18 MBytes       
[SUM]   0.00-1.00   sec   601 MBytes  5.04 Gbits/sec   69             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   1.00-2.00   sec   335 MBytes  2.81 Gbits/sec  167    904 KBytes       
[  7]   1.00-2.00   sec   342 MBytes  2.87 Gbits/sec  536   1.67 MBytes       
[SUM]   1.00-2.00   sec   678 MBytes  5.68 Gbits/sec  703             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   2.00-3.00   sec   335 MBytes  2.81 Gbits/sec    0   1.12 MBytes       
[  7]   2.00-3.00   sec   342 MBytes  2.87 Gbits/sec    0   1.81 MBytes       
[SUM]   2.00-3.00   sec   678 MBytes  5.68 Gbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   3.00-4.00   sec   332 MBytes  2.79 Gbits/sec  280   1.04 MBytes       
[  7]   3.00-4.00   sec   344 MBytes  2.88 Gbits/sec  482   1.44 MBytes       
[SUM]   3.00-4.00   sec   676 MBytes  5.67 Gbits/sec  762             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   4.00-5.00   sec   332 MBytes  2.79 Gbits/sec  206   1017 KBytes       
[  7]   4.00-5.00   sec   338 MBytes  2.83 Gbits/sec  292   1.22 MBytes       
[SUM]   4.00-5.00   sec   670 MBytes  5.62 Gbits/sec  498             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   5.00-6.00   sec   331 MBytes  2.78 Gbits/sec    0   1.21 MBytes       
[  7]   5.00-6.00   sec   339 MBytes  2.84 Gbits/sec    0   1.40 MBytes       
[SUM]   5.00-6.00   sec   670 MBytes  5.62 Gbits/sec    0             
^C- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   6.00-6.60   sec   199 MBytes  2.78 Gbits/sec    0   1.32 MBytes       
[  7]   6.00-6.60   sec   202 MBytes  2.83 Gbits/sec    0   1.50 MBytes       
[SUM]   6.00-6.60   sec   401 MBytes  5.61 Gbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -


On physical setups I've seen better numbers, but driver performance and settings may impact the situation (a lot).
While looking into the sysctl settings, I stumbled on this https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=237166
as well.
It explains how to set the receive and send descriptors, for my test it didn't change a lot, some other iflib setting might,
I haven't tried.

Since we haven't seen huge performance degrations on our physical setups,
there's the possibility that default settings have changed in vmx (haven't looked into that, nor plan to).
Driver quality might have been better pre iflib, which is always a bit of risk on FreeBSD after major upgrades to be honest.

In our experience (on intel) the situation isn't bad at all after switching to FreeBSD 12.1,
but that's just my personal opinion (based on measurements on our equipment some months ago).

Best regards,

Ad



Quote from: Supermule on October 18, 2020, 04:36:04 PM
But 45 gbit/s???

Quote from: AdSchellevis on October 17, 2020, 04:17:10 PM
The clients atached are simple kali linux installs, both using their own vSwitch, so traffic is measured from kali 1 to kali 2

:)
,,The S in IoT stands for Security!" :)

They still need drivers and networking as the VSwitch is attached to a network adapter.



Quote from: Gauss23 on October 18, 2020, 04:37:56 PM
Quote from: Supermule on October 18, 2020, 04:36:04 PM
But 45 gbit/s???

Quote from: AdSchellevis on October 17, 2020, 04:17:10 PM
The clients atached are simple kali linux installs, both using their own vSwitch, so traffic is measured from kali 1 to kali 2

:)

your point is? just for clarification, the 45Gbps is measured between 2 linux (kali) machines on the same network (vswitch) using all default optimisations, which would be baseline (maximum achievable without anything in between) in my case.

I did some tests and noticed that my network suffers from two different problems which interfere each other. The i219-LM nic in my server has autonegotiation problems whereby performance degradation was around 80 percent. I solved this issue by forcing the nic to 1 Gbit/s full-duplex. Now, performance tests with iperf3 reach around 980 Mbit/s in direct transfer between client and server, which look fine.

After I have integrated the Opnsense again into my setup in such a way that the firewall routes the traffic between my server and client subnet, the traffic degraded from 980 Mbit/s to ca. 245 MBit/s. I should mention that my Opnsense (v.20.7.3) runs on bare metal, so a virtualization impact is impossible.

Next steps will be some ressource monitoring during iperf3 tests.
OPNsense 24.7.11_2-amd64