Poor Throughput (Even On Same Network Segment)

Started by hax0rwax0r, August 25, 2020, 08:31:25 PM

Previous topic - Next topic
I am seeing very slow throughput on pfsense as well using Iperf3 online.

Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
16 CPUs: 2 package(s) x 8 core(s)
AES-NI CPU Crypto: Yes (inactive)

Using Suricata and cant get more then 200 mbps... pretty annoying.

Ok, so we have an upstream problem with FreeBSD and a some chances to get them fixed the next months.
So the interim solution for now is to go

a) go back to 20.1
b) disable netmap (IPS/Sensei)
c) accept the lowered performance

I had a talk to Franco yesterday, there are some promising patches awaiting and we sure need some testers, so if one not going back to 20.1, this would be fine

Wasnt the problem OPN/pfsense instead of FreeBSD? Didnt the 10gbit tests show wirespeed on a FreeBSD machine using pf?

No, OPNsense 20.7 and pfSense 2.5 are using FreeBSD 12.X; 20.1 and pf 2.4 FreeBSD 11.X

With FreeBSD12 interface/networking stack was changed to iflib, which has known problems with netmap, where ppl. are already working on it.

@minimugmail

Quote from: hax0rwax0r on September 02, 2020, 07:34:01 AM
OK, back to basics here.  I couldn't leave well enough alone and I did more testing tonight because I just couldn't believe that my CPU couldn't even do single threaded gigabit.  Here's my test scenario:

Test Scenario 1:

  • Physical Linux Server (CentOS 7) on VLAN 2 (iperf3 client)
  • Virtual Linux Server (CentOS 7) on VLAN 24 (iperf3 server)
  • Dell PowerEdge R430 w/Intel X520-SR2 and HardenedBSD 12-STABLE (BUILD-LATEST 2020-08-31)

Single Threaded:
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  1.00 GBytes   863 Mbits/sec    0             sender
[  4]   0.00-10.00  sec  1.00 GBytes   860 Mbits/sec                  receiver


6 Parallel Threads:
[ ID] Interval           Transfer     Bandwidth       Retr
[SUM]   0.00-10.00  sec  2.23 GBytes  1.91 Gbits/sec  938             sender
[SUM]   0.00-10.00  sec  2.22 GBytes  1.90 Gbits/sec                  receiver


Notice a common theme here with the ~850 Mbps single threaded test.  It's pretty close to what I get with OPNsense.  Note this is THROUGH the firewall and not from the firewall.  Also note my system did have IPv6 addresses from my ISP on each of the interfaces, though, I was only testing IPv4 traffic.

Test Scenario 2:

  • Physical Linux Server (CentOS 7) on VLAN 2 (iperf3 client)
  • Virtual Linux Server (CentOS 7) on VLAN 24 (iperf3 server)
  • Dell PowerEdge R430 w/Intel X520-SR2 and FreeBSD 12.1-RELEASE

Single Threaded:
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  9.75 GBytes  8.38 Gbits/sec  573             sender
[  4]   0.00-10.00  sec  9.75 GBytes  8.38 Gbits/sec                  receiver


6 Parallel Threads:
[ ID] Interval           Transfer     Bandwidth       Retr
[SUM]   0.00-10.00  sec  10.5 GBytes  9.05 Gbits/sec  3607             sender
[SUM]   0.00-10.00  sec  10.5 GBytes  9.04 Gbits/sec                  receiver


I couldn't believe my eyes as I had to do a triple check that it was in fact pushing 8.38 Gbps THROUGH the FreeBSD 12.1 server and it wasn't taking some magical alternate path somehow.  It was, in fact, going through the FreeBSD router.  As you can see, parallel test is about 1 Gbps less than wire speed.  Excellent!  Also note my system did have IPv6 addresses from my ISP on each of the interfaces, though, I was only testing IPv4 traffic.

I thought I would enable pfctl on the FreeBSD 12.1 router to see how that affected performance.  Not sure how much adding rules impacts throughput but I did notice a measurable drop in the single thread test (6.23 Gbps) but the parallel thread test was negligible (8.94 Gbps).

As of right now, it seems so so so strange to me that HardenedBSD exhibits the same exact single threaded throughput and likewise low parallel thread throughput over FreeBSD.

I am willing to accept that I am not accounting for something here; however, near wire speed throughput on the same exact hardware on FreeBSD versus HardenedBSD, it seems to me something is very different with HardenedBSD.

What are your thoughts?

@hax0rwax0r

Try to repeat the FreeBSD 12.1-RELEASE test with our kernel instead of the stock one. I don't expect any differences.

https://pkg.opnsense.org/FreeBSD:12:amd64/20.7/sets/kernel-20.7.2-amd64.txz


Cheers,
Franco

Details matter (a lot) in these cases, we haven't seen huge differences on our end (apart from netmap issues with certain cards, which we don't ship ourselves). That being said, IPS is a feature that really stresses your hardware, quite some setups are not able to do more than 200Mbps mentioned in this thread.

Please be advised that HardenedBSD 12-STABLE isn't the same as OPNsense 20.7, the differences between OPNsense 20.7 src and freebsd are a bit smaller, but if you're convinced your issues lies with HardenedBSD's additions it might be good starting point (and a plain install has less features enabled).

You can always try to install our kernel on the same FreeBSD install which worked without issues (as Franco suggested), it could help reproducing steps more easily.

If you want to compare between HBSD and FBSD anyway, always make sure your comparing apples with apples, check interface settings, build options and tunables (sysctl -a). Testing between interfaces (not vlan's on the same) is probably easier so you know for sure traffic is only flowing once through the physical interface.

In case someone would like to reproduce your test, make sure to document step by step how one could do that (including network segments used).

Best regards,

Ad

Quote from: Supermule on September 02, 2020, 11:12:45 AM
@minimugmail

Quote from: hax0rwax0r on September 02, 2020, 07:34:01 AM
OK, back to basics here.  I couldn't leave well enough alone and I did more testing tonight because I just couldn't believe that my CPU couldn't even do single threaded gigabit.  Here's my test scenario:

Test Scenario 1:

  • Physical Linux Server (CentOS 7) on VLAN 2 (iperf3 client)
  • Virtual Linux Server (CentOS 7) on VLAN 24 (iperf3 server)
  • Dell PowerEdge R430 w/Intel X520-SR2 and HardenedBSD 12-STABLE (BUILD-LATEST 2020-08-31)

Single Threaded:
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  1.00 GBytes   863 Mbits/sec    0             sender
[  4]   0.00-10.00  sec  1.00 GBytes   860 Mbits/sec                  receiver


6 Parallel Threads:
[ ID] Interval           Transfer     Bandwidth       Retr
[SUM]   0.00-10.00  sec  2.23 GBytes  1.91 Gbits/sec  938             sender
[SUM]   0.00-10.00  sec  2.22 GBytes  1.90 Gbits/sec                  receiver


Notice a common theme here with the ~850 Mbps single threaded test.  It's pretty close to what I get with OPNsense.  Note this is THROUGH the firewall and not from the firewall.  Also note my system did have IPv6 addresses from my ISP on each of the interfaces, though, I was only testing IPv4 traffic.

Test Scenario 2:

  • Physical Linux Server (CentOS 7) on VLAN 2 (iperf3 client)
  • Virtual Linux Server (CentOS 7) on VLAN 24 (iperf3 server)
  • Dell PowerEdge R430 w/Intel X520-SR2 and FreeBSD 12.1-RELEASE

Single Threaded:
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  9.75 GBytes  8.38 Gbits/sec  573             sender
[  4]   0.00-10.00  sec  9.75 GBytes  8.38 Gbits/sec                  receiver


6 Parallel Threads:
[ ID] Interval           Transfer     Bandwidth       Retr
[SUM]   0.00-10.00  sec  10.5 GBytes  9.05 Gbits/sec  3607             sender
[SUM]   0.00-10.00  sec  10.5 GBytes  9.04 Gbits/sec                  receiver


I couldn't believe my eyes as I had to do a triple check that it was in fact pushing 8.38 Gbps THROUGH the FreeBSD 12.1 server and it wasn't taking some magical alternate path somehow.  It was, in fact, going through the FreeBSD router.  As you can see, parallel test is about 1 Gbps less than wire speed.  Excellent!  Also note my system did have IPv6 addresses from my ISP on each of the interfaces, though, I was only testing IPv4 traffic.

I thought I would enable pfctl on the FreeBSD 12.1 router to see how that affected performance.  Not sure how much adding rules impacts throughput but I did notice a measurable drop in the single thread test (6.23 Gbps) but the parallel thread test was negligible (8.94 Gbps).

As of right now, it seems so so so strange to me that HardenedBSD exhibits the same exact single threaded throughput and likewise low parallel thread throughput over FreeBSD.

I am willing to accept that I am not accounting for something here; however, near wire speed throughput on the same exact hardware on FreeBSD versus HardenedBSD, it seems to me something is very different with HardenedBSD.

What are your thoughts?

I have the same values with 20.7 on SuperMicro Hardware with Xeon and X520 as posted before. It's something in your hardware

I am not super familiar with FreeBSD so how would I go about swapping your kernel in for the existing stock FreeBSD 12.1 one I am running?  I searched around on Google and I found how to build a customer kernel from source but this txz file you linked appears to be already compiled so I don't think that's what I want to do.

I also found reference to pkg-static to install locally downloaded packages but wanted to get some initial guidance before totally hosing this up.

This should also be the same kernel which gets installed with latest 20.7.2

Oh, I guess I misunderstood franco's instructions I thought they were asking me to drop the 20.7.2 kernel linked on top/in place on my FreeBSD 12.1 install which I was asking how exactly to do that.

I think with your clarification and re-reading the post, franco was just asking me to try an install of 20.7.2, which happens to be running that kernel, and re-run my tests to see if it improves.

If that's the case, I will try and report back my findings with OPNsense 20.7.2.

No I did mean FreeBSD 12.1 with our kernel. All the networking is in the kernel so we will see if this is OPNsense vs. HBSD vs. FBSD or some sort of tweaking effort.

# fetch https://pkg.opnsense.org/FreeBSD:12:amd64/20.7/sets/kernel-20.7.2-amd64.txz
# mv /boot/kernel /boot/kernel.old
# tar -C / -xf kernel-20.7.2-amd64.txz
# kldxref /boot/kernel

It should have a new /boot/kernel now and a reboot should activate it. You can compare build info after the system is back up.

# uname -rv
12.1-RELEASE-p8-HBSD FreeBSD 12.1-RELEASE-p8-HBSD #0  b3665671c4d(stable/20.7)-dirty: Thu Aug 27 05:58:53 CEST 2020     root@sensey64:/usr/obj/usr/src/amd64.amd64/sys/SMP


Cheers,
Franco

OK here are the test results as you requested:

FreeBSD 12.1 (pf enabled):

[root@fbsd1 ~]# uname -rv
12.1-RELEASE FreeBSD 12.1-RELEASE r354233 GENERIC

[root@fbsd1 ~]# top -aSH
last pid:  2954;  load averages:  0.44,  0.42,  0.41                                                                      up 0+01:38:55  20:13:46
132 threads:   10 running, 104 sleeping, 18 waiting
CPU:  0.0% user,  0.0% nice, 19.7% system,  5.2% interrupt, 75.1% idle
Mem: 10M Active, 6100K Inact, 271M Wired, 21M Buf, 39G Free
Swap: 3968M Total, 3968M Free

  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
   11 root        155 ki31      0    96K RUN      5  94:58  95.25% [idle{idle: cpu5}]
   11 root        155 ki31      0    96K CPU1     1  93:26  83.69% [idle{idle: cpu1}]
   11 root        155 ki31      0    96K RUN      0  94:44  73.68% [idle{idle: cpu0}]
   11 root        155 ki31      0    96K CPU4     4  93:15  72.51% [idle{idle: cpu4}]
   11 root        155 ki31      0    96K CPU3     3  93:36  64.80% [idle{idle: cpu3}]
   11 root        155 ki31      0    96K RUN      2  92:55  62.29% [idle{idle: cpu2}]
    0 root        -76    -      0   480K CPU2     2   0:05  34.76% [kernel{if_io_tqg_2}]
    0 root        -76    -      0   480K CPU3     3   0:14  33.49% [kernel{if_io_tqg_3}]
   12 root        -52    -      0   304K CPU0     0  26:23  29.62% [intr{swi6: task queue}]
    0 root        -76    -      0   480K -        4   0:05  23.31% [kernel{if_io_tqg_4}]
    0 root        -76    -      0   480K -        0   0:05  12.31% [kernel{if_io_tqg_0}]
    0 root        -76    -      0   480K -        1   0:04  10.01% [kernel{if_io_tqg_1}]
   12 root        -88    -      0   304K WAIT     5   3:55   2.28% [intr{irq264: mfi0}]
    0 root        -76    -      0   480K -        5   0:06   1.88% [kernel{if_io_tqg_5}]
2954 root         20    0    13M  3676K CPU5     5   0:00   0.02% top -aSH
   12 root        -60    -      0   304K WAIT     0   0:01   0.01% [intr{swi4: clock (0)}]
    0 root        -76    -      0   480K -        4   0:02   0.01% [kernel{if_config_tqg_0}]


Single Thread:
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  8.45 GBytes  7.26 Gbits/sec  802             sender
[  4]   0.00-10.00  sec  8.45 GBytes  7.26 Gbits/sec                  receiver


10 Threads:
[ ID] Interval           Transfer     Bandwidth       Retr
[SUM]   0.00-10.00  sec  9.85 GBytes  8.46 Gbits/sec  2991             sender
[SUM]   0.00-10.00  sec  9.83 GBytes  8.45 Gbits/sec                  receiver



FreeBSD 12.1 with OPNsense Kernel (pf enabled):

[root@fbsd1 ~]# uname -rv
12.1-RELEASE FreeBSD 12.1-RELEASE r354233 GENERIC

[root@fbsd1 ~]# fetch https://pkg.opnsense.org/FreeBSD:12:amd64/20.7/sets/kernel-20.7.2-amd64.txz
[root@fbsd1 ~]# mv /boot/kernel /boot/kernel.old
[root@fbsd1 ~]# tar -C / -xf kernel-20.7.2-amd64.txz
[root@fbsd1 ~]# kldxref /boot/kernel
[root@fbsd1 ~]# reboot

[root@fbsd1 ~]# uname -rv
12.1-RELEASE-p8-HBSD FreeBSD 12.1-RELEASE-p8-HBSD #0  b3665671c4d(stable/20.7)-dirty: Thu Aug 27 05:58:53 CEST 2020     root@sensey64:/usr/obj/usr/src/amd64.amd64/sys/SMP

[root@fbsd1 ~]# top -aSH
last pid: 43891;  load averages:  0.99,  0.49,  0.20                                                                      up 0+00:04:28  20:29:24
131 threads:   13 running, 100 sleeping, 18 waiting
CPU:  0.0% user,  0.0% nice, 62.5% system,  3.5% interrupt, 33.9% idle
Mem: 14M Active, 1184K Inact, 270M Wired, 21M Buf, 39G Free
Swap: 3968M Total, 3968M Free

  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
    0 root        -76    -      0   480K CPU3     3   0:08  81.27% [kernel{if_io_tqg_3}]
    0 root        -76    -      0   480K CPU1     1   0:09  74.39% [kernel{if_io_tqg_1}]
    0 root        -76    -      0   480K CPU5     5   0:08  73.20% [kernel{if_io_tqg_5}]
    0 root        -76    -      0   480K CPU0     0   0:21  71.79% [kernel{if_io_tqg_0}]
   11 root        155 ki31      0    96K RUN      4   4:09  54.15% [idle{idle: cpu4}]
   11 root        155 ki31      0    96K RUN      2   4:09  51.30% [idle{idle: cpu2}]
    0 root        -76    -      0   480K CPU2     2   0:05  40.10% [kernel{if_io_tqg_2}]
    0 root        -76    -      0   480K -        4   0:09  37.60% [kernel{if_io_tqg_4}]
   11 root        155 ki31      0    96K RUN      0   4:03  26.48% [idle{idle: cpu0}]
   11 root        155 ki31      0    96K RUN      5   4:14  25.87% [idle{idle: cpu5}]
   11 root        155 ki31      0    96K RUN      1   4:09  24.32% [idle{idle: cpu1}]
   12 root        -52    -      0   304K RUN      2   1:12  20.63% [intr{swi6: task queue}]
   11 root        155 ki31      0    96K CPU3     3   4:00  17.30% [idle{idle: cpu3}]
   12 root        -88    -      0   304K WAIT     5   0:10   1.47% [intr{irq264: mfi0}]
43891 root         20    0    13M  3660K CPU4     4   0:00   0.03% top -aSH
   21 root        -16    -      0    16K -        4   0:00   0.02% [rand_harvestq]
   12 root        -60    -      0   304K WAIT     1   0:00   0.02% [intr{swi4: clock (0)}]


Single Thread:
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  2.89 GBytes  2.48 Gbits/sec    0             sender
[  4]   0.00-10.00  sec  2.89 GBytes  2.48 Gbits/sec                  receiver


10 Threads:
[ ID] Interval           Transfer     Bandwidth       Retr
[SUM]   0.00-10.00  sec  8.16 GBytes  7.01 Gbits/sec  4260             sender
[SUM]   0.00-10.00  sec  8.13 GBytes  6.98 Gbits/sec                  receiver


I included the "top -aSH" output again because my general observation between OPNsense kernel and FreeBSD 12.1 stock kernel is the "[kernel{if_io_tqg_X}]" process usage.  Even on an actual OPNsense 20.7.2 installation I notice the exact same behavior of the "[kernel{if_io_tqg_X}]" being consistently higher and throughput significantly slower, specifically on single threaded tests.  Note that both of the top outputs were only from the 10 thread count tests only as I did not think to capture them during the single threaded test.

I can't help but think that whatever high "[kernel{if_io_tqg_X}]" on the OPNsense kernel means is starving the system of throughput potential.

Thoughts?  Next steps I can run and provide results from?

Just wanted to post here due to the excellent testing from OP and to corroborate the numbers that OP is seeing.

My testing setup is as follows:
ESXi 6.7u3, host has an E3 1220v3 and 32GB of RAM
All Firewall VMs have 2vCPU. 5GB of RAM allocated to OPNsense.
VMXnet3 NICs negotiated at 10gbps

In pfSense and OPNsense, I disabled all of the hardware offloading features. I am using client and server VMs on the WAN and LAN sides of the firewall VMs. This means I am pushing/pulling traffic through the firewalls, I am not running iperf directly on any of the firewalls themselves. Because I am doing this on a single ESXi host and the traffic is within the same host/vSwitch, the traffic is never routed to my physical network switch and therefore I can test higher throughput.

pfSense and OPNsense were both out of the box installs with their default rulesets. I did not add any packages or make any config changes outside of making sure that all hardware offloading was disabled. All iperf3 tests were run with the LAN side client pulling traffic through the WAN side interface, to simulate a large download. However, if I perform upload tests, my throughput results are the same. All iperf3 tests were run for 60 seconds and used the default MTU of 1500. The results below show the average of the 60 second runs. I ran each test twice, and used the final result to allow the firewalls to "warm up" and stabilize with their throughput during testing.

pfSense 2.4.5p1 1500MTU receiving from WAN, vmx3 NICs, all hardware offloading disabled, default ruleset
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  31.5 GBytes  4.50 Gbits/sec  11715             sender
[  5]   0.00-60.00  sec  31.5 GBytes  4.50 Gbits/sec                  receiver

OpenWRT 19.07.3 1500MTU receiving from WAN, vmx3 NICs, default ruleset
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  47.5 GBytes  6.81 Gbits/sec  44252             sender
[  5]   0.00-60.00  sec  47.5 GBytes  6.81 Gbits/sec                  receiver

OPNsense 20.7.2 1500MTU receiving from WAN, vmx3 NICs, all hardware offloading disabled, default ruleset
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  6.83 GBytes   977 Mbits/sec  459             sender
[  5]   0.00-60.00  sec  6.82 GBytes   977 Mbits/sec                  receiver


I also notice that while doing a throughput test on OPNsense, one of the vCPUs is completely consumed. I did not see this behavior with Linux or pfSense on my testing, screenshot attached shows the CPU usage I'm seeing while the iperf3 test is running.

Hi, Newbie here. 

I also notice this problem with OpnSense v 20.7.2 which was released recently. I got only about 450 Mbps in my LAN, when no one uses it besides me (I disconnect every downlink devices). I use iPerf3 on Windows to check it out.

PS E:\Util> .\iperf3.exe -c 192.168.10.8 -p 26574
Connecting to host 192.168.10.8, port 26574
[  4] local 192.168.12.4 port 50173 connected to 192.168.10.8 port 26574
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec  49.1 MBytes   412 Mbits/sec
[  4]   1.00-2.00   sec  52.5 MBytes   440 Mbits/sec
[  4]   2.00-3.00   sec  51.8 MBytes   434 Mbits/sec
[  4]   3.00-4.00   sec  52.4 MBytes   439 Mbits/sec
[  4]   4.00-5.00   sec  52.1 MBytes   438 Mbits/sec
[  4]   5.00-6.00   sec  52.6 MBytes   441 Mbits/sec
[  4]   6.00-7.00   sec  52.4 MBytes   440 Mbits/sec
[  4]   7.00-8.00   sec  46.4 MBytes   389 Mbits/sec
[  4]   8.00-9.00   sec  49.0 MBytes   411 Mbits/sec
[  4]   9.00-10.00  sec  51.6 MBytes   433 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-10.00  sec   510 MBytes   428 Mbits/sec                  sender
[  4]   0.00-10.00  sec   510 MBytes   428 Mbits/sec                  receiver


My hardware is an AMD Ryzen 7 2700 with 16 GB of RAM. Ethernet is Intel i350T2 Ethernet with Gigabit.