I originally posted on Reddit but figured I might get more traction here with this.
I have an OPNsense 20.7.1 server running on a Dell R430 with 16 GB DDR4 RAM, an Intel Xeon E5-2620 v3 (6 cores/12 threads @ 2.40GHz) CPU and an Intel X520-SR2 10GbE NIC.
My network has several VLANs and network subnets with my OPNsense router functioning as a router on a stick doing all the traffic firewalling and routing between each network segment.
I recently upgraded my OPNsense to 20.7.1 and on a whim decided to run an iperf3 test between two VMs on different network segments to see what kind of throughput I was getting. I am certain, at least at some point, this very same hardware pushed over 6 Gbps on the same iperf3 test. Today it was getting around 850 Mbps every single time.
I started iperf3 as a server on my QNAP NAS device which is also attached to the same 10 Gbps switch and ran iperf3 as a client from OPNsense on the same network segment and got the same 850 Mbps throughput.
To make sure I wasn't limited by the QNAP NAS device, I ran the same iperf3 test with my other QNAP NAS device as a client to the first QNAP NAS device and it pushed 8.6 Gbps across the same network segment (no OPNsense involved) so both the QNAP and the switch can push it.
My question is what do I have going wrong here? Even the same network segment, OPNsense can't do more than 850 Mbps throughput. I have no idea if this was happening pre-upgrade to 20.7.1 but I know for sure it is happening now. I would assume an iperf3 test from the OPNsense server on the same network segment would surely remove any doubt it was firewalling, etc.
The interface shows 10 Gbps link speed, too, both from ifconfig and the switch itself.
My current MBUF Usage is 1 % (17726/1010734).
IDS/IPS package is installed but disabled.
I had "Hardware CRC" and "Hardware TSO" and "Hardware LRO" and "VLAN Hardware Filtering" all enabled. I have since set those all to disabled and rebooted. I can confirm that it disabled by looking at the interface flags in ifconfig:
Pre-reboot:
options=e53fbb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_UCAST,WOL_MCAST,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
Post-reboot:
options=803828<VLAN_MTU,JUMBO_MTU,WOL_UCAST,WOL_MCAST,WOL_MAGIC>
I ran top and was able to see a process (kernel{if_io_tqg_2}) utilize near 100% of a CPU core during this iperf3 test:
# top -aSH
last pid: 22772; load averages: 1.23, 0.94, 0.79 up 5+23:48:52 14:24:22
233 threads: 15 running, 193 sleeping, 25 waiting
CPU: 1.0% user, 0.0% nice, 16.1% system, 0.5% interrupt, 82.4% idle
Mem: 1485M Active, 297M Inact, 1657M Wired, 935M Buf, 12G Free
Swap: 8192M Total, 8192M Free
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
0 root -76 - 0 848K CPU2 2 279:51 99.77% [kernel{if_io_tqg_2}]
11 root 155 ki31 0 192K CPU3 3 130.8H 98.78% [idle{idle: cpu3}]
11 root 155 ki31 0 192K CPU9 9 131.3H 98.75% [idle{idle: cpu9}]
11 root 155 ki31 0 192K CPU1 1 129.7H 98.68% [idle{idle: cpu1}]
11 root 155 ki31 0 192K CPU10 10 138.1H 98.33% [idle{idle: cpu10}]
11 root 155 ki31 0 192K CPU5 5 130.5H 97.51% [idle{idle: cpu5}]
11 root 155 ki31 0 192K CPU0 0 138.3H 95.78% [idle{idle: cpu0}]
11 root 155 ki31 0 192K CPU8 8 137.7H 95.25% [idle{idle: cpu8}]
11 root 155 ki31 0 192K CPU6 6 138.7H 95.20% [idle{idle: cpu6}]
11 root 155 ki31 0 192K CPU4 4 138.4H 94.26% [idle{idle: cpu4}]
22772 root 82 0 15M 6772K CPU7 7 0:04 93.83% iperf3 -c 192.168.1.31
11 root 155 ki31 0 192K RUN 7 129.4H 68.75% [idle{idle: cpu7}]
11 root 155 ki31 0 192K RUN 11 126.8H 46.12% [idle{idle: cpu11}]
0 root -76 - 0 848K - 4 277:00 5.12% [kernel{if_io_tqg_4}]
12 root -60 - 0 400K WAIT 11 449:21 5.02% [intr{swi4: clock (0)}]
0 root -76 - 0 848K - 8 317:40 3.81% [kernel{if_io_tqg_8}]
0 root -76 - 0 848K - 0 272:13 2.71% [kernel{if_io_tqg_0}]
I occasionally see flowd_aggregate.py pop up to 100% but it doesn't seem consistent or relevant to when iperf3 is running:
# top -aSH
last pid: 99781; load averages: 1.15, 0.90, 0.77 up 5+23:47:27 14:22:57
232 threads: 14 running, 193 sleeping, 25 waiting
CPU: 8.5% user, 0.0% nice, 1.6% system, 0.4% interrupt, 89.5% idle
Mem: 1481M Active, 299M Inact, 1656M Wired, 935M Buf, 12G Free
Swap: 8192M Total, 8192M Free
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
43465 root 90 0 33M 25M CPU7 7 7:11 99.82% /usr/local/bin/python3 /usr/local/opnsense/scripts/netflow/flowd_aggregate.py (python3.7)
11 root 155 ki31 0 192K CPU9 9 131.3H 99.80% [idle{idle: cpu9}]
11 root 155 ki31 0 192K CPU3 3 130.8H 99.68% [idle{idle: cpu3}]
11 root 155 ki31 0 192K CPU10 10 138.1H 99.50% [idle{idle: cpu10}]
11 root 155 ki31 0 192K CPU6 6 138.7H 98.53% [idle{idle: cpu6}]
11 root 155 ki31 0 192K RUN 5 130.5H 98.20% [idle{idle: cpu5}]
11 root 155 ki31 0 192K CPU1 1 129.7H 97.97% [idle{idle: cpu1}]
11 root 155 ki31 0 192K CPU11 11 126.8H 96.52% [idle{idle: cpu11}]
11 root 155 ki31 0 192K CPU0 0 138.3H 96.43% [idle{idle: cpu0}]
11 root 155 ki31 0 192K CPU8 8 137.7H 95.95% [idle{idle: cpu8}]
11 root 155 ki31 0 192K CPU2 2 138.3H 95.81% [idle{idle: cpu2}]
11 root 155 ki31 0 192K CPU4 4 138.4H 93.94% [idle{idle: cpu4}]
12 root -60 - 0 400K WAIT 4 449:17 5.10% [intr{swi4: clock (0)}]
0 root -76 - 0 848K - 4 276:55 4.95% [kernel{if_io_tqg_4}]
What is going on here?
To add to this, I re-configured all my VLANs on bge0 (onboard NIC) and moved all my interfaces over to each respective bge0_vlanX interface and re-ran my iperf3 tests.
On my first test, I got the same throughput as with my Intel X520-SR2 NIC:
# iperf3 -c 192.168.1.31
Connecting to host 192.168.1.31, port 5201
[ 5] local 192.168.1.1 port 42455 connected to 192.168.1.31 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 92.0 MBytes 772 Mbits/sec 91 5.70 KBytes
[ 5] 1.00-2.00 sec 91.1 MBytes 764 Mbits/sec 88 145 KBytes
[ 5] 2.00-3.00 sec 86.1 MBytes 722 Mbits/sec 86 836 KBytes
[ 5] 3.00-4.00 sec 92.5 MBytes 776 Mbits/sec 76 589 KBytes
[ 5] 4.00-5.00 sec 107 MBytes 894 Mbits/sec 0 803 KBytes
[ 5] 5.00-6.00 sec 107 MBytes 898 Mbits/sec 2 731 KBytes
[ 5] 6.00-7.00 sec 109 MBytes 914 Mbits/sec 1 658 KBytes
[ 5] 7.00-8.00 sec 110 MBytes 926 Mbits/sec 0 863 KBytes
[ 5] 8.00-9.00 sec 107 MBytes 898 Mbits/sec 2 748 KBytes
[ 5] 9.00-10.00 sec 109 MBytes 918 Mbits/sec 1 663 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 1011 MBytes 848 Mbits/sec 347 sender
[ 5] 0.00-10.32 sec 1010 MBytes 821 Mbits/sec receiver
For reference, I just tested with my MacBook Pro against the same iperf3 server and was able to push 926 Mbps and re-tested my QNAP to QNAP transfer and it did 9.39 Gbps to completely rule out it's an iperf3 server thing.
For the sake of testing because why not, I re-ran iperf3 from my OPNsense server once more and got near gigabit throughput:
# iperf3 -c 192.168.1.31
Connecting to host 192.168.1.31, port 5201
[ 5] local 192.168.1.1 port 8283 connected to 192.168.1.31 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 108 MBytes 906 Mbits/sec 0 792 KBytes
[ 5] 1.00-2.00 sec 111 MBytes 932 Mbits/sec 2 698 KBytes
[ 5] 2.00-3.00 sec 111 MBytes 930 Mbits/sec 1 638 KBytes
[ 5] 3.00-4.00 sec 108 MBytes 905 Mbits/sec 1 585 KBytes
[ 5] 4.00-5.00 sec 111 MBytes 929 Mbits/sec 0 816 KBytes
[ 5] 5.00-6.00 sec 111 MBytes 929 Mbits/sec 1 776 KBytes
[ 5] 6.00-7.00 sec 111 MBytes 928 Mbits/sec 1 725 KBytes
[ 5] 7.00-8.00 sec 108 MBytes 906 Mbits/sec 2 663 KBytes
[ 5] 8.00-9.00 sec 111 MBytes 928 Mbits/sec 2 616 KBytes
[ 5] 9.00-10.00 sec 111 MBytes 928 Mbits/sec 0 837 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 1.07 GBytes 922 Mbits/sec 10 sender
[ 5] 0.00-10.32 sec 1.07 GBytes 892 Mbits/sec receiver
One thing I noticed between the first and second iperf3 test was the "Retr" column of 347 vs 10. I researched what that meant for iperf3 and found this: "It's the number of TCP segments retransmitted. This can happen if TCP segments are lost in the network due to congestion or corruption."
I also noticed during my second iperf3 test that there was now a kernel process using 99.81% CPU:
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
11 root 155 ki31 0 192K CPU3 3 9:02 100.00% [idle{idle: cpu3}]
0 root -92 - 0 848K CPU2 2 0:30 99.81% [kernel{bge0 taskq}]
Additionally, I am not sure "Retr" in itself is a smoking gun as the QNAP to QNAP test that yielded 9.39 Gbps did 2218 retries.
The search continues.
I know that bge driver has problems with OPNsense but X520 should deliver fine performance.
I tested these cards with 20.7rc1 and got full wire speed.
I can run these tests again with latest 20.7.1 but I need to finish some other stuff first.
I know that the Broadcom drivers aren't the best but I figured it was worth a test. That being said, I just swapped the Intel X520-SR2 with a Chelsio T540-CR which seems to have excellent FreeBSD support and that family of NICs seems frequently recommended.
Here's the results from the Chelsio T540-CR:
# iperf3 -c 192.168.1.31
Connecting to host 192.168.1.31, port 5201
[ 5] local 192.168.1.1 port 19465 connected to 192.168.1.31 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 112 MBytes 943 Mbits/sec 0 8.00 MBytes
[ 5] 1.00-2.00 sec 110 MBytes 924 Mbits/sec 0 8.00 MBytes
[ 5] 2.00-3.00 sec 112 MBytes 939 Mbits/sec 0 8.00 MBytes
[ 5] 3.00-4.00 sec 112 MBytes 941 Mbits/sec 0 8.00 MBytes
[ 5] 4.00-5.00 sec 112 MBytes 941 Mbits/sec 0 8.00 MBytes
[ 5] 5.00-6.00 sec 112 MBytes 939 Mbits/sec 0 8.00 MBytes
[ 5] 6.00-7.00 sec 112 MBytes 940 Mbits/sec 0 8.00 MBytes
[ 5] 7.00-8.00 sec 112 MBytes 938 Mbits/sec 0 8.00 MBytes
[ 5] 8.00-9.00 sec 112 MBytes 940 Mbits/sec 0 8.00 MBytes
[ 5] 9.00-10.00 sec 112 MBytes 940 Mbits/sec 0 8.00 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 1.09 GBytes 939 Mbits/sec 0 sender
[ 5] 0.00-10.32 sec 1.09 GBytes 909 Mbits/sec receiver
Also thought it was interesting there were zero retransmits on the test.
I swapped out the optic on the NIC when I swapped the NIC itself. I will swap the optic on the switch and maybe try a different switch port and fiber patch cable tomorrow, though, I doubt those are the issue.
Unfortunately, it appears that the issue was not my Intel X520-SR2 NIC as the Chelsio T540-CR exhibits the same behavior.
Just a status update:
Swapped optics on the switch side (both have now been switched) and swapped for a new fiber patch cable. Same results. I also re-enabled "Hardware CRC" and "VLAN Hardware Filtering" but left "Hardware TSO" and "Hardware LRO" disabled as I read most drivers are broken for those functions.
I also added this to /boot/loader.conf.local and rebooted:
hw.cxgbe.toecaps_allowed=0
hw.cxgbe.rdmacaps_allowed=0
hw.cxgbe.iscsicaps_allowed=0
hw.cxgbe.fcoecaps_allowed=0
Absolutely zero impact in performance. Tomorrow I think I'll unbox my other PowerEdge R430 and put the original Intel X520-SR2 NIC in it and see if I can duplicate the problem.
I am at a total loss of what is going on here.
OK so at the risk of seeming like I am only talking to myself at this point, I think I found a commonality amongst the poor performance -- it's OPNsense.
I built a fresh new and updated OPNsense 20.7.1 VM on VMware ESXi 6.7U3, imported my configuration backup from my physical server and re-mapped all the interfaces to the new vmx0_vlanX names and things are working, albeit even slower than the physical hardware:
root@opnsense1:~ # iperf3 -c 192.168.1.31
...
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.01 sec 705 MBytes 591 Mbits/sec 0 sender
[ 5] 0.00-10.41 sec 705 MBytes 568 Mbits/sec receiver
Seems pretty awful. So I decided to create a two new OPNsense 20.7.1 VMs and configure one as a VLAN trunk and the other as non-trunk to test if the problem lied within the VLAN implementation itself:
OPNsense 20.7.1 (amd64)
VLAN and pf Enabled:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 949 MBytes 796 Mbits/sec 0 sender
[ 5] 0.00-10.40 sec 949 MBytes 766 Mbits/sec receiver
VLAN and pf Disabled (pfctl -d):
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.01 sec 1.22 GBytes 1.05 Gbits/sec 0 sender
[ 5] 0.00-10.41 sec 1.22 GBytes 1.01 Gbits/sec receiver
Non-VLAN and pf Enabled:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 854 MBytes 716 Mbits/sec 0 sender
[ 5] 0.00-10.40 sec 854 MBytes 688 Mbits/sec receiver
Non-VLAN and pf Disabled (pfctl -d):
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 983 MBytes 825 Mbits/sec 0 sender
[ 5] 0.00-10.40 sec 983 MBytes 793 Mbits/sec receiver
As you can see, the VLAN trunk configured VM had slightly better performance. Perhaps environmental impacts caused the performance differences as I would expect them to be nearly the same. Even at the differences I'm seeing, I would consider it mostly negligible given the link is 10 gigabit. I also tested without pf to see if the throughput was measurable. Both tests show that it is in fact better without pf, though, kinda pointless to have a network perimeter firewall without it running...
Next I thought maybe this is just a fluke and all three OPNsense servers just suck on VMware ESXi and dislike the hardware or configuration or maybe my ESX host just can't push traffic. I had a CentOS 8.2.2004 VM already deployed and configured on the same network segment I had been testing on so I loaded up iperf3 on it to see if it was an ESX host/network problem.
CentOS 8.2.2004 (x86_64)
Non-VLAN and firewalld Enabled:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 10.7 GBytes 9.17 Gbits/sec 11 sender
[ 5] 0.00-10.04 sec 10.7 GBytes 9.14 Gbits/sec receiver
Non-VLAN and firewalld Disabled:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 10.8 GBytes 9.32 Gbits/sec 1 sender
[ 5] 0.00-10.04 sec 10.8 GBytes 9.28 Gbits/sec receiver
Tested with firewall on and off just for fun to see how much iptables slowed the Linux test down. As you can see, 9.14 Gbps to 9.32 Gbps on this test. The problem isn't my ESX host or my network.
I then thought it might be a BSD problem. Perhaps something with running inside VMware or the vmxnet3 driver that is problematic. I tried to figure out how to install HardenedBSD but it seemed too difficult difficult as my quick search for an ISO yielded not much. As such, I used FreeBSD. Hopefully it's close enough!
FreeBSD 12.1 (amd64)
VLAN and pf Disabled (not configured):
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 10.9 GBytes 9.35 Gbits/sec 0 sender
[ 5] 0.00-10.42 sec 10.9 GBytes 8.97 Gbits/sec receiver
Non-VLAN and pf Disabled (not configured):
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 10.9 GBytes 9.36 Gbits/sec 13 sender
[ 5] 0.00-10.21 sec 10.9 GBytes 9.17 Gbits/sec receiver
I thought I hadn't spent enough time already dorking around with this so why not configure one test VM to be VLAN trunking and the other not to see if there are any differences. As you can see, FreeBSD 12.1 pushed the packets, fast, regardless of VLAN or otherwise. Problem doesn't seem to be vmxnet3/ESXi and FreeBSD related.
Finally, I came to the conclusion that maybe OPNsense 20.7 is just broken. As such, I loaded up a OPNsense 19.7 test VM and gave it a go.
OPNsense 19.7.10_1 (amd64)
Non-VLAN and pf Enabled:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 1.75 GBytes 1.50 Gbits/sec 0 sender
[ 5] 0.00-10.44 sec 1.75 GBytes 1.44 Gbits/sec receiver
Non-VLAN and pf Disabled (pfctl -d):
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 2.57 GBytes 2.21 Gbits/sec 0 sender
[ 5] 0.00-10.48 sec 2.57 GBytes 2.11 Gbits/sec receiver
Not good. You can see the results of 1.75 Gbps to 2.57 Gbps is measurably better than my test results with OPNsense 20.7 but nowhere near stellar. I was very much over testing at this point so I opted not to do a VLAN versus non-VLAN configuration. That being said, based on historical results, I am sure that the difference in results would have been negligible.
To add to this, as a general observation, whenever the iperf3 test is running on OPNsense, a constant ping of the firewall starts to drop packets like it is choked out and cannot keep up. I did not experience this at all on CentOS or FreeBSD when testing.
Why is OPNsense so bad at throughput in my tests? If it's not, what am I doing wrong? The commonality amongst these tests seems to be OPNsense, regardless if it's 19.7 or 20.7, though, the former is better than the later.
Edit: Because why not at this point. Let's test pfSense!
pfSense 2.4.5 (amd64)
Non-VLAN and pf Enabled:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 3.80 GBytes 3.26 Gbits/sec 67 sender
[ 5] 0.00-10.26 sec 3.80 GBytes 3.18 Gbits/sec receiver
Non-VLAN and pf Disabled (pfctl -d):
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 5.66 GBytes 4.86 Gbits/sec 109 sender
[ 5] 0.00-10.22 sec 5.66 GBytes 4.76 Gbits/sec receiver
pfSense is not stellar, especially considering it is based on FreeBSD 12.1 and I tested FreeBSD 12.1 and got very different (better) results. That being said, both results are much, much faster than any OPNsense test I could push regardless if physical or virtual.
Edit 2: Fixed a typo in my comments where I erroneously used 20.1 instead of 20.7 when referring to editions of OPNsense.
TL;DR: OPNsense seems to be dog slow compared to FreeBSD 12.1 and CentOS 8.2 at raw network throughput. What gives? What am I doing wrong that it can be this huge of a performance gap?
your testing is amazing -
I have nothing to add (there are actually 2 other threads with this same subject matter - various reason, but we're slow)
I am posting to let you know, there are others and you arn't just talking to yourself
Quote from: hax0rwax0r on August 28, 2020, 04:08:35 AM
What am I doing wrong that it can be this huge of a performance gap?
The problem is, your are not testing traffic *through* the firewall, you are measuring *against* the firewall.
iperf3 on OPNsense operates really bad. Can you test sender and receiver on different interfaces?
Again, I'm doing regular performance tests with hardware details and I'm always near wirespeed:
https://www.routerperformance.net/opnsense/opnsense-performance-20-1-8/
https://www.routerperformance.net/routers/nexcom-nsa/fujitsu-rx1330/
https://www.routerperformance.net/routers/nexcom-nsa/thomas-krenn-ri1102d/
OK, I upgraded my lab now:
Client1: Ubuntu
FW1: 20.7.1 (Intel(R) Xeon(R) CPU E3-1240 v6 @ 3.70GHz (8 cores))
FW2: 20.7
Client2: Ubuntu
They are directly attached via TwinAx cables and a mix of Intel X520 and Mellanoc Connect-X3.
Client1 is iperf client, Client2 is iperf server:
With IPS enabled, 1 stream:
root@px3:~# iperf3 -p 5000 -f m -V -c 10.2.0.10 -P 1 -t 10 -R
iperf 3.1.3
Linux px3 4.15.18-12-pve #1 SMP PVE 4.15.18-35 (Wed, 13 Mar 2019 08:24:42 +0100) x86_64
Time: Fri, 28 Aug 2020 05:17:13 GMT
Connecting to host 10.2.0.10, port 5000
Reverse mode, remote host 10.2.0.10 is sending
Cookie: px3.1598591833.837625.6814fda03553a5
TCP MSS: 1448 (default)
[ 4] local 10.1.0.10 port 58842 connected to 10.2.0.10 port 5000
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test
[ ID] Interval Transfer Bandwidth
[ 4] 0.00-1.00 sec 159 MBytes 1335 Mbits/sec
[ 4] 1.00-2.00 sec 159 MBytes 1335 Mbits/sec
[ 4] 2.00-3.00 sec 156 MBytes 1308 Mbits/sec
[ 4] 3.00-4.00 sec 156 MBytes 1305 Mbits/sec
[ 4] 4.00-5.00 sec 157 MBytes 1313 Mbits/sec
[ 4] 5.00-6.00 sec 157 MBytes 1315 Mbits/sec
[ 4] 6.00-7.00 sec 156 MBytes 1309 Mbits/sec
[ 4] 7.00-8.00 sec 157 MBytes 1319 Mbits/sec
[ 4] 8.00-9.00 sec 155 MBytes 1298 Mbits/sec
[ 4] 9.00-10.00 sec 155 MBytes 1301 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 1.53 GBytes 1316 Mbits/sec 39 sender
[ 4] 0.00-10.00 sec 1.53 GBytes 1315 Mbits/sec receiver
CPU Utilization: local/receiver 63.0% (8.2%u/54.8%s), remote/sender 0.2% (0.0%u/0.2%s)
iperf Done.
Without IPS, 1 stream:
root@px3:~# iperf3 -p 5000 -f m -V -c 10.2.0.10 -P 1 -t 10 -R
iperf 3.1.3
Linux px3 4.15.18-12-pve #1 SMP PVE 4.15.18-35 (Wed, 13 Mar 2019 08:24:42 +0100) x86_64
Time: Fri, 28 Aug 2020 05:18:46 GMT
Connecting to host 10.2.0.10, port 5000
Reverse mode, remote host 10.2.0.10 is sending
Cookie: px3.1598591926.454562.6f7931ec23f094
TCP MSS: 1448 (default)
[ 4] local 10.1.0.10 port 58846 connected to 10.2.0.10 port 5000
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test
[ ID] Interval Transfer Bandwidth
[ 4] 0.00-1.00 sec 800 MBytes 6708 Mbits/sec
[ 4] 1.00-2.00 sec 816 MBytes 6844 Mbits/sec
[ 4] 2.00-3.00 sec 814 MBytes 6830 Mbits/sec
[ 4] 3.00-4.00 sec 814 MBytes 6829 Mbits/sec
[ 4] 4.00-5.00 sec 816 MBytes 6844 Mbits/sec
[ 4] 5.00-6.00 sec 816 MBytes 6844 Mbits/sec
[ 4] 6.00-7.00 sec 815 MBytes 6840 Mbits/sec
[ 4] 7.00-8.00 sec 816 MBytes 6840 Mbits/sec
[ 4] 8.00-9.00 sec 815 MBytes 6841 Mbits/sec
[ 4] 9.00-10.00 sec 816 MBytes 6841 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 7.95 GBytes 6829 Mbits/sec 36 sender
[ 4] 0.00-10.00 sec 7.95 GBytes 6826 Mbits/sec receiver
CPU Utilization: local/receiver 28.7% (1.2%u/27.5%s), remote/sender 1.2% (0.0%u/1.2%s)
iperf Done.
Without IPS, 10 parallel streams:
[ 4] 3.00-3.90 sec 106 MBytes 992 Mbits/sec
[ 6] 3.00-3.90 sec 105 MBytes 981 Mbits/sec
[ 8] 3.00-3.90 sec 71.7 MBytes 669 Mbits/sec
[ 10] 3.00-3.90 sec 69.8 MBytes 651 Mbits/sec
[ 12] 3.00-3.90 sec 73.6 MBytes 686 Mbits/sec
[ 14] 3.00-3.90 sec 97.8 MBytes 912 Mbits/sec
[ 16] 3.00-3.90 sec 101 MBytes 941 Mbits/sec
[ 18] 3.00-3.90 sec 80.4 MBytes 750 Mbits/sec
[ 20] 3.00-3.90 sec 137 MBytes 1279 Mbits/sec
[ 22] 3.00-3.90 sec 163 MBytes 1523 Mbits/sec
[SUM] 3.00-3.90 sec 1006 MBytes 9383 Mbits/sec
I mean of course running a parallel test is going to yield better results if the firewall has multi-core CPU(s) and you are maxing out a CPU core.
The issue I have is that that single threaded throughput is only about 850 Mbps on my non-virtualized hardware. That seems not right to me but I only know my environment so I might just be wrong.
And yes, I did test through the firewall before I started doing tests from the firewall. Through the firewall nets me similar performance for single threaded:
[root@client1 ~]# iperf3 -f m -c 192.168.1.31
...
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 973 MBytes 816 Mbits/sec 22 sender
[ 4] 0.00-10.00 sec 970 MBytes 814 Mbits/sec receiver
And, as expected, increased throughput when running in parallel:
[root@client1 ~]# iperf3 -f m -c 192.168.1.31 -P 10
...
[ ID] Interval Transfer Bandwidth Retr
...
[SUM] 0.00-10.00 sec 3.26 GBytes 2798 Mbits/sec 4464 sender
[SUM] 0.00-10.00 sec 3.23 GBytes 2776 Mbits/sec receiver
Can you humor me and run a single threaded test through your hardware and show me the output?
If OPNsense is truly not broken in this release then I guess my CPU core speed isn't enough to achieve what I am looking to do and I need to look on eBay for a faster one. That being said, it appears there are several others reporting degraded performance since upgrading so maybe there is something to my claim.
Edit: I see your single threaded non-IPS throughput is 6826 Mbps. See, even your single threaded test absolutely crushes mine. I get that your CPU is @ 3.7 GHz and a v6 but really, almost 7 Gbps versus less than my 1 Gbps. I have a v3 Xeon that has a higher clock rate (maybe 3.2 GHz?) I can try to test out tomorrow to see what results I get.
Can you also test with pfsense 2.5.0dev since this is based on 12, as 2.4.5 runs FreeBSD 11
Fresh install of OPNsense 20.7 on a Dell T20 (Intel(R) Xeon(R) CPU E3-1225 v3 @ 3.20GHz (4 cores)):
[root@client1 ~]# iperf3 -c 192.168.1.31
...
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 8.29 GBytes 7.12 Gbits/sec 2 sender
[ 4] 0.00-10.00 sec 8.29 GBytes 7.12 Gbits/sec receiver
[root@client1 ~]# iperf3 -c 192.168.1.31 -P 10
...
[ ID] Interval Transfer Bandwidth Retr
[SUM] 0.00-10.01 sec 8.77 GBytes 7.53 Gbits/sec 139 sender
[SUM] 0.00-10.01 sec 8.77 GBytes 7.53 Gbits/sec receiver
It's just hard to believe that E3-1225 v3 @ 2.4GHz/3.2GHz versus an E5-2620 v3 3.2GHz/3.6GHz is that much difference for a single thread test; however, it's clear, the results don't lie. There's either something wrong with my hardware, my install or it's just too slow of a CPU to push single threaded performance past about 850 Mbps.
And you're right about the pfSense version of FreeBSD. I just double checked the page (https://docs.netgate.com/pfsense/en/latest/releases/versions-of-pfsense-and-freebsd.html) and, in spite of it being clearly marked 2.5.0 TBD, I didn't even pay attention that it definitely was not the edition I installed.
FreeBSD is known to be not so performant compared to Linux in single stream, esp with pppoe, but your problem is weird. Sadly No other hardware to test
I built a new OPNsense server on my spare Dell PowerEdge R430 server that has the same CPU in it as my one I am currently using.
I can confirm that the problem appears to be my CPU and/or hardware since the same exact NIC was moved from the Dell PowerEdge T20 which previously tested out at 7.53 Gbps to this R430 server and the test results are much lower:
[root@client1 ~]# iperf3 -c 192.168.1.31
...
[ 4] 0.00-10.00 sec 2.13 GBytes 1.83 Gbits/sec 72 sender
[ 4] 0.00-10.00 sec 2.13 GBytes 1.83 Gbits/sec receiver
[root@client1 ~]# iperf3 -c 192.168.1.31 -P 10
...
[SUM] 0.00-10.00 sec 4.78 GBytes 4.10 Gbits/sec 1143 sender
[SUM] 0.00-10.00 sec 4.75 GBytes 4.08 Gbits/sec receiver
One observation is on like-for-like hardware, the new R430 is performing more than double the throughput on the single thread test and more than a gigabit more on parallel test than my currently used R430 I have been experiencing problems with. No idea why this is.
I guess I have a decision to make about buying a new CPU or a new server.
OK, back to basics here. I couldn't leave well enough alone and I did more testing tonight because I just couldn't believe that my CPU couldn't even do single threaded gigabit. Here's my test scenario:
Test Scenario 1:
- Physical Linux Server (CentOS 7) on VLAN 2 (iperf3 client)
- Virtual Linux Server (CentOS 7) on VLAN 24 (iperf3 server)
- Dell PowerEdge R430 w/Intel X520-SR2 and HardenedBSD 12-STABLE (BUILD-LATEST 2020-08-31)
Single Threaded:
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 1.00 GBytes 863 Mbits/sec 0 sender
[ 4] 0.00-10.00 sec 1.00 GBytes 860 Mbits/sec receiver6 Parallel Threads:
[ ID] Interval Transfer Bandwidth Retr
[SUM] 0.00-10.00 sec 2.23 GBytes 1.91 Gbits/sec 938 sender
[SUM] 0.00-10.00 sec 2.22 GBytes 1.90 Gbits/sec receiverNotice a common theme here with the ~850 Mbps single threaded test. It's pretty close to what I get with OPNsense. Note this is THROUGH the firewall and not from the firewall. Also note my system did have IPv6 addresses from my ISP on each of the interfaces, though, I was only testing IPv4 traffic.
Test Scenario 2:
- Physical Linux Server (CentOS 7) on VLAN 2 (iperf3 client)
- Virtual Linux Server (CentOS 7) on VLAN 24 (iperf3 server)
- Dell PowerEdge R430 w/Intel X520-SR2 and FreeBSD 12.1-RELEASE
Single Threaded:
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 9.75 GBytes 8.38 Gbits/sec 573 sender
[ 4] 0.00-10.00 sec 9.75 GBytes 8.38 Gbits/sec receiver6 Parallel Threads:
[ ID] Interval Transfer Bandwidth Retr
[SUM] 0.00-10.00 sec 10.5 GBytes 9.05 Gbits/sec 3607 sender
[SUM] 0.00-10.00 sec 10.5 GBytes 9.04 Gbits/sec receiverI couldn't believe my eyes as I had to do a triple check that it was in fact pushing 8.38 Gbps THROUGH the FreeBSD 12.1 server and it wasn't taking some magical alternate path somehow. It was, in fact, going through the FreeBSD router. As you can see, parallel test is about 1 Gbps less than wire speed. Excellent! Also note my system did have IPv6 addresses from my ISP on each of the interfaces, though, I was only testing IPv4 traffic.
I thought I would enable pfctl on the FreeBSD 12.1 router to see how that affected performance. Not sure how much adding rules impacts throughput but I did notice a measurable drop in the single thread test (6.23 Gbps) but the parallel thread test was negligible (8.94 Gbps).
As of right now, it seems so so so strange to me that HardenedBSD exhibits the same exact single threaded throughput and likewise low parallel thread throughput over FreeBSD.
I am willing to accept that I am not accounting for something here; however, near wire speed throughput on the same exact hardware on FreeBSD versus HardenedBSD, it seems to me something is very different with HardenedBSD.
What are your thoughts?
I am seeing very slow throughput on pfsense as well using Iperf3 online.
Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
16 CPUs: 2 package(s) x 8 core(s)
AES-NI CPU Crypto: Yes (inactive)
Using Suricata and cant get more then 200 mbps... pretty annoying.
Ok, so we have an upstream problem with FreeBSD and a some chances to get them fixed the next months.
So the interim solution for now is to go
a) go back to 20.1
b) disable netmap (IPS/Sensei)
c) accept the lowered performance
I had a talk to Franco yesterday, there are some promising patches awaiting and we sure need some testers, so if one not going back to 20.1, this would be fine
Wasnt the problem OPN/pfsense instead of FreeBSD? Didnt the 10gbit tests show wirespeed on a FreeBSD machine using pf?
No, OPNsense 20.7 and pfSense 2.5 are using FreeBSD 12.X; 20.1 and pf 2.4 FreeBSD 11.X
With FreeBSD12 interface/networking stack was changed to iflib, which has known problems with netmap, where ppl. are already working on it.
@minimugmail
Quote from: hax0rwax0r on September 02, 2020, 07:34:01 AM
OK, back to basics here. I couldn't leave well enough alone and I did more testing tonight because I just couldn't believe that my CPU couldn't even do single threaded gigabit. Here's my test scenario:
Test Scenario 1:
- Physical Linux Server (CentOS 7) on VLAN 2 (iperf3 client)
- Virtual Linux Server (CentOS 7) on VLAN 24 (iperf3 server)
- Dell PowerEdge R430 w/Intel X520-SR2 and HardenedBSD 12-STABLE (BUILD-LATEST 2020-08-31)
Single Threaded:
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 1.00 GBytes 863 Mbits/sec 0 sender
[ 4] 0.00-10.00 sec 1.00 GBytes 860 Mbits/sec receiver
6 Parallel Threads:
[ ID] Interval Transfer Bandwidth Retr
[SUM] 0.00-10.00 sec 2.23 GBytes 1.91 Gbits/sec 938 sender
[SUM] 0.00-10.00 sec 2.22 GBytes 1.90 Gbits/sec receiver
Notice a common theme here with the ~850 Mbps single threaded test. It's pretty close to what I get with OPNsense. Note this is THROUGH the firewall and not from the firewall. Also note my system did have IPv6 addresses from my ISP on each of the interfaces, though, I was only testing IPv4 traffic.
Test Scenario 2:
- Physical Linux Server (CentOS 7) on VLAN 2 (iperf3 client)
- Virtual Linux Server (CentOS 7) on VLAN 24 (iperf3 server)
- Dell PowerEdge R430 w/Intel X520-SR2 and FreeBSD 12.1-RELEASE
Single Threaded:
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 9.75 GBytes 8.38 Gbits/sec 573 sender
[ 4] 0.00-10.00 sec 9.75 GBytes 8.38 Gbits/sec receiver
6 Parallel Threads:
[ ID] Interval Transfer Bandwidth Retr
[SUM] 0.00-10.00 sec 10.5 GBytes 9.05 Gbits/sec 3607 sender
[SUM] 0.00-10.00 sec 10.5 GBytes 9.04 Gbits/sec receiver
I couldn't believe my eyes as I had to do a triple check that it was in fact pushing 8.38 Gbps THROUGH the FreeBSD 12.1 server and it wasn't taking some magical alternate path somehow. It was, in fact, going through the FreeBSD router. As you can see, parallel test is about 1 Gbps less than wire speed. Excellent! Also note my system did have IPv6 addresses from my ISP on each of the interfaces, though, I was only testing IPv4 traffic.
I thought I would enable pfctl on the FreeBSD 12.1 router to see how that affected performance. Not sure how much adding rules impacts throughput but I did notice a measurable drop in the single thread test (6.23 Gbps) but the parallel thread test was negligible (8.94 Gbps).
As of right now, it seems so so so strange to me that HardenedBSD exhibits the same exact single threaded throughput and likewise low parallel thread throughput over FreeBSD.
I am willing to accept that I am not accounting for something here; however, near wire speed throughput on the same exact hardware on FreeBSD versus HardenedBSD, it seems to me something is very different with HardenedBSD.
What are your thoughts?
@hax0rwax0r
Try to repeat the FreeBSD 12.1-RELEASE test with our kernel instead of the stock one. I don't expect any differences.
https://pkg.opnsense.org/FreeBSD:12:amd64/20.7/sets/kernel-20.7.2-amd64.txz
Cheers,
Franco
Details matter (a lot) in these cases, we haven't seen huge differences on our end (apart from netmap issues with certain cards, which we don't ship ourselves). That being said, IPS is a feature that really stresses your hardware, quite some setups are not able to do more than 200Mbps mentioned in this thread.
Please be advised that HardenedBSD 12-STABLE isn't the same as OPNsense 20.7, the differences between OPNsense 20.7 src and freebsd are a bit smaller, but if you're convinced your issues lies with HardenedBSD's additions it might be good starting point (and a plain install has less features enabled).
You can always try to install our kernel on the same FreeBSD install which worked without issues (as Franco suggested), it could help reproducing steps more easily.
If you want to compare between HBSD and FBSD anyway, always make sure your comparing apples with apples, check interface settings, build options and tunables (sysctl -a). Testing between interfaces (not vlan's on the same) is probably easier so you know for sure traffic is only flowing once through the physical interface.
In case someone would like to reproduce your test, make sure to document step by step how one could do that (including network segments used).
Best regards,
Ad
Quote from: Supermule on September 02, 2020, 11:12:45 AM
@minimugmail
Quote from: hax0rwax0r on September 02, 2020, 07:34:01 AM
OK, back to basics here. I couldn't leave well enough alone and I did more testing tonight because I just couldn't believe that my CPU couldn't even do single threaded gigabit. Here's my test scenario:
Test Scenario 1:
- Physical Linux Server (CentOS 7) on VLAN 2 (iperf3 client)
- Virtual Linux Server (CentOS 7) on VLAN 24 (iperf3 server)
- Dell PowerEdge R430 w/Intel X520-SR2 and HardenedBSD 12-STABLE (BUILD-LATEST 2020-08-31)
Single Threaded:
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 1.00 GBytes 863 Mbits/sec 0 sender
[ 4] 0.00-10.00 sec 1.00 GBytes 860 Mbits/sec receiver
6 Parallel Threads:
[ ID] Interval Transfer Bandwidth Retr
[SUM] 0.00-10.00 sec 2.23 GBytes 1.91 Gbits/sec 938 sender
[SUM] 0.00-10.00 sec 2.22 GBytes 1.90 Gbits/sec receiver
Notice a common theme here with the ~850 Mbps single threaded test. It's pretty close to what I get with OPNsense. Note this is THROUGH the firewall and not from the firewall. Also note my system did have IPv6 addresses from my ISP on each of the interfaces, though, I was only testing IPv4 traffic.
Test Scenario 2:
- Physical Linux Server (CentOS 7) on VLAN 2 (iperf3 client)
- Virtual Linux Server (CentOS 7) on VLAN 24 (iperf3 server)
- Dell PowerEdge R430 w/Intel X520-SR2 and FreeBSD 12.1-RELEASE
Single Threaded:
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 9.75 GBytes 8.38 Gbits/sec 573 sender
[ 4] 0.00-10.00 sec 9.75 GBytes 8.38 Gbits/sec receiver
6 Parallel Threads:
[ ID] Interval Transfer Bandwidth Retr
[SUM] 0.00-10.00 sec 10.5 GBytes 9.05 Gbits/sec 3607 sender
[SUM] 0.00-10.00 sec 10.5 GBytes 9.04 Gbits/sec receiver
I couldn't believe my eyes as I had to do a triple check that it was in fact pushing 8.38 Gbps THROUGH the FreeBSD 12.1 server and it wasn't taking some magical alternate path somehow. It was, in fact, going through the FreeBSD router. As you can see, parallel test is about 1 Gbps less than wire speed. Excellent! Also note my system did have IPv6 addresses from my ISP on each of the interfaces, though, I was only testing IPv4 traffic.
I thought I would enable pfctl on the FreeBSD 12.1 router to see how that affected performance. Not sure how much adding rules impacts throughput but I did notice a measurable drop in the single thread test (6.23 Gbps) but the parallel thread test was negligible (8.94 Gbps).
As of right now, it seems so so so strange to me that HardenedBSD exhibits the same exact single threaded throughput and likewise low parallel thread throughput over FreeBSD.
I am willing to accept that I am not accounting for something here; however, near wire speed throughput on the same exact hardware on FreeBSD versus HardenedBSD, it seems to me something is very different with HardenedBSD.
What are your thoughts?
I have the same values with 20.7 on SuperMicro Hardware with Xeon and X520 as posted before. It's something in your hardware
I am not super familiar with FreeBSD so how would I go about swapping your kernel in for the existing stock FreeBSD 12.1 one I am running? I searched around on Google and I found how to build a customer kernel from source but this txz file you linked appears to be already compiled so I don't think that's what I want to do.
I also found reference to pkg-static to install locally downloaded packages but wanted to get some initial guidance before totally hosing this up.
This should also be the same kernel which gets installed with latest 20.7.2
Oh, I guess I misunderstood franco's instructions I thought they were asking me to drop the 20.7.2 kernel linked on top/in place on my FreeBSD 12.1 install which I was asking how exactly to do that.
I think with your clarification and re-reading the post, franco was just asking me to try an install of 20.7.2, which happens to be running that kernel, and re-run my tests to see if it improves.
If that's the case, I will try and report back my findings with OPNsense 20.7.2.
No I did mean FreeBSD 12.1 with our kernel. All the networking is in the kernel so we will see if this is OPNsense vs. HBSD vs. FBSD or some sort of tweaking effort.
# fetch https://pkg.opnsense.org/FreeBSD:12:amd64/20.7/sets/kernel-20.7.2-amd64.txz
# mv /boot/kernel /boot/kernel.old
# tar -C / -xf kernel-20.7.2-amd64.txz
# kldxref /boot/kernel
It should have a new /boot/kernel now and a reboot should activate it. You can compare build info after the system is back up.
# uname -rv
12.1-RELEASE-p8-HBSD FreeBSD 12.1-RELEASE-p8-HBSD #0 b3665671c4d(stable/20.7)-dirty: Thu Aug 27 05:58:53 CEST 2020 root@sensey64:/usr/obj/usr/src/amd64.amd64/sys/SMP
Cheers,
Franco
OK here are the test results as you requested:
FreeBSD 12.1 (pf enabled):
[root@fbsd1 ~]# uname -rv
12.1-RELEASE FreeBSD 12.1-RELEASE r354233 GENERIC
[root@fbsd1 ~]# top -aSH
last pid: 2954; load averages: 0.44, 0.42, 0.41 up 0+01:38:55 20:13:46
132 threads: 10 running, 104 sleeping, 18 waiting
CPU: 0.0% user, 0.0% nice, 19.7% system, 5.2% interrupt, 75.1% idle
Mem: 10M Active, 6100K Inact, 271M Wired, 21M Buf, 39G Free
Swap: 3968M Total, 3968M Free
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
11 root 155 ki31 0 96K RUN 5 94:58 95.25% [idle{idle: cpu5}]
11 root 155 ki31 0 96K CPU1 1 93:26 83.69% [idle{idle: cpu1}]
11 root 155 ki31 0 96K RUN 0 94:44 73.68% [idle{idle: cpu0}]
11 root 155 ki31 0 96K CPU4 4 93:15 72.51% [idle{idle: cpu4}]
11 root 155 ki31 0 96K CPU3 3 93:36 64.80% [idle{idle: cpu3}]
11 root 155 ki31 0 96K RUN 2 92:55 62.29% [idle{idle: cpu2}]
0 root -76 - 0 480K CPU2 2 0:05 34.76% [kernel{if_io_tqg_2}]
0 root -76 - 0 480K CPU3 3 0:14 33.49% [kernel{if_io_tqg_3}]
12 root -52 - 0 304K CPU0 0 26:23 29.62% [intr{swi6: task queue}]
0 root -76 - 0 480K - 4 0:05 23.31% [kernel{if_io_tqg_4}]
0 root -76 - 0 480K - 0 0:05 12.31% [kernel{if_io_tqg_0}]
0 root -76 - 0 480K - 1 0:04 10.01% [kernel{if_io_tqg_1}]
12 root -88 - 0 304K WAIT 5 3:55 2.28% [intr{irq264: mfi0}]
0 root -76 - 0 480K - 5 0:06 1.88% [kernel{if_io_tqg_5}]
2954 root 20 0 13M 3676K CPU5 5 0:00 0.02% top -aSH
12 root -60 - 0 304K WAIT 0 0:01 0.01% [intr{swi4: clock (0)}]
0 root -76 - 0 480K - 4 0:02 0.01% [kernel{if_config_tqg_0}]
Single Thread:
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 8.45 GBytes 7.26 Gbits/sec 802 sender
[ 4] 0.00-10.00 sec 8.45 GBytes 7.26 Gbits/sec receiver
10 Threads:
[ ID] Interval Transfer Bandwidth Retr
[SUM] 0.00-10.00 sec 9.85 GBytes 8.46 Gbits/sec 2991 sender
[SUM] 0.00-10.00 sec 9.83 GBytes 8.45 Gbits/sec receiver
FreeBSD 12.1 with OPNsense Kernel (pf enabled):
[root@fbsd1 ~]# uname -rv
12.1-RELEASE FreeBSD 12.1-RELEASE r354233 GENERIC
[root@fbsd1 ~]# fetch https://pkg.opnsense.org/FreeBSD:12:amd64/20.7/sets/kernel-20.7.2-amd64.txz
[root@fbsd1 ~]# mv /boot/kernel /boot/kernel.old
[root@fbsd1 ~]# tar -C / -xf kernel-20.7.2-amd64.txz
[root@fbsd1 ~]# kldxref /boot/kernel
[root@fbsd1 ~]# reboot
[root@fbsd1 ~]# uname -rv
12.1-RELEASE-p8-HBSD FreeBSD 12.1-RELEASE-p8-HBSD #0 b3665671c4d(stable/20.7)-dirty: Thu Aug 27 05:58:53 CEST 2020 root@sensey64:/usr/obj/usr/src/amd64.amd64/sys/SMP
[root@fbsd1 ~]# top -aSH
last pid: 43891; load averages: 0.99, 0.49, 0.20 up 0+00:04:28 20:29:24
131 threads: 13 running, 100 sleeping, 18 waiting
CPU: 0.0% user, 0.0% nice, 62.5% system, 3.5% interrupt, 33.9% idle
Mem: 14M Active, 1184K Inact, 270M Wired, 21M Buf, 39G Free
Swap: 3968M Total, 3968M Free
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
0 root -76 - 0 480K CPU3 3 0:08 81.27% [kernel{if_io_tqg_3}]
0 root -76 - 0 480K CPU1 1 0:09 74.39% [kernel{if_io_tqg_1}]
0 root -76 - 0 480K CPU5 5 0:08 73.20% [kernel{if_io_tqg_5}]
0 root -76 - 0 480K CPU0 0 0:21 71.79% [kernel{if_io_tqg_0}]
11 root 155 ki31 0 96K RUN 4 4:09 54.15% [idle{idle: cpu4}]
11 root 155 ki31 0 96K RUN 2 4:09 51.30% [idle{idle: cpu2}]
0 root -76 - 0 480K CPU2 2 0:05 40.10% [kernel{if_io_tqg_2}]
0 root -76 - 0 480K - 4 0:09 37.60% [kernel{if_io_tqg_4}]
11 root 155 ki31 0 96K RUN 0 4:03 26.48% [idle{idle: cpu0}]
11 root 155 ki31 0 96K RUN 5 4:14 25.87% [idle{idle: cpu5}]
11 root 155 ki31 0 96K RUN 1 4:09 24.32% [idle{idle: cpu1}]
12 root -52 - 0 304K RUN 2 1:12 20.63% [intr{swi6: task queue}]
11 root 155 ki31 0 96K CPU3 3 4:00 17.30% [idle{idle: cpu3}]
12 root -88 - 0 304K WAIT 5 0:10 1.47% [intr{irq264: mfi0}]
43891 root 20 0 13M 3660K CPU4 4 0:00 0.03% top -aSH
21 root -16 - 0 16K - 4 0:00 0.02% [rand_harvestq]
12 root -60 - 0 304K WAIT 1 0:00 0.02% [intr{swi4: clock (0)}]
Single Thread:
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 2.89 GBytes 2.48 Gbits/sec 0 sender
[ 4] 0.00-10.00 sec 2.89 GBytes 2.48 Gbits/sec receiver
10 Threads:
[ ID] Interval Transfer Bandwidth Retr
[SUM] 0.00-10.00 sec 8.16 GBytes 7.01 Gbits/sec 4260 sender
[SUM] 0.00-10.00 sec 8.13 GBytes 6.98 Gbits/sec receiver
I included the "top -aSH" output again because my general observation between OPNsense kernel and FreeBSD 12.1 stock kernel is the "[kernel{if_io_tqg_X}]" process usage. Even on an actual OPNsense 20.7.2 installation I notice the exact same behavior of the "[kernel{if_io_tqg_X}]" being consistently higher and throughput significantly slower, specifically on single threaded tests. Note that both of the top outputs were only from the 10 thread count tests only as I did not think to capture them during the single threaded test.
I can't help but think that whatever high "[kernel{if_io_tqg_X}]" on the OPNsense kernel means is starving the system of throughput potential.
Thoughts? Next steps I can run and provide results from?
Just wanted to post here due to the excellent testing from OP and to corroborate the numbers that OP is seeing.
My testing setup is as follows:
ESXi 6.7u3, host has an E3 1220v3 and 32GB of RAM
All Firewall VMs have 2vCPU. 5GB of RAM allocated to OPNsense.
VMXnet3 NICs negotiated at 10gbps
In pfSense and OPNsense, I disabled all of the hardware offloading features. I am using client and server VMs on the WAN and LAN sides of the firewall VMs. This means I am pushing/pulling traffic through the firewalls, I am not running iperf directly on any of the firewalls themselves. Because I am doing this on a single ESXi host and the traffic is within the same host/vSwitch, the traffic is never routed to my physical network switch and therefore I can test higher throughput.
pfSense and OPNsense were both out of the box installs with their default rulesets. I did not add any packages or make any config changes outside of making sure that all hardware offloading was disabled. All iperf3 tests were run with the LAN side client pulling traffic through the WAN side interface, to simulate a large download. However, if I perform upload tests, my throughput results are the same. All iperf3 tests were run for 60 seconds and used the default MTU of 1500. The results below show the average of the 60 second runs. I ran each test twice, and used the final result to allow the firewalls to "warm up" and stabilize with their throughput during testing.
pfSense 2.4.5p1 1500MTU receiving from WAN, vmx3 NICs, all hardware offloading disabled, default ruleset
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 31.5 GBytes 4.50 Gbits/sec 11715 sender
[ 5] 0.00-60.00 sec 31.5 GBytes 4.50 Gbits/sec receiver
OpenWRT 19.07.3 1500MTU receiving from WAN, vmx3 NICs, default ruleset
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 47.5 GBytes 6.81 Gbits/sec 44252 sender
[ 5] 0.00-60.00 sec 47.5 GBytes 6.81 Gbits/sec receiver
OPNsense 20.7.2 1500MTU receiving from WAN, vmx3 NICs, all hardware offloading disabled, default ruleset
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 6.83 GBytes 977 Mbits/sec 459 sender
[ 5] 0.00-60.00 sec 6.82 GBytes 977 Mbits/sec receiver
I also notice that while doing a throughput test on OPNsense, one of the vCPUs is completely consumed. I did not see this behavior with Linux or pfSense on my testing, screenshot attached shows the CPU usage I'm seeing while the iperf3 test is running.
Hi, Newbie here.
I also notice this problem with OpnSense v 20.7.2 which was released recently. I got only about 450 Mbps in my LAN, when no one uses it besides me (I disconnect every downlink devices). I use iPerf3 on Windows to check it out.
PS E:\Util> .\iperf3.exe -c 192.168.10.8 -p 26574
Connecting to host 192.168.10.8, port 26574
[ 4] local 192.168.12.4 port 50173 connected to 192.168.10.8 port 26574
[ ID] Interval Transfer Bandwidth
[ 4] 0.00-1.00 sec 49.1 MBytes 412 Mbits/sec
[ 4] 1.00-2.00 sec 52.5 MBytes 440 Mbits/sec
[ 4] 2.00-3.00 sec 51.8 MBytes 434 Mbits/sec
[ 4] 3.00-4.00 sec 52.4 MBytes 439 Mbits/sec
[ 4] 4.00-5.00 sec 52.1 MBytes 438 Mbits/sec
[ 4] 5.00-6.00 sec 52.6 MBytes 441 Mbits/sec
[ 4] 6.00-7.00 sec 52.4 MBytes 440 Mbits/sec
[ 4] 7.00-8.00 sec 46.4 MBytes 389 Mbits/sec
[ 4] 8.00-9.00 sec 49.0 MBytes 411 Mbits/sec
[ 4] 9.00-10.00 sec 51.6 MBytes 433 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth
[ 4] 0.00-10.00 sec 510 MBytes 428 Mbits/sec sender
[ 4] 0.00-10.00 sec 510 MBytes 428 Mbits/sec receiver
My hardware is an AMD Ryzen 7 2700 with 16 GB of RAM. Ethernet is Intel i350T2 Ethernet with Gigabit.
Quote from: hax0rwax0r on September 02, 2020, 10:40:50 PM
OK here are the test results as you requested:
FreeBSD 12.1 (pf enabled):
[root@fbsd1 ~]# uname -rv
12.1-RELEASE FreeBSD 12.1-RELEASE r354233 GENERIC
[root@fbsd1 ~]# top -aSH
last pid: 2954; load averages: 0.44, 0.42, 0.41 up 0+01:38:55 20:13:46
132 threads: 10 running, 104 sleeping, 18 waiting
CPU: 0.0% user, 0.0% nice, 19.7% system, 5.2% interrupt, 75.1% idle
Mem: 10M Active, 6100K Inact, 271M Wired, 21M Buf, 39G Free
Swap: 3968M Total, 3968M Free
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
11 root 155 ki31 0 96K RUN 5 94:58 95.25% [idle{idle: cpu5}]
11 root 155 ki31 0 96K CPU1 1 93:26 83.69% [idle{idle: cpu1}]
11 root 155 ki31 0 96K RUN 0 94:44 73.68% [idle{idle: cpu0}]
11 root 155 ki31 0 96K CPU4 4 93:15 72.51% [idle{idle: cpu4}]
11 root 155 ki31 0 96K CPU3 3 93:36 64.80% [idle{idle: cpu3}]
11 root 155 ki31 0 96K RUN 2 92:55 62.29% [idle{idle: cpu2}]
0 root -76 - 0 480K CPU2 2 0:05 34.76% [kernel{if_io_tqg_2}]
0 root -76 - 0 480K CPU3 3 0:14 33.49% [kernel{if_io_tqg_3}]
12 root -52 - 0 304K CPU0 0 26:23 29.62% [intr{swi6: task queue}]
0 root -76 - 0 480K - 4 0:05 23.31% [kernel{if_io_tqg_4}]
0 root -76 - 0 480K - 0 0:05 12.31% [kernel{if_io_tqg_0}]
0 root -76 - 0 480K - 1 0:04 10.01% [kernel{if_io_tqg_1}]
12 root -88 - 0 304K WAIT 5 3:55 2.28% [intr{irq264: mfi0}]
0 root -76 - 0 480K - 5 0:06 1.88% [kernel{if_io_tqg_5}]
2954 root 20 0 13M 3676K CPU5 5 0:00 0.02% top -aSH
12 root -60 - 0 304K WAIT 0 0:01 0.01% [intr{swi4: clock (0)}]
0 root -76 - 0 480K - 4 0:02 0.01% [kernel{if_config_tqg_0}]
Single Thread:
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 8.45 GBytes 7.26 Gbits/sec 802 sender
[ 4] 0.00-10.00 sec 8.45 GBytes 7.26 Gbits/sec receiver
10 Threads:
[ ID] Interval Transfer Bandwidth Retr
[SUM] 0.00-10.00 sec 9.85 GBytes 8.46 Gbits/sec 2991 sender
[SUM] 0.00-10.00 sec 9.83 GBytes 8.45 Gbits/sec receiver
FreeBSD 12.1 with OPNsense Kernel (pf enabled):
[root@fbsd1 ~]# uname -rv
12.1-RELEASE FreeBSD 12.1-RELEASE r354233 GENERIC
[root@fbsd1 ~]# fetch https://pkg.opnsense.org/FreeBSD:12:amd64/20.7/sets/kernel-20.7.2-amd64.txz
[root@fbsd1 ~]# mv /boot/kernel /boot/kernel.old
[root@fbsd1 ~]# tar -C / -xf kernel-20.7.2-amd64.txz
[root@fbsd1 ~]# kldxref /boot/kernel
[root@fbsd1 ~]# reboot
[root@fbsd1 ~]# uname -rv
12.1-RELEASE-p8-HBSD FreeBSD 12.1-RELEASE-p8-HBSD #0 b3665671c4d(stable/20.7)-dirty: Thu Aug 27 05:58:53 CEST 2020 root@sensey64:/usr/obj/usr/src/amd64.amd64/sys/SMP
[root@fbsd1 ~]# top -aSH
last pid: 43891; load averages: 0.99, 0.49, 0.20 up 0+00:04:28 20:29:24
131 threads: 13 running, 100 sleeping, 18 waiting
CPU: 0.0% user, 0.0% nice, 62.5% system, 3.5% interrupt, 33.9% idle
Mem: 14M Active, 1184K Inact, 270M Wired, 21M Buf, 39G Free
Swap: 3968M Total, 3968M Free
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
0 root -76 - 0 480K CPU3 3 0:08 81.27% [kernel{if_io_tqg_3}]
0 root -76 - 0 480K CPU1 1 0:09 74.39% [kernel{if_io_tqg_1}]
0 root -76 - 0 480K CPU5 5 0:08 73.20% [kernel{if_io_tqg_5}]
0 root -76 - 0 480K CPU0 0 0:21 71.79% [kernel{if_io_tqg_0}]
11 root 155 ki31 0 96K RUN 4 4:09 54.15% [idle{idle: cpu4}]
11 root 155 ki31 0 96K RUN 2 4:09 51.30% [idle{idle: cpu2}]
0 root -76 - 0 480K CPU2 2 0:05 40.10% [kernel{if_io_tqg_2}]
0 root -76 - 0 480K - 4 0:09 37.60% [kernel{if_io_tqg_4}]
11 root 155 ki31 0 96K RUN 0 4:03 26.48% [idle{idle: cpu0}]
11 root 155 ki31 0 96K RUN 5 4:14 25.87% [idle{idle: cpu5}]
11 root 155 ki31 0 96K RUN 1 4:09 24.32% [idle{idle: cpu1}]
12 root -52 - 0 304K RUN 2 1:12 20.63% [intr{swi6: task queue}]
11 root 155 ki31 0 96K CPU3 3 4:00 17.30% [idle{idle: cpu3}]
12 root -88 - 0 304K WAIT 5 0:10 1.47% [intr{irq264: mfi0}]
43891 root 20 0 13M 3660K CPU4 4 0:00 0.03% top -aSH
21 root -16 - 0 16K - 4 0:00 0.02% [rand_harvestq]
12 root -60 - 0 304K WAIT 1 0:00 0.02% [intr{swi4: clock (0)}]
Single Thread:
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 2.89 GBytes 2.48 Gbits/sec 0 sender
[ 4] 0.00-10.00 sec 2.89 GBytes 2.48 Gbits/sec receiver
10 Threads:
[ ID] Interval Transfer Bandwidth Retr
[SUM] 0.00-10.00 sec 8.16 GBytes 7.01 Gbits/sec 4260 sender
[SUM] 0.00-10.00 sec 8.13 GBytes 6.98 Gbits/sec receiver
I included the "top -aSH" output again because my general observation between OPNsense kernel and FreeBSD 12.1 stock kernel is the "[kernel{if_io_tqg_X}]" process usage. Even on an actual OPNsense 20.7.2 installation I notice the exact same behavior of the "[kernel{if_io_tqg_X}]" being consistently higher and throughput significantly slower, specifically on single threaded tests. Note that both of the top outputs were only from the 10 thread count tests only as I did not think to capture them during the single threaded test.
I can't help but think that whatever high "[kernel{if_io_tqg_X}]" on the OPNsense kernel means is starving the system of throughput potential.
Thoughts? Next steps I can run and provide results from?
My first thought was maybe shared forwarding, but you have this with pfsense 2.5 too, correct?
Ok, iflib, so it's related to 12.X-only, but strange it doesn't happen to vanilla 12.1
https://forums.freebsd.org/threads/what-is-kernel-if_io_tqg-100-load-of-core.70642/
Do you still test with this hardware?
Dell T20 (Intel(R) Xeon(R) CPU E3-1225 v3 @ 3.20GHz (4 cores))
Quote from: mimugmail on September 03, 2020, 06:15:29 AM
My first thought was maybe shared forwarding, but you have this with pfsense 2.5 too, correct?
I have never tested pfSense 2.5. As you had previously pointed out, my test was pfSense 2.4 which was FreeBSD 11.3 based. I mistakenly looked at the version history page and mentioned it was FreeBSD 12.1 but we determined I was incorrect in my statement.
Quote from: mimugmail on September 03, 2020, 12:26:17 PM
Ok, iflib, so it's related to 12.X-only, but strange it doesn't happen to vanilla 12.1
https://forums.freebsd.org/threads/what-is-kernel-if_io_tqg-100-load-of-core.70642/
Yeah, I saw that forum post when I was Googling around, too. I don't know what is different than vanilla FreeBSD 12.1 and the OPNsense 20.7.2 kernel that makes it higher CPU usage but it is consistent in my testing every single time.
Quote from: mimugmail on September 03, 2020, 12:36:34 PM
Do you still test with this hardware?
Dell T20 (Intel(R) Xeon(R) CPU E3-1225 v3 @ 3.20GHz (4 cores))
No, every single test, with the exception of that single test I did on the Dell T20 to see if more MHz helped, has been on a Dell R430. I have several R430 that are like-for-like and I even ran different software on each one and the results were consistent to weed out that a X520 NIC or something was bad. The results followed the OS/kernel installed regardless of which R430 I ran it on so I am fairly confident in my hardware.
Quote from: mimugmail on September 03, 2020, 06:15:29 AM
My first thought was maybe shared forwarding, but you have this with pfsense 2.5 too, correct?
I tried this with the recent build of pfSense 2.5 Development (built 9/2/2020) and was able to get around 2.0gbits/sec using the same test scenario that I posted about yesterday. So it is still lower throughput than pfSense 2.4.x running on FreeBSD 11.2 in the same test scenario, however it's still higher than what we're seeing with the OPNsense 20.7 series running the 12.x kernel.
just for the record. i am also experiencing degraded throughput. lan routing between different vlans only with firewall enabled, no IPS etc. is around 550 Mbit/s. the setup is switch -> 1Gbit trunk -> switch -> 1Gbit trunk -> opnsense fw. Low overall traffic.
Overall usage core wise when loading the FW.
16 cores but only few are used. Its like multicore usage in either IDS or PF is limited.
Quote from: Supermule on September 04, 2020, 11:55:34 AM
Overall usage core wise when loading the FW.
16 cores but only few are used. Its like multicore usage in either IDS or PF is limited.
One stream can only be handled by one core, this was in 20.1 and is in 20.7 :)
A quick follow up. I am routing about 20 vlans. I read a lot about performance tuning and in one post the captive portals performance impact was mentioned. Recently i changed my WiFi setup and at some point i have tried the captive portal function for a guest vlan. So i gave it a try and disabled the captive portal (was active for one vlan) . I could not beliefe my eyes when i tested the throughput again.
captive portal enabled for one vlan:
530 Mbit/s
captive portal disabled:
910 Mbit/s
CP uses shared forwarding which sends every packet to ipfw, I'd guess 20.1 has the same problem
Uh no, features decrease throughput. Where have I seen this before? Maybe in every industry firewall spec sheet... ;)
This thread is slowly degrading and losing focus. I can't say there aren't any penalties in using the software, but if we only focus on how much better others are we run the risk of not having an objective discussion: is your OPNsense too slow? The easiest fix is to get the hardware that performs well enough. There's already money saved from the lack of licensing.
Performance will likely increase over time in the default releases if we can identify the actual differences in configuration.
Cheers,
Franco
first of all, i like opnsense and i am an absolute supporter, my comment was meant to be absolutely constructive... i personally wasn't aware that a rather simple looking feature can have a nearly 50% performance impact and i have a feeling as if i couldn't be the only one, so i just wanted to share information.
Shaper and captive portal require enabling the second firewall (ipfw) in tandem with the normal one (pf). Both are nice firewalls, but most features come from pf historically, while others are better suited for ipfw or are only available there.
I just think we should talk about raw throughput here with minimum configuration to make results comparable between operating systems. The more configuration and features come into play it becomes less and less possible to derive meaningful results.
Cheers,
Franco
I stumbled across this thread after having the same issues as the OP with 20.7 and I'd done much of the same types of troubleshooting. Unless I missed it, I didn't see any kind of conclusion. I've read various things about issues with some nic drivers using iflib, but I haven't been able to nail anything down. For example, this post about a new netmap kernel: https://forum.opnsense.org/index.php?topic=19175.0
Though I don't know if that would even apply here since I'm not using Sensei or Suricata. I am using the vmxnet3 driver on ESXi 7 and can't get more than 1Gb/sec through a new install of opnsense. No traffic shaping or anything and all test VMs (and opnsense) are on the same vswitch. Just going between a test VM and an opnsense LAN interface are stuck at 1Gb. I can at least get 4 Gb/sec using pfsense 2.4.x. I haven't tried older versions of opnsense.
The opnsense roadmap says "Fix stability and reliability issues with regard to vmx(4), vtnet(4), ixl(4), ix(4) and em(4) ethernet drivers." I guess I'm trying to find out of there are specific bugs or issues called out that this refers to. If the issue I'm seeing is already identified, great.
It's under investigation, 20.7.4 May bring an already fixed kernel
Very interesting discussion here regarding degraded performance with Opnsense. Roughly, one month ago I noticed degraded performance in SMB transfers between my own server and clients. At first, I suspected my server itself as the performance bottleneck due to a kernel upgrade, short time before. I am not sure, whether this issue likewise correlates with an update of the opnsense firewall, I performed meanwhile. I investigated a little bit and got some discussions regarding issues with the server network card (intel LM-219) and Linux. But, after buying a low-priced USB network adapter (Realtek chipset) for testing, I got the same poor performance results.
My next steps are to investigate the whole network and Opnsense this upcoming weekend (if the wether is fine in this context — ⛈ 🌩 ...). So, this discussion is a very interesting starting point for me and my investigation.
Here some details regarding my Opnsense (20.7.3):
- Mainboard: Supermicro A2SDi-4C-HLN4F (Link to specs (https://www.supermicro.com/en/products/motherboard/A2SDi-4C-HLN4F))
- RAM: 8GB
- Network performance (past): around 900MBit/s (SMB transfer across two subnets)
- Network performance (now): around 200MBit/s (SMB transfer across two subnets)
I tried re-running these tests with OPNsense 20.7.3 and also tried the netmap kernel. For my particular case, this did not result in a change in throughput.
I'll recap my environment:
HP Server ML10v2/Xeon E3 1220v3/32GB of RAM
VM configurations:
Each pfSense and OPNsense VM has 2vCPU/4GB RAM/VMX3 NICs
Each pfSense and OPNsense VM has default settings and all hardware offloading disabled
The OPNsense netmap kernel was tested by doing the following:
opnsense-update -kr 20.7.3-netmap
reboot
When running these iperf3 tests, each test was run for 60 seconds, all test were run twice and the last test result is recorded here to allow some of the firewalls time to "warm up" to the throughput load. All tests were perform on the same host, and two VMs were used to simulate a WAN/LAN configuration with separate vSwitches. This allows us to push traffic through the firewall, instead of using the firewall as an iperf3 client.
Below are my results from today:
pfSense 2.5.0Build_10-16-20 1500MTU receiving from WAN, vmx3 NICs, all hardware offloading disabled, default ruleset
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 14.8 GBytes 2.12 Gbits/sec 550 sender
[ 5] 0.00-60.00 sec 14.8 GBytes 2.12 Gbits/sec receiver
pfSense 2.4.5p1 1500MTU receiving from WAN, vmx3 NICs, all hardware offloading disabled, default ruleset
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 29.4 GBytes 4.21 Gbits/sec 12054 sender
[ 5] 0.00-60.00 sec 29.4 GBytes 4.21 Gbits/sec receiver
OpenWRT 19.07.3 1500MTU receiving from WAN, vmx3 NICs, default ruleset
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 44.1 GBytes 6.31 Gbits/sec 40490 sender
[ 5] 0.00-60.00 sec 44.1 GBytes 6.31 Gbits/sec receiver
OPNsense 20.7.3 1500MTU receiving from WAN, vmx3 NICs, all hardware offloading disabled, default ruleset
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 5.39 GBytes 771 Mbits/sec 362 sender
[ 5] 0.00-60.00 sec 5.39 GBytes 771 Mbits/sec receiver
OPNsense 20.7.3(netflow disabled) 1500MTU receiving from WAN, vmx3 NICs, all hardware offloading disabled, default ruleset
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 6.66 GBytes 953 Mbits/sec 561 sender
[ 5] 0.00-60.00 sec 6.66 GBytes 953 Mbits/sec receiver
OPNsense 20.7.3(netmap kernel) 1500MTU receiving from WAN, vmx3 NICs, all hardware offloading disabled, default ruleset
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 5.35 GBytes 766 Mbits/sec 434 sender
[ 5] 0.00-60.00 sec 5.35 GBytes 766 Mbits/sec receiver
OPNsense 20.7.3(netmap kernel, netflow disabled) 1500MTU receiving from WAN, vmx3 NICs, all hardware offloading disabled, default ruleset
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 6.55 GBytes 937 Mbits/sec 399 sender
[ 5] 0.00-60.00 sec 6.55 GBytes 937 Mbits/sec receiver
Its actually qiute interesting to see the performance degradation from pfsense 2.4 to 2.5
One should think that things were moving forward instead of backwards.
And could it be on purpose since TNSR is launched that somehow is able to route significant more?
I know its kernel dependant, but its really annoying that the new FreeBSD releases actually perform worse than 10.3 and the OS dependant on that.
Giving the right MTU's then yu can easily push 7+ gbit/s on a FW.
I probably should have clarified on that. I tested both *sense based distros just show that they both see a hit with the FreeBSD 12.x kernel. I don't think this is out of malicious intent from either side, just teething issues due to the new way that the 12.x kernel pushes packets. I'm NOT trying to compare OPNsense to pfSense, I merely wanted to show that they both see a hit moving to 12.x.
There is an upside to all of this. I'm running OPNsense 20.7.3 on bare metal at home with the stock kernel. With the FreeBSD 12.x implementations I no longer need to leave FQ_Codel shaping enabled to get A+ scores on my 500/500 Fiber connection. It seems the way that FreeBSD 12.x handles transfer queues is much more efficient. I'm sure as time moves forward this will all get worked out. I'm posting here mainly just to show what I am seeing, and hopefully we can see the numbers get better as newer kernels are integrated.
Yes, it needs more user base to test and diagnose. I'm sure If pfsense would switch there would be faster progress. Currently it's Up to the Sensei guys and 12.1 community
Do they need a sponsor to make it happen sooner?
No idea, just ask mb via PM
Although we haven't experienced performance issues on the equipment we sell ourselves, quite some of the feedback in
this thread seems to be related to virtual setups.
Since we had a setup available from the webinar last Thursday, I thought to replicate the simple vmxnet3 test on our end.
Small disclaimer upfront, I'm not a frequent VMWare ESXi user, so I just followed the obvious steps.
Our test machine is really small, not extremely fast, but usable for the purpose (a random desktop which was available).
Machine specs:
Lenovo 10T700AHMH desktop
6 CPUs x Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz
8GB Memory
|- OPNsense vm, 2 vcores
|- kali1, 1 vcore
|- kali2, 1 vcore
While going through the VMWare setup, for some reason I wasn't allowed to select VMXNET3, so I edited the .vmx file manually
to make sure all attached interfaces used the correct driver.
ethernetX.virtualDev = "vmxnet3"
The clients atached are simple kali linux installs, both using their own vSwitch, so traffic is measured from kali 1 to kali 2
using iperf3 (doesn't really tell a lot about real world performance, but I didn't have the time or spirit available to setup trex and proper testsets)
[kali1, client] --- vswitch1 --- [OPNsense] --- vswitch2 --- [kali2, server]
192.168.1.100/24 - 192.168.1.1/24,192.168.2.1/24 - 192.168.2.100/24
Before testing, let's establish a baseline, move both kali linux machines in the same network and iperf between them.
# iperf3 -c 192.168.2.100 -t 10000
Connecting to host 192.168.2.100, port 5201
[ 5] local 192.168.2.101 port 55240 connected to 192.168.2.100 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 3.34 GBytes 28.7 Gbits/sec 0 1.91 MBytes
[ 5] 1.00-2.00 sec 5.03 GBytes 43.2 Gbits/sec 0 2.93 MBytes
[ 5] 2.00-3.00 sec 5.24 GBytes 45.0 Gbits/sec 0 3.08 MBytes
[ 5] 3.00-4.00 sec 5.18 GBytes 44.5 Gbits/sec 0 3.08 MBytes
[ 5] 4.00-5.00 sec 5.23 GBytes 45.0 Gbits/sec 0 3.08 MBytes
Which is the absolute maximum my setup could reach, using linux and all defaults set.... but, since we don't use
any offloading features (https://wiki.freebsd.org/10gFreeBSD/Router), it would be fairer to check what the performance should be when disabling offloading on
the same setup.
So, we disable all offloading, assuming our router/firewall won't use them either.
# ethtool -K eth0 lro off
# ethtool -K eth0 tso off
# ethtool -K eth0 rx off
# ethtool -K eth0 tx off
# ethtool -K eth0 sg off
And test again:
# iperf3 -c 192.168.2.100 -t 10000
Connecting to host 192.168.2.100, port 5201
[ 5] local 192.168.2.101 port 55274 connected to 192.168.2.100 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 1.20 GBytes 10.3 Gbits/sec 0 458 KBytes
[ 5] 1.00-2.00 sec 1.30 GBytes 11.2 Gbits/sec 0 1007 KBytes
[ 5] 2.00-3.00 sec 1.30 GBytes 11.1 Gbits/sec 0 1.18 MBytes
[ 5] 3.00-4.00 sec 1.29 GBytes 11.1 Gbits/sec 0 1.24 MBytes
[ 5] 4.00-5.00 sec 1.30 GBytes 11.2 Gbits/sec 0 1.37 MBytes
[ 5] 5.00-6.00 sec 1.31 GBytes 11.2 Gbits/sec 0 1.43 MBytes
[ 5] 6.00-7.00 sec 1.30 GBytes 11.2 Gbits/sec 0 1.51 MBytes
Which keeps about 25% of our original throughput, vmware seems to be very efficient when hardware tasks are pushed back
to the hypervisor.
Now reconnect the kali machines back into their own networks, with OPNsense (20.7.3+new netmap kernel) in between.
The firewall policy is simple, just accept anything, no other features used.
# iperf3 -c 192.168.2.100 -t 10000
Connecting to host 192.168.2.100, port 5201
[ 5] local 192.168.1.100 port 54870 connected to 192.168.2.100 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 280 MBytes 2.35 Gbits/sec 59 393 KBytes
[ 5] 1.00-2.00 sec 281 MBytes 2.35 Gbits/sec 33 383 KBytes
[ 5] 2.00-3.00 sec 279 MBytes 2.34 Gbits/sec 60 379 KBytes
[ 5] 3.00-4.00 sec 275 MBytes 2.31 Gbits/sec 46 380 KBytes
[ 5] 4.00-5.00 sec 276 MBytes 2.32 Gbits/sec 31 387 KBytes
Next step is to check the man page of the vmx driver (man vmx), which shows quite some sysctl tunables which
don't seem to work anymore on 12.x, probably due to switching to iflib. One comment however seems quite relevant:
Quote
The vmx driver supports multiple transmit and receive queues. Multiple
queues are only supported by certain VMware products, such as ESXi. The
number of queues allocated depends on the presence of MSI-X, the number
of configured CPUs, and the tunables listed below. FreeBSD does not
enable MSI-X support on VMware by default. The
hw.pci.honor_msi_blacklist tunable must be disabled to enable MSI-X
support.
So we go to tunables, disable hw.pci.honor_msi_blacklist (set to 0) and reboot out machine.
Time to test again:
# iperf3 -c 192.168.2.100
Connecting to host 192.168.2.100, port 5201
[ 5] local 192.168.1.100 port 54878 connected to 192.168.2.100 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 350 MBytes 2.93 Gbits/sec 589 304 KBytes
[ 5] 1.00-2.00 sec 342 MBytes 2.87 Gbits/sec 378 337 KBytes
[ 5] 2.00-3.00 sec 342 MBytes 2.87 Gbits/sec 324 298 KBytes
[ 5] 3.00-4.00 sec 343 MBytes 2.88 Gbits/sec 292 301 KBytes
[ 5] 4.00-5.00 sec 345 MBytes 2.89 Gbits/sec 337 307 KBytes
[ 5] 5.00-6.00 sec 341 MBytes 2.86 Gbits/sec 266 301 KBytes
[ 5] 6.00-7.00 sec 341 MBytes 2.86 Gbits/sec 301 311 KBytes
Single flow performance is often a challenge, so to be sure, let's try to push 2 sessions through iperf3
# iperf3 -c 192.168.2.100 -P 2 -t 10000
Connecting to host 192.168.2.100, port 5201
[ 5] local 192.168.1.100 port 54952 connected to 192.168.2.100 port 5201
[ 7] local 192.168.1.100 port 54954 connected to 192.168.2.100 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 261 MBytes 2.19 Gbits/sec 176 281 KBytes
[ 7] 0.00-1.00 sec 245 MBytes 2.05 Gbits/sec 136 342 KBytes
[SUM] 0.00-1.00 sec 506 MBytes 4.24 Gbits/sec 312
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 1.00-2.00 sec 302 MBytes 2.54 Gbits/sec 57 281 KBytes
[ 7] 1.00-2.00 sec 208 MBytes 1.74 Gbits/sec 25 375 KBytes
[SUM] 1.00-2.00 sec 510 MBytes 4.28 Gbits/sec 82
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 2.00-3.00 sec 304 MBytes 2.55 Gbits/sec 45 284 KBytes
[ 7] 2.00-3.00 sec 210 MBytes 1.76 Gbits/sec 9 392 KBytes
[SUM] 2.00-3.00 sec 514 MBytes 4.31 Gbits/sec 54
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 3.00-4.00 sec 304 MBytes 2.55 Gbits/sec 39 386 KBytes
[ 7] 3.00-4.00 sec 209 MBytes 1.75 Gbits/sec 15 331 KBytes
[SUM] 3.00-4.00 sec 512 MBytes 4.30 Gbits/sec 54
^C- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 4.00-4.95 sec 288 MBytes 2.54 Gbits/sec 39 287 KBytes
[ 7] 4.00-4.95 sec 198 MBytes 1.74 Gbits/sec 23 325 KBytes
[SUM] 4.00-4.95 sec 485 MBytes 4.28 Gbits/sec 62
Which is already way better, more sessions don't seem to impact my setup as far as I could see, but that could also
be caused by the number of queues confiure (2, see dmesg | grep vmx). In the new iflib world I wasn't able to
increase that number, so I'll leave it at that.
Just for fun, I disabled pf (pfctl -d) to get a bit of insights about how the firewall impacts our performance,
the details of that test are shown below (just for reference)
[code]
# iperf3 -c 192.168.2.100 -P 2 -t 10000
Connecting to host 192.168.2.100, port 5201
[ 5] local 192.168.1.100 port 55038 connected to 192.168.2.100 port 5201
[ 7] local 192.168.1.100 port 55040 connected to 192.168.2.100 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 300 MBytes 2.51 Gbits/sec 0 888 KBytes
[ 7] 0.00-1.00 sec 302 MBytes 2.53 Gbits/sec 69 2.18 MBytes
[SUM] 0.00-1.00 sec 601 MBytes 5.04 Gbits/sec 69
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 1.00-2.00 sec 335 MBytes 2.81 Gbits/sec 167 904 KBytes
[ 7] 1.00-2.00 sec 342 MBytes 2.87 Gbits/sec 536 1.67 MBytes
[SUM] 1.00-2.00 sec 678 MBytes 5.68 Gbits/sec 703
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 2.00-3.00 sec 335 MBytes 2.81 Gbits/sec 0 1.12 MBytes
[ 7] 2.00-3.00 sec 342 MBytes 2.87 Gbits/sec 0 1.81 MBytes
[SUM] 2.00-3.00 sec 678 MBytes 5.68 Gbits/sec 0
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 3.00-4.00 sec 332 MBytes 2.79 Gbits/sec 280 1.04 MBytes
[ 7] 3.00-4.00 sec 344 MBytes 2.88 Gbits/sec 482 1.44 MBytes
[SUM] 3.00-4.00 sec 676 MBytes 5.67 Gbits/sec 762
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 4.00-5.00 sec 332 MBytes 2.79 Gbits/sec 206 1017 KBytes
[ 7] 4.00-5.00 sec 338 MBytes 2.83 Gbits/sec 292 1.22 MBytes
[SUM] 4.00-5.00 sec 670 MBytes 5.62 Gbits/sec 498
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 5.00-6.00 sec 331 MBytes 2.78 Gbits/sec 0 1.21 MBytes
[ 7] 5.00-6.00 sec 339 MBytes 2.84 Gbits/sec 0 1.40 MBytes
[SUM] 5.00-6.00 sec 670 MBytes 5.62 Gbits/sec 0
^C- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 6.00-6.60 sec 199 MBytes 2.78 Gbits/sec 0 1.32 MBytes
[ 7] 6.00-6.60 sec 202 MBytes 2.83 Gbits/sec 0 1.50 MBytes
[SUM] 6.00-6.60 sec 401 MBytes 5.61 Gbits/sec 0
- - - - - - - - - - - - - - - - - - - - - - - - -
On physical setups I've seen better numbers, but driver performance and settings may impact the situation (a lot).
While looking into the sysctl settings, I stumbled on this https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=237166
as well.
It explains how to set the receive and send descriptors, for my test it didn't change a lot, some other iflib setting might,
I haven't tried.
Since we haven't seen huge performance degrations on our physical setups,
there's the possibility that default settings have changed in vmx (haven't looked into that, nor plan to).
Driver quality might have been better pre iflib, which is always a bit of risk on FreeBSD after major upgrades to be honest.
In our experience (on intel) the situation isn't bad at all after switching to FreeBSD 12.1,
but that's just my personal opinion (based on measurements on our equipment some months ago).
Best regards,
Ad
Nice write-up :)
But 45 gbit/s???
Quote from: Supermule on October 18, 2020, 04:36:04 PM
But 45 gbit/s???
Quote from: AdSchellevis on October 17, 2020, 04:17:10 PM
The clients atached are simple kali linux installs, both using their own vSwitch, so traffic is measured from kali 1 to kali 2
:)
They still need drivers and networking as the VSwitch is attached to a network adapter.
Quote from: Gauss23 on October 18, 2020, 04:37:56 PM
Quote from: Supermule on October 18, 2020, 04:36:04 PM
But 45 gbit/s???
Quote from: AdSchellevis on October 17, 2020, 04:17:10 PM
The clients atached are simple kali linux installs, both using their own vSwitch, so traffic is measured from kali 1 to kali 2
:)
your point is? just for clarification, the 45Gbps is measured between 2 linux (kali) machines on the same network (vswitch) using all default optimisations, which would be baseline (maximum achievable without anything in between) in my case.
I did some tests and noticed that my network suffers from two different problems which interfere each other. The i219-LM nic in my server has autonegotiation problems whereby performance degradation was around 80 percent. I solved this issue by forcing the nic to 1 Gbit/s full-duplex. Now, performance tests with iperf3 reach around 980 Mbit/s in direct transfer between client and server, which look fine.
After I have integrated the Opnsense again into my setup in such a way that the firewall routes the traffic between my server and client subnet, the traffic degraded from 980 Mbit/s to ca. 245 MBit/s. I should mention that my Opnsense (v.20.7.3) runs on bare metal, so a virtualization impact is impossible.
Next steps will be some ressource monitoring during iperf3 tests.
How did you set your NIC to do that?
Quote from: FlightService on October 19, 2020, 10:42:26 AM
How did you set your NIC to do that?
I should clarify that my server is a Linux installation (Debian Buster) which runs on dedicated hardware. There are several discussions regarding issues of intel NICs and Linux. It doesn't matter whether the server is directly connected to a client or a switch in between. The NIC driver often reports a connection of 10MBit/s to the system, although the real performance was more than the reported speed. On the Linux machine I used "ethtool <dev> speed 1000" to disable autonegotiation.
Quote from: mimugmail on October 13, 2020, 07:20:17 AM
It's under investigation, 20.7.4 May bring an already fixed kernel
Just to add more info on this topic: vmxnet3 can't handle more than 1Gbps while traffic testing OPNSense to Windows(and reverse mode iperf) on the same vlan. It's a big hit since our users frequently access fileserver and PDM(autocad-like) data that are on different vlans(and thus, all traffic is forwarded by OPNSense). All network is 10Gbps including user workstations and esxi 6.7u3 servers.
We have noticed a big hit on transfer speeds after changing our firewall vendor to OPNSense on that location and we believe that it relates to this vmxnet3 case.
OPNSense VM Specs:
- 4 vCPU - Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
- 8 GB RAM
- vmxnet3 attached to a vSwitch with 2 10Gbp/s - QLogic Corporation NetXtreme II BCM57800(broadcom Dell OEM).
- 10 vlans
OPNSense and Windows server, same vlan, opnsense as gateway of this server vlan:
OPNSENSE to WINDOWS:iperf3 -c 10.254.win.ip -P 8 -w 128k 5201
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.02 sec 118 MBytes 98.8 Mbits/sec 0 sender
[ 5] 0.00-10.02 sec 118 MBytes 98.8 Mbits/sec receiver
[ 7] 0.00-10.02 sec 116 MBytes 96.8 Mbits/sec 0 sender
[ 7] 0.00-10.02 sec 116 MBytes 96.8 Mbits/sec receiver
[ 9] 0.00-10.02 sec 113 MBytes 94.5 Mbits/sec 0 sender
[ 9] 0.00-10.02 sec 113 MBytes 94.5 Mbits/sec receiver
[ 11] 0.00-10.02 sec 109 MBytes 91.5 Mbits/sec 0 sender
[ 11] 0.00-10.02 sec 109 MBytes 91.5 Mbits/sec receiver
[ 13] 0.00-10.02 sec 107 MBytes 89.7 Mbits/sec 0 sender
[ 13] 0.00-10.02 sec 107 MBytes 89.7 Mbits/sec receiver
[ 15] 0.00-10.02 sec 99.8 MBytes 83.5 Mbits/sec 0 sender
[ 15] 0.00-10.02 sec 99.8 MBytes 83.5 Mbits/sec receiver
[ 17] 0.00-10.02 sec 82.0 MBytes 68.7 Mbits/sec 0 sender
[ 17] 0.00-10.02 sec 82.0 MBytes 68.7 Mbits/sec receiver
[ 19] 0.00-10.02 sec 71.2 MBytes 59.6 Mbits/sec 0 sender
[ 19] 0.00-10.02 sec 71.2 MBytes 59.6 Mbits/sec receiver
[SUM] 0.00-10.02 sec 816 MBytes 683 Mbits/sec 0 sender
[SUM] 0.00-10.02 sec 816 MBytes 683 Mbits/sec receiver
OPNSENSE to WINDOWS(iperf3 reverse mode):iperf3 -c 10.254.win.ip -P 8 -R -w 128k 5201
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate
[ 5] 0.00-10.00 sec 88.4 MBytes 74.1 Mbits/sec sender
[ 5] 0.00-10.00 sec 88.2 MBytes 74.0 Mbits/sec receiver
[ 7] 0.00-10.00 sec 118 MBytes 98.7 Mbits/sec sender
[ 7] 0.00-10.00 sec 117 MBytes 98.5 Mbits/sec receiver
[ 9] 0.00-10.00 sec 91.9 MBytes 77.1 Mbits/sec sender
[ 9] 0.00-10.00 sec 91.7 MBytes 76.9 Mbits/sec receiver
[ 11] 0.00-10.00 sec 91.6 MBytes 76.9 Mbits/sec sender
[ 11] 0.00-10.00 sec 91.5 MBytes 76.7 Mbits/sec receiver
[ 13] 0.00-10.00 sec 92.6 MBytes 77.7 Mbits/sec sender
[ 13] 0.00-10.00 sec 92.4 MBytes 77.5 Mbits/sec receiver
[ 15] 0.00-10.00 sec 94.4 MBytes 79.2 Mbits/sec sender
[ 15] 0.00-10.00 sec 94.2 MBytes 79.0 Mbits/sec receiver
[ 17] 0.00-10.00 sec 100 MBytes 84.3 Mbits/sec sender
[ 17] 0.00-10.00 sec 100 MBytes 84.1 Mbits/sec receiver
[ 19] 0.00-10.00 sec 99.9 MBytes 83.8 Mbits/sec sender
[ 19] 0.00-10.00 sec 99.6 MBytes 83.6 Mbits/sec receiver
[SUM] 0.00-10.00 sec 777 MBytes 652 Mbits/sec sender
[SUM] 0.00-10.00 sec 775 MBytes 650 Mbits/sec receiver
Linux VM Specs:
- 1 vCPU - Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
- 4 GB RAM
- vmxnet3 attached to a vSwitch with 2 10Gbp/s - QLogic Corporation NetXtreme II BCM57800(broadcom Dell OEM).
- vmnet attached to the vm(no visibility on vlan tags)
Linux server and Windows server, same vlan cause they are designated on the "servers vlan":
LINUX TO WINDOWS:iperf3 -c 10.254.win.ip -P 8 -w 128k 5201
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 1.17 GBytes 1.00 Gbits/sec 128 sender
[ 4] 0.00-10.00 sec 1.17 GBytes 1.00 Gbits/sec receiver
[ 6] 0.00-10.00 sec 275 MBytes 231 Mbits/sec 69 sender
[ 6] 0.00-10.00 sec 275 MBytes 231 Mbits/sec receiver
[ 8] 0.00-10.00 sec 1.12 GBytes 961 Mbits/sec 150 sender
[ 8] 0.00-10.00 sec 1.12 GBytes 961 Mbits/sec receiver
[ 10] 0.00-10.00 sec 1.13 GBytes 972 Mbits/sec 98 sender
[ 10] 0.00-10.00 sec 1.13 GBytes 972 Mbits/sec receiver
[ 12] 0.00-10.00 sec 264 MBytes 222 Mbits/sec 37 sender
[ 12] 0.00-10.00 sec 264 MBytes 222 Mbits/sec receiver
[ 14] 0.00-10.00 sec 1.13 GBytes 973 Mbits/sec 109 sender
[ 14] 0.00-10.00 sec 1.13 GBytes 973 Mbits/sec receiver
[ 16] 0.00-10.00 sec 280 MBytes 235 Mbits/sec 34 sender
[ 16] 0.00-10.00 sec 280 MBytes 235 Mbits/sec receiver
[ 18] 0.00-10.00 sec 246 MBytes 206 Mbits/sec 64 sender
[ 18] 0.00-10.00 sec 246 MBytes 206 Mbits/sec receiver
[SUM] 0.00-10.00 sec 5.59 GBytes 4.81 Gbits/sec 689 sender
[SUM] 0.00-10.00 sec 5.59 GBytes 4.80 Gbits/sec receiver
LINUX TO WINDOWS(Reverse mode iperf): This is where iperf and vmxnet reaches it's full potential
iperf3 -c 10.254.win.ip -P 8 -R -w 128k 5201
[ ID] Interval Transfer Bandwidth
[ 4] 0.00-10.00 sec 3.17 GBytes 2.72 Gbits/sec sender
[ 4] 0.00-10.00 sec 3.17 GBytes 2.72 Gbits/sec receiver
[ 6] 0.00-10.00 sec 3.10 GBytes 2.66 Gbits/sec sender
[ 6] 0.00-10.00 sec 3.10 GBytes 2.66 Gbits/sec receiver
[ 8] 0.00-10.00 sec 2.91 GBytes 2.50 Gbits/sec sender
[ 8] 0.00-10.00 sec 2.91 GBytes 2.50 Gbits/sec receiver
[ 10] 0.00-10.00 sec 3.00 GBytes 2.58 Gbits/sec sender
[ 10] 0.00-10.00 sec 3.00 GBytes 2.58 Gbits/sec receiver
[ 12] 0.00-10.00 sec 2.78 GBytes 2.39 Gbits/sec sender
[ 12] 0.00-10.00 sec 2.78 GBytes 2.39 Gbits/sec receiver
[ 14] 0.00-10.00 sec 2.85 GBytes 2.45 Gbits/sec sender
[ 14] 0.00-10.00 sec 2.85 GBytes 2.45 Gbits/sec receiver
[ 16] 0.00-10.00 sec 2.68 GBytes 2.31 Gbits/sec sender
[ 16] 0.00-10.00 sec 2.68 GBytes 2.31 Gbits/sec receiver
[ 18] 0.00-10.00 sec 2.63 GBytes 2.26 Gbits/sec sender
[ 18] 0.00-10.00 sec 2.63 GBytes 2.26 Gbits/sec receiver
[SUM] 0.00-10.00 sec 23.1 GBytes 19.9 Gbits/sec sender
[SUM] 0.00-10.00 sec 23.1 GBytes 19.9 Gbits/sec receiver
I have customers pushing 6Gbit over vmxnet driver.
Quote from: mimugmail on October 19, 2020, 07:38:33 PM
I have customers pushing 6Gbit over vmxnet driver.
OK. And what i'm supposed to do with this information? Not trying to be rude, but there is plenty of reports on this topic that goes against your scenario.
Do you have any idea what i could tune to achieve better performance then?
Quote from: nwildner on October 19, 2020, 08:20:36 PM
Quote from: mimugmail on October 19, 2020, 07:38:33 PM
I have customers pushing 6Gbit over vmxnet driver.
OK. And what i'm supposed to do with this information? Not trying to be rude, but there is plenty of reports on this topic that goes against your scenario.
Do you have any idea what i could tune to achieve better performance then?
What about this idea?
https://xenomorph.net/freebsd/performance-esxi/
Quote from: Gauss23 on October 19, 2020, 08:37:01 PM
What about this idea?
https://xenomorph.net/freebsd/performance-esxi/
I'll try as soon as our users stop doing transfers at that remote office :)
Nice catch.
Where do you manually edit the rc.conf??
Quote from: Supermule on October 19, 2020, 09:42:48 PM
Where do you manually edit the rc.conf??
There is an option inside the web administration:
Interface > Settings > Hardware LRO > Uncheck it to enable LRO
Quote from: Gauss23 on October 19, 2020, 08:37:01 PM
What about this idea?
https://xenomorph.net/freebsd/performance-esxi/
Well, only enabling lro didn't change much. The guy that wrote this tutorial is using the same NIC series i'm using it was worth trying to enable lro,tso and vlan_hwfilter, and after that, things got a loooot better.
Still not catching 10Gbps, but could get almost 5Gbps which is pretty good:
Only enabling lro:[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.17 sec 118 MBytes 97.5 Mbits/sec 0 sender
[ 5] 0.00-10.17 sec 118 MBytes 97.5 Mbits/sec receiver
[ 7] 0.00-10.17 sec 120 MBytes 98.9 Mbits/sec 0 sender
[ 7] 0.00-10.17 sec 120 MBytes 98.9 Mbits/sec receiver
[ 9] 0.00-10.17 sec 120 MBytes 98.8 Mbits/sec 0 sender
[ 9] 0.00-10.17 sec 120 MBytes 98.8 Mbits/sec receiver
[ 11] 0.00-10.17 sec 117 MBytes 96.8 Mbits/sec 0 sender
[ 11] 0.00-10.17 sec 117 MBytes 96.8 Mbits/sec receiver
[ 13] 0.00-10.17 sec 118 MBytes 97.4 Mbits/sec 0 sender
[ 13] 0.00-10.17 sec 118 MBytes 97.4 Mbits/sec receiver
[ 15] 0.00-10.17 sec 119 MBytes 98.0 Mbits/sec 0 sender
[ 15] 0.00-10.17 sec 119 MBytes 98.0 Mbits/sec receiver
[ 17] 0.00-10.17 sec 90.8 MBytes 74.9 Mbits/sec 0 sender
[ 17] 0.00-10.17 sec 90.8 MBytes 74.9 Mbits/sec receiver
[ 19] 0.00-10.17 sec 72.2 MBytes 59.6 Mbits/sec 0 sender
[ 19] 0.00-10.17 sec 72.2 MBytes 59.6 Mbits/sec receiver
[SUM] 0.00-10.17 sec 875 MBytes 722 Mbits/sec 0 sender
[SUM] 0.00-10.17 sec 875 MBytes 722 Mbits/sec receiver
iperf Done.
vmx0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=800428<VLAN_MTU,JUMBO_MTU,LRO>
ether 00:50:56:a5:d3:68
inet6 fe80::250:56ff:fea5:d368%vmx0 prefixlen 64 scopeid 0x1
media: Ethernet autoselect
status: active
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL
lro, tso and vlan_hwfilter enabled:- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.01 sec 1.08 GBytes 929 Mbits/sec 0 sender
[ 5] 0.00-10.01 sec 1.08 GBytes 929 Mbits/sec receiver
[ 7] 0.00-10.01 sec 510 MBytes 427 Mbits/sec 0 sender
[ 7] 0.00-10.01 sec 510 MBytes 427 Mbits/sec receiver
[ 9] 0.00-10.01 sec 1.05 GBytes 903 Mbits/sec 0 sender
[ 9] 0.00-10.01 sec 1.05 GBytes 903 Mbits/sec receiver
[ 11] 0.00-10.01 sec 953 MBytes 799 Mbits/sec 0 sender
[ 11] 0.00-10.01 sec 953 MBytes 799 Mbits/sec receiver
[ 13] 0.00-10.01 sec 447 MBytes 375 Mbits/sec 0 sender
[ 13] 0.00-10.01 sec 447 MBytes 375 Mbits/sec receiver
[ 15] 0.00-10.01 sec 409 MBytes 342 Mbits/sec 0 sender
[ 15] 0.00-10.01 sec 409 MBytes 342 Mbits/sec receiver
[ 17] 0.00-10.01 sec 379 MBytes 318 Mbits/sec 0 sender
[ 17] 0.00-10.01 sec 379 MBytes 318 Mbits/sec receiver
[ 19] 0.00-10.01 sec 825 MBytes 691 Mbits/sec 0 sender
[ 19] 0.00-10.01 sec 825 MBytes 691 Mbits/sec receiver
[SUM] 0.00-10.01 sec 5.57 GBytes 4.78 Gbits/sec 0 sender
[SUM] 0.00-10.01 sec 5.57 GBytes 4.78 Gbits/sec receiver
iperf Done.
root@fw01adb:~ # ifconfig vmx0
vmx0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=8507b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO>
ether 00:50:56:a5:d3:68
inet6 fe80::250:56ff:fea5:d368%vmx0 prefixlen 64 scopeid 0x1
media: Ethernet autoselect
status: active
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
But thats not 5 gbit/s....
I got better results disabling LRO on the ESXi host.
Quote from: nwildner on October 19, 2020, 08:20:36 PM
Quote from: mimugmail on October 19, 2020, 07:38:33 PM
I have customers pushing 6Gbit over vmxnet driver.
OK. And what i'm supposed to do with this information? Not trying to be rude, but there is plenty of reports on this topic that goes against your scenario.
Do you have any idea what i could tune to achieve better performance then?
You wrote vmxnet cant handle more than one gb which is not true. Now when someone googles for similar problem they might think it's a general limitation. I have no idea about hyperviaors, but I dont want that wrong facts are going wild
Quote from: mimugmail on October 20, 2020, 06:03:29 AM
You wrote vmxnet cant handle more than one gb which is not true. Now when someone googles for similar problem they might think it's a general limitation. I have no idea about hyperviaors, but I dont want that wrong facts are going wild
Just read again my reports.
vmxnet3 is not handling more than 1Gbps on FreeBSD(maybe, OPNSense specific patches). I never said vmxnet3 is garbage, and as you can see, Linux is handling traffic fine. I have other phisical machines on different offices and vmxnet3 is just fine with Linux and Windows.
And if you google for solutions, you will find plenty of information(and that also means missinformation). Bugs and other fixes(maybe iflib/vmx related) that
COULD work:
- https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=242070
- https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=236999
UPDATE REPORT: had to disable lro, lso and vlan_hwfilter since it made traffic entering on that interface horribly slow (7Mbps max), and that is a regression that we could not handle.
Better have an interface using 1Gbps than one that uses 4,5Gbps only one way.
@nwilder: would you be so kind not to keep spreading inaccurate / false information around. We don't use any modifications on the vmx driver, which can do more than 1Gbps at ease on a stock FreeBSD 12.1. LRO shouldn't be used on a router for obvious reasons (also pointed at in my earlier post https://forum.opnsense.org/index.php?topic=18754.msg90576#msg90576).
Quote from: AdSchellevis on October 20, 2020, 12:53:16 PM
@nwilder: would you be so kind not to keep spreading inaccurate / false information around. We don't use any modifications on the vmx driver, which can do more than 1Gbps at ease on a stock FreeBSD 12.1. LRO shouldn't be used on a router for obvious reasons (also pointed at in my earlier post https://forum.opnsense.org/index.php?topic=18754.msg90576#msg90576).
All right. I've tried lro/lso/vlan_hwfilter cause i'm running out of options here. Tried all those sysctls from that FreeBSD bugreport and no sensible performance increase was noticed after tunning tx/rx descriptors. Same 800Mbps limited on transfer whenever OPNSense tries to contact another host.
Other tests i've made:1 - Iperf from one vlan interface to another, same parent interface(vmx0): After that, i've made another test by putting iperf to listen on one vlan interface(parent vmx0) while binding the client to another vlan interface(parent also vmx0) on this OPNSense box and got pretty good forwarding rates:
iperf3 -c 10.254.117.ip -B 10.254.110.ip -P 8 -w 128k 5201
[SUM] 0.00-10.00 sec 8.86 GBytes 7.61 Gbits/sec 0 sender
[SUM] 0.00-10.16 sec 8.86 GBytes 7.49 Gbits/sec receiver
I was just trying to test internal forwarding.
2 - Try do disable ipsec and it's passthrough related configs: By thinking that ipsec could be the one throttling the connection through it's passhtrough tunnels on traffic that comes in/ou of vlan interfaces, i've disabled all ipsec configs and iperf still got 800Mbps max from firewall to Windows/Linux servers.
3 - Disable PF: After disabling ipsec tunnels tried to disable pf entirely, did a fresh boot and put OPNSense in router mode. No luck (still the same iperf performance).
4 - Adding vlan 117 to a new phisical vmx interface, letting the hypervisor tag it: Presented a new interface, vlan 117 tagged by the hypervisor, changed the assignment inside OPNSense ONLY to this specific servers network. iperf tests keep getting the same speed.
Additional logs: bug id=237166 threw some light on this issue, and i've found that MSI-X vectors aren't being handled correctly by vmware(looking from the point of view that MSI-X related issues were resolved on FreeBSD). I'm looking for any documentation that could help me on this case. I'll try to thinker with hw.pci.honor_msi_blacklist=0 on loader.conf to see if i get better performance.
vmx0: <VMware VMXNET3 Ethernet Adapter> port 0x5000-0x500f mem 0xfd4fc000-0xfd4fcfff,0xfd4fd000-0xfd4fdfff,0xfd4fe000-0xfd4fffff irq 19 at device 0.0 on pci4
vmx0: Using 4096 TX descriptors and 2048 RX descriptors
vmx0: Using 4 RX queues 4 TX queues
vmx0: failed to allocate 5 MSI-X vectors, err: 6
vmx0: Using an MSI interrupt
vmx0: Ethernet address: 00:50:56:a5:d3:68
vmx0: netmap queues/slots: TX 1/4096, RX 1/4096
Edit: "hw.pci.honor_msi_blacklist: 0" removed the error form the log, but transfer rates remain the same:
vmx0: <VMware VMXNET3 Ethernet Adapter> port 0x5000-0x500f mem 0xfd4fc000-0xfd4fcfff,0xfd4fd000-0xfd4fdfff,0xfd4fe000-0xfd4fffff irq 19 at device 0.0 on pci4
vmx0: Using 4096 TX descriptors and 2048 RX descriptors
vmx0: Using 4 RX queues 4 TX queues
vmx0: Using MSI-X interrupts with 5 vectors
vmx0: Ethernet address: 00:50:56:a5:d3:68
vmx0: netmap queues/slots: TX 4/4096, RX 4/4096
root@fw01adb:~ # sysctl -a | grep blacklis
vm.page_blacklist:
hw.pci.honor_msi_blacklist: 0
Hope that some of my tests could bring light on this issue.
Removing the MSI blacklist option allocated 4 netmap TX/RX queues :)
For those interested, started a FreeBSD 13 Current VM (2020-oct-08), vmxnet3 interface, created one 802.1q vlan, and did some iperf between this guy and a Linux VM and, BOOM!. Full performance with 4 paralelism configured:
[ ID] Interval Transfer Bandwidth Retr
[ 5] 0.00-10.23 sec 2.34 GBytes 1.96 Gbits/sec 0 sender
[ 5] 0.00-10.23 sec 2.34 GBytes 1.96 Gbits/sec receiver
[ 7] 0.00-10.23 sec 2.09 GBytes 1.75 Gbits/sec 0 sender
[ 7] 0.00-10.23 sec 2.09 GBytes 1.75 Gbits/sec receiver
[ 9] 0.00-10.23 sec 1.67 GBytes 1.40 Gbits/sec 0 sender
[ 9] 0.00-10.23 sec 1.67 GBytes 1.40 Gbits/sec receiver
[ 11] 0.00-10.23 sec 1.65 GBytes 1.39 Gbits/sec 0 sender
[ 11] 0.00-10.23 sec 1.65 GBytes 1.39 Gbits/sec receiver
[SUM] 0.00-10.23 sec 7.75 GBytes 6.50 Gbits/sec 0 sender
[SUM] 0.00-10.23 sec 7.75 GBytes 6.50 Gbits/sec receiver
Maybe this is some regression on 12.1.
> How did you do that? [force 1Gbps NIC]
Turn off auto negotiation and set the nic's IF to 1gbps (?)
You guys got me interested in this subject. I have tested plenty of iperf3 against my VMs in my little 3-host homelab, my 10GbE is just a couple DACs connected between the 10Gbe "backbone" IFs of my Dell Powerconnect 7048P, which is really more of a gigabit switch.
Usually the VMs will peg right up to ~9.4Gbps with little fluctuation if nothing else is happening, but I'm recording 3 720p video streams and 6 high-MP (4MP & 8MP) IP cameras right now, and have no interest in stopping any of it for testing right now.
I could have sworn I'd iperfed my OPNsense VM and gotten somewhere around 2.9Gbps vs the 9.4Gbps I got on my Linux, OmniOS or FreeBSD VMs (don't think I tested Windows, iperf3 is compiled weird in Win32 and doesn't yield predictable results). So I expected it to be a bit slower, but not THIS much slower:
OPNsense 20.7.3 to OmniOS r151034
(on separate hosts)
This is a VM w/ 4 vCPU and 8GB ram, run on an E3-1230 v2 home-built Supermicro X9SPU-F host running ESXi 6.7U3. The LAN vNIC is vmxnet3, running open-vm-tools.
root@gateway:/ # uname -a
FreeBSD gateway.webtool.space 12.1-RELEASE-p10-HBSD FreeBSD 12.1-RELEASE-p10-HBSD #0 517e44a00df(stable/20.7)-dirty: Mon Sep 21 16:21:17 CEST 2020 root@sensey64:/usr/obj/usr/src/amd64.amd64/sys/SMP amd64
root@gateway:/ # iperf3 -c 192.168.1.56
Connecting to host 192.168.1.56, port 5201
[ 5] local 192.168.1.1 port 13640 connected to 192.168.1.56 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 125 MBytes 1.05 Gbits/sec 0 2.00 MBytes
[ 5] 1.00-2.00 sec 126 MBytes 1.06 Gbits/sec 0 2.00 MBytes
[ 5] 2.00-3.00 sec 132 MBytes 1.11 Gbits/sec 0 2.00 MBytes
[ 5] 3.00-4.00 sec 131 MBytes 1.10 Gbits/sec 0 2.00 MBytes
[ 5] 4.00-5.00 sec 132 MBytes 1.11 Gbits/sec 0 2.00 MBytes
[ 5] 5.00-6.00 sec 135 MBytes 1.13 Gbits/sec 0 2.00 MBytes
[ 5] 6.00-7.00 sec 138 MBytes 1.16 Gbits/sec 0 2.00 MBytes
[ 5] 7.00-8.00 sec 137 MBytes 1.15 Gbits/sec 0 2.00 MBytes
[ 5] 8.00-9.00 sec 133 MBytes 1.12 Gbits/sec 0 2.00 MBytes
[ 5] 9.00-10.00 sec 131 MBytes 1.10 Gbits/sec 0 2.00 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 1.29 GBytes 1.11 Gbits/sec 0 sender
[ 5] 0.00-10.00 sec 1.29 GBytes 1.11 Gbits/sec receiver
iperf Done.
That is abysmal. Compare that to this Bullseye VM going to same OmniOS VM (also on separate hosts)
Debian Bullseye to OmniOS r151034
avery@debbox:~$ uname -a
Linux debbox 5.4.0-4-amd64 #1 SMP Debian 5.4.19-1 (2020-02-13) x86_64 GNU/Linux
avery@debbox:~$ iperf3 -c 192.168.1.56
Connecting to host 192.168.1.56, port 5201
[ 5] local 192.168.1.39 port 58064 connected to 192.168.1.56 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 688 MBytes 5.77 Gbits/sec 0 2.00 MBytes
[ 5] 1.00-2.00 sec 852 MBytes 7.15 Gbits/sec 0 2.00 MBytes
[ 5] 2.00-3.00 sec 801 MBytes 6.72 Gbits/sec 1825 730 KBytes
[ 5] 3.00-4.00 sec 779 MBytes 6.53 Gbits/sec 33 1.13 MBytes
[ 5] 4.00-5.00 sec 788 MBytes 6.61 Gbits/sec 266 1.33 MBytes
[ 5] 5.00-6.00 sec 828 MBytes 6.94 Gbits/sec 392 1.43 MBytes
[ 5] 6.00-7.00 sec 830 MBytes 6.96 Gbits/sec 477 1.49 MBytes
[ 5] 7.00-8.00 sec 826 MBytes 6.93 Gbits/sec 1286 749 KBytes
[ 5] 8.00-9.00 sec 826 MBytes 6.93 Gbits/sec 0 1.26 MBytes
[ 5] 9.00-10.00 sec 775 MBytes 6.50 Gbits/sec 278 1.38 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 7.81 GBytes 6.71 Gbits/sec 4557 sender
[ 5] 0.00-10.00 sec 7.80 GBytes 6.70 Gbits/sec receiver
iperf Done.
So much better throughput. Even while that OmniOS VM is recording 8-9 streams of video over the network.
I'm going to install a FreeBSD kernel and see what happens. Will be back with more benchmarks.
It is odd that so many of us seem to find an artificial ~1gbps limit when testing OPNsense 20.7 on VMware ESXi and vmxnet3 adapters. It looks like there's at least 3 of us that are able to re-produce these results now?
I've disable the hardware blacklist and did not see a difference in my test results from what I had posted here prior. The only way I can get a little bit better throughput is to add more vCPU to the OPNsense VM, however this does not scale well. For instance, if I go from 2vCPU to 4vCPU, I can start to get between 1.5gbps and 2.2gbps depending on how much parallelism I select on my iperf clients.
Quote from: opnfwb on October 22, 2020, 05:03:05 AM
It is odd that so many of us seem to find an artificial ~1gbps limit when testing OPNsense 20.7 on VMware ESXi and vmxnet3 adapters. It looks like there's at least 3 of us that are able to re-produce these results now?
I've disable the hardware blacklist and did not see a difference in my test results from what I had posted here prior. The only way I can get a little bit better throughput is to add more vCPU to the OPNsense VM, however this does not scale well. For instance, if I go from 2vCPU to 4vCPU, I can start to get between 1.5gbps and 2.2gbps depending on how much parallelism I select on my iperf clients.
I don't think it's related to the "hardware" (even though in this case, it's virtual). I think it's the upstream regression mentioned on page 1 - since I used to get better speeds than this before I upgraded. I think I did my last LAN-side iperf3 tests around v18 or 19, and they were at least twice that. In fact, I'm fairly certain I doubled my vCPUs and ram since because I was testing Sensei and never re-configured it for 2 vCPU/4GB after I uninstalled it.
Quote from: opnfwb on October 22, 2020, 05:03:05 AM
It is odd that so many of us seem to find an artificial ~1gbps limit when testing OPNsense 20.7 on VMware ESXi and vmxnet3 adapters. It looks like there's at least 3 of us that are able to re-produce these results now?
I've disable the hardware blacklist and did not see a difference in my test results from what I had posted here prior. The only way I can get a little bit better throughput is to add more vCPU to the OPNsense VM, however this does not scale well. For instance, if I go from 2vCPU to 4vCPU, I can start to get between 1.5gbps and 2.2gbps depending on how much parallelism I select on my iperf clients.
Be honest to yourself, would you buy a piece of hardware with only 2 cores if you have to requirement for 10G? The smallest hardware with 10 interfaces has 4 core minimum.
Quote from: mimugmail on October 22, 2020, 07:27:38 AM
Be honest to yourself, would you buy a piece of hardware with only 2 cores if you have to requirement for 10G? The smallest hardware with 10 interfaces has 4 core minimum.
I think we may be talking past each other here. I'm not talking about purchasing hardware. I'm discussing a lack of throughput that now exists after an upgrade on hardware that performs at a much higher rate with just a software change. That's why we're running tests on multiple VMs, all with the same specs. There's obviously some bottleneck occurring here that isn't just explained away by core count (or lack thereof).
Quote from: mimugmail on October 19, 2020, 07:38:33 PM
I have customers pushing 6Gbit over vmxnet driver.
I'm more interested in trying to understand what is different in my environment that is causing these issues on VMs? Is this claimed 6Gbit going through a virtualized OPNsense install?. Do you have any additional details that we can check? I've even tried to change CPU core assignment (change number of sockets to 1, and add cores) to see if there was some weird NUMA scaling issue impacting OPNsense. So far everything I have tried to do has had no impact on throughput, even switching to the beta netmap kernel that is supposed to resolve some of this did not seem to work yet?
Quote from: AveryFreeman on October 22, 2020, 04:36:49 AM
You guys got me interested in this subject. I have tested plenty of iperf3 against my VMs in my little 3-host homelab, my 10GbE is just a couple DACs connected between the 10Gbe "backbone" IFs of my Dell Powerconnect 7048P, which is really more of a gigabit switch.
The infrastructure i have on that remote office i was reporting so far:
- PowerEdge R630(2 servers)
- 2 Socket Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz with 12 cores each(24 cores per server)
- 3x NetXtreme II BCM57800 10 Gigabit Ethernet (dual port NIC), meaning 6 phisical adapters distributed into 3 virtual switches (2 nics vm, 2 nics vmotion, 2 nics vmkernel)
- 512GB Ram each server
- Plenty of storage on an external SAS 12Gbps(2x6Gbps active + 2x6GBbps passive paths) MD3xxx Dell storage with round-robin paths
- 2x Dell N4032F as core/backbone switches with 10Gbps ports and stacked with 2x 40Gbps ports.
- 6 port trunks for each server. 3 ports per trunk per stacking member so, each vSwitch nic will touch one stack member
- Stack member on Dell N series are treated as a unity so, LACP can be configured across stack members(no MLAG involved).
Even when trying to transfer data between vms that were not registered on the same phisical hardware i can achieve 8Gbps easily, except with vmxnet3 driver from FreeBSD 12.1.
Quote from: mimugmail on October 22, 2020, 07:27:38 AM
Be honest to yourself, would you buy a piece of hardware with only 2 cores if you have to requirement for 10G? The smallest hardware with 10 interfaces has 4 core minimum.
What is not honest is to pretend that a VM cant push more than 1Gbps or achieve decent throughput rates while having only 1 vCPU configured, and that is not true. On the contrary, while doing virtualization you should always configure resources in a way that will avoid cpu oversubscription. Having for example a 4vCPUs VM that is mostly idle and does not run cpu intense operations will create problems to other vms on the same pool/share/physical hardware. For simple iperf3 and network transfer tests with FreeBSD 13 1vCPU did fine, while OPNSense(FreeBSD 12.1) with 4vCPU and high cpu shares being the only VM with that share configuration crawled during transfers.
Vmxnet3 on FreeBSD 12.1 is garbage. It seems that the port to iflib created some regressions related to MSI-X, tx/rx queues, iflib leaking MSI-x messages, non-power-of 2 tx/rx queue configs and others. I could even find some LRO regressions on commits that could explain retransmissions and the abismal lack of performance that i've reported here (https://forum.opnsense.org/index.php?topic=18754.msg90766#msg90766) on a previous page while trying to enable LRO as a workaround for that performance issue. https://svnweb.freebsd.org/base/head/sys/dev/vmware/vmxnet3/if_vmx.c?view=log
The test i've made above with FreeBSD 13-CURRENT, i was only using 1vCPU, 4GB ram, pvscsi and vmxnet3 and the system performed greatly compared with the vmxnet3 driver state of the FreeBSD 12.1-RELEASE.
With proxmox using vnet adapter the speed is fine, but using pfsense based on freebsd 11 works fine with vmxnet3 too.
So the issue is with the HBSD and the vmxnet adapter. I dont understand why opnsense based on a half dead OS. HBSD is abandoned most of the devs. Just drop it and use the standard freebsd again.
Quote from: Archanfel80 on October 26, 2020, 10:27:47 AM
With proxmox using vnet adapter the speed is fine, but using pfsense based on freebsd 11 works fine with vmxnet3 too.
So the issue is with the HBSD and the vmxnet adapter. I dont understand why opnsense based on a half dead OS. HBSD is abandoned most of the devs. Just drop it and use the standard freebsd again.
FreeBSD 12.1 has the same issues ..
Quote from: mimugmail on October 26, 2020, 12:02:55 PM
Quote from: Archanfel80 on October 26, 2020, 10:27:47 AM
With proxmox using vnet adapter the speed is fine, but using pfsense based on freebsd 11 works fine with vmxnet3 too.
So the issue is with the HBSD and the vmxnet adapter. I dont understand why opnsense based on a half dead OS. HBSD is abandoned most of the devs. Just drop it and use the standard freebsd again.
FreeBSD 12.1 has the same issues ..
Yes, but the pfsense current stable branch still using freebsd 11.x not 12. I think they are on point. Not a good idea switching to a newer base OS if its still have many issues. Now i have to roll back to opnsense 20.1 everywhere where i upgraded to 20.7. And the issue is not just with the vmxnet. After i upgrade to 20.7 one of my hw firewall with EFI boot, the OS no longer boot but freezed during the EFI boot. Its also a freebsd 12 related issue, i already figured out.
And Sophos is using a 3.12 kernel, why upgrading to a newer one ..
If noone does the first step there wouldn't be any progress. Usually mission critical systems shouldnt be updated to a major release when not on .3 or .4. I'd even wait till a .6.
The whole discussion is way too offtopic and only updated with frustrated content.
It should talk about this, so maybe offtopic but still.
Half year release model, so im updated since recently, 20.7 is almost half year old now, we are close to the 21.1 now, when 20.7 will be obsolate too. You're right about that a critical system software should wait for adapting new releases. So even the 21.x series should use freebsd 11 and wait for upgrading to 12 until it will be stable. A firewall is not a good place to experiencing and making the first step.
But i can say something what is not offtopic.
Disabling net.inet.ip.redirect and net.inet.ip6.redirect, increasing net.inet.tcp.recvspace and net.inet.tcp.sendspace also kern.ipc.maxsockbuf and kern.ipc.somaxconn helps a little. Still have perfomance lost but not that bad.
I attached my tunables related config.
Just keep using 20.1 with all the security related caveats and missing features. I really don't see the point in complaining about user choices.
Cheers,
Franco
Quote from: franco on October 26, 2020, 02:08:45 PM
Just keep using 20.1 with all the security related caveats and missing features. I really don't see the point in complaining about user choices.
Cheers,
Franco
I did rollback, everything is fine. The network speed is around 800mbit again (gigabit internet), with 20.7 this was just 500-600mbit. Speed is important here, i dont care about missing features i dont use any. Im not sure about the security caveats. freebsd 11 is no less secure. Until this issue not fixed i stay with 20.1.x. This servers used in production enviroment, i dont have time and oppurtunity to use these as a playground. This was exactly the same reason why i abandon using pfsense. They importing untested kernels and features and the core system become unstable and after an upgrade i have fears what will gone wrong. Opnsense did right for now, i hope the devs fix this or at least we have some workaround. The speed is not the only issue. I have to disable IPS/IDS and sensei too because its cause system freeze. I basicly neglected my firewalls. I know this is still in testing phase but 20.7 is 4 almost 5 months old now and still unable to use this features properly. And we paid for the sensei which is unusable now. This is not acceptable. So yes, i take the "risk" and did rollback wherever i can...
Would it be possible to install a stock FreeBSD 13 kernel? Maybe they fixed the regressions. I'm wondering if it has something to do with HBSD compile flags for security.
Quote from: AveryFreeman on October 26, 2020, 08:52:55 PM
Would it be possible to install a stock FreeBSD 13 kernel? Maybe they fixed the regressions. I'm wondering if it has something to do with HBSD compile flags for security.
Unfortunatelly this is not so easy. You cant use a precompiled kernel from an another system. It wouldn't boot.
You have to compile from source, but newer kernel means newer headers and libraries in dependency. The compilation process could failed at some point. The only solution what could work is cherry pick the fix only and implement to the original kernel source tree and compile. But this needs work too.
I was an android kernel developer many years back so i know experiencing with the kernel is always risky.
Quote from: Archanfel80 on October 27, 2020, 08:53:09 AM
Quote from: AveryFreeman on October 26, 2020, 08:52:55 PM
Would it be possible to install a stock FreeBSD 13 kernel? Maybe they fixed the regressions. I'm wondering if it has something to do with HBSD compile flags for security.
Unfortunatelly this is not so easy. You cant use a precompiled kernel from an another system. It wouldn't boot.
You have to compile from source, but newer kernel means newer headers and libraries in dependency. The compilation process could failed at some point. The only solution what could work is cherry pick the fix only and implement to the original kernel source tree and compile. But this needs work too.
I was an android kernel developer many years back so i know experiencing with the kernel is always risky.
Wouldnt it be easier to do it the other way round?
Make OS work with FBSD13? To eliminate any remnance of bad plugin code?
Quote from: Supermule on October 27, 2020, 10:01:12 AM
Quote from: Archanfel80 on October 27, 2020, 08:53:09 AM
Quote from: AveryFreeman on October 26, 2020, 08:52:55 PM
Would it be possible to install a stock FreeBSD 13 kernel? Maybe they fixed the regressions. I'm wondering if it has something to do with HBSD compile flags for security.
Unfortunatelly this is not so easy. You cant use a precompiled kernel from an another system. It wouldn't boot.
You have to compile from source, but newer kernel means newer headers and libraries in dependency. The compilation process could failed at some point. The only solution what could work is cherry pick the fix only and implement to the original kernel source tree and compile. But this needs work too.
I was an android kernel developer many years back so i know experiencing with the kernel is always risky.
Wouldnt it be easier to do it the other way round?
Make OS work with FBSD13? To eliminate any remnance of bad plugin code?
They just switched to fbsd12 i dont think fbsd13 will be adapted soon. But you have the point.
What i find out when opnsense used in a virtualized environment its uses only one core only. The hw socket detection is faulty in case.
net.isr.maxthreads and net.isr.numthreads is always returns 1.
But it can be changed in the tunables too.
This also needs to change net.isr.dispatch from "direct" to "deferred".
This gives me massive performance boost on gigabit connection, but still not perfect. The boost comes with overhead too. But only in fbsd 12. With 20.1 what is still based on fbsd 11 its lightning fast :)
Using 20.7.4 with and without sysctl tuning. And using 20.1 with tuning.
With 20.7.x nothing helps the speed capped and lost around 20-30 percent because the overhead. With 20.1, you see the difference :)
I'm also experiencing poor throughput with OPNsense 20.7. Maybe some of you have seen my thread in the general forum (https://forum.opnsense.org/index.php?topic=19426.0 (https://forum.opnsense.org/index.php?topic=19426.0)).
I did some testing and want to share the results with you.
Measure: In a first step, I disabled all packet filtering on the OPNsense device.
Result: No improvement.
Measure: In a second step and in order to rule out sources of error, I have removed the LAGG/LACP configuration in my setup.
Result: No improvement.
In the next step, I made some performance comparisons. I did tests with the following two setups:
a) Client (Ubuntu 20.04.1 LTS) <-->
OPNsense (20.7.4) <--> File Server (Debian 10.6)
b) Client (Ubuntu 20.04.1 LTS) <-->
Ubuntu (20.04.1 LTS) <--> File Server (Debian 10.6)
In both setups the client is a member of VLAN 70 and the file server is a member of VLAN 10. In setup
b) I have enabled packet forwarding for IPv4.
The test results were as follows:
Samba transfer speeds (MB/sec)
Routing device | Client --> Server | Server --> Client |
a) OPNsense | 67,3 | 71,2 |
b) Ubuntu | 108,7 | 113,8 |
iPerf3 UDP transfer speeds (MBit/sec)
Routing device | Client --> Server | Server --> Client |
a) OPNsense |
|
|
b) Ubuntu |
|
|
|
Packet loss leads to approx. 25% reduced throughput on the receiving device.
Back with some more test results.
I did a rollback to
OPNsense 20.1 for testing purposes.
Samba transfer speeds (MB/sec)
Routing device | Client --> Server | Server --> Client |
OPNsense 20.1 | 109,3 | 102,6 |
iPerf3 UDP transfer speeds (MBit/sec)
Routing device | Client --> Server | Server --> Client |
OPNsense 20.1 |
|
|
As you can see OPNsense 20.1 gives me full wire speed.
Quote from: Supermule on October 27, 2020, 10:01:12 AM
Quote from: Archanfel80 on October 27, 2020, 08:53:09 AM
Quote from: AveryFreeman on October 26, 2020, 08:52:55 PM
Would it be possible to install a stock FreeBSD 13 kernel? Maybe they fixed the regressions. I'm wondering if it has something to do with HBSD compile flags for security.
Unfortunatelly this is not so easy. You cant use a precompiled kernel from an another system. It wouldn't boot.
You have to compile from source, but newer kernel means newer headers and libraries in dependency. The compilation process could failed at some point. The only solution what could work is cherry pick the fix only and implement to the original kernel source tree and compile. But this needs work too.
I was an android kernel developer many years back so i know experiencing with the kernel is always risky.
Wouldnt it be easier to do it the other way round?
Make OS work with FBSD13? To eliminate any remnance of bad plugin code?
It does work, and it's fairly easy. Just install OPNsense using opnsense-bootstrap over a FreeBSD installation. You have to change the script if you want to install over a different version of FreeBSD (e.g. 13), but if you install 12.x you can just run the script. Then boot from kernel.old or copy the kernel back to /boot/kernel, kldxref, etc.
I can't vouch for the helpfulness as my FreeBSD understanding is limited, I don't know much about kernel tuning. Your identification of net.isr.maxthreads and net.isr.numthreads always returning 1 core seems more helpful than arbitrarily changing kernel.
How would you recommend tuning kernel for multi-threaded? Is turning off hyperthreading a good idea?
Btw I didn't see much speed increase installing OPNsense 20.7 over 13-CURRENT and I'm suspect of its reliability, but there is a slight increase in speed installing OPNsense over 12.1-RELEASE and keeping FreeBSD kernel: https://forum.opnsense.org/index.php?topic=19789.msg91356#msg91356
It would probably be more noticeable on 10G but I haven't done any benchmarking w/ it yet.
Looking as to what "if_io_tqg" is, and why it's eating up quite a bit of a core when doing (not even line rate) transfers on my apu2 board, I found this thread.
Has any conclusion been reached yet? Is there anything we can test/do?
Not yet, no
I had throughput issues with 20.7.4 on Proxmox (6.2-last).
They were related to the "offload feature" enabled (I know, I'm stupid).
Once disabled, everything is OK, maxing out the link.
This problem seems to be getting worse - I upgraded to 20.7.5 and my iperf3 speeds have dropped from ~2Gb/s to hovering around 1Gb/s, with VM->VM speeds at ~650Mbps :o :-\
CentOS 8 VM on the same machine gets around 9.4Gbps
will upload some speeds when I get a chance
Has anyone rerun the tests with opnsense 21.1?
Here are my latest results.
Recap of my environment:
Server is HP ML10v2 ESXi 6.7 running build 17167734
Xeon E3-1220 v3 CPU
32GB of RAM
SSD/HDD backed datastore (vSAN enabled)
All firewalls are tested with their "out of the box" ruleset, no customizations were made besides configure WAN/LAN adapters to work for these tests. All firewalls have their version of VM Tools installed from the package manager.
The iperf3 client/server are both Fedora Desktop v33. The server sits behind the WAN interface, the client sits behind the LAN interface to simulate traffic through the firewall. No transfer tests are performed hosting iperf3 on the firewall itself.
OPNSense 21.1.1 VM Specs:
VM hardware version 14
2 vCPU
4GB RAM
2x vmx3 NICs
pfSense 2.5.0-RC VM Specs:
VM hardware version 14
2 vCPU
4GB RAM
2x vmx3 NICs
OpenWRT VM Specs:
VM hardware version 14
2 vCPU
1GB RAM
2x vmx3 NICs
OPNsense 21.1.1 (netflow disabled) 1500MTU receiving from WAN, vmx3 NICs, all hardware offload disabled, single thread (p1)
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 8.10 GBytes 1.16 Gbits/sec 219 sender
[ 5] 0.00-60.00 sec 8.10 GBytes 1.16 Gbits/sec receiver
OPNsense 21.1.1 (netflow disabled) 1500MTU receiving from WAN, vmx3 NICs, all hardware offload disabled, four thread (p4)
[ ID] Interval Transfer Bitrate Retr
[SUM] 0.00-60.00 sec 13.4 GBytes 1.91 Gbits/sec 2752 sender
[SUM] 0.00-60.00 sec 13.3 GBytes 1.91 Gbits/sec receiver
OPNsense 21.1.1 (netflow disabled) 1500MTU receiving from WAN, vmx3 NICs, all hardware offload enabled, single thread (p1)
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 251 MBytes 35.0 Mbits/sec 56410 sender
[ 5] 0.00-60.00 sec 250 MBytes 35.0 Mbits/sec receiver
pfSense 2.5.0-RC 1500MTU receiving from WAN, vmx3 NICs, all hardware offload disabled, single thread (p1)
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 15.1 GBytes 2.15 Gbits/sec 1029 sender
[ 5] 0.00-60.00 sec 15.0 GBytes 2.15 Gbits/sec receiver
pfSense 2.5.0-RC 1500MTU receiving from WAN, vmx3 NICs, all hardware offload disabled, four thread (p4)
[ ID] Interval Transfer Bitrate Retr
[SUM] 0.00-60.00 sec 15.3 GBytes 2.19 Gbits/sec 12807 sender
[SUM] 0.00-60.00 sec 15.3 GBytes 2.18 Gbits/sec receiver
pfSense 2.5.0-RC 1500MTU receiving from WAN, vmx3 NICs, all hardware offload enabled, single thread (p1)
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 316 MBytes 44.2 Mbits/sec 48082 sender
[ 5] 0.00-60.00 sec 316 MBytes 44.2 Mbits/sec receiver
OpenWRT v19.07.6 1500MTU receiving from WAN, vmx3 NICs, no UI offload settings (using defaults), single thread (p1)
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 34.1 GBytes 4.88 Gbits/sec 21455 sender
[ 5] 0.00-60.00 sec 34.1 GBytes 4.88 Gbits/sec receiver
OpenWRT v19.07.6 1500MTU receiving from WAN, vmx3 NICs, no UI offload settings (using defaults), four thread (p4)
[ ID] Interval Transfer Bitrate Retr
[SUM] 0.00-60.00 sec 43.2 GBytes 6.18 Gbits/sec 79765 sender
[SUM] 0.00-60.00 sec 43.2 GBytes 6.18 Gbits/sec receiver
host CPU usage during the transfer was as follows:
OPNsense 97% host CPU used
pfSense 84% host CPU used
OpenWRT 63% host CPU used for p1, 76% host CPU used for p4
In this case, my environment is CPU constrained. However, the purpose of these transfers is to use a best case scenario (all 1500MTU packets) and see how much we can push through the firewall with the given CPU power available. I think we're still dealing with inherent bottlenecks within FreeBSD 12. Both of the BSDs here hit high host CPU usage regardless of the thread count during the transfer. Only the Linux system scaled with more threads and still did not max the host CPU during transfers.
I personally use OPNsense and it's a great firewall. Running on bare metal hardware with IGB NICs and a modern processor made within the last 5 years or so, it will be plenty to cover gigabit speeds for most people. However, if we are virtualizing in an environment all of the BSDs seem to want a lot of CPU power to be able to scale beyond a steady 1GB/s. Perhaps FreeBSD 13 will give us more efficient virtualization throughput?
I am curious if I am seeing this kernel problem on my bare-metal install. I have a passively cooled mini PC with 4 Intel NICs and a J1900 CPU at 2.00GHz and 4 GB of RAM. I know this CPU is fairly old, but the hardware sizing guide says I should be able to do 350-750 Mbit/s throughput. When I have no firewall rules enabled and the default IPS settings I get about 370-380 Mbit/s of my 400 Mbit/s inbound speed. If I enable firewall rules to set up fq_codel, then it drops my throughput to 320-340 Mbit/s. In both of these scenarios I see my CPU going up to 90+% on one thread. I do understand that my throughput will go down with different options like IPS and firewall rules, but I would think that with no other options running this hardware should be able to do better than 380 Mbit/s tops.
Quote from: DiHydro on February 11, 2021, 09:40:20 PM
I am curious if I am seeing this kernel problem on my bare-metal install. I have a passively cooled mini PC with 4 Intel NICs and a J1900 CPU at 2.00GHz and 4 GB of RAM. I know this CPU is fairly old, but the hardware sizing guide says I should be able to do 350-750 Mbit/s throughput. When I have no firewall rules enabled and the default IPS settings I get about 370-380 Mbit/s of my 400 Mbit/s inbound speed. If I enable firewall rules to set up fq_codel, then it drops my throughput to 320-340 Mbit/s. In both of these scenarios I see my CPU going up to 90+% on one thread. I do understand that my throughput will go down with different options like IPS and firewall rules, but I would think that with no other options running this hardware should be able to do better than 380 Mbit/s tops.
Using FQ_Codel or IPS are more secondary to the overall discussion here. Both of these will consume a large amount of CPU cycles and won't illustrate the true throughput capabilities of the firewall due to their own inherent overhead.
I run a J3455 with a quad port Intel I340 NIC, and can easily push 1gigabit with the stock ruleset and have plenty of CPU overhead remaining. This unit can also enable FQ_Codel on WAN and still push 1gigabit, although CPU usage does increase around 20% at 1gigabit speeds.
I don't personally run any of the IPS components so I don't have any direct feedback on that. It's worth noting that both of these tests are done on a traditional DHCP WAN connection. If you're using PPPoE, that will be single thread bound and will limit your throughput to the maximum speed of a single core.
What most of the transfer speed tests are illustrating here are that FreeBSD seems to have very poor scaling when using 10gbit virtualized NICs and forwarding packets. This isn't an OPNsense induced issue, more of an issue that OPNsense gets stuck with due to the poor upstream support from FreeBSD. For the vast majority of users on 1gigabit or lower connections, this won't be a cause for concern in the near future.
Quote from: opnfwb on February 11, 2021, 10:27:08 PM
Quote from: DiHydro on February 11, 2021, 09:40:20 PM
I am curious if I am seeing this kernel problem on my bare-metal install. I have a passively cooled mini PC with 4 Intel NICs and a J1900 CPU at 2.00GHz and 4 GB of RAM. I know this CPU is fairly old, but the hardware sizing guide says I should be able to do 350-750 Mbit/s throughput. When I have no firewall rules enabled and the default IPS settings I get about 370-380 Mbit/s of my 400 Mbit/s inbound speed. If I enable firewall rules to set up fq_codel, then it drops my throughput to 320-340 Mbit/s. In both of these scenarios I see my CPU going up to 90+% on one thread. I do understand that my throughput will go down with different options like IPS and firewall rules, but I would think that with no other options running this hardware should be able to do better than 380 Mbit/s tops.
Using FQ_Codel or IPS are more secondary to the overall discussion here. Both of these will consume a large amount of CPU cycles and won't illustrate the true throughput capabilities of the firewall due to their own inherent overhead.
I run a J3455 with a quad port Intel I340 NIC, and can easily push 1gigabit with the stock ruleset and have plenty of CPU overhead remaining. This unit can also enable FQ_Codel on WAN and still push 1gigabit, although CPU usage does increase around 20% at 1gigabit speeds.
I don't personally run any of the IPS components so I don't have any direct feedback on that. It's worth noting that both of these tests are done on a traditional DHCP WAN connection. If you're using PPPoE, that will be single thread bound and will limit your throughput to the maximum speed of a single core.
What most of the transfer speed tests are illustrating here are that FreeBSD seems to have very poor scaling when using 10gbit virtualized NICs and forwarding packets. This isn't an OPNsense induced issue, more of an issue that OPNsense gets stuck with due to the poor upstream support from FreeBSD. For the vast majority of users on 1gigabit or lower connections, this won't be a cause for concern in the near future.
It sounds like I may need to reset to stock configuration and try this again. I thought that in some of my testing I had disabled all options and was running the device as a pure router and still seeing the single core limitation. Maybe I was mistaken and did still have some option that had significant CPU usage. My cable modem gives a DHCP lease to my OPNsense box, so I am not running PPPoE. When directly connected to the modem I get 390-430 Mbit/s. That is what lead me to look at the actual firewall as a throttle point.
Quote from: DiHydro on February 11, 2021, 09:40:20 PM
I am curious if I am seeing this kernel problem on my bare-metal install. I have a passively cooled mini PC with 4 Intel NICs and a J1900 CPU at 2.00GHz and 4 GB of RAM. I know this CPU is fairly old, but the hardware sizing guide says I should be able to do 350-750 Mbit/s throughput. When I have no firewall rules enabled and the default IPS settings I get about 370-380 Mbit/s of my 400 Mbit/s inbound speed. If I enable firewall rules to set up fq_codel, then it drops my throughput to 320-340 Mbit/s. In both of these scenarios I see my CPU going up to 90+% on one thread. I do understand that my throughput will go down with different options like IPS and firewall rules, but I would think that with no other options running this hardware should be able to do better than 380 Mbit/s tops.
I wonder what throughput you would receive with a Linux based fw just to see what the hardware is capable of. I made the experience with the current opnsense 21.1 release that it gives me only ~50% throughput after performance tuning in a virtualized environment. A quick test with virtualized openwrt gave me full gigabit wire speed without any optimization needed. I know that's comparing apples and oranges but it's difficult to say what a hardware platform is capable of if you don't try different things.
Quote from: spi39492 on February 12, 2021, 04:19:49 PM
Quote from: DiHydro on February 11, 2021, 09:40:20 PM
I am curious if I am seeing this kernel problem on my bare-metal install. I have a passively cooled mini PC with 4 Intel NICs and a J1900 CPU at 2.00GHz and 4 GB of RAM. I know this CPU is fairly old, but the hardware sizing guide says I should be able to do 350-750 Mbit/s throughput. When I have no firewall rules enabled and the default IPS settings I get about 370-380 Mbit/s of my 400 Mbit/s inbound speed. If I enable firewall rules to set up fq_codel, then it drops my throughput to 320-340 Mbit/s. In both of these scenarios I see my CPU going up to 90+% on one thread. I do understand that my throughput will go down with different options like IPS and firewall rules, but I would think that with no other options running this hardware should be able to do better than 380 Mbit/s tops.
I wonder what throughput you would receive with a Linux based fw just to see what the hardware is capable of. I made the experience with the current opnsense 21.1 release that it gives me only ~50% throughput after performance tuning in a virtualized environment. A quick test with virtualized openwrt gave me full gigabit wire speed without any optimization needed. I know that's comparing apples and oranges but it's difficult to say what a hardware platform is capable of if you don't try different things.
I am going to try this in a day or two. IPfire is my choice right now, unless someone has a different suggestion. I will probably come back to OPNsense either way as I like this community and the project.
Quote from: DiHydro on February 12, 2021, 10:49:11 PM
I am going to try this in a day or two. IPfire is my choice right now, unless someone has a different suggestion. I will probably come back to OPNsense either way as I like this community and the project.
Yeah, I like opnsense as well. That's why it is so painful that in my setup the throughput is so limited. I did the tests with Debian and iptables on one hand and with openwrt on the other as it s available for many platforms and pretty simple to install on bare metal and in virtual environments.
So I put OPNsense on a PC that has an Intel PRO/1000 4 port NIC and an i7 2600, and with a default install I get my 450 mibt/s. Once I put a firewall rule in to enable fq_codel, then it drops to 360-380 mbit/s. I don't believe that an i7 at 3.4 GHz with an Intel NIC cannot handle these rules at full speed. What is wrong/what can I look at/how can I help make this better?
Quote from: DiHydro on February 16, 2021, 01:05:39 AM
So I put OPNsense on a PC that has an Intel PRO/1000 4 port NIC and an i7 2600, and with a default install I get my 450 mibt/s. Once I put a firewall rule in to enable fq_codel, then it drops to 360-380 mbit/s. I don't believe that an i7 at 3.4 GHz with an Intel NIC cannot handle these rules at full speed. What is wrong/what can I look at/how can I help make this better?
You can check with some of the performance setting tips laid out here https://forum.opnsense.org/index.php?topic=9264.msg93315#msg93315 (https://forum.opnsense.org/index.php?topic=9264.msg93315#msg93315)
I have exactly the same problem. Apparently there are problems with vmxnet3 vNIC here. It's sad but I can't get higher than 1.4 Gbps. Please don't come to me with hardware. Sorry folks, it's 2021. 10gbps is what every FW should be able to do by default. Opnsense is a wonderful product. But I think you are betting on a dead horse. Why not use Linux as OS? FreeBSD slept through the virtual world (see the s... vmxnet3 support and bugs). Now I'm out of my frustration and go back to work :).
Quote from: mm-5221 on February 21, 2021, 06:58:42 PM
I have exactly the same problem. Apparently there are problems with vmxnet3 vNIC here. It's sad but I can't get higher than 1.4 Gbps. Please don't come to me with hardware. Sorry folks, it's 2021. 10gbps is what every FW should be able to do by default. Opnsense is a wonderful product. But I think you are betting on a dead horse. Why not use Linux as OS? FreeBSD slept through the virtual world (see the s... vmxnet3 support and bugs). Now I'm out of my frustration and go back to work :).
So there's always an option to use IPFire for this use-case? :)
No, I switched from sophos UTM to opnsense some time ago. Now I do not want another migration. With the exception of WAF and that the firewall aliases are not connected to DHCP, I find that opnsense is a great product.
I have now solved my performance problem with the parameter hw.pci.honor_msi_blacklist 0. I get with -P10 (parallel jobs) with iperf3 between 8-9Gbps without IPS. With IPS unfortunately only 1.7Gbps (CPU only 30% utilized). I am still missing the performance tuning of IPS parameters in the UI. I think I could get 5-6Gbps with about 8 cores. With 12 cores should be 8-9gbps. Currently IPS/Suricata is artificially throttled somewhere in the configuration.
Do we have any solution here?
I have R620 (Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz - 8 cores) under ESXi7 and I have 700Mbps between OPNsense <> Ubuntu VM on the same host, while two Ubuntu VMs can do 7Gbps, 10 times faster.
Are there any news regarding this topic? Throughput is still slow on Opnsense 21.1.5 :'(
I've found a similar issue regarding slow transfers with iflib in TrueNas which has been solved.
Maybe we're facing the same issue here in OPNsense.
Please have a look at the following links/commits:
- https://jira.ixsystems.com/browse/NAS-107593
- https://reviews.freebsd.org/D27683
- https://reviews.freebsd.org/R10:81be655266fac2b333e25f386f32c9bcd17f523d
Maybe there is an expert here who can review the code snippets.
I really hope this issue can be solved soon.
Related Github ticket: #119 (https://github.com/opnsense/src/issues/119)
Hello Together
Unfortunately I have the same performance problem on ESXi 6.7 with vmxnet3 network adapters. The physical adapters behind are as follows:
WAN: AQtion AQN-107 (10 Gbps)
LAN: Intel 10 Gigabit Ethernet Controller 82599 (10 Gbps)
DMZ: Intel 10 Gigabit Ethernet Controller 82599 (10 Gbps)
ISP: 10/10 Gbps (XGS-PON)
The speed on OPNsense (also on pfSense) is approximately as follows:
down: 7-10 Mbps
up: 2.5-3 Gbps
On any Linux firewall (e.g. IPFire and Untangle) I get the following values:
down & up: 5-6 Gbps
I have tried all possible tunables on the OPNsense, which unfortunately didn't help.
But now I just noticed something strange:
When I have the performance monitoring active on a speedtest (Performanse Graph in WebUI or top via ssh) the speed is suddenly not even that bad:
down & up: 3-4 Gbps
If I deactivate the performance monitoring again, the values are as low as at the beginning.
Unfortunately I don't know exactly what triggers this phenomenon, but maybe someone of you has also noticed this?
Try WAN and LAN to Intel and dont use the other card
Thank you for the answer.
I previously had an Intel X550-T2 purely for the WAN connection. But after testing I found that the onboard AQtion AQN-107 with current driver from Marvell* is just as fast (so I could save one PCI-E slot).
On both Linux firewalls, I was able to max out the bandwidth of the ISP with both configurations (Intel or AQiton).
P.S. the problem was the same with the configuration with the Intel NIC
(*sorry, driver is not from broadcom, it's from Marvell)
Interfaces LAN MSS ... set to 1300
Thanks for the hint, but I had already adjusted this value before - unfortunately without success...
What is really strange is that the speed is normal (like on the Linux Firewalls) as soon as I have "top" open in the background.
(no matter if OPNsense is tuned or on factory settings).
As if (figuratively speaking) "top" keeps the floodgates open for the network packets to flow faster.
Can anyone perhaps verify this with the same problem (vmxnet3)?
I have recorded the phenomenon below:
https://ibb.co/rv8r4fn (https://ibb.co/rv8r4fn)
Just to chip in and offer possibly "standard" hardware approach.
I'm using Deciso's "own" hardware, which should help replicating/reproing the issue.
Deciso DEC 840 with OPNsense 21.4.2-amd64, FreeBSD 12.1-RELEASE-p19-HBSD
I have one main VLAN routing to untagged (main LAN). I upgraded my main switch to 10 gbps and changed my LAN+VLAN interface from GbE port to SFP+ port at 10 GbE.
Everything else works well, but VLAN <=> LAN routing causes massive lag on completely separate routing (like 400-1000ms spikes); the extreme one being CPU spike up to 80%+ which caused several seconds of 1000-1300ms spikes on separate routing (light traffic).
I will reconfigure (likely today) the VLAN parts to separate GbE interface and see if the issue solves by that, next step will be restoring whole network to GbE ports (as it was before).
I did install new switch in the network, so it might play part of this, but based on the behaviour, it seems unlikely.
Do you use Sensei or IPS?
If you meant me, no I don't have Sensei and I believe (can't even right now find the setting) I don't have IPS enabled (at least not on purpose).
We do use traffic shaping policies for 2x WANs, but that's about it. All the other is just basic (rule limited) routing between LAN/VLANs.
I didn't touch anything on the recent change, except moved the LAN (+ VLANs associated with it) from igb0 interface to ax0.
I'll configure backwards soonish (hopefully today), as the 10 Gbe wasn't yet really utilized and the issue is really easy to spot right now. So I get more info about my scenario soon.
Ok, that was nice and clean to confirm.
To clarify the terms below Deciso 840 has 4x GbE ports (igb0,1,2,3) and 2x 10GbE SFP+ ports (ax0,ax1).
The issue with Deciso 840 is the 10 Gbe SFP+ ports routing VLAN traffic. In my case it was supposed to route the traffic alongside untagged LAN traffic, so this is the scenario I can confirm.
1. Before changes - VLAN routing worked
Before using SFP+ ports I had LAN + VLAN routed with igb0 interface. Everything worked well, no issues.
2. After changes - VLAN routing broken (affecting other routing too)
After moving LAN + VLAN over SFP+ port (ax0), the issues started. When VLAN-traffic was routed, heavy lag spikes on non-VLAN traffic also. I don't have performance numbers, but the traffic wasn't heavy - yet it heavily affected whole physical interface.
3. Fixed with moving VLAN to igb0 while keeping LAN on ax0
As I knew the "everything on igb0" worked, I wanted to try if its enough to move just VLAN to igb0 and keep LAN on ax0. It required some careful "tag-denial" on switch routes to not "loop" either untagged or VLANs, but the solution worked.
EDIT: Of course this workadound/fix was only feasible because my VLAN networks didn't need the 10 GbE in the first place.
As I need to change 2x managed switches and be very careful not to make my OPNsense inaccessible, I'm hesitant to try "the other way around"; moving VLANs to SFP+ and LAN to igb0 - just to test whether whole VLAN routing is broken, or is the issue just when LAN/VLAN is "routing back" through the same physical interface.
I also didn't test the 10 GbE speeds (no sensible way to test it right now through OPNsense), but the lagging/latency issue was so clear, that there obviously was something not working.
@Kallex Can you try to update to 21.4.3? the axgbe driver from AMD had an issue with larger packets in vlans, which lead to a lot of spam in dmesg (and reduced performance). If you do suffer from the same issue, I expect quite some kernel messages (..Big packet...) when larger packets are being processed.
The release notes for 21.4.3 are available here https://docs.opnsense.org/releases/BE_21.4.html#august-11-2021
o src: axgbe: remove unneccesary packet length check (https://github.com/opnsense/src/commit/bee1ba0981190dabcd045b6c8debfc8b8820016c)
Best regards,
Ad
I can try to; we're on production environment so I can on earliest try it on weekend.
I guess that's not the "Stable Business Branch" release, can I easily roll back to the last stable one after checking that version out?
I'll report back regardless whether I could test it or not.
EDIT: Realized it's indeed a business release. I'll test it at latest on weekend and report back.
Quote from: Kallex on August 24, 2021, 11:15:50 PM
I can try to; we're on production environment so I can on earliest try it on weekend.
I guess that's not the "Stable Business Branch" release, can I easily roll back to the last stable one after checking that version out?
I'll report back regardless whether I could test it or not.
EDIT: Realized it's indeed a business release. I'll test it at latest on weekend and report back.
I got to test it now. My issue does not replicate anymore with this newest version, thank you :-).
So initially I had performance issues on routing VLAN <=> LAN through ax0 (10 GbE) on Deciso DEC 840. After this patch the issue is clearly gone.
I don't have any real performance numbers between VLANs, but the clear "laggy issue" is entirely gone now.
I also did some testing after I noticed on a customer site that even on a 10G uplink I would max out at 600Mbps. Since then I roughly tested this on all other sites where we run OPNsense and the result is the same everywhere. OPNsense runs everywhere on either ESXi or Proxmox and on Thomas Krenn servers with the following specs:
Supermicro mainboard X10SDV-TP8F
Intel Xeon D-1518
16 GB ECC DDR4 2666 RAM
I now testet on 3 VMs, 2 running Debian Bullseye and 1 OPNsense (latest 20.1 and latest 21.7). The results are quite poor.
Debian -> Debian
> 14Gbps
Debian -> OPNsense 20.1 -> Debian
< 700Mbps
Debian -> OPNsense 21.7 -> Debian
< 900Mbps
Both OPNsense installs are using default settings, hardware offloading disabled and updated to latest version.
I tried setting the following tunables:
net.isr.maxthreads=-1
I also noticed that net.isr.maxthreads always returns 1 but when setting to -1 it reports the correct threads. However, the network throughput does not change.
hw.ibrs_disable=1
This made a significant impact and throughput increased to 2.6Gbps which is still too low but a lot better than before.
@alh, in case of ESXI most relevant details are likely already documented in https://forum.opnsense.org/index.php?topic=18754.msg90576#msg90576, the 14Gbps are probably measured with default settings, the D-1518 isn't a very fast machine so that would be reasonable using all hardware accelerated offloading settings.
For virtualised environments it helps to look into SR-IOV.
Supermicro M11SDV-8C-LN4F with Intel X710-DA2 running Proxmox 7 with SR-IOV VFs configured for OPNsense LAN and WAN on separate SFP+ slots.
Running
iperf3 -c192.168.178.8 -R -P3 -t30
through the firewalls.
OPNsense 21.7.3_1 with Sensei
[SUM] 0.00-30.00 sec 10.5 GBytes 3.00 Gbits/sec 3117 sender
[SUM] 0.00-30.00 sec 10.5 GBytes 3.00 Gbits/sec receiver
OPNsense 21.7.3_1 without Sensei
[SUM] 0.00-30.00 sec 23.8 GBytes 6.82 Gbits/sec 514 sender
[SUM] 0.00-30.00 sec 23.8 GBytes 6.82 Gbits/sec receiver
Blindtest, Linux based firewall hardware:
[SUM] 0.00-30.00 sec 29.3 GBytes 8.40 Gbits/sec 0 sender
[SUM] 0.00-30.00 sec 29.3 GBytes 8.40 Gbits/sec receiver
@athurdent
Do you think SR-IOV also helps if host (virtualized env. platform) uses vSwitches ?
I work with ESXi hosts where a NIC goes directly to vSwitch and so the NIC seems not to be "sliced" for VM guests.
Thanks for the benchmarks btw.
T.
Quote from: testo_cz on September 28, 2021, 09:24:52 PM
@athurdent
Do you think SR-IOV also helps if host (virtualized env. platform) uses vSwitches ?
I work with ESXi hosts where a NIC goes directly to vSwitch and so the NIC seems not to be "sliced" for VM guests.
Thanks for the benchmarks btw.
T.
Hi, not sure about the ESXi implementation, they seem to have documentation on it though. https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.networking.doc/GUID-CC021803-30EA-444D-BCBE-618E0D836B9F.html
The card itself definitely has integrated switching capabilities. If I use a VLAN only on the card for 2 VMs to communicate (VLAN is not configured or allowed on the hardware switch the card is connected to), then I get around 18G throughput, which is done on the card internally.
Quote from: athurdent on September 29, 2021, 05:28:19 AM
Quote from: testo_cz on September 28, 2021, 09:24:52 PM
@athurdent
Do you think SR-IOV also helps if host (virtualized env. platform) uses vSwitches ?
I work with ESXi hosts where a NIC goes directly to vSwitch and so the NIC seems not to be "sliced" for VM guests.
Thanks for the benchmarks btw.
T.
Hi, not sure about the ESXi implementation, they seem to have documentation on it though. https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.networking.doc/GUID-CC021803-30EA-444D-BCBE-618E0D836B9F.html
The card itself definitely has integrated switching capabilities. If I use a VLAN only on the card for 2 VMs to communicate (VLAN is not configured or allowed on the hardware switch the card is connected to), then I get around 18G throughput, which is done on the card internally.
Thats an interesting information -- SR-IOV cards VFs just switch between each other. It also makes sense. I can imagine how this would improve smaller setups, no matter if its ESXi or another.
ESXi docs say that Direct I/O enables HW acceleration too, no matter vSwitch, but only some scenarios. I assume its an combination of their VMXNET3 paravirt. driver magic and Physical Fuction of the NIC. What I've seen its default for large ESXi setups.
18G means the traffic went through PCIe only , cool.
Thanks. T.
I made a new thread about this very same issue but with Proxmox guests in the mix:
https://forum.opnsense.org/index.php?topic=25410.msg122060#msg122060
I don't want to blame OPNsense 100% before I rule out OVS problems, but OVS has not had issues for me in the past :(
I'm chiming in to say I have seen similar issues. Running on proxmox, I can only route about 600 mbps in opnsense using virtio/vtnet. A related kernel process in opnsense shows 100% cpu usage and the underlying vhost process on the proxmox host is pegged as well.
Trying a Linux VM on the same segment (i.e. not routing the opnsense) saturates my 1gig nic on my desktop with only 25% cpu usage on the associated vhost process for the VMs nic.
I know some blame has been put on CPU speed/etc., but I think there is some sort of performance issue with the vtnet drivers. Even users of pfsense have had similar complaints. I also tried the new opnsense development build (freebsd 13) with no improvement.
I passed my nic through to the opnsense VM and reconfigured the interfaces and can route 1gbps no sweat. This is with the em driver (which supports my nic).
Note: I can get 1gbps with multiple queues set on the vtnet adapters for the opnsense VM. However, this still doesn't fix the performance issue with a single "stream."
Hello,
I'm joining this thread too .. we have:
* 4 x DEC-3850
* OPNsense 21.10.2-amd64 (Business edition)
Since we use OpnSense .. we have the problem with throughput .. we had in the beginning a SuperMicro X11-SSH with ~5Gb/s and switched than to the appliance. We never reach more than 2-3Gb/s (iperf3, without any special options) and it seems .. the problem is the VPN stack. So, if you have a IPSec tunnel, all traffic slows down, even it does not go through the tunnel.
we tested:
* VM -> VM same hypervisor (Proxmox) same VLAN = ~16Gb/s
* VM -> VM different hypervisor (Proxmox) same VLAN = ~10Gb/s
* VM -> VM different hypervisor (Proxmox) different VLAN = 1,5Gb/s - ~3Gb/s
So, if it goes via OpnSense .. the network slows down.
https://www.mayrhofer.eu.org/post/firewall-throughput-opnsense-openwrt/
QuoteWhen IPsec is active - even if the relevant traffic is not part of the IPsec policy - throughput is decreased by nearly 1/3. This seems like a real performance issue / bug in the FreeBSD/HardenedBSD kernel. I will need to try with VTI based IPsec routing to see if the in-kernel policy matching is a problem.
What makes as very sad .. if this is the real issue .. It is not easy, to test it and disable VPN .. but we will try to build a test scenario ...
Pretty sad things ...
did you also test 22.1 ?
@linuxmail would you mind stopping random cross-posting, thanks
Is there a way how I could test this with a bare metal opnsense installation? How would I proceed here?
EDIT - Resolved - see next post
Original post:
Quote from: iamperson347 on December 05, 2021, 07:48:25 PM
I'm chiming in to say I have seen similar issues. Running on proxmox, I can only route about 600 mbps in opnsense using virtio/vtnet. A related kernel process in opnsense shows 100% cpu usage and the underlying vhost process on the proxmox host is pegged as well.
I'm seeing throughput all over the place on a similar setup (I.e in a Proxmox VM)
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 97.0 MBytes 814 Mbits/sec
[ 5] 1.00-2.00 sec 109 MBytes 911 Mbits/sec
[ 5] 2.00-3.00 sec 111 MBytes 934 Mbits/sec
[ 5] 3.00-4.00 sec 103 MBytes 867 Mbits/sec
[ 5] 4.00-5.00 sec 100 MBytes 843 Mbits/sec
[ 5] 5.00-6.00 sec 112 MBytes 937 Mbits/sec
[ 5] 6.00-7.00 sec 109 MBytes 911 Mbits/sec
[ 5] 7.00-8.00 sec 75.7 MBytes 635 Mbits/sec
[ 5] 8.00-9.00 sec 68.9 MBytes 578 Mbits/sec
[ 5] 9.00-10.00 sec 96.6 MBytes 810 Mbits/sec
[ 5] 10.00-11.00 sec 112 MBytes 936 Mbits/sec
And while that's happening, I see the virtio_pci process maxing out:
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
12 root -92 - 0B 400K CPU0 0 21:42 94.37% [intr{irq29: virtio_pci1}]
51666 root 4 0 17M 6600K RUN 1 0:18 68.65% iperf3 -s
11 root 155 ki31 0B 32K RUN 1 20.4H 13.40% [idle{idle: cpu1}]
11 root 155 ki31 0B 32K RUN 0 20.5H 3.61% [idle{idle: cpu0}]
Are there any settings that could help with this please?
I'm on 22.1.6
Further to my previous post, I actually fixed this just by turning on all the hardware acceleration options in "Interface -> Settings"
That includes CRC, TSO, and LRO. I removed the 'disabled' check and rebooted.
Now get rock solid iperf3 result:
[ 5] 166.00-167.00 sec 112 MBytes 941 Mbits/sec
[ 5] 167.00-168.00 sec 112 MBytes 941 Mbits/sec
[ 5] 168.00-169.00 sec 112 MBytes 941 Mbits/sec
[ 5] 169.00-170.00 sec 112 MBytes 941 Mbits/sec
[\code]
And NIC processing load dropped to just 25% or so:
[code]
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
11 root 155 ki31 0B 32K RUN 1 3:14 77.39% [idle{idle: cpu1}]
11 root 155 ki31 0B 32K RUN 0 3:06 71.26% [idle{idle: cpu0}]
12 root -92 - 0B 400K WAIT 0 0:55 28.35% [intr{irq29: virtio_pci1}]
91430 root 4 0 17M 6008K RUN 0 0:43 21.94% iperf3 -s
What confused me was:
1) The acceleration is disabled by default (not sure why?)
2) I thought it would apply to virtio devices, but clearly they're implementing the right things to support it.
EDIT
Arghh - perhaps not. While this fixed the LAN side, suddenly the WAN side throughput plummets.
This is strange because it's using the same virtio to a separate NIC of exactly the same type.
we've also a performance issue, we've a Scop7 5510 with 10G SFP+ and just got 1,2GBit/s but that should be >9GBit.
Any ideas why this happens and how to fix that?
Quote from: linuxmail on February 02, 2022, 12:54:49 PM
https://www.mayrhofer.eu.org/post/firewall-throughput-opnsense-openwrt/
QuoteWhen IPsec is active - even if the relevant traffic is not part of the IPsec policy - throughput is decreased by nearly 1/3. This seems like a real performance issue / bug in the FreeBSD/HardenedBSD kernel. I will need to try with VTI based IPsec routing to see if the in-kernel policy matching is a problem.
Well spotted! Exactly the same negative observation here on my end with IPsec policy based VPN.
Here is a first estimate of how IPsec affects my routing speed in the LAN:
Direction | IPsec enabled | IPsec disabled |
Server -> OPnsense -> Client | 48.1 MB/s | 74.2 MB/s |
Server <- OPnsense <- Client | 49.9 MB/s | 61.1 MB/s |
Overall, the routing speed remains very disappointing. Especially considering I had full routing performance up until OPNsense 20.1.
During my testing, I noticed that OPNsense doesn't seem to be utilizing all NIC queues. Two out of four NIC queues process almost no traffic and are bored.
dev.ix.2.queue3.rx_packets: 2959840
dev.ix.2.queue2.rx_packets: 2158082
dev.ix.2.queue1.rx_packets: 9861
dev.ix.2.queue0.rx_packets: 4387
dev.ix.2.queue3.tx_packets: 2967255
dev.ix.2.queue2.tx_packets: 2160888
dev.ix.2.queue1.tx_packets: 15955
dev.ix.2.queue0.tx_packets: 8725
Any take on this?
interrupt | total | rate |
irq51: ix2:rxq0 | 5136 | 11 |
irq52: ix2:rxq1 | 2176474 | 4708 |
irq53: ix2:rxq2 | 7203 | 16 |
irq54: ix2:rxq3 | 3299471 | 7138 |
irq55: ix2:aq | 1 | 0 |
This is really crap!
Update: I don't know if others have made the same mistake, but do a traceroute from your iperf client to iperf server and make sure it looks right. Do a netstat -rn on your opnsense box too and make sure the routing table looks sane. In my testing I was putting the wan side into my normal network, and the lan side into an isolated proxmox bridge with no physical port attached. For some reason, OPNsense is routing the traffic all the way to WAN's upstream gateway, which is a physical 1gb router outside of my proxmox environment. I'm not sure why yet, but here's what's happening:
vtnet0 WAN 10.0.0.1->WAN gateway (physical router) 10.0.0.254
vtnet1 LAN 10.0.1.1
iperf client on lan side: 10.0.1.1
iperf server on wan network: 10.0.0.100
traceroute from iperf client to iperf server (through opnsense):
1 10.0.1.1
2 10.0.0.254
3 10.0.0.100
traceroute from iperf client to iperf server (through pfsense):
1 10.0.1.1
2 10.0.0.100
Deleted the route entry from system->routes->status and it works as expected now, but how did that entry get there in the first place? I have a second opnsense test instance that did the same thing.
Original post:
Anyone have any updates on this? Is this now considered a known bug? I saw early in the thread a link to a Github PR that was merged, but it looks like it's been included in 22.7. I setup two identical VMs in Proxmox, one Pfsense 2.6.0, one OPNsense 22.7. VMs have 12 E5-2620 cores (vm cpu set to "host"), 4GB of ram, and two virtio nics. Nothing was changed other than setting a static lan IP for each instance. Traffic is tested as such with all VMs on the same proxmox host (including the iperf client and server):
iperf client->iperf server: 10gb/s
iperf client->pfsense lan->pfsense wan->iperf server: 2.5gb/s
iperf client->OPNsense lan->OPNsense wan->iperf server: 0.743gb/s
I then set hw.ibrs_disable=1 (note if CPU is set to the default KVM64, this isn't needed and performance is the same)
iperf client->OPNsense lan->OPNsense wan->iperf server: 0.933gb/s
Also tested with multiple iperf streams (-P 20) and got the same speeds.
CPU usage was high when testing, but then I enabled multiqueue on the Proxmox nics (six on each nic) and CPU usage dropped to basically nothing, and then I topped out right at 940mb/s, exactly the max TCP speed on a gigabit link. I find it pretty suspicious and it makes me think something in the chain is being limited to gigabit ethernet. It does show the nics as "10gbaseT <full duplex>" in the UI, and again my iperf client VM and iperf server VM both have 10g interfaces and when connected directly to each other, pull a full 10gb/s.
I have multi-gigabit Internet and recently decided to transition to an OPNsense server running inside of a Proxmox VM with Virtio network adapters as my main router at home, not realizing at the time that so many performance issues existed....
I read through this entire thread and combed through numerous other resources online. It seems like a lot of people are hung up on this issue and definitive answers are in short supply.
I went through the journey and experienced everything mentioned in this thread pretty much, even marcosscriven's uncanny post about how hardware acceleration caused the LAN side of the network performance to improve and WAN throughput to plummit.
I'm posting here now because I solved this issue for my setup. My OPNsense running in a Proxmox KVM virtual machine is now able to keep up with my 6 gig Internet.
(https://binaryimpulse.com/wp-content/uploads/2022/11/st-normal-6g.png)
I made a lot of changes that I'm not sure if they all helped or not (I'm quite sure a large number of them had no immediately noticeable effect), but I decided to leave a lot of changes in place because the various things I'd read about some of these changes throughout the process made sense to me and increasing the values seemed logical in many cases even if there was no noticeable performance improvement.
You can read my entire writeup on my blog where I go through the whole journey in detail if you want: https://binaryimpulse.com/2022/11/opnsense-performance-tuning-for-multi-gigabit-internet/ (https://binaryimpulse.com/2022/11/opnsense-performance-tuning-for-multi-gigabit-internet/)
In a nutshell my solution came down to leaving all of the hardware offloading disabled and configuring a bunch of sysctl values compiled from like 5 different sources which eventually led to my desired performance. I made some minor changes to the Proxmox VM too like enabling multiqueue on the network adapter, but I'm skeptical whether any of those changes really mattered.
The sysctl values that worked for me (and I think sysctl tuning overall did the most to solve the problem - along with disabling hardware offloading) were the following:
hw.ibrs_disable=1
net.isr.maxthreads=-1
net.isr.bindthreads = 1
net.isr.dispatch = deferred
net.inet.rss.enabled = 1
net.inet.rss.bits = 6
kern.ipc.maxsockbuf = 614400000
net.inet.tcp.recvbuf_max=4194304
net.inet.tcp.recvspace=65536
net.inet.tcp.sendbuf_inc=65536
net.inet.tcp.sendbuf_max=4194304
net.inet.tcp.sendspace=65536
net.inet.tcp.soreceive_stream = 1
net.pf.source_nodes_hashsize = 1048576
net.inet.tcp.mssdflt=1240
net.inet.tcp.abc_l_var=52
net.inet.tcp.minmss = 536
kern.random.fortuna.minpoolsize=128
net.isr.defaultqlimit=2048
If you want my sources and reasoning for the changes and how I arrived at them, I went into a lot of detail in my blog article (https://binaryimpulse.com/2022/11/opnsense-performance-tuning-for-multi-gigabit-internet/).
Just wanted to add my 2 cents to this very useful thread, which did start me off in the right direction toward solving the issue for my setup. Hopefully these details are helpful to someone else.
Cheers,
Kirk
@Kirk: How to set these tweaks? I can't find these options in the Web GUI, except of the first mentioned.
Quote from: Porfavor on November 22, 2022, 12:26:38 AM
@Kirk: How to set these tweaks? I can't find these options in the Web GUI, except of the first mentioned.
@Porfavor these settings are in System > Settings > Tunables. Some of the tunables will not be listed on that page. You can click the + icon to add the tunable you want to tweak.
For example once you hit + you would put a tunable like "net.inet.rss.enabled" in the tunable box, leave the description blank (it will autofill it with a description it has already from somewhere), and then copy the value, like 1, into the value box.
Keep in mind some of these tunables will not be applied until the system is rebooted.
I tried these settings to no avail. :(
I'm experiencing the same exact issue. I read that the blame was also being put towards the CPU. So, in order to test this I took my same appliance:
J4125
8GB RAM
Intel 225 (4 ports)
and installed untangled on it. Once I loaded everything to match my opnsense config my speeds were normal. In fact the impact of IPS was minimal. I went from 1.1gbps (no IPS, 800-900 with IPS) to 1.4gbps which is the speed I pay for from my provider.
This can't be a CPU issue. I'm going to try the tweaks above in OPNsense now and see if it makes any changes.
OpnSense DEC840 which is supposed to be able to handle passing ~15gbit of traffic
Speedtest from the firewall:
# speedtest --server-id=47746
Speedtest by Ookla
Server: AT&T - Miami, FL (id: 47746)
ISP: AT&T Internet
Idle Latency: 3.53 ms (jitter: 0.50ms, low: 3.06ms, high: 4.12ms)
Download: 2327.36 Mbps (data used: 2.6 GB)
5.18 ms (jitter: 1.65ms, low: 2.79ms, high: 26.40ms)
Upload: 378.54 Mbps (data used: 685.6 MB)
3.01 ms (jitter: 1.79ms, low: 2.03ms, high: 55.43ms)
Packet Loss: 0.0%
Result URL: https://www.speedtest.net/result/c/bbd0ee99-ad99-4e32-b3c9-ad05daf8bd84
Speedtest through the firewall (notice slow upload)
# speedtest --server-id=47746
Speedtest by Ookla
Server: AT&T - Miami, FL (id: 47746)
ISP: AT&T Internet
Idle Latency: 4.17 ms (jitter: 0.94ms, low: 3.06ms, high: 6.49ms)
Download: 2295.81 Mbps (data used: 1.5 GB)
5.08 ms (jitter: 2.15ms, low: 2.79ms, high: 53.90ms)
Upload: 329.78 Mbps (data used: 362.9 MB)
4.05 ms (jitter: 1.37ms, low: 3.12ms, high: 16.97ms)
Packet Loss: 0.0%
Result URL: https://www.speedtest.net/result/c/2f29bb86-def6-4379-ad30-7292ad3e1926
iperf3 from the same machine *to* the Opnsense firewall, normal and reverse
root@dev:/ # iperf3 -c gw
Connecting to host gw, port 5201
[ 5] local 10.27.3.230 port 31205 connected to 10.27.3.254 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 272 MBytes 2.28 Gbits/sec 413 472 KBytes
[ 5] 1.00-2.00 sec 287 MBytes 2.41 Gbits/sec 2 614 KBytes
[ 5] 2.00-3.00 sec 255 MBytes 2.14 Gbits/sec 61 593 KBytes
[ 5] 3.00-4.00 sec 280 MBytes 2.35 Gbits/sec 23 17.0 KBytes
[ 5] 4.00-5.00 sec 261 MBytes 2.19 Gbits/sec 82 257 KBytes
[ 5] 5.00-6.00 sec 257 MBytes 2.15 Gbits/sec 14 133 KBytes
[ 5] 6.00-7.00 sec 254 MBytes 2.13 Gbits/sec 20 737 KBytes
[ 5] 7.00-8.00 sec 260 MBytes 2.18 Gbits/sec 70 512 KBytes
[ 5] 8.00-9.00 sec 268 MBytes 2.25 Gbits/sec 140 737 KBytes
[ 5] 9.00-10.00 sec 266 MBytes 2.23 Gbits/sec 116 714 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 2.60 GBytes 2.23 Gbits/sec 941 sender
[ 5] 0.00-10.00 sec 2.60 GBytes 2.23 Gbits/sec receiver
iperf Done.
root@dev:/ # iperf3 -R -c gw
Connecting to host gw, port 5201
Reverse mode, remote host gw is sending
[ 5] local 10.27.3.230 port 12997 connected to 10.27.3.254 port 5201
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 254 MBytes 2.13 Gbits/sec
[ 5] 1.00-2.02 sec 262 MBytes 2.16 Gbits/sec
[ 5] 2.02-3.00 sec 257 MBytes 2.19 Gbits/sec
[ 5] 3.00-4.00 sec 250 MBytes 2.10 Gbits/sec
[ 5] 4.00-5.00 sec 234 MBytes 1.97 Gbits/sec
[ 5] 5.00-6.00 sec 244 MBytes 2.05 Gbits/sec
[ 5] 6.00-7.00 sec 251 MBytes 2.11 Gbits/sec
[ 5] 7.00-8.00 sec 229 MBytes 1.92 Gbits/sec
[ 5] 8.00-9.00 sec 248 MBytes 2.08 Gbits/sec
[ 5] 9.00-10.00 sec 238 MBytes 1.99 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 2.41 GBytes 2.07 Gbits/sec 14 sender
[ 5] 0.00-10.00 sec 2.41 GBytes 2.07 Gbits/sec receiver
iperf Done.
I actually expect more than this. With a loopback to my own server through my switch I can do 9gbit with a single stream. If I do multiple streams to the Opnsense firewall I can hit 4.2gbit max
So where is this mysterious bottleneck coming from? I did have the ipsec.ko loaded from an old setup, but I had no policies. Module completely gone. No amount of tuning or interface settings changes seems to matter.
How do I get this thing to actually push line rate? I've even swapped from 10gbase-t to fiber in case it was something odd with the media, but same results.
edit: I setup another test scenario where I do a speed test over wifi from my laptop to my server using Librespeed (http://"https://github.com/librespeed/speedtest-go") and when I hit it directly through my AP on the same switch connected to the server I can do 300/300, but when I force my traffic to go through the firewall (same segment, same VLAN) the download speed (my server's upload) can't break 100
There is something very peculiar going on
CPU at the test? Fragmentation? Iperf from A to B through the Firewall? Any drops at the switch? Services screenshot of ds
Dashboard please
Majority of the issue was net.isr.dispatch=direct which should be net.isr.dispatch=deferred so multiple CPU cores are used. I can hit ~7gbit on an iperf to the firewall and I've been able to get my full 2gbit through it.
I don't know why this isn't the default value in Opnsense. I understand why it's not in FreeBSD, but a networking appliance should be tuned out of the box for maximum networking performance. Hope to see this and more auto-tuning improvements in the future.
I also would have expected Opnsense to automatically recognize this hardware and apply specific tuning for it. It is one of their flagship products after all.
The inability to get a full 10gbit iperf to the firewall when the DEC840 spec sheet specifically states "14.4Gbps firewall throughput" and "Firewall Port to Port Throughput: 9Gbps" makes me wonder if the Opnsense team has ever actually hit those numbers with this hardware or if they're just advertising theoretical max?
First of all thanks for this tip.
I tested it. I am running OPNsense on APU2D2.
Test setup:
- RPI4B+
- Win10 PC
- InterVlan setup
- Both hosts in separate VLans
net.isr.dispatch set to direct
-P 10
[SUM] 0.00-10.00 sec 788 MBytes 661 Mbits/sec sender
[SUM] 0.00-10.00 sec 785 MBytes 659 Mbits/sec receiver
net.isr.dispatch set to deferred
-P 10
[SUM] 0.00-10.00 sec 1.02 GBytes 878 Mbits/sec sender
[SUM] 0.00-10.00 sec 1.02 GBytes 877 Mbits/sec receiver
net.isr.dispatch set to deferred - running for 300s 10 streams
-P 10
[SUM] 0.00-300.00 sec 31.4 GBytes 899 Mbits/sec sender
[SUM] 0.00-300.00 sec 31.4 GBytes 899 Mbits/sec receiver
So there is something definitely to it. I was not able to go up to 1G, now I can after changing the value to "deferred"