Wireguard performance 100% faster on pfSense than OPNsense

Started by pfop, February 19, 2024, 05:04:59 PM

Previous topic - Next topic
Hello colleagues

Intro

I'm a long time (over 15 years) pfSense user, now moving to OPNsense once my new fiber connection is ready, as OPNsense offers better NAT performance in my tests.
So far I used pfSense on ALIX and APU devices from PC Engines, as also virtually on VMs.

New hardware
For my fiber connection, which will be 10GBit symmetrical, I got a passive Quotom device, which is powered by an 8 Core Intel Atom C3758R CPU, 32GB DDR4 2400MHz ECC RAM (2x 16GB) and two NVME SSDs with ZFS Mirror.
The devices provide 4x SFP+ X553 ports, 5x RJ45 2.5G Intel I225-V.

Issue with Wireguard performance
What currently is bugging me, is the Wireguard performance on OPNsense, compared to pfSense.
On the C3758R I get with pfSense 2.7.2 and the 'WireGuard' version 0.2.1 package 1300Mbit of Wireguard performance.
On the C3758R I get with OPNsense 24.1.1 630Mbit of Wireguard performance.

Setup
The setup for both tests is exactly the same, also the same physical box was used for all tests.

ServerA is wired directly to SFP+ port1 (ix1) on OPNsense with a 10G LR SM optic.
ServerB is wired directly to SFP+ port2 (ix2) on OPNsense with a 10G LR SM optic.



ix1 = OPNsense LAN, MTU 1500
ix2 = OPNsense WAN, outbound NAT active, MTU 1500

Testing
Doing iperf3 tests between ServerA and ServerB, I can reach with 1 stream up to 3.5GBit, with more streams, I can saturate the 10Gbit interfaces.

When estabilishing a Wireguard VPN between FW01 and ServerB, iperf3 tests between ServerA to ServerB's WG IP, I can reach with 1 stream about 630MBit and the CPU utilization is at 100%.

pfSense Wireguard performance
Doing the exactly same with pfSense, with the same physical Firewall, I can reach 1300MBit through Wireguard with the exact same setup.

Question
Has anyone an idea, why OPNsense is 50% slower in regards to Wireguard throughput? Is there any hidden options that can be modified, to get closer to the 1300MBit possible on pfSense?

I look forward to an constructive discussion!

Best regards
Firewall Specs: AMD Ryzen 5700G, 16GB DDR4 3200MHz RAM, Intel E810 Quad Port SFP28 NIC
Internet Specs: Init7 25GBit FTTH

You have read the documentation section on performance?

A quick google search also returns this, so have you disabled the Spectre and Meltdown mitigations?

Also, some mitigations have been obsoleted by microcode updates, did you apply them?

If that does not help: The wireguard performance on FreeBSD is not particularly good, so maybe the pfSense folks have come up with something special. It does not use AES, so that AES-NI instructions do not help, either.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 440 up, Bufferbloat A+

Hello meyergru

Thank you for taking time to reply!
I did now disable spectre/meltdown settings and also the IPv4 random IDs, applied them, and rebooted the firewall.
net.inet.ip.random_id   0
hw.ibrs_disable 1
vm.pmap.pti 0

Redoing the test shows maybe a slight increase to 650MBit, but nowhere close to the 1300MBit from pfSense.
Microcode Update is installed - yes.

It's correct, that Wireguard is not using AES, there are some Intel Quick Assist implementations which can help, but this system has too old Quick Assist afaik, anyway it is the same for pfSense.

BR
Firewall Specs: AMD Ryzen 5700G, 16GB DDR4 3200MHz RAM, Intel E810 Quad Port SFP28 NIC
Internet Specs: Init7 25GBit FTTH


Can we cut to the chase and admit both are using the FreeBSD base kernel module now?


Cheers,
Franco

Hello together

Since opnSense 24.1, it uses kernel based WG implementation, you can also see this in their release notes:
wireguard: installed by default using the bundled FreeBSD 13.2 kernel module
Source: https://forum.opnsense.org/index.php?topic=38427.0

Mine is a fresh 24.1.1 install, no upgrade, where there could be some 'leftovers' from wireguard-go.
Current opnSense 24.1.1 is running FreeBSD 13.2-RELEASE-p9:

root@OPNsense:~ # uname -r
13.2-RELEASE-p9

pfSense using WG kernel module:
[2.7.2-RELEASE][root@pfSense]/root: kldstat | grep wg
9    1 0xffffffff83e4f000    2e560 if_wg.ko

OPNsense using WG kernel module:
root@OPNsense:~ # kldstat | grep wg
22    1 0xffffffff82df2000    2f560 if_wg.ko

BR
Firewall Specs: AMD Ryzen 5700G, 16GB DDR4 3200MHz RAM, Intel E810 Quad Port SFP28 NIC
Internet Specs: Init7 25GBit FTTH

There can be no leftovers. Wireguard will even ignore the Go implementation if it has a kernel module readily available. But all that is moot because we only have code for kernel setup in 24.1 anyway. ;)

It's probably a rule for scrubbing or something else being configured suboptimal vs. pfSense.

We don't really hear anyone saying "it's much faster on OPNsense" which likely means it's the same speed on both in the average case. And there is no reason it shouldn't.


Cheers,
Franco

February 20, 2024, 10:03:40 PM #7 Last Edit: February 21, 2024, 11:09:13 PM by 36thchamber
I was trying all possible settings that were published, incl. rss, ibrs, for months and all had zero impact. IDS/IPS (and RSS) would slow it down but i'm not using it.
Then i installed 24.1.x and WG throughput ^doubled^. On all devices, all counter OS, iperf or web speedtests, different servers, different VPNs, different interfaces, ISP base speed monitored nonstop.. all went up to 2gbit while cpu usage halved. Upload is generally slower, so it was at full speed, but there's no more a little gap, it's now pure 100.0% of ISP speed.
So i reread the newsletter for 24. I thought wg package removal is about gui only, and the 13.2 kernel wg was there before..:o

You can play with Normalization.

http://x.x.x.x/firewall_scrub.php

Add rule -
On Interface Wireguard Group
max MSS. 1300

This helps me to get max Performance with Wireguard. I do it on both sides.


Or in instance tick advanced and set MTU to the same value on all devices.

This is getting interesting...

I am unable to compare against pfSense nor do I want to start a war of what is faster. I take pfops finding only as an indication of poor wireguard performance comparing of what could be expected.

From several discussions about wireguard speed in the past, I got the impression that the implementation suffers.

This may be of the time where it was implemented in Go, however, I have an Intel Atom Silver N6005 and now with the kernel implementation I get ~500 MBit/s Wireguard speed. It tops out there with 100% CPU load, regardless if I:

- use more threads (-P4 or -P8 are equal, only -P1 is a little slower).

- disable scrubbing (either on the wireguard interface or on all of them - it makes no difference at all.

- set MSS 1300 or MTU on all WG interfaces.

I have disabled Spectre/Meltdown mitigations and traffic shaping.

My CPU should be more or less comparable to the C3758R, even somewhat fast in single-thread application.
Because more threads in iperf do no influence the result (much), one can infer that the kernel implementation either always uses all available threads for cryptography or is inherently single-threaded. But when I look at "top" with threads and system processes enabled, I can see the kernel at ~300% (the rest is interrupts and user processes), so all 4 threads seem to get utilized.

Paraphrasing what was said in one of the older threads: Wireguard with chacha20-poly1305 was supposed to be much faster than IPSEC and/or OpenVPN with AES, especially on slow CPUs. Considering that, I am disappointed by the results. For starters: I can download a file from a HTTPS site with curl on my OpnSense box at 1 GBit/s. Depending on what the website offers, this is also AES or chacha20-poly1305, but much faster than Wireguard on the same system...

Thus, it would be really interesting to find the bottleneck. I am at a loss on where to look and alas, I lack the time to check if pfSense is really faster or investigate what the difference is.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 440 up, Bufferbloat A+

Quote from: franco on February 20, 2024, 09:57:22 PM
It's probably a rule for scrubbing or something else being configured suboptimal vs. pfSense.
So I disabled scrubbing on all interfaces, applied the config, rebooted opnSense and retestet. Still same result.

Quote from: lewald on February 21, 2024, 10:45:33 AM
You can play with Normalization.

http://x.x.x.x/firewall_scrub.php

Add rule -
On Interface Wireguard Group
max MSS. 1300
Added the rule, applied it, rebooted opnSense and retested. Still same result.

Quote from: mimugmail on February 21, 2024, 12:37:33 PM
Or in instance tick advanced and set MTU to the same value on all devices.
As I don't know where to look for the issue really, I also tried this change (which in general doesn't make sense, as smaller packets will load the CPU more than bigger ones, and I'm testing on a 1500MTU only network).
So before MTU of the WG interfaces on both side were default, 1420:
opnSense: wg1: flags=80c1<UP,RUNNING,NOARP,MULTICAST> metric 0 mtu 1420
ServerB: 5: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN group default qlen 1000

After setting MTU 1300 on both sides:
opnSense: wg1: flags=80c1<UP,RUNNING,NOARP,MULTICAST> metric 0 mtu 1300
ServerB: 6: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1300 qdisc noqueue state UNKNOWN group default qlen 1000

And as expected, the speed went down by about 10MBit because of more overhead with smaller packets.


Quote from: meyergru on February 21, 2024, 01:07:15 PM
- use more threads (-P4 or -P8 are equal, only -P1 is a little slower).
I agree on this, on both opnSense and pfSense, one thread already get close to the maximum throughput, with more streams you only gain very little additional speed.

Quote from: meyergru on February 21, 2024, 01:07:15 PM
My CPU should be more or less comparable to the C3758R, even somewhat fast in single-thread application.
....
But when I look at "top" with threads and system processes enabled, I can see the kernel at ~300% (the rest is interrupts and user processes), so all 4 threads seem to get utilized.
I see 750-780% on the kernel process:
    0 root        142 -16    -     0B  2272K swapin   3  16:30 784.82% kernel
Most likely my system has somewhat better hardware processing, as my interrupts stay very close to 0%, that most likely why you see a bit less performance than my setup, even if you got a some 5% faster CPU than me.
Firewall Specs: AMD Ryzen 5700G, 16GB DDR4 3200MHz RAM, Intel E810 Quad Port SFP28 NIC
Internet Specs: Init7 25GBit FTTH

It has been some weeks and I got some new hardware and did also some additional tests that might be interesting to the community.

The logical setup stays the same, but now I used a Ryzen 5700G with an Intel E810 Quad Port SFP28 network card.

OPNsense WG performance
OPNsense 24.1.1 Wireguard performance: 1800MBit
--> So the Ryzen 5700G is 3x faster compared to the C3758R, alltought the CPU itself is 4.8x faster

pfSense+ WG performance
pfSense+ 23.09.1 Wireguard performance: 6000MBit
--> Out of interest, I did some tests with pfSense+, which uses hardware acceleration for ChaCha20-Poly1305, and it shows an impressive 6000MBit throughput while the CPU is still 75% idle, impressive

Is the difference in post 1 because different FreeBSD versions are used?
I doubt that. I did some tests with Wireguard on FreeBSD VMs with identical configuration, and the results are really close to each other. I used for the FreeBSD tests an old i7-7700K, each FreeBSD VM got 1vCPU assigned.

FreeBSD 13.2: 990MBit
FreeBSD 13.3: 1020MBit
FreeBSD 14.0: 980MBit

Conclusion
One core of the i7-7700K has about 10% of the processing power of a Ryzen 5700G, still it achieves 50% of the throughput of the Ryzen 5700G with OPNsense.
So for me it is clearly an issue related to OPNsense, and not FreeBSD / Kernel version in general.
Firewall Specs: AMD Ryzen 5700G, 16GB DDR4 3200MHz RAM, Intel E810 Quad Port SFP28 NIC
Internet Specs: Init7 25GBit FTTH

Tested it myself on KVM VMs OPNsenses and I'm stuck at around 850 Mbit/s on a Ryzen 9 3900x.

So, this feature does all the magic in pfsense Plus? Intel CPU and IPsec Multi-Buffer (IPsec-MB, IIMB) Cryptographic Acceleration for ChaCha20-Poly1305 ?

https://docs.netgate.com/pfsense/en/latest/hardware/cryptographic-accelerators.html
Hardware:
DEC740