Wireguard performance 100% faster on pfSense than OPNsense

Started by pfop, February 19, 2024, 05:04:59 PM

Previous topic - Next topic
Quote from: Monviech on March 10, 2024, 07:22:35 PM
So, this feature does all the magic in pfsense Plus? Intel CPU and IPsec Multi-Buffer (IPsec-MB, IIMB) Cryptographic Acceleration for ChaCha20-Poly1305 ?

https://docs.netgate.com/pfsense/en/latest/hardware/cryptographic-accelerators.html

To be honest, I tried to find this out, but I'm not a 'low level' expert unfortunately.
What I can say is, installing intel-ipsec-mb-1.5_1 and loading cryptodev.ko didn't make a difference on OPNsense.

Here the loaded modules on pfSense+ and OPNsense.
The iimb.ko on pfSense+ looks like the one we're missing...

pfSense+
[23.09.1-RELEASE][root@pfSense.home.arpa]/root: kldstat
Id Refs Address                Size Name
1   38 0xffffffff80200000  339f830 kernel
2    1 0xffffffff835a0000    abd98 ice_ddp.ko
3    1 0xffffffff8364c000     76f8 cryptodev.ko
4    1 0xffffffff83655000    1e2b0 opensolaris.ko
5    1 0xffffffff83674000   5d7790 zfs.ko
6    1 0xffffffff84710000     2220 cpuctl.ko
7    1 0xffffffff84713000     3210 intpm.ko
8    1 0xffffffff84717000     2178 smbus.ko
9    1 0xffffffff8471a000     9288 aesni.ko
10    1 0xffffffff84800000   666a08 iimb.ko
12    1 0xffffffff84753000     3158 amdtemp.ko
13    1 0xffffffff84757000     2130 amdsmn.ko
14    1 0xffffffff84724000    2e560 if_wg.ko


OPNsense
root@OPNsense:~ # kldstat
Id Refs Address                Size Name
1   86 0xffffffff80200000  216c2e0 kernel
2    1 0xffffffff8236d000     ab48 opensolaris.ko
3    1 0xffffffff82378000     4b58 if_enc.ko
4    3 0xffffffff8237d000    78aa0 pf.ko
5    1 0xffffffff823f6000     a458 cryptodev.ko
6    1 0xffffffff82401000    abc98 ice_ddp.ko
7    1 0xffffffff824ad000     f4c8 pfsync.ko
8    1 0xffffffff824bd000   59dfe0 zfs.ko
9    1 0xffffffff82a5b000     3b18 pflog.ko
10    1 0xffffffff82a5f000     f858 carp.ko
11    1 0xffffffff82a70000     aa70 if_gre.ko
12    1 0xffffffff82a7b000    16148 if_lagg.ko
13    2 0xffffffff82a92000     3538 if_infiniband.ko
14    1 0xffffffff82a96000     e8f8 if_bridge.ko
15    2 0xffffffff82aa5000     8958 bridgestp.ko
16    1 0xffffffff83010000     3378 acpi_wmi.ko
17    1 0xffffffff83014000     3218 intpm.ko
18    1 0xffffffff83018000     2180 smbus.ko
19    1 0xffffffff8301b000     3340 uhid.ko
20    1 0xffffffff8301f000     3380 usbhid.ko
21    1 0xffffffff83023000     31f8 hidbus.ko
22    1 0xffffffff83027000     3320 wmt.ko
23    1 0xffffffff8302b000     72a8 hifn.ko
24    1 0xffffffff83033000     2270 padlock.ko
25    1 0xffffffff83036000    15308 qat.ko
26    1 0xffffffff8304c000     43b0 safe.ko
27    1 0xffffffff83051000     3160 amdtemp.ko
28    1 0xffffffff83055000     2138 amdsmn.ko
29    1 0xffffffff83058000    2f560 if_wg.ko
30    1 0xffffffff83088000     4700 nullfs.ko
Firewall Specs: AMD Ryzen 5700G, 16GB DDR4 3200MHz RAM, Intel E810 Quad Port SFP28 NIC
Internet Specs: Init7 25GBit FTTH

Of course the presence of the library and/or the kernel module does not make a difference in itself, unless you actually use those functions.

As I already wrote: The FreeBSD implementation is not the best and obviously, Netgate actually has done something special (at least for pfSense plus):

https://redmine.pfsense.org/issues/14291

So, this would best be addressed as a feature request for OpnSense, namely to add IPsec-MB support as an additional crypto acceleration technique. As far as I understand it, the acceleration is not strictly limited to Intel CPUs, but works when certain CPU features are available.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 440 up, Bufferbloat A+

It seems like there was another thread where tunables have been described:

https://forum.opnsense.org/index.php?topic=37808.0
Hardware:
DEC740

Quote from: Monviech on March 11, 2024, 08:24:37 AM
It seems like there was another thread where tunables have been described:

https://forum.opnsense.org/index.php?topic=37808.0

Thank you for your reply, I added those tunables already without any change in WG performance.
Firewall Specs: AMD Ryzen 5700G, 16GB DDR4 3200MHz RAM, Intel E810 Quad Port SFP28 NIC
Internet Specs: Init7 25GBit FTTH

iimb.ko is pretty interesting because it's not found anywhere in FreeBSD old and new.

Can you try to locate it on the plus install?

# find / -name "iimb.ko"

And then try to see if it belongs to a package or if it is part of the non-free plus sources?

# pkg which /path/to/iimb.ko


Cheers,
Franco


Quote from: franco on March 11, 2024, 10:00:56 AM
Can you try to locate it on the plus install?
# find / -name "iimb.ko"


[23.09.1-RELEASE][root@pfSense.home.arpa]/root: find / -name "iimb.ko"
/boot/kernel/iimb.ko

Quote from: franco on March 11, 2024, 10:00:56 AM
And then try to see if it belongs to a package or if it is part of the non-free plus sources?
# pkg which /path/to/iimb.ko


[23.09.1-RELEASE][root@pfSense.home.arpa]/root: pkg which /boot/kernel/iimb.ko
/boot/kernel/iimb.ko was installed by package pfSense-kernel-pfSense-23.09.1


To summarize:
pfSense CE on bare metal C3758R, 1300MBit Wireguard throughput
OPNsense on bare metal C3758R, 630MBit Wireguard throughput (-51%)

pfSense+ on bare metal Ryzen 5700G, 6000MBit Wireguard throughput (with IIMB!) at only ~25% CPU load
OPNsense on bare metal Ryzen 5700G, 1800MBit (-70%)

FreeBSD 13.2 on vSphere VM, 1 Core i7-7700K, 990MBit throughput
FreeBSD 13.3 on vSphere VM, 1 Core i7-7700K, 1020MBit throughput
FreeBSD 14.0 on vSphere VM, 1 Core i7-7700K, 980MBit throughput

Unfortunately I was not able to do Wireguard tests on Ryzen 5700G with pfSense CE, as the network driver supplied there only supports one queue which could lead to measurement errors.

pfSense+ is clearly the leader with hardware acceleration of Wireguard.
But I still can't figure out, why OPNsense throughput is so much lower compared to pfSense CE or FreeBSD. There is clearly something unoptimized on OPNsense when using Wireguard.
Firewall Specs: AMD Ryzen 5700G, 16GB DDR4 3200MHz RAM, Intel E810 Quad Port SFP28 NIC
Internet Specs: Init7 25GBit FTTH

The last one is an interesting question. As far as I understand it is that pfSense uses the same crypto as OpnSense. What comes to mind is that there could be some compiler optimization switches, CPU instruction sets or even the compiler itself (gcc vs. clang) that is different.

Actually, what I have found is that in kern.mk, there is a setting for amd64 which effectively disables some instructions:


#
# For AMD64, we explicitly prohibit the use of FPU, SSE and other SIMD
# operations inside the kernel itself.  These operations are exclusively
# reserved for user applications.
#
# gcc:
# Setting -mno-mmx implies -mno-3dnow
# Setting -mno-sse implies -mno-sse2, -mno-sse3, -mno-ssse3 and -mfpmath=387
#
# clang:
# Setting -mno-mmx implies -mno-3dnow and -mno-3dnowa
# Setting -mno-sse implies -mno-sse2, -mno-sse3, -mno-ssse3, -mno-sse41 and -mno-sse42
# (-mfpmath= is not supported)
#
.if ${MACHINE_CPUARCH} == "amd64"
CFLAGS.clang+= -mno-aes -mno-avx
CFLAGS+= -mcmodel=kernel -mno-red-zone -mno-mmx -mno-sse -msoft-float \
-fno-asynchronous-unwind-tables
INLINE_LIMIT?= 8000
.endif


The reason given here is clear: The kernel is to run on any amd64-capable platform, regardless of specific features. This partly explains why the chacha20-poly1305 code is kind of slow: Not only is this a piece of code that is not optimized for a specific CPU platform - being part of the kernel, it is compiled for maximum compatibility (and I can only guess: probably without '-O2').

Mind you: I do not know if pfSense CE and FreeBSD really compile this differently and I have no means to check. But this 100% improvement could be fairly easy to unlock.


As for the much faster iimb.ko module: What I found out so far is that there is a cryptography API which enables to use a crypto driver which can implement specific functions - like the Intel QAT engine(s). iimb.ko seems to be such a driver which implements the kernel functions for chacha20-poly1305 and others.

If I got it right, even the FreeBSD wireguard implementation does not use the native wireguard routines, but the kernel crypto functions instead. Thus, what is needed is a crypto driver using the Intel IIMB library. The API is rather arcane, though.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 440 up, Bufferbloat A+

iimb.ko is a kernel-wrapped version of https://github.com/intel/intel-ipsec-mb as fas as we can tell, which also speeds up WireGuard despite having 'IPsec' in the name.

The measurements here, however, are all over the place and even suggest modification of CE to an unknown degree.

To be frank at this point we can conclude please only compare FreeBSD and OPNsense.

And yet the measurements for FreeBSD and OPNsense given here are all over the place as well so it suggest a low effort out of the box comparison with out any factoring for sysctls and differing kernel version. You could also load a FreeBSD kernel on OPNsense and vice versa. It should give you more consistent testing results to compare.

And if you want to use proprietary software please go ahead but let's stop this advertisement now.  I find it interesting you go trough all of the trouble to touch up product names with markup here. :)


Cheers,
Franco

@franco: Would you mind telling how to install a FreeBSD kernel beneath OpnSense? Is it possible to do that after the fact (i.e. install a FreeBSD kernel package)? I know that one could install OpnSense on top of FreeBSD, but it would be easier if one could just replace the kernel.

On a side note: It looks to me as if either Netgate actually has done something with the CE version as well or it is simply because of differences between FreeBSD 14 and 13.2.

Matter-of-fact, I have somewhat verified the "100% faster" claim: In my tests between two otherwise identical OpnSense and pfSense VM instances, they reached speeds of ~1.2 GBit/s in either direction (slow because of virtio networking). Whilst doing that, the OpnSense VM had ~80% load, whereas the pfSense VM only had 40%.

Therefore, I would like to check with a pure FreeBSD 13.2 (and 14) replacement kernel for OpnSense.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 440 up, Bufferbloat A+

Before doing that I'd rather suggest the easier route and hide iimb.ko from the pfSense system so it cannot be loaded on boot (kldunload may work as well to some degree) in order to see the performance dro.

You could even take that file and move it to a compatible FreeBSD release to kldload it and see what difference it makes. The assumption is this is plug and play crytpo, but be aware this load/unload could crash the kernel in mid-use.

Let me try to compile the steps to load a FreeBSD kernel for OPNsense and get back in a bit.


Cheers,
Franco

I only have pfSense CE and there is neither an iimb.ko module loaded nor even present:

[2.7.2-RELEASE][root@pfSense.mgsoft]/boot/kernel: kldstat
Id Refs Address                Size Name
1   32 0xffffffff80200000  339ce08 kernel
2    1 0xffffffff8359d000    1e2b0 opensolaris.ko
3    1 0xffffffff835bc000     76f8 cryptodev.ko
4    1 0xffffffff835c4000   5d7790 zfs.ko
5    1 0xffffffff84418000     2220 cpuctl.ko
6    1 0xffffffff8441b000     3210 intpm.ko
7    1 0xffffffff8441f000     2178 smbus.ko
9    1 0xffffffff84451000     9288 aesni.ko
10    1 0xffffffff8445b000     3158 amdtemp.ko
11    1 0xffffffff8445f000     2130 amdsmn.ko
12    1 0xffffffff84422000    2e560 if_wg.ko


I do not really know if it might be compiled statically, but if the speed results of @pfop are correct (i.e. OpnSense 24.1.3_1 = 100%, pfSense CE 2.7.2 = 200% and pfSense+ = 400%), it suggests Netgate really limits use of the integration to pfSense+.

Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 440 up, Bufferbloat A+

Quote from: meyergru on March 18, 2024, 04:47:01 PM
I do not really know if it might be compiled statically, but if the speed results of @pfop are correct (i.e. OpnSense 24.1.3_1 = 100%, pfSense CE 2.7.2 = 200% and pfSense+ = 400%), it suggests Netgate really limits use of the integration to pfSense+.
The CPU Crypto offloading/acceleration with iimb.ko is only available in pfSense+ and not available on pfSense CE.
Firewall Specs: AMD Ryzen 5700G, 16GB DDR4 3200MHz RAM, Intel E810 Quad Port SFP28 NIC
Internet Specs: Init7 25GBit FTTH

I know. I could see that the module is missing, but it still is theoretically possible that pfSense CE has some sort of "lightweight" module or static kernel part that does the 200% speed when compared to OpnSense.

I was rather referring to the implications I already suggested indirectly:

The "long way to go" would be to do the same as Netgate has done with iimb.ko in pfSense+ and integrate the (poorly - if at all - documented) FreeBSD kernel crypto API with the corresponding library functions to achieve the full improvement with a factor of 4.

I understand that is much work and with the advent of 14.1, probably it has to be done twice if those APIs changed. So, I would not expect that anytime soon, at least not before integration of 14.1.

However, since we can take it that pfSense CE does not use that approach (BTW: I have a direct confirmation of that fact) and still is faster, maybe there is a "quick win" for the oncoming FreeBSD 13.3-based version 24.7 that does not require as much effort but still doubles Wireguard performance.

My own setup clearly indicates that with typical current OpnSense hardware (like N5105, N100 and their likes), doubling WG performance would break the magical 1 GBps barrier, which would be a decent improvement of the current situation.

Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 440 up, Bufferbloat A+

I did not follow the whole thread, but did you compare pfsense and OPN on KVM without Wireguard?