Slow WireGuard Performance

Started by blitzer909, November 29, 2023, 08:45:23 AM

Previous topic - Next topic
Hi All,
I've been searching thru the threads regarding slow wireguard performance on opnsense I'm hoping someone is able to provide some clarity as to what is causing my wireguard to max out at about 383Mbits/Sec

Here is my layout:
I'm testing between 2 locations that have 1GB speed on Fibre obtic network PPPoE Connection.
when I run iperf3 between both locations using the WAN IP I get near line speed however when I test using the internal IP of a machine behind the opnsense router I get a max of about 383Mbits/Sec, and this is even with parallel connections

I also tested opnsense as a VM in proxmox and opnsense installed on the hardware without a hypervisor (Identical hardware) and the speeds did not change all that much

This is the summary output of iperf3 using the WAN IP:

[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  81.5 MBytes   684 Mbits/sec   91    394 KBytes
[  5]   1.00-2.00   sec  94.2 MBytes   790 Mbits/sec   28    429 KBytes
[  5]   2.00-3.00   sec  92.1 MBytes   772 Mbits/sec    2    465 KBytes
[  5]   3.00-4.00   sec  90.0 MBytes   755 Mbits/sec   47    346 KBytes
[  5]   4.00-5.00   sec  91.1 MBytes   764 Mbits/sec   19    378 KBytes
[  5]   5.00-6.00   sec  92.5 MBytes   776 Mbits/sec   20    405 KBytes
[  5]   6.00-7.00   sec  92.1 MBytes   773 Mbits/sec   21    433 KBytes
[  5]   7.00-8.00   sec  91.8 MBytes   770 Mbits/sec    4    465 KBytes
[  5]   8.00-9.00   sec  89.8 MBytes   753 Mbits/sec   37    360 KBytes
[  5]   9.00-10.00  sec  89.5 MBytes   751 Mbits/sec    9    402 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   905 MBytes   759 Mbits/sec  278             sender
[  5]   0.00-10.04  sec   903 MBytes   754 Mbits/sec                  receiver


These are the speeds using the LAN IP

[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.01   sec  46.1 MBytes   385 Mbits/sec  107    529 KBytes
[  5]   1.01-2.00   sec  45.0 MBytes   379 Mbits/sec    0    574 KBytes
[  5]   2.00-3.00   sec  47.5 MBytes   398 Mbits/sec    0    620 KBytes
[  5]   3.00-4.00   sec  50.0 MBytes   419 Mbits/sec    0    663 KBytes
[  5]   4.00-5.00   sec  76.2 MBytes   640 Mbits/sec   22    546 KBytes
[  5]   5.00-6.00   sec  45.0 MBytes   377 Mbits/sec    0    597 KBytes
[  5]   6.00-7.00   sec  41.2 MBytes   345 Mbits/sec    0    640 KBytes
[  5]   7.00-8.00   sec  62.5 MBytes   526 Mbits/sec    0    693 KBytes
[  5]   8.00-9.00   sec  72.5 MBytes   608 Mbits/sec   18    581 KBytes
[  5]   9.00-10.00  sec  77.5 MBytes   650 Mbits/sec    0    671 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   564 MBytes   473 Mbits/sec  147             sender
[  5]   0.00-10.05  sec   561 MBytes   468 Mbits/sec                  receiver


the hardware specs I have are:
Intel(R) Celeron(R) N5105 @ 2.00GHz
8GB of Memory
128 SSD

I am using the kernel package for wireguard, any help would be appreciated.

November 29, 2023, 12:58:24 PM #1 Last Edit: November 29, 2023, 01:00:20 PM by meyergru
I get similar results with the same CPU, even somewhat lower, but I think that is because I use crowdsec and Netflow. The CPU maxes out at 100%, whereas the counterpart, an AMD V1500B is only at 40%.

Wireguard uses all CPU threads, and the N5105 has no hyperthreading, so only 4 threads. AFAIK, when available, AVX features are being leveraged for ChaCha20. The N5105 does not have these extensions, as you can see with 'x86info -a':

Feature flags:
fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflsh ds acpi mmx fxsr sse sse2 ss ht tm pbe sse3 pclmuldq dtes64 monitor ds-cpl vmx est tm2 ssse3 sdbg cx16 xTPR pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc-deadline aes xsave osxsave rdrnd


With the V1500B this reads:

Feature flags:
fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflsh mmx fxsr sse sse2 ht sse3 pclmulqdq mwait ssse3 fma cmpxchg16b sse4_1 sse4_2 [1:ecx:22] popcnt aes xsave osxsave avx f16c [1:ecx:30]


If you want faster cryptography, you need something like an N100 or better.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+

okay so I ran another test.

this time I did a port-forward to a machine behind the opnsense installed wireguard and ran an iperf3 test to it

when I do the iperf3 test from the WAN IP I get the near line speeds:

[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  85.3 MBytes   715 Mbits/sec   26    506 KBytes
[  5]   1.00-2.00   sec  91.7 MBytes   770 Mbits/sec   42    359 KBytes
[  5]   2.00-3.00   sec  91.6 MBytes   768 Mbits/sec   20    366 KBytes
[  5]   3.00-4.00   sec  91.8 MBytes   770 Mbits/sec   21    373 KBytes
[  5]   4.00-5.00   sec  92.0 MBytes   772 Mbits/sec    6    392 KBytes
[  5]   5.00-6.00   sec  95.1 MBytes   798 Mbits/sec    4    411 KBytes
[  5]   6.00-7.00   sec  93.3 MBytes   782 Mbits/sec   21    420 KBytes
[  5]   7.00-8.00   sec  94.1 MBytes   790 Mbits/sec   21    430 KBytes
[  5]   8.00-9.00   sec  93.8 MBytes   787 Mbits/sec   22    443 KBytes
[  5]   9.00-10.00  sec  91.4 MBytes   767 Mbits/sec   43    449 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   920 MBytes   772 Mbits/sec  226             sender
[  5]   0.00-10.04  sec   918 MBytes   767 Mbits/sec                  receiver


when I run iperf3 behind the firewall I get this:

[  5] local 192.168.7.5 port 48580 connected to 192.168.7.1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  52.6 MBytes   441 Mbits/sec   74    477 KBytes
[  5]   1.00-2.00   sec  44.7 MBytes   375 Mbits/sec   29    386 KBytes
[  5]   2.00-3.00   sec  63.2 MBytes   528 Mbits/sec    0    486 KBytes
[  5]   3.00-4.00   sec  55.1 MBytes   463 Mbits/sec   42    411 KBytes
[  5]   4.00-5.00   sec  58.6 MBytes   492 Mbits/sec    0    498 KBytes
[  5]   5.00-6.00   sec  64.8 MBytes   543 Mbits/sec   22    448 KBytes
[  5]   6.00-7.00   sec  56.8 MBytes   477 Mbits/sec   41    377 KBytes
[  5]   7.00-8.00   sec  47.4 MBytes   396 Mbits/sec    0    452 KBytes
[  5]   8.00-9.00   sec  58.3 MBytes   489 Mbits/sec   22    375 KBytes
[  5]   9.00-10.00  sec  30.1 MBytes   253 Mbits/sec    0    428 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   531 MBytes   446 Mbits/sec  230             sender
[  5]   0.00-10.04  sec   530 MBytes   442 Mbits/sec                  receiver



This machine is much more powerful and has the AVX feature in it's processor:

Specs:
AMD Ryzen 7 PRO 3700U w/ Radeon Vega Mobile Gfx
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave [b]avx[/b] f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 [b]avx2[/b] smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sev sev_es

it's got 30 GB of memory this machine is not starved for resources yet the speeds are not much better,

a. How would it help to make one side of a wireguard connection faster? Or what is the other side?

b. By "AVX", I meant the whole family of AVX extensions, including AVX2 and AVX512. I do not know which exactly is needed / used.

c. Once you pass the firewall, there may be other inspections done, like Crowdsec, Zenarmor, Intrusion detection or Netflow, that put stress on your OpnSense, limiting the attainable speed.

d. Also: Did you set a smaller MTU than 1420, especially if you go over IPv6 and / or PPPoE and /or VLAN?
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+

Quote from: meyergru on November 29, 2023, 05:03:46 PM
a. How would it help to make one side of a wireguard connection faster? Or what is the other side?

b. By "AVX", I meant the whole family of AVX extensions, including AVX2 and AVX512. I do not know which exactly is needed / used.

c. Once you pass the firewall, there may be other inspections done, like Crowdsec, Zenarmor, Intrusion detection or Netflow, that put stress on your OpnSense, limiting the attainable speed.

d. Also: Did you set a smaller MTU than 1420, especially if you go over IPv6 and / or PPPoE and /or VLAN?

To answer your questions,
A: the WAN connection on both ends stays the same, removed wireguard from the opensense instance so that it would only focus on routing vs routing and VPN, the speeds on both ends is the same using the same ISP.

C: This is a newly deployed instance I haven't turned anything that is not on by default with the exception of wireguard and there are about 4 rules in wireguard that i'm using at the moment

the part that is very curious for me is when I do iperf3 via the WAN IP I get near line speeds routing back to the machine behind opnsense, however when I use wireguard whether it's being managed by opnsense or the machine behind the opnsense instance it shows similar slow speeds.

is there anything else I could look at?

you help is greatly apreciated.

I obviously miss how you are measuring. One side is an OpnSense with an N5105 CPU, but what is the other one?
I assumed this is a wireguard site-to-site VPN between two OpnSenses.

The speed you get between two VPN endpoints is limited by the minimum of both (and by the speed between both sides when do do not use encryption). Also, if the encryption is done on the router itself, everything that is done on the router adds to the CPU load (i.e. routing, NAT, firewalling, packet inspection, logging)...

Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+

I'm also surprised how slow Wireguard is on generic X86 machines, i guess it's lack of hardware offloading and acceleration. ChaCha also least profiting from cpu acceleration. No surprise OpenVPN could be faster on multigigabit connections. Perhaps WG is ment only for Androids / Arm CPUs?

First to note I didn't see any impact of OPNsense functions except ZenArmor active mode. No difference even with firewall off.

I saw slow performance also on V1500B too. It's like a curse, slow everywhere except on lowend devices.

Glad I took N305 type of PC instead of N100 for the firewall. I can now fly 1600mbps, but that still low considering it's 8505 vPro CPU with double IPC, all possible extensions and even QAT. The CPU is up to 200x faster in crypto benchmarks than Armada 385, it's stronger than my 16core desktop PC, yet an old router with Armada 385 without cpu extensions can do decent 800mbps - a half. I don't get it. How?


Performance of the cpus mentioned:
https://www.cpubenchmark.net/compare/4412vs4304vs5157vs3426vs4775/Intel-Celeron-N5105-vs-AMD-Ryzen-Embedded-V1500B-vs-Intel-N100-vs-AMD-Ryzen-7-3700U-vs-Intel-Pentium-Gold-8505

Quote from: meyergru on November 29, 2023, 07:10:56 PM
I obviously miss how you are measuring. One side is an OpnSense with an N5105 CPU, but what is the other one?
I assumed this is a wireguard site-to-site VPN between two OpnSenses.

The speed you get between two VPN endpoints is limited by the minimum of both (and by the speed between both sides when do do not use encryption). Also, if the encryption is done on the router itself, everything that is done on the router adds to the CPU load (i.e. routing, NAT, firewalling, packet inspection, logging)...

To add a little more colour for you, the machine on the other end is just a generic ubuntu 22.04 server, it's acting as a client, when it does iperf3 connection to the WAN IP I get the near line speeds, when it connects to wireguard hosted by opnsense or when it connects to the wireguard service on generic ubuntu 22.04 server behind the opnsense server I get the reduced performance.

i'm not an expert but I don't believe opnsense would be doing any crypotography when it's simply matching packets that match a NAT rule so that doesn't explain that. again I appreciate everyone input, either i've missed something big or perhaps I should see how pfsense will handle this work.

What I am trying to tell you is that if that counterpart Ubuntu machine is, say, able to handle wire speed at 1 GBit/s, but VPN speed at 300 MBit/s for the same reasons I think your OpnSense is slow, then maybe it is not your OpnSense that is the culprit.

In that case, you could use a 14900K for an OpnSense and nothing would change in your VPN speed measurements, because both sides have to handle the encryption.

To get into that situation is very easy: For example, if that "generic ubuntu 22.04 server" is running on the wrong 5.15 or 6.2 kernel or as a VM under Proxmox, then AVX2 extensions could be disabled by accident, even if your CPU has them.

Details matter. There is a saying in german: "Wer misst, misst Mist." (Who measures, measures manure). The framework conditions are vital for the assessment of the validity of the outcome.

Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+