Poor Throughput (Even On Same Network Segment)

Started by hax0rwax0r, August 25, 2020, 08:31:25 PM

Previous topic - Next topic
Hello Together

Unfortunately I have the same performance problem on ESXi 6.7 with vmxnet3 network adapters. The physical adapters behind are as follows:

WAN: AQtion AQN-107 (10 Gbps)
LAN: Intel 10 Gigabit Ethernet Controller 82599 (10 Gbps)
DMZ: Intel 10 Gigabit Ethernet Controller 82599 (10 Gbps)

ISP: 10/10 Gbps (XGS-PON)


The speed on OPNsense (also on pfSense) is approximately as follows:
down: 7-10 Mbps
up: 2.5-3 Gbps

On any Linux firewall (e.g. IPFire and Untangle) I get the following values:
down & up: 5-6 Gbps

I have tried all possible tunables on the OPNsense, which unfortunately didn't help.

But now I just noticed something strange:
When I have the performance monitoring active on a speedtest (Performanse Graph in WebUI or top via ssh) the speed is suddenly not even that bad:
down & up: 3-4 Gbps

If I deactivate the performance monitoring again, the values are as low as at the beginning.

Unfortunately I don't know exactly what triggers this phenomenon, but maybe someone of you has also noticed this?


August 19, 2021, 12:43:27 PM #122 Last Edit: August 23, 2021, 06:47:36 PM by balrog
Thank you for the answer.

I previously had an Intel X550-T2 purely for the WAN connection. But after testing I found that the onboard AQtion AQN-107 with current driver from Marvell* is just as fast (so I could save one PCI-E slot).
On both Linux firewalls, I was able to max out the bandwidth of the ISP with both configurations (Intel or AQiton).

P.S. the problem was the same with the configuration with the Intel NIC

(*sorry, driver is not from broadcom, it's from Marvell)


Thanks for the hint, but I had already adjusted this value before - unfortunately without success...

What is really strange is that the speed is normal (like on the Linux Firewalls) as soon as I have "top" open in the background.
(no matter if OPNsense is tuned or on factory settings).

As if (figuratively speaking) "top" keeps the floodgates open for the network packets to flow faster.


Can anyone perhaps verify this with the same problem (vmxnet3)?


Just to chip in and offer possibly "standard" hardware approach.

I'm using Deciso's "own" hardware, which should help replicating/reproing the issue.

Deciso DEC 840 with OPNsense 21.4.2-amd64, FreeBSD 12.1-RELEASE-p19-HBSD

I have one main VLAN routing to untagged (main LAN). I upgraded my main switch to 10 gbps and changed my LAN+VLAN interface from GbE port to SFP+ port at 10 GbE.

Everything else works well, but VLAN <=> LAN routing causes massive lag on completely separate routing (like 400-1000ms spikes); the extreme one being CPU spike up to 80%+ which caused several seconds of 1000-1300ms spikes on separate routing (light traffic).

I will reconfigure (likely today) the VLAN parts to separate GbE interface and see if the issue solves by that, next step will be restoring whole network to GbE ports (as it was before).

I did install new switch in the network, so it might play part of this, but based on the behaviour, it seems unlikely.


If you meant me, no I don't have Sensei and I believe (can't even right now find the setting) I don't have IPS enabled (at least not on purpose).

We do use traffic shaping policies for 2x WANs, but that's about it. All the other is just basic (rule limited) routing between LAN/VLANs.

I didn't touch anything on the recent change, except moved the LAN (+ VLANs associated with it) from igb0 interface to ax0.

I'll configure backwards soonish (hopefully today), as the 10 Gbe wasn't yet really utilized and the issue is really easy to spot right now. So I get more info about my scenario soon.

August 24, 2021, 04:30:50 PM #129 Last Edit: August 24, 2021, 04:33:59 PM by Kallex
Ok, that was nice and clean to confirm.

To clarify the terms below Deciso 840 has 4x GbE ports (igb0,1,2,3) and 2x 10GbE SFP+ ports (ax0,ax1).

The issue with Deciso 840 is the 10 Gbe SFP+ ports routing VLAN traffic. In my case it was supposed to route the traffic alongside untagged LAN traffic, so this is the scenario I can confirm.


1. Before changes - VLAN routing worked

Before using SFP+ ports I had LAN + VLAN routed with igb0 interface. Everything worked well, no issues.

2. After changes - VLAN routing broken (affecting other routing too)

After moving LAN + VLAN over SFP+ port (ax0), the issues started. When VLAN-traffic was routed, heavy lag spikes on non-VLAN traffic also. I don't have performance numbers, but the traffic wasn't heavy - yet it heavily affected whole physical interface.

3. Fixed with moving VLAN to igb0 while keeping LAN on ax0

As I knew the "everything on igb0" worked, I wanted to try if its enough to move just VLAN to igb0 and keep LAN on ax0. It required some careful "tag-denial" on switch routes to not "loop" either untagged or VLANs, but the solution worked.

EDIT: Of course this workadound/fix was only feasible because my VLAN networks didn't need the 10 GbE in the first place.


As I need to change 2x managed switches and be very careful not to make my OPNsense inaccessible, I'm hesitant to try "the other way around"; moving VLANs to SFP+ and LAN to igb0 - just to test whether whole VLAN routing is broken, or is the issue just when LAN/VLAN is "routing back" through the same physical interface.

I also didn't test the 10 GbE speeds (no sensible way to test it right now through OPNsense), but the lagging/latency issue was so clear, that there obviously was something not working.

@Kallex Can you try to update to 21.4.3? the axgbe driver from AMD had an issue with larger packets in vlans, which lead to a lot of spam in dmesg (and reduced performance). If you do suffer from the same issue, I expect quite some kernel messages (..Big packet...) when larger packets are being processed.

The release notes for 21.4.3 are available here https://docs.opnsense.org/releases/BE_21.4.html#august-11-2021

o src: axgbe: remove unneccesary packet length check (https://github.com/opnsense/src/commit/bee1ba0981190dabcd045b6c8debfc8b8820016c)

Best regards,

Ad

August 24, 2021, 11:15:50 PM #131 Last Edit: August 24, 2021, 11:23:35 PM by Kallex
I can try to; we're on production environment so I can on earliest try it on weekend.

I guess that's not the "Stable Business Branch" release, can I easily roll back to the last stable one after checking that version out?

I'll report back regardless whether I could test it or not.

EDIT: Realized it's indeed a business release. I'll test it at latest on weekend and report back.

Quote from: Kallex on August 24, 2021, 11:15:50 PM
I can try to; we're on production environment so I can on earliest try it on weekend.

I guess that's not the "Stable Business Branch" release, can I easily roll back to the last stable one after checking that version out?

I'll report back regardless whether I could test it or not.

EDIT: Realized it's indeed a business release. I'll test it at latest on weekend and report back.

I got to test it now. My issue does not replicate anymore with this newest version, thank you :-).

So initially I had performance issues on routing VLAN <=> LAN through ax0 (10 GbE) on Deciso DEC 840. After this patch the issue is clearly gone.

I don't have any real performance numbers between VLANs, but the clear "laggy issue" is entirely gone now.

I also did some testing after I noticed on a customer site that even on a 10G uplink I would max out at 600Mbps. Since then I roughly tested this on all other sites where we run OPNsense and the result is the same everywhere. OPNsense runs everywhere on either ESXi or Proxmox and on Thomas Krenn servers with the following specs:

Supermicro mainboard X10SDV-TP8F
Intel Xeon D-1518
16 GB ECC DDR4 2666 RAM

I now testet on 3 VMs, 2 running Debian Bullseye and 1 OPNsense (latest 20.1 and latest 21.7). The results are quite poor.

Debian -> Debian
> 14Gbps

Debian -> OPNsense 20.1 -> Debian
< 700Mbps

Debian -> OPNsense 21.7 -> Debian
< 900Mbps

Both OPNsense installs are using default settings, hardware offloading disabled and updated to latest version.

I tried setting the following tunables:

net.isr.maxthreads=-1
I also noticed that net.isr.maxthreads always returns 1 but when setting to -1 it reports the correct threads. However, the network throughput does not change.

hw.ibrs_disable=1
This made a significant impact and throughput increased to 2.6Gbps which is still too low but a lot better than before.


@alh, in case of ESXI most relevant details are likely already documented in https://forum.opnsense.org/index.php?topic=18754.msg90576#msg90576, the 14Gbps are probably measured with default settings, the D-1518 isn't a very fast machine so that would be reasonable using all hardware accelerated offloading settings.