Hello, homelab-networker-newbie here.
For more than a day I am debugging a strange issue where my connections halt/freeze/stop when they go beyond 1 GB/s. I have a 2 GB/s ISP WAN connection, https://www.speedtest.net/ however, works fine and show 2000+ mb/s speedtest up/down.
My setup is Proxmox with an OPNsense VM with virtio network devices (underlying hardware is an N150 with I226-V network devices).
I started noticing the issue when I downloaded a large file with curl, it kept stalling at random points. When I add '--limit-rate 120m' to curl it still works fine, but when I raise it to 130m (or higher) the issue occurs. I tried this on devices in the network but also directly via the OPNsense host.
On the same Proxmox instance I also tried an OpenWRT VM, on the same type of virtio device, and then I have issues at all. So underlying hardware looks OK.
I also tried connecting directly to my fiber modem and that is also working fine.
I tried changing MTU and MSS settings, which also did not change anything and for OpenWRT the same settings work fine. For debugging further I tried turning hardware offloading options on/off, and a lot of tunable options (as suggested in other topics), but without any noticeable change.
Before I try changing more options, or even create a new OPNsense instance to test it, I want to understand what the possible issue is. I just don't get why the connection halts when a certain threshold is reached. Other networking connections continue just fine, so only high speed connections break.
Any suggestions what the underlying issue could be? Or how I can debug it further? Any help would be appreciated.
If you only had general speed issues, I would point you to the recommendations here (https://forum.opnsense.org/index.php?topic=44159.0), especially multiqueue. You can also try RSS (https://docs.opnsense.org/troubleshooting/performance.html).
However, you seem to reach the advertised speeds, only that the connection stutters when you actually use that.
It is a well-known fact that because of bandwidth * delay product, ISPs use buffering. This mechanism is also implemented in TCP, which nowadays does not wait for each single packet to be acknowledged, but rather sends packets blindly out and gathers their acks later.
When those buffers are being overrun, packet drops occur, causing retransmissions. This effect is called buffer bloat.
You can measure how bad this is for you. There are test sites linked to on https://bufferbloat.net, like Waveform.
In order to remedy this, you can do traffic shaping. While there is an official guide (https://docs.opnsense.org/manual/how-tos/shaper_bufferbloat.html), I recommend to work through this (https://forum.opnsense.org/index.php?topic=46990.0) - and I really mean "through", because it does not only apply to IPv6, but also to specific rules for ICMPv4 traffic. Also, you will find that changing the traffic shaper sometimes requires a reboot before the settings are applied.
@meyergru thanks for your suggestions. I did the bufferfloat test and it shows Grade A+. No issues at all.
Still I tried to do some traffic shaping but without any effect.
I am a bit confused as bufferfloat does not seem to be an issue and I do not see anything which indicates retransmissions. The download streams just stop at certain moments. No data is coming in after it stops, I just keeps waiting.
I finally created a new OPNsense VM, with just the required WAN and LAN gateway, tested it and no issues at all. I will continue rebuilding my OPNsense instances and keep testing until/if it goes wrong again at a certain point. Wanted to put certain configs in terraform anyway.
Thanks for the help and if I discover what setting is trigger the issue again I will update this topic.
With retransmissions, I mean packets that must be retransmitted, because they were dropped, e.g. ACK packets for downstream data. The other side will stop transmitting until all outstanding ACKs have been received. The symptom you see is that the downstream stops until you retransmit the ACK packets. These pauses cause pumping with short pauses in between.
Such effects can occur if one of these problems exist:
1. Bufferbloat, where both sides think they can push more than actually available.
2. Hardware errors, like frame errors, where packets are dropped for whatever reasons.
3. Line congestion or overprovisioning.
4. MTU configuration problems, where packets are dropped, because they are too big.
#4 sometimes occurs with only some sites, which lack PMTU discovery. I have a tutorial on how to set the preferred MTU of 1500 (https://forum.opnsense.org/index.php?topic=45658.0). It largely depends on how your internet connection is set up - PPPoE is a problematic candidate.
All is still running fine, basically have the same setup because I imported the old configs in full into my new instances. The basic change was the Proxmox VM machine type from i440fx to q35.
Again thanks for the suggestions, they are still useful in optimizing flows.