Hello All,
After a couple days of testing, I think there may be a software issue with PF. I've tried various configurations over a couple different hardware setups and keep experiencing the exact same issue. Full details below.
I currently have a fiber link of 1gbit up/1gbit down. Connecting directly to the fiber modem without going through the firewall, my speed tests run 950mbit up and 950mbit down. I have run the tests multiple times over a period of an hour to verify there's no deviation in that speed when connected directly to the modem.
Now, I have built a couple firewalls during my tests:
1) Intel Xeon E3-1220 v3 @ 3.10GHz, 250GB Samsung SSD, 8GB ram, 2x Intel CT Desktop NICs (Intel EXPI9301CTBLK)
2) Virtual Machine, AMD Ryzen 5950x. Assigned 8 threads to the HyperV VM. 128gb HD on 4x raid 10 SSDs. 8gb ram. Intel NICs
On both machines, I have the exact same experience:
1) I power up the firewall
2) I do a speedtest. The speedtests are 950mbit up/down like they are when directly plugged in
3) I watch some YouTube videos for about 5 minutes.
4) I do another speedtest. The speedtests are 600-700mbit down and only 60mbit up.
5) I reboot
6) Speedtest returns back to normal
I have performed the speedtests using a Windows 10 Machine and also a Windows Server 2019 machine. I have plugged the Ethernet cables directly from the test machines directly into the firewall LAN port (no middle switches).
Additionally, I have also tried installing pfsense on the same machines to see if it was something to do with opnsense. I experienced the exact same issue. Speed drops after the firewall has been online for a few minutes.
The performance seems to deviate. Occasionally it will come back up to the 950mbit, but the majority of the time the speed is slower. The upload rate is the primary issue. It is always below 100mbit for some reason.
I have tried enabling RSS in the tuneables. That did not help. I tried disabling Spectre and Meltdown mitigations. Disabling the meltdown mitigations for some reason causes it to run slower on the Xeon processor - Download never goes above 600mbit, but the upload seems to be a little faster than 60mbit when it goes into slow down mode.
I've tried enabling and disabling the "Hardware CRC", "Hardware TSO", "Hardware LRO" in the interface settings. I tried enabling/disabling interface scrubbing.
When performing speedtests, I watch the opnsense interface statistics to make sure the speeds match what the speedtests shows. They are very close to each other. This shows that there's no background activity occurring other than the speedtest.
Since I have this issue with opnsense and pfsense, the only thing that makes sense to me is an issue with PF. Anyone have a similar issue?
When it enters "slow down" mode. The CPU doesn't even go over 1%. IE, upload is 60mbit and cpu is 1%. Prior to going into slow down mode, the CPU hits 25% when downloading and uploading. It's like there's a bottleneck somewhere.
Have you tried disabling powerd? And disable all sleep states in the BIOS if that option available.
I just swapped to 2 completely different nics in the 5950x box. I'm still experiencing the same slowness.
When I make a configuration change on the WAN interface (any change that reloads it), the speed returns back to normal.
PowerD is not enabled. I have tried enabling it and setting to conservative and max and that didn't change anything.
I am not using Suricata. I basically have a vanilla install, but configured 10.0.1.0/24 for the LAN.
I just bought an intel x540 dual lan nic, but I doubt this will help at this point since I've been through 4 NICs now that did not have any issue in the past.
When I made LAN interface config changes, it reloaded the LAN interface, but the speed was still 60mbit upload.
I do the same with the WAN interface, speed goes back up to 1gbit/second.
Any config change on the WAN interface allows the speed to return to normal. That's the only thing I know for sure at this point. Across 4 NICS (2 different intel nics, a realtek, and the hyperv virtual nics). Make any WAN config change and click apply, speed returns.
I feel absolutely dingy now. I just setup a netfilter based distro (IPFire) on the same hardware. I do not have any performance slow downs now. Everything is 900mbit+ on up/down.
I really prefer OPNSense, but the slow downs are making it unusable at the moment.
You say you use vm? What happens when you install opnsense on the Hardware itself?
Exactly the same thing. I've tried it on bare metal and got the exact same issue. Since changing to netfilter, it's been consistently 900mbps+.
Can you try setting MSS to 1300 in Interface : LAN?
Could this be related? https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=268490
After testing MSS settings. It had no effect. Tried on WAN and LAN.
I will say changing the MSS settings on the LAN, reduced max throughput to about 700mbit up/down, but the throughput eventually dropped to 150mbit down and 60mbit up.
Could you elaborate a bit more on the network topology? Are there things like PPPoE involved?
I am using a very basic configuration at the moment while troubleshooting.
Fiber Modem -> Firewall WAN. Fiber modem gives OPNSense a WAN IP using DHCP. No PPOE. Just basic DHCP.
Firewall LAN -> Network Device. Firewall gives device an IP via DHCP (10.0.1.0/24)
I have tried the following NICs:
- Intel Gigabit CT Desktop Adapter (x2)
- Intel i210 Gigabit Adapter
- Realtek PCIe 2.5GbE Family Controller RTL8125
For every NIC, network device gets 900mbit+ for about 5-10 minutes before dropping to about 600mbps down and 60mbps up.
I have also tried different network cables (CAT8).
At the moment, I am running this in a HyperV virtual machine with 8 cores assigned with 4GB RAM. Before virtualizing the setup, I was running bare metal on a Xeon E3-1220 v3 @ 3.10GHz Quadcore with 128GB SSD and 8gb RAM. The exact same experience occurred on that hardware. However, I was using the Intel Gigabit CT Desktop NICs.
Now that I have everything virtualized, I've setup new VMs with the following:
- IPFire - Consistent 900mbps up/down with no performance degradation
- Endian- Consistent 900mbps up/down with no performance degradation
- Pfsense - 900mbps for 5-10mins before dropping to 600mbps down and 60mbps up. If the connection remains idle for a short period, performance returns for a few minutes before dropping again
- OPNSense - 900mbps for 5-10mins before dropping to 600mbps down and 60mbps up. If the connection remains idle for a short period, performance returns for a few minutes before dropping again
I cant imagine this happens on bare metal too :o
I can imagine. Opnsense has several performance issues. Regarding my Opnsense, routing traffic between different LAN segments (1 GBit/s each) is slow and drops to 40-60 MB/s (especially SMB transfers).
I can't exactly remember anymore. But, if I am right the issues began with version 18.x or 19.x, maybe while migrating to "iflib". The issues are still not solved and I started coming to terms with it. I don't know whether these performance issues are directly related to Opnsense or are the same in PFsense or vanilla BSD.
I noticed that performance increases when disabling IPsec (even if its not related to the routing between the above mentioned networks). Furthermore, "netflow" has a non-negligible negative impact on performance.
See also:
- https://forum.opnsense.org/index.php?topic=19426
- https://forum.opnsense.org/index.php?topic=18754.0
Edit: My Opnsense runs on bare metal (Supermicro A2SDi-4C-HLN4F)
I have a datacenter OPN on 22.1.10, Xeon E on Supermicro with X710 running 10G in both directions. It cant be a generic problem (and this device exports netflows to external)
Tried using the tunables Kirk recommended in that second post. Didn't make any difference, unfortunately. :(
I wonder if this is related to my issue?: https://forum.opnsense.org/index.php?topic=31748.0
and another guy's issue on reddit?: https://www.reddit.com/r/opnsense/comments/1055v4l/abysmally_low_upload_speed/
https://forum.opnsense.org/index.php?topic=31753.msg153441#msg153441
960Mbit upload ...
Quote from: mimugmail on January 07, 2023, 02:17:02 PM
https://forum.opnsense.org/index.php?topic=31753.msg153441#msg153441
960Mbit upload ...
Testing the throughput in the WAN compared to the LAN can have much more issues which lead to a degraded performance. So, I recommend to start testing the performance in the LAN. I don't want to hijack this thread with my observed problems. But maybe they are related. Regarding my previous post (#15 (https://forum.opnsense.org/index.php?topic=31680.msg153331#msg153331)) I did some tests again:
Scenario: Server <-> Opnsense <-> Client:
- Client downloaded a 10GB file from the server via SMB in the LAN
- All LAN links are 1GB/s
- Server CPU load was around 10-15% (all cases)
- Client CPU load was around 10-15% (all cases)
- Opnsense CPU load was around 25% (all cases)
- Result (regular configuration): Download speed was around 40MB/s
- Result (regular configuration and service "samplicate" stopped): Download speed was around 60MB/s
- Result (regular configuration and services "samplicate", "stgrongswan" stopped): Download speed was around 90MB/s
Disabling these both services increases the performance in the LAN a lot. But for me it is not an option to drop IPsec. Unfortunately, this situation of slow data throughput has existed for a very long time with no prospect of improvement.
Edit:
Playing with some tunables from here (#15) (https://forum.opnsense.org/index.php?PHPSESSID=87qpqtobv6gjs8nknrripi90bh&topic=25844.msg126362#msg126362) increases the performance for the third case from around 90MB/s to 100-105MB/s.
Why is your OPNsense involved in LAN to LAN traffic? Do you use a LAN bridge instead of a switch? You should definitely get a dedicated switch, then.
Quote from: pmhausen on January 08, 2023, 02:58:26 PM
Why is your OPNsense involved in LAN to LAN traffic? Do you use a LAN bridge instead of a switch? You should definitely get a dedicated switch, then.
Please don't be confused by the term "LAN", I use this in a broader sense (everything behind the firewall compared to WAN). Server and client are both in different LAN segments (VLANs) whereas the Opnsense routes the traffic between them based on pf rules. Additionally, the VLANs are on different physical interfaces, so no sharing of bandwidth is involved.
Just installed the Intel x540-t2. Exact same issue.
Quote from: eneerge on January 11, 2023, 07:14:38 AM
Just installed the Intel x540-t2. Exact same issue.
@eneerge: Have you also tested the performance and possible impact when stopping the services "samplicate" and "strongswan" on the Dashboard?
So, I just stumbled upon this post on reddit that mentioned the EXACT issue I described (speed dropping to the exact same rate). https://www.reddit.com/r/HomeNetworking/comments/p63zbo/calix_ont_to_3rd_party_router_not_working/
Last post mentions that it's caused by the ARP timeout. Maybe this needs to be statically assigned to resolve. I will test at some point.
I do not want to "jenks" myself, but holy f-ing s, it seems to be fixed after changing the ARP expiration timeout.
When using OpenWrt, I never experienced any slow downs. Linux by default has an ARP expiration of 60 seconds. Pf/Opnsense has a default expiration of 20 minutes. At exactly 600 seconds into the ARP response was when I started experiencing the slow downs. With pfsense, I was able to manually remove the individual cached entry for the gateway. I removed the ARP entry for my gateway and that instantly restored my speeds (and also caused it to instantly create a new ARP entry). So now I can apply the same to Opnsense.
I just added this to tuneables:
- net.link.ether.inet.max_age = 540
This should set the ARP cache to expire every 9 minutes instead of 20.
I don't understand why this works. The MAC address of my gateway is the exact same even after the expiration and renew. Anyone have any idea why an old ARP cache entry (which is actually still valid) would cause this issue?
For reference. I have a Calix Gigapoint 803g OTP that my fiber runs into.
Calix 803g -> Opnsense -> Switch -> Devices
Anyone that could enlighten me as to why this fixed the issue, please feel free to do so. This just seems odd that deleting a cached entry and creating the exact same entry every 540 seconds instead of every 1200 seconds fixes the issue.