Hi. Another "help with bufferbloat".
I am currently on OPN 25.1.12-amd64 on a VM that has been running fine and has been updated a few major releases. All good.
At some point in the past, perhaps 2 years ago I followed one of the threads here to get decent bufferbloat help. It worked fine and I got B on waveform site with only the "low latency gaming measure" being !. That was good enough for me. I don't do gaming. I only need video (MS Teams / zoom ) to work reliably when needed.
My ISP package is fibre to the premise at 520 Down / 72 Up. Their speeds are normally consistent.
I had what seemed some buffering last week and went to check settings. I realised perhaps I needed to reconfigure it so I did a) read a few recent to a max 24 months posts; b) checked the current docs. I admit I can't understand the current way to use the "limit" note of the docs, the reference to the bug.
I decided to set it up per docs and made note of what I had first.
Result: consistently C results. Includes reboots when changing the flows.
I went back to what I had and still mostly C, sometimes B.
So this is the background. Can someone make a suggestion what values to use?
These are the values I had before the change but now the results _appear_ worse. And yes, it doesn't make a whole lot of sense but I'm looking for another set of eyes in case I've stared too long.
Download pipe
Enabled X
Bandwidth 490
Bandwidth Metric Mbit/s
Queue
Mask (none)
Buckets
Scheduler type FlowQueue-CoDel
Enable CoDel
(FQ-)CoDel target
(FQ-)CoDel interval
(FQ-)CoDel ECN X
FQ-CoDel quantum
FQ-CoDel limit 20480
FQ-CoDel flows 8192
Enable PIE
Delay 1
Description Download pipe
Download queue
Enabled X
Pipe Download pipe
Weight 100
mask destination
Buckets
Enable CoDel
(FQ-)CoDel target
(FQ-)CoDel interval
(FQ-)CoDel ECN X
Enable PIE
Description Download queue
Download rule
Enabled X
Sequence 1
Interface WAN
Interface 2 None
Protocol ip
Max Packet Length
Source any
Invert source
Src-port any
Destination any
Invert destination
Dst-port any
DSCP Nothing selected
Direction in
Target Download queue
Description Download rule
The mask in the Download queue should be (none). Also, you should define the Upstream side of things as well.
Is a downstream shaper (particularly a single queue) likely to have the effect you want? I used downstream shapers in the past, but my purpose was to control offered load by adding latency, using multiple queues on a CBQ shaper. I didn't bother after my link passed 10Mb; it did help at 6-10Mb.
I'd think a simple fair queue with no shaper would be the best option for you. I don't know the best way to accomplish that - perhaps open the pipe beyond 520Mb/s (toward single-station LAN speed). I haven't looked at the fq-codel implementation in... a while. The one I recall used a flow hash, and you could set the number of bits (up to 16, I believe). It looks like the ipfw implementation has that limit (65536). I'd think more can't hurt - fewer (potential) collisions. I wouldn't expect any negatives, but you never can tell. PIE just sounds like a RED implementation - I can't see that it'd have much if any effect, as I wouldn't expect your queue depths/times to reach discard levels.
Of course, you could have upstream issues, at any point in the path.
Quote from: meyergru on December 01, 2025, 07:28:42 PMThe mask in the Download queue should be (none). Also, you should define the Upstream side of things as well.
yes I tried with that removed as per docs. Still bad.
Anything else you can spot?
Edit: p.s. uploads seem very good in the bufferbloat tests but I can add them to the thread no problem. I wanted to keep it as tidy as possible.
Quote from: pfry on December 01, 2025, 08:18:47 PMIs a downstream shaper (particularly a single queue) likely to have the effect you want? I used downstream shapers in the past, but my purpose was to control offered load by adding latency, using multiple queues on a CBQ shaper. I didn't bother after my link passed 10Mb; it did help at 6-10Mb.
I'd think a simple fair queue with no shaper would be the best option for you. I don't know the best way to accomplish that - perhaps open the pipe beyond 520Mb/s (toward single-station LAN speed). I haven't looked at the fq-codel implementation in... a while. The one I recall used a flow hash, and you could set the number of bits (up to 16, I believe). It looks like the ipfw implementation has that limit (65536). I'd think more can't hurt - fewer (potential) collisions. I wouldn't expect any negatives, but you never can tell. PIE just sounds like a RED implementation - I can't see that it'd have much if any effect, as I wouldn't expect your queue depths/times to reach discard levels.
Of course, you could have upstream issues, at any point in the path.
You mean set it up as per the docs https://docs.opnsense.org/manual/how-tos/shaper_bufferbloat.html ?
But I can try see if I follow the thinking and put a pipe beyond the 520 Mbps, to see what happens. Thanks for the idea.
Going a little mad with this at the moment.
Thing is, I have a decent (for me) 520 Mbps bandwith. Normally I wouldn't bother with shaping but I seem to have the odd buffering now after this change I made. Frustratingly it is not better ie back to normal after restoring the previous settings.
To make it factual, my just-made 2 test results:
BUFFERBLOAT GRADE
B
LATENCY
Unloaded 26 ms
Download Active +39 ms
Upload Active +0 ms
SPEED ↓ Download 259.5 Mbps
↑ Upload 66.9 Mbps
Second:
BUFFERBLOAT GRADE
B
Your latency increased moderately under load.
LATENCY
Unloaded 21 ms
Download Active +42 ms
Upload Active +0 ms
SPEED ↓ Download 262.4 Mbps
↑ Upload 66.8 Mbps
==
So it's giving me Bs at the moment. Is this "good enough" leave-it-alone result? Tomorrow it might give me Cs though. I'll keep checking.
Cookie,
Looking at your original configuration on the very 1st post, it looks to be misaligned with the docs.
Please align the configuration exactly as is in the official documentation. It was tested on several different configurations (HW + WANs) and its designed to provide a proper baseline with minimal configuration needed. Which usually results B or higher scores, if you at least set the BW properly.
The main point of having properly configured FQ_C is to set properly the BW and to have Pipes and Queues for both Download and UPload. The rest of the parameters should be used for advanced fine tuning.
Quote from: cookiemonster on December 01, 2025, 07:09:25 PMI admit I can't understand the current way to use the "limit" note of the docs, the reference to the bug.
Prior OPN 25.7.8 there was a BUG that caused a CPU hogging due to excessive logging caused when the limit queue is exceeded. So the advice was to let Limit blank. Franco did FIX this (well at least on OPN side). So now is safe and beneficial to use the Limit queue and set it to 1000 for Speeds under 10Gbit/s.
I did as well update the docs, PR was merged, when Ad will recompile the docs it will be updated
https://github.com/opnsense/docs/pull/811/files
-----------
Alright lets dissect this;
Quote from: pfry on December 01, 2025, 08:18:47 PMI'd think a simple fair queue with no shaper would be the best option for you. I don't know the best way to accomplish that - perhaps open the pipe beyond 520Mb/s (toward single-station LAN speed).
Your QoS/Shaping should be implemented on the interface that you want to control the bottleneck for. So closer to the source of bufferbloat. A FQ as such doesn't handle in anyway bufferbloat. FQ only shares the BW equally amongst all the flows. To control bufferbloat you need an AQM (FQ_Codel, FQ_Pie) or a SQM (CAKE).
Another point is, you should not set your Pipe to more than you have, this introduces issues. You can not give out what you don't have, in our case BW. By settings BW higher than you have you will end in bufferbloat land, and latency will go high-wire, and you are giving up the control to the ISP.
Quote from: pfry on December 01, 2025, 08:18:47 PMI haven't looked at the fq-codel implementation in... a while. The one I recall used a flow hash, and you could set the number of bits (up to 16, I believe).
FQ_C creates internal flow queues per 5-tuple using a HASH. There are examples where stochastic nature of hashing, multiple flows may end up being hashed into the same slot. This can be controlled by the flow parameter in FQ_C.
Quote from: pfry on December 01, 2025, 08:18:47 PMIt looks like the ipfw implementation has that limit (65536). I'd think more can't hurt - fewer (potential) collisions. I wouldn't expect any negatives, but you never can tell.
This is a very bad idea if we speak about the "limit parameter". Limit is effectively the Queue size for the internal flows created by FQ_C. If you have a long Queue, but you are not able to process the packets in the Queue in time you create latency. FQ_C because its an AQM, measure sojourn time of each packet in the queue, and if it exceeds it either marks it or drops. But having to big of a queue is still overall bad. We want to TAIL drop packets when we can not handle them and not store them.
limit parameter (max 20480) with flow parameter (max 65535).
Settings the flow parameter higher is not a bad idea, the desired outcome is to have as less possible overlapping flows into the same queue as possible. But this parameter the higher its set takes more memory (in reality its not so much).
Rule of thumb;
Limit > bellow 10Gbit/s should be around (good starting point) 1000 (usable since 25.7.8)
Flow > If possible set to max 65535
Quote from: pfry on December 01, 2025, 08:18:47 PMPIE just sounds like a RED implementation - I can't see that it'd have much if any effect, as I wouldn't expect your queue depths/times to reach discard levels.
I really don't want to go into PIE to much e.g FQ_PIE, it work similar to FQ_C, but it has different use case, so I will say this:
Pie
- Probabilistic, gradual
- Usage in ISP networks, broadband, general traffic
Codel
- Adaptive, based on packet age
- Low-latency applications, real-time traffic
Regards,
S.
My router CPU is Intel N5105 and I have OPNsense 25.7.8. I noticed that with Flow size 65535 the CPU usage as per 'top' was hitting 50% during the download portion of the Bufferbloat test. Reducing the flow size to 16384 per pipe, I got the usage down to 20-25%.
I also noticed that increasing this value from its default caused higher latency initially until I decreased the pipe BW (even lower than the recommended 85%). Not sure if that's just a side effect of the weaker CPU.
Result is very good, though. Consistent A+ when tested on a client running Windows. Drops to A when tested on same client with Linux, but still consistent.
@cookiemonster, the hard part is finding the sweet spot for the BW value. For me it's kind of a tipping point. There's a narrow range that if I deviate from, in either direction, then the latency starts to go up again. It's not enough to matter in practice (we're talking tolerances within A to B range) but enough to drive a perfectionist crazy. ;) I think my ISP makes it more difficult because cable modem speeds here fluctuate throughout the day, and the service is over-provisioned for short bursts so that I get 120% of the advertised speed initially before it levels out.
Honestly, I do not anymore remember if Flow increases the CPU utilization as such. But its possible as it has to span more flow queues. Overall this is the desired behavior, we do not want to mix packets from Flow A with Flow B into the same Queue. But as mentioned its a tradeoff of extra resources.
I would not say it creates any additional latency on initial start, it would create a persistent one if you don't have the horse power to run it, e.g if you CPUs are too weak. This is not due to FQ_C but due to Shaper, as Logical shaping/QoS is a CPU intensive task. Most likely the latency seen is due to the variable nature of your internet connection.
I have a cable too, my ISP had half year ago a capacity problem that during the Peak hours speeds were extremely variable, so basically no constant throughput rate. However even in this case I ran FQ_C. Cause FQ_C can handle this, basically instead of having 2s latency it was able to keep in check up to 100ms in case the connectivity dropped to 300Mbit from 500Mbit (pipe set to 495Mbit).
On Linux you can achieve as well A+, same as on windows the score results are as well depending on your browser performance. When you check the github link and description that testing was done in Linux on (Floorp) Firedragon browser 12. The documentation was written and all test performed as well on Linux.
Regards,
S.
Very good information. Thank you @OPNethu your observation of the BW is interesting.
@Seimus very thankful to you for the advice. I'll need to digest it a bit and go back to resetting all the way as per docs BUT I am on OPN 25.1.12 and worry about upgrading to latest for what other changes it might bring, unrelated to the shaper. And yes setting the BW right seems to be the hardest part. I just tested and got an A. I am closer to the AP for the test so it seems my testing methodology is something I need to be more conscious of. And the BW measured was 151 Mbps for this A result. Makes me suspect the results a little.
Also, rookie question but I'll ask. Do zenarmor / crowdsec interfere when running the bufferbloat tests?
And to clarify. Can I/should I reset as per docs on my 25.1.12 version ? Suggested testing method ?
I would advice to run the test over a cable. If you don't have at least WiFi6 + all the BW available in the channel + no noise or overlap of the channel testing via WiFi is not advised. AS any of those 3 things can introduce you Wireless specific latency.
Quote from: cookiemonster on December 02, 2025, 03:42:31 PMAlso, rookie question but I'll ask. Do zenarmor / crowdsec interfere when running the bufferbloat tests?
Not directly and not by intent. This goes around to the CPU bottleneck, if your CPU can not keep up, you will see a latency introduced by the CPU processing of the packets. For example I have ZA on N100, and there is no problem to handle 500+ throughput on WAN with shaping enabled.
Quote from: cookiemonster on December 02, 2025, 03:42:31 PMAnd to clarify. Can I/should I reset as per docs on my 25.1.12 version ? Suggested testing method ?
Docs are valid for any OPNsense version.
What you should focus on its the configuration + the (basic) tuning via BW parameter. Configuration for FQ_C as well BW tuning methodology is the the docs.
The advanced tuning is not needed mostly, and its really just if you want to deep dive and squeeze it.
Regards,
S.
Hey. I've been using a windows laptop for testing the bufferbloat so far. Normally I use linux but had a need to stay booted on Win last few days. This one is connected via a Wi-Fi 6 (802.11ax) Wifi network using a Intel(R) Wi-Fi 6E AX210 160MHz adapter. Depending on location I can get as little as 480/721 (Mbps) agregated link speed (rec/tran) so I have a bottleneck there at times. Wired connections are only one for a PC but I can't get to it most of the time.
For OPN's CPU I'm using an AMD Ryzen 5 5600U on Proxmox with two vCPUs. Just did a ubench run on it and gives: Ubench Single CPU: 910759 (0.41s). So I think that is Ok.
I've now reset the shaper to docs defaults. This time also the upload side. I need to reboot (had limit and flows on the pipe), I'll update the post.
Quote from: Seimus on December 02, 2025, 10:12:33 AMOn Linux you can achieve as well A+, same as on windows the score results are as well depending on your browser performance. When you check the github link and description that testing was done in Linux on (Floorp) Firedragon browser 12. The documentation was written and all test performed as well on Linux.
Marginal differences at best, but I do get a consistent +5 to +10ms on the download portion of the test under Linux using the latest version of FireFox on both (and keeping all OPNsense parameters constant):
Linux: https://www.waveform.com/tools/bufferbloat?test-id=964b7180-4a1f-4eed-a114-1dfb613e9b63
Win10: https://www.waveform.com/tools/bufferbloat?test-id=edad2d94-d2c8-41e1-8b63-a31eeb2539bb
I've spent some time trying to close the gap but no luck :) Maybe it's a quirk with my motherboard's i225V (rev02) NIC and the Windows driver is just a little bit better.
Maybe that is due to the TCP congestion algorithms used. You can change it with Windows, I think under Win10, it was BBR2, but that had some problems, so they reverted back to CUBIC for Win11.
With Linux, you can easily change it via sysctl. These are the values I use:
net.core.rmem_default = 2048000
net.core.wmem_default = 2048000
net.core.rmem_max = 67108864
net.core.wmem_max = 67108864
net.ipv4.tcp_rmem = 4096 1024000 33554432
net.ipv4.tcp_wmem = 4096 1024000 33554432
# don't cache ssthresh from previous connection
#net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_moderate_rcvbuf = 1
net.ipv4.tcp_adv_win_scale = 5
# recommended to increase this for 1000 BT or higher
net.core.netdev_max_backlog = 30000
# for 10 GigE, use this
# net.core.netdev_max_backlog = 30000
net.ipv4.tcp_syncookies = 1
# Enable BBR for Kernel >= 4.9
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr
Quote from: cookiemonster on December 02, 2025, 06:14:28 PMHey. I've been using a windows laptop for testing the bufferbloat so far. Normally I use linux but had a need to stay booted on Win last few days. This one is connected via a Wi-Fi 6 (802.11ax) Wifi network using a Intel(R) Wi-Fi 6E AX210 160MHz adapter. Depending on location I can get as little as 480/721 (Mbps) agregated link speed (rec/tran) so I have a bottleneck there at times. Wired connections are only one for a PC but I can't get to it most of the time.
For OPN's CPU I'm using an AMD Ryzen 5 5600U on Proxmox with two vCPUs. Just did a ubench run on it and gives: Ubench Single CPU: 910759 (0.41s). So I think that is Ok.
I've now reset the shaper to docs defaults. This time also the upload side. I need to reboot (had limit and flows on the pipe), I'll update the post.
HW should be okay to handle ZA + Shaper and that throughput.
But keep in mind the stuff about WiFi I mentioned above.
Regards,
S.
Quote from: OPNenthu on December 02, 2025, 09:21:40 PMLinux: https://www.waveform.com/tools/bufferbloat?test-id=964b7180-4a1f-4eed-a114-1dfb613e9b63 (https://www.waveform.com/tools/bufferbloat?test-id=964b7180-4a1f-4eed-a114-1dfb613e9b63)
Win10: https://www.waveform.com/tools/bufferbloat?test-id=edad2d94-d2c8-41e1-8b63-a31eeb2539bb (https://www.waveform.com/tools/bufferbloat?test-id=edad2d94-d2c8-41e1-8b63-a31eeb2539bb)
These results are not bad, and if they are constant I would call it a win.
But as mentioned by @meyergru you may have differences due to congestion algorithms. On Win10 I think the default is CTCP or CUBIC.
I personally run BBR on Linux.
bat /etc/sysctl.d/bbr.conf
─────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ File: /etc/sysctl.d/bbr.conf
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ net.core.default_qdisc=fq
2 │ net.ipv4.tcp_congestion_control=bbr
─────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Regards,
S.
It was worth a try guys, but I'm not seeing a difference between BBR and CUBIC on my system. At least the loaded latency in all cases doesn't go above +10ms.
I have been relentlessly trying all sort of guides and settings for bufferbloat. I have finally found a solution (1 gig fiber PPPOE), and getting A rates at waveform (https://www.waveform.com/tools/bufferbloat) (https://www.waveform.com/tools/bufferbloat)
I will call this used guide guide (https://www.xda-developers.com/how-opnsense-traffic-shaping-improve-your-lan/) "Dirty but effective"
That is because it states to use FQ-CoDel quantum of 3000.
But when I change the quantum to the described 1500 (mtu of connection) I get a B instead of A.
And in addition to this above guide, I added the [controle plane / ICMP ] as described on this forum elsewhere.
So although I do not understand why and it is not right or recommended to use 3000 quantum, I have to say it works great over here.
Quote from: RamSense on December 03, 2025, 07:37:11 AMI will call this used guide guide (https://www.xda-developers.com/how-opnsense-traffic-shaping-improve-your-lan/) "Dirty but effective"
The creator of this article looks like doesn't know what their are doing.
There are several miss statements about Pipes and Queues.
There are several miss-configurations in the Pipe > BW, Queues, Quantum
There are several miss-configurations in the Queue > MASK, ECN, Enabled Codel
And so on.
Some of the configured features don't do anything and some of them have impact. Like MASK, Quantum & BW and overlapping FQ_C in scheduler and CoDel in the Queue, causes overall that there is twice running the CoDel algorithm, fighting each other.
In layman terms Quantum basically handles how much bytes at once can from a Flow queue leave (FQ_C internal one). The reason why you want to use Quantum at your MTU size when your BW is above 100Mbit is to serve 1 packet per flow (there can be thousands flows, 1 flow = 1 internal FQ_Codel queue). Higher Quantum sizes could starve smaller packets. And too small Quantum sizes could starve bigger packet sizes. Both related to the configured Interface MTU.
Regards,
S.
Quote from: meyergru on December 02, 2025, 11:08:38 PMMaybe that is due to the TCP congestion algorithms used. You can change it with Windows, I think under Win10, it was BBR2, but that had some problems, so they reverted back to CUBIC for Win11.
With Linux, you can easily change it via sysctl. These are the values I use:
net.core.rmem_default = 2048000
net.core.wmem_default = 2048000
net.core.rmem_max = 67108864
net.core.wmem_max = 67108864
net.ipv4.tcp_rmem = 4096 1024000 33554432
net.ipv4.tcp_wmem = 4096 1024000 33554432
# don't cache ssthresh from previous connection
#net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_moderate_rcvbuf = 1
net.ipv4.tcp_adv_win_scale = 5
# recommended to increase this for 1000 BT or higher
net.core.netdev_max_backlog = 30000
# for 10 GigE, use this
# net.core.netdev_max_backlog = 30000
net.ipv4.tcp_syncookies = 1
# Enable BBR for Kernel >= 4.9
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr
Interesting. I did not know anything about this. Thanks @meyergry
Quote from: Seimus on December 02, 2025, 11:30:01 PMQuote from: cookiemonster on December 02, 2025, 06:14:28 PMHey. I've been using a windows laptop for testing the bufferbloat so far. Normally I use linux but had a need to stay booted on Win last few days. This one is connected via a Wi-Fi 6 (802.11ax) Wifi network using a Intel(R) Wi-Fi 6E AX210 160MHz adapter. Depending on location I can get as little as 480/721 (Mbps) agregated link speed (rec/tran) so I have a bottleneck there at times. Wired connections are only one for a PC but I can't get to it most of the time.
For OPN's CPU I'm using an AMD Ryzen 5 5600U on Proxmox with two vCPUs. Just did a ubench run on it and gives: Ubench Single CPU: 910759 (0.41s). So I think that is Ok.
I've now reset the shaper to docs defaults. This time also the upload side. I need to reboot (had limit and flows on the pipe), I'll update the post.
HW should be okay to handle ZA + Shaper and that throughput.
But keep in mind the stuff about WiFi I mentioned above.
Regards,
S.
So far, gone back to exactly as docs I am getting consistent B grades. It seems to confirm my testing was flawed too. Wired testing seems better but don't have the values at hand.
That said, although I did know that I expected wired/wifi differences, I was hoping that the bufferbloat cure would help the wireless clients, which are the majority in the household, hence I was testing this way.
Is it possible or even desirable to tweak the shaper for wireless as main target ?
If you implement/tune FQ_C properly for the wire it will reflect on wireless as well. But implementing/tune this while testing from WiFi seems to me a very bad idea. The Wireless has so many variables that can play into the latency just for the Wireless last mile.
I am having a lot of devices on wireless too (only one PC is wired, servers are wired, phones, laptops and other stuff is wireless), WiFi6, but my Wireless APs are specifically tuned to the environment I live in (I am speaking here about tweaking on the AP itself + lot of measurements on the wireless freq using scanners). A lot of those devices are TVs & PCs for IPTV, VOIP, live streams, even online gaming, basically all the good stuff that hates any jitter.
The way how I did implement/tune FQ_C was as described in the docs, while testing was running from the single wired PC on the network. When I was happy with the score, from the artificial tests, I did turn on a live stream and run speedtest to observe if that live stream will show any stutter. If the stutter was present, and it was constant (unbearable) I tuned the FQ_C again to mitigate the stutter.
My tuning went into the advanced section as well, that was mainly because when the docs were created I needed to know what it does, how it behaves and if it even works as it should. But usually, you really only need to configure it as per the docs + basic tuning (BW tuning increasing decreasing) to get positive results. Main tuning parameter is the BW. All other parameters default are set well out of the box except "limit", but to use "limit" you need to be at least on 25.7.8.
Regards,
S.
Alrighty. Thanks Seimus. I'm beginning to feel I'm close.
What ive always wondered is what MTU to set the Quantum to, since i have three different interfaces which have different MTU settings:
Physical WAN: 1512
VLAN WAN: 1508
PPPOE WAN: 1500
Documentation doesnt appear to show this case.
It should be set based on the Interface you apply the Shaping on (defined by the rule). Also for the standard MTU size ~1500B you can let Quantum on default. As the default covers the 1500B + 14B of the hardware header.
Very rarely there is a need to change the Quantum. Most use cases when Quantum is needed to be changed are sub 100Mbit speeds or when using Jumbo frames.
Regards,
S.