Suggestion for Bufferbloat fix. Fibre to the Home. No PPoE.

Started by cookiemonster, December 01, 2025, 07:09:25 PM

Previous topic - Next topic
Hi. Another "help with bufferbloat".
I am currently on OPN 25.1.12-amd64 on a VM that has been running fine and has been updated a few major releases. All good.
At some point in the past, perhaps 2 years ago I followed one of the threads here to get decent bufferbloat help. It worked fine and I got B on waveform site with only the "low latency gaming measure" being !. That was good enough for me. I don't do gaming. I only need video (MS Teams / zoom ) to work reliably when needed.
My ISP package is fibre to the premise at 520 Down / 72 Up. Their speeds are normally consistent.

I had what seemed some buffering last week and went to check settings. I realised perhaps I needed to reconfigure it so I did a) read a few recent to a max 24 months posts; b) checked the current docs. I admit I can't understand the current way to use the "limit" note of the docs, the reference to the bug.

I decided to set it up per docs and made note of what I had first.
Result: consistently C results. Includes reboots when changing the flows.

I went back to what I had and still mostly C, sometimes B.

So this is the background. Can someone make a suggestion what values to use?
These are the values I had before the change but now the results _appear_ worse. And yes, it doesn't make a whole lot of sense but I'm looking for another set of eyes in case I've stared too long.

Download pipe

Enabled X
Bandwidth 490
Bandwidth Metric Mbit/s
Queue
Mask (none)
Buckets
Scheduler type FlowQueue-CoDel
Enable CoDel
(FQ-)CoDel target
(FQ-)CoDel interval
(FQ-)CoDel ECN X
FQ-CoDel quantum
FQ-CoDel limit 20480
FQ-CoDel flows 8192
Enable PIE
Delay 1
Description Download pipe

 
Download queue

Enabled X
Pipe Download pipe
Weight 100
mask destination
Buckets
Enable CoDel
(FQ-)CoDel target
(FQ-)CoDel interval
(FQ-)CoDel ECN X
Enable PIE
Description Download queue


Download rule

Enabled X
Sequence 1
Interface WAN
Interface 2 None
Protocol ip
Max Packet Length
Source any
Invert source
Src-port any
Destination any
Invert destination
Dst-port any
DSCP Nothing selected
Direction in
Target Download queue
Description Download rule
 

The mask in the Download queue should be (none). Also, you should define the Upstream side of things as well.
Intel N100, 4* I226-V, 2* 82559, 16 GByte, 500 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+

Is a downstream shaper (particularly a single queue) likely to have the effect you want? I used downstream shapers in the past, but my purpose was to control offered load by adding latency, using multiple queues on a CBQ shaper. I didn't bother after my link passed 10Mb; it did help at 6-10Mb.

I'd think a simple fair queue with no shaper would be the best option for you. I don't know the best way to accomplish that - perhaps open the pipe beyond 520Mb/s (toward single-station LAN speed). I haven't looked at the fq-codel implementation in... a while. The one I recall used a flow hash, and you could set the number of bits (up to 16, I believe). It looks like the ipfw implementation has that limit (65536). I'd think more can't hurt - fewer (potential) collisions. I wouldn't expect any negatives, but you never can tell. PIE just sounds like a RED implementation - I can't see that it'd have much if any effect, as I wouldn't expect your queue depths/times to reach discard levels.

Of course, you could have upstream issues, at any point in the path.

Quote from: meyergru on December 01, 2025, 07:28:42 PMThe mask in the Download queue should be (none). Also, you should define the Upstream side of things as well.
yes I tried with that removed as per docs. Still bad.
Anything else you can spot?
Edit: p.s. uploads seem very good in the bufferbloat tests but I can add them to the thread no problem. I wanted to keep it as tidy as possible.

Quote from: pfry on December 01, 2025, 08:18:47 PMIs a downstream shaper (particularly a single queue) likely to have the effect you want? I used downstream shapers in the past, but my purpose was to control offered load by adding latency, using multiple queues on a CBQ shaper. I didn't bother after my link passed 10Mb; it did help at 6-10Mb.

I'd think a simple fair queue with no shaper would be the best option for you. I don't know the best way to accomplish that - perhaps open the pipe beyond 520Mb/s (toward single-station LAN speed). I haven't looked at the fq-codel implementation in... a while. The one I recall used a flow hash, and you could set the number of bits (up to 16, I believe). It looks like the ipfw implementation has that limit (65536). I'd think more can't hurt - fewer (potential) collisions. I wouldn't expect any negatives, but you never can tell. PIE just sounds like a RED implementation - I can't see that it'd have much if any effect, as I wouldn't expect your queue depths/times to reach discard levels.

Of course, you could have upstream issues, at any point in the path.
You mean set it up as per the docs https://docs.opnsense.org/manual/how-tos/shaper_bufferbloat.html ?
But I can try see if I follow the thinking and put a pipe beyond the 520 Mbps, to see what happens. Thanks for the idea.
Going a little mad with this at the moment.

Thing is, I have a decent (for me) 520 Mbps bandwith. Normally I wouldn't bother with shaping but I seem to have the odd buffering now after this change I made. Frustratingly it is not better ie back to normal after restoring the previous settings.

To make it factual, my just-made 2 test results:
BUFFERBLOAT GRADE
B

LATENCY
Unloaded 26 ms
Download Active +39 ms

Upload Active +0 ms
SPEED ↓ Download 259.5 Mbps
↑ Upload 66.9 Mbps

Second:
BUFFERBLOAT GRADE
B

Your latency increased moderately under load.

LATENCY
Unloaded 21 ms
Download Active +42 ms
Upload Active +0 ms
SPEED ↓ Download 262.4 Mbps
↑ Upload 66.8 Mbps
==
So it's giving me Bs at the moment. Is this "good enough" leave-it-alone result? Tomorrow it might give me Cs though. I'll keep checking.

Cookie,

Looking at your original configuration on the very 1st post, it looks to be misaligned with the docs.

Please align the configuration exactly as is in the official documentation. It was tested on several different configurations (HW + WANs) and its designed to provide a proper baseline with minimal configuration needed. Which usually results B or higher scores, if you at least set the BW properly.

The main point of having properly configured FQ_C is to set properly the BW and to have Pipes and Queues for both Download and UPload. The rest of the parameters should be used for advanced fine tuning.

Quote from: cookiemonster on December 01, 2025, 07:09:25 PMI admit I can't understand the current way to use the "limit" note of the docs, the reference to the bug.
Prior OPN 25.7.8 there was a BUG that caused a CPU hogging due to excessive logging caused when the limit queue is exceeded. So the advice was to let Limit blank. Franco did FIX this (well at least on OPN side). So now is safe and beneficial to use the Limit queue and set it to 1000 for Speeds under 10Gbit/s.

I did as well update the docs, PR was merged, when Ad will recompile the docs it will be updated
https://github.com/opnsense/docs/pull/811/files

-----------

Alright lets dissect this;

Quote from: pfry on December 01, 2025, 08:18:47 PMI'd think a simple fair queue with no shaper would be the best option for you. I don't know the best way to accomplish that - perhaps open the pipe beyond 520Mb/s (toward single-station LAN speed).

Your QoS/Shaping should be implemented on the interface that you want to control the bottleneck for. So closer to the source of bufferbloat. A FQ as such doesn't handle in anyway bufferbloat. FQ only shares the BW equally amongst all the flows. To control bufferbloat you need an AQM (FQ_Codel, FQ_Pie) or a SQM (CAKE).
Another point is, you should not set your Pipe to more than you have, this introduces issues. You can not give out what you don't have, in our case BW. By settings BW higher than you have you will end in bufferbloat land, and latency will go high-wire, and you are giving up the control to the ISP.


Quote from: pfry on December 01, 2025, 08:18:47 PMI haven't looked at the fq-codel implementation in... a while. The one I recall used a flow hash, and you could set the number of bits (up to 16, I believe).
FQ_C creates internal flow queues per 5-tuple using a HASH. There are examples where stochastic nature of hashing, multiple flows may end up being hashed into the same slot. This can be controlled by the flow parameter in FQ_C.

Quote from: pfry on December 01, 2025, 08:18:47 PMIt looks like the ipfw implementation has that limit (65536). I'd think more can't hurt - fewer (potential) collisions. I wouldn't expect any negatives, but you never can tell.
This is a very bad idea if we speak about the "limit parameter". Limit is effectively the Queue size for the internal flows created by FQ_C. If you have a long Queue, but you are not able to process the packets in the Queue in time you create latency. FQ_C because its an AQM, measure sojourn time of each packet in the queue, and if it exceeds it either marks it or drops. But having to big of a queue is still overall bad. We want to TAIL drop packets when we can not handle them and not store them.

limit parameter (max 20480) with flow parameter (max 65535).

Settings the flow parameter higher is not a bad idea, the desired outcome is to have as less possible overlapping flows into the same queue as possible. But this parameter the higher its set takes more memory (in reality its not so much).

Rule of thumb;
Limit > bellow 10Gbit/s should be around (good starting point) 1000 (usable since 25.7.8)
Flow > If possible set to max 65535

Quote from: pfry on December 01, 2025, 08:18:47 PMPIE just sounds like a RED implementation - I can't see that it'd have much if any effect, as I wouldn't expect your queue depths/times to reach discard levels.
I really don't want to go into PIE to much e.g FQ_PIE, it work similar to FQ_C, but it has different use case, so I will say this:

Pie 
- Probabilistic, gradual
- Usage in ISP networks, broadband, general traffic

Codel
- Adaptive, based on packet age
- Low-latency applications, real-time traffic

Regards,
S.

Networking is love. You may hate it, but in the end, you always come back to it.

OPNSense HW
APU2D2 - deceased
N5105 - i226-V | Patriot 2x8G 3200 DDR4 | L 790 512G - VM HA(SOON)
N100   - i226-V | Crucial 16G  4800 DDR5 | S 980 500G - PROD

My router CPU is Intel N5105 and I have OPNsense 25.7.8.  I noticed that with Flow size 65535 the CPU usage as per 'top' was hitting 50% during the download portion of the Bufferbloat test.  Reducing the flow size to 16384 per pipe, I got the usage down to 20-25%. 

I also noticed that increasing this value from its default caused higher latency initially until I decreased the pipe BW (even lower than the recommended 85%).  Not sure if that's just a side effect of the weaker CPU.

Result is very good, though.  Consistent A+ when tested on a client running Windows.  Drops to A when tested on same client with Linux, but still consistent.

@cookiemonster, the hard part is finding the sweet spot for the BW value.  For me it's kind of a tipping point.  There's a narrow range that if I deviate from, in either direction, then the latency starts to go up again.  It's not enough to matter in practice (we're talking tolerances within A to B range) but enough to drive a perfectionist crazy. ;)  I think my ISP makes it more difficult because cable modem speeds here fluctuate throughout the day, and the service is over-provisioned for short bursts so that I get 120% of the advertised speed initially before it levels out.

Honestly, I do not anymore remember if Flow increases the CPU utilization as such. But its possible as it has to span more flow queues. Overall this is the desired behavior, we do not want to mix packets from Flow A with Flow B into the same Queue. But as mentioned its a tradeoff of extra resources.

I would not say it creates any additional latency on initial start, it would create a persistent one if you don't have the horse power to run it, e.g if you CPUs are too weak. This is not due to FQ_C but due to Shaper, as Logical shaping/QoS is a CPU intensive task. Most likely the latency seen is due to the variable nature of your internet connection.

I have a cable too, my ISP had half year ago a capacity problem that during the Peak hours speeds were extremely variable, so basically no constant throughput rate. However even in this case I ran FQ_C. Cause FQ_C can handle this, basically instead of having 2s latency it was able to keep in check up to 100ms in case the connectivity dropped to 300Mbit from 500Mbit (pipe set to 495Mbit).

On Linux you can achieve as well A+, same as on windows the score results are as well depending on your browser performance. When you check the github link and description that testing was done in Linux on (Floorp) Firedragon browser 12. The documentation was written and all test performed as well on Linux.

Regards,
S.
Networking is love. You may hate it, but in the end, you always come back to it.

OPNSense HW
APU2D2 - deceased
N5105 - i226-V | Patriot 2x8G 3200 DDR4 | L 790 512G - VM HA(SOON)
N100   - i226-V | Crucial 16G  4800 DDR5 | S 980 500G - PROD

Very good information. Thank you @OPNethu  your observation of the BW is interesting.
@Seimus very thankful to you for the advice. I'll need to digest it a bit and go back to resetting all the way as per docs BUT I am on OPN 25.1.12 and worry about upgrading to latest for what other changes it might bring, unrelated to the shaper. And yes setting the BW right seems to be the hardest part. I just tested and got an A. I am closer to the AP for the test so it seems my testing methodology is something I need to be more conscious of. And the BW measured was 151 Mbps for this A result. Makes me suspect the results a little.

Also, rookie question but I'll ask. Do zenarmor / crowdsec interfere when running the bufferbloat tests?
And to clarify. Can I/should I reset as per docs on my 25.1.12 version ? Suggested testing method ?

I would advice to run the test over a cable. If you don't have at least WiFi6 + all the BW available in the channel + no noise or overlap of the channel testing via WiFi is not advised. AS any of those 3 things can introduce you Wireless specific latency.


Quote from: cookiemonster on Today at 03:42:31 PMAlso, rookie question but I'll ask. Do zenarmor / crowdsec interfere when running the bufferbloat tests?
Not directly and not by intent. This goes around to the CPU bottleneck, if your CPU can not keep up, you will see a latency introduced by the CPU processing of the packets. For example I have ZA on N100, and there is no problem to handle 500+ throughput on WAN with shaping enabled.

Quote from: cookiemonster on Today at 03:42:31 PMAnd to clarify. Can I/should I reset as per docs on my 25.1.12 version ? Suggested testing method ?
Docs are valid for any OPNsense version.
What you should focus on its the configuration + the (basic) tuning via BW parameter. Configuration for FQ_C as well BW tuning methodology is the the docs.
The advanced tuning is not needed mostly, and its really just if you want to deep dive and squeeze it.

Regards,
S.
Networking is love. You may hate it, but in the end, you always come back to it.

OPNSense HW
APU2D2 - deceased
N5105 - i226-V | Patriot 2x8G 3200 DDR4 | L 790 512G - VM HA(SOON)
N100   - i226-V | Crucial 16G  4800 DDR5 | S 980 500G - PROD

Hey. I've been using a windows laptop for testing the bufferbloat so far. Normally I use linux but had a need to stay booted on Win last few days. This one is connected via a Wi-Fi 6 (802.11ax) Wifi network using a Intel(R) Wi-Fi 6E AX210 160MHz adapter. Depending on location I can get as little as 480/721 (Mbps) agregated link speed (rec/tran) so I have a bottleneck there at times. Wired connections are only one for a PC but I can't get to it most of the time.
For OPN's CPU I'm using an AMD Ryzen 5 5600U on Proxmox with two vCPUs. Just did a ubench run on it and gives: Ubench Single CPU:   910759 (0.41s). So I think that is Ok.
I've now reset the shaper to docs defaults. This time also the upload side. I need to reboot (had limit and flows on the pipe), I'll update the post.

Quote from: Seimus on Today at 10:12:33 AMOn Linux you can achieve as well A+, same as on windows the score results are as well depending on your browser performance. When you check the github link and description that testing was done in Linux on (Floorp) Firedragon browser 12. The documentation was written and all test performed as well on Linux.

Marginal differences at best, but I do get a consistent +5 to +10ms on the download portion of the test under Linux using the latest version of FireFox on both (and keeping all OPNsense parameters constant):

Linux: https://www.waveform.com/tools/bufferbloat?test-id=964b7180-4a1f-4eed-a114-1dfb613e9b63
Win10: https://www.waveform.com/tools/bufferbloat?test-id=edad2d94-d2c8-41e1-8b63-a31eeb2539bb

I've spent some time trying to close the gap but no luck :)  Maybe it's a quirk with my motherboard's i225V (rev02) NIC and the Windows driver is just a little bit better.

Maybe that is due to the TCP congestion algorithms used. You can change it with Windows, I think under Win10, it was BBR2, but that had some problems, so they reverted back to CUBIC for Win11.

With Linux, you can easily change it via sysctl. These are the values I use:

net.core.rmem_default = 2048000
net.core.wmem_default = 2048000
net.core.rmem_max = 67108864
net.core.wmem_max = 67108864
net.ipv4.tcp_rmem = 4096 1024000 33554432
net.ipv4.tcp_wmem = 4096 1024000 33554432

# don't cache ssthresh from previous connection
#net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_moderate_rcvbuf = 1
net.ipv4.tcp_adv_win_scale = 5
# recommended to increase this for 1000 BT or higher
net.core.netdev_max_backlog = 30000
# for 10 GigE, use this
# net.core.netdev_max_backlog = 30000
net.ipv4.tcp_syncookies = 1
# Enable BBR for Kernel >= 4.9
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr
Intel N100, 4* I226-V, 2* 82559, 16 GByte, 500 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+