Hello,
I've upgraded my ISP to 10G symmetric and configured my DEC3840 at such.
Bizarrely enough, 10G clients are getting full speed, 1G clients are getting ~300M Download / 940M Upload.
Here's my topology:
ISP--10G-->DEC3840--10G-->Meraki MS125-24P-->Clients (10G, 2.5G, 1G)
Any clue anyone?
How are you testing that? TCP/UDP? One stream or multiple? What type of server/client?
Also: From the looks of it, I would argue that the sender overwhelms your client by sending too much data and packets are lost. With a TCP-based test, this should not happen because of the congestion algorithm in use, but different algorithms may yield better or worse results.
Especially with speeds > 1 GBit/s, there are many tuning parameters to be considered and also testing methodology matters (https://forum.opnsense.org/index.php?topic=37407.msg183443#msg183443).
Thanks. Testing were made with a simple speedtest.
Did you see huge fluctuations in the graph? You could also try https://www.waveform.com/tools/bufferbloat
No fluctuations whatsoever.
Just to be clear: with WAN being Igbx I get full 940/940. When WAN is ax0 10G clients get full bandwidth whilst 1G contents get 300/940
Makes sense?
Yes, I understood that. What I am saying is that with 10 GBit/s instead of 1 GBit/s, your provider or the OOKLA server sends too much traffic at once for your 1 GBit/s clients, such that packets get lost. When that is detected, TCP traffic is suspended for a short while until it restarts (this is done by the congestion algorithm). Thus, traffic can "oscillate / fluctuate". Normally, this should be handled by TCP itself, but sometimes, intermediaries add to this.
This is known as bufferbloat. You can configure your OpnSense shaper to fight that if that is the problem. The Waveform test site shows you if you have a bufferbloat problem.
Probably a good case for flow control, assuming it's off currently and isn't the cause of the problems ;)
Ultimately, it's what it's good at - providing the ability to control the flow, when stepping 'down' i.e 10G -> 1G and helping to stop micro-bursts.
Usually 'off' on the WAN side, 'on' everywhere else or at least the uplink/trunks and 1G ports (opnsense, switches, devices).
Quote from: meyergru on December 08, 2023, 06:54:48 PM
Yes, I understood that. What I am saying is that with 10 GBit/s instead of 1 GBit/s, your provider or the OOKLA server sends too much traffic at once for your 1 GBit/s clients, such that packets get lost. When that is detected, TCP traffic is suspended for a short while until it restarts (this is done by the congestion algorithm). Thus, traffic can "oscillate / fluctuate". Normally, this should be handled by TCP itself, but sometimes, intermediaries add to this.
This is known as bufferbloat. You can configure your OpnSense shaper to fight that if that is the problem. The Waveform test site shows you if you have a bufferbloat problem.
I get that however I'm not affected by bufferbloat.
The fact that uploads are constantly 940 also suggests the root cause is somewhere else.
Moreover DEC3840 CPU is capable of routing ~17Gbps
I'm a bit lost on how to get to the bottom of this...
Quote from: NW4FUN on December 08, 2023, 09:48:23 PM
I get that however I'm not affected by bufferbloat.
The fact that uploads are constantly 940 also suggests the root cause is somewhere else.
No, that is inorrect. You have a bottleneck only in the downstream direction when your OpnSense can receive data at 10 GBit/s but the client device can only handle 1 GBit/s. That is not a problem in the other direction.
To the contrary, exactly this proves my point: You get 10 GBit/s at your OpnSense downstream, but it cannot hand that down to your slower client devices.
Quote
Moreover DEC3840 CPU is capable of routing ~17Gbps
That is irrelevant to the problem at hand. As I said: Your OpnSense can handle 10 GBit downstream, your clients can not. Thus, packets received by OpnSense are lost and the overhead of retrying causes the degradation.
This is a problem that does not turn up very often: not many people have a WAN connection that is 10 times faster than their clients?
To analyse it, you would have to use packet traces and look at them with wireguard. But the interesting question is: How to fix it? Traffic shaping at the firewall will not help, unless you reduce the downstream bandwitdh to 1 GBit/s, which you do not want.
What client do you test with? Are your results consistent with different OSes?
Quote from: NW4FUN on December 08, 2023, 09:48:23 PM
I'm a bit lost on how to get to the bottom of this...
Flow Control.
Turn it on, on the switch/Meraki (at least the trunk/uplink and all 1G ports), leave it off on the WAN side, make sure it has not been disabled on opnsense (it's usually enabled by default).
As @meyergru has pointed out, your 10G is likely causing microbursts (@ higher than 1G) to saturate the 1G interfaces/ports. Throughput then scales back. You do not have this problem in the reverse, as it is 1G -> 10G.
You likely have 3 options:
- Some form of QoS, to limit only the 1G clients. Although this is less than ideal, without restricting all 1G clients to a total of 1G, or accepting that if you have multiple 1G pipes (for example 1 per 1G client/port/interface) it might cause other problems elsewhere.
- Flow control on the switch, so that pause frames are sent to stop the 1G ports being hammered by the 10G upstream
- Make the buffers deeper on the 1G switch ports, to handle the microburst from the 10G
Bufferbloat shows A+ (+1ms D/U)...
Moreover, iPerf results are super solid on both 10G, 2.5G and 1G clients proving OPNsense is being configured properly (I guess).
What's left for me to wonder is the SFP+ RJ45 transceiver on the WAN. If not, what could that be?
Pleased you got it all figured out ;)
Quote from: iMx on December 09, 2023, 02:09:16 PM
Pleased you got it all figured out ;)
This forum spirit is suppose to be collaborative and I honestly do not appreciate this useless sarcasm of yours.
As per your suggested approaches...:
#1 - as per your own comment this is not desirable
#2 and #3 - My switches are Meraki which are built on a self healing fashion, therefore I'm not able to play with buffers and or limiting flows on specific ports
I'm at a point where I'm just trying to narrow down the possible causes to fewer scenarios. Logic suggests - but I might be very well mistaken, hence suggestions are welcome - transceiver on WAN might not be working properly.
What do you guys think?
Thanks
Quote from: NW4FUN on December 09, 2023, 03:44:33 PM
I'm at a point where I'm just trying to narrow down the possible causes to fewer scenarios. Logic suggests - but I might be very well mistaken, hence suggestions are welcome - transceiver on WAN might not be working properly.
Up to this point, you do not accept suggestions, but instead hop from one wild guess to another. This is another good example: Ask yourself this: If the WAN transceiver was broken, how could your 10G clients get the full speed?
About collaboration: You did not answer questions, we do not even know which type of client you try this with. Why this is relevant? Because if you try with Windows clients, you are likely to have problem with the default settings of the MS TCP stack, see this for what I mean (https://borncity.com/win/2023/02/14/microsofts-tcp-mess-how-to-optimize-in-windows-10-11/).
Also, what does "which are built on a self healing fashion" mean? Did you try to enable Flow Control or egress limiting on your switches?
Good luck, you will surely figure this out by yourself.
Quote from: meyergru on December 09, 2023, 04:20:42 PM
Quote from: NW4FUN on December 09, 2023, 03:44:33 PM
I'm at a point where I'm just trying to narrow down the possible causes to fewer scenarios. Logic suggests - but I might be very well mistaken, hence suggestions are welcome - transceiver on WAN might not be working properly.
Up to this point, you do not accept suggestions, but instead hop from one wild guess to another. This is another good example: Ask yourself this: If the WAN transceiver was broken, how could your 10G clients get the full speed?
About collaboration: You did not answer questions, we do not even know which type of client you try this with. Why this is relevant? Because if you try with Windows clients, you are likely to have problem with the default settings of the MS TCP stack, see this for what I mean (https://borncity.com/win/2023/02/14/microsofts-tcp-mess-how-to-optimize-in-windows-10-11/).
Also, what does "which are built on a self healing fashion" mean? Did you try to enable Flow Control or egress limiting on your switches?
Good luck, you will surely figure this out by yourself.
I am suggesting my WAN transceiver might not be compatible, I'm not saying it is broken.
10G clients are capped at 300Down like the 1G ones (my bad, I did not post the update), whilst are at full capacity in UL.
As per clients, they are all OSX or Linux. There's no WIN nor Android stuff around here (and there won't ever be).
RE your comment on flow control, on Meraki you can only play with QoS at best (https://documentation.meraki.com/MS/Other_Topics/QoS_(Quality_of_Service) (https://documentation.meraki.com/MS/Other_Topics/QoS_(Quality_of_Service))) and that has not being configured either.
A bit less of attitude would be appreciated ;)
I have the exact same problem as what you originally posted, where 10G clients are fine but 1G clients and wireless clients connecting via an AP with 1G port is having the issue where only downloads are affected, downloads are usually a little slower but inconsistent, iperf is fine for 10GB and 1GB clients.