OPNsense Forum

English Forums => Hardware and Performance => Topic started by: johnoatwork on December 03, 2021, 11:21:39 am

Title: 10GB LAN Performance
Post by: johnoatwork on December 03, 2021, 11:21:39 am
Hi All,

I've got a LAN performance issue that I'm having problems isolating and I could really use some help.

A simplified version of the infrastructure is set out in the diagram:

OPNsense 21.7
Dell R620, Dual Xeon E5-2680 v2 @ 2.80GHz CPUs
Dual Chelsio T520-CR 10GB NICS
Stacked Dell Force10 S4810s
OPNsense and Proxmox/Windows servers LACP bond to the S4810s

It all works, but here is what I'm finding:

If I run speedtest-cli from OPNsense I get throughput between 5 and 8 Gbps depending on the time of day. All good
If I run speedtest from a Proxmox or Windows Server connected through OPNsense the throughput ranges from 850Mbps to 1900Mbps i.e. 10% of the WAN throughput
If I run iperf3 as a server on the OPNsense LAN interface and hit it with a Proxmox or Windows Server client same result i.e. max throughput a bit over 1Gbps

This is only a problem with connections to OPNsense.  I have LAGGs on other internal networks that are getting nearly line speed with an identical configuration.

I've checked and rechecked the switch and server configurations:

The switchports comprising the LAGGs all show as connected at 10Gbps
The OPNsense Proxmox/Windows server LAGGs on the switch shows as connected at 20Gbps
The LAGGs are all configured correctly and the partners are all bundled
I've tried using Jumbo packets and tweaking the kernel on the Proxmox server but doesn't make much difference. 

I just don't get it. If the performance hit were associated with packet filtering I would expect to see some hit on the OPNsense CPUs, but the dashboard has them at barley 20% during testing. Anyway, I get the same result with packet filtering completely disabled on OPNsense.

Pulling my hair out on this.  Any tips or pointers would be greatly appreciated.

johno
Title: Re: 10GB LAN Performance
Post by: rungekutta on December 03, 2021, 01:50:53 pm
Check that OpnSense is using netmap in native mode. I just went through quite a struggle to get decent performance on the same NICs (coming from T420-CR to begin with). It was not at all trivial and none of it properly documented anywhere. See https://forum.opnsense.org/index.php?topic=25263.0
Title: Re: 10GB LAN Performance
Post by: johnoatwork on December 05, 2021, 11:26:27 pm
Hi rungekutta, thanks for taking the time to respond. 

I went through the steps in your post on the Chelsio card.  Very informative, but using VFs instead of PFs didn't make any difference for me. Netmap is loaded for the VFs but I'm not sure if it is in emulation or native mode.  Do you know if its possible to set this in the driver? 

Code: [Select]
# dmesg | grep vcxl | grep netmap
vcxl0: netmap queues/slots: TX 8/1023, RX 8/1024
vcxl0: 8 txq, 8 rxq (NIC); 1 txq (ETHOFLD); 8 txq, 8 rxq (netmap)
vcxl1: netmap queues/slots: TX 8/1023, RX 8/1024
vcxl1: 8 txq, 8 rxq (NIC); 1 txq (ETHOFLD); 8 txq, 8 rxq (netmap)
vcxl2: netmap queues/slots: TX 8/1023, RX 8/1024
vcxl2: 8 txq, 8 rxq (NIC); 1 txq (ETHOFLD); 8 txq, 8 rxq (netmap)
vcxl3: netmap queues/slots: TX 8/1023, RX 8/1024
vcxl3: 8 txq, 8 rxq (NIC); 1 txq (ETHOFLD); 8 txq, 8 rxq (netmap)

I agree there isn't a lot of good doco around on the Chelsio cards and I'm kinda wishing I'd gone Intel, but a bit late now as I have quite a few of the T520's.  But I did pick up a few things along the way that might help others struggling with these cards. 

## Show current firmware

Code: [Select]
root@pfw1:~ # sysctl dev.t5nex.0.firmware_version
dev.t5nex.0.firmware_version: 1.25.6.0
root@pfw1:~ # sysctl dev.t5nex.1.firmware_version
dev.t5nex.1.firmware_version: 1.25.4.0

Interestingly, running the latest firmware on these cards isn't necessarily optimal.  The driver includes a compatible blob that may be earlier than the firmware on a card, but by default it won't downgrade to the compatible firmware.  You can fix it with this tuneable in loader.conf.local:
Code: [Select]
hw.cxgbe.fw_install="2"
Then:
root@pfw1:~ # dmesg | grep 1.23
t5nex0: firmware on card (1.26.2.0) is different than the version bundled with this driver, installing firmware 1.23.0.0 on card.
t5nex1: firmware on card (1.26.2.0) is different than the version bundled with this driver, installing firmware 1.23.0.0 on card.

Anyway, that's as far as I've got with it. I'm heading up to the DC today to check if there are any cabling issues but I doubt it.  Wish me luck!
Title: Re: 10GB LAN Performance
Post by: rungekutta on December 06, 2021, 01:07:28 pm
That’s native mode. If in emulated mode the log will say something like

Quote
generic_netmap_register   Emulated adapter for cxgbe1 activated

Also make sure that your interface assignments are against these VFs (vcxl) and not cxgbe.

Last but not least are you running Suricata? With that enabled I was never able to top 1Gb/s through the fw even without any rules active. So there’s clearly some kind of bottleneck there too.
Title: Re: 10GB LAN Performance
Post by: johnoatwork on December 06, 2021, 11:15:27 pm
Definitely using the VFs.

 I'm at the DC now and I've swapped around some cables but no joy.  I also:

* Swapped the offending Cheslio NIC for an Intel NIC
* Ran up pfsense to see if was an OPNsense issue

Got exactly the same result with both tests.  So it seems its not the Chelsio card or the distro that's the issue.

I'm doing some testing now to rule out the switch and after that I'm out of ideas.  I'll post the solution if I have a breakthrough.
Title: Re: 10GB LAN Performance
Post by: cookiemonster on December 16, 2021, 11:38:31 pm
Are you able to swap to a linux-based firewall, to rule out *BSD.
Title: Re: 10GB LAN Performance
Post by: johnoatwork on December 17, 2021, 12:37:39 am
Actually I've run up both TNSR and VyOS as a VMs with SR-IOV passthrough VFs from the T520-CRs. I don't have the numbers yet for VyOS, but with TNSR and a basic set of ACLs performance right out of the box is double what I was getting with FreeBSD packet filters.  The thing with TNSR of course is that it doesn't have a nice GUI like OPNsense. It's reasonably easy to configure from the CLI but staying on top of it from an ongoing management perspective would likely be a struggle for me. 

Anyway, I've ordered an X710-DA2 for testing (the Intel card I previously tested with was x520).  Still hoping I can do this with OPNsense as it really is a great product.  But if I can't pinpoint the throughput issues I'll have to run with a Linux based distro.
Title: Re: 10GB LAN Performance
Post by: testo_cz on December 17, 2021, 08:59:58 am
I'say : try connect another PC with 10GbE to your switch stack,
to clearly see if the bottleneck is proxmox/win OR the OPNsense server.

Throughput around 1Gbps on this 20G setup seems to me crazy low. Unless you have left some IPS or shaping settings on OPNsense -- what Is CPu load on OPNsense server when you test throughput ?
Title: Re: 10GB LAN Performance
Post by: johnoatwork on December 17, 2021, 11:43:39 pm
Thanks for the feedback. I've tested with servers connected through the switch. I've also direct connected a server to the firewall to rule out the switch but same result. To restate the issue:


For others who come across this thread, I note that the T520CRs were originally installed with generic DACs and WAN throughput from OPNsense was less that 450Mbps on a 10Gbps link.  I swapped the cables out for fibre with genuine Chelsio transceivers at the card and Dell at the switch and the WAN throughput came up to acceptable levels (just not for servers connected through the firewall)! 
Title: Re: 10GB LAN Performance
Post by: rungekutta on December 18, 2021, 10:36:54 am
Actually I've run up both TNSR and VyOS as a VMs with SR-IOV passthrough VFs from the T520-CRs. I don't have the numbers yet for VyOS, but with TNSR and a basic set of ACLs performance right out of the box is double what I was getting with FreeBSD packet filters.
That's to say you got double 1Gb/s i.e. still only ~2Gb/s forwarding performance? Still sounds very low. Would be interesting to hear your equivalent VyOS performance.

Anyway, I've ordered an X710-DA2 for testing (the Intel card I previously tested with was x520).  Still hoping I can do this with OPNsense as it really is a great product.  But if I can't pinpoint the throughput issues I'll have to run with a Linux based distro.
After all my woes (https://forum.opnsense.org/index.php?topic=25263.15 (https://forum.opnsense.org/index.php?topic=25263.15)) I managed to get forwarding performance up to ~5 Gb/s through the Chelsio T520-SO-CR and Ryzen hardware so bit weird that your performance is is so low after having followed similar steps. Will be interesting to hear your results on Intel X710. And as mentioned on Linux also. NB that's a side project for me as well - setting up a minimal Debian 11 with routing and firewall through nftables, also unbound and dhcp server etc. Not got as far as live-testing it yet but curious how it will perform in comparison.
Title: Re: 10GB LAN Performance
Post by: johnoatwork on January 01, 2022, 03:59:38 am
Bit of an update on this. After swapping the Chelsio cards for Intel X710-DA2s and getting more or less the same result I've figured out at least the iperf issue. iperf3 is single threaded, even if you run it with the -P option it still only hits one CPU core. If you want multithreaded operation you have to use iperf2.

I'd been checking CPU utilisation on the firewall dashboard while iperf3 was running and not seeing any significant numbers, but when I checked with top directly from the console the single CPU core being hit by iperf was running at close to 100%.

So I installed iperf2 and ran it multithreaded and boom! Near wire speed with 20+ concurrent threads!!

Running iperf continuously makes it easier to monitor top. For those who are interested, this runs the iperf2 client continuously with 50 threads:

Code: [Select]
iperf -c hostname -tinf -P 50
Then run top on the firewall like this:

Code: [Select]
top -PCH
But here's what I don't get. If I run iperf2 *through* the firewall to a server on the same 10Gbps network segment as the WAN, I get around 5Gbps with a single thread and 7-8Gbps multithreaded. But the same client running speedtest cli peaks at around 1Gbps. Looking at top on the firewall while speedtest is running doesn't show any significant CPU utilisation and anyway, if the firewall is only running a single thread for speedtest realistically it should be capable of way better than 1Gbps (half of that with WIN10!).

The obvious culprit is the ISP network but I'm still getting up to 8Gbps running speedtest directly from the firewall. I've also tested with mtr (no data loss and super low latency) and tracepath (no mtu issues all the way through to 1.1.1.1).

In summary, here is what I have found:
So I'm not sure why there is such a big difference in firewall throughput between speedtest and iperf. I'm guessing speedtest uses tcp/443 and iperf defaults tcp/5001 (5201 for iperf3).

Unless the firewall is doing additional processing for tcp/443? I don't have any special rules set up for https and there is no IDS running at the moment. I'm going to have a close look at the proxy setup see if that leads anywhere.
Title: Re: 10GB LAN Performance
Post by: johnoatwork on January 01, 2022, 04:41:16 am
Would be interesting to hear your equivalent VyOS performance.

Actually I didn't go any further with VyOS. The version I downloaded for initial testing was an old version available on their website.  As I understand it, you need to be a contributor or buy a subscription to get the current version.  The only other alternative seems to be an untested rolling release.  I don't mind paying for support for open source software but not I'm not interested in paying what they are asking just for a proof of concept with the current stable release. 
Title: Re: 10GB LAN Performance
Post by: testo_cz on January 01, 2022, 06:42:24 pm
Bit of an update on this. After swapping the Chelsio cards for Intel X710-DA2s and getting more or less the same result I've figured out at least the iperf issue. iperf3 is single threaded, even if you run it with the -P option it still only hits one CPU core. If you want multithreaded operation you have to use iperf2.

I'd been checking CPU utilisation on the firewall dashboard while iperf3 was running and not seeing any significant numbers, but when I checked with top directly from the console the single CPU core being hit by iperf was running at close to 100%.

So I installed iperf2 and ran it multithreaded and boom! Near wire speed with 20+ concurrent threads!!

Running iperf continuously makes it easier to monitor top. For those who are interested, this runs the iperf2 client continuously with 50 threads:

Code: [Select]
iperf -c hostname -tinf -P 50
Then run top on the firewall like this:

Code: [Select]
top -PCH
But here's what I don't get. If I run iperf2 *through* the firewall to a server on the same 10Gbps network segment as the WAN, I get around 5Gbps with a single thread and 7-8Gbps multithreaded. But the same client running speedtest cli peaks at around 1Gbps. Looking at top on the firewall while speedtest is running doesn't show any significant CPU utilisation and anyway, if the firewall is only running a single thread for speedtest realistically it should be capable of way better than 1Gbps (half of that with WIN10!).

The obvious culprit is the ISP network but I'm still getting up to 8Gbps running speedtest directly from the firewall. I've also tested with mtr (no data loss and super low latency) and tracepath (no mtu issues all the way through to 1.1.1.1).

In summary, here is what I have found:
  • There is not much difference I can tell in performance between the Intel X710-DA2 and the Chelsio T520-CRs
  • The internal 10Gbps network and attached clients are healthy and can transfer data at close to wire speed
  • The overhead from packet filtering on the firewall (passing iperf traffic) is 2-3Gbps which is bearable. Faster CPUs might reduce this, but with 10 cores engaged utilisation is only about 25-30%
  • The ISP upstream network is healthy
So I'm not sure why there is such a big difference in firewall throughput between speedtest and iperf. I'm guessing speedtest uses tcp/443 and iperf defaults tcp/5001 (5201 for iperf3).

Unless the firewall is doing additional processing for tcp/443? I don't have any special rules set up for https and there is no IDS running at the moment. I'm going to have a close look at the proxy setup see if that leads anywhere.

Nice finding about iperf2 vs iperf3. Thanks.

I think "rungekutta" reported about similar forwarding performance as yours in iperf3 testing.
My ESX based testbed (only 10G capable I've got) runs also something over 5Gbps  with iperf3 but I haven't tweaked it much.

When you mentioned "10 cores engaged utilisation is only about 25-30%" , does that mean that each of the ten CPU cores is utilized at 25-30% ?

Few shots I would check:
if power management allows CPU to scale its frequency up ?
Code: [Select]
sysctl -a | grep cpu | grep freq

if this network tunables for multicore CPUs are on:
Code: [Select]
net.isr.maxthreads = "-1"
net.isr.bindthreads = "1"

if flow control per network interface is off ?
Code: [Select]
dev.ixl.#.fc = "0"

Further, I'd say that one may try to increase number of RX/TX queues and descriptors.
If not ixl(4) then iflib(4) based tunables might let you to do so. Check the sysctl values of 'nrxqs', 'ntxqs', 'nrxds' and 'ntxds' and see if you may override them to make them bigger/larger. Overrides require reboot , I guess.
Docs e.g. here:
https://www.freebsd.org/cgi/man.cgi?query=iflib&sektion=4&apropos=0&manpath=FreeBSD+12.2-RELEASE+and+Ports
 (https://www.freebsd.org/cgi/man.cgi?query=iflib&sektion=4&apropos=0&manpath=FreeBSD+12.2-RELEASE+and+Ports)

This approach boosted forwarding performance on my ESX setup with vmx interfaces.


With regards the speedtest-cli to the Internet, I;d say to try to tcpdump/wireshark on both sides of firewall to see if the packets go nicely as expected or if there are resends, rubbish or something strange going on.



Title: Re: 10GB LAN Performance
Post by: johnoatwork on January 01, 2022, 10:52:24 pm
Hey testo_cz, thanks for the ideas.  I've previously tried tweaking the queues and did see some performance improvement but that config got lost along the way. I'll give it another try along with the other things you suggested and report back.

When you mentioned "10 cores engaged utilisation is only about 25-30%" , does that mean that each of the ten CPU cores is utilized at 25-30% ?

Yes that's correct, the load seems to be more or less evenly distributed across the CPUs. But you have to use the switches for top like this "top -PCH".  Otherwise it just reports as a single CPU oversubscribed (like 550% etc).
Title: Re: 10GB LAN Performance
Post by: johnoatwork on January 02, 2022, 09:02:33 am

Few shots I would check:
if power management allows CPU to scale its frequency up ?
Code: [Select]
sysctl -a | grep cpu | grep freq

if this network tunables for multicore CPUs are on:
Code: [Select]
net.isr.maxthreads = "-1"
net.isr.bindthreads = "1"

if flow control per network interface is off ?
Code: [Select]
dev.ixl.#.fc = "0"

Further, I'd say that one may try to increase number of RX/TX queues and descriptors.
If not ixl(4) then iflib(4) based tunables might let you to do so. Check the sysctl values of 'nrxqs', 'ntxqs', 'nrxds' and 'ntxds' and see if you may override them to make them bigger/larger. Overrides require reboot , I guess.
Docs e.g. here:
https://www.freebsd.org/cgi/man.cgi?query=iflib&sektion=4&apropos=0&manpath=FreeBSD+12.2-RELEASE+and+Ports
 (https://www.freebsd.org/cgi/man.cgi?query=iflib&sektion=4&apropos=0&manpath=FreeBSD+12.2-RELEASE+and+Ports)

This approach boosted forwarding performance on my ESX setup with vmx interfaces.

With regards the speedtest-cli to the Internet, I;d say to try to tcpdump/wireshark on both sides of firewall to see if the packets go nicely as expected or if there are resends, rubbish or something strange going on.

Well I rigged up a test with tshark and a Debian iperf server (I call it FAUX_WAN) and captured a bunch of data. I was a bit worried at first because I was seeing quite a few re-transmits, but on digging into it I believe this is just the nature of tcp.  I managed to drive the link through the firewall as high as 8Gbps with multithreaded iperf and some retransmision is inevtable when you are flogging the link like that. I did implement the tunables you suggested and I think they helped, so thanks again.

Re the low performance results from speedtest.net, the obvious issue is that what you get back from any particular server depends on what else it is doing and how much other traffic it is processing. Here in Australia, I've found that speedtest reports on average only 25-30% of what I see from iinet (https://www.iinet.net.au/internet-products/broadband/speed-test/). I'm not an iinet customer but I'm more inclined to believe their numbers which show I can get ~800Mbps down and up to 2.7Gbps up on a Windows VM with 2 cores and 4GB RAM.  In the end I'm not entirely sure how I got here but that's good enough for me :)

Thanks to everyone who contributed to the disussion. I just love open source software!
Title: Re: 10GB LAN Performance
Post by: johnoatwork on January 02, 2022, 09:30:43 am
Oh, and for those who might come across this, here is my /boot/loader.conf.local for the X710-DA2.  Not sure this is optimal and always interested in suggestions on how it can be improved.

Code: [Select]
kern.ipc.nmbclusters=1000000
kern.ipc.nmbjumbop=524288
hw.intr_storm_threshold=10000
net.inet.tcp.tso=0
net.isr.maxthreads=-1
net.isr.bindthreads=1
dev.ixl.0.iflib.override_qs_enable=1
dev.ixl.1.iflib.override_qs_enable=1
dev.ixl.0.iflib.override_nrxqs=128
dev.ixl.1.iflib.override_nrxqs=128
dev.ixl.0.iflib.override_ntxqs=128
dev.ixl.1.iflib.override_ntxqs=128
dev.ixl.0.iflib.override_nrxds=128
dev.ixl.1.iflib.override_nrxds=128
dev.ixl.0.iflib.override_ntxds=128
dev.ixl.1.iflib.override_ntxds=128
Title: Re: 10GB LAN Performance
Post by: rungekutta on May 29, 2022, 12:47:34 am
After all my woes (https://forum.opnsense.org/index.php?topic=25263.15 (https://forum.opnsense.org/index.php?topic=25263.15)) I managed to get forwarding performance up to ~5 Gb/s through the Chelsio T520-SO-CR and Ryzen hardware so bit weird that your performance is is so low after having followed similar steps. Will be interesting to hear your results on Intel X710. And as mentioned on Linux also. NB that's a side project for me as well - setting up a minimal Debian 11 with routing and firewall through nftables, also unbound and dhcp server etc. Not got as far as live-testing it yet but curious how it will perform in comparison.

Ok so I'm opening up this thread again now, because I've done exactly this. The results are kind of interesting.

Note first and foremost that I'm a big fan of OpnSense. The admin GUI is superb and it has really served me well (and continues to do so) and helped me get up the curve on networking stuff.

All that said, I think the testing reveals some differences between Linux and FreeBSD, or at least FreeBSD as configured in OpnSense. My Linux setup is a minimal Debian 11 VM in Proxmox with nftables for firewall & routing and dnsmasq for dns & dhcp. Dnsmasq forwards dns to unbound as a resolver. A bunch of rules that controll traffic between the various internal networks, and a bunch of dnat forwarding to services on the dmz, with hairpin.

I applied minimal tuning - increased some tcp buffer etc. Don't know if that made any difference or not.

Out of the box, iPerf3 against 2 other servers on the internal network are a solid 9.4-9.8 Gb/s and Debian still runs 95% idle. NAT routing performance out on the 10Gb wan (using fast.com and speedtest.com) varies according to client and time of day between 6-8 Gb/s while the Debian VM idles 98% (!).

Note that I'm not running suricata or any fancy metrics or instrumentation (only nftables stats).

Still, this is quite some difference. I never managed to get much more than 5-6Gb/s through OpnSense on the same hardware, and the CPU had to work much harder too.

Maybe it partly comes down to kernel optimizations for Ryzen? In any case, I wasn't expecting quite such a difference.

Would OpnSense ever consider re-basing on Linux? I realize it would be a non-trivial exercise... The TrueNas folks did it though...
Title: Re: 10GB LAN Performance
Post by: jclendineng on June 04, 2022, 03:34:12 pm
It may just come down to BSDs lacking driver support for many things which is where unix shines since the community is much much bigger. That said, I can get a solid 9.8gb natted over opnsense, running an older xeon here. I get 9.8gb from opnsense to my unraid server running iperf3 (using -P8 for 8 threads) but unraid server to opnsense is only ~5-6gb sometimes 4, but I think that has more to do with unraid at that point since Im able to do 10gb 1 way.
Title: Re: 10GB LAN Performance
Post by: lilsense on June 04, 2022, 11:41:02 pm
OPNsense is BSD based. you should try Vyatta if you are looking for linux. No need to change what's Rockin'.

I also am able to sustain 9.8Gbps for minutes without any resource use... LOL
Title: Re: 10GB LAN Performance
Post by: rungekutta on June 06, 2022, 10:56:34 am
Good to hear those speeds are achievable. What NICs do you guys use? Did you have to fiddle with tunables in order to get the performance?

Fwiw I looked at Vyatta also but didn’t really see the point. Nftables in itself is straightforward enough so not so much gained vs a vanilla Debian - where you also get more flexibility. In both cases losing out vs OpnSense’s awesome gui.
Title: Re: 10GB LAN Performance
Post by: lilsense on June 06, 2022, 03:02:33 pm
lol... Let me understand this... you like the GUI and want to change the entire unerlay to something else because of GUI???? LOL

I like Ferraris, so let's make all 18 Wheeler trucks look like Ferrari! LOL
Title: Re: 10GB LAN Performance
Post by: rungekutta on June 06, 2022, 11:46:46 pm
Indeed, I like the product including its gui and plug-in ecosystem and its community. And I am raising the question whether in the long run it would be better off based off Linux than BSD. I understand it’s a sensitive topic for some. Alas iX systems and Netgate both seem to be heading in that direction so its not like nobody ever thought of it before. But feel free to lol if that makes you feel better ;-). Or maybe add some actual thoughts on the topic.
Title: Re: 10GB LAN Performance
Post by: lilsense on June 06, 2022, 11:49:44 pm
you seem not understand the BSD ecosystem. It's not your fault and that's OK.

No worries though.... :)
Title: Re: 10GB LAN Performance
Post by: jclendineng on June 07, 2022, 12:15:27 am
Good to hear those speeds are achievable. What NICs do you guys use? Did you have to fiddle with tunables in order to get the performance?

Fwiw I looked at Vyatta also but didn’t really see the point. Nftables in itself is straightforward enough so not so much gained vs a vanilla Debian - where you also get more flexibility. In both cases losing out vs OpnSense’s awesome gui.

Mellanox ConnectX-3 10gb SFP dual port here, 1 to WAN and 1 to my LAN. No tunables set up.
Title: Re: 10GB LAN Performance
Post by: rungekutta on June 07, 2022, 07:22:10 pm
you seem not understand the BSD ecosystem. It's not your fault and that's OK.
Thank you for your thoughtful contribution to the topic.
Title: Re: 10GB LAN Performance
Post by: rungekutta on June 07, 2022, 07:27:17 pm
Mellanox ConnectX-3 10gb SFP dual port here, 1 to WAN and 1 to my LAN. No tunables set up.

That’s interesting. I have Chelsio NICs, which are supposedly well supported, but I had to mess around with tunables and settings before I managed to get netmap to run in native mode and offer half decent performance. https://forum.opnsense.org/index.php?topic=25263.0
Title: Re: 10GB LAN Performance
Post by: jclendineng on July 17, 2022, 02:47:03 am
Mellanox ConnectX-3 10gb SFP dual port here, 1 to WAN and 1 to my LAN. No tunables set up.

That’s interesting. I have Chelsio NICs, which are supposedly well supported, but I had to mess around with tunables and settings before I managed to get netmap to run in native mode and offer half decent performance. https://forum.opnsense.org/index.php?topic=25263.0

further testing is seeing sub optimal speeds. 1 direction is 9.4gb or so (fine) the other direction is 5-6gb. I think that comes down to single thread performance. I have a good (older) cpu in a rack server and iperf doesnt even touch it but im thinking single thread is the issue as iperf is single thread even with -P set, as that only is streams and (my understanding is) that mulkti stream is still all single threaded per the og dev.
Title: Re: 10GB LAN Performance
Post by: lilsense on July 17, 2022, 01:43:10 pm
you seem not understand the BSD ecosystem. It's not your fault and that's OK.
Thank you for your thoughtful contribution to the topic.
Just to clarifying this, Netgate's TNSR is linux based and NOT free. Netgate is doing this for $$$$. TrueNAs is doing it for totally different reason and that's got nothing to do with FreeBSD slow networking speeds.

Here's an example of driver you can install on your FreeBSD, bnxt , to get 100G
Broadcom BCM57454 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb Ethernet

And!!!

here's how netflix is using FreeBSD:
https://papers.freebsd.org/2021/eurobsdcon/gallatin-netflix-freebsd-400gbps/

here's the YT:
https://www.youtube.com/watch?v=_o-HcG8QxPc

I can't wait for this years EuroBSDcon for this topic:
The “other” FreeBSD optimizations used by Netflix to serve video at 800Gb/s from a single server
Title: Re: 10GB LAN Performance
Post by: rungekutta on July 21, 2022, 06:37:35 pm
Thanks for demystifying that earlier comment.  ;)

No doubt FreeBSD as well as other Unixes as well as Linux is capable of producing great results. And those Netflix stats are impressive. However, note also that the use case is very specific. They stream static files from SSDs and have carefully optimized and tuned everything along the way, from software to o/s to hardware and drivers. In some cases they have found and removed bottlenecks and submitted back to FreeBSD (e.g. async sendfile). They have also worked closely with AMD and Mellanox and others.

I’m sure some of that has benefitted FreeBSD more broadly, but I don’t know how relevant it is to most users on this forum who are trying to get good filtering, routing and forwarding performance out of a range of different hardware, from small appliances to enterprise. And in addition, on a relatively complex setup that involves netmap, automatically generated firewall rules, and additional software such as suricata layered on top. That’s quite a different gig.. and the complexity of it, in combination with the range of available x86 hardware, is presumably why this forum is so full of people reporting such a wide range of experience from OpnSense in terms of performance.

Btw, if you’re looking for other extreme examples, Linux reached a forwarding packet rate of 1 terabit back in 2017 ;)
https://www.fiercetelecom.com/telecom/linux-foundation-s-fd-io-virtual-switch-project-doubles-packet-throughput-to-terabit-speeds

For the regular user though, I’m not sure this is any more relevant than the Netflix example. And I notice this is going off-topic as well - my intention was not to start an o/s war. Sorry. Will stop there.