10GB LAN Performance

Started by johnoatwork, December 03, 2021, 11:21:39 AM

Previous topic - Next topic
Hi All,

I've got a LAN performance issue that I'm having problems isolating and I could really use some help.

A simplified version of the infrastructure is set out in the diagram:

OPNsense 21.7
Dell R620, Dual Xeon E5-2680 v2 @ 2.80GHz CPUs
Dual Chelsio T520-CR 10GB NICS
Stacked Dell Force10 S4810s
OPNsense and Proxmox/Windows servers LACP bond to the S4810s

It all works, but here is what I'm finding:

If I run speedtest-cli from OPNsense I get throughput between 5 and 8 Gbps depending on the time of day. All good
If I run speedtest from a Proxmox or Windows Server connected through OPNsense the throughput ranges from 850Mbps to 1900Mbps i.e. 10% of the WAN throughput
If I run iperf3 as a server on the OPNsense LAN interface and hit it with a Proxmox or Windows Server client same result i.e. max throughput a bit over 1Gbps

This is only a problem with connections to OPNsense.  I have LAGGs on other internal networks that are getting nearly line speed with an identical configuration.

I've checked and rechecked the switch and server configurations:

The switchports comprising the LAGGs all show as connected at 10Gbps
The OPNsense Proxmox/Windows server LAGGs on the switch shows as connected at 20Gbps
The LAGGs are all configured correctly and the partners are all bundled
I've tried using Jumbo packets and tweaking the kernel on the Proxmox server but doesn't make much difference. 

I just don't get it. If the performance hit were associated with packet filtering I would expect to see some hit on the OPNsense CPUs, but the dashboard has them at barley 20% during testing. Anyway, I get the same result with packet filtering completely disabled on OPNsense.

Pulling my hair out on this.  Any tips or pointers would be greatly appreciated.

johno

Check that OpnSense is using netmap in native mode. I just went through quite a struggle to get decent performance on the same NICs (coming from T420-CR to begin with). It was not at all trivial and none of it properly documented anywhere. See https://forum.opnsense.org/index.php?topic=25263.0

Hi rungekutta, thanks for taking the time to respond. 

I went through the steps in your post on the Chelsio card.  Very informative, but using VFs instead of PFs didn't make any difference for me. Netmap is loaded for the VFs but I'm not sure if it is in emulation or native mode.  Do you know if its possible to set this in the driver? 

# dmesg | grep vcxl | grep netmap
vcxl0: netmap queues/slots: TX 8/1023, RX 8/1024
vcxl0: 8 txq, 8 rxq (NIC); 1 txq (ETHOFLD); 8 txq, 8 rxq (netmap)
vcxl1: netmap queues/slots: TX 8/1023, RX 8/1024
vcxl1: 8 txq, 8 rxq (NIC); 1 txq (ETHOFLD); 8 txq, 8 rxq (netmap)
vcxl2: netmap queues/slots: TX 8/1023, RX 8/1024
vcxl2: 8 txq, 8 rxq (NIC); 1 txq (ETHOFLD); 8 txq, 8 rxq (netmap)
vcxl3: netmap queues/slots: TX 8/1023, RX 8/1024
vcxl3: 8 txq, 8 rxq (NIC); 1 txq (ETHOFLD); 8 txq, 8 rxq (netmap)


I agree there isn't a lot of good doco around on the Chelsio cards and I'm kinda wishing I'd gone Intel, but a bit late now as I have quite a few of the T520's.  But I did pick up a few things along the way that might help others struggling with these cards. 

## Show current firmware

root@pfw1:~ # sysctl dev.t5nex.0.firmware_version
dev.t5nex.0.firmware_version: 1.25.6.0
root@pfw1:~ # sysctl dev.t5nex.1.firmware_version
dev.t5nex.1.firmware_version: 1.25.4.0


Interestingly, running the latest firmware on these cards isn't necessarily optimal.  The driver includes a compatible blob that may be earlier than the firmware on a card, but by default it won't downgrade to the compatible firmware.  You can fix it with this tuneable in loader.conf.local:
hw.cxgbe.fw_install="2"
Then:
root@pfw1:~ # dmesg | grep 1.23
t5nex0: firmware on card (1.26.2.0) is different than the version bundled with this driver, installing firmware 1.23.0.0 on card.
t5nex1: firmware on card (1.26.2.0) is different than the version bundled with this driver, installing firmware 1.23.0.0 on card.


Anyway, that's as far as I've got with it. I'm heading up to the DC today to check if there are any cabling issues but I doubt it.  Wish me luck!

That's native mode. If in emulated mode the log will say something like

Quotegeneric_netmap_register   Emulated adapter for cxgbe1 activated

Also make sure that your interface assignments are against these VFs (vcxl) and not cxgbe.

Last but not least are you running Suricata? With that enabled I was never able to top 1Gb/s through the fw even without any rules active. So there's clearly some kind of bottleneck there too.

Definitely using the VFs.

I'm at the DC now and I've swapped around some cables but no joy.  I also:

* Swapped the offending Cheslio NIC for an Intel NIC
* Ran up pfsense to see if was an OPNsense issue

Got exactly the same result with both tests.  So it seems its not the Chelsio card or the distro that's the issue.

I'm doing some testing now to rule out the switch and after that I'm out of ideas.  I'll post the solution if I have a breakthrough.

Are you able to swap to a linux-based firewall, to rule out *BSD.

Actually I've run up both TNSR and VyOS as a VMs with SR-IOV passthrough VFs from the T520-CRs. I don't have the numbers yet for VyOS, but with TNSR and a basic set of ACLs performance right out of the box is double what I was getting with FreeBSD packet filters.  The thing with TNSR of course is that it doesn't have a nice GUI like OPNsense. It's reasonably easy to configure from the CLI but staying on top of it from an ongoing management perspective would likely be a struggle for me. 

Anyway, I've ordered an X710-DA2 for testing (the Intel card I previously tested with was x520).  Still hoping I can do this with OPNsense as it really is a great product.  But if I can't pinpoint the throughput issues I'll have to run with a Linux based distro.

I'say : try connect another PC with 10GbE to your switch stack,
to clearly see if the bottleneck is proxmox/win OR the OPNsense server.

Throughput around 1Gbps on this 20G setup seems to me crazy low. Unless you have left some IPS or shaping settings on OPNsense -- what Is CPu load on OPNsense server when you test throughput ?

Thanks for the feedback. I've tested with servers connected through the switch. I've also direct connected a server to the firewall to rule out the switch but same result. To restate the issue:

  • Speedtest-cli run on OPNSense shows that the ISP network is performing at between 5 and 8Gbps
    • Speedtest-cli run on a connected server connected through xxsense gets max throughput of less than 2Gbps but typically 1-1.5Gbps. Not running Suricata or anything else
    • iperf from a connected server to the inner OPNsense interface also tops out under 1.5Gbps even with packet filtering disabled

    For others who come across this thread, I note that the T520CRs were originally installed with generic DACs and WAN throughput from OPNsense was less that 450Mbps on a 10Gbps link.  I swapped the cables out for fibre with genuine Chelsio transceivers at the card and Dell at the switch and the WAN throughput came up to acceptable levels (just not for servers connected through the firewall)! 

    December 18, 2021, 10:36:54 AM #9 Last Edit: December 18, 2021, 10:38:58 AM by rungekutta
    Quote from: johnoatwork on December 17, 2021, 12:37:39 AM
    Actually I've run up both TNSR and VyOS as a VMs with SR-IOV passthrough VFs from the T520-CRs. I don't have the numbers yet for VyOS, but with TNSR and a basic set of ACLs performance right out of the box is double what I was getting with FreeBSD packet filters.
    That's to say you got double 1Gb/s i.e. still only ~2Gb/s forwarding performance? Still sounds very low. Would be interesting to hear your equivalent VyOS performance.

    Quote from: johnoatwork on December 17, 2021, 12:37:39 AM
    Anyway, I've ordered an X710-DA2 for testing (the Intel card I previously tested with was x520).  Still hoping I can do this with OPNsense as it really is a great product.  But if I can't pinpoint the throughput issues I'll have to run with a Linux based distro.
    After all my woes (https://forum.opnsense.org/index.php?topic=25263.15) I managed to get forwarding performance up to ~5 Gb/s through the Chelsio T520-SO-CR and Ryzen hardware so bit weird that your performance is is so low after having followed similar steps. Will be interesting to hear your results on Intel X710. And as mentioned on Linux also. NB that's a side project for me as well - setting up a minimal Debian 11 with routing and firewall through nftables, also unbound and dhcp server etc. Not got as far as live-testing it yet but curious how it will perform in comparison.

    January 01, 2022, 03:59:38 AM #10 Last Edit: January 01, 2022, 04:54:53 AM by johnoatwork
    Bit of an update on this. After swapping the Chelsio cards for Intel X710-DA2s and getting more or less the same result I've figured out at least the iperf issue. iperf3 is single threaded, even if you run it with the -P option it still only hits one CPU core. If you want multithreaded operation you have to use iperf2.

    I'd been checking CPU utilisation on the firewall dashboard while iperf3 was running and not seeing any significant numbers, but when I checked with top directly from the console the single CPU core being hit by iperf was running at close to 100%.

    So I installed iperf2 and ran it multithreaded and boom! Near wire speed with 20+ concurrent threads!!

    Running iperf continuously makes it easier to monitor top. For those who are interested, this runs the iperf2 client continuously with 50 threads:

    iperf -c hostname -tinf -P 50

    Then run top on the firewall like this:

    top -PCH

    But here's what I don't get. If I run iperf2 *through* the firewall to a server on the same 10Gbps network segment as the WAN, I get around 5Gbps with a single thread and 7-8Gbps multithreaded. But the same client running speedtest cli peaks at around 1Gbps. Looking at top on the firewall while speedtest is running doesn't show any significant CPU utilisation and anyway, if the firewall is only running a single thread for speedtest realistically it should be capable of way better than 1Gbps (half of that with WIN10!).

    The obvious culprit is the ISP network but I'm still getting up to 8Gbps running speedtest directly from the firewall. I've also tested with mtr (no data loss and super low latency) and tracepath (no mtu issues all the way through to 1.1.1.1).

    In summary, here is what I have found:

    • There is not much difference I can tell in performance between the Intel X710-DA2 and the Chelsio T520-CRs
    • The internal 10Gbps network and attached clients are healthy and can transfer data at close to wire speed
    • The overhead from packet filtering on the firewall (passing iperf traffic) is 2-3Gbps which is bearable. Faster CPUs might reduce this, but with 10 cores engaged utilisation is only about 25-30%
    • The ISP upstream network is healthy
    So I'm not sure why there is such a big difference in firewall throughput between speedtest and iperf. I'm guessing speedtest uses tcp/443 and iperf defaults tcp/5001 (5201 for iperf3).

    Unless the firewall is doing additional processing for tcp/443? I don't have any special rules set up for https and there is no IDS running at the moment. I'm going to have a close look at the proxy setup see if that leads anywhere.

    Quote from: rungekutta on December 18, 2021, 10:36:54 AMWould be interesting to hear your equivalent VyOS performance.

    Actually I didn't go any further with VyOS. The version I downloaded for initial testing was an old version available on their website.  As I understand it, you need to be a contributor or buy a subscription to get the current version.  The only other alternative seems to be an untested rolling release.  I don't mind paying for support for open source software but not I'm not interested in paying what they are asking just for a proof of concept with the current stable release. 

    Quote from: johnoatwork on January 01, 2022, 03:59:38 AM
    Bit of an update on this. After swapping the Chelsio cards for Intel X710-DA2s and getting more or less the same result I've figured out at least the iperf issue. iperf3 is single threaded, even if you run it with the -P option it still only hits one CPU core. If you want multithreaded operation you have to use iperf2.

    I'd been checking CPU utilisation on the firewall dashboard while iperf3 was running and not seeing any significant numbers, but when I checked with top directly from the console the single CPU core being hit by iperf was running at close to 100%.

    So I installed iperf2 and ran it multithreaded and boom! Near wire speed with 20+ concurrent threads!!

    Running iperf continuously makes it easier to monitor top. For those who are interested, this runs the iperf2 client continuously with 50 threads:

    iperf -c hostname -tinf -P 50

    Then run top on the firewall like this:

    top -PCH

    But here's what I don't get. If I run iperf2 *through* the firewall to a server on the same 10Gbps network segment as the WAN, I get around 5Gbps with a single thread and 7-8Gbps multithreaded. But the same client running speedtest cli peaks at around 1Gbps. Looking at top on the firewall while speedtest is running doesn't show any significant CPU utilisation and anyway, if the firewall is only running a single thread for speedtest realistically it should be capable of way better than 1Gbps (half of that with WIN10!).

    The obvious culprit is the ISP network but I'm still getting up to 8Gbps running speedtest directly from the firewall. I've also tested with mtr (no data loss and super low latency) and tracepath (no mtu issues all the way through to 1.1.1.1).

    In summary, here is what I have found:

    • There is not much difference I can tell in performance between the Intel X710-DA2 and the Chelsio T520-CRs
    • The internal 10Gbps network and attached clients are healthy and can transfer data at close to wire speed
    • The overhead from packet filtering on the firewall (passing iperf traffic) is 2-3Gbps which is bearable. Faster CPUs might reduce this, but with 10 cores engaged utilisation is only about 25-30%
    • The ISP upstream network is healthy
    So I'm not sure why there is such a big difference in firewall throughput between speedtest and iperf. I'm guessing speedtest uses tcp/443 and iperf defaults tcp/5001 (5201 for iperf3).

    Unless the firewall is doing additional processing for tcp/443? I don't have any special rules set up for https and there is no IDS running at the moment. I'm going to have a close look at the proxy setup see if that leads anywhere.

    Nice finding about iperf2 vs iperf3. Thanks.

    I think "rungekutta" reported about similar forwarding performance as yours in iperf3 testing.
    My ESX based testbed (only 10G capable I've got) runs also something over 5Gbps  with iperf3 but I haven't tweaked it much.

    When you mentioned "10 cores engaged utilisation is only about 25-30%" , does that mean that each of the ten CPU cores is utilized at 25-30% ?

    Few shots I would check:
    if power management allows CPU to scale its frequency up ?

    sysctl -a | grep cpu | grep freq


    if this network tunables for multicore CPUs are on:

    net.isr.maxthreads = "-1"
    net.isr.bindthreads = "1"


    if flow control per network interface is off ?

    dev.ixl.#.fc = "0"


    Further, I'd say that one may try to increase number of RX/TX queues and descriptors.
    If not ixl(4) then iflib(4) based tunables might let you to do so. Check the sysctl values of 'nrxqs', 'ntxqs', 'nrxds' and 'ntxds' and see if you may override them to make them bigger/larger. Overrides require reboot , I guess.
    Docs e.g. here:
    https://www.freebsd.org/cgi/man.cgi?query=iflib&sektion=4&apropos=0&manpath=FreeBSD+12.2-RELEASE+and+Ports


    This approach boosted forwarding performance on my ESX setup with vmx interfaces.


    With regards the speedtest-cli to the Internet, I;d say to try to tcpdump/wireshark on both sides of firewall to see if the packets go nicely as expected or if there are resends, rubbish or something strange going on.




    Hey testo_cz, thanks for the ideas.  I've previously tried tweaking the queues and did see some performance improvement but that config got lost along the way. I'll give it another try along with the other things you suggested and report back.

    Quote from: testo_cz on January 01, 2022, 06:42:24 PM
    When you mentioned "10 cores engaged utilisation is only about 25-30%" , does that mean that each of the ten CPU cores is utilized at 25-30% ?

    Yes that's correct, the load seems to be more or less evenly distributed across the CPUs. But you have to use the switches for top like this "top -PCH".  Otherwise it just reports as a single CPU oversubscribed (like 550% etc).

    Quote from: testo_cz on January 01, 2022, 06:42:24 PM

    Few shots I would check:
    if power management allows CPU to scale its frequency up ?

    sysctl -a | grep cpu | grep freq


    if this network tunables for multicore CPUs are on:

    net.isr.maxthreads = "-1"
    net.isr.bindthreads = "1"


    if flow control per network interface is off ?

    dev.ixl.#.fc = "0"


    Further, I'd say that one may try to increase number of RX/TX queues and descriptors.
    If not ixl(4) then iflib(4) based tunables might let you to do so. Check the sysctl values of 'nrxqs', 'ntxqs', 'nrxds' and 'ntxds' and see if you may override them to make them bigger/larger. Overrides require reboot , I guess.
    Docs e.g. here:
    https://www.freebsd.org/cgi/man.cgi?query=iflib&sektion=4&apropos=0&manpath=FreeBSD+12.2-RELEASE+and+Ports


    This approach boosted forwarding performance on my ESX setup with vmx interfaces.

    With regards the speedtest-cli to the Internet, I;d say to try to tcpdump/wireshark on both sides of firewall to see if the packets go nicely as expected or if there are resends, rubbish or something strange going on.

    Well I rigged up a test with tshark and a Debian iperf server (I call it FAUX_WAN) and captured a bunch of data. I was a bit worried at first because I was seeing quite a few re-transmits, but on digging into it I believe this is just the nature of tcp.  I managed to drive the link through the firewall as high as 8Gbps with multithreaded iperf and some retransmision is inevtable when you are flogging the link like that. I did implement the tunables you suggested and I think they helped, so thanks again.

    Re the low performance results from speedtest.net, the obvious issue is that what you get back from any particular server depends on what else it is doing and how much other traffic it is processing. Here in Australia, I've found that speedtest reports on average only 25-30% of what I see from iinet (https://www.iinet.net.au/internet-products/broadband/speed-test/). I'm not an iinet customer but I'm more inclined to believe their numbers which show I can get ~800Mbps down and up to 2.7Gbps up on a Windows VM with 2 cores and 4GB RAM.  In the end I'm not entirely sure how I got here but that's good enough for me :)

    Thanks to everyone who contributed to the disussion. I just love open source software!