English Forums > 23.7 Legacy Series

[Tutorial/Call for Testing] Enabling Receive Side Scaling on OPNsense

(1/26) > >>

tuto2:
Hi all,

In a future 21.7.x OPNsense release, in-kernel support for Receive Side Scaling will be included. The implementation of RSS is coupled with PCBGROUP – an implementation which introduces notions of CPU affinity for connections. While the latter will be of lesser importance for OPNsense, since it specifically applies to connections built up in userland using sockets (which is relevant to servers, not middleboxes), the idea of distributing work on a lower level with hardware support provides a myriad of benefits – especially with regard to multithreading in Suricata in our use case (please see the note at the end of this post).

Without going into too much technical detail, I’ll provide a short description of the inner workings of RSS – as well as how to set up the correct tunables to ensure steady state operation. All of this will hopefully also serve as a troubleshooting guide.

Overview

RSS is used to distribute packets over CPU cores using a hashing function – either with support in the hardware which offloads the hashing for you, or in software. The idea is to take as input the TCP 4-tuple (source address, source port, destination address, destination port) of a packet, hash this input using an in-kernel defined key, and selecting the resulting values’ LSB as an index into a user-configurable indirection table. The indirection table is loaded into the hardware during boot and is used by the NIC to decide which CPU to interrupt with a given packet. All of this allows packets of the same origin/destination (a.k.a. flows) to be queued consistently on the same CPU.

By default, RSS will be disabled since it’s impact is quite far reaching. Only enable this feature if you’re interested in testing it and seeing if it will increase your throughput under high load – such as when using IDS/IPS. Since I do not have every type of hardware available to me – nor the time to test all of them, no guarantee is given that a NIC driver will properly handle the kernel implementation or is even capable of using it.


The NIC/Driver

Assuming you are using a modern NIC which supports multiple hardware queues and RSS, the configuration of a NIC will decide how and on which queue packets arrive on your system. This is also hardware dependent and will not be the same on every NIC. Should your driver support the option to enable/disable RSS, a sysctl tunable will be available. One can search for one using

--- Code: ---sysctl -a | grep rss
--- End code ---
or (assuming you are i.e. using the axgbe driver)

--- Code: ---sysctl dev.ax | grep rss
--- End code ---
Sticking with the axgbe example, rss can be enabled by setting

--- Code: ---dev.ax.0.rss_enabled = 1
dev.ax.1.rss_enabled = 1
--- End code ---
in the OPNsense System->Settings->Tunables interface.


It is also possible that a driver does not expose this ability to the user, in which case you’d want to look up whether the NIC/driver supports RSS at all – using online datasheets or a simple google search. For example, igb enables RSS by default, dut does not reflect this in any configuration parameter. However, since it uses multiple queues:

--- Code: ---dmesg | grep vectors

igb0: Using MSI-X interrupts with 5 vectors
igb1: Using MSI-X interrupts with 5 vectors
igb2: Using MSI-X interrupts with 5 vectors
igb3: Using MSI-X interrupts with 5 vectors

--- End code ---
It will most likely have some form of packet filtering to distribute packets over the hardware queues. In fact, igb does RSS by default.

For most NICs, RSS is the primary method of deciding which CPU to interrupt with a packet. NICs that do not implement any other type of filter and whose RSS feature is missing or turned off, will most likely interrupt only CPU 0 at all times – which will reduce potential throughput due to cache line migrations and lock contention. Please keep system-wide RSS disabled if this is the case.

The last but not least thing to consider is the fact that driver support with the in-kernel implementation of RSS is a must. Proper driver support will ensure the correct key and indirection table being set in hardware. Drivers which support RSS according to the source code (but mostly untested):

* em
* igb -> tested & working
* axgbe -> tested & working
* netvsc
* ixgbe
* ixl
* cxgbe
* lio
* mlx5
* sfxge

The Kernel

Internally, FreeBSD uses netisr as an abstraction layer for dispatching packets to the upper protocols. Within the implementation, the default setting is to restrict packet processing to one thread only. Since RSS now provides a way to keep flows local to a CPU, the following sysctls should be set in System->Settings->Tunables:


--- Code: ---net.isr.bindthreads = 1
--- End code ---
causes threads to be bound to a CPU

--- Code: ---net.isr.maxthreads = -1
--- End code ---
assigns a workstream to each CPU core available.


Furthermore, the RSS implementation also provides a few necessary sysctls:


--- Code: ---net.inet.rss.enabled = 1
--- End code ---
makes sure RSS is enabled. Disabled by default to prevent regressions on NICs that do not properly implement the RSS interface.


--- Code: ---net.inet.rss.bits = X
--- End code ---
This one is dependent on the amount of cores you have. By default the amount of bits here represent the amount of cores x 2 in binary. This is done on purpose to provide load-balancing, though there is no current implementation for this so I recommend setting this value to the amount of bits representing the number of CPU cores. This means we use the following values:
- for 4-core systems, use ‘2’
- for 8-core systems, use ‘3’
- for 16-core systems, use ‘4’
Etc.

If RSS is enabled with the 'enabled' sysctl, the packet dispatching policy will move from ‘direct’ to ‘hybrid’. This will directly dispatch a packet on the current context when allowed, otherwise it will queue the packet on the bound CPU on which it came in on. Please note that this will increase the interrupt load as seen in ‘top -P’. This simply means that packets are being processed with the highest priority in the CPU scheduler - it does not mean the CPU is under more load than normal.

The correct working of netisr can be verified by running

--- Code: ---netstat -Q
--- End code ---
 


Note regarding IPS

When Suricata is running in IPS mode, Netmap is utilized to fetch packets off the line for inspection. By default, OPNsense has configured Suricata in such a way that the packet which has passed inspection will be re-injected into the host networking stack for routing/firewalling purposes. The current Suricata/Netmap implementation limits this re-injection to one thread only. Work is underway to address this issue since the new Netmap API (V14+) is now capable of increasing this thread count. Until then, no benefit is gained from RSS when using IPS.


Preliminary testing

If you’d like to test RSS on your system before the release, a pre-made kernel is available from the OPNsense pkg repository. Please set the tunables as described in this post and update using:

--- Code: ---opnsense-update -zkr 21.7.1-rss
--- End code ---

If you are doing performance tests, make sure to disable rx/tx flow control if the NIC in question supports disabling this.

Feedback or questions regarding the use of RSS can be posted in this thread. Let me know your thoughts and whether you encounter any issues :)

Update
please note and assume that all tunables set in this tutorial require a reboot to properly apply them.

Cheers,
Stephan

allebone:
I can probably do some testing.

Can I understand what net rss bits to be using? I have 2 threads per CPU and 4 cores. Do I use a value of 2 or 4 for this?

root@OPNsense:~ # lscpu
Architecture:            amd64
Byte Order:              Little Endian
Total CPU(s):            4
Thread(s) per core:      2
Core(s) per socket:      2
Socket(s):               1
Vendor:                  GenuineIntel
CPU family:              6
Model:                   142
Model name:              Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
Stepping:                9
L1d cache:               32K
L1i cache:               32K
L2 cache:                256K
L3 cache:                3M
Flags:                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 cflsh ds acpi mmx fxsr sse sse2 ss htt tm pbe sse3 pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline aes xsave osxsave avx f16c rdrnd fsgsbase tsc_adjust sgx bmi1 avx2 smep bmi2 erms invpcid fpcsds mpx rdseed adx smap clflushopt intel_pt syscall nx pdpe1gb rdtscp lm lahf_lm lzcnt

tuto2:
Ideally, I'd like to see some testing with both hyperthreading enabled and disabled. Systems with hyperthreading usually have one hardware queue per logical CPU - as such only half the cores in your systems can be used for interrupts.

In your case please try the value '2' if only 4 hardware queues are used, otherwise use '3' if 8 hardware queues are used.

To expand on this: the 'bits' refer to the actual binary representation of powers of 2, e.g.:

(net.inet.rss.bits = 2) == 0b0011 = 3 (core 0 - 3, thus 4 cores)
0b0111 = 7
0b1111 = 15

Cheers,
Stephan

MartB:
Enabled it on the opnsense built-in re (realtek) driver with my rtl8125b.
Seems to be in use and is working just fine, i guess?


--- Code: ---root@rauter:~ # netstat -Q
Configuration:
Setting                        Current        Limit
Thread count                         4            4
Default queue limit                256        10240
Dispatch policy               deferred          n/a
Threads bound to CPUs          enabled          n/a

Protocols:
Name   Proto QLimit Policy Dispatch Flags
ip         1   1000    cpu   hybrid   C--
igmp       2    256 source  default   ---
rtsock     3    256 source  default   ---
arp        4    256 source  default   ---
ether      5    256    cpu   direct   C--
ip6        6    256    cpu   hybrid   C--
ip_direct     9    256    cpu   hybrid   C--
ip6_direct    10    256    cpu   hybrid   C--

Workstreams:
WSID CPU   Name     Len WMark   Disp'd  HDisp'd   QDrops   Queued  Handled
   0   0   ip         0    24        0     6402        0   299134   305536
   0   0   igmp       0     0        0        0        0        0        0
   0   0   rtsock     0     2        0        0        0      100      100
   0   0   arp        0     0        0        0        0        0        0
   0   0   ether      0     0    21891        0        0        0    21891
   0   0   ip6        0     2        0        3        0      272      275
   0   0   ip_direct     0     0        0        0        0        0        0
   0   0   ip6_direct     0     0        0        0        0        0        0
   1   1   ip         0    10        0   222075        0   123441   345516
   1   1   igmp       0     0        0        0        0        0        0
   1   1   rtsock     0     0        0        0        0        0        0
   1   1   arp        0     1        0        0        0        1        1
   1   1   ether      0     0   674658        0        0        0   674658
   1   1   ip6        0     4        0       30        0      327      357
   1   1   ip_direct     0     0        0        0        0        0        0
   1   1   ip6_direct     0     0        0        0        0        0        0
   2   2   ip         0    14        0    79091        0   108867   187958
   2   2   igmp       0     0        0        0        0        0        0
   2   2   rtsock     0     0        0        0        0        0        0
   2   2   arp        0     1        0        0        0      105      105
   2   2   ether      0     0   420575        0        0        0   420575
   2   2   ip6        0     1        0      204        0       36      240
   2   2   ip_direct     0     0        0        0        0        0        0
   2   2   ip6_direct     0     0        0        0        0        0        0
   3   3   ip         1    13        0     5750        0   301312   307061
   3   3   igmp       0     0        0        0        0        0        0
   3   3   rtsock     0     0        0        0        0        0        0
   3   3   arp        0     0        0        0        0        0        0
   3   3   ether      0     0    25502        0        0        0    25502
   3   3   ip6        0     3        0        7        0      283      290
   3   3   ip_direct     0     0        0        0        0        0        0
   3   3   ip6_direct     0     0        0        0        0        0        0
--- End code ---

athurdent:
Thanks for the nice explanation, tuto2!

This sounds cool, I can surely test this.
I have ixl nics for LAN and WAN (passed through to the OPNsense VM in Proxmox, recognized as Intel(R) Ethernet Controller X710 for 10GbE SFP+), connected to a 10G switch and an old i3 2 core / 4 threads CPU (Intel(R) Core(TM) i3-7100 CPU @ 3.90GHz)
Over in the Sensei forum, mb mentioned that Sensei would also benefit from RSS when it comes to reaching 10G speeds.
As I am not using Suricata on the ixl interfaces, but I am using Sensei on LAN, will it also benefit?


--- Code: ---root@OPNsense:~ # sysctl -a | grep rss
hw.bxe.udp_rss: 0
hw.ix.enable_rss: 1
root@OPNsense:~ # dmesg | grep vectors
ixl0: Using MSI-X interrupts with 5 vectors
ixl1: Using MSI-X interrupts with 5 vectors
ixl0: Using MSI-X interrupts with 5 vectors
ixl1: Using MSI-X interrupts with 5 vectors
--- End code ---

Navigation

[0] Message Index

[#] Next page

Go to full version