Performance tuning for IPS maximum performance

Started by dcol, December 08, 2017, 05:13:30 PM

Previous topic - Next topic
I recently got an upgrade for my internet badwidth from 200/50 mbit to 1000/50 mbit.

Sadly, my initial speed tests only resulted in 160 / 50 mbit.

I quickly identified Suricata with activated IPS as the bottleneck. I tried each combination of  hyperscan vs aho-corasick, activation of Suricata on LAN (igb), LAN+WAN, WAN(em), every performance tuning rule described in the first post of this thread but still I got only around 160 / 50 with IPS enabled.

I also noticed that the Suricata process uses 100% of one CPU core during speed tests whereas the remaining three cores were ideling.
Also, disabling most of the rules resulted in a "successfull" speed test of 950 / 50 mbit.

So my question is, why doesn't Suricata make use of all four cores? Why is the clock speed of a single core the bottleneck here? From what I understood reading about Suricata, it should be capable of multithreading?



Intel Pentium G4560T (2 cores, 4 threads) with 2.90 GHZ + 8 GB RAM.

But apart from the clock speed, why is only one core being used by suricata?


February 12, 2019, 09:39:03 PM #49 Last Edit: September 25, 2021, 11:31:46 AM by Sahbi
Had some severe performance issues after enabling IPS mode, like barely saturating 50% of my ISP connection (supposed to be 250/25 Mbps). So I figured I'd chime in with some of my experiences. I'm assuming that since I have an APU4C4 with i211AT NICs, flow control is set to 3 (Full) since it seems to support that according to this here datasheet. Also I'm using speedtest.net because it's still the most popular one and at least they have decent connected servers close to me, unlike e.g. Google which goes all the way to damn Atlanta. I always used the same server, as well as the relatively new "multi" feature. I'm also running the speedtests from a computer behind OPNSense and not from the box itself. Finally, I have pretty much everything enabled at this point, this includes a transparent HTTPS proxy which requires me to disable hardware offloading for some networking stuff.

First, let's list the rulesets I have in use. Now, I'm not that familiar with OPN nor Suricata yet so I'm not entirely sure if below data is "clean", but should be close enough.
root@opn:/usr/local/etc/suricata/rules # ls *.rules
OPNsense.rules emerging-icmp_info.rules
abuse.ch.feodotracker.rules emerging-imap.rules
abuse.ch.sslblacklist.rules emerging-info.rules
abuse.ch.sslipblacklist.rules emerging-malware.rules
abuse.ch.urlhaus.rules emerging-misc.rules
botcc.portgrouped.rules emerging-mobile_malware.rules
botcc.rules emerging-rpc.rules
ciarmy.rules emerging-scan.rules
compromised.rules emerging-shellcode.rules
drop.rules emerging-smtp.rules
dshield.rules emerging-sql.rules
emerging-activex.rules emerging-trojan.rules
emerging-attack_response.rules emerging-user_agents.rules
emerging-current_events.rules emerging-web_client.rules
emerging-deleted.rules emerging-web_server.rules
emerging-dns.rules emerging-web_specific_apps.rules
emerging-dos.rules emerging-worm.rules
emerging-exploit.rules opnsense.test.rules
emerging-ftp.rules opnsense.uncategorized.rules
emerging-icmp.rules

root@opn:/usr/local/etc/suricata/rules # cat *.rules | sed 's/^ *#.*//' | sed '/^ *$/d' | wc -l
   41614


The rules are divided about 50/50 in regards to drop/alert actions, but I don't think that matters for performance because it has to log stuff regardless.

This is before applying any of the tunables mentioned in the OP (at my speeds I don't care about decimals so I'll just round that shit):

  • Using Hyperscan mode: 10ms ping, 98 Mbps down, 25 Mbps up
  • Aho-Corasick: 12ms, 230/26
I read somewhere on these forums that Hyperscan is preferred in most cases, as such I had that active which caused a significant performance drop compared to A-C. So this was the cause for my issues, at least at the moment. :>

After running sysctl dev.igb.<x>.fc=0 for all interfaces (no need to reboot for these so figured I'd just go ahead and try):

  • Hyperscan: 9ms, 115/25
  • Aho-Corasick: 10ms, 240/25
A slight improvement for both algos, with Hyperscan closing the most distance. RAM usage for both tests stayed pretty much the same, there's currently 50% in use after having been a day in full production. Also, after every reboot I waited for the startup beep to go off, then checked with top to see if any startup stuff was still running. Only when everything calmed down will I proceed with the next test.

Now let's try some more tunables:

### loader.conf.local

# Flow Control (FC): 0 = Disabled, 1 = Rx Pause, 2 = Tx Pause, 3 = Full FC
hw.igb.0.fc=0
hw.igb.1.fc=0
hw.igb.2.fc=0
hw.igb.3.fc=0

# Set number of queues to number of cores divided by number of ports, 0 lets FreeBSD decide (should be default)
hw.igb.num_queues=0

# Increase packet descriptors (set as 1024, 2048 or 4096 ONLY)
hw.igb.rxd="4096" # Default = 1024
hw.igb.txd="4096"
net.link.ifqmaxlen="8192" # Sum of above two (default = 50)

# Increase network efficiency (Adaptive Interrupt Moderation, should be default)
hw.igb.enable_aim=1

# Increase interrupt rate # Default = 8000
hw.igb.max_interrupt_rate="64000"

# Fast interrupt handling, allows NIC to process packets as fast as they are received (should be default)
hw.igb.enable_msix=1
hw.pci.enable_msix=1

# Unlimited packet processing
hw.igb.rx_process_limit="-1"
hw.igb.tx_process_limit="-1"

### WebGUI > System > Settings > Tunables

# Disable Energy Efficient Ethernet
dev.igb.0.eee_disabled=1
dev.igb.1.eee_disabled=1
dev.igb.2.eee_disabled=1
dev.igb.3.eee_disabled=1

# Set Flow Control
hw.igb.0.fc=0
hw.igb.1.fc=0
hw.igb.2.fc=0
hw.igb.3.fc=0

dev.igb.0.fc=0
dev.igb.1.fc=0
dev.igb.2.fc=0
dev.igb.3.fc=0

# Do not accept IPv4 fragments
net.inet.ip.maxfragpackets=0
net.inet.ip.maxfragsperpacket=0


And reboot. =]

RAM usage is still hovering fine and dandy around 45%.

  • Aho-Corasick: 11ms, 248/25
  • Hyperscan: 12ms, 245/25
Now one thing I also noticed while watching top -HS is that Suricata no longer takes an entire core + a bit from the second, but instead distributes its load over 3 cores with the total load being around 180% (out of 400%). It also feels like the web interface is "snappier"; the dashboard page used to take quite some time to load but it's mucho faster now.




So it seems that just disabling flow control brings some slight improvements already, but Hyperscan in particular benefits hugely from adjusting hw.igb.rxd/txd, net.link.ifqmaxlen and hw.igb.max_interrupt_rate. Apparently with newer BSDs (like 10.x onwards) there's a newer driver which reduces the amount of interrupts significantly, so you can probably just set it to 16000 and have the same results. I'm routing a lot of stuff due to a complex homelab setup, so I'll just leave it at 64k for now. =] Probably worth mentioning too, but my lil' APU's CPU temps have never gone over 60C so far while after a cold boot it starts at around 59.

Since the difference between A-C and HS at this point is negligible and most likely just the result of tiny factors such as other services happening to check in at the time, I'm satisfied with the current settings and will end my tunables testing here. For shits and giggles I did run an iperf just now, from the same computer behind OPN to a VPS with gigabit in the same country:

$ iperf -c vps1 -p 4712 -u -t 60 -i 10 -b 1000M
------------------------------------------------------------
Client connecting to vps1, UDP port 4712
Sending 1470 byte datagrams, IPG target: 11.22 us (kalman adjust)
UDP buffer size: 9.00 KByte (default)
------------------------------------------------------------
[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-10.0 sec  1.11 GBytes   954 Mbits/sec
[  5] 10.0-20.0 sec  1.11 GBytes   952 Mbits/sec
[  5] 20.0-30.0 sec  1.11 GBytes   954 Mbits/sec
[  5] 30.0-40.0 sec  1.11 GBytes   953 Mbits/sec
[  5] 40.0-50.0 sec  1.11 GBytes   955 Mbits/sec
[  5]  0.0-60.0 sec  6.66 GBytes   953 Mbits/sec
[  5] Sent 4864635 datagrams


Suricata takes a little less than 1 core and the temps are still around 59C. :>

March 01, 2019, 05:14:55 PM #50 Last Edit: March 01, 2019, 05:40:58 PM by juliocbc
After applying the tunnables, I did some tests here, but something went wrong! :-(

My Lab hardware:
OPNsense 18.7.10_4
hw.model: Intel(R) Atom(TM) CPU  C2758  @ 2.40GHz
hw.machine: amd64
hw.ncpu: 8
16GB RAM
Intel i210AT

When I've pressed ENTER to start the iperf tests, system crashed:
client's iperf params: iperf -p 5201 -c 192.168.1.99 -u -b 10m -P 100 -d -t 60

Tracing command kernel pid 0 tid 100162 td 0xfffff8001ffb1560
sched_switch() at sched_switch+0x4aa/frame 0xfffffe0467a1daa0
mi_switch() at mi_switch+0xe5/frame 0xfffffe0467a1dad0
sleepq_wait() at sleepq_wait+0x3a/frame 0xfffffe0467a1db00
_sleep() at _sleep+0x255/frame 0xfffffe0467a1db80
taskqueue_thread_loop() at taskqueue_thread_loop+0x121/frame 0xfffffe0467a1dbb0
fork_exit() at fork_exit+0x85/frame 0xfffffe0467a1dbf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0467a1dbf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command kernel pid 0 tid 100173 td 0xfffff800099dd000
sched_switch() at sched_switch+0x4aa/frame 0xfffffe0467a54aa0
mi_switch() at mi_switch+0xe5/frame 0xfffffe0467a54ad0
sleepq_wait() at sleepq_wait+0x3a/frame 0xfffffe0467a54b00
_sleep() at _sleep+0x255/frame 0xfffffe0467a54b80
taskqueue_thread_loop() at taskqueue_thread_loop+0x121/frame 0xfffffe0467a54bb0
fork_exit() at fork_exit+0x85/frame 0xfffffe0467a54bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0467a54bf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
db:0:kdb.enter.default>  capture off
db:0:kdb.enter.default>  call doadump
= 0x6
db:0:kdb.enter.default>  reset
cpu_reset: Restarting BSP
cpu_reset_proxy: Stopped CPU 7
Cloudfence Open Source Team

April 05, 2019, 01:18:05 PM #51 Last Edit: April 05, 2019, 02:56:38 PM by lrosenman
I added the em tunables (on the 19.1.4 netmap kernel), with the https://github.com/aus/pfatt bypass (using my pull requested config).

And my UPLOAD is back to ~800Meg, but the Download side is ~600 meg.

This is ATT Fiber 1G/1G.

SpeedTest: https://www.lerctr.org/~ler/ST-2019-04-05-06-12-21.png
Tunables added: https://www.lerctr.org/~ler/tuneables-2019-04-05-06-13-14.png

Ideas on what I can do on the Download side (with all the netgraph fun)?

EDIT: This is with *NO* IPS/IDS running.

To followup, Brent Cowing of Protectli sent me a i3-7100U based box and my speeds are back to 910/949.

see also:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=237072
https://github.com/HardenedBSD/hardenedBSD/issues/376

I will also have a 2nd E3845 box here this week (thanks Brent), and will able to play and not affect my internet connection. 

Quote from: lrosenman on April 09, 2019, 04:40:33 AM
To followup, Brent Cowing of Protectli sent me a i3-7100U based box and my speeds are back to 910/949.

see also:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=237072
https://github.com/HardenedBSD/hardenedBSD/issues/376

I will also have a 2nd E3845 box here this week (thanks Brent), and will able to play and not affect my internet connection.

Is this with IPS/IDS turned on? I get 870/950 with the igbX tunables and no IPS/IDS. When I turn on IPS/IDS, the speedtest.net download speed starts at 800-900 mbps and slowly levels off at 100-200 mbps. The upload speed starts at 10 mbps and then the test errors out. I wonder if this has something to do with netgraph ...

NO, this was without IDS/IPS on.

I've not gotten the testing done yet. 

Quote from: harshw on May 06, 2019, 08:01:53 PM
Quote from: lrosenman on April 09, 2019, 04:40:33 AM
To followup, Brent Cowing of Protectli sent me a i3-7100U based box and my speeds are back to 910/949.

see also:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=237072
https://github.com/HardenedBSD/hardenedBSD/issues/376

I will also have a 2nd E3845 box here this week (thanks Brent), and will able to play and not affect my internet connection.

Is this with IPS/IDS turned on? I get 870/950 with the igbX tunables and no IPS/IDS. When I turn on IPS/IDS, the speedtest.net download speed starts at 800-900 mbps and slowly levels off at 100-200 mbps. The upload speed starts at 10 mbps and then the test errors out. I wonder if this has something to do with netgraph ...

netgraph(4) is definitely on my list of things to look at.  I suspect there is something(tm) nish-kosher there.  What, I'm not sure yet. 

Quote from: dcol on December 08, 2017, 05:13:30 PM
I have researched and tested tunables because I have experienced too many down links and poor performance when using IPS/Inline on the WAN interface that could no longer be ignored. This file, loader.conf.local along with adding some system tunables in the WebGUI, has fixed this for me so I thought I would share with the OPNsense community. Sharing is what makes on open-source project successful. Share your experiences using the info in this post. You may or may not see much performance improvement depending on your hardware, but you will see less dropped connections. If you have any other tunable recommendations, please share and post those experiences here. This thread is for performance tuning ideas.

The biggest impact was from the Flow Control (FC) setting. FC is a level 1 layer adding pause frames before the data is transmitted. My assumption is Netmap has issues with FC which causes the dropped connections. Recommendations from many sources, including Cisco, suggest disabling FC altogether and let the higher levels handle the flow. There are exceptions, but these usually involve ESXi, VMware and other special applications.

I have done all my testing using an Intel i350T4 and i340T4, common NICs used for firewalls, in 4 different systems and, by the way, neither NIC had any performance advantage. I have tested these system for 5 days without any down links experienced after the changes were made. Without these changes every system was plagued with down WAN links and poor performance using the default settings.

Do not use this file if you are not using an igb driver. igb combined with other drivers is ok as long as you have at least one igb NIC, and I recommend you use the igb for all WAN interfaces.

Add the file below in the '/boot' folder and call it 'loader.conf.local' right besides 'loader.conf'. I use WinSCP, in a Windows environment, as a file manager to get easy access to the folders. Don't forget to Enable Secure Shell. I have tried using the 'System Tunables' in the WebGUI to add these settings. Some worked and some didn't using that method. Not sure why. Better to just add this file. If you're a Linux guru, I am not, then use your own methods to add this file.

The two most IMPORTANT things to insure is that power management be disabled in the OPNsense settings and also in the BIOS settings of the system (thanks wefinet). And the second is to disable flow control (IEEE 802.3x) on all ports. It is advisable to not connect an IPS interface to any device which has flow control on. Flow control should be turned off to allow the congestion to be managed higher up in the stack

Please test all tunables in a test environment before you apply to a production system.

# File starts below this line, use Copy/Paste #####################
# Check for interface specific settings and add accordingly.
# These ae tunables to improve network performance on Intel igb driver NICs

# Flow Control (FC) 0=Disabled 1=Rx Pause 2=Tx Pause 3=Full FC
# This tunable must be set according to your configuration. VERY IMPORTANT!
# Set FC to 0 (<x>) on all interfaces
hw.igb.<x>.fc=0 #Also put this in System Tunables hw.igb.<x>.fc: value=0

# Set number of queues to number of cores divided by number of ports. 0 lets FreeBSD decide
hw.igb.num_queues=0

# Increase packet descriptors (set as 1024,2048, or 4096) ONLY!
# Allows a larger number of packets to be processed.
# Use "netstat -ihw 1" in the shell and make sure the idrops are zero
# If the NIC has constant disconnects, lower this value
# if not zero then lower this value.
hw.igb.rxd="4096" # For i340/i350 use 2048
hw.igb.txd="4096" # For i340/i350 use 2048
net.link.ifqmaxlen="8192" # value here equal sum of above values. For i340/i350 use 4096

# Increase Network efficiency
hw.igb.enable_aim=1

# Increase interuppt rate
hw.igb.max_interrupt_rate="64000"

# Network memory buffers
# run "netstat -m" in the shell and if the 'mbufs denied' and 'mbufs delayed' are 0/0/0 then this is not needed
# if not zero then keep adding 400000 until mbufs are zero
kern.ipc.nmbclusters="1000000"

# Fast interrupt handling
# Normally set by default. Use these settings to insure it is on.
# Allows NIC to process packets as fast as they are received
hw.igb.enable_msix=1
hw.pci.enable_msix=1

# Unlimited packet processing
# Use this only if you are sure that the NICs have dedicated IRQs
# View the IRQ assignments by executing this in the shell "vmstat -i"
# A value of "-1" means unlimited packet processing
hw.igb.rx_process_limit="-1"
hw.igb.tx_process_limit="-1"
###################################################
# File ends above this line ##################################

##UPDATE 12/12/2017##
After testing I have realized that some of these settings are NOT applied via loader.conf.local and must be added via the WebGUI in System>Settings>Tunables. I have moved these from the file above to this list.
Add to Tunables

Disable Energy Efficiency - set for each igb port in your system
This setting can cause Link flap errors if not disabled
Set for every igb interface in the system as per these examples
dev.igb.0.eee_disabled: value=1
dev.igb.1.eee_disabled: value=1
dev.igb.2.eee_disabled: value=1
dev.igb.3.eee_disabled: value=1

IPv4 Fragments - 0=Do not accept fragments
This is mainly need for security. Fragmentation can be used to evade packet inspection
net.inet.ip.maxfragpackets: value=0
net.inet.ip.maxfragsperpacket: value=0

Set to 0 (<x>) for every port used by IPS
dev.igb.<x>.fc: value=0

##UPDATE 1/16/2018##
Although the tuning in this thread so far just deals with the tunables, there are other settings that can impact IPS performance. Here are a few...

In the Intrusion Detection Settings Tab.

Promiscuous mode- To be used only when multiple interfaces or VLAN's are selected in the Interfaces setting.
This is used so that IPS will capture data on all the selected interfaces. Do not enable if you have just one interface selected. It will help with performance.

Pattern matcher: This setting can select the best  algorithm to use when pattern matching. This setting is best set by testing. Hyperscan seems to work well with Intel NIC's. Try different ones and test the bandwidth with an internet speed test.

Home networks (under advanced menu.
Make sure the interfaces fall within the actual local networks. You may want to change the generic 192.168.0.0/16 to your actual local network ie 192.168.1.1/24

###################################################
USEFUL SHELL COMMANDS
sysctl net.inet.tcp.hostcache.list # View the current host cache stats
vmstat -i # Query total interrupts per queue
top -H -S # Watch CPU usage
dmesg | grep -i msi # Verify MSI-X is being used by the NIC
netstat -ihw 1 # Look for idrops to determine hw.igb.txd and rxd
grep <interface> /var/run/dmesg.boot # Shows useful info like netmap queue/slots
sysctl -A # Shows system variables
###################################################

Hello,

I am curious. Does loader.conf.local get loaded after loader.conf? I did as instructed but what happened was a complete slowdown. Rtt and Rttd shot up, to the point of making my Internet connection unusable. I removed loader.conf.local and rebooted. The Internet was back and Rtt/Rttd was back to normal.

I am going to start testing with one option in loader.conf.local and see where the connection becomes unusable. I left all the options in the Tunables section of the GUI.

Thanks,
Steve

Yes, it's loaded after loader.conf. Good try to test it one by one :)

It looks like kern.ipc.nmbclusters="1000000" was the culprit.

I had performance problems while connecting a Fritzbox 6591 to my opnsense box. The trick with the fc works fine for me; full 1GB/s throughput; before just ~300MB/s.

But... I added the commands to tunables (GUI) and /boot/loader.conf.local. After reboot, dev.igb.x.fc is set to 0, but is does not speed up the things. After entering "sysctl dev.igb.x.fc=0" by hand from console, things speed up magically. It looks like the commands are not working when executed from /boot/loader.conf.x ...

/boot/loader.conf.local:

### loader.conf.local

# Flow Control (FC): 0 = Disabled, 1 = Rx Pause, 2 = Tx Pause, 3 = Full FC
hw.igb.0.fc=0
hw.igb.1.fc=0
dev.igb.0.fc=0
dev.igb.1.fc=0

# Set number of queues to number of cores divided by number of ports, 0 lets FreeBSD decide (should be default)
hw.igb.num_queues=0
# Increase packet descriptors (set as 1024, 2048 or 4096 ONLY)
hw.igb.rxd="2048" # Default = 1024
hw.igb.txd="2048"
net.link.ifqmaxlen="4096" # Sum of above two (default = 50)

# Increase network efficiency (Adaptive Interrupt Moderation, should be default)
hw.igb.enable_aim=1

# Increase interrupt rate # Default = 8000
hw.igb.max_interrupt_rate="64000"

# Fast interrupt handling, allows NIC to process packets as fast as they are received (should be default)
hw.igb.enable_msix=1
hw.pci.enable_msix=1

# Unlimited packet processing
hw.igb.rx_process_limit="-1"
hw.igb.tx_process_limit="-1"



and the rest of /boot/loader.conf:

...

net.inet.ip.redirect="0"
net.inet.icmp.drop_redirect="1"
hw.igb.1.fc="0"
dev.igb.1.fc="0"
hw.igb.0.fc="0"
dev.igb.0.fc="0"

# dynamically generated console settings follow
#comconsole_speed
#boot_multicons
#boot_serial
#kern.vty
console="vidconsole"


The NIC is a i350-T2.
opnsense is pretty new for me, and I have no idea what I am doing wrong... any help is welcome :-)