Large Alias Causing CPU spikes and ping latency

Started by CanIKipThis, February 04, 2025, 02:58:25 PM

Previous topic - Next topic
Hey everyone,

Tracked down a problem that seems its been there at least since 24.19.  If you have "large" aliases groups, the firewall will have periodic CPU spikes and periods of erratic raised latency through it.  I noticed this on now 3 different firewalls, initially they were configured with MaxMind GEO IP blocks.  I had configured an alias that had the US, Canada and GB in it.  This caused all three firewalls to act similar with latencies both to the firewall as well as through it to internet resources had really high latency periods.  Here is a graph showing latencies to the firewall and through it to an endpoint while this condition occurred:

 Firewall A:

https://imgur.com/a/oy4HB9t

Firewall A to 1.1.1.1:

https://imgur.com/a/BKk7h8F

Here is firewall B:

https://imgur.com/a/IoSYz4K

So what I did to troubleshoot was to delete any of the GEO IP aliases.  You can see in Firewall A how it responded, both ping times evened out (notice the red square)

https://imgur.com/a/hx9jOj3

I even did a control experiment where I enabled Crowdsec (which creates an alias) and you can see the latency started to crawl back up (noted by the red arrow in that picture)

I checked crontab, and there is a job that runs every minute with update_tables.py in it. It seems some other people are reporting somewhat similar issues:

https://forum.opnsense.org/index.php?topic=41759.60#msg211036

Like I said it's happening across 3 different firewalls, with 3 different hardware setups at 3 different locations.  It seems to be related to OPNSense.  As a test I swapped out OPNsense with pfsense and it did not have the same latency spikes. 

Any idea's or help?


Hi,

This behavior is present for a long time now, please see this:

https://forum.opnsense.org/index.php?topic=31662.msg153060#msg153060

I ended up with a workaround because I could not find the root of the problem.

Thanks, can you share what your workaround was?

We have this one in the pipeline for other reasons, but it could help?

# opnsense-patch https://github.com/opnsense/core/commit/81ec98007d


Cheers,
Franco

Thanks for the patch. Processing time reduced from somewhere between 15/20 seconds to under 7 seconds:

root@OPNsense:/usr/local/opnsense/scripts/filter # time /usr/local/opnsense/scripts/filter/update_tables.py
{"status": "ok"}
6.810u 4.850s 0:12.08 96.5%   159+171k 0+2io 0pf+0w

As for the workaround. It depends on the presence of a temporary file called: /tmp/refreshaliases
This file is created by a custom script called: /opt/local/bin/refreshaliases.sh
Contents of the script is:

#!/bin/sh

if [ $(wc -c /usr/local/opnsense/scripts/filter/update_tables.py|awk '{print $1}') -gt 100 ]
then
   mv /usr/local/opnsense/scripts/filter/update_tables.py /opt/local/bin
   cp /opt/local/bin/update_tables.py_new /usr/local/opnsense/scripts/filter/update_tables.py
   /usr/local/bin/rsync -a --delete /usr/local/opnsense/scripts/filter/lib /opt/local/bin/
fi
if [ $(drill www.google.com|grep ^www.google.com|wc -l) -ne 0 ]
then
   /usr/local/bin/flock -n -E 0 -o /tmp/filter_update_tables.lock /opt/local/bin/update_tables.py > /dev/null
   touch /tmp/refreshaliases
fi

Furthermore a new python script was created that does nothing, called: /opt/local/bin/update_tables.py_new
Contents of the script is:

#!/usr/local/bin/python3

"""
    dummy
"""

Then a monit job was created that checks whether a config change has occurred and calls the /opt/local/bin/refreshaliases.sh script.
At boot the /opt/local/bin/refreshaliases.sh script must be run as well since the /tmp/refreshaliases file is not present at boot time.

Result: no more CPU spikes but aliases are refreshed at any config change. Hope this helps.

Apparently this still lacks a bit of context: type of box, number of aliases, total size of them?


Thanks,
Franco

Hi Franco,

You are absolutely right. Please find the answers to your questions below:

It's Protectli:
# dmidecode | grep "Product Name"|uniq
   Product Name: VP2420
According to their website: Intel CeleronĀ® J6412 Quad Core at 2 GHz (Burst up to 2.6 GHz)

There are about 161 aliases:
# grep "alias uuid" /conf/config.xml|wc -l
     161

In total all aliases sum up to about 5.5 million. The larger ones are based on IP adresses from AbuseIP, FireHOL and about 5 large GEOIP based alias lists.

If you want me to trace anything please let me know, I will be more than happy to assist.

I am happy to help as well.

Box is ProtectCLI VP240
Celeron J4125
Currently on 25.1

Here is my alias list (the majority are in the one GEOIP blocklist)

https://imgur.com/a/lf8ccph

For the patch, if I install it, and then upgrade to 25.1.3, do I have to re-install each time?




At first glance Celeron CPUs are underwhelming for this task. a stretch to 500k maybe, but I wouldn't trust it with managing more entries than this.


Cheers,
Franco

The running time is reduced by ~40% by the patch...
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 770 up, Bufferbloat A

While that is nice it mainly works around kernel crashes regarding pfctl reading table contents that suddenly changes while reading under a lock.


Cheers,
Franco

What does this scheduled task actually do? 

Manage alias updates. Downloading, comparing, making sure the data is up to date. More or less what you would it expect to do.


Cheers,
Franco

OK having my 'amount' of aliases is a bit too much. Would the behavior as I have implemented the workaround using monit to trigger the alias update when a config change occurs be possible when a modification is done in the aliases screen in the UI?