High CPU Load every morning at 8am by update_tables.py leads to network timeouts

Started by mokai, September 05, 2025, 06:29:56 PM

Previous topic - Next topic
Hi,

for a couple of weeks we've been seeing irregular CPU load spikes on our primary HA firewall node. The issue occurs most frequently around **8 AM** and, as far as we can tell, it is caused by a large number of concurrent Python processes (about 20), mostly update_tables.py .

This looks very similar to the issue described here:
[https://forum.opnsense.org/index.php?topic=45620.0](https://forum.opnsense.org/index.php?topic=45620.0)

**Differences in our setup:**

* We don't have huge alias lists: ~450 aliases with about 3–15 entries each (~3,000–4,000 entries total).
* Hardware: 2 × Dell R340 (Xeon E2234, 8 cores, 16 GB RAM, SSD storage).
* Network: ~15 VLANs, 1 × 40 GbE LAGG (Mellanox ConnectX-3), 1 × 10 GbE LAGG (Mellanox ConnectX-3), ~20 virtual CARP IPs.

**Symptoms:**

* When the spike occurs, latency rises sharply and interfaces even stop responding.
* Example:

  -- gw01 ping statistics ---
  188 packets transmitted, 179 received, 4.8% packet loss

* After 1–2 minutes, the Python processes calm down and latency returns to normal.
* The issue sometimes occurs at other times of the day, but right now **8 AM is a daily pattern**.

At this point we don't really know how to debug this further, nor can we pinpoint what exactly is triggering such a high load.

**Question:**
Has anyone experienced something similar or can provide hints on how to further analyze this?

Thanks!

(Version is 25.7.2, but problem occured already in an earlier release)

Quote from: mokai on September 05, 2025, 06:29:56 PM[...]
* We don't have huge alias lists: ~450 aliases with about 3–15 entries each (~3,000–4,000 entries total).

The obvious first thought is a big URL alias with a 24-hour refresh...

Quote* Hardware: 2 × Dell R340 (Xeon E2234, 8 cores, 16 GB RAM, SSD storage).
* Network: ~15 VLANs, 1 × 40 GbE LAGG (Mellanox ConnectX-3), 1 × 10 GbE LAGG (Mellanox ConnectX-3), ~20 virtual CARP IPs. [...]

...but it should be tough to flatten a quad-core Coffee Lake, unless you're really utilizing that 40GbE. What does top look like at the time? That'd give you an idea of where to go next. The "Reporting: Health" graphs might give you something to look at if granularity is sufficient, but the specific data is not something you can easily post here. Perhaps the "Traffic" graph, if it shows anything significant.

I have a bigger machine with a much smaller network, so I have no similar symptoms. It's unusual, at any rate.