Hi everyone,
yesterday night we experienced a severe outage of ~ 3-4 Minutes on our OPNsense firewall where the system load spiked to over 100, rendering the web GUI unresponsive and completely blocking all network traffic.
System Environment:
OS: OPNsense (25.7.3_7)
Services: Multiple OpenVPN Servers, large number of Aliases/Tables (each with small amount of entries, <15).
Observed Symptoms:
- Load Average > 100.
- Hundreds of python3 ... update_tables.py --types authgroup processes.
- Multiple pfctl -t [ALIAS] -T replace -f /var/db/aliastables/[...].txt processes stuck in state R (running) or D (disk wait).
- Frequent ovpn_event.py triggers (add/delete/update).
The pfctl could be seen 3 or 4 times with the same table, which gives me the impression that there is a bug.
We appended a redacted logfile. This is a productive HA/CARP firewall in the datacenter so that this issue gives me quite a headache.
Is this a known issue?
Why would there be no flock or another locking mechanism on the update_tables.py?
Do we use the wrong hardware for this?
Specs : 8 x E-2234 @ 3.6GHz, 8GB RAM, 200 GB SSD, 1 x 40 GE LAGG, 1 x 10GE LAGG, 1 x 1GE CARP
Usage : 12 VLANs, 3 Wireguard Server (200 clients), 3 OpenVPN Server (200 clients little traffic), Average Traffic ~1GB/s on all interfaces
Thanks for any help