Since update to 20.1.7 reoccuring high latency situations

Started by jaghatei, June 30, 2020, 04:12:50 PM

Previous topic - Next topic
Since the update to 20.1.7 end of May I am reoccuringly getting high latency to OPNsense LAN interface.
Normal ping to OPNsense is < 1ms in LAN. In case it happens ping reaches 400-500 ms.
As an effect it is disturbing any real time traffic like VOIP or streaming, and in extrem cases causing unreachability of DNS on OPNsense.
Reason is a reoccuring high cpu load from configd and scripts started by it: The loaded cpu core is bound to the effected ethernet port by OS. And this causes the network latency.
But I still do not have a clue why configd is using so high load since 20.1.7 and did not do before. The config itself did not change at all.

It normally starts at 04:00 in the night where my PPPOE WAN is reconnected by internal cron "Periodic interface reset".
Normaly the affected timeframe is < 5 minutes not disturbing anyone in the night.
But each 5-10 days it does not stop from using high load from its own and configd needs manual restart to recover. And it also does not only start at 04:00, so it is not strictly bound to the WAN reconnect.

I already tried to limit the problem by switching of some of the traffic statistics but this did not help.

Anyone having any idea what it could be or how to enable better logging for configd?
The backend logs only show configd.py repeating the same scripts until manual restart, but no reason why:

2020-06-30T13:28:01   configd.py: message d69aa126-bca4-43bb-af64-bb60ee10c563 [filter.refresh_aliases] returned {"status": "ok"}
2020-06-30T13:27:58   configd.py: [b4353426-236d-409d-bcde-4772eaff6b6a] updating dyndns OPT1_VPNV4
2020-06-30T13:27:58   configd.py: [d69aa126-bca4-43bb-af64-bb60ee10c563] refresh url table aliases
2020-06-30T13:27:58   configd.py: OPNsense/Filter generated //usr/local/etc/filter_geoip.conf
2020-06-30T13:27:58   configd.py: OPNsense/Filter generated //usr/local/etc/filter_tables.conf
2020-06-30T13:27:57   configd.py: generate template container OPNsense/Filter


Whats in the log at 4? Maybe some alias refresh or broken pattern updates

On my boxes, periodic pf table updates spawn a python3 process every minute. While the cron job is not new, chances are that the CPU overhead caused by the script (update_tables.py) or the python interpreter itself is more important than in previous releases, because it used to remain unnoticed (at least by me) for months until somewhere in the 20.1 branch.

In virtualized environments, these regular CPU spikes put a lot of stress on our hypervisors, which indeed results in latency issues and lost packets (see this post).
--
Marin BERNARD
System administrator


The problem look similar to the netgate issue at least from the symtom perspective latency: My statistics loging ping replies look similar to the ones in the netgate forum.
But I cannot confirm the later discussion regarding max table entries - I only have 117 rules according to pfinfo.
I will give it a try directly setting the the default values for "Firewall Maximum States" and "Firewall Maximum Table Entries" not leaving the fields empty. But it will need 1-3 weeks to confirm if problem is gone afterwards.

Since I applied the workaround the latency issue did not reoccur yet.