OPNsense dying every few days on APU2

Started by ikkeT, December 18, 2024, 08:59:09 AM

Previous topic - Next topic
Hi,

I've had this problem for several months, but now getting more often. OPNsense works several days just fine, but all the sudden home traffic starts slowind down and then I can't access it any longer and network dies. I keep it up to date, it's nothing sudden, the problem has been around for several releases. Now I'm running 24.7.11.

I just had to pull the plug and reboot. I thought I look around a bit. I disabled rrd collection just to make sure it's not that. No help. I run the following services at home, not much traffic:
- HAproxy (mainly traffic to nextcloud instance
- dnsmasq for home gadgets
- kea dhcp
- captive portal for guest VLAN, hardly ever used.

I used to have IPv6 enabled, but after moving the new connection only has IPv4.

So not much running. Immediately I notice some problems:

1. Flowd is eating CPU:


76462 root          1 135    0    58M    44M CPU0     0  16:38 100.00% python3.11
# ps awfux|grep 76462
root   76462 100.0  1.1  59844 44944  -  Rs   09:23   16:57.09 /usr/local/bin/python3 /usr/local/opnsense/scripts/netflow/flowd_aggregate.py (python3.11)



2. Config.d Errors in logs

(I have never touched unbound, it's not running)

2024-12-18T09:44:55 Error configd.py [8741e584-e8e0-47d1-940e-639b0fe9a307] Script action failed with Command '/usr/local/opnsense/scripts/unbound/wrapper.py -s ' returned non-zero exit status 1. at Traceback (most recent call last): File "/usr/local/opnsense/service/modules/actions/script_output.py", line 78, in execute subprocess.check_call(script_command, env=self.config_environment, shell=True, File "/usr/local/lib/python3.11/subprocess.py", line 413, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '/usr/local/opnsense/scripts/unbound/wrapper.py -s ' returned non-zero exit status 1.
2024-12-18T09:30:11 Error configd.py Timeout (120) executing : system diag log '20' '0' '' 'core' 'audit' 'Emergency,Alert,Critical,Error,Warning' '1734420490.461'
2024-12-18T08:55:33 Error configd.py [eb377147-ead9-4e22-b070-4066dc2a5e25] Script action failed with Command '/usr/local/opnsense/scripts/interfaces/list_macdb.py ' died with <Signals.SIGBUS: 10>. at Traceback (most recent call last): File "/usr/local/opnsense/service/modules/actions/script_output.py", line 78, in execute subprocess.check_call(script_command, env=self.config_environment, shell=True, File "/usr/local/lib/python3.11/subprocess.py", line 413, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '/usr/local/opnsense/scripts/interfaces/list_macdb.py ' died with <Signals.SIGBUS: 10>.
2024-12-18T08:55:33 Error configd.py [47cd8873-4e90-45dd-81a7-66fa3dfee38c] Script action failed with Command '/usr/local/sbin/pluginctl -D ''' died with <Signals.SIGBUS: 10>. at Traceback (most recent call last): File "/usr/local/opnsense/service/modules/actions/script_output.py", line 78, in execute subprocess.check_call(script_command, env=self.config_environment, shell=True, File "/usr/local/lib/python3.11/subprocess.py", line 413, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '/usr/local/sbin/pluginctl -D ''' died with <Signals.SIGBUS: 10>.
2024-12-18T08:53:14 Warning configd.py Stopping daemon.
2024-12-18T08:53:14 Error configd.py Configd disconnected while executing : interface list macdb
2024-12-18T08:52:52 Error configd.py Configd disconnected while executing : openvpn connections client,server
2024-12-18T08:52:52 Warning configd.py Stopping daemon.
2024-12-18T08:50:06 Error api no active session, user not found
2024-12-18T08:45:08 Error configd.py Timeout (120) executing : firmware remote
2024-12-18T08:43:06 Error configd.py Timeout (120) executing : firmware tiers
2024-12-18T08:41:28 Error configd.py Timeout (120) executing : firmware remote
2024-12-18T08:38:06 Error configd.py Timeout (120) executing : firmware remote
2024-12-18T08:38:05 Error configd.py Timeout (120) executing : firmware tiers
2024-12-18T08:36:05 Error configd.py Timeout (120) executing : firmware tiers
2024-12-18T08:33:04 Error configd.py Timeout (120) executing : firmware tiers
2024-12-18T08:23:11 Error configd.py Timeout (120) executing : firmware remote
2024-12-18T08:20:03 Error configd.py Timeout (120) executing : firmware tiers
2024-12-18T08:16:03 Error configd.py Timeout (120) executing : firmware tiers
2024-12-18T08:12:01 Error configd.py Timeout (120) executing : firmware tiers


3. Disk space should be OK

root@OPNsense:~ # ls -ltrh /var/crash && df -hT
total 4
-rw-r--r--  1 root wheel    5B Dec  2 21:45 minfree
Filesystem       Type     Size    Used   Avail Capacity  Mounted on
/dev/gpt/rootfs  ufs       13G    8.1G    4.3G    65%    /
devfs            devfs    1.0K      0B    1.0K     0%    /dev
tmpfs            tmpfs    2.0G    3.5M    2.0G     0%    /tmp
devfs            devfs    1.0K      0B    1.0K     0%    /var/dhcpd/dev
devfs            devfs    1.0K      0B    1.0K     0%    /var/captiveportal/zone0/dev


So question, what the heck is this flowd doing, and how to disable it? Perhaps it's that overcooking the CPU. I found some old thread about deleting and putting interfaces back to it, I'll try. Let's see what else is there.

I toggled the nics off and back on in netflow, and also disbabled the local service and cleared the netflow data few times. Now I got the cpu usage down at least for a while. Let's see if it stays that way now.

Hi there,
I've a similar issue as yours. My Opnsense router would stop working all of a sudden (Internet dies and cannot access Opnsense GUI). It's been happening more frequently now. To get back internet, I need to reboot manually.
Digging around the logs in the UI, I saw a Backend error
```
[506c11e3-fc64-4b1c-89d3-1767a6b76110] Script action failed with Command '/usr/local/opnsense/scripts/firmware/read.sh ' died with <Signals.SIGBUS: 10>. at Traceback (most recent call last): File "/usr/local/opnsense/service/modules/actions/script_output.py", line 78, in execute subprocess.check_call(script_command, env=self.config_environment, shell=True, File "/usr/local/lib/python3.11/subprocess.py", line 413, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '/usr/local/opnsense/scripts/firmware/read.sh ' died with <Signals.SIGBUS: 10>.
```

Seems you are getting `<Signals.SIGBUS: 10>` as well which suggests maybe corrupt memory?
Following this thread.

Details:

OPNsense 24.7.11_2-amd64
FreeBSD 14.1-RELEASE-p6
OpenSSL 3.0.15

Intel(R) Core(TM) i3-N305 (8 cores, 8 threads) machine from Aliexpress

Thanks