I have a Protecli Vault 4-port running OPNsense 22.7.11_1, along with ZenArmor and mDNS repeater. ZenArmor uses an Elasticsearch database hosted on my DMZ server to which it is directly connected (point to point)
About 3 or 4 months ago, the router began randomly rebooting, about once a day or so... sometimes it would be fine for several days, but then it could happen multiple times in a day, including in the middle of the night when there was next to no activity. I never got around to diagnosing the cause, because it wasn't a show-stopper and I had more pressing things to deal with. When it happened, our internet would be down for about 1-2 minutes, and then it would be back again.
Strangely, the problem went away completely for about a month after I unplugged a small external fan that was blowing air across the Protectli's passive cooling fins. I'd mention that this fan was cooling it long before the crash/reboot issues started happening. But sure enough, when I plugged this fan back in, the problem started again. The fan brings the CPU temps down about 20 degrees F, from the 115-120F range to about 95-100F (for those more accustomed to Centigrade, that's about 46-49 C vs. 35-38 C) which seems like a good thing. But for whatever reason, the router would randomly crash and reboot as a result.
I know this makes it easy to blame the hardware. But here's the thing... now that I'm digging into the problem, I'm finding that these reboots are consistently due to kernel panic happening in the Python 3.9 process, possibly related to ZenArmor. I am still collecting more data on it, but here are three kernel panic messages from three separate crashes/reboots (I obtained these through dmesg command)
Fatal trap 9: general protection fault while in kernel mode
cpuid = 3; apic id = 06
instruction pointer = 0x20:0xffffffff8114c9a8
stack pointer = 0x28:0xfffffe00a022cc20
frame pointer = 0x28:0xfffffe00a022cd60
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 48339 (python3.9)
trap number = 9
panic: general protection fault
cpuid = 3
time = 1686917344
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00a022ca40
vpanic() at vpanic+0x17f/frame 0xfffffe00a022ca90
panic() at panic+0x43/frame 0xfffffe00a022caf0
trap_fatal() at trap_fatal+0x385/frame 0xfffffe00a022cb50
calltrap() at calltrap+0x8/frame 0xfffffe00a022cb50
--- trap 0x9, rip = 0xffffffff8114c9a8, rsp = 0xfffffe00a022cc20, rbp = 0xfffffe00a022cd60 ---
pmap_remove_pages() at pmap_remove_pages+0x4d8/frame 0xfffffe00a022cd60
vmspace_exit() at vmspace_exit+0x7f/frame 0xfffffe00a022cd90
exit1() at exit1+0x57f/frame 0xfffffe00a022cdf0
sys_sys_exit() at sys_sys_exit+0xd/frame 0xfffffe00a022ce00
amd64_syscall() at amd64_syscall+0x10c/frame 0xfffffe00a022cf30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00a022cf30
--- syscall (1, FreeBSD ELF64, sys_sys_exit), rip = 0x80078c0da, rsp = 0x7fffffffeb18, rbp = 0x7fffffffeb30 ---
KDB: enter: panic
Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 04
fault virtual address = 0x141172067
fault code = supervisor write data, page not present
instruction pointer = 0x20:0xffffffff8114d864
stack pointer = 0x0:0xfffffe00a02c5bb0
frame pointer = 0x0:0xfffffe00a02c5be0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 34353 (python3.9)
trap number = 12
panic: page fault
cpuid = 1
time = 1686966901
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00a02c5970
vpanic() at vpanic+0x17f/frame 0xfffffe00a02c59c0
panic() at panic+0x43/frame 0xfffffe00a02c5a20
trap_fatal() at trap_fatal+0x385/frame 0xfffffe00a02c5a80
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00a02c5ae0
calltrap() at calltrap+0x8/frame 0xfffffe00a02c5ae0
--- trap 0xc, rip = 0xffffffff8114d864, rsp = 0xfffffe00a02c5bb0, rbp = 0xfffffe00a02c5be0 ---
pmap_is_prefaultable() at pmap_is_prefaultable+0x164/frame 0xfffffe00a02c5be0
vm_fault_prefault() at vm_fault_prefault+0x112/frame 0xfffffe00a02c5c50
vm_fault() at vm_fault+0x120c/frame 0xfffffe00a02c5d70
vm_fault_trap() at vm_fault_trap+0x6d/frame 0xfffffe00a02c5dc0
trap_pfault() at trap_pfault+0x1f3/frame 0xfffffe00a02c5e20
trap() at trap+0x40a/frame 0xfffffe00a02c5f30
calltrap() at calltrap+0x8/frame 0xfffffe00a02c5f30
--- trap 0xc, rip = 0x8003cbf7d, rsp = 0x7fffffffd3f0, rbp = 0x7fffffffd410 ---
KDB: enter: panic
Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer = 0x20:0xffffffff811491ac
stack pointer = 0x0:0xfffffe00b5a92b90
frame pointer = 0x0:0xfffffe00b5a92b90
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 31806 (python3.9)
trap number = 9
panic: general protection fault
cpuid = 1
time = 1686973021
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00b5a929b0
vpanic() at vpanic+0x17f/frame 0xfffffe00b5a92a00
panic() at panic+0x43/frame 0xfffffe00b5a92a60
trap_fatal() at trap_fatal+0x385/frame 0xfffffe00b5a92ac0
calltrap() at calltrap+0x8/frame 0xfffffe00b5a92ac0
--- trap 0x9, rip = 0xffffffff811491ac, rsp = 0xfffffe00b5a92b90, rbp = 0xfffffe00b5a92b90 ---
pmap_pvh_remove() at pmap_pvh_remove+0x4c/frame 0xfffffe00b5a92b90
pmap_enter() at pmap_enter+0xd59/frame 0xfffffe00b5a92c50
vm_fault() at vm_fault+0x1016/frame 0xfffffe00b5a92d70
vm_fault_trap() at vm_fault_trap+0x6d/frame 0xfffffe00b5a92dc0
trap_pfault() at trap_pfault+0x1f3/frame 0xfffffe00b5a92e20
trap() at trap+0x40a/frame 0xfffffe00b5a92f30
calltrap() at calltrap+0x8/frame 0xfffffe00b5a92f30
--- trap 0xc, rip = 0x80049fb58, rsp = 0x7fffffffd2f0, rbp = 0x7fffffffd450 ---
KDB: enter: panic
UPDATE:
I wrote a script to log the outputs of several commands every 2 seconds, cycling over a 2-minute period -- top, ps aux, dmesg, and pftop
Doing this, I was able to identify a handful of python processes which started 30 seconds before the logging went dark, presumably due to the kernel panic. The top command also shows a possible memory leak.
It's interesting to note that the were 15 logs (spanning 30 seconds) in the rotation that were totally empty, meaning the script started writing to the file but there was no output. Note that the write to each log file is a backgrounded subshell that also echos a timestamp variable from the outer shell, and even that timestamp wasn't written to those 15 logs that are empty.
New Python processes that show up 30 seconds before logging went dark. These only show up in a single 'ps aux' log
root 40568 10.8 0.8 43020 31564 - S 07:08 0:01.21 /usr/local/bin/python3 /usr/local/opnsense/scripts/filter/update_tables.py (python3.9)
root 42497 2.9 0.4 27288 16448 - S 07:08 0:00.33 /usr/local/sensei/py_venv/bin/python3 /usr/local/opnsense/scripts/OPNsense/Sensei/periodicals.py (python3.9)
root 42708 0.0 0.3 23736 12396 - S 07:08 0:00.20 /usr/local/bin/python3 /usr/local/sbin/configctl sensei userenrich (python3.9)
root 42969 0.0 0.5 32280 22052 - R 07:08 0:00.57 /usr/local/sensei/py_venv/bin/python3 /usr/local/opnsense/scripts/OPNsense/Sensei/userenrich.py (python3.9)
root 48158 0.0 0.1 43020 3668 - R 07:08 0:00.00 /sbin/pfctl -t __opt15_network -T show (python3.9),
These Python processes are showing up consistently from ps aux:
root 8723 1.5 1.8 114752 74968 - S< Tue10 57:24.50 /usr/local/sensei/py_venv/bin/python3 /usr/local/sensei//scripts/datastore/ipdrstreamer.py (python3.9)
root 27669 0.0 0.3 23736 12016 - S Tue10 0:37.58 /usr/local/bin/python3 /usr/local/sbin/configctl -e -t 0.5 system event config_changed (python3.9)
root 30075 0.0 0.3 23972 12100 - S Tue10 0:34.03 /usr/local/bin/python3 /usr/local/opnsense/scripts/syslog/lockout_handler (python3.9)
root 33269 0.0 0.6 36256 23788 - Is Wed17 0:02.44 /usr/local/bin/python3 /usr/local/opnsense/service/configd.py (python3.9)
root 34926 0.0 0.8 66484 33788 - I Wed17 1:32.06 /usr/local/bin/python3 /usr/local/opnsense/service/configd.py console (python3.9)
root 76595 0.0 0.3 23736 12092 - S Tue19 0:32.45 /usr/local/bin/python3 /usr/local/sbin/configctl -e -t 0.5 system event config_changed (python3.9)
root 76891 0.0 0.3 23972 12176 - S Tue19 0:29.67 /usr/local/bin/python3 /usr/local/opnsense/scripts/syslog/lockout_handler (python3.9)
Here is the last 'top' output from before the logging cuts out. Take note of the amount of free memory.
last pid: 89561; load averages: 0.43, 0.43, 0.39 up 2+20:18:50 07:08:31
79 processes: 1 running, 78 sleeping
CPU: 3.3% user, 0.0% nice, 2.5% system, 0.0% interrupt, 94.1% idle
Mem: 106M Active, 2334M Inact, 1054M Wired, 384M Buf, 381M Free
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
8723 root 7 20 -20 112M 73M select 0 57:25 0.68% python3.
8118 root 12 20 -20 2214M 571M nanslp 1 36:16 0.00% eastpect
8160 root 12 20 -20 2129M 482M nanslp 2 19:33 0.00% eastpect
53587 root 1 52 -20 14M 4356K wait 0 9:33 0.00% bash
48656 root 2 20 -20 845M 189M nanslp 0 4:33 0.00% eastpect
28722 root 1 52 0 13M 3008K wait 0 1:54 0.00% sh
34926 root 2 27 0 65M 33M accept 0 1:32 0.00% python3.
69189 unbound 4 20 0 103M 40M kqread 1 1:22 0.00% unbound
92515 root 1 20 0 13M 2348K kqread 2 1:16 0.00% rtsold
93925 root 1 20 0 13M 2428K select 0 1:12 0.00% rtsold
72490 root 1 20 0 12M 2324K select 3 1:06 0.00% radvd
76399 root 3 20 0 40M 12M kqread 1 0:46 0.00% syslog-n
27669 root 1 20 0 23M 12M select 3 0:38 0.00% python3.
30075 root 1 20 0 23M 12M select 3 0:34 0.00% python3.
95965 root 1 20 0 21M 6552K select 3 0:34 0.00% ntpd
76595 root 1 20 0 23M 12M select 3 0:32 0.00% python3.
76891 root 1 20 0 23M 12M select 2 0:30 0.00% python3.
50611 root 1 20 0 13M 2628K bpf 0 0:25 0.00% filterlo
Here is the normal 'top' output:
last pid: 42538; load averages: 0.36, 0.27, 0.32 up 0+15:43:46 14:21:45
80 processes: 2 running, 78 sleeping
CPU: 2.0% user, 0.0% nice, 2.2% system, 0.0% interrupt, 95.8% idle
Mem: 101M Active, 684M Inact, 962M Wired, 355M Buf, 2083M Free
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
70937 root 6 20 -20 69M 44M select 0 3:36 0.10% python3.9
88652 root 1 52 -20 14M 4376K wait 3 0:46 0.10% bash
69398 root 12 20 -20 2005M 318M nanslp 2 2:43 0.00% eastpect
68511 root 12 20 -20 2001M 287M nanslp 1 1:53 0.00% eastpect
98717 root 2 20 0 103M 53M accept 1 1:30 0.00% python3.9
5306 root 1 52 0 60M 39M piperd 2 0:43 0.00% php
51212 root 2 20 -20 851M 195M nanslp 0 0:40 0.00% eastpect
6804 root 1 52 0 13M 2832K wait 3 0:19 0.00% sh
81801 root 1 20 0 13M 2352K kqread 0 0:16 0.00% rtsold
75363 root 1 20 0 25M 15M select 2 0:14 0.00% python3.9
85012 root 1 20 0 13M 2284K select 2 0:14 0.00% rtsold
31667 root 3 20 0 42M 12M kqread 0 0:13 0.00% syslog-ng
33698 root 1 20 0 12M 2324K select 2 0:13 0.00% radvd
47169 root 1 52 0 56M 37M accept 3 0:11 0.00% php-cgi
49850 root 1 20 0 23M 12M select 1 0:08 0.00% python3.9
52451 root 1 20 0 23M 12M select 0 0:07 0.00% python3.9
21484 root 1 20 0 21M 6552K select 3 0:07 0.00% ntpd
90459 root 1 52 0 54M 35M accept 0 0:07 0.00% php-cgi
My guess is that there is something in the Python process that is causing a memory leak? I'm not sure how the fan/cooling is related to this, but perhaps CPU running faster contributes to a problem?
It still looks like a hardware issue, perhaps main memory going bad?
Cheers,
Franco
I have another update:
I was looking at more detailed logs from a recent crash, and about 20 hours before the crash there was a sensei/zenarmor process 'ipdrstreamer.py' (Python) which suddenly starting consuming 100% CPU (presumably of one core), and stayed in this high-CPU-consumption state for the entire ~20 hours before the crash. Prior to that it was showing low CPU consumption like the other sensei python processes... digging into this more