Print Page - Intermittent Kernel Panics

Title: Intermittent Kernel Panics
Post by: BillyJoePiano on June 17, 2023, 08:44:44 AM

I have a Protecli Vault 4-port running OPNsense 22.7.11_1, along with ZenArmor and mDNS repeater. ZenArmor uses an Elasticsearch database hosted on my DMZ server to which it is directly connected (point to point)

About 3 or 4 months ago, the router began randomly rebooting, about once a day or so... sometimes it would be fine for several days, but then it could happen multiple times in a day, including in the middle of the night when there was next to no activity. I never got around to diagnosing the cause, because it wasn't a show-stopper and I had more pressing things to deal with. When it happened, our internet would be down for about 1-2 minutes, and then it would be back again.

Strangely, the problem went away completely for about a month after I unplugged a small external fan that was blowing air across the Protectli's passive cooling fins. I'd mention that this fan was cooling it long before the crash/reboot issues started happening. But sure enough, when I plugged this fan back in, the problem started again. The fan brings the CPU temps down about 20 degrees F, from the 115-120F range to about 95-100F (for those more accustomed to Centigrade, that's about 46-49 C vs. 35-38 C) which seems like a good thing. But for whatever reason, the router would randomly crash and reboot as a result.

I know this makes it easy to blame the hardware. But here's the thing... now that I'm digging into the problem, I'm finding that these reboots are consistently due to kernel panic happening in the Python 3.9 process, possibly related to ZenArmor. I am still collecting more data on it, but here are three kernel panic messages from three separate crashes/reboots (I obtained these through dmesg command)

Code Select

Fatal trap 9: general protection fault while in kernel mode
cpuid = 3; apic id = 06
instruction pointer	= 0x20:0xffffffff8114c9a8
stack pointer	        = 0x28:0xfffffe00a022cc20
frame pointer	        = 0x28:0xfffffe00a022cd60
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 48339 (python3.9)
trap number		= 9
panic: general protection fault
cpuid = 3
time = 1686917344
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00a022ca40
vpanic() at vpanic+0x17f/frame 0xfffffe00a022ca90
panic() at panic+0x43/frame 0xfffffe00a022caf0
trap_fatal() at trap_fatal+0x385/frame 0xfffffe00a022cb50
calltrap() at calltrap+0x8/frame 0xfffffe00a022cb50
--- trap 0x9, rip = 0xffffffff8114c9a8, rsp = 0xfffffe00a022cc20, rbp = 0xfffffe00a022cd60 ---
pmap_remove_pages() at pmap_remove_pages+0x4d8/frame 0xfffffe00a022cd60
vmspace_exit() at vmspace_exit+0x7f/frame 0xfffffe00a022cd90
exit1() at exit1+0x57f/frame 0xfffffe00a022cdf0
sys_sys_exit() at sys_sys_exit+0xd/frame 0xfffffe00a022ce00
amd64_syscall() at amd64_syscall+0x10c/frame 0xfffffe00a022cf30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00a022cf30
--- syscall (1, FreeBSD ELF64, sys_sys_exit), rip = 0x80078c0da, rsp = 0x7fffffffeb18, rbp = 0x7fffffffeb30 ---
KDB: enter: panic

Code Select

Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 04
fault virtual address	= 0x141172067
fault code		= supervisor write data, page not present
instruction pointer	= 0x20:0xffffffff8114d864
stack pointer	        = 0x0:0xfffffe00a02c5bb0
frame pointer	        = 0x0:0xfffffe00a02c5be0
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 34353 (python3.9)
trap number		= 12
panic: page fault
cpuid = 1
time = 1686966901
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00a02c5970
vpanic() at vpanic+0x17f/frame 0xfffffe00a02c59c0
panic() at panic+0x43/frame 0xfffffe00a02c5a20
trap_fatal() at trap_fatal+0x385/frame 0xfffffe00a02c5a80
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00a02c5ae0
calltrap() at calltrap+0x8/frame 0xfffffe00a02c5ae0
--- trap 0xc, rip = 0xffffffff8114d864, rsp = 0xfffffe00a02c5bb0, rbp = 0xfffffe00a02c5be0 ---
pmap_is_prefaultable() at pmap_is_prefaultable+0x164/frame 0xfffffe00a02c5be0
vm_fault_prefault() at vm_fault_prefault+0x112/frame 0xfffffe00a02c5c50
vm_fault() at vm_fault+0x120c/frame 0xfffffe00a02c5d70
vm_fault_trap() at vm_fault_trap+0x6d/frame 0xfffffe00a02c5dc0
trap_pfault() at trap_pfault+0x1f3/frame 0xfffffe00a02c5e20
trap() at trap+0x40a/frame 0xfffffe00a02c5f30
calltrap() at calltrap+0x8/frame 0xfffffe00a02c5f30
--- trap 0xc, rip = 0x8003cbf7d, rsp = 0x7fffffffd3f0, rbp = 0x7fffffffd410 ---
KDB: enter: panic

Code Select

Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer	= 0x20:0xffffffff811491ac
stack pointer	        = 0x0:0xfffffe00b5a92b90
frame pointer	        = 0x0:0xfffffe00b5a92b90
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 31806 (python3.9)
trap number		= 9
panic: general protection fault
cpuid = 1
time = 1686973021
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00b5a929b0
vpanic() at vpanic+0x17f/frame 0xfffffe00b5a92a00
panic() at panic+0x43/frame 0xfffffe00b5a92a60
trap_fatal() at trap_fatal+0x385/frame 0xfffffe00b5a92ac0
calltrap() at calltrap+0x8/frame 0xfffffe00b5a92ac0
--- trap 0x9, rip = 0xffffffff811491ac, rsp = 0xfffffe00b5a92b90, rbp = 0xfffffe00b5a92b90 ---
pmap_pvh_remove() at pmap_pvh_remove+0x4c/frame 0xfffffe00b5a92b90
pmap_enter() at pmap_enter+0xd59/frame 0xfffffe00b5a92c50
vm_fault() at vm_fault+0x1016/frame 0xfffffe00b5a92d70
vm_fault_trap() at vm_fault_trap+0x6d/frame 0xfffffe00b5a92dc0
trap_pfault() at trap_pfault+0x1f3/frame 0xfffffe00b5a92e20
trap() at trap+0x40a/frame 0xfffffe00b5a92f30
calltrap() at calltrap+0x8/frame 0xfffffe00b5a92f30
--- trap 0xc, rip = 0x80049fb58, rsp = 0x7fffffffd2f0, rbp = 0x7fffffffd450 ---
KDB: enter: panic

Title: Re: Intermittent Kernel Panics
Post by: BillyJoePiano on June 17, 2023, 09:25:32 PM

UPDATE:

I wrote a script to log the outputs of several commands every 2 seconds, cycling over a 2-minute period -- top, ps aux, dmesg, and pftop

Doing this, I was able to identify a handful of python processes which started 30 seconds before the logging went dark, presumably due to the kernel panic. The top command also shows a possible memory leak.

It's interesting to note that the were 15 logs (spanning 30 seconds) in the rotation that were totally empty, meaning the script started writing to the file but there was no output. Note that the write to each log file is a backgrounded subshell that also echos a timestamp variable from the outer shell, and even that timestamp wasn't written to those 15 logs that are empty.

New Python processes that show up 30 seconds before logging went dark. These only show up in a single 'ps aux' log

Code Select

root                        40568  10.8  0.8   43020  31564  -  S    07:08       0:01.21 /usr/local/bin/python3 /usr/local/opnsense/scripts/filter/update_tables.py (python3.9)
root                        42497   2.9  0.4   27288  16448  -  S    07:08       0:00.33 /usr/local/sensei/py_venv/bin/python3 /usr/local/opnsense/scripts/OPNsense/Sensei/periodicals.py (python3.9)
root                        42708   0.0  0.3   23736  12396  -  S    07:08       0:00.20 /usr/local/bin/python3 /usr/local/sbin/configctl sensei userenrich (python3.9)
root                        42969   0.0  0.5   32280  22052  -  R    07:08       0:00.57 /usr/local/sensei/py_venv/bin/python3 /usr/local/opnsense/scripts/OPNsense/Sensei/userenrich.py (python3.9)
root                        48158   0.0  0.1   43020   3668  -  R    07:08       0:00.00 /sbin/pfctl -t __opt15_network -T show (python3.9),

These Python processes are showing up consistently from ps aux:

Code Select

root                         8723   1.5  1.8  114752  74968  -  S<   Tue10      57:24.50 /usr/local/sensei/py_venv/bin/python3 /usr/local/sensei//scripts/datastore/ipdrstreamer.py (python3.9)
root                        27669   0.0  0.3   23736  12016  -  S    Tue10       0:37.58 /usr/local/bin/python3 /usr/local/sbin/configctl -e -t 0.5 system event config_changed (python3.9)
root                        30075   0.0  0.3   23972  12100  -  S    Tue10       0:34.03 /usr/local/bin/python3 /usr/local/opnsense/scripts/syslog/lockout_handler (python3.9)
root                        33269   0.0  0.6   36256  23788  -  Is   Wed17       0:02.44 /usr/local/bin/python3 /usr/local/opnsense/service/configd.py (python3.9)
root                        34926   0.0  0.8   66484  33788  -  I    Wed17       1:32.06 /usr/local/bin/python3 /usr/local/opnsense/service/configd.py console (python3.9)
root                        76595   0.0  0.3   23736  12092  -  S    Tue19       0:32.45 /usr/local/bin/python3 /usr/local/sbin/configctl -e -t 0.5 system event config_changed (python3.9)
root                        76891   0.0  0.3   23972  12176  -  S    Tue19       0:29.67 /usr/local/bin/python3 /usr/local/opnsense/scripts/syslog/lockout_handler (python3.9)

Here is the last 'top' output from before the logging cuts out. Take note of the amount of free memory.

Code Select

last pid: 89561;  load averages:  0.43,  0.43,  0.39  up 2+20:18:50    07:08:31
79 processes:  1 running, 78 sleeping
CPU:  3.3% user,  0.0% nice,  2.5% system,  0.0% interrupt, 94.1% idle
Mem: 106M Active, 2334M Inact, 1054M Wired, 384M Buf, 381M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
 8723 root          7  20  -20   112M    73M select   0  57:25   0.68% python3.
 8118 root         12  20  -20  2214M   571M nanslp   1  36:16   0.00% eastpect
 8160 root         12  20  -20  2129M   482M nanslp   2  19:33   0.00% eastpect
53587 root          1  52  -20    14M  4356K wait     0   9:33   0.00% bash
48656 root          2  20  -20   845M   189M nanslp   0   4:33   0.00% eastpect
28722 root          1  52    0    13M  3008K wait     0   1:54   0.00% sh
34926 root          2  27    0    65M    33M accept   0   1:32   0.00% python3.
69189 unbound       4  20    0   103M    40M kqread   1   1:22   0.00% unbound
92515 root          1  20    0    13M  2348K kqread   2   1:16   0.00% rtsold
93925 root          1  20    0    13M  2428K select   0   1:12   0.00% rtsold
72490 root          1  20    0    12M  2324K select   3   1:06   0.00% radvd
76399 root          3  20    0    40M    12M kqread   1   0:46   0.00% syslog-n
27669 root          1  20    0    23M    12M select   3   0:38   0.00% python3.
30075 root          1  20    0    23M    12M select   3   0:34   0.00% python3.
95965 root          1  20    0    21M  6552K select   3   0:34   0.00% ntpd
76595 root          1  20    0    23M    12M select   3   0:32   0.00% python3.
76891 root          1  20    0    23M    12M select   2   0:30   0.00% python3.
50611 root          1  20    0    13M  2628K bpf      0   0:25   0.00% filterlo

Here is the normal 'top' output:

Code Select

last pid: 42538;  load averages:  0.36,  0.27,  0.32  up 0+15:43:46    14:21:45
80 processes:  2 running, 78 sleeping
CPU:  2.0% user,  0.0% nice,  2.2% system,  0.0% interrupt, 95.8% idle
Mem: 101M Active, 684M Inact, 962M Wired, 355M Buf, 2083M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
70937 root          6  20  -20    69M    44M select   0   3:36   0.10% python3.9
88652 root          1  52  -20    14M  4376K wait     3   0:46   0.10% bash
69398 root         12  20  -20  2005M   318M nanslp   2   2:43   0.00% eastpect
68511 root         12  20  -20  2001M   287M nanslp   1   1:53   0.00% eastpect
98717 root          2  20    0   103M    53M accept   1   1:30   0.00% python3.9
 5306 root          1  52    0    60M    39M piperd   2   0:43   0.00% php
51212 root          2  20  -20   851M   195M nanslp   0   0:40   0.00% eastpect
 6804 root          1  52    0    13M  2832K wait     3   0:19   0.00% sh
81801 root          1  20    0    13M  2352K kqread   0   0:16   0.00% rtsold
75363 root          1  20    0    25M    15M select   2   0:14   0.00% python3.9
85012 root          1  20    0    13M  2284K select   2   0:14   0.00% rtsold
31667 root          3  20    0    42M    12M kqread   0   0:13   0.00% syslog-ng
33698 root          1  20    0    12M  2324K select   2   0:13   0.00% radvd
47169 root          1  52    0    56M    37M accept   3   0:11   0.00% php-cgi
49850 root          1  20    0    23M    12M select   1   0:08   0.00% python3.9
52451 root          1  20    0    23M    12M select   0   0:07   0.00% python3.9
21484 root          1  20    0    21M  6552K select   3   0:07   0.00% ntpd
90459 root          1  52    0    54M    35M accept   0   0:07   0.00% php-cgi

My guess is that there is something in the Python process that is causing a memory leak? I'm not sure how the fan/cooling is related to this, but perhaps CPU running faster contributes to a problem?

Title: Re: Intermittent Kernel Panics
Post by: franco on June 19, 2023, 10:58:28 AM

It still looks like a hardware issue, perhaps main memory going bad?

Cheers,
Franco

Title: Re: Intermittent Kernel Panics
Post by: BillyJoePiano on July 03, 2023, 08:27:07 AM

I have another update:

I was looking at more detailed logs from a recent crash, and about 20 hours before the crash there was a sensei/zenarmor process 'ipdrstreamer.py' (Python) which suddenly starting consuming 100% CPU (presumably of one core), and stayed in this high-CPU-consumption state for the entire ~20 hours before the crash. Prior to that it was showing low CPU consumption like the other sensei python processes... digging into this more

OPNsense Forum

Archive => 22.7 Legacy Series => Topic started by: BillyJoePiano on June 17, 2023, 08:44:44 AM