frequent crashes (kernel panic)

Started by neckcen, November 25, 2023, 10:11:59 PM

Previous topic - Next topic
November 25, 2023, 10:11:59 PM Last Edit: November 27, 2023, 05:38:53 PM by neckcen
Hello,

Since upgrading to 23.7.9 on the 23rd I've experienced 4 crashes. Here is an example of such crash:


Fatal trap 9: general protection fault while in kernel mode
cpuid = 6; apic id = 06
instruction pointer = 0x20:0xffffffff81019df4
stack pointer         = 0x28:0xfffffe00917945e0
frame pointer         = 0x28:0xfffffe0091794670
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 25976 (python3.9)
trap number = 9
panic: general protection fault
cpuid = 6
time = 1700943731
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0091794400
vpanic() at vpanic+0x151/frame 0xfffffe0091794450
panic() at panic+0x43/frame 0xfffffe00917944b0
trap_fatal() at trap_fatal+0x387/frame 0xfffffe0091794510
calltrap() at calltrap+0x8/frame 0xfffffe0091794510
--- trap 0x9, rip = 0xffffffff81019df4, rsp = 0xfffffe00917945e0, rbp = 0xfffffe0091794670 ---
vm_radix_lookup_ge() at vm_radix_lookup_ge+0x104/frame 0xfffffe0091794670
kern_proc_vmmap_resident() at kern_proc_vmmap_resident+0x12b/frame 0xfffffe00917946e0
kern_proc_vmmap_out() at kern_proc_vmmap_out+0x1ae/frame 0xfffffe0091794870
note_procstat_vmmap() at note_procstat_vmmap+0x81/frame 0xfffffe00917948c0
elf64_coredump() at elf64_coredump+0x3e8/frame 0xfffffe0091794990
sigexit() at sigexit+0xbe0/frame 0xfffffe0091794e30
postsig() at postsig+0x23c/frame 0xfffffe0091794ef0
ast() at ast+0x347/frame 0xfffffe0091794f30
doreti_ast() at doreti_ast+0x1f/frame 0x8207620d0
KDB: enter: panic


Is this a known problem? I haven't found anything on github or the forum suggesting it is. How would one go about investigating what causes the crash in the first place? It appears to be a python process but there are several of these running.

I've tried reinstalling the opnsense package (the only one with a 23.7.9 version) but it didn't help.

The firewall is running on a DEC2750 from Deciso. Plugins enabled are os-ddclient and os-wireguard.

Did they start immediately after upgrading?  Was anything else changed?  Have you rolled back to the previous version and do the crashes go away?

My first instinct with random crashes is to test the hardware.

November 26, 2023, 10:16:12 PM #2 Last Edit: November 26, 2023, 10:17:49 PM by neckcen
Thank you for the reply.

QuoteDid they start immediately after upgrading?  Was anything else changed?

The first crash occurred on Thursday morning (7 am -ish). I have a cron set to apply upgrades at 3.30 am and according to the logs the update to 23.7.9 was applied that night. I has not changed anything on the firewall recently. Since posting my initial message, the firewall has crashed 4 more times it is fairly annoying to say the least.

QuoteHave you rolled back to the previous version and do the crashes go away?

I have not found the option to roll back the version, how would I proceed?

QuoteMy first instinct with random crashes is to test the hardware.

Same here, so this weekend I ran memtest for a few hours and the results were clean. SMART data for the nvme are also clean. Got any other test you'd recommend?

I also tried disabling both plugins and booting with the previous kernel, to no avail.

Which hardware is that on?
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+

23.7.8 and 23.7.9 use the same kernel and base. Individual packages can be downgraded with opnsense-revert -r 23.7.8 <package>, e. g. opnsense-revert -r 23.7.8 opnsense.

Cheers
Maurice
OPNsense virtual machine images
OPNsense aarch64 firmware repository

Commercial support & engineering available. PM for details (en / de).

QuoteWhich hardware is that on?

A DEC2750, more specifically a unit from 2022 with 3x 1Gb Ethernet (not the most recent v2 version with 3x 2.5 Gb). I believe it uses the Netboard A10 Gen3 internally.

Quote23.7.8 and 23.7.9 use the same kernel and base. Individual packages can be downgraded with opnsense-revert -r 23.7.8 <package>, e. g. opnsense-revert -r 23.7.8 opnsense.

Thank you kindly for the command. It appears I was wrong (and probably should have checked that earlier), the crashes did not start with version 23.7.9 as seen in the logs below. Unsurprisingly, reverting the update did not help.


root@OPNsense:~ # grep -h "panic: general protection fault" /var/log/system/*.log | sort | uniq
<13>1 2023-10-31T18:30:57+00:00 OPNsense kernel - - [meta sequenceId="9"] panic: general protection fault
<13>1 2023-11-12T04:35:57+00:00 OPNsense kernel - - [meta sequenceId="12"] panic: general protection fault
<13>1 2023-11-12T07:51:12+00:00 OPNsense kernel - - [meta sequenceId="12"] panic: general protection fault
<13>1 2023-11-13T01:06:57+00:00 OPNsense kernel - - [meta sequenceId="12"] panic: general protection fault
<13>1 2023-11-16T07:10:56+00:00 OPNsense kernel - - [meta sequenceId="12"] panic: general protection fault
<13>1 2023-11-16T14:58:57+00:00 OPNsense kernel - - [meta sequenceId="12"] panic: general protection fault
<13>1 2023-11-18T03:21:56+00:00 OPNsense kernel - - [meta sequenceId="12"] panic: general protection fault
<13>1 2023-11-18T05:21:57+00:00 OPNsense kernel - - [meta sequenceId="12"] panic: general protection fault
<13>1 2023-11-18T07:09:58+00:00 OPNsense kernel - - [meta sequenceId="12"] panic: general protection fault
<13>1 2023-11-18T19:41:56+00:00 OPNsense kernel - - [meta sequenceId="12"] panic: general protection fault
<13>1 2023-11-19T09:31:56+00:00 OPNsense kernel - - [meta sequenceId="5"] panic: general protection fault
<13>1 2023-11-20T02:59:56+00:00 OPNsense kernel - - [meta sequenceId="12"] panic: general protection fault
<13>1 2023-11-20T05:56:57+00:00 OPNsense kernel - - [meta sequenceId="12"] panic: general protection fault
<13>1 2023-11-21T12:12:56+00:00 OPNsense kernel - - [meta sequenceId="12"] panic: general protection fault
<13>1 2023-11-21T20:01:56+00:00 OPNsense kernel - - [meta sequenceId="12"] panic: general protection fault
<13>1 2023-11-23T06:02:57+00:00 OPNsense kernel - - [meta sequenceId="12"] panic: general protection fault
<13>1 2023-11-23T20:34:57+00:00 OPNsense kernel - - [meta sequenceId="12"] panic: general protection fault
<13>1 2023-11-24T08:22:58+00:00 OPNsense kernel - - [meta sequenceId="12"] panic: general protection fault
<13>1 2023-11-24T10:51:57+00:00 OPNsense kernel - - [meta sequenceId="12"] panic: general protection fault
<13>1 2023-11-25T03:59:58+00:00 OPNsense kernel - - [meta sequenceId="12"] panic: general protection fault
<13>1 2023-11-25T04:30:20+00:00 OPNsense kernel - - [meta sequenceId="13"] panic: general protection fault
<13>1 2023-11-25T07:04:55+00:00 OPNsense kernel - - [meta sequenceId="5"] panic: general protection fault
<13>1 2023-11-25T08:52:01+00:00 OPNsense kernel - - [meta sequenceId="12"] panic: general protection fault
<13>1 2023-11-25T15:43:14+00:00 OPNsense kernel - - [meta sequenceId="12"] panic: general protection fault
<13>1 2023-11-25T18:54:55+00:00 OPNsense kernel - - [meta sequenceId="12"] panic: general protection fault
<13>1 2023-11-25T20:23:18+00:00 OPNsense kernel - - [meta sequenceId="12"] panic: general protection fault
<13>1 2023-11-25T21:46:43+00:00 OPNsense kernel - - [meta sequenceId="12"] panic: general protection fault
<13>1 2023-11-25T22:31:56+00:00 OPNsense kernel - - [meta sequenceId="12"] panic: general protection fault
<13>1 2023-11-26T06:35:55+00:00 OPNsense kernel - - [meta sequenceId="12"] panic: general protection fault
<13>1 2023-11-26T08:56:54+00:00 OPNsense kernel - - [meta sequenceId="5"] panic: general protection fault
<13>1 2023-11-26T09:46:13+00:00 OPNsense kernel - - [meta sequenceId="13"] panic: general protection fault
<13>1 2023-11-26T13:46:54+00:00 OPNsense kernel - - [meta sequenceId="12"] panic: general protection fault
<13>1 2023-11-26T17:12:54+00:00 OPNsense kernel - - [meta sequenceId="12"] panic: general protection fault
<13>1 2023-11-27T05:05:56+00:00 OPNsense kernel - - [meta sequenceId="12"] panic: general protection fault
<13>1 2023-11-27T10:28:05+00:00 OPNsense kernel - - [meta sequenceId="12"] panic: general protection fault
<13>1 2023-11-27T11:21:56+00:00 OPNsense kernel - - [meta sequenceId="13"] panic: general protection fault
<13>1 2023-11-27T13:12:55+00:00 OPNsense kernel - - [meta sequenceId="12"] panic: general protection fault

It might be heat related.  Try running some of the cpu stress tests.   Something like Prime95 maybe?

I don't recall if the UBCD has anything for cpu testing or not.

From what I could see, the CPU never went above 40°C with some light load (still way higher than my typical usage which is below 10%).

I also tried a factory reset (aka config wipe) and reinstalling base + kernel. No luck.

I have reached to Deciso as the unit is still under waranty and they advised to try a fresh install before sending it back (quick, polite and professional reply too). So far, the firewall has been stable for 24h running stock 23.7. A big improvement over 5 crashes in a day. I will try upgrading and re-applying my configuration to see if the crashes come back.