Random daily Kernel Panics since 23.7.1

Started by kamikazedan, August 25, 2023, 07:05:21 PM

Previous topic - Next topic
August 25, 2023, 07:05:21 PM Last Edit: August 28, 2023, 08:20:56 PM by kamikazedan
Anyone else having any issues since the latest updates?
System was rock solid stable, now it just reboots randomly during regular use.
It's running in a VM under UNRAID.
I've passed through 2 cores from the IntelĀ® PentiumĀ® Silver N6005 and 8GB RAM.

Dump header from device: /dev/vtbd0p3
  Architecture: amd64
  Architecture Version: 4
  Dump Length: 101376
  Blocksize: 512
  Compression: none
  Dumptime: 2023-08-25 17:45:25 +0100
  Hostname: OPNsense
  Magic: FreeBSD Text Dump
  Version String: FreeBSD 13.2-RELEASE-p2 stable/23.7-n254761-4b4f06e3731 SMP
  Panic String: privileged instruction fault
  Dump Parity: 4239370554
  Bounds: 0
  Dump Status: good


[fib_algo] inet.0 (bsearch4#28) rebuild_fd_flm: switching algo to radix4_lockless
kernel trap 1 with interrupts disabled


Fatal trap 1: privileged instruction fault while in kernel mode
cpuid = 1; apic id = 01
instruction pointer = 0x20:0xffffffff81224720
stack pointer         = 0x28:0xfffffe001079d458
frame pointer         = 0x28:0xfffffe001079d540
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = resume, IOPL = 0
current process = 0 (if_io_tqg_1)
trap number = 1
panic: privileged instruction fault
cpuid = 1
time = 1692981925
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe001079d270
vpanic() at vpanic+0x151/frame 0xfffffe001079d2c0
panic() at panic+0x43/frame 0xfffffe001079d320
trap_fatal() at trap_fatal+0x387/frame 0xfffffe001079d380
calltrap() at calltrap+0x8/frame 0xfffffe001079d380
--- trap 0x1, rip = 0xffffffff81224720, rsp = 0xfffffe001079d458, rbp = 0xfffffe001079d540 ---
lapic_handle_intr() at lapic_handle_intr/frame 0xfffffe001079d540
ng_pppoe_rcvdata() at ng_pppoe_rcvdata+0x339/frame 0xfffffe001079d5d0
ng_apply_item() at ng_apply_item+0x2bf/frame 0xfffffe001079d660
ng_snd_item() at ng_snd_item+0x28e/frame 0xfffffe001079d6a0
ng_apply_item() at ng_apply_item+0x2bf/frame 0xfffffe001079d730
ng_snd_item() at ng_snd_item+0x28e/frame 0xfffffe001079d770
ng_ppp_link_xmit() at ng_ppp_link_xmit+0x124/frame 0xfffffe001079d7c0
ng_apply_item() at ng_apply_item+0x2bf/frame 0xfffffe001079d850
ng_snd_item() at ng_snd_item+0x28e/frame 0xfffffe001079d890
ng_apply_item() at ng_apply_item+0x2bf/frame 0xfffffe001079d920
ng_snd_item() at ng_snd_item+0x28e/frame 0xfffffe001079d960
ng_iface_send() at ng_iface_send+0xdf/frame 0xfffffe001079d9e0
ng_iface_output() at ng_iface_output+0xe3/frame 0xfffffe001079da20
ip_tryforward() at ip_tryforward+0x4f7/frame 0xfffffe001079dae0
ip_input() at ip_input+0x724/frame 0xfffffe001079db70
netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe001079dbc0
ether_demux() at ether_demux+0x159/frame 0xfffffe001079dbf0
ether_nh_input() at ether_nh_input+0x36b/frame 0xfffffe001079dc50
netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe001079dca0
ether_input() at ether_input+0x69/frame 0xfffffe001079dd00
iflib_rxeof() at iflib_rxeof+0xbcb/frame 0xfffffe001079de00
_task_fn_rx() at _task_fn_rx+0x72/frame 0xfffffe001079de40
gtaskqueue_run_locked() at gtaskqueue_run_locked+0x15d/frame 0xfffffe001079dec0
gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xc3/frame 0xfffffe001079def0
fork_exit() at fork_exit+0x7e/frame 0xfffffe001079df30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe001079df30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
panic.txt0600003414472155245  7146 ustarrootwheelprivileged instruction faultversion.txt0600007414472155245  7545 ustarrootwheelFreeBSD 13.2-RELEASE-p2 stable/23.7-n254761-4b4f06e3731 SMP

An interrupt killed your PPPoE processing is what I see. But I can't find any reference to a problem in FreeBSD 13.2. It's possible you can tweak this within your UNRAID host, but I am not sure how.


Cheers,
Franco

Thank you for the response.
Today I had a different crash with the following:

Dump header from device: /dev/vtbd0p3
  Architecture: amd64
  Architecture Version: 4
  Dump Length: 72704
  Blocksize: 512
  Compression: none
  Dumptime: 2023-08-27 09:57:08 +0100
  Hostname: OPNsense
  Magic: FreeBSD Text Dump
  Version String: FreeBSD 13.2-RELEASE-p2 stable/23.7-n254761-4b4f06e3731 SMP
  Panic String: double fault
  Dump Parity: 3156452410
  Bounds: 1
  Dump Status: good



Fatal double fault
rip 0xffffffff81115a76 rsp 0xfffffe00107a9dd0 rbp 0xfffffe00107a9dd0
rax 0x1063525043c96 rdx 0x1063500000000 rbx 0xfffff80001a60000
rcx 0 rsi 0 rdi 0xfffffe00107a9e88
r8 0xfad9c0 r9 0x80000000 r10 0xffffffff
r11 0x1 r12 0xfffff80001a60028 r13 0
r14 0x1063525043c96 r15 0x2f rflags 0x10246
cs 0x20 ss 0x28 ds 0x3b es 0x3b fs 0x13 gs 0x1b
fsbase 0x35ad7a465120 gsbase 0xffffffff82c11000 kgsbase 0
cpuid = 1; apic id = 01
timeout stopping cpus
panic: double fault
cpuid = 1
time = 1693126628
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0011e4edb0
vpanic() at vpanic+0x151/frame 0xfffffe0011e4ee00
panic() at panic+0x43/frame 0xfffffe0011e4ee60
dblfault_handler() at dblfault_handler+0x1ce/frame 0xfffffe0011e4ef20
Xdblfault() at Xdblfault+0xd7/frame 0xfffffe0011e4ef20
--- trap 0x17, rip = 0xffffffff81115a76, rsp = 0xfffffe00107a9dd0, rbp = 0xfffffe00107a9dd0 ---
acpi_cpu_c1() at acpi_cpu_c1+0x6/frame 0xfffffe00107a9dd0
acpi_cpu_idle() at acpi_cpu_idle+0x2ef/frame 0xfffffe00107a9e10
cpu_idle_acpi() at cpu_idle_acpi+0x48/frame 0xfffffe00107a9e30
cpu_idle() at cpu_idle+0x9f/frame 0xfffffe00107a9e50
sched_idletd() at sched_idletd+0x4e1/frame 0xfffffe00107a9ef0
fork_exit() at fork_exit+0x7e/frame 0xfffffe00107a9f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00107a9f30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
panic.txt0600001414472607744  7152 ustarrootwheeldouble faultversion.txt0600007414472607744  7553 ustarrootwheelFreeBSD 13.2-RELEASE-p2 stable/23.7-n254761-4b4f06e3731 SMP

ACPI handling is constantly crashing it seems. Normally I'd suggest updating the BIOS, but in this case for the host it probably only makes sense to try and find host settings for the VM that might make it better (I don't  know what UNRAID offers).


Cheers,
Franco

August 28, 2023, 01:14:47 PM #4 Last Edit: August 28, 2023, 01:22:07 PM by kamikazedan
Thanks for the help with this.

The VM is setup exactly as you would for FreeBSD (UNRAID has a standard VM template for FreeBSD).

I'm not sure what I could change to help the situation, I recently tried to give the VM all 4 cores on the N6005 which has not helped.

I've ran multiple pfSense VMs and OPNsense VMs for years without issues until recently.

Something changed in FreeBSD 13.2 I think, but I don't see a relevant setting in UNRAID, which doesn't mean there isn't one but just not on the GUI.

Last straw: maybe a problem with suspend/resume if applicable?


Cheers,
Franco

And another crash today.

Dump header from device: /dev/vtbd0p3
  Architecture: amd64
  Architecture Version: 4
  Dump Length: 73728
  Blocksize: 512
  Compression: none
  Dumptime: 2023-08-28 19:17:15 +0100
  Hostname: OPNsense
  Magic: FreeBSD Text Dump
  Version String: FreeBSD 13.2-RELEASE-p2 stable/23.7-n254761-4b4f06e3731 SMP
  Panic String: page fault
  Dump Parity: 2482483055
  Bounds: 2
  Dump Status: good


Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 01
fault virtual address = 0x0
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff8122470c
stack pointer         = 0x28:0xfffffe001079d420
frame pointer         = 0x28:0xfffffe001079d600
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = resume, IOPL = 0
current process = 0 (if_io_tqg_1)
trap number = 12
panic: page fault
cpuid = 1
time = 1693246635
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe001079d1e0
vpanic() at vpanic+0x151/frame 0xfffffe001079d230
panic() at panic+0x43/frame 0xfffffe001079d290
trap_fatal() at trap_fatal+0x387/frame 0xfffffe001079d2f0
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe001079d350
calltrap() at calltrap+0x8/frame 0xfffffe001079d350
--- trap 0xc, rip = 0xffffffff8122470c, rsp = 0xfffffe001079d420, rbp = 0xfffffe001079d600 ---
native_lapic_set_lvt_triggermode() at native_lapic_set_lvt_triggermode+0xdc/frame 0xfffffe001079d600
calltrap() at calltrap+0x8/frame 0xfffffe001079d600
--- trap 0x1, rip = 0xffffffff81224720, rsp = 0xfffffe001079d6d8, rbp = 0xfffffe001079d7e0 ---
lapic_handle_intr() at lapic_handle_intr/frame 0xfffffe001079d7e0
pf_test_state_udp() at pf_test_state_udp+0x130/frame 0xfffffe001079d850
pf_test() at pf_test+0xc57/frame 0xfffffe001079d9c0
pf_check_in() at pf_check_in+0x25/frame 0xfffffe001079d9e0
pfil_run_hooks() at pfil_run_hooks+0x97/frame 0xfffffe001079da20
ip_tryforward() at ip_tryforward+0x181/frame 0xfffffe001079dae0
ip_input() at ip_input+0x724/frame 0xfffffe001079db70
netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe001079dbc0
ether_demux() at ether_demux+0x159/frame 0xfffffe001079dbf0
ether_nh_input() at ether_nh_input+0x36b/frame 0xfffffe001079dc50
netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe001079dca0
ether_input() at ether_input+0x69/frame 0xfffffe001079dd00
iflib_rxeof() at iflib_rxeof+0xbcb/frame 0xfffffe001079de00
_task_fn_rx() at _task_fn_rx+0x72/frame 0xfffffe001079de40
gtaskqueue_run_locked() at gtaskqueue_run_locked+0x15d/frame 0xfffffe001079dec0
gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xc3/frame 0xfffffe001079def0
fork_exit() at fork_exit+0x7e/frame 0xfffffe001079df30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe001079df30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
panic.txt0600001214473162253  7140 ustarrootwheelpage faultversion.txt0600007414473162253  7543 ustarrootwheelFreeBSD 13.2-RELEASE-p2 stable/23.7-n254761-4b4f06e3731 SMP

August 28, 2023, 08:33:37 PM #7 Last Edit: August 28, 2023, 10:24:26 PM by kamikazedan
I've just noticed that the backend is spamming "notice" logs for configd.py.
Is the amount of logs a second it's generating normal?

I'm not sure what to do next, I'm getting daily crashes now.

Notice level logs have nothing to do with crashing the kernel.


Anything from a BSD change in the kernel
   - to a software configuration
   - to a software bug in Unraid to a missing BIOS update - if any - on the HW
   - or a BIOS config change that you need for Unraid
   - or a HW issue
-- really anything is fair game here.

Unraid and STH forums may be better avenues to inquire or read about issues and / or tweaks required for N6005 chips/appliances.


Two things you could try:

1) run a memtest

2) on a spare drive run OPNsense on bare metal or in Proxmox



If Unraid is a must on N6005 then your best bet is a dedicated HW appliance for the firewall.