Multiple OPNsense 24.7.11_2 Crashes on Protectli Vault Pro VP2420-4

Started by BertQuodge, January 11, 2025, 09:32:34 PM

Previous topic - Next topic
Hi

I purchased a Protectli Vault Pro VP2420-4, Crucial RAM 32GB DDR4 3200MHz CL22 & a Integral 512GB M.2 SATA III 2280 to run OPNSense in April 2024. Since installation the system has been rock solid, with no crashes, until I upgraded to OPNSense 24.7.11_2 in December of 2024. Since then I've had 3 OPNsense crashes, where the system reboots and recovers by itself. The crash reporter shows the crashes. All 3 crashes have been due to page faults. I've removed the memory and SSD from the Protectli and I've re-seated them but the crashes still occur. The Protectli is UPS fed and no other device have reported any power issues on the same UPS. The Protectli is in a cool environment and isn't near sources of EMI. The firewall isn't driven very hard and I use it at home. I use NUT, BGP, DHCP Server. I only use 2 ports on the Protectli, WAN access and a trunk for my home network. Interestingly, all 3 crashes have occurred after a few days of uptime while watching videos online, 2 with YouTube and one with the BBC.

The OPNSense crashes are receiving a poor wife acceptance factor, so I'd appreciate any advice on how to stop The Great British Bake Off from being interrupted ;-)

The kernel panic is shown below:

Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 02
fault virtual address   = 0x0
fault code      = supervisor write data, page not present
instruction pointer   = 0x20:0xffffffff82190d9c
stack pointer           = 0x28:0xffffffff82e54e00
frame pointer           = 0x28:0xffffffff82e54e30
code segment      = base 0x0, limit 0xfffff, type 0x1b
         = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags   = interrupt enabled, resume, IOPL = 0
current process      = 6 (pf purge)
rdi: fffff801e8d47d10 rsi: fffff801e8d47d10 rdx: 0000000095089b03
rcx: 0000000000000000  r8: 0000000022f0d653  r9: 0000000000000000
rax: 0000000000000000 rbx: fffff801e8d68dc0 rbp: ffffffff82e54e30
r10: 0000000000000000 r11: 00000000b9f5a6a9 r12: fffffe0106bdc000
r13: 00000000000877df r14: fffff801e8d47d10 r15: fffff80001b20000
trap number      = 12
panic: page fault
cpuid = 1
time = 1736625558
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xffffffff82e54af0
vpanic() at vpanic+0x131/frame 0xffffffff82e54c20
panic() at panic+0x43/frame 0xffffffff82e54c80
trap_fatal() at trap_fatal+0x40b/frame 0xffffffff82e54ce0
trap_pfault() at trap_pfault+0x46/frame 0xffffffff82e54d30
calltrap() at calltrap+0x8/frame 0xffffffff82e54d30
--- trap 0xc, rip = 0xffffffff82190d9c, rsp = 0xffffffff82e54e00, rbp = 0xffffffff82e54e30 ---
pf_detach_state() at pf_detach_state+0x5fc/frame 0xffffffff82e54e30
pf_unlink_state() at pf_unlink_state+0x290/frame 0xffffffff82e54e70
pf_purge_expired_states() at pf_purge_expired_states+0x188/frame 0xffffffff82e54ec0
pf_purge_thread() at pf_purge_thread+0x13b/frame 0xffffffff82e54ef0
fork_exit() at fork_exit+0x7f/frame 0xffffffff82e54f30
fork_trampoline() at fork_trampoline+0xe/frame 0xffffffff82e54f30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
panic.txt0600001214740546626  7147 ustarrootwheelpage faultversion.txt0600007414740546626  7552 ustarrootwheelFreeBSD 14.1-RELEASE-p6 stable/24.7-n267979-0d692990122 SMP

EDIT: I forgot to mention, I ran memtest64 for a few hours but no errors were found.

Thanks!

Just had another OPNSense crash, just over a day from the last, right in the middle of watching a film with the family. The wife acceptance factor has reduced even further. OPNSense recovered and rebooted itself, though it took a while.

The RAM and SSD has been re-seated again, just in case. Memtest64 shows no issues.

I use LibreNMS to monitor my house equipment, and OPNSense has lots of free memory, disk space and wasn't very warm at the time of the crash. The OPNSense was near(ish) to a WiFi AP, but I moved this a few days ago in case EMI was an issue, but this hasn't helped. OPNSense seemed to be fine until I upgraded to 24.7.11, though this could be a coincidence. I've just run a "opnsense-revert -r 24.7.10 opnsense" with a reboot to see if this helps. I'm not sure if I need to run more commands to fully revert to 24.7.10. Any suggestions would be appreciated, or the number of a good divorce lawyer ;-)


Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 02
fault virtual address   = 0x37a891063000
fault code      = supervisor read data, page not present
instruction pointer   = 0x20:0xffffffff8109fa60
stack pointer           = 0x28:0xfffffe0037992430
frame pointer           = 0x28:0xfffffe0037992430
code segment      = base 0x0, limit 0xfffff, type 0x1b
         = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags   = interrupt enabled, resume, IOPL = 0
current process      = 0 (if_io_tqg_1)
rdi: 000037a891063000 rsi: fffffe0037992558 rdx: 0000000000000028

rcx: 0000000000098a7b  r8: 00000000000000ac  r9: 00000000a10c11ac
rax: 0000000000000000 rbx: fffff80001a65000 rbp: fffffe0037992430
r10: 00000000c7ae7521 r11: 0000000000000014 r12: fffffe0037992558
r13: 000037a891063000 r14: fffff8000fcc7300 r15: fffffe0106bdc000
trap number      = 12
panic: page fault
cpuid = 1
time = 1736709176
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0037992120
vpanic() at vpanic+0x131/frame 0xfffffe0037992250
panic() at panic+0x43/frame 0xfffffe00379922b0
trap_fatal() at trap_fatal+0x40b/frame 0xfffffe0037992310
trap_pfault() at trap_pfault+0x46/frame 0xfffffe0037992360
calltrap() at calltrap+0x8/frame 0xfffffe0037992360
--- trap 0xc, rip = 0xffffffff8109fa60, rsp = 0xfffffe0037992430, rbp = 0xfffffe0037992430 ---
memcmp() at memcmp+0x110/frame 0xfffffe0037992430
pf_find_state() at pf_find_state+0xc0/frame 0xfffffe0037992480
pf_test_state_icmp() at pf_test_state_icmp+0x298/frame 0xfffffe00379925e0
pf_test() at pf_test+0x112c/frame 0xfffffe0037992790
pf_check_in() at pf_check_in+0x27/frame 0xfffffe00379927b0
pfil_mbuf_in() at pfil_mbuf_in+0x38/frame 0xfffffe00379927e0
ip_input() at ip_input+0x5d5/frame 0xfffffe0037992840
netisr_dispatch_src() at netisr_dispatch_src+0x9e/frame 0xfffffe0037992890
ether_demux() at ether_demux+0x149/frame 0xfffffe00379928c0
ether_nh_input() at ether_nh_input+0x36a/frame 0xfffffe0037992920
netisr_dispatch_src() at netisr_dispatch_src+0x9e/frame 0xfffffe0037992970
ether_input() at ether_input+0x56/frame 0xfffffe00379929c0
ether_demux() at ether_demux+0x8e/frame 0xfffffe00379929f0
ng_ether_rcv_upper() at ng_ether_rcv_upper+0x8c/frame 0xfffffe0037992a10
ng_apply_item() at ng_apply_item+0x13e/frame 0xfffffe0037992ab0
ng_snd_item() at ng_snd_item+0x274/frame 0xfffffe0037992af0
ng_apply_item() at ng_apply_item+0x13e/frame 0xfffffe0037992b90
ng_snd_item() at ng_snd_item+0x274/frame 0xfffffe0037992bd0
ng_ether_input() at ng_ether_input+0x4c/frame 0xfffffe0037992c00
ether_nh_input() at ether_nh_input+0x1dc/frame 0xfffffe0037992c60
netisr_dispatch_src() at netisr_dispatch_src+0x9e/frame 0xfffffe0037992cb0
ether_input() at ether_input+0x56/frame 0xfffffe0037992d00
iflib_rxeof() at iflib_rxeof+0xc0e/frame 0xfffffe0037992e00
_task_fn_rx() at _task_fn_rx+0x72/frame 0xfffffe0037992e40
gtaskqueue_run_locked() at gtaskqueue_run_locked+0x14e/frame 0xfffffe0037992ec0
gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xc2/frame 0xfffffe0037992ef0
fork_exit() at fork_exit+0x7f/frame 0xfffffe0037992f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0037992f30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
panic.txt0600001214741012070  7125 ustarrootwheelpage faultversion.txt0600007414741012070  7530 ustarrootwheelFreeBSD 14.1-RELEASE-p6 stable/24.7-n267979-0d692990122 SMP

For what it's worth, I also have a VP2420-4 that's been running 24.7.11_2 for the past two weeks and I haven't had any stability issues. Unfortunately I don't have any suggestions for you, but just mentioning this to rule out any common issue between 24.7.11_2 and this Protectli box.

Quote from: funkyd on January 13, 2025, 06:14:55 AMFor what it's worth, I also have a VP2420-4 that's been running 24.7.11_2 for the past two weeks and I haven't had any stability issues. Unfortunately I don't have any suggestions for you, but just mentioning this to rule out any common issue between 24.7.11_2 and this Protectli box.
Hi

Thanks for posting, that's great to know others are not having issues with the VP2420-4, it is something specific to my setup.

Quote from: BertQuodge on January 12, 2025, 08:54:03 PMJust had another OPNSense crash, just over a day from the last, right in the middle of watching a film with the family. The wife acceptance factor has reduced even further. OPNSense recovered and rebooted itself, though it took a while.

The RAM and SSD has been re-seated again, just in case. Memtest64 shows no issues.

I use LibreNMS to monitor my house equipment, and OPNSense has lots of free memory, disk space and wasn't very warm at the time of the crash. The OPNSense was near(ish) to a WiFi AP, but I moved this a few days ago in case EMI was an issue, but this hasn't helped. OPNSense seemed to be fine until I upgraded to 24.7.11, though this could be a coincidence. I've just run a "opnsense-revert -r 24.7.10 opnsense" with a reboot to see if this helps. I'm not sure if I need to run more commands to fully revert to 24.7.10. Any suggestions would be appreciated, or the number of a good divorce lawyer ;-)

I am also using OPNsense on a Protectli system (though I'm not using the exact same hardware as you; I'm using a Protectli Vault FW6A), and I also experienced random crash/reboots like you describe.  In my case, I updated the kernel via "opnsense-update -fk" to get a newer, fixed one.  That stopped the random crash/reboot behaviour for me.

I've recently updated to OPNsense 24.7.12-amd64, and I hope the behaviour remains fixed.

I post this hopefully to let you know this is probably not a hardware problem for you.