Daily Kernel page faults

Started by inittab, January 26, 2023, 09:08:59 PM

Previous topic - Next topic
January 26, 2023, 09:08:59 PM Last Edit: February 02, 2023, 01:52:05 PM by inittab
Hello, I'm running opnsense 22.7.11 within proxmox 7.3-4 on a protectli VP2420 - 4x 2.5G Port Intel ® Celeron J6412

I've been getting daily page faults where the opnsense vm will reboot, neither my pihole lxc on proxmox, or proxmox will crash or see any issues, just the opnsense vm.

I've attached the output from the latest crash. any idea's what might be going on here? I'm not doing anything out of the ordinary during crashes, it doesn't happen during high load or anything else, this latest crash I just had a youtube video going. Any help is appreciated, Thanks!

kdb_enter() at kdb_enter+0x37/frame 0xfffffe00d6c1dcf0
vpanic() at vpanic+0x1b0/frame 0xfffffe00d6c1dd40
panic() at panic+0x43/frame 0xfffffe00d6c1dda0
trap_fatal() at trap_fatal+0x385/frame 0xfffffe00d6c1de00
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00d6c1de60
calltrap() at calltrap+0x8/frame 0xfffffe00d6c1de60
--- trap 0xc, rip = 0xffffffff8115a3a6, rsp = 0xfffffe00d6c1df38, rbp = 0x7fffffffc260 ---
handle_ibrs_entry() at handle_ibrs_entry+0x6/frame 0x7fffffffc260

Maybe you want to disable IBRS. Set "hw.ibrs_disable" to "1".

You could also have a word with the hardware manufacturer. This looks fishy.


Cheers,
Franco

Thanks, I've set disable_ibrs in the tuneables and will see how this goes. if not I'll reach out to protectli

Looks like same issue, set tunable for disable_ibrs, rebooted, and looks like it's still page faulting on the ibrs entrypoint. new log was too big to attach so is at https://pastebin.com/hTTEJjUm


I'll get in touch with protectli, any other ideas?

I have a vp2420 as well, which had been running well on 22, but a day or two after upgrading to 23 also had a kernel panic and rebooted (didn't notice til later). I've attached a log file as well, though the panic itself looks different.

I submitted it through the reporter as well if that helps - thanks in advance for any ideas!

Quotevlan02: link state changed to UP
vlan04: link state changed to UP
vlan03: link state changed to UP
vlan01: link state changed to UP
igc1: link state changed to UP
igc2: promiscuous mode enabled
panic: vm_page_free_prep: page 0xfffffe0008d96490 has references
cpuid = 2
time = 1674903900
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00c7604a90
vpanic() at vpanic+0x17f/frame 0xfffffe00c7604ae0
panic() at panic+0x43/frame 0xfffffe00c7604b40
vm_page_free_prep() at vm_page_free_prep+0x15b/frame 0xfffffe00c7604b60
vm_page_free_toq() at vm_page_free_toq+0x12/frame 0xfffffe00c7604b90
pmap_remove_pte() at pmap_remove_pte+0x1c5/frame 0xfffffe00c7604bf0
pmap_remove_ptes() at pmap_remove_ptes+0xdc/frame 0xfffffe00c7604c50
pmap_remove() at pmap_remove+0x41e/frame 0xfffffe00c7604cd0
vm_map_delete() at vm_map_delete+0x25e/frame 0xfffffe00c7604d30
vm_map_remove() at vm_map_remove+0x9e/frame 0xfffffe00c7604d60
vmspace_exit() at vmspace_exit+0xaa/frame 0xfffffe00c7604d90
exit1() at exit1+0x57f/frame 0xfffffe00c7604df0
sys_sys_exit() at sys_sys_exit+0xd/frame 0xfffffe00c7604e00
amd64_syscall() at amd64_syscall+0x10c/frame 0xfffffe00c7604f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00c7604f30
--- syscall (1, FreeBSD ELF64, sys_sys_exit), rip = 0x80078c0da, rsp = 0x7fffffffeb88, rbp = 0x7fffffffeba0 ---
KDB: enter: panic

so found out what was causing this and as usual it was user error and a rather stupid one at that.
I had misremembered how much memory I ordered my VP2420 with, thinking I had ordered 16gb when I only ordered 8gb.

I had my opnsense VM set to 8gb of ram. proxmox did not complain about this at all and happily set the vm to 8gb of ram. I have dropped the opnsense vm down to 6gb of ram and have not had a kernel fault in 2 days so looks like this is resolved.

That's an interesting turn of events, thanks for sharing the solution. :)


Cheers,
Franco

Looks like I may have spoke too soon, ended up getting another page fault last night although the faulting module has now changed. Any thoughts?


db:0:kdb.enter.default>  bt
Tracing pid 41388 tid 101929 td 0xfffffe00bd063ac0
kdb_enter() at kdb_enter+0x37/frame 0xffffffff81f5f4c0
vpanic() at vpanic+0x1b0/frame 0xffffffff81f5f510
panic() at panic+0x43/frame 0xffffffff81f5f570
dblfault_handler() at dblfault_handler+0x1ce/frame 0xffffffff81f5f630
Xdblfault() at Xdblfault+0xd7/frame 0xffffffff81f5f630
--- trap 0x17, rip = 0xffffffff811352c2, rsp = 0x80107a9ce, rbp = 0x7fffffffe380 ---
Xtimerint_pti() at Xtimerint_pti+0x2/frame 0x7fffffffe380

Looks too generic as well. I'm not sure but Proxmox as added complication could also be a cause of this.


Cheers,
Franco

yeah, i'll loose my pi-hole vm but i've considered moving the opnsense to baremetal to rule that out. I'll work on getting that done.