Kernel Panics Reboot

Started by furfix, August 29, 2024, 01:02:40 AM

Previous topic - Next topic
September 19, 2024, 06:00:23 PM #15 Last Edit: September 19, 2024, 06:13:51 PM by franco
Looks like different panic?

db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0149b4b8b0
vpanic() at vpanic+0x131/frame 0xfffffe0149b4b9e0
panic() at panic+0x43/frame 0xfffffe0149b4ba40
trash_ctor() at trash_ctor+0x53/frame 0xfffffe0149b4ba50
mb_ctor_pack() at mb_ctor_pack+0x3e/frame 0xfffffe0149b4ba90
item_ctor() at item_ctor+0x117/frame 0xfffffe0149b4bae0
m_getm2() at m_getm2+0x1aa/frame 0xfffffe0149b4bb50
m_uiotombuf() at m_uiotombuf+0x6f/frame 0xfffffe0149b4bbe0
uipc_sosend_dgram() at uipc_sosend_dgram+0x170/frame 0xfffffe0149b4bc70
sousrsend() at sousrsend+0x79/frame 0xfffffe0149b4bcd0
kern_sendit() at kern_sendit+0x1bc/frame 0xfffffe0149b4bd60
sendit() at sendit+0x184/frame 0xfffffe0149b4bdb0
sys_sendto() at sys_sendto+0x4d/frame 0xfffffe0149b4be00
amd64_syscall() at amd64_syscall+0x140/frame 0xfffffe0149b4bf30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0149b4bf30

> panic("Memory modified after free %p(%d) val=%lx @ %p\n", mem, size, *p, p);

Yes, well, this seems to point to a memory corruption that's going on for whatever reason, apparently in UDP (which woul also point to the WireGuard kernel module).

It could still be the same panic, but since the debug kernel has more panics like this one it tries to catch errors earlier, but here also the damage was already done.

The question is if this is caused inherently by hardware and it needs to be replaced or the errors go away without using WireGuard? This doesn't seem to be a prevalent issue, but it could still be a code problem.


Cheers,
Franco

Quote from: franco on September 19, 2024, 05:40:07 PM
@cgone can you post the panic backtrace too as a reference point?

Here is the trace back of the last crash. The crashes does not always give a crash dump.


ddb.txt06000014000014673124713  7102 ustarrootwheeldb:0:kdb.enter.default>  run lockinfo
db:1:lockinfo> show locks
No such command; use "help" to list available commands
db:1:lockinfo>  show alllocks
No such command; use "help" to list available commands
db:1:lockinfo>  show lockedvnods
Locked vnodes
db:0:kdb.enter.default>  show pcpu
cpuid        = 3
dynamic pcpu = 0xfffffe009e97b080
curthread    = 0xfffff8002d0bc740: pid 23521 tid 102231 critnest 1 "Eastpect Main Event"
curpcb       = 0xfffff8002d0bcc60
fpcurthread  = 0xfffff8002d0bc740: pid 23521 "Eastpect Main Event"
idlethread   = 0xfffff80001974000: tid 100006 "idle: cpu3"
self         = 0xffffffff83a13000
curpmap      = 0xfffff801c96ad600
tssp         = 0xffffffff83a13384
rsp0         = 0xfffffe0102b8c000
kcr3         = 0x80000003ae08d4b0
ucr3         = 0x80000003ae08ccb0
scr3         = 0x3ae08ccb0
gs32p        = 0xffffffff83a13404
ldt          = 0xffffffff83a13444
tss          = 0xffffffff83a13434
curvnet      = 0
db:0:kdb.enter.default>  bt
Tracing pid 23521 tid 102231 td 0xfffff8002d0bc740
kdb_enter() at kdb_enter+0x33/frame 0xfffffe0102b8b9e0
panic() at panic+0x43/frame 0xfffffe0102b8ba40
trap_fatal() at trap_fatal+0x40b/frame 0xfffffe0102b8baa0
calltrap() at calltrap+0x8/frame 0xfffffe0102b8baa0
--- trap 0x9, rip = 0xffffffff8108cf63, rsp = 0xfffffe0102b8bb70, rbp = 0xfffffe0102b8bb70 ---
pmap_pvh_remove() at pmap_pvh_remove+0x23/frame 0xfffffe0102b8bb70
pmap_enter() at pmap_enter+0xd1e/frame 0xfffffe0102b8bc50
vm_fault() at vm_fault+0xbb7/frame 0xfffffe0102b8bd70
vm_fault_trap() at vm_fault_trap+0x4d/frame 0xfffffe0102b8bdc0
trap_pfault() at trap_pfault+0x1be/frame 0xfffffe0102b8be10
trap() at trap+0x4ab/frame 0xfffffe0102b8bf30
calltrap() at calltrap+0x8/frame 0xfffffe0102b8bf30
--- trap 0xc, rip = 0x827eed850, rsp = 0x8414947a8, rbp = 0x841494860 ---


My guess is that it is more likely a hardware fault, since the backtrace is often different in a different thread.

September 20, 2024, 01:07:33 PM #17 Last Edit: September 20, 2024, 01:14:57 PM by furfix
Quote from: franco on September 19, 2024, 06:00:23 PM
Looks like different panic?

db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0149b4b8b0
vpanic() at vpanic+0x131/frame 0xfffffe0149b4b9e0
panic() at panic+0x43/frame 0xfffffe0149b4ba40
trash_ctor() at trash_ctor+0x53/frame 0xfffffe0149b4ba50
mb_ctor_pack() at mb_ctor_pack+0x3e/frame 0xfffffe0149b4ba90
item_ctor() at item_ctor+0x117/frame 0xfffffe0149b4bae0
m_getm2() at m_getm2+0x1aa/frame 0xfffffe0149b4bb50
m_uiotombuf() at m_uiotombuf+0x6f/frame 0xfffffe0149b4bbe0
uipc_sosend_dgram() at uipc_sosend_dgram+0x170/frame 0xfffffe0149b4bc70
sousrsend() at sousrsend+0x79/frame 0xfffffe0149b4bcd0
kern_sendit() at kern_sendit+0x1bc/frame 0xfffffe0149b4bd60
sendit() at sendit+0x184/frame 0xfffffe0149b4bdb0
sys_sendto() at sys_sendto+0x4d/frame 0xfffffe0149b4be00
amd64_syscall() at amd64_syscall+0x140/frame 0xfffffe0149b4bf30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0149b4bf30

> panic("Memory modified after free %p(%d) val=%lx @ %p\n", mem, size, *p, p);

Yes, well, this seems to point to a memory corruption that's going on for whatever reason, apparently in UDP (which woul also point to the WireGuard kernel module).

It could still be the same panic, but since the debug kernel has more panics like this one it tries to catch errors earlier, but here also the damage was already done.

The question is if this is caused inherently by hardware and it needs to be replaced or the errors go away without using WireGuard? This doesn't seem to be a prevalent issue, but it could still be a code problem.


Cheers,
Franco

Should I try reinstalling maybe? One of the first panic was about zpool, but never happened again, but per what you are saying, looks like it's never about the same :(


At the end of the log I still see it though:

Timecounter "TSC-low" frequency 1593603556 Hz quality 1000
Timecounters tick every 1.000 msec
ugen0.1: <Intel XHCI root HUB> at usbus0
ixl1: Link is up, 1 Gbps Full Duplex, Requested FEC: None, Negotiated FEC: None, Autoneg: True, Flow Control: None
ixl1: link state changed to UP
debugnet_any_ifnet_update: Bad dn_init result from ixl1 (ifp 0xfffff8000326e000), ignoring.
ZFS filesystem version: 5
ZFS storage pool version: features support (5000)
uhub0 on usbus0
uhub0: <Intel XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus0
ugen1.1: <Intel XHCI root HUB> at usbus1
uhub1 on usbus1
uhub1: <Intel XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus1
nvme0: Allocated 64MB host memory buffer
nda0 at nvme0 bus 0 scbus0 target 0 lun 1
nda0: <CT500P3PSSD8 P9CR413 2417487F0AA6>
nda0: Serial Number X
nda0: nvme version 1.4
nda0: 476940MB (976773168 512 byte sectors)
Trying to mount root from zfs:zroot/ROOT/default []...
uhub0: 4 ports with 4 removable, self powered
uhub1: 16 ports with 16 removable, self powered
ugen1.2: <MediaTek Inc. WirelessDevice> at usbus1
pid 31 (zpool) is attempting to use unsafe AIO requests - not logging anymore


Another panic today, while using heavily a WG tunnel for a long period of time:

panic: Memory modified after free 0xfffff8015ea40800(2048) val=1ce4029760df7eac @ 0xfffff8015ea40dc0

cpuid = 3
time = 1726747522
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0149b4b8b0
vpanic() at vpanic+0x131/frame 0xfffffe0149b4b9e0
panic() at panic+0x43/frame 0xfffffe0149b4ba40
trash_ctor() at trash_ctor+0x53/frame 0xfffffe0149b4ba50
mb_ctor_pack() at mb_ctor_pack+0x3e/frame 0xfffffe0149b4ba90
item_ctor() at item_ctor+0x117/frame 0xfffffe0149b4bae0
m_getm2() at m_getm2+0x1aa/frame 0xfffffe0149b4bb50
m_uiotombuf() at m_uiotombuf+0x6f/frame 0xfffffe0149b4bbe0
uipc_sosend_dgram() at uipc_sosend_dgram+0x170/frame 0xfffffe0149b4bc70
sousrsend() at sousrsend+0x79/frame 0xfffffe0149b4bcd0
kern_sendit() at kern_sendit+0x1bc/frame 0xfffffe0149b4bd60
sendit() at sendit+0x184/frame 0xfffffe0149b4bdb0
sys_sendto() at sys_sendto+0x4d/frame 0xfffffe0149b4be00
amd64_syscall() at amd64_syscall+0x140/frame 0xfffffe0149b4bf30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0149b4bf30
--- syscall (133, FreeBSD ELF64, sendto), rip = 0x25520240c57a, rsp = 0x2551ff2eeea8, rbp = 0x2551ff2f30c0 ---
KDB: enter: panic


Going to backup the config, and do a clean re-install, and I will remove PCIe wifi that has no used on this box (just in case).

Otherwise kids and wife will open a sev1 and escalate it :D

Reinstalled. Wish me luck :) Keep you posted if another panic comes to my way.

The happyness was short :( After 10 hours of upgraded to 2.7.5, I got another panic:

Should I completed disable Wireguard? Do you think it's a hardware issue?

Fatal trap 12: page fault while in kernel mode
cpuid = 6; apic id = 18
fault virtual address = 0x30
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff810909f0
stack pointer         = 0x28:0xfffffe01549b0710
frame pointer         = 0x28:0xfffffe01549b0860
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 77999 (python3.11)
rdi: fffffe001ea27940 rsi: fffff80215ba0740 rdx: 0000000200000000
rcx: 0000000000000001  r8: 000007fffffff000  r9: 0000000000000063
rax: c6083eb6eceb6cea rbx: fffffffc00000000 rbp: fffffe01549b0860
r10: fffff80039fb8ce0 r11: fffff801dace1000 r12: 0000000000000021
r13: fffff80000000000 r14: 0000000000000000 r15: 39f7c14913149315
trap number = 12
panic: page fault
cpuid = 6
time = 1727378741
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe01549b0400
vpanic() at vpanic+0x131/frame 0xfffffe01549b0530
panic() at panic+0x43/frame 0xfffffe01549b0590
trap_fatal() at trap_fatal+0x40b/frame 0xfffffe01549b05f0
trap_pfault() at trap_pfault+0x46/frame 0xfffffe01549b0640
calltrap() at calltrap+0x8/frame 0xfffffe01549b0640
--- trap 0xc, rip = 0xffffffff810909f0, rsp = 0xfffffe01549b0710, rbp = 0xfffffe01549b0860 ---
pmap_remove_pages() at pmap_remove_pages+0x5f0/frame 0xfffffe01549b0860
vmspace_exit() at vmspace_exit+0x80/frame 0xfffffe01549b0890
exit1() at exit1+0x53a/frame 0xfffffe01549b08f0
sigexit() at sigexit+0x13d/frame 0xfffffe01549b0d60
postsig() at postsig+0x23a/frame 0xfffffe01549b0e20
ast_sig() at ast_sig+0x1d7/frame 0xfffffe01549b0ed0
ast_handler() at ast_handler+0x88/frame 0xfffffe01549b0f10
ast() at ast+0x20/frame 0xfffffe01549b0f30
doreti_ast() at doreti_ast+0x1c/frame 0x87e205ef0
KDB: enter: panic
panic.txt0600001214675332465  7150 ustarrootwheelpage faultversion.txt0600007414675332465  7553 ustarrootwheelFreeBSD 14.1-RELEASE-p5 stable/24.7-n267840-e62d514886a SMP

It's beginning to look more and more like a hardware issue.


Sorry,
Franco

Quote from: franco on September 27, 2024, 11:23:53 AM
It's beginning to look more and more like a hardware issue.


Sorry,
Franco

I found this:

<6>pid 77267 (python3.11), jid 0, uid 0: exited on signal 10 (no core dump - bad address)
<6>pid 77999 (python3.11), jid 0, uid 0: exited on signal 10 (no core dump - bad address)



Also in System>log files>general:
2024-09-27T13:43:31 Error opnsense /usr/local/etc/rc.newwanipv6: The command '/bin/kill -'TERM' '73423''(pid:/var/run/unbound.pid) returned exit code '1', the output was 'kill: 73423: No such process'
2024-09-27T13:40:38 Error opnsense /usr/local/sbin/pluginctl: The command '/bin/kill -'TERM' '72153''(pid:/var/run/unbound.pid) returned exit code '1', the output was 'kill: 72153: No such process'


Once family is sleeping, I will boot the machine with memtest86 and run it. Otherwise, I don't know what else to do...

imho memtest is a waste of time, frequent false negatives and takes LOTs of time. Order fresh RAM and see what happenz...
kind regards
chemlud
____
"The price of reliability is the pursuit of the utmost simplicity."
C.A.R. Hoare

felix eichhorns premium katzenfutter mit der extraportion energie

A router is not a switch - A router is not a switch - A router is not a switch - A rou....

Another panic :(

Fatal trap 9: general protection fault while in kernel mode
cpuid = 9; apic id = 22
instruction pointer = 0x20:0xffffffff8108f4ee
stack pointer         = 0x28:0xfffffe0158e9bbd0
frame pointer         = 0x28:0xfffffe0158e9bc00
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 26738 (python3.11)
rdi: fffffe001ea8a480 rsi: 000000000000000c rdx: 0000000000000024
rcx: 46ff382abf19b8e7  r8: 000007fffffff000  r9: fffff8001ac55600
rax: fffff80188830168 rbx: fffffe0017e62a28 rbp: fffffe0158e9bc00
r10: 80000003ad429425 r11: fffff80000000000 r12: ffffffff81807940
r13: 0000000000000000 r14: fffff80188830160 r15: fffffe0158e9bc60
trap number = 9
panic: general protection fault
cpuid = 9
time = 1727453132
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0158e9b910
vpanic() at vpanic+0x131/frame 0xfffffe0158e9ba40
panic() at panic+0x43/frame 0xfffffe0158e9baa0
trap_fatal() at trap_fatal+0x40b/frame 0xfffffe0158e9bb00
calltrap() at calltrap+0x8/frame 0xfffffe0158e9bb00
--- trap 0x9, rip = 0xffffffff8108f4ee, rsp = 0xfffffe0158e9bbd0, rbp = 0xfffffe0158e9bc00 ---
pmap_try_insert_pv_entry() at pmap_try_insert_pv_entry+0xbe/frame 0xfffffe0158e9bc00
pmap_copy() at pmap_copy+0x549/frame 0xfffffe0158e9bcb0
vmspace_fork() at vmspace_fork+0xc90/frame 0xfffffe0158e9bd30
fork1() at fork1+0x52e/frame 0xfffffe0158e9bda0
sys_fork() at sys_fork+0x54/frame 0xfffffe0158e9be00
amd64_syscall() at amd64_syscall+0x100/frame 0xfffffe0158e9bf30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0158e9bf30
--- syscall (0, FreeBSD ELF64, syscall), rip = 0x8262491fa, rsp = 0x838afb3f8, rbp = 0x838afb450 ---
KDB: enter: panic
panic.txt0600003014675553714  7152 ustarrootwheelgeneral protection faultversion.txt0600007414675553714  7555 ustarrootwheelFreeBSD 14.1-RELEASE-p5 stable/24.7-n267840-e62d514886a SMP

September 29, 2024, 08:50:44 AM #25 Last Edit: September 29, 2024, 08:52:44 AM by madar2356
Hi; going through the same thing; I'm running OPNSense on Proxmox and was very happy with it for 2+ months, but then started crashing. Tried both 24.1 and 24.7 but In my case, it appears that Proxmox was "gracefully" shutting down and rebooting OPNSense. OPNSense debug didn't indicate a Kernel Panic.

To fix it, I installed 24.7 and removed all additional NIC / Virtual bridges, and am presently running OPNSense as a basic / simple home router. No Surricata, Zenarmor, Crodsec, Vlans, Port Forwarding, Wireguard, or Proton VPN. Managed to get it running for 25 hrs and it crashed last night. This time it was a Kernel Panic.

I've now installed 24.7.5 and os-cpu-microcode-amd; if after this it crashes, I'll remove 2 memory dimms that I had installed on July 31st. I doubt though that this is a memory issue for me, cause the host system has ADGuard Home and a few basic containers working, and they all are operating fine.

If even after that, I can't achieve any form of stability, I'm disheartened to say, I'm gonna give PFSense CE a try.

If PFSense also crashes, then I have no option but to treat this as a hardware issue.

I've spent 4 months on my home lab, media server, web hosting; and now to see it all cash... is disheartening.

My current ISP doesn't have a 24h disconnect, but this may change in the future.
To check how well Opnsense can deal with it I enabled the 'Periodic reset interface' cron on the dial-in connection.

What shall I say, it panics at around 50% chance.