[SOLVED] Update to 24.7.2 results in kernel panic

Started by mroess, August 22, 2024, 08:42:18 AM

Previous topic - Next topic
I am willing to test it.

ZFS: yes
HW: Baracuda 220a (Intel Atom based d525)
Age: Based on CPU 10-12 years (Barracuda says it was sold new until 2016)
VM: No.
Kernel or corepackage: default


August 23, 2024, 01:44:43 PM #31 Last Edit: August 23, 2024, 01:46:52 PM by waxhead
I just want to "join the club" as well. I too got an kernel panic after upgrading to 24.7.2


panic: page fault
cpuid = 0
time = 1724410401
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0064e3e818
vpanic() at vpanic+0x131/frame 0xfffffe0064e3e940
panic() at panic+0x43/frame 0xfffffe0064e3e9a0
trap_fatal() at trap_fatal+0x48b/frame 0xfffffe0064e3ea00 trap_pfault() at trap_pfault+0x46/frame 0xfffffe0064e3ea50
calltrap() at calltrap+0x8/frame 0xfffffe0064e3ea50
trap 0xc, rip = 0xffffffff804d7de7, rsp = 0xfffffe0064e3eb20, rbp = 0xfffffe0064e3eb40
agp_close() at agp_close+0x57/frame 0xfffffe0064e3eb40 giant_close() at giant_close+0x68/frame Bxfffffe0064e3eb98
devfs_close() at devfs_close+0x4b3/frame 0xfffffe0064e3ec00
VOP_CLOSE_APV() at VOP_CLOSE_APV+0x1d/frame 0xfffffe0064e3ec20
vn_close1() at vn_close1+0x14c/frame 0xfffffe0064e3ec90
vn_closefile() at vn_closefile+0x3d/frame 0xfffffe0064e3ece0
devfs_close_f() at devfs_close_f+0x2a/frame 0xfffffe0064e3ed18
_fdrop() at_fdrop+0x11/frame Bxfffffe0064e3ed30
closef() at closef+0x24a/frame 0xfffffe0064e3edco
closefp_impl() at closefp_imp1+0x58/frame 0xfffffe0064e3ee00 amd64_syscall() at amd64_syscall+0x100/frame 0xfffffe0064e3ef30
fast_syscall_common() at fast_syscall_common+Bxf8/frame 0xfffffe0064e3ef30
syscall (6, FreeBSD ELF64, close), rip = 0x3843e84152ba, rsp = 0x3843f837fd8
18 , rbp = 0x3843f837fda0
KDB: enter: panic
[ thread pid 31 tid 100232 ]
Stopped at
kdb_enter+8x33: movq $0,0xfd9962(%rip)
NEC
db>


This on a real physical box with a Intel Core Duo CPU E6400 at 2.13Ghz with 2GB RAM. I have never had any kernel panic on this system before.

Luckily I just replaced some drives in that box so I reverted to OPNsense 24.7_9-amd64 / FreeBSD 14.1-RELEASE-p2, OpenSSL 3.0.14 which is running happily. Once I changed back to the new drives again I get the kernel panic.

I suggest that you pull this "upgrade" before more people are getting bit by this bug. Best of luck finding it.

Quote from: waxhead on August 23, 2024, 01:44:43 PM
I suggest that you pull this "upgrade" before more people are getting bit by this bug. Best of luck finding it.

I suggest we all agree to find the actual cause first instead of giving blank 20-20 advice.


Thanks,
Franco

Which image types are you guys using here.. DVD or VGA?


Cheers,
Franco

I am using dvd  (iso image for DVD+R)

Quote from: franco on August 23, 2024, 02:18:18 PM
Which image types are you guys using here.. DVD or VGA?


Cheers,
Franco

Quote from: franco on August 23, 2024, 01:20:33 PM
I've been thinking how to approach this. Would someone care to test two images of 24.7.2 -- one with the actual 24.7.2 state and one with the environment var commit reverted?

I think we should do 24.7.3 next week so we need to move this along. We need a way to confirm this precisely and I guess that is the safest way.


Cheers,
Franco

Hi. I'm running serial image. I'll be happy to test whatever tonight with whatever links and reinstalls and happy to report back

Quote from: mifi42 on August 23, 2024, 02:28:08 PM
I am using dvd  (iso image for DVD+R)

Quote from: franco on August 23, 2024, 02:18:18 PM
Which image types are you guys using here.. DVD or VGA?


Cheers,
Franco

I used VGA.

Quote from: franco on August 23, 2024, 02:18:18 PM
Which image types are you guys using here.. DVD or VGA?


Cheers,
Franco

I can't say 100%, but I would be surprised if I did use anything else than the USB installer e.g. VGA image.

I have tested the update with the patch reverted. The system is booting normal.

HW: Gateprotect GPO 150, Intel(R) Atom(TM) CPU D525   @ 1.80GHz, Samsung SSD, with ZFS, used serial image to install

After the first kernel panic, I did a fresh ZFS install, did a pkg update and pkg upgrade via serial, reverted the patch an reboot. Seems to work fine so far.

Regards
Marian

@mroess

sorry I have been wildly busy today so I couldn't finish the images required. Since you have a system on the good state I would like to ask of you the following:

1. Install the debug kernel for 24.7.2 and reboot to activate it.

# opnsense-update -zkr dbg-24.7.2
# opnsense-shell reboot

2. Trigger the panic manually which in theory should be:

# env ZPOOL_IMPORT_PATH=/dev zpool import -Na

3. If it panics the system knows the debug kernel was installed and creates a /var/crash/vmcore.0 file which is the one I need to hit the debugger.

4. The system should boot back without issue since the panic trigger was only temporarily forced.

=======

If I can view this in the debugger I can apply a kernel bandaid and issue a new kernel. This seems very hardware specific and likely the only possible panic.


Thanks,
Franco

PS: In fact the process would work for anyone with the issue sitting on 24.7 or 24.7.1 waiting for resolution. I tested the command with truss and it really goes on and pokes everything in /dev for better or worse.

 :o ::)

You'd expect this would be limited to block devices at minimum.

For emphasis:

# sh -c "env ZPOOL_IMPORT_PATH=/dev truss zpool import -Na 2>&1" | grep '/dev'
openat(AT_FDCWD,"/dev/zfs",O_RDWR|O_EXCL|O_CLOEXEC,00) = 3 (0x3)
openat(AT_FDCWD,"/dev/zfs",O_RDWR|O_CLOEXEC,00)    = 4 (0x4)
fstatat(AT_FDCWD,"/dev",{ mode=dr-xr-xr-x ,inode=2,size=512,blksize=4096 },0x0) = 0 (0x0)
__realpathat(AT_FDCWD,"/dev","/dev",1024,0)    = 0 (0x0)
open("/dev",O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC,05413465340) = 5 (0x5)
openat(AT_FDCWD,"/dev/acpi",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/apm",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 6 (0x6)
openat(AT_FDCWD,"/dev/apmctl",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 7 (0x7)
openat(AT_FDCWD,"/dev/audit",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/auditpipe",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/bpf",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/bpf0",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 6 (0x6)
openat(AT_FDCWD,"/dev/console",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/consolectl",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/ctty",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) ERR#6 'Device not configured'
openat(AT_FDCWD,"/dev/cuau0",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) ERR#16 'Device busy'
openat(AT_FDCWD,"/dev/cuau0.lock",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 6 (0x6)
openat(AT_FDCWD,"/dev/cuau0.init",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 7 (0x7)
openat(AT_FDCWD,"/dev/devctl",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) ERR#16 'Device busy'
openat(AT_FDCWD,"/dev/devctl2",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 6 (0x6)
openat(AT_FDCWD,"/dev/devstat",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 8 (0x8)
openat(AT_FDCWD,"/dev/fido",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 6 (0x6)
openat(AT_FDCWD,"/dev/full",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 8 (0x8)
openat(AT_FDCWD,"/dev/geom.ctl",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 7 (0x7)
openat(AT_FDCWD,"/dev/hpet0",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 6 (0x6)
openat(AT_FDCWD,"/dev/io",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/kbdmux0",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) ERR#16 'Device busy'
openat(AT_FDCWD,"/dev/klog",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) ERR#16 'Device busy'
openat(AT_FDCWD,"/dev/kmem",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 6 (0x6)
openat(AT_FDCWD,"/dev/mem",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/midistat",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 6 (0x6)
openat(AT_FDCWD,"/dev/mlx5ctl",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/nda0",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 6 (0x6)
openat(AT_FDCWD,"/dev/nda0p1",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 7 (0x7)
openat(AT_FDCWD,"/dev/nda0p2",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 8 (0x8)
openat(AT_FDCWD,"/dev/music0",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/nda0p4",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 8 (0x8)
openat(AT_FDCWD,"/dev/netdump",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 9 (0x9)
openat(AT_FDCWD,"/dev/mdctl",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/netmap",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 9 (0x9)
openat(AT_FDCWD,"/dev/nda0p3",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 10 (0xa)
openat(AT_FDCWD,"/dev/null",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/nvd0",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/kbd0",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) ERR#16 'Device busy'
openat(AT_FDCWD,"/dev/nvd0p1",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 9 (0x9)
openat(AT_FDCWD,"/dev/nvd0p2",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 11 (0xb)
openat(AT_FDCWD,"/dev/nvd0p4",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 6 (0x6)
openat(AT_FDCWD,"/dev/nvd0p3",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 7 (0x7)
openat(AT_FDCWD,"/dev/nvme0",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/pass0",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) ERR#1 'Operation not permitted'
openat(AT_FDCWD,"/dev/nvme0ns1",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/pci",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 6 (0x6)
openat(AT_FDCWD,"/dev/pf",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 6 (0x6)
openat(AT_FDCWD,"/dev/pfil",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 6 (0x6)
openat(AT_FDCWD,"/dev/random",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 6 (0x6)
openat(AT_FDCWD,"/dev/sndstat",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 6 (0x6)
openat(AT_FDCWD,"/dev/speaker",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/stderr",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 6 (0x6)
openat(AT_FDCWD,"/dev/stdin",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/stdout",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/sysmouse",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/tcp_log",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 6 (0x6)
openat(AT_FDCWD,"/dev/ttyu0",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/ttyu0.init",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/ttyu0.lock",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 6 (0x6)
openat(AT_FDCWD,"/dev/ttyv0",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/ttyv1",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 6 (0x6)
openat(AT_FDCWD,"/dev/ttyv2",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/ttyv3",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/ttyv4",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 6 (0x6)
openat(AT_FDCWD,"/dev/ttyv5",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/ttyv6",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 6 (0x6)
openat(AT_FDCWD,"/dev/ttyv7",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 7 (0x7)
openat(AT_FDCWD,"/dev/ttyv8",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/ttyv9",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 6 (0x6)
openat(AT_FDCWD,"/dev/ttyva",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 7 (0x7)
openat(AT_FDCWD,"/dev/ttyvb",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/ugen0.1",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 6 (0x6)
openat(AT_FDCWD,"/dev/ufssuspend",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 7 (0x7)
openat(AT_FDCWD,"/dev/ugen0.2",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/uinput",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 6 (0x6)
openat(AT_FDCWD,"/dev/usbctl",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/urandom",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 7 (0x7)
openat(AT_FDCWD,"/dev/xpt0",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) ERR#1 'Operation not permitted'
openat(AT_FDCWD,"/dev/zero",O_RDONLY|O_NONBLOCK|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/nda0p3",O_RDONLY|O_EXCL|O_CLOEXEC,00) = 5 (0x5)
openat(AT_FDCWD,"/dev/nvd0p3",O_RDONLY|O_EXCL|O_CLOEXEC,00) = 5 (0x5)

Hmm, awesome. Just reproduced  this insane zpool import behavior on XigmaNAS. (The 14.1 RC version).

I guess I'd rather file a ticket there, recycling various desktop -like HW is much more common there.

Good idea, thanks! The problem is this isn't used by default anymore and likely nobody uses this env var. It used to be the default in the old ZFS implementation... That was the starting point of all of this.

To be honest I don't understand the cache files for zfs/zpool which would have been the other way to solve this.


Cheers,
Franco