[SOLVED] Update to 24.7.2 results in kernel panic

Started by mroess, August 22, 2024, 08:42:18 AM

Previous topic - Next topic
I've looked at the change you referenced earlier https://github.com/opnsense/core/commit/37003d1d5793b03
but I can't see where the upstream change is, if there's one. Do you have it handy?

Well I certainly don't expect exporting an env. variable to trigger kernel panics. Used or not - if it's broken, unused and unmaintained, just nuke the code... 🤷‍♂️



Do I get this correct? "zfs import -a" scans /dev and that makes the kernel panic?

O.K., that could either be a weird broken device driver that is being touched via the device path or maybe even ZFS itself if the zpool is old enough (the current OpenZFS version has new features - and maybe new bugs).
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 770 up, Bufferbloat A

Quote from: meyergru on August 24, 2024, 12:04:32 AM
Do I get this correct? "zfs import -a" scans /dev and that makes the kernel panic?

O.K., that could either be a weird broken device driver that is being touched via the device path

Don't know but this reference to agp driver and the agp_close() call seems to be the only one in the entire FreeBSD-src repo.

🤷‍♂️

Quote from: franco on August 23, 2024, 02:18:18 PM
Which image types are you guys using here.. DVD or VGA?

I was using VGA version for all tests.

Throwing my hat in to say I had the same issue as OP down to the hex. Did a clean install (ZFS) of 24.7.0, all's well. Restored config, all good. Update to 24.7.2 and hello page fault again.

Saw the post to try UFS instead of ZFS. Selected install via UFS or whatever the menu option was, and got a bunch of stacktraces before the machine rebooted, believe someone also had that issue.

Selected the other option that's not UFS or ZFS and ended up doing a guided UFS install basically. I forget exactly which option it is because I've been messing with (re)installing OPN for the last 3 hours testing this and I'm not about to do it again since everything is stable now.

Are you using ZFS? Not anymore, since that's why the error is appearing.

How old is the hardware/is it a VM? Bare metal, Core 2 Quad 8300, socket lga 775 was circa 08 I believe. 4 gigs RAM, 1 Intel PCIe NIC with 2 ports and 2 1 port PCI NICs. 120gb SATA SSD. Not that it matters, this is a ZFS error.

Does this panic occur due to the 24.7.2 kernel or 24.7.2 core package? I know it's difficult with the panic but we need more data points than "24.7.2 is not working" now. I don't know how I'd differentiate that during boot to be honest. Since it's the kernel panicking I'm assuming it's the 24.7.2 kernel and not core.

August 24, 2024, 04:36:19 AM #53 Last Edit: August 24, 2024, 05:25:08 AM by franco
Can't sleep. It's 4 a.m. I looked at "man agp". I looked at sporadic meaningless sys/dev/agp code refactors of the last couple of years. Found this in 2020:

https://github.com/opnsense/src/commit/4f8959b9f4bb

Hmm.

Quote from: doktornotor on August 23, 2024, 10:20:58 PM
Well I certainly don't expect exporting an env. variable to trigger kernel panics. Used or not - if it's broken, unused and unmaintained, just nuke the code... 🤷‍♂️

"And so it is, just like you said it should be"

https://github.com/opnsense/tools/commit/97f9f368b58

If anyone misses agp they can still kldload at their own peril?

https://pkg.opnsense.org/FreeBSD:14:amd64/snapshots/misc/OPNsense-24.7.2-vga-amd64.img.bz2
https://pkg.opnsense.org/FreeBSD:14:amd64/snapshots/misc/OPNsense-24.7.2-vga-amd64.img.sig

https://pkg.opnsense.org/FreeBSD:14:amd64/snapshots/misc/OPNsense-24.7.2-dvd-amd64.iso.bz2
https://pkg.opnsense.org/FreeBSD:14:amd64/snapshots/misc/OPNsense-24.7.2-dvd-amd64.iso.sig

I will replace the kernel on the mirror when someone confirmed this on one of the images with the actual hardware that the driver attaches to. ZPOOL_IMPORT_PATH has not been reverted, only agp removed from kernel so it cannot hit the bad agp_close(). If this works for the people reporting the crash we have our answer.


Cheers,
Franco

August 24, 2024, 05:09:01 AM #54 Last Edit: August 24, 2024, 05:15:13 AM by newsense
>>>>> https://github.com/opnsense/src/commit/4f8959b9f4bb  <<<<<

Just as Nostradamus predicted in the ICMP thread: "Downstream issue, use a vanilla...ooops nevermind and carry on, nothing to see here"

Hindsight is always 20-20

/FreeBSD<=4

It's a valid point raised in 2020. The man page says this was added in FreeBSD 4.1 (2000). The man page was last updated in 2007. flyboy463 said their hardware was from 2008.

The only question that remains is whether the removal of agp from the kernel is detrimental to using this hardware from 2008 or not. This is a bit of a trick question. :)


Cheers,
Franco

So I was halfway smashed on my initial post here and I am definitely toasty now, but I can say that a clean install of opnsense 24.7.0 and a console upgrade to 24.7.2 on my R720XD completely virtualized (KVM/QEMU) did not produce a pagefault. That hardware is from 2012 with no PCI to speak of AFAIK.  Hope this is of some use. I can do more testing if need be if I'm not too hungover tomorrow.

Quote from: franco on August 24, 2024, 04:36:19 AM
"And so it is, just like you said it should be"

https://github.com/opnsense/tools/commit/97f9f368b58

If anyone misses agp they can still kldload at their own peril?

Indeed looks like a good prevention so I suggested that to XigmaNAS folks as well. If the broken code is not there, it cannot be triggered.  8)

August 24, 2024, 09:41:33 AM #58 Last Edit: August 24, 2024, 09:43:54 AM by franco
@flyboy463 sorry to have come across this way here. All input is appreciated. It's a hardware specific issue that just happens to be triggered now with ZFS/zpool-import use due to a environment variable use. You can probably crash the affected hardware on FreeBSD 14.1 with something as simple as

# echo > /dev/agpgart

If anyone dares to try be my guest.

The question still remains if there is some use for the agp kernel module here WRT graphics support / VGA console. If that is the case disabling it by default may have other repercussions for users of the hardware. An alternative would be to avoid presenting the device node /dev/agpgart or fix the actual panic as suggested earlier.

I'm still positive that we should do something other than removing the environment variable for 24.7.3 since the scope of this is very narrow and mitigated by using UFS as far as I can tell.


Cheers,
Franco

August 24, 2024, 11:11:53 AM #59 Last Edit: August 24, 2024, 11:26:55 AM by doktornotor
Franco, are these hints still applicable?

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187015

If removing the driver altogether is a real issue (not convinced at all), this could mitigate the device creation, seems to me? Also, people affected here could add that from the boot prompt to see if it helps?