I'm getting a kernel panic after updating/reinstalling v23.1

Started by RNHurt, February 28, 2023, 09:24:32 AM

Previous topic - Next topic
I've been running OPNSense for years and I really love it.  However, after a recent update my HD activity light was staying on and the CPU meter on the OPNSense dashboard was reading 100%.  After looking for anything obvious and turning off all the services I could, the CPU was still pegged and the HD light was still on constantly.  So I rebooted the machine; it never came back online.

After I attached a monitor to the machine I saw that it had a kernel panic[0].  While unusual I didn't think to much of it.  However, rebooting the machine didn't resolve the issue.  So I removed all the cards, memory, etc. to see if I could get a clean boot.  Nothing helped and I continued to get a kernel panic[0].

I thought it might be a corrupted hard drive or something so I disconnected the drive and booted off of a USB thumb drive with a fresh copy of v23.1 installed on it.  The system booted just fine and ran the live version.  So I turned the machine off, reconnected the drive, rebooted and installed v23.1 on the HD.  The install worked perfectly and the machine rebooted.  Once again, I got the kernel panic[0].

My next thought was that maybe the HD was "bad".  I replaced the HD and again installed a fresh copy of v23.1.  Again, the kernel panic[0] showed up.  Arrggghhh! 

I'm running Memtest86 v6.10 right now and everything is looking good, so I don't think it's memory related.  I've replaced the HD so that's (probably) not the problem.  It seems to work fine booting from the USB flash drive (its just slooooow) so the CPU seems to be OK. 

Any thoughts on what I should do now?  I'm not very good at reading kernel panic output so I thought I was ask here.  The weird thing is that it seems to run fine from a live USB stick but not when I install it on a HD.  Maybe the HD controller is bad?  How would I test this?

BTW: after the kernel panic the machine is locked up completely.  Nothing works.  The keyboard doesn't do anything, the capslock key doesn't even light up.  Even the floppy drive light is stuck on.

Later...
Richard

I'd contact the freebsd-stable mailing list about that issue. You will need to provide all detail you can get about the specifics of your hardware.

https://lists.freebsd.org/subscription/freebsd-stable
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

I found another post on this forum that is reporting a very similar error from 9 months ago - https://forum.opnsense.org/index.php?topic=28422.msg138676#msg138676
Later...
Richard

Possibly - but that is a problem for the FreeBSD kernel developers to address. Hence my recommendation.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Quote from: pmhausen on February 28, 2023, 10:19:59 AM
Possibly - but that is a problem for the FreeBSD kernel developers to address. Hence my recommendation.

Thank you for your recommendation.  The only question I have is that the FreeBSD forums seem to really not like people asking about "derivative" OS installations[0].  Is the mailing list more receptive?

BTW: I'm going to try to install v21.7 (which is what I was running before I think) to see if that makes any difference at all.

Later...
Richard

The forum is to my knowledge considered a FreeBSD user support medium. So naturally they want to concentrate on FreeBSD and are a bit reluctant as far as derived products are concerned.

The -stable mailing list is a developer channel and a kernel panic is in my opinion a developer topic. If I am not mistaken OPNsense runs an unmodified 13.1-p6, currently.

@franco is also active on various mailing lists so it does not seem to be problem to me.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

This is a known problem with FreeBSD and older systems.  :-[

It has to do with system mitigations for the old "Meltdown" and "Spectre" issues.  Once I added these parameters to my tunables everything worked fine.

hw.ibrs_disable=0
vm.pmap.pti=1


More details can be found here:
* https://github.com/opnsense/core/issues/3177
* https://forum.opnsense.org/index.php?topic=11419.msg52164#msg52164
* https://forum.opnsense.org/index.php?topic=13564.msg62529#msg62529
Later...
Richard

Sorry, late to the party... this would be the panic:

https://github.com/opnsense/src/blob/72b2aabf593569fab6d9e00f90c806facce21742/sys/x86/x86/mca.c#L1535

But since the solution was already posted there isn't any patch in that area that can be picked up.

But the question is: was this still working on 22.7?


Cheers,
Franco

Quote from: franco on March 01, 2023, 07:47:13 AM
But the question is: was this still working on 22.7?

I'm not sure I understand the question, but adjusting the tunables in OPNSense was still working on 22.7-amd64 and is currently working for me on 23.1-amd64. 

My problems started when I reset my OPNSense back to the "default" configuration without remembering that you have to update the boot parameters.  :-[  The resulting kernel panic sent me down a rabbit hole that took a couple of days to find my way out of.

It would be nice if OPNSense could tell if your hardware was susceptible to this issue and automatically add those boot parameters for you.  Like I commented before, for some reason the "live" install worked off of the USB stick but when I installed it to the HD it caused the panic.  I'm guessing the "live" install has different boot parameters or something.
Later...
Richard

Ah ok, so this was the case for at least 22.1 then when resetting the configuration. Thanks for clarifying.

> It would be nice if OPNSense could tell if your hardware was susceptible to this issue and automatically add those boot parameters for you.

I think the kernel is in a unique position to prevent a panic here. We don't have any structure to know what hardware quirks a platform has.


Cheers,
Franco