[SOLVED] Update to 21.1.3 prevents server to boot with new kernel

Started by andreaslink, March 11, 2021, 08:29:17 AM

Previous topic - Next topic
I have updated my server as usual (and as I had done all times before) and as this update required a reboot I'm stuck now. The kernel does not boot on my hardware anymore, server is hanging with "Probing 13 blocks devices..." and that's it, see attached screenshot.

I'm running on bare metal and that is one hw-info from before: https://bsd-hardware.info/?probe=e161817fa3

I'm really looking for help, any idea how I could boot from USB and restore to former kernel somehow?
Running OPNsense on 4 core Intel Xeon E5506, 20GB RAM, 2x Broadcom NetXtreme II BCM5709, 4x Intel 82580
Ubench Single CPU: 307897 (0.39s)

Hi,

i think you should reinstall Opnsense and restore a backup of your config. Will be the fastest way to fix this.


greets

With restoring I also loose all my historical logs, so that is my last way to go, I would prefer to find a real root cause here.
Is that the new kernel not supporting my hardware oder BIOS anymore?
I just would prefer to go back to the former kernel, where all was running fine and check this was still working.
Running OPNsense on 4 core Intel Xeon E5506, 20GB RAM, 2x Broadcom NetXtreme II BCM5709, 4x Intel 82580
Ubench Single CPU: 307897 (0.39s)

Quote from: andreaslink on March 11, 2021, 08:43:41 AM
With restoring I also loose all my historical logs, so that is my last way to go, I would prefer to find a real root cause here.
Is that the new kernel not supporting my hardware oder BIOS anymore?
I just would prefer to go back to the former kernel, where all was running fine and check this was still working.

I have run FreeBSD ever since 2.5 in the nineties. I've never ran into any issues with kernel updates (with the same BIOS) within the same major version

QuoteI've never ran into any issues with kernel updates (with the same BIOS)

That is good to know and (until today) I share the same experience - even though not over such a long period. That is why I was asking how to be able to restore the former kernel from here as I potentially also might have HW issues or a RAID-Controller issue by coincidence in parallel which was just triggered by the reboot only. It's not said that I'm connecting wrong events here to drill down to the error.

I will for sure investigate further on the HW part, but is there a way to downgrade to the former kernel, when I just boot from USB to somehow repair the system manually or is a booting always required? It looks like the kernel is trying to boot but can't proceed due to missing/unreadable hard disks, I guess?
Running OPNsense on 4 core Intel Xeon E5506, 20GB RAM, 2x Broadcom NetXtreme II BCM5709, 4x Intel 82580
Ubench Single CPU: 307897 (0.39s)

I've seen this randomly a few times over the years during hundreds of test installs and updates, but it was always a fluke related to the kernel update not reaching the disk or the file system giving up. It was never about a particular way to update (our update sequence is the same since 2015: untar and reboot a.k.a. "what could possibly go wrong") or what changes were included in the kernel update as gunnarf noted.

If the partition layout is gone it's quite hard to get your data back. You can check this with the config importer from the installer. But keep in mind that the installer only recovers /conf contents it can restore and then clears out the rest of the system if you decide to reinstall after import.

The importer has some hints on how to mount this manually. It might be of future interest to build a "repair" function that puts the kernel and base system back and see if it reboots, but again only if the partition layout is not messed up beyond easy repair.

https://github.com/opnsense/core/blob/master/src/sbin/opnsense-importer#L246-L273


Cheers,
Franco

Thank you Franco and you are right as I was finally able to solve. The kernel was not able to detect the usual boot medium as my RAID (HW-Raid controller) ran out of sync "somehow" during the boot, I guess and as long as it was out of sync the SAS RAID controller kept if offline.
I finally connected monitor and keyboard and restored the RAID (simple mirroring) by syncing the disks again (see screenshot) and finally after the sync the problem was solved. After restore, there was a disk again and I was able too boot as I expected it.

So this was very probably not related to the upgrade I made, but to my hardware aka environment and kicked by coincidence in during the reboot. Thanks for supporting and solving it. I mark this one as solved.
Running OPNsense on 4 core Intel Xeon E5506, 20GB RAM, 2x Broadcom NetXtreme II BCM5709, 4x Intel 82580
Ubench Single CPU: 307897 (0.39s)