OPNsense 25.1.5 just broke my system.

Started by Shoog, April 10, 2025, 09:56:55 PM

Previous topic - Next topic
Hi,
Just a heads up that when I upgraded this evening to the latest 25.1.5 the system was broken on reboot. Don't exactly know how it is broken but Nothing on my home network is working. I suspect that somehow the DHCP is the issue, but don't know for certain. Cannot access the webportal but when I plug a monitor and keyboard into the router everything looks OK and the main WAN has an IP and my Wireguard tunnel is up.

Fortunately I have a full disk backup from Sunday so when I can remember how to uncompress it and dd it back to the main disk I should be somewhat OK, but that will have to wait for tomorrow.

Stephen


April 10, 2025, 10:20:59 PM #2 Last Edit: April 10, 2025, 10:27:26 PM by Shoog
I never setup captive portal.

How would I go about disabling captive portal from the command line - just to be sure.

Just upgraded to 25.1.5 and it didn't reboot. Hooked up a monitor to see what was going on, and it can't mount root: unknown filesystem.

I took a snapshot before the upgrade, as I always do, but I can't see the usual menu that allows me to rollback to a chosen snapshot.

I'm at a mountroot> prompt, I guess it's expecting I specify a filesystem, I used zfs, but none of the ones I tried worked.

Any suggestion??


Quote from: alexdelprete on April 10, 2025, 11:54:29 PMJust upgraded to 25.1.5 and it didn't reboot. Hooked up a monitor to see what was going on, and it can't mount root: unknown filesystem.

I took a snapshot before the upgrade, as I always do, but I can't see the usual menu that allows me to rollback to a chosen snapshot.

I'm at a mountroot> prompt, I guess it's expecting I specify a filesystem, I used zfs, but none of the ones I tried worked.

Any suggestion??



Drive full or dying most likely. Are you sure you're not skipping over the boot menu ? if that appears you could try booting the old kernel - just in case the new kernel wasn't installed properly.

Try a fresh install, see how the drive behaves.

April 11, 2025, 01:10:30 AM #5 Last Edit: April 11, 2025, 01:13:18 AM by alexdelprete
Quote from: newsense on April 11, 2025, 12:30:12 AMDrive full or dying most likely. Are you sure you're not skipping over the boot menu ? if that appears you could try booting the old kernel - just in case the new kernel wasn't installed properly.

Try a fresh install, see how the drive behaves.

It's a 1TB nvme drive, 99% free. Never had issues with it. The boot menu doesn't come up, I see a strange booting /boot/kernel/kernel text line with some hex characters. I managed to press space to get to an OK prompt in which I have some commands available, but I don't know how to load the old kernel from there.



If I don't do anything and it loads the new kernel, then it stops here:



I guess I'm stuck and have to reinstall, right?



Yes that's the path forward right now, the drive is still the unknown here.

Let's see what's going on, hopefully nothing unresolvable.
Can you please boot with a liveusb (not linux, we need freebsd). Best to use the same freeBSD version in case of need of the boot code.
Boot with it and drop to a shell and issue a $gpart show    and provide the results inside code brackets.

I reinstalled from scratch and restored the config manually (I had git backup and also manual backups of the config). I double checked the nvme drive and it has no issues I can diagnose. This means that something happened during the upgrade. :(

First time in years I had issues with an opnsense upgrade. Must confess that now I'm a little bit scared for next upgrades.

After this experience, What I feel is missing is that in the live usb image there is no recovery tool that checks (and fixes) the disk installation when facing these kind of boot issues.

The other pain was the fact that we have a config backup, but the plugins (and their config/data) are not restored. Now I'm back on track, almost, but some plugins I still have to configure them. Tailscale for some reason is not behaving properly, but I'll check later, will probably reinstall it from scratch.

Question is: to prevent this from happening in the future, and shorten the restore cycle, what should I do? take a full image of the drive by pulling it out of the system every once in a while? isn't there a better way to achieve this?

Quote from: alexdelprete on April 11, 2025, 01:43:11 PMAfter this experience, What I feel is missing is that in the live usb image there is no recovery tool that checks (and fixes) the disk installation when facing these kind of boot issues.

There are no offline analysis and repair tools for ZFS.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Quote from: alexdelprete on April 11, 2025, 01:43:11 PMThe other pain was the fact that we have a config backup, but the plugins (and their config/data) are not restored. Now I'm back on track, almost, but some plugins I still have to configure them. Tailscale for some reason is not behaving properly, but I'll check later, will probably reinstall it from scratch.

Question is: to prevent this from happening in the future, and shorten the restore cycle, what should I do? take a full image of the drive by pulling it out of the system every once in a while? isn't there a better way to achieve this?
Here is the pitfall of modifying outside the UI which acts as a sort of collector of the modifications for reinstallations. Also shows the advantage of running it as a virtual machine.
Enven then we have to backup the image of the hypervisor somehow, like taking a full image of it. Or, what takes care of it in both cases is to run it on high availability storage i.e. a raid setup. Even a mirrored pair pretty much takes care of it BUT it is of course sometimes not possible like when not available storage ports.
Reminds, me. I need to make a new image too but has downtime. Boot to Clonezilla, clone to extenal disk.

Quote from: Patrick M. Hausen on April 11, 2025, 01:59:27 PM
Quote from: alexdelprete on April 11, 2025, 01:43:11 PMAfter this experience, What I feel is missing is that in the live usb image there is no recovery tool that checks (and fixes) the disk installation when facing these kind of boot issues.

There are no offline analysis and repair tools for ZFS.

I feared (but kind of expecting it) this feedback was coming. Thanks Patrick.

Quote from: cookiemonster on April 11, 2025, 02:37:36 PMHere is the pitfall of modifying outside the UI which acts as a sort of collector of the modifications for reinstallations. Also shows the advantage of running it as a virtual machine.
Enven then we have to backup the image of the hypervisor somehow, like taking a full image of it. Or, what takes care of it in both cases is to run it on high availability storage i.e. a raid setup. Even a mirrored pair pretty much takes care of it BUT it is of course sometimes not possible like when not available storage ports.
Reminds, me. I need to make a new image too but has downtime. Boot to Clonezilla, clone to extenal disk.

HA storage doesn't solve the issue of an upgrade script creating issue, or an "rm -rf" on the wrong path. :)

But you have a point that will make me think in the next days: maybe it's time to seriously consider virtualizing OPNsense, I was not in favor of it for several reasons, but considering what happened, probably the advantages outweigh the disadvantages. The ability to quickly restore a VM, in seconds, vs spending a whole night trying to recover a bare metal installation is really tempting. Thanks for the advice.


April 11, 2025, 03:13:01 PM #12 Last Edit: April 11, 2025, 03:16:06 PM by Shoog
Well my recovery image is corrupted in some way so that path is closed to me.

Going to try a factory reset and then a restore to see if it flushes out the errors. I. My case I definitely think it's the DHCP which is the root cause. The firewall itself is able to ping out and all interfaces are up and running - but no services on the LAN are getting IPs. A clue is that my Kodi boxes ran on for around half an hour before they died which sort of points to the DHCP dropping the connections at refresh. Cannot ping anything on the LAN.

Mighty pain in the hole since I have quite a few add-ons to reconfigure if I fresh install, will take the best part of a day but at least I have notes.

Quote from: Alessandro Del Prete on April 11, 2025, 02:56:36 PMHA storage doesn't solve the issue of an upgrade script creating issue, or an "rm -rf" on the wrong path. :)

Yes, true of course.

@Shoog you seem to be in a better place. Something in config only not right, not the whole OS failing to boot.
Have you installed the latest hotfix?