OPNsense Forum

Archive => 23.7 Legacy Series => Topic started by: jenix on January 21, 2024, 04:53:14 PM

Title: DEC840: Extremly slow boot after 23.7.12 / 24.1 upgrade
Post by: jenix on January 21, 2024, 04:53:14 PM
EDIT: I adjusted the topic, as I now believe my initial assumptions (device freezes during boot) are wrong, instead the boot process is extremely slow (around 30 minutes). I suspect the culprit to be either my configuration or the config migration. Please see my latest post (https://forum.opnsense.org/index.php?topic=38266.msg188463#msg188463) for more details.

Initial Post:
So the upgrade to 23.7.12 bricks my DEC840 firewall. I can reproduce the issue as it happened 3 times in a row (initial upgrade, after the fresh installation and once again to verify).
The installation of the upgrades completes, but there seems to be an issue when syncing the filesystem during the shutdown. Console output reads:
Waiting (max 60 seconds) for system process `vnlru' to stop... done
Waiting (max 60 seconds) for system process `syncer' to stop...
Syncing disks, vnodes remaining... 6 fsync: giving up on dirty (error = 35) 0xfffff800017f73d0: type VCHR
    usecount 1, writecount 0, refcount 435 seqc users 0 rdev 0xfffff80001785000
    hold count flags ()
    flags ()
    v_object 0xfffff800017e4e70 ref 0 pages 12163 cleanbuf 432 dirtybuf 1
    lock type mntfs: EXCL by thread 0xfffffe00917dee40 (pid 16, syncer, tid 100092)
3 2 0 0 done
All buffers synced.


Then after the reboot, the system gets stuck after enabling the interfaces:
uart0: <8250 or 16450 or compatible> port 0x3f8-0x3ff irq 3 flags 0x10 on acpi0
hwpstate0: <Cool`n'Quiet 2.0> on cpu0
Timecounter "TSC" frequency 2096061312 Hz quality 1000
Timecounters tick every 1.000 msec
Trying to mount root from ufs:/dev/ada0p2 [rw]...
ugen0.1: <AMD XHCI root HUB> at usbus0
uhub0 on usbus0
uhub0: <AMD XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus0
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <TS256GMTS952T2 02J0T4GB> ACS-2 ATA SATA 3.x device
ada0: Serial Number G752440056
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 1024bytes)
ada0: Command Queueing enabled
ada0: 244198MB (500118192 512 byte sectors)
uhub0: 8 ports with 8 removable, self powered
ax1: Link is UP - 10Gbps/Full - flow control off
ax1: link state changed to UP
ax0: Link is UP - 10Gbps/Full - flow control off
ax0: link state changed to UP


After that, changes to the interfaces are detected (showing "Link us UP" / "Link is DOWN" when plugging cables in / out), but the system never continues the boot. Both Safe Mode and Single User Mode get stuck at the same stage.

The only solution for me was to reinstall opnsense 23.7 via USB. But as mentioned above, trying to upgrade the fresh install results in the same issue.

I'm not sure what more info I can provide to this issue. Please let me know if I can contribute logs or additional information.
As this is my primary firewall, testing upgrades is difficult (but not impossible if needs be).

Any ideas, what the issue is and how I can solve it?
Thanks already and kind regards

Jens
Title: Re: 23.7.12 upgrade bricks my DEC840
Post by: dmurphy on January 24, 2024, 03:49:52 AM
Are you using the "VGA" or "Serial" version of the OPNsense image to start with?

I had similar issues using the VGA image ... had to use serial, and then I didn't get those same hangs.

I already had a USB key made up for a different box with the VGA image .. didn't think twice about it, but made a tremendous difference.
Title: Re: 23.7.12 upgrade bricks my DEC840
Post by: jenix on January 24, 2024, 06:10:25 AM
Thanks for the reply. My initial installation was the one that the hardware came with (about 2 years ago), converted to community edition once the subscription ended. I then reinstalled with the serial version which gave me the observed error.
Title: Re: 23.7.12 upgrade bricks my DEC840
Post by: dmurphy on January 26, 2024, 07:58:15 PM
Quote from: dmurphy on January 24, 2024, 03:49:52 AM
Are you using the "VGA" or "Serial" version of the OPNsense image to start with?

I had similar issues using the VGA image ... had to use serial, and then I didn't get those same hangs.

I already had a USB key made up for a different box with the VGA image .. didn't think twice about it, but made a tremendous difference.

I wish I had a better suggestion.  The only thing I can consider is maybe to pull the SFPs out of ax0/ax1, and see if it finishes the boot.  Then possibly patch 23.7 to current if that works.

Worth a try?
Title: Re: 23.7.12 upgrade bricks my DEC840
Post by: jenix on January 27, 2024, 09:03:06 AM
Thanks, I already tried this, sadly without any luck. I don't believe the SFPs or NICs are the problem rather than whatever OPNsense tries to initialize after them.
I suspect the dirty filesystem during the shutdown causes a corruption which prevents the system from correctly reading some files during boot and thus getting stuck. It does not crash (when I unplug the SFPs, this gets noted by the kernel and logged as "Link state changed to down"), but also does not continue to load the system.

Unfortunately, I'm not experienced enough to analyze the issue further. Is there a way to get more information about the boot process to figure out, which file (if my assumption is correct) is corrupt? Is there a possibility to run fsck when single user mode won't boot as well (e.g. from the usb install drive)? Can I update to a previous minor version (23.7.11)?

At this point, with 24.1 around the corner I suspect the best way forward is to wait for the new major version, make a fresh install and restore my config.
Title: Re: DEC840 won't boot after 23.7.12 upgrade
Post by: newsense on January 27, 2024, 08:43:33 PM
You installed 3 times, then "added more data to the disk / tried to upgrade" and had issues.

  - Whether you install 22.x, 23.1 or the upcoming 24.1 you'd be redoing the same steps above expecting a completely different result (?)

  - Installing 24.1 would actually confuse things more since there will be no readily available 24.1.x to upgrade to for a few weeks - so you'd be tempted to think the issue has been (auto)magically fixed



Most likely the ssd inside is dying, so you have two options to consider:

1) If still under warranty contact Deciso about the best path forward

2) Otherwise open the case and replace the drive
Title: Re: DEC840 won't boot after 23.7.12 upgrade
Post by: jenix on January 28, 2024, 09:13:24 AM
Thanks for the reply.
While a hardware defect with the disk certainly is possible, my assumption for now is more of a software issue. What are the odds that the filesystem (which should handle disk corruption to a certain point) writes the same file to the same (corrupt) block 3 times in a row? Nevertheless, I contacted Deciso about warranty as my device is just shy of the 2 years of age.

As I said, I assume more of a software issue during the upgrade, like a faulty config migration. Or maybe a race condition which blocks the access to a system config, prevents it from being written correctly and thus leading to failing to boot.
In this case, installing 24.1 would solve the issue as there would be no upgrade / migration steps which can fail. But without any debug information what is going on when the system hangs during boot, this is hard to tell.

Title: Re: DEC840 won't boot after 23.7.12 upgrade
Post by: jenix on February 01, 2024, 12:08:20 PM
After spending the morning testing different scenarios, I'm hoping that I have found some new information for my issue.

I got in touch with the Deciso support (as my DEC840 is just still under warrenty). They suggested to update the bios and try again, which i did. Whilst my issue is still unsolved, I noticed the following behaviour:

My DEC boots fine with 23.7 (the major release version available to download). I can import my config and reboot the firewall without issue.
When I do a clean install with 24.1, the firewall also boots normal.

The issue arises, when the firewall tries to boot with my productive config, either after upgrading to 23.7.12 or after importing the config to a fresh 24.1 installation. Then, the boot process is extremely slow (e.g. configuring the routes takes 2-3 minutes instead of mere seconds) which initially lead me to believe that my system froze. But given enough waiting time (the boot process takes around 30 minutes), the firewall manages to complete the boot process. Yet, I'm then still not able to login and analyze further, as my HTTP or SSH access attempts time out. I once managed to log into the WebGUI while the firewall was still complete the interface configuration, but lost the access shortly after.
This lets me wonder if there is an issue either with my configuration or the config migration during the update. It feels like the firewall gets fully occupied loading / migrating the configuration during boot that it struggles to run.

Now I'm at a loss how to proceed. I don't see a possibility how to figure out which part of my configuration (if any) is responsible for my issues. Recreating my whole configuration from scratch in 24.1 seems pretty undesirable.
I already tried to skip non-essential configuration parts during import (e.g. IPS), but this resulted in no change. I also imported the 23.7 config into the fresh 24.1 install, instantly exported it again before reboot and looked at a diff to see what changed. But I can't see any obvious changes which hint at problematic config parts.

Does anyone has an idea, how I can continue analyzing this issue to find out, what causes it?

Thanks already.
Title: Re: DEC840: Extremly slow boot after 23.7.12 / 24.1 upgrade
Post by: newsense on February 01, 2024, 01:47:42 PM
That slowness may be DNS related, so a race condition when services need to be up but depend on DNS and DNS is not up yet. This should be easily fixable.
Title: Re: DEC840: Extremly slow boot after 23.7.12 / 24.1 upgrade
Post by: jenix on February 03, 2024, 09:13:04 AM
Thanks for getting back to me. I'm not sure if DNS is required at that early stage. But i did enable unbound (usually I'm using a separate dns server in my network) to make sure. It did not made a difference.

After some more testing, I can narrow the issue down somewhat:
- Everything works fine up to 23.7.11. I can upgrade to this release without issues
- Starting with 23.7.12 (both the initial release as well as the hotfixes) and 24.1, the firewall gets extremely slow as soon as it loads my config during boot (starting with the ">>> Invoking early script 'upgrade'" output on console).
- I can reproduce this behaviour consistently when upgrading from 23.7 (starting after the reboot following the upgrade), restoring my config to a fresh install of 24.1 (starting after the reboot) or loading my config while booting 24.1 from a usb drive.

As mentioned above, as soon as 23.7.12 / 24.1 loads my config, everything gets extremely slow (steps taking minutes instead of usually seconds), leading to a boot time of around 30 minutes (instead of roughly 1 minute).
When completing the boot, the firewall is not accessible via WebGUI or SSH, probably as the services are occupied (or even crashed).

Booting the firewall with the serial console attached does not show any conclusive errors. I do see errors like "Generating configuration: error in configd communication %s, see syslog for details" sometimes. But I believe them to be symptoms of the system being occupied as they do not occur on every boot.

Is there any way to increase the console output during boot? I'd like to figure out what is happening and ideally what I need to fix in my config to get 24.1 working.

Title: Re: DEC840: Extremly slow boot after 23.7.12 / 24.1 upgrade
Post by: newsense on February 03, 2024, 10:30:55 AM
With the FW up and running a screenshot might help understand what is going on.


There are two possible options:

If you have mimugmail repo, install htop - has no other dependencies iirc - and post a screenshot.

Else run top, press a, post screenshot.
Title: Re: DEC840: Extremly slow boot after 23.7.12 / 24.1 upgrade
Post by: jenix on February 10, 2024, 10:38:32 AM
I tried to debug the issue again, but was unsuccessful. The firewall simply never booted to a point where I could access the device (neither via serial console, nor via ssh or webGUI). The serial console never got to the login prompt, stopping at the fingerprint of the HTTP / HTTPS access. This may be due to a bug where the "USB-based serial" option gets reenabled after importing a config (I suspect it is not saved in the config as I could not find it and thus gets enabled during import) which disables console logins on the DEC840. SSH and webGUI simply do not respond, leading to timeouts when trying to access them.

Having wasted days in trying to figuring out what is going on, I gave up on more troubleshooting. I did a fresh install with 24.1, pulled a clean config export from it, copied over the most important settings from my 23.7 config file (interfaces, aliases, firewall rules, dhcp configuration, ipsec) and (more or less) successfully imported it on the clean install. Now my firewall is back up and running with 24.1, although it did not import my ipsec settings and won't recognize them after reconfiguring them. I will create a new thread for this issue.

With multiple attempts importing different parts of my old config, I suspect that my IPS / Surricata config may have been the culprit leading to my issues. When trying to restore the <IPS> block of my config, my firewall started to act up again during boot. But I didn't test this further to get decisive proof for that suspicion.

Anyhow, for me this problem is solved.
Title: Re: DEC840: Extremly slow boot after 23.7.12 / 24.1 upgrade
Post by: jenix on March 10, 2024, 10:13:44 AM
I want to give another brief update about my findings, hoping someone might find them useful.

First of all, it turns out that at least some of my issue may indeed have come from a defective disk. My SSD now died completely (the infamous 'Solaris: WARNING: Pool 'zroot' has encountered an uncorrectable I/O failure and has been suspended.' error causing OPNsense to freeze, followed by I/O timeouts and device losses during a reinstall), so some file corruption seems possible.

Furthermore (as it is most of the times), I guess there were multiple issues coming together:

- No USB serial access after config import: This seems to be a bug with the "Use USB-based serial ports" setting (in System -> Settings -> Administration). The setting is disabled during the install and after the first boot of the fresh install. If I import my config (from a system where the setting is disabled as well), it gets enabled. This results in the console not being available via USB after the reboot. To solve this, you can either disable the " Exclude console settings from import." setting during import. Or disable the auto reboot after import, go to the settings page and apply them again.

- No access via SSH / HTTP after config restore: This was an issue with my suricata config. I enabled IPS on my LAN interface (overall, you want to detect suspicious activities inside your network as well, right?). After the import / reboot suricata had some errors and blocked all access to OPNsense. This was difficult to identify, as some of the access through the firewall was possible (pinging the firewall and hosts in different net worked, DNS resolution worked, but HTTP or SSH access to the firewall or beyond did not). To solve this, you can access the shell via the console and kill suricata ('killall -TERM suricata').
If you are having similar issues after your upgrade / import of the config, I suggest testing to disable the IDS/IPS temporarily.