My DEC appliances fail to reboot every once in a while?

Started by Patrick M. Hausen, July 28, 2024, 09:33:39 PM

Previous topic - Next topic
July 28, 2024, 09:33:39 PM Last Edit: August 02, 2024, 08:09:47 PM by Patrick M. Hausen
Hi all,

today I upgraded one HA cluster built with two DEC3860 units from 21.7 (yeah, I know ...) all the way to 24.7.

The updates themselves went perfectly unspectacularly and reliably. Just one release at a time, standby first, primary second.

But ...

I initially planned to do this last Sunday. When going from 22.1 to 22.7 the standby unit did not come back up.
So today I drove to Frankfurt, power cycled the standby - which completed the update perfectly - and went on.

About 1/3 of all the software initiated reboots failed. I connected serial consoles and the systems seem to hang somewhere between the EFI bootloader and the FreeBSD loader.efi before it goes on to load kernel and modules. Power cycle always fixes the problem.

I also observed "reboot not working" with our office DEC690 unit. There is no primary/standby setup in that location so we just pulled the plug. This time, too, the update finished and the firewall booted perfectly fine.


*phew* this is a bit of downer, really. The majority of the firewalls I manage are an hour or two of a car ride away. If I cannot dare to install updates because the system might fail to boot, well ...

Any idea what is happening, here? Can I expect this to be fixed now with FreeBSD 14.1? Are there any firmware updates that would help?

I would hate if this continued to be a lottery. Because I really like the Deciso appliances. Insane built quality and price/performance ratio. But I would seriously consider going back to Supermicro instead where I have full IPMI and can power cycle the systems from afar. At least everywhere where I have an HA setup.

Serial console is nice, but without power control, remote media etc. there can always be a situation where I have to drive which I would rather avoid.

Kind regards,
Patrick
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Hi Patrick,

Sorry to hear you're having issues, but given the time frame (21.7 and 24.7) it's difficult to predict which type of issues you might run into.

One of the things that might explain the weird efi behavior is an earlier change in FreeBSD which required a setting in the bios, which is documented here https://docs.opnsense.org/hardware/serial_connectivity.html#legacy-uart-vs-uefi-serial

One other topic that previously was problematic in older versions was the automatic restore of items as Network Insights, which sometimes didn't complete at all or just took so long that people expected the box to have died in between. A reboot fixes this, but is annoying.


Best regards,

Ad

Quote from: AdSchellevis on July 29, 2024, 10:26:27 AM
One of the things that might explain the weird efi behavior is an earlier change in FreeBSD which required a setting in the bios, which is documented here https://docs.opnsense.org/hardware/serial_connectivity.html#legacy-uart-vs-uefi-serial

I was aware of that and I am pretty sure the appliances are already configured that way, but will make sure to check and report back.

Please note that our DEC695 which I always keep on a current release also occasionally fails to properly reboot.

And no, it's not a long running process. Perform maintenance in the evening, system doesn't come back up. Early morning (7 am) the next day, office still offline, power cycle, all good.

Thanks for your assistance, I'll check the BIOS settings of the DEC3860 units in the evening. Will need to reboot them - keeping fingers crossed :)
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

> Thanks for your assistance, I'll check the BIOS settings of the DEC3860 units in the evening. Will need to reboot them - keeping fingers crossed :)

The CMOS battery may be empty/faulty. I've had one device that was resetting to defaults which also makes the console disappear. Replacing the battery helped fix this.


Cheers,
Franco

It's not the console disappearing.

Reboot device remotely, boot process hangs. If I connect a console after the fact, nothing is visible, no reaction. If I have a console connected while rebooting it shows the beginning of the kernel load process and the ASCII animated spinner but in a frozen state. Power cycle fixes it.

Console is working perfectly fine otherwise.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

> Reboot device remotely, boot process hangs. If I connect a console after the fact, nothing is visible, no reaction. If I have a console connected while rebooting it shows the beginning of the kernel load process and the ASCII animated spinner but in a frozen state.

That's odd because this is exactly the problem when the UART settings are wrong in the BIOS. Did this occur with 24.1 previously or is it new on 24.7? Any upgrade-related reboot will be on 24.7 kernel already.


Cheers,
Franco

Quote from: franco on July 29, 2024, 11:31:34 AM
That's odd because this is exactly the problem when the UART settings are wrong in the BIOS. Did this occur with 24.1 previously or is it new on 24.7? Any upgrade-related reboot will be on 24.7 kernel already.

I'll check the setting and report back. After office hours. Might have missed a unit or two when I first read about the change in the release notes.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

@AdSchellevis, @franco, apologies, the first system in question had the legacy option set to 0x3F8.
I'll check all of them in the next days, of course. Thanks for your help.

Should any system misbehave in the future, of course expect me to come back  8)
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

OK ... our DEC690 unit at the office seems to use coreboot instead of a traditional BIOS. This is all I get when I invoke the "nvramcui":

┌─coreboot configuration utility───────────────────────────────────────────────┐
│                                                                              │
│┌─Press F1 when done─────────────────────────────────────────────────┐        │
││ baud_rate               115200                                     │        │
││                                                                    │        │
││ interleave_chip_selects Disable                                    │        │
││                                                                    │        │
││ power_on_after_fail     Disable                                    │        │
││                                                                    │        │
││ debug_level             Emergency                                  │        │
││                                                                    │        │
││ nmi                     Disable                                    │        │
││                                                                    │        │
││ iommu                   Disable                                    │        │
││                                                                    │        │
││ ECC_memory              Disable                                    │        │
││                                                                    │        │
││ user_data               0                                          │        │
││                                                                    │        │
│└────────────────────────────────────────────────────────────────────┘        │
│                                                                              │
│                                                                              │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘


Anything to change here? This unit definitely hung a couple of times in the past year when initiating a reboot via software. Even waiting over night did not let it finish. Power cycle - up and working again.

Any ideas?
Patrick
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Hi Patrick,

The DEC[2]6XX range uses a bios indeed and no efi payload, it doesn't require changes for serial console access.

On older versions (or older settings), the shutdown backup+restore hook made upgrades hang or take an awful long time, but I think I already mentioned that.

When using functionality like Network Insight with a bit larger databases, you likely don't want to enable the tarbals being created on shutdown and reimported on startup. You can check your settings in "System: Settings: Miscellaneous".

Best regards,

Ad