Solved: 24.7.8 rebooting every few hours

Started by zinge, November 10, 2024, 12:28:20 AM

Previous topic - Next topic
November 10, 2024, 12:28:20 AM Last Edit: November 14, 2024, 07:38:57 PM by zinge
As of last night when I updated to 24.7.8, my Opnsense machine has been rebooting itself every few hours with no errors or issues that I can find. Prior to the update, it had almost a month of uptime, with the last reboot being October 15. I've checked the various logs in the UI, and don't see any obvious entries at level "Error" or higher that could seem to cause the issue.

Any suggestions on what the next step is to figure out what might be causing the issue?

Hardware is a VP4630 Protecli Vault w/ 16GB RAM. I don't expect this to be a hardware issue as it's been working untouched for several months with the intermittent reboots starting after the upgrade to 24.7.8 last night.

These are the only extra plugins I have installed:
os-adguardhome-maxit
os-ddclient
os-mdns-repeater
os-speedtest-community
os-udpbroadcastrelay

Here's an example of what the General log looks like before a reboot:
   
2024-11-09T22:34:54-08:00   Notice   kernel   Copyright (c) 1992-2023 The FreeBSD Project.   
2024-11-09T22:34:54-08:00   Notice   kernel   ---<<BOOT>>---   
2024-11-09T22:34:53-08:00   Notice   syslog-ng   syslog-ng starting up; version='4.8.1'   
2024-11-09T22:29:54-08:00   Notice   dhclient   dhclient-script: Creating resolv.conf   
2024-11-09T22:29:54-08:00   Notice   dhclient   dhclient-script: Reason RENEW on igc0 executing   
2024-11-09T22:29:54-08:00   Error   dhclient   unknown dhcp option value 0x7d   
2024-11-09T22:24:54-08:00   Notice   dhclient   dhclient-script: Creating resolv.conf   
2024-11-09T22:24:54-08:00   Notice   dhclient   dhclient-script: Reason RENEW on igc0 executing   
2024-11-09T22:24:54-08:00   Error   dhclient   unknown dhcp option value 0x7d   
2024-11-09T22:19:54-08:00   Notice   dhclient   dhclient-script: Creating resolv.conf   
2024-11-09T22:19:54-08:00   Notice   dhclient   dhclient-script: Reason RENEW on igc0 executing   
2024-11-09T22:19:54-08:00   Error   dhclient   unknown dhcp option value 0x7d   
2024-11-09T22:14:54-08:00   Notice   dhclient   dhclient-script: Creating resolv.conf   
2024-11-09T22:14:54-08:00   Notice   dhclient   dhclient-script: Reason RENEW on igc0 executing   
2024-11-09T22:14:54-08:00   Error   dhclient   unknown dhcp option value 0x7d   
2024-11-09T22:09:54-08:00   Notice   dhclient   dhclient-script: Creating resolv.conf   
2024-11-09T22:09:54-08:00   Notice   dhclient   dhclient-script: Reason RENEW on igc0 executing   
2024-11-09T22:09:54-08:00   Error   dhclient   unknown dhcp option value 0x7d   
2024-11-09T22:04:54-08:00   Notice   dhclient   dhclient-script: Creating resolv.conf   
2024-11-09T22:04:54-08:00   Notice   dhclient   dhclient-script: Reason RENEW on igc0 executing

Quote from: zinge on November 10, 2024, 12:28:20 AM
Hardware is a VP4630 Protecli Vault w/ 16GB RAM.

Since most Protectli boxes have a serial console port, maybe it's worth connecting to the serial console to see if there is a kernel panic or some other error on the console prior to reboot?

Unfortunately, this model's COM port is somehow not compatible with Mac, but I'll see if I have a Windows PC lying around or can get it running on parallels to try that:

"VP4600 Series Guide
Both of these series of Vault Pro models come with a Micro USB Serial Console Cable. You cannot use COM output with these units if your connected computer is running MacOS."

The machine was completely unresponsive and the network was down this morning, and it was very hot like it locked up on something using 100% CPU. I power cycled it and have a Windows laptop hooked up displaying the console. Hopefully I'll see some kind of error message next time it dies.

And just to confirm, there is nothing in any of the logs in OPNSense UI between the hang and the power cycle, which was about 9 hours.

Just did it again. Hard locked up, network went down, booting in the console. Had to power cycle to get it back.

Health audit output (the AdguardHome checksum mismatch I believe is because the repo has an older version I and used the UI to update it after installing):

***GOT REQUEST TO AUDIT HEALTH***
Currently running OPNsense 24.7.8 at Sun Nov 10 14:56:35 PST 2024
>>> Root file system: zroot/ROOT/default
>>> Check installed kernel version
Version 24.7.8 is correct.
>>> Check for missing or altered kernel files
No problems detected.
>>> Check installed base version
Version 24.7.8 is correct.
>>> Check for missing or altered base files
No problems detected.
>>> Check installed repositories
mimugmail (Priority: 5)
OPNsense (Priority: 11)
>>> Check installed plugins
os-adguardhome-maxit 1.12
os-ddclient 1.25
os-mdns-repeater 1.1_1
os-speedtest-community 0.9_5
os-udpbroadcastrelay 1.0_5
>>> Check locked packages
No locks found.
>>> Check for missing package dependencies
Checking all packages: .......... done
>>> Check for missing or altered package files
Checking all packages: ....
os-adguardhome-maxit-1.12: checksum mismatch for /usr/local/AdGuardHome/AdGuardHome
Checking all packages......... done
>>> Check for core packages consistency
Core package "opnsense" at 24.7.8 has 69 dependencies to check.
Checking packages: ...................................................................... done
***DONE***

Any files in /var/crash ?

ls -l /var/crash


If you leave a ssh session open do you see a crash message when it reboots ?

I swapped in a new hard drive, did a fresh install, restored my config, and so far I'm at 6hrs uptime. Crossing my fingers that either it was a bad drive or the fresh install fixed it, but if it happens again I'll check /var/crash. I didn't try leaving an ssh session open, but I did have the console open on another laptop via micro USB and didn't get any messages. I can try switching that to an SSH session as well if it keeps happening.

There were Intel NIC driver changes in 24.7.8 - if you have a saved boot environment, you could try rolling back.

When you say saved boot environment, do you mean the snapshot button under the system options, or something else? I know about the snapshots now and will be using them before the next update, but the only backup I knew how to do before the last update was downloading a copy of my config.

I think my machine has enough resources to run Proxmox, so I may look at virtualizing OPNSense in the future to take advantage of VM snapshots and backing up to external storage that way eventually.

So far at 16hrs uptime 🤞

November 13, 2024, 06:20:12 PM #12 Last Edit: November 13, 2024, 06:28:11 PM by muldini
I'm running a Protectli VP4650 with coreboot and also experience(d) random reboots. I don't believe this has anything to do with OPNsense as I've seen the same with other operating systems.

Even had a replacement unit from Protectli with the same issues. By chance I figured out that the device reboots if it had been rebooted rather than shut down completely. It can be reproduced reliably.

An actual halt/shutdown and subsequent power on will leave the device running flawlessly. Current uptime is 3 months+.
Rebooting the device however will have it randomly crash/reboot without any logs/crash dumps.

No further debugging was done so far, but replacing the BIOS is still on my to-do list. Hope this helps.

Interesting... I'm not sure if that's my issue, or if reinstalling opnsense and replacing the HD was the issue, but either way this seems to be solved for now. 4 days uptime without issues since the reinstall and new drive.

About two hours ago, my Opnsense restarted for no reason, attached is the log.