Opnsense randomly (?) crashes

Started by meikel, June 02, 2026, 09:06:59 AM

Previous topic - Next topic
Quote from: Nullman on June 03, 2026, 08:02:35 PMIts right there in his signature.
I see signatures of users but his is empty ?!

Quote from: meikel on June 03, 2026, 08:10:22 PMI also have to correct myself: the SSD is not running in raid1.

It's a single SSD but as a live OS is also crashing I assume it's not the SSD.
I also doubt that a faulty SSD crashes the whole machine.
Sometimes Live CD/ISO Images try to mount local storage automatically and if that fails in a certain way it can cause issues too so I would check that again just to be sure !!

- Which model is it ?
- Are there firmware updates available ?
- What does smartctl/nvmecontrol say ?
- Are there known issues with it in certain scenarios ?

And indeed : Memory testing should be done for at least 24 hours !!
Weird guy who likes everything Linux and *BSD on PC/Laptop/Tablet/Mobile and funny little ARM based boards :)

Today at 08:31:06 AM #16 Last Edit: Today at 08:35:38 AM by meikel Reason: added memtest info
The memtest was fine:




So I think I found the culprit:


Well I guess it was the SSD all along. I didn't think a system drive failing would have such symptoms.

Sadly the story didn't end here:

* I tried to clone the drive onto a new drive which took way to long and didn't work at the end. The OS booted but couldn't find some files (don't ask me what exactly).
* So I took another road: Backup opnsense via UI and import in a freshly installed version. Should be straightforward, right? Wrong
* I didn't ever bothering migrating from "deprecated" features as long as they worked or I get the message "This will no longer work in the next version". It's just my homelab after all, if it's working it's good enough.
* Today I learned ISC DCHP is so hard deprecated that it is not even backed up when you create a backup. It backups the config values but does not backup (now a plugin) ISC from what I could tell.
* I booted the original drive (and hoped it would survive as long as necessary (spoiler: it did not - had to restart multiple times) and migrated everything to Kea according to this guide
* Now everything is working as before and I hope the device survives longer this time

Afterthought: Shouldn't opnsense warn me about bad smart values? Is there any way to enable this?

Thanks for everyone helping me, bringing ideas to the table and for your valuable time.

There is a SMARTS plugin you can install and run tests from OPNsense directly.
It comes with a widget to show the health of the latest test.

Regards,
S.
Networking is love. You may hate it, but in the end, you always come back to it.

OPNSense HW
N355 - i226-V | AQC113C | 16G | 500G - PROD

PRXMX
N5105 - i226-V | 2x8G | 512G - NODE #1
N100 - i226-V | 16G | 1T - NODE #2

Hah! Called the failing SSD!

Quote from: meikel on Today at 08:31:06 AM* Today I learned ISC DCHP is so hard deprecated that it is not even backed up when you create a backup. It backups the config values but does not backup (now a plugin) ISC from what I could tell.

I just checked a backup that I made last week, when I still had old(disabled) ISC DHCP config left behind after transitioning to Kea. That config is included in the backup XML, under sections "dhcpd" and "dhcpdv6".

AFAIK, you do need to install the plugin (on a fresh installation) before you can restore the config for it, though.

That Sophos UTM firewall appliance is at least 10 years old, where did you buy it?

I'll assume the Intel SSD was new, as it only ran for 479 days and was turned on 69 times.

What's the state of the other Intel SSD in the RAID1 configuration?

Quote from: vpx on Today at 03:05:18 PMThat Sophos UTM firewall appliance is at least 10 years old, where did you buy it?
I see no problem with that as long as it works. Its a very solid machine built with hq components.

Quote from: vpx on Today at 03:05:18 PMWhat's the state of the other Intel SSD in the RAID1 configuration?
There is no other drive and there is no RAID. He has single SSD.

Quote from: meikel on Today at 08:31:06 AMThe memtest was fine
I am a fan of 24 hour tests, but OK :)

QuoteSo I think I found the culprit:


Well I guess it was the SSD all along. I didn't think a system drive failing would have such symptoms.
Told you! :)

It seems to be one of those models that I call "Fake Intel SSD" : https://duckduckgo.com/?q=ssdsc2bw180h6&ia=web
No wonder it failed despite being a MLC model : They had no additional spare space like the Intel Enterprise models had at the time :)
(Don't get me started on everything else that's wrong with them... LOL!)

QuoteSadly the story didn't end here:

* I tried to clone the drive onto a new drive which took way to long and didn't work at the end. The OS booted but couldn't find some files (don't ask me what exactly).
That's to be expected : Defect drives are very often impossible to clone!

Quote* So I took another road: Backup opnsense via UI and import in a freshly installed version. Should be straightforward, right? Wrong
I am surprised you got that far considering the issues!
You would also risk having a corrupt config.xml that way IMO.

Quote* Today I learned ISC DCHP is so hard deprecated that it is not even backed up when you create a backup.
It backups the config values but does not backup (now a plugin) ISC from what I could tell.
If you are talking about 'Static DHCP Mappings based on the MAC Address' then you need to Export those into .CSV files via the webGUI and Import them into KEA or DNSmasqd ;)

Quote* Now everything is working as before and I hope the device survives longer this time
What's the storage this time ?
Weird guy who likes everything Linux and *BSD on PC/Laptop/Tablet/Mobile and funny little ARM based boards :)

So I think I found the culprit:
[/quote]

These results are quite confusing. S.M.A.R.T parameters are clearly indicating that SSD is pretty much dead. Confusing part is the fact that machine still crashed running Debian live. This indicates another issue beside dead drive.

@nero355

It's an SAMSUNG EVO 850 250gb - it's used but when this issue happens again I'm fine with downtime. I may move to a raid1 at some point but it's not worth the hustle right now as there is no dedicated space for another ssd.

Once I knew what the issue was (very easy to find out if you don't blindfold yourself) it's quite easy to fix with a backup. Except the dhcp migration part.