OPNsense Forum

Archive => 19.7 Legacy Series => Topic started by: sporkman on December 21, 2019, 01:56:18 am

Title: kernel panics through multiple releases
Post by: sporkman on December 21, 2019, 01:56:18 am
Hi all - I was pretty happy with opnsense as a concept. Coming from pfsense, I was a bit saddened by how that project has changed over the years, especially the move to require AES-NI CPU support in the future (which they seem to have backed off from). So opnsense looked like a good option, and the fact that you've already started the process of "cleaning house" on old code was a big deal to me.

That said, last week I moved back to pfsense. It became necessary because no matter what I did (replacing hardware, turning off "big" features like IDS/IPS, clean reinstalls, etc.) I was just getting fairly regular kernel panics. The more I watched this, the more I realized that with UFS I was getting serious data corruption each time (as shown by the built-in 'health check') and for a time I thought perhaps that was the root of my problem - some prior release paniced once and then subsequent panics were the result of corruption in some kernel module or something. I eventually moved to ZFS using the nice bootstrapping tool provided and I saw a few panics, the last of which left the system unbootable (panic during mountroot).

A few threads where I brought up the panics, but didn't really find any resolution, mostly me talking to myself at some point:

https://forum.opnsense.org/index.php?topic=14323.0 (configd)
https://forum.opnsense.org/index.php?topic=12267.msg68445#msg68445 (zfs install)

So I yanked the drive, put in an old drive (one that also had opnsense on it that I'd swapped out to test if the corruption was a drive failure), and installed pfsense w/the zfs install option. A week later and it's still going (and thankfully aliases and dhcp static mappings are pretty easy to export/import across platforms) and it's still working without any panics. This is great, but I'm also on a platform that promises to obsolete my hardware with the next major release (which may not come given how much time their other linux-based project is getting).

So what's my point in posting?

Just calling attention to the issue, giving people with similar hardware a chance to find this via google, whatever. My gut feeling is that while HardenedBSD is great, it sees WAY less hardware than mainline FreeBSD and it's just not happy with my old Core2Duo (E7500, 2.93GHz) Dell. It reminds me of the early days of OpenBSD - secure, but as you add more protections, you end up with less stability because you're bailing out whenever you hit an unexpected condition. This is GOOD - it means your protections and correctness in following spec is working. It's bad if you have users that hit the bugs and don't have the manpower to follow up. Anyhow, I've done the "submit a bug" thing after each of these panics for the last year or so so there's a record for anyone wanting to look at it. And I have plenty of spare drives around and a copy of my last config so if anyone ever wants to troubleshoot with me, I have no problem flipping over to opnsense again for testing.

From my end though, I've hit a dead end - the built-in Dell diagnostics all pass, memtest86 passes, SMART passes on all drives I've tried (after a "long" self-test), pegging the cpu with benchmarkers doesn't trigger the bug, CPU fan is fine, so not sure what else I could do.
Title: Re: kernel panics through multiple releases
Post by: bartjsmit on December 21, 2019, 08:56:34 am
Turn the Dell into a Hypervisor. OPNsense is stable as a virtual guest under ESXi and you can insulate yourself from all hardware pitfalls.

Bart...
Title: Re: kernel panics through multiple releases
Post by: Pickens on December 21, 2019, 03:43:03 pm
Are there any downsides of doing that, Bart?
Title: Re: kernel panics through multiple releases
Post by: bartjsmit on December 21, 2019, 04:48:41 pm
You'll lose a bit of your resources to the hypervisor overhead, but ESXi has a tiny footprint. Depending on your paranoia level, you may not like that ESXi is closed source and you are dependent on VMware continuing to provide a free (as in beer) licence if you want to stay current. You get an extra piece of infrastructure that needs patches.

What you gain is things like monitoring and easy roll-back of changes. E.g. I take a snapshot before each OPNsense update and only delete it once I'm sure the new version works for me.

Bart...
Title: Re: kernel panics through multiple releases
Post by: franco on December 22, 2019, 09:59:17 am
Persistent, erratic failures often point to bad hardware itself.


Cheers,
Franco
Title: Re: kernel panics through multiple releases
Post by: sporkman on December 22, 2019, 07:59:05 pm
Franco - understood, but the problem goes away with any other OS running on the hardware, so I'm just thinking it's a bug and I'm running on hardware that's not well-tested with HBSD. I'll certainly report back if in the future I see similar issues under pfsense...

I guess if I'm feeling adventurous I could try the "wrap it in a vmware bubble blanket", but that feels a little extreme.
Title: Re: kernel panics through multiple releases
Post by: packet loss on December 22, 2019, 08:39:45 pm
Start with a new installation of OPNsense. Do not import any information into it from a previously saved pfSense or OPNsense configuration file. Manually configure all your OPNsense settings.
Title: Re: kernel panics through multiple releases
Post by: sporkman on December 23, 2019, 08:19:17 am
I've done that a few times already.
Title: Re: kernel panics through multiple releases
Post by: sporkman on February 28, 2020, 08:08:05 pm
Just poking my head back in here. One day I'd like to move back to opnsense.

Been on pfsense for a few months now and no file corruption and no panics, so I'm pretty sure I'm not running into a hardware issue (especially after all the hardware testing). Probably not even really an opnsense issue, but a HardenedBSD issue. My guess is this - my hardware is older (Core2Duo era), Shawn probably has nothing like it that he tests on regularly and it's just a bug that's not been addressed that shows up on this particular hardware.