OPNsense Forum

English Forums => General Discussion => Topic started by: gratis on June 24, 2015, 05:58:18 pm

Title: backup story
Post by: gratis on June 24, 2015, 05:58:18 pm
This afternoon there was a brief power outage. The old UPS that was powering the OPNsense machine, along with a few network devices, immediately powered down. It appears that the old UPS needs a new battery, and a new battery status indicator light...

I installed a new APC UPS and rebooted OPNsense, but the only service that was running was ntpd. All other services, dhcpd, dnsmasq, apinger, and a few others like Proxy were Stopped, and would not Start. The log files of the services showed unusual error messages like missing user accounts, or misconfigured settings, although before the power failure everything was working fine.

It quickly became clear that the power outage had corrupted some files. Rather than troubleshooting a lot of random issues, I decided to simply reinstall and start fresh. Unfortunately I didn't have a backup of the configuration, but it's easy to recreate...

So, that is the story, for what it's worth. The question is, would it be possible to add an APC UPS monitoring package to the OPNsense repository?  https://freshports.org/sysutils/apcupsd/ (https://freshports.org/sysutils/apcupsd/)
Title: Re: backup story
Post by: franco on June 24, 2015, 06:17:14 pm
Hi gratis,

thanks for sharing this. I read about this happening on pfSense/FreeBSD, but I am unsure how to address this properly. Setting the disk to "sync" mode to force flushes all the time is something that can be done locally, but is not for a standard installation and may still fail sometimes.

In short this is what happens: users and groups added to the system disappear during the crash, the system is unable to put them back.

If you have enough of the system left in a running state (no essential libraries or binaries missing, root login working), try:

# opnsense-update -f && reboot

This will reapply the kernel and base system as well as packages for 15.1.12 systems. It might not be enough, but it's worth a try. If not it may be worth looking into a real portable recovery tool instead.

Your second option is to use the installer and do an "import configuration", which will recover all your configs, backups, ssh keys, dhcpleases and captive portal databases as long as the disk isn't corrupted badly. After import, "quick and easy" install will install the system cleanly with all your settings in place.

If anyone has other suggestions or improvements I'd like to bring them to the table to find a sensible solution.


Cheers,
Franco

PS: https://github.com/opnsense/tools/issues/10
Title: Re: backup story
Post by: gratis on June 25, 2015, 05:24:14 am
Hello franco,

The sync option should work for my use case, as I am currently using a re-purposed ThinkCentre desktop with a standard HDD as a test machine. Thanks for the tip.

Only a week into testing, and didn't have a large amount time invested in configuring OPNsense, so I simply reinstalled and started over fresh.

However, your suggestions on how to recover look promising. In retrospect I should have switched to a backup router and attempted to resolve the issue with your help, to test your suggestions. Next time...

Thanks again for your efforts.
Title: Re: backup story
Post by: cmb on June 25, 2015, 09:11:17 am
The root cause is an issue with pw not issuing fsync where it should, leaving the files in an inconsistent state, where if you have an unclean shut down within some time after touching the passwd or group file, you'll end up with them being blank, or if not using SU, with random bits of other files in /etc/.

Same happens on stock FreeBSD if it occurs shortly after running pw. One of our (pfSense) developers will get a fix for pw into FreeBSD, but for the time being, setting sync is confirmed to 100% resolve the problem. That's a reasonable solution for firewall use cases.

It's easily replicable on OPNsense and pfSense because we write the passwd and group files on every boot, which is much more than what a stock FreeBSD install would generally do. Nature of the beast for what we're doing, though we'll make it idempotent in the future in pfSense.

Since setting sync, we have systems that have been through several thousand power cycles (snmpset to IP PDU scripted in a loop for days and days) immediately after writing the passwd and group files (which is what triggers the problem), and have had 0 problems.

There is a much worse problem of some sort with OPNsense that I hit the very first time I pulled the power plug on it. Clean install, boot to the console menu, yank the power plug. Completely trashed filesystem. Couldn't touch it without kernel panic (attached). The only relevant diff I saw between OPNsense, stock FreeBSD, and pfSense in that regard is you're running SU without J. FreeBSD 10.1 is SU+J by default, pfSense 2.2.3 and newer is SU+J by default (and pre-2.2.3 was no SU, no J). I went through several hundred of power cycles without sync in tracking the root cause, only a handful of those on OPNsense as another point of comparison, but it's the only thing that ever ended up with anything worse than blank or corrupted passwd and group files. Guessing you probably want SU+J so you're out of what's probably a much less tested code path, at least if my suspicion of it being SU without J is down the right path.
Title: Re: backup story
Post by: jschellevis on June 25, 2015, 09:56:24 am
Hi Chris,

Thanks for your help and explanation.
The cause is clear now.

Any idea when your developer will have a patch?
Do you guys need any help with getting it ready for upstream?

Title: Re: backup story
Post by: franco on June 25, 2015, 11:15:06 am
gratis, apcupsd has been added and will be available with 15.7 as an optional package: https://github.com/opnsense/tools/commit/e82e6356d1aa6960413699d9bff1fe957c8cffc1
Title: Re: backup story
Post by: cmb on June 25, 2015, 12:23:34 pm
JimT is testing and reviewing his fix as time permits. A bit more work on that and we'll put it into our power cycle test harness and get it upstream once confirmed, hopefully MFCed into FreeBSD 10.2. We'll have no issue getting it in upstream once we're confident it's correct, multiple FreeBSD committers already involved.
Title: Re: backup story
Post by: jschellevis on June 25, 2015, 02:25:55 pm
Thanks Chris,

Franco has created a workaround that I am currently testing.
https://github.com/opnsense/core/commit/1b7aec7a7738a99eaa567b00adfc8b9c983dd86b
Title: Re: backup story
Post by: franco on June 26, 2015, 07:40:46 am
Chris, thanks for your help here. Unfortunately, forcing everyone to sync is suboptimal for a bug that happens in just one utility. I agree that pw must be fixed mid-term. Now, given that Jim has openly mocked my commit that Jos confirmed actually working I'm not going to take anything but the real pw fix.

I feel that said negativity will keep preventing us from working together in the future.

Thanks for your understanding. Looking forward to your fix in 10-STABLE.
Title: Re: backup story
Post by: gratis on June 26, 2015, 06:06:46 pm
Looks like sharing the backup story started an interesting exchange... It's good that Chris pointed out the root of the problem, and that there are multiple solutions in the works.

Thanks for the addition franco, hopefully others will benefit from it aswell...
Title: Re: backup story
Post by: franco on July 01, 2015, 03:03:11 pm
We've discussed this a couple more times internally and have come to the conclusion that this issue is not fixable, or at least not in the way it has been presented and discussed. While it's true that "sync" completely circumvents the issue, it seems that UFS has gotten a lot more error prone in FreeBSD 10 because of a yet to be discovered regression. We do not intend to switch our installs to "sync" or use journaling on top of soft updates.

FreeBSD's default for mounting is "noasync" which is a hybrid approach for writing data to disk: metadata is synchronously written, while file and directory data is held in RAM for a while, presenting the writer with a finished write even though the data is still pending a flush to the physical disk. While looking at pw and the supposed fsync() missing (fflush() really, because it doesn't use the open() function call), we've noticed that pw itself uses libc by default, so "fixing pw" does not look like a sensible route. What makes this worse now is that all other utilities using libc with the particular user/group function have this potential corruption issue too. And then, still, while digging deeper, we've found that other files do indeed corrupt as well, especially when pulling the plug on a booting system (e.g. /etc/shells), eventually rendering the installations impossible to be accessed even from a serial console.

So now, we've put in place a mechanism that ensures the system recovers the *important* files on boot so that a login capability is ensured and a fully operational system can also recover, but -- keeping the above it mind -- any unclean shutdown may lead to a file corruption at some point most likely rendering the system unable to recover or start due to missing configuration files or binaries. We've also made sure that the syncer kernel thread has lower timeouts (we went from 30 seconds previously to 5 seconds) for file and directory writes. The latter is a moving target as we monitor and tweak performance.

In all that we do, we try to avoid adding kernel/base patches or reboots or changing the behaviour away from FreeBSD defaults. If that's a problem, please speak up now. Otherwise, enjoy the upcoming 15.7. :)

Our patches for reference:

https://github.com/opnsense/core/commit/81edf54f2afb643800becd8ce231b4d891d97c77
https://github.com/opnsense/core/commit/cc7180222db0dd67fb4857fb43fa28cf4b5c27e9
https://github.com/opnsense/core/commit/7baccb7c42dd812963a931edb2483a25269eaf7a
https://github.com/opnsense/core/commit/f45171e7851b26acbd23b48c8c30d3f35b997e71
https://github.com/opnsense/core/commit/50613a493692ca6f43eb31e52403b1f97523ecb6
Title: Re: backup story
Post by: chol on July 02, 2015, 01:15:52 am
 :) Like that:

In all that we do, we try to avoid adding kernel/base patches or reboots or changing the behaviour away from FreeBSD defaults.

OPNsense 15.7 .. steady as she goes!




Title: Re: backup story
Post by: chol on July 02, 2015, 01:16:36 am
Re: UFS SU+J sync

Data loss due to vfs buffering
 (https://forums.freebsd.org/threads/data-loss-due-to-vfs-buffering.52091/)

FreeBSD 10.1 - Deadlock on reboot with UFS tuned with SU+J
 (https://www.freebsd.org/security/advisories/FreeBSD-EN-15:05.ufs.asc)

Hang on shutdown/root unmount after FreeBSD 10.1R
 (http://mpc.lists.freebsd.bugs.narkive.com/w2kcOYzo/bug-195458-new-hang-on-shutdown-root-unmount-after-freebsd-10-1r)

Features and status of FreeBSD's Ext2 implementation - BSDCan 2014
 (http://www.bsdcan.org/2014/schedule/events/483.en.html)
Title: Re: backup story
Post by: franco on July 02, 2015, 10:35:39 am
Features and status of FreeBSD's Ext2 implementation - BSDCan 2014
 (http://www.bsdcan.org/2014/schedule/events/483.en.html)

Ad and me are happy about this in a way you can't imagine. :D
Title: Re: backup story
Post by: franco on July 03, 2015, 07:21:19 am
To be fair, here is what pfSense has to say. https://blog.pfsense.org/?p=1815

The timing is impeccable; if I didn't know any better it would look like they did not want 15.7 to have the fix. A day earlier and we would have picked it up gladly. :)
Title: Re: backup story
Post by: jschellevis on July 03, 2015, 08:00:54 am
The timing is impeccable; if I didn't know any better it would look like they did not want 15.7 to have the fix. A day earlier and we would have picked it up gladly. :)

Can't agree more..it is a pitty that the whole a idea seemed to be to wait until 15.7 was released.

However nobody needs to worry as 15.7 the "Brave Badger" contains a well tested solution/fix no need to "sync" your disks.

And as they say, the badger is a brave creature quoting someone who posted a youtube video:
Quote
"stung a thousant times and he doesn't give a #"

Let them keep stinging, we'll keep fixing and take to project into this millennium! :)
Title: Re: backup story
Post by: chol on July 03, 2015, 10:26:44 am
To be fair, here is what pfSense has to say. https://blog.pfsense.org/?p=1815

The timing is impeccable; if I didn't know any better it would look like they did not want 15.7 to have the fix. A day earlier and we would have picked it up gladly. :)
2.2 .. 2.2.1 .. 2.2.2  they seem to try to get close to our (weekly) bug fix cycle  ???   ;D

Features and status of FreeBSD's Ext2 implementation - BSDCan 2014
 (http://www.bsdcan.org/2014/schedule/events/483.en.html)

Ad and me are happy about this in a way you can't imagine. :D
O.k. I admit it , I am not really sure, if I got the funny part here, but I do not take the following statement to seriously, that's for sure:
"Some free advice: If you don’t understand the system, don’t attempt to disguise your lack of knowledge with infantile rambling, and anyone who thinks ext2 is an appropriate primary filesystem for FreeBSD has questionable motives and poor taste." (pfSense Digest - filesystem corruption: closed, by Jim Thompson July 2, 2015 (https://blog.pfsense.org/?p=1815)) -- Actually this statement got me tears in my eyes .. from laughing !

The timing is impeccable; if I didn't know any better it would look like they did not want 15.7 to have the fix. A day earlier and we would have picked it up gladly. :)

Can't agree more..it is a pitty that the whole a idea seemed to be to wait until 15.7 was released.
Its so absurd, this realy doesn't need more words...  8)

Title: Re: backup story
Post by: franco on July 03, 2015, 10:48:22 am
O.k. I admit it , I am not really sure, if I got the funny part here

That's natural. It refers to a conversation that Ad and I had on Wednesday, discussing the file corruption and how it affects *all* files that are not synced to the disk in time. This is especially bad on power outages that have more than one power cut: if the system was just rebooting after the outage and brought down again there are a lot of files that will have the same issue, not just /etc/passwd and friends. We were joking about using ext2 instead to fix the issue, knowing this will never be a sensible fix for (any) BSD. And then you drop the link in the forum. And then Jim picks it up to teach us a lesson. And now we're here. :)
Title: Re: backup story
Post by: chol on July 03, 2015, 12:27:33 pm
 Lol - yes exactly, I put it in because of the little spicy element that comes with it, in an ironic undertone : and Jim made my day!