[SOLVED] configd not running, won't start

Started by sporkman, September 25, 2019, 11:23:38 PM

Previous topic - Next topic
September 25, 2019, 11:23:38 PM Last Edit: September 27, 2019, 08:54:46 AM by franco
Logged-in to the GUI today to see if the box is still panicing every night and most of the data in the dashboard was blank.

In the "services" pane, I saw that "configd" was not running.  On trying to start it, this is logged in the system logs:

opnsense: /status_services.php: The command '/usr/local/etc/rc.d/configd start' returned exit code '1',
the output was 'Starting configd. Traceback (most recent call last): File "/usr/local/opnsense/service/configd.py",
line 37, in <module> import logging File "/usr/local/lib/python3.7/logging/__init__.py", line 26, in <module> import sys,
os, time, io, traceback, warnings, weakref, collections.abc File "/usr/local/lib/python3.7/traceback.py", line 5,
in <module> import linecache File "/usr/local/lib/python3.7/linecache.py", line 8, in <module> import functools
ModuleNotFoundError: No module named 'functools' /usr/local/etc/rc.d/configd: WARNING: failed to start configd'


Any idea what that's about?

No access via ssh, I assume something is broken there as well.

Can you run this from the console?

# pkg check -da
# pkg check -sa

Assuming multiple files missing or checksum clobbering.


Cheers,
Franco

Yep, first one is OK, second fails:


root@SporkLab:/home/sporkadmin # pkg check -da
Checking all packages: 100%
root@SporkLab:/home/sporkadmin # pkg check -sa
Checking all packages:  83%
py37-yaml-5.1: missing file /usr/local/lib/python3.7/site-packages/yaml/__pycache__/constructor.cpython-37.pyc
Checking all packages:  85%
python37-3.7.4: missing file /usr/local/lib/python3.7/__pycache__/optparse.cpython-37.opt-2.pyc
python37-3.7.4: missing file /usr/local/lib/python3.7/__pycache__/random.cpython-37.pyc
python37-3.7.4: missing file /usr/local/lib/python3.7/__pycache__/stat.cpython-37.opt-1.pyc
python37-3.7.4: missing file /usr/local/lib/python3.7/__pycache__/stat.cpython-37.opt-2.pyc
python37-3.7.4: missing file /usr/local/lib/python3.7/functools.py
python37-3.7.4: missing file /usr/local/lib/python3.7/lib2to3/pgen2/__pycache__/token.cpython-37.opt-1.pyc
Checking all packages: 100%
root@SporkLab:/home/sporkadmin

I did a "pkg install -f pkgname" for all of them with mismatches (a later run of "pkg check -sa" turned up a few more).

I assume this is corruption/data loss from the box panicing.

Was configd working after reinstall the packages?
Twitter: banym
Mastodon: banym@bsd.network
Blog: https://www.banym.de


> I assume this is corruption/data loss from the box panicing.

Indeed. Glad the package manager still worked to recover the system. :)


Cheers,
Franco

So those panics...

That seems kind of abnormal and something that's new to this box after migrating from pfsense.

It's an old Dell SFF, Core2Duo. If I get a clean bill of health on the RAM from memtest86+, what's next?

I did report the last two with the included "report this problem" tool, nothing stood out to me, but what do I know?

sporkman you can submit any issues with OPNsense you encounter at the following locations:

OPNsense Issues - github:

I will, but I'm looking for some guidance on what I can do on the hardware side first.

I just ran a single pass of memtest86+ and that completed fine.

I ran "stresscpu" for about 20 minutes and nothing odd happened.

I ran the manufacturer's disk diag tools and they came up empty.

I should probably run memtest again and let it do like 10 passes or so to be sure.

After that though, my gut feeling is it's a HardenedBSD issue - perhaps my older hardware is an edge case or something. In addition to the panics, I'll occasionally see a message about python SIGSEGV'ing with an extra note from HBSD's stack protection or something.

Anyhow, some pointers before I waste time on the issue tracker would be appreciated.

And just noting when I went out last night I let memtest run for a few hours, still no errors.

Do I open a github issue or no?

Panics are in the domain of the operating system. Some may come from bad use of the software, but ideally the OS shouldn't be prone to "denial of service" from userland. Sometimes it is, but it is still a bug in the OS to react that way.

HardenedBSD may be involved in panics, but I've rarely seen that to be the case.

The bulk of panics comes from FreeBSD base code by nature of the code base: drivers and networking.

Now, the question is what type of panic are you getting? Do you have a stack trace? This can help looking for clues on e.g. https://bugs.freebsd.org/bugzilla/ and isolating the crashing component. It may be network card or some other aux hardware that doesn't crash while doing memtests, but does so reliably when you put traffic through it. Possibilities are many so the stack trace is key.

Your best bet other than identifying the problematic piece of hardware (if any) is waiting for the next OS update. We've scheduled HardenedBSD 12.1 for OPNsense 20.1 which is quite a jump forward in the code base so if it is a software error there's a valid chance that the bug has been fixed (if any).


Cheers,
Franco

Is there any way for me to retrieve the stack traces that were submitted via the built-in reporting system? I sent the last two in.

Not sure if opnsense runs the "daily" scripts like FreeBSD does, but both panics happened shortly after 3 a.m., which is when the scripts kick off on stock FreeBSD.

Prior to the crash there's usually a bunch of "signal 11" crashes of configd and other applications.

Older releases have had this issue and it's a similar pattern - a few panics every night after updating and over the time, fewer panics, which seems odd. This started with my move from pfsense, for whatever that's worth - I know the base OS in these two firewall distros is pretty different at this point.

I think I might have found something here.

Putting together a few things:

- lots of panics, which means just repeatedly trashing the filesystem
- background fsck on "/" (does it really work?)
- "pkg check" returning bad checksums after each panic
- today's panics happening during "health check", which likely walks large parts of the filesystem
- panics come in clusters - maybe one of the last ones fsck finally fixes something (but not everything because UFS w/SU does not get the attention ZFS does these days)
- SU is enabled on root, which may or may not be a good idea depending on who you ask

I have a theory... One of the first panics after moving from pfsense was just random - it happens. But it mangled something, which at some point caused another panic, which then also lead to more corruption of the fs and another panic, etc.

All the panics I have logged right now look like this:

panic: ufs_dirbad: /: bad dir ino 7877697 at offset 0: mangled entry
panic: ufs_dirbad: /: bad dir ino 7877697 at offset 0: mangled entry
panic: handle_written_inodeblock: Invalid link count 65535 for inodedep 0xfffff8001da83000
cpuid = 0

I wish I had access to some older ones, but these are all clearly dirty filesystem issues, no?

Going to the garage to shutdown and do a fsck in single user...

Also, from command line, any easy way to force a reinstall of the base? I'd like to be sure I don't have any corrupt files hanging around. The base OS is not pkg-ified, right?