unbound died after new wanip

Started by itn3rd77, February 02, 2022, 07:36:19 AM

Previous topic - Next topic
Hi together,

after the upgrade to 22.1 I was faced with an died unbound process after new wanip for several times now.
It's obviously that a router without a running nameserver is near to useless  ;)

The only thing I found so far are the following lines inside system/latest.log


<13>1 2022-02-02T00:49:02+01:00 fw.local.home opnsense 68773 - [meta sequenceId="444"] plugins_configure newwanip (,opt2)
<13>1 2022-02-02T00:49:02+01:00 fw.local.home opnsense 68773 - [meta sequenceId="445"] plugins_configure newwanip (execute task : ntpd_configure_do())
<13>1 2022-02-02T00:49:12+01:00 fw.local.home opnsense 68773 - [meta sequenceId="446"] plugins_configure newwanip (execute task : opendns_configure_do())
<13>1 2022-02-02T00:49:12+01:00 fw.local.home opnsense 68773 - [meta sequenceId="447"] plugins_configure newwanip (execute task : openssh_configure_do(,opt2))
<13>1 2022-02-02T00:49:12+01:00 fw.local.home opnsense 68773 - [meta sequenceId="448"] plugins_configure newwanip (execute task : unbound_configure_do(,opt2))
<11>1 2022-02-02T00:49:35+01:00 fw.local.home configctl 98849 - [meta sequenceId="449"] error in configd communication  Traceback (most recent call last):   File "/usr/local/opnsense/service/configd_ctl.py", line 68, in exec_config_cmd     line = sock.recv(65536).decode() socket.timeout: timed out
<11>1 2022-02-02T00:51:41+01:00 fw.local.home opnsense 88635 - [meta sequenceId="2"] /usr/local/etc/rc.newwanipv6: The command '/usr/local/opnsense/scripts/dns/unbound_dhcpd.py --domain 'local.home'' returned exit code '1', the output was 'Unable to lock on the pidfile.'
<13>1 2022-02-02T00:51:41+01:00 fw.local.home opnsense 88635 - [meta sequenceId="3"] plugins_configure newwanip (execute task : vxlan_configure_interface())
<13>1 2022-02-02T00:51:41+01:00 fw.local.home opnsense 88635 - [meta sequenceId="4"] plugins_configure newwanip (execute task : webgui_configure_do(,wan))
<11>1 2022-02-02T00:51:41+01:00 fw.local.home opnsense 72174 - [meta sequenceId="5"] /usr/local/etc/rc.filter_configure: Ignore down inet gateways : WAN_DHCP6
<11>1 2022-02-02T00:51:41+01:00 fw.local.home opnsense 72174 - [meta sequenceId="6"] /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '100.84.0.1'
<11>1 2022-02-02T00:51:41+01:00 fw.local.home opnsense 72174 - [meta sequenceId="7"] /usr/local/etc/rc.filter_configure: Ignore down inet6 gateways : WAN_DHCP6


The error is not reproducible from the command line. There must be some special circumstances leading to the error. The really bad thing is you are left with a system near to useless as unbound is no longer running anymore after the failure  :(

Has anybody seen this or any advice how to track this down?

Best regards
Ingo

Hi Ingo,

Well, what happens when you run the command that fails?

# /usr/local/opnsense/scripts/dns/unbound_dhcpd.py --domain 'local.home'

You can also always check which part in the Unbound configuration is causing an error:

# configctl unbound check


Cheers,
Franco

Hi Franco,

thanks for picking up my issue.

Here are the results from the commands


root@fw:~ # /usr/local/opnsense/scripts/dns/unbound_dhcpd.py --domain 'local.home'
Unable to lock on the pidfile.
root@fw:~ # configctl unbound check
no errors in /var/unbound/unbound.conf


Repated the command /usr/local/opnsense/scripts/dns/unbound_dhcpd.py --domain 'local.home' several times. The result is always Unable to lock on the pidfile.

But despite the error unbound is still running this time.

I am a bit lost and helpless  :(

Best regards
Ingo

Hmm, would indicate the service is still running?

# cat /var/run/unbound_dhcpd.pid

Dann mal

# ps aux | grep PID-AUS-DER-DATEI


Danke,
Franco

Here we go ....


root@fw:~ #  cat /var/run/unbound_dhcpd.pid
11057
root@fw:~ # ps aux | grep 11057
root     11057   0.0  0.0   25416   14940  -  Ss   08:58       0:02.56 /usr/local/bin/python3 /usr/local/opnsense/scripts/dns/unbound_dhcpd.py --domain local.home (python3.8)


Quote from: franco on February 02, 2022, 09:20:18 AM
Hmm, would indicate the service is still running?

# cat /var/run/unbound_dhcpd.pid

then do a decent

# ps aux | grep pid-from-cat-command-above


Thanks,
Franco

...translation by forum-autotranslate.org... :-D
kind regards
chemlud
____
"The price of reliability is the pursuit of the utmost simplicity."
C.A.R. Hoare

felix eichhorns premium katzenfutter mit der extraportion energie

A router is not a switch - A router is not a switch - A router is not a switch - A rou....

sorry, not the first time this happened :D

Can you see if unbound-control is stuck?

# ps aux | grep unbound-control


Cheers,
Franco

Hi Franco,

nope. Just found my own grep  ;)


root@fw:~ # ps aux | grep unbound-control
root     70523   0.0  0.0   12740    2536  0  S+   09:40       0:00.00 grep unbound-control



What happens on

# kill -TERM 11057

That's what the GUI is trying to do and apparently failing.


Cheers,
Franco

Ok, process termiated.


root@fw:~ # ps aux | grep 11057
root     11057   0.0  0.0   25416   14940  -  Ss   08:58       0:03.08 /usr/local/bin/python3 /usr/local/opnsense/scripts/dns/unbound_dhcpd.py --domain local.home (python3.8)
root     71577   0.0  0.0   12740    2532  0  S+   09:55       0:00.00 grep 11057
root@fw:~ #  kill -TERM 11057
root@fw:~ # ps aux | grep 11057
root     78974   0.0  0.0   12740    2532  0  S+   09:56       0:00.00 grep 11057


Is this good or bad news?

For debugging bad, in general good :)

I'm still suspecting unbound-control might be hanging and it's for strange reasons which I'm trying to address via https://github.com/opnsense/core/commit/a9a67b42550a

Can you do me a favour and report back if this happens again so we can try the patch as a next step (but not right now).


Cheers,
Franco

The behaviour with the command  /usr/local/opnsense/scripts/dns/unbound_dhcpd.py --domain 'local.home' is reproducible. I executed the command (after the #kill -TERM 11057) and it succeeded. The next execution fails with 'Unable to lock on the pidfile'.



root@fw:~ # /usr/local/opnsense/scripts/dns/unbound_dhcpd.py --domain 'local.home'
root@fw:~ # /usr/local/opnsense/scripts/dns/unbound_dhcpd.py --domain 'local.home'
Unable to lock on the pidfile.



I'll drop you a note once this happenes again out of nowhere.

Best regards
Ingo




Hi Ingo,

That's to be expected honestly. It's already running. What I'm wondering is why a service might not terminate before restarting it. The code is always the same:

https://github.com/opnsense/core/blob/master/src/etc/inc/plugins.inc.d/unbound.inc#L407-L426

see first and last line in markup


Cheers,
Franco