[solved] Unbound service stopped after updating blacklists

Started by Nikotine, September 27, 2021, 03:18:08 PM

Previous topic - Next topic
EDIT:
It turned out that it's not the updating of dnsbl that causes a crash. Unbound is already down before, likely due to memory issues on my HP T620 plus.
END EDIT

I have a cronjob to update the Unbound blocklists every night at 3AM.
This morning I found that Unbound had stopped and I had to manually start it again.
Looking at the logs, I see the blacklists being downloaded, but then the logs suddenly stop...
I had verbosity level 3, from back when I was setting up DNS over TLS, so there's a lot of chatter, but I hope you can find what went wrong?

"2021-09-27T08:48:01 unbound[13660] daemonize unbound dhcpd watcher."
"2021-09-27T03:00:19 unbound[87221] [87221:0] debug: switching log to stderr"
"2021-09-27T03:00:19 unbound[87221] [87221:0] debug: cache memory msg=132120 rrset=132120 infra=10866 val=0"
"2021-09-27T03:00:19 unbound[87221] [87221:0] info:    4.000000    8.000000 1"
"2021-09-27T03:00:19 unbound[87221] [87221:0] info:    2.000000    4.000000 1"
"2021-09-27T03:00:19 unbound[87221] [87221:0] info:    1.000000    2.000000 1"
"2021-09-27T03:00:19 unbound[87221] [87221:0] info:    0.524288    1.000000 6"
"2021-09-27T03:00:19 unbound[87221] [87221:0] info:    0.262144    0.524288 18"
"2021-09-27T03:00:19 unbound[87221] [87221:0] info:    0.131072    0.262144 67"
"2021-09-27T03:00:19 unbound[87221] [87221:0] info:    0.065536    0.131072 144"
"2021-09-27T03:00:19 unbound[87221] [87221:0] info:    0.032768    0.065536 68"
"2021-09-27T03:00:19 unbound[87221] [87221:0] info:    0.016384    0.032768 50"
"2021-09-27T03:00:19 unbound[87221] [87221:0] info:    0.008192    0.016384 26"
"2021-09-27T03:00:19 unbound[87221] [87221:0] info:    0.004096    0.008192 2"
"2021-09-27T03:00:19 unbound[87221] [87221:0] info:    0.000256    0.000512 1"
"2021-09-27T03:00:19 unbound[87221] [87221:0] info:    0.000000    0.000001 5"
"2021-09-27T03:00:19 unbound[87221] [87221:0] info: lower(secs) upper(secs) recursions"
"2021-09-27T03:00:19 unbound[87221] [87221:0] info: [25%]=0.0392734 median[50%]=0.0851058 [75%]=0.129479"
"2021-09-27T03:00:19 unbound[87221] [87221:0] info: histogram of recursion processing times"
"2021-09-27T03:00:19 unbound[87221] [87221:0] info: average recursion processing time 0.131777 sec"
"2021-09-27T03:00:19 unbound[87221] [87221:0] info: mesh has 0 recursion states (0 with reply, 0 detached), 0 waiting replies, 390 recursion replies sent, 0 replies dropped, 0 states jostled out"
"2021-09-27T03:00:19 unbound[87221] [87221:0] info: server stats for thread 3: requestlist max 3 avg 0.193066 exceeded 0 jostled 0"
"2021-09-27T03:00:19 unbound[87221] [87221:0] info: server stats for thread 3: 5504 queries, 5114 answers from cache, 390 recursions, 1283 prefetch, 0 rejected by ip ratelimiting"
"2021-09-27T03:00:18 unbound[87221] [87221:0] debug: cache memory msg=132120 rrset=132120 infra=10866 val=0"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.524288    1.000000 1"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.262144    0.524288 11"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.131072    0.262144 46"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.065536    0.131072 122"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.032768    0.065536 57"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.016384    0.032768 51"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.008192    0.016384 28"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.004096    0.008192 2"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.002048    0.004096 1"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.000256    0.000512 1"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.000000    0.000001 4"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info: lower(secs) upper(secs) recursions"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info: [25%]=0.0308405 median[50%]=0.0752052 [75%]=0.118717"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info: histogram of recursion processing times"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info: average recursion processing time 0.085716 sec"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info: mesh has 0 recursion states (0 with reply, 0 detached), 0 waiting replies, 324 recursion replies sent, 0 replies dropped, 0 states jostled out"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info: server stats for thread 2: requestlist max 3 avg 0.195266 exceeded 0 jostled 0"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info: server stats for thread 2: 5235 queries, 4911 answers from cache, 324 recursions, 1197 prefetch, 0 rejected by ip ratelimiting"
"2021-09-27T03:00:18 unbound[87221] [87221:0] debug: cache memory msg=132120 rrset=132120 infra=10866 val=0"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.524288    1.000000 7"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.262144    0.524288 15"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.131072    0.262144 51"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.065536    0.131072 128"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.032768    0.065536 66"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.016384    0.032768 58"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.008192    0.016384 24"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.004096    0.008192 3"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.002048    0.004096 1"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.000512    0.001024 1"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.000000    0.000001 1"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info: lower(secs) upper(secs) recursions"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info: [25%]=0.0331404 median[50%]=0.077568 [75%]=0.123008"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info: histogram of recursion processing times"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info: average recursion processing time 0.100031 sec"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info: mesh has 0 recursion states (0 with reply, 0 detached), 0 waiting replies, 355 recursion replies sent, 0 replies dropped, 0 states jostled out"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info: server stats for thread 1: requestlist max 3 avg 0.212223 exceeded 0 jostled 0"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info: server stats for thread 1: 5085 queries, 4730 answers from cache, 355 recursions, 1134 prefetch, 0 rejected by ip ratelimiting"
"2021-09-27T03:00:18 unbound[87221] [87221:0] debug: cache memory msg=132120 rrset=132120 infra=10866 val=0"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    4.000000    8.000000 1"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.524288    1.000000 8"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.262144    0.524288 22"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.131072    0.262144 44"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.065536    0.131072 129"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.032768    0.065536 73"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.016384    0.032768 59"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.008192    0.016384 27"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.004096    0.008192 3"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.002048    0.004096 2"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info:    0.000000    0.000001 4"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info: lower(secs) upper(secs) recursions"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info: [25%]=0.0322126 median[50%]=0.0746806 [75%]=0.121927"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info: histogram of recursion processing times"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info: average recursion processing time 0.117378 sec"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info: mesh has 0 recursion states (0 with reply, 0 detached), 0 waiting replies, 372 recursion replies sent, 0 replies dropped, 0 states jostled out"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info: server stats for thread 0: requestlist max 3 avg 0.20524 exceeded 0 jostled 0"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info: server stats for thread 0: 5226 queries, 4854 answers from cache, 372 recursions, 1231 prefetch, 0 rejected by ip ratelimiting"
"2021-09-27T03:00:18 unbound[87221] [87221:0] info: service stopped (unbound 1.13.2)."
"2021-09-27T03:00:18 unbound[87221] [87221:0] info: control cmd:  stop"
"2021-09-27T03:00:18 unbound[87221] [87221:0] debug: new control connection from 127.0.0.1 port 51834"
"2021-09-27T03:00:17 unbound[80774] blocklist download done in 17.21 seconds (37705 records)"
"2021-09-27T03:00:17 unbound[80774] blocklist download https://perflyst.github.io/PiHoleBlocklist/SmartTV.txt (lines: 345 exclude: 0 block: 222)"


I had the exact same thing happen.  Upgraded OPNsense yesterday (didn't reboot after upgrading).  First thing this morning unbound service was stopped.  It happened immediately after the DNS blocklist update task had run via the scheduled job.  Had to manually start unbound.

I'm going to reboot router this evening, as I've noticed a few other weird errors that have started to appear in the syslog since the upgrade (swap_pages_getswapspace and lots of configctl events), and will see whether it happens tomorrow.

UPDATE:
A restart of the router seems to of cleared things up on my OPNsense instance.  Errors that were appearing in event log related to swap and configctl are no longer occurring, and the DNS blocklist update ran this morning without any issues.  Unbound restarted cleanly during the task as expected.

I didn't restart the appliance after either of the previous 2 updates, so maybe that's related somehow.

October 12, 2021, 10:10:27 AM #3 Last Edit: October 12, 2021, 01:16:49 PM by Nikotine
I had the same happen again after it tried to update the blocklists tonight.
The cronjob started updating the blocklists at 3AM.
It failed to resolve some of them.
Then the logs went quiet.

2021-10-12T03:06:01 unbound[80923] blocklist download done in 360.94 seconds (0 records)
2021-10-12T03:06:01 unbound[80923] blocklist download https://perflyst.github.io/PiHoleBlocklist/SmartTV.txt (lines: 0 exclude: 0 block: 0)
2021-10-12T03:06:01 unbound[80923] blocklist download : unable to download file from https://perflyst.github.io/PiHoleBlocklist/SmartTV.txt (error : HTTPSConnectionPool(host='perflyst.github.io', port=443): Max retries exceeded with url: /PiHoleBlocklist/SmartTV.txt (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x2d0eeaf15e0>: Failed to establish a new connection: [Errno 8] Name does not resolve')))
2021-10-12T03:04:31 unbound[80923] blocklist download https://perflyst.github.io/PiHoleBlocklist/android-tracking.txt (lines: 0 exclude: 0 block: 0)
2021-10-12T03:04:31 unbound[80923] blocklist download : unable to download file from https://perflyst.github.io/PiHoleBlocklist/android-tracking.txt (error : HTTPSConnectionPool(host='perflyst.github.io', port=443): Max retries exceeded with url: /PiHoleBlocklist/android-tracking.txt (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x2d0eeaf1790>: Failed to establish a new connection: [Errno 8] Name does not resolve')))
2021-10-12T03:03:01 unbound[80923] blocklist download https://dbl.oisd.nl/ (lines: 0 exclude: 0 block: 0)
2021-10-12T03:03:01 unbound[80923] blocklist download : unable to download file from https://dbl.oisd.nl/ (error : HTTPSConnectionPool(host='dbl.oisd.nl', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x2d0eeaf16d0>: Failed to establish a new connection: [Errno 8] Name does not resolve')))
2021-10-12T03:01:30 unbound[80923] blocklist download https://justdomains.github.io/blocklists/lists/adguarddns-justdomains.txt (lines: 0 exclude: 0 block: 0)
2021-10-12T03:01:30 unbound[80923] blocklist download : unable to download file from https://justdomains.github.io/blocklists/lists/adguarddns-justdomains.txt (error : HTTPSConnectionPool(host='justdomains.github.io', port=443): Max retries exceeded with url: /blocklists/lists/adguarddns-justdomains.txt (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x2d0eeaf17f0>: Failed to establish a new connection: [Errno 8] Name does not resolve')))
2021-10-12T03:00:00 unbound[80923] blocklist download : exclude domains matching ^(?![a-zA-Z_\d]).*|.*localhost$


This morning I woke up with Unbound down and I had to manually restart it.

In the general log, I have this:
2021-10-12T03:02:00 configctl[81123] error in configd communication Traceback (most recent call last): File "/usr/local/opnsense/service/configd_ctl.py", line 68, in exec_config_cmd line = sock.recv(65536).decode() socket.timeout: timed out

In the backend log, I have this:
2021-10-12T03:06:06 configd.py[28821] unable to sendback response [Error (1) ] for [unbound][dnsbl][None] {087185b8-8eec-400c-aeef-e66b2af7f3a3}, message was Traceback (most recent call last): File "/usr/local/opnsense/service/modules/processhandler.py", line 202, in run self.connection.sendall(('%s\n' % result).encode()) BrokenPipeError: [Errno 32] Broken pipe

This is my config:


I just did a manual update and that got an OK:
root@opnsense:/usr/local/opnsense/service/conf/actions.d # configctl unbound dnsbl
OK


No logs in unbound, except for this, which is good:
2021-10-12T10:25:30 unbound[49463] [49463:0] info: start of service (unbound 1.13.2).

based on the logs, something breaks in the connection between config_ctl and configd.
is there anything suspicious in System: Diagnostics: Activity?

> blocklist download done in 360.94 seconds (0 records)

This is longer than the backend component can wait for a qualified response (120 seconds) so I suppose the blocklist generation needs to be detached call in order to succeed. In this case it won't return any useful data I believe so there is no real reason to wait from the callers perspective (other than having to deal with partial or old data) when not waiting for the download to finish.


Cheers,
Franco

October 12, 2021, 05:38:23 PM #6 Last Edit: October 12, 2021, 05:40:23 PM by Fright
@franco
i think its just overall stats string from the end of blocklists.py
(each request waits no more then 120sec)
imho there is no correlation between this and unbound crash or socket errors?

October 12, 2021, 06:39:33 PM #7 Last Edit: October 12, 2021, 06:58:53 PM by Fright
@franco sorry, my bad
socket timeouts are associated with a non-working dns resolution (and the associated too long execution of the "action")
but do not explain the unbound crash. still not enough info (

@Nikotine
so it looks like the unbound was already dead by the time the blocklist update and your system uses the unbound as a system nameserver.
is there anything interesting in logs before the dnsbl update?

Last entries before the 3 AM's planned dnsbl update were from 3 hours before.
I hadn't realized, but apparently Unbound had been freaking out for about an hour.
It's too big to post here, so see the log here: https://pastebin.com/awRbx1Wg

I count 21 service restarts between 2240 and 2341.

So then I started scrolling through the general log and I found this gem:
2021-10-12T00:41:01 kernel pid 20424 (unbound), jid 0, uid 59, was killed: out of swap space
2021-10-12T00:39:23 monit[63615] 'opnsense.home.niek.be' mem usage of 78.9% matches resource limit [mem usage > 75.0%]
2021-10-12T00:37:23 monit[63615] 'opnsense.home.niek.be' mem usage of 78.3% matches resource limit [mem usage > 75.0%]
2021-10-12T00:35:23 monit[63615] 'opnsense.home.niek.be' mem usage of 78.5% matches resource limit [mem usage > 75.0%]
2021-10-12T00:33:22 monit[63615] 'opnsense.home.niek.be' mem usage of 78.1% matches resource limit [mem usage > 75.0%]
2021-10-12T00:31:22 monit[63615] 'opnsense.home.niek.be' mem usage of 77.5% matches resource limit [mem usage > 75.0%]


A few hours before I had enabled Suricata and I guess my system (HP T620 plus) was struggling.
Strange thing though is that currently Suricata is still enable and CPU and RAM usage are very low.
It also doesn't explain the first crash I had experience, back when I started this topic, because back then I hadn't enable Suricata yet.

But ok, I will continue to monitor if it remains stable.
Thanks for helping me debug this.



2021-10-12T00:41:01 kernel pid 20424 (unbound), jid 0, uid 59, was killed: out of swap space
2021-10-12T00:39:23 monit[63615] 'opnsense.home.niek.be' mem usage of 78.9% matches resource limit [mem usage > 75.0%]

it makes a lot of things logical
QuoteIt's too big to post here, so see the log here
asks for a password


October 13, 2021, 10:52:35 AM #12 Last Edit: October 13, 2021, 11:55:12 AM by Nikotine
Huge spikes in memory usage right after I enabled Suricata, until it crashed:


My system seems to be stable now, but not sure if I should be using Suricata on this device.
It has just 4 GB RAM.
I followed this guide and have hyperscan enabled on LAN in promiscuous mode because I have VLANs.

But perhaps I should stop posting here because actually I'm going off-topic.
Conclusion is it's not the updating of dnsbl that crashes Unbound.

EDIT: I've ordered 16 GB of RAM to give it some more breathing room.

it is difficult to say based only on the unbound logs. could you also show both the general and backend logs for the same time period?

General log: https://pastebin.com/LpgatFmg
Starting from 2319 when I enabled Suricata.
Monit started throwing alerts about memory from then on.
It then kills Unbound at 0041, Suricata at 0121 and, I hadn't noticed yet, python and php at 0151...
Finally you see me enabling Unbound and Suricata again at 0916.

Backend log doesn't go back that far anymore, I'm afraid.

Suricata is definitely not possible on this system.
I'll try again after upgrading RAM.