Unbound just hit 100% CPU on one core again...

Started by lar.hed, February 14, 2024, 03:36:50 PM

Previous topic - Next topic
i dont use any overrides. Only customization i use for Unbound is DoT.

I don't use overrides either.

I "only" use BlockList and DoT.


A DNS loop is just one thing that might explain such a hangup. What I wanted to point out is that obviously, not everybody has those problems, so it seems to be triggered by something special in your setup.

Now, that you have one thing in common, namely DoT:

I do not use it and have no problem. Did you try to disable it?
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 440 up, Bufferbloat A+

Regarding DoT: No not yet.

And there is a reason for that: This config I am running, as I mentioned before, was running just fine until I upgraded to 23.7.x - after that upgrade this Unbound challenge has been around. Different patches has been tested, but none so far have solved this challenge. So currently, as mentioned, I have removed DHCP registration within Unbound - which, fingers crossed and all that, seems to make this just so much more stable.

So why might this be an issue for say myself and not for others - of course this is a very good question, which we still are looking for answers for. My thinking is: for some reason I think this will happen sooner or later for everybody (that are using Unbound that is), but for some reason my setup is faster (!?) to trigger this. So say that you upgrade your OPNsense as soon as something is released, or just a reboot once a week or something, this will so to speak reset the eggtimer.... And I seem to drag my feet abit around before I upgrade...

Is it possible this is realated to
  DNSSEC validators -- denial-of-service/CPU exhaustion from KeyTrap and NSEC3 vulnerabilities
  CVE: CVE-2023-50868
  CVE: CVE-2023-50387
  WWW: https://vuxml.freebsd.org/freebsd/21a854cc-cac1-11ee-b7a7-353f1e043d9a.html
which is patched in unbound 1.19.1?

Hopefully that update will find it's way to opnsense soon.

I have been running this since I posted earlier.
I cleared all my logs.

Within the settings screen under general for unbound i only have ticked;
enable unbound
enable DNSSEC support
Flush DNS Cache during reload

I have not seen that error again so far since adding the patch, its been a few days so far. Device has been pretty stable so far.

@joshndroid which of the 3 patches did you apply ?

February 17, 2024, 08:30:00 PM #37 Last Edit: February 18, 2024, 04:55:43 AM by joshndroid
I had applied the 3 from the original thread that is linked in the 23.7 unbound thread.. However there was a hot fix update so not sure if they stuck as I never reapplied them - https://forum.opnsense.org/index.php?topic=35527.msg187426#msg187426

I then applied the patch in this thread  - https://forum.opnsense.org/index.php?topic=37973.msg188912#msg188912

Regarding the switch vs direct plug discussion.  I've had many times in my career where I was sure something was completely unrelated and didn't test it until I had gone down many other rabbit holes.  Sometimes it fixed the problem, sometimes it didn't.  But switches are cheap and easy to test.  Because sometimes computers gonna computer.

I'm using DoT, DNSSEC, and DHCP registrations and I don't have any issues with Unbound.  I'm also using multiple blocklists, but not anything like HAProxy, AdGuard, etc.  Just wanted to provide some more data.

I'm still waiting for the next 100% CPU. Currently I am running DoT, DNSSEC, BlockList and NO DHCP registration.

If this works for say 14 days (I will not update for any fixes until crash or something... bad idea partly I know, but then again...), I will attach a switch (I have a rather simple D-Link somewher, it will do for this test) to see if anything changes (with DHCP re-enabled for registration). I do agree with CJ about a switch that sometimes can move the problem away...

I'm not out of ideas to test, it just takes for ever...

EDIT: And I do run ONLY IP4 - no IP6 at all.

I don't use unbound myself but a quick internet search gives comments that you should disable DNSSEC & caching - I don't know if that will help.

I prefer to run my DNS servers on the LAN and fwiw, I use Powerdns Authoritative & Recursor plus dnsdist for load balancing - never had any problems with that configuration.
Regards


Bill

I just had a strange thing happening: Unbound just stopped - I have no clue why, nothing in the log for Unbound, or in System/Log/General. My Monit setup started Unbound again so no biggi in that way, but why just stop?

Hope it was just a glitch...

Yeasterday I did a firmware update, so now I am running:
OPNsense 24.1.2-amd64
FreeBSD 13.2-RELEASE-p10
OpenSSL 3.0.13


Within less than 24h I got another 100% CPU - and my Monit setup kill Unbound and everything got back to normal. I have combined three logs: General (under System), Monit and Unbound - it is way easier to follow what happens (why is that not a built in option, to look at several logs at the same time - or is it and I just do not know how to combine logs?):
LOG Date Severity Process Line
Monit 2024-02-22T00:09:22 Informational monit 'UnboundHighCPU' status succeeded (1) -- exit: Expression Syntax.
Unbound 2024-02-22T00:07:29 Informational unbound [17657:1] info: generate keytag query _ta-4f66. NULL IN
Unbound 2024-02-22T00:07:29 Informational unbound [17657:0] info: start of service (unbound 1.19.1).
Unbound 2024-02-22T00:07:29 Notice unbound [17657:0] notice: init module 2: iterator
Unbound 2024-02-22T00:07:29 Notice unbound [17657:0] notice: init module 1: validator
Unbound 2024-02-22T00:07:22 Informational unbound [17657:0] info: dnsbl_module: blocklist loaded. length is 3687688
General 2024-02-22T00:07:18 Error opnsense /usr/local/sbin/pluginctl: The command '/bin/kill -'TERM' '59284''(pid:/var/run/unbound.pid) returned exit code '1', the output was 'kill: 59284: No such process'
General 2024-02-22T00:07:18 Notice opnsense /usr/local/sbin/pluginctl: plugins_configure unbound_start (execute task : unbound_configure_do(1))
General 2024-02-22T00:07:18 Notice opnsense /usr/local/sbin/pluginctl: plugins_configure unbound_start (1)
Monit 2024-02-22T00:07:18 Informational monit 'UnboundHighCPU' start: '/usr/local/sbin/pluginctl -c unbound_start'
Unbound 2024-02-22T00:07:18 Informational unbound [17657:0] info: dnsbl_module: updating blocklist.
Unbound 2024-02-22T00:07:18 Notice unbound [17657:0] notice: init module 0: python
Monit 2024-02-22T00:07:13 Informational monit 'UnboundHighCPU' stop: '/usr/local/opnsense/scripts/OPNsense/Monit/Unbound_Kill.sh stop'
Monit 2024-02-22T00:07:13 Informational monit 'UnboundHighCPU' trying to restart
Monit 2024-02-22T00:07:08 Error monit 'UnboundHighCPU' status failed (100) -- no output
Unbound 2024-02-22T00:03:14 Informational unbound [59284:4] info: generate keytag query _ta-4f66. NULL IN
Unbound 2024-02-22T00:03:14 Critical unbound [59284:3] fatal error: Could not initialize thread
Unbound 2024-02-22T00:03:14 Informational unbound [59284:0] info: start of service (unbound 1.19.1).
Unbound 2024-02-22T00:03:14 Informational unbound [59284:3] info: server stats for thread 3: requestlist max 0 avg 0 exceeded 0 jostled 0
Unbound 2024-02-22T00:03:14 Informational unbound [59284:3] info: server stats for thread 3: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Unbound 2024-02-22T00:03:14 Error unbound [59284:3] error: Could not set root or stub hints
Unbound 2024-02-22T00:03:14 Error unbound [59284:3] error: reading root hints /root.hints 2:6: Syntax error, could not parse the RR's type
Unbound 2024-02-22T00:03:14 Notice unbound [59284:0] notice: init module 2: iterator
Unbound 2024-02-22T00:03:14 Notice unbound [59284:0] notice: init module 1: validator
Unbound 2024-02-22T00:03:07 Informational unbound [59284:0] info: dnsbl_module: blocklist loaded. length is 3687688
General 2024-02-22T00:03:06 Notice opnsense /usr/local/etc/rc.linkup: DEVD: Ethernet detached event for opt2(igb2)
General 2024-02-22T00:03:06 Notice kernel <6>igb2: link state changed to DOWN
Unbound 2024-02-22T00:03:03 Informational unbound [59284:0] info: dnsbl_module: updating blocklist.
Unbound 2024-02-22T00:03:03 Notice unbound [59284:0] notice: init module 0: python
General 2024-02-22T00:03:00 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dns (execute task : unbound_configure_do())
General 2024-02-22T00:03:00 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dns (execute task : dnsmasq_configure_do())
General 2024-02-22T00:03:00 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dns ()
General 2024-02-22T00:03:00 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dhcp (execute task : dhcpd_dhcp_configure())
General 2024-02-22T00:03:00 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dhcp ()
General 2024-02-22T00:03:00 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure ipsec (execute task : ipsec_configure_do(,opt2))
General 2024-02-22T00:03:00 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure ipsec (,opt2)
General 2024-02-22T00:03:00 Notice opnsense /usr/local/etc/rc.linkup: ROUTING: entering configure using 'opt2'
General 2024-02-22T00:03:00 Notice kernel <6>igb2: link state changed to UP
Unbound 2024-02-22T00:03:00 Informational unbound [50504:0] info: service stopped (unbound 1.19.1).
General 2024-02-22T00:02:59 Notice opnsense /usr/local/etc/rc.linkup: DEVD: Ethernet attached event for opt2(igb2)
General 2024-02-22T00:01:38 Notice dhclient dhclient-script: Creating resolv.conf
General 2024-02-22T00:01:38 Notice dhclient dhclient-script: Reason RENEW on igb1 executing


Do NOT get confused about the roots.hint error - it is not the problem, I have showed that earlier - it is a symptom maybe? but it is not what is the challenge, it is just "a thing" that seems to happen....

Anyway, for the record: DNSSEC, BlockLists, DoT enabled and NO DHCP registration --> this kills my idea of something DHCP related. Instead I have to admit that this might very well be interface related. I have earlier written I do not think so, and part of me are still in that odd corner, however when 23.7 was release I had problems with my direct connected PC (LAN if you will) as one might recall: https://forum.opnsense.org/index.php?topic=36807.msg180191#msg180191

So it does indeed look like there is a connection with link up/down. However as I wrote earlier, this worked in 23.1 - and from 23.7 it is a bit of a challenge...

I will go and search for my DLink switch, so I can remove the link up/down from this and see if this works better. Do note however that I have for example my HomeAssistant rpi4 server direct attached to another port, and a Linux Server to yet another port. However they run 24x7 so no link up/down on them...

I'll be back  8)

hi
QuoteDo NOT get confused about the roots.hint error - it is not the problem, I have showed that earlier - it is a symptom maybe?
sorry, what make you think so? from what i see - its (one of ?) the reason
QuoteI did a firmware update
so 'use internal hints' patch is gone?


Quote from: Fright on February 22, 2024, 07:07:56 AM
hi
QuoteDo NOT get confused about the roots.hint error - it is not the problem, I have showed that earlier - it is a symptom maybe?
sorry, what make you think so? from what i see - its (one of ?) the reason
QuoteI did a firmware update
so 'use internal hints' patch is gone?

Well I can of course not say for 100% sure that the roots.hint is not the challenge - but I had this 100% CPU Unbound challenge even with that patch installed. That patch got overwritten I guess when I upgraded to latest, since Unbound was part of that firmware upgrade (ports: unbound 1.19.1) - the option that was installed with that patch got removed anyway so it is not available anymore (except if I re-install the patch of course).