Unbound crashing

Started by seed, August 22, 2023, 08:18:48 AM

Previous topic - Next topic
Looks like the relevant parts to avoid reload had been removed in 2021... https://github.com/opnsense/core/commit/4a1bc9f8b5e65651e8

I don't see a reason not to revive this, but still a bit weird that this went unnoticed since then or new upstream bugs came to light at some point.


Cheers,
Franco

Quote from: franco on August 29, 2023, 09:21:33 PM
Sounds like more upstream bugs. Is this all on DoT? It's still rather buggy after all these years.

Is there something particular about DoT that you think is buggy?  I've been running it for quite a while now and haven't noticed any issues.

This one was fixed in 1.18.0 just then (but we had the fix already on 23.7):

https://github.com/NLnetLabs/unbound/commit/52581f86447

It shows bottom line poor coding surfacing due to ASLR "breaking" the execution.

There were a few others over the years.


Cheers,
Franco

Unbound 1.18.0 runs fine here, nothing unusual in the logs so far.


For anyone interested it can be installed/tested now.

fetch https://pkg.opnsense.org/FreeBSD:13:amd64/snapshots/latest/All/unbound-1.18.0.pkg && pkg install unbound-1.18.0.pkg

September 11, 2023, 10:49:16 AM #34 Last Edit: September 11, 2023, 04:01:55 PM by karlson2k
Unbound (1.17.1) is not hanging with OPNsense 23.7.3 when log level set to "Level 4" or "Level 3".

I've changed it back to "Level 1" to check the situation.

It would be nice to avoid Unbound reloading when IP address is renewed to the same value. Looks like currently Unbound cache is killed every 10 minutes (IP renewal period).


With "Level 1" log, Unbound again failed to restart, just like before: https://forum.opnsense.org/index.php?topic=35527.msg173533#msg173533. The minor difference is additional error entry:
error: reading root hints /root.hints 28:37: Syntax error, could not parse the RR's class
after similar error entry:
error: reading root hints /root.hints 2:12: Syntax error, could not parse the RR's type

An obvious workaround is using more detailed log messages. However, I don't want to kill my SSD too quickly.

I will try with Unbound 1.18.0


Unbound version https://pkg.opnsense.org/FreeBSD:13:amd64/snapshots/latest/All/unbound-1.18.0.pkg hasn't hung so far.
I'll try with so-reuseport: no to see whether problem would appear with multi-threaded Unbound.

1.18.0 will be in 23.7.4 on Thursday.


Cheers,
Franco

I got a new Unbound at 100% CPU with so-reuseport: no.
Switched back to single-threaded. Several hours without freezing.

Is this only with RSS or always?


Cheers,
Franco

I have Intel NICs and RSS is always enabled on my hardware.

I can try with RSS disabled just to check the results.

If it's RSS related it may be a driver specific issue ? Not seeing anomalies on igb, igc and em drivers, with or without RSS enabled, Unbound 1.17.1 or 1.18.0 with DoT and running on port 53.


This is from an APU4

root@OPNsense:~ # netstat -Q
Configuration:
Setting                        Current        Limit
Thread count                         4            4
Default queue limit                256        10240
Dispatch policy                 direct          n/a
Threads bound to CPUs          enabled          n/a

Protocols:
Name   Proto QLimit Policy Dispatch Flags
ip         1   1000    cpu   hybrid   C--
igmp       2    256 source  default   ---
rtsock     3    256 source  default   ---
arp        4    256 source  default   ---
ether      5    256    cpu   direct   C--
ip6        6   1000    cpu   hybrid   C--
ip_direct     9    256    cpu   hybrid   C--
ip6_direct    10    256    cpu   hybrid   C--

Workstreams:
WSID CPU   Name     Len WMark   Disp'd  HDisp'd   QDrops   Queued  Handled
   0   0   ip         0   200        0    77796        0   380237   458033
   0   0   igmp       0     0        0        0        0        0        0
   0   0   rtsock     0     0        0        0        0        0        0
   0   0   arp        0     0     3802        0        0        0     3802
   0   0   ether      0     0  1568827        0        0        0  1568827
   0   0   ip6        0    15        0        0        0   255964   255964
   0   0   ip_direct     0     0        0        0        0        0        0
   0   0   ip6_direct     0     0        0        0        0        0        0
   1   1   ip         0   117        0   190911        0   192458   383369
   1   1   igmp       0     0        0        0        0        0        0
   1   1   rtsock     0     6        0        0        0     3396     3396
   1   1   arp        0     0        1        0        0        0        1
   1   1   ether      0     0   441355        0        0        0   441355
   1   1   ip6        0    29        0        0        0   319858   319858
   1   1   ip_direct     0     0        0        0        0        0        0
   1   1   ip6_direct     0     0        0        0        0        0        0
   2   2   ip         0   129        0    68245        0  1618864  1687109
   2   2   igmp       0     0        0        0        0        0        0
   2   2   rtsock     0     0        0        0        0        0        0
   2   2   arp        0     0    41070        0        0        0    41070
   2   2   ether      0     0   802177        0        0        0   802177
   2   2   ip6        0    63        0    95122        0   297665   392787
   2   2   ip_direct     0     0        0        0        0        0        0
   2   2   ip6_direct     0     0        0        0        0        0        0
   3   3   ip         0   183        0   134480        0   361672   496152
   3   3   igmp       0     0        0        0        0        0        0
   3   3   rtsock     0     0        0        0        0        0        0
   3   3   arp        0     0        6        0        0        0        6
   3   3   ether      0     0   720936        0        0        0   720936
   3   3   ip6        0   264        0    81594        0   551511   633105
   3   3   ip_direct     0     0        0        0        0        0        0
   3   3   ip6_direct     0     0        0        0        0        0        0
root@OPNsense:~ # unbound-control -c /var/unbound/unbound.conf status
version: 1.18.0
verbosity: 1
threads: 4
modules: 3 [ python validator iterator ]
uptime: 64425 seconds
options: control(ssl)
unbound (pid 2627) is running...

As I wrote earlier, the issue is most likely triggered by frequent Unbound restarts (I have short-lived DHCP upstream licenses, every renewal of WAN IP address unconditionally initiates Unbound restart).
Probably having 8 threads also increases the probability of the freeze.

I think Unbound is freezing either at stop or at start.

Version 1.17.1 hangs quickly with so-reuseport: no (multi-threaded, according to statistics). Without so-reuseport: no (all requests are handled by single thread only, by statistics) it hangs later.
Version 1.18.0 hangs with so-reuseport: no. Without it I haven't faced a freeze yet.

Detailed (level 3 and level 4) log somehow prevents Unbound freezing, so I cannot tell precisely what's triggering the issue.

NICs drivers are igb.
8 vCPU cores are available.

# unbound-control -c /var/unbound/unbound.conf status
version: 1.18.0
verbosity: 1
threads: 8
modules: 3 [ python validator iterator ]
uptime: 460 seconds
options: reuseport control(ssl)
unbound (pid 76726) is running...


Current so-reuseport: yes comes from default plugin configuration, preventing handling of requests by multiple threads.

Schedule pluginctl dns to be called every one or two minutes (or even 15 seconds) to quickly trigger the issue, preferably with so-reuseport: no in Unbound configuration.
To fully reproduce my configuration, use requests forwarding (I'm using local DncCrypt-Proxy, but I think it is not important for Unbound where the requests are forwarded to).

September 14, 2023, 12:51:29 PM #44 Last Edit: September 14, 2023, 01:05:36 PM by newsense
What happens if you ditch DNScrypt and forward to 1.1.1.1:853 instead ? Can you still trigger it ?

If performance is at stake then Cloudflaree is one of the fastest. DNScrypt on the other hand - if using thee stock one - might not be the best tool here, it's quite old and in need of an update (maybe should be removed from the plugin list ?)