Spikes on recursive resolution average time in Unbound

Started by andrema2, September 16, 2020, 05:19:42 PM

Previous topic - Next topic
September 16, 2020, 05:19:42 PM Last Edit: September 16, 2020, 09:18:33 PM by andrema2
Hi

I use Unbound in resolve mode. At the bottom you can see my unbound.conf

I have a grafana/influxdb collecting and showing Unbound statistics.

I live in Brazil so I know that the root DNS servers should be far from here. I can see the recursive average time is high at the start of the service and it goes down as days goes by. It goes down to close to 130ms. Suddenly, there is a high latency and it goes to 260ms or more in average.

The latency for 8.8.8.8 that I use as test ip on the gateway setup is pretty much stable at 6ms with 0 loss. Which makes me understand that it's not a WAN/Net problem.

What can cause this behavior in Unbound ? Is it avoidable ? How ?

##########################
# Unbound Configuration
##########################

##
# Server configuration
##
server:
chroot: /var/unbound
username: unbound
directory: /var/unbound
pidfile: /var/run/unbound.pid
root-hints: /var/unbound/root.hints
use-syslog: yes
port: 53
verbosity: 1
extended-statistics: yes
log-queries: yes
hide-identity: yes
hide-version: yes
harden-referral-path: no
do-ip4: yes
do-ip6: no
do-udp: yes
do-tcp: yes
do-daemonize: yes
module-config: "validator iterator"
cache-max-ttl: 86400
cache-min-ttl: 0
harden-dnssec-stripped: yes
serve-expired: yes
outgoing-num-tcp: 10
incoming-num-tcp: 10
num-queries-per-thread: 4096
outgoing-range: 8192
infra-host-ttl: 900
infra-cache-numhosts: 50000
unwanted-reply-threshold: 0
jostle-timeout: 200
msg-cache-size: 50m
rrset-cache-size: 100m
num-threads: 4
msg-cache-slabs: 8
rrset-cache-slabs: 8
infra-cache-slabs: 8
key-cache-slabs: 8

auto-trust-anchor-file: /var/unbound/root.key

prefetch: yes
prefetch-key: yes
rrset-roundrobin: yes


# Interface IP(s) to bind to
interface: 0.0.0.0
interface: ::0
interface-automatic: yes



# DNS Rebinding
# For DNS Rebinding prevention
#
# All these addresses are either private or should not be routable in the global IPv4 or IPv6 internet.
#
# IPv4 Addresses
#
private-address: 0.0.0.0/8       # Broadcast address
private-address: 10.0.0.0/8
private-address: 100.64.0.0/10
private-address: 127.0.0.0/8     # Loopback Localhost
private-address: 169.254.0.0/16
private-address: 172.16.0.0/12
private-address: 192.0.2.0/24    # Documentation network TEST-NET
private-address: 192.168.0.0/16
private-address: 198.18.0.0/15   # Used for testing inter-network communications
private-address: 198.51.100.0/24 # Documentation network TEST-NET-2
private-address: 203.0.113.0/24  # Documentation network TEST-NET-3
private-address: 233.252.0.0/24  # Documentation network MCAST-TEST-NET
#
# IPv6 Addresses
#
private-address: ::1/128         # Loopback Localhost
private-address: 2001:db8::/32   # Documentation network IPv6
private-address: fc00::/8        # Unique local address (ULA) part of "fc00::/7", not defined yet
private-address: fd00::/8        # Unique local address (ULA) part of "fc00::/7", "/48" prefix group
private-address: fe80::/10       # Link-local address (LLA)


# Access lists
include: /var/unbound/access_lists.conf

# Static host entries
include: /var/unbound/host_entries.conf

# DHCP leases (if configured)
include: /var/unbound/dhcpleases.conf

# Domain overrides
include: /var/unbound/domainoverrides.conf

# Custom includes (plugins)
include: /var/unbound/etc/*.conf

# Unbound custom options
server:
private-domain: "plex.direct"




remote-control:
    control-enable: yes
    control-interface: 127.0.0.1
    control-port: 953
    server-key-file: /var/unbound/unbound_server.key
    server-cert-file: /var/unbound/unbound_server.pem
    control-key-file: /var/unbound/unbound_control.key
    control-cert-file: /var/unbound/unbound_control.pem


Thanks


Adding to the information and to the questions I have...

Is there a way to see which query caused the spike in the time response ? I had times of 9 seconds during last week, without a perceived outage in the WAN.

Now my average recursive time is around 0.398.

Hi @andrema2

Please read this document and in particular try the CPU optimisation and report back. Please specify how you added the optimisations in detail.

https://nlnetlabs.nl/documentation/unbound/howto-optimise/

Kind regards.