OPNsense Forum

Archive => 24.1, 24.4 Legacy Series => Topic started by: opnfwb on May 16, 2024, 02:58:23 PM

Title: Unbound DNS reporting stops randomly - 24.1.6
Post by: opnfwb on May 16, 2024, 02:58:23 PM
I've used the DNS Reporting feature since it launched in OPNsense. Awesome feature!

However I've noticed with the latest update it is now randomly stopping for me. Anywhere between 24 to 72hrs the reporting will just cease to have any new results. I don't see anything in the Unbound log or system log that correlates to the time when the Unbound reporting stops. Is anyone else seeing this behavior?

Information on my use case:
I am not using any DNS blocklists.
I am using Quad9 DNS with DoT

Steps I've tried so far:
Restarting Unbound service, this temporarily restores reporting but it will stop again randomly
Disabling, resetting, and then re-enabling the Unbound reporting function
Title: Re: Unbound DNS reporting stops randomly - 24.1.6
Post by: sarkyscouser on May 16, 2024, 03:55:17 PM
I sometimes have issues with Unbound, but it seems to be linked to WAN/ISP issues where my ISP either gives me a new IPv6 prefix or sometimes a new IPv4 address as well.

This causes both radvd and Unbound to crash and I have monit setup to report and restart them.
Title: Re: Unbound DNS reporting stops randomly - 24.1.6
Post by: franco on May 16, 2024, 04:53:39 PM
There was a fix in Unbound 1.20.0 for a restart-related issue reported to them by a community member. It's in today's 24.1.7.


Cheers,
Franco
Title: Re: Unbound DNS reporting stops randomly - 24.1.6
Post by: DEC670airp414user on May 16, 2024, 06:25:07 PM
i noticed this as well running quad 9 in DOT.

i went back to just unbound for a few days and have not had any issues.

i am on business so i guess ill get the patch a little later .

reports are not the end of the world for me
Title: Re: Unbound DNS reporting stops randomly - 24.1.6
Post by: opnfwb on May 16, 2024, 06:33:49 PM
Quote from: franco on May 16, 2024, 04:53:39 PM
There was a fix in Unbound 1.20.0 for a restart-related issue reported to them by a community member. It's in today's 24.1.7.


Cheers,
Franco
Thanks Franco!

Quote from: DEC670airp414user on May 16, 2024, 06:25:07 PM
i noticed this as well running quad 9 in DOT.

i went back to just unbound for a few days and have not had any issues.

i am on business so i guess ill get the patch a little later .

reports are not the end of the world for me
I forgot to mention this above but I tried switching from Quad9 DoT to CloudFlare DoT, and it was still randomly stopping for me. However I did not try to fall back on standard unencrypted DNS.
Title: Re: Unbound DNS reporting stops randomly - 24.1.6
Post by: bestboy on May 18, 2024, 08:42:50 AM
I have these reporting stops in which the Unbound page is empty also with 24.1.7. Log file does not contain anything except for

error: SERVFAIL <some.redacted.domain. AAAA IN>: exceeded the maximum number of sends

Restarting the unbound service helps for a while until is breaks again.
Title: Re: Unbound DNS reporting stops randomly - 24.1.6
Post by: DEC670airp414user on May 22, 2024, 10:36:33 PM
I've moved to another tls service with only one server and I am not longer having this issue.

Has anyone tried just one server?
Title: Re: Unbound DNS reporting stops randomly - 24.1.6
Post by: Ben S on May 23, 2024, 10:32:46 AM
When the logger process gets stuck the end of the ktrace file is as below.  So it's stuck trying to get a lock which is already held, possibly by itself, from the earlier fcntl F_SETLK call?  Hard to be sure though.  But I don't know what else would hold a lock on this file, and fstat indicates that nothing else is.

This is also the only call to flock at any time over many hours however the process had not been running for 24 hours so it doesn't seem to be due to the daily export/import which the logger script does.

Could it be a bug in duckdb?

67596 python3.11 CALL  fcntl(0x9,F_SETLK,0x820ff3940)
67596 python3.11 RET   fcntl -1 errno 35 Resource temporarily unavailable
67596 python3.11 CALL  fcntl(0x9,F_GETLK,0x820ff3940)
67596 python3.11 RET   fcntl 0
67596 python3.11 CALL  fcntl(0x9,F_SETLK,0x820ff3940)
67596 python3.11 RET   fcntl 0
67596 python3.11 CALL  openat(AT_FDCWD,0x882d53840,0x100001<O_WRONLY|O_CLOEXEC>)
67596 python3.11 NAMI  "/var/unbound/data/unbound.duckdb"
67596 python3.11 RET   openat 10/0xa
67596 python3.11 CALL  flock(0xa,0x2<LOCK_EX>)
root@barium:~ # tail -20 kdump.txt
67596 python3.11 CALL  fstatat(AT_FDCWD,0x87d89bed0,0x820ff3bd0,0)
67596 python3.11 NAMI  "/var/unbound/data/unbound.duckdb"
67596 python3.11 STRU  struct stat {dev=101, ino=3445703, mode=0100640, nlink=1, uid=59, gid=59, rdev=6911328, atime=1716448568.853894000, mtime=1716448559.667945000, ctime=1716448559.667945000, birthtime=1716370706.399868000, size=6303744, blksize=32768, blocks=12416, flags=0x0 }
67596 python3.11 RET   fstatat 0
67596 python3.11 CALL  openat(AT_FDCWD,0x87d89be70,0x100002<O_RDWR|O_CLOEXEC>)
67596 python3.11 NAMI  "/var/unbound/data/unbound.duckdb"
67596 python3.11 RET   openat 9
67596 python3.11 CALL  fstat(0x9,0x820ff39e0)
67596 python3.11 STRU  struct stat {dev=101, ino=3445703, mode=0100640, nlink=1, uid=59, gid=59, rdev=6911328, atime=1716448568.853894000, mtime=1716448559.667945000, ctime=1716448559.667945000, birthtime=1716370706.399868000, size=6303744, blksize=32768, blocks=12416, flags=0x0 }
67596 python3.11 RET   fstat 0
67596 python3.11 CALL  fcntl(0x9,F_SETLK,0x820ff3940)
67596 python3.11 RET   fcntl -1 errno 35 Resource temporarily unavailable
67596 python3.11 CALL  fcntl(0x9,F_GETLK,0x820ff3940)
67596 python3.11 RET   fcntl 0
67596 python3.11 CALL  fcntl(0x9,F_SETLK,0x820ff3940)
67596 python3.11 RET   fcntl 0
67596 python3.11 CALL  openat(AT_FDCWD,0x882d53840,0x100001<O_WRONLY|O_CLOEXEC>)
67596 python3.11 NAMI  "/var/unbound/data/unbound.duckdb"
67596 python3.11 RET   openat 10/0xa
67596 python3.11 CALL  flock(0xa,0x2<LOCK_EX>)
Title: Re: Unbound DNS reporting stops randomly - 24.1.6
Post by: bestboy on May 25, 2024, 10:56:48 AM
Interesting. I can remember an instance when Unbound was entirely broken. Not only just not reporting, but also not responding. Restarting the Unbound service via UI did not work and repeatedly failed with a service timeout error.
So, I used the CLI and I did a ps aux | grep unbound and killed all Unbound processes manually. Interestingly there were multiple flock processes which I killall'ed. Once all processes were killed, I could start Unbound in the UI, everything was fine again.
So, I indeed believe there may be an edge case in which the lock handling with flock does not work. I guess there should not ever be multiple flock processes.