Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - DenverTech

#1
Resolved, see bottom.

I've been using Tailscale for a really really long time, originally with the port, then with the plugin. Worked without any issue, following the directions for rules. That was, until 2 nights ago. I rebooted my VPS, which is a remote tailscale node, as I've done many many times before without issue. I watched the tailscale connection come up...but nothing went through (more details below post-testing). Since then, I've been losing my mind, trying to figure out wtf changed and how to fix it. Please help!

I've reviewed other posts recently about Tailscale issues and the few that seem to match what I'm seeing were abandoned by the person asking, so no resolution was ever mentioned.

Testing info/details:
* Tailscale ACL is (and always has been), set up with ICMP allowed all:all for diagnosing scenarios just like this.
* From Opnsense to the VPS & VPS to Opnsense, pings work fine. I can ping either the host's actual IP or the tailscale IP of the device
* From the VPS side, I can ping anything on the Opnsense network. That tells me the all:all ACL is working as intended and the connection is good.
* From the Opnsense side, ONLY the Opnsense will ping things on the VPS side. It will not route anyone else to it.
* I _do_ have the rules that allow LAN to get to Tailscale. Those have not been changed in more than a year.
* I _do_ have the NAT Outbound rules for Tailscale. Those also have not been changed in more than a year.
* I'm not seeing any dropped/blocked packets in the Opnsense logs at all. Instead, I see the firewall rules passing the traffic without issue (ie, LAN to Tailscale), but the far tailscale devices never receive them. This seems like the NAT Outbound rule issue...but those rules are there and enabled. Unless something changed in how they're supposed to be configured, those are as they've been for a year+.
* Routes are being advertised and are approved on Tailscale (again, not changed in forever)
* The node has not expired.

LAN: 192.168.0.0/24
Opnsense: 192.168.0.1 & 172.0.0.1
Tailscale: 172.0.0.0/24
VPS Tailscale IP: 172.0.0.2

From Opnsense, I can ping 172.0.0.2.
From LAN, I can ping 172.0.0.1, but not 172.0.0.2
From VPS, I can ping 172.0.0.1 and all of 192.168.0.0/24

Anyone got any ideas, before I lose what's left of my hair over this surprise issue?


EDIT: Figured it out. A IoT device was apparently advertising routes that don't exist and killing a lot of the true advertised routes. Gotta go throw an IoT device against the wall a few times now.
#2
Quote from: cookiemonster on December 04, 2024, 11:42:56 PM
For AdGH logging specifically there's another way. I can't really remember how long ago I did this modification but a few AdGH versions ago they introduced some additional logging capabilities by config. I update AdGH from time to time using the UI. The installation is from mimugmail's repo.
Please if you do it, check you have a version that has this capability and take notice of the schema changes. Make a backup of the config file: /usr/local/AdGuardHome/AdGuardHome.yaml before any changes.
I am on Version: v0.107.54 of AdGH.
On my config I have added/changed the following settings in the log section:
log:
  enabled: true
  file: /var/log/AdGuardHome/AdGuardHome.log
  max_backups: 30
  max_size: 10
  max_age: 31
  compress: false
  local_time: false
  verbose: false

This is very recently, before I had:
log:
  enabled: true
  file: /var/log/AdGuardHome/AdGuardHome.log
  max_backups: 3
  max_size: 100
  max_age: 3
  compress: false
  local_time: false
  verbose: false

but I realised I wanted to rotate earlier and not have 100 MB files. I have just recently made the change so will need to keep an eye on successful rotation.
So, take backup, make your desired changes, restart AdGH from : System: Diagnostics: Services. Read the new log file to see it starting. Keep an eye to make sure the log files are growing and rotating as expected.
More info on settings https://github.com/AdguardTeam/AdGuardHome/wiki/Configuration


Awesome, thank you! That should help a ton with troubleshooting. I've got it logging now and will monitor.
#3
Quote from: Patrick M. Hausen on December 04, 2024, 01:43:49 PM
Quote from: DenverTech on December 04, 2024, 07:33:57 AM
Out of curiosity, is there any way to log errors with plugins (such as from console)? If I could get a log of what is exploding in AGH, that would make this a lot easier.

Nothing to find in the /usr/local/AdGuardHome/data directory?

Also, log on to OPNsense via SSH, use drill to specifically throw a request at the AdGuard Home port, observe what happens.

I checked the AGH folder, including data, but all it has is the query log. I ran queries and only the query log changes. Nothing appears to be an "informational" log.

Quote from: cookiemonster on December 04, 2024, 01:33:50 PM
there seem to be a couple of hotfixes today for 24.7.10.x. I suggest you retest after system is stable again.

Good point. I've been watching those. They don't really describe the issue I'm having, but yeah, good call.
#4
Out of curiosity, is there any way to log errors with plugins (such as from console)? If I could get a log of what is exploding in AGH, that would make this a lot easier.
#5
I used the community plugin from maxit/mimugmail, so it's installed via the plugin. Nothing too fancy or special there.
#6
This isn't the first time I've seen this and haven't figured out the cause yet. Short version is that every few updates to OPNsense, AGH crashes badly and can't be restarted without wiping it out and starting fresh. Trying to track down the cause and resolve it once and for all. I've seen a few people post with similar issues, but they mostly just get told off (ie, "Doesn't happen to me, so it's your problem" type of replies). I acknowledge this isn't happening to all firewalls, but I have two of them doing this...and oddly, they don't always break on the same updates.

Ok, here's the pile of info on what I'm seeing and tested, using today's crash as an example. Previous run-ins with this issue were identical:

  • AGH and OPNsense work great for months at a time. Not a single crash, failure, or error.
  • AGH is the sole DNS on the firewall. I do not use Unbound.
  • When this occurs, it's usually after an update to OPNsense or the AGH container. Either can trigger it, but the results are the same.
  • About 2hrs after the update to 24.7.10_1, I began getting alerts from my Uptime Kuma system that AGH wasn't responding. Except...it was when I checked. Appears almost as if the AGH plugin crashed and restarted.
  • About 4hrs after the update to 24.7.10_1, I began getting reports that users had sites not loading. If they refresh, it works fine. Again, sounds like AGH crashing and restarting, but my pings to it never show a drop.
  • At 5hrs, I got a bunch of panic-calls. The internet was 100% down. In fact, OPNsense had crashed. No ping, but all the lights are still on. Had to hard power-off the system. Unfortunately, this wipes my memory-logs, so I have no idea why OPNsense died. Aside from patch-reboots, it's been running for about 1yr without being fully powered off, so this is new.
  • It came back up and AGH won't start. I gave it a bit and it never started. Since there's no AGH logs, I have no idea why. Manual start runs for about 3 seconds, then stops again. These are identical symptoms to the last time this happened about 4mo ago.
  • As I was advised last time, I uninstalled AGH and reinstalled, then migrated back the yaml file. Same results. It sometimes fails to start, sometimes just crashes.
  • Again, as advised last time, I reinstalled AGH and started with a clean yaml. It starts fine and has no issues or indications of a problem. Last time I had to manually rebuild everything for no good reason. Looks like I need to again.
  • Plenty of others have had issues with updates breaking AGH, so there's definitely something going on. However, I don't know what the trigger is or what's actually breaking and would appreciate some advice, guidance, or whatnot. Starting clean with AGH every 2-3 OPNsense updates isn't viable.
  • In the interim, Unbound works fine. I'd really prefer to use AGH, but if it's going to eat its own face every few updates, that may not be an option.

Any ideas or things I may not have tried?
#7
Quote from: atoll on August 10, 2024, 03:28:40 PM
That's essentially it: After the 24.7.1 update, the Adguard plugin is visible, but refuses to start.

Any ideas?

Best e.

I had the same issue. It would start and instantly stop. Reviewed the config and everything looked fine. Ended up deleting the yaml configuration file and redoing the whole configuration. Worked fine with a fresh config, so it acts like there was some setting it didn't like in old configs.
#8
No one's had wifi-calling issues but me? I find that hard to believe. :-p
#9
Of late, we've been having loads of issues with our cell phones not getting calls (ie, they don't ring, but then there's a voicemail waiting), or when we make calls, it's just dead air for 20-30 seconds and then finally starts ringing. Problem vanishes if we switch from wifi to mobile network.

For the longest time, I blamed my cell carrier and/or my wifi. Changed carriers, same issue. Changed wifi, same issue. Very odd. So, with nowhere else to look, I wondered if there's something that might be causing OPNsense to drop wifi-calling connections, especially if idle for a while? Ie, we leave the phone untouched for 4hrs, then try to make a call...would OPNsense have some kind of issue with re-establishing the ePDG connection? Some sort of timeout maybe?

I know I'm grasping at straws, but figured I'd reach out to the experts and see if anyone knew!

Setup:
* OPNsense (current version)
* Wifi via Ruckus AP, with ePDGs enabled. Ruckus has ruled out all possible issues on their end after way too many hours on the phone with them. Tested with a Unifi AP and had same results, so I begrudgingly agree with Ruckus.
* Android 12/13 phones (same issue occurs for people visiting this location and using wifi)
* All devices seem to work fine/better when not on our building wifi
#10
Quote from: rdunkle84 on March 12, 2024, 05:06:46 PM
I noticed that when creating the cloudflare api token, Acme required:
Zone Resources set: Include | All zones.   This appears to be the problem.
To sum it up:
Zone | DNS | Edit
Zone Resources | Include | All Zones
Client IP (not using this field)
TTL | set a valid date range
This appears to work OK.

Tried this. Still says the domain is invalid. I've got all zones allowed and a TTL, as well as the edit permissions.
#11
Does seem to be the case! I definitely didn't mean to break the acme plugin. :D
#12
Lacking other options, I did try the Caddy plugin. No luck...but different results.

Example, it's setup with some.sitename.com points to handler 192.168.0.1, port 1111. I go to some.sitename.com:443 and it gives me a secure blank page. It does not forward to 192.168.0.1:1111 at all.

Progress, maybe? Still would love to know why the built-in plugin isn't working, but no one seems to want to talk about it, judging by the other threads about this. :)
#13
I really don't want to learn Caddy to fix an issue that just cropped up with the built-in system. I'll consider that a last resort.

Side-note...tested again using the global API key. Also says the domain is invalid.
#14
I've seen and read many posts about issues with Cloudflare, but have been using it without issue for about 1-2 years, using the generated API keys from CF. I use a wildcard domain and all renewals worked from 2022 until about 70 days ago. Then, mysteriously, they stopped working with the errors below. Hoping someone has some ideas on this as I've been beating my head against it for days.

Issue:

  • Starting about 70 days ago, the renewals began failing with "invalid domain" and "Error add txt for domain"
  • In the past, others have fixed this with updates (I'm current on both OPNsense and plugins) or new API keys (tried that)
  • Rebuilt all stages of the cert and issue persists
  • Tried with a single subdomain and issue persists

Tested:

  • Recreated the verification challenge, as that's where it's failing. Same errors.
  • Verified/recreated the API key permissions in case something changed on CF's end. Same errors.
  • Switched to a single subdomain, rather than wildcard. Same errors.
  • Recreated all stages of the request. Same errors.
  • Created a new API key with correct permissions. Same errors.
  • Contacted CloudFlare. They blame OPNsense, because of course they do.
  • NOTE: The API key does have zone read and dns edit permissions

See: https://github.com/acmesh-official/acme.sh/wiki/How-to-debug-acme.sh
Please add '--debug' or '--log' to check more details.
Error add txt for domain:_acme-challenge.somedomain.com
invalid domain
Adding txt value: <somestring> for domain: _acme-challenge.somedomain.com
Getting webroot for domain='*.somedomain.com'
Getting domain auth token for each domain
Single domain='*.somedomain.com'
Using CA: https://acme-v02.api.letsencrypt.org/directory
#15
I agree, but would still like to figure out why I can't port redirect lan to lan.