Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - pjw

#1
Just in case anyone else is following this and wasn't aware, this does appear to be fixed in one of the last 2 updates.
#2
I'm hoping to bump this since I'm continuing to have this issue plus what it appears to be a regression.  I'm updated to the latest 24.7.8 release.

I had my second WAN link go down this morning, and I had to still bring down the interface in OPNsense, apply it, and re-enable the interface, and apply it, to make it see the WAN link was actually up.  This time I tried rebooting the Starlink just to see if that link toggle might do the trick, but still, no dice.  The OPNsense seems to just refuse trying to bring the link back without manual intervention.

What seems to be a regression is after manually toggling the interface and bringing the gateway back up, my connections that are supposed to be headed over that gateway group do not fail back.  This was the case after my initial 24.7 upgrade, and somewhere between now and then, it was fixed.  Now it is broken again.  The only way I can fix this is to manually fail the main WAN link, or reboot my OPNsense mid-day.  Neither is a great solution.

I'm hoping a dev sees this and can either indicate these are known issues, or if they need additional information to help troubleshoot.  I'm more than happy to provide anything I can if it helps get these issues under control.
#3
I have a multi-WAN setup with two uplinks (one to broadband, one to Starlink).  I have rules in place to split traffic between them, home traffic to broadband, work traffic to Starlink.  Works great still.

Note to all of this: this setup with my gateway groups and all my firewall rules have been running fine since the previous major release, and this major release.

What I'm seeing though is the link health check will kick in sometimes because Starlink will have a hiccup, and will fail the link and initiate failover in the gateway group.  What won't happen is the piece monitoring for the Starlink side of things to come back up won't bring up that interface again.  I have to log into the UI and toggle off the interface my Starlink is plugged into, Apply, and then toggle on and Apply to turn the interface back on.  Then poof, link is back and we're happy.  I've tried power cycling my Starlink router (it's in bridge mode) and that has not helped.

Worth noting that I tested bringing my broadband link down, and the same thing happens.  I have to manually toggle the port for the gateway to be brought back online.

I seem to recall right after the 24.7 rollout that some folks were having issues with getting the links back up on a failover scenario.  I had different problems (since resolved) so I never paid attention to it.  But it does seem like there is still an issue here.

Happy to try anything or share any details of my config if anyone is willing and able to help debug.
#4
I finally figured out what is going wrong here.  I ended up looking at the firewall rules themselves via the cmdline, and saw there was a new catch-all rule on my LAN interface that matched and directed all packets to the default gateway, which in this case would be the higher-priority metric out of my two WAN links.

Looking in the GUI, I found a new hidden sshlockout rule that seems to have been added during the upgrade that I did not have on that interface prior to the upgrade.  It was the !sshlockout that matched everything inbound from my LAN net, and going anywhere.  It was before my rules that split the traffic between my WAN2 and WAN1 (work and everything else, respectively).

I ended up keeping the !sshlockout rule, but modified it for a destination of LAN net as well (keep local traffic inbound open).  I don't need the sshlockout enabled, since I have no external login inbound from a WAN interface.

Anyways, this is now working.  I did verify I can fail over and fail back correctly between my tier1 and tier2 gateways.  Apologies that I didn't find this sooner, but I hope this helps anyone else with a multi-WAN setup to get it working post-upgrade.
#5
Quote from: patrick3000 on August 30, 2024, 01:04:49 AM
This recent post may shed some light on this issue: https://forum.opnsense.org/index.php?topic=42552.0.

If WAN cannot ping remote hosts in 24.7, that could explain why gateway monitoring is broken.

For those of you who have 24.7 installed (as noted, I rolled back to 24.1.10 due to this problem), I would suggest manually attempting to ping from each public-facing interface (WAN, WAN2, etc.) to 8.8.8.8 or some other remote host to determine if that's the source of the problem.

Interesting.  That looks like if the WAN link is down, that once it's back up for real, that it can't detect and get things back up.  My situation is a bit different I think.

My setup is two WAN uplinks, say WAN1 and WAN2.  I have two Gateway groups defined, say Group1 and Group2.  Group1 has WAN1 as the Tier 1, WAN2 as Tier 2.  Group2 has WAN2 as Tier 1, WAN1 as Tier 2.  In my firewall rules, I have something like this:

- From anywhere internally to specific destination IP (work): use Group2
- From anywhere internally to anywhere: use Group1

Then if either WAN link fails, it should fail over correctly.

What is broken for me after the upgrade is that first rule refuses to push traffic over WAN2 when both WAN uplinks are running just fine, and reported as Up as well.  It's almost as if the routing metric (where WAN1 is a higher priority) is being applied versus the Gateway group Tiering.  The only way I can get my work traffic onto WAN2 is to disable WAN1 altogether, and then restart my work VPN tunnels to stick on WAN2.  Then I bring WAN1 back online, and we're good until something bounces again.

That setup is what broke after the upgrade.  At one point in time, this exact setup *did* break on 24.1 at one point, and then a subsequent update fixed it.  Then 24.7 came along, and it's completely broken again.
#6
Trying a bump on this since even after the recent 24.7.2 updates, this is still not working.  I don't know if this has something to do with the second WAN uplink having a higher metric or not, but this setup works/worked pre-big upgrade, and still does not.  I really am hesitant to downgrade, since recreating my config if a restore doesn't work seems a bit terrifying to me.

I'm really open to trying any patches, desk builds, command-line hacks, anything, to try and get this working again.  Any help is greatly appreciated.
#7
I just ran into this again, on 24.7.2.  Unbound seemed to stop forwarding DNS requests to my ISP's nameservers (all set by DHCP, nothing manually entered).  It looked like cached entries were all working fine, like google.com, various news websites, etc.  But I noticed when I tried updating an OctoPi instance, that https://github.com failed to resolve.  I checked multiple hosts at home, and then toggled my phone onto cell only, and it resolved fine.  I restarted Unbound DNS on my OPNsense box, and all hosts in my house can now resolve GitHub.

Seems like there's still a situation where Unbound can randomly hang with no warning or indication it needs a restart.  Any other suggestions I can try, or any telemetry I can upload to help devs debug this?
#8
I have no workaround so far from anyone, and I've not heard or seen any mention that it's being worked on to fix.

I'm unfortunately not brave enough to try a downgrade, even though my config is backed up in a few places. I can't afford extended downtime since I work remotely. But I rely on the ability to split my traffic between both WANs, since I push my work traffic over one uplink, and the rest of 5he house over the other. I can't do that now.

This was broken at one point in 24.1 as well, and then an update fixed it shortly before the 24.7 upgrade was released. I'm hoping a dev sees this and knows what needs to happen, and an update pops out really soon.
#9
I just performed the most recent upgrade to get up to 24.7.1.  This issue still remains where the multi-wan setup just doesn't work, and the higher-metric gateway is always chosen no matter what.

I'm happy to try a patch or anything to help get this fixed.
#10
Thanks for the suggestions on some things to try tweaking. I don't necessarily care about the ISC registrations from dhcp, but I do care about the static mappings getting registered. So we'll see if turning off ISC reporting will resolve things or not.

It's worth noting I've only seen Unbound hang/crash once requiring manual intervention to restart it. But it was bad enough that it broke my home internet (and the wife and kids weren't thrilled). Hence this ticket in case something might jump out to the devs
#11
Quote from: newsense on August 04, 2024, 03:12:22 AM
I was referring to the upstream DNS you have defined in Unbound.

Thing is, the behavior you're describing can happen when using encrypted connections for DNS. The SSL connection can be dropped upstream for various reasons while Unbound still tries sending queries thinking it has a valid connection.

If this is the case there's not much to be done other than restarting Unbound and keeping an eye on the WAN link

Ah ok, sorry I misunderstood.  I do not have DNS over TLS enabled in Unbound.  I have no other Advanced features enabled.  I only have Register ISC DHCP4 Leases and Register DHCP Static Mappings.  For the latter, I have 9 total statically defined leases, and about 90ish other dynamic leases.
#12
Pretty sure it's Regular DNS.  I have a screenshot of my config attached.
#13
I recently upgraded to the 24.7_9 release from 24.1. My Unbound DNS thread today stopped working, with my local clients getting a DNS server failure when trying to resolve things not locally cached.  I restarted the Unbound DNS service from the GUI, and everything seems ok now.

I don't see anything in the log files that would indicate a problem, it just seemed to have hung.

Any ideas to help gather info, I'm happy to provide.  Also, if there's a way to monitor this like Monit or something that can then be used to restart it, I'm happy to try that out too.
#14
This is still broken.  I had one of the WAN links fail overnight (this is not uncommon) and the multi-WAN setup properly failed things over to my primary.  But it refuses to fail back, and is routing 100% of traffic now out of the primary, and ignoring the firewall rules.
#15
Further information:

I went ahead and disabled my one "main" gateway in the settings, System => Gateways => Configuration, and applied it.  I saw my secondary gateway become active, and the gateway disappeared from the active gateways view.  Even though it was disabled, traffic is still being routed to it no matter what.  This is really confusing, like the UI is completely ignoring the gateway state and is just routing to the one with a higher priority metric, even if it's disabled.

I ended up trying another thing by disabling the interface for the "main" gateway (disabled the port).  After doing that and re-enabling the interface, it seems my multi-wan is working again for now.