Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - drivera

#16
I used to have a failover gateway group, but that caused a routing issue because gateway groups can't have routeback" rules added to netfilter. That's why I changed it to how it is now. With a gateway group it would be impossible to reach the firewall via anything but the currently-active circuit, which was not the intent.

This is why I'm not using gateway groups.
#17
You can find a ZIP file with all the relevant screenshots (General Settings, Gateway page, individual gateways, Rules, NAT, Firewall Settings, and Express VPN client) here.  It's ~2.1MB so I couldn't attach it here directly.

https://drive.google.com/file/d/1DW2tnGd7UNcqZQVm7Ig6d-_gTVt0AELi/view?usp=sharing

Cheers!
#18
You asked for screenshots, but I'll do you one better: a sanitized (I hope :P) config.xml!

The configuration reflects the system's current state. Maybe you can spot what's wrong better than I.  The wiring is simple: two upstream links (one per ISP), one LAN link. I sanitized the users' passwords as well - you may have to copy that section from another working configuration.

Let me know what else I can provide to help debug this.

Thanks!
#19
Hi!

Gateway monitoring is set up using a script that basicall polls netstat -rn every 0.01 seconds and records the result, reporting whenever the result varies from poll to poll.

The "Don't add/remove routes" box is unchecked, but "Don't pull routes" is checked, and I added <pull-filter ignore "redirect-gateway"> to the advanced configuration so it wouldn't set itself as the default gateway when coming up.

I've checked "Don't add/remove routes", and tested, with no change in behavior.  This is the output log for the gateway monitor script:

$ ./monitor-gateway
2020/01/19 08:12:51: Gateway monitoring Started
2020/01/19 08:12:51: GATEWAY=[186.159.241.1]
<MANUALLY FORCED AN OUTAGE HERE>
2020/01/19 08:18:08: GATEWAY=[empty]
2020/01/19 08:18:08: GATEWAY=[192.168.200.1]
2020/01/19 08:18:42: GATEWAY=[empty]
2020/01/19 08:18:42: GATEWAY=[192.168.200.1]
2020/01/19 08:19:59: GATEWAY=[empty]
<MANUAL INTERVENTION VIA SAVE/APPLY-CHANGES>
2020/01/19 08:22:51: GATEWAY=[192.168.200.1]
<MANUAL TAKEDOWN OF NON-FUNCTIONAL VPN LINK>
2020/01/19 08:26:39: GATEWAY=[empty]
<MANUAL RESTART OF VPN LINK>
2020/01/19 08:27:14: GATEWAY=[192.168.200.1]


Also, when I finally fixed the induced outage, I had to manually re-enable the main gateway to get it to work again, so no auto-failback this time :(

Clearly, something is amiss here...

Cheers.
#20
Ok, new behavior: the default gateway configuration code is now considering (and configuring) non-upstream gateways for default gateway.

I have an OpenVPN connection to Express VPN (for content streaming) configured in the firewall, and all the necessary rules to use it only for my streaming devices. Clearly it's not configured as an upstream gateway since it's dependent on either of the actual two physical circuits which are marked as upstream.

I've just had a soft outage of the primary service (everything is up, but something went wrong in the ISP's network that routing is borked), and guess what?  The ExpressVPN gateway was chosen as the default gateway for the system despite the fact that the secondary gateway was still up and in good health!!!

Eventually the system righted itself without intervention, but still: this highlights the fact that the default gateway selection and configuration algorithm is broken as it clearly makes sense to only consider healthy upstream gateways as candidates for default gateway (right?).

Cheers!
#21
Quote from: mimugmail on June 05, 2019, 07:05:39 AM
Do you have default gw switching enabled on System : Settings : General?

In my case, this setting has always been on, and I still have this issue. In fact, I just made another post about it providing a bit more info since this thread was sort of stale...
#22
Hi!

I've noticed that during failover, after a few minutes from the initial failover the default gateway configuration will get cleared even though failover had successfully ocurred. The result of this is that routing to the internet no longer works despite there being an active, healthy secondary gateway available. I'm using multiple upstream gateways with differing priorities and except for this glitch the configuration seems to work as intended.

The only way to recover this is to log onto the UI, edit one of the gateways (the healthy one, for instance), save it without making any changes, and clicking on "Apply Changes". This will trigger the code that recalculates the correct gateway and fix the configuration.

Sometimes (very often) this has to be done two or three times for it to take, and normal network functionality to be restored.  If this isn't done the gateway configuration will remain incorrect until the primary circuit returns. Obviously this defeats the purpose of any failover configuration.

However, once the primary circuit comes back to life everything returns to normal on its own.

Maybe the issue is related to the fact that the primary circuit is still online (still has an IP and the link is still UP), but it's effectively dead because some segment downstream is dead? Thus, the circuit's configured upstream gateway is down (and correctly detected as such) even though the interface isn't dead per-se. Perhaps that's what's confusing the gateway calculation algorithm?

I've written a script I use to monitor the gateway configuration which I could easily enough turn into a monitoring daemon (of sorts) that could trigger the gateway calculation/reconfiguration code when it detects that the default gateway has been left empty.  However: I don't know how to do that from the O/S CLI. Any ideas?

Is there documentation anywhere regarding the scripts/commands that are available at the CLI level to invoke OPNSense functionality?

Perhaps that daemon would only trigger the "repair" when it detects that one of the (higher-priority) upstream gateways is both enabled and "down" (i.e. we're in a failover state) ... this way it would minimize interference with normal operation when everything is OK....?

Thoughts?

Thanks!
#23
Hi!

I've noticed recently that my Insight graph screen never has any significant data. I tried resetting the RRD and Netflow Data, but to no avail.  I tried manually running the flowd_aggregate service, but it didn't fix anything.

I've been looking through the logs but I'm not sure what to look for. I don't even know if the Insight data is populated from the netflow data or somewhere/thing else entirely.

Importantly: I've tried doing a data reset and immediate reboot and sometimes it would work and everything starts to show as expected, but it's since stopped working.

To clarify: the "Traffic" section does show real-time traffic. It's the historical stuff that is borked, and I'd like to fix.

Can you guys help me figure this out? Is there any way to fully, cleanly, atomically reset the graphing data so the engine starts gathering stuff correctly again as if from a fresh install?

Thanks!
#24
Found (I think) my solution via /usr/local/etc/rc.syshook monitor.

I'll play around with that and maybe I'll be able to figure out an easy way to fire off a DHCP release/renew for the "failed" interface.

The only question I have is how to preserve it in a backup other than manually. But that's small potatoes by comparison :)

Cheers!
#25
Hi!

In my multi-WAN setup, I have my gateways configured such that their monitoring IP is a well-known, "always up", pingable IP on the general internet.  This is important because occasionally the ISPs will have a link be up, but with no internet connectivity. Thus, monitoring an "internet" address helps me cover for that case and apply failover even though the link appears to be up.

However, I've also found that they have another issue wherein when there's a connectivity hiccup - usually due to a short power outage (< 1 min) - the connection will seem to be up, but connectivity won't be restored.  This seems to be an issue with the CableModem/ISP connection itself since OPNSense is correctly detecting the lack of connectivity and refuses to fail-back to the primary.

The scenario is this:


  • Short power outage (< 1 min), causing connectivity over the primary circuit to disappear
  • Failover happens correctly to the secondary circuit
  • Power returns
  • The IP address/etc is still valid on the primary interface, but connectivity is still borked (OPNSense remains in "failed" state, correctly routing over the secondary circuit)
  • I manually cause a link restart on the primary circuit by power-cycling the cable modem, and everything returns to normal (fail-back to primary, etc)

The question I have regarding all the above is this: is there a way that I could somehow attach a custom script that is executed when a gateway is marked as "DOWN"? i.e. "when this interface's gateway is marked as down, flush the DHCP lease and leave it unconfigured until it comes back up on its own"

The alternative is for me to buy a USB- or Network-controllable power strip - IOT style - and through that custom script, trigger a power cycle of the Cable Modem, which ideally results in fixing everything up.

So...Thoughts? Ideas?
#26
Update to the required command:

$ base64 -d encrypted-config.xml | openssl enc -d -aes-256-cbc -md md5 > decrypted-config.xml

The -md md5 was missing from the previous solutions.

Remember to remove the necessary lines from (a copy of) the encrypted file first.  The openssl command will ask for the password interactively. There are parameters that can be added to include the password in the command, left as an exercise for the reader.

Cheers!
#27
I shall do that.

In other news, the problem with failover seems to be dpinger.  This is from today's outage event:

Aug 25 09:43:12 firewall dpinger: CABLE_DHCP 10.19.0.1: Alarm latency 8574us stddev 2231us loss 6%
Aug 25 09:43:12 firewall dpinger: GATEWAY ALARM: CABLE_DHCP (Addr: 10.19.0.1 Alarm: 1 RTT: 8574ms RTTd: 2231ms Loss: 6%)
Aug 25 09:48:36 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:48:37 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:48:38 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:48:39 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:48:40 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:48:41 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:48:42 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:48:43 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:48:44 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:48:45 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:48:46 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:48:47 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:48:48 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:48:49 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:48:50 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:48:52 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:48:53 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:48:54 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:48:55 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:48:56 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:48:57 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:48:58 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:48:59 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:00 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:01 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:02 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:03 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:04 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:05 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:06 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:07 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:08 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:09 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:10 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:11 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:12 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:13 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:14 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:15 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:16 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:17 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:18 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:19 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:20 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:21 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:22 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:23 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:24 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:25 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:26 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:28 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:29 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:30 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55
Aug 25 09:49:31 firewall dpinger: MASKVPN_VPNV4 10.57.0.205: sendto error: 55


It would seem that DPinger is trying to re-route everything via a VPN gateway, which isn't marked as an upstream gateway - i.e. dpinger doesn't seem to be aware of the new "gateway priority" feature ... either that, or it's simply not giving a hoot.

I'll log both defects in GitHub.

Cheers!
#28
Well, I know for sure the behavior is as described vs. what I would have expected. If you think it's a bug, I'll file a report.

I was just giving the benefit of the doubt that the problem was me and a bad setting somewhere...

Thoughts?
#29
Here's another tidbit I've just discovered with this new setup: the 2nd circuit is unroutable except for IPs for which it's specifically set up for - either by DHCP or manually by me.

However, when failover occurs (which it does seem to when the 1st circuit goes completely offline), everything is fine and routing works perfectly. Then it fails back cleanly.

However, while the primary circuit is up, traffic going out the 2nd circuit (i.e. for tests and diagnostics) simply dies (i.e. is never seen again) except when going to the addresses I mentioned. I'm not sure if there's a setting that I'm missing, but this used to work just fine when I was using a routing group to handle the failover.

Let me know if I should start this as its own thread, as this diverges a bit from the topic of discussion.

Cheers!
#30
It seems that the default firewall setting has a rather short limit on log size. I'm going to increase it for the future. Also: the log size field processing has a bug - I tried to set the size to 3GB (in bytes = 3,221,225,472), but when I reset the log files, the first log file was already over 100GB when I forcibly rebooted to avoid choking the disk...

I set it to 2GB-1 (2,147,483,647) and that seemed to work just fine.

Annoyingly, I had configured a remote syslog server to capture all these logs but for some reason it stopped listening and wasn't receiving so even that history was boned.

I'll submit logs the next outage I have.

Cheers.