Running OPNsense 23.1 on four firewalls, each on separate sites with their respective Internet connections. Some ISPs update and reboot their cable modes without warning.
I've configured the WAN interfaces with their Monitor IP, so the dpinger should catch faults. Monit does a fair job of catching the interruptions, but its limited to running the gateway_alert script on the "cron" schedule, every minute of the hour.
dpinger is obviously catching the faults because I can see the WAN interface down on the dashboard. Through trial and error OSPF doesn't seem to be noticing the faults. It's not generating LSA's for the events.
I'm interested in learning about the link fault detection methods used by OPNsense, whether they're configurable, and if they're not, what to expect. I run LLDP and CDP on my LAN, yet OPNsense doesn't seem to be talking with my other devices even though my rules allow any to any for all protocols in my LAN. If my ISPs support BFD, CDP, or others it could benefit, but it's important to know what to expect from OPNsense in order to be educated and not waste other's time.
Link faults (as in link layer) and routing daemons like FRR (routing layer) are mutually unaware of each other.
There are a couple of things that happen automatically:
1. If a link goes physically down the interface is reconfigured when it comes back up. This is the best form of recovery but also the most undesirable event to happen.
2. Dynamic connectivity such as DHCP and PPP can also detect protocol and connectivity issues and reset its connection.
3. Dpinger / gateway alarm detects unrelated upstream routing issues effectively that can happen without 1.) or 2.) triggering a recovery and reacts accordingly (default gateway switching or gateway group tier adjustment).
4. Services like FRR can emit timeout log messages which can be used to trigger recovery events such as is being used for CARP failover... https://github.com/opnsense/plugins/issues/2091
Some people use periodic interface resets via cron in cases where 1.) 2.) and 3.) are not effective enough. About 4.) things could be improved but presenting applicable requirements would help.
Cheers,
Franco
Thanks. The situation encountered is that, upon 1) WAN link down, 4) frr appears as if nothing happened. Because I am not certain whether this is an frr problem or an OPNsense problem I entered issues in the respective github repositories, for frr and OPNsense. Here are the links.
https://github.com/opnsense/plugins/issues/3445
https://github.com/FRRouting/frr/issues/13597
Frr already answered asking to upgrade frr 7.5.1 to a newer release, but in OPNsense you can't choose plugin versions. So I'm at a loss for helping the ones who are already trying to help.
Most carriers don't support BFD for consumer-grade Internet service. I end up depending on dpinger, but ICMP is not really meant to detect link faults. Anyway, the problem at hand seems to be somewhere in a gray area.
If only I knew which link fault detection protocols are used by OPNsense (I guess they would apply to all interfaces, but I'm interested in the WAN link) I could talk to my ISPs intelligently about it.
Will appreciate your time and effort.
I've made a comment on the FRR issue. Michael and me wanted to update FRR to 8, but the last time they changed/broke too much for our taste and a lot of people rely on this stuff to keep working nowadays. ;)
Cheers,
Franco
@franco Great. Will be monitoring this conversation.
Cheers
Franco,
I guess I answered my own question (above) about "link fault detection" by tinkering around with my home network. Ran ifconfig before and after disconnecting the WAN cable, saw the carrier loss detected by freebsd.
Now, I was just reviewing the WAN_GW gateway configuration (system > gateway > single) and noticed the "Far Gateway" choice, described as "This will allow the gateway to exist outside of the interface subnet". WAN_GW is using the WAN interface and is defined as Upstream Gateway, with Far Gateway unchecked.
I definitely don't know what I'm talking about here, but please hear me out. Even if I wasn't using frr, would this re-route traffic to the other firewall if WAN is down, and vice versa ?
- create another Single Gateway, this one tied to the LAN interface, with a staitic IP address pointing to the default gateway of my other firewall, Upstream unchecked and Far Gateway checked,
- include this new single gateway as a Tier2 gateway in the Gateway Group that already has WAN_GW defined as a Tier1 member.
- Replicate the same setup on the other firewall, reversing the gateway order in the Gateway Group.
Thanks for your patience.
Regards
Pedro
Hi Pedro,
I'm not sure. I wouldn't use a gateway on an interface that's not supposed to reach an external router, but perhaps this works. All I'm trying to say this seems like an uncommon approach.
Cheers,
Franco
I know. Very unorthodox. Trying to get some hardware to test it in the lab.
Appreciate all the help!
Gracias...
By the way -- if you want to, you could use bhyve to create a vm for your latest version of FRR on a VM linux running on your OPNsense.
Her's an example of OpenWRT running on an OPNsense with bhyve:
https://forum.opnsense.org/index.php?topic=34034.0
Hmmm.... I can't find a single reference to frr in that case. Will read in more depth.
Thanks.
there's not a single reference for frr, however, you can install a linux using a vm on the opnsense and install the latest frr.
Quote from: franco on June 02, 2023, 02:31:26 PM
Hi Pedro,
I'm not sure. I wouldn't use a gateway on an interface that's not supposed to reach an external router, but perhaps this works. All I'm trying to say this seems like an uncommon approach.
Cheers,
Franco
Thanks for your prompt action to commit frr8 (https://github.com/FRRouting/frr/issues/13597). I previously suggested a very unorthodox configuration as a shortcut to the need for gateway failover/switchover. Instead of playing around with features I don't understand well I'll propose something else, more directly related with Gateway Monitoring than Link Fault Detection.
I understand that relying on someone else's technology is not the first choice for sound development. But, there are at least two methods widely used over the Internet to verify network connectivity: Microsoft's NCSI and Android's GoogleConnectivityCheck.
Would it be open to consideration to add a feature to OPNsense that would add choices to the "Monitor IP" option in the Single Gateways? The feature would be to choose between an IP address to ping (dpinger), the methods used by Microsoft's NCSI and GoogleConnectivityCheck, or to use the FQDN of your ISP's speedtest site.
I'm sure there will be legal and other reasons that will weigh in when making that choice. Hope it's something feasible.
Regards
Pedro
Quote from: kpiq on June 01, 2023, 02:50:03 PM
@franco Great. Will be monitoring this conversation.
Cheers
@franco
I just upgraded my lab with OPNsense 23.7, will be testing and observing it for a week or two before proposing the upgrade to the production firewalls that were previously not succeeding to failover.
Appreciate all the hard work, your time, and attention. Will keep you posted.
Regards
Pedro
Well, 23.7.2 absolutely killed my preparation, before I got a chance to demonstrate that the fiber disconnect will trigger default gateway changes and that those will propagate via OSPF causing a proper failover from one firewall/ISP connection to another.
After the 23.7.2 update, the gateway monitoring - with the default settings and the same Monitoring IP as before - forced my firewall to believe that the WAN link was down, intermittently. We did thorough troubleshooting: cleaned the fibers connected to the WAN port, replaced the SFP.
Was about to call the ISP to troubleshoot the circuit when I took a tcpdump and saw traffic traversing the WAN port. Decided to try disabling gateway monitoring. That patched our trouble. Would love to use Gateway Monitoring, but it broke my network. Will apreciate your help.
Do you use a far gateway? Or DHCP with assigned /32 address? There is a patch for that:
https://github.com/opnsense/core/commit/c8a5d32760
The recent work on gateway monitoring in 23.7.x made visible multiple problems that existed in the code for many years and that were almost impossible to catch/debug before.
The patch is not in 23.7.3 but it will be in 23.7.4 most likely.
Cheers,
Franco
No. We use upstream gateway and a /30 static public ip address.
Regards
Pedro
Ok, I'm talking to someone else who also has another edge caase issue. I think the monitoring is working fine for the most part and the only thing that helps now is logs (system and gateways to be precise) to see if what you would expect was actually being seen. I'm not too concerned with 23.1.11 state -- it might have just been doing something by accident. Now the less we do the more these edge cases become a problem (but make the routing a lot more stable in multi-WAN failover situations).
Cheers,
Franco
Franco,
My apologies for jumping the gun. It seems we have circuit trouble with one of our Internet providers, with latency and packet loss just outside our gateway. That must be why OPNsense was bringing the WAN port down, as it should. If I recall correctly OPNsense uses latency and packet loss in the gateway monitoring calculation, right?
Thanks for your continued efforts, time, and support. It seems like now we'll have to wait to perform the fiber disconnect testing until the ISP trouble is over.
Regards
Pedro
Hi Pedro,
Yes, there are high thresholds for both packet loss and latency that when reached will mark the connection as "down" regardless of the actual disrupted link state (which is the traditional packet loss 100%). It also depends on what gateway group trigger is used. All these values can be tweaked per gateway if that helps.
Let me know how this progresses on your end in any case.
Thanks,
Franco
Franco,
Sorry it took so long to share the results of our testing. Link fault detection in OPNsense 23.7 and up works fine, but there is something still not right with FRR/OSPF.
We finally scheduled testing for ISP fiber disconnect of one of the firewalls. Monitored the OSPF LSA notifications. Finally saw OSPF removing the external LSA record for the firewall where the fiber was disconnected.
For a minute our routing tables got adjusted to use the other firewall. I was able to ping a few Internet sites. But then OSPF started a shutdown/restart loop in the firewall where the fibers were disconnected, and OSPF started announcing its external LSA as if it were connected... that resulted in full connectivity loss (the disconnected firewall had the higher OSPF gateway metric), even when the other firewall was up.
Team decided that we will not rely on OSPF for gateway switchover. We may try carp, and we have the ethernet links needed, but I'm not sure carp is meant for two firewalls which are 1,000 miles apart with a latency of 25ms.
I'm sorry this did not work as expected this time around. Maybe it was the wrong way of achieving our goal of reliable uptime thru gateway redundancy.
Thanks for all your help.
Regards
Pedro
@Franco
When you get a chance, I've posted a reply to your August statements, dated December 7.
Regards
Pedro
Quote from: franco on August 31, 2023, 09:05:36 PM
Hi Pedro,
Yes, there are high thresholds for both packet loss and latency that when reached will mark the connection as "down" regardless of the actual disrupted link state (which is the traditional packet loss 100%). It also depends on what gateway group trigger is used. All these values can be tweaked per gateway if that helps.
Let me know how this progresses on your end in any case.
Thanks,
Franco