Link fault detection

Started by kpiq, June 01, 2023, 12:43:46 PM

Previous topic - Next topic
Running OPNsense 23.1 on four firewalls, each on separate sites with their respective Internet connections.  Some ISPs update and reboot their cable modes without warning. 

I've configured the WAN interfaces with their Monitor IP, so the dpinger should catch faults.  Monit does a fair job of catching the interruptions, but its limited to running the gateway_alert script on the "cron" schedule, every minute of the hour.

dpinger is obviously catching the faults because I can see the WAN interface down on the dashboard.  Through trial and error OSPF doesn't seem to be noticing the faults.  It's not generating LSA's for the events.

I'm interested in learning about the link fault detection methods used by OPNsense, whether they're configurable, and if they're not, what to expect.  I run LLDP and CDP on my LAN, yet OPNsense doesn't seem to be talking with my other devices even though my rules allow any to any for all protocols in my LAN.  If my ISPs support BFD, CDP, or others it could benefit, but it's important to know what to expect from OPNsense in order to be educated and not waste other's time.

Link faults (as in link layer) and routing daemons like FRR (routing layer) are mutually unaware of each other.

There are a couple of things that happen automatically:

1. If a link goes physically down the interface is reconfigured when it comes back up. This is the best form of recovery but also the most undesirable event to happen.
2. Dynamic connectivity such as DHCP and PPP can also detect protocol and connectivity issues and reset its connection.
3. Dpinger / gateway alarm detects unrelated upstream routing issues effectively that can happen without 1.) or 2.) triggering a recovery and reacts accordingly (default gateway switching or gateway group tier adjustment).
4. Services like FRR can emit timeout log messages which can be used to trigger recovery events such as is being used for CARP failover... https://github.com/opnsense/plugins/issues/2091

Some people use periodic interface resets via cron in cases where 1.) 2.) and 3.) are not effective enough. About 4.) things could be improved but presenting applicable requirements would help.


Cheers,
Franco

June 01, 2023, 01:37:18 PM #2 Last Edit: June 01, 2023, 01:39:53 PM by kpiq
Thanks.  The situation encountered is that, upon 1) WAN link down, 4) frr appears as if nothing happened.  Because I am not certain whether this is an frr problem or an OPNsense problem I entered issues in the respective github repositories, for frr and OPNsense.  Here are the links.

https://github.com/opnsense/plugins/issues/3445
https://github.com/FRRouting/frr/issues/13597

Frr already answered asking to upgrade frr 7.5.1 to a newer release, but in OPNsense you can't choose plugin versions.  So I'm at a loss for helping the ones who are already trying to help.

Most carriers don't support BFD for consumer-grade Internet service.  I end up depending on dpinger, but ICMP is not really meant to detect link faults.  Anyway, the problem at hand seems to be somewhere in a gray area. 

If only I knew which link fault detection protocols are used by OPNsense (I guess they would apply to all interfaces, but I'm interested in the WAN link) I could talk to my ISPs intelligently about it. 

Will appreciate your time and effort.

I've made a comment on the FRR issue. Michael and me wanted to update FRR to 8, but the last time they changed/broke too much for our taste and a lot of people rely on this stuff to keep working nowadays. ;)


Cheers,
Franco

@franco Great.  Will be monitoring this conversation.

Cheers

June 02, 2023, 05:25:02 AM #5 Last Edit: June 02, 2023, 11:47:36 AM by kpiq
Franco,

I guess I answered my own question (above) about "link fault detection" by tinkering around with my home network.  Ran ifconfig before and after disconnecting the WAN cable, saw the carrier loss detected by freebsd.

Now, I was just reviewing the WAN_GW gateway configuration (system > gateway > single) and noticed the "Far Gateway" choice, described as "This will allow the gateway to exist outside of the interface subnet".  WAN_GW is using the WAN interface and is defined as Upstream Gateway, with Far Gateway unchecked.

I definitely don't know what I'm talking about here, but please hear me out.   Even if I wasn't using frr, would this re-route traffic to the other firewall if WAN is down, and vice versa ?

- create another Single Gateway, this one tied to the LAN interface, with a staitic IP address pointing to the default gateway of my other firewall, Upstream unchecked and Far Gateway checked,
- include this new single gateway as a Tier2 gateway in the Gateway Group that already has WAN_GW defined as a Tier1 member.
- Replicate the same setup on the other firewall, reversing the gateway order in the Gateway Group.

Thanks for your patience.

Regards

Pedro

Hi Pedro,

I'm not sure. I wouldn't use a gateway on an interface that's not supposed to reach an external router, but perhaps this works. All I'm trying to say this seems like an uncommon approach.


Cheers,
Franco

I know.   Very unorthodox.  Trying to get some hardware to test it in the lab.

Appreciate all the help!

Gracias...

By the way -- if you want to, you could use bhyve to create a vm for your latest version of FRR on a VM linux running on your OPNsense.

Her's an example of OpenWRT running on an OPNsense with bhyve:
https://forum.opnsense.org/index.php?topic=34034.0

Hmmm.... I can't find a single reference to frr in that case.  Will read in more depth.

Thanks.

there's not a single reference for frr, however, you can install a linux using a vm on the opnsense and install the latest frr.

June 21, 2023, 04:50:23 PM #11 Last Edit: June 21, 2023, 05:22:31 PM by kpiq
Quote from: franco on June 02, 2023, 02:31:26 PM
Hi Pedro,

I'm not sure. I wouldn't use a gateway on an interface that's not supposed to reach an external router, but perhaps this works. All I'm trying to say this seems like an uncommon approach.


Cheers,
Franco

Thanks for your prompt action to commit frr8 (https://github.com/FRRouting/frr/issues/13597).  I previously suggested a very unorthodox configuration as a shortcut to the need for gateway failover/switchover.  Instead of playing around with features I don't understand well I'll propose something else, more directly related with Gateway Monitoring than Link Fault Detection.

I understand that relying on someone else's technology is not the first choice for sound development.  But, there are at least two methods widely used over the Internet to verify network connectivity:   Microsoft's NCSI and Android's GoogleConnectivityCheck.

Would it be open to consideration to add a feature to OPNsense that would add choices to the "Monitor IP" option in the Single Gateways?  The feature would be to choose between an IP address to ping (dpinger), the methods used by Microsoft's NCSI and GoogleConnectivityCheck, or to use the FQDN of your ISP's speedtest site.

I'm sure there will be legal and other reasons that will weigh in when making that choice.  Hope it's something feasible.

Regards

Pedro

Quote from: kpiq on June 01, 2023, 02:50:03 PM
@franco Great.  Will be monitoring this conversation.

Cheers

@franco

I just upgraded my lab with OPNsense 23.7, will be testing and observing it for a week or two before proposing the upgrade to the production firewalls that were previously not succeeding to failover.

Appreciate all the hard work, your time, and attention. Will keep you posted.

Regards

Pedro

August 30, 2023, 07:19:11 AM #13 Last Edit: August 30, 2023, 07:22:43 AM by kpiq
Well, 23.7.2 absolutely killed my preparation, before I got a chance to demonstrate that the fiber disconnect will trigger default gateway changes and that those will propagate via OSPF causing a proper failover from one firewall/ISP connection to another.


After the 23.7.2 update, the gateway monitoring - with the default settings and the same Monitoring IP as before - forced my firewall to believe that the WAN link was down, intermittently.  We did thorough troubleshooting: cleaned the fibers connected to the WAN port,  replaced the SFP. 


Was about to call the ISP to troubleshoot the circuit when I took a tcpdump and saw traffic traversing the WAN port.  Decided to try disabling gateway monitoring.   That patched our trouble.  Would love to use Gateway Monitoring, but it broke my network.  Will apreciate your help.

Do you use a far gateway? Or DHCP with assigned /32 address? There is a patch for that:

https://github.com/opnsense/core/commit/c8a5d32760

The recent work on gateway monitoring in 23.7.x made visible multiple problems that existed in the code for many years and that were almost impossible to catch/debug before.

The patch is not in 23.7.3 but it will be in 23.7.4 most likely.


Cheers,
Franco