Link fault detection

kpiq · August 30, 2023, 01:35:18 PM

No. We use upstream gateway and a /30 static public ip address.

Regards

Pedro

franco · August 30, 2023, 01:41:55 PM

Ok, I'm talking to someone else who also has another edge caase issue. I think the monitoring is working fine for the most part and the only thing that helps now is logs (system and gateways to be precise) to see if what you would expect was actually being seen. I'm not too concerned with 23.1.11 state -- it might have just been doing something by accident. Now the less we do the more these edge cases become a problem (but make the routing a lot more stable in multi-WAN failover situations).

Cheers,
Franco

kpiq · August 31, 2023, 05:30:37 PM

Franco,

My apologies for jumping the gun. It seems we have circuit trouble with one of our Internet providers, with latency and packet loss just outside our gateway. That must be why OPNsense was bringing the WAN port down, as it should. If I recall correctly OPNsense uses latency and packet loss in the gateway monitoring calculation, right?

Thanks for your continued efforts, time, and support. It seems like now we'll have to wait to perform the fiber disconnect testing until the ISP trouble is over.

Regards

Pedro

franco · August 31, 2023, 09:05:36 PM

Hi Pedro,

Yes, there are high thresholds for both packet loss and latency that when reached will mark the connection as "down" regardless of the actual disrupted link state (which is the traditional packet loss 100%). It also depends on what gateway group trigger is used. All these values can be tweaked per gateway if that helps.

Let me know how this progresses on your end in any case.

Thanks,
Franco

kpiq · December 07, 2023, 08:39:53 PM

Franco,

Sorry it took so long to share the results of our testing. Link fault detection in OPNsense 23.7 and up works fine, but there is something still not right with FRR/OSPF.

We finally scheduled testing for ISP fiber disconnect of one of the firewalls. Monitored the OSPF LSA notifications. Finally saw OSPF removing the external LSA record for the firewall where the fiber was disconnected.

For a minute our routing tables got adjusted to use the other firewall. I was able to ping a few Internet sites. But then OSPF started a shutdown/restart loop in the firewall where the fibers were disconnected, and OSPF started announcing its external LSA as if it were connected... that resulted in full connectivity loss (the disconnected firewall had the higher OSPF gateway metric), even when the other firewall was up.

Team decided that we will not rely on OSPF for gateway switchover. We may try carp, and we have the ethernet links needed, but I'm not sure carp is meant for two firewalls which are 1,000 miles apart with a latency of 25ms.

I'm sorry this did not work as expected this time around. Maybe it was the wrong way of achieving our goal of reliable uptime thru gateway redundancy.

Thanks for all your help.

Regards

Pedro

kpiq · December 27, 2023, 10:10:07 AM

@Franco

When you get a chance, I've posted a reply to your August statements, dated December 7.

Regards

Pedro

Quote from: franco on August 31, 2023, 09:05:36 PM
Hi Pedro,

Yes, there are high thresholds for both packet loss and latency that when reached will mark the connection as "down" regardless of the actual disrupted link state (which is the traditional packet loss 100%). It also depends on what gateway group trigger is used. All these values can be tweaked per gateway if that helps.

Let me know how this progresses on your end in any case.

Thanks,
Franco

Link fault detection

kpiq

August 30, 2023, 01:35:18 PM #15

franco

August 30, 2023, 01:41:55 PM #16

kpiq

August 31, 2023, 05:30:37 PM #17

franco

August 31, 2023, 09:05:36 PM #18 Last Edit: September 01, 2023, 07:19:01 AM by franco

kpiq

December 07, 2023, 08:39:53 PM #19 Last Edit: December 07, 2023, 08:56:24 PM by kpiq

kpiq

December 27, 2023, 10:10:07 AM #20