I need some help figuring something out with WAN Failover and monitoring, because myself and ChatGPT cannot figure out why this is happening.
My ISP perhaps once a day, has a little blip where I encounter some loss and latency of a very small margin.
We are talking perhaps 10-15 seconds of loss, and an increase in latency.
Normally, OPNSense would failover in this scenario, so I tweaked the settings to try and accommodate this loss.
Last night, it did it at 04:58, but my settings are configured below:
Latency Low: 200
Latency High: 2000   - This high to accommodate at least 2 seconds of latency before considering failover
Packet Loss Low: 10
Packet Loss High: 75 - This high to accommodate 75% of lost packets in 60 seconds, which equates to 45 seconds based on probe.
Probe Interval: 1
Time Period: 60      
Loss Interval: 20   - Immediately fail over if 20 consecutive pings are lost (20 seconds)
Last night, OPNSense failed over when I think it should NOT have done. Logs below
Gateway Log File:
2024-06-25T04:59:15   Notice   dpinger   ALERT: VIRGIN_DHCP (Addr: 1.1.1.1 Alarm: delay -> none RTT: 20.6 ms RTTd: 21.8 ms Loss: 1.0 %)   
2024-06-25T04:59:04   Notice   dpinger   ALERT: VIRGIN_DHCP (Addr: 1.1.1.1 Alarm: delay+loss -> delay RTT: 410.9 ms RTTd: 1240.2 ms Loss: 10.0 %)
2024-06-25T04:58:27   Notice   dpinger   ALERT: VIRGIN_DHCP (Addr: 1.1.1.1 Alarm: delay -> delay+loss RTT: 433.3 ms RTTd: 1272.6 ms Loss: 12.0 %)
2024-06-25T04:58:15   Notice   dpinger   ALERT: VIRGIN_DHCP (Addr: 1.1.1.1 Alarm: none -> delay RTT: 419.1 ms RTTd: 1263.4 ms Loss: 0.0 %)
You can see from the above, VIRGIN_DHCP never reaches the *down* status, so why failover?
Backend
2024-06-25T04:58:27   Notice   configd.py   [ec6f463a-0689-49ac-9396-c5444a7e5d7d] reconfiguring routing due to gateway alarm
Again, why.....it's only an alert, not a down.
Health
1030   Tue Jun 25 2024 04:58:00 GMT+0100 (British Summer Time)   3.402372   0.10887092677   0.28939024999
1031   Tue Jun 25 2024 04:59:00 GMT+0100 (British Summer Time)   12.35422725   0.35960223911   1.0485299431
The above is the total loss 3.4 + 12.35 = 15.75 well below the configured 20.
Can anyone explain why it failed over?  It looks like just the alert on it's own was enough to do it, in which case I need to increase my *LOW* thresholds to stop the alarm.
Regards
			
			
			
				Just looking at this again, is it perhaps I have my GateWay Group trigger as "Packet Loss".
Perhaps I should change this to "Member Down"
I can only surmise, that it detected packet loss...due to the alert only, and therefore failed over?
			
			
			
				Yep, just to confirm, it was because I had the gateway trigger set to Packet Loss.
I wrongly assumed that when set to Packet Loss, it would have to breach the Packet Loss *High* threshold on the monitor, but it's actually the *Low* threshold that triggers the fail over.
I've now changed it to "Member Down" which will only now failover if the *High* thresholds (Latency/Loss) are breached...or the ping count exceeds my limit.
Had a blip last night, and it didn't fail over.......joy..
2024-06-26T03:18:03   Notice   dpinger   ALERT: VIRGIN_DHCP (Addr: 1.1.1.1 Alarm: delay -> none RTT: 182.6 ms RTTd: 660.6 ms Loss: 0.0 %)   
2024-06-26T03:17:06   Notice   dpinger   ALERT: VIRGIN_DHCP (Addr: 1.1.1.1 Alarm: none -> delay RTT: 1307.5 ms RTTd: 3039.4 ms Loss: 0.0 %)