Dual WAN Failover stuck

Started by proxykid, September 28, 2022, 05:45:41 PM

Previous topic - Next topic
Hello

I've been having some issues for quite some time, since 21.7. I'm currently on the most recent version:
OPNsense 22.7.4-amd64
FreeBSD 13.1-RELEASE-p2
OpenSSL 1.1.1q 5 Jul 2022

I have 2 ISP connections, main one being WAN and backup (radio) being WAN2.

WAN is fiber optic but ISP sucks, unfortunately cannot cancel as of now and have to deal with the issues, at least 2 times a week around midnight there is packet loss ~25% so it's not entirely down.... our setup correctly switches to WAN2.

This issue tend to last for 1 or 2 hours, but when WAN starts working correctly now and there is no longer packet loss all the traffic keeps going through WAN2 without switching back to WAN.

I even tried setting up a corn task to reset the WAN interface around 3am.

Allow default gateway switching = OFF
GW GROUP: failover (WAN Tier 1, WAN Tier 2)
FIREWALL LAN Rule: !192.168.0.0/16  Gateway: failover

Anything we are not setting up correctly? Or is this an issue with opnsense?


Dear,

sorry for hijacking the topic, but I have exactly the same problem (version OPNsense 22.7.10_2-amd64).

I tried to simulate the situation in GNS3 and couldn't reconstruct the issue. Failover and recovery did work. The network traffic was rather low - just a few pings.

Then I set up a new (virtual) environment, connected a few clients and the issue is back: Failover works as designed, but recovery does not.

Analysis from last night. The trigger appears to be from /usr/local/etc/rc.syshook.d/monitor/10-dpinger:

/usr/bin/logger -t dpinger "GATEWAY ALARM: ${GATEWAY} (Addr: ${2} Alarm: ${3} RTT: ${4}us RTTd: ${5}us Loss: ${6}%)"

echo -n "Reloading filter: "
/usr/local/bin/flock -n -E 0 -o /tmp/filter_reload_gateway.lock configctl filter reload skip_alias


Gateway log:
<12>1 2022-12-22T22:55:47+01:00 OPNsense.localdomain dpinger 46446 - [meta sequenceId="1"] WAN_GWv4_1 37.209.40.1: Alarm latency 502128us stddev 312011us loss 0%
<13>1 2022-12-22T22:55:47+01:00 OPNsense.localdomain dpinger 14062 - [meta sequenceId="2"] GATEWAY ALARM: WAN_GWv4_1 (Addr: 37.209.40.1 Alarm: 1 RTT: 502128us RTTd: 312011us Loss: 0%)
<12>1 2022-12-22T22:56:11+01:00 OPNsense.localdomain dpinger 46446 - [meta sequenceId="3"] WAN_GWv4_1 37.209.40.1: Clear latency 411311us stddev 298042us loss 1%
<13>1 2022-12-22T22:56:11+01:00 OPNsense.localdomain dpinger 36717 - [meta sequenceId="4"] GATEWAY ALARM: WAN_GWv4_1 (Addr: 37.209.40.1 Alarm: 0 RTT: 411311us RTTd: 298042us Loss: 1%)


Config daemon log:
<13>1 2022-12-22T22:55:48+01:00 OPNsense.localdomain configd.py 196 - [meta sequenceId="1"] [de0c153b-b628-488d-9aca-6dbc676535d1] Reloading filter
<13>1 2022-12-22T22:55:48+01:00 OPNsense.localdomain configd.py 196 - [meta sequenceId="2"] [12ff4b1c-04a6-46ac-afda-1c78ac9be651] request pf current overall table record count and table-entries limit
<13>1 2022-12-22T22:56:11+01:00 OPNsense.localdomain configd.py 196 - [meta sequenceId="3"] [286febe5-4a77-4691-96b1-2e4c32f6d2d4] Reloading filter
<13>1 2022-12-22T22:56:12+01:00 OPNsense.localdomain configd.py 196 - [meta sequenceId="4"] [da34e743-0c02-45b8-bfe8-e03091f0cd9d] request pf current overall table record count and table-entries limit


According to the logs, the relevant commands were running. However, the rules did not change (192.168.0.2 is the failover GW):
pass in quick on vtnet0 route-to (vtnet1 192.168.0.2) inet from any to ! <private> flags S/SA keep state label "f5a781eeb65a44a79c529c6d7ba4cbb6"

After triggering filter reload manually, the gateway changes from 192.168.0.2 to the primary GW 192.168.0.1 as expected:
pass in quick on vtnet0 route-to (vtnet1 192.168.0.1) inet from any to ! <private> flags S/SA keep state label "f5a781eeb65a44a79c529c6d7ba4cbb6"

Anything other ideas to analyse this?

Best Dirk