Multi WAN Dpinger needs restarting after gateway outage Workaround

Started by xaxero, May 04, 2023, 08:30:39 AM

Previous topic - Next topic
May 05, 2023, 09:31:31 PM #15 Last Edit: May 05, 2023, 09:33:49 PM by RedVortex
Actually, in 23.1.7_3 the problem seems worse... Unplugging and replugging the SL ether cable in igb0, triggers the problem every time it seems.

I did it 3 times and the state looks like this each time now...  :-\

all icmp 100.79.101.92:7232 -> 1.1.1.1:7232       0:0
   age 00:00:36, expires in 00:00:10, 36:0 pkts, 1008:0 bytes, rule 90
   id: c4e7556400000000 creatorid: 7ac5a56d gateway: 0.0.0.0
   origif: pppoe0


again, restarting dpinger or killing the state brings the gateway status to UP and the state to what it should be

all icmp 100.79.101.92:9493 -> 1.1.1.1:9493       0:0
   age 00:00:16, expires in 00:00:09, 16:16 pkts, 448:448 bytes, rule 100
   id: 15eb556400000000 creatorid: 7ac5a56d gateway: 100.64.0.1
   origif: pppoe0


Also notice the rule goes from 90 to 100. 100 is usually what I see when it works, I believe it's the default rule that allows traffic from the OPNSense to anywhere and 90 is the rule associated to DHCP.

With 23.1.7_3, SL gateway always ends up being flagged as down if I use 1.1.1.1 whether I use the "Disable Host Route" or not. I tried multiple things to keep it up but after some time it ends up failing because of the state gateway that ends up sending the packets to the pppoe0 wan instead of SL.

So I'm dropping the idea of using the 1.1.1.1 altogether for now as this seems really problematic likely because of the dhcp renewal on SL that sends 1.1.1.1 as a dns maybe ? Anyways, I'll be testing with 9.9.9.9 instead and see how it goes.

Did using another IP than 1.1.1.1 fixed it for you @xaxero ? Also, have you upgraded to 23.1.7_3 yet ?

Good Morning
    Changing to openDNS has resulted in a big improvement. 48 hours with no issues. However SL has been very stable. The second  unit I simply use the SL Gateway address.

Note: As I am using the Dual antenna setup I have put in a second router at the front end simply to NAT the traffic and so I have a unique gateway for each antenna and tagging the packets onto separate VLANS to our main router several decks down. 2 WANS with the same gateway was problematic if we had to do a full system power cycle.
With the front end router I am disabling gateway monitoring and I am doing all the DPinger stuff on the main router. Also Disable host route may have helped as well.

Another slimy hack is to force all passenger traffic through the 4G-Starlink-Primary interface via the firewall so this bypasses dpinger completely. The more critical ship traffic goes through the Gateway failover and the worst case scenario is that we are stuck on the VSAT until I can restart Dpinger.

I have attached the gateway configuration of the front and and the core routers. So far it has been working well.

You can use the following to inspect host route behaviour now:

# pluginctl -r host_routes

An overlap between facilities IS possible and the last match wins which may break DNS or monitoring facility... That's why disable host route was added to monitor settings in which case the DNS is still active and dpinger monitoring latches on to interface IP anyway so routing should be ok (if no PBR is used breaking that as well).


Cheers,
Franco

May 08, 2023, 08:29:42 PM #19 Last Edit: February 04, 2024, 10:31:56 PM by RedVortex
Quote from: franco on May 08, 2023, 12:01:59 PM
You can use the following to inspect host route behaviour now:

# pluginctl -r host_routes

An overlap between facilities IS possible and the last match wins which may break DNS or monitoring facility... That's why disable host route was added to monitor settings in which case the DNS is still active and dpinger monitoring latches on to interface IP anyway so routing should be ok (if no PBR is used breaking that as well).

Hello franco  :)

Ok, so everything remained stable (but I did not test for very long, maybe 12h) while I was using 9.9.9.9. I've configured 1.1.1.1 again on SL, saved gateway and then saved the interface as well to restart it.

For now I see this (everything normal and gateway is marked UP)

root@xxxxx:~ # pfctl -ss -vvv | grep "1\.1\.1\.1" -A 3
No ALTQ support in kernel
ALTQ related functions disabled
all icmp 100.79.101.92:47540 -> 1.1.1.1:47540       0:0
   age 00:03:49, expires in 00:00:10, 225:225 pkts, 6300:6300 bytes, rule 100
   id: a7325d6400000000 creatorid: 7ac5a56d gateway: 100.64.0.1
   origif: igb0



root@xxxxx:~ # pluginctl -r host_routes
{
    "core": {
        "8.8.8.8": null,
        "8.8.4.4": null
    },
    "dpinger": {
        "8.8.4.4": "10.50.45.70",
        "1.1.1.1": "100.64.0.1",
        "2001:4860:4860::8844": "fe80::200:xxxx:xxxx:xxx%igb0",
        "149.112.112.112": "192.168.2.1",
        "2620:fe::9": "2001:470:xx:4x:x"
    }
}


10.50.45.70 is my default gateway that uses pppoe0 interface
100.64.0.1 is SL and is used as backup gateway on igb0


root@xxxxx:~ # netstat -rn | head
Routing tables

Routing tables

Internet:
Destination        Gateway            Flags     Netif Expire
default            10.50.45.70        UGS      pppoe0
1.1.1.1            100.64.0.1         UGHS       igb0
8.8.4.4            10.50.45.70        UGHS     pppoe0
10.2.0.0/16        192.168.2.1        UGS         em0
10.50.45.70        link#16            UHS      pppoe0
34.120.255.244     link#4             UHS        igb0


After 2-3 mins, I see the routing tables loses 1.1.1.1 (SL dhcp renewal I guess) but so far everything remains functional

root@xxxxx:~ # netstat -rn | head
Routing tables

Internet:
Destination        Gateway            Flags     Netif Expire
default            10.50.45.70        UGS      pppoe0
8.8.4.4            10.50.45.70        UGHS     pppoe0
10.2.0.0/16        192.168.2.1        UGS         em0
10.50.45.70        link#16            UHS      pppoe0
34.120.255.244     link#4             UHS        igb0
100.64.0.0/10      link#4             U          igb0


Everything else remains the same and gateway is, for now, marked UP. When I get back home, I'll test the ethernet cable pull/plug, that usually seems to trigger the issue and I'll let you know what I get then.

Hello RedVortex :)

Hmm, how about this one?

# grep -nr "1\.1\.1\.1" /var/db/dhclient.leases.*

If SL is pushing routes it will scrub them on a renew perhaps.


Cheers,
Franco

Hi,

FWIW, see this too for Multi-WAN Gateway monitor.

Monitor IP / dpinger not reliable in simulated fail & failback scenarios

Can only "fix" it be restart of Gateway service  :(

Keep in mind that some DNS servers have been known to rate-limit or block ping requests so it looks bad but it's not. From the OPNsense perspective the alarm has to be raised even though it's not necessary and disruptive.


Cheers,
Franco

So I've been testing Multi-Wan gateway failover for quite a few hours now.

Does not work with Trigger Level = "Packet Loss" option for 23.latest or even back to 22.7.latest

Scenario: Primary gateway with Trigger Level = "Packet Loss" option set then block downstream ping does NOT cause gateway to be marked as down nor default route to be flipped to Secondary. Have to manually restart Gateway service (then it notices).

Failback works ok.

Works ok if Trigger Level = "Member Down" however, this is a less likely real-world scenario where ISP is up but internet service is interrupted.



See https://github.com/opnsense/core/issues/6231 -- packetloss and delay triggers have been broken inherently with the switch from apinger to dpinger. The latter never supported the lower thresholds. I'm trying to avoid dealing with dpinger for alarm decisions in 23.7 to bring back the desired behaviour and dpinger then is left to only monitor.


Cheers,
Franco

Quote from: franco on May 11, 2023, 09:26:23 AM
See https://github.com/opnsense/core/issues/6231 -- packetloss and delay triggers have been broken inherently with the switch from apinger to dpinger. The latter never supported the lower thresholds. I'm trying to avoid dealing with dpinger for alarm decisions in 23.7 to bring back the desired behaviour and dpinger then is left to only monitor.


Cheers,
Franco

thanks Franco.   Read through the issues thread. Appreciate the detail there.

What timeframe are you thinking for the fix ?


It might take 1 more month for the final code to hit development, but as I said the plan is to have it in production for 23.7 in July (not sooner due to considerable changes).


Cheers,
Franco

I am collating the data from this post and others Applies to Starlink only but may be useful elsewhere. Applied the following fixes  from everyone's suggestions and the gateways are stable - We are having frequent outages as we are in laser link territory however the link is stable overall.

1/. Wan Definition Reject leases from 192.168.100.1 (note gateways are on separate router in my case)
2/. Gateway - Disable host route.
3/. Monitor IP that is not 1.1.1.1 (In my case open DNS) and bind each interface to DNS via Settings General.

Interfaces have been going up and down last 24 hours and the gateways (so far) are behaving and the routes are changing dynamically

Last thought - perhaps we could include httping as an option in the future as well as dpinger. http has much higher priority.

That leaves only the question of who will write and integrate a new solution for the problem someone though solved a decade ago.  ;)


Cheers,
Franco