Multi WAN Dpinger needs restarting after gateway outage Workaround

Started by xaxero, May 04, 2023, 08:30:39 AM

Previous topic - Next topic
Problem occurred again today after a Ethernet flap on the SL side (likely a firmware update on their side)

root@xxxxx:~ # pluginctl -r host_routes
{
    "core": {
        "8.8.8.8": null,
        "8.8.4.4": null
    },
    "dpinger": {
        "8.8.4.4": "10.50.45.70",
        "1.1.1.1": "100.64.0.1",
        "2001:4860:4860::8844": "fe80::200:xxxx:xxxx:xxx%igb0",
        "149.112.112.112": "192.168.2.1",
        "2620:fe::9": "2001:470:xx:4x:x"
    }
}

root@xxxxx:~ # netstat -rn | head
Routing tables

Internet:
Destination        Gateway            Flags     Netif Expire
default            10.50.45.70        UGS      pppoe0
8.8.4.4            10.50.45.70        UGHS     pppoe0
10.2.0.0/16        192.168.2.1        UGS         em0
10.50.45.70        link#16            UHS      pppoe0
34.120.255.244     link#4             UHS        igb0
100.64.0.0/10      link#4             U          igb0

root@xxxxx:~ # grep -nr "1\.1\.1\.1" /var/db/dhclient.leases.*
/var/db/dhclient.leases.igb0:7:  option domain-name-servers 1.1.1.1,8.8.8.8;
/var/db/dhclient.leases.igb0:24:  option domain-name-servers 1.1.1.1,8.8.8.8;
/var/db/dhclient.leases.igb0:41:  option domain-name-servers 1.1.1.1,8.8.8.8;

root@xxxxx:~ # pfctl -ss -vvv | grep "1\.1\.1\.1" -A 3
No ALTQ support in kernel
ALTQ related functions disabled
all icmp 100.79.101.92:23279 -> 1.1.1.1:23279       0:0
   age 00:05:24, expires in 00:00:10, 319:148 pkts, 8932:4144 bytes, rule 90
   id: 36e4776400000003 creatorid: 7ac5a56d gateway: 0.0.0.0
   origif: igb0

root@xxxxx:~ # tcpdump -i pppoe0 icmp and host 1.1.1.1 -n
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on pppoe0, link-type NULL (BSD loopback), capture size 262144 bytes
22:21:08.078198 IP 100.79.101.92 > 1.1.1.1: ICMP echo request, id 23279, seq 385, length 8
22:21:09.141706 IP 100.79.101.92 > 1.1.1.1: ICMP echo request, id 23279, seq 386, length 8
22:21:10.205204 IP 100.79.101.92 > 1.1.1.1: ICMP echo request, id 23279, seq 387, length 8
^C
3 packets captured
679 packets received by filter
0 packets dropped by kernel


After the Ethernet flap, the 1.1.1.1 route was present (likely added by the DNS received from the SL dhcp) but this route got removed after 2-3 minutes (On DHCP SL renewal I think). At that point, since dpinger state continue using 0.0.0.0 (but not dpinger command line itself), the gateway went down since packets are now being routed to pppoe (my main provider) instead of igb0 (SL) which cannot work since dpinger uses SL source IP on my other provider and it is likely being dropped.

What I expect to happen: the state should use SL gateway, not 0.0.0.0 whatever routes are.

Ok, let's do this then: https://github.com/opnsense/core/commit/c12e77519f164

However, in multi-WAN you really need to set a gateway for each global DNS server being used:

    "core": {
        "8.8.8.8": null,
        "8.8.4.4": null
    },


8.8.8.8 would naturally need the SL gateway and 8.8.4.4 the other WAN gateway as per:

https://docs.opnsense.org/manual/how-tos/multiwan.html#step-3-configure-dns-for-each-gateway

Perhaps even adding 1.1.1.1 as global DNS to SL would fix the current situation as well (DNS server and route are always enforced unlike gateway monitoring). And from the docs you can see coupling these facilities through the same server on the same link makes sense.


Cheers,
Franco

I applied the patch, so far so good, I'll do more testing this week and let you know how it goes.

Is it just me or I have a feeling of "deja vu" ? I think we troubleshooted something along those lines a few months ago before letting it go after deciding that there was a lot of cleanup necessary around those scripts? :-) hopefully this time we'll nail it, lol

We did, but some progress was made on the code so it's good to revisit (and debugging was kinda easy this time).

For the route drop of the nameserver it's probably better to aim for symmetry or at least not undo routes that haven't even been added (by DNS itself). I've added the proposed change to upcoming 23.1.8 and will circle back at some point. Already have an idea on how to pull this off.


Cheers,
Franco

I also think the problematic is somewhat complicated by the fact that we use 1.1.1.1 for 2 things. We'll need to decide which one wins.

On SL, they push us the 1.1.1.1 DNS. Even though I do not use their DNS (I do not allow WAN-pushed DNS to override mine) it seems to play with the routing table. On top of that, I also use 1.1.1.1 for gateway monitoring, where you can select whether you was dpinger to add routes or not for the gateway monitoring. On top of that, you can also have someone which may add 1.1.1.1 to its DNS configs and (may or may not, I know don't) select a gateway for it, which I think may also add routes...

So I think we may need to decide at some point what takes priority (likely based on what functionality absolutely needs their route or something like that) or a order of priority of what does what.

I mean, any provider could decide to start pushing a route, a DNS or something that we are already using as a monitoring IP and we may have selected (or not) to add a route for the monitoring, who would win ? :-)

Actually, as per the doc it suggests to make each IP exclusive to the attached uplink and that's it. You would be starting to validate in a circle and some of this like DNS server via DHCP(v6) is runtime information further complicating the issue.

The individual areas can validate against double-use already, but throwing the host route into the routing table is sort of a blackbox. We only know if a route was there but not why. Is it ours? Is it someone else's? Who knows.


Cheers,
Franco

Testing the patch went well and I upgraded to 23.1.8 last night and so far so good.

As long as the route remains there, it should work, we'll see the next few days. SL is on igb0 and GW is 100.64.0.10 and monitors IP 1.1.1.1 and main provider is on pppoe0 GW is 10.50.45.70 and monitors IP 8.8.4.4

DNS servers (not bound to a gateway) are 8.8.8.8 and 8.8.4.4

The state shows that the default gateway 0.0.0.0 (routing table more likely) is being used to reach 1.1.1.1, not 100.64.0.1 but I see the packets flowing through igb0 (SL), not pppoe0 which is what we want so we're good as long as the 1.1.1.1 route remains there.


root@xxxxx:~ # pluginctl -r host_routes
{
    "core": {
        "8.8.8.8": null,
        "8.8.4.4": null
    },
    "dpinger": {
        "8.8.4.4": "10.50.45.70",
        "1.1.1.1": "100.64.0.1",
        "2001:4860:4860::8844": "fe80::200:xxxx:xxxx:xxx%igb0",
        "149.112.112.112": "192.168.2.1",
        "2620:fe::9": "2001:470:xxx:x::x"
    }
}


root@xxxxx:~ # pfctl -ss -vvv | grep "1\.1\.1\.1" -A 3
No ALTQ support in kernel
ALTQ related functions disabled
all icmp 100.79.101.92:29707 -> 1.1.1.1:29707       0:0
   age 12:19:33, expires in 00:00:09, 43592:43538 pkts, 1220576:1219064 bytes, rule 90
   id: 3b7e706400000001 creatorid: c307077d gateway: 0.0.0.0
   origif: igb0


root@xxxxx:~ # netstat -rn | head
Routing tables

Internet:
Destination        Gateway            Flags     Netif Expire
default            10.50.45.70        UGS      pppoe0
1.1.1.1            100.64.0.1         UGHS       igb0
8.8.4.4            10.50.45.70        UGHS     pppoe0
10.2.0.0/16        192.168.2.1        UGS         em0
10.50.45.70        link#16            UHS      pppoe0
34.120.255.244     link#4             UHS        igb0

Same problem on 23.7.8.
I have 4 Gateways (FTTH, FTTC, FWA, SAT) and Starlink is the only one presenting this problem.

Quote from: Jetro on November 14, 2023, 11:01:43 PM
Same problem on 23.7.8.
I have 4 Gateways (FTTH, FTTC, FWA, SAT) and Starlink is the only one presenting this problem.

I'm curious... I also run Starlink and did not get this issue with this version, yet. But I know it sometimes takes time to show up and also some instability for the issue to show up on my SL gateway. But the bug did not happen anymore since 23.1.8 fixes.

Can you check the output of pluginctl -r host_routes

Quote from: Jetro on November 14, 2023, 11:01:43 PM
Same problem on 23.7.8.
I have 4 Gateways (FTTH, FTTC, FWA, SAT) and Starlink is the only one presenting this problem.

For me the problem was not happening since the last 23.1.x patches on Starlink but started to appear again in 24.1-rc1 and still ongoing on 24.1 final

Here's the link to the issue in the 24.1 forum if you feel like troubleshooting it with us: https://forum.opnsense.org/index.php?topic=38603.0