OPNsense Forum

Archive => 23.1 Legacy Series => Topic started by: xaxero on May 04, 2023, 08:30:39 AM

Title: Multi WAN Dpinger needs restarting after gateway outage Workaround
Post by: xaxero on May 04, 2023, 08:30:39 AM
I have an issue that seems to be ongoing and I cannot see a fix in the forums. If this has been resolved apologies.

Using starlink where the WAN frequently drops out DPinger needs to be restarted in order for the gateway monitoring to work again and the routes services restarted to get the default route back.

Has anyone found a fix for this yet? I have disabled sticky connections in the firewall settings.
Title: Re: Multi WAN Dpinger needs restarting after gateway outage
Post by: RedVortex on May 05, 2023, 12:34:55 AM
I've been having the same issue for quite some time now. I also have Starlink and a second WAN on PPPoE. I have IPV4 and V6 enabled on SL and only V4 on the PPPoE link.

I can usually trigger this issue easily. If I reboot OPNSense, usually after about 2 hours the issue will start happening, something seems to flap on the SL interface after that 2 hours. When that trigger happens, you can see the 2 dpinger for v4 and v6 quitting and then are restarted by the scripts to start checking the gateways again. Usually the one for v6 recovers and continues to work but almost all the time the one for v4 starts failing and flags the gateway as down.

The gateway is really up and I can manually run dpinger in the command line to validate this.

So today, I decided to troubleshoot it and I discovered 2 things.

The dpinger that got restarted has all the right parameters but the traffic for this dpinger is going through the PPPoE interface (I had to do a tcpdump to see this in OPNSense on the PPPoE interface) instead of the SL interface, even though the -B "SL PUBLIC IP" is there with the SL IP in it the traffic is going out with the wrong interface.

While this dpinger is not working, I run another on the command line with the exact same parameters and this one is working well. Checking tcpdump again on the one I started, with the exact same parameters, I see it uses the SL interface, unlike the bad one that uses the PPPoE interface. So I thought, what's going on here ? What could cause one dpinger process to use one interface and another dpinger process to use another interface ? The routes are the sames, the IP are the same, the command line is the same...

Then I thought about checking the firewall states, maybe (like UDP for instance) something is being reused or hasn't timeout or got cleared properly when the interface flapped and the packets are not being handled properly because of this.

And there it was, I could see one state with the bad dpinger and another state with the good dpinger.

Checking the good dpinger, I saw that it showed the rule handling it as "let out anything from firewall host itself (force gw)" as you usually see on any dpinger state while the bad dpinger was showing under rule "allow access to DHCP server" and that doesn't make sense. So it seemed the old dpinger was kind of stuck on a weird rule or something in the states that wasn't right.

Without restarting dpinger (like I usually do to fix this), I only deleted the bad state in the table and as soon as I did this, the packets started flowing to the SL interface as they should have and stopped going to the PPPoE interface and the gateway got flagged UP within a few checks and the state rule now started showing "let out anything from firewall host itself (force gw)" also as it should have.

All that being said, this looks like something bad happening (probably some timing in the script or a state not cleared) during the interface flap or dhcp renewal or something else and my guess is that maybe dpinger starts monitoring and a bad state is either created or kept and that makes the dpinger traffic go to the wrong interface and the states doesn't get a chance to expire or get reset so dpinger continues to flag it as down since traffic continues to flow to the wrong interface because the state is being reused. And the gateway gets wrongly flagged as down since the ICMP packets are being routed to the wrong interface (PPPoE in my case) with the source IP of the SL interface.

Restarting dpinger fixes this since the state is linked to the process (you see the process ID in the state) or deleting the state (firewall/diag/states search for your dpinger process id or the IP it monitors) will create a new state that will route the packets to the proper interface and also fix the issue.
Title: Re: Multi WAN Dpinger needs restarting after gateway outage
Post by: RedVortex on May 05, 2023, 01:08:02 AM
Sorry I posted a by rapidly and made a few mistakes and my description wasn't super clear, I edited my reply a bit, hopefully it is better :-) If not, please do not hesitate to ask for clarifications
Title: Re: Multi WAN Dpinger needs restarting after gateway outage
Post by: RedVortex on May 05, 2023, 01:12:45 AM
Also, here are two screenshots showing the

- good (let out anything from firewall host itself (force gw))

and the

- bad dpinger processes (allow access to DHCP server)

under the rule in firewall/diags/states.

My monitoring IP for SL is 1.1.1.1 which makes it easy for me to check for the state as 1.1.1.1 is only used for monitoring the gateway on SL only, nothing else.
Title: Re: Multi WAN Dpinger needs restarting after gateway outage
Post by: RedVortex on May 05, 2023, 01:52:44 AM
More troubleshooting. I tried to manually flap the interface to try to see what happened and I see that sometimes it uses the SL gateway (100.64.0.1) and sometimes 0.0.0.0.

I even had 2 states are some point pointing to 2 different gateways. Also, since SL is also pushing 1.1.1.1 as a DNS server using DHCP, I also end up having a route added but it last only for some time then it goes away.

SL interface is igb0 in the logs below

SL gateway flagged UP and dpinger working well. 1.1.1.1 is not in the route table.


pfctl -ss -vv | grep "1\.1\.1\.1" -A 3

all icmp 100.79.101.92:23255 -> 1.1.1.1:23255       0:0
   age 00:22:12, expires in 00:00:10, 1309:1305 pkts, 36652:36540 bytes, rule 100
   id: 3253546400000000 creatorid: 837fd2f8 gateway: 100.64.0.1
   origif: igb0


Now I disconnect and reconnect SL and wait for DHCP to get an IP and now I see this, it seems to be using the default gateway, weird... Still dpinger works maybe because a temporary route has been added to 1.1.1.1 on initial dhcp ?


pfctl -ss -vv | grep "1\.1\.1\.1" -A 3

all icmp 100.79.101.92:54053 -> 1.1.1.1:54053       0:0
   age 00:01:41, expires in 00:00:10, 100:100 pkts, 2800:2800 bytes, rule 93
   id: 6e58546400000002 creatorid: 837fd2f8 gateway: 0.0.0.0
   origif: igb0


and this in the routing table (only the top few routes to keep this simple...)


netstat -rn | head

Routing tables

Internet:
Destination        Gateway            Flags     Netif Expire
default            10.50.45.70        UGS      pppoe0
1.1.1.1            100.64.0.1         UGHS       igb0
8.8.4.4            10.50.45.70        UGHS     pppoe0


After a minute or two, SL issues a DHCP renewal and the GW goes down temporarily for dpinger and I see this, two different states, one on the default gateway and another one with the SL gateway.


pfctl -ss -vv | grep "1\.1\.1\.1" -A 3

all icmp 100.79.101.92:61626 -> 1.1.1.1:61626       0:0
   age 00:00:14, expires in 00:00:09, 14:14 pkts, 392:392 bytes, rule 100
   id: 7451546400000003 creatorid: 837fd2f8 gateway: 100.64.0.1
   origif: igb0
--
all icmp 100.79.101.92:54053 -> 1.1.1.1:54053       0:0
   age 00:03:33, expires in 00:00:00, 195:148 pkts, 5460:4144 bytes, rule 93
   id: 6e58546400000002 creatorid: 837fd2f8 gateway: 0.0.0.0
   origif: igb0


After some time the state using 0.0.0.0 seems to disappear and the route to 1.1.1.1 also disappear.

SL is still marked as UP now, so for some reason the problem did not happen this time but you can see that if something gets stuck on the 0.0.0.0 (which is my main WAN, PPPoE, by default) this would result in SL dpinger not working and sending its packets to PPPoE instead of SL.

I'll try to reproduce the isseu again later on and post the results and I'll also try to catch a pfctl output and netstat -rn when the issue happens, if you could do the same, maybe we'll see something clearer than in the UI.
Title: Re: Multi WAN Dpinger needs restarting after gateway outage
Post by: tracerrx on May 05, 2023, 02:43:01 AM
Also starlink and seeing this in the logs:

2023-05-04T20:40:33-04:00 Notice dhclient Creating resolv.conf
2023-05-04T20:40:33-04:00 Error dhclient unknown dhcp option value 0x52
Title: Re: Multi WAN Dpinger needs restarting after gateway outage
Post by: RedVortex on May 05, 2023, 03:58:09 AM
Quote from: tracerrx on May 05, 2023, 02:43:01 AM
Also starlink and seeing this in the logs:

2023-05-04T20:40:33-04:00 Notice dhclient Creating resolv.conf
2023-05-04T20:40:33-04:00 Error dhclient unknown dhcp option value 0x52


Yeah, that part has been there forever, likely just because dhclient doesn't support (or needs to) this option in the DHCP reply we get from SL (option 82 (0x52 hex)). Likely something SL original router needs and/or supports but not used by standard DHCP (I've replace some values by xxxx)


  Vendor-rfc1048 Extensions
    Magic Cookie 0x63825363
    DHCP-Message Option 53, length 1: ACK
    Subnet-Mask Option 1, length 4: 255.192.0.0
    Server-ID Option 54, length 4: 100.64.0.1
    Default-Gateway Option 3, length 4: 100.64.0.1
    Lease-Time Option 51, length 4: 300
    Domain-Name-Server Option 6, length 8: 1.1.1.1,8.8.8.8
    Classless-Static-Route Option 121, length 23: (192.168.100.1/32:0.0.0.0),(34.120.255.244/32:0.0.0.0),(default:100.64.0.1)
    MTU Option 26, length 2: 1500
    Agent-Information Option 82, length 24:
      Circuit-ID SubOption 1, length 4: xxxx
      Unknown SubOption 5, length 4:
0x0000:  xxxx xxxx
      Unknown SubOption 151, length 8:
0x0000:  xxxx xxxx xxxx xxxx
      Unknown SubOption 152, length 0:
    END Option 255, length 0
    PAD Option 0, length 0, occurs 28


Also reported here: https://forum.opnsense.org/index.php?topic=28391.0

Option 82 : https://en.wikipedia.org/wiki/Dynamic_Host_Configuration_Protocol#Relay_agent_information_sub-options
Title: Re: Multi WAN Dpinger needs restarting after gateway outage
Post by: RedVortex on May 05, 2023, 04:44:53 AM
I was able to trigger the dpinger issue this way.

The gateway was up and the state showed this
all icmp 100.79.101.92:34217 -> 1.1.1.1:34217       0:0
   age 00:03:13, expires in 00:00:10, 190:190 pkts, 5320:5320 bytes, rule 100
   id: ed96546400000000 creatorid: 837fd2f8 gateway: 100.64.0.1
   origif: igb0


which is normal and 1.1.1.1 was not in the routing table.

I then unplugged the RJ45 on OPNSense (igb0) and reconnected it almost right away. This triggered a DHCP request. When the request came in, I now had the 1.1.1.1 route in the route table and the state table showed this (notice the rule has changed from 100 to 90 and the gateway is now 0.0.0.0 (which is the default which would use PPPoE which is not good.

all icmp 100.79.101.92:16758 -> 1.1.1.1:16758       0:0
   age 00:02:50, expires in 00:00:10, 168:146 pkts, 4704:4088 bytes, rule 90
   id: 6f87546400000002 creatorid: 837fd2f8 gateway: 0.0.0.0
   origif: igb0


After about 2 mins (SL DHCP renewal time) the 1.1.1.1 route disappeared, on DHCP renewal, from the routing table and the state remained to gateway: 0.0.0.0. At that point the gateway monitoring started to fail (since the packets started routing to the wrong interface).

After another 2 mins, another DHCP renewal I guess, the state changed to this: (notice rule 100 now and SL gateway, not 0.0.0.0 anymore)

all icmp 100.79.101.92:34217 -> 1.1.1.1:34217       0:0
   age 00:03:13, expires in 00:00:10, 190:190 pkts, 5320:5320 bytes, rule 100
   id: ed96546400000000 creatorid: 837fd2f8 gateway: 100.64.0.1
   origif: igb0


And the gateway monitoring went back to UP. So far it seems to be able to recover after some time and many renewal even though the network never went down, it all seems to be a mix of routing/firewall state thing. So it may remain down depending on some timing I suppose...

I have
- "Allow DNS server list to be overridden by DHCP/PPP on WAN" unchecked in general
- "Allow default gateway switching" checked in general
- "Disable Host Route" unchecked on all gateways in gateways (Description: Do not create a dedicated host route for this monitor when it is checked).

So since I have the last setting unchecked, a route should in theory be added for the monitor. Which is the case when the interface comes up on initial DHCP. But it seems to be removed on the next DHCP renewal, I'm not sure why. Maybe this conflicts with the DNS (1.1.1.1) since I use the same (1.1.1.1) for monitoring or something...

So I tried checking "Disable Host Route" and saved the gateway and I now have this, monitoring works but I'm not sure why I see origif: pppoe0 (not SL). Checking with tcpdump, I see the icmp queries going out on igb0 (SL) and not pppoe0. I suppose the gateway forces traffic on the right interface. I also do not see the 1.1.1.1 route anymore in the routing table...

State with "Disable host route" checked in the SL gateway.

all icmp 100.79.101.92:59191 -> 1.1.1.1:59191       0:0
   age 00:00:26, expires in 00:00:10, 26:25 pkts, 728:700 bytes, rule 100
   id: 9394546400000002 creatorid: 837fd2f8 gateway: 100.64.0.1
   origif: pppoe0


disconnecting/reconnecting SL rj45 ends up creating those two states

all icmp 100.79.101.92:22967 -> 1.1.1.1:22967       0:0
   age 00:00:16, expires in 00:00:04, 1:0 pkts, 28:0 bytes, rule 90
   id: 79a6546400000000 creatorid: 837fd2f8 gateway: 0.0.0.0
   origif: pppoe0
--
all icmp 100.79.101.92:51249 -> 1.1.1.1:51249       0:0
   age 00:00:13, expires in 00:00:10, 14:14 pkts, 392:392 bytes, rule 100
   id: 9bcf546400000001 creatorid: 837fd2f8 gateway: 100.64.0.1
   origif: pppoe0


After a few seconds only this one remains

all icmp 100.79.101.92:51249 -> 1.1.1.1:51249       0:0
   age 00:00:57, expires in 00:00:09, 56:56 pkts, 1568:1568 bytes, rule 100
   id: 9bcf546400000001 creatorid: 837fd2f8 gateway: 100.64.0.1
   origif: pppoe0


I'm still unable to make the problem (dpinger flagging SL as down and keeping it down until I restart dpinger) happen though. I'll continue to try to reproduce the issue.

At least I see something weird with the routes/states that could explain it may be flagged down at some point: if the route to 1.1.1.1 disappears and the gateway remains on 0.0.0.0 on the state which seemed to happened for about 2 mins (SL dhcp renewal) and fixed itself.

Maybe the state should expire in this situation (seems there is a 10 secs timeout) but never does as dpinger keeps it alive every 1 sec ? I don't know... lol
Title: Re: Multi WAN Dpinger needs restarting after gateway outage
Post by: RedVortex on May 05, 2023, 04:51:23 AM
@xaxero what IP do you monitor for your SL gateway ? Something that may also conflict with what we receive in the DHCP from SL ?
Title: Re: Multi WAN Dpinger needs restarting after gateway outage
Post by: xaxero on May 05, 2023, 07:25:18 AM
Good Morning
  I am trying this 2 ways (I use 2 Starlink Maritime interfaces)

StarlinkBackup_GWv4 (active)    StarlinkBackup    IPv4    199    192.168.192.1    100.64.0.1    40.0 ms    9.0 ms    0.0 %    Online
   
      Starmain_VLAN_GWv4    4G_VLAN    IPv4    201    192.168.191.1    1.1.1.1    0.0 ms    0.0 ms    100.0 %    Offline
   StrMain    
      
I use the remote gateway IP for one 100.64.0.1 and this works better than 1.1.1.1 that is down every morning.
Title: Re: Multi WAN Dpinger needs restarting after gateway outage
Post by: RedVortex on May 05, 2023, 08:07:29 AM
Quote from: xaxero on May 05, 2023, 07:25:18 AM
I use the remote gateway IP for one 100.64.0.1 and this works better than 1.1.1.1 that is down every morning.

Ha, interesting so you may end up with the same situation as me since 1.1.1.1 is pushed by SL as DNS in their DHCP reply. So if you hit the same bug I'm trying to figure out, you may end up having the gateway flagged as down wrongly.

SL is pushing 2 DNS using DHCP and this could create issues I think if you monitor those for the gateways. They push 1.1.1.1 and 8.8.8.8. So try using something else like 8.8.4.4 for instance which may not be impacted by some dhcp client script that remove or add routes automatically as they renew (in theory).

You could also try what I'm testing as well that is enabling "disable host route" in the gateway settings so that the monitoring will not try to add a route and depend on it while the dhcp client script may want to remove it since, I suppose, we do not use the SL DNS servers that are pushed to us (general settings). I suppose you don't let WAN dhcp-learned DNS servers override your own DNS servers you have likely defined manually.
Title: Re: Multi WAN Dpinger needs restarting after gateway outage
Post by: xaxero on May 05, 2023, 09:42:04 AM
OK Done as you suggested on the one interface (main) Will see how it goes.    



Name    Interface    Protocol    Priority    Gateway    Monitor IP    RTT    RTTd    Loss    Status    Description    
      StarlinkBackup_GWv4 (active)    StarlinkBackup    IPv4    199    192.168.192.1    100.64.0.1    46.8 ms    11.7 ms    0.0 %    Online
   StarBK    
      StarMain_VLAN_GWv4    4G_VLAN    IPv4    201    192.168.191.1    208.67.222.222    36.3 ms    7.8 ms    0.0 %    Online
   StrMain    
   
Title: Re: Multi WAN Dpinger needs restarting after gateway outage
Post by: aandras on May 05, 2023, 03:38:59 PM
i'm, having similar problems for some weeks. I have 2 WAN gateways. The problem comes mostly with the backup GW but sometimes with the primary too. The gateway goes into Error status with 100% packet loss. I attached a screenshot that shows the States in the Firewall Diagnostic after the GW was changed to Error status. This GW has 8.8.4.4 as monitor IP. After i delete the entry with the state 0:0 everything goes back to normal for some hours.
Currently i have OPNsense 23.1.6-amd64, this issue came with version 23.1.5.


Title: Re: Multi WAN Dpinger needs restarting after gateway outage
Post by: RedVortex on May 05, 2023, 08:46:20 PM
Also this: https://github.com/opnsense/core/issues/6544 got released in 23.1.7_3 (I am running _1) not long ago.

Also, _3 contains a few other patches that could also, possibly, impact our current issue, worth upgrading/testing at the very least.
Title: Re: Multi WAN Dpinger needs restarting after gateway outage
Post by: RedVortex on May 05, 2023, 09:18:27 PM
The upgrade didn't solve the issue for me.

To test, I unchecked again the "disable host route" on the SL gateway so that a route is added (to replicate the issue we had). Btw, with "disable host route" checked, I did not had the problem again, so far.

So, back to testing. After unchecking the "disable host route", I unplugged and replugged the SL ethernet cable in my igb0.

Once the link came back up I had the 1.1.1.1 that got added to the routing table (since this is the IP I monitor) which is expected. And the state was now like this. Notice that the gateway is 0.0.0.0, it should normally be SL gateway (100.64.0.1) to make sure dpinger uses this interface to monitor (icmp ping) 1.1.1.1. So right there I knew the issue would probably trigger later on (on DHCP renewal):

all icmp 100.79.101.92:47126 -> 1.1.1.1:47126       0:0
   age 00:00:58, expires in 00:00:10, 58:57 pkts, 1624:1596 bytes, rule 90
   id: 58da556400000000 creatorid: 7ac5a56d gateway: 0.0.0.0
   origif: igb0


And the gateway was marked as UP. About 2-3 minutes later (SL DHCP renewal), the 1.1.1.1 route disappeared from the routing table and the gateway is now marked as DOWN

The state is still this

all icmp 100.79.101.92:47126 -> 1.1.1.1:47126       0:0
   age 00:12:03, expires in 00:00:10, 715:148 pkts, 20020:4144 bytes, rule 90
   id: 58da556400000000 creatorid: 7ac5a56d gateway: 0.0.0.0
   origif: igb0


And it is not recovering and you see that it is linked to this dpinger

root    47126   0.0  0.0  17728  2624  -  Is   14:51      0:00.09 /usr/local/bin/dpinger -f -S -r 0 -i STARLINK_DHCP -B 100.79.101.92 -p /var/run/dpinger_STARLINK_DHCP.pid -u /var/run/dpinger_STARLINK_DHCP.sock -C /usr/local/etc/rc.syshook monitor -s 1s -l 2s -t 60s -A 1s -D 500 -L 75 -d 0 1.1.1.1

Which is not working and flagging the gateway as down, even though it is UP because the packets are now going to the wrong interface. They should be going to igb0 (SL) but they are going to my other (default) WAN which is pppoe0 so they will fail (100.79.101.92 is my current SL IP)

tcpdump -i pppoe0 icmp and host 1.1.1.1 -n

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on pppoe0, link-type NULL (BSD loopback), capture size 262144 bytes
15:05:28.529901 IP 100.79.101.92 > 1.1.1.1: ICMP echo request, id 47126, seq 843, length 8
15:05:29.545644 IP 100.79.101.92 > 1.1.1.1: ICMP echo request, id 47126, seq 844, length 8
15:05:30.553962 IP 100.79.101.92 > 1.1.1.1: ICMP echo request, id 47126, seq 845, length 8


This happens because in the state, the gateway is set to 0.0.0.0 which is wrong, it should be 100.64.0.1.

If I manually test it I see it works and the latency is definitely SL as it would be 2-3 ms over my pppoe0 link.

ping -S 100.79.101.92 1.1.1.1
PING 1.1.1.1 (1.1.1.1) from 100.79.101.92: 56 data bytes
64 bytes from 1.1.1.1: icmp_seq=0 ttl=58 time=56.306 ms
64 bytes from 1.1.1.1: icmp_seq=1 ttl=58 time=64.790 ms
64 bytes from 1.1.1.1: icmp_seq=2 ttl=58 time=47.758 ms


The state is also unable to expire and release/relearn itself since dpinger pings every 1 sec and this keeps the state alive.

If I kill or restart dpinger this will release the state and fix the issue

I'll kill the state to test it

root@xxxxx:~ # pfctl -ss -vvv | grep "1\.1\.1\.1" -A 3
all icmp 100.79.101.92:47126 -> 1.1.1.1:47126       0:0
   age 00:21:06, expires in 00:00:09, 1249:148 pkts, 34972:4144 bytes, rule 90
   id: 58da556400000000 creatorid: 7ac5a56d gateway: 0.0.0.0
   origif: igb0

root@xxxxx:~ # pfctl -k id -k 58da556400000000
killed 1 states

root@xxxxx:~ # pfctl -ss -vvv | grep "1\.1\.1\.1" -A 3
all icmp 100.79.101.92:47126 -> 1.1.1.1:47126       0:0
   age 00:00:03, expires in 00:00:10, 4:4 pkts, 112:112 bytes, rule 100
   id: f1c1556400000002 creatorid: 7ac5a56d gateway: 100.64.0.1
   origif: pppoe0


And now the gateway is back UP.

I'll re-check "disable host route" in the SL gateway since this seems to help as it does seem to prevent that the gateway in the state be 0.0.0.0 since there is never a 1.1.1.1 route if I do this. It's a workaround but it seems to work for now. Probably using something else than 1.1.1.1 would also work since DHCP renewal would not play with the route as it seems to be doing after 3 mins (SL DHCP renew time).
Title: Re: Multi WAN Dpinger needs restarting after gateway outage
Post by: RedVortex on May 05, 2023, 09:31:31 PM
Actually, in 23.1.7_3 the problem seems worse... Unplugging and replugging the SL ether cable in igb0, triggers the problem every time it seems.

I did it 3 times and the state looks like this each time now...  :-\

all icmp 100.79.101.92:7232 -> 1.1.1.1:7232       0:0
   age 00:00:36, expires in 00:00:10, 36:0 pkts, 1008:0 bytes, rule 90
   id: c4e7556400000000 creatorid: 7ac5a56d gateway: 0.0.0.0
   origif: pppoe0


again, restarting dpinger or killing the state brings the gateway status to UP and the state to what it should be

all icmp 100.79.101.92:9493 -> 1.1.1.1:9493       0:0
   age 00:00:16, expires in 00:00:09, 16:16 pkts, 448:448 bytes, rule 100
   id: 15eb556400000000 creatorid: 7ac5a56d gateway: 100.64.0.1
   origif: pppoe0


Also notice the rule goes from 90 to 100. 100 is usually what I see when it works, I believe it's the default rule that allows traffic from the OPNSense to anywhere and 90 is the rule associated to DHCP.
Title: Re: Multi WAN Dpinger needs restarting after gateway outage
Post by: RedVortex on May 07, 2023, 02:25:50 AM
With 23.1.7_3, SL gateway always ends up being flagged as down if I use 1.1.1.1 whether I use the "Disable Host Route" or not. I tried multiple things to keep it up but after some time it ends up failing because of the state gateway that ends up sending the packets to the pppoe0 wan instead of SL.

So I'm dropping the idea of using the 1.1.1.1 altogether for now as this seems really problematic likely because of the dhcp renewal on SL that sends 1.1.1.1 as a dns maybe ? Anyways, I'll be testing with 9.9.9.9 instead and see how it goes.

Did using another IP than 1.1.1.1 fixed it for you @xaxero ? Also, have you upgraded to 23.1.7_3 yet ?
Title: Re: Multi WAN Dpinger needs restarting after gateway outage
Post by: xaxero on May 07, 2023, 09:29:10 AM
Good Morning
    Changing to openDNS has resulted in a big improvement. 48 hours with no issues. However SL has been very stable. The second  unit I simply use the SL Gateway address.

Note: As I am using the Dual antenna setup I have put in a second router at the front end simply to NAT the traffic and so I have a unique gateway for each antenna and tagging the packets onto separate VLANS to our main router several decks down. 2 WANS with the same gateway was problematic if we had to do a full system power cycle.
With the front end router I am disabling gateway monitoring and I am doing all the DPinger stuff on the main router. Also Disable host route may have helped as well.

Another slimy hack is to force all passenger traffic through the 4G-Starlink-Primary interface via the firewall so this bypasses dpinger completely. The more critical ship traffic goes through the Gateway failover and the worst case scenario is that we are stuck on the VSAT until I can restart Dpinger.

I have attached the gateway configuration of the front and and the core routers. So far it has been working well.
Title: Re: Multi WAN Dpinger needs restarting after gateway outage
Post by: franco on May 08, 2023, 12:01:59 PM
You can use the following to inspect host route behaviour now:

# pluginctl -r host_routes

An overlap between facilities IS possible and the last match wins which may break DNS or monitoring facility... That's why disable host route was added to monitor settings in which case the DNS is still active and dpinger monitoring latches on to interface IP anyway so routing should be ok (if no PBR is used breaking that as well).


Cheers,
Franco
Title: Re: Multi WAN Dpinger needs restarting after gateway outage
Post by: RedVortex on May 08, 2023, 08:29:42 PM
Quote from: franco on May 08, 2023, 12:01:59 PM
You can use the following to inspect host route behaviour now:

# pluginctl -r host_routes

An overlap between facilities IS possible and the last match wins which may break DNS or monitoring facility... That's why disable host route was added to monitor settings in which case the DNS is still active and dpinger monitoring latches on to interface IP anyway so routing should be ok (if no PBR is used breaking that as well).

Hello franco  :)

Ok, so everything remained stable (but I did not test for very long, maybe 12h) while I was using 9.9.9.9. I've configured 1.1.1.1 again on SL, saved gateway and then saved the interface as well to restart it.

For now I see this (everything normal and gateway is marked UP)

root@xxxxx:~ # pfctl -ss -vvv | grep "1\.1\.1\.1" -A 3
No ALTQ support in kernel
ALTQ related functions disabled
all icmp 100.79.101.92:47540 -> 1.1.1.1:47540       0:0
   age 00:03:49, expires in 00:00:10, 225:225 pkts, 6300:6300 bytes, rule 100
   id: a7325d6400000000 creatorid: 7ac5a56d gateway: 100.64.0.1
   origif: igb0



root@xxxxx:~ # pluginctl -r host_routes
{
    "core": {
        "8.8.8.8": null,
        "8.8.4.4": null
    },
    "dpinger": {
        "8.8.4.4": "10.50.45.70",
        "1.1.1.1": "100.64.0.1",
        "2001:4860:4860::8844": "fe80::200:xxxx:xxxx:xxx%igb0",
        "149.112.112.112": "192.168.2.1",
        "2620:fe::9": "2001:470:xx:4x:x"
    }
}


10.50.45.70 is my default gateway that uses pppoe0 interface
100.64.0.1 is SL and is used as backup gateway on igb0


root@xxxxx:~ # netstat -rn | head
Routing tables

Routing tables

Internet:
Destination        Gateway            Flags     Netif Expire
default            10.50.45.70        UGS      pppoe0
1.1.1.1            100.64.0.1         UGHS       igb0
8.8.4.4            10.50.45.70        UGHS     pppoe0
10.2.0.0/16        192.168.2.1        UGS         em0
10.50.45.70        link#16            UHS      pppoe0
34.120.255.244     link#4             UHS        igb0


After 2-3 mins, I see the routing tables loses 1.1.1.1 (SL dhcp renewal I guess) but so far everything remains functional

root@xxxxx:~ # netstat -rn | head
Routing tables

Internet:
Destination        Gateway            Flags     Netif Expire
default            10.50.45.70        UGS      pppoe0
8.8.4.4            10.50.45.70        UGHS     pppoe0
10.2.0.0/16        192.168.2.1        UGS         em0
10.50.45.70        link#16            UHS      pppoe0
34.120.255.244     link#4             UHS        igb0
100.64.0.0/10      link#4             U          igb0


Everything else remains the same and gateway is, for now, marked UP. When I get back home, I'll test the ethernet cable pull/plug, that usually seems to trigger the issue and I'll let you know what I get then.
Title: Re: Multi WAN Dpinger needs restarting after gateway outage
Post by: franco on May 09, 2023, 11:41:30 AM
Hello RedVortex :)

Hmm, how about this one?

# grep -nr "1\.1\.1\.1" /var/db/dhclient.leases.*

If SL is pushing routes it will scrub them on a renew perhaps.


Cheers,
Franco
Title: Re: Multi WAN Dpinger needs restarting after gateway outage
Post by: lazyE on May 10, 2023, 04:40:29 AM
Hi,

FWIW, see this too for Multi-WAN Gateway monitor.

Monitor IP / dpinger not reliable in simulated fail & failback scenarios

Can only "fix" it be restart of Gateway service  :(
Title: Re: Multi WAN Dpinger needs restarting after gateway outage
Post by: franco on May 10, 2023, 09:03:49 AM
Keep in mind that some DNS servers have been known to rate-limit or block ping requests so it looks bad but it's not. From the OPNsense perspective the alarm has to be raised even though it's not necessary and disruptive.


Cheers,
Franco
Title: Re: Multi WAN Dpinger needs restarting after gateway outage
Post by: lazyE on May 11, 2023, 09:14:02 AM
So I've been testing Multi-Wan gateway failover for quite a few hours now.

Does not work with Trigger Level = "Packet Loss" option for 23.latest or even back to 22.7.latest

Scenario: Primary gateway with Trigger Level = "Packet Loss" option set then block downstream ping does NOT cause gateway to be marked as down nor default route to be flipped to Secondary. Have to manually restart Gateway service (then it notices).

Failback works ok.

Works ok if Trigger Level = "Member Down" however, this is a less likely real-world scenario where ISP is up but internet service is interrupted.


Title: Re: Multi WAN Dpinger needs restarting after gateway outage
Post by: franco on May 11, 2023, 09:26:23 AM
See https://github.com/opnsense/core/issues/6231 -- packetloss and delay triggers have been broken inherently with the switch from apinger to dpinger. The latter never supported the lower thresholds. I'm trying to avoid dealing with dpinger for alarm decisions in 23.7 to bring back the desired behaviour and dpinger then is left to only monitor.


Cheers,
Franco
Title: Re: Multi WAN Dpinger needs restarting after gateway outage
Post by: lazyE on May 11, 2023, 10:24:15 AM
Quote from: franco on May 11, 2023, 09:26:23 AM
See https://github.com/opnsense/core/issues/6231 -- packetloss and delay triggers have been broken inherently with the switch from apinger to dpinger. The latter never supported the lower thresholds. I'm trying to avoid dealing with dpinger for alarm decisions in 23.7 to bring back the desired behaviour and dpinger then is left to only monitor.


Cheers,
Franco

thanks Franco.   Read through the issues thread. Appreciate the detail there.

What timeframe are you thinking for the fix ?

Title: Re: Multi WAN Dpinger needs restarting after gateway outage
Post by: franco on May 11, 2023, 10:34:46 AM
It might take 1 more month for the final code to hit development, but as I said the plan is to have it in production for 23.7 in July (not sooner due to considerable changes).


Cheers,
Franco
Title: Re: Multi WAN Dpinger needs restarting after gateway outage Workaround
Post by: xaxero on May 12, 2023, 06:03:27 AM
I am collating the data from this post and others Applies to Starlink only but may be useful elsewhere. Applied the following fixes  from everyone's suggestions and the gateways are stable - We are having frequent outages as we are in laser link territory however the link is stable overall.

1/. Wan Definition Reject leases from 192.168.100.1 (note gateways are on separate router in my case)
2/. Gateway - Disable host route.
3/. Monitor IP that is not 1.1.1.1 (In my case open DNS) and bind each interface to DNS via Settings General.

Interfaces have been going up and down last 24 hours and the gateways (so far) are behaving and the routes are changing dynamically
Title: Re: Multi WAN Dpinger needs restarting after gateway outage Workaround
Post by: xaxero on May 12, 2023, 06:46:18 AM
Last thought - perhaps we could include httping as an option in the future as well as dpinger. http has much higher priority.
Title: Re: Multi WAN Dpinger needs restarting after gateway outage Workaround
Post by: franco on May 12, 2023, 09:13:12 AM
That leaves only the question of who will write and integrate a new solution for the problem someone though solved a decade ago.  ;)


Cheers,
Franco
Title: Re: Multi WAN Dpinger needs restarting after gateway outage Workaround
Post by: RedVortex on May 22, 2023, 04:31:39 AM
Problem occurred again today after a Ethernet flap on the SL side (likely a firmware update on their side)

root@xxxxx:~ # pluginctl -r host_routes
{
    "core": {
        "8.8.8.8": null,
        "8.8.4.4": null
    },
    "dpinger": {
        "8.8.4.4": "10.50.45.70",
        "1.1.1.1": "100.64.0.1",
        "2001:4860:4860::8844": "fe80::200:xxxx:xxxx:xxx%igb0",
        "149.112.112.112": "192.168.2.1",
        "2620:fe::9": "2001:470:xx:4x:x"
    }
}

root@xxxxx:~ # netstat -rn | head
Routing tables

Internet:
Destination        Gateway            Flags     Netif Expire
default            10.50.45.70        UGS      pppoe0
8.8.4.4            10.50.45.70        UGHS     pppoe0
10.2.0.0/16        192.168.2.1        UGS         em0
10.50.45.70        link#16            UHS      pppoe0
34.120.255.244     link#4             UHS        igb0
100.64.0.0/10      link#4             U          igb0

root@xxxxx:~ # grep -nr "1\.1\.1\.1" /var/db/dhclient.leases.*
/var/db/dhclient.leases.igb0:7:  option domain-name-servers 1.1.1.1,8.8.8.8;
/var/db/dhclient.leases.igb0:24:  option domain-name-servers 1.1.1.1,8.8.8.8;
/var/db/dhclient.leases.igb0:41:  option domain-name-servers 1.1.1.1,8.8.8.8;

root@xxxxx:~ # pfctl -ss -vvv | grep "1\.1\.1\.1" -A 3
No ALTQ support in kernel
ALTQ related functions disabled
all icmp 100.79.101.92:23279 -> 1.1.1.1:23279       0:0
   age 00:05:24, expires in 00:00:10, 319:148 pkts, 8932:4144 bytes, rule 90
   id: 36e4776400000003 creatorid: 7ac5a56d gateway: 0.0.0.0
   origif: igb0

root@xxxxx:~ # tcpdump -i pppoe0 icmp and host 1.1.1.1 -n
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on pppoe0, link-type NULL (BSD loopback), capture size 262144 bytes
22:21:08.078198 IP 100.79.101.92 > 1.1.1.1: ICMP echo request, id 23279, seq 385, length 8
22:21:09.141706 IP 100.79.101.92 > 1.1.1.1: ICMP echo request, id 23279, seq 386, length 8
22:21:10.205204 IP 100.79.101.92 > 1.1.1.1: ICMP echo request, id 23279, seq 387, length 8
^C
3 packets captured
679 packets received by filter
0 packets dropped by kernel


After the Ethernet flap, the 1.1.1.1 route was present (likely added by the DNS received from the SL dhcp) but this route got removed after 2-3 minutes (On DHCP SL renewal I think). At that point, since dpinger state continue using 0.0.0.0 (but not dpinger command line itself), the gateway went down since packets are now being routed to pppoe (my main provider) instead of igb0 (SL) which cannot work since dpinger uses SL source IP on my other provider and it is likely being dropped.

What I expect to happen: the state should use SL gateway, not 0.0.0.0 whatever routes are.
Title: Re: Multi WAN Dpinger needs restarting after gateway outage Workaround
Post by: franco on May 22, 2023, 08:39:35 AM
Ok, let's do this then: https://github.com/opnsense/core/commit/c12e77519f164

However, in multi-WAN you really need to set a gateway for each global DNS server being used:

    "core": {
        "8.8.8.8": null,
        "8.8.4.4": null
    },


8.8.8.8 would naturally need the SL gateway and 8.8.4.4 the other WAN gateway as per:

https://docs.opnsense.org/manual/how-tos/multiwan.html#step-3-configure-dns-for-each-gateway

Perhaps even adding 1.1.1.1 as global DNS to SL would fix the current situation as well (DNS server and route are always enforced unlike gateway monitoring). And from the docs you can see coupling these facilities through the same server on the same link makes sense.


Cheers,
Franco
Title: Re: Multi WAN Dpinger needs restarting after gateway outage Workaround
Post by: RedVortex on May 23, 2023, 06:21:00 AM
I applied the patch, so far so good, I'll do more testing this week and let you know how it goes.

Is it just me or I have a feeling of "deja vu" ? I think we troubleshooted something along those lines a few months ago before letting it go after deciding that there was a lot of cleanup necessary around those scripts? :-) hopefully this time we'll nail it, lol
Title: Re: Multi WAN Dpinger needs restarting after gateway outage Workaround
Post by: franco on May 23, 2023, 03:38:15 PM
We did, but some progress was made on the code so it's good to revisit (and debugging was kinda easy this time).

For the route drop of the nameserver it's probably better to aim for symmetry or at least not undo routes that haven't even been added (by DNS itself). I've added the proposed change to upcoming 23.1.8 and will circle back at some point. Already have an idea on how to pull this off.


Cheers,
Franco
Title: Re: Multi WAN Dpinger needs restarting after gateway outage Workaround
Post by: RedVortex on May 23, 2023, 04:23:00 PM
I also think the problematic is somewhat complicated by the fact that we use 1.1.1.1 for 2 things. We'll need to decide which one wins.

On SL, they push us the 1.1.1.1 DNS. Even though I do not use their DNS (I do not allow WAN-pushed DNS to override mine) it seems to play with the routing table. On top of that, I also use 1.1.1.1 for gateway monitoring, where you can select whether you was dpinger to add routes or not for the gateway monitoring. On top of that, you can also have someone which may add 1.1.1.1 to its DNS configs and (may or may not, I know don't) select a gateway for it, which I think may also add routes...

So I think we may need to decide at some point what takes priority (likely based on what functionality absolutely needs their route or something like that) or a order of priority of what does what.

I mean, any provider could decide to start pushing a route, a DNS or something that we are already using as a monitoring IP and we may have selected (or not) to add a route for the monitoring, who would win ? :-)
Title: Re: Multi WAN Dpinger needs restarting after gateway outage Workaround
Post by: franco on May 23, 2023, 04:27:12 PM
Actually, as per the doc it suggests to make each IP exclusive to the attached uplink and that's it. You would be starting to validate in a circle and some of this like DNS server via DHCP(v6) is runtime information further complicating the issue.

The individual areas can validate against double-use already, but throwing the host route into the routing table is sort of a blackbox. We only know if a route was there but not why. Is it ours? Is it someone else's? Who knows.


Cheers,
Franco
Title: Re: Multi WAN Dpinger needs restarting after gateway outage Workaround
Post by: RedVortex on May 26, 2023, 09:32:00 PM
Testing the patch went well and I upgraded to 23.1.8 last night and so far so good.

As long as the route remains there, it should work, we'll see the next few days. SL is on igb0 and GW is 100.64.0.10 and monitors IP 1.1.1.1 and main provider is on pppoe0 GW is 10.50.45.70 and monitors IP 8.8.4.4

DNS servers (not bound to a gateway) are 8.8.8.8 and 8.8.4.4

The state shows that the default gateway 0.0.0.0 (routing table more likely) is being used to reach 1.1.1.1, not 100.64.0.1 but I see the packets flowing through igb0 (SL), not pppoe0 which is what we want so we're good as long as the 1.1.1.1 route remains there.


root@xxxxx:~ # pluginctl -r host_routes
{
    "core": {
        "8.8.8.8": null,
        "8.8.4.4": null
    },
    "dpinger": {
        "8.8.4.4": "10.50.45.70",
        "1.1.1.1": "100.64.0.1",
        "2001:4860:4860::8844": "fe80::200:xxxx:xxxx:xxx%igb0",
        "149.112.112.112": "192.168.2.1",
        "2620:fe::9": "2001:470:xxx:x::x"
    }
}


root@xxxxx:~ # pfctl -ss -vvv | grep "1\.1\.1\.1" -A 3
No ALTQ support in kernel
ALTQ related functions disabled
all icmp 100.79.101.92:29707 -> 1.1.1.1:29707       0:0
   age 12:19:33, expires in 00:00:09, 43592:43538 pkts, 1220576:1219064 bytes, rule 90
   id: 3b7e706400000001 creatorid: c307077d gateway: 0.0.0.0
   origif: igb0


root@xxxxx:~ # netstat -rn | head
Routing tables

Internet:
Destination        Gateway            Flags     Netif Expire
default            10.50.45.70        UGS      pppoe0
1.1.1.1            100.64.0.1         UGHS       igb0
8.8.4.4            10.50.45.70        UGHS     pppoe0
10.2.0.0/16        192.168.2.1        UGS         em0
10.50.45.70        link#16            UHS      pppoe0
34.120.255.244     link#4             UHS        igb0
Title: Re: Multi WAN Dpinger needs restarting after gateway outage Workaround
Post by: Jetro on November 14, 2023, 11:01:43 PM
Same problem on 23.7.8.
I have 4 Gateways (FTTH, FTTC, FWA, SAT) and Starlink is the only one presenting this problem.
Title: Re: Multi WAN Dpinger needs restarting after gateway outage Workaround
Post by: RedVortex on November 15, 2023, 05:59:27 PM
Quote from: Jetro on November 14, 2023, 11:01:43 PM
Same problem on 23.7.8.
I have 4 Gateways (FTTH, FTTC, FWA, SAT) and Starlink is the only one presenting this problem.

I'm curious... I also run Starlink and did not get this issue with this version, yet. But I know it sometimes takes time to show up and also some instability for the issue to show up on my SL gateway. But the bug did not happen anymore since 23.1.8 fixes.

Can you check the output of pluginctl -r host_routes
Title: Re: Multi WAN Dpinger needs restarting after gateway outage Workaround
Post by: RedVortex on February 04, 2024, 10:29:42 PM
Quote from: Jetro on November 14, 2023, 11:01:43 PM
Same problem on 23.7.8.
I have 4 Gateways (FTTH, FTTC, FWA, SAT) and Starlink is the only one presenting this problem.

For me the problem was not happening since the last 23.1.x patches on Starlink but started to appear again in 24.1-rc1 and still ongoing on 24.1 final

Here's the link to the issue in the 24.1 forum if you feel like troubleshooting it with us: https://forum.opnsense.org/index.php?topic=38603.0