Archive > 23.1 Legacy Series

Multi WAN Dpinger needs restarting after gateway outage Workaround

(1/8) > >>

xaxero:
I have an issue that seems to be ongoing and I cannot see a fix in the forums. If this has been resolved apologies.

Using starlink where the WAN frequently drops out DPinger needs to be restarted in order for the gateway monitoring to work again and the routes services restarted to get the default route back.

Has anyone found a fix for this yet? I have disabled sticky connections in the firewall settings.

RedVortex:
I've been having the same issue for quite some time now. I also have Starlink and a second WAN on PPPoE. I have IPV4 and V6 enabled on SL and only V4 on the PPPoE link.

I can usually trigger this issue easily. If I reboot OPNSense, usually after about 2 hours the issue will start happening, something seems to flap on the SL interface after that 2 hours. When that trigger happens, you can see the 2 dpinger for v4 and v6 quitting and then are restarted by the scripts to start checking the gateways again. Usually the one for v6 recovers and continues to work but almost all the time the one for v4 starts failing and flags the gateway as down.

The gateway is really up and I can manually run dpinger in the command line to validate this.

So today, I decided to troubleshoot it and I discovered 2 things.

The dpinger that got restarted has all the right parameters but the traffic for this dpinger is going through the PPPoE interface (I had to do a tcpdump to see this in OPNSense on the PPPoE interface) instead of the SL interface, even though the -B "SL PUBLIC IP" is there with the SL IP in it the traffic is going out with the wrong interface.

While this dpinger is not working, I run another on the command line with the exact same parameters and this one is working well. Checking tcpdump again on the one I started, with the exact same parameters, I see it uses the SL interface, unlike the bad one that uses the PPPoE interface. So I thought, what's going on here ? What could cause one dpinger process to use one interface and another dpinger process to use another interface ? The routes are the sames, the IP are the same, the command line is the same...

Then I thought about checking the firewall states, maybe (like UDP for instance) something is being reused or hasn't timeout or got cleared properly when the interface flapped and the packets are not being handled properly because of this.

And there it was, I could see one state with the bad dpinger and another state with the good dpinger.

Checking the good dpinger, I saw that it showed the rule handling it as "let out anything from firewall host itself (force gw)" as you usually see on any dpinger state while the bad dpinger was showing under rule "allow access to DHCP server" and that doesn't make sense. So it seemed the old dpinger was kind of stuck on a weird rule or something in the states that wasn't right.

Without restarting dpinger (like I usually do to fix this), I only deleted the bad state in the table and as soon as I did this, the packets started flowing to the SL interface as they should have and stopped going to the PPPoE interface and the gateway got flagged UP within a few checks and the state rule now started showing "let out anything from firewall host itself (force gw)" also as it should have.

All that being said, this looks like something bad happening (probably some timing in the script or a state not cleared) during the interface flap or dhcp renewal or something else and my guess is that maybe dpinger starts monitoring and a bad state is either created or kept and that makes the dpinger traffic go to the wrong interface and the states doesn't get a chance to expire or get reset so dpinger continues to flag it as down since traffic continues to flow to the wrong interface because the state is being reused. And the gateway gets wrongly flagged as down since the ICMP packets are being routed to the wrong interface (PPPoE in my case) with the source IP of the SL interface.

Restarting dpinger fixes this since the state is linked to the process (you see the process ID in the state) or deleting the state (firewall/diag/states search for your dpinger process id or the IP it monitors) will create a new state that will route the packets to the proper interface and also fix the issue.

RedVortex:
Sorry I posted a by rapidly and made a few mistakes and my description wasn't super clear, I edited my reply a bit, hopefully it is better :-) If not, please do not hesitate to ask for clarifications

RedVortex:
Also, here are two screenshots showing the

- good (let out anything from firewall host itself (force gw))

and the

- bad dpinger processes (allow access to DHCP server)

under the rule in firewall/diags/states.

My monitoring IP for SL is 1.1.1.1 which makes it easy for me to check for the state as 1.1.1.1 is only used for monitoring the gateway on SL only, nothing else.

RedVortex:
More troubleshooting. I tried to manually flap the interface to try to see what happened and I see that sometimes it uses the SL gateway (100.64.0.1) and sometimes 0.0.0.0.

I even had 2 states are some point pointing to 2 different gateways. Also, since SL is also pushing 1.1.1.1 as a DNS server using DHCP, I also end up having a route added but it last only for some time then it goes away.

SL interface is igb0 in the logs below

SL gateway flagged UP and dpinger working well. 1.1.1.1 is not in the route table.


--- Code: ---pfctl -ss -vv | grep "1\.1\.1\.1" -A 3

all icmp 100.79.101.92:23255 -> 1.1.1.1:23255       0:0
   age 00:22:12, expires in 00:00:10, 1309:1305 pkts, 36652:36540 bytes, rule 100
   id: 3253546400000000 creatorid: 837fd2f8 gateway: 100.64.0.1
   origif: igb0

--- End code ---

Now I disconnect and reconnect SL and wait for DHCP to get an IP and now I see this, it seems to be using the default gateway, weird... Still dpinger works maybe because a temporary route has been added to 1.1.1.1 on initial dhcp ?


--- Code: ---pfctl -ss -vv | grep "1\.1\.1\.1" -A 3

all icmp 100.79.101.92:54053 -> 1.1.1.1:54053       0:0
   age 00:01:41, expires in 00:00:10, 100:100 pkts, 2800:2800 bytes, rule 93
   id: 6e58546400000002 creatorid: 837fd2f8 gateway: 0.0.0.0
   origif: igb0

--- End code ---

and this in the routing table (only the top few routes to keep this simple...)


--- Code: ---netstat -rn | head

Routing tables

Internet:
Destination        Gateway            Flags     Netif Expire
default            10.50.45.70        UGS      pppoe0
1.1.1.1            100.64.0.1         UGHS       igb0
8.8.4.4            10.50.45.70        UGHS     pppoe0

--- End code ---

After a minute or two, SL issues a DHCP renewal and the GW goes down temporarily for dpinger and I see this, two different states, one on the default gateway and another one with the SL gateway.


--- Code: ---pfctl -ss -vv | grep "1\.1\.1\.1" -A 3

all icmp 100.79.101.92:61626 -> 1.1.1.1:61626       0:0
   age 00:00:14, expires in 00:00:09, 14:14 pkts, 392:392 bytes, rule 100
   id: 7451546400000003 creatorid: 837fd2f8 gateway: 100.64.0.1
   origif: igb0
--
all icmp 100.79.101.92:54053 -> 1.1.1.1:54053       0:0
   age 00:03:33, expires in 00:00:00, 195:148 pkts, 5460:4144 bytes, rule 93
   id: 6e58546400000002 creatorid: 837fd2f8 gateway: 0.0.0.0
   origif: igb0

--- End code ---

After some time the state using 0.0.0.0 seems to disappear and the route to 1.1.1.1 also disappear.

SL is still marked as UP now, so for some reason the problem did not happen this time but you can see that if something gets stuck on the 0.0.0.0 (which is my main WAN, PPPoE, by default) this would result in SL dpinger not working and sending its packets to PPPoE instead of SL.

I'll try to reproduce the isseu again later on and post the results and I'll also try to catch a pfctl output and netstat -rn when the issue happens, if you could do the same, maybe we'll see something clearer than in the UI.

Navigation

[0] Message Index

[#] Next page

Go to full version