Print Page - Multi WAN Dpinger needs restarting after gateway outage Workaround

Title: Multi WAN Dpinger needs restarting after gateway outage Workaround
Post by: xaxero on May 04, 2023, 08:30:39 AM

I have an issue that seems to be ongoing and I cannot see a fix in the forums. If this has been resolved apologies.

Using starlink where the WAN frequently drops out DPinger needs to be restarted in order for the gateway monitoring to work again and the routes services restarted to get the default route back.

Has anyone found a fix for this yet? I have disabled sticky connections in the firewall settings.

Title: Re: Multi WAN Dpinger needs restarting after gateway outage
Post by: RedVortex on May 05, 2023, 12:34:55 AM

I've been having the same issue for quite some time now. I also have Starlink and a second WAN on PPPoE. I have IPV4 and V6 enabled on SL and only V4 on the PPPoE link.

I can usually trigger this issue easily. If I reboot OPNSense, usually after about 2 hours the issue will start happening, something seems to flap on the SL interface after that 2 hours. When that trigger happens, you can see the 2 dpinger for v4 and v6 quitting and then are restarted by the scripts to start checking the gateways again. Usually the one for v6 recovers and continues to work but almost all the time the one for v4 starts failing and flags the gateway as down.

The gateway is really up and I can manually run dpinger in the command line to validate this.

So today, I decided to troubleshoot it and I discovered 2 things.

The dpinger that got restarted has all the right parameters but the traffic for this dpinger is going through the PPPoE interface (I had to do a tcpdump to see this in OPNSense on the PPPoE interface) instead of the SL interface, even though the -B "SL PUBLIC IP" is there with the SL IP in it the traffic is going out with the wrong interface.

While this dpinger is not working, I run another on the command line with the exact same parameters and this one is working well. Checking tcpdump again on the one I started, with the exact same parameters, I see it uses the SL interface, unlike the bad one that uses the PPPoE interface. So I thought, what's going on here ? What could cause one dpinger process to use one interface and another dpinger process to use another interface ? The routes are the sames, the IP are the same, the command line is the same...

Then I thought about checking the firewall states, maybe (like UDP for instance) something is being reused or hasn't timeout or got cleared properly when the interface flapped and the packets are not being handled properly because of this.

And there it was, I could see one state with the bad dpinger and another state with the good dpinger.

Checking the good dpinger, I saw that it showed the rule handling it as "let out anything from firewall host itself (force gw)" as you usually see on any dpinger state while the bad dpinger was showing under rule "allow access to DHCP server" and that doesn't make sense. So it seemed the old dpinger was kind of stuck on a weird rule or something in the states that wasn't right.

Without restarting dpinger (like I usually do to fix this), I only deleted the bad state in the table and as soon as I did this, the packets started flowing to the SL interface as they should have and stopped going to the PPPoE interface and the gateway got flagged UP within a few checks and the state rule now started showing "let out anything from firewall host itself (force gw)" also as it should have.

All that being said, this looks like something bad happening (probably some timing in the script or a state not cleared) during the interface flap or dhcp renewal or something else and my guess is that maybe dpinger starts monitoring and a bad state is either created or kept and that makes the dpinger traffic go to the wrong interface and the states doesn't get a chance to expire or get reset so dpinger continues to flag it as down since traffic continues to flow to the wrong interface because the state is being reused. And the gateway gets wrongly flagged as down since the ICMP packets are being routed to the wrong interface (PPPoE in my case) with the source IP of the SL interface.

Restarting dpinger fixes this since the state is linked to the process (you see the process ID in the state) or deleting the state (firewall/diag/states search for your dpinger process id or the IP it monitors) will create a new state that will route the packets to the proper interface and also fix the issue.

Title: Re: Multi WAN Dpinger needs restarting after gateway outage
Post by: RedVortex on May 05, 2023, 01:08:02 AM

Sorry I posted a by rapidly and made a few mistakes and my description wasn't super clear, I edited my reply a bit, hopefully it is better :-) If not, please do not hesitate to ask for clarifications

Title: Re: Multi WAN Dpinger needs restarting after gateway outage
Post by: RedVortex on May 05, 2023, 01:12:45 AM

Also, here are two screenshots showing the

- good (let out anything from firewall host itself (force gw))

and the

- bad dpinger processes (allow access to DHCP server)

under the rule in firewall/diags/states.

My monitoring IP for SL is 1.1.1.1 which makes it easy for me to check for the state as 1.1.1.1 is only used for monitoring the gateway on SL only, nothing else.

Title: Re: Multi WAN Dpinger needs restarting after gateway outage
Post by: RedVortex on May 05, 2023, 01:52:44 AM

More troubleshooting. I tried to manually flap the interface to try to see what happened and I see that sometimes it uses the SL gateway (100.64.0.1) and sometimes 0.0.0.0.

I even had 2 states are some point pointing to 2 different gateways. Also, since SL is also pushing 1.1.1.1 as a DNS server using DHCP, I also end up having a route added but it last only for some time then it goes away.

SL interface is igb0 in the logs below

SL gateway flagged UP and dpinger working well. 1.1.1.1 is not in the route table.

OPNsense Forum

Archive => 23.1 Legacy Series => Topic started by: xaxero on May 04, 2023, 08:30:39 AM