Gateway issue

Started by bigops, August 31, 2020, 10:12:27 PM

Previous topic - Next topic
August 31, 2020, 10:12:27 PM Last Edit: August 31, 2020, 10:42:48 PM by bigops
Hi
Recently I have been noticing a strange behavior on Opnsense.  I have a configuration which has two internet links and the configuration is done to have the first link to have a higher priority than the secondary link.  The traffic will fail-over to the secondary link if there is an issue with the primary link.   But what i have noticed recently is that once OpnSense switches to the secondary link it never falls back to the primary link even though the primary link has been restored and shown online in the GUI.  What is more intriguing is that the route table lists the primary link as active and still all traffic takes the other link.  Any changes to the gateway configs or rebooting OpnSense then switches to the correct gateway.  This is a new behavior noted recently


Sticky sessions was the culprit.  Thanks

Removing the sticky connections seems to have solved the issue.  But isnt sticky connections there for a reason?  Probably there is an issue where the sticky connections timer does not expire as when this is on I do not see a tailback when the primary connection is restored.

After trying out the various options and observing for more than a week I am fairly certain that the Failback is not working as expected.   When OPNsense fails over to the lower tier gateway even when the Primary connections becomes active without any error the failback does not happen.  Rebooting the device seems to always correct the issue.  Physically removing the primary connection seems to trigger the failback when the connection comes back on.  The issue seems to occur when the failover happens due to a latency  / packet loss issue. 

Looking into the Gateway configuration and the route table everything seems to be fine, but the traffic just does not seem to take the route

I am attaching a few screenshots which shows the issue where the tracert from the client takes a different path vs the one from the OPNsense box itself.   

This issue is causing a lot of headaches.  Anyone has any suggestions?

Thanks

B

Maybe it doesnt fail back because the session is still active? Then this would force using still the second Tier to not disrupt connections again

It seems to work fine when the Tier 2 is physically disconnected and failure is simulated.  Also the behavior does not change even if all the sessions are closed (or the client rebooted).   An additional information is that in this setup the clients are behind a Layer 3 device and only a routed link is available between the Layer 3 device and OpnSense

Is this problem being observed by anyone else?  I keep having this problem and nothing seems to be able to resolve it. 

Thanks