Multi-WAN failover not working the second time

Started by qdrop, February 20, 2020, 02:59:35 PM

Previous topic - Next topic
Hello everyone.

I configured everything of my multi WAN setup according to the official documentation: https://docs.opnsense.org/manual/how-tos/multiwan.html

I experience two very strange things though:

- By default, OPNsense selects the secondly added gateway as active (for it's own routing table). Why is this? Of course routing traffic from LAN over the failover gateway group is working properly (respecting the tiers).

- Doing a test of the setup results in a proper failover. Also, failback is working as expected. The logs also show, how the packetfilter gets reloaded:

2020-02-20T14:53:18   configd.py: message 9ba3d1ec-b554-4b18-ad15-3cc4970e501e [filter.refresh_aliases] returned {"status": "ok"}
2020-02-20T14:53:17   configd.py: [2ee33329-0e77-40e3-8f36-e0b6cf03c3eb] updating dyndns FIB7_DHCP
2020-02-20T14:53:17   configd.py: [9ba3d1ec-b554-4b18-ad15-3cc4970e501e] refresh url table aliases
2020-02-20T14:53:17   configd.py: OPNsense/Filter generated //usr/local/etc/filter_geoip.conf
2020-02-20T14:53:17   configd.py: OPNsense/Filter generated //usr/local/etc/filter_tables.conf
2020-02-20T14:53:17   configd.py: generate template container OPNsense/Filter
2020-02-20T14:53:17   configd.py: [786980b2-b4a3-4210-9097-06c608048e04] generate template OPNsense/Filter
2020-02-20T14:53:17   configd.py: [e1f6a017-f2f6-467a-ab46-c91aefa729e9] Reloading filter
2020-02-20T14:53:00   configd.py: [93dc6a38-0e99-4e39-aad0-4e838581053e] Linkup stopping igb0

Even failback works as expected:

2020-02-20T14:54:26   configd.py: message d0b2f6c7-c23f-42b3-b01c-23ccc81afb47 [filter.refresh_aliases] returned {"status": "ok"}
2020-02-20T14:54:26   configd.py: [5750e7b4-ccfb-4eff-924f-45de2925d7bf] updating dyndns opt2
2020-02-20T14:54:26   configd.py: OPNsense/Unbound/* generated //var/unbound/root.hints
2020-02-20T14:54:26   configd.py: generate template container OPNsense/Unbound/core
2020-02-20T14:54:25   configd.py: [607d0fc8-4eb1-4c86-a1d2-53798b1ded0d] generate template OPNsense/Unbound/*
2020-02-20T14:54:25   configd.py: [d0b2f6c7-c23f-42b3-b01c-23ccc81afb47] refresh url table aliases
2020-02-20T14:54:25   configd.py: OPNsense/Filter generated //usr/local/etc/filter_geoip.conf
2020-02-20T14:54:25   configd.py: OPNsense/Filter generated //usr/local/etc/filter_tables.conf
2020-02-20T14:54:24   configd.py: generate template container OPNsense/Filter
2020-02-20T14:54:24   configd.py: [7a2318c9-4afc-42ba-a794-087cceeebf87] generate template OPNsense/Filter
2020-02-20T14:54:23   configd.py: [28df1e97-6eed-4379-a7d1-8df6e536f28a] Linkup starting igb0

But if I try to make a failover the second time, the filter ain't reloaded and the connection is down for all clients despite showing the secondary gateway as online:

2020-02-20T14:56:36   configd.py: [6e099b0c-d769-4a25-bab1-812534b536fd] updating dyndns FIB7_DHCP
2020-02-20T14:56:20   configd.py: [baa832c1-b301-4ce4-a52f-9dbdde97be2e] Linkup stopping igb0

If I now reload the packet filter manually, the connection is established instantly:

2020-02-20T14:57:39   configd.py: message d75dd61a-11f2-4841-8716-6f2ff6a9065e [filter.refresh_aliases] returned {"status": "ok"}
2020-02-20T14:57:38   configd.py: [d75dd61a-11f2-4841-8716-6f2ff6a9065e] refresh url table aliases
2020-02-20T14:57:38   configd.py: OPNsense/Filter generated //usr/local/etc/filter_geoip.conf
2020-02-20T14:57:38   configd.py: OPNsense/Filter generated //usr/local/etc/filter_tables.conf
2020-02-20T14:57:38   configd.py: generate template container OPNsense/Filter
2020-02-20T14:57:38   configd.py: [08d3b561-be74-4bff-acd8-1ed9e0a95445] generate template OPNsense/Filter
2020-02-20T14:57:37   configd.py: [2debe712-879d-4475-96a2-7aefa3c74659] Reloading filter

How can I overcome this issue? We're using OPNsense-compatible hardware from Thomas Krenn. I'm testing by unplugging the LAN-cable from the server...

Furthermore, we enabled state-killing and disabled sticky sessions in the preferences of the packet filter.

Any help is highly appreciated.

Damn, I didn't expect OPNsense to be this buggy.

I installed OPNsense from scratch and only configured the MultiWAN exactly according to https://docs.opnsense.org/manual/how-tos/multiwan.html. It is not switching traffic. I have to manually reload the pf-service. This is highly disturbing and disappointing.

Has anyone an idea what could possibly go wrong in my setup? What logs or configuration would you like to have to help me narrowing this issue down and probably finding a workaround? I'm using the most recent version of OPNsense.

Furthermore:

I'm removing the physical cable to trigger that behavior. When just disabling the gateways in the settings - everything works as expected.

But of course, the link can die entirely - this has to work obviously.

It is indeed extremely buggy in some areas. I cannot get the policy routing through VPN gateway working, something as simple as that is broken. Came back here after some years with pfSense, hoping to get the Wireguard going, just to find out that even simple stuff isn't working.

I've had multi-wan (failover) working with OPNsense for over a couple of years now and have found it perfectly reliable with all the versions I've used from 17.7 to 20.1. I've used the information from this link to configure a working setup, suggest you try it or at least compare it with your setup to look for any differences:
https://www.thomas-krenn.com/de/wiki/OPNsense_Multi_WAN



Engineers from Thomas Krenn just opened an issue: https://github.com/opnsense/core/issues/3961

It seems to be reproducible. Let's see if someone manages to debug this and offer a fix or workaround.

If it helps as a workaround I don't use DHCP to assign an IP address to my WAN interfaces I use static IP addresses.