IPSec Tunnel with Dual WAN Failover GW_Group

Started by MGVaxx, May 16, 2024, 10:25:26 PM

Previous topic - Next topic
Hello all,

Here's a scenario we are having difficulty with and looking for some insight on.

We have a client site with two WAN connections from different providers for redundancy. They are currently configured as WAN1 and WAN2 in a Failover Group - Failover_GW_Group. WAN1 set as Tier1 and WAN2 set as Tier2, as per the official OpnSense docs. The failover works as expected and switches the default gateway from WAN1 to WAN2 upon failure, and back to WAN1 when the connection is restored. We are using Default Gateway Switching.

All good so far.

However, the client also has an IPSec tunnel (legacy mode) to a cloud provider that we want to failover when the WAN connection changes. When setting up the Phase1 for the tunnel, the Interface options are WAN1, WAN2, LAN and ANY. We can successfully establish the tunnel choosing either WAN1 or WAN2, and it will connect and pass traffic using either interface, however it does not drop and re-establish itself when the WAN fails over. We thought using ANY was the next obvious option but the tunnel does not seem to connect at all.

We do not have control over the remote end of the tunnel, so suggestions such as building a second tunnel etc are not options.

We compared the setup to a working one using pFSense and noted that their IPSec setup allows you to select the GW Group as the interface in the Phase1 setup, whereas OpnSense does not.

We've already committed to using OpnSense for a variety of reasons and would prefer to stay with it.

Any thoughts or suggestions would be most appreciated. Otherwise, is there a way to submit this as a feature request?

If anyone has gotten a legacy IPSec tunnel to automatically switch WAN connections with a failover group configuration and has some tips, many thanks in advance. Not hoping to reinvent the wheel here, just wondering if there's something obvious we have overlooked.

Cheers,
Mike

I am also struggling with this.  Can a developer please add Gateway Groups (for failover) to be selectable as the interface for IPSec tunnels.  Thanks.

I must assume nobody has ever done this and got it working?

I just need an IPSec tunnel to tear down and re-establish itself when the default gateway changes as a result of WAN failover. The current setup doesn't seem to allow that, unless I am missing something obvious here?


The obvious answer that you are missing, is you can't do that.
The remote side has to have a "remote-peer" IP configured which it connects to.... when that ISP goes down, the WAN interface with the "remote-peer" goes down, so any tunnels which connect to it go down as well.

The answer is to have a second tunnel configured to point to the "remote peer" IP of the second ISP's WAN interface.

That way when the first ISP / Interface goes down, and the tunnel along with it, the secondary tunnel will become the new route to your LAN subnet.

We don't have access to the remote site, it is a healthcare provider who laid out the specifics of how we are to connect to their datacenter.

The tunnel is configured using a FQDN as the remote identifier on their end, and they have whitelisted the two static IP addresses from the two ISP providers we are using for WAN1 and WAN2. The tunnel can be established from either interface and works fine when manually set.

The problem is it cannot follow the change of default gateway because you must select one of the WAN interfaces in the phase 1 configuration. On pFSense, there is the option to choose the gateway group as the outgoing interface and it works as expected. OPNSense does not have that same option.


What about using

VPN: IPsec: Connections

And put the two IPs of your WAN interfaces as local IPs?
Hardware:
DEC740

December 02, 2024, 07:45:13 PM #6 Last Edit: December 02, 2024, 09:14:08 PM by ludarkstar99
I configured both remote gateway addresses in the Connections menu. However, when the SA is marked down by DPD detection, the charon daemon continues retransmitting to the same peer instead of attempting to connect to the secondary peer in the list. Eventually, it stops.

Even with the DPD action set to restart (manually added to the configuration) and MOBIKE disabled, the behavior remains the same.

Relevant log on ipsec.log

=== IKE_SA & CHILD_SA ESTABLISHMENT
<30>1 2024-12-01T19:46:59-03:00 fw-filial.citrait.corp charon 24636 - [meta sequenceId="36"] 15[CFG] <545401c5-d36e-4cfc-ba6e-5118be9636c0|4> selected proposal: ESP:AES_CBC_128/HMAC_SHA2_25
6_128/NO_EXT_SEQ
<30>1 2024-12-01T19:46:59-03:00 fw-filial.citrait.corp charon 24636 - [meta sequenceId="37"] 15[IKE] <545401c5-d36e-4cfc-ba6e-5118be9636c0|4> CHILD_SA f0dc40a0-3628-464a-8631-1fa01d258ab2{4
} established with SPIs c2c08ea1_i ca0b8e26_o and TS 192.168.20.0/24 === 192.168.110.0/24 192.168.120.0/24
<165>1 2024-12-01T19:46:59-03:00 fw-filial.citrait.corp charon 58459 - [meta sequenceId="38"] [UPDOWN] received up-client event for reqid 1
<165>1 2024-12-01T19:47:00-03:00 fw-filial.citrait.corp charon 59509 - [meta sequenceId="39"] [UPDOWN] received up-client event for reqid 1

=== AFTER ISP1 GOES DOWN COMES RETRANSMISSION
<30>1 2024-12-01T19:47:02-03:00 fw-filial.citrait.corp charon 24636 - [meta sequenceId="40"] 14[IKE] <545401c5-d36e-4cfc-ba6e-5118be9636c0|1> retransmit 1 of request with message ID 2
<30>1 2024-12-01T19:47:02-03:00 fw-filial.citrait.corp charon 24636 - [meta sequenceId="41"] 14[NET] <545401c5-d36e-4cfc-ba6e-5118be9636c0|1> sending packet: from 45.148.20.6[500] to 189.112.235.129[500] (80 bytes)
<30>1 2024-12-01T19:47:05-03:00 fw-filial.citrait.corp charon 24636 - [meta sequenceId="42"] 15[IKE] <545401c5-d36e-4cfc-ba6e-5118be9636c0|1> retransmit 2 of request with message ID 2
<30>1 2024-12-01T19:47:05-03:00 fw-filial.citrait.corp charon 24636 - [meta sequenceId="43"] 15[NET] <545401c5-d36e-4cfc-ba6e-5118be9636c0|1> sending packet: from 45.148.20.6[500] to 189.112.235.129[500] (80 bytes)

=== GIVING UP AFTER 2 RETRANSMITS (i've set retransmit_tries to 2)
<30>1 2024-12-01T19:47:08-03:00 fw-filial.citrait.corp charon 24636 - [meta sequenceId="44"] 15[IKE] <545401c5-d36e-4cfc-ba6e-5118be9636c0|1> giving up after 2 retransmits
<30>1 2024-12-01T19:47:08-03:00 fw-filial.citrait.corp charon 24636 - [meta sequenceId="45"] 15[IKE] <545401c5-d36e-4cfc-ba6e-5118be9636c0|1> proper IKE_SA delete failed, peer not responding

=== THEN RECONNECTED THE 1st ISP, CONNECTION ESTABLISHES AGAIN (as responder)
<30>1 2024-12-01T19:47:29-03:00 fw-filial.citrait.corp charon 24636 - [meta sequenceId="46"] 15[NET] <545401c5-d36e-4cfc-ba6e-5118be9636c0|4> received packet: from 189.112.235.129[500] to 45.148.20.6[500] (80 bytes)



I tested with both remote gateways, and the tunnel successfully connects to each. This confirms that the issue is not related to a specific remote peer.

edit: improved clarity.
- nothing broken, nothing missing;

December 02, 2024, 08:59:17 PM #7 Last Edit: December 02, 2024, 09:15:28 PM by ludarkstar99
I have confirmed that even when I populate both local_addrs and remote_addrs fields with the respective addresses on both ends, the issue persists. Specifically, after the primary connection fails, the system enters retransmit mode, eventually reaches retransmit_tries, and freezes the connection without switching to the backup ISP.
- nothing broken, nothing missing;

December 02, 2024, 09:52:04 PM #8 Last Edit: December 02, 2024, 09:57:04 PM by Monviech (Cedrik)
Maybe you need default gateway switching for the OPNsense itself. It can be activated somewhere.

System: Settings: General - At the bottom.

I would like to know if a change in the default route will still make it try to use the first IP.

That combined with DPD to force a restart of phase 1.
Hardware:
DEC740

December 03, 2024, 04:13:25 AM #9 Last Edit: December 03, 2024, 04:17:25 AM by ludarkstar99
Hi Monviech (Cedrik),
the gateway switching trick works well if i have one ISP in the HQ, and two ISPs in the branch office.
When the branch office ISP 1 goes down, it will reconnect using the backup isp to reach to HQ.
But... if both sites have 2 ISPs, when HQ ISP1 goes down, there's a lot of retransmissions due to dpd action, and eventually it trows away the half open connection, and keep failing at retransmissions until give up.
- nothing broken, nothing missing;


Hi Cedrik,

That's my go-to approach for this type of setup. However, I'm exploring whether Strongswan can handle this natively for simpler management.

I'm stuck with an idea to make a plugin to auto switch between tunnels if the primary of them get down.
- nothing broken, nothing missing;

If simpler means writing a script for it I would rather go back to dynamic routing since that will not cause race conditions.

Especially with this:

https://docs.opnsense.org/manual/how-tos/dynamic_routing_bfd.html#bfd-with-ospf-or-bgp

You can have the traffic swapping between tunnels in split seconds, as often as you want. Nothing gets stuck like in a script where a tunnel might be down for 5 seconds (no traffic through the tunnel itself).

BFD catches that in a split second and bam you can be rerouted back and forth.
Hardware:
DEC740