Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - ici

#1
Hello good people of the OPNsense community,

We encountered an interesting phenomenon explained bellow and would like to inquire on the best way to handle it.

Setup:
We have two OPNsense firewalls (master and backup) with AWS Site-to-Site VPN configured. Each firewall has two IPsec tunnels to AWS (total of 4 tunnels). BGP is used for route exchange with AWS, and CARP handles the failover between master and backup OPNsense firewalls.

Issue:
After a fresh reboot of the backup OPNsense, everything works perfectly - ping from our DC to an EC2 instance in AWS flows through the VPN uninterrupted. However, at minute 30 of every hour, we lose connectivity to our AWS resources. Interestingly, while the AWS console shows status changes (DOWN then UP again) only for the backup's tunnels, our connectivity doesn't recover even after the tunnels return to UP status. All this occurs despite the master's VPN tunnels remaining consistently up and stable.

Root Cause:
We traced this to the built-in cron job "HA update and reconfigure backup" (enabled in System: Settings: Cron and running at */30). This job triggers a full reconfiguration of the backup firewall, which causes:
- IPsec tunnel renegotiation
- BGP session resets on the backup

The most puzzling aspect was that despite this occurring on the backup firewall, these service restarts were somehow affecting the routing for traffic that should have been flowing through the master.

If we perform a manual HA sync the same behavior is observed and a reboot of the backup is required for the connectivity to be restored.

Resolution:
- Disabling the "HA update and reconfigure backup" cron job immediately resolved the connectivity issues.
- The VPN tunnels and routing remain stable now, and no longer experiencing the hourly status changes.

Questions:
1. Why would the backup firewall's VPN state impact traffic when the master's tunnels were working correctly?
2. Is there a way to make HA sync less disruptive—perhaps without triggering full service restarts?
3. What's the recommended approach for handling HA synchronization in setups with active VPN tunnels on both master and backup?
#2
High availability / CARP Failback Delay Implementation
February 04, 2025, 04:47:42 PM
I want to briefly share our experience implementing Redundant AWS Site-to-Site VPN connections with failover.

When testing failover scenarios, we found:

- Disconnecting the WAN interface: Seamless failover with no ping loss
- Rebooting/shutting down the Master: Acceptable failover with ~15 ping loss (though we're still looking to improve this)

The real challenge emerged during failback - when the original Master comes back online. By default, it would immediately reclaim Master CARP status for all interfaces without waiting for IPsec tunnels or Gateway connectivity to be fully established. This resulted in significant connectivity issues (>100 ping losses) across our VPN connections.

After several attempts to solve this by directly manipulating CARP states, demotion values, and interface settings (which failed because OPNsense actively manages these at a system level), we found the solution in OPNsense's built-in maintenance mode command, as mentioned in this forum post.

Using OPNsense's native configctl interface carp_set_status maintenance command worked perfectly because it's properly integrated with OPNsense's HA management system. We implemented a controlled failback process that:

1. Puts the recovering Master in maintenance mode
2. Waits for IPsec and Gateway connectivity
3. Only then allows it to reclaim Master status

I am sharing the implementation details if anyone's interested:

CARP Failback Delay Implementation

This solution implements a controlled delay when an OPNsense firewall transitions back to MASTER state, ensuring services like IPsec tunnels are fully established before taking over traffic.

Components

1. CARP Hook Script
Location: /usr/local/etc/rc.syshook.d/carp/10-carp_delay

#!/bin/sh
vhid=${1%@*}
interface=${1#*@}
STATE_FILE="/var/run/carp_transition_state"

if [ "${interface}" = "vtnet0" ] && [ "${2}" = "MASTER" ]; then
    if [ ! -f "$STATE_FILE" ]; then
        touch "$STATE_FILE"
        logger -t carp_delay_hook "WAN interface becoming MASTER, starting carp_delay service"
        /usr/sbin/service carp_delay start
    else
        logger -t carp_delay_hook "Transition already in progress, skipping"
    fi
else
    logger -t carp_delay_hook "Not starting carp_delay - interface: ${interface}, state: ${2}"
fi


2. Main Delay Service
Location: /usr/local/etc/rc.d/carp_delay

#!/bin/sh
# PROVIDE: carp_delay
# REQUIRE: NETWORKING
# KEYWORD: shutdown

. /etc/rc.subr

name="carp_delay"
rcvar="${name}_enable"
start_cmd="carp_delay_start"
stop_cmd=":"

# Default values
: ${carp_delay_seconds:="120"}
STATE_FILE="/var/run/carp_transition_state"

check_ipsec_tunnels() {
    local max_attempts=5
    local attempt=1
    local wait_time=30
    while [ $attempt -le $max_attempts ]; do
        if /usr/local/sbin/ipsec status | grep -q "INSTALLED"; then
            logger -t carp_delay "IPsec tunnels are up (attempt ${attempt}/${max_attempts})"
            return 0
        else
            logger -t carp_delay "IPsec tunnels not ready (attempt ${attempt}/${max_attempts})"
            attempt=$((attempt + 1))
            [ $attempt -le $max_attempts ] && sleep $wait_time
        fi
    done
    return 1
}

check_gateway() {
    local ping_count=3
    local gateway=$(netstat -rn | grep default | awk '{print $2}' | head -1)
    if ping -c ${ping_count} ${gateway} > /dev/null 2>&1; then
        logger -t carp_delay "Gateway ${gateway} is responding"
        return 0
    else
        logger -t carp_delay "Gateway ${gateway} is not responding"
        return 1
    fi
}

carp_delay_start() {
    logger -t carp_delay "Starting CARP failback delay of ${carp_delay_seconds} seconds"
   
    # Force CARP maintenance mode
    /usr/local/sbin/configctl interface carp_set_status maintenance
    logger -t carp_delay "Enabled CARP maintenance mode"

    # Initial delay
    sleep ${carp_delay_seconds}

    # Check gateway connectivity
    if ! check_gateway; then
        logger -t carp_delay "Gateway check failed, keeping maintenance mode"
        rm -f "$STATE_FILE"
        exit 1
    fi

    # Check IPsec tunnels
    if ! check_ipsec_tunnels; then
        logger -t carp_delay "IPsec tunnel check failed, keeping maintenance mode"
        rm -f "$STATE_FILE"
        exit 1
    fi

    # Additional stabilization delay
    logger -t carp_delay "All checks passed, waiting additional 30 seconds for stabilization"
    sleep 30

    # Leave maintenance mode
    /usr/local/sbin/configctl interface carp_set_status maintenance
   
    logger -t carp_delay "Services ready, maintenance mode disabled"
    sleep 10
    rm -f "$STATE_FILE"
}

load_rc_config $name
run_rc_command "$1"

3. Service Configuration
Location: /etc/rc.conf.local

carp_delay_enable="YES"
carp_delay_seconds="120"

How It Works

1. When the WAN interface (vtnet0) attempts to transition to MASTER state, the hook script detects this and starts the carp_delay service.

2. The carp_delay service:
  - Immediately puts the firewall in maintenance mode using OPNsense's built-in command
  - Waits for the configured delay period (default 120 seconds)
  - Checks gateway connectivity
  - Verifies IPsec tunnels are established
  - Adds an additional 30-second stabilization period
  - Disables maintenance mode, allowing the firewall to become MASTER

3. A lock file (/var/run/carp_transition_state) prevents multiple simultaneous transitions.

Tested after a reboot/shutdown or WAN interface disconnect.

Hope that it can be useful to someone.