CARP Failback Delay Implementation

Started by ici, February 04, 2025, 04:47:42 PM

Previous topic - Next topic
I want to briefly share our experience implementing Redundant AWS Site-to-Site VPN connections with failover.

When testing failover scenarios, we found:

- Disconnecting the WAN interface: Seamless failover with no ping loss
- Rebooting/shutting down the Master: Acceptable failover with ~15 ping loss (though we're still looking to improve this)

The real challenge emerged during failback - when the original Master comes back online. By default, it would immediately reclaim Master CARP status for all interfaces without waiting for IPsec tunnels or Gateway connectivity to be fully established. This resulted in significant connectivity issues (>100 ping losses) across our VPN connections.

After several attempts to solve this by directly manipulating CARP states, demotion values, and interface settings (which failed because OPNsense actively manages these at a system level), we found the solution in OPNsense's built-in maintenance mode command, as mentioned in this forum post.

Using OPNsense's native configctl interface carp_set_status maintenance command worked perfectly because it's properly integrated with OPNsense's HA management system. We implemented a controlled failback process that:

1. Puts the recovering Master in maintenance mode
2. Waits for IPsec and Gateway connectivity
3. Only then allows it to reclaim Master status

I am sharing the implementation details if anyone's interested:

CARP Failback Delay Implementation

This solution implements a controlled delay when an OPNsense firewall transitions back to MASTER state, ensuring services like IPsec tunnels are fully established before taking over traffic.

Components

1. CARP Hook Script
Location: /usr/local/etc/rc.syshook.d/carp/10-carp_delay

#!/bin/sh
vhid=${1%@*}
interface=${1#*@}
STATE_FILE="/var/run/carp_transition_state"

if [ "${interface}" = "vtnet0" ] && [ "${2}" = "MASTER" ]; then
    if [ ! -f "$STATE_FILE" ]; then
        touch "$STATE_FILE"
        logger -t carp_delay_hook "WAN interface becoming MASTER, starting carp_delay service"
        /usr/sbin/service carp_delay start
    else
        logger -t carp_delay_hook "Transition already in progress, skipping"
    fi
else
    logger -t carp_delay_hook "Not starting carp_delay - interface: ${interface}, state: ${2}"
fi


2. Main Delay Service
Location: /usr/local/etc/rc.d/carp_delay

#!/bin/sh
# PROVIDE: carp_delay
# REQUIRE: NETWORKING
# KEYWORD: shutdown

. /etc/rc.subr

name="carp_delay"
rcvar="${name}_enable"
start_cmd="carp_delay_start"
stop_cmd=":"

# Default values
: ${carp_delay_seconds:="120"}
STATE_FILE="/var/run/carp_transition_state"

check_ipsec_tunnels() {
    local max_attempts=5
    local attempt=1
    local wait_time=30
    while [ $attempt -le $max_attempts ]; do
        if /usr/local/sbin/ipsec status | grep -q "INSTALLED"; then
            logger -t carp_delay "IPsec tunnels are up (attempt ${attempt}/${max_attempts})"
            return 0
        else
            logger -t carp_delay "IPsec tunnels not ready (attempt ${attempt}/${max_attempts})"
            attempt=$((attempt + 1))
            [ $attempt -le $max_attempts ] && sleep $wait_time
        fi
    done
    return 1
}

check_gateway() {
    local ping_count=3
    local gateway=$(netstat -rn | grep default | awk '{print $2}' | head -1)
    if ping -c ${ping_count} ${gateway} > /dev/null 2>&1; then
        logger -t carp_delay "Gateway ${gateway} is responding"
        return 0
    else
        logger -t carp_delay "Gateway ${gateway} is not responding"
        return 1
    fi
}

carp_delay_start() {
    logger -t carp_delay "Starting CARP failback delay of ${carp_delay_seconds} seconds"
   
    # Force CARP maintenance mode
    /usr/local/sbin/configctl interface carp_set_status maintenance
    logger -t carp_delay "Enabled CARP maintenance mode"

    # Initial delay
    sleep ${carp_delay_seconds}

    # Check gateway connectivity
    if ! check_gateway; then
        logger -t carp_delay "Gateway check failed, keeping maintenance mode"
        rm -f "$STATE_FILE"
        exit 1
    fi

    # Check IPsec tunnels
    if ! check_ipsec_tunnels; then
        logger -t carp_delay "IPsec tunnel check failed, keeping maintenance mode"
        rm -f "$STATE_FILE"
        exit 1
    fi

    # Additional stabilization delay
    logger -t carp_delay "All checks passed, waiting additional 30 seconds for stabilization"
    sleep 30

    # Leave maintenance mode
    /usr/local/sbin/configctl interface carp_set_status maintenance
   
    logger -t carp_delay "Services ready, maintenance mode disabled"
    sleep 10
    rm -f "$STATE_FILE"
}

load_rc_config $name
run_rc_command "$1"

3. Service Configuration
Location: /etc/rc.conf.local

carp_delay_enable="YES"
carp_delay_seconds="120"

How It Works

1. When the WAN interface (vtnet0) attempts to transition to MASTER state, the hook script detects this and starts the carp_delay service.

2. The carp_delay service:
  - Immediately puts the firewall in maintenance mode using OPNsense's built-in command
  - Waits for the configured delay period (default 120 seconds)
  - Checks gateway connectivity
  - Verifies IPsec tunnels are established
  - Adds an additional 30-second stabilization period
  - Disables maintenance mode, allowing the firewall to become MASTER

3. A lock file (/var/run/carp_transition_state) prevents multiple simultaneous transitions.

Tested after a reboot/shutdown or WAN interface disconnect.

Hope that it can be useful to someone.