OPNsense HA CARP – Backup node never becomes MASTER, primary continuou crashes

Started by Peter_Lanser, October 14, 2025, 04:16:38 PM

Previous topic - Next topic
Hi all,

I am running an OPNsense HA cluster with two nodes. Both are configured with CARP VIPs and a dedicated pfsync interface over a direct cable. However, I am struggling with severe failover issues:

Problem description

When I reboot or put the primary (router01) into persistent CARP maintenance mode, the secondary (router02) never transitions to MASTER.

Instead, router02 remains stuck in BACKUP state, even though router01 is unavailable.

In addition, router01 continue to hard-crashes and reboots unexpectedly during failover tests and normal operations.


What I already checked / tested

Hardware

- Initially used Mellanox ConnectX-5 100G → replaced with ConnectX-4 40G → exact same problems.
- Running single interfaces now (no LAGG/LACP) for CARP VIPs.
- Dedicated sync interface (direct cable) for pfsync.
- CARP behavior

Logs show repeating flaps:
carp: MASTER -> BACKUP (more frequent advertisement received)
carp: BACKUP -> MASTER (master timed out)
This suggests unstable or missing CARP advertisements.

Running tcpdump -ni <carpdev> proto 112 often shows no CARP advertisements at all.

Settings

- Gateway monitoring disabled.
- Hardware offloading disabled (TSO, LRO, checksum).

Switching / Layer2

- CARP runs on VLAN interfaces across switches.
- IGMP snooping / multicast forwarding might play a role, but even with direct interfaces the issue persists.

Symptoms

- Router02 stays BACKUP even when router01 is down or in maintenance.
- System logs show continuous CARP MASTER/BACKUP flapping.
- Router01 continuously hard-crashes and reboots automatically during failover testing.

High RTTs and loss of GUI access also occur during tests.

Questions

- Has anyone experienced CARP failover where the backup never becomes MASTER, even with preempt=1 and demotion=0?
- Could multicast (224.0.0.18) handling or switch features (IGMP snooping, unknown multicast filtering, storm control) be the cause?
- Are Mellanox NICs (ConnectX-4/5) known to cause CARP instability or even kernel panics/crashes under FreeBSD/OPNsense?
- Any best practices for Intel vs Mellanox NICs in OPNsense HA setups?

Thanks in advance!

Peter Lanser