CARP not preempting despite "disable preempt" not checked

Started by MaeveFirstborn, November 22, 2024, 11:09:02 PM

Previous topic - Next topic
We have two firewalls in a CARP failover relationship. Each one has two WANs and three LANs. While troubleshooting something earlier today, we realized that CARP failover wasn't behaving how we thought it was supposed to. We want the behavior to be such that if one of the interfaces fails - any of them - the backup takes over. More specifically, whichever one has the most functional interfaces. I guess a better configuration in the future would be to aim for specifically weighing on the WANs, but for now we want to get preempting working in the first place. 
Right now, when we kill one of the interfaces on the master, the second firewall's corresponding interface takes over as CARP master. However, ONLY that interface takes over. Which is useless - if the WANs fail on firewall 1 but the LANs don't, then the downstream hosts are going to send messages to the firewall which has the CARP master - which in this case is the firewall without WAN reachibility. 
Is it something with these advskews?
Obvious stuff:

  • Disable pre-empt is off
  • CARP itself is working, just not in a group

We face exactly the same problem.
if we unplug the LAN from out primary it does a failover to the secondary on LAN and all the assigned VLAN interfaces. But not on the WAN interface - for the WAN the primary stays master and so internet traffic for the clients is interrupted.
I had Disable preemptive unchecked on the primary and checkend on the secondary... this option disappeared from the gui with the update to 25.1 so I had to remove it from the config from  the secondary manually...

What OPNsense version are running and how did you set up the interfaces? It's important that all the interfaces have the same identifier, like VLAN99 - opt1 and so.

See https://docs.opnsense.org/manual/how-tos/carp.html#setup-interfaces-basic-firewall-rules

"Make sure the interface assignments on both systems are identical! Via Interfaces ‣ Overview you can check if e.g. DMZ is opt1 on both machines. When the assignments differ you will have mixed Master and Backup IPs on both machines."
Deciso DEC740

We use OPNsense 25.1.3-amd64 installed on two Proxmox hypervisors. all the interface numberings and assignments are exactly the same on both machines. (the primary/secondary config is generated so the interface numbering must be the same on both machines)
I removed some of the CARP VIPs yesterday because the vlans are not in use currently but I kept the interfaces active on both machines... but this should not matter right?

Quote from: tofuSCHNITZEL on March 19, 2025, 12:22:02 AMWe face exactly the same problem.
if we unplug the LAN from out primary it does a failover to the secondary on LAN and all the assigned VLAN interfaces. But not on the WAN interface - for the WAN the primary stays master and so internet traffic for the clients is interrupted.
I had Disable preemptive unchecked on the primary and checkend on the secondary... this option disappeared from the gui with the update to 25.1 so I had to remove it from the config from  the secondary manually...
The VHIDs have to be uniq, one VHID per CARP interface. You're setting the same VHID for multiple interfaces.
Deciso DEC740

Quote from: MaeveFirstborn on November 22, 2024, 11:09:02 PMIs it something with these advskews?
The recommendation is advskews to 0 or 1 on the master and 100+ on the backups. Try setting them to 0 on the master CARPs and all to 100 on the backups.
Deciso DEC740

Quote from: patient0 on March 19, 2025, 10:23:26 AMThe VHIDs have to be uniq, one VHID per CARP interface. You're setting the same VHID for multiple interfaces.

I have more than 256 vlans so I cannot use different vhids for every vlan. also technically they are "unique" per interface. since the vlans don't "see" each other... i also had a unique VHID for WAN and LAN and then the same for every vlan - but this also did not help with the "together failover"

the advskew is 0 for every VIP on the primary and 100 for every VIP on the secondary

QuoteDisable preemptive unchecked on the primary and checkend on the secondary... this option disappeared from the gui
I'm on 25.1.3 and it's still available, you gotta enable the 'advanced mode', as before, to see it.

Quote from: tofuSCHNITZEL on March 19, 2025, 10:40:50 AMI have more than 256 vlans so I cannot use different vhids for every vlan. also technically they are "unique" per interface. since the vlans don't "see" each other... i also had a unique VHID for WAN and LAN and then the same for every vlan - but this also did not help with the "together failover"
Fair enough, has to be uniq on the interface/broadcast domain.
Just out of curiosity: You have more than 256 interfaces CARP-ed? And each VLAN has it's own interface in the VM (I assume not)? What type of switch/bridge to you use on Proxmox, Linux bridge or OpenvSwitch?
Deciso DEC740

Quote from: patient0 on March 19, 2025, 12:22:31 PMI'm on 25.1.3 and it's still available, you gotta enable the 'advanced mode', as before, to see it.
yes you are right! I forgot about the "advanced" switch on top.

Quote from: patient0 on March 19, 2025, 12:22:31 PMFair enough, has to be uniq on the interface/broadcast domain.
Just out of curiosity: You have more than 256 interfaces CARP-ed? And each VLAN has it's own interface in the VM (I assume not)? What type of switch/bridge to you use on Proxmox, Linux bridge or OpenvSwitch?
currently active no - but planned. and of course I need a carp in every vlan otherwise the HA setup does not make much sense.
no the vlans are added in opnsense on the main "interface". the bridge in proxmox that contains this interface is set to be vlan aware (linux bridge) but we had our fair share of headaches with multicast "bleeding" between vlans etc..

I realised the issue.
It was the virtualisation all along. since the interface is in a bridge and this bridge is assigned to a virtual nic - the nic never went down when the plug on the hypervisor was physically disconnected. so the master opnsense never had the "interface down" event so it never got demoted and I always ended up with a "split brain" situation where both the primary and secondary became master in the LAN interfaces - and since on the wan both could see each other - the wan stayed master/backup.
I solved it by directly assigning the opnsense VMs the nics on the 10g card via PCI-E passthrough.

I'm having this same issue when configuring a second WAN for the first time. When I pull the second WAN connection, it fails over to the second firewall (typically all interfaces will failover, but several times i've seen this to not be the case, and only the downed interface will transfer). Regardless of who is master of the second WAN after failover, a client on the LAN stop being able to ping a client on the WAN until I do a tracert between the clients, which will succeed and I am able to ping again. All of my VIPs are uniq, and assigned to the same interfaces, my advskew is set correctly as well. It seems like some sort of gateway/routing issue to me? I have no static routes configured on either firewalls. I have the gateways configured so that WAN1 GW is the default (I've been testing the 2nd WAN HA with the first WAN uninitialized, so this gateway shouldn't be used at all during this scenario). I've tried setting up gateway groups, changing priorities, etc., but I can't seem to find something that works.