Primary FW upstream not functional after successful failover

Started by mattlf, February 06, 2025, 05:07:05 PM

Previous topic - Next topic
I have a Primary/Backup setup with CARP sitting a rack, both WAN is a separate connection to additional network infra I do not have visibility on.

Both FW's are sitting in a /29, x.x.x.16-19/29 not mine

Primary WAN is at x.x.x.21/29, LAN at 10.50.50.10/24
Backup WAN at x.x.x.22/29, LAN at 10.50.50.20/24
Both point at x.x.x.17/29 as their default Gateway
WAN VHID at x.x.x.20/29
LAN VHID at 10.50.50.1/24
Adv Frq is 1 / 0 on Primary, 1 / 100 on Backup
Primary PFSync at 10.0.0.1
Backup PFSync at 10.0.0.2

Everything works fine initially, Primary can sync to Backup fine, and can see chatter on pfsync. Viewing the VIPs Status page I can see both at a CARP demotion level of 0, Primary knows it's the master of both WAN+LAN VHIDs, Backup knows it's the Backup.

If I simulate Primary FW down (pull power out), tiny outage as expected but within 1-2 seconds Backup has taken over and I can see the Backup FW has become the Master.

Once the Primary FW has come back online, I can see the Backup relinquish it's claim on the VHIDs and move back into a Backup state, and the Primary FW becomes the Master for both, I can confirm this by connecting to the LAN VHID at 10.50.50.1.


However despite the Primary now being in control of both VHIDs, upstream traffic becomes unusable similar to a network loop. LAN remains fine. If i physically remove the WAN cable from the Backup machine or power it off, the Primary quickly becomes happy again and everything's good, if the Backup then rejoins it causes no additional interference, and stays in it's Backup state waiting for another failover.

My 2 suspicions are that the Backup FW is somehow not fully relinquishing the WAN VHID back to the primary despite it looking like it has via the GUI? Or, there is something going on in the non-owned switches these firewalls are connected that I do not have access to, potentially something like them caching the Backup FW at .20/29 at initial failover event, and despite it having relinquished the VHID back to the Primary as it's come back online, the switch hasn't noticed and is still trying to route traffic to it.

I have raised a ticket with the team that manages the upstream switches as well as to me it sounds to me more likely that OPNsense in this instance is behaving correctly, but has anyone else experienced something like this before, or any suggestions of where to start looking to verify if the problem is perhaps in my configuration? Thanks for any help or suggestions.


So both OPNsense firewalls are connected to a single switch or a pair of switches that implements a single broadcast domain on WAN? Do these switches transport multicast packets correctly?
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

I believe each firewall is connected to a separate juniper switch, haven't visibility on the model/config/capabilities as they're not owned or managed by myself.

If you have any suspicions though I can relay them to the team responsible for managing them and fill in the gaps.

The switches must be connected and support multicast communication between the two OPNsense WAN interfaces.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Thanks Patrick, working further with the guys that control the upstream stack this does sound like the most likely thing to be the case, although they're convinced that both switches are connected and support multicast...

Just incase it assists anyone else searching for a similar problem. Out of interest, and impatience, I decided to add my own L2 into the stack above my firewalls, connected both upstream cables and firewalls to that L2, and a failover test primary->backup->primary seems to work flawlessly with that additional layer, it's just not ideal, I'll keep at it with the other team. This test proves what you suggested in my mind.

Interestingly I noticed that if returned to the previously connected setup and replicated the bricked state again, if disabled my outbound rule on my primary FW of everything to use x.x.x.20 (the WAN VHID), and instead use x.x.x.21, upstream gateway works absolutely fine again. I want the WAN VHID so not a solution for myself, but makes me suspect it's potentially something like ARP caching problem on their switches (x.x.x.20 still being routed to the backup FW despite the primary coming back up?).

I don't currently suspect it's OPNsense or me writing bad config, but welcome to hear any other insights though! Just want this relatively simple thing resolved