OPNsense Forum

English Forums => High availability => Topic started by: j_s on July 20, 2022, 01:34:46 pm

Title: When should a "failover" automatically occur?
Post by: j_s on July 20, 2022, 01:34:46 pm
Hello.  I've got 3 HA systems in production use, but still one lingering problem seems to remain that I encounter from time to time...

HA and failover.

Let me give an example:

#1 has all interfaces in a MASTER state for CARP and #2 has all interfaces in a BACKUP state.  #1 is my primary and #2 is my secondary.

All is working fine, but someone accidentally unplugs a network cable randomly on #1 (I am running lagg everywhere, but sometimes failover occurs faster than the lagg can respond).  #2 (which had all carp interfaces in a BACKUP state) has changed all interfaces to MASTER.  So far so good.  Everything works and workloads are typically unaware of what just happened.

Later that day I reconnect the cable that was unplugged.

#2 remains in MASTER state for all CARP interfaces and #1 stays in BACKUP state for all interfaces (except the one that was offline, that changes to BACKUP from INIT).

I did look at the CARP traffic with tcpdump, and everything seems normal.  The carp packets being broadcasted by #2 have the proper vrids, and the priority is 100.  #1 isn't broadcasting anything and is happy with all of its CARP states in BACKUP.

So here's my questions:

1. Shouldn't #1 have taken back over as MASTER now that all of the networking is good?  (I expected it to, but it didn't)
2.  If it's not supposed to, what's the recommended procedure to fail back over to #1 being in MASTER state?  I get the feeling that "rebooting #2" isn't the most ideal situation, although I have done it in the past to solve problems like this.
3.  Since I expected expected that #1 would have simply broadcasted itself as the higher priority and taken back over, is there a setting I could have wrong?

I am having this problem on 22.1.10, but I've had this happen to me on versions going back to the 21.1 series.  I have no reason to believe there's a problem with this version, but more of a problem with either my configuration or a problem with my understanding of the proper behavior of Opnsense HA.

Thanks for reading and hopefully clarifying my misunderstandings of HA or configuration issues.
Title: Re: When should a "failover" automatically occur?
Post by: Patrick M. Hausen on July 20, 2022, 01:47:14 pm
What is System > High Availability > Settings > Disable Preempt set to?
Title: Re: When should a "failover" automatically occur?
Post by: j_s on July 20, 2022, 02:35:24 pm
What is System > High Availability > Settings > Disable Preempt set to?

It is disabled on both nodes.  Sorry, I should have mentioned that in my original post.

However, I did go poking around, and after going through every page of the WebGUI, I saw this on #1 under Interfaces -> Virtual IPs -> Status:

CARP has detected a problem and this unit has been demoted to BACKUP status.
Check link status on all interfaces with configured CARP VIPs.

I double checked everything and all is good (each can ping the other and the BACKUP is receiving carp packets and can ping the VIP).  So I'm going to presume that it saw a problem when the cable was unplugged, but hasn't figured out that everything is fine now.  I verified from #1 and #2 that I can ping the other and that #1 can ping all of the VIPs.  I also verified that every interface with a carp (which is all of them except sync interface) is receiving carp packets.

I did noticed while writing this post that on #1 under Interfaces -> Virtual IPs -> Status it says "Current CARP demotion level = 0" while on #2 it says that it is 240.  I'm not sure if this is a hint as to the problem or not as I don't know what the "demotion level" really means.  About to go check Google and the docs.

Edit:  I did just find that an interface I created (but never actually used for any client machines) did not have the firewall rule allowing CARP traffic.  I've since added the rule to both #1 and #2.
Title: Re: When should a "failover" automatically occur?
Post by: j_s on July 21, 2022, 12:01:31 am
Just wanted to provide an update.  For reasons outside my control, the #2 box had to be power cycled.  #1 picked up and took over all workloads.  When #2 came up, it sat in a BACKUP state.

I guess this issue will be left unsolved as I can't investigate it more as the issue is gone.

Thanks to everyone that helped.
Title: Re: When should a "failover" automatically occur?
Post by: nzkiwi68 on August 12, 2022, 05:43:43 am
You can also slow down CARP fail-over.

The Advertising Frequency base value default is 1. This means a CARP broadcast message is sent once per second. You could slow this down to 2 or 3 and sent CARP messages once every 2 or 3 seconds which will slow down fail-over but be more stable in an imperfect network.

If CARP appears to fail-over too quickly on the smallest network hiccup, then, increasing the base value can assist.