HA cluster goes haywire after 24.1.1 [solved]

Started by Hunduster, February 11, 2024, 01:21:39 PM

Previous topic - Next topic
February 11, 2024, 01:21:39 PM Last Edit: February 11, 2024, 02:37:51 PM by Hunduster
Hello everyone,

Since updating to 24.1.1, I have had massive problems with my HA cluster. As soon as I put the second node into operation, my entire network goes down.

It doesn't matter which of the two nodes is the active one. As soon as both instances are online, I have permanent crashes in the entire network. According to my Unifi switch, the ports are probably all blocked due to STP. However, I was only able to see this briefly once, as I can no longer access my Unifi controller as soon as both nodes are active.

I have already read a lot about problems with the new 24.1 but not yet about this issue. At the moment I don't quite know how to debug it to find out what the problem is.
So long....

The Hunduster

February 11, 2024, 02:37:36 PM #1 Last Edit: February 11, 2024, 09:04:44 PM by Hunduster
OK...bizarre

When both nodes are active, the CPU load goes up alternately on both machines. While I run continuous pings on the respective node IP and the VIP, I repeatedly have dropouts on the three IP addresses.

If I restart a node, it doesn't matter which one, the CPU load on the remaining node goes back down to the normal range. HA generally seems to work, as the backup becomes the master when I restart Node1. As soon as the restarted Node1 is online again, Node1 becomes the master again and everything seems to run normally for about 3 minutes. Then the game starts all over again: CPU load increases on both nodes and the network dropouts start again.

Now that I have started the Wireshark once, I think I have found the problem:

Obviously there were too many broadcasts between the subnets. In my case, the "Enable CARP Failover" checkbox was missing in the mDNS repeater. Now that I have activated this, there is no problem.

The STP also turned out to be an error. The Switch had always briefly blocked the ports of the booting node, but then released them again.
So long....

The Hunduster

Quote from: Hunduster on February 11, 2024, 02:37:36 PM
Obviously there were too many broadcasts between the subnets. In my case, the "Enable CARP Failover" checkbox was missing in the mDNS repeater. Now that I have activated this, there is no problem.
You probably created a nice broadcast storm with two active repeaters. Also bound to happen when you create a loop in the topology and one bridge/switch does not speak STP. Rarely happens with switches nowadays but the default for the FreeBSD/OPNsense bridge is "STP off". Duh!  :)
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Fun Fact:

I am actually very sure that I never set the check mark here ;-)

So something must have changed in my eyes. But the whole thing makes sense and that's why I'm not angry. The Wife Acceptance Factor has just shifted significantly today  ;D
So long....

The Hunduster