CARP Bug in 17.1 resulting in split brains or backup always "master" ???

Started by Wayne Train, June 27, 2017, 09:59:46 AM

Previous topic - Next topic
Hi,
I use Ubiquity EdgeMax Switches and I have to enable FlowControl to use reliable LACP. I bonded to switchports and to carry all vlans on them. And yes, I also experienced Splitbrains without a LAGG. As soon as there are more than one VLAN on a cable the box starts behaving very confusing when I disconnect an ethernet cable on the lan side. In most cases disconnecting the WAN uplink leads to a "clean" failover. Disconnecting the LAN side (LAGG or not, but with multiple VLANs) leads to a split brain.
Since I readt that CARP was touched between 17.1.1 and 17.1.8 and I didn't experience these Issues in 16.7, I think that something with LACP in CARP context may be messed up in FreeBSD.
Best regards,
Wayne

Quote from: Wayne Train on June 30, 2017, 08:32:27 AM
Hi,

no it didn't. And furthermore there seems to be another Bug: After trying with the LAGG, I wanted to delete it, and the whole system crashed. I had this before on both nodes before I did a clean reinstall. OPNsense detected a bug and i filed it with a short description. It was related to some errors and uncaught exceptions in the lagg_edit.php file, but I'm not a programmer...

I'm really hoping, that the next minor relase is coming soon, since 17.1.8 isn't really what I expected from OPNsense. 16.x was really fine, I had no issues. Until 17.1.4 everything worked fine and then it started getting really weird...

Thank you.

Seems I found a similar one ...
https://github.com/opnsense/core/issues/1715


So, I created a LAGG with 2 IF's and on this LAGG 2 VLANs with CARP.
I tested every scenario, no splitbrains, but now MASTER state is always on machine 2.

I believe this has something to do with the LACP balancing because packets for VLAN88 are sent over igb1 and packets for VLAN99 are sent over igb2. Must be something like this.

After a reboot of both machines MASTER state is on machine1 again.

Did you enable fast timeouts on LAGG? This didn't work with my setup, so please don't.

Oh, OK, now I plugged out WAN, then Machine2 is MASTER for WAN and STANDBY for LAGG. Only dis- and enabling CARP fixes this. Hm, also when I plug one cable of LAGG one is MASTER for WAN and the other MASTER for LAN (LAGG). This wasn't the case in my first test.


Ahhh, I reread you initial post.

This is not called splitbrain! Splitbrain is when both machines are in master state and you have a flapping of MACs on the switch.

What you have is a mix of master/standby on same machine.

I did a reboot now, and also after the reboot M1 was MASTER on WAN and BACKUP for LAGG, other machine vice versa. The I pulled out power and plugged in again, now M2 is MASTER for all. Strange ...

Cross correlating these reports with trouble I've had with pfSense 2.3.4. The menus for that allow setting up Master and Standby separately and differently for different interfaces. So I did, which worked fine in pfSense 2.3.3. But with 2.3.4 it broke, became undependable. This may be unrelated to the reports here ... or not.

Someone on the pfSense forums claimed such a setup isn't "supported," which seems a strange thing to say when (1) the menus allow it, and (2) the docs don't say anything against it, and (3) it used to work. But then the OPNsense docs on CARP speak of the whole system failing over, rather than having failover work independently per-interface. Having read that, since I'm needing to replace pfSense in short order, I asked elsewhere in these forums what how CARP is designed to work on OPNsense. There's been an answer, but it was unclear.

The Decisio brochure says: "Two or more firewalls can be configured as a failover group. If one interface fails on the primary or the primary goes offline entirely, the secondary becomes active." This can be read to imply that the disconnection of a single interface on the primary server is supposed to result in the entire operation being taken over by the secondary server. Is that in theory what the back-end logic is supposed to do? Or in theory should only the VIPs of the single interface that is down be taken over by the secondary server?

As written in another thread dont tick "Disable preempt" on both FWs and set a tunable of net.inet.carp.senderr_demotion_factor=0 on both firewalls. Reboot and you're good

No,
I'm not trunking VLAN 1 and I also ran into the issue with VLANs only, without LAGG.
Best regards,
Wayne