31
19.1 Legacy Series / CARP over LAGG problems
« on: May 03, 2019, 09:25:38 am »
I usually do my opnsense upgrades by first updating the usually-backup machine, disabling carp on the master and updating it as well.
Now when upgrading from 19.1.2 to 19.1.6 (which needs reboot), I found that some VHIDs would go to master and some to backup (net.inet.carp.preempt=0, should be 1 but helpful for debugging here) afterwards. The VHIDs that became master all are on a LAGG interface (directly or VLAN), the others remaining on backup are on physical interfaces. When disabling and enabling carp on the master machine, the situation was resolved. Apparently, the LAGG interface didn't receive carp packets from the master in-time when booting up, so the rebooted machine suspected it needed to become master itself.
After my HA setup was settled and working normally, I started to upgrade the switches one by one. With one switch down, the LAGG interface is still workable, since only one of both physical interfaces looses connection, but CARP seems to increase demotion based on the physical interface, not the resulting LAGG interface. In order to not have CARP failing over unnecessarily (which would affect eg. OpenVPN connections), CARP on the backup needs to be disabled temporarily.
So there seem to be two issues here: CARP expecting traffic before LAGG is ready, and CARP demotion reacting to LAGG slave interfaces instead of the LAGG interface itself.
Now when upgrading from 19.1.2 to 19.1.6 (which needs reboot), I found that some VHIDs would go to master and some to backup (net.inet.carp.preempt=0, should be 1 but helpful for debugging here) afterwards. The VHIDs that became master all are on a LAGG interface (directly or VLAN), the others remaining on backup are on physical interfaces. When disabling and enabling carp on the master machine, the situation was resolved. Apparently, the LAGG interface didn't receive carp packets from the master in-time when booting up, so the rebooted machine suspected it needed to become master itself.
After my HA setup was settled and working normally, I started to upgrade the switches one by one. With one switch down, the LAGG interface is still workable, since only one of both physical interfaces looses connection, but CARP seems to increase demotion based on the physical interface, not the resulting LAGG interface. In order to not have CARP failing over unnecessarily (which would affect eg. OpenVPN connections), CARP on the backup needs to be disabled temporarily.
So there seem to be two issues here: CARP expecting traffic before LAGG is ready, and CARP demotion reacting to LAGG slave interfaces instead of the LAGG interface itself.