OPNsense Forum

Archive => 17.1 Legacy Series => Topic started by: Wayne Train on June 27, 2017, 09:59:46 am

Title: CARP Bug in 17.1 resulting in split brains or backup always "master" ???
Post by: Wayne Train on June 27, 2017, 09:59:46 am
Hi,

I'm experiencing very strange issue resulting in various splitbrains.
In most of the times, only WAN is switched over to the backup node.
When I try to resolve the splitbrain, I manually set the BACKUP-node to CARP MAINTENANCE MODE
and the MASTER holds all interfaces again. The strange thing is, that when I leave Maintenance Mode
on BACKUP, the BACKUP-node takes over the MASTER-role again.
Furthermore, after rebooting or after a failover, the BACKUP-Node remains
in the master-role, while the original MASTER is demoted to the backup-role.

I'm running a LACP-LAGG that consists of igb0 and igb1, that holds a couple of vlans.
My Switch is also configured to use LACP for the trunk.

Each VLAN is configured like this:

MASTER-Node   Virtual-IP   
10.x.x.10   10.x.x.1/24   vhid 12 , freq. 1 / 0
10.x.y.10   10.x.y.1/24   vhid 24 , freq. 1 / 0

BACKUP-Node   Virtual-IP
10.x.x.20   10.x.x.1/24   vhid 12 , freq. 1 / 100
10.x.y.20   10.x.y.1/24   vhid 14 , freq. 1 / 100


When I'm capturing carp-packets I see the following on the LAN-Side:

Capture output of the MASTER-Node:
09:09:53.869797 IP 10.x.x.20 > 224.0.0.18: VRRPv2, Advertisement, vrid 14, prio 100, authtype none, intvl 1s, length 36
09:09:55.282945 IP 10.x.x.20 > 224.0.0.18: VRRPv2, Advertisement, vrid 14, prio 100, authtype none, intvl 1s, length 36
09:09:56.696995 IP 10.x.x.20 > 224.0.0.18: VRRPv2, Advertisement, vrid 14, prio 100, authtype none, intvl 1s, length 36

Capture output of the BACKUP-Node:
09:08:30.688149 IP 10.x.x.20 > 224.0.0.18: VRRPv2, Advertisement, vrid 14, prio 100, authtype none, intvl 1s, length 36
09:08:32.116865 IP 10.x.x.20 > 224.0.0.18: VRRPv2, Advertisement, vrid 14, prio 100, authtype none, intvl 1s, length 36
09:08:33.508241 IP 10.x.x.20 > 224.0.0.18: VRRPv2, Advertisement, vrid 14, prio 100, authtype none, intvl 1s, length 36


On the WAN-Side it looks like this:

Capture output of the MASTER-Node:
09:11:38.102897 IP WAN_BACKUP_NODE_IP > 224.0.0.18: VRRPv2, Advertisement, vrid 12, prio 100, authtype none, intvl 1s, length 36
09:11:39.504055 IP WAN_BACKUP_NODE_IP > 224.0.0.18: VRRPv2, Advertisement, vrid 12, prio 100, authtype none, intvl 1s, length 36
09:11:40.929161 IP WAN_BACKUP_NODE_IP > 224.0.0.18: VRRPv2, Advertisement, vrid 12, prio 100, authtype none, intvl 1s, length 36

Capture output of the BACKUP-Node:
09:13:43.619491 IP WAN_BACKUP_NODE_IP > 224.0.0.18: VRRPv2, Advertisement, vrid 12, prio 100, authtype none, intvl 1s, length 36
09:13:45.039772 IP WAN_BACKUP_NODE_IP > 224.0.0.18: VRRPv2, Advertisement, vrid 12, prio 100, authtype none, intvl 1s, length 36
09:13:46.431278 IP WAN_BACKUP_NODE_IP > 224.0.0.18: VRRPv2, Advertisement, vrid 12, prio 100, authtype none, intvl 1s, length 36


Every Interface & VLAN has a rule to allow any traffic between the CARP-Nodes:

Action   Proto   Source         Port   Destination      Port   Gateway
Pass   IPv4 *    CARP_NODES_VLAN_X     *    CARP_NODES_VLAN_X    *    *    


My "High Availability Settings" are configured like this:

MASTER (172.x.y.y = Sync-Interface-IP)
Synchronize States      YES
Synchronize Interface      SYNC-Interface
Synchronize Peer IP      172.x.y.z
Synchronize Config to IP   172.x.y.z
Remote System Username      user_name
Remote System Password      password
Users and Groups      YES
...            YES
DNS Resolver         YES


BACKUP   (172.x.y.z = Sync-Interface-IP)
Synchronize States      YES
Synchronize Interface      SYNC-Interface
Synchronize Peer IP      172.x.y.y

I left all other Settings unchecked, since the help tells, that one should only sync
from the MASTER to the BACKUP node and not bi-directional. So I assume this is right.
Or am I wrong ?


In My logs I can only find the following entries:

Jun 23 19:03:21    kernel: carp: 12@lagg0_vlan40: MASTER -> BACKUP (more frequent advertisement received)
Jun 23 19:03:21    kernel: carp: 17@lagg0_vlan100: MASTER -> BACKUP (more frequent advertisement received)
Jun 23 19:03:21    kernel: carp: 19@lagg0_vlan20: MASTER -> BACKUP (more frequent advertisement received)
Jun 23 19:03:21    kernel: carp: 16@lagg0_vlan70: MASTER -> BACKUP (more frequent advertisement received)
Jun 23 19:03:21    kernel: carp: 15@lagg0_vlan60: MASTER -> BACKUP (more frequent advertisement received)
Jun 23 19:03:20    kernel: carp: 20@lagg0_vlan10: MASTER -> BACKUP (more frequent advertisement received)

To me everything seems like the BACKUP-node is advertising more frequent than the original MASTER and therefore becomes the master.

I also checked the settings on the shell to see, if there is some valuable information regarding carp. As you can see on the MASTER,
it got demoted:

   net.inet.carp.ifdown_demotion_factor: 240
   net.inet.carp.senderr_demotion_factor: 240
   net.inet.carp.demotion: 3120
   net.inet.carp.log: 1
   net.inet.carp.preempt: 1
   net.inet.carp.allow: 1
   net.pfsync.carp_demotion_factor: 240

While on the BACKUP-node it looks like this:

   net.inet.carp.ifdown_demotion_factor: 240
   net.inet.carp.senderr_demotion_factor: 240
   net.inet.carp.demotion: 0
   net.inet.carp.log: 1
   net.inet.carp.preempt: 1
   net.inet.carp.allow: 1
   net.pfsync.carp_demotion_factor: 240


Another strange thing is, that by invoking "ifconfig", all my vlans are in the carp group "groups: vlan",
while on my WAN-interface "igb5" no carp group is defined. May this be the reason for the split brains?
In some way this would explain, why the VLANs and WAN failover seperately. In a correctly working
HA-enviroment, i would expect the master to failover completely to the backup, if any of it's interfaces
goes down...

I'm experiencing this issue on 17.1.1, 17.1.4 and 17.1.8 and I really ran out of ideas on how to resolve it.
Is it possible that this is a bug in freebsd carp, or opnsense release?
Is someone experiencing similar issues?

Best regards,
Wayne
Title: Re: CARP Bug in 17.1 resulting in split brains or backup always "master" ???
Post by: mimugmail on June 27, 2017, 10:20:01 am
Have you tried this setup without LAGG to isolate the problem?

I'd first setup the whole thing without VLANs and without LAGG. If this works as expected I'd add VLANs. If this works as expected I'd add LAGG.

Then you'll see where exactly the error is.
Title: Re: CARP Bug in 17.1 resulting in split brains or backup always "master" ???
Post by: Wayne Train on June 27, 2017, 02:26:54 pm
I already did this, and I tried it again with a completely blank setting a few minutes ago.
The result is:

With only a physical Interface one on the LAN one on the WAN side, everything works well and I got no split brains.

With 1 VLAN (not on a LACP-LAGG, neither a LAGG), one physical NIC on the LAN and one on the WAN side, it results in split brains again. The Backup node takes over the VLAN if I manually failover by disconnectng the cable from the used port, but it fails over only for that interface. LAN and WAN reside on the original MASTER.

Furthermore, when I attach the cable back in, the BACKUP node doesn't release the IP back to the master.

I'm on Release 17.1.4 at the moment.

Best Regards
Wayne

Title: Re: CARP Bug in 17.1 resulting in split brains or backup always "master" ???
Post by: Wayne Train on June 27, 2017, 02:27:42 pm
At the moment it all looks like that there are some strange vlan issues that affect carps behaviour.
Title: Re: CARP Bug in 17.1 resulting in split brains or backup always "master" ???
Post by: Wayne Train on June 27, 2017, 04:17:49 pm
Ok,
I just upgraded to release 17.1.8, but the problem remains. Any ideas ?
Cheers
Wayne
Title: Re: CARP Bug in 17.1 resulting in split brains or backup always "master" ???
Post by: mimugmail on June 27, 2017, 04:32:14 pm
Just to isolate further, can you check LAGG without VLANs?
Title: Re: CARP Bug in 17.1 resulting in split brains or backup always "master" ???
Post by: Wayne Train on June 28, 2017, 02:17:24 pm
It's the same behaviour.
Do you yourself also have carp enabled with vlans on the lan-side ?
Regards,
Wayne
Title: Re: CARP Bug in 17.1 resulting in split brains or backup always "master" ???
Post by: mimugmail on June 28, 2017, 03:08:52 pm
Not, but I'll investigate time here to reproduce if the error is clear.

So LAGG without VLANs works fine? No splitbrains?
Title: Re: CARP Bug in 17.1 resulting in split brains or backup always "master" ???
Post by: Wayne Train on June 30, 2017, 08:32:27 am
Hi,

no it didn't. And furthermore there seems to be another Bug: After trying with the LAGG, I wanted to delete it, and the whole system crashed. I had this before on both nodes before I did a clean reinstall. OPNsense detected a bug and i filed it with a short description. It was related to some errors and uncaught exceptions in the lagg_edit.php file, but I'm not a programmer...

I'm really hoping, that the next minor relase is coming soon, since 17.1.8 isn't really what I expected from OPNsense. 16.x was really fine, I had no issues. Until 17.1.4 everything worked fine and then it started getting really weird...

Thank you.
Title: Re: CARP Bug in 17.1 resulting in split brains or backup always "master" ???
Post by: mimugmail on June 30, 2017, 09:50:17 am
Today I'm in home office, I'll try to reproduce this on monday with some test machines.

So
- CARP with single interfaces works
- CARP with single interfaces as VLANs results in split-brain
- CARP with LAGG without VLANs results in split-brain
- CARP with LAGG with VLANs results in split-brain

Title: Re: CARP Bug in 17.1 resulting in split brains or backup always "master" ???
Post by: pingutux on July 07, 2017, 01:57:15 pm
Hello,

i can confirm this.
-> CARP with single interfaces as VLANs results in split-brain

Started with 17.1.8.

br
Title: Re: CARP Bug in 17.1 resulting in split brains or backup always "master" ???
Post by: mimugmail on July 11, 2017, 10:23:04 am
WAN VLAN (igb0)
LAN ETH (igb1)

CARP on VLAN

Works, no splitbrains.

I'll try VLAN only with just one physical IF in the next test

EDIT: There was a short mac flap of course:

*Mar  1 01:58:12.181: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0101 in vlan 60 is flapping between port Gi2/0/13 and port Gi2/0/14
*Mar  1 01:58:23.431: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0101 in vlan 60 is flapping between port Gi2/0/13 and port Gi2/0/14

Title: Re: CARP Bug in 17.1 resulting in split brains or backup always "master" ???
Post by: mimugmail on July 11, 2017, 10:56:46 am
VLAN60 / WAN / igb0 / CARP IP 192.168.10.1
VLAN99 / WAN / igb0 / CARP IP 192.168.1.1

pulled cable if igb0 on unit 1, unit 2 took over smoothly. Pluged in again, I had 2 mac flaps and a loss of 5 pings.

No splitbrain.

17.1.8



How is you switch configured?
Title: Re: CARP Bug in 17.1 resulting in split brains or backup always "master" ???
Post by: Wayne Train on July 11, 2017, 03:05:44 pm
Hi,

my trunking LAGG to the switch is configured as LACP. Both on the Firewall- and on the Switch-Side.
Flowcontrol is enabled. Otherwise LACP won't work like intended. But btw: I haven't experienced these issues in 16.7.
Therefore I expect, that it's related to a bug in 17.1.x.

This is my setup:

Switch                                                        Firewall

 47                                                               igb0
(   )======(VLANs 10-100)========(      )==(OPNSENSE)=====(WAN)
 48                                                               igb1

              V-IPs for VLAN 10-100                                                                V-IP for WAN

I wonder if it's an issue that only occurs if you have multiple VLANs on one LAGG.
Have you also tried this ?
I experienced the issue on multiple systems. All of them 17.1.x.

Best regards
Wayne
Title: Re: CARP Bug in 17.1 resulting in split brains or backup always "master" ???
Post by: mimugmail on July 11, 2017, 04:41:32 pm
Hi,

I only tested VLAN, not LAGG, I can do this tomorrow.

Don't know why flowcontrol should influence LACP. This woud mean that you can't run this setup without a switch supporting flowcontrol?

You said you also experienced splitbrains with just VLANs and not LAGG?

Are you trunking Vlan1 (like in Catalyst)?
Title: Re: CARP Bug in 17.1 resulting in split brains or backup always "master" ???
Post by: Wayne Train on July 12, 2017, 12:23:34 pm
Hi,
I use Ubiquity EdgeMax Switches and I have to enable FlowControl to use reliable LACP. I bonded to switchports and to carry all vlans on them. And yes, I also experienced Splitbrains without a LAGG. As soon as there are more than one VLAN on a cable the box starts behaving very confusing when I disconnect an ethernet cable on the lan side. In most cases disconnecting the WAN uplink leads to a "clean" failover. Disconnecting the LAN side (LAGG or not, but with multiple VLANs) leads to a split brain.
Since I readt that CARP was touched between 17.1.1 and 17.1.8 and I didn't experience these Issues in 16.7, I think that something with LACP in CARP context may be messed up in FreeBSD.
Best regards,
Wayne
Title: Re: CARP Bug in 17.1 resulting in split brains or backup always "master" ???
Post by: mimugmail on July 12, 2017, 04:17:39 pm
Hi,

no it didn't. And furthermore there seems to be another Bug: After trying with the LAGG, I wanted to delete it, and the whole system crashed. I had this before on both nodes before I did a clean reinstall. OPNsense detected a bug and i filed it with a short description. It was related to some errors and uncaught exceptions in the lagg_edit.php file, but I'm not a programmer...

I'm really hoping, that the next minor relase is coming soon, since 17.1.8 isn't really what I expected from OPNsense. 16.x was really fine, I had no issues. Until 17.1.4 everything worked fine and then it started getting really weird...

Thank you.

Seems I found a similar one ...
https://github.com/opnsense/core/issues/1715
Title: Re: CARP Bug in 17.1 resulting in split brains or backup always "master" ???
Post by: franco on July 12, 2017, 04:52:41 pm
It seems to be a known issue on FreeBSD 11.0. We will try to get the fix into the 17.7 release.

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=218886
Title: Re: CARP Bug in 17.1 resulting in split brains or backup always "master" ???
Post by: mimugmail on July 14, 2017, 10:06:58 am
So, I created a LAGG with 2 IF's and on this LAGG 2 VLANs with CARP.
I tested every scenario, no splitbrains, but now MASTER state is always on machine 2.

I believe this has something to do with the LACP balancing because packets for VLAN88 are sent over igb1 and packets for VLAN99 are sent over igb2. Must be something like this.

After a reboot of both machines MASTER state is on machine1 again.

Did you enable fast timeouts on LAGG? This didn't work with my setup, so please don't.

Oh, OK, now I plugged out WAN, then Machine2 is MASTER for WAN and STANDBY for LAGG. Only dis- and enabling CARP fixes this. Hm, also when I plug one cable of LAGG one is MASTER for WAN and the other MASTER for LAN (LAGG). This wasn't the case in my first test.

Title: Re: CARP Bug in 17.1 resulting in split brains or backup always "master" ???
Post by: mimugmail on July 14, 2017, 10:24:47 am
Ahhh, I reread you initial post.

This is not called splitbrain! Splitbrain is when both machines are in master state and you have a flapping of MACs on the switch.

What you have is a mix of master/standby on same machine.

I did a reboot now, and also after the reboot M1 was MASTER on WAN and BACKUP for LAGG, other machine vice versa. The I pulled out power and plugged in again, now M2 is MASTER for all. Strange ...
Title: Re: CARP Bug in 17.1 resulting in split brains or backup always "master" ???
Post by: whitwye on July 14, 2017, 06:21:12 pm
Cross correlating these reports with trouble I've had with pfSense 2.3.4. The menus for that allow setting up Master and Standby separately and differently for different interfaces. So I did, which worked fine in pfSense 2.3.3. But with 2.3.4 it broke, became undependable. This may be unrelated to the reports here ... or not.

Someone on the pfSense forums claimed such a setup isn't "supported," which seems a strange thing to say when (1) the menus allow it, and (2) the docs don't say anything against it, and (3) it used to work. But then the OPNsense docs on CARP speak of the whole system failing over, rather than having failover work independently per-interface. Having read that, since I'm needing to replace pfSense in short order, I asked elsewhere in these forums what how CARP is designed to work on OPNsense. There's been an answer, but it was unclear.

The Decisio brochure says: "Two or more firewalls can be configured as a failover group. If one interface fails on the primary or the primary goes offline entirely, the secondary becomes active." This can be read to imply that the disconnection of a single interface on the primary server is supposed to result in the entire operation being taken over by the secondary server. Is that in theory what the back-end logic is supposed to do? Or in theory should only the VIPs of the single interface that is down be taken over by the secondary server?
Title: Re: CARP Bug in 17.1 resulting in split brains or backup always "master" ???
Post by: mimugmail on July 19, 2017, 11:03:04 am
As written in another thread dont tick "Disable preempt" on both FWs and set a tunable of net.inet.carp.senderr_demotion_factor=0 on both firewalls. Reboot and you're good
Title: Re: CARP Bug in 17.1 resulting in split brains or backup always "master" ???
Post by: Wayne Train on July 19, 2017, 03:43:10 pm
No,
I'm not trunking VLAN 1 and I also ran into the issue with VLANs only, without LAGG.
Best regards,
Wayne