OPNsense Forum
Archive => 17.1 Legacy Series => Topic started by: Wayne Train on June 27, 2017, 09:59:46 am
-
Hi,
I'm experiencing very strange issue resulting in various splitbrains.
In most of the times, only WAN is switched over to the backup node.
When I try to resolve the splitbrain, I manually set the BACKUP-node to CARP MAINTENANCE MODE
and the MASTER holds all interfaces again. The strange thing is, that when I leave Maintenance Mode
on BACKUP, the BACKUP-node takes over the MASTER-role again.
Furthermore, after rebooting or after a failover, the BACKUP-Node remains
in the master-role, while the original MASTER is demoted to the backup-role.
I'm running a LACP-LAGG that consists of igb0 and igb1, that holds a couple of vlans.
My Switch is also configured to use LACP for the trunk.
Each VLAN is configured like this:
MASTER-Node Virtual-IP
10.x.x.10 10.x.x.1/24 vhid 12 , freq. 1 / 0
10.x.y.10 10.x.y.1/24 vhid 24 , freq. 1 / 0
BACKUP-Node Virtual-IP
10.x.x.20 10.x.x.1/24 vhid 12 , freq. 1 / 100
10.x.y.20 10.x.y.1/24 vhid 14 , freq. 1 / 100
When I'm capturing carp-packets I see the following on the LAN-Side:
Capture output of the MASTER-Node:
09:09:53.869797 IP 10.x.x.20 > 224.0.0.18: VRRPv2, Advertisement, vrid 14, prio 100, authtype none, intvl 1s, length 36
09:09:55.282945 IP 10.x.x.20 > 224.0.0.18: VRRPv2, Advertisement, vrid 14, prio 100, authtype none, intvl 1s, length 36
09:09:56.696995 IP 10.x.x.20 > 224.0.0.18: VRRPv2, Advertisement, vrid 14, prio 100, authtype none, intvl 1s, length 36
Capture output of the BACKUP-Node:
09:08:30.688149 IP 10.x.x.20 > 224.0.0.18: VRRPv2, Advertisement, vrid 14, prio 100, authtype none, intvl 1s, length 36
09:08:32.116865 IP 10.x.x.20 > 224.0.0.18: VRRPv2, Advertisement, vrid 14, prio 100, authtype none, intvl 1s, length 36
09:08:33.508241 IP 10.x.x.20 > 224.0.0.18: VRRPv2, Advertisement, vrid 14, prio 100, authtype none, intvl 1s, length 36
On the WAN-Side it looks like this:
Capture output of the MASTER-Node:
09:11:38.102897 IP WAN_BACKUP_NODE_IP > 224.0.0.18: VRRPv2, Advertisement, vrid 12, prio 100, authtype none, intvl 1s, length 36
09:11:39.504055 IP WAN_BACKUP_NODE_IP > 224.0.0.18: VRRPv2, Advertisement, vrid 12, prio 100, authtype none, intvl 1s, length 36
09:11:40.929161 IP WAN_BACKUP_NODE_IP > 224.0.0.18: VRRPv2, Advertisement, vrid 12, prio 100, authtype none, intvl 1s, length 36
Capture output of the BACKUP-Node:
09:13:43.619491 IP WAN_BACKUP_NODE_IP > 224.0.0.18: VRRPv2, Advertisement, vrid 12, prio 100, authtype none, intvl 1s, length 36
09:13:45.039772 IP WAN_BACKUP_NODE_IP > 224.0.0.18: VRRPv2, Advertisement, vrid 12, prio 100, authtype none, intvl 1s, length 36
09:13:46.431278 IP WAN_BACKUP_NODE_IP > 224.0.0.18: VRRPv2, Advertisement, vrid 12, prio 100, authtype none, intvl 1s, length 36
Every Interface & VLAN has a rule to allow any traffic between the CARP-Nodes:
Action Proto Source Port Destination Port Gateway
Pass IPv4 * CARP_NODES_VLAN_X * CARP_NODES_VLAN_X * *
My "High Availability Settings" are configured like this:
MASTER (172.x.y.y = Sync-Interface-IP)
Synchronize States YES
Synchronize Interface SYNC-Interface
Synchronize Peer IP 172.x.y.z
Synchronize Config to IP 172.x.y.z
Remote System Username user_name
Remote System Password password
Users and Groups YES
... YES
DNS Resolver YES
BACKUP (172.x.y.z = Sync-Interface-IP)
Synchronize States YES
Synchronize Interface SYNC-Interface
Synchronize Peer IP 172.x.y.y
I left all other Settings unchecked, since the help tells, that one should only sync
from the MASTER to the BACKUP node and not bi-directional. So I assume this is right.
Or am I wrong ?
In My logs I can only find the following entries:
Jun 23 19:03:21 kernel: carp: 12@lagg0_vlan40: MASTER -> BACKUP (more frequent advertisement received)
Jun 23 19:03:21 kernel: carp: 17@lagg0_vlan100: MASTER -> BACKUP (more frequent advertisement received)
Jun 23 19:03:21 kernel: carp: 19@lagg0_vlan20: MASTER -> BACKUP (more frequent advertisement received)
Jun 23 19:03:21 kernel: carp: 16@lagg0_vlan70: MASTER -> BACKUP (more frequent advertisement received)
Jun 23 19:03:21 kernel: carp: 15@lagg0_vlan60: MASTER -> BACKUP (more frequent advertisement received)
Jun 23 19:03:20 kernel: carp: 20@lagg0_vlan10: MASTER -> BACKUP (more frequent advertisement received)
To me everything seems like the BACKUP-node is advertising more frequent than the original MASTER and therefore becomes the master.
I also checked the settings on the shell to see, if there is some valuable information regarding carp. As you can see on the MASTER,
it got demoted:
net.inet.carp.ifdown_demotion_factor: 240
net.inet.carp.senderr_demotion_factor: 240
net.inet.carp.demotion: 3120
net.inet.carp.log: 1
net.inet.carp.preempt: 1
net.inet.carp.allow: 1
net.pfsync.carp_demotion_factor: 240
While on the BACKUP-node it looks like this:
net.inet.carp.ifdown_demotion_factor: 240
net.inet.carp.senderr_demotion_factor: 240
net.inet.carp.demotion: 0
net.inet.carp.log: 1
net.inet.carp.preempt: 1
net.inet.carp.allow: 1
net.pfsync.carp_demotion_factor: 240
Another strange thing is, that by invoking "ifconfig", all my vlans are in the carp group "groups: vlan",
while on my WAN-interface "igb5" no carp group is defined. May this be the reason for the split brains?
In some way this would explain, why the VLANs and WAN failover seperately. In a correctly working
HA-enviroment, i would expect the master to failover completely to the backup, if any of it's interfaces
goes down...
I'm experiencing this issue on 17.1.1, 17.1.4 and 17.1.8 and I really ran out of ideas on how to resolve it.
Is it possible that this is a bug in freebsd carp, or opnsense release?
Is someone experiencing similar issues?
Best regards,
Wayne
-
Have you tried this setup without LAGG to isolate the problem?
I'd first setup the whole thing without VLANs and without LAGG. If this works as expected I'd add VLANs. If this works as expected I'd add LAGG.
Then you'll see where exactly the error is.
-
I already did this, and I tried it again with a completely blank setting a few minutes ago.
The result is:
With only a physical Interface one on the LAN one on the WAN side, everything works well and I got no split brains.
With 1 VLAN (not on a LACP-LAGG, neither a LAGG), one physical NIC on the LAN and one on the WAN side, it results in split brains again. The Backup node takes over the VLAN if I manually failover by disconnectng the cable from the used port, but it fails over only for that interface. LAN and WAN reside on the original MASTER.
Furthermore, when I attach the cable back in, the BACKUP node doesn't release the IP back to the master.
I'm on Release 17.1.4 at the moment.
Best Regards
Wayne
-
At the moment it all looks like that there are some strange vlan issues that affect carps behaviour.
-
Ok,
I just upgraded to release 17.1.8, but the problem remains. Any ideas ?
Cheers
Wayne
-
Just to isolate further, can you check LAGG without VLANs?
-
It's the same behaviour.
Do you yourself also have carp enabled with vlans on the lan-side ?
Regards,
Wayne
-
Not, but I'll investigate time here to reproduce if the error is clear.
So LAGG without VLANs works fine? No splitbrains?
-
Hi,
no it didn't. And furthermore there seems to be another Bug: After trying with the LAGG, I wanted to delete it, and the whole system crashed. I had this before on both nodes before I did a clean reinstall. OPNsense detected a bug and i filed it with a short description. It was related to some errors and uncaught exceptions in the lagg_edit.php file, but I'm not a programmer...
I'm really hoping, that the next minor relase is coming soon, since 17.1.8 isn't really what I expected from OPNsense. 16.x was really fine, I had no issues. Until 17.1.4 everything worked fine and then it started getting really weird...
Thank you.
-
Today I'm in home office, I'll try to reproduce this on monday with some test machines.
So
- CARP with single interfaces works
- CARP with single interfaces as VLANs results in split-brain
- CARP with LAGG without VLANs results in split-brain
- CARP with LAGG with VLANs results in split-brain
-
Hello,
i can confirm this.
-> CARP with single interfaces as VLANs results in split-brain
Started with 17.1.8.
br
-
WAN VLAN (igb0)
LAN ETH (igb1)
CARP on VLAN
Works, no splitbrains.
I'll try VLAN only with just one physical IF in the next test
EDIT: There was a short mac flap of course:
*Mar 1 01:58:12.181: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0101 in vlan 60 is flapping between port Gi2/0/13 and port Gi2/0/14
*Mar 1 01:58:23.431: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.5e00.0101 in vlan 60 is flapping between port Gi2/0/13 and port Gi2/0/14
-
VLAN60 / WAN / igb0 / CARP IP 192.168.10.1
VLAN99 / WAN / igb0 / CARP IP 192.168.1.1
pulled cable if igb0 on unit 1, unit 2 took over smoothly. Pluged in again, I had 2 mac flaps and a loss of 5 pings.
No splitbrain.
17.1.8
How is you switch configured?
-
Hi,
my trunking LAGG to the switch is configured as LACP. Both on the Firewall- and on the Switch-Side.
Flowcontrol is enabled. Otherwise LACP won't work like intended. But btw: I haven't experienced these issues in 16.7.
Therefore I expect, that it's related to a bug in 17.1.x.
This is my setup:
Switch Firewall
47 igb0
( )======(VLANs 10-100)========( )==(OPNSENSE)=====(WAN)
48 igb1
V-IPs for VLAN 10-100 V-IP for WAN
I wonder if it's an issue that only occurs if you have multiple VLANs on one LAGG.
Have you also tried this ?
I experienced the issue on multiple systems. All of them 17.1.x.
Best regards
Wayne
-
Hi,
I only tested VLAN, not LAGG, I can do this tomorrow.
Don't know why flowcontrol should influence LACP. This woud mean that you can't run this setup without a switch supporting flowcontrol?
You said you also experienced splitbrains with just VLANs and not LAGG?
Are you trunking Vlan1 (like in Catalyst)?
-
Hi,
I use Ubiquity EdgeMax Switches and I have to enable FlowControl to use reliable LACP. I bonded to switchports and to carry all vlans on them. And yes, I also experienced Splitbrains without a LAGG. As soon as there are more than one VLAN on a cable the box starts behaving very confusing when I disconnect an ethernet cable on the lan side. In most cases disconnecting the WAN uplink leads to a "clean" failover. Disconnecting the LAN side (LAGG or not, but with multiple VLANs) leads to a split brain.
Since I readt that CARP was touched between 17.1.1 and 17.1.8 and I didn't experience these Issues in 16.7, I think that something with LACP in CARP context may be messed up in FreeBSD.
Best regards,
Wayne
-
Hi,
no it didn't. And furthermore there seems to be another Bug: After trying with the LAGG, I wanted to delete it, and the whole system crashed. I had this before on both nodes before I did a clean reinstall. OPNsense detected a bug and i filed it with a short description. It was related to some errors and uncaught exceptions in the lagg_edit.php file, but I'm not a programmer...
I'm really hoping, that the next minor relase is coming soon, since 17.1.8 isn't really what I expected from OPNsense. 16.x was really fine, I had no issues. Until 17.1.4 everything worked fine and then it started getting really weird...
Thank you.
Seems I found a similar one ...
https://github.com/opnsense/core/issues/1715
-
It seems to be a known issue on FreeBSD 11.0. We will try to get the fix into the 17.7 release.
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=218886
-
So, I created a LAGG with 2 IF's and on this LAGG 2 VLANs with CARP.
I tested every scenario, no splitbrains, but now MASTER state is always on machine 2.
I believe this has something to do with the LACP balancing because packets for VLAN88 are sent over igb1 and packets for VLAN99 are sent over igb2. Must be something like this.
After a reboot of both machines MASTER state is on machine1 again.
Did you enable fast timeouts on LAGG? This didn't work with my setup, so please don't.
Oh, OK, now I plugged out WAN, then Machine2 is MASTER for WAN and STANDBY for LAGG. Only dis- and enabling CARP fixes this. Hm, also when I plug one cable of LAGG one is MASTER for WAN and the other MASTER for LAN (LAGG). This wasn't the case in my first test.
-
Ahhh, I reread you initial post.
This is not called splitbrain! Splitbrain is when both machines are in master state and you have a flapping of MACs on the switch.
What you have is a mix of master/standby on same machine.
I did a reboot now, and also after the reboot M1 was MASTER on WAN and BACKUP for LAGG, other machine vice versa. The I pulled out power and plugged in again, now M2 is MASTER for all. Strange ...
-
Cross correlating these reports with trouble I've had with pfSense 2.3.4. The menus for that allow setting up Master and Standby separately and differently for different interfaces. So I did, which worked fine in pfSense 2.3.3. But with 2.3.4 it broke, became undependable. This may be unrelated to the reports here ... or not.
Someone on the pfSense forums claimed such a setup isn't "supported," which seems a strange thing to say when (1) the menus allow it, and (2) the docs don't say anything against it, and (3) it used to work. But then the OPNsense docs on CARP speak of the whole system failing over, rather than having failover work independently per-interface. Having read that, since I'm needing to replace pfSense in short order, I asked elsewhere in these forums what how CARP is designed to work on OPNsense. There's been an answer, but it was unclear.
The Decisio brochure says: "Two or more firewalls can be configured as a failover group. If one interface fails on the primary or the primary goes offline entirely, the secondary becomes active." This can be read to imply that the disconnection of a single interface on the primary server is supposed to result in the entire operation being taken over by the secondary server. Is that in theory what the back-end logic is supposed to do? Or in theory should only the VIPs of the single interface that is down be taken over by the secondary server?
-
As written in another thread dont tick "Disable preempt" on both FWs and set a tunable of net.inet.carp.senderr_demotion_factor=0 on both firewalls. Reboot and you're good
-
No,
I'm not trunking VLAN 1 and I also ran into the issue with VLANs only, without LAGG.
Best regards,
Wayne