mixed master/backup problem, force one node to stay master?

isg-ek · April 09, 2024, 01:51:34 PM

Dear all,

we have a HA pair of OPNsenses, LAN trunk interface with around 10 Vlans on both machines, WAN and admin interfaces separate NICs. Matching carp interfaces on the nodes for the Vlans, Wan, admin.
We sync our config from node1 to 2, node1 is regulary master. Sync between the pair is done over a dedicated hardware interface with direct cable connection.

This setup has grown over the last 1 1/2 years, but worked like a charm with firmware updates, reboots, changes between master/backup mode, everything good - until this morning:
node1 was in state "BACKUP" since a few days - we've seen this happening before, but after rebooting the node, everything went back to normal in the past. So we checked for firmware updates in the morning, it showed one minor update, no reboot required, installed it one node1. And booted to get rid of the "backup" state. System took quite a long time to come up again, afterwards stayed in "backup" mode with GUI telling "system is booting, not all services started". This stayed for around another 15 minutes, then node1 became "master" for ~7 out of its 15 carp interfaces. We found out after few minutes that we had connectivity problems in some of the vlans, partially services not available, slow or broken internet connection and decided to take the "safe" way: and shut and turned off node1 completely. Node2 is master again since then, and everything is "fine" from the connectivity point of view. But of course, it can't stay like this.

So far, now to our question :) What's the safest way to get node1 back online again, to check its log files, status, and so on .. I suppose there must have been recent changes to our config which are the reason for the pair to behave like this, as it's never done so before. Maybe "force" the backup node2 to stay master, even if node1 comes back online? There is this button "enter persistent CARP Maintenance mode" on the backup node2 - I don't want to simply try it, never used it before and if I understand it right, it should normally be used on the regular master node before a system update/reboot? Any suggestions.. ?

thx a lot & best
Silke

Monviech (Cedrik) · April 09, 2024, 02:16:14 PM

Since the Persistent CARP Maintenance Mode is set on the Primary Node https://docs.opnsense.org/manual/how-tos/carp.html#example-updating-a-carp-ha-cluster its possible to boot it without network connection, get into the GUI, and set it.

If you don't trust that option, there is another trick.

- You boot the node1 without connection to your switch (so point to point to your configuration Laptop), go into the GUI
- Go to System: High Availability: Settings
- Disable "Services to synchronize (XMLRPC Sync)" - "Virtual IPs" section (so the manual change to the CARP VIPs won't overwrite when you sync the node 1 with node 2)
- On node2 (backup) - Go to "Interfaces: Virtual IPs: Settings" and look at one of the CARP Vips, expand advanced mode, look at the "advskew" - it should be something like 100, just remember this value
- On node1 (master) - Go to "Interfaces: Virtual IPs: Settings" and look at one of the CARP Vips, expand advanced mode, look at the "advskew" - it should be something like 0 or 1. Set this around 100 higher than node2. So for example, put 200 or 201 in there.
- Save that increased value of 200+ for each of the CARPs on node1.

Now - node1 has advskew of 200 and node2 has advskew of 100.

If you connect node1 to your network, it has a slower carp advertising rate than node2, forcing node2 to stay master, and node1 to stay backup.

Just be careful with the xmlrpc sync, the section for VIPs has been removed, but once you enable it again, these changes will revert.

Find out why there are mixed master and backup states:

On each VLAN, there is a VRRP protocol broadcast. That's how the CARP Vips communicate the above "advbase" and "advskew", to determine for each CARP VIP, who should be master and who should be backup.

You can see it with tcpdump:

Code Select

tcpdump -i vlan0.1 proto 112

If one of your VLANs has the wrong tag somewhere on switches, there is igmp snooping, or some other features that prevent these broadcasts per VLAN to happen, then there can be mixed master/backup states. Its the most likely cause apart from misconfiguration on the OPNsense.

isg-ek · April 09, 2024, 04:55:48 PM

thx a lot for your super fast reply! we've decided for the first option, configured the persistent carp mode, reconnected. and will bring it back online - but for this last part, tomorrow :) and then check both machines config, log files, firewall rules again.

For the debugging process, I already checked the switchports config. both machines are on the same switch, port config is identical. If this switch isn't somehow broken or buggy, I guess the problem would most likely be within the opnsenses firewall rules (?). So I think after we checked that we'll have to go on with the TCP dumps tomorrow, and to be able to do this, I suppose we'd have to end the maintenance mode.

Another thought: I mentioned in my first post, we knew from the past that a fail-over to backup was triggered for node1 over all interfaces, sometimes node1 switched back to master, sometimes not. we've seen this, we could solve it, but we could never quite explain it. I wonder - whatever the reason in our network is for this, if we did not already have it for a while. Hope we'll be able to catch it

mixed master/backup problem, force one node to stay master?

isg-ek

April 09, 2024, 01:51:34 PM

Monviech (Cedrik)

April 09, 2024, 02:16:14 PM #1 Last Edit: April 09, 2024, 02:22:46 PM by Monviech

isg-ek

April 09, 2024, 04:55:48 PM #2