[SOLVED] CARP switches to BACKUP frequently since 22.1.10 Upgrade

Started by juere, July 27, 2022, 12:13:01 PM

Previous topic - Next topic
Since upgrading to 22.1.10 last friday, our 2-Node OPNSense CARP Cluster switches the Master node to BACKUP for no apparent reason.
The slave node takes over on all six configured interfaces. The Master shows an error as in the attached screenshot. By klicking on "Enter persistent CARP Maintenance Mode" twice I can switch back as expected.

The cluster has been running stable with different OPNSense releases for months now, there are no signs for hardware errors.

  • has anyone else noticed CARP errors with 22.1.10, that where not there before upgrading ?
  • Is there a CARP logfile I can have a look at to find out, what's happening ?

System : Log Files : General should contain everything. Did you also reboot the machine? What was the last known working version?

For the time, the last switch happened, System -> Logfiles -> General contains


2022-07-27T01:00:53 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Resyncing OpenVPN instances for interface Virtual BMCs IP (10.254.11.1).
2022-07-27T01:00:53 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "Virtual BMCs IP (10.254.11.1) (9@lagg0_vlan11)" has resumed the state "BACKUP" for vhid 9
2022-07-27T01:00:53 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Resyncing OpenVPN instances for interface Virtual BMCs IP (10.254.11.1).
2022-07-27T01:00:53 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "Virtual BMCs IP (10.254.11.1) (9@lagg0_vlan11)" has resumed the state "MASTER" for vhid 9
2022-07-27T01:00:52 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Resyncing OpenVPN instances for interface Virtual LegacyServer IP (192.168.180.1).
2022-07-27T01:00:52 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "Virtual LegacyServer IP (192.168.180.1) (11@lagg0_vlan180)" has resumed the state "BACKUP" for vhid 11
2022-07-27T01:00:52 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Resyncing OpenVPN instances for interface Virtual Administration IP (10.254.15.1).
2022-07-27T01:00:52 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "Virtual Administration IP (10.254.15.1) (5@lagg0_vlan15)" has resumed the state "BACKUP" for vhid 5
2022-07-27T01:00:51 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Resyncing OpenVPN instances for interface Virtual WAN IP (168.119.27.6).
2022-07-27T01:00:51 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "Virtual WAN IP (168.119.27.6) (1@igb1)" has resumed the state "BACKUP" for vhid 1
2022-07-27T01:00:51 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Resyncing OpenVPN instances for interface Virtual ManagementInterfaces IP (10.254.12.1).
2022-07-27T01:00:51 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "Virtual ManagementInterfaces IP (10.254.12.1) (3@lagg0_vlan12)" has resumed the state "BACKUP" for vhid 3
2022-07-27T01:00:50 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Resyncing OpenVPN instances for interface Virtual WAN IP (168.119.27.6).
2022-07-27T01:00:50 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "Virtual WAN IP (168.119.27.6) (1@igb1)" has resumed the state "MASTER" for vhid 1
2022-07-27T01:00:49 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Resyncing OpenVPN instances for interface Virtual LegacyServer IP (192.168.180.1).
2022-07-27T01:00:49 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "Virtual LegacyServer IP (192.168.180.1) (11@lagg0_vlan180)" has resumed the state "MASTER" for vhid 11
2022-07-27T01:00:49 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Resyncing OpenVPN instances for interface Virtual Administration IP (10.254.15.1).
2022-07-27T01:00:49 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "Virtual Administration IP (10.254.15.1) (5@lagg0_vlan15)" has resumed the state "MASTER" for vhid 5
2022-07-27T01:00:48 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Resyncing OpenVPN instances for interface Virtual ManagementInterfaces IP (10.254.12.1).
2022-07-27T01:00:48 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "Virtual ManagementInterfaces IP (10.254.12.1) (3@lagg0_vlan12)" has resumed the state "MASTER" for vhid 3
2022-07-27T01:00:44 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Resyncing OpenVPN instances for interface Virtual ManagementInterfaces IP (10.254.12.1).
2022-07-27T01:00:44 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "Virtual ManagementInterfaces IP (10.254.12.1) (3@lagg0_vlan12)" has resumed the state "BACKUP" for vhid 3
2022-07-27T01:00:44 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Resyncing OpenVPN instances for interface Virtual ManagementInterfaces IP (10.254.12.1).
2022-07-27T01:00:44 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "Virtual ManagementInterfaces IP (10.254.12.1) (3@lagg0_vlan12)" has resumed the state "MASTER" for vhid 3
2022-07-27T01:00:23 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Resyncing OpenVPN instances for interface Virtual LegacyServer IP (192.168.180.1).
2022-07-27T01:00:23 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "Virtual LegacyServer IP (192.168.180.1) (11@lagg0_vlan180)" has resumed the state "BACKUP" for vhid 11
2022-07-27T01:00:22 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Resyncing OpenVPN instances for interface Virtual Administration IP (10.254.15.1).
2022-07-27T01:00:22 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "Virtual Administration IP (10.254.15.1) (5@lagg0_vlan15)" has resumed the state "BACKUP" for vhid 5
2022-07-27T01:00:21 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Resyncing OpenVPN instances for interface Virtual BMCs IP (10.254.11.1).
2022-07-27T01:00:21 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "Virtual BMCs IP (10.254.11.1) (9@lagg0_vlan11)" has resumed the state "BACKUP" for vhid 9
2022-07-27T01:00:21 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Resyncing OpenVPN instances for interface Virtual ManagementInterfaces IP (10.254.12.1).
2022-07-27T01:00:21 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "Virtual ManagementInterfaces IP (10.254.12.1) (3@lagg0_vlan12)" has resumed the state "BACKUP" for vhid 3
2022-07-27T01:00:20 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Resyncing OpenVPN instances for interface ProductionServer Virtual IP (10.254.10.1).
2022-07-27T01:00:20 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "ProductionServer Virtual IP (10.254.10.1) (7@lagg0_vlan10)" has resumed the state "BACKUP" for vhid 7
2022-07-27T01:00:20 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Resyncing OpenVPN instances for interface Virtual WAN IP (XXX.XXX.XXX.XXX).
2022-07-27T01:00:20 Error opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "Virtual WAN IP (XXX.XXX.XXX.XXX) (1@igb1)" has resumed the state "BACKUP" for vhid 1


Thats quite a lot of lines which would be significantely less if OpenVPN was only bound to the WAN Interface, as is should. It doesn't give me a clue as to what happened, though.

The last working release was 22.1.8
The last reboot was after the 22.1.10 update on friday.
I can try a reboot of the primary node, next time the switching occurs.

/var/log/dmesg.today might provide more insights, unfortunately there are no timestamps:


dmesg.today:carp: 1@igb1: MASTER -> BACKUP (more frequent advertisement received)
dmesg.today:carp: 7@lagg0_vlan10: MASTER -> BACKUP (more frequent advertisement received)
dmesg.today:carp: 11@lagg0_vlan180: MASTER -> BACKUP (more frequent advertisement received)
dmesg.today:carp: 1@igb1: INIT -> BACKUP (initialization complete)
dmesg.today:carp: demoted by -240 to 1200 (interface up)
dmesg.today:carp: demoted by -240 to 960 (pfsync bulk done)
dmesg.today:carp: 1@igb1: BACKUP -> MASTER (master timed out)
dmesg.today:carp: 1@igb1: MASTER -> BACKUP (more frequent advertisement received)
dmesg.today:carp: 9@lagg0_vlan11: BACKUP -> MASTER (master timed out)
dmesg.today:carp: 5@lagg0_vlan15: BACKUP -> MASTER (master timed out)
dmesg.today:carp: 5@lagg0_vlan15: BACKUP -> MASTER (master timed out)
dmesg.today:carp: 11@lagg0_vlan180: BACKUP -> MASTER (master timed out)
dmesg.today:carp: 3@lagg0_vlan12: BACKUP -> MASTER (master timed out)
dmesg.today:carp: 7@lagg0_vlan10: BACKUP -> MASTER (master timed out)
dmesg.today:carp: 1@igb1: BACKUP -> MASTER (master timed out)
dmesg.today:carp: demoted by -240 to 720 (send ok on lagg0_vlan15)
dmesg.today:carp: demoted by -240 to 480 (send ok on lagg0_vlan11)
dmesg.today:carp: demoted by -240 to 240 (send ok on lagg0_vlan12)
dmesg.today:carp: demoted by -240 to 0 (send ok on lagg0_vlan180)
dmesg.today:carp: demoted by 240 to 240 (send error 55 on lagg0_vlan11)
dmesg.today:carp: demoted by 240 to 480 (send error 55 on lagg0_vlan180)
dmesg.today:carp: demoted by 240 to 720 (send error 55 on lagg0_vlan15)
dmesg.today:carp: 7@lagg0_vlan10: MASTER -> BACKUP (more frequent advertisement received)
dmesg.today:carp: 1@igb1: MASTER -> BACKUP (more frequent advertisement received)
dmesg.today:carp: 3@lagg0_vlan12: MASTER -> BACKUP (more frequent advertisement received)
dmesg.today:carp: 5@lagg0_vlan15: MASTER -> BACKUP (more frequent advertisement received)
dmesg.today:carp: 11@lagg0_vlan180: MASTER -> BACKUP (more frequent advertisement received)
dmesg.today:carp: 9@lagg0_vlan11: MASTER -> BACKUP (more frequent advertisement received)
dmesg.today:carp: 7@lagg0_vlan10: BACKUP -> MASTER (master timed out)
dmesg.today:carp: 5@lagg0_vlan15: BACKUP -> MASTER (master timed out)
dmesg.today:carp: demoted by -240 to 480 (send ok on lagg0_vlan15)
dmesg.today:carp: 3@lagg0_vlan12: BACKUP -> MASTER (master timed out)
dmesg.today:carp: 7@lagg0_vlan10: MASTER -> BACKUP (more frequent advertisement received)
dmesg.today:carp: 5@lagg0_vlan15: MASTER -> BACKUP (more frequent advertisement received)
dmesg.today:carp: 9@lagg0_vlan11: BACKUP -> MASTER (master timed out)
dmesg.today:carp: 3@lagg0_vlan12: MASTER -> BACKUP (more frequent advertisement received)
dmesg.today:carp: 11@lagg0_vlan180: BACKUP -> MASTER (master timed out)
dmesg.today:carp: 9@lagg0_vlan11: MASTER -> BACKUP (more frequent advertisement received)
dmesg.today:carp: 11@lagg0_vlan180: MASTER -> BACKUP (more frequent advertisement received)
dmesg.today:carp: demoted by 240 to 720 (sysctl)
dmesg.today:carp: demoted by -720 to 0 (sysctl)
dmesg.today:carp: 11@lagg0_vlan180: BACKUP -> MASTER (preempting a slower master)
dmesg.today:carp: 9@lagg0_vlan11: BACKUP -> MASTER (preempting a slower master)
dmesg.today:carp: 7@lagg0_vlan10: BACKUP -> MASTER (preempting a slower master)
dmesg.today:carp: 1@igb1: BACKUP -> MASTER (preempting a slower master)
dmesg.today:carp: demoted by -240 to -240 (send ok on lagg0_vlan180)


but I cant still see what's happening

As it turnes out, this was a false alarm and apparently not related to the 22.1.10 update.

Immediately before this update our stacked Netgear M4300-8X8F core switch got an update to firmware release 12.0.17.7. This update turned of both fans on both switch units, the switch overheated and the described CARP behaviour was correct due to link loss. This is a bad update bug, shame on Netgear. Since turning on the fans via CLI again we did not experience any futher problems.

Sorry for the noise :)


Packet capture done on fw1 and fw2 is always your best friend.

That way you can see if the CARP messages are really arriving on fw1 and fw2.