CARP instability lately

Started by Evert, November 01, 2024, 12:26:01 PM

Previous topic - Next topic
Hi all,

We have 2 OPNsense units, GW0 & GW1. We're using CARP for HA. Port ax0 connects the unit to our office network. There's currently 6 VLANs in use.

GW0 is MASTER. GW1 is BACKUP

The last couple of days, we see occasionally that the CARP of some of the VLAN's switches over to GW1 as MASTER.

GW0:
2024-11-01T11:30:18+01:00 GW0.domain.com kernel - - [meta sequenceId="1"] <6>carp: 100@vlan0.100: MASTER -> BACKUP (more frequent advertisement received)
2024-11-01T11:30:18+01:00 GW0.domain.com opnsense-business 64683 - [meta sequenceId="2"] /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "virtual IP CONTROL (10.10.0.1) (100@vlan0.100)" has resumed the state "BACKUP" for vhid 100
2024-11-01T11:30:18+01:00 GW0.domain.com opnsense-business 65088 - [meta sequenceId="3"] /usr/local/sbin/pluginctl: plugins_configure crl (1)
2024-11-01T11:30:18+01:00 GW0.domain.com opnsense-business 65088 - [meta sequenceId="4"] /usr/local/sbin/pluginctl: plugins_configure crl (execute task : core_trust_crl(1))
2024-11-01T11:30:18+01:00 GW0.domain.com opnsense-business 65088 - [meta sequenceId="5"] /usr/local/sbin/pluginctl: plugins_configure crl (execute task : openvpn_refresh_crls(1))
2024-11-01T11:36:27+01:00 GW0.domain.com opnsense-business 60906 - [meta sequenceId="1"] /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "virtual IP GUEST (192.168.254.1) (168@vlan0.192)" has resumed the state "BACKUP" for vhid 168
2024-11-01T11:36:27+01:00 GW0.domain.com kernel - - [meta sequenceId="2"] <6>carp: 168@vlan0.192: MASTER -> BACKUP (more frequent advertisement received)
2024-11-01T11:36:28+01:00 GW0.domain.com opnsense-business 61798 - [meta sequenceId="3"] /usr/local/sbin/pluginctl: plugins_configure crl (1)
2024-11-01T11:36:28+01:00 GW0.domain.com opnsense-business 61798 - [meta sequenceId="4"] /usr/local/sbin/pluginctl: plugins_configure crl (execute task : core_trust_crl(1))
2024-11-01T11:36:28+01:00 GW0.domain.com opnsense-business 61798 - [meta sequenceId="5"] /usr/local/sbin/pluginctl: plugins_configure crl (execute task : openvpn_refresh_crls(1))
2024-11-01T11:44:14+01:00 GW0.domain.com opnsense-business 68814 - [meta sequenceId="1"] /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "virtual IP IoT (192.168.238.1) (238@vlan0.999)" has resumed the state "BACKUP" for vhid 238
2024-11-01T11:44:14+01:00 GW0.domain.com kernel - - [meta sequenceId="2"] <6>carp: 238@vlan0.999: MASTER -> BACKUP (more frequent advertisement received)
2024-11-01T11:44:14+01:00 GW0.domain.com opnsense-business 70199 - [meta sequenceId="3"] /usr/local/sbin/pluginctl: plugins_configure crl (1)
2024-11-01T11:44:14+01:00 GW0.domain.com opnsense-business 70199 - [meta sequenceId="4"] /usr/local/sbin/pluginctl: plugins_configure crl (execute task : core_trust_crl(1))
2024-11-01T11:44:14+01:00 GW0.domain.com opnsense-business 70199 - [meta sequenceId="5"] /usr/local/sbin/pluginctl: plugins_configure crl (execute task : openvpn_refresh_crls(1))


GW1:
2024-11-01T11:30:18+01:00 GW1.domain.com kernel - - [meta sequenceId="1"] <6>carp: 100@vlan0.100: BACKUP -> MASTER (master timed out)
2024-11-01T11:30:18+01:00 GW1.domain.com opnsense-business 88827 - [meta sequenceId="2"] /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "virtual IP CONTROL (10.10.0.1) (100@vlan0.100)" has resumed the state "MASTER" for vhid 100
2024-11-01T11:30:18+01:00 GW1.domain.com opnsense-business 91716 - [meta sequenceId="3"] /usr/local/sbin/pluginctl: plugins_configure crl (1)
2024-11-01T11:30:18+01:00 GW1.domain.com opnsense-business 91716 - [meta sequenceId="4"] /usr/local/sbin/pluginctl: plugins_configure crl (execute task : core_trust_crl(1))
2024-11-01T11:30:18+01:00 GW1.domain.com opnsense-business 91716 - [meta sequenceId="5"] /usr/local/sbin/pluginctl: plugins_configure crl (execute task : openvpn_refresh_crls(1))
2024-11-01T11:36:27+01:00 GW1.domain.com kernel - - [meta sequenceId="1"] <6>carp: 168@vlan0.192: BACKUP -> MASTER (master timed out)
2024-11-01T11:36:27+01:00 GW1.domain.com opnsense-business 28680 - [meta sequenceId="2"] /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "virtual IP GUEST (192.168.254.1) (168@vlan0.192)" has resumed the state "MASTER" for vhid 168
2024-11-01T11:36:28+01:00 GW1.domain.com opnsense-business 31286 - [meta sequenceId="3"] /usr/local/sbin/pluginctl: plugins_configure crl (1)
2024-11-01T11:36:28+01:00 GW1.domain.com opnsense-business 31286 - [meta sequenceId="4"] /usr/local/sbin/pluginctl: plugins_configure crl (execute task : core_trust_crl(1))
2024-11-01T11:36:28+01:00 GW1.domain.com opnsense-business 31286 - [meta sequenceId="5"] /usr/local/sbin/pluginctl: plugins_configure crl (execute task : openvpn_refresh_crls(1))
<85>1 2024-11-01T11:40:12+01:00 GW1.domain.com sudo 28873 - [meta sequenceId="1"]    evert : TTY=pts/0 ; PWD=/home/evert ; USER=root ; COMMAND=/usr/bin/su -
2024-11-01T11:44:14+01:00 GW1.domain.com opnsense-business 75137 - [meta sequenceId="1"] /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "virtual IP IoT (192.168.238.1) (238@vlan0.999)" has resumed the state "MASTER" for vhid 238
2024-11-01T11:44:14+01:00 GW1.domain.com kernel - - [meta sequenceId="2"] <6>carp: 238@vlan0.999: BACKUP -> MASTER (master timed out)
2024-11-01T11:44:14+01:00 GW1.domain.com opnsense-business 77007 - [meta sequenceId="3"] /usr/local/sbin/pluginctl: plugins_configure crl (1)
2024-11-01T11:44:14+01:00 GW1.domain.com opnsense-business 77007 - [meta sequenceId="4"] /usr/local/sbin/pluginctl: plugins_configure crl (execute task : core_trust_crl(1))
2024-11-01T11:44:14+01:00 GW1.domain.com opnsense-business 77007 - [meta sequenceId="5"] /usr/local/sbin/pluginctl: plugins_configure crl (execute task : openvpn_refresh_crls(1))


It's not always the same VLAN's which go from MASTER to BACKUP, but it's never all of them.

We have never had this issue before, and there have been no hardware/config changes in a while, other than updating to 24.10BE, but whether that has anything to do with it...?

Any suggestions on where I should start looking? 🤔
--
Regards,
   Evert

November 01, 2024, 01:13:05 PM #1 Last Edit: November 01, 2024, 01:19:09 PM by Monviech
Check out dmesg to see why a demotion happens.

More frequent advertisement means that there can be some issues on layer2, like the broadcasts not being received, being dropped, or some network latency being involved.

It could also be link up/down events or pfsync issues.
Hardware:
DEC740

dmesg doesn't seem to give much more. Here's the most recent occurrence, where only 1 vlan switched.

GW0:
carp: 238@vlan0.999: MASTER -> BACKUP (more frequent advertisement received)

GW1:
carp: 238@vlan0.999: BACKUP -> MASTER (master timed out)
--
Regards,
   Evert

Well that already says what happened, the CARP broadcast was either not sent or received.

I would tcpdump (packet capture) the carp broadcasts on on both sides to find out if thats true.
Hardware:
DEC740