OPNsense Forum

English Forums => High availability => Topic started by: Evert on November 01, 2024, 12:26:01 PM

Title: CARP instability lately
Post by: Evert on November 01, 2024, 12:26:01 PM
Hi all,

We have 2 OPNsense units, GW0 & GW1. We're using CARP for HA. Port ax0 connects the unit to our office network. There's currently 6 VLANs in use.

GW0 is MASTER. GW1 is BACKUP

The last couple of days, we see occasionally that the CARP of some of the VLAN's switches over to GW1 as MASTER.

GW0:
2024-11-01T11:30:18+01:00 GW0.domain.com kernel - - [meta sequenceId="1"] <6>carp: 100@vlan0.100: MASTER -> BACKUP (more frequent advertisement received)
2024-11-01T11:30:18+01:00 GW0.domain.com opnsense-business 64683 - [meta sequenceId="2"] /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "virtual IP CONTROL (10.10.0.1) (100@vlan0.100)" has resumed the state "BACKUP" for vhid 100
2024-11-01T11:30:18+01:00 GW0.domain.com opnsense-business 65088 - [meta sequenceId="3"] /usr/local/sbin/pluginctl: plugins_configure crl (1)
2024-11-01T11:30:18+01:00 GW0.domain.com opnsense-business 65088 - [meta sequenceId="4"] /usr/local/sbin/pluginctl: plugins_configure crl (execute task : core_trust_crl(1))
2024-11-01T11:30:18+01:00 GW0.domain.com opnsense-business 65088 - [meta sequenceId="5"] /usr/local/sbin/pluginctl: plugins_configure crl (execute task : openvpn_refresh_crls(1))
2024-11-01T11:36:27+01:00 GW0.domain.com opnsense-business 60906 - [meta sequenceId="1"] /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "virtual IP GUEST (192.168.254.1) (168@vlan0.192)" has resumed the state "BACKUP" for vhid 168
2024-11-01T11:36:27+01:00 GW0.domain.com kernel - - [meta sequenceId="2"] <6>carp: 168@vlan0.192: MASTER -> BACKUP (more frequent advertisement received)
2024-11-01T11:36:28+01:00 GW0.domain.com opnsense-business 61798 - [meta sequenceId="3"] /usr/local/sbin/pluginctl: plugins_configure crl (1)
2024-11-01T11:36:28+01:00 GW0.domain.com opnsense-business 61798 - [meta sequenceId="4"] /usr/local/sbin/pluginctl: plugins_configure crl (execute task : core_trust_crl(1))
2024-11-01T11:36:28+01:00 GW0.domain.com opnsense-business 61798 - [meta sequenceId="5"] /usr/local/sbin/pluginctl: plugins_configure crl (execute task : openvpn_refresh_crls(1))
2024-11-01T11:44:14+01:00 GW0.domain.com opnsense-business 68814 - [meta sequenceId="1"] /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "virtual IP IoT (192.168.238.1) (238@vlan0.999)" has resumed the state "BACKUP" for vhid 238
2024-11-01T11:44:14+01:00 GW0.domain.com kernel - - [meta sequenceId="2"] <6>carp: 238@vlan0.999: MASTER -> BACKUP (more frequent advertisement received)
2024-11-01T11:44:14+01:00 GW0.domain.com opnsense-business 70199 - [meta sequenceId="3"] /usr/local/sbin/pluginctl: plugins_configure crl (1)
2024-11-01T11:44:14+01:00 GW0.domain.com opnsense-business 70199 - [meta sequenceId="4"] /usr/local/sbin/pluginctl: plugins_configure crl (execute task : core_trust_crl(1))
2024-11-01T11:44:14+01:00 GW0.domain.com opnsense-business 70199 - [meta sequenceId="5"] /usr/local/sbin/pluginctl: plugins_configure crl (execute task : openvpn_refresh_crls(1))


GW1:
2024-11-01T11:30:18+01:00 GW1.domain.com kernel - - [meta sequenceId="1"] <6>carp: 100@vlan0.100: BACKUP -> MASTER (master timed out)
2024-11-01T11:30:18+01:00 GW1.domain.com opnsense-business 88827 - [meta sequenceId="2"] /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "virtual IP CONTROL (10.10.0.1) (100@vlan0.100)" has resumed the state "MASTER" for vhid 100
2024-11-01T11:30:18+01:00 GW1.domain.com opnsense-business 91716 - [meta sequenceId="3"] /usr/local/sbin/pluginctl: plugins_configure crl (1)
2024-11-01T11:30:18+01:00 GW1.domain.com opnsense-business 91716 - [meta sequenceId="4"] /usr/local/sbin/pluginctl: plugins_configure crl (execute task : core_trust_crl(1))
2024-11-01T11:30:18+01:00 GW1.domain.com opnsense-business 91716 - [meta sequenceId="5"] /usr/local/sbin/pluginctl: plugins_configure crl (execute task : openvpn_refresh_crls(1))
2024-11-01T11:36:27+01:00 GW1.domain.com kernel - - [meta sequenceId="1"] <6>carp: 168@vlan0.192: BACKUP -> MASTER (master timed out)
2024-11-01T11:36:27+01:00 GW1.domain.com opnsense-business 28680 - [meta sequenceId="2"] /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "virtual IP GUEST (192.168.254.1) (168@vlan0.192)" has resumed the state "MASTER" for vhid 168
2024-11-01T11:36:28+01:00 GW1.domain.com opnsense-business 31286 - [meta sequenceId="3"] /usr/local/sbin/pluginctl: plugins_configure crl (1)
2024-11-01T11:36:28+01:00 GW1.domain.com opnsense-business 31286 - [meta sequenceId="4"] /usr/local/sbin/pluginctl: plugins_configure crl (execute task : core_trust_crl(1))
2024-11-01T11:36:28+01:00 GW1.domain.com opnsense-business 31286 - [meta sequenceId="5"] /usr/local/sbin/pluginctl: plugins_configure crl (execute task : openvpn_refresh_crls(1))
<85>1 2024-11-01T11:40:12+01:00 GW1.domain.com sudo 28873 - [meta sequenceId="1"]    evert : TTY=pts/0 ; PWD=/home/evert ; USER=root ; COMMAND=/usr/bin/su -
2024-11-01T11:44:14+01:00 GW1.domain.com opnsense-business 75137 - [meta sequenceId="1"] /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "virtual IP IoT (192.168.238.1) (238@vlan0.999)" has resumed the state "MASTER" for vhid 238
2024-11-01T11:44:14+01:00 GW1.domain.com kernel - - [meta sequenceId="2"] <6>carp: 238@vlan0.999: BACKUP -> MASTER (master timed out)
2024-11-01T11:44:14+01:00 GW1.domain.com opnsense-business 77007 - [meta sequenceId="3"] /usr/local/sbin/pluginctl: plugins_configure crl (1)
2024-11-01T11:44:14+01:00 GW1.domain.com opnsense-business 77007 - [meta sequenceId="4"] /usr/local/sbin/pluginctl: plugins_configure crl (execute task : core_trust_crl(1))
2024-11-01T11:44:14+01:00 GW1.domain.com opnsense-business 77007 - [meta sequenceId="5"] /usr/local/sbin/pluginctl: plugins_configure crl (execute task : openvpn_refresh_crls(1))


It's not always the same VLAN's which go from MASTER to BACKUP, but it's never all of them.

We have never had this issue before, and there have been no hardware/config changes in a while, other than updating to 24.10BE, but whether that has anything to do with it...?

Any suggestions on where I should start looking? 🤔
Title: Re: CARP instability lately
Post by: Monviech (Cedrik) on November 01, 2024, 01:13:05 PM
Check out dmesg to see why a demotion happens.

More frequent advertisement means that there can be some issues on layer2, like the broadcasts not being received, being dropped, or some network latency being involved.

It could also be link up/down events or pfsync issues.
Title: Re: CARP instability lately
Post by: Evert on November 01, 2024, 03:31:07 PM
dmesg doesn't seem to give much more. Here's the most recent occurrence, where only 1 vlan switched.

GW0:
carp: 238@vlan0.999: MASTER -> BACKUP (more frequent advertisement received)

GW1:
carp: 238@vlan0.999: BACKUP -> MASTER (master timed out)
Title: Re: CARP instability lately
Post by: Monviech (Cedrik) on November 01, 2024, 06:04:46 PM
Well that already says what happened, the CARP broadcast was either not sent or received.

I would tcpdump (packet capture) the carp broadcasts on on both sides to find out if thats true.