Why: backup-node turned MASTER on WAN with unknown cause

Started by tt-ah, May 22, 2023, 12:12:00 PM

Previous topic - Next topic
Hi everyone!

I saw some weird behaviour of a HA-cluster last week on which I would like to consult you guys.

Environment:
Its a 2 node cluster. The main-firewall (IP: .11) ran on 22.7 and the backup-firewall (IP: .12) on 23.1.
In total there are 8 CARP-Instances: 1 on WAN (VIP: .10), 7 on VLANs on LAN.
The VIP of WAN is used for a site-2-site IPsec.
In addition there is an alias (IP: .14) on WAN.
WAN and LAN are connected to a switch stack. The sync-interface is directly connected between the two firewalls.
The ISP gateway (IP: .9) is connected to the same switch and is being monitored by both firewalls as default gateway.
During this saga nothing in terms of link-status events was logged on the switch.

Config:
net.inet.carp.preempt: 1
otherwise see screenshots

What happened:
At 17:27:45 (see attached log extract) the backup firewall started swapping between INIT and BACKUP on WAN. After a few seconds it became MASTER. But it only took MASTER role on WAN, all other interfaces stayed BACKUP. Meanwhile the main-firewall still had MASTER status on all interfaces. This stayed this way until I rebooted the backup-firewall. Then everything was back to normal, backup-node had BACKUP state on all interfaces. dmesg on backup-firewall was clean concerning the NICs.
The IPsec tunnel was not affected at all and seemed to be held by main-firewall.

This happened while the ISP or their carrier had some IP-hash based routing issues between 16:57 and 18:32.
This definetely affected the cluster in terms of external availability, but internally all should have been fine. There are no logs indicating loss to the default gateway. (IPs of the cluster were available while others were not. Depending on external source-ip the available/unavailable IPs varied).


Attached are the logs of backup-firewall during the timeframe when it occured. During the state-switching there are logs concerning OpenVPN on backup-firewall even though OpenVPN is not configured on these systems (but installed). It logs the alias-ip (.14) disappearing on WAN - could this cause CARP to "freak-out" in this way? I dont know why it would disappear - it is still configured as it was on both systems.

On the main-firewall only this was logged (apart from filterlogs):


May 17 17:27:14 main-firewall configd.py[244]: [6d22285b-e647-45f5-a0a2-b7f33c75f747] system status
...
May 17 17:27:19 main-firewall configd.py[244]: [3636d608-c56b-4e4b-97cd-d4a1f49f07b8] request pfsync info
...
May 17 17:27:20 main-firewall configd.py[244]: [b7530887-f020-441a-9544-2847c8758fe9] system status
...
May 17 17:27:39 main-firewall configd.py[244]: [50cf99f7-c26e-46ed-a62c-a28676368395] system status
...
May 17 17:27:43 main-firewall configd.py[244]: [a86b37a2-e610-4420-9e69-86cc469980da] Syncing firewall
...
May 17 17:27:47 main-firewall opnsense[18207]: /usr/local/etc/rc.filter_synchronize: Filter sync successfully completed with https://192.168.254.2/xmlrpc.php.
...
May 17 17:27:50 sws-tue-gw1 devd[506]: Processing event '!system=DEVFS subsystem=CDEV type=DESTROY cdev=pts/0'



I was doing updates during the last few weeks from 20.x to 23.1 - always putting the backup-firewall one step ahead of the main-firewall. This issue occurred about 2 hrs after the last update.

Right now I can not explain what might have caused the issues with CARP here.
It happened right after a filter sync, but the logs dont show any dropped CARP advertisements. Also in my understanding during the failover all interfaces should be transfered instead of only one. Obviously the main-firewall should not have stayed MASTER given a valid cause for the failover itself. I dont see how the ISP/upstream issues could have caused this, even though it happened during this time.

Do you have an idea what might have caused this issue? I would be very grateful.

If I missed something you need in terms of configuration or description please let me know.



I know this is an old topic, but it is my own and still unreplied.

I write here again, because this issue just happened again after I triggered a configuration sync on the main-firewall.  Again the backup firewall took over MASTER-status on WAN while the main-firewall also stayed MASTER.

main-firewall is 23.1.6
backup-firewall is on 23.1.11
(i am currently trying to patch to 23.7 but this issue is interrupting me)

During and shortly after the config-sync I see several events such as the following on the main-firewall. op1(igb3) is the SYNC interface between main- and backup-firewall and is a direct connection

2023-11-21T21:13:48 Notice opnsense /usr/local/etc/rc.linkup: ROUTING: entering configure using 'opt1'
2023-11-21T21:13:48 Notice opnsense /usr/local/etc/rc.linkup: DEVD: Ethernet attached event for opt1(igb3)
2023-11-21T21:13:43 Notice opnsense /usr/local/etc/rc.linkup: DEVD: Ethernet detached event for opt1(igb3)


On the backup-firewall I see the following:

2023-11-21T20:33:14 Notice opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Resyncing OpenVPN instances for interface VIP_WAN (xx.yy.xx.yy).
2023-11-21T20:33:14 Notice opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "VIP_WAN (xx.yy.xx.yy) (2@igb1)" has resumed the state "MASTER" for vhid 2
2023-11-21T20:33:13 Notice kernel <6>carp: 2@igb1: BACKUP -> MASTER (master timed out)
2023-11-21T20:33:13 Notice opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Resyncing OpenVPN instances for interface VIP_WAN (xx.yy.xx.yy).
2023-11-21T20:33:13 Notice opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "VIP_WAN (xx.yy.xx.yy) (2@igb1)" has resumed the state "BACKUP" for vhid 2
2023-11-21T20:33:12 Notice opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Resyncing OpenVPN instances for interface VIP_WAN (xx.yy.xx.yy).
2023-11-21T20:33:12 Notice opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "VIP_WAN (xx.yy.xx.yy) (2@igb1)" has resumed the state "INIT" for vhid 2
2023-11-21T20:33:12 Notice opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Resyncing OpenVPN instances for interface VIP_WAN (xx.yy.xx.yy).
2023-11-21T20:33:12 Notice opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "VIP_WAN (xx.yy.xx.yy) (2@igb1)" has resumed the state "BACKUP" for vhid 2
2023-11-21T20:33:11 Notice opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Resyncing OpenVPN instances for interface VIP_WAN (xx.yy.xx.yy).
2023-11-21T20:33:11 Notice opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "VIP_WAN (xx.yy.xx.yy) (2@igb1)" has resumed the state "INIT" for vhid 2
2023-11-21T20:33:11 Notice opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Resyncing OpenVPN instances for interface VIP_WAN (xx.yy.xx.yy).
2023-11-21T20:33:11 Notice opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "VIP_WAN (xx.yy.xx.yy) (2@igb1)" has resumed the state "BACKUP" for vhid 2
2023-11-21T20:33:10 Notice configctl event @ 1700595190.27 exec: system event config_changed
2023-11-21T20:33:10 Notice configctl event @ 1700595190.27 msg: Nov 21 20:33:10 sws-tue-gw2.domain.de config[28268]: config-event: new_config /conf/backup/config-1700595190.2471.xml
2023-11-21T20:33:10 Notice opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Resyncing OpenVPN instances for interface VIP_WAN (xx.yy.xx.yy).
2023-11-21T20:33:10 Notice opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "VIP_WAN (xx.yy.xx.yy) (2@igb1)" has resumed the state "INIT" for vhid 2
2023-11-21T20:33:10 Notice configctl event @ 1700595190.27 exec: system event config_changed
2023-11-21T20:33:10 Notice configctl event @ 1700595190.27 msg: Nov 21 20:33:10 sws-tue-gw2.domain.de config[28268]: config-event: new_config /conf/backup/config-1700595190.2471.xml
2023-11-21T20:33:10 Notice opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Resyncing OpenVPN instances for interface VIP_WAN (xx.yy.xx.yy).
2023-11-21T20:33:10 Notice opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "VIP_WAN (xx.yy.xx.yy) (2@igb1)" has resumed the state "INIT" for vhid 2
2023-11-21T20:33:10 Notice opnsense /xmlrpc.php: plugins_configure monitor (execute task : dpinger_configure_do())
2023-11-21T20:33:10 Notice opnsense /xmlrpc.php: plugins_configure monitor ()
2023-11-21T20:33:10 Notice opnsense /xmlrpc.php: ROUTING: keeping current inet default gateway 'xx.yy.xx.zz'
2023-11-21T20:33:10 Notice opnsense /xmlrpc.php: ROUTING: configuring inet default gateway on wan
2023-11-21T20:33:10 Notice opnsense /xmlrpc.php: ROUTING: entering configure using defaults
2023-11-21T20:33:10 Notice kernel <6>carp: 2@igb1: INIT -> BACKUP (initialization complete)
2023-11-21T20:33:10 Notice kernel <6>carp: 2@igb1: BACKUP -> INIT (hardware interface up)
2023-11-21T20:33:10 Notice kernel <6>carp: 2@igb1: INIT -> BACKUP (initialization complete)
2023-11-21T20:33:10 Notice kernel <6>carp: 2@igb1: BACKUP -> INIT (hardware interface up)
2023-11-21T20:33:10 Notice kernel <6>carp: 2@igb1: INIT -> BACKUP (initialization complete)
2023-11-21T20:33:10 Notice kernel <6>carp: 2@igb1: BACKUP -> INIT (hardware interface up)


I fixed it by doing a reboot of the backup-firewall.

I have the following questions:

  • What is wrong with my configuration that causes the (a) the splitbrain and (b) WAN being the only affected interface. I want the MASTER-state to be always concentrated on a single node.
  • What causes this failover in the first place when I trigger a configuration sync? Before the updates from 22.7 onward, this has not been an issue and my configuration has not changed since.