25.1.3 CARP-based HA router stops responding on WAN interface

Started by rdol, March 13, 2025, 09:16:15 PM

Previous topic - Next topic
My HA routers created as VMs in OVH private cloud work fine - expect of one problem. And I don't know what to do.
Both VMs run 25.1.3, same config, same interfaces, names etc.
Interface   Identifier
[DMZ]      opt2
[DMZ2]      opt3
[LAN]      lan
[PFSYNC]   opt1
[VPN]      opt4
[WAN]      wan

All interfaces use default 224.0.0.18 peer address in virtual IP definitions - except of WAN. For WAN I have to use unicast IPv4 of the second VM (and vice versa) with "No XMLRPC Sync" checked on.

Each node uses its own public IPv4/26 with correct DGW. I use "Manual outbound NAT rule generation", each VM initiates its own communication from its own public and dedicated IPv4/26 (and not incorrectly from VIP as described in some tickets I found in this forum). Confirmed by "curl https://ifconfig.me/ip" from both boxes.

Based on another recommendation found on this forum I created only one CARP VIP on WAN. Another 19 public IPs for WAN have been created as IP aliases with /32 subnet and with the same VHID group number created for CARP VIP. It should minimize CARP traffic. There was a catch with /26 used for IP aliases, in my case it has to be /32. I believe it's correct configuration.

So CARP VIP on WAN has "No XMLRPC Sync" check on. Another 19 IP Aliases have "No XMLRPC Sync" unchecked. Synchronization is working from primary to secondary VM without any problem.

Now I want to test router failover.
1) Let's press "Enter Persistent CARP Maintenance Mode" on primary node.
2) Primary node becomes BACKUP, secondary node becomes MASTER. So far so good.
3) I'll initiate primary node's reboot while pinging WAN CARP VIP and/or any IP Alias from the Internet.
4) All pings work ... until primary node finishes reboot. Pings to dedicated WAN IPs work for both VMs but nothing replies when pinging WAN CARP VIP or IP Alias.
5) Primary node is still BACKUP, secondary node is still MASTER. But WAN communication does not work.
6) WAN communication is restored when I press "Leave Persistent CARP Maintenance Mode" on primary node.

Do you have any idea what may be wrong, what should I check again?

Based on this article (https://forum.opnsense.org/index.php?topic=39906.0) I planned to test the same procedure described above (steps 1-6) with an additional step:
2a) On node1 (master) - Go to "Interfaces: Virtual IPs: Settings" and look at one of the CARP Vips, expand advanced mode, look at the "advskew" - it should be something like 0 or 1. Set this around 100 higher than node2.

The problem is I am not able to find "advskew" in 25.1.3 GUI :) Clicking "Advanced" mode shows/hides Gateway form field only. Is it expected to not see "advskew" in 25.1.3?

14 hours laters and at least 3 years older ... I have a perfectly working HA router finally.
I started to rollback all my changes (done in last let's say 7 days) this morning. And I discovered my environment is simply incompatible with "IP Alias" bound to CARP VIP.

As soon as I reconfigured all WAN IP Aliases (19 in total) back to separate WAN CARP VIPs (each with its own VHID group, subnet equal to /26, unicast, no xmlrpc sync at the end) everything started to work flawlessly. I can reboot nodes, initiate Maintenance modes as I wish, everything behaves nice and smooth.