Hi,
We recently ran into exactly the same behavior described in this thread, and after a fair amount of digging I wanted to share what we found and how we worked around it.
Symptoms (same as described above)
CARP works reliably on LAN / internal networks
CARP on the WAN interface behaves inconsistently
ARP resolution looks correct
Sometimes the first ICMP packet works
Subsequent traffic is dropped or blackholed
On the switches, we observed:
ARP table is correct (VIP → CARP virtual MAC)
MAC address table never learns the CARP virtual MAC
As a result, unicast traffic to the VIP is not forwarded reliably
Why this happens (key point)
This is not really a CARP bug, but an interaction between floating L2 identities and virtualized switching.
In virtual environments (ESXi + distributed switches in our case):
CARP replies ARP with the correct virtual MAC (00:00:5e:00:01:XX)
However, frames sourced with that MAC are not always learned by physical ToR switches
Even with Forged Transmits, MAC Address Changes, and Promiscuous Mode enabled
On LAN networks, this often works because:
Traffic stays inside the hypervisor or distributed switch
The physical switch is never involved
The CARP MAC does not need to be learned upstream
On the WAN, traffic must traverse physical uplinks:
The ToR switch must learn the source MAC
The CARP virtual MAC is never learned
Result: ARP resolves, first packet may pass, steady-state traffic fails
This explains why the issue appears WAN-only and why it is so inconsistent.
Workaround / design pattern that worked reliably for us
We solved this by separating HA control-plane from data-plane identity:
Keep CARP for state and master election only
Do not use the CARP VIP for production traffic
Create a plain IP Alias (no VHID) for the production IP
Move that alias between nodes based on CARP MASTER state
This way:
The production IP always uses the physical interface MAC
The switch can learn the MAC normally
CARP still provides HA logic and state sync
WAN traffic becomes stable and predictable
We implemented this using Monit and a small script that:
Adds the alias on the CARP MASTER
Removes it on the BACKUP node
Example logic (simplified):
Monit runs this every few seconds, so failover and failback are fast.
Conclusion
This seems to be a general limitation when using floating virtual MACs on WAN interfaces in virtualized environments, especially when traffic must traverse physical switching.
The workaround above has been stable for us and avoids relying on a MAC address that the physical fabric never learns.
Posting this in case it helps others who run into the same issue — happy to clarify or compare notes.
We recently ran into exactly the same behavior described in this thread, and after a fair amount of digging I wanted to share what we found and how we worked around it.
Symptoms (same as described above)
CARP works reliably on LAN / internal networks
CARP on the WAN interface behaves inconsistently
ARP resolution looks correct
Sometimes the first ICMP packet works
Subsequent traffic is dropped or blackholed
On the switches, we observed:
ARP table is correct (VIP → CARP virtual MAC)
MAC address table never learns the CARP virtual MAC
As a result, unicast traffic to the VIP is not forwarded reliably
Why this happens (key point)
This is not really a CARP bug, but an interaction between floating L2 identities and virtualized switching.
In virtual environments (ESXi + distributed switches in our case):
CARP replies ARP with the correct virtual MAC (00:00:5e:00:01:XX)
However, frames sourced with that MAC are not always learned by physical ToR switches
Even with Forged Transmits, MAC Address Changes, and Promiscuous Mode enabled
On LAN networks, this often works because:
Traffic stays inside the hypervisor or distributed switch
The physical switch is never involved
The CARP MAC does not need to be learned upstream
On the WAN, traffic must traverse physical uplinks:
The ToR switch must learn the source MAC
The CARP virtual MAC is never learned
Result: ARP resolves, first packet may pass, steady-state traffic fails
This explains why the issue appears WAN-only and why it is so inconsistent.
Workaround / design pattern that worked reliably for us
We solved this by separating HA control-plane from data-plane identity:
Keep CARP for state and master election only
Do not use the CARP VIP for production traffic
Create a plain IP Alias (no VHID) for the production IP
Move that alias between nodes based on CARP MASTER state
This way:
The production IP always uses the physical interface MAC
The switch can learn the MAC normally
CARP still provides HA logic and state sync
WAN traffic becomes stable and predictable
We implemented this using Monit and a small script that:
Adds the alias on the CARP MASTER
Removes it on the BACKUP node
Example logic (simplified):
Code Select
if ifconfig | grep -q "carp: MASTER vhid 1"; then
ifconfig vmx1 inet <PROD_IP>/24 alias
else
ifconfig vmx1 inet <PROD_IP>/24 -alias
fiMonit runs this every few seconds, so failover and failback are fast.
Conclusion
This seems to be a general limitation when using floating virtual MACs on WAN interfaces in virtualized environments, especially when traffic must traverse physical switching.
The workaround above has been stable for us and avoids relying on a MAC address that the physical fabric never learns.
Posting this in case it helps others who run into the same issue — happy to clarify or compare notes.
"