Multiple errors with CARP on SFP+ LACP LAGG with VLANs

Started by ianf, March 30, 2023, 10:53:05 AM

Previous topic - Next topic
Hi all,

We have been experiencing multiple issues on OPNsense 23.1, then switched to 22.10 (business), but since 22.10 is running 22.7 under the hood I thought I should post here:

Setup
2x DEC3850
SFP+ ports ax0,ax1 -> lagg0 - connected to D-Link DXS-3400-24SC
igb0 -> WAN
igb3 -> PFSYNC direct cable between OPNsenses

All networks on LAN side are running in VLANs, configured on lagg0. lagg0 is also assigned and enabled, but not configured.

Errors
1. (the worst one) Packets being dropped with Symbol-Errors on Switch
Our APs (ubiquiti u6-lite) and switches are located in vlan01.98 (lagg0, tag: 98).
Our domain controller running a RADIUS server is located in vlan01.100 (lagg0, tag: 100).
When logging in to the WiFi via WPA2 Enterprise with RADIUS backend, I can see the RADIUS packet as follows (captured on both vlan01.98 and vlan01.100, packets visible on both):
- Access-Request, AP -> DC
- Access-Challenge, DC -> AP
- Access-Request, AP -> DC
- IP Fragment containing first 1480 Bytes of an Access-Challenge, DC -> AP
- missing bytes of Access-Challenge, DC -> AP

When logging the packets I can also see the packets being passed on to the next hop.
However, the only packets being logged on the DXS-3400-24SC are:
- Access-Request, AP -> DC
- Access-Challenge, DC -> AP
- Access-Request, AP -> DC
- missing bytes of Access-Challenge, DC -> AP

The Switch also reports a Symbol-Err for the dropped packed. When debugging this with the D-Link support, they asked me to swap the cables and ports. However, even when I switch to the backup OPNsense, the error stays exactly the same. This behaviour is 100% reproducible accross both devices. The Symbol-Err counter increases for other packets as well, this is just the only one I have been able to capture and reproduce.

On the OPNsense Interface Statistics I can see a number of "Errors Out" in multiple vlans, with many on the LAGG interface.

2. (annoying) CARP seems to not be working properly
When we access the Interfaces -> Virtual IPs -> Status page, often this message is displayed:
CARP has detected a problem and this unit has been demoted to BACKUP status.
Check link status on all interfaces with configured CARP VIPs.

This happens on both OPNsenses regularly, and has led to us having to shutdown the MASTER whenever we want to update/reboot it, to make the BACKUP become MASTER, since Persistent CARP Maintenance Mode doesn't work.

3. (not that important) ARP error messages on Backup OPNsense
The backup OPNsense has constant arp error messages stating:
arp: 00:0e:08:17:87:63 is using my IP address 172.1.1.3 on vlan01.11!
However, when I switch the IP to e.g. 172.1.1.4 the error messages still appear, just with the new IP.
This isn't really an issue, I just thought it might have something to do with the CARP problem.

I put all of these errors into one thread, as I'm not sure whether they might be relevant to one another.

Thanks for any ideas and help!

Best,
Ian