LAGG flapping at regular time intervals

Started by kei, March 25, 2024, 09:49:41 AM

Previous topic - Next topic
Dear All,

after days of plugging and testing I am turning to the forum to find out a flaw in my setup.

The hardware is straightforward, OPNsense running on a Supermicro A1SAi-2550F connected to Mikrotik CSS610-8P-2S+IN serving 3 WAPs and other equipment (Home).
Multiple VLANs are setup on top of the LAGG.
Upstream is 1G Fibre, connected to an Intel X520-DA2 plugged into the Supermicro board.

All equipment on latest BIOS, firmware versions.

Connection between the OPNsense box and the switch is a LACP. Tested with several combinations (2 onboard UTP, 1 onboard UTP, 1 onboard UTP+1 GB SFP, 1 GB SFP).
With multiple LACP legs, there is a dropout of the LAGG several times a day at random time intervals.
With a single LACP leg, the dropout is pretty consistent at approximately 5h delay, in burts of 3.
In all cases the LAGG is rebuilt within a second of time, yet all the damage of the interface going down is already materialized.


# igb1 is single member of the LACP lagg0 group (Onboard Intel 1GB NIC, UTP cable)
2024-03-23 23:41:55 notice kernel: igb1: Interface stopped DISTRIBUTING, possible flapping
2024-03-23 23:41:56 notice kernel: <6>lagg0: link state changed to UP
2024-03-23 23:43:26 notice kernel: igb1: Interface stopped DISTRIBUTING, possible flapping
2024-03-23 23:43:27 notice kernel: <6>lagg0: link state changed to UP
2024-03-23 23:44:57 notice kernel: igb1: Interface stopped DISTRIBUTING, possible flapping
2024-03-23 23:44:58 notice kernel: <6>lagg0: link state changed to UP
2024-03-24 04:42:07 notice kernel: igb1: Interface stopped DISTRIBUTING, possible flapping
2024-03-24 04:42:08 notice kernel: <6>lagg0: link state changed to UP
2024-03-24 04:43:39 notice kernel: igb1: Interface stopped DISTRIBUTING, possible flapping
2024-03-24 04:43:40 notice kernel: <6>lagg0: link state changed to UP
2024-03-24 04:45:10 notice kernel: igb1: Interface stopped DISTRIBUTING, possible flapping
2024-03-24 04:45:11 notice kernel: <6>lagg0: link state changed to UP
2024-03-24 09:42:49 notice kernel: igb1: Interface stopped DISTRIBUTING, possible flapping
2024-03-24 09:42:50 notice kernel: <6>lagg0: link state changed to UP
2024-03-24 09:44:20 notice kernel: igb1: Interface stopped DISTRIBUTING, possible flapping
2024-03-24 09:44:21 notice kernel: <6>lagg0: link state changed to UP
2024-03-24 09:45:52 notice kernel: igb1: Interface stopped DISTRIBUTING, possible flapping
2024-03-24 09:45:53 notice kernel: <6>lagg0: link state changed to UP

# ix1 is single member of the LACP lagg group (PCIe Intel X520-DA2 NIC, 1,25G SFP, MM fiber cable)
2024-03-24 23:16:30 notice kernel: ix1: Interface stopped DISTRIBUTING, possible flapping
2024-03-24 23:16:31 notice kernel: <6>lagg0: link state changed to UP
2024-03-24 23:18:00 notice kernel: ix1: Interface stopped DISTRIBUTING, possible flapping
2024-03-24 23:18:01 notice kernel: <6>lagg0: link state changed to UP
2024-03-24 23:19:32 notice kernel: ix1: Interface stopped DISTRIBUTING, possible flapping
2024-03-24 23:19:33 notice kernel: <6>lagg0: link state changed to UP
2024-03-25 04:16:34 notice kernel: ix1: Interface stopped DISTRIBUTING, possible flapping
2024-03-25 04:16:35 notice kernel: <6>lagg0: link state changed to UP
2024-03-25 04:18:05 notice kernel: ix1: Interface stopped DISTRIBUTING, possible flapping
2024-03-25 04:18:06 notice kernel: <6>lagg0: link state changed to UP
2024-03-25 04:19:35 notice kernel: ix1: Interface stopped DISTRIBUTING, possible flapping
2024-03-25 04:19:36 notice kernel: <6>lagg0: link state changed to UP
2024-03-25 09:17:09 notice kernel: ix1: Interface stopped DISTRIBUTING, possible flapping
2024-03-25 09:17:10 notice kernel: <6>lagg0: link state changed to UP
2024-03-25 09:18:39 notice kernel: ix1: Interface stopped DISTRIBUTING, possible flapping
2024-03-25 09:18:40 notice kernel: <6>lagg0: link state changed to UP
2024-03-25 09:20:10 notice kernel: ix1: Interface stopped DISTRIBUTING, possible flapping
2024-03-25 09:20:11 notice kernel: <6>lagg0: link state changed to UP


Other activity of the OPNsense box before the flapping is as usual, filterlog entries, regular cron jobs. Nothing special.
Mikrotik switch unfortunately has no logs or traps.

Based on browsing previous LAGG related posts, I have settled with

  • OPNsense side active, Switch side passive (but tested also active/active)
  • Use flowid (but tested also without)
  • Fast timeout OFF (default)
  • l2,l3,l4 (but tested also l2 only)
  • Use strict default -- net.link.lagg.lacp.default_strict_mode: 1

This is the current config which results in the log above:

lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=4e53fbb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_UCAST,WOL_MCAST,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,NOMAP>
ether 90:e2:ba:xx:xx:xx
laggproto lacp lagghash l2,l3,l4
lagg options:
flags=5<USE_FLOWID,USE_NUMA>
flowid_shift: 16
lagg statistics:
active ports: 1
flapping: 9
lag id: [(8000,90-E2-BA-XX-XX-XX,016B,0000,0000),
(8000,48-A9-8A-XX-XX-XX,0002,0000,0000)]
laggport: ix1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> state=3d<ACTIVITY,AGGREGATION,SYNC,COLLECTING,DISTRIBUTING>
[(8000,90-E2-BA-XX-XX-XX,016B,8000,0002),
(8000,48-A9-8A-XX-XX-XX,0002,8000,000A)]
groups: lagg
media: Ethernet autoselect
status: active
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>

ix1: flags=8863<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=4e53fbb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_UCAST,WOL_MCAST,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,NOMAP>
ether 90:e2:ba:xx:xx:xx
media: Ethernet autoselect (1000baseSX <full-duplex>)
status: active
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
plugged: SFP/SFP+/SFP28 1000BASE-SX (LC)
vendor: JDSU PN: PLRXPL-VI-S24-27 SN: XXXXXXXXXX DATE: 2013-08-27
module temperature: 44.05 C voltage: 3.33 Volts
lane 1: RX power: 0.26 mW (-5.88 dBm) TX bias: 5.71 mA


I was also setting net.link.lagg.lacp.debug=1 sometimes but could not make sense out of the output.

Wondering if anybody can shed some light on which bit to flip to make this work.
I am happy to test/change/provide logs for just about any altered config for debugging purposes.

Thank you in advance for your suggestions.

Cheers,

Kei

March 26, 2024, 02:05:49 AM #1 Last Edit: March 26, 2024, 02:13:13 AM by Seimus
Do you see, any physical port flap happening at all?
Basically does the physical port go down and up during the time of occurrence?
If yes its happening after or before LACP breaks?

How is your LACP configured on both ends?
Are you by chance using LACP fast?


Those logs, seem to be each around 30s, which points to be the default HB interval for LACP. There is a high chance that one of the sides has misconfigured LACP timer, resulting in missing the HB and breaking the LACP connection. usually this happens, when you use different LACP timers on both ends or if you use LACP FAST on both devices between two different vendors.

Other possibility is one of the sides is not responding within the needed Interval.

Regards,
S.
Networking is love. You may hate it, but in the end, you always come back to it.

OPNSense HW
APU2D2 - deceased
N5105 - i226-V | Patriot 2x8G 3200 DDR4 | L 790 512G - VM HA(SOON)
N100   - i226-V | Crucial 16G  4800 DDR5 | S 980 500G - PROD

Hi Seimus,
thank you for looking at this.

There is no indication of the physical port going down in the logs of OPNsense. I don't have monitoring up on the switch, so I have no information. The whole thing is in the cupboard, so also no eyes. I am strongly doubting the port would go down physically.

On the switch side, there is only an active/passive/static toggle. Static meaning non-LACP LAG. No other options to change or tune. The documentation is also NIL on any other parameters.
I was now running a night with passive on the switch as well as a night with active on the switch. The behaviour is identical.

On the OPNsense side I have been dialing through the different options which absolutely no effect.
I have settled now with L2 hashing only. Based on a forum post on the switch vendor side.
LACP fast timeout is switched off on the OPNsense side, no such option on the switch.
I have switched off flowid now on OPNsense. Was on because of a forum post here. No change.

I have now switched on sysctl net.link.lagg.lacp.debug=1.
I am observing a lacpdu transmit/receive every 30 seconds. The delay between transmit and receive is 5-15 seconds.


2024-03-26 08:55:48 notice kernel: ix1: lacpdu transmit
2024-03-26 08:55:48 notice kernel: actor=(8000,90-E2-BA-XX-XX-XX,016B,8000,0002)
2024-03-26 08:55:48 notice kernel: actor.state=3d<ACTIVITY,AGGREGATION,SYNC,COLLECTING,DISTRIBUTING>
2024-03-26 08:55:48 notice kernel: partner=(8000,48-A9-8A-XX-XX-XX,0002,8000,000A)
2024-03-26 08:55:48 notice kernel: partner.state=3c<AGGREGATION,SYNC,COLLECTING,DISTRIBUTING>
2024-03-26 08:55:48 notice kernel: maxdelay=0
2024-03-26 08:55:54 notice kernel: ix1: lacpdu receive
2024-03-26 08:55:54 notice kernel: actor=(8000,48-A9-8A-XX-XX-XX,0002,8000,000A)
2024-03-26 08:55:54 notice kernel: actor.state=3c<AGGREGATION,SYNC,COLLECTING,DISTRIBUTING>
2024-03-26 08:55:54 notice kernel: partner=(8000,90-E2-BA-XX-XX-XX,016B,8000,0002)
2024-03-26 08:55:54 notice kernel: partner.state=3d<ACTIVITY,AGGREGATION,SYNC,COLLECTING,DISTRIBUTING>
2024-03-26 08:55:54 notice kernel: maxdelay=0


I will leave this log on for a couple of hours and try to find the spot around the flapping.

I also plan to take this to the vendor MikroTik.

Thanks for taking your time looking at this.
If there is any other idea, I am happy to try an alternative config.

Cheers,

Kei

I had an OPNsense connected to Microtik and the LAGG flapped all the time, tried out all settings, nothing worked.

Switched to Netgear, now everything works.
Hardware:
DEC740

Quote from: Monviech on March 26, 2024, 09:17:18 AM
I had an OPNsense connected to Microtik and the LAGG flapped all the time, tried out all settings, nothing worked.
Oops  :)

I just replaced my Cisco 2960-L with a Mikrotik CRS326-24G-2S+IN and noticed nothing of that sort. RouterOS 7.14.1.

Kind regards
Patrick
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)


I'll be able to add another data point in a couple of days - I will probably receive a DEC750 today. SFP+ optics still in the works at fs.com.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Hi All,

first of all, MikroTik comes with RouterOS, SwOS and SwOS lite - three completely different platforms. So it is hard to generalize across the vendor's line.
I have multiple MikroTiks and am quite satisfied. Esp. with price/performance ratio and the "does its job, nothing extra" attitude.

Back to the LACP, I have now the detailed logs with sysctl net.link.lagg.lacp.debug=1 and there is a pattern.

OPNsense is initiating lacpdu transmit at regular 30 seconds intervals.
The Mikrotik box answers these with an ever increasing delay. Starting from 1 seconds up to and including 30 seconds. (That is a HB miss).
Now OPNsense retransmits and gets the (delayed) answer to the previous transmit, but after two more transmissions the box and the switch get back in sync, with the switch answering again starting at 1 seconds delay.
These cycles are a bit over 27 minutes long.
During these cycles, the lagg0 interface on the OPNsense box stays UP, i.e. no perceived disruption.

When the OPNsense does not get a sync after the third retransmission, it stops the interface and restarts. This is what I observed previously and happens every 5 hours.
It takes the OPNsense and the switch multiple attempts to get back into sync. This is the 3 times flapping while re-syncing.
Once in sync the 27 minutes cycles start again.


From the looks of it, this is a Mikrotik problem (drift in answering), and I am taking it there, but obviously the observations are biased, the log entries' clock is the same as the OPNsense clock. So 30 seconds on the box might not be 30 seconds in reality.

Thanks everybody for your support!
If there is an update from the vendor, I will post it here.

Cheers,

Kei










Quote from: kei on March 26, 2024, 11:50:10 AM
From the looks of it, this is a Mikrotik problem (drift in answering), and I am taking it there, but obviously the observations are biased, the log entries' clock is the same as the OPNsense clock. So 30 seconds on the box might not be 30 seconds in reality.
Which of their multiple OSes are you running on that particular switch?

Kind regards,
Patrick
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

The CSS610 comes with SwOS lite as the only option.
They have also devices with RouterOS only and devices with RouterOS and SwOS as a boot-time option. Furthering the confusion.

From what I understand RouterOS is just another Linux (which makes it unattractive to me), whereas SwOS (lite) are built in-house and are "barebone", only a UI over the bits that can be set on the underlying hardware.

As said, except for this LACP problem (in my home network), I am quite happy with them.


The CRS326-24G-2S+* come with both options. I picked routeros specifically because swos looked somewhat spooky in the bonding/lagg department. Setting 2 ports to "active" you cannot assign them to a group/bond-interface, only when setting to "static". How am I going to create more than one lagg interface, then? "active" is what you normally want for LACP. Weird.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

I'm sure that Switch OS would be the better option, since the Microtik I used with Router OS also had big performance problems with lots of small pakets in high traffic scenarios. Running a Backup Job with Nakivo over that Microtik Switch was like 600Mbit/s and with the Netgear it got up to 5-6 Gbit/s.

Checking with iperf confirmed some kind of Switching Performance Bottleneck with Router OS. Since you have to create these weird bridges for VLANs, they're probably CPU bottlenecked. The CPU of the Microtik was always 100% when it "switched" lots of small packets.

Could have also been a configuration issue on my side, though, I don't have time to troubleshoot switches, for me they have to work. So I only take Netgear and Juniper again now, even if they're more expensive.
Hardware:
DEC740

@monviech Probably a less than optimal configuration or a model that does not support hardware accelerated switching in RouterOS or an outdated release or any combination thereof.

With RouterOS 7.14.1 I have only one bridge, vlan filtering enabled, all ports run hardware accelerated. I would not buy one for use @work given how utterly bad the UI is. CLI is a bit better but not well documented.

At roughly 200€ new the device is a steal.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)