LAGG flapping at regular time intervals

nero355 · June 12, 2026, 11:24:32 PM

Quote from: Patrick M. Hausen on June 12, 2026, 09:59:00 PMupdate - I can confirm there might be an interoperability problem with Mikrotik devices and FreeBSD/OPNsense concerning LACP.

Since I changed the LACP timeout from 1s/fast to 30s/slow on both sides and disabled "strict" mode on the OPNsense side the connection now seems to be stable.
On the Mikrotik side there is nothing to adjust but the timeout. I also disabled flowid explicitly on OPNsense but if I am not mistaken that was the default all the time, anyway.

I think I can confirm this :

Quote from: Seimus on June 12, 2026, 11:05:43 PMYea the fast timeout cross vendor is always problematic.
This is not only applying for FBSD & Mikrotik, but as well other vendors.

When I use to build a lot of Linux Bonding LACP links we use to set this :

Quotebond-lacp-rate rate
Denotes the rate of LACPDU requested from the peer.
The rate can be given as string or as numerical value.

Valid values are slow (0) and fast (1). The default is slow.

To SLOW too :)

And this :

Quotebond-miimon interval
Denotes the MII link monitoring frequency in milliseconds.
This determines how often the link state of each slave is inspected for link failures.

A value of zero disables MII link monitoring. The default is 0.

Was always set at 100 IIRC...

_{Source : https://manpages.ubuntu.com/manpages/jammy/man5/interfaces-bond.5.html}

Seimus · Reply #16 - Re: LAGG flapping at regular time intervals

Fast timeout is pain in Enterprise too.
At one point I had enough and enforced across company to use timeout slow (30s) for cross vendor connections.

Because the fast was constantly causing for example FW switchovers and other nonsense....

And thats the reason its in OPNsense docs too cause I was crying to Cedrik when he was writing it :)
https://github.com/opnsense/docs/pull/610#issuecomment-2424144823

Regards,
S.

Patrick M. Hausen · Reply #17 - Re: LAGG flapping at regular time intervals

No more flapping during the night and half a day. So it seems the conservative approach is to use slow timeouts.

I vaguely remember reconfiguring all my LAGG ports to use the same settings across all devices a couple of months ago. Probably I changed OPNsense-switch to fast at that time.

Seimus · Reply #18 - Re: LAGG flapping at regular time intervals

I am not surprised you configured Fast timeout.
The 1s re-convergence vs 30s is a BIG deal.

But if its not working as should, it causes more troubles, cause in worst case scenario it can cause insane micro-flaps.
I have seen outage windows for 5-15min with Fast timeout...

Regards,
S.

Patrick M. Hausen · Reply #19 - Re: LAGG flapping at regular time intervals

This home lab would of course work just as well with just one link per system. I only use LAGG for all core infrastructure because I can - and to gain experience with such setups.

In the production DC we have MLAG to catch the complete loss of a single switch. I *think* we use 30s - which is good enough for hosted web applications, IMHO. STP convergence is in the same time range. Flapping of course is a different beast altogether and somehow even worse than a complete loss of connectivity.

I could not find any documentation on the "strict" option, though. So I turned to the tried and true method of "use the source, Luke".

In net/if_lagg.c, around line 1538 we find:

Code Select

			struct lacp_softc *lsc;
			struct lacp_port *lp;

			lsc = (struct lacp_softc *)sc->sc_psc;

			switch (ro->ro_opts) {
			[...]
			case LAGG_OPT_LACP_STRICT:
				lsc->lsc_strict_mode = 1;
				break;
			case -LAGG_OPT_LACP_STRICT:
				lsc->lsc_strict_mode = 0;
				break;

In net//ieee8023ad_lacp.c we have a per LCAP partner bit mask that more or less defines which variables we accept from the partner on reception or not:

Code Select

/*
 * partner administration variables.
 * XXX should be configurable.
 */

static const struct lacp_peerinfo lacp_partner_admin_optimistic = {
	.lip_systemid = { .lsi_prio = 0xffff },
	.lip_portid = { .lpi_prio = 0xffff },
	.lip_state = LACP_STATE_SYNC | LACP_STATE_AGGREGATION |
	    LACP_STATE_COLLECTING | LACP_STATE_DISTRIBUTING,
};

static const struct lacp_peerinfo lacp_partner_admin_strict = {
	.lip_systemid = { .lsi_prio = 0xffff },
	.lip_portid = { .lpi_prio = 0xffff },
	.lip_state = 0,
};
	[...]
	if (lp->lp_lsc->lsc_strict_mode)
		lp->lp_partner = lacp_partner_admin_strict;
	else
		lp->lp_partner = lacp_partner_admin_optimistic;

The only actual code path using that mechanism is in lines 1732 ff. and 1812 ff.

Code Select

		/*
		 * XXX Maintain legacy behavior of leaving the
		 * LACP_STATE_SYNC bit unchanged from the partner's
		 * advertisement if lsc_strict_mode is false.
		 * TODO: We should re-examine the concept of the "strict mode"
		 * to ensure it makes sense to maintain a non-strict mode.
		 */
		if (lp->lp_lsc->lsc_strict_mode)
			lp->lp_partner.lip_state |= LACP_STATE_SYNC;
[...]
static void
lacp_sm_rx_update_default_selected(struct lacp_port *lp)
{

	LACP_TRACE(lp);

	if (lp->lp_lsc->lsc_strict_mode)
		lacp_sm_rx_update_selected_from_peerinfo(lp,
		    &lacp_partner_admin_strict);
	else
		lacp_sm_rx_update_selected_from_peerinfo(lp,
		    &lacp_partner_admin_optimistic);
}

So essentially strict mode clears some of the information received by the partner because these flags are (supposedly) not part of the 802.3ad standard. Looks like more or less a no-op to me. See the "XXX" comment above.

I'll re-enable it and whatch what happens.

Seimus · Reply #20 - Re: LAGG flapping at regular time intervals

Quote from: Patrick M. Hausen on Today at 02:35:07 PMIn the production DC we have MLAG to catch the complete loss of a single switch. I *think* we use 30s - which is good enough for hosted web applications, IMHO. STP convergence is in the same time range. Flapping of course is a different beast altogether and somehow even worse than a complete loss of connectivity.

Yea MEC, MLAG or VPC are the ultimate form deployments you want in Production.
Flapping on a LAGG is always a story on it self. And you are right on that, this is causing even worse problems than if it would just die.

Quote from: Patrick M. Hausen on Today at 02:35:07 PMSo essentially strict mode clears some of the information received by the partner because these fields are (supposedly) not part of the 802.3ad standard. Looks like more or less a no-op to me. See the "XXX" comment above.

I'll re-enable it and whatch what happens.

I always thought that the Strict mode enforces the usage of LACP within the LAGG. Meaning if both sides are not actively talking proper LACP the LAGG will not establish....

Regards,
S.

Patrick M. Hausen · Reply #21 - Re: LAGG flapping at regular time intervals

Quote from: Seimus on Today at 02:55:33 PMMeaning if both sides are not actively talking proper LACP the LAGG will not establish....

Check the source please - possibly I am reading it wrong. I have a fair knowledge of C but no experience with these parts of the kernel code. All in all I am stuck in the 70s, Lions' Commentary and of course Minix ;-)

LAGG flapping at regular time intervals

nero355

June 12, 2026, 11:24:32 PM #15

Seimus

Today at 12:44:56 AM #16

Patrick M. Hausen

Today at 02:09:19 PM #17

Seimus

Today at 02:16:51 PM #18

Patrick M. Hausen

Today at 02:35:07 PM #19 Last Edit: Today at 02:48:48 PM by Patrick M. Hausen

Seimus

Today at 02:55:33 PM #20

Patrick M. Hausen

Today at 03:13:55 PM #21