1
General Discussion / LAGG flapping at regular time intervals
« on: March 25, 2024, 09:49:41 am »
Dear All,
after days of plugging and testing I am turning to the forum to find out a flaw in my setup.
The hardware is straightforward, OPNsense running on a Supermicro A1SAi-2550F connected to Mikrotik CSS610-8P-2S+IN serving 3 WAPs and other equipment (Home).
Multiple VLANs are setup on top of the LAGG.
Upstream is 1G Fibre, connected to an Intel X520-DA2 plugged into the Supermicro board.
All equipment on latest BIOS, firmware versions.
Connection between the OPNsense box and the switch is a LACP. Tested with several combinations (2 onboard UTP, 1 onboard UTP, 1 onboard UTP+1 GB SFP, 1 GB SFP).
With multiple LACP legs, there is a dropout of the LAGG several times a day at random time intervals.
With a single LACP leg, the dropout is pretty consistent at approximately 5h delay, in burts of 3.
In all cases the LAGG is rebuilt within a second of time, yet all the damage of the interface going down is already materialized.
Other activity of the OPNsense box before the flapping is as usual, filterlog entries, regular cron jobs. Nothing special.
Mikrotik switch unfortunately has no logs or traps.
Based on browsing previous LAGG related posts, I have settled with
This is the current config which results in the log above:
I was also setting net.link.lagg.lacp.debug=1 sometimes but could not make sense out of the output.
Wondering if anybody can shed some light on which bit to flip to make this work.
I am happy to test/change/provide logs for just about any altered config for debugging purposes.
Thank you in advance for your suggestions.
Cheers,
Kei
after days of plugging and testing I am turning to the forum to find out a flaw in my setup.
The hardware is straightforward, OPNsense running on a Supermicro A1SAi-2550F connected to Mikrotik CSS610-8P-2S+IN serving 3 WAPs and other equipment (Home).
Multiple VLANs are setup on top of the LAGG.
Upstream is 1G Fibre, connected to an Intel X520-DA2 plugged into the Supermicro board.
All equipment on latest BIOS, firmware versions.
Connection between the OPNsense box and the switch is a LACP. Tested with several combinations (2 onboard UTP, 1 onboard UTP, 1 onboard UTP+1 GB SFP, 1 GB SFP).
With multiple LACP legs, there is a dropout of the LAGG several times a day at random time intervals.
With a single LACP leg, the dropout is pretty consistent at approximately 5h delay, in burts of 3.
In all cases the LAGG is rebuilt within a second of time, yet all the damage of the interface going down is already materialized.
Code: [Select]
# igb1 is single member of the LACP lagg0 group (Onboard Intel 1GB NIC, UTP cable)
2024-03-23 23:41:55 notice kernel: igb1: Interface stopped DISTRIBUTING, possible flapping
2024-03-23 23:41:56 notice kernel: <6>lagg0: link state changed to UP
2024-03-23 23:43:26 notice kernel: igb1: Interface stopped DISTRIBUTING, possible flapping
2024-03-23 23:43:27 notice kernel: <6>lagg0: link state changed to UP
2024-03-23 23:44:57 notice kernel: igb1: Interface stopped DISTRIBUTING, possible flapping
2024-03-23 23:44:58 notice kernel: <6>lagg0: link state changed to UP
2024-03-24 04:42:07 notice kernel: igb1: Interface stopped DISTRIBUTING, possible flapping
2024-03-24 04:42:08 notice kernel: <6>lagg0: link state changed to UP
2024-03-24 04:43:39 notice kernel: igb1: Interface stopped DISTRIBUTING, possible flapping
2024-03-24 04:43:40 notice kernel: <6>lagg0: link state changed to UP
2024-03-24 04:45:10 notice kernel: igb1: Interface stopped DISTRIBUTING, possible flapping
2024-03-24 04:45:11 notice kernel: <6>lagg0: link state changed to UP
2024-03-24 09:42:49 notice kernel: igb1: Interface stopped DISTRIBUTING, possible flapping
2024-03-24 09:42:50 notice kernel: <6>lagg0: link state changed to UP
2024-03-24 09:44:20 notice kernel: igb1: Interface stopped DISTRIBUTING, possible flapping
2024-03-24 09:44:21 notice kernel: <6>lagg0: link state changed to UP
2024-03-24 09:45:52 notice kernel: igb1: Interface stopped DISTRIBUTING, possible flapping
2024-03-24 09:45:53 notice kernel: <6>lagg0: link state changed to UP
# ix1 is single member of the LACP lagg group (PCIe Intel X520-DA2 NIC, 1,25G SFP, MM fiber cable)
2024-03-24 23:16:30 notice kernel: ix1: Interface stopped DISTRIBUTING, possible flapping
2024-03-24 23:16:31 notice kernel: <6>lagg0: link state changed to UP
2024-03-24 23:18:00 notice kernel: ix1: Interface stopped DISTRIBUTING, possible flapping
2024-03-24 23:18:01 notice kernel: <6>lagg0: link state changed to UP
2024-03-24 23:19:32 notice kernel: ix1: Interface stopped DISTRIBUTING, possible flapping
2024-03-24 23:19:33 notice kernel: <6>lagg0: link state changed to UP
2024-03-25 04:16:34 notice kernel: ix1: Interface stopped DISTRIBUTING, possible flapping
2024-03-25 04:16:35 notice kernel: <6>lagg0: link state changed to UP
2024-03-25 04:18:05 notice kernel: ix1: Interface stopped DISTRIBUTING, possible flapping
2024-03-25 04:18:06 notice kernel: <6>lagg0: link state changed to UP
2024-03-25 04:19:35 notice kernel: ix1: Interface stopped DISTRIBUTING, possible flapping
2024-03-25 04:19:36 notice kernel: <6>lagg0: link state changed to UP
2024-03-25 09:17:09 notice kernel: ix1: Interface stopped DISTRIBUTING, possible flapping
2024-03-25 09:17:10 notice kernel: <6>lagg0: link state changed to UP
2024-03-25 09:18:39 notice kernel: ix1: Interface stopped DISTRIBUTING, possible flapping
2024-03-25 09:18:40 notice kernel: <6>lagg0: link state changed to UP
2024-03-25 09:20:10 notice kernel: ix1: Interface stopped DISTRIBUTING, possible flapping
2024-03-25 09:20:11 notice kernel: <6>lagg0: link state changed to UP
Other activity of the OPNsense box before the flapping is as usual, filterlog entries, regular cron jobs. Nothing special.
Mikrotik switch unfortunately has no logs or traps.
Based on browsing previous LAGG related posts, I have settled with
- OPNsense side active, Switch side passive (but tested also active/active)
- Use flowid (but tested also without)
- Fast timeout OFF (default)
- l2,l3,l4 (but tested also l2 only)
- Use strict default -- net.link.lagg.lacp.default_strict_mode: 1
This is the current config which results in the log above:
Code: [Select]
lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=4e53fbb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_UCAST,WOL_MCAST,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,NOMAP>
ether 90:e2:ba:xx:xx:xx
laggproto lacp lagghash l2,l3,l4
lagg options:
flags=5<USE_FLOWID,USE_NUMA>
flowid_shift: 16
lagg statistics:
active ports: 1
flapping: 9
lag id: [(8000,90-E2-BA-XX-XX-XX,016B,0000,0000),
(8000,48-A9-8A-XX-XX-XX,0002,0000,0000)]
laggport: ix1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> state=3d<ACTIVITY,AGGREGATION,SYNC,COLLECTING,DISTRIBUTING>
[(8000,90-E2-BA-XX-XX-XX,016B,8000,0002),
(8000,48-A9-8A-XX-XX-XX,0002,8000,000A)]
groups: lagg
media: Ethernet autoselect
status: active
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
ix1: flags=8863<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=4e53fbb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_UCAST,WOL_MCAST,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,NOMAP>
ether 90:e2:ba:xx:xx:xx
media: Ethernet autoselect (1000baseSX <full-duplex>)
status: active
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
plugged: SFP/SFP+/SFP28 1000BASE-SX (LC)
vendor: JDSU PN: PLRXPL-VI-S24-27 SN: XXXXXXXXXX DATE: 2013-08-27
module temperature: 44.05 C voltage: 3.33 Volts
lane 1: RX power: 0.26 mW (-5.88 dBm) TX bias: 5.71 mA
I was also setting net.link.lagg.lacp.debug=1 sometimes but could not make sense out of the output.
Wondering if anybody can shed some light on which bit to flip to make this work.
I am happy to test/change/provide logs for just about any altered config for debugging purposes.
Thank you in advance for your suggestions.
Cheers,
Kei