LAG issue

Started by firewallfun, November 28, 2024, 05:47:29 PM

Previous topic - Next topic
December 01, 2024, 07:42:04 PM #15 Last Edit: December 01, 2024, 07:50:27 PM by firewallfun
Now back to topic - LAG issue:

While LAN now works, there are some issues as you see below. On OPNsense, it hasn't established bndl 100%.

I'm including a pfSense-box I also have in LACP lag (fast) that works 100%, that's the last lacp lagg shown in the list. It has the same config on the switch like the pfSense boxes.

If you look at the one unit of OVPNsense, it lists a blank Dev ID and even requesting Slow LACPDUs. But the one working has fast. There are no option to have both fast and slow on a lagg-pair, so I assume it is not actually requesting slow. At least no option to split them up.

Master FW

Aggregate port 10:
Local information:
                                     LACP port       Oper    Port    Port
Port           Flags     State       Priority        Key     Number  State
---------------------------------------------------------------------------
Te1/0/10       FA        bndl        32768           0xa     0xa     0x3f
Te2/0/10       FA        susp        32768           0xa     0x2c    0x47

Partner information:
                         LACP port                  Oper    Port     Port
Port           Flags     Priority      Dev ID       Key     Number   State
--------------------------------------------------------------------------
Te1/0/10       FA        32768     207c.14f5.9166   0x1b2   0x7      0x3f
Te2/0/10       SP        0         0000.0000.0000   0x0     0x0      0x0
FS#show lacp summary 2


Flags:  S - Device is requesting Slow LACPDUs   F - Device is requesting Fast LACPDUs.
A - Device is in active mode.        P - Device is in passive mode.

Backup FW

Aggregate port 2:

Local information:
                                     LACP port       Oper    Port    Port
Port           Flags     State       Priority        Key     Number  State
---------------------------------------------------------------------------
Te1/0/2        FA        susp        32768           0x2     0x2     0x47
Te2/0/2        FA        bndl        32768           0x2     0x24    0x3f

Partner information:
                         LACP port                  Oper    Port     Port
Port           Flags     Priority      Dev ID       Key     Number   State
--------------------------------------------------------------------------
Te1/0/2        SP        0         0000.0000.0000   0x0     0x0      0x0
Te2/0/2        FA        32768     207c.14f5.916f   0x1d2   0x8      0x3f
FS#show lacp summary 1

Flags:  S - Device is requesting Slow LACPDUs   F - Device is requesting Fast LACPDUs.
A - Device is in active mode.        P - Device is in passive mode.


pfSense (not OPNsense) unit I already have working, with same config on switch

Aggregate port 1:

Local information:
                                     LACP port       Oper    Port    Port
Port           Flags     State       Priority        Key     Number  State
---------------------------------------------------------------------------
Te1/0/1        FA        bndl        32768           0x1     0x1     0x3f
Te2/0/18       FA        bndl        32768           0x1     0x34    0x3f

Partner information:
                         LACP port                  Oper    Port     Port
Port           Flags     Priority      Dev ID       Key     Number   State
--------------------------------------------------------------------------
Te1/0/1        FA        32768     0cc4.7aaa.fba5   0x14b   0x2      0x3f
Te2/0/18       FA        32768     0cc4.7aaa.fba5   0x14b   0x4      0x3f

And here is the detailed LACP-info for interface for LACP-members on the switch - I just picked on of the two failing firewalls as it fails the same way on both boxes:

FS#show running-config interface Te2/0/10

Building configuration...
Current configuration: 112 bytes

interface TenGigabitEthernet 2/0/10
description FW3
port-group 10 mode active
lacp short-timeout
FS#show running-config interface Te1/0/
FS#show running-config interface Te1/0/10

Building configuration...
Current configuration: 112 bytes

interface TenGigabitEthernet 1/0/10
description FW3
port-group 10 mode active
lacp short-timeout

December 01, 2024, 08:24:43 PM #17 Last Edit: December 01, 2024, 08:32:47 PM by firewallfun
And it clearly says in the switch that LACP is not enabled on one of the ports. So two set of cables, on two machines - and both have the exact same problem. It must be a bonding error in the lacp-setting in opnsense  (since it works on pfSense).

I have also disconnected the LACP-lag and no issues with the port member in question then, it worked just fine alone without LACP.

Both switch and the opnsense-box shows light/no light when I unplug/plug it into the port.

(5)Notifications
LACP
SUSPEND
Interface TenGigabitEthernet 1/0/2 suspended: LACP currently not enabled on the remote port.
2024-12-01 14:06:52

show lacp counters

Aggregate port 2:
Port          InPkts    OutPkts
-------------------------------
Te1/0/2        798391    1170027
Te2/0/2        945838    885832

December 01, 2024, 09:14:18 PM #18 Last Edit: December 01, 2024, 09:34:10 PM by firewallfun
I have disconnected the LACP-lag and now tested the LAN on the 2 individual ports that make up lacp 2. No problems, works perfectly one and one.

I have also double checked that the mac-address of the individual port in the switch vs the ones in opnsense is correct, so it is 100% sure it is physically connected.

As soon as joining the lacp-team, only one member of the team shows up correctly.

Tried to set everything to slow, both in my switch and on the lagg0.

Aggregate port 2:

Local information:
                                     LACP port       Oper    Port    Port
Port           Flags     State       Priority        Key     Number  State
---------------------------------------------------------------------------
Te1/0/2        SA        susp        32768           0x2     0x2     0x45
Te2/0/2        SA        bndl        32768           0x2     0x24    0x3d

Partner information:
                         LACP port                  Oper    Port     Port
Port           Flags     Priority      Dev ID       Key     Number   State
--------------------------------------------------------------------------
Te1/0/2        SP        0         0000.0000.0000   0x0     0x0      0x0
Te2/0/2        SA        32768     207c.14f5.916f   0x1d2   0x8      0x3d


During reboot, when OPNsense is down, it shows this status on the switch (correctly):

Aggregate port 2:

Local information:
                                     LACP port       Oper    Port    Port
Port           Flags     State       Priority        Key     Number  State
---------------------------------------------------------------------------
Te1/0/2        SA        susp        32768           0x2     0x2     0x45
Te2/0/2        SA        susp        32768           0x2     0x24    0x45

Partner information:
                         LACP port                  Oper    Port     Port
Port           Flags     Priority      Dev ID       Key     Number   State
--------------------------------------------------------------------------
Te1/0/2        SP        0         0000.0000.0000   0x0     0x0      0x0
Te2/0/2        SP        0         0000.0000.0000   0x0     0x0      0x0



December 01, 2024, 09:36:56 PM #19 Last Edit: December 02, 2024, 09:30:38 AM by firewallfun
It must be some standard in lacp that is not matching here and that one of the port is just going into a sleep/backup-state where it doesn't exchange correct data.

ifconfig


lagg0: flags=1008943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500
        description: LAN (lan)
        options=4e0382b<RXCSUM,TXCSUM,VLAN_MTU,JUMBO_MTU,WOL_UCAST,WOL_MCAST,WOL_MAGIC,RXCSUM_IPV6,TXCSUM_IPV6,HWSTATS,MEXTPG>
        ether 20:7c:14:f5:91:6f
        hwaddr 00:00:00:00:00:00
        inet .2 netmask 0xffffff00 broadcast ...255
        inet .1 netmask 0xffffff00 broadcast ...255 vhid 3
        laggproto lacp lagghash l2
        laggport: ix2 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
        laggport: ix3 flags=0<>
        groups: lagg
        carp: MASTER vhid 3 advbase 1 advskew 0
              peer 224.0.0.18 peer6 ff02::12
        media: Ethernet autoselect
        status: active


The port of issue is really up:

root@f1:~ # ifconfig ix3
ix3: flags=1008943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500
        options=4e0382b<RXCSUM,TXCSUM,VLAN_MTU,JUMBO_MTU,WOL_UCAST,WOL_MCAST,WOL_MAGIC,RXCSUM_IPV6,TXCSUM_IPV6,HWSTATS,MEXTPG>
        ether 20:7c:14:f5:91:6f
        hwaddr 20:7c:14:f5:91:70
        media: Ethernet autoselect (Unknown <rxpause,txpause>)
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>



TenGigabitEthernet 1/0/2                 up        1      Full     10G       fiber
TenGigabitEthernet 2/0/2                 up        1      Full     10G       fiber


root@f1:~ # tcpdump -i ix3 ether proto 0x8809
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on ix3, link-type EN10MB (Ethernet), snapshot length 262144 bytes
21:51:18.266702 LACPv1, length 110
21:51:19.351099 LACPv1, length 110
21:51:20.444063 LACPv1, length 110
21:51:21.547254 LACPv1, length 110
21:51:22.635720 LACPv1, length 110
21:51:23.725439 LACPv1, length 110
21:51:24.826664 LACPv1, length 110


ifconfig

NON-working OPNSense:

root@f1:~ # ifconfig lagg0
lagg0: flags=1008943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500
        description: LAN (lan)
        options=4e0382b<RXCSUM,TXCSUM,VLAN_MTU,JUMBO_MTU,WOL_UCAST,WOL_MCAST,WOL_MAGIC,RXCSUM_IPV6,TXCSUM_IPV6,HWSTATS,MEXTPG>
        ether 20:7c:14:f5:91:6f
        hwaddr 00:00:00:00:00:00
        inet XXX.2 netmask 0xffffff00 broadcast XX255
        inet XXX.1 netmask 0xffffff00 broadcast XX.255 vhid 3
        laggproto lacp lagghash l2,l3,l4
        laggport: ix2 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
        laggport: ix3 flags=0<>
        groups: lagg
        carp: MASTER vhid 3 advbase 1 advskew 0
              peer 224.0.0.18 peer6 ff02::12
        media: Ethernet autoselect
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>


Working pfSense:

lagg0: flags=1008943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500
        description: LAN
        options=4e527bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,HWSTATS,MEXTPG>
        ether 0c:c4:7a:aa:fb:a5
        hwaddr 00:00:00:00:00:00
        inet XXX.1 netmask 0xffffff00 broadcast XXXX
        inet6 fe80::ec4:7aff:feaa:fba5%lagg0 prefixlen 64 scopeid 0xa
        laggproto lacp lagghash l2,l3,l4
        laggport: igb1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
        laggport: igb3 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
        groups: lagg
        media: Ethernet autoselect
        status: active
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>


I'm getting nowhere..

I deleted this LACP-lagg on the switch and changed the lagg0 on OPNsense from lacp to failover.

Didn't change the ports members ix2 or ix3 though. And it works just like one would expect. I set ix3 as master (the one that was issue in lacp team). And pinging lan without issue. When I do "ifconfig ix3 down", it goes over to ix2 after 4-5 missing pings (so not as fast as lacp would be). And back to ix3 afterwards. With no problem at all. But would have prefered lacp...

Since I have exact same issue with two physical OPNsense boxes, it must be something in software on OPNsense box or OS. Against the same switch switch lacp works against pfSense..

FS-switches tent to have a lot of weird behavior for LAGG + LACP in the past, you can find it on their forum or reddit.

Did you try the LAGG with LACP fast disabled on both ends and bounce the LAGG on the switch side?
Are you running the latest OS for that switch?
Did you try to reboot the switch?
Did you possible check for know bugs for the switch?

One can argue that cause it works on PFsense and it doesn't on OPNsense, that is the issue of OPNsense. However I run LAGGs with LACP on OPNsense towards a Zyxel GS1900-24E switch and I do not have such issues.

Regards,
S.
Networking is love. You may hate it, but in the end, you always come back to it.

OPNSense HW
APU2D2 - deceased
N5105 - i226-V | Patriot 2x8G 3200 DDR4 | L 790 512G - VM HA(SOON)
N100   - i226-V | Crucial 16G  4800 DDR5 | S 980 500G - PROD

When I setup my 2nd OPNSense box, I just left it default to LACP fast disabled and that is also default on the switch. So it didn't make any improvements. I also tried later to move everything over to fast and restart the LACP-interface and the relevant ports on the switch.

I have only had FS-switches for 4 months (and upgraded to last version then). Replaced all our switches with them. But this is the first time I have had any issue with LACP actually. I mainly have lacp on all ports, against Supermicro-bladeservers and other switches/gears. And Rocky Linux/Windows-servers. It has been like a dream, until now.

It is 24/7 environment, so I can't risk rebooting them unless solid reason.

I'll research a bit more. For now at least it works in this active/backup-mode. It could also be bugs with network driver I guess (vs my spf+ intel ports).


I have OPNsense running with LACP and Cisco and Mikrotik gear at the other end. Never a problem. So there is to my knowledge nothing fundamentally broken.

I also have a couple of dozens of FreeBSD servers (13.3/13.4) with LACP to Cisco switches.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Quote from: Patrick M. Hausen on November 28, 2024, 06:27:21 PM
Like in my screen shot.

Your images is 0 bytes btw.
Hardware: DEC3852
Version: OPNsense 24.10 Business Edition

Like Patrick M. Hausen already confirmed, LACP is working without any issues with latest OPNsense and any version I can remember (I'm using Juniper switches).

No experience with FS switches, but I noticed a inconsistency in your comparison with a working and non-working example in this post https://forum.opnsense.org/index.php?topic=44338.msg221412#msg221412 .

You've assigned your LAGG to the LAN interface of OPNsense, but it also looks like you have CARP enabled which isn't (or doesn't look like) with your PFsense config:

Quote
NON-working OPNSense:

...
carp: MASTER vhid 3 advbase 1 advskew 0
              peer 224.0.0.18 peer6 ff02::12
...

As you have a (low level) LAGG interface problem and it seems also HA (CARP), I would rule out the whole HA stuff first. In other words, try to configure a single vanilla OPNsense box without any bells and whistles and try to configure the LAGG in this setup (which should work), only after that continue with any HA stuff to keep a clear overview.