OPNsense Forum

English Forums => High availability => Topic started by: firewallfun on November 28, 2024, 05:47:29 PM

Title: LAG issue
Post by: firewallfun on November 28, 2024, 05:47:29 PM
On a dedicated box with SPF+ ports, no VLANS.

I have setup two firewalls on same hw, but I struggled with HA and loss of LAN-connection on this 2nd.

I found out that the problem is that LAG doesn't work on this 2nd fw.

I have removed the two ports from the LACP-lag on the switch - and also removed LAN interface and the lagg on  OPNSense. Then I activated one and one port, to verify it was ix2 and ix3 that are correct. I enabled each interface one and one. In both directions. I deleted the interface for each time also. So when I added ix2 and ix3 to lag, attached lagg0 to LAN, I was 100% sure it was correct cables. It shows green in Web GUI for the LAG. I have also created allow-all rule in pfSense fw on this LAN-interface and rebooted.

No matter what I do, it doesn not estabilish connection. What can be wrong? The switch says "susp" on both ports in the LACP-lag there and Dev ID  0000000.

All is working outside the lag...

It simply says this:

Interface TenGigabitEthernet 1/0/10 suspended: LACP currently not enabled on the remote port.
2024-11-28
(5)Notifications
LACP
SUSPEND
Interface TenGigabitEthernet 2/0/10 suspended: LACP currently not enabled on the remote port.
2024-11-28
Title: Re: LAG issue
Post by: Patrick M. Hausen on November 28, 2024, 06:27:21 PM
Please show the lagg0 configuration on the OPNsense side. Like in my screen shot.
Title: Re: LAG issue
Post by: firewallfun on November 28, 2024, 08:21:50 PM
Image didn't work.

But I assume it was from "Interfaces: Overview" - screen. I got a scrollable list, so wasn't easy to take picture. But here is the text version of it:


Flags   8843
Capabilities   rxcsum
txcsum
vlan_mtu
vlan_hwtagging
jumbo_mtu
vlan_hwcsum
tso4
tso6
lro
wol_ucast
wol_mcast
wol_magic
vlan_hwfilter
vlan_hwtso
netmap
rxcsum_ipv6
txcsum_ipv6
hwstats
mextpg
Options   vlan_mtu
jumbo_mtu
wol_ucast
wol_mcast
wol_magic
hwstats
mextpg
MAC Address   20:7c:14:f5:91:66 - Qotom
Supported Media   autoselect
Physical   
Device   lagg0
mtu   1500
macaddr_hw   00:00:00:00:00:00
LAGG Protocol   lacp
LAGG Hash   l2
l3
l4
LAGG Options   
flags   flowid_shift
lacp_fast_timo   16
LAGG Statistics   
active ports   flapping
0   0
Groups   lagg
Media   Ethernet autoselect
Media (Raw)   Ethernet autoselect
Status   up
Routes   10.10.10.0/24
Identifier   opt4
Description   LAN
Enabled   true
Link Type   static
addr4   10.10.10.3/24
addr6   
IPv4 Addresses   
10.10.10.3/24
VLAN Tag   
Gateways   
Driver   lagg0
Index   13
Promiscuous Listeners   0
Send Queue Length   0
Send Queue Max Length   50
Send Queue Drops   0
Type   Ethernet
Address Length   6
Header Length   14
Link State   2
vhid   0
Data Length   152
Metric   0
Line Rate   10.00 Gbit/s
Packets Received   18378
Input Errors   0
Packets Transmitted   0
Output Errors   18
Collisions   0
Bytes Received   2421158
Bytes Transmitted   0
Multicasts Received   18378
Multicasts Transmitted   0
Input Queue Drops   0
Packets for Unknown Protocol   0
Hardware Offload Capabilities   0x0
Uptime at Attach or Statistics Reset   32

I'm thinking about just starting from scratch, I have no clue what is going on. The other fw I have of same brand/model, had no issues with this at all.
Title: Re: LAG issue
Post by: Patrick M. Hausen on November 28, 2024, 08:51:22 PM
Nope, not the overview.

Interfaces > Other Types > LAGG - then open the configuration of your lagg IF.
Title: Re: LAG issue
Post by: firewallfun on November 28, 2024, 10:23:42 PM
There I have this. Attaching both assignment and the one you asked about.
Title: Re: LAG issue
Post by: Patrick M. Hausen on November 28, 2024, 10:49:31 PM
Pick the hash layers matching the policy of your switch. Most common is L2 + L3.
Title: Re: LAG issue
Post by: firewallfun on November 28, 2024, 11:17:15 PM
On my 2nd box with same config, I have this default (empty), working with LACP there.

I tried to change it now to use l2+l3, I still get this:

(5)Notifications
LACP
SUSPEND
Interface TenGigabitEthernet 1/0/10 suspended: LACP currently not enabled on the remote port.
2024-11-28 17:10:44
(5)Notifications
LACP
SUSPEND
Interface TenGigabitEthernet 2/0/10 suspended: LACP currently not enabled on the remote port.
2024-11-28 17:10:44
Title: Re: LAG issue
Post by: Patrick M. Hausen on November 28, 2024, 11:52:19 PM
Did you try slow instead of fast timeout? Any docs what your switch expects? Also did you disable all hardware offlading? Which would be the default ... disabled, that is.
Title: Re: LAG issue
Post by: firewallfun on November 29, 2024, 09:46:57 AM
All is disabled as default, haven't touched any optimization features.

Regarding slow/fast, so yes. I first had it at slow both places, but changed to fast after a day of not getting anywhere. So I have same settings on this lacp pair as the other opnsense box of same batch/type. I struggle a bit with both HA-units becoming master at same time, so I started to believe it could be a IP conflict (because VIP carp IP would then be active both places). But then it shouldn't work on single LAN, so not sure about that either.

I will go to the console, reset everything and maybe I will have better luck... Maybe something has gotten stuck.
Title: Re: LAG issue
Post by: Seimus on November 29, 2024, 10:20:59 AM
If you can,

provide output from your CISCO switch

Quoteshow etherchannel summary

Also provide output of the lagg port configuration and the physical port configuration of the ports belonging to the LAGG on switch side.

Regards,
S.
Title: Re: LAG issue
Post by: Monviech (Cedrik) on November 29, 2024, 10:24:15 AM
If both firewalls become master for a carp vip it could be 2 things most likely:

- Both firewalls send out their VRRP advertisements, but they get lost on the way to the other firewall, either manipulated or dropped by the switch or blocked by a firewall rule
- The hashes of the vhid group are not the same on both sides. Make sure the coniguration is exactly the same, especially when having more vips in the same vhid carp group
Title: Re: LAG issue
Post by: firewallfun on November 29, 2024, 09:28:02 PM
The LAGG-issue was kind of solved. I switched out the cables (spf+) to a different pair and then I got connection. I still have an issue with active/passive, where I can only unplug one of the cables for some reason. But as long as both fibers are plugged in both switches, then lagg now works (it is a fs-switch with LACP).

I have vhid group 1 on the CARP WAN and vhd groud 2 on the CARP LAN. Same on second device. I have also deleted all the VIP'S and synced it over, so they are identical (using multicast, so I didn't have to specify peer IP).

I have disabled pfctl -d on both fw. Can it still be blocking?
Title: Re: LAG issue
Post by: Monviech (Cedrik) on November 29, 2024, 10:05:14 PM
Your switch could use igmp snooping to mess with multicast.

There could also be MAC security features that block the spoofed mac addresses of vrrp packets.
Title: Re: LAG issue
Post by: firewallfun on November 30, 2024, 10:41:29 PM
Thank you for your suggestion. I have a thread here on it: https://forum.opnsense.org/index.php?topic=44226.0

It seems to be that since I have a public /29 IP on my WAN on both devices and my ISP has routers that disable/enables each fiber at their end (participating in the /29), I can't do it like this. It is not a flat /29. Need to buy 2 new switches on the WAN-side, so each WAN-interface sees each other before I can connect my to OPNsense to the shared WAN-network.
Title: Re: LAG issue
Post by: Patrick M. Hausen on November 30, 2024, 11:03:53 PM
Re-read what I wrote in the other thread. You do not need two more switches if you already have a pair of stackable ones and a handful of free ports.

VLANs == as many virtual switches as you like as long as there are ports. That's the point of VLANs. A VLAN is a virtual unmanaged switch.
Title: Re: LAG issue
Post by: firewallfun on December 01, 2024, 07:42:04 PM
Now back to topic - LAG issue:

While LAN now works, there are some issues as you see below. On OPNsense, it hasn't established bndl 100%.

I'm including a pfSense-box I also have in LACP lag (fast) that works 100%, that's the last lacp lagg shown in the list. It has the same config on the switch like the pfSense boxes.

If you look at the one unit of OVPNsense, it lists a blank Dev ID and even requesting Slow LACPDUs. But the one working has fast. There are no option to have both fast and slow on a lagg-pair, so I assume it is not actually requesting slow. At least no option to split them up.

Master FW

Aggregate port 10:
Local information:
                                     LACP port       Oper    Port    Port
Port           Flags     State       Priority        Key     Number  State
---------------------------------------------------------------------------
Te1/0/10       FA        bndl        32768           0xa     0xa     0x3f
Te2/0/10       FA        susp        32768           0xa     0x2c    0x47

Partner information:
                         LACP port                  Oper    Port     Port
Port           Flags     Priority      Dev ID       Key     Number   State
--------------------------------------------------------------------------
Te1/0/10       FA        32768     207c.14f5.9166   0x1b2   0x7      0x3f
Te2/0/10       SP        0         0000.0000.0000   0x0     0x0      0x0
FS#show lacp summary 2


Flags:  S - Device is requesting Slow LACPDUs   F - Device is requesting Fast LACPDUs.
A - Device is in active mode.        P - Device is in passive mode.

Backup FW

Aggregate port 2:

Local information:
                                     LACP port       Oper    Port    Port
Port           Flags     State       Priority        Key     Number  State
---------------------------------------------------------------------------
Te1/0/2        FA        susp        32768           0x2     0x2     0x47
Te2/0/2        FA        bndl        32768           0x2     0x24    0x3f

Partner information:
                         LACP port                  Oper    Port     Port
Port           Flags     Priority      Dev ID       Key     Number   State
--------------------------------------------------------------------------
Te1/0/2        SP        0         0000.0000.0000   0x0     0x0      0x0
Te2/0/2        FA        32768     207c.14f5.916f   0x1d2   0x8      0x3f
FS#show lacp summary 1

Flags:  S - Device is requesting Slow LACPDUs   F - Device is requesting Fast LACPDUs.
A - Device is in active mode.        P - Device is in passive mode.


pfSense (not OPNsense) unit I already have working, with same config on switch

Aggregate port 1:

Local information:
                                     LACP port       Oper    Port    Port
Port           Flags     State       Priority        Key     Number  State
---------------------------------------------------------------------------
Te1/0/1        FA        bndl        32768           0x1     0x1     0x3f
Te2/0/18       FA        bndl        32768           0x1     0x34    0x3f

Partner information:
                         LACP port                  Oper    Port     Port
Port           Flags     Priority      Dev ID       Key     Number   State
--------------------------------------------------------------------------
Te1/0/1        FA        32768     0cc4.7aaa.fba5   0x14b   0x2      0x3f
Te2/0/18       FA        32768     0cc4.7aaa.fba5   0x14b   0x4      0x3f
Title: Re: LAG issue
Post by: firewallfun on December 01, 2024, 07:55:17 PM
And here is the detailed LACP-info for interface for LACP-members on the switch - I just picked on of the two failing firewalls as it fails the same way on both boxes:

FS#show running-config interface Te2/0/10

Building configuration...
Current configuration: 112 bytes

interface TenGigabitEthernet 2/0/10
description FW3
port-group 10 mode active
lacp short-timeout
FS#show running-config interface Te1/0/
FS#show running-config interface Te1/0/10

Building configuration...
Current configuration: 112 bytes

interface TenGigabitEthernet 1/0/10
description FW3
port-group 10 mode active
lacp short-timeout
Title: Re: LAG issue
Post by: firewallfun on December 01, 2024, 08:24:43 PM
And it clearly says in the switch that LACP is not enabled on one of the ports. So two set of cables, on two machines - and both have the exact same problem. It must be a bonding error in the lacp-setting in opnsense  (since it works on pfSense).

I have also disconnected the LACP-lag and no issues with the port member in question then, it worked just fine alone without LACP.

Both switch and the opnsense-box shows light/no light when I unplug/plug it into the port.

(5)Notifications
LACP
SUSPEND
Interface TenGigabitEthernet 1/0/2 suspended: LACP currently not enabled on the remote port.
2024-12-01 14:06:52

show lacp counters

Aggregate port 2:
Port          InPkts    OutPkts
-------------------------------
Te1/0/2        798391    1170027
Te2/0/2        945838    885832
Title: Re: LAG issue
Post by: firewallfun on December 01, 2024, 09:14:18 PM
I have disconnected the LACP-lag and now tested the LAN on the 2 individual ports that make up lacp 2. No problems, works perfectly one and one.

I have also double checked that the mac-address of the individual port in the switch vs the ones in opnsense is correct, so it is 100% sure it is physically connected.

As soon as joining the lacp-team, only one member of the team shows up correctly.

Tried to set everything to slow, both in my switch and on the lagg0.

Aggregate port 2:

Local information:
                                     LACP port       Oper    Port    Port
Port           Flags     State       Priority        Key     Number  State
---------------------------------------------------------------------------
Te1/0/2        SA        susp        32768           0x2     0x2     0x45
Te2/0/2        SA        bndl        32768           0x2     0x24    0x3d

Partner information:
                         LACP port                  Oper    Port     Port
Port           Flags     Priority      Dev ID       Key     Number   State
--------------------------------------------------------------------------
Te1/0/2        SP        0         0000.0000.0000   0x0     0x0      0x0
Te2/0/2        SA        32768     207c.14f5.916f   0x1d2   0x8      0x3d


During reboot, when OPNsense is down, it shows this status on the switch (correctly):

Aggregate port 2:

Local information:
                                     LACP port       Oper    Port    Port
Port           Flags     State       Priority        Key     Number  State
---------------------------------------------------------------------------
Te1/0/2        SA        susp        32768           0x2     0x2     0x45
Te2/0/2        SA        susp        32768           0x2     0x24    0x45

Partner information:
                         LACP port                  Oper    Port     Port
Port           Flags     Priority      Dev ID       Key     Number   State
--------------------------------------------------------------------------
Te1/0/2        SP        0         0000.0000.0000   0x0     0x0      0x0
Te2/0/2        SP        0         0000.0000.0000   0x0     0x0      0x0


Title: Re: LAG issue
Post by: firewallfun on December 01, 2024, 09:36:56 PM
It must be some standard in lacp that is not matching here and that one of the port is just going into a sleep/backup-state where it doesn't exchange correct data.

ifconfig


lagg0: flags=1008943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500
        description: LAN (lan)
        options=4e0382b<RXCSUM,TXCSUM,VLAN_MTU,JUMBO_MTU,WOL_UCAST,WOL_MCAST,WOL_MAGIC,RXCSUM_IPV6,TXCSUM_IPV6,HWSTATS,MEXTPG>
        ether 20:7c:14:f5:91:6f
        hwaddr 00:00:00:00:00:00
        inet .2 netmask 0xffffff00 broadcast ...255
        inet .1 netmask 0xffffff00 broadcast ...255 vhid 3
        laggproto lacp lagghash l2
        laggport: ix2 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
        laggport: ix3 flags=0<>
        groups: lagg
        carp: MASTER vhid 3 advbase 1 advskew 0
              peer 224.0.0.18 peer6 ff02::12
        media: Ethernet autoselect
        status: active


The port of issue is really up:

root@f1:~ # ifconfig ix3
ix3: flags=1008943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500
        options=4e0382b<RXCSUM,TXCSUM,VLAN_MTU,JUMBO_MTU,WOL_UCAST,WOL_MCAST,WOL_MAGIC,RXCSUM_IPV6,TXCSUM_IPV6,HWSTATS,MEXTPG>
        ether 20:7c:14:f5:91:6f
        hwaddr 20:7c:14:f5:91:70
        media: Ethernet autoselect (Unknown <rxpause,txpause>)
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>



TenGigabitEthernet 1/0/2                 up        1      Full     10G       fiber
TenGigabitEthernet 2/0/2                 up        1      Full     10G       fiber


root@f1:~ # tcpdump -i ix3 ether proto 0x8809
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on ix3, link-type EN10MB (Ethernet), snapshot length 262144 bytes
21:51:18.266702 LACPv1, length 110
21:51:19.351099 LACPv1, length 110
21:51:20.444063 LACPv1, length 110
21:51:21.547254 LACPv1, length 110
21:51:22.635720 LACPv1, length 110
21:51:23.725439 LACPv1, length 110
21:51:24.826664 LACPv1, length 110


ifconfig

NON-working OPNSense:

root@f1:~ # ifconfig lagg0
lagg0: flags=1008943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500
        description: LAN (lan)
        options=4e0382b<RXCSUM,TXCSUM,VLAN_MTU,JUMBO_MTU,WOL_UCAST,WOL_MCAST,WOL_MAGIC,RXCSUM_IPV6,TXCSUM_IPV6,HWSTATS,MEXTPG>
        ether 20:7c:14:f5:91:6f
        hwaddr 00:00:00:00:00:00
        inet XXX.2 netmask 0xffffff00 broadcast XX255
        inet XXX.1 netmask 0xffffff00 broadcast XX.255 vhid 3
        laggproto lacp lagghash l2,l3,l4
        laggport: ix2 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
        laggport: ix3 flags=0<>
        groups: lagg
        carp: MASTER vhid 3 advbase 1 advskew 0
              peer 224.0.0.18 peer6 ff02::12
        media: Ethernet autoselect
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>


Working pfSense:

lagg0: flags=1008943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500
        description: LAN
        options=4e527bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,HWSTATS,MEXTPG>
        ether 0c:c4:7a:aa:fb:a5
        hwaddr 00:00:00:00:00:00
        inet XXX.1 netmask 0xffffff00 broadcast XXXX
        inet6 fe80::ec4:7aff:feaa:fba5%lagg0 prefixlen 64 scopeid 0xa
        laggproto lacp lagghash l2,l3,l4
        laggport: igb1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
        laggport: igb3 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
        groups: lagg
        media: Ethernet autoselect
        status: active
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>

Title: Re: LAG issue
Post by: firewallfun on December 02, 2024, 10:35:59 AM
I'm getting nowhere..

I deleted this LACP-lagg on the switch and changed the lagg0 on OPNsense from lacp to failover.

Didn't change the ports members ix2 or ix3 though. And it works just like one would expect. I set ix3 as master (the one that was issue in lacp team). And pinging lan without issue. When I do "ifconfig ix3 down", it goes over to ix2 after 4-5 missing pings (so not as fast as lacp would be). And back to ix3 afterwards. With no problem at all. But would have prefered lacp...

Since I have exact same issue with two physical OPNsense boxes, it must be something in software on OPNsense box or OS. Against the same switch switch lacp works against pfSense..
Title: Re: LAG issue
Post by: Seimus on December 02, 2024, 10:54:12 AM
FS-switches tent to have a lot of weird behavior for LAGG + LACP in the past, you can find it on their forum or reddit.

Did you try the LAGG with LACP fast disabled on both ends and bounce the LAGG on the switch side?
Are you running the latest OS for that switch?
Did you try to reboot the switch?
Did you possible check for know bugs for the switch?

One can argue that cause it works on PFsense and it doesn't on OPNsense, that is the issue of OPNsense. However I run LAGGs with LACP on OPNsense towards a Zyxel GS1900-24E switch and I do not have such issues.

Regards,
S.
Title: Re: LAG issue
Post by: firewallfun on December 02, 2024, 08:58:41 PM
When I setup my 2nd OPNSense box, I just left it default to LACP fast disabled and that is also default on the switch. So it didn't make any improvements. I also tried later to move everything over to fast and restart the LACP-interface and the relevant ports on the switch.

I have only had FS-switches for 4 months (and upgraded to last version then). Replaced all our switches with them. But this is the first time I have had any issue with LACP actually. I mainly have lacp on all ports, against Supermicro-bladeservers and other switches/gears. And Rocky Linux/Windows-servers. It has been like a dream, until now.

It is 24/7 environment, so I can't risk rebooting them unless solid reason.

I'll research a bit more. For now at least it works in this active/backup-mode. It could also be bugs with network driver I guess (vs my spf+ intel ports).

Title: Re: LAG issue
Post by: Patrick M. Hausen on December 02, 2024, 09:05:14 PM
I have OPNsense running with LACP and Cisco and Mikrotik gear at the other end. Never a problem. So there is to my knowledge nothing fundamentally broken.

I also have a couple of dozens of FreeBSD servers (13.3/13.4) with LACP to Cisco switches.
Title: Re: LAG issue
Post by: Melroy vd Berg on December 05, 2024, 08:18:44 PM
Quote from: Patrick M. Hausen on November 28, 2024, 06:27:21 PM
Like in my screen shot.

Your images is 0 bytes btw.
Title: Re: LAG issue
Post by: netnut on December 05, 2024, 11:38:26 PM
Like Patrick M. Hausen already confirmed, LACP is working without any issues with latest OPNsense and any version I can remember (I'm using Juniper switches).

No experience with FS switches, but I noticed a inconsistency in your comparison with a working and non-working example in this post https://forum.opnsense.org/index.php?topic=44338.msg221412#msg221412 .

You've assigned your LAGG to the LAN interface of OPNsense, but it also looks like you have CARP enabled which isn't (or doesn't look like) with your PFsense config:

Quote
NON-working OPNSense:

...
carp: MASTER vhid 3 advbase 1 advskew 0
              peer 224.0.0.18 peer6 ff02::12
...

As you have a (low level) LAGG interface problem and it seems also HA (CARP), I would rule out the whole HA stuff first. In other words, try to configure a single vanilla OPNsense box without any bells and whistles and try to configure the LAGG in this setup (which should work), only after that continue with any HA stuff to keep a clear overview.