LAGG Port Errors

Started by seed, January 31, 2020, 07:04:15 PM

Previous topic - Next topic
Introduction:

Hello,

i have already written on twitter about this issue. When i create a LACP LAGG errors occur on the interface.
Last year when i testet this (19.1), my system crashed after some time. I disabled LACP and carried on. This week i tested again with the same Results. Errors on the LAGG. Even without any cables attached.
I have seen this strange behavior even with a completly different hardware (I211 Gigabit Network Connection)

I am a so to speak "home" user and use OPNsense to learn more about networking security.
Sorry for my bad english. It is not my native language.


Hardware used:

Versions    OPNsense 20.1-amd64
FreeBSD 11.2-RELEASE-p16-HBSD
OpenSSL 1.1.1d 10 Sep 2019

Switch:

Device Information
Device Type    DGS-1210-26 Gigabit Ethernet Switch
Boot Version    1.00.010
Firmware Version    6.12.B006
Hardware Version    F1

NIC:

root@OPNsense:~ # pciconf -l -BbceVv
igb2@pci0:1:0:2:   class=0x020000 card=0x12a18086 chip=0x150e8086 rev=0x01 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82580 Gigabit Network Connection'
    class      = network
    subclass   = ethernet
    bar   [10] = type Memory, range 32, base 0xde180000, size 524288, enabled
    bar   [1c] = type Memory, range 32, base 0xde304000, size 16384, enabled
    cap 01[40] = powerspec 3  supports D0 D3  current D0
    cap 05[50] = MSI supports 1 message, 64 bit, vector masks
    cap 11[70] = MSI-X supports 10 messages, enabled
                 Table in map 0x1c[0x0], PBA in map 0x1c[0x2000]
    cap 10[a0] = PCI-Express 2 endpoint max data 256(512) FLR NS
                 link x4(x4) speed 5.0(5.0) ASPM disabled(L0s/L1)
    ecap 0001[100] = AER 1 0 fatal 0 non-fatal 1 corrected
    ecap 0003[140] = Serial 1 001b21ffffa75bf0
    ecap 0017[1a0] = TPH Requester 1
  PCI-e errors = Correctable Error Detected
                 Unsupported Request Detected
     Corrected = Advisory Non-Fatal Error
igb3@pci0:1:0:3:   class=0x020000 card=0x12a18086 chip=0x150e8086 rev=0x01 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82580 Gigabit Network Connection'
    class      = network
    subclass   = ethernet
    bar   [10] = type Memory, range 32, base 0xde100000, size 524288, enabled
    bar   [1c] = type Memory, range 32, base 0xde300000, size 16384, enabled
    cap 01[40] = powerspec 3  supports D0 D3  current D0
    cap 05[50] = MSI supports 1 message, 64 bit, vector masks
    cap 11[70] = MSI-X supports 10 messages, enabled
                 Table in map 0x1c[0x0], PBA in map 0x1c[0x2000]
    cap 10[a0] = PCI-Express 2 endpoint max data 256(512) FLR NS
                 link x4(x4) speed 5.0(5.0) ASPM disabled(L0s/L1)
    ecap 0001[100] = AER 1 0 fatal 0 non-fatal 1 corrected
    ecap 0003[140] = Serial 1 001b21ffffa75bf0
    ecap 0017[1a0] = TPH Requester 1
  PCI-e errors = Correctable Error Detected
                 Unsupported Request Detected
     Corrected = Advisory Non-Fatal Error


Issue:

Directly after creating the LAGG I saw the error counter rising up:

root@OPNsense:~ # netstat -i
Name    Mtu Network       Address              Ipkts Ierrs Idrop    Opkts Oerrs  Coll
lagg0  1500 <Link#11>     00:1b:21:a7:5b:f2        0     0     0        0     5     0
lagg0     - fe80::%lagg0/ fe80::21b:21ff:fe        0     -     -        2     -     -

Then in moved a vlan interface to the nic, send some traffic over it and plugged and unplugged the physical links one after another.

root@OPNsense:~ # netstat -i
Name    Mtu Network       Address              Ipkts Ierrs Idrop    Opkts Oerrs  Coll
lagg0  1500 <Link#11>     00:1b:21:a7:5b:f2     9843     0     0     4232    34     0
lagg0     - fe80::%lagg0/ fe80::21b:21ff:fe        0     -     -        2     -     -

Then i rebooted the switch and the opnsense and testet again. After reboot I noticed that the link did not work. I had to unplug both physical cables and replugg them.

This is what the error counters looked like after sending some traffic through again:

root@OPNsense:~ # netstat -i
Name    Mtu Network       Address              Ipkts Ierrs Idrop    Opkts Oerrs  Coll
lagg0  1500 <Link#11>     00:1b:21:a7:5b:f2    12385     0     0     7016    82     0
lagg0     - fe80::%lagg0/ fe80::21b:21ff:fe        0     -     -        2     -     -

root@OPNsense:~ # netstat -i
Name    Mtu Network       Address              Ipkts Ierrs Idrop    Opkts Oerrs  Coll
lagg0  1500 <Link#11>     00:1b:21:a7:5b:f2    13326     0     0     8134   135     0
lagg0     - fe80::%lagg0/ fe80::21b:21ff:fe        0     -     -        2     -     -

With or without VLAN hardware filtering the same thing happens.
I do not suspect the switch beeing the cause.



Screenshots:

Creating LAGG:
https://ibb.co/HqGd9s9

Configure vlan to lagg parent if:
https://ibb.co/1JCj1jR

Interface settings:
https://ibb.co/VWqg2hW

lacp switch configuration 1/2:
https://ibb.co/Ptd7fJQ

lacp switch configuration 2/2:
https://ibb.co/G0WwL6L

switch interface stats before connecting physical lacp links to opnsense:
https://ibb.co/2hj65vR

switch interface stats after connecting physical lacp links to opnsense:
https://ibb.co/dKZcyRN
i want all services to run with wirespeed and therefore run this dedicated hardware configuration:

AMD Ryzen 7 9700x
ASUS Pro B650M-CT-CSM
64GB DDR5 ECC (2x KSM56E46BD8KM-32HA)
Intel XL710-BM1
Intel i350-T4
2x SSD with ZFS mirror
PiKVM for remote maintenance

private user, no business use

February 01, 2020, 07:37:15 PM #1 Last Edit: February 01, 2020, 07:42:36 PM by seed
Just to make sure that this is not a hardware related issue. I installed Ubuntu Server 19.10 with the Kernel 5.3.x and created a bond (LACP) with a vlan on it. Then i tested again (pinging) and repluggin the physical links one after another.

No Problem. The TX/RX error counter stayd at "0". Even after rebooting a few times with and without physical cables attached.

Please give a instructions to debug this.

Could it be an igb driver or kernel related problem?
i want all services to run with wirespeed and therefore run this dedicated hardware configuration:

AMD Ryzen 7 9700x
ASUS Pro B650M-CT-CSM
64GB DDR5 ECC (2x KSM56E46BD8KM-32HA)
Intel XL710-BM1
Intel i350-T4
2x SSD with ZFS mirror
PiKVM for remote maintenance

private user, no business use