Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - kei

#1
The CSS610 comes with SwOS lite as the only option.
They have also devices with RouterOS only and devices with RouterOS and SwOS as a boot-time option. Furthering the confusion.

From what I understand RouterOS is just another Linux (which makes it unattractive to me), whereas SwOS (lite) are built in-house and are "barebone", only a UI over the bits that can be set on the underlying hardware.

As said, except for this LACP problem (in my home network), I am quite happy with them.

#2
Hi All,

first of all, MikroTik comes with RouterOS, SwOS and SwOS lite - three completely different platforms. So it is hard to generalize across the vendor's line.
I have multiple MikroTiks and am quite satisfied. Esp. with price/performance ratio and the "does its job, nothing extra" attitude.

Back to the LACP, I have now the detailed logs with sysctl net.link.lagg.lacp.debug=1 and there is a pattern.

OPNsense is initiating lacpdu transmit at regular 30 seconds intervals.
The Mikrotik box answers these with an ever increasing delay. Starting from 1 seconds up to and including 30 seconds. (That is a HB miss).
Now OPNsense retransmits and gets the (delayed) answer to the previous transmit, but after two more transmissions the box and the switch get back in sync, with the switch answering again starting at 1 seconds delay.
These cycles are a bit over 27 minutes long.
During these cycles, the lagg0 interface on the OPNsense box stays UP, i.e. no perceived disruption.

When the OPNsense does not get a sync after the third retransmission, it stops the interface and restarts. This is what I observed previously and happens every 5 hours.
It takes the OPNsense and the switch multiple attempts to get back into sync. This is the 3 times flapping while re-syncing.
Once in sync the 27 minutes cycles start again.


From the looks of it, this is a Mikrotik problem (drift in answering), and I am taking it there, but obviously the observations are biased, the log entries' clock is the same as the OPNsense clock. So 30 seconds on the box might not be 30 seconds in reality.

Thanks everybody for your support!
If there is an update from the vendor, I will post it here.

Cheers,

Kei









#3
Hi Seimus,
thank you for looking at this.

There is no indication of the physical port going down in the logs of OPNsense. I don't have monitoring up on the switch, so I have no information. The whole thing is in the cupboard, so also no eyes. I am strongly doubting the port would go down physically.

On the switch side, there is only an active/passive/static toggle. Static meaning non-LACP LAG. No other options to change or tune. The documentation is also NIL on any other parameters.
I was now running a night with passive on the switch as well as a night with active on the switch. The behaviour is identical.

On the OPNsense side I have been dialing through the different options which absolutely no effect.
I have settled now with L2 hashing only. Based on a forum post on the switch vendor side.
LACP fast timeout is switched off on the OPNsense side, no such option on the switch.
I have switched off flowid now on OPNsense. Was on because of a forum post here. No change.

I have now switched on sysctl net.link.lagg.lacp.debug=1.
I am observing a lacpdu transmit/receive every 30 seconds. The delay between transmit and receive is 5-15 seconds.


2024-03-26 08:55:48 notice kernel: ix1: lacpdu transmit
2024-03-26 08:55:48 notice kernel: actor=(8000,90-E2-BA-XX-XX-XX,016B,8000,0002)
2024-03-26 08:55:48 notice kernel: actor.state=3d<ACTIVITY,AGGREGATION,SYNC,COLLECTING,DISTRIBUTING>
2024-03-26 08:55:48 notice kernel: partner=(8000,48-A9-8A-XX-XX-XX,0002,8000,000A)
2024-03-26 08:55:48 notice kernel: partner.state=3c<AGGREGATION,SYNC,COLLECTING,DISTRIBUTING>
2024-03-26 08:55:48 notice kernel: maxdelay=0
2024-03-26 08:55:54 notice kernel: ix1: lacpdu receive
2024-03-26 08:55:54 notice kernel: actor=(8000,48-A9-8A-XX-XX-XX,0002,8000,000A)
2024-03-26 08:55:54 notice kernel: actor.state=3c<AGGREGATION,SYNC,COLLECTING,DISTRIBUTING>
2024-03-26 08:55:54 notice kernel: partner=(8000,90-E2-BA-XX-XX-XX,016B,8000,0002)
2024-03-26 08:55:54 notice kernel: partner.state=3d<ACTIVITY,AGGREGATION,SYNC,COLLECTING,DISTRIBUTING>
2024-03-26 08:55:54 notice kernel: maxdelay=0


I will leave this log on for a couple of hours and try to find the spot around the flapping.

I also plan to take this to the vendor MikroTik.

Thanks for taking your time looking at this.
If there is any other idea, I am happy to try an alternative config.

Cheers,

Kei
#4
Dear All,

after days of plugging and testing I am turning to the forum to find out a flaw in my setup.

The hardware is straightforward, OPNsense running on a Supermicro A1SAi-2550F connected to Mikrotik CSS610-8P-2S+IN serving 3 WAPs and other equipment (Home).
Multiple VLANs are setup on top of the LAGG.
Upstream is 1G Fibre, connected to an Intel X520-DA2 plugged into the Supermicro board.

All equipment on latest BIOS, firmware versions.

Connection between the OPNsense box and the switch is a LACP. Tested with several combinations (2 onboard UTP, 1 onboard UTP, 1 onboard UTP+1 GB SFP, 1 GB SFP).
With multiple LACP legs, there is a dropout of the LAGG several times a day at random time intervals.
With a single LACP leg, the dropout is pretty consistent at approximately 5h delay, in burts of 3.
In all cases the LAGG is rebuilt within a second of time, yet all the damage of the interface going down is already materialized.


# igb1 is single member of the LACP lagg0 group (Onboard Intel 1GB NIC, UTP cable)
2024-03-23 23:41:55 notice kernel: igb1: Interface stopped DISTRIBUTING, possible flapping
2024-03-23 23:41:56 notice kernel: <6>lagg0: link state changed to UP
2024-03-23 23:43:26 notice kernel: igb1: Interface stopped DISTRIBUTING, possible flapping
2024-03-23 23:43:27 notice kernel: <6>lagg0: link state changed to UP
2024-03-23 23:44:57 notice kernel: igb1: Interface stopped DISTRIBUTING, possible flapping
2024-03-23 23:44:58 notice kernel: <6>lagg0: link state changed to UP
2024-03-24 04:42:07 notice kernel: igb1: Interface stopped DISTRIBUTING, possible flapping
2024-03-24 04:42:08 notice kernel: <6>lagg0: link state changed to UP
2024-03-24 04:43:39 notice kernel: igb1: Interface stopped DISTRIBUTING, possible flapping
2024-03-24 04:43:40 notice kernel: <6>lagg0: link state changed to UP
2024-03-24 04:45:10 notice kernel: igb1: Interface stopped DISTRIBUTING, possible flapping
2024-03-24 04:45:11 notice kernel: <6>lagg0: link state changed to UP
2024-03-24 09:42:49 notice kernel: igb1: Interface stopped DISTRIBUTING, possible flapping
2024-03-24 09:42:50 notice kernel: <6>lagg0: link state changed to UP
2024-03-24 09:44:20 notice kernel: igb1: Interface stopped DISTRIBUTING, possible flapping
2024-03-24 09:44:21 notice kernel: <6>lagg0: link state changed to UP
2024-03-24 09:45:52 notice kernel: igb1: Interface stopped DISTRIBUTING, possible flapping
2024-03-24 09:45:53 notice kernel: <6>lagg0: link state changed to UP

# ix1 is single member of the LACP lagg group (PCIe Intel X520-DA2 NIC, 1,25G SFP, MM fiber cable)
2024-03-24 23:16:30 notice kernel: ix1: Interface stopped DISTRIBUTING, possible flapping
2024-03-24 23:16:31 notice kernel: <6>lagg0: link state changed to UP
2024-03-24 23:18:00 notice kernel: ix1: Interface stopped DISTRIBUTING, possible flapping
2024-03-24 23:18:01 notice kernel: <6>lagg0: link state changed to UP
2024-03-24 23:19:32 notice kernel: ix1: Interface stopped DISTRIBUTING, possible flapping
2024-03-24 23:19:33 notice kernel: <6>lagg0: link state changed to UP
2024-03-25 04:16:34 notice kernel: ix1: Interface stopped DISTRIBUTING, possible flapping
2024-03-25 04:16:35 notice kernel: <6>lagg0: link state changed to UP
2024-03-25 04:18:05 notice kernel: ix1: Interface stopped DISTRIBUTING, possible flapping
2024-03-25 04:18:06 notice kernel: <6>lagg0: link state changed to UP
2024-03-25 04:19:35 notice kernel: ix1: Interface stopped DISTRIBUTING, possible flapping
2024-03-25 04:19:36 notice kernel: <6>lagg0: link state changed to UP
2024-03-25 09:17:09 notice kernel: ix1: Interface stopped DISTRIBUTING, possible flapping
2024-03-25 09:17:10 notice kernel: <6>lagg0: link state changed to UP
2024-03-25 09:18:39 notice kernel: ix1: Interface stopped DISTRIBUTING, possible flapping
2024-03-25 09:18:40 notice kernel: <6>lagg0: link state changed to UP
2024-03-25 09:20:10 notice kernel: ix1: Interface stopped DISTRIBUTING, possible flapping
2024-03-25 09:20:11 notice kernel: <6>lagg0: link state changed to UP


Other activity of the OPNsense box before the flapping is as usual, filterlog entries, regular cron jobs. Nothing special.
Mikrotik switch unfortunately has no logs or traps.

Based on browsing previous LAGG related posts, I have settled with

  • OPNsense side active, Switch side passive (but tested also active/active)
  • Use flowid (but tested also without)
  • Fast timeout OFF (default)
  • l2,l3,l4 (but tested also l2 only)
  • Use strict default -- net.link.lagg.lacp.default_strict_mode: 1

This is the current config which results in the log above:

lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=4e53fbb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_UCAST,WOL_MCAST,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,NOMAP>
ether 90:e2:ba:xx:xx:xx
laggproto lacp lagghash l2,l3,l4
lagg options:
flags=5<USE_FLOWID,USE_NUMA>
flowid_shift: 16
lagg statistics:
active ports: 1
flapping: 9
lag id: [(8000,90-E2-BA-XX-XX-XX,016B,0000,0000),
(8000,48-A9-8A-XX-XX-XX,0002,0000,0000)]
laggport: ix1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> state=3d<ACTIVITY,AGGREGATION,SYNC,COLLECTING,DISTRIBUTING>
[(8000,90-E2-BA-XX-XX-XX,016B,8000,0002),
(8000,48-A9-8A-XX-XX-XX,0002,8000,000A)]
groups: lagg
media: Ethernet autoselect
status: active
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>

ix1: flags=8863<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=4e53fbb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_UCAST,WOL_MCAST,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,NOMAP>
ether 90:e2:ba:xx:xx:xx
media: Ethernet autoselect (1000baseSX <full-duplex>)
status: active
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
plugged: SFP/SFP+/SFP28 1000BASE-SX (LC)
vendor: JDSU PN: PLRXPL-VI-S24-27 SN: XXXXXXXXXX DATE: 2013-08-27
module temperature: 44.05 C voltage: 3.33 Volts
lane 1: RX power: 0.26 mW (-5.88 dBm) TX bias: 5.71 mA


I was also setting net.link.lagg.lacp.debug=1 sometimes but could not make sense out of the output.

Wondering if anybody can shed some light on which bit to flip to make this work.
I am happy to test/change/provide logs for just about any altered config for debugging purposes.

Thank you in advance for your suggestions.

Cheers,

Kei
#5
Hi,

I've have started with opnsense a week ago and I am very pleased. Great work!
Setting up IPSEC I needed Certificates with Subject Alternative Name set. The GUI provides a great screen for internal certificates with this extension. Unfortunately the resulting certificate does not contain any of the information provided.
Upgrading from 15.1 to 15.7 did not help, so I looked into the source and found the following:

/etc/inc/certs.inc starts with
define("OPEN_SSL_CONF_PATH", "/etc/ssl/openssl.cnf");

the file exists, but is not picked up anywhere, as php expects this information to be passed in OPENSSL_CONF (see http://php.net/manual/en/openssl.installation.php).

Investigation shows that openssl picks /usr/local/openssl/openssl.cnf

down in function cert_create is the handling of SAN, it appends $cert_type with "_san" to signal openssl the custom information which it puts in the environment. This variable is never read.
Moreover there is no section in openssl.cnf that would react on the environment variable.

I have no build environment and would not want to setup one for this, yet suggest the following changes:
Replace OPEN_SSL_CONF_PATH definition with a comment on where the default openssl.cnf is picked up.

Change  cert_create as follows:

        $ca_serial = ++$ca['serial'];


        $cert_type = "usr_cert";
        // in case of using Subject Alternative Names use other sections (with postfix '_san')
        // pass subjectAltName over environment variable 'SAN'
        if ($dn['subjectAltName']) {
                putenv("SAN={$dn['subjectAltName']}"); // subjectAltName can be set _only_ via configuration file
                $cert_type .= '_san';
                unset($dn['subjectAltName']);
        }

        $args = array(
                "x509_extensions" => $cert_type,
                "digest_alg" => $digest_alg,

Then add a section to /usr/local/openssl/openssl.cnf after the end of the existing usr_cert section, just duplicating its content and adding 1 line. I have removed here the commented lines.

[ usr_cert_san ]
basicConstraints=CA:FALSE
nsComment                       = "OpenSSL Generated Certificate"
subjectKeyIdentifier=hash
authorityKeyIdentifier=keyid,issuer
subjectAltName= $ENV::SAN


I have patched my live system with this and obtained proper certificates with Subject Alternative Name.
I guess this did not get much attention as most of us have CA infrastructure in place elsewhere.

I hope this helps and can make it into some upcoming patch.

Keep up the good work!

Cheers,

Kei