LAN NIC randomly drops out

Started by Nycix, December 16, 2024, 12:47:31 PM

Previous topic - Next topic
Hi Folks,

Albeit a somewhat bumpy road I've been a happy user over OPNSense for the last 3 years or so.
However, as of recently I'm seeing some very annoying and especially problematic issue.
As stated in the title, my LAN NIC seems to randomly drop out for a couple seconds, see attached image.
There seems to be no real rhyme or reason to this, except that it worked flawlessly until about the start of fall and progressively got worse.
This used to occur very sporadically, like once a week or so, but by now it's multiple times per day if not hour.
Today I've even had it not recover. What actually happened no idea but I had to walk downstairs and do a hard-reset to restore connectivity.
Any ideas on how to debug or even address it?

If it helps, I bought the machine off Aliexpress some time ago.
https://nl.aliexpress.com/item/1005004578935938.html
(I got the I225-V N5105 variant, not the I226)
I had another ITX PC (J1900) what I wanted to use, but it had Realtek NICs so I bought this one.

Cheers

In some other devices using the I225, this could be solved by a BIOS setting to disable ASPM, but IDK if your device has that.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 440 up, Bufferbloat A+

To illustrate how often this happens, see below.

Quote from: meyergru on December 16, 2024, 01:19:55 PMIn some other devices using the I225, this could be solved by a BIOS setting to disable ASPM, but IDK if your device has that.
It has been working fine since about 1 this afternoon, but this will be the first thing I check if it misbehaves again.

Checking the logs some more it does appear to be "bursty". If that helps.

Quote from: meyergru on December 16, 2024, 01:19:55 PMIn some other devices using the I225, this could be solved by a BIOS setting to disable ASPM, but IDK if your device has that.
I just checked and it seems that this is off by default. How nice of the supplier.

I also forgot to mention I also get very short half-second "hiccups", very noticeable while gaming.
I put my old janky sitecom router in place of the OPNSense router and things have been totally fine.
Minus the lack of some (fairly) import features however, so it's not really a good all-in solution.

So it does only this one port? e.g LAN?
WAN port is stable?
Did you check/replace the Cable on this port?
To what is this port connected?
Is the device on the other-side alright?

Regards,
S.
Networking is love. You may hate it, but in the end, you always come back to it.

OPNSense HW
APU2D2 - deceased
N5105 - i226-V | Patriot 2x8G 3200 DDR4 | L 790 512G - VM HA(SOON)
N100   - i226-V | Crucial 16G  4800 DDR5 | S 980 500G - PROD

Quote from: Seimus on December 17, 2024, 01:34:08 PMSo it does only this one port? e.g LAN?
WAN port is stable?
Did you check/replace the Cable on this port?
To what is this port connected?
Is the device on the other-side alright?

Regards,
S.
So it does only this one port? e.g LAN?
This seems to only affect the LAN port. I do see not logging when it hiccups however.

WAN port is stable?
As far as logging is concerned it's doing fine. There's not really a "good" way to check this, but I have no reason to believe the WAN port is problematic.

Did you check/replace the Cable on this port?
No, because I've effectively eliminated that variable by replacing it with the aforementioned Sitecom router, effectively implying the issue is either with the hard- or software on the OPNSense box.

To what is this port connected?
It is connected to a 5 port 2.5Gbps switch.
https://nl.aliexpress.com/item/1005006822096018.html

Is the device on the other-side alright?
As far as I can tell yea, I just can't connect to the OPNSense box (and everything behind it, e.g. the modem and internet) while it's crapping out. Other devices on my LAN are accessible and totally OK.

Right now I'm considering imaging the SSD and trying a clean install to see if it still chokes. If it starts working it's probably a bad configuration or a bad migration from the countless updates.

If anyone has any idea to check something specific I'm all ears.

It may sound strange, bit there have also been reports that 2.5 Gbit/s is problematic. You could try to set the speed to 1 GBit/s just to verify. I would guess that your other router only has 1 GBit/s.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 440 up, Bufferbloat A+

December 17, 2024, 02:42:00 PM #7 Last Edit: December 17, 2024, 03:44:21 PM by Nycix
Quote from: meyergru on December 17, 2024, 02:16:39 PMIt may sound strange, bit there have also been reports that 2.5 Gbit/s is problematic. You could try to set the speed to 1 GBit/s just to verify. I would guess that your other router only has 1 GBit/s.
Good suggestion, but sadly I've already tried that. Kinda forgot to mention that.
I'm quite sure the cable (which is like 10 years old lol) is probably not able to handle 2.5Gbps.
Especially considering it would randomly choke real hard and regress to 100Mbps until reboot.

Edit: After recovering the backup I can no longer login, sigh.

December 17, 2024, 10:04:09 PM #8 Last Edit: December 17, 2024, 10:09:48 PM by Nycix
Ok, I've had proxmox + dd-wrt on my bucket list some time at the end of the year, so I guess that got pulled forward out of necessity.
Sadly it appears that dd-wrt has the same issue. It doesn't appear in the same way (the VM just panics and restarts) but it does seem to strongly imply there's a HW issue. I'll dust off my ye-olde J1900 PC and see if the issue still occurs. If it doesn't there's something really fucky with the HW of my current router.
As it is I see no reason to think OPNSense it to blame here (minus "useless" logging I suppose).