[SOLVED] Intel I226-V NICs going down randomly without obvious reason

Started by imk82, August 15, 2024, 10:18:41 PM

Previous topic - Next topic
Hi all,

I have a pretty weird problem after using opnsense since years while being absolutely stable. My box is partially crashing in terms of that no network interaction (ssh, routing, web interface acces) is possible anymore. Direct access via (literally) keyboard and monitor is still possible and there is no obvious problem.

The issue seems random so far and I was not fully able to track down when it happens. But it seems, that is happens when there is load on the connection (100Mbit VDSL via PPPoE connection).

All logs I am aware of (system log, pppoe log, dmesg) are completely clean / contain the exact same entries as before the network part is no longer accessible.

A reboot fixes the problem.

Does anybody have a idea or can maybe point me in a direction how to go on with debugging?

Thanks in advance
Robert

Any other changes in the infrastructure you can think of? Adding or removing a switch or a device that could be misbehaving and affecting the routes? I guess you have discounted any recent changes to OPN configuration.

Hi,

yes, sorry, I skipped some parts of the story and maybe to much background why I am quite sure it is a problem with the opnsense box.

The problem started after I changed my opnsense box (hardware switch to a N100 based one + switch to 24.7.1 from 24.1.10 before). The rest of my network equipment (modem, switches, APs) are the same and I didn't change things there. This combined with the fact that everything else is still accessible (Unifi Hardware, other router) and a opnsense reboot from the local console fixes it, brought me in my direction.

But it is completely weird, there is literally no trace I see in the mentioned logs about anything regarding a crash of a network driver or similar. I mean even any kind of hardware failure must show traces like complete freezes or errors in the logs about crashing services. It is like some kind of firmware in the network hardware (Intel I226 NICs) or similar is crashing and the OS is not even noticing it (just a theory)

Best regards
Robert


I've had strange in my time and had to reboot all network devices. Diagnostics were limited and inconclusive.
But that's the equivalent of switch off and back on which I dislike.
Only ideas are to do network diagnostics at the console when it happens but calls for experience to spot signs of what looks odd. Arp tables, routes and that sort of thing.
Sorry not helpful but without visible crashes it leaves me thinking the problem is somewhere _around_ OPN.

But all the other network equipment wasn't changed, that's why I think it must be something with the box.

What I didn't try so far: when the problem happens, connecting a cable directly to the opnsense from my laptop. If that is as well not working, there is no other network hardware involved.

Are there any other logs, options or debug levels I can raise to narrow my problem down further?

Quote from: imk82 on August 16, 2024, 08:06:54 AM
Are there any other logs, options or debug levels I can raise to narrow my problem down further?

Hmmm....

Quote from: imk82 on August 15, 2024, 10:55:24 PM
The problem started after I changed my opnsense box (hardware switch to a N100

Dunno, chances are high you simply have bought a lemon.

Hi,

just to give feedback about the current state:
* did some longer memory tests -> all good
* did some longer CPU load tests -> all good
* did CPU microcode updates -> still happens
* searched for bios updates -> none available
* still not able to trigger the problem, happens felt randomly

With the last occurrences I did some further tests what is really happening:
* it is not the case that every network connection breaks, but only one of my three connected ones
* the PPPoE connection was still up and running (could ping in the internet)
* one of my internal connections was running (next internal gateway could be pinged)
* another connection was completely unusable, but without any sign on the one or the other side of the connection (no log entries or similar)
** re-plugging the cable fixes it

In summary my current working thesis is that I may be hit by this bug: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=279245

Worst case, so to say..

But, if so, I am asking myself what is the trigger and why it only happens with one NIC (all are I226-V).

Has anybody an idea how to go on here? It is really annoying and I have no idea what to do.:-(

Thanks and best regards
Robert

Summary after much more trial and error:
Root cause of the described problem was a Bios setting Active-State Power Management (ASPM). After setting all occurrences to "disabled" the box and especially it's NICs are rock stable.

Downside is a higher energy consumption leading to a hotter heatsink, finally. But I think I kind of "overfixed" the problem by disabling all 9 (!) occurrences of the setting in Bios. Will re-enable them step by step to find the single one causing the problem.

Yes, that fixed it on my Minisforum MS-01, too. Just updated my bug report on FreeBSD bugzilla.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 770 up, Bufferbloat A

Thanks!

Seems I have the exact same problem!

Some more thoughts. In my experience link issues are often broken cables or power management issues. In the past I had similar issues with regular link loss between opnsense (INtel X553) and a fritzbox. After disabling EEE (Energy Efficient Ethernet) the problem was solved.
OPNsense 24.7.11_2-amd64

Small Update on this thread, seems like intel did an upstream fix of the issue related to i-226-x ethernet controllers, watch out for new BIOS versions, my oem (Shuttle) released one in July which resolved ALL issues with stability (NIC's going down etc.).