Periodic NIC issues (?) with Protectli Vault, Intel i226-V

Started by fornax, Today at 02:09:53 AM

Previous topic - Next topic
I'm working on troubleshooting an issue that's been popping up irregularly since deploying OPNSense on a Protectli VP3210 (both new to me). The device is set up to perform all DHCP, DNS, firewall, and routing duties for the home network behind it.

Approximately every 1-7 days the network starts acting up. The symptoms aren't always consistent, but so far have tended to fall into one of three categories:

1. Something that looks like a DNS issue. Attempts to resolve an address will usually time out first try, but then succeed immediately a few seconds later. If I connect to the upstream router and use the same resolver, everything is normal.

2. DHCP will stop working for some/all devices.

3. An online game I play regularly has trouble connecting to the game servers.

Regardless of the symptom, the workaround that resolves it (temporarily) is the same. Go to Interfaces -> Settings, uncheck "Disable hardware checksum offload", Apply, recheck the box, Apply again. Everything immediately starts working as it should. (This is why I assume this is a NIC issue.)

Doing some research, I see that it's not uncommon for people to have issues with the Intel i226-V NICs, something I missed when I chose the hardware. Based on what I read I've been playing with various tunables, rebooting as necessary:

dev.igc.0.fc=0
dev.igc.1.fc=0
dev.igc.0.eee_control=0
dev.igc.1.eee_control=0
net.isr.bindthreads=1
net.isr.maxthreads=-1
net.isr.dispatch=deferred
net.inet.ip.intr_queue_maxlen=3000
hw.pci.enable_aspm=0

So far nothing has made a difference. The other thing that seems to be done commonly with these NICs is to upgrade the NVM firmware, which I'll try if I have to but that's a bit intimidating. Anyone have any other ideas before I go that route?

ASPM is causing this for I226 devices and I am not aware that updating the NIC firmware fixes that.

If there is an updated BIOS for the Protectl, try that first. You can actually make that go away with ASPM off, but AFAIK, you can only disable this for the whole machine under OpnSense if the BIOS does not set it selectively for your NICs.

The global setting is by done setting the tuneable hw.pci.enable_aspm=0. You should probably also set dev.igc.X.eee_control=0 with X=0,1.
Intel N100, 4* I226-V, 2* 82559, 16 GByte, 500 GByte NVME, Leox LXT-010H-D

1100 down / 450 up, Bufferbloat A+

Hi there,

While having a look at this issue, I noticed a potential bug in the iflib code making an automatic reset in case of a TX hang impossible, a custom kernel has been published which resolves this (though likely not the final patch version). Would you mind installing this kernel to see if this changes anything about the issue?

# opnsense-update -zk 26.1.10-iflib
The commit in question is https://github.com/opnsense/src/commit/8dd26e6351d72a53fab5d47a16d053d5f8648353.

If it's this issue, you should see "watchdog timeout" messages appearing in your dmesg/system log. After this, an automatic reset should recover connectivity. If this happens, can you share these logs?

Your description of the issue sounds similar to others, however, there are still a lot of gaps to fill. Most notably, do you always need manual intervention to fix the issue? or does it recover on its own? Is it always the same igc interface? What is the auto-negotiated link state at the time of failure (# ifconfig igcX)?  If there's no auto-negotiation, what link speed did you set it to?

Also, and perhaps most importantly, can you share a snapshot of

# sysctl dev.igc.X (where X is the affected interface) after the failure?

Lastly, please do these tests with all default tunables. As far as I know, dev.igc.0.eee_control=0 will *enable* EEE.

Cheers,
Stephan

And I forgot to ask, since you mention that toggling offloading fixes it,

does

# ifconfig igcX down && ifconfig igcX up
also fix it?

Cheers,
Stephan