PSA: Test kernel with Intel fixes is available for testing

Started by newsense, October 13, 2024, 12:34:39 AM

Previous topic - Next topic
December 11, 2024, 02:39:25 PM #60 Last Edit: December 11, 2024, 02:42:18 PM by Grossartig
Thank you -- applied to both boxes half an hour ago and so far stable, but will run it for a bit longer this time. Do you want to give me _10 already so I can switch to it at my leisure a bit later? :)

EDIT: Nooo, the CARP issues just happened again. So I guess I need _8? :)

Here it is:

# opnsense-patch -zkr 24.7.6_8

Thanks a lot for doing this BTW!

Thanks Franco, the pleasure is all mine. Who knows, may still just be an issue only affecting me, so I appreciate you taking time out of your busy calendar with this.

I applied _8 to both boxes and saw the same CARP issues in the logs. I'll revert back to _7 to confirm that it is gone. Then I'll probably switch forward to _8 to confirm it truly starts with that build.

So I've gone back and forth between _07 and _08 a couple of times and I can only see the issue appearing on _08, and it seems gone when I go back to _07. Smoking gun or coincidence? I'll let it remain on _07 overnight and will see if there are any issues present in the logs in the morning. Thanks!

Hmm, so it's the following commit:

https://github.com/opnsense/src/commit/1af69a3af3540f9f

which also seems to do something with igb...

Which Intel card is this again? Do you have EEE sysctl set?

# sysctl hw.em.eee_setting
hw.em.eee_setting: 1

Allegedly "1" means off which is / should be the default.. Turning EEE on ("0") could potentially introduce an instability for CARP / link activity.


Cheers,
Franco

It's the I211 network chip. See here for the complete details of my system (at bottom of that post).

Output of the EEE setting is 1 on both boxes:

root@OPNsense:~ # sysctl hw.em.eee_setting
hw.em.eee_setting: 1

root@OPNsense-Backup:~ # sysctl hw.em.eee_setting
hw.em.eee_setting: 1

Still all stable btw after leaving it on _7 overnight.

Just for fun would you mind flipping the setting to 0?

# sysctl hw.em.eee_setting=0

I don't have a strong objection of backing this out for now, but I don't quite understand what is happening yet.


Cheers,
Franco

It appears I have to set it in loader.conf?

# sysctl hw.em.eee_setting=0
sysctl: oid 'hw.em.eee_setting' is a read only tunable
sysctl: Tunable values are set in /boot/loader.conf

It's not present in that file yet -- should I just add hw.em.eee_setting="0" to the bottom of it?

Rather add it in the UI.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Quote from: Patrick M. Hausen on December 12, 2024, 03:05:25 PMRather add it in the UI.

Good call. Made the tunable changes, brought kernal back up to _8 (where the "issue" starts) and am rebooting both boxes now.

It's been almost an hour and the issue has so far not re-appeared. I'll leave it for another hour on _8 with the EEE tunable set to 0, and then I want to bring the kernel back up to the current production version to see if it continues to be stable with this tunable in the 0 position. Unless you want me to do any other testing along the way :)

That test plan sounds good. I need to look at the Intel vendor driver to make sense of this. If the setting is reversed then there is a bug now in the driver. Pretty weird.


Cheers,
Franco

December 12, 2024, 06:27:48 PM #72 Last Edit: December 13, 2024, 01:31:01 AM by Grossartig
So I'm about an hour two three hours in to having gone from kernel 24.7.6_8 back up to the current 24.7.10. The EEE tunable is still set to 0. An no issues so far (knock on wood). Will keep it like this and monitor over the remainder of the day.

I think it's plausible that someone flipped the meaning of the tunable. 0 for Energy Efficient Ethernet OFF. 1 for EEE ON. Which would also be more intuitive.

Perhaps this also explains why the issue always took some time to materialize -- the energy efficiency mode only kicked in after some timeout? Just speculating of course.

EDIT: Several hours later, still all fine. The tunable fixed it!

Hmm, I checked the Intel driver. It seems, correct, BUT:

From what I can tell before the EEE was enabled by default, now it is forced to disabled.

Which means this is the right fix, but the issue is that igb and e1000 are different and now the e1000 default to disable will be impossible to change for ibg since they are the same value defaulting to off.


Cheers,
Franco

December 13, 2024, 02:57:38 PM #74 Last Edit: December 13, 2024, 03:04:41 PM by Grossartig
So if I understand you correctly, depending on network chipset, end users (like me) will have to set this tunable specifically according to their chipset.

Or OPNsense (or BSD?) could detect the chipset and set the tunable accordingly? But perhaps it's best to leave that to end users.

Also, is this maybe only impacting people using CARP (like I was)? If so, probably just a small percentage of users impacted.