25.7.2 needing hard reboot every few hours

Started by Tjh3, September 04, 2025, 05:52:35 AM

Previous topic - Next topic
September 04, 2025, 05:52:35 AM Last Edit: September 04, 2025, 05:58:55 AM by Tjh3
So ever since i updated to 25.7.2, my opnsense router (link to pcpartpicker page for the router build - AM4 motherboard, 256GB nvme drive and 5600GT) just stops responding or doing anything every few hours, needing a hard reboot. I was able to grab some logs, that i'm attaching. Just to time things - the issue happened a couple of times to my memory: 28-Aug 17:30ish local time and 29-aug sometime before 09:00 local time. (I removed the 'notice' and 'information' lines to keep is small enough to upload here).

I was able to get an alternate opnsense box i had with an older version and am currently running 25.1, but I'd really like to figure out what happened if possible. Unfortunately, the opnsense box that had this issue is no longer recoverable (while attempting to solve the issue, i seem to have wiped the hard drive somehow), but I'm now afraid to upgrade past 25.1

Have you tested the hardware outside of OPNsense? e.g. memtest86 and mprime. I don't have a suggestion for testing the SSD, other than checking SMART counters (offhand I don't see the device-specific counters in the OPNsense SMART utility, so I'd look from a shell).

September 04, 2025, 07:57:58 PM #2 Last Edit: September 04, 2025, 08:02:18 PM by meyergru
What does "stops responding" exactly mean? No reponse from network? Or a complete system lockup?

The reason I ask is that many people see problems with ASPM enabled on Intel adapters. Mostly, this was with I226-V, but I experienced network lockups because of my 82599ES, which you have in your system. FreeBSD just is not the best when it comes to energy saving modes.

You can find if your device has ASPM enabled via: "pciconf -lbcevV" - the relevant part of the output looks like this:

ix1@pci0:1:0:1: class=0x020000 rev=0x01 hdr=0x00 vendor=0x8086 device=0x10fb subvendor=0x8086 subdevice=0x000c
    vendor     = 'Intel Corporation'
    device     = '82599ES 10-Gigabit SFI/SFP+ Network Connection'
    class      = network
    subclass   = ethernet
    bar   [10] = type Memory, range 64, base 0x81080000, size 524288, enabled
    bar   [18] = type I/O Port, range 32, base 0x3000, size 32, enabled
    bar   [20] = type Memory, range 64, base 0x81200000, size 16384, enabled
    cap 01[40] = powerspec 3  supports D0 D3  current D0
    cap 05[50] = MSI supports 1 message, 64 bit, vector masks
    cap 11[70] = MSI-X supports 64 messages, enabled
                 Table in map 0x20[0x0], PBA in map 0x20[0x2000]
    cap 10[a0] = PCI-Express 2 endpoint max data 256(512) FLR RO NS
                 max read 512
                 link x4(x8) speed 5.0(5.0) ASPM disabled(L0s)
    cap 03[e0] = VPD
    ecap 0001[100] = AER 1 0 fatal 0 non-fatal 1 corrected
    ecap 0003[140] = Serial 1 90e3baffff00be8a
    ecap 000e[150] = ARI 1
    ecap 0010[160] = SR-IOV 1 IOV disabled, Memory Space disabled, ARI disabled
                     0 VFs configured out of 64 supported
                     First VF RID Offset 0x0180, VF RID Stride 0x0002
                     VF Device ID 0x10ed
                     Page Sizes: 4096 (enabled), 8192, 65536, 262144, 1048576, 4194304

Note the "ASPM disabled(L0s)". If that is different for you, you can add the tuneable "hw.pci.enable_aspm = 0" to disable ASPM if your BIOS does not support disabling it.
Intel N100, 4* I226-V, 2* 82559, 16 GByte, 500 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+

Quote from: meyergru on September 04, 2025, 07:57:58 PMlink x4(x8) speed 5.0(5.0) ASPM disabled(L0s)

Question,
It shows ASPM for L0s is disabled. What about the other power state L1?

The sysctl item should disable apsm for all pci devices.

Is the last post yours (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=279245) ?
Duly noted though, some users on that bug site say the issue they see happens under heavy load, which should not be ASPM related.

I also wonder why ASPM would ever sleep a nic card (pcie link) in a WAN/LAN setup, the fw device itself always has traffic on the WAN port, and LAN hosts (iot, pc, other) usually keep LAN port busy. Too aggressive power policies can cause the sleep issue, but I can't figure out why ASPM would cause nic (pcie link) to power down. L0s should just do one side of the pcie link (Tx or Rx), but L1 sleeps it all and clocks go idle.



September 04, 2025, 09:59:17 PM #5 Last Edit: September 04, 2025, 10:18:19 PM by BrandyWine
Maybe also pull out config register info, maybe something in there is odd?

Access specific configuration registers
eg; pciconf -r pci0:0:2:0 0x10

References
https://www.intel.com/content/www/us/en/programmable/pcie-register-map/current/index.html#hdq1622511525681.html
https://www.intel.com/content/www/us/en/docs/programmable/683488/16-0/pci-express-capability-structure.html

Here's an interesting read about ASPM enabed in BIOS and that will impact NMVe device speeds.
https://superuser.com/questions/1822809/why-does-disabling-active-power-management-in-bios-double-nvme-speed

For a fw device, disable all the ASPM / powerd stuff, in BIOS and via OS sysctl (tunables).

September 04, 2025, 10:04:06 PM #6 Last Edit: September 04, 2025, 10:06:35 PM by meyergru
Quote from: BrandyWine on September 04, 2025, 09:44:26 PM
Quote from: meyergru on September 04, 2025, 07:57:58 PMlink x4(x8) speed 5.0(5.0) ASPM disabled(L0s)

Question,
It shows ASPM for L0s is disabled. What about the other power state L1?


What do you mean with that? I disabled ASPM, so now it is off and the problems are gone.

Quote from: BrandyWine on September 04, 2025, 09:44:26 PMThe sysctl item should disable apsm for all pci devices.


I know. That is all FreeBSD offers if you cannot disable the setting for specific devices in BIOS. For an add-on card like the OP uses, this is even less likely than when the NIC is builtin.

Quote from: BrandyWine on September 04, 2025, 09:44:26 PMIs the last post yours (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=279245) ?
Duly noted though, some users on that bug site say the issue they see happens under heavy load, which should not be ASPM related.

I also wonder why ASPM would ever sleep a nic card (pcie link) in a WAN/LAN setup, the fw device itself always has traffic on the WAN port, and LAN hosts (iot, pc, other) usually keep LAN port busy. Too aggressive power policies can cause the sleep issue, but I can't figure out why ASPM would cause nic (pcie link) to power down. L0s should just do one side of the pcie link (Tx or Rx), but L1 sleeps it all and clocks go idle.

Yes it is, as is the bug itself. And the only report of "heavy traffic" within that I225/I226-related thread is by someone that popped in and thought he had a similar problem, but with a completely different model, namely an Intel i228-LM 2.5GB, so that seems to be unrelated.

I cannot tell you why the problem appears, all I can say is:

1. I know that the ASPM problem occured under Linux as well and that they have handled that problem for the I226 model in the driver - that is why I say that FreeBSD is not very good at such things.

2. I have a BIOS that can disable ASPM for the I226 ports selectively, however I started to have problems when I switched my LAN to the 82559ES and again in periods of low activity. I remembered the ASPM problem on the I226 and applied the tuneable, and hey, presto - the problem went away.

At times, I can be very pragmatic - once I find a solution to a problem, I gladly accept that it works.
Intel N100, 4* I226-V, 2* 82559, 16 GByte, 500 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+

September 05, 2025, 12:20:58 AM #7 Last Edit: September 05, 2025, 12:29:27 AM by BrandyWine
Quote from: meyergru on September 04, 2025, 10:04:06 PMWhat do you mean with that? I disabled ASPM, so now it is off and the problems are gone.
The specific reference from pciconf to "(L0s)" is the question.

pciconf says "ASPM disabled(L0s)"

L0s is just one power state in pcie, there is also L1.
Why does the ASPM status specifically reference "L0s". Why would it not just say "ASPM disabled" if ASPM was 100% disabled?

From Intel PDF (https://www.intel.com/content/dam/doc/white-paper/pci-express-architecture-power-management-rev-1-1-paper.pdf)
Quote3. Power saving opportunities during link power states
This section describes power saving opportunities during various L-states, mainly at the device level. Platform level
opportunities are identified where appropriate.
3.1 Active state power management
Active state power management is the hardware capability to power-manage the PCI Express link. Only L0s and L1 are used
during active state power management.
• L0s: This link state is a very low exit latency link state intended to reduce power wastage during short
intervals of logical idle between link activities. L0s support is required, and, assuming L0s usage is
enabled for the link, active state power management in each port's transmitter must transition the
appropriate link direction to L0s after an idle period in the range of 25%-100% of the opposing port's
reported L0s exit latency. Acknowledgment of this entry is implicit. The power saving opportunities during
this state include, but are not limited to, most of the transceiver circuitry as well as the clock gating of at
least the link layer logic. Devices must transition to L0s independently on each direction of the link.
Minimizing L0s exit latency optimizes performance/power considerations. For example, innovation on the
clock recovery mechanism would help to reduce the number of fast-training sequences required and hence
minimize L0s exit latency. It is strongly recommended that Mobile devices optimize their implementation to
minimize the number of fast-training sequences for synchronization during L0s exit.
• L1: This link state is a low exit latency link state that is intended to reduce power when the device becomes
aware of a lack of outstanding requests or pending transactions. Although the PCI Express base
specification rev1.0 defines L1 support to be optional, it is required for mobile platforms in order to optimize
battery life and thermal design power constraints. If L1 entry is rejected, the link must transition to L0s. L1
entry policy is not mandated in the PCI Express base specification; however, to promote innovation, this
document discusses a few approaches to optimizing L1 usage. The power saving opportunities during this
state include, but are not limited to, shutdown of most of the transceiver circuitry, clock gating of most PCI
Express architecture logic, and shutdown of the PLL

September 05, 2025, 12:32:33 AM #8 Last Edit: September 05, 2025, 01:07:14 AM by BrandyWine
@meyergru
Here's example on freeBSD showing ASPM is disabled for both L0s and L1

Notice the ASPM status calls out both states for disabled.
Maybe the 226 NVM does not have code to support L1, thus pciconf only reports L0s ?

powerspec 3 has four states D0 D1 D2 D3, and your controller only has D0 D3. Other controllers support all four. Supporting D3hot means L1 should be supported, but you output only shows L0s, making me think NVM issue.

To support L1 for D3 it needs to be D3hot, so perhaps the example below does not support D3cold.

Edit: ok, I think we see something. Your controller is D0 D3 only, my guess is "D3" is spec D3cold (from the pdf doc in post #7). D3cold maps to L2/L3 states. So with that matrix from pdf your controller does not have support for L1 since it does not support D1 or D2 or D3hot. If it supported D3hot then I would expect pciconf for ASPM disabled to call out L1. However, D3cold is a L2 L3 mapping, and your ASPM disabled does not call that out either, which maybe indicates the NVM is only supporting L0s even though pciconf is reporting support for D0 D3.

ASPM L2/L3ready, seems that's the "wake on lan" feature ?

I wonder if when OP's device goes unresponsive if all the nic lights go dark?

Quote# pciconf -lcv re0
re0@pci0:2:0:0: class=0x020000 rev=0x15 hdr=0x00 [...]
vendor = 'Realtek Semiconductor Co., Ltd.'
device = 'RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller
[...]
cap 01[40] = powerspec 3 supports D0 D1 D2 D3 current D0
cap 05[50] = MSI supports 1 message, 64 bit
cap 10[70] = PCI-Express 2 endpoint MSI 1 max data 128(128) RO
max read 4096
link x1(x1) speed 2.5(2.5) ASPM disabled(L0s/L1)

September 05, 2025, 01:05:01 AM #9 Last Edit: September 05, 2025, 01:12:09 AM by meyergru
My understanding is, that "ASPM disabled" means that the management (i.e. the switching of states) is disabled. Whatever is shown in brackets is just the current state, which may be dictated by the hardware capabilities (only guessing here).

Matter of fact, the 82559ES is a very old design, which may only support a subset of what is available. My I226V (which is definitely a newer design) on the same machine shows L1 as being active, with switching being disabled:

igc3@pci0:5:0:0:        class=0x020000 rev=0x04 hdr=0x00 vendor=0x8086 device=0x125c subvendor=0x8086 subdevice=0x0000
    vendor     = 'Intel Corporation'
    device     = 'Ethernet Controller I226-V'
    class      = network
    subclass   = ethernet
    bar   [10] = type Memory, range 32, base 0x80500000, size 1048576, enabled
    bar   [1c] = type Memory, range 32, base 0x80600000, size 16384, enabled
    cap 01[40] = powerspec 3  supports D0 D3  current D0
    cap 05[50] = MSI supports 1 message, 64 bit, vector masks
    cap 11[70] = MSI-X supports 5 messages, enabled
                 Table in map 0x1c[0x0], PBA in map 0x1c[0x2000]
    cap 10[a0] = PCI-Express 2 endpoint max data 256(512) FLR RO NS
                 max read 512
                 link x1(x1) speed 5.0(5.0) ASPM disabled(L1)
    ecap 0001[100] = AER 2 0 fatal 0 non-fatal 0 corrected
    ecap 0003[140] = Serial 1 00f0b4ffff0fb6cb
    ecap 0018[1c0] = LTR 1
    ecap 001f[1f0] = Precision Time Measurement 1
    ecap 001e[1e0] = L1 PM Substates 1

And no, 82559ES had no Wake on LAN, AFAIK. Mine is SFP+ only, but even X540-Tx types did not have that.
Intel N100, 4* I226-V, 2* 82559, 16 GByte, 500 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+

September 05, 2025, 01:26:28 AM #10 Last Edit: September 05, 2025, 01:29:41 AM by BrandyWine
Quote from: meyergru on September 05, 2025, 01:05:01 AMMy understanding is, that "ASPM disabled" means that the management (i.e. the switching of states) is disabled. Whatever is shown in brackets is just the current state

I might disagree. Only the D state shows current state.

Notice in my quoted example the ASPM Disabled shows "(L0s/L1)". If ASPM is disabled then we expect the system to not be not in any ASPM state (L0s L1 L2 L3) at all, hence a pcie L0 state when ASPM is fully disabled.

I think the reference to L state in ASPM disabled status tells us which states are disabled.

L0 is full-consume no power saving, only L0s and above are the save modes.

We were thinking nic card, but maybe it's something else on pcie that does not have ASPM disabled. Need to look at all the pci devices.

Probably, what is sure is that ASPM is disabled and that with it disabled, my problems go away.

But let's keep to the topic at hand and if disabling ASPM helps the OP...
Intel N100, 4* I226-V, 2* 82559, 16 GByte, 500 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+

September 05, 2025, 03:31:25 AM #12 Last Edit: September 05, 2025, 03:39:09 AM by BrandyWine
Quote from: meyergru on September 05, 2025, 01:55:06 AMProbably, what is sure is that ASPM is disabled and that with it disabled, my problems go away.

But let's keep to the topic at hand and if disabling ASPM helps the OP...
Indeed, but also need to validate the ASPM status for all the devices on pcie.
With the tunable installed, if some devices say disabled and others not, then there's a code issue somewhere.

For the OP.
That MSI mobo should have ASPM features in the BIOS (or some form of power mgmt). If not then check BIOS version, or contact maker and ask
A question for the nic card maker, ask them what power states the card firmware supports. I think you can also query the registers for this.

September 05, 2025, 04:16:57 AM #13 Last Edit: September 05, 2025, 04:37:09 AM by BrandyWine
So, what does the reference mean L0s, or L1? I think from pciconf we see the capabilities of that device, as disabled.

Here's mine

Here's my mini china pc
# pciconf -lbcevV | grep ASPM
                link x1(x1) speed 5.0(5.0) ASPM disabled(L1) Ethernet Controller I226-V
                link x1(x1) speed 5.0(5.0) ASPM disabled(L1) Ethernet Controller I226-V
                link x1(x1) speed 5.0(5.0) ASPM disabled(L1) Ethernet Controller I226-V
                link x1(x4) speed 8.0(8.0) ASPM disabled(L0s/L1) Hosin Global Electronics Patriot P300 NVMe SSD (DRAM-less)
                link x4(x8) speed 5.0(5.0) ASPM disabled(L0s) 82599ES 10-Gigabit SFI/SFP+ Network Connection
                link x4(x8) speed 5.0(5.0) ASPM disabled(L0s) 82599ES 10-Gigabit SFI/SFP+ Network Connection


I can see some recent fixes for igc in a Linux driver release, I guess 226-V has an issue with L1.2. Was this the same fix in igc that's in freeBSD 14.3 release? Where's the source used by freeBSD org?

Aug 21 2025 "igc: fix disabling L1.2 PCI-E link substate on I226 on init"
https://github.com/torvalds/linux/commits/master/drivers/net/ethernet/intel/igc/igc_main.c

Quote/* Disable ASPM L1.2 on I226 devices to avoid packet loss */
   if (igc_is_device_id_i226(hw))
      pci_disable_link_state(pdev, PCIE_LINK_STATE_L1_2);