OPNsense Forum

English Forums => 25.1, 25.4 Production Series => Topic started by: ze0Ood0O on March 18, 2025, 11:34:09 AM

Title: 25.1 high load causing routing and services failures
Post by: ze0Ood0O on March 18, 2025, 11:34:09 AM
Good morning.  I upgraded to 25.1.2 from 24.7 a couple weeks ago and did not notice any problems right away, but ~4 days after upgrading my firewall became unresponsive and was intermittently routing traffic.  Investigating shows kernel{if_io_tqg_2} seemingly hung as it uses 100% of a core causing the load on the box to gradually increase until services stop responding.

198 threads:   7 running, 177 sleeping, 14 waiting
CPU 0:  1.2% user,  0.0% nice,  1.9% system,  0.0% interrupt, 96.9% idle
CPU 1:  0.4% user,  0.0% nice,  0.8% system,  0.0% interrupt, 98.8% idle
CPU 2:  0.0% user,  0.0% nice,  100% system,  0.0% interrupt,  0.0% idle
CPU 3:  0.4% user,  0.0% nice,  1.5% system,  0.0% interrupt, 98.1% idle
Mem: 130M Active, 1069M Inact, 1172M Wired, 343M Buf, 1607M Free

  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
    0 root        -60    -     0B   704K CPU2     2   5:28  99.97% kernel{if_io_tqg_2}
    2 root        -60    -     0B    64K WAIT     1   1:26   1.03% clock{clock (0)}
36345 unbound      20    0   278M   217M kqread   0   0:21   0.68% unbound{unbound}
    0 root        -60    -     0B   704K -        1   0:24   0.24% kernel{if_io_tqg_1}

I have not found a way to recover from this other than rebooting the firewall.  After rebooting the firewall it became unresponsive again within 2 hours.  I rebooted the firewall again and it was fine for another ~3 days and then the same problem occurred.  I upgraded to 25.1.3 as soon as it came out with hopes it would resolve my problem but it did not.

I've Googled around and not found a definitive answer but did find this post https://www.reddit.com/r/PFSENSE/comments/1ags2z6/pfsense_locks_after_a_few_days_routes_traffic_but/ which is very similar but obviously pfsense and different software versions with no clear solution other than 'patched'.

I did not see this in 24.7.  Does anyone have some ideas on what I could look at next to help diagnose and resolve this?  Any help is greatly appreciated.
Title: Re: 25.1 high load causing routing and services failures
Post by: meyergru on March 18, 2025, 12:01:19 PM
Which NIC hardware?
Title: Re: 25.1 high load causing routing and services failures
Post by: ze0Ood0O on March 18, 2025, 12:14:46 PM
Quote from: meyergru on March 18, 2025, 12:01:19 PMWhich NIC hardware?
Thanks for the reply!  This firewall has the Intel(R) I210 NICs.

# sysctl -a | grep -E 'dev.(igb|ix|em).*.%desc:'
dev.igb.3.%desc: Intel(R) I210 Flashless (Copper)
dev.igb.2.%desc: Intel(R) I210 Flashless (Copper)
dev.igb.1.%desc: Intel(R) I210 Flashless (Copper)
dev.igb.0.%desc: Intel(R) I210 Flashless (Copper)
Title: Re: 25.1 high load causing routing and services failures
Post by: meyergru on March 18, 2025, 12:27:23 PM
I had hangs like that because my hardware could not handle ASPM correctly. After disabling that in the BIOS, the problem went away.
Title: Re: 25.1 high load causing routing and services failures
Post by: ze0Ood0O on March 18, 2025, 01:20:07 PM
Quote from: meyergru on March 18, 2025, 12:27:23 PMI had hangs like that because my hardware could not handle ASPM correctly. After disabling that in the BIOS, the problem went away.
This system is running coreboot for the BIOS, I am not sure how to disable ASPM via coreboot currently so I think I disabled ASPM via the tunables section of OPNsense
System -> Settings -> Tunables
Tunable: hw.pci.enable_aspm
Value: 0
Hit apply and then rebooted the firewall.  Currently trying to verify if ASPM is disabled or not.
Title: Re: 25.1 high load causing routing and services failures
Post by: ze0Ood0O on March 18, 2025, 09:36:02 PM
Disabling ASPM via the Tunables did not seem to disable ASPM on the intefaces
igb0@pci0:1:0:0: class=0x020000 rev=0x03 hdr=0x00 vendor=0x8086 device=0x157b subvendor=0x8086 subdevice=0x0000
    vendor     = 'Intel Corporation'
    device     = 'I210 Gigabit Network Connection'
    class      = network
    subclass   = ethernet
    bar   [10] = type Memory, range 32, base 0x91000000, size 131072, enabled
    bar   [18] = type I/O Port, range 32, base 0x1000, size 32, enabled
    bar   [1c] = type Memory, range 32, base 0x91020000, size 16384, enabled
    cap 01[40] = powerspec 3  supports D0 D3  current D0
    cap 05[50] = MSI supports 1 message, 64 bit, vector masks
    cap 11[70] = MSI-X supports 5 messages, enabled
                 Table in map 0x1c[0x0], PBA in map 0x1c[0x2000]
    cap 10[a0] = PCI-Express 2 endpoint max data 128(512) FLR RO NS
                 max read 512
                 link x1(x1) speed 2.5(2.5) ASPM L1(L0s/L1)
    ecap 0001[100] = AER 2 0 fatal 0 non-fatal 0 corrected
    ecap 0003[140] = Serial 1 00e067ffff22f83c
    ecap 0017[1a0] = TPH Requester 1
and the system became hung again because of that thread using 100% system CPU, so I have gone and reinstalled 24.7 and restored from backup.  If anyone knows how to disable ASPM via coreboot or another way in FreeBSD I would love to try and see if that resolved my problems so I can upgrade to 25.
Title: Re: 25.1 high load causing routing and services failures
Post by: newsense on March 18, 2025, 09:47:58 PM
Reporting-Settings

RRD Running ? Try turning it off. Also reset RRD and Netflow data.
Title: Re: 25.1 high load causing routing and services failures
Post by: meyergru on March 19, 2025, 10:03:33 AM
For me, the device looks like:

igc0@pci0:1:0:0:        class=0x020000 rev=0x04 hdr=0x00 vendor=0x8086 device=0x125c subvendor=0x8086 subdevice=0x0000
    vendor     = 'Intel Corporation'
    device     = 'Ethernet Controller I226-V'
    class      = network
    subclass   = ethernet
    bar   [10] = type Memory, range 32, base 0x80a00000, size 1048576, enabled
    bar   [1c] = type Memory, range 32, base 0x80b00000, size 16384, enabled
    cap 01[40] = powerspec 3  supports D0 D3  current D0
    cap 05[50] = MSI supports 1 message, 64 bit, vector masks
    cap 11[70] = MSI-X supports 5 messages, enabled
                 Table in map 0x1c[0x0], PBA in map 0x1c[0x2000]
    cap 10[a0] = PCI-Express 2 endpoint max data 256(512) FLR RO NS
                 max read 512
                 link x1(x1) speed 5.0(5.0) ASPM disabled(L1)
    ecap 0001[100] = AER 2 0 fatal 0 non-fatal 0 corrected
    ecap 0003[140] = Serial 1 60beb4ffff16a800
    ecap 0018[1c0] = LTR 1
    ecap 001f[1f0] = Precision Time Measurement 1
    ecap 001e[1e0] = L1 PM Substates 1

IDK how to force ASPM off, though. Did you also try dev.igb.X.eee_disabled=1 (https://forum.opnsense.org/index.php?msg=23591)?
Title: Re: 25.1 high load causing routing and services failures
Post by: ze0Ood0O on March 19, 2025, 12:58:12 PM
Thanks for the suggestion, I added that and rebooted, still the same results
# sysctl -a | grep igb.2 | grep eee
dev.igb.2.eee_control: 1
# pciconf -lbcevV igb2@pci0:3:0:0
igb2@pci0:3:0:0: class=0x020000 rev=0x03 hdr=0x00 vendor=0x8086 device=0x157b subvendor=0x8086 subdevice=0x0000
    vendor     = 'Intel Corporation'
    device     = 'I210 Gigabit Network Connection'
    class      = network
    subclass   = ethernet
    bar   [10] = type Memory, range 32, base 0x91200000, size 131072, enabled
    bar   [18] = type I/O Port, range 32, base 0x3000, size 32, enabled
    bar   [1c] = type Memory, range 32, base 0x91220000, size 16384, enabled
    cap 01[40] = powerspec 3  supports D0 D3  current D0
    cap 05[50] = MSI supports 1 message, 64 bit, vector masks
    cap 11[70] = MSI-X supports 5 messages, enabled
                 Table in map 0x1c[0x0], PBA in map 0x1c[0x2000]
    cap 10[a0] = PCI-Express 2 endpoint max data 128(512) FLR RO NS
                 max read 512
                 link x1(x1) speed 2.5(2.5) ASPM L1(L0s/L1)
    ecap 0001[100] = AER 2 0 fatal 0 non-fatal 0 corrected
    ecap 0003[140] = Serial 1 00e067ffff22f83e
    ecap 0017[1a0] = TPH Requester 1