Random crash/freeze CPU 100% [kernel{if_io_tqg_N}]

Started by angled_whacking924, April 22, 2025, 11:26:32 PM

Previous topic - Next topic
Hi,

I'm running OPNsense 25.1.5_5-amd64 and have an issue which seems to have appeared from nowhere. Until this, the router has been running seamlessly for over a year.

Randomly every few hours/days the router seems to go slow/become unresponsive and the issue can only be resolved if I physically (as in press the power button) reset it. Rebooting from the CLI or GUI does nothing (though it plays the little shutdown tune).

When I've managed to get onto the webGUI I've found that [kernel{if_io_tqg_N}] is at 100% CPU load. On a few occasions more than one instance of [kernel{if_io_tqg_N}] has been at 100% CPU. N seems to change, it's been 0, 1, 2 and 3 at various times, again seems to be random.

Initially I thought the LAN/WAN was being flooded but I can't seem anything obvious in the logs and likewise I've done a packet capture and there was nothing obvious standing out.

I'm leaning towards a hardware failure but wanted to see if anyone else had experienced this. I can find very little about [kernel{if_io_tqg_N}] both on here and Google.

Thanks in advance!

Still no closer to identifying what is causing this issue.

Very little information online what [kernel{if_io_tqg_N}] is. ChatGPT thinks it's the network queue and this is being caused by a packet storm or loop.

Things I've tried to diagnose this.

Every piece of network hardware has been removed and re added one by one with so single device being the obvious culprit.

Cleared all firewall rules, no effect.

Network switch and WiFi ap factory reset, no effect.

In some cases I've managed to identify the precise time [kernel{if_io_tqg_N}] has crashed however, nothing obvious in the Opnsense logs.

I've done packet capture on both wan and lan at the time of crash and have looked at it in wireshark, nothing stands out, no excessive load of packet storms.

Factory reset and even fresh reinstall of opnsense, no effect.

New network NIC in case it was a hardware failure, no effect.

I feel like this issue started after one of the system updates but I can't be sure. Maybe it's a driver or firmware issue?

Does anyone have any ideas?!

Don't torture yourself, downgrade. There are numerous problems with 25. It's a broken release as is FreeBSD 13 in general. It's not worth anyone's while to try and wrestle with its problems until the dev team smartens up and stops mimicking Linux.

Ah interesting! I wasn't aware of that or the stuff with BSD. I'll do some more research around that. I only use it at home so it's not a major issue, just very irritating!

OPNsense 25.1 is based on FreeBSD 14.2 - which is about as rock solid as it gets on one hundred servers I happen to manage. Don't bother falling for this unsubstantiated BS.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

So with that cleared up, OP, start going at the problem like any other. Narrow it down and work methodically.
Start with posting your hardware and setup, services used.
Big NOTE: If you use realtek NICs, I'll be bailing out sharpish. Not worth the time. Use supported hardware. Using vendor driver helps but only so much.
Enable SSH so you can begin your diagnostics. Start with dmesg. Anything there?

Quote from: cookiemonster on May 23, 2025, 05:43:36 PMSo with that cleared up, OP, start going at the problem like any other. Narrow it down and work methodically.
Start with posting your hardware and setup, services used.
Big NOTE: If you use realtek NICs, I'll be bailing out sharpish. Not worth the time. Use supported hardware. Using vendor driver helps but only so much.
Enable SSH so you can begin your diagnostics. Start with dmesg. Anything there?

Thank you, that is appreciated
dmsg (summarised):
CPU: Intel Celeron J1900 (4 cores, no Hyper-Threading), microcode updated from 0x813 to 0x838
aesni0: No AES or SHA support.
Physical RAM: 8 GB (8192 MB) Available after boot: ~7.5 GB
ZFS versions:
Filesystem: v5
Pool: feature flags (v5000)
Boot pool mounted from zroot/ROOT/default
Disk: SanDisk U100 64GB SSD, SATA2 link speed (300 MB/s), NCQ enabled.
Intel I350 Quad Port (igb0–igb3)
ns8250: UART FCR is broken
Errors noted:
WARNING: L1 data cache covers fewer APIC IDs than a core (0 < 1)
unknown: I/O range not supported
atrtc0: <AT realtime clock> port 0x70-0x77 on acpi0
atrtc0: Warning: Couldn't map I/O.
atrtc0: registered as a time-of-day clock, resolution 1.000000
shdaa_audio_as_parse: Duplicate pin 0 (27) in association 1! Disabling association.
Opnsense is running on an iGel system with a J1900 CPU and 8Gb ram and 64Gb SSD.

Services being used:
Adguard home
ISC DHCPv4 for LAN DHCP

That's basically it, it's a fresh install with just Adguard installed - all DNS goes through Adguard.

---<<BOOT>>---
Copyright (c) 1992-2023 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 14.2-RELEASE-p3 stable/25.1-n269769-0381600e81a4 SMP amd64
FreeBSD clang version 18.1.6 (https://github.com/llvm/llvm-project.git llvmorg-18.1.6-0-g1118c2e05e67)
VT(vga): resolution 640x480
CPU microcode: updated from 0x813 to 0x838
CPU: Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz (2000.07-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x30678  Family=0x6  Model=0x37  Stepping=8
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x41d8e3bf<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,SSE4.1,SSE4.2,MOVBE,POPCNT,TSCDLT,RDRAND>
  AMD Features=0x28100800<SYSCALL,NX,RDTSCP,LM>
  AMD Features2=0x101<LAHF,Prefetch>
  Structured Extended Features=0x2282<TSCADJ,SMEP,ERMS,NFPUSG>
  Structured Extended Features3=0xc000400<MD_CLEAR,IBPB,STIBP>
  VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID
  TSC: P-state invariant, performance statistics
real memory  = 8589934592 (8192 MB)
avail memory = 7916101632 (7549 MB)
Event timer "LAPIC" quality 600
ACPI APIC Table: <INSYDE INSYDE>
WARNING: L1 data cache covers fewer APIC IDs than a core (0 < 1)
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
FreeBSD/SMP: 1 package(s) x 4 core(s)
random: registering fast source Intel Secure Key RNG
random: fast provider: "Intel Secure Key RNG"
random: unblocking device.
ioapic0: MADT APIC ID 2 != hw id 1
ioapic0 <Version 2.0> irqs 0-86
Launching APs: 3 2 1
random: entropy device external interface
wlan: mac acl policy registered
kbd1 at kbdmux0
WARNING: Device "spkr" is Giant locked and may be deleted before FreeBSD 15.0.
vtvga0: <VT VGA driver>
smbios0: <System Management BIOS> at iomem 0xfe120-0xfe13e
smbios0: Version: 2.7, BCD Revision: 2.7
aesni0: No AES or SHA support.
acpi0: <INSYDE INSYDE>
acpi0: Power Button (fixed)
unknown: I/O range not supported
cpu0: <ACPI CPU> on acpi0
atrtc0: <AT realtime clock> port 0x70-0x77 on acpi0
atrtc0: Warning: Couldn't map I/O.
atrtc0: registered as a time-of-day clock, resolution 1.000000s
Event timer "RTC" frequency 32768 Hz quality 0
hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff irq 8 on acpi0
Timecounter "HPET" frequency 14318180 Hz quality 950
Event timer "HPET" frequency 14318180 Hz quality 450
Event timer "HPET1" frequency 14318180 Hz quality 440
Event timer "HPET2" frequency 14318180 Hz quality 440
attimer0: <AT timer> port 0x40-0x43,0x50-0x53 irq 0 on acpi0
Timecounter "i8254" frequency 1193182 Hz quality 0
Event timer "i8254" frequency 1193182 Hz quality 100
Timecounter "ACPI-fast" frequency 3579545 Hz quality 900
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x408-0x40b on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pcib0: Length mismatch for 3 range: 10efffff vs 10f00000
pci0: <ACPI PCI bus> on pcib0
vgapci0: <VGA-compatible display> port 0x2050-0x2057 mem 0x90000000-0x903fffff,0x80000000-0x8fffffff irq 16 at device 2.0 on pci0
vgapci0: Boot video device
ahci0: <AHCI SATA controller> port 0x2048-0x204f,0x205c-0x205f,0x2040-0x2047,0x2058-0x205b,0x2020-0x203f mem 0x90e18000-0x90e187ff irq 19 at device 19.0 on pci0
ahci0: AHCI v1.30 with 2 3Gbps ports, Port Multiplier not supported
ahcich0: <AHCI channel> at channel 0 on ahci0
xhci0: <Intel BayTrail USB 3.0 controller> mem 0x90e00000-0x90e0ffff irq 20 at device 20.0 on pci0
xhci0: 32 bytes context size, 64-bit DMA
xhci0: Port routing mask set to 0xffffffff
usbus0 on xhci0
usbus0: 5.0Gbps Super Speed USB v3.0
sdhci_pci0: <Intel Bay Trail eMMC 4.5 Controller> mem 0x90e17000-0x90e17fff,0x90e16000-0x90e16fff irq 23 at device 23.0 on pci0
sdhci_pci0: 1 slot(s) allocated
mmc0: <MMC/SD bus> on sdhci_pci0
pci0: <encrypt/decrypt> at device 26.0 (no driver attached)
hdac0: <Intel BayTrail HDA Controller> mem 0x90e10000-0x90e13fff irq 22 at device 27.0 on pci0
pcib1: <ACPI PCI-PCI bridge> irq 16 at device 28.0 on pci0
pci1: <ACPI PCI bus> on pcib1
igb0: <Intel(R) I350 (Copper)> mem 0x90a00000-0x90afffff,0x90b0c000-0x90b0ffff irq 16 at device 0.0 on pci1
igb0: EEPROM V1.63-0 eTrack 0x80000ae6
igb0: Using 1024 TX descriptors and 1024 RX descriptors
igb0: Using 4 RX queues 4 TX queues
igb0: Using MSI-X interrupts with 5 vectors
igb0: Ethernet address: a0:36:9f:6a:8c:4c
igb0: netmap queues/slots: TX 4/1024, RX 4/1024
igb1: <Intel(R) I350 (Copper)> mem 0x90900000-0x909fffff,0x90b08000-0x90b0bfff irq 17 at device 0.1 on pci1
igb1: EEPROM V1.63-0 eTrack 0x80000ae6
igb1: Using 1024 TX descriptors and 1024 RX descriptors
igb1: Using 4 RX queues 4 TX queues
igb1: Using MSI-X interrupts with 5 vectors
igb1: Ethernet address: a0:36:9f:6a:8c:4d
igb1: netmap queues/slots: TX 4/1024, RX 4/1024
igb2: <Intel(R) I350 (Copper)> mem 0x90800000-0x908fffff,0x90b04000-0x90b07fff irq 18 at device 0.2 on pci1
igb2: EEPROM V1.63-0 eTrack 0x80000ae6
igb2: Using 1024 TX descriptors and 1024 RX descriptors
igb2: Using 4 RX queues 4 TX queues
igb2: Using MSI-X interrupts with 5 vectors
igb2: Ethernet address: a0:36:9f:6a:8c:4e
igb2: netmap queues/slots: TX 4/1024, RX 4/1024
igb3: <Intel(R) I350 (Copper)> mem 0x90700000-0x907fffff,0x90b00000-0x90b03fff irq 19 at device 0.3 on pci1
igb3: EEPROM V1.63-0 eTrack 0x80000ae6
igb3: Using 1024 TX descriptors and 1024 RX descriptors
igb3: Using 4 RX queues 4 TX queues
igb3: Using MSI-X interrupts with 5 vectors
igb3: Ethernet address: a0:36:9f:6a:8c:4f
igb3: netmap queues/slots: TX 4/1024, RX 4/1024
pcib2: <ACPI PCI-PCI bridge> irq 17 at device 28.1 on pci0
pcib3: <ACPI PCI-PCI bridge> irq 18 at device 28.2 on pci0
pci2: <ACPI PCI bus> on pcib3
xhci1: <ASMedia ASM1042A USB 3.0 controller> mem 0x90600000-0x90607fff irq 18 at device 0.0 on pci2
xhci1: 32 bytes context size, 64-bit DMA
usbus1 on xhci1
usbus1: 5.0Gbps Super Speed USB v3.0
pcib4: <ACPI PCI-PCI bridge> irq 19 at device 28.3 on pci0
pci3: <ACPI PCI bus> on pcib4
re0: <RealTek 8168/8111 B/C/CP/D/DP/E/F/G PCIe Gigabit Ethernet> port 0x1000-0x10ff mem 0x90500000-0x90500fff,0x90400000-0x90403fff irq 19 at device 0.0 on pci3
re0: Using 1 MSI-X message
re0: ASPM disabled
re0: Chip rev. 0x2c800000
re0: MAC rev. 0x00100000
miibus0: <MII bus> on re0
rgephy0: <RTL8169S/8110S/8211 1000BASE-T media interface> PHY 1 on miibus0
rgephy0:  none, 10baseT, 10baseT-FDX, 10baseT-FDX-flow, 100baseTX, 100baseTX-FDX, 100baseTX-FDX-flow, 1000baseT, 1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, 1000baseT-FDX-flow, 1000baseT-FDX-flow-master, auto, auto-flow
re0: Using defaults for TSO: 65518/35/2048
re0: Ethernet address: 00:e0:c5:4d:fe:5e
re0: netmap queues/slots: TX 1/256, RX 1/256
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
acpi_button0: <Power Button> on acpi0
acpi_button1: <Sleep Button> on acpi0
acpi_tz0: <Thermal Zone> on acpi0
acpi_syscontainer0: <System Container> on acpi0
ppc0: <Parallel port> port 0x378-0x37f irq 7 on acpi0
ppc0: SMC-like chipset (ECP/EPP/PS2/NIBBLE) in COMPATIBLE mode
ppc0: FIFO with 16/16/16 bytes threshold
ppbus0: <Parallel port bus> on ppc0
lpt0: <Printer> on ppbus0
lpt0: Interrupt-driven port
ppi0: <Parallel I/O> on ppbus0
ns8250: UART FCR is broken
ns8250: UART FCR is broken
uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
ns8250: UART FCR is broken
ns8250: UART FCR is broken
uart1: <16550 or compatible> port 0x2f8-0x2ff irq 3 on acpi0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
atkbd0: [GIANT-LOCKED]
est0: <Enhanced SpeedStep Frequency Control> on cpu0
Timecounter "TSC" frequency 1999999494 Hz quality 1000
Timecounters tick every 1.000 msec
ugen0.1: <Intel XHCI root HUB> at usbus0
ugen1.1: <(0x1b21) XHCI root HUB> at usbus1
uhub0 on usbus0
uhub0: <Intel XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus0
ZFS filesystem version: 5
ZFS storage pool version: features support (5000)
uhub1 on usbus1
uhub1: <(0x1b21) XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus1
mmc0: No compatible cards found on bus
hdacc0: <Realtek ALC662 rev3 HDA CODEC> at cad 0 on hdac0
hdaa0: <Realtek ALC662 rev3 Audio Function Group> at nid 1 on hdacc0
hdaa0: hdaa_audio_as_parse: Duplicate pin 0 (27) in association 1! Disabling association.
pcm0: <Realtek ALC662 rev3 (Rear Analog Mic)> at nid 24 on hdaa0
hdacc1: <Intel Valleyview2 HDA CODEC> at cad 2 on hdac0
hdaa1: <Intel Valleyview2 Audio Function Group> at nid 1 on hdacc1
pcm1: <Intel Valleyview2 (HDMI/DP 8ch)> at nid 4 on hdaa1
pcm2: <Intel Valleyview2 (HDMI/DP 8ch)> at nid 5 on hdaa1
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <SanDisk SSD U100 64GB 10.56.00> ACS-2 ATA SATA 3.x device
ada0: Serial Number 130817401529
ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 512bytes)
ada0: Command Queueing enabled
ada0: 61057MB (125045424 512 byte sectors)
Trying to mount root from zfs:zroot/ROOT/default []...
uhub1: 4 ports with 4 removable, self powered
uhub0: 7 ports with 7 removable, self powered
ugen1.2: <CHESEN USB Keyboard> at usbus1
ukbd0 on uhub1
ukbd0: <CHESEN USB Keyboard, class 0/0, rev 1.10/1.10, addr 1> on usbus1
kbd2 at ukbd0
ugen0.2: <vendor 0x1a40 USB 2.0 Hub> at usbus0
uhub2 on uhub0
uhub2: <vendor 0x1a40 USB 2.0 Hub, class 9/0, rev 2.00/1.11, addr 1> on usbus0
uhub2: 4 ports with 4 removable, self powered
igb0: link state changed to UP
igb1: link state changed to UP
re0: link state changed to DOWN
ichsmb0: <Intel Baytrail SMBus controller> port 0x2000-0x201f mem 0x90e15000-0x90e1501f irq 18 at device 31.3 on pci0
smbus0: <System Management Bus> on ichsmb0
uhid0 on uhub1
uhid0: <CHESEN USB Keyboard, class 0/0, rev 1.10/1.10, addr 1> on usbus1
lo0: link state changed to UP
coretemp0: <CPU On-Die Thermal Sensors> on cpu0
pflog0: permanently promiscuous mode enabled
igb1: link state changed to DOWN
igb0: link state changed to DOWN
igb1: link state changed to UP
igb0: link state changed to UP

Quote from: angled_whacking924 on May 23, 2025, 06:51:43 PMIntel I350 Quad Port (igb0–igb3)
Sweet. Intel I350. Supported hardware.
Those warnings are just that, so not something to worry about.

Quote from: angled_whacking924 on April 22, 2025, 11:26:32 PM[kernel{if_io_tqg_N}] has been at 100% CPU.
makes me look at the io part. Maybe disk is beginning to fault.
Dmesg is only showing boot messages, not the point when things start to go haywire. Ideally we'd like to see it (dmesg messages) when the problems occur. Failing that, may I suggest to test the disk. Start with a long SMART test to look at signs of wear/decay.

it has nothing to to with disk. It's part of iflib(4): https://man.freebsd.org/cgi/man.cgi?query=iflib&sektion=4&manpath=freebsd-release-ports
It's probably a driver/fw issue as already stated. I'd install linux on the box and virtualize opnsense with kvm. If it runs well then, it's driver issue.

Quote from: grind on May 24, 2025, 10:41:48 AMit has nothing to to with disk. It's part of iflib(4): https://man.freebsd.org/cgi/man.cgi?query=iflib&sektion=4&manpath=freebsd-release-ports
It's probably a driver/fw issue as already stated. I'd install linux on the box and virtualize opnsense with kvm. If it runs well then, it's driver issue.

I ran the full SMART test and nothing was flagged up.

I might give this virtual machine idea a go. I've not trawled the release notes to see if there's been any driver/fw changes but this seems to be an issue that's come about from an update somewhere. Annoying i do automatic updates so it's hard to be precise about which one has caused it.

I actually disconnected my unifi ap and used a GL.iNet for a bit. Opnsense ran with no issues for over two days until I plugged the unifi back in, at which point it crashed within a few hrs. Thought ah-ha this is the source of the issue. Rebooted and removed the unifi ap again, crashed again within 3hrs 😭 so glad I don't use this within a commercial environment!!

Also check the latest firmware for your Intel card.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Quote from: angled_whacking924 on May 27, 2025, 11:22:09 AMAnnoying i do automatic updates so it's hard to be precise about which one has caused it.
This is why I don't do this precisely for a system like OPN. Automatic updates as you write, maybe I would consider, as there wouldn't be breaking changes. But I wouldn't do automatic upgrades. I don't know if those can be chosen so I could be misrepresenting this particular situation.
And that is not to say there is un upgrade that caused this, only a comment.

Back to this.
Quote from: angled_whacking924 on May 23, 2025, 06:51:43 PMCPU: Intel Celeron J1900 (4 cores, no Hyper-Threading), microcode updated from 0x813 to 0x838
Maybe check for any online comments for this particular firmware?