Recurring Kernel Panics - Fatal trap 12: page fault while in kernel mode

Started by rafaelreisr, September 08, 2022, 03:40:07 PM

Previous topic - Next topic
Hey everyone.

I've been wrestling with a fresh install of OPNSense in a KVM / QEMU env. Host is Ubuntu 22.04 Jammy.
Device is a Topton N5105 Celeron with 4 Intel igc 2.5gbit nics. 2x4GB DDR4
1 NIC is reserved to Ubuntu host. 3 NICs are passthough to VM with iommu.

This has been going on for weeks. I get random Kernel Panics + VM reboots every 15 to 20hs. Host is rock solid with over a week of uptime.

What have I tried so far:

1 - BIOS mode instead of UEFI (no change)
2 - adding nopti to kernel opts on host since it looked like acpi / mitigations issue. No change
3 - installing qemu-guest-tools-vm to opnsense. No change.

It is similar to https://forum.opnsense.org/index.php?topic=28302.0 and https://forum.opnsense.org/index.php?topic=28422.0

I have found similar reports in bare metal installations, other Hypervisors, so it seems a bit widespread.

The usual replies are that it is either a HW issue (unlikely due to it being so common), or it should be solved on 13.1 / 22.7, which is exactly my fresh installation.

Edit: Crash report https://forum.opnsense.org/index.php?action=post;quote=145955;topic=30230.0;last_msg=145955


Quote from: rafaelreisr on September 08, 2022, 03:40:07 PM
The usual replies are that it is either a HW issue (unlikely due to it being so common)

Loads of bad RAM out there. Have you tested yours? https://www.memtest86.com/

Also similar to https://forum.opnsense.org/index.php?topic=29845.0?

Are you virtualizing the VM CPU as KVM/Qemu or using host? Have you tried not passing through the network adapter and using VirtIO instead, which should handle 2.5g fine? Either of those could narrow down the issue.

Starting to suspect something in this hardware combo is giving the underlying FreeBSD base fits. If virtualizing the 2.5g nic or the CPU (or both in combination) stops the Freebsd Kernel panics that should point in the general direction of an answer. Seems as though RAM issues would affect the host and VM.

As expected, it crashed again. I have the crash report:

System Information:
User-Agent Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.6.1 Safari/605.1.15
FreeBSD 13.1-RELEASE-p1 stable/22.7-n250224-b668033f066 SMP amd64
OPNsense 22.7.2 412c0b79c
Plugins os-dmidecode-1.1_1 os-qemu-guest-agent-1.1 os-telegraf-1.12.5 os-upnp-1.4_2 os-wireguard-1.11
Time Thu, 08 Sep 2022 14:36:24 -0300
OpenSSL 1.1.1q  5 Jul 2022
Python 3.9.13
PHP 8.0.22


dmesg.boot:
Copyright (c) 1992-2021 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 13.1-RELEASE-p1 stable/22.7-n250224-b668033f066 SMP amd64
FreeBSD clang version 13.0.0 (git@github.com:llvm/llvm-project.git llvmorg-13.0.0-0-gd7b669b3a303)
VT(efifb): resolution 1024x768
CPU: Intel(R) Celeron(R) N5105 @ 2.00GHz (1996.78-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x906c0  Family=0x6  Model=0x9c  Stepping=0
  Features=0x1f83fbff
  Features2=0xcff8a223
  AMD Features=0x28100800
  AMD Features2=0x101
  Structured Extended Features=0x21940283
  Structured Extended Features2=0x18400124
  Structured Extended Features3=0xac000400
  XSAVE Features=0xf
  IA32_ARCH_CAPS=0x6b
  AMD Extended Feature Extensions ID EBX=0x100d000
  VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID,VID,PostIntr
Hypervisor: Origin = "KVMKVMKVM"
real memory  = 2147483648 (2048 MB)
avail memory = 2032087040 (1937 MB)
Event timer "LAPIC" quality 600
ACPI APIC Table:
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
FreeBSD/SMP: 1 package(s) x 2 core(s)
random: registering fast source Intel Secure Key RNG
random: fast provider: "Intel Secure Key RNG"
random: unblocking device.
ioapic0  irqs 0-23
Launching APs: 1
random: entropy device external interface
wlan: mac acl policy registered
kbd1 at kbdmux0
WARNING: Device "spkr" is Giant locked and may be deleted before FreeBSD 14.0.
kvmclock0:
Timecounter "kvmclock" frequency 1000000000 Hz quality 975
kvmclock0: registered as a time-of-day clock, resolution 0.000001s
efirtc0:
efirtc0: registered as a time-of-day clock, resolution 1.000000s
smbios0:  at iomem 0x7f922000-0x7f92201e
smbios0: Version: 2.8, BCD Revision: 2.8
aesni0:
acpi0:
acpi0: Power Button (fixed)
cpu0:  on acpi0
atrtc0:  port 0x70-0x77 irq 8 on acpi0
atrtc0: registered as a time-of-day clock, resolution 1.000000s
Event timer "RTC" frequency 32768 Hz quality 0
Timecounter "ACPI-fast" frequency 3579545 Hz quality 900
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x608-0x60b on acpi0
pcib0:  port 0xcf8-0xcff on acpi0
pci0:  on pcib0
vgapci0:  mem 0xc0000000-0xc0ffffff,0xc2119000-0xc2119fff at device 1.0 on pci0
vgapci0: Boot video device
pcib1:  mem 0xc2118000-0xc2118fff irq 22 at device 2.0 on pci0
pci1:  on pcib1
pcib2:  mem 0xc2000000-0xc20000ff irq 22 at device 0.0 on pci1
pci2:  on pcib2
hdac0:  mem 0xc1e00000-0xc1e03fff irq 23 at device 1.0 on pci2
pcib3:  mem 0xc2117000-0xc2117fff irq 22 at device 2.1 on pci0
pci3:  on pcib3
virtio_pci0:  mem 0xc1c00000-0xc1c00fff,0x800000000-0x800003fff irq 22 at device 0.0 on pci3
pcib4:  mem 0xc2116000-0xc2116fff irq 22 at device 2.2 on pci0
pci4:  on pcib4
virtio_pci1:  mem 0xc1a00000-0xc1a00fff,0x800100000-0x800103fff irq 22 at device 0.0 on pci4
vtblk0:  on virtio_pci1
vtblk0: 40960MB (83886080 512 byte sectors)
pcib5:  mem 0xc2115000-0xc2115fff irq 22 at device 2.3 on pci0
pci5:  on pcib5
igc0:  mem 0xc1800000-0xc18fffff,0xc1900000-0xc1903fff irq 22 at device 0.0 on pci5
igc0: Using 1024 TX descriptors and 1024 RX descriptors
igc0: Using 2 RX queues 2 TX queues
igc0: Using MSI-X interrupts with 3 vectors
igc0: Ethernet address: 7c:2b:e1:13:00:5a
igc0: netmap queues/slots: TX 2/1024, RX 2/1024
pcib6:  mem 0xc2114000-0xc2114fff irq 22 at device 2.4 on pci0
pci6:  on pcib6
igc1:  mem 0xc1600000-0xc16fffff,0xc1700000-0xc1703fff irq 22 at device 0.0 on pci6
igc1: Using 1024 TX descriptors and 1024 RX descriptors
igc1: Using 2 RX queues 2 TX queues
igc1: Using MSI-X interrupts with 3 vectors
igc1: Ethernet address: 7c:2b:e1:13:00:5b
igc1: netmap queues/slots: TX 2/1024, RX 2/1024
pcib7:  mem 0xc2113000-0xc2113fff irq 22 at device 2.5 on pci0
pci7:  on pcib7
igc2:  mem 0xc1400000-0xc14fffff,0xc1500000-0xc1503fff irq 22 at device 0.0 on pci7
igc2: Using 1024 TX descriptors and 1024 RX descriptors
igc2: Using 2 RX queues 2 TX queues
igc2: Using MSI-X interrupts with 3 vectors
igc2: Ethernet address: 7c:2b:e1:13:00:5c
igc2: netmap queues/slots: TX 2/1024, RX 2/1024
pcib8:  mem 0xc2112000-0xc2112fff irq 22 at device 2.6 on pci0
pci8:  on pcib8
virtio_pci2:  mem 0x800200000-0x800203fff irq 22 at device 0.0 on pci8
vtballoon0:  on virtio_pci2
pcib9:  mem 0xc2111000-0xc2111fff irq 22 at device 2.7 on pci0
pci9:  on pcib9
xhci0:  mem 0xc1000000-0xc1003fff irq 22 at device 0.0 on pci9
xhci0: 32 bytes context size, 64-bit DMA
usbus0 on xhci0
usbus0: 5.0Gbps Super Speed USB v3.0
isab0:  at device 31.0 on pci0
isa0:  on isab0
ahci0:  port 0xe040-0xe05f mem 0xc2110000-0xc2110fff irq 16 at device 31.2 on pci0
ahci0: AHCI v1.00 with 6 1.5Gbps ports, Port Multiplier not supported
ahcich0:  at channel 0 on ahci0
ahcich1:  at channel 1 on ahci0
ahcich2:  at channel 2 on ahci0
ahcich3:  at channel 3 on ahci0
ahcich4:  at channel 4 on ahci0
ahcich5:  at channel 5 on ahci0
acpi_syscontainer0:  on acpi0
acpi_syscontainer1:  port 0xcd8-0xce3 on acpi0
acpi_syscontainer2:  port 0x620-0x62f on acpi0
acpi_syscontainer3:  port 0xcc0-0xcd7 on acpi0
uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
uart0: console (115200,n,8,1)
atkbdc0:  port 0x60,0x64 irq 1 on acpi0
atkbd0:  irq 1 on atkbdc0
kbd0 at atkbd0
atkbd0: [GIANT-LOCKED]
psm0:  irq 12 on atkbdc0
psm0: [GIANT-LOCKED]
WARNING: Device "psm" is Giant locked and may be deleted before FreeBSD 14.0.
psm0: model IntelliMouse Explorer, device ID 4
attimer0:  at port 0x40 on isa0
Timecounter "i8254" frequency 1193182 Hz quality 0
Event timer "i8254" frequency 1193182 Hz quality 100
Timecounters tick every 10.000 msec
ZFS filesystem version: 5
ZFS storage pool version: features support (5000)
hdacc0:  at cad 0 on hdac0
hdaa0:  at nid 1 on hdacc0
pcm0:  at nid 3 and 5 on hdaa0
ugen0.1: <(0x1b36) XHCI root HUB> at usbus0
uhub0 on usbus0
uhub0: <(0x1b36) XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus0
Trying to mount root from zfs:zroot/ROOT/default []...
Root mount waiting for: usbus0
uhub0: 30 ports with 30 removable, self powered
Dual Console: Video Primary, Serial Secondary


/var/crash/info.0:
Dump header from device: /dev/vtbd0p3
  Architecture: amd64
  Architecture Version: 4
  Dump Length: 72192
  Blocksize: 512
  Compression: none
  Dumptime: 2022-09-08 12:51:50 -0300
  Hostname: OPNsense.localdomain
  Magic: FreeBSD Text Dump
  Version String: FreeBSD 13.1-RELEASE-p1 stable/22.7-n250224-b668033f066 SMP
  Panic String: page fault
  Dump Parity: 3509545778
  Bounds: 0
  Dump Status: good


/var/crash/textdump.tar.0: attached due to size restriction. Left the interesting bit:



Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 01
fault virtual address = 0xfffffc009bed98de
fault code = supervisor write data, page not present
instruction pointer = 0x20:0xffffffff812267c0
stack pointer         = 0x28:0xfffffe0096da9b28
frame pointer         = 0x28:0xfffffe0096da9c20
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = resume, IOPL = 0
current process = 75225 (telegraf)
trap number = 12
panic: page fault
cpuid = 1
time = 1662652310
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0096da98e0
vpanic() at vpanic+0x17f/frame 0xfffffe0096da9930
panic() at panic+0x43/frame 0xfffffe0096da9990
trap_fatal() at trap_fatal+0x385/frame 0xfffffe0096da99f0
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe0096da9a50
calltrap() at calltrap+0x8/frame 0xfffffe0096da9a50
--- trap 0xc, rip = 0xffffffff812267c0, rsp = 0xfffffe0096da9b28, rbp = 0xfffffe0096da9c20 ---
lapic_handle_timer() at lapic_handle_timer/frame 0xfffffe0096da9c20
pmap_copy() at pmap_copy+0x561/frame 0xfffffe0096da9cc0
vmspace_fork() at vmspace_fork+0xc8a/frame 0xfffffe0096da9d40
fork1() at fork1+0x42a/frame 0xfffffe0096da9da0
sys_fork() at sys_fork+0x54/frame 0xfffffe0096da9e00
amd64_syscall() at amd64_syscall+0x10c/frame 0xfffffe0096da9f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0096da9f30
--- syscall (2, FreeBSD ELF64, sys_fork), rip = 0x485ad6, rsp = 0xc0003251c8, rbp = 0xc0003252b0 ---
KDB: enter: panic
panic.txt0600001214306407626  7141 ustarrootwheelpage faultversion.txt0600007414306407626  7544 ustarrootwheelFreeBSD 13.1-RELEASE-p1 stable/22.7-n250224-b668033f066 SMP

Quote from: bartjsmit on September 08, 2022, 04:49:59 PM
Loads of bad RAM out there. Have you tested yours? https://www.memtest86.com/

It's on the 3rd pass of memtest86 and no errors yet. I highly doubt it is RAM. I have Micron modules from reputable brand, brand new, not from China. Also, Host is rock solid.



Quote from: Vesalius on September 08, 2022, 05:11:38 PM
Also, similar to https://forum.opnsense.org/index.php?topic=29845.0?

Are you virtualizing the VM CPU as KVM/Qemu or using host? Have you tried not passing through the network adapter and using VirtIO instead, which should handle 2.5g fine? Either of those could narrow down the issue.


VM settings xml is attached in the original post. CPU is host-passthrough. I could try emulating CPU and NICs, although I feel it defeats the purpose of leveraging the hardware.
My best suspicion is a poor-coded BIOS. These chinese boards could have poor microcode implementation. Altough it would also show on Linux. It seems to be only giving problems on BSD.

Finally, I can also try a bare metal install just for testing. That is not my intended deployment, since it would waste a lot of the hardware potential, but I'd be happy to contribute to the experts for a potential fix.

You would lose little to nothing virtualizing the cpu and host and really it might just be temporary to trouble shoot if the host nic or cpu direct interaction with FreeBSD are the issue. It's more about systematically checking off those boxes of what might be the cause.

VirtIO on many host can do 10-20g of throughput and should have no issues with 2.5g.

I just installed it bare metal. Imaged and saved previous Ubuntu / KVM installation as a backup.

Did the recommended setup steps, and I'll now leave it running for a few days and report back.

If it runs fine, we will know for sure it is virtualization related. Then I'll move into the suggested VM troubleshooting.

Might be worth looking at Proxmox and VMware ESXi as alternative hypervisors.

Update the thread with current troubleshooting status:


  • Bare Metal Installation - Worked Perfectly with no crashes for 36+ hs
  • Changed guest VM chipset from Q35 to i440fx - Although OPNSense and FreeBSD docs says Q35 is stable, tried rolling back to legacy chipset. Most important thing is that it only supports Legacy PCI not PCI-e. So passthrough was achieved over PCI. Performance was fine on benchmarks (could reach 2.5gbit). But it also didn't work. Same Kernel Panic Fatal Trap 12 as usual.
  • Restored Q35 and moved CPU from host-passthrough to host-model - As per KVM docs, host model tries to enable most of the host chip capabilities but keeping a layer of compatibility. In case of N5105 (Jasper Lake) QEMU set architecture to SnowRidge (Atom). It did enable most of the original CPU flags, and although there was a performance penalty, it was minor. I could still reach the same NIC speeds, but with 15% higher core CPU usage. It didn't work, tough. Panicked around the same time (16hs in) but with a different result, Trap n. 1. Crash logs attached below

Next steps:

  • Fully emulate CPU with different architecture / manually enabling certain flags, which will be a pain.
  • Remove Passthrough and virtualize the NICs
Process is slow since the crashes only occur after 15-20hs. I'll be posting the results as I go.

Quote from: bartjsmit on September 09, 2022, 07:50:37 AM
Might be worth looking at Proxmox and VMware ESXi as alternative hypervisors.

I will give Proxmox a shot afterwards, just to see if KVM / QEMU is more stable there. But I don't feel is the best option for me. I need a Linux install to run Docker services and OPNSense. Proxmox does not support Docker natively, I'd have to run 2 VMs (OPNSense + Ubuntu or some other distro) + Hypevisor. That will be a lot for this machine. I'd rather have the Hypervisor be a Linux distro with Docker Support (Like the Ubuntu Host I'm currently running, and is very stable).



Proxmox is just some binaries on top of a slightly modified Debian install. In fact, you can install Debian and then install proxmox to that.

Regardless of how you chose to install initially, you can have Docker running directly on the Debian/proxmox host easily as getting it running on Debian. Most people don't as installing docker on a lightweight proxmox Debian/Ubuntu/alpine LXC takes so few additional resources, but you can.

https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_install_proxmox_ve_on_debian

Troubleshooting updates:

Installed Proxmox. Created the VM with the exact same settings as I had in Ubuntu (see if I replicate the problem on another Hypervisor).

Recap> Host CPU passthrough, NIC passthrough, no memory balooning, scsi virtio disks.

It crashed as well, around 20hs in. The VM crashes and is not rebooted automatically, which is a worst behavior than Ubuntu.

syslog from proxmox:
Sep 14 12:25:23 pve QEMU[1016]: extra data[0]: 0x0000000080000b0e
Sep 14 12:25:23 pve QEMU[1016]: extra data[1]: 0x0000000000000031
Sep 14 12:25:23 pve QEMU[1016]: extra data[2]: 0x0000000000000083
Sep 14 12:25:23 pve QEMU[1016]: extra data[3]: 0x0000000830917ff8
Sep 14 12:25:23 pve QEMU[1016]: extra data[4]: 0x0000000000000002
Sep 14 12:25:23 pve QEMU[1016]: RAX=0000000830917eb6 RBX=ffffffff81f5f0c0 RCX=00000000c0000101 RDX=00000000ffffffff
Sep 14 12:25:23 pve QEMU[1016]: RSI=0000000000000000 RDI=ffffffff81f5f0c0 RBP=ffffffff81f5f0b0 RSP=ffffffff81f5efe0
Sep 14 12:25:23 pve QEMU[1016]: R8 =000000c000c6e900 R9 =0000000000000000 R10=0000000000000000 R11=000000c000c6e900
Sep 14 12:25:23 pve QEMU[1016]: R12=ffffffffffffff99 R13=ffffffffffffff9f R14=000000c0010d5380 R15=0000000830917eb6
Sep 14 12:25:23 pve QEMU[1016]: RIP=ffffffff81133841 RFL=00010082 [--S----] CPL=0 II=0 A20=1 SMM=0 HLT=0
Sep 14 12:25:23 pve QEMU[1016]: ES =003b 0000000000000000 ffffffff 00c0f300 DPL=3 DS   [-WA]
Sep 14 12:25:23 pve QEMU[1016]: CS =0020 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
Sep 14 12:25:23 pve QEMU[1016]: SS =0000 0000000000000000 ffffffff 00c00000
Sep 14 12:25:23 pve QEMU[1016]: DS =003b 0000000000000000 ffffffff 00c0f300 DPL=3 DS   [-WA]
Sep 14 12:25:23 pve QEMU[1016]: FS =0013 0000000830c40130 ffffffff 00c0f300 DPL=3 DS   [-WA]
Sep 14 12:25:23 pve QEMU[1016]: GS =001b ffffffff82c10000 ffffffff 00c0f300 DPL=3 DS   [-WA]
Sep 14 12:25:23 pve QEMU[1016]: LDT=0000 0000000000000000 ffffffff 00c00000
Sep 14 12:25:23 pve QEMU[1016]: TR =0048 ffffffff82c10384 00002068 00008b00 DPL=0 TSS64-busy
Sep 14 12:25:23 pve QEMU[1016]: GDT=     ffffffff82c103ec 00000067
Sep 14 12:25:23 pve QEMU[1016]: IDT=     ffffffff81f5d690 00000fff
Sep 14 12:25:23 pve QEMU[1016]: CR0=80050033 CR2=ffffffff81133841 CR3=0000000830917eb6 CR4=003506e8
Sep 14 12:25:23 pve QEMU[1016]: DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
Sep 14 12:25:23 pve QEMU[1016]: DR6=00000000ffff0ff0 DR7=0000000000000400
Sep 14 12:25:23 pve QEMU[1016]: EFER=0000000000000d01
Sep 14 12:25:23 pve QEMU[1016]: Code=?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??


There is no crash logs recorded on OPNSense OS. Proxmox looking worse than Ubuntu so far.

I know it rules out containers (at least in the free version) but my ESXi VM with OPNsense has never crashed in its many years of use.

Bart...

@rafaelreiser have you tried to run the opnsense VM either in ubuntu kvm/qemu or Proxmox without nic passthrough yet? Using a virtualized cpu and paravirtualized nics (virtio) seems to be about the only combo left to try.

I've also run OPNsense VM on Proxmox for years now without any sort of crashes like this, as have many others on the Proxmox forum I frequent, so no inherent generalized compatibility issues there on the software front.

This is a common issue with those units. There isn't any hardware swapping or configuration tweaking you can do to fix it.

My suggestion right now is to upgrade the host kernel to the latest version 5.19 and report back your results. Maybe even try using proxmox just to work alongside the efforts of others.

Kernel upgrade thread: https://forum.proxmox.com/threads/opt-in-linux-5-19-kernel-for-proxmox-ve-7-x-available.115090/

Main thread tracking this issue: https://forum.proxmox.com/threads/vm-freezes-irregularly.111494/

I do not have experience with Topton brand, but these look similar to those of Kettop and Qotom. I built one for a client and felt like I installed a time bomb after the complaints. Anything I tried could not make the OS happy on the Qotom.

I took the config and plugged it into my own smaller appliance PC, assigned interfaces, took it right over, and it has worked ever since. Saved my posterior and bought a replacement appliance, both of those from Protectli.

These appliance and Nuk size PCs are an amazing level of overkill for a firewall to stretch and scale, running on an external DC power supply.

An appliance PC I highly recommend running one thing on it here metal. Anything going virtual should go on a server class machine built for throughout.

This lesson I learned was definitely a case of getting what one pays for. I only state as objectively as possible my own experience.

For reference used a Kettop Home Router Mi3455P4 Intel Celeron J3455 from Amazon. Replaced with either Protectli FW4B or FW6D. The price difference is shadowed only by a great difference in quality.

Other miscellaneous reference: I use a maxed out Dell R710 to stage my VMs. It is running Ubuntu server 20.04. At any given time I can be running Windows, xBuntu, or OpnSense VMs running in Qemu with libvirtd