PC Engines APU1 "general protection fault"

Started by bikemike, April 16, 2023, 09:08:58 PM

Previous topic - Next topic
April 16, 2023, 09:08:58 PM Last Edit: May 26, 2023, 04:11:11 PM by bikemike
I have been running OPNsense for several weeks now after coming from pfSense.  The platform is great and I am really happy about making the switch.  However, I cannot keep OPNsense up for more than a day or two without crashing. 

Hardware: PC engines apu1d4 running BIOS Build 9/8/2014 (beta, reduced "spew level")
Processor: AMD G-T40E Processor
NIC: Realtek RTL8111E
Drive: Transcend 32GB SATA III 6Gb/s MSA370S mSATA
OPNsense Version: OPNsense 23.1.5_4-amd64

Few things I have tried:

  • hw.ibrs_disable = 1 (Spectre V2 mitigation)
  • vm.pmap.pti = 0 (Meltdown mitigation)
  • Install os-realtek-re plugin
  • Enable AMD thermal sensor (not related, but allows me to see CPU temps which are good)

I tried enabling powerd, but this processor under the current firmware does not support this.  When the crash does occur, I get the following:


Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer = 0x20:0xffffffff80cc3160
stack pointer         = 0x28:0xfffffe000798ab60
frame pointer         = 0x28:0xfffffe000798abc0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = resume, IOPL = 0
current process = 11 (idle: cpu0)
trap number = 9
panic: general protection fault
cpuid = 0
time = 1681634405
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe000798a980
vpanic() at vpanic+0x17f/frame 0xfffffe000798a9d0
panic() at panic+0x43/frame 0xfffffe000798aa30
trap_fatal() at trap_fatal+0x385/frame 0xfffffe000798aa90
calltrap() at calltrap+0x8/frame 0xfffffe000798aa90
--- trap 0x9, rip = 0xffffffff80cc3160, rsp = 0xfffffe000798ab60, rbp = 0xfffffe000798abc0 ---
callout_process() at callout_process+0x180/frame 0xfffffe000798abc0
handleevents() at handleevents+0x188/frame 0xfffffe000798ac00
timercb() at timercb+0x24e/frame 0xfffffe000798ac50
hpet_intr_single() at hpet_intr_single+0x1b3/frame 0xfffffe000798ac80
intr_event_handle() at intr_event_handle+0x92/frame 0xfffffe000798acd0
intr_execute_handlers() at intr_execute_handlers+0x4b/frame 0xfffffe000798ad00
Xapic_isr1() at Xapic_isr1+0xdc/frame 0xfffffe000798ad00
--- interrupt, rip = 0xffffffff8111b0a6, rsp = 0xfffffe000798add0, rbp = 0xfffffe000798add0 ---
acpi_cpu_c1() at acpi_cpu_c1+0x6/frame 0xfffffe000798add0
acpi_cpu_idle() at acpi_cpu_idle+0x2ef/frame 0xfffffe000798ae10
cpu_idle_acpi() at cpu_idle_acpi+0x3e/frame 0xfffffe000798ae30
cpu_idle() at cpu_idle+0x9f/frame 0xfffffe000798ae50
sched_idletd() at sched_idletd+0x4e1/frame 0xfffffe000798aef0
fork_exit() at fork_exit+0x7e/frame 0xfffffe000798af30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe000798af30
--- trap 0x7a140b8, rip = 0xffffffff80c30e8f, rsp = 0, rbp = 0xffffffff8133a258 ---
mi_startup() at mi_startup+0xdf/frame 0xffffffff8133a258
KDB: enter: panic
panic.txt0600003014416732145  7140 ustarrootwheelgeneral protection faultversion.txt0600007414416732145  7543 ustarrootwheelFreeBSD 13.1-RELEASE-p7 stable/23.1-n250411-85724e9ce22 SMP


Load does not seem to play a factor.  In many cases, its the middle of the night when the crash occurs.  I have made many crash reports, but not sure where to go next.  I am considering updating the APU firmware/BIOS to v4.17.0.3 next.  One crash apparently corrupted something and required a fresh install as OPNsense would not fully boot.  Obviously, this is not sustainable.

In the many, many years of running pfSense, I never once had a crash.  So, not sure why OPNsense is having issues.  Obviously, two different systems, but this kinda sucks.  Any help or insight would be greatly appreciated.

I should note, it seems many others are seeing this as well:

https://forum.opnsense.org/index.php?topic=28302.0
Set net.inet.tcp.sack.enable to 0, but this was supposed to be fixed in 22.7.5.

https://forum.opnsense.org/index.php?topic=27211.0
No resolution...

https://forum.opnsense.org/index.php?topic=31965.0
I am actually going to try a new power supply since I had a similar issue before.

https://forum.opnsense.org/index.php?topic=33239.0

https://forum.opnsense.org/index.php?topic=20599

[Mega thread on the issue but dated and maybe not relevant]
https://forum.opnsense.org/index.php?topic=11419.0

[4/22 Update]  Replaced the power supply two days ago.  System appeared stable and went nearly two days, then another crash.  Likely going to move forward with the BIOS/firmware update next.

[4/23 Update] Today everything came unglued and started core dumping.  The web interface was returning a 500, but traffic was still flowing.  Ended up having to pull the power on the OPNsense device to recover...  See attached screenshot for details.

[4/28 Update] I removed the WireGuard plugin a few days ago and OPNsense has been up since.  Pushing nearly four days now.  I had installed the plugin but never fully configured or brought up the interface.  Wondering if that was causing issues.  If several more days go by, it could be suspect.  I did have a PHP component crash which was preventing graphs on the Dashboard from loading.  Restarted all the services which brought that back.

[5/9 Update] System went four days without a crash.  Updated my other APU board to the latest BIOS, but need to swap out and put in use.  Keep submitting the crash reports :-(

Switched to my other APU1D4 board with the latest BIOS.  Lets see how this goes...

[5/11 Update]  So far the old board with the new BIOS is holding strong.  Needs to exceed 4+ days for me to feel comfortable things are stable.  Starting to wonder if maybe the other/new board has memory issues.  Was going to run it through an extensive memory check to rule that out.

[5/16 Update] System has been up over six days now which is a record.  Thinking were stable at this point.  It was either the new board with old BIOS which was the issue or something else on that new board.  The old board with new BIOS is good.  I am tempted to upgrade the new board to the new BIOS, then put it back into use and see what happens.  If that holds stable, it was definitely the old BIOS.  Otherwise, I think the issue is resolved at this point. 

[5/20 Update] System has been up for over 16 days now.  Going to consider the system stable and the issue resolved.

Coreboot 4.17.0.3 was the last version to support APU1 series, and it's light years ahead of the 2014 one you have. You have all the information on how to flash it properly on the site and on the forum.https://pcengines.github.io

In short it should be like this:flashrom -w apu1_v4.17.0.3.rom -p internal &&  pciconf -w pci0:24:0 0x6c 0x580ffe10 && opnsense-shell reboot

Other than that I don't think you have other issues to worry about.

Thanks for the reply newsense!  I am hoping the firmware update will take care of the issue (and new power supply).  I was actually going to use the following for the update:

https://github.com/pcengines/apu2-documentation/blob/master/scripts/apu_fw_updater_opnsense.sh

Looks like the flashrom command is a bit different than what you provided though.

Ah yes, the chip detection override is required on APU1, else it won't flash the rom

flashrom -w apu1_v4.17.0.3.rom -p internal  -c "MX25L1605A/MX25L1606E/MX25L1608E" &&  pciconf -w pci0:24:0 0x6c 0x580ffe10 && opnsense-shell reboot