OPNSense randomly crashes (Fatal double fault)

Started by SirUffsALot, June 12, 2022, 12:26:23 PM

Previous topic - Next topic
June 12, 2022, 12:26:23 PM Last Edit: June 12, 2022, 09:55:25 PM by SirUffsALot
Hi,
Unfortunately, my OPNSense firewall has been randomly crashing for some time now. I cannot predict a pattern. Sometimes it happens after a week or two, sometimes within 24 hours. Mostly while low traffic (normal websurfing)
At first I thought it might be the combination of RAM disk and firewall logs, however the crashes continue to occur even after deactivation.
CPU temperatures seems normal.

Hardware/Configuration:
OPNsense 22.1.8_1-amd64
Sophos XG 105
Intel Atom Processor E3930 @ 1.30GHz (2 cores, 2 threads)
2048MB RAM
4x Intel I211
64GB SSD (ZFS)
No CARP or IPS in use.

Installed Plugins:

os-acme-client
os-ddclient
os-dmidecode
os-dyndns
os-git-backup
os-hw-probe
os-iperf
os-mdns-repeater
os-smart
os-telegraf
os-theme-cicada
os-udpbroadcastrelay
os-vnstat
os-wireguard (+ kmod)


Following tunables modified:

hw.ibrs_disable = 1
hw.igb.rx_process_limit = -1
hw.igb.tx_process_limit = -1
hw.mds_disable = 0
hw.pci.honor_msi_blacklist = 0
legal.intel_igb.license_ack = 1
net.inet.icmp.drop_redirect = 1
net.inet.ip.redirect = 0
vfs.zfs.arc_max = 256M
vm.pmap.pti = 0


I was able to record a crash message from the serial console. Unfortunately i cannot post it into this message due to the character limit, but i uploaded it on my pastebin service and attached it as a file to this post.
https://paste.biocrafting.net/?ce2a1af0e2c5d868#FZUKBAbbQVpNkTaEyVsvc979ggYSfitZFNvNfZYR2njW

Has somebody any idea what can causes the crashes?

Best regards

It looks hardware-related :

KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0003b69db0
vpanic() at vpnic+0x17f/frame 0xfffffe0003b69e00
panic() at panic+0x43/frame 0xfffffe0003b69e60
dblfault_handler() at dblfault_handler+0x1ce/frame 0xfffffe0003b69f20
Xdblfault() at Xdblfault+0xd7/frame 0xfffffe0003b69f20
--- trap 0x17, rip = 0xffffffff8110d1d6, rsp = 0xfffffe0003571dd0, rbp = 0xfffffe0003571dd0 ---
acpi_cpu_c1() at acpi_cpu_c1+0x6/frame 0xfffffe0003571dd0
acpi_cpu_idle() at acpi_cpu_idle+0x2ef/frame 0xfffffe0003571e10
cpu_idle_acpi() at cpu_idle_acpi+0x3e/frame 0xfffffe0003571e30
cpu_idle() at cpu_idle+0x9f/frame 0xfffffe0003571e50
sched_idletd() at sched_idletd+0x4e1/frame 0xfffffe0003571ef0
fork_exit() at fork_exit+0x7e/frame 0xfffffe0003571f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0003571f30
--- trap 0x36200d0, rip = 0xffffffff80c2b91f, rsp = 0, rbp = 0xffffffff8131d1ea ---
mi_startup() at mi_startup+0xdf/frame 0xffffffff8131d1ea

Maybe a BIOS update can help here.


Cheers,
Franco

With a similar system, I get those kinds of instabilities when lower C-states are allowed. You can look at 'sysctl -a | grep cx_' to find out which C1-states are allowed and which are in use. You can set 'sysctl hw.acpi.cpu.cx_lowest=CX' or set the tuneable to limit lowest C-state to X. My box is capable of doing C1 only before getting unstable.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+

Thanks both of you for your input.

I have to see how to update the BIOS on this appliance. Sophos do not provide standalone update files, maybe it automatically updates when XG is installed. I have to try it.

I checked the c-states and it seems that the CPU only supports C0/1?


root@FWOPS01DEL:~ # sysctl -a | grep cx_
hw.acpi.cpu.cx_lowest: C1
dev.cpu.1.cx_method: C1/hlt
dev.cpu.1.cx_usage_counters: 130821134
dev.cpu.1.cx_usage: 100.00% last 89us
dev.cpu.1.cx_lowest: C1
dev.cpu.1.cx_supported: C1/1/0
dev.cpu.0.cx_method: C1/hlt
dev.cpu.0.cx_usage_counters: 251320055
dev.cpu.0.cx_usage: 100.00% last 32us
dev.cpu.0.cx_lowest: C1
dev.cpu.0.cx_supported: C1/1/0

It is a matter of the BIOS (sometimes configurable) which states are used, some vendors have problems with lower C-states. In your case, only C1 is supported (and used), so that this should not be a problem.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+

The last week the system ran well and stable, but today a crash occurred again. BIOS is unfortunately already the latest installed, because the firewall model is EOL and the last Sophos version XG 17.5.17 was already installed before.

But I took the chance and reinstalled the system completely and restored the config.xml, maybe the behavior improves. At that time I installed OPNSense manually over FreeBSD so I could use ZFS.