DEC2750 crashes all 2-3 weeks with kernel panic (wireguard kmod problem)

Started by Monviech (Cedrik), August 21, 2023, 09:13:26 AM

Previous topic - Next topic
Hardware: Opnsense DEC2750

OPNsense 23.4.1-amd64
FreeBSD 13.1-RELEASE-p7
OpenSSL 1.1.1u 30 May 2023
Licensed until 2023-11-23


I have this crash report:
Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer = 0x20:0xffffffff80cc43eb
stack pointer         = 0x28:0xfffffe00cb82cd00
frame pointer         = 0x28:0xfffffe00cb82cd00
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = resume, IOPL = 0
current process = 0 (wg_tqg_0)
trap number = 9
panic: general protection fault
cpuid = 0
time = 1692600884
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00cb82cb20
vpanic() at vpanic+0x17f/frame 0xfffffe00cb82cb70
panic() at panic+0x43/frame 0xfffffe00cb82cbd0
trap_fatal() at trap_fatal+0x385/frame 0xfffffe00cb82cc30
calltrap() at calltrap+0x8/frame 0xfffffe00cb82cc30
--- trap 0x9, rip = 0xffffffff80cc43eb, rsp = 0xfffffe00cb82cd00, rbp = 0xfffffe00cb82cd00 ---
callout_cc_add() at callout_cc_add+0x7b/frame 0xfffffe00cb82cd00
callout_reset_sbt_on() at callout_reset_sbt_on+0x21b/frame 0xfffffe00cb82cd70
wg_deliver_out() at wg_deliver_out+0x2a1/frame 0xfffffe00cb82ce40
gtaskqueue_run_locked() at gtaskqueue_run_locked+0x15d/frame 0xfffffe00cb82cec0
gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xc2/frame 0xfffffe00cb82cef0
fork_exit() at fork_exit+0x7e/frame 0xfffffe00cb82cf30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00cb82cf30
--- trap 0x80c85274, rip = 0x2b5600000000000, rsp = 0, rbp = 0 ---
KDB: enter: panic
panic.txt0600003014470605064  7137 ustarrootwheelgeneral protection faultversion.txt0600007414470605064  7542 ustarrootwheelFreeBSD 13.1-RELEASE-p7 stable/23.1-n250445-fb81510bd0e SMP


From my quick analysis it seems like it has to do with the wireguard kernel mod.
current process = 0 (wg_tqg_0)
wg_deliver_out() at wg_deliver_out+0x2a1/frame 0xfffffe00cb82ce40

I have wireguard-kmod installed. But I would need another pair of eyes on that.

If wireguard-kmod is really the culprit I would remove it and go back to software emulated mode.
Hardware:
DEC740

You're on the business edition not 23.1, so better update it first and see if the issue persists

Quote from: newsense on August 21, 2023, 09:25:11 AM
You're on the business edition not 23.1, so better update it first and see if the issue persists

Thanks, but:

https://forum.opnsense.org/index.php?topic=34448.msg166836#msg166836

"OPNsense business edition 23.4.1 released - This business release is based on the OPNsense 23.1.9 community version..."

I didn't upgrade to 23.4.2 Business Edition yet, my patch window is next week. The firewall is in production.
Hardware:
DEC740

If you find a change window to change wg versions - outside patching cycle - but you won't get one for emergency / security patching ? o_0


That being said, a few DEC750s I now o run multiple wg instances, main GWs, S2S and roadwarrior for at least a year, most are on 23.4.2 but some still lagging on 23.4.1, and none crash.

Running Zenarmor or Suricata there by chance ?

I implemented wireguard kmod when I deployed the firewall. I didn't change the method wireguard works while it was already in production. And yes, I am using suricata on it, but not zenarmor.
It's also not ultra critical since 2 firewalls run in HA mode, so the other firewall took over when there was a crash.

I also have additional DEC695 running with wireguard-kmod and they don't crash. But they don't have suricata running (if that's the reason)

I'm sorry that the patching cycle surprises you, but I didn't see any emergency security issues being fixed in the patchnotes of 23.4.2. So I'm using the regular window for that.


Hardware:
DEC740

Looks like an upstream issue and I found one that looks a bit similar, but nobody working on a patch https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=264115


Cheers,
Franco

callout_reset_sbt_on() is from the wireguard-kmod package because I can't find the call from the in-kernel code for 23.7 (which is not in 23.4 for now). So the problem might not exist in FreeBSD 13.2 / OPNsense 23.7 but I don't know for sure.


Cheers,
Franco

Thank you this has been very helpful.

I will update the firewalls as soon as I can get a patch window and look out for further crashes. If they happen again, I will switch back to the wireguard-go implementation.
Hardware:
DEC740

No problem. Don't expect 23.4.2 to fix this, but also don't know if the wireguard-kmod port was updated/fixed. It could be, but it's rare.

Now that WireGuard is in the FreeBSD kernel much of the interest actively supporting it via upstream seems to have gone away (who would have thought).


Cheers,
Franco

I have updated to 23.4.2 today and I tried to revert to wireguard-go. I removed the wireguard-kmod package, rebooted, installed the wireguard-go plugin, rebooted again. Then I applied the wireguard config. But I couldn't get handshakes to work at all.

Now I'm back on wireguard-kmod and hope it won't crash the firewall.

My strategy for the future will involve going back to ipsec and ikev2.
Hardware:
DEC740

That would be our business support answer too. WireGuard is what it is... but definitely not enterprise-ready.


Cheers,
Franco

Quote from: franco on August 25, 2023, 07:05:20 PM
That would be our business support answer too. WireGuard is what it is... but definitely not enterprise-ready.

You are right. It took some testing time in my pilot road warrior group but I came to the same conclusion. Theres too many wonky things about it, especially while also using HA with CARP. I'm definitely taking a warning like this more seriously the next time:

At this time this code is new, unvetted, possibly buggy, and should be
considered "experimental". It might contain security issues. We gladly
welcome your testing and bug reports, but do keep in mind that this code
is new, so some caution should be exercised at the moment for using it
in mission critical environments.
Hardware:
DEC740

While this is specifically aimed at FreeBSD kernel implementation some of the constraints trying to make it a "simple" protocol make it pretty much unusable in basic scenarios that require 2FA. You could just as easily tunnel SSH here and have 2FA...

And we were discussing a while back that apparently HA is somehow broken in the protocol layer as the peer may keep polling a stale interface and misinterpret the other instance as being the one that is down and keep sending traffic there (since it got a reply from....somewhere).


Cheers,
Franco