Sometimes I need to repeatedly reconnect my PPPoE connection as my ISP doesn't properly weight their gateways and I end up on one the other side of the country.
Recently (I think since 23.7), when I do this, after a few times OPNsense completely locks up and restarts, I've submitted a few crash reports but wanted to check if anyone else here is able to reproduce?
I have the crash log which I can upload if it'd help anyone (and is there anything other than IPs to remove from the logs?)
Welcome to the club. :(
I stayed at 22.7 until recently exactly because of this dreaded crash and reboot on PPPoE reconnect bug. Was running OPNsense for several years without any problem ever. Right after I dared to finally upgrade to 23.7.1 because problem reports ceased to show up, I was promptly hit by this bug I managed so long to avoid.
Reading past reports, I understand that the developers have a hard time fixing this as none of them has an ISP with PPPoE.
Like you, I've sent lots of bug reports lately.
The weird thing is that sometimes my daily PPPoE reconnect by cron is successful for several days in a row while out of the blue it crashes and reboots for the next few days. Naturally, nothing changed in my environment meanwhile.
I found a FreeBSD kernel bug (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=272319) which seems to be the culprit. Not sure though, if OPNsense can do something about it regarding the MPD5 daemon or if we have to patiently wait until this gets fixed upstream.
To test PPPoE you don't really need an ISP with PPPoE. You can create a fully functional PPPoE Server on linux or freebsd and configure it like an ISP one. Then you can connect your PPPoE client to it on the Opnsense and do all kinds of tests. I already did that to prove something different though.
https://github.com/opnsense/core/issues/6650#issuecomment-1635663267
@Monviech: You are right. I was merely quoting @franco's statement (https://forum.opnsense.org/index.php?topic=12828.msg60201#msg60201). ::)
@thatso if I read that FreeBSD issue correctly it should only concern users of FreeBSD 14?
I also found this mailing list discussion that might relate to your problem:
https://lists.freebsd.org/archives/freebsd-net/2023-October/004104.html
Are you using IPv6 with PPPoE?
Kind regards,
Patrick
Yes, I am using IPv6 :)
edit: I've uploaded the textdump file to my original post
In the Kernel panic I can see that it's caused by a page fault of the CPU.
The processes responsible seem to be:
- ether_demux (demultiplexes ethernet packets, looks into them, sees that they're IPv6 for example, and passes them to ip6_input)
- ip6_input (receives IPv6 packets and handles them, passing them to ip6_tryforward for example)
- ip6_tryforward (forwards IPv6 packets to the best path to its destination)
I could also see a lot of "fq-codel" messages, which show that you use traffic shaping. Maybe try to deactivate traffic shaping pipes for a while and see how it goes.
On first glance it doesn't look like PPPoE crashed the kernel. Its one of the above things that crashes, so maybe PPPoE calls them wrongly and that crashes the kernel. But somebody else might know better.
---------------------------------------------------------
(I use pppoe at home with a hardware opnsense and didn't experience crashes yet, also ipv6, but I have static prefixes so I can't be compared. I will try to reconnect it a few times later to see if I can make it crash ;D
Read the mailing list. ;)
If an IPv6 interface goes away - like when PPPoE disconnects - a certain data structure is deallocated, while occasionally another thread tries to use it to forward a queued packet. Which of course causes a crash.
Looks like this is precisely the bug hitting our OP.
Kristof Provost and friends are currently discussing how to best fix it.
@Patrick: your explanation seems to be right on the spot.
BTW: like @craig, I use PPPoE with IPv4 and an additional IPv6 /56 prefix.
The FreeBSD bug tracker says that kernels 12.0 - 13.2 (the current OPNsense kernel version) are affected and the main cause is the MPD5 daemon.
There was a lengthy discussion (https://forum.opnsense.org/index.php?topic=12828.0) about the same problem in the past, unfortunately it stopped without any final solution besides @schnipp preventing the crash with a modified script (https://forum.opnsense.org/index.php?topic=12828.msg69564#msg69564) of his own.
I'm unsure the current discussion on the mailing list is the bug (half hoping this was only FreeBSD 14) but ip6_tryforward() is at least suspicious enough to take a closer look. Let me prepare a debug kernel tomorrow so we can get a core dump.
Backtrace for easier reference:
Tracing pid 0 tid 100025 td 0xfffffe00387fc020
kdb_enter() at kdb_enter+0x37/frame 0xfffffe00e003f5f0
vpanic() at vpanic+0x182/frame 0xfffffe00e003f640
panic() at panic+0x43/frame 0xfffffe00e003f6a0
trap_fatal() at trap_fatal+0x387/frame 0xfffffe00e003f700
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00e003f760
calltrap() at calltrap+0x8/frame 0xfffffe00e003f760
--- trap 0xc, rip = 0xffffffff80ea3574, rsp = 0xfffffe00e003f830, rbp = 0xfffffe00e003f8a0 ---
ip6_tryforward() at ip6_tryforward+0x274/frame 0xfffffe00e003f8a0
ip6_input() at ip6_input+0x5e4/frame 0xfffffe00e003f980
netisr_dispatch_src() at netisr_dispatch_src+0x295/frame 0xfffffe00e003f9d0
ether_demux() at ether_demux+0x159/frame 0xfffffe00e003fa00
ng_ether_rcv_upper() at ng_ether_rcv_upper+0x8c/frame 0xfffffe00e003fa20
ng_apply_item() at ng_apply_item+0x2bf/frame 0xfffffe00e003fab0
ng_snd_item() at ng_snd_item+0x28e/frame 0xfffffe00e003faf0
ng_apply_item() at ng_apply_item+0x2bf/frame 0xfffffe00e003fb80
ng_snd_item() at ng_snd_item+0x28e/frame 0xfffffe00e003fbc0
ng_ether_input() at ng_ether_input+0x4c/frame 0xfffffe00e003fbf0
ether_nh_input() at ether_nh_input+0x1f2/frame 0xfffffe00e003fc50
netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe00e003fca0
ether_input() at ether_input+0x69/frame 0xfffffe00e003fd00
iflib_rxeof() at iflib_rxeof+0xbcb/frame 0xfffffe00e003fe00
_task_fn_rx() at _task_fn_rx+0x72/frame 0xfffffe00e003fe40
gtaskqueue_run_locked() at gtaskqueue_run_locked+0x15d/frame 0xfffffe00e003fec0
gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xc3/frame 0xfffffe00e003fef0
fork_exit() at fork_exit+0x7e/frame 0xfffffe00e003ff30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00e003ff30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Cheers,
Franco
Promised kernel (make sure you are on 23.7.7 beforehand):
# opnsense-update -zkr dbg-23.7.7
# opnsense-shell reboot
After crash and automatic reboot there are vmcore.* files under /var/crash that I'd need to look at.
The debug kernel is a little more trigger-happy for panics, like opening system: tunables will crash the system. Try to avoid operating the GUI as much until the crash happens and then go back to regular kernel:
# opnsense-update -kf
Cheers,
Franco
Installed and waiting for the next crash. Naturally, today's reconnect did not crash. ???
Fingers crossed. Thanks a lot for the help!
Cheers,
Franco
PS: @thatso can't directly confirm your panic is the same as the OP one so just wanted to mention that to level expectation
Sorry I've been away for a few days.
I installed the debug kernel last night, but after doing OPNSense panics on boot.
I had to get things back up and running, so used the console port to select the previous kernel - I'll try and get the panic this evening to see if we can work around it.
I was a bit afraid of that. Building with INVARIANTS in a release still crashes it pretty reliably in unrelated places. I'm not even sure I can do the debug thing without it due to other build requirements.
Cheers,
Franco
I have just had a PPPoE crash (typical), and do have a 1.96GB vmcore.0 crash file from the "production kernel" if it would help?
Yes please. Do you have somewhere to stash it?
Cheers,
Franco
PS: Compressing it should help with size a lot.
I've popped it on WeTransfer - https://we.tl/t-QYw1eSa4pj let me know if there's any problems.
Got it, thanks. Taking a look right away.
Cheers,
Franco
Just to also give a quick update: I had no crash on reconnect for the last three days and I don't want to provoke one so as not to change the conditions leading to the crash.
As said, sometimes the crashes happen for several days in a row and sometimes nothing happens for a week. :o
Unfortunately I'm running into this gdb issue:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=257036
I've checked all the gdb version we had down to 22.1 and all exhibit the same behaviour which either means the core file or the debug kernel file has a persistent issue.. it could be the size of the core file but that file size itself I wouldn't call problematic at first glance. :(
Cheers,
Franco
Is there an info.0 file still on your end? I might need that, but not sure.
I can't get useful information out of the core, e.g.:
# dmesg -M vmcore.0
dmesg: _amd64_minidump_vatop: virtual address 0x0 not minidumped
dmesg: kvm_read: invalid address (0x0)
# ps -M vmcore.0
ps: invalid address (0xffffffff82d10000)
etc.
Cheers,
Franco
I do - I backed up the entire folder
Dump header from device: /dev/gpt/swapfs
Architecture: amd64
Architecture Version: 2
Dump Length: 1956237312
Blocksize: 512
Compression: none
Dumptime: 2023-10-31 10:41:57 +0000
Hostname: OPNsense.home
Magic: FreeBSD Kernel Dump
Version String: FreeBSD 13.2-RELEASE-p3 stable/23.7-n254818-f155405f505 SMP
Panic String: page fault
Dump Parity: 2194897932
Bounds: 0
Dump Status: good
And we have a winner. ;)
After 6 days it finally crashed again.
@franco: I sent a PM regarding the dump files.
This morning it crashed again (was still on 23.7.7_3) with this well known error:
Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 02
fault virtual address = 0x10
fault code = supervisor read data, page not present
I've already submitted the full crash log.
Today I had another crash with the same error. >:(
@franco: any insights on the debug logs yet?
The past few days had daily crashes and reboots BTW.
This is also still happening for me - I've been working through disabling functionality (shaper, jumbo frames etc) to try and figure it out, but it's a slow process.
It does look like IPv6 is going to be my next target though, as `ip6_tryforward()` is mentioned in the trace.
Fatal trap 12: page fault while in kernel mode
cpuid = 6; apic id = 06
fault virtual address = 0x10
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80ea3764
stack pointer = 0x28:0xfffffe00e013eca0
frame pointer = 0x28:0xfffffe00e013ed10
Fatal trap 12: page fault while in kernel mode
cpuid = 5; code segment = base 0x0, limit 0xfffff, type 0x1b
apic id = 05
fault virtual address = 0x10
fault code = supervisor read data, page not present
= DPL 0, pres 1, long 1, def32 0, gran 1
instruction pointer = 0x20:0xffffffff80ea3764
processor eflags = interrupt enabled, resume, stack pointer = 0x28:0xfffffe00e0143ca0
IOPL = 0
current process = 12 (swi1: netisr 6)
trap number = 12
frame pointer = 0x28:0xfffffe00e0143d10
code segment = base 0x0, limit 0xfffff, type 0x1b
panic: page fault
cpuid = 6
time = 1700089902
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00e013ea60
vpanic() at vpanic+0x151/frame 0xfffffe00e013eab0
panic() at panic+0x43/frame 0xfffffe00e013eb10
trap_fatal() at trap_fatal+0x387/frame 0xfffffe00e013eb70
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00e013ebd0
calltrap() at calltrap+0x8/frame 0xfffffe00e013ebd0
--- trap 0xc, rip = 0xffffffff80ea3764, rsp = 0xfffffe00e013eca0, rbp = 0xfffffe00e013ed10 ---
ip6_tryforward() at ip6_tryforward+0x274/frame 0xfffffe00e013ed10
ip6_input() at ip6_input+0x5e4/frame 0xfffffe00e013edf0
swi_net() at swi_net+0x12b/frame 0xfffffe00e013ee60
ithread_loop() at ithread_loop+0x25a/frame 0xfffffe00e013eef0
fork_exit() at fork_exit+0x7e/frame 0xfffffe00e013ef30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00e013ef30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
Almost two months later ... any findings on this issue or on the crash logs I've sent @franco?
It seems this bug was finally fixed upstream (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=272319) for the v14 kernel. Any chance that this will be included in OPNsense 24.1?
The problem is still persistent in
FreeBSD 14.1-STABLE #2 stable/14-n267607-7e10c2d27a53: Sat May 4 08:33:15 CEST 2024 amd64
and I guess OPNsense will hit the fan when reaching the base of FreeBSD 14.
There's a current issue in the upstream bugtracker (from January):
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=276294