Crashes when reconnecting PPPoE repeatedly

Started by craig, October 23, 2023, 12:00:33 PM

Previous topic - Next topic
I was a bit afraid of that. Building with INVARIANTS in a release still crashes it pretty reliably in unrelated places. I'm not even sure I can do the debug thing without it due to other build requirements.


Cheers,
Franco
"AI has absolutely reduced the cost of creating technical debt." -- ChatGPT

I have just had a PPPoE crash (typical), and do have a 1.96GB vmcore.0 crash file from the "production kernel" if it would help?

Yes please. Do you have somewhere to stash it?


Cheers,
Franco
"AI has absolutely reduced the cost of creating technical debt." -- ChatGPT

PS: Compressing it should help with size a lot.
"AI has absolutely reduced the cost of creating technical debt." -- ChatGPT

I've popped it on WeTransfer - https://we.tl/t-QYw1eSa4pj let me know if there's any problems.

Got it, thanks. Taking a look right away.


Cheers,
Franco
"AI has absolutely reduced the cost of creating technical debt." -- ChatGPT

Just to also give a quick update: I had no crash on reconnect for the last three days and I don't want to provoke one so as not to change the conditions leading to the crash.
As said, sometimes the crashes happen for several days in a row and sometimes nothing happens for a week.  :o

Unfortunately I'm running into this gdb issue:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=257036

I've checked all the gdb version we had down to 22.1 and all exhibit the same behaviour which either means the core file or the debug kernel file has a persistent issue.. it could be the size of the core file but that file size itself I wouldn't call problematic at first glance. :(


Cheers,
Franco
"AI has absolutely reduced the cost of creating technical debt." -- ChatGPT

Is there an info.0 file still on your end? I might need that, but not sure.

I can't get useful information out of the core, e.g.:

# dmesg -M vmcore.0
dmesg: _amd64_minidump_vatop: virtual address 0x0 not minidumped
dmesg: kvm_read: invalid address (0x0)

# ps -M vmcore.0
ps: invalid address (0xffffffff82d10000)

etc.


Cheers,
Franco
"AI has absolutely reduced the cost of creating technical debt." -- ChatGPT

I do - I backed up the entire folder

Dump header from device: /dev/gpt/swapfs
  Architecture: amd64
  Architecture Version: 2
  Dump Length: 1956237312
  Blocksize: 512
  Compression: none
  Dumptime: 2023-10-31 10:41:57 +0000
  Hostname: OPNsense.home
  Magic: FreeBSD Kernel Dump
  Version String: FreeBSD 13.2-RELEASE-p3 stable/23.7-n254818-f155405f505 SMP
  Panic String: page fault
  Dump Parity: 2194897932
  Bounds: 0
  Dump Status: good

And we have a winner.  ;)
After 6 days it finally crashed again.

@franco: I sent a PM regarding the dump files.

This morning it crashed again (was still on 23.7.7_3) with this well known error:

Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 02
fault virtual address = 0x10
fault code = supervisor read data, page not present


I've already submitted the full crash log.

Today I had another crash with the same error.  >:(

@franco: any insights on the debug logs yet?

The past few days had daily crashes and reboots BTW.

This is also still happening for me - I've been working through disabling functionality (shaper, jumbo frames etc) to try and figure it out, but it's a slow process.

It does look like IPv6 is going to be my next target though, as `ip6_tryforward()` is mentioned in the trace.

Fatal trap 12: page fault while in kernel mode
cpuid = 6; apic id = 06
fault virtual address = 0x10
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80ea3764
stack pointer         = 0x28:0xfffffe00e013eca0

frame pointer         = 0x28:0xfffffe00e013ed10

Fatal trap 12: page fault while in kernel mode
cpuid = 5; code segment = base 0x0, limit 0xfffff, type 0x1b
apic id = 05
fault virtual address = 0x10
fault code = supervisor read data, page not present
= DPL 0, pres 1, long 1, def32 0, gran 1
instruction pointer = 0x20:0xffffffff80ea3764
processor eflags = interrupt enabled, resume, stack pointer         = 0x28:0xfffffe00e0143ca0
IOPL = 0
current process = 12 (swi1: netisr 6)
trap number = 12
frame pointer         = 0x28:0xfffffe00e0143d10
code segment = base 0x0, limit 0xfffff, type 0x1b
panic: page fault
cpuid = 6
time = 1700089902
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00e013ea60
vpanic() at vpanic+0x151/frame 0xfffffe00e013eab0
panic() at panic+0x43/frame 0xfffffe00e013eb10
trap_fatal() at trap_fatal+0x387/frame 0xfffffe00e013eb70
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00e013ebd0
calltrap() at calltrap+0x8/frame 0xfffffe00e013ebd0
--- trap 0xc, rip = 0xffffffff80ea3764, rsp = 0xfffffe00e013eca0, rbp = 0xfffffe00e013ed10 ---
ip6_tryforward() at ip6_tryforward+0x274/frame 0xfffffe00e013ed10
ip6_input() at ip6_input+0x5e4/frame 0xfffffe00e013edf0
swi_net() at swi_net+0x12b/frame 0xfffffe00e013ee60
ithread_loop() at ithread_loop+0x25a/frame 0xfffffe00e013eef0
fork_exit() at fork_exit+0x7e/frame 0xfffffe00e013ef30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00e013ef30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic