Crashes when reconnecting PPPoE repeatedly

franco · October 31, 2023, 11:50:59 AM

I was a bit afraid of that. Building with INVARIANTS in a release still crashes it pretty reliably in unrelated places. I'm not even sure I can do the debug thing without it due to other build requirements.

Cheers,
Franco

craig · October 31, 2023, 11:54:23 AM

I have just had a PPPoE crash (typical), and do have a 1.96GB vmcore.0 crash file from the "production kernel" if it would help?

franco · October 31, 2023, 11:58:05 AM

Yes please. Do you have somewhere to stash it?

Cheers,
Franco

franco · October 31, 2023, 11:58:30 AM

PS: Compressing it should help with size a lot.

craig · October 31, 2023, 12:12:31 PM

I've popped it on WeTransfer - https://we.tl/t-QYw1eSa4pj let me know if there's any problems.

franco · October 31, 2023, 01:12:06 PM

Got it, thanks. Taking a look right away.

Cheers,
Franco

thatso · October 31, 2023, 02:20:50 PM

Just to also give a quick update: I had no crash on reconnect for the last three days and I don't want to provoke one so as not to change the conditions leading to the crash.
As said, sometimes the crashes happen for several days in a row and sometimes nothing happens for a week. :o

franco · October 31, 2023, 02:42:37 PM

Unfortunately I'm running into this gdb issue:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=257036

I've checked all the gdb version we had down to 22.1 and all exhibit the same behaviour which either means the core file or the debug kernel file has a persistent issue.. it could be the size of the core file but that file size itself I wouldn't call problematic at first glance. :(

Cheers,
Franco

franco · October 31, 2023, 03:00:20 PM

Is there an info.0 file still on your end? I might need that, but not sure.

I can't get useful information out of the core, e.g.:

# dmesg -M vmcore.0
dmesg: _amd64_minidump_vatop: virtual address 0x0 not minidumped
dmesg: kvm_read: invalid address (0x0)

# ps -M vmcore.0
ps: invalid address (0xffffffff82d10000)

etc.

Cheers,
Franco

craig · October 31, 2023, 04:37:59 PM

I do - I backed up the entire folder

Code Select

Dump header from device: /dev/gpt/swapfs
  Architecture: amd64
  Architecture Version: 2
  Dump Length: 1956237312
  Blocksize: 512
  Compression: none
  Dumptime: 2023-10-31 10:41:57 +0000
  Hostname: OPNsense.home
  Magic: FreeBSD Kernel Dump
  Version String: FreeBSD 13.2-RELEASE-p3 stable/23.7-n254818-f155405f505 SMP
  Panic String: page fault
  Dump Parity: 2194897932
  Bounds: 0
  Dump Status: good

thatso · November 04, 2023, 08:50:18 AM

And we have a winner. ;)
After 6 days it finally crashed again.

@franco: I sent a PM regarding the dump files.

thatso · November 09, 2023, 08:06:49 PM

This morning it crashed again (was still on 23.7.7_3) with this well known error:

Code Select

Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 02
fault virtual address	= 0x10
fault code		= supervisor read data, page not present

I've already submitted the full crash log.

thatso · November 11, 2023, 07:26:27 PM

Today I had another crash with the same error. >:(

@franco: any insights on the debug logs yet?

thatso · November 16, 2023, 06:25:47 PM

The past few days had daily crashes and reboots BTW.

craig · November 20, 2023, 11:25:17 AM

This is also still happening for me - I've been working through disabling functionality (shaper, jumbo frames etc) to try and figure it out, but it's a slow process.

It does look like IPv6 is going to be my next target though, as `ip6_tryforward()` is mentioned in the trace.

Code Select

Fatal trap 12: page fault while in kernel mode
cpuid = 6; apic id = 06
fault virtual address	= 0x10
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff80ea3764
stack pointer	        = 0x28:0xfffffe00e013eca0

frame pointer	        = 0x28:0xfffffe00e013ed10

Fatal trap 12: page fault while in kernel mode
cpuid = 5; code segment		= base 0x0, limit 0xfffff, type 0x1b
apic id = 05
fault virtual address	= 0x10
fault code		= supervisor read data, page not present
			= DPL 0, pres 1, long 1, def32 0, gran 1
instruction pointer	= 0x20:0xffffffff80ea3764
processor eflags	= interrupt enabled, resume, stack pointer	        = 0x28:0xfffffe00e0143ca0
IOPL = 0
current process		= 12 (swi1: netisr 6)
trap number		= 12
frame pointer	        = 0x28:0xfffffe00e0143d10
code segment		= base 0x0, limit 0xfffff, type 0x1b
panic: page fault
cpuid = 6
time = 1700089902
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00e013ea60
vpanic() at vpanic+0x151/frame 0xfffffe00e013eab0
panic() at panic+0x43/frame 0xfffffe00e013eb10
trap_fatal() at trap_fatal+0x387/frame 0xfffffe00e013eb70
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00e013ebd0
calltrap() at calltrap+0x8/frame 0xfffffe00e013ebd0
--- trap 0xc, rip = 0xffffffff80ea3764, rsp = 0xfffffe00e013eca0, rbp = 0xfffffe00e013ed10 ---
ip6_tryforward() at ip6_tryforward+0x274/frame 0xfffffe00e013ed10
ip6_input() at ip6_input+0x5e4/frame 0xfffffe00e013edf0
swi_net() at swi_net+0x12b/frame 0xfffffe00e013ee60
ithread_loop() at ithread_loop+0x25a/frame 0xfffffe00e013eef0
fork_exit() at fork_exit+0x7e/frame 0xfffffe00e013ef30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00e013ef30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic

Crashes when reconnecting PPPoE repeatedly

franco

October 31, 2023, 11:50:59 AM #15

craig

October 31, 2023, 11:54:23 AM #16

franco

October 31, 2023, 11:58:05 AM #17

franco

October 31, 2023, 11:58:30 AM #18

craig

October 31, 2023, 12:12:31 PM #19

franco

October 31, 2023, 01:12:06 PM #20

thatso

October 31, 2023, 02:20:50 PM #21

franco

October 31, 2023, 02:42:37 PM #22

franco

October 31, 2023, 03:00:20 PM #23

craig

October 31, 2023, 04:37:59 PM #24

thatso

November 04, 2023, 08:50:18 AM #25

thatso

November 09, 2023, 08:06:49 PM #26

thatso

November 11, 2023, 07:26:27 PM #27

thatso

November 16, 2023, 06:25:47 PM #28

craig

November 20, 2023, 11:25:17 AM #29