CRASH in 19.1.8 when PPPOE refreshes

schnipp · June 20, 2019, 10:42:34 PM

We know the panic occurs while disconnecting the pppoe interface. Together with the stack trace and trap number 12 (page fault while in kernel mode) it looks for me like a race condition in the kernel which results in accessing an invalid pointer.

A similar reported but already fixed bug mentioned the same (missing locks to synchronize smp).

cpw · June 21, 2019, 04:04:53 AM

Diving the kernel bug db, I see a couple of things that pop up.
1. I got a new more explicit panic today:

Code Select

sbsndptr: sockbuf 0xfffff800bc34e878 and mbuf 0xfffff80034f66500 clashing
2. Looking at the kernel bug reports, I found a couple of interesting ones: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=148807 and https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=218270

The latter suggests setting hw.igb.num_queues=1, so I'm going to try that. It also seems that several people in the former bug are seeing problems related to ipv6. I wonder if the combo of ipv6 that does but also doesn't work and bouncing pppoe might be the magic sauce to make this crash.

schnipp · June 21, 2019, 03:19:36 PM

For testing purposes it may help to disable smp (simultaneous multiprocessing). The idea behind is to mitigate concurrency within kernel mode.

Disable smp: loader tunable kern.smp.disabled=1
Disable specific cpu: loader tunable hint.lapic.X.disabled with "X" as the apic id of the cpu

Further details, see here

schnipp · June 24, 2019, 10:04:11 PM

@cpw: Did you make any progress in testing?

cpw · June 25, 2019, 03:00:16 AM

I'll be working on this the next day or so. I don't want to kill the network while others are using it ;)

cpw · June 30, 2019, 02:56:08 PM

So, a small, positive (maybe?) update on this. I set hw.igb.num_queues to 1 in the tunables section. It seems the box has remained up, over an extended period, including a full reset of the pppoe connection. I am not 100% confident yet, but this seems like a massive improvement relative to where I was previously.

mimugmail · June 30, 2019, 04:20:11 PM

Nice, good progress!

JDtheHutt · June 30, 2019, 08:03:18 PM

I have been experiencing this issue for a while now, and I also use PPPoE. It's been driving me crazy and I lack the technical experience to solve it myself. I thought it was due to my use of the earlier Wireguard packages, or maybe my hardware was faulty. However, I returned to OpenVPN and removed WG, and did a fresh install on top of that, and also tested my hardware and didn't see any faults occurring there, but the issue has kept occurring.

I have also set the tunable as detailed by cpw and I'll report back in a few days as to how it is going. I usually see at least one kernel panic a day, sometimes multiple, so I should know quite quickly.

JDtheHutt · July 02, 2019, 12:07:14 PM

I've stayed up for 50 hours without failure, however my PPPoE connection has not reset during that. I forced a reload of my PPPoE and OPNsense immediately died and required a reboot. So at least I know it is due to PPPoE, but that tunable has not fixed it.

JDtheHutt · July 02, 2019, 12:08:58 PM

Quote from: schnipp on June 21, 2019, 03:19:36 PM
For testing purposes it may help to disable smp (simultaneous multiprocessing). The idea behind is to mitigate concurrency within kernel mode.

Disable smp: loader tunable kern.smp.disabled=1
Disable specific cpu: loader tunable hint.lapic.X.disabled with "X" as the apic id of the cpu

Further details, see here

I'll try this next and report back in a few days

cpw · July 02, 2019, 04:10:24 PM

JDTheHutt are you using realtek NICs, or intel NICs? That tuneable only affects intel NICs.

JDtheHutt · July 02, 2019, 06:17:57 PM

I am using Intel NICs. I use a Supermicro X10SBA board and just went to their support page to confirm just in case I was mistaken.

mimugmail · July 02, 2019, 07:18:43 PM

Intel I211?

JDtheHutt · July 02, 2019, 09:16:32 PM

Quote from: mimugmail on July 02, 2019, 07:18:43 PM
Intel I211?

Intel i210AT is what is listed for them. I hope you're not about to tell me that those are bored in BSD!

mimugmail · July 02, 2019, 09:55:39 PM

No, there's also a system known to stop service cause of 211, but not for 210.

CRASH in 19.1.8 when PPPOE refreshes

schnipp

June 20, 2019, 10:42:34 PM #30

cpw

June 21, 2019, 04:04:53 AM #31

schnipp

June 21, 2019, 03:19:36 PM #32

schnipp

June 24, 2019, 10:04:11 PM #33

cpw

June 25, 2019, 03:00:16 AM #34

cpw

June 30, 2019, 02:56:08 PM #35

mimugmail

June 30, 2019, 04:20:11 PM #36

JDtheHutt

June 30, 2019, 08:03:18 PM #37 Last Edit: June 30, 2019, 08:07:14 PM by JDtheHutt

JDtheHutt

July 02, 2019, 12:07:14 PM #38

JDtheHutt

July 02, 2019, 12:08:58 PM #39

cpw

July 02, 2019, 04:10:24 PM #40

JDtheHutt

July 02, 2019, 06:17:57 PM #41

mimugmail

July 02, 2019, 07:18:43 PM #42

JDtheHutt

July 02, 2019, 09:16:32 PM #43

mimugmail

July 02, 2019, 09:55:39 PM #44