CRASH in 19.1.8 when PPPOE refreshes

Started by cpw, May 23, 2019, 06:46:28 AM

Previous topic - Next topic
I'll chime in here as well. There's no developer that has a PPPoE connection, so when somebody tries to help it's just me doing remote debugging sessions and IRC talks and discussing on GitHub and auditing and improving interface code bottom-up.

It's a problem for PPPoE for sure. There are better options out there for sure.

What is needed is just one person to step up and fix this for everyone. :/


Cheers,
Franco

I'm happy to help. Suggestions for how to diagnose would be greatly appreciated. I can supply whatever crash reports I get (I've been submitting them regularly through the reporter tool). Is there anything I can turn on to get a better picture of the situation, such as debugging flags?


Looks like it.


re0@pci0:1:0:0: class=0x020000 card=0x012310ec chip=0x816810ec rev=0x06 hdr=0x00
    vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller'
    class      = network
    subclass   = ethernet
re1@pci0:6:0:0: class=0x020000 card=0x012310ec chip=0x816810ec rev=0x07 hdr=0x00
    vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller'
    class      = network
    subclass   = ethernet
re2@pci0:7:0:0: class=0x020000 card=0x012310ec chip=0x816810ec rev=0x07 hdr=0x00
    vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller'
    class      = network
    subclass   = ethernet


That's all 3 NICs in this box. I don't have the luxury of new hardware right now. Do I just put up with frequent outages?

I've added a couple of tunables that others have noted can have an impact:


hw.re.msi_disable 1
hw.re.msix_disable 1

They seem to have taken effect (no log message about MSIX anymore).

Let's see if the stability improves.

Quote from: cpw on June 06, 2019, 03:14:03 AM
I've added a couple of tunables that others have noted can have an impact:


hw.re.msi_disable 1
hw.re.msix_disable 1

They seem to have taken effect (no log message about MSIX anymore).

Let's see if the stability improves.

It did not.



June 06, 2019, 06:29:07 PM #23 Last Edit: June 06, 2019, 07:29:15 PM by schnipp
Today, I have updated my opnsense from version 19.1.7 to 19.1.9. After trying to re-establish the pppoe connection (testing the pppoe reconnect bug patch) my system also crashed.

But this only occured when using the "disconnect/connect" buttons in the webgui and not when re-establishing the pppoe connection using the system console by sending the right system signal to the mpd5 daemon process. It seems like a kernel bug, when removing a lock from a filedescripter of a closing socket or a bug somewhere in the call stack:

db:0:kdb.enter.default>  bt
Tracing pid 52650 tid 100123 td 0xfffff80012870000
sbcut_internal() at sbcut_internal+0x40/frame 0xfffffe023697a710
sbdestroy() at sbdestroy+0x28/frame 0xfffffe023697a730
sofree() at sofree+0x123/frame 0xfffffe023697a760
soclose() at soclose+0x35a/frame 0xfffffe023697a7b0
closef() at closef+0x251/frame 0xfffffe023697a840
closefp() at closefp+0x99/frame 0xfffffe023697a880
amd64_syscall() at amd64_syscall+0xa38/frame 0xfffffe023697a9b0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe023697a9b0


@cpw: what occurs when calling "kill -s USR2 <pid>" and subsequently "kill -s USR1 <pid>" on the console?
OPNsense 24.7.11_2-amd64

My kernel crash has always been around the rename of the temporary pppoe interface to <pppoe0>.

It looks like the connection works fine, but the realtek driver just falls over when that happens, but only after the first time (otherwise, trivially, it'd never have worked at all).

It's clearly an invalid pointer somehow in the logic of the interface driver, but it could be the ppp daemon or the driver causing it.

I'll try 19.1.9 now, see what happens.

Update: completely new hardware, using Intel nics. Fundamentally the same exact crash happened last night. Pppoe connection dropped, box crashed during reconnect attempt. Guess we can rule out realtek nics. I wonder if Amazon will refund me my $$$


https://www.amazon.ca/gp/product/B074PK8ZVG

Happened again about 10 minutes ago. PPPOE wobbled, box crashed.

Hypothesis - it seems that the problem happens because it's trying to route a packet to the now dead pppoe interface, and crashes with a kernel segfault? Is it possible that the routing system can't cope with a dead PPPOE interface?

Latest crash report attached from a few minutes ago. Same Trap 12 error. Note: PPPOE interface is atop igb2 now. igb0 is my LAN (with VLANs) and igb3 is the direct cable connection.

This is really strange, this device is known to work perfect for *sense. Next would be to test against Vanilla FreeBSD. But may be worth to exchange it back.

Clearly it's not. It's crashed again this afternoon. I think the problem lies in the PPPoE somewhere. The crash is associated with the PPPoE interface resetting, due to external factors (noise on the DSL line probably).

I have no idea how I can troubleshoot, but it's really frustrating. OPNsense would be pretty much spot on, were it not for the very poor reliability (I'm running at about an average of 2 days uptime, though the outages cluster).

I'm curious what "vanilla BSD" would tell you? I mean, it wouldn't be a functional router in that state. But if you have a "livecd" I can run from USB stick, I'll happily give it a try, see if I can reproduce in that state (I'm pretty sure just pulling the network cable from my DSL modem will cause the problem).