CRASH in 19.1.8 when PPPOE refreshes

franco · June 05, 2019, 09:42:49 PM

I'll chime in here as well. There's no developer that has a PPPoE connection, so when somebody tries to help it's just me doing remote debugging sessions and IRC talks and discussing on GitHub and auditing and improving interface code bottom-up.

It's a problem for PPPoE for sure. There are better options out there for sure.

What is needed is just one person to step up and fix this for everyone. :/

Cheers,
Franco

cpw · June 05, 2019, 10:00:27 PM

I'm happy to help. Suggestions for how to diagnose would be greatly appreciated. I can supply whatever crash reports I get (I've been submitting them regularly through the reporter tool). Is there anything I can turn on to get a better picture of the situation, such as debugging flags?

mimugmail · June 05, 2019, 10:18:49 PM

Is this a realtek NIC?
https://github.com/opnsense/core/issues/3227

cpw · June 06, 2019, 01:25:15 AM

Looks like it.

Code Select


re0@pci0:1:0:0: class=0x020000 card=0x012310ec chip=0x816810ec rev=0x06 hdr=0x00
    vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller'
    class      = network
    subclass   = ethernet
re1@pci0:6:0:0: class=0x020000 card=0x012310ec chip=0x816810ec rev=0x07 hdr=0x00
    vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller'
    class      = network
    subclass   = ethernet
re2@pci0:7:0:0: class=0x020000 card=0x012310ec chip=0x816810ec rev=0x07 hdr=0x00
    vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller'
    class      = network
    subclass   = ethernet

That's all 3 NICs in this box. I don't have the luxury of new hardware right now. Do I just put up with frequent outages?

cpw · June 06, 2019, 03:14:03 AM

I've added a couple of tunables that others have noted can have an impact:

Code Select


hw.re.msi_disable 1
hw.re.msix_disable 1

They seem to have taken effect (no log message about MSIX anymore).

Let's see if the stability improves.

cpw · June 06, 2019, 02:52:54 PM

Quote from: cpw on June 06, 2019, 03:14:03 AM
I've added a couple of tunables that others have noted can have an impact:

Code Select Expand
hw.re.msi_disable 1 hw.re.msix_disable 1
They seem to have taken effect (no log message about MSIX anymore).

Let's see if the stability improves.

It did not.

mimugmail · June 06, 2019, 03:23:21 PM

Do you run IDS/IPS or Shaper?

https://github.com/opnsense/core/issues/1481

cpw · June 06, 2019, 04:27:42 PM

No, I am not.

schnipp · June 06, 2019, 06:29:07 PM

Today, I have updated my opnsense from version 19.1.7 to 19.1.9. After trying to re-establish the pppoe connection (testing the pppoe reconnect bug patch) my system also crashed.

But this only occured when using the "disconnect/connect" buttons in the webgui and not when re-establishing the pppoe connection using the system console by sending the right system signal to the mpd5 daemon process. It seems like a kernel bug, when removing a lock from a filedescripter of a closing socket or a bug somewhere in the call stack:

db:0:kdb.enter.default> bt
Tracing pid 52650 tid 100123 td 0xfffff80012870000
sbcut_internal() at sbcut_internal+0x40/frame 0xfffffe023697a710
sbdestroy() at sbdestroy+0x28/frame 0xfffffe023697a730
sofree() at sofree+0x123/frame 0xfffffe023697a760
soclose() at soclose+0x35a/frame 0xfffffe023697a7b0
closef() at closef+0x251/frame 0xfffffe023697a840
closefp() at closefp+0x99/frame 0xfffffe023697a880
amd64_syscall() at amd64_syscall+0xa38/frame 0xfffffe023697a9b0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe023697a9b0

@cpw: what occurs when calling "kill -s USR2 <pid>" and subsequently "kill -s USR1 <pid>" on the console?

cpw · June 07, 2019, 03:43:36 PM

My kernel crash has always been around the rename of the temporary pppoe interface to <pppoe0>.

It looks like the connection works fine, but the realtek driver just falls over when that happens, but only after the first time (otherwise, trivially, it'd never have worked at all).

It's clearly an invalid pointer somehow in the logic of the interface driver, but it could be the ppp daemon or the driver causing it.

I'll try 19.1.9 now, see what happens.

cpw · June 20, 2019, 02:07:00 PM

Update: completely new hardware, using Intel nics. Fundamentally the same exact crash happened last night. Pppoe connection dropped, box crashed during reconnect attempt. Guess we can rule out realtek nics. I wonder if Amazon will refund me my $$$

mimugmail · June 20, 2019, 02:15:12 PM

Link to Amazon product please

cpw · June 20, 2019, 03:29:14 PM

https://www.amazon.ca/gp/product/B074PK8ZVG

Happened again about 10 minutes ago. PPPOE wobbled, box crashed.

Hypothesis - it seems that the problem happens because it's trying to route a packet to the now dead pppoe interface, and crashes with a kernel segfault? Is it possible that the routing system can't cope with a dead PPPOE interface?

Latest crash report attached from a few minutes ago. Same Trap 12 error. Note: PPPOE interface is atop igb2 now. igb0 is my LAN (with VLANs) and igb3 is the direct cable connection.

mimugmail · June 20, 2019, 08:56:56 PM

This is really strange, this device is known to work perfect for *sense. Next would be to test against Vanilla FreeBSD. But may be worth to exchange it back.

cpw · June 20, 2019, 09:15:57 PM

Clearly it's not. It's crashed again this afternoon. I think the problem lies in the PPPoE somewhere. The crash is associated with the PPPoE interface resetting, due to external factors (noise on the DSL line probably).

I have no idea how I can troubleshoot, but it's really frustrating. OPNsense would be pretty much spot on, were it not for the very poor reliability (I'm running at about an average of 2 days uptime, though the outages cluster).

I'm curious what "vanilla BSD" would tell you? I mean, it wouldn't be a functional router in that state. But if you have a "livecd" I can run from USB stick, I'll happily give it a try, see if I can reproduce in that state (I'm pretty sure just pulling the network cable from my DSL modem will cause the problem).

CRASH in 19.1.8 when PPPOE refreshes

franco

June 05, 2019, 09:42:49 PM #15

cpw

June 05, 2019, 10:00:27 PM #16

mimugmail

June 05, 2019, 10:18:49 PM #17

cpw

June 06, 2019, 01:25:15 AM #18

cpw

June 06, 2019, 03:14:03 AM #19

cpw

June 06, 2019, 02:52:54 PM #20

mimugmail

June 06, 2019, 03:23:21 PM #21

cpw

June 06, 2019, 04:27:42 PM #22

schnipp

June 06, 2019, 06:29:07 PM #23 Last Edit: June 06, 2019, 07:29:15 PM by schnipp

cpw

June 07, 2019, 03:43:36 PM #24

cpw

June 20, 2019, 02:07:00 PM #25

mimugmail

June 20, 2019, 02:15:12 PM #26

cpw

June 20, 2019, 03:29:14 PM #27

mimugmail

June 20, 2019, 08:56:56 PM #28

cpw

June 20, 2019, 09:15:57 PM #29