CRASH in 19.1.8 when PPPOE refreshes

Started by cpw, May 23, 2019, 06:46:28 AM

Previous topic - Next topic
Hi

So, I've been struggling to get my system stable - I have jerry rigged some crazy cooling, to try and regulate temperature, and cannibalized an old laptop for memory sticks, but the system still crashes regularly.

Conincidentally, it always crashes right after the PPPOE interface is being rebuilt (due to an ISP dropped line glitch perhaps?). The error is exactly as is described in this forum thread: https://forum.opnsense.org/index.php?topic=5697.0 from a couple of years ago, and "closed with solution".

Is there a weird fundamental incompatibility with some hardware that is triggered by the PPPOE activity? That's pretty :o

Anyway, I don't know if I should file a bug, or where I should file such a bug. I've clicked the submit report button a few times to send you the full details.

One thing I noticed - it doesn't always crash the box. One time, the box got stuck in a weird state where it had no network connectivity at all, but didn't reboot or do anything. It just sat. Very odd.

I can share precise hardware details if you need them. This hardware is the ultimate botch job, but is a proof of concept before I invest real money in quality hardware.

What happens when you fix the cooling with a ventilator in front?

No change. I have two big fans, according to cputemp stats it never runs more than 10C above ambient now. It's crashed twice since, both times when pppoe refreshed.

Update: it crashed again today, as soon as the PPPOE connection reset. The correlation is exact and causal. It caused a 5 minute outage while OPNSense rebooted. This seems to be a fairly critical flaw - I would never expect a simple activity like bouncing a PPPOE interface to cause a complete fatal crash of the OS layer.

Sadly, it is not without resultant corruption as well. It seems that all the RRD reports (netflow, health, netdata) have lost all data as well. It also appears netflow hasn't properly restarted after the crash.

Update update: another DSL connection wobble, another 5 minute reboot of the OPN sense firewall. Is there any idea how we could fix this? I'm happy to help diagnose, it seems that all I need do is disconnect the DSL modem temporarily. It is extremely frustrating to have repeated outages on something that's supposed to be extremely reliable, and was, until I chose OPNsense.

What does the console say? Any stack traces? What hardware is this? I ran OPNsense on so many devices, never had any issue like this.

June 04, 2019, 10:13:13 PM #6 Last Edit: June 04, 2019, 10:35:48 PM by cpw
Yup. There is always a core message, before the firewall completely restarts. I can't seem to find it in any of the log files anymore.

The hardware is "Intel(R) Atom(TM) CPU D2700 @ 2.13GHz (4 cores)". It has 8GB of memory (barely using 1G according to the dashboard).

It's an old zotac mini PC. I've equipped it with two additional ethernet devices via a mini PCIE card. As I say, it seems to work 100% reliably, except when the PPPOE disconnects/reconnects. I have a multiwan setup, with both DSL/PPPOE and non-DSL/DHCP services upstream (each on a physical NIC) as well as a segmented LAN with 6 VLANs.

Everything works fine, except when the PPPOE restarts - due to dropped connection upstream, I believe - it's very windy today and the phone line is a little weather vulnerable.

I'll get the error report next time it crashes or trigger one manually shortly, so I can give you the exact kernel core dump message.

Note: the error is the exact same kernel crash as identified in the thread I linked at the top. Nothing in the thread has helped, however.



Can you capture the screen an make a video? You need to find the reason. My guess is PCI card causes BSD to crash. Then we can search for tunings maybe. If it doesnt work out at all I see 3 options:
Replace Hardware
Replace OPNsense
Let Modem do the dialin and OPNsense behind

Here's the crashdump from the last crash. It looks like it happened, again, last night.

Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 02
fault virtual address = 0x188
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80d253ac
stack pointer         = 0x28:0xfffffe022f196940
frame pointer         = 0x28:0xfffffe022f196990
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 12 (swi5: fast taskq)


What would you like me to video? It crashing and rebooting? There's not much to show, it just flashes the error and then goes to the boot screen.

Why would a particular PCI card cause a complete hard crash? It seems it's specifically related to something inside the kernel, that only happens when the PPPOE restarts. The uptime was nearly 2 weeks, when the PPPOE recycled and took it out yesterday. There is a definite cause/effect relationship here, and it's not hardware, that I can tell?

Replacing hardware isn't an option right now, because $$$
Replacing software: I'm reluctantly looking for alternatives at present.
Modem handoff: I've looked into it. It looks like if I do that, I have to deal with 99 other flavours of awful that my modem "provides" as well.

No, it's a relation between (3 of 5):

- FreeBSD
- PPPoE software (mpd5)
- Hardware
- Maybe ISP sending strange packets

There are thousands of PPPoE installations out there without this problem.

Now it's time to exclude one after another to find the problematic combination.

The easiest one is installing pfSense, if this also crashes and it doesn't happen with e.g IPFire, it's FreeBSD
Next one is using different hardware, but OPN and your provider, perhaps you can borrow some piece of hardware and test. If it happens, it's not your hardware, if not, it has something to do with your hardware.

And so on ..

I run pppoe with opnsense. It happened that i wanted to change public ip, clicked the "reload" button on the wan interfaces overview page and all od sudden it crashed immediately. running 19.1.8.

Quote from: keropiko on June 05, 2019, 02:13:22 PM
I run pppoe with opnsense. It happened that i wanted to change public ip, clicked the "reload" button on the wan interfaces overview page and all od sudden it crashed immediately. running 19.1.8.

Then you should open a new thread, exact hardware, if it happens on every reload, system.log / ppps.log and stack trace while crashing.

Quote from: mimugmail on June 05, 2019, 02:12:15 PM
No, it's a relation between (3 of 5):

- FreeBSD
- PPPoE software (mpd5)
- Hardware
- Maybe ISP sending strange packets

There are thousands of PPPoE installations out there without this problem.
Agree that this is unusual. I don't believe it's unique, however. There are clear reports of others with this issue, spanning several years.

Quote from: mimugmail on June 05, 2019, 02:12:15 PM
Now it's time to exclude one after another to find the problematic combination.

The easiest one is installing pfSense, if this also crashes and it doesn't happen with e.g IPFire, it's FreeBSD

OK. I can probably pull that off fairly easily, assuming I can recover OPNsense configuration.

I do know that the previous (not same hardware) Linux setup handling PPPOE, never experienced this issue in several years of running.

I believe that rules out the ISP being weird?

Quote from: mimugmail on June 05, 2019, 02:12:15 PM

Next one is using different hardware, but OPN and your provider, perhaps you can borrow some piece of hardware and test. If it happens, it's not your hardware, if not, it has something to do with your hardware.

And so on ..

Are hardware compatibility problems like this prevalent in the BSD community? I mean, I'm not a fan of spending many many days troubleshooting an issue to find that I have to invest hundreds (or thousands) of dollars in a hoped-for resolution. I'm a Linux guy, and I've not seen behaviour like this since the really early days of Linux (like 1995-8 or so).

- When it spans over serveral years, what does that mean? Only a few in several year, or it still wasn't fixed by upstream? Or provider-related. Honestly, I'm unsure

- Just install pfSense, put in user/pw and you're good, there's no real config import, it costs time for sure. That linux on different hardware works doesn't rule out provider since it must be a combination of it. You can also give IPFire a shot, it's not that hard to set up.

- I'm also not a fan of testing things for days/ages .. but when all ppl think this way the problem will not be solved (and it seems noone was digging deeper successfully). So there's also no reason for posts like "it's now on 19.1.8 and still crashes". Only the ppl affected can help to investigate since it's to hard to guess from remote, sorry.

Quote from: mimugmail on June 05, 2019, 05:05:35 PM
- When it spans over serveral years, what does that mean? Only a few in several year, or it still wasn't fixed by upstream? Or provider-related. Honestly, I'm unsure

I doubt it's provider related. I can't see much in common between a Canadian ISP and a german ISP (the previous linked issue seemed to be german?)

Quote from: mimugmail on June 05, 2019, 05:05:35 PM
- Just install pfSense, put in user/pw and you're good, there's no real config import, it costs time for sure. That linux on different hardware works doesn't rule out provider since it must be a combination of it. You can also give IPFire a shot, it's not that hard to set up.

I'll try running them from a usbstick on the same hardware. That way I can do the test without wiping out OPNsense, hopefully.

Quote from: mimugmail on June 05, 2019, 05:05:35 PM
- I'm also not a fan of testing things for days/ages .. but when all ppl think this way the problem will not be solved (and it seems noone was digging deeper successfully). So there's also no reason for posts like "it's now on 19.1.8 and still crashes". Only the ppl affected can help to investigate since it's to hard to guess from remote, sorry.

That's why I'm still here. I'm not going to walk away immediately. But I don't have $ to invest in potential solutions right now, and I can't tolerate week long investigation outages. But point tests, they're quite doable.

I've seen various reports over the years of this issue happening, and people seem to have narrowed it down in the past to the 'interface rename' that occurs at the kernel level. I can see from the logs I have collected so far, that that seems to correspond to my issue as well. I suspect that "hardware" is a factor only insofar as speed of said hardware is a factor. This isn't a speedy computer. Hardware slowness often reveals hidden race conditions.