[SOLVED][Fix included in 17.7.1] PPPOE Crash

Started by jwe, August 10, 2017, 02:13:23 AM

Previous topic - Next topic
August 10, 2017, 02:13:23 AM Last Edit: August 28, 2017, 11:44:15 AM by jwe
UPDATE 28.08.2017:
The Problem has been solved and the fix is implemented in 17.7.1.
Cause was Multiple AC's on the PPPoE Line which where not correctly handled when a session
was already opened.
Thanks you for fixing this issue :)



UPDATE NOT-SOLVED:
After letting it run after the BIOS Setting it ran good until i tried to rename the WAN<->PPPoE Interface,
which lead to the same crashing as before.

UPDATE/RESOLVED/WTF:
I could fix the problem by disabling "Deep S5 State" in BIOS
Going to run the System from live-usb now until i am 100% sure that this is all... grrr..WTF...



=====ORIGINAL POST FROM HERE====


I now had a few crashes using 17.7 and pppoe connection.

Realtek Network Cards,
tried disable and enable hardware vlan tagging.

PPPoE via VLAN => Instantly crashing
PPPoE without VLAN(set vlan via switch) => possibly crashing after some time or when i rename the interface from opt1 to anything...

Crash means, system is showing a few 1000 lines on screen scrolling for about a minute, possibly creating a crashlog and rebooting. again and again until i remove the network cable from the pppoe port.

I already sent in some crash reports via the reporter in the webconfigurator.

If i can help with anything more, please let me know.

~jwe

I confirm this error. It's the same than in https://forum.opnsense.org/index.php?topic=5650.0. Franco was trying to reproduce the problem.




Hi guys,

I haven't been able to reproduce, but I saw something in the logs that looks suspicious. How about this patch?

https://github.com/opnsense/core/commit/065244ed

Apply with:

opnsense-patch 065244ed

Apply again to revert.


Cheers,
Franco

August 12, 2017, 11:59:26 PM #3 Last Edit: August 13, 2017, 01:42:11 AM by jwe
Didnt resolve the problem.

Applied you patch, renamed the interface => crash,
rebooted itself, crashed again... then rebooted and showed:

Launching the init system...done.
Initalizing...
Warning: require_once(config.inc): failed to open stream: No such directory
***snip***
login: root
Login incorrect
login:


After that, i reinstalled 17.7 freshly from usb.
Setup lan, setup pppoe (no vlan or so, just pppoe on re0)
assigned the pppoe
enable interface=>gets ip from dsl then instantly crashes.
boot...crash...boot...config.inc error...



so for now i am going to use the 17.1.

I will add some screenshots as soon as i can.


EDIT:
Here are some photos from the crashes:


As you can see, the crash comes instantly after pppoe login(which is sucessfull, getting an ip-address)

Tried to reproduce the problem on some hyper-v vm's, but cant.

So the Problem must be something with the hardware.
As the guy in the other post said, i also have a J1900 MoBo from asrock.

Maybe this can be a hint for the problem.

If there is any way to get you more details to help solving the problem, please tell me :)
As for now i can't use 17.7... :(

I can also reproduce it this way:

I have a working 17.1 Setup with working pppoe
When booting from an USB-Stick with 17.7(VGA) and importing the configuration it boots up and crashes.

What i can see in dmesg(it holds the log from 17.7 and the current 17.1 one)
is that it is ending with

QuoteFatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x100
fault code      = supervisor read data, page not present
instruction pointer   = 0x20:0xffffffff8244b3ee
stack pointer           = 0x28:0xfffffe01de78c790
frame pointer           = 0x28:0xfffffe01de78c820
code segment      = base 0x0, limit 0xfffff, type 0x1b
         = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags   = interrupt enabled, resume, IOPL = 0
current process      = 12 (swi5: fast taskq)


The only thing i have seen in the changelog is some change in the realtek drivers. Maybe this is the Problem?

Any ideas?

After doing some old-school trial-and-error with my bios settings i found out that:
disabling "Deep S5" in BIOS solves the Problem.

Dunno how as i dont understant what the RTC thingy has to do with my problem. Whatever.
If i can helper further analyze the root of the problem i sure will help you.

For now i am happy that it works.

I will post here again when it ran about 24hr from the liveusb with importet config from 17.1.
If this is running good, i will try to update the installed 17.1 to 17.7.

~still WTF?!

Hi jwe,

Glad this helps, but I think there is more to do here. Problems that magically disappear tend to reappear. :/

The Realtek drivers didn't change from the 17.1.4 images till now. I think this is a dormant bug in the operating systems that we trigger with our modified interface configuration code.

There is one patch that adds a new feature to PPPoE that could be a candidate, but that seems unlikely to be the problem.

There is one issue in the boot screenshots you made where a file is missing, this is already due to corruption in the file system caused by a panic, which is essentially like pulling the power plug and the file system can't keep its consistent state.

There are more ways to debug this, but it's really difficult to do this remotely.

One can "unscript" the crash handling, so the console prompt will be able to execute commands, the "bt" command is usually the most helpful.

# ddb unscript kdb.enter.default

(cause crash)

Type "bt" and hit enter at the crash dump prompt.

We also have debug kernel support now to enrich the crash dump, which is supported when 17.7.1 is out (the updater needs a bit of extra code).

This panic is not reproducible so far for us. We can always build test images to give "ready to use" system state to test patches or inspect the panic more closely, and we're evaluating patches that would have caused this. So far there is one likely candidate, but that didn't seem to help.

The real question is how much time would you be willing to invest testing a couple of images that we prepare to pin down the issue to a component (kernel or interface configuration code)?


Thanks,
Franco


Hi Franco,

you are right,
the Problem came back when i tried to rename the WAN Interface that is mapped to the pppoe.

I really want to help you (and me...) to solve the problem.

i have removed the usb bootstick and i am running the 17.1.11 now from installed hdd without any problem.

If you can send me a step-by-step manual what i can do i can invest some hours into it for sure.

I imagine for example you give me an usb-image to run and send you back the output(stored on installed ssd or something?) or screenshots.

Whatever you need.

We could also start some skype call (german is my native language).

I could play your remotehands, we just need to get a timeframe(weekdays after 19:00 GMT+1) or on weekend.


Yay, also German... I'll prepare two USB images till Friday to try (VGA/amd64?) and send a PM for when we could have call if needed.


Thank you,
Franco

Quote from: franco on August 16, 2017, 02:26:02 PM
Yay, also German... I'll prepare two USB images till Friday to try (VGA/amd64?) and send a PM for when we could have call if needed.


Thank you,
Franco

Sounds got.

I am using the vga/x64 image(via usb)
Mainboard is Asrock Q1900M(http://www.asrock.com/mb/Intel/Q1900M/index.de.asp)
with two additional dual realtek nics
(That makes 5xRealtek nic included the one on the mainboard)

Happy to test these images on upcoming weekend :)

August 16, 2017, 11:42:56 PM #11 Last Edit: August 16, 2017, 11:47:21 PM by odites999
You can count on me to test the images. I'm also using vga/x64 via usb and my mainboard is Asrock Q1900DC-ITX (http://www.asrock.com/mb/intel/q1900dc-itx/. I'll also try to test the Deep S5 solution as soon as I can.

Sorry, my native language is not German... but Spanish.


Regards,

I find this peculiar... two Asrock Q1900 boards... Do you have the latest BIOS?

I'll have an image ready in a few minutes...

But my Spanish is really rusty, lo siento. :D


Cheers,
Franco

Ok, here we go:

https://pkg.opnsense.org/snapshots/OPNsense-17.7-test1-OpenSSL-vga-amd64.img.bz2

This image is based on multiple fixes for the upcoming 17.7.1. If it should panic, you can type "bt" and send a screenshot.

https://pkg.opnsense.org/snapshots/OPNsense-17.7-test2-OpenSSL-vga-amd64.img.bz2

This second image is based on the same fixes, but with the last 17.1 kernel to verify that the kernel is indeed okay.  If it should panic, you can type "bt" and send a screenshot.


Thanks in advance,
Franco

I have the latest BIOS for my motherboard, according to Asrock (1.60).


Regards,