OPNsense Forum

Archive => 17.7 Legacy Series => Topic started by: jwe on August 10, 2017, 02:13:23 am

Title: [SOLVED][Fix included in 17.7.1] PPPOE Crash
Post by: jwe on August 10, 2017, 02:13:23 am
UPDATE 28.08.2017:
The Problem has been solved and the fix is implemented in 17.7.1.
Cause was Multiple AC's on the PPPoE Line which where not correctly handled when a session
was already opened.
Thanks you for fixing this issue :)



UPDATE NOT-SOLVED:
After letting it run after the BIOS Setting it ran good until i tried to rename the WAN<->PPPoE Interface,
which lead to the same crashing as before.

UPDATE/RESOLVED/WTF:
I could fix the problem by disabling "Deep S5 State" in BIOS
Going to run the System from live-usb now until i am 100% sure that this is all... grrr..WTF...



=====ORIGINAL POST FROM HERE====


I now had a few crashes using 17.7 and pppoe connection.

Realtek Network Cards,
tried disable and enable hardware vlan tagging.

PPPoE via VLAN => Instantly crashing
PPPoE without VLAN(set vlan via switch) => possibly crashing after some time or when i rename the interface from opt1 to anything...

Crash means, system is showing a few 1000 lines on screen scrolling for about a minute, possibly creating a crashlog and rebooting. again and again until i remove the network cable from the pppoe port.

I already sent in some crash reports via the reporter in the webconfigurator.

If i can help with anything more, please let me know.

~jwe
Title: Re: PPPOE Crash
Post by: odites999 on August 10, 2017, 07:45:41 am
I confirm this error. It's the same than in https://forum.opnsense.org/index.php?topic=5650.0 (https://forum.opnsense.org/index.php?topic=5650.0). Franco was trying to reproduce the problem.



Title: Re: PPPOE Crash
Post by: franco on August 10, 2017, 03:48:59 pm
Hi guys,

I haven't been able to reproduce, but I saw something in the logs that looks suspicious. How about this patch?

https://github.com/opnsense/core/commit/065244ed

Apply with:

opnsense-patch 065244ed

Apply again to revert.


Cheers,
Franco
Title: Re: PPPOE Crash
Post by: jwe on August 12, 2017, 11:59:26 pm
Didnt resolve the problem.

Applied you patch, renamed the interface => crash,
rebooted itself, crashed again... then rebooted and showed:

Code: [Select]
Launching the init system...done.
Initalizing...
Warning: require_once(config.inc): failed to open stream: No such directory
***snip***
login: root
Login incorrect
login:

After that, i reinstalled 17.7 freshly from usb.
Setup lan, setup pppoe (no vlan or so, just pppoe on re0)
assigned the pppoe
enable interface=>gets ip from dsl then instantly crashes.
boot...crash...boot...config.inc error...



so for now i am going to use the 17.1.

I will add some screenshots as soon as i can.


EDIT:
Here are some photos from the crashes:
(https://thumb.ibb.co/mpTqKv/20170812_231400.jpg) (https://ibb.co/mpTqKv) (https://thumb.ibb.co/cCi1Ra/20170812_233524.jpg) (https://ibb.co/cCi1Ra) (https://thumb.ibb.co/bD4VKv/20170812_233526.jpg) (https://ibb.co/bD4VKv) (https://thumb.ibb.co/mGTCXF/20170812_233528.jpg) (https://ibb.co/mGTCXF) (https://thumb.ibb.co/nyDqKv/20170812_233531.jpg) (https://ibb.co/nyDqKv) (https://thumb.ibb.co/eG1MRa/20170812_233534.jpg) (https://ibb.co/eG1MRa) (https://thumb.ibb.co/cbJ1Ra/20170812_233537.jpg) (https://ibb.co/cbJ1Ra) (https://thumb.ibb.co/k8x6sF/20170812_233603.jpg) (https://ibb.co/k8x6sF)

As you can see, the crash comes instantly after pppoe login(which is sucessfull, getting an ip-address)
Title: Re: PPPOE Crash
Post by: jwe on August 13, 2017, 05:45:01 pm
Tried to reproduce the problem on some hyper-v vm's, but cant.

So the Problem must be something with the hardware.
As the guy in the other post said, i also have a J1900 MoBo from asrock.

Maybe this can be a hint for the problem.

If there is any way to get you more details to help solving the problem, please tell me :)
As for now i can't use 17.7... :(
Title: Re: PPPOE Crash
Post by: jwe on August 15, 2017, 09:48:26 pm
I can also reproduce it this way:

I have a working 17.1 Setup with working pppoe
When booting from an USB-Stick with 17.7(VGA) and importing the configuration it boots up and crashes.

What i can see in dmesg(it holds the log from 17.7 and the current 17.1 one)
is that it is ending with

Quote
Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x100
fault code      = supervisor read data, page not present
instruction pointer   = 0x20:0xffffffff8244b3ee
stack pointer           = 0x28:0xfffffe01de78c790
frame pointer           = 0x28:0xfffffe01de78c820
code segment      = base 0x0, limit 0xfffff, type 0x1b
         = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags   = interrupt enabled, resume, IOPL = 0
current process      = 12 (swi5: fast taskq)


The only thing i have seen in the changelog is some change in the realtek drivers. Maybe this is the Problem?

Any ideas?
Title: Re: [Resolved,WTF!?] PPPOE Crash
Post by: jwe on August 15, 2017, 10:56:09 pm
After doing some old-school trial-and-error with my bios settings i found out that:
disabling "Deep S5" in BIOS solves the Problem.

Dunno how as i dont understant what the RTC thingy has to do with my problem. Whatever.
If i can helper further analyze the root of the problem i sure will help you.

For now i am happy that it works.

I will post here again when it ran about 24hr from the liveusb with importet config from 17.1.
If this is running good, i will try to update the installed 17.1 to 17.7.

~still WTF?!
Title: Re: [SOLVED,WTF!?] PPPOE Crash
Post by: franco on August 16, 2017, 07:40:50 am
Hi jwe,

Glad this helps, but I think there is more to do here. Problems that magically disappear tend to reappear. :/

The Realtek drivers didn't change from the 17.1.4 images till now. I think this is a dormant bug in the operating systems that we trigger with our modified interface configuration code.

There is one patch that adds a new feature to PPPoE that could be a candidate, but that seems unlikely to be the problem.

There is one issue in the boot screenshots you made where a file is missing, this is already due to corruption in the file system caused by a panic, which is essentially like pulling the power plug and the file system can't keep its consistent state.

There are more ways to debug this, but it's really difficult to do this remotely.

One can "unscript" the crash handling, so the console prompt will be able to execute commands, the "bt" command is usually the most helpful.

# ddb unscript kdb.enter.default

(cause crash)

Type "bt" and hit enter at the crash dump prompt.

We also have debug kernel support now to enrich the crash dump, which is supported when 17.7.1 is out (the updater needs a bit of extra code).

This panic is not reproducible so far for us. We can always build test images to give "ready to use" system state to test patches or inspect the panic more closely, and we're evaluating patches that would have caused this. So far there is one likely candidate, but that didn't seem to help.

The real question is how much time would you be willing to invest testing a couple of images that we prepare to pin down the issue to a component (kernel or interface configuration code)?


Thanks,
Franco

Title: Re: [SOLVED,WTF!?] PPPOE Crash
Post by: jwe on August 16, 2017, 02:19:23 pm
Hi Franco,

you are right,
the Problem came back when i tried to rename the WAN Interface that is mapped to the pppoe.

I really want to help you (and me...) to solve the problem.

i have removed the usb bootstick and i am running the 17.1.11 now from installed hdd without any problem.

If you can send me a step-by-step manual what i can do i can invest some hours into it for sure.

I imagine for example you give me an usb-image to run and send you back the output(stored on installed ssd or something?) or screenshots.

Whatever you need.

We could also start some skype call (german is my native language).

I could play your remotehands, we just need to get a timeframe(weekdays after 19:00 GMT+1) or on weekend.

Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: franco on August 16, 2017, 02:26:02 pm
Yay, also German... I'll prepare two USB images till Friday to try (VGA/amd64?) and send a PM for when we could have call if needed.


Thank you,
Franco
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: jwe on August 16, 2017, 02:29:11 pm
Yay, also German... I'll prepare two USB images till Friday to try (VGA/amd64?) and send a PM for when we could have call if needed.


Thank you,
Franco

Sounds got.

I am using the vga/x64 image(via usb)
Mainboard is Asrock Q1900M(http://www.asrock.com/mb/Intel/Q1900M/index.de.asp)
with two additional dual realtek nics
(That makes 5xRealtek nic included the one on the mainboard)

Happy to test these images on upcoming weekend :)
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: odites999 on August 16, 2017, 11:42:56 pm
You can count on me to test the images. I'm also using vga/x64 via usb and my mainboard is Asrock Q1900DC-ITX (http://www.asrock.com/mb/intel/q1900dc-itx/ (http://www.asrock.com/mb/intel/q1900dc-itx/). I'll also try to test the Deep S5 solution as soon as I can.

Sorry, my native language is not German... but Spanish.


Regards,
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: franco on August 17, 2017, 10:14:24 am
I find this peculiar... two Asrock Q1900 boards... Do you have the latest BIOS?

I'll have an image ready in a few minutes...

But my Spanish is really rusty, lo siento. :D


Cheers,
Franco
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: franco on August 17, 2017, 11:05:48 am
Ok, here we go:

https://pkg.opnsense.org/snapshots/OPNsense-17.7-test1-OpenSSL-vga-amd64.img.bz2

This image is based on multiple fixes for the upcoming 17.7.1. If it should panic, you can type "bt" and send a screenshot.

https://pkg.opnsense.org/snapshots/OPNsense-17.7-test2-OpenSSL-vga-amd64.img.bz2

This second image is based on the same fixes, but with the last 17.1 kernel to verify that the kernel is indeed okay.  If it should panic, you can type "bt" and send a screenshot.


Thanks in advance,
Franco
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: odites999 on August 17, 2017, 05:45:21 pm
I have the latest BIOS for my motherboard, according to Asrock (1.60).


Regards,
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: jwe on August 17, 2017, 09:52:59 pm
Here is the output from the test1-image, imported config from 17.1:
(https://thumb.ibb.co/mRWKHF/minimized_20170817_213607.jpg) (https://ibb.co/mRWKHF) (https://thumb.ibb.co/iOxRxF/20170817_213635.jpg) (https://ibb.co/iOxRxF)

I also have the latest BIOS version.
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: odites999 on August 19, 2017, 12:17:56 pm
With the Test1 image, the system also crashes (same error than in images posted by jwe).
With the Test2 image, the system does not crash.


Regards,
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: franco on August 20, 2017, 11:16:18 am
Hi guys,

Alright, that confirms it's not something we did in our code per se... It's from the import of the host-uniq patch from here:

https://reviews.freebsd.org/D9270

Can you tell me how your "service" field is filled out? If it is not filled out, what happens when you set a bogus string like "foo"?

We'd also be looking at a "|" symbol use or anything odd / non-standard about that.


Thanks so far,
Franco
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: odites999 on August 20, 2017, 04:50:45 pm
Hi franco,

My service name is (and has always been) empty. I've tried changing it to different lengths with no problem. The connection stays alive.


Regards,
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: jwe on August 20, 2017, 11:00:32 pm
Hi franco,

My service name is (and has always been) empty. I've tried changing it to different lengths with no problem. The connection stays alive.


Regards,

I set it to "mnet" in the 17.1 install.
PPPoE then reconnects and is working as before.

But when booting from the 17.7 usb stick (Test Image 1) then i get the same error as in my screenshots after configuration import and getting the network devices ready.
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: odites999 on August 21, 2017, 07:22:01 am
Hi,

My test with system names has been with the Test2 image, the only one that does not crash, in my case (empty system name).


Regards,
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: franco on August 21, 2017, 06:21:21 pm
Ok, so we need to roll back on that patch and start asking the authors for help in solving this mystery.

The older kernel will work for 17.7 if you can install it:

# opnsense-update -ikr 17.1.9 -n "17.1\/sets"
# /usr/local/etc/rc.reboot

The working image is also ok to install. The kernels are the same and upgrades will work as soon as 17.7.1 is out.


Cheers,
Franco
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: jwe on August 21, 2017, 10:00:07 pm
Ok, so we need to roll back on that patch and start asking the authors for help in solving this mystery.

The older kernel will work for 17.7 if you can install it:

# opnsense-update -ikr 17.1.9 -n "17.1\/sets"
# /usr/local/etc/rc.reboot

The working image is also ok to install. The kernels are the same and upgrades will work as soon as 17.7.1 is out.


Cheers,
Franco

Okay.. what are the next steps now?
Are you contacting the author?

Edit:
I could reproduce the problem on my main PC within a hyper-v VM running pfsense 17.7

I also captured the packages with wireshark if it could help futher identify the problem.
But this makes the problem hardware-independent, as my main pc has an intel nic, mobo from gigabyte

Edit2:

I did some research on the patch at https://reviews.freebsd.org/D9270 (as far as my knowledge is going...)-

It seems like it not only patched/added the host-uniq feature but also added some additional parsing for the PADM messages.

As i can see in ng_pppoe_rcvdata_ether(), the way how a NULL PADI tag gets treated changed dramatically, before they called
Code: [Select]
CTR1(KTR_NET, "%20s: PADI w/o Service-Name",__func__);
LEAVE(ENETUNREACH);

but now they just to some new indroduced
Code: [Select]
if (tag==NULL)tag=&sntag;
which i would assume to be the problem.

Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: franco on August 22, 2017, 06:32:41 am
Good morning,

The PCAP of the trace would certainly help. Something went wrong in the ng_pppoe_rcvdata_ether() function, but from review it looked like there was nothing conceptually wrong with the &sntag change which just wants to force a match of the service to the first service it finds. At least there was no obvious problem with it in the lines changes (that other reviewers could have caught as well).

You are right it's not hardware independent, that was ruled out with the test1/test2 images. It looks provider-specific and/or PPPoE server specific (setup).

We can try to revert parts of the patch to find the issue of course, I will involve the author from that review later today. :)


Cheers,
Franco
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: odites999 on August 23, 2017, 07:55:38 am
Hi!,

I have installed the test2 image (I made all the tests in "live mode" and, so far, everything is working OK.


Regards,
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: jwe on August 23, 2017, 11:24:46 pm
Ok,
i think i have identified the problem.

What i have seen in the packetcapture:
Router sends PADI (Offer) | 585 in image
AC(1) sends PADO to Router |586
Router sends PADR to AC(1)|587
AC(1) send PADS to Router |588
=====Session is established, i also already have an ip via pppoe device=====
AC(2) sends PADO to router|589
=====Router crashes====


Thats what i have seen in the wireshark log.

AC(1) and AC(2) are sending the PADR's but the one from AC(2) is coming really late... as far as i could seen always after i have an established session with AC(1)

I can differntiate both AC's by their MAC-address.

So i blocked the MAC-address from the second AC on my switch.

Et voilĂ , problem solved.

I have attached a screenshot of the relevant captures.

so. by the RFC, if there are multiple AC's the client should be able to switch between them.
Possible workarounds for opnsense are, in my view:
1. a diagnostic 'pppoe -A' which sends the PADI to the ether and list the possible AC's
2. a possibility to select the AC to use in the configurator (for pppoe its parameter -C)

both commands are from https://www.freebsd.org/cgi/man.cgi?query=pppoe

hope that helps.

(https://thumb.ibb.co/ergCwk/pcap_discover_problem.png) (https://ibb.co/ergCwk)


Edit:
I also noticed that the AC(2) (Cisco whatever by its MAC) has some more Vendor Specific PPPoE Tags(Cirguit ID and Remote ID, the Remote ID also contains my name in its value... 0o)
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: franco on August 24, 2017, 08:37:45 am
Hi jwe,

Thanks for this, I will push it to the FreeBSD bug tracker today.

We don't use the "pppoe" command, but from the looks of it you should be able to prevent the crash by filling out the appropriate AC value that you see being advertised on the wire for AC(1). That should also make the MAC block obsolete.

In the provider field, simply fill with "ac4.nue3\", the trailing backslash is important. The Host-Uniq field must be empty.


Cheers,
Franco
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: alexdupre on August 24, 2017, 08:42:20 am
Hi, I'm the author of the patch.

It'd be indeed useful to have the full ethernet dump. I've read multiple hypothesis about the cause of the issue, but few facts supporting them. The last one seems to be multiple PADO answers from different Access Concentrators. I don't exclude at 100% that this may be the issue (it's not a common scenario), but in theory it should be handled correctly (second PADO is ignored, /* Multiple PADO is OK. */ in the code) and my patch doesn't touch that part, so I'd expect the same behavior on a clean FreeBSD installation.

Instead of blocking the AC via its MAC address, you can configure the PPPoE connection to just accept a specific AC-Name. Since blocking one AC seems to have fixed the issue, it'd be interesting to see if blocking the another one produces the same result. And also to know the behavior with the same conditions (multiple PADO) on a clean pppoe module.
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: alexdupre on August 24, 2017, 09:34:23 am
I think to have found the issue. Try changing the following code in pppoe_finduniq function:

Code: [Select]
        /* Cycle through all known hooks. */
        LIST_FOREACH(hook, &node->nd_hooks, hk_hooks) {
                /* Skip any nonsession hook. */
                if (NG_HOOK_PRIVATE(hook) == NULL)
                        continue;
                sp = NG_HOOK_PRIVATE(hook);
                if (sp->neg->host_uniq_len == ntohs(tag->tag_len) &&
                    bcmp(sp->neg->host_uniq.data, (const char *)(tag + 1),
                     sp->neg->host_uniq_len) == 0)
                        break;
        }

with

Code: [Select]
        /* Cycle through all known hooks. */
        LIST_FOREACH(hook, &node->nd_hooks, hk_hooks) {
                /* Skip any nonsession hook. */
                if (NG_HOOK_PRIVATE(hook) == NULL)
                        continue;
                sp = NG_HOOK_PRIVATE(hook);
                /* Skip already connected sessions. */
                if (sp->neg == NULL)
                        continue;
                if (sp->neg->host_uniq_len == ntohs(tag->tag_len) &&
                    bcmp(sp->neg->host_uniq.data, (const char *)(tag + 1),
                     sp->neg->host_uniq_len) == 0)
                        break;
        }

Franco, can you create an image or a module with the above change so that jwe can test it, please?
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: franco on August 25, 2017, 08:44:28 am
Hi Alex,

Great to see you here, thank you. I'll prepare an image tonight and we let you know how that goes. :)


Cheers,
Franco
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: jwe on August 25, 2017, 12:31:49 pm
In the provider field, simply fill with "ac4.nue3\", the trailing backslash is important. The Host-Uniq field must be empty.

I dont have the provider field in my web-interface. Is there an option to make it visible?
(Looked in Interfaces/Point-to-Point/Devices(Interface type=pppoe))

Since blocking one AC seems to have fixed the issue, it'd be interesting to see if blocking the another one produces the same result.

It seems like the 2nd AC is only showing up after the session to the first AC is established.
This might be because the original router from my ISP is building a second pppoe session for voice and i assume that the second AC is for the voice connection. So i cant test using only the second AC as it is not responding to my PADI. It sends its PADO always about 1 second after the session with AC(1) is established.
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: franco on August 25, 2017, 02:45:34 pm
I'll have the new test image up and running this evening... The check that Alex proposed seems to be a very likely fix so you don't necessarily have to try something else for the moment. :)

FWIW, the "provider" input field is under your [WAN] interface configuration, oddly enough not under the PPPoE device itself. I've been meaning to hollow out the device settings so that only the bare minimum is there and everything else can be configured from the interface configuration itself.


Cheers,
Franco
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: franco on August 26, 2017, 10:05:40 am
The new image is here:

https://pkg.opnsense.org/snapshots/OPNsense-17.7-test3-OpenSSL-vga-amd64.img.bz2


Cheers,
Franco
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: jwe on August 26, 2017, 10:55:10 pm
The new image is here:

https://pkg.opnsense.org/snapshots/OPNsense-17.7-test3-OpenSSL-vga-amd64.img.bz2


Cheers,
Franco

Hi Franco,

this seems to work, but as i did not always see the other AC,
and i cant load the img into my hyper-v to watch network traffic via wireshark, it would be nice if someone else can confirm this work.

Or - if you have some time for it- you could create the test image as an iso...

Hyper-v really sucks not beeing able to load img's :(
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: knoppo on August 26, 2017, 11:30:40 pm
I'd be happy to test it, too. But I need a serial image for my apu2.
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: franco on August 27, 2017, 09:58:16 am
For APU: https://pkg.opnsense.org/snapshots/OPNsense-17.7-test3-OpenSSL-serial-amd64.img.bz2
For VM: https://pkg.opnsense.org/snapshots/OPNsense-17.7-test3-OpenSSL-dvd-amd64.iso.bz2
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: knoppo on August 27, 2017, 12:34:12 pm
Works great for me!  :D Thanks guys!

For everybody else stumbling upon this (and google):
My apu2 was stuck in a bootloop (rebooting after configuring interfaces).
When I unplugged the pppoe cable it booted fine but re-plugging it caused the whole system to crash.
I've done this with `tail -f /var/log/syslog` where I just found a "kernel panic 12" hexdump.

The test3 image fixed it.
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: odites999 on August 28, 2017, 07:45:58 am
Hi,

I confirm that everything is working fine here with the test3 image. Thanks guys!!


Regards,
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: alexdupre on August 28, 2017, 08:29:28 am
Thanks for testing. I'll update the patch and push for merging it in standard FreeBSD installation.
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: franco on August 28, 2017, 11:23:18 am
Otherwise confirmed as well, thanks again to everyone for the help in tracking this down and Alex for the quick fix.

This will be in 17.7.1 this week. :)


Cheers,
Franco
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: jwe on August 28, 2017, 11:36:40 am
Very nice!

Thanks for all the help, patches and images.

Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: franco on August 28, 2017, 11:40:48 am
jwe, will you mark it solved? Thanks!
Title: Re: [NOT-SOLVED,WTF!?] PPPOE Crash
Post by: jwe on August 28, 2017, 11:44:58 am
jwe, will you mark it solved? Thanks!

done  :)
Title: Re: [SOLVED][Fix included in 17.7.1] PPPOE Crash
Post by: franco on August 28, 2017, 12:08:39 pm
:)
Title: Re: [SOLVED][Fix included in 17.7.1] PPPOE Crash
Post by: mitsos on September 04, 2017, 06:29:27 pm
Updating from 17.1 still wants to go to 17.7 first. Is this fix included in 17.7? If not, then the firewall will crash and will not be able to update. How to work around this so that 17.1 machines can be updated?
Title: Re: [SOLVED][Fix included in 17.7.1] PPPOE Crash
Post by: franco on September 04, 2017, 06:46:23 pm
Hello,

Lookie who is here! I hope you are doing fine these days! :D

The upgrade path was moved to 17.7.1 for this particular reason, but the 17.1 GUI can't know that, so it shows the one that it is pointing to, but the mirror has the symlink switched.


Cheers,
Franco
Title: Re: [SOLVED][Fix included in 17.7.1] PPPOE Crash
Post by: JDtheHutt on September 05, 2017, 05:27:13 pm
I moved from pfsense straight to opnsense 17.7.  I have to say that while I like it in general, this bug has been a complete nightmare.  It has been so unstable that I have been experiencing crashes and reboots so constantly that I can barely get past logging into the GUI before it goes down.  I live in an area where I have 0 mobile signal so without some means to access my wired internet I am a bit screwed.  After a lot of perseverence I managed to get it to update to 17.7.1 before it crashed again and now it is running rock solid.  I think you need to consider removing 17.7 as the provided download and put 17.7.1 up there instead, or plaster some kind of massive warning across the 17.7 download that it has a serious flaw and that users need to immediately by any means update to 17.7.1 before it experiences a crash.
Title: Re: [SOLVED][Fix included in 17.7.1] PPPOE Crash
Post by: tillsense on September 05, 2017, 06:35:29 pm
...I think you need to consider removing 17.7 as the provided download and put 17.7.1 up there instead...

@franco
100% ACK

cheers
till
Title: Re: [SOLVED][Fix included in 17.7.1] PPPOE Crash
Post by: franco on September 05, 2017, 06:51:47 pm
Hi guys,

We agree. Yet we have more changes that we would like to see in images, so we are one or two 17.7.x releases away. It's a bet of sorts... And in the meantime 17.1 works too. :)

End of September is realistic.


Cheers,
Franco
Title: Re: [SOLVED][Fix included in 17.7.1] PPPOE Crash
Post by: JDtheHutt on September 05, 2017, 07:56:11 pm
Would it be best just to pull 17.7 as a downloadable or upgradeable option then?  Only allow 17.1 to be downloaded by default and any upgrades from that within the system to skip right past to 17.7.1?  Because 17.7 itself does not seem to merit being defined as a stable production ready system.  I know it's hard to rollback on something as big as a release, but that's better than having users tearing their hair out because the system won't stay up even long enough to perform an upgrade.  You'll lose people otherwise.  With me having no mobile access and unable to get 17.7 to stay up, I was on the verge of just returning to pfsense and not looking back, which would be a shame as I think opnsense is a great system other than for that fault.  Now I'm on 17.7.1 I am rock steady, not dropped whatsoever since then.
Title: Re: [SOLVED][Fix included in 17.7.1] PPPOE Crash
Post by: mitsos on September 06, 2017, 03:28:53 pm
@franco The fact that I don't post doesn't mean I'm not here, it means the product has been rock solid for my use cases :-). This is the first "serious" issue that I have seen, and it did not even apply to our entire fleet of appliances. Only appliances that were behind bridged modems were actually affected by this, because PPPoE is handled by opnsense, instead of PPPoE being handled by the modem if it is in routing mode.

Tried the 17.1.x to 17.7 upgrade, upgraded smoothly, although it did take a while to come back and gave me a slight scare there :-). It then found the 17.7.1 upgrade and sailed through that as well  ;D

@JDtheHutt every software has bugs, and this bug as far as I understand it wasn't in opnsense, but upstream. I've been running opnsense since day 0. I actually had to reconfigure one of our routers because it was upgraded from a 32bit machine to a 64bit machine and wanted to "start fresh" with it. I had to skim through the old configuration, it was *that* long that I had to fiddle with it that I forgot how it was configured (subnets/interfaces). So far, since day 0, it was always use>update>use, didn't notice anything serious. By the time I updated, the serious issues had already been pulled from the mirrors (eg VLANs). And no, I'm not one of those "IT experts" that never update, I try to update everything (from servers down to clients' access points) every week. YMMV of course, because there is always that one person that will say "but I upgraded and now everything is broken! how did you miss that?"  :o
Title: Re: [SOLVED][Fix included in 17.7.1] PPPOE Crash
Post by: JDtheHutt on September 09, 2017, 03:46:04 am
@Franco Unfortunately, this fault does not appear to be fixed in 17.7.1. Since I installed it, the last 48 hours it has run without any issue. Then it died again, same behaviour as before. If I boot with my WAN cable connected then it fails to boot. Without the WAN cable I can boot but the second I insert it, the whole system goes offline. Please let me know if there are any specific logs you want for this, however with no working means to access the internet other than walking down the street to get mobile signal, or trying 17.1 instead or rolling back to pfsense entirely, it might take me a while.
Title: Re: [SOLVED][Fix included in 17.7.1] PPPOE Crash
Post by: franco on September 09, 2017, 12:25:28 pm
It's unlikely the same issue with multiple confirms that the issue was solved.

Do you have a crash report?


Cheers,
Franco
Title: Re: [SOLVED][Fix included in 17.7.1] PPPOE Crash
Post by: JDtheHutt on September 09, 2017, 05:23:12 pm
I'll take a look when I get home from this shift. However, I did notice that each time it rebooted and I logged in, the usual crash report notification which I have seen at the top on previous occurrences was not present. Is there a specific place these are logged? I can manually grab them and send over to you.
Title: Re: [SOLVED][Fix included in 17.7.1] PPPOE Crash
Post by: franco on September 09, 2017, 05:30:27 pm
If nothing shows up it's not a proper crash. Does uptime reset a.k.a. forced reboot? Also check the System: Log File page for clues.
Title: Re: [SOLVED][Fix included in 17.7.1] PPPOE Crash
Post by: JDtheHutt on September 10, 2017, 06:36:04 pm
@Franco I had a bit of a look around.  To provide some context for my setup, I was using a fresh install of 17.7 which was immediately upgraded to 17.7.1.  I use IVPN and have configured the VPN service for that as per their current pfsense instructions, as they seem to match OPNsense and have been working fine.  Other than using some third party DNS and setting some static IPs for devices on my network, nothing else is changed.

The first oddity is that even with IPv6 disabled across all settings, telling it to use IPv4 even if IPv6 is available, putting block rules on IPv6 traffic etc, all my devices are still showing as being assigned IPv6 addresses at the client side.  At the server end I don't see any IPv6 addresses assigned and the DHCP Server for IPv6 is disabled, yet all devices are still receiving addresses and I can see IPv6 traffic constantly being blocked by my firewall block rule for it.

The system as configured at first rungs fine, but after a period of seemingly random time, though always by 48 hours it seems, everything drops, all devices fall off the network and show no IPv4 addresses, yet they still possess IPv6 addresses.  I cannot ping anything, my clients show that their connections are down.  I can still see that they are connected physically as they are detecting a potential 1000Mbit/s link, but no actual connectivity is available.  Going manually to the OPNsense box and checking output, it is just hung, no response.

Rebooting, if the WAN cable is left in, the system fails to boot, reports config errors and just hangs on startup.  Removing the WAN cable, it boots and everything works perfectly with no further drops, though all devices still receive IPv6 addresses.  Plugging the WAN cable back in results in an immediate failure again and everything drops off the network.

One thing I noticed it that OPNsense keeps switching my default gateway back to my WAN rather than being on my VPN gateway.  All outbound traffic is configured to use the VPN gateway anyway but I have additionally set the VPN gateway as default and set the system to only use the default and not to fallback to any other gateway if the VPN gateway goes down.

Setting the default gateway back to the VPN gateway lets me plug the WAN cable back in and no drops then.  However, DNS is completely broken and nothing resolves whatsoever.  Due to this, my VPN cannot establish as it cannot recognise the hostname in order to bring it up.  I tried setting separate DNS options for every gateway on the system but no success.  Manually entering in the IP for my VPN hostname, I can then establish the VPN and I can ping external addresses again but DNS is still completely dead, whether I use my manually defined DNS services or whether I select to allow my ISP DNS to override my locally defined services.

I actually made a backup of my configuration when I first set it all up and it was working, but restoring that does not work.  Resetting to factory default and then restoring the configuration also does not work.  I have to do a complete disk wipe and reinstall from USB to get it running and it then goes through all of the above again.

I have had to decide to fully fallback to using pfsense 2.4.0-RC as I have an angry wife, two children and a mother-in-law here who have demanded I give them their internet back, so I am afraid I now do not have an OPNsense box available for further testing.  I can confirm that the pfsense install is working perfectly and has so far not experienced the same issues as on OPNsense, though it has not yet been running for 48 hours.  To also add, the same settings on pfsense result in no IPv6 activity at all and none of my devices receive IPv6 addresses.  I am sorry I can't provide more information than that currently, but I hope it helps and I would like to give OPNsense another go, maybe when they go to visit relatives for the week and I am left behind due to work!  Let me know if you need any other details which I may have forgotten and I will see if I can add them.  Thanks.