[SOLVED][Fix included in 17.7.1] PPPOE Crash

Started by jwe, August 10, 2017, 02:13:23 AM

Previous topic - Next topic
Here is the output from the test1-image, imported config from 17.1:


I also have the latest BIOS version.

With the Test1 image, the system also crashes (same error than in images posted by jwe).
With the Test2 image, the system does not crash.


Regards,

Hi guys,

Alright, that confirms it's not something we did in our code per se... It's from the import of the host-uniq patch from here:

https://reviews.freebsd.org/D9270

Can you tell me how your "service" field is filled out? If it is not filled out, what happens when you set a bogus string like "foo"?

We'd also be looking at a "|" symbol use or anything odd / non-standard about that.


Thanks so far,
Franco

August 20, 2017, 04:50:45 PM #18 Last Edit: August 20, 2017, 07:23:50 PM by odites999
Hi franco,

My service name is (and has always been) empty. I've tried changing it to different lengths with no problem. The connection stays alive.


Regards,

Quote from: odites999 on August 20, 2017, 04:50:45 PM
Hi franco,

My service name is (and has always been) empty. I've tried changing it to different lengths with no problem. The connection stays alive.


Regards,

I set it to "mnet" in the 17.1 install.
PPPoE then reconnects and is working as before.

But when booting from the 17.7 usb stick (Test Image 1) then i get the same error as in my screenshots after configuration import and getting the network devices ready.

Hi,

My test with system names has been with the Test2 image, the only one that does not crash, in my case (empty system name).


Regards,

Ok, so we need to roll back on that patch and start asking the authors for help in solving this mystery.

The older kernel will work for 17.7 if you can install it:

# opnsense-update -ikr 17.1.9 -n "17.1\/sets"
# /usr/local/etc/rc.reboot

The working image is also ok to install. The kernels are the same and upgrades will work as soon as 17.7.1 is out.


Cheers,
Franco

August 21, 2017, 10:00:07 PM #22 Last Edit: August 21, 2017, 11:12:41 PM by jwe
Quote from: franco on August 21, 2017, 06:21:21 PM
Ok, so we need to roll back on that patch and start asking the authors for help in solving this mystery.

The older kernel will work for 17.7 if you can install it:

# opnsense-update -ikr 17.1.9 -n "17.1\/sets"
# /usr/local/etc/rc.reboot

The working image is also ok to install. The kernels are the same and upgrades will work as soon as 17.7.1 is out.


Cheers,
Franco

Okay.. what are the next steps now?
Are you contacting the author?

Edit:
I could reproduce the problem on my main PC within a hyper-v VM running pfsense 17.7

I also captured the packages with wireshark if it could help futher identify the problem.
But this makes the problem hardware-independent, as my main pc has an intel nic, mobo from gigabyte

Edit2:

I did some research on the patch at https://reviews.freebsd.org/D9270 (as far as my knowledge is going...)-

It seems like it not only patched/added the host-uniq feature but also added some additional parsing for the PADM messages.

As i can see in ng_pppoe_rcvdata_ether(), the way how a NULL PADI tag gets treated changed dramatically, before they called

CTR1(KTR_NET, "%20s: PADI w/o Service-Name",__func__);
LEAVE(ENETUNREACH);


but now they just to some new indroduced

if (tag==NULL)tag=&sntag;

which i would assume to be the problem.


Good morning,

The PCAP of the trace would certainly help. Something went wrong in the ng_pppoe_rcvdata_ether() function, but from review it looked like there was nothing conceptually wrong with the &sntag change which just wants to force a match of the service to the first service it finds. At least there was no obvious problem with it in the lines changes (that other reviewers could have caught as well).

You are right it's not hardware independent, that was ruled out with the test1/test2 images. It looks provider-specific and/or PPPoE server specific (setup).

We can try to revert parts of the patch to find the issue of course, I will involve the author from that review later today. :)


Cheers,
Franco

August 23, 2017, 07:55:38 AM #24 Last Edit: August 23, 2017, 12:23:15 PM by odites999
Hi!,

I have installed the test2 image (I made all the tests in "live mode" and, so far, everything is working OK.


Regards,

August 23, 2017, 11:24:46 PM #25 Last Edit: August 23, 2017, 11:28:49 PM by jwe
Ok,
i think i have identified the problem.

What i have seen in the packetcapture:
Router sends PADI (Offer) | 585 in image
AC(1) sends PADO to Router |586
Router sends PADR to AC(1)|587
AC(1) send PADS to Router |588
=====Session is established, i also already have an ip via pppoe device=====
AC(2) sends PADO to router|589
=====Router crashes====


Thats what i have seen in the wireshark log.

AC(1) and AC(2) are sending the PADR's but the one from AC(2) is coming really late... as far as i could seen always after i have an established session with AC(1)

I can differntiate both AC's by their MAC-address.

So i blocked the MAC-address from the second AC on my switch.

Et voilĂ , problem solved.

I have attached a screenshot of the relevant captures.

so. by the RFC, if there are multiple AC's the client should be able to switch between them.
Possible workarounds for opnsense are, in my view:
1. a diagnostic 'pppoe -A' which sends the PADI to the ether and list the possible AC's
2. a possibility to select the AC to use in the configurator (for pppoe its parameter -C)

both commands are from https://www.freebsd.org/cgi/man.cgi?query=pppoe

hope that helps.




Edit:
I also noticed that the AC(2) (Cisco whatever by its MAC) has some more Vendor Specific PPPoE Tags(Cirguit ID and Remote ID, the Remote ID also contains my name in its value... 0o)

August 24, 2017, 08:37:45 AM #26 Last Edit: August 24, 2017, 08:55:54 AM by franco
Hi jwe,

Thanks for this, I will push it to the FreeBSD bug tracker today.

We don't use the "pppoe" command, but from the looks of it you should be able to prevent the crash by filling out the appropriate AC value that you see being advertised on the wire for AC(1). That should also make the MAC block obsolete.

In the provider field, simply fill with "ac4.nue3\", the trailing backslash is important. The Host-Uniq field must be empty.


Cheers,
Franco

Hi, I'm the author of the patch.

It'd be indeed useful to have the full ethernet dump. I've read multiple hypothesis about the cause of the issue, but few facts supporting them. The last one seems to be multiple PADO answers from different Access Concentrators. I don't exclude at 100% that this may be the issue (it's not a common scenario), but in theory it should be handled correctly (second PADO is ignored, /* Multiple PADO is OK. */ in the code) and my patch doesn't touch that part, so I'd expect the same behavior on a clean FreeBSD installation.

Instead of blocking the AC via its MAC address, you can configure the PPPoE connection to just accept a specific AC-Name. Since blocking one AC seems to have fixed the issue, it'd be interesting to see if blocking the another one produces the same result. And also to know the behavior with the same conditions (multiple PADO) on a clean pppoe module.

I think to have found the issue. Try changing the following code in pppoe_finduniq function:


        /* Cycle through all known hooks. */
        LIST_FOREACH(hook, &node->nd_hooks, hk_hooks) {
                /* Skip any nonsession hook. */
                if (NG_HOOK_PRIVATE(hook) == NULL)
                        continue;
                sp = NG_HOOK_PRIVATE(hook);
                if (sp->neg->host_uniq_len == ntohs(tag->tag_len) &&
                    bcmp(sp->neg->host_uniq.data, (const char *)(tag + 1),
                     sp->neg->host_uniq_len) == 0)
                        break;
        }


with


        /* Cycle through all known hooks. */
        LIST_FOREACH(hook, &node->nd_hooks, hk_hooks) {
                /* Skip any nonsession hook. */
                if (NG_HOOK_PRIVATE(hook) == NULL)
                        continue;
                sp = NG_HOOK_PRIVATE(hook);
                /* Skip already connected sessions. */
                if (sp->neg == NULL)
                        continue;
                if (sp->neg->host_uniq_len == ntohs(tag->tag_len) &&
                    bcmp(sp->neg->host_uniq.data, (const char *)(tag + 1),
                     sp->neg->host_uniq_len) == 0)
                        break;
        }


Franco, can you create an image or a module with the above change so that jwe can test it, please?

Hi Alex,

Great to see you here, thank you. I'll prepare an image tonight and we let you know how that goes. :)


Cheers,
Franco