[isolated: see #91] PPPoE reconnect loop

elektroinside · February 18, 2018, 08:26:22 PM

Well, good to know anyway that custom MTU/MSS with certain PPPoE links are not a match made in heaven :P
Hopefully, the others with the same loops have similar configs and clearing them will fix the issues for their link as well.

theq86 · February 19, 2018, 07:50:25 AM

Unfortunately, no.

I never set custom values to MTU/MSS and regularly facing those problems.

Since 17.7 my IPv6 connection broke after 24h reconnect. I worked around it setting an automated reboot. But now, since a clean 18.1 install, I also get those infinite pppoe loops.
And often, out of the blue my pppoe connection stops working.

elektroinside · February 19, 2018, 12:20:07 PM

Hmm... sorry to hear this.
Do you happen to remember what custom settings have you set for the WAN?
Maybe you could try reconfiguring PPPoE, this is how I found out about my problem. It was painfully slow, but I managed to find the issue. I set things up step by step and after each step, I tried to reconnect.

schnipp · February 19, 2018, 10:33:36 PM

Quote from: schnipp on February 16, 2018, 06:22:28 PM
yes, I have some logs. When I am back in my control center :D (beginning of next week) I can post the wireshark logs of both scenarios (reconnect issue and fresh reboot) I have taken so far.

Here are my wireshark logs. I monitored the DSL interface of my DSL modem with pppoe packet forwarding. In this context nearly the same like a DSL modem in bridge mode.

1. Wireshark log after rebooting the machine (pppoe_dial_after_reboot_vcc0.jpg)

After rebooting the machine everything works fine with a stable DSL connection. However, we can see some strange behaviour (black colored packets) in this trace. The Opnsense sends multiple PADI requests interleaved even the ISP has already sent an offer (PADO). This situation should not occur but could be caused by multiple pppoe daemon (mpd5) instances running the same time or some issues in the daemon's FST. But after configuration has finished everything works fine and the Opnsense immediately replies every "LCP Echo Request" (packet nr. 30 ff.).

2. Wireshark log after connection dropped (pppoe_dial_after_reconnect_vcc0.jpg)

After connection drops and the system tries to reconnect to the ISP the whole configuration process looks the same like in the scenario above but without the strange black colored packets. The strange behaviour, which leads to the reconnection loops is shown in line 27. The ISP again sends "LCP Echo Request" packets which are not answered by the Opnsense. After the third lost packet the ISP thinks the pppoe client is not alive anymore and sends a termination request which is oddly answered by the Opnsense. After finishing the termination procedure the system tries to reconnect again (PADI request) and so forth.

Maybe the reason could be a timing issue (caused by a race condition?) or other issues in the FST. But this is only an uncertain assumption.

schnipp · February 19, 2018, 10:40:01 PM

Quote from: elektroinside on February 14, 2018, 11:38:28 PM
Perhaps this helps you as well, until this is fixed:
https://forum.opnsense.org/index.php?topic=7316.0

Thanks, I will check this out the next days. In the other thread you mentioned the system will reboot if pppoe interface goes down and never gets a new IPv4 address. In the reconnection loop of my system I will get a new IPv4 address, but only for almost 30 seconds.

elektroinside · February 20, 2018, 07:19:03 AM

You're welcome :)
I've updated the script, had some design flaws :P

schnipp · February 21, 2018, 02:52:00 PM

I did some more investigation in this topic and increased the logging of the mpd daemon to get some more information of the ISP's LCP echo probing. I found out that the daemon processes echo request packets and itself claims to send out corresponding reply packets. Unfortunately, the echo reply packets are not seen on the WAN interface :-(

marjohn56 · February 21, 2018, 03:25:55 PM

Interesting that in elektroinside problems disappeared when he rebuilt his system from scratch.

elektroinside · February 21, 2018, 04:59:46 PM

Quote from: marjohn56 on February 21, 2018, 03:25:55 PM
Interesting that in elektroinside problems disappeared when he rebuilt his system from scratch.

It wasn't the rebuild that fixed my loop, it just helped me find its source :)
I started adding major features to the new built and I had no loops while doing that. I lost IPv6, but no loops. Then I imported my previous backup and the loops reappeared. This made me wonder what in that backup triggered the loop. Then I started deleting/disabling/uninstalling stuff until I found that in my case, the custom MTU/MSS was the triggering factor... I don't necessarily think that this is the only trigger, as I had those custom MTU/MSS values while rebuilding the box (if I remember correctly), but I had no loops, not until the import. So I think that the MTU/MSS is just a part of a combination of factors that eventually causes the loop. Eliminating this one factor was enough for me, but might not be for others..

mimugmail · February 21, 2018, 06:16:55 PM

Quote from: schnipp on February 21, 2018, 02:52:00 PM
I did some more investigation in this topic and increased the logging of the mpd daemon to get some more information of the ISP's LCP echo probing. I found out that the daemon processes echo request packets and itself claims to send out corresponding reply packets. Unfortunately, the echo reply packets are not seen on the WAN interface :-(

IPS enabled?

schnipp · February 21, 2018, 07:58:50 PM

Quote from: mimugmail on February 21, 2018, 06:16:55 PM

IPS enabled?

Currently, I am running a nearly plain Opnsense system in testing mode. IPS is not yet installed. Only a few plugins like Dyndns and Arp-scan are used. IDS/IPS and Webproxy filtering are tasks for future after system stabilization.

schnipp · February 21, 2018, 08:37:22 PM

People affected by this issue, can you please post the used NIC model and driver (incl. its version). Thanks.

schnipp · February 28, 2018, 08:47:48 PM

I investigated a little bit more to figure out the reason of the reconnection loops. What we already know, in some cases LCP echo request packets sent by the ISP seem not to be answered by the Opnsense. After three unanswered request packets my ISP thinks the PPP endpoint is not alive anymore and drops the PPPoE connection.

Initially after reboot everything works fine, but after interruption of the connection (e.g. 24h reconnect initiated by ISP) or some time in between, echo reply packets aren't seen anymore on the network interface. I downloaded the source code of the mpd5 (PPPoE daemon) an compiled it with some modfications for debugging (due to missing gdb). The daemon successfully receives echo request packets and immediately sends out an appropriate response (the sendto() function successfully returns without error code).

So, it looks like the daemon itself works fine. But I have to check whether the packets are sent over the correct network link. Furthermore, by snooping a netgraph node, I can see the echo reply packets sent by the daemon. But, during reconnection loop echo reply packets are delayed and oddly seen as a bunch of three packets. So the ISP won't receive the responses in time.

Packets are sent out via b0@mpd32168-lso (see netgraph in the attachment), and I tapped the netgraph at mpd32168-wan_link0-lt.

The next steps will be tapping more nodes in the netgrqaph and studying the log files of the mpd5 daemon.

nallar · March 06, 2018, 02:24:51 PM

I had a reconnect loop issue a while back where the modem interface would go up and down repeatedly.

I think there's a bug in rc.linkup after this commit:

https://github.com/opnsense/core/commit/fdc754e4261d333878549d1f43c980ae23a5f9ed

A static IPv4 address with V6 not configured will call interface_configure. Previously the empty($ip6addr) check would consider that to be a static address so it would not call interface_configure.

My modem interface has only a V4 static address. Giving it a static V6 address resolved the problem.

franco · March 06, 2018, 03:14:07 PM

Very nice analysis, can you try https://github.com/opnsense/core/commit/267a086dc ?

# opnsense-patch 267a086dc

Thanks,
Franco

[isolated: see #91] PPPoE reconnect loop

elektroinside

February 18, 2018, 08:26:22 PM #45

theq86

February 19, 2018, 07:50:25 AM #46

elektroinside

February 19, 2018, 12:20:07 PM #47

schnipp

February 19, 2018, 10:33:36 PM #48

schnipp

February 19, 2018, 10:40:01 PM #49

elektroinside

February 20, 2018, 07:19:03 AM #50

schnipp

February 21, 2018, 02:52:00 PM #51

marjohn56

February 21, 2018, 03:25:55 PM #52

elektroinside

February 21, 2018, 04:59:46 PM #53 Last Edit: February 21, 2018, 08:06:22 PM by elektroinside

mimugmail

February 21, 2018, 06:16:55 PM #54

schnipp

February 21, 2018, 07:58:50 PM #55

schnipp

February 21, 2018, 08:37:22 PM #56

schnipp

February 28, 2018, 08:47:48 PM #57

nallar

March 06, 2018, 02:24:51 PM #58

franco

March 06, 2018, 03:14:07 PM #59