PPPoE drop/disconnect, requires a reboot to fix

Started by iMx, November 14, 2023, 05:26:43 PM

Previous topic - Next topic
November 14, 2023, 05:26:43 PM Last Edit: November 14, 2023, 05:50:34 PM by iMx
Hi there,

Appreciate any thoughts on this ....

When PPPoE (running on a VLAN) disconnects, the only way for me to recover it is to restart the firewall.  The /var/logs/ppps files just shows:

"<30>1 2023-11-14T16:16:04+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="581"] Multi-link PPP daemon for FreeBSD
<30>1 2023-11-14T16:16:04+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="582"]
<30>1 2023-11-14T16:16:04+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="583"] process 1291 started, version 5.9
<30>1 2023-11-14T16:16:04+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="584"] web: web is not running
<30>1 2023-11-14T16:16:04+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="585"] [wan] Bundle: Interface ng1 created
<30>1 2023-11-14T16:16:04+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="586"] [undefined] GetSystemIfaceMTU: SIOCGIFMTU failed: Device not configured
<30>1 2023-11-14T16:16:04+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="587"] [wan_link0] Link: OPEN event
<30>1 2023-11-14T16:16:04+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="588"] [wan_link0] LCP: Open event
<30>1 2023-11-14T16:16:04+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="589"] [wan_link0] LCP: state change Initial --> Starting
<30>1 2023-11-14T16:16:04+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="590"] [wan_link0] LCP: LayerStart
<30>1 2023-11-14T16:16:04+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="591"] [wan_link0] PPPoE: Set PPP-Max-Payload to '1500'
<30>1 2023-11-14T16:16:04+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="592"] [wan_link0] PPPoE: Connecting to ''
<30>1 2023-11-14T16:16:13+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="593"] [wan_link0] PPPoE connection timeout after 9 seconds
<30>1 2023-11-14T16:16:13+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="594"] [wan_link0] Link: DOWN event
<30>1 2023-11-14T16:16:13+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="595"] [wan_link0] LCP: Down event
<30>1 2023-11-14T16:16:13+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="596"] [wan_link0] Link: reconnection attempt 1 in 4 seconds
<30>1 2023-11-14T16:16:17+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="597"] [wan_link0] Link: reconnection attempt 1
<30>1 2023-11-14T16:16:17+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="598"] [wan_link0] PPPoE: Set PPP-Max-Payload to '1500'
<30>1 2023-11-14T16:16:17+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="599"] [wan_link0] PPPoE: Connecting to ''
<30>1 2023-11-14T16:16:26+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="600"] [wan_link0] PPPoE connection timeout after 9 seconds
<30>1 2023-11-14T16:16:26+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="601"] [wan_link0] Link: DOWN event
<30>1 2023-11-14T16:16:26+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="602"] [wan_link0] LCP: Down event
<30>1 2023-11-14T16:16:26+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="603"] [wan_link0] Link: reconnection attempt 2 in 4 seconds
<30>1 2023-11-14T16:16:30+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="604"] [wan_link0] Link: reconnection attempt 2
<30>1 2023-11-14T16:16:30+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="605"] [wan_link0] PPPoE: Set PPP-Max-Payload to '1500'
<30>1 2023-11-14T16:16:30+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="606"] [wan_link0] PPPoE: Connecting to ''
<30>1 2023-11-14T16:16:39+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="607"] [wan_link0] PPPoE connection timeout after 9 seconds
<30>1 2023-11-14T16:16:39+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="608"] [wan_link0] Link: DOWN event
<30>1 2023-11-14T16:16:39+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="609"] [wan_link0] LCP: Down event
<30>1 2023-11-14T16:16:39+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="610"] [wan_link0] Link: reconnection attempt 3 in 1 seconds
<30>1 2023-11-14T16:16:40+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="611"] [wan_link0] Link: reconnection attempt 3
<30>1 2023-11-14T16:16:40+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="612"] [wan_link0] PPPoE: Set PPP-Max-Payload to '1500'
<30>1 2023-11-14T16:16:40+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="613"] [wan_link0] PPPoE: Connecting to ''"


... over and over. 

It doesn't matter if I try to reload/reconnect the PPPoE interface, kill mpd5, reload all services, nothing will bring it back, other than a reboot. 

- All Intel NICS.
- em0 and the VLAN interfaces are up
- tcpdumping on the em0 interface just shows:

listening on em0, link-type EN10MB (Ethernet), capture size 262144 bytes
16:24:59.901275 PPPoE PADI [Host-Uniq 0x8038224300F8FFFF] [Service-Name] [PPP-Max-Payload 0x05DC]
16:25:01.900485 PPPoE PADI [Host-Uniq 0x8038224300F8FFFF] [Service-Name] [PPP-Max-Payload 0x05DC]
16:25:05.900490 PPPoE PADI [Host-Uniq 0x8038224300F8FFFF] [Service-Name] [PPP-Max-Payload 0x05DC]
16:25:09.902668 PPPoE PADI [Host-Uniq 0x40DB564300F8FFFF] [Service-Name] [PPP-Max-Payload 0x05DC]
16:25:11.902485 PPPoE PADI [Host-Uniq 0x40DB564300F8FFFF] [Service-Name] [PPP-Max-Payload 0x05DC]


I've tried:

- Disconnect/reload PPPoE via the UI
- Killing mpd5
- Enabling/disabling promiscuous (although I don't believe this is required for me...)
- Enabling/disabling 'prevent interface removal'
- Tried leaving MTU default, rather than 1508 on the PPPoE (for 1500 calculated)
- Tried changing the physical WAN port on the firewall
- Tried unplugging and re-plugging the ethernet cable (to force link loss, recovery)
- Reset all tuneables

Is there anything that is generated differently, for example a 'fake' MAC or some form of ID (Host-Uniq?), that is different after a reboot but not different after a reload of PPPoE? 

I'm at the stage now, other than trying to source different hardware, where I'm wondering if it's the ISP that's dropping the connection and blocking reconnects - for whatever reason - but after a reboot, the ID/MAC appears different so succeeds?!? Clutching at straws now!

... guess next step is to hook up the ISP router again and run this in the DMZ behind it and see if it still occurs.

November 14, 2023, 05:33:49 PM #1 Last Edit: November 14, 2023, 09:22:58 PM by iMx
Interestingly...

If I move the cable from em0, to a spare port, then reassign the VLAN to that interface (which takes PPPoE with it), reload all services, PPPoE then comes up again without the reboot.

If I move the cable back to em0, reassign the VLAN to that em0, reload all services, PPPoE still does not come up on the original interface.

... at a loss.

EDIT: Although this only worked once, trying it again to the same previously working interface doesnt work.

Around 15 hours whilst plugged into the ISP router, with the ISP router performing PPPoE, no PPPoE drops.  Have even ramped up usage, in case this was a factor.

Looks like there have potentially been similar general unresolved PPPoE issues on FreeBSD 'since Adam was a lad'.  Where nothing appears to resolve it, other than a reboot of the box.

Might have to try/build a Linux device to act as a PPPoE bridge/half-bridge and stick that in front of opnsense.

the Max-Payload is a bit weird.. have you configured an MTU on the parent interface? If so, try removing it

I tried both :)

Due to the problems in early 23.7.x it was needed for a 1500 MTU on PPPoE, until the changes around 23.7.7 (from memory) removed the need to set MTU on the physical and VLAN and it now just works it out.

But, I think its correct?  HEX converted to decimal:

0x05DC -> 1500 ?

May-Payload is RFC 4638.

November 16, 2023, 12:53:38 PM #5 Last Edit: November 16, 2023, 12:59:48 PM by iMx
ISP router performing PPPoE still stable for almost 48 hours.

Current thoughts for things to try:

- Move the VLAN tagging to a switch, just to rule this out and have a bog-standard interface in opnsense
- This seems intriguing

https://forums.freebsd.org/threads/override-mpd-pppoe-client-timeout.90413/

...  although for the above point, I would assume it would fail to connect all the time not just randomly disconnect/be unable to reconnect without a reboot.

- Stick a bridging device between the ISP router and the ONT and try to work out what it does differently (times, Host-Uniq, etc)

November 16, 2023, 06:23:12 PM #6 Last Edit: November 16, 2023, 06:33:19 PM by iMx
Switch in between opnsense and the ONT might be the next thing to try:

https://forum.netgate.com/topic/180061/lcp-no-reply-to-echo-requests/21?_=1700153939159&lang=en-GB

... no update since the OP posted that it certainly improved things.

EEE is disabled by default anyway, so it's not that:

hw.em.eee_setting: 1

       hw.em.eee_setting
       Disable or enable Energy Efficient Ethernet.  Default  1 (dis-
       abled).


Source: https://man.freebsd.org/cgi/man.cgi?em(4)

Still running with the ISP router performing PPPoE, connection has been 'up' constantly for this period.

Came across the below, wondering if this has reared its head again:

https://github.com/opnsense/core/issues/2267

... symptoms seem very similar.

For my own benefit:

12:30 GMT+0 - removed the ISP router, after 1 week of up time, now running with opnsense performing VLAN tagging and PPPoE again.

Using following tunables:

net.isr.bindthreads: 1
net.isr.maxthreads: -1
dev.em.*.fc: 0


Dynamic DNS: Interface IP check method

If the problem reoccurs, next step I will try putting a switch between opnsense and the ONT.

Didn't last very long - dropped at 13:14 GMT+0

<30>1 2023-11-21T13:13:53+00:00 fw.localdomain ppp 28779 - [meta sequenceId="1"] [wan_link0] LCP: no reply to 1 echo request(s)
<30>1 2023-11-21T13:14:03+00:00 fw.localdomain ppp 28779 - [meta sequenceId="2"] [wan_link0] LCP: no reply to 2 echo request(s)
<30>1 2023-11-21T13:14:13+00:00 fw.localdomain ppp 28779 - [meta sequenceId="3"] [wan_link0] LCP: no reply to 3 echo request(s)
<30>1 2023-11-21T13:14:23+00:00 fw.localdomain ppp 28779 - [meta sequenceId="4"] [wan_link0] LCP: no reply to 4 echo request(s)
<30>1 2023-11-21T13:14:33+00:00 fw.localdomain ppp 28779 - [meta sequenceId="5"] [wan_link0] LCP: no reply to 5 echo request(s)
<30>1 2023-11-21T13:14:33+00:00 fw.localdomain ppp 28779 - [meta sequenceId="6"] [wan_link0] LCP: peer not responding to echo requests
<30>1 2023-11-21T13:14:33+00:00 fw.localdomain ppp 28779 - [meta sequenceId="7"] [wan_link0] LCP: state change Opened --> Stopping
<30>1 2023-11-21T13:14:33+00:00 fw.localdomain ppp 28779 - [meta sequenceId="8"] [wan_link0] Link: Leave bundle "wan"
<30>1 2023-11-21T13:14:33+00:00 fw.localdomain ppp 28779 - [meta sequenceId="9"] [wan] Bundle: Status update: up 0 links, total bandwidth 9600 bps


Unable to reconnect, without a reboot.

November 21, 2023, 02:41:58 PM #10 Last Edit: November 21, 2023, 03:09:34 PM by iMx
Dropped again, 13:41 - got a reply a short time before:

tcpdump -i vlan01 pppoes and ppp proto 0xc021
13:39:42.307124 PPPoE  [ses 0x594] LCP, Echo-Request (0x09), id 39, length 10
13:39:42.307235 PPPoE  [ses 0x594] LCP, Echo-Reply (0x0a), id 39, length 10
13:40:13.507118 PPPoE  [ses 0x594] LCP, Echo-Request (0x09), id 40, length 10
13:40:13.507240 PPPoE  [ses 0x594] LCP, Echo-Reply (0x0a), id 40, length 10
-- dropped here --
13:40:52.374228 PPPoE  [ses 0x594] LCP, Echo-Request (0x09), id 1, length 10
13:41:02.438086 PPPoE  [ses 0x594] LCP, Echo-Request (0x09), id 2, length 10
13:41:12.441570 PPPoE  [ses 0x594] LCP, Echo-Request (0x09), id 3, length 10
13:41:22.446509 PPPoE  [ses 0x594] LCP, Echo-Request (0x09), id 4, length 10



<30>1 2023-11-21T13:17:42+00:00 fw.localdomain ppp 73374 - [meta sequenceId="76"] [wan] IFACE: Rename interface ng0 to pppoe0
<30>1 2023-11-21T13:41:02+00:00 fw.localdomain ppp 73374 - [meta sequenceId="1"] [wan_link0] LCP: no reply to 1 echo request(s)
<30>1 2023-11-21T13:41:12+00:00 fw.localdomain ppp 73374 - [meta sequenceId="2"] [wan_link0] LCP: no reply to 2 echo request(s)
<30>1 2023-11-21T13:41:22+00:00 fw.localdomain ppp 73374 - [meta sequenceId="3"] [wan_link0] LCP: no reply to 3 echo request(s)
<30>1 2023-11-21T13:41:32+00:00 fw.localdomain ppp 73374 - [meta sequenceId="4"] [wan_link0] LCP: no reply to 4 echo request(s)
<30>1 2023-11-21T13:41:42+00:00 fw.localdomain ppp 73374 - [meta sequenceId="5"] [wan_link0] LCP: no reply to 5 echo request(s)
<30>1 2023-11-21T13:41:42+00:00 fw.localdomain ppp 73374 - [meta sequenceId="6"] [wan_link0] LCP: peer not responding to echo requests
<30>1 2023-11-21T13:41:42+00:00 fw.localdomain ppp 73374 - [meta sequenceId="7"] [wan_link0] LCP: state change Opened --> Stopping



Back to ISP router, will order a dumb £10 switch to put between the ONT/opnsense.

November 22, 2023, 04:34:29 PM #11 Last Edit: November 22, 2023, 04:36:50 PM by iMx
Added a cheap, unmanaged TP-Link TL-SG105S switch in-between opnsense and the ONT. 

Quite a few reports on the pfsense forums and reddit that this helped/fixed similar issues for others.

VLAN, PPPoE et al now being performed on opnsense again.

November 22, 2023, 04:50:43 PM #12 Last Edit: November 22, 2023, 05:00:07 PM by iMx
Still dropped in about 10 minutes.

Have plugged the ISP router WAN port into a computer and wireshark-ed the PPPoE discovery.

Going to try mimicking the Host-Uniq tag - this does not change on the ISP router, between reboots, restarts, etc.  On opnsense/mpd5 it seems to.

Still dropped, after about 20 minutes.

At a loss - connection is absolutely rock solid, when using the ISP router and opnsense in the DMZ (no PPPoE).

Might have to try 'the other one', for something to compare to :-/

Have moved VLAN tagging to a managed switch. 

PPPoE/em0 on opnsense now on a native VLAN port on the switch (no VLAN tagging in opnsense) with the tagged switch port connected to the ONT.