OPNsense Forum

English Forums => 23.7 Legacy Series => Topic started by: iMx on November 14, 2023, 05:26:43 pm

Title: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on November 14, 2023, 05:26:43 pm
Hi there,

Appreciate any thoughts on this ....

When PPPoE (running on a VLAN) disconnects, the only way for me to recover it is to restart the firewall.  The /var/logs/ppps files just shows:

Code: [Select]
"<30>1 2023-11-14T16:16:04+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="581"] Multi-link PPP daemon for FreeBSD
<30>1 2023-11-14T16:16:04+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="582"]
<30>1 2023-11-14T16:16:04+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="583"] process 1291 started, version 5.9
<30>1 2023-11-14T16:16:04+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="584"] web: web is not running
<30>1 2023-11-14T16:16:04+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="585"] [wan] Bundle: Interface ng1 created
<30>1 2023-11-14T16:16:04+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="586"] [undefined] GetSystemIfaceMTU: SIOCGIFMTU failed: Device not configured
<30>1 2023-11-14T16:16:04+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="587"] [wan_link0] Link: OPEN event
<30>1 2023-11-14T16:16:04+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="588"] [wan_link0] LCP: Open event
<30>1 2023-11-14T16:16:04+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="589"] [wan_link0] LCP: state change Initial --> Starting
<30>1 2023-11-14T16:16:04+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="590"] [wan_link0] LCP: LayerStart
<30>1 2023-11-14T16:16:04+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="591"] [wan_link0] PPPoE: Set PPP-Max-Payload to '1500'
<30>1 2023-11-14T16:16:04+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="592"] [wan_link0] PPPoE: Connecting to ''
<30>1 2023-11-14T16:16:13+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="593"] [wan_link0] PPPoE connection timeout after 9 seconds
<30>1 2023-11-14T16:16:13+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="594"] [wan_link0] Link: DOWN event
<30>1 2023-11-14T16:16:13+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="595"] [wan_link0] LCP: Down event
<30>1 2023-11-14T16:16:13+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="596"] [wan_link0] Link: reconnection attempt 1 in 4 seconds
<30>1 2023-11-14T16:16:17+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="597"] [wan_link0] Link: reconnection attempt 1
<30>1 2023-11-14T16:16:17+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="598"] [wan_link0] PPPoE: Set PPP-Max-Payload to '1500'
<30>1 2023-11-14T16:16:17+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="599"] [wan_link0] PPPoE: Connecting to ''
<30>1 2023-11-14T16:16:26+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="600"] [wan_link0] PPPoE connection timeout after 9 seconds
<30>1 2023-11-14T16:16:26+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="601"] [wan_link0] Link: DOWN event
<30>1 2023-11-14T16:16:26+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="602"] [wan_link0] LCP: Down event
<30>1 2023-11-14T16:16:26+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="603"] [wan_link0] Link: reconnection attempt 2 in 4 seconds
<30>1 2023-11-14T16:16:30+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="604"] [wan_link0] Link: reconnection attempt 2
<30>1 2023-11-14T16:16:30+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="605"] [wan_link0] PPPoE: Set PPP-Max-Payload to '1500'
<30>1 2023-11-14T16:16:30+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="606"] [wan_link0] PPPoE: Connecting to ''
<30>1 2023-11-14T16:16:39+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="607"] [wan_link0] PPPoE connection timeout after 9 seconds
<30>1 2023-11-14T16:16:39+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="608"] [wan_link0] Link: DOWN event
<30>1 2023-11-14T16:16:39+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="609"] [wan_link0] LCP: Down event
<30>1 2023-11-14T16:16:39+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="610"] [wan_link0] Link: reconnection attempt 3 in 1 seconds
<30>1 2023-11-14T16:16:40+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="611"] [wan_link0] Link: reconnection attempt 3
<30>1 2023-11-14T16:16:40+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="612"] [wan_link0] PPPoE: Set PPP-Max-Payload to '1500'
<30>1 2023-11-14T16:16:40+00:00 fw00.localdomain ppp 1291 - [meta sequenceId="613"] [wan_link0] PPPoE: Connecting to ''"

... over and over. 

It doesn't matter if I try to reload/reconnect the PPPoE interface, kill mpd5, reload all services, nothing will bring it back, other than a reboot. 

- All Intel NICS.
- em0 and the VLAN interfaces are up
- tcpdumping on the em0 interface just shows:

Code: [Select]
listening on em0, link-type EN10MB (Ethernet), capture size 262144 bytes
16:24:59.901275 PPPoE PADI [Host-Uniq 0x8038224300F8FFFF] [Service-Name] [PPP-Max-Payload 0x05DC]
16:25:01.900485 PPPoE PADI [Host-Uniq 0x8038224300F8FFFF] [Service-Name] [PPP-Max-Payload 0x05DC]
16:25:05.900490 PPPoE PADI [Host-Uniq 0x8038224300F8FFFF] [Service-Name] [PPP-Max-Payload 0x05DC]
16:25:09.902668 PPPoE PADI [Host-Uniq 0x40DB564300F8FFFF] [Service-Name] [PPP-Max-Payload 0x05DC]
16:25:11.902485 PPPoE PADI [Host-Uniq 0x40DB564300F8FFFF] [Service-Name] [PPP-Max-Payload 0x05DC]

I've tried:

- Disconnect/reload PPPoE via the UI
- Killing mpd5
- Enabling/disabling promiscuous (although I don't believe this is required for me...)
- Enabling/disabling 'prevent interface removal'
- Tried leaving MTU default, rather than 1508 on the PPPoE (for 1500 calculated)
- Tried changing the physical WAN port on the firewall
- Tried unplugging and re-plugging the ethernet cable (to force link loss, recovery)
- Reset all tuneables

Is there anything that is generated differently, for example a 'fake' MAC or some form of ID (Host-Uniq?), that is different after a reboot but not different after a reload of PPPoE? 

I'm at the stage now, other than trying to source different hardware, where I'm wondering if it's the ISP that's dropping the connection and blocking reconnects - for whatever reason - but after a reboot, the ID/MAC appears different so succeeds?!? Clutching at straws now!

... guess next step is to hook up the ISP router again and run this in the DMZ behind it and see if it still occurs.
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on November 14, 2023, 05:33:49 pm
Interestingly...

If I move the cable from em0, to a spare port, then reassign the VLAN to that interface (which takes PPPoE with it), reload all services, PPPoE then comes up again without the reboot.

If I move the cable back to em0, reassign the VLAN to that em0, reload all services, PPPoE still does not come up on the original interface.

... at a loss.

EDIT: Although this only worked once, trying it again to the same previously working interface doesnt work.
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on November 15, 2023, 10:56:46 am
Around 15 hours whilst plugged into the ISP router, with the ISP router performing PPPoE, no PPPoE drops.  Have even ramped up usage, in case this was a factor.

Looks like there have potentially been similar general unresolved PPPoE issues on FreeBSD 'since Adam was a lad'.  Where nothing appears to resolve it, other than a reboot of the box.

Might have to try/build a Linux device to act as a PPPoE bridge/half-bridge and stick that in front of opnsense.
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: tuto2 on November 15, 2023, 05:01:28 pm
the Max-Payload is a bit weird.. have you configured an MTU on the parent interface? If so, try removing it
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on November 15, 2023, 07:02:19 pm
I tried both :)

Due to the problems in early 23.7.x it was needed for a 1500 MTU on PPPoE, until the changes around 23.7.7 (from memory) removed the need to set MTU on the physical and VLAN and it now just works it out.

But, I think its correct?  HEX converted to decimal:

 0x05DC -> 1500 ?

May-Payload is RFC 4638.
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on November 16, 2023, 12:53:38 pm
ISP router performing PPPoE still stable for almost 48 hours.

Current thoughts for things to try:

- Move the VLAN tagging to a switch, just to rule this out and have a bog-standard interface in opnsense
- This seems intriguing

https://forums.freebsd.org/threads/override-mpd-pppoe-client-timeout.90413/

...  although for the above point, I would assume it would fail to connect all the time not just randomly disconnect/be unable to reconnect without a reboot.

- Stick a bridging device between the ISP router and the ONT and try to work out what it does differently (times, Host-Uniq, etc)
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on November 16, 2023, 06:23:12 pm
Switch in between opnsense and the ONT might be the next thing to try:

https://forum.netgate.com/topic/180061/lcp-no-reply-to-echo-requests/21?_=1700153939159&lang=en-GB

... no update since the OP posted that it certainly improved things.

EEE is disabled by default anyway, so it's not that:

Code: [Select]
hw.em.eee_setting: 1
Code: [Select]
       hw.em.eee_setting
       Disable or enable Energy Efficient Ethernet.  Default  1 (dis-
       abled).

Source: https://man.freebsd.org/cgi/man.cgi?em(4)
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on November 21, 2023, 11:04:17 am
Still running with the ISP router performing PPPoE, connection has been 'up' constantly for this period.

Came across the below, wondering if this has reared its head again:

https://github.com/opnsense/core/issues/2267

... symptoms seem very similar.
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on November 21, 2023, 01:57:56 pm
For my own benefit:

12:30 GMT+0 - removed the ISP router, after 1 week of up time, now running with opnsense performing VLAN tagging and PPPoE again.

Using following tunables:

Code: [Select]
net.isr.bindthreads: 1
net.isr.maxthreads: -1
dev.em.*.fc: 0

Dynamic DNS: Interface IP check method

If the problem reoccurs, next step I will try putting a switch between opnsense and the ONT.
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on November 21, 2023, 02:16:49 pm
Didn't last very long - dropped at 13:14 GMT+0

Code: [Select]
<30>1 2023-11-21T13:13:53+00:00 fw.localdomain ppp 28779 - [meta sequenceId="1"] [wan_link0] LCP: no reply to 1 echo request(s)
<30>1 2023-11-21T13:14:03+00:00 fw.localdomain ppp 28779 - [meta sequenceId="2"] [wan_link0] LCP: no reply to 2 echo request(s)
<30>1 2023-11-21T13:14:13+00:00 fw.localdomain ppp 28779 - [meta sequenceId="3"] [wan_link0] LCP: no reply to 3 echo request(s)
<30>1 2023-11-21T13:14:23+00:00 fw.localdomain ppp 28779 - [meta sequenceId="4"] [wan_link0] LCP: no reply to 4 echo request(s)
<30>1 2023-11-21T13:14:33+00:00 fw.localdomain ppp 28779 - [meta sequenceId="5"] [wan_link0] LCP: no reply to 5 echo request(s)
<30>1 2023-11-21T13:14:33+00:00 fw.localdomain ppp 28779 - [meta sequenceId="6"] [wan_link0] LCP: peer not responding to echo requests
<30>1 2023-11-21T13:14:33+00:00 fw.localdomain ppp 28779 - [meta sequenceId="7"] [wan_link0] LCP: state change Opened --> Stopping
<30>1 2023-11-21T13:14:33+00:00 fw.localdomain ppp 28779 - [meta sequenceId="8"] [wan_link0] Link: Leave bundle "wan"
<30>1 2023-11-21T13:14:33+00:00 fw.localdomain ppp 28779 - [meta sequenceId="9"] [wan] Bundle: Status update: up 0 links, total bandwidth 9600 bps

Unable to reconnect, without a reboot.
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on November 21, 2023, 02:41:58 pm
Dropped again, 13:41 - got a reply a short time before:

Code: [Select]
tcpdump -i vlan01 pppoes and ppp proto 0xc021
13:39:42.307124 PPPoE  [ses 0x594] LCP, Echo-Request (0x09), id 39, length 10
13:39:42.307235 PPPoE  [ses 0x594] LCP, Echo-Reply (0x0a), id 39, length 10
13:40:13.507118 PPPoE  [ses 0x594] LCP, Echo-Request (0x09), id 40, length 10
13:40:13.507240 PPPoE  [ses 0x594] LCP, Echo-Reply (0x0a), id 40, length 10
-- dropped here --
13:40:52.374228 PPPoE  [ses 0x594] LCP, Echo-Request (0x09), id 1, length 10
13:41:02.438086 PPPoE  [ses 0x594] LCP, Echo-Request (0x09), id 2, length 10
13:41:12.441570 PPPoE  [ses 0x594] LCP, Echo-Request (0x09), id 3, length 10
13:41:22.446509 PPPoE  [ses 0x594] LCP, Echo-Request (0x09), id 4, length 10

Code: [Select]
<30>1 2023-11-21T13:17:42+00:00 fw.localdomain ppp 73374 - [meta sequenceId="76"] [wan] IFACE: Rename interface ng0 to pppoe0
<30>1 2023-11-21T13:41:02+00:00 fw.localdomain ppp 73374 - [meta sequenceId="1"] [wan_link0] LCP: no reply to 1 echo request(s)
<30>1 2023-11-21T13:41:12+00:00 fw.localdomain ppp 73374 - [meta sequenceId="2"] [wan_link0] LCP: no reply to 2 echo request(s)
<30>1 2023-11-21T13:41:22+00:00 fw.localdomain ppp 73374 - [meta sequenceId="3"] [wan_link0] LCP: no reply to 3 echo request(s)
<30>1 2023-11-21T13:41:32+00:00 fw.localdomain ppp 73374 - [meta sequenceId="4"] [wan_link0] LCP: no reply to 4 echo request(s)
<30>1 2023-11-21T13:41:42+00:00 fw.localdomain ppp 73374 - [meta sequenceId="5"] [wan_link0] LCP: no reply to 5 echo request(s)
<30>1 2023-11-21T13:41:42+00:00 fw.localdomain ppp 73374 - [meta sequenceId="6"] [wan_link0] LCP: peer not responding to echo requests
<30>1 2023-11-21T13:41:42+00:00 fw.localdomain ppp 73374 - [meta sequenceId="7"] [wan_link0] LCP: state change Opened --> Stopping


Back to ISP router, will order a dumb £10 switch to put between the ONT/opnsense.
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on November 22, 2023, 04:34:29 pm
Added a cheap, unmanaged TP-Link TL-SG105S switch in-between opnsense and the ONT. 

Quite a few reports on the pfsense forums and reddit that this helped/fixed similar issues for others.

VLAN, PPPoE et al now being performed on opnsense again.
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on November 22, 2023, 04:50:43 pm
Still dropped in about 10 minutes.

Have plugged the ISP router WAN port into a computer and wireshark-ed the PPPoE discovery.

Going to try mimicking the Host-Uniq tag - this does not change on the ISP router, between reboots, restarts, etc.  On opnsense/mpd5 it seems to.
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on November 22, 2023, 05:11:28 pm
Still dropped, after about 20 minutes.

At a loss - connection is absolutely rock solid, when using the ISP router and opnsense in the DMZ (no PPPoE).

Might have to try 'the other one', for something to compare to :-/
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on November 22, 2023, 05:27:03 pm
Have moved VLAN tagging to a managed switch. 

PPPoE/em0 on opnsense now on a native VLAN port on the switch (no VLAN tagging in opnsense) with the tagged switch port connected to the ONT.
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on November 22, 2023, 06:09:18 pm
Was looking promising - got to about 40 minutes, with VLAN tagging on the switch - but then died/dropped.

Back to ISP router performing PPPoE and opnsense in the DMZ.
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: sja1440 on November 23, 2023, 11:21:16 am
I have had to resolve a variety of pppoe issues recently on my own Intel NIC based systems. My ISP requires me to access the ONT via pppoe over a vlan. I also 'spoof' my system MAC as seen by the ONT.  The symptoms that I have seen are:
(1) Frequent flapping of pppoe over vlan (seen in the system log) even though there were no entries in the pppoe logs. I could reproduce this problem by simply doing a speed test.
(2) pppoe over vlan becomes unusable after restart - I always needed to reconfigure the pppoe interface to get it working again.

It is possible that your problem is completely unrelated to mine, but just in case, here is how I need to configure my system in order to remove all my problems:
*ensure that powerd is active
*ensure that tunables net.isr.dispatch=deferred and net.isr.maxthreads=<number of cores> (and rebooted)
*create an interface for the raw underlying NIC (igc..). I call it underWAN
*assign your own MAC to underWAN (and NOT to the interface to the pppoe device)
*if set, remove promiscuous mode from the interface to the pppoe device. (if you don't then flapping always occurs)
*set promiscuous mode on underWAN (i.e the direct interface to the igc device)

I need to use "promiscuous mode" on the igc because I am using vlans and also because I am spoofing the MAC.
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on November 23, 2023, 12:19:06 pm
Thanks for your input!

I think the only 'thing' I haven't tried from your list - other than MAC spoofing, which I don't think should be required as my ISP does permit/support 3rd party routers without - is:

net.isr.dispatch=deferred

The box is powerful enough and can handle the connection with plenty of overhead.

Researching the problem, pretty much every type of firewall/router seems to have its own share of weird PPPoE issues - from OpenWRT on certain hardware, pfsense, even Asus/Asus Merlin routers.

Will certainly give your settings a whirl, I'm pretty much out of options now - I've worked through my earlier thoughts.

I am more or less resigned to having to keep the ISP router in place for now - which is particularly annoying, as they have a crazy ICMP limit through their router. 

Or, perhaps rebuilding the box with something like IPFire/Linux, or even vanilla FreeBSD/OpenBSD and experimenting with the user-land ppp rather than mpd5.  Not sure if I could easily test with user-land ppp rather than mpd5 on opnsense as it is.
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: sja1440 on November 23, 2023, 12:39:48 pm
Do you set "Promiscuos Mode"? and, more importantly, where do you set it?
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on November 23, 2023, 12:41:44 pm
I tried with and without, on physical, vlan and pppoe - no change.

My symptoms are very similar to a bug report from 18.x/19.x - absolutely nothing will bring the interface back when it drops, other than a reboot:

https://github.com/opnsense/core/issues/2267

All other NICs on the same card, remain operational.  So it's not a 'hard' lock up of all interfaces.

The difference being, I don't see indications in the packet dump that the remote end hasn't received the LCP replies.  The connection just drops, mpd5 then starts its own LCP echo requests at 10 second intervals, when it doesn't receive a frame from the peer.
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: sja1440 on November 23, 2023, 01:07:45 pm
Have you tried comparing the following different packet captures:
*captures taken  using OPNsense diagnostics on each of pppoe, vlan, and the underlying NIC interface
*a capture taken from a monitoring switch placed between the ONT and the OPNsense box

I ask because I see that in your log extracts, OPNsense is claiming that, at a certain point, it never sees replies from your ISP (via the ONT). I suppose the ISP is sending the responses, so the question is, why isnt OPNsense seeing them?

Doing this helped me identify the causes of my particular problems
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on November 23, 2023, 01:24:06 pm
MPD only sends echo requests when it doesn't receive a frame from the peer, with the opnsense default settings within 10 seconds, maximum 60 seconds.

The initial echo requests come from the remote end, which my end replies to.  They're sent from the remote end, every 30 seconds - my end replied about 15 seconds before the connection dropped and is not exhibiting signs that it didn't receive them.

The 5 at the end, are opnsense - as it hasn't heard from the peer, as the connection has already dropped, so starts sending it's own up to a maximum of 60 seconds before it considers the link 'dead'.

Quote
set link keep-alive seconds max
This command enables the sending of LCP echo packets on the link. The first echo packet is sent after seconds seconds of quiet time (i.e., no frames received from the peer on that link). After seconds more seconds, another echo request is sent. If after max seconds of doing this no echo reply has been received yet, the link is brought down.
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: sja1440 on November 23, 2023, 06:44:43 pm
That's the connection being pulled down.

But why didn't it come up again?  From the pppoe logs I only see that the pppoe connection attempt timed out after around 10 seconds.

From your capture do you see the PADI being broadcast by your box and are some unicast PADO's then received from your ISP as  response(s)?

[Edited for clarification]
By capture I mean a capture made with something like a monitoring switch between the ONT and the OPNsense box. Such a capture will show the vlan encapsulation - I cant see it in the capture of your first post which suggests that was made on the stack after OPNsense had removed the vlan encapsulation.

When troubleshooting one of my issues, on a capture made on the pppoe device I saw exactly what you saw: repeated PADI's but no PADO.  I only saw the PADO's (and realised what my problem was) by doing the capture on a monitoring switch.

Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on November 24, 2023, 09:14:22 am
But why didn't it come up again?  From the pppoe logs I only see that the pppoe connection attempt timed out after around 10 seconds.

Fair point, mirror/span with an intermediary switch should help to remove some guess work.

Absolutely nothing - so far - will bring the interface back 'up', i.e it simply cannot reconnect, after it drops.  Other than a reboot.  It's like something is blocking, or hanging.

I found an OpenWRT post where an ISP was basically leaking packets meant for another customer, that basically caused very similar issues - pppoe/ppp was processing these erroneous packets and then locking up.

Anyway, this morning I am trying:

- Promisuous mode on the physical interface (WAN_PHYS), as per your config
- I noticed that Shared Forwarding was still enabled, even though I'd already removed the Shaper rules for earlier troubleshooting.
- Tuneables are still all defaults
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: sja1440 on November 24, 2023, 09:35:36 am
Given that your ISP router doesn't have the problem, it seems reasonable to assume that the ISP is indeed sending the PADO's. If so, this would mean that OPNsense is ignoring them for some reason. The monitor switch capture might help to understand why.
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on November 24, 2023, 10:22:03 am
Well, it is has made it to an hour of uptime - which in recent days testing, it hadn't

I am beginning to wonder if this is a Shared Forwarding problem - which I 'have' to have enabled, to use shaping, as I use Policy Based Routing heavily - as I had previously tried with Promiscuous on the physical port :-/

I have 2 other opnsense setups, which have also seen - I am now beginning to suspect - something similar, where the NIC just hangs or doesn't 'see' packets.  The other 2 setups, do not use PPPoE but do/did have Shared Forwarding enabled - one of the deployments only saw random crash/WAN hang, under semi load, after upgrading from 22.7 -> 23.7 ... and had been rock solid for years prior.

Fingers crossed things will run for a few days, then I can start to narrow down Shared Forwarding vs Promiscuous on the physical.
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: sja1440 on November 24, 2023, 11:34:29 am
I too use shaping and so have "Shared forwarding" set.

I still suggest that your best bet for understanding what is going on is to get a capture from a monitoring switch between the ONT and your OPNsense.
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on November 24, 2023, 11:41:15 am
Dropped again - arse.

Quote
> I still suggest that your best bet for understanding what is going on is to get a capture from a monitoring switch between the ONT and your OPNsense.

I hear you, I do, I'm not disagreeing, but at the moment that's not possible without a lot of faff.  However, my symptoms do appear to be different from yours.

But anyway, I added the below:

Quote
*ensure that tunables net.isr.dispatch=deferred and net.isr.maxthreads=<number of cores> (and rebooted)
*assign your own MAC to underWAN (and NOT to the interface to the pppoe device)

You just 'force assigned' the physical (underWAN) MAC back to itself on the physical interface?  Or by 'own MAC' you mean the own MAC of the ISP router?

Given the VLAN tagging however, I wouldn't expect the ISP to see the physical Mac - only the VLAN and/or PPPoE?
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: sja1440 on November 24, 2023, 01:19:19 pm
Here is the way I have configured the relevant interfaces:
Code: [Select]
Interfaces->underWAN
    device=igc4
    block private networks=unset
    block bogon networks=unset
    MAC address=(my desired MAC)
    Promiscuous mode=yes
    MTU=empty
    MSS=empty

Interfaces->Other Types->VLAN->vlan0.0835
    parent=igc4 [underWAN]
    vlan tag=835
    Vlan priority=0 (default)
    Edit vlan=auto

Interfaces->Point-to-Point->Devices->pppoe0
    Link interface=vlan0.0835
    Username=<my username>
    Password=<my password>
    Service name=empty
    Host-Uniq=empty
    Local IP (vlan0.0835)=not set
    Gateway=not set
    Advanced options=all default

Interfaces->WAN
    Device=pppoe0
    block private networks=unset
    block bogon networks=unset
    MAC address=empyu
    Promiscuous mode=unset
    MTU=empty
    MSS=empty

I spoof the MAC  to reduce the amount of information I give to my ISP regarding my choice of hardware. My ISP allows me to choose whatever router I like. I choose a MAC from the locally administered range.
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on November 24, 2023, 01:37:02 pm
Much obliged - only difference from my last drop/crash and now, is:

- Forcing a MAC on the physical WAN
- Setting dispatch to deferred

... I can't see how these would make a difference, based on the documentation, but hey.... nothing to lose :)
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on November 24, 2023, 02:50:24 pm
So far so good - looking hopeful.

I was experimenting with RSS at one point and I note:

Code: [Select]
If RSS is enabled with the 'enabled' sysctl, the packet dispatching policy will move from ‘direct’ to ‘hybrid’.

But I decided to revert back to the defaults of NOT using RSS - and whilst I didn't start seeing PPPoE problems immediately, it's possibly within a week of reverting.

If it is setting to 'non-direct' that resolves this problem, which it's looking likely you might be right, then when I had RSS enabled it would have switched to 'hybrid' potentially having the same/similar impact as 'deferred'.

....but at a loss why this should completely break PPPoE on Direct, until the box is rebooted. 

Documentation on dispatch states 'impacted hardware' and 'performance improvements' - but my hardware has plenty of overhead, that I assumed this wasn't needed..and not wanting to 'tweak settings for the sake of tweaking'!
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on November 24, 2023, 06:38:51 pm
Died again - after about 5-6 hours.

I give up for now, will stick with the ISP router and/or rebuild with something else to test.

Appreciate you taking the time to share your experiences/config.
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on November 28, 2023, 10:12:36 am
For the last few days, the issue seems to have stopped - but still none the wiser, really, as to the cause - final thoughts for now and recent changes:

- 'systat -vmstat 1'

I noticed that the interrupts for 1 em0 queue, which I understand PPPoE will only use 1, even at a fairly low load of 90-100Mbps (both directions concurrently) were reaching 5000-6000, the default seems to be 8000.  I don't know if these scale in a linear fashion based on throughput/packets, but I added:

Code: [Select]
hw.em.max_interrupt_rate: 32000
hw.em.rx_process_limit: -1

- Made sure all CPU scaling/PowerD was disabled and rebooted, difference is negligible anyway as the load is fairly consistent on this firewall, albeit with peak time increases

- No MAC spoofing
- No promiscuous
- MTU 1508 on PPPOE (for 1500 calculated)
- Disabled Flow Control on all interfaces
- Increased Codel flows/limits, these are the defaults so I don't set them on the shaper pipe, no changes to queue on the pipe (let it manage dynamically) and no changes to Quantum (1514, 1500 MTU +14 seems to be the best for high bandwidth, general use)

Code: [Select]
# Some suggestions this should be equal to at least maximum sessions/states, i.e flows, I believe. 
# Firewall regularly has at least 2000 sessions/state entries, so to allow for bursting
net.inet.ip.dummynet.fqcodel.flows: 8192
# The default hard size limit (in unit of packet) of all queues managed by an instance of the scheduler.
# This is the absolute upper limit permitted
net.inet.ip.dummynet.fqcodel.limit: 20480

- Increased the other interface queues:

Code: [Select]
net.inet.ip.intr_queue_maxlen: 2048
net.isr.defaultqlimit: 2048
net.link.ifqmaxlen: 2048 # Set to sum of RX/TX NIC descriptors; default 1024 descriptors
net.route.netisr_maxqlen: 2048

- Whilst TSO was (should have been) already disabled, based on opnsense defaults, also added the below:

Code: [Select]
net.inet.tcp.tso: 0
- Bind threads to CPUs and limit the number of threads (4 in my case):

Code: [Select]
net.isr.bindthreads: 1
net.isr.maxthreads: -1

I have no idea if:

- Something happened at the ISP, ONT upgrades, etc that caused the initial problems. Although it did not impact the ISP router PPPoE if that is/was the case.
- Interrupt/packet processing limit
- Something 'left over' from PowerD and/or CPU scaling, that was unloaded after a reboot/disable

I also did some reading into NetGraph, as:

Quote
"The trick part is that after PPPoE session is established, mpd5 does not process its traffic as it goes completely in-kernel"

.. for me, MPD made the connection successfully but it would drop at random points there after - and then refuse to connect until rebooted.

I have also noticed that it takes 6 seconds for the PPPoE connection to get a response from the ISP, then the connection completes in a further 2 seconds:

Code: [Select]
<30>1 2023-11-28T07:50:25+00:00 Firewall.localdomain ppp 75232 - [meta sequenceId="12"] [wan_link0] PPPoE: Connecting to ''
<30>1 2023-11-28T07:50:31+00:00 Firewall.localdomain ppp 75232 - [meta sequenceId="1"] PPPoE: rec'd ACNAME "XXXXX-XXX-C1"
.......
<30>1 2023-11-28T07:50:32+00:00 Firewall.localdomain ppp 75232 - [meta sequenceId="70"] [wan] IFACE: Rename interface ng0 to pppoe0

... and I believe it times out after 9 seconds by default, so a slow responding PPPoE server could contribute, but then I would have expected it to never connect initially (which was not the case).

Not using RSS in sysctl (em NICs seem to use this anyway, based on 'systat -vmstat' output (?) and documentation) or dispatch deferred - whether or not this is still the case, most documentation I found says it should be left on direct if possible especially if there is shaping.
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on November 29, 2023, 11:10:56 am
Spoke too soon, happened again, albeit after a few days - then happened twice in a few hours.

I installed the Intel em drivers - 7.7.8 - then noticed the below in the log, when it happened again:

Code: [Select]
em0: Watchdog timeout Queue[0]-- resetting
em0: Watchdog timeout Queue[0]-- resetting
em0: Watchdog timeout Queue[0]-- resetting
em0: Watchdog timeout Queue[0]-- resetting
em0: Watchdog timeout Queue[0]-- resetting
em0: Watchdog timeout Queue[0]-- resetting
em0: Watchdog timeout Queue[0]-- resetting
em0: Watchdog timeout Queue[0]-- resetting
em0: Watchdog timeout Queue[0]-- resetting
em0: Watchdog timeout Queue[0]-- resetting
em0: Watchdog timeout Queue[0]-- resetting
em0: Watchdog timeout Queue[0]-- resetting
em0: Watchdog timeout Queue[0]-- resetting
em0: Watchdog timeout Queue[0]-- resetting
em0: Watchdog timeout Queue[0]-- resetting
em0: Watchdog timeout Queue[0]-- resetting

Only ever em0 and seemingly queue 0 where PPPoE ends up (due to single queue for PPPoE?)

Bit more Googling, found examples of others using 82574L on Linux, FreeBSD, etc.  A fix for Linux, was to force MSI not MSI-X.

Decided to try disabling all MSI/MSI-X, to go back to legacy IRQ:

Code: [Select]
hw.pci.enable_msix="0"
hw.pci.enable_msi="0"

Now monitoring...and waiting .... again... If this helps, will then try with just MSI-X disabled.

EDIT: Also just spotted this, seems to happen at boot, something else to look into:

Code: [Select]
WARNING: attempt to domain_add(netgraph) after domainfinalize()
ng0: changing name to 'pppoe0'

EDIT 2: Other 'things' to potentially consider, if running solely on legacy IRQ resolves the issue, to then re-enable MSI-X:

- Disabling Hyper-threading, to stop interrupts bouncing between threads
- machdep.disable_msix_migration=1

machdep.disable_msix_migration: Disable migration of MSI-X interrupts between CPUs
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: johndchch on November 29, 2023, 08:02:02 pm
can you switch out the 82574 for something newer?  I had stability issues with the em driver and an 82576 - went away going to something which is more current ( igb, ixgbe etc )
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on November 30, 2023, 08:16:14 am
Unfortunately not, 8 ports on an integrated board.  Not without replacing the entire thing.

However, this problem does only seem to show when using PPPoE - having another device perform PPPoE (ISP router, Mikrotik) there are no issues with the box/board generally performing DHCP/static IP assignment.

On the plus side, it has now made it to almost 24 hours running PPPoE - one of only a handful of times it has done so - since disabling both MSI/MSI-X.
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on December 02, 2023, 09:00:55 am
Three days, no PPPoE drops, since disabling MSI/MSIX.  Might leave it this way over Christmas/NY, to see if it continues to prove itself.

Box has 8 ports, I'm only using the first 4 - so it doesn't pose much/any of a problem that the IRQ is shared with the last 4 ports

Code: [Select]
vmstat -i
interrupt                          total       rate
irq16: em0 em4+                998150914       3871
irq17: em1 em5                    592811          2
irq18: em2 em6+                311699977       1209
irq19: em3 em7+                  4029536         16
irq22: hdac0                           9          0
irq23: ehci1                      386863          2
cpu0:timer                     277593929       1077
cpu1:timer                     150527204        584
cpu2:timer                     176130400        683
cpu3:timer                     142581455        553
Total                         2061693098       7996

For my benefit as much as anything, next steps I think are:

- Make sure anything 'unused' is disabled in the BIOS, inbuilt audio, etc.
- Did see another report some time ago, on FreeBSD mailing list, where they recommended unloading USB from the kernel to prevent an em(4) oddity/crash/hang and re-testing.  Would prefer to avoid, for recover-ability issues, but can at least make sure USB Legacy is disabled, USB3.0 if it has it, etc.

Then:

- Re-enable both MSIX/MSI
- Disable MSI-X migrations: machdep.disable_msix_migration=1

machdep.disable_msix_migration: Disable migration of MSI-X interrupts between CPUs

If problems still occur, then:

- Try disabling HyperThreading (don't tell Theo)

If problems still occur:

- Move back to just MSI enabled, MSI-X disabled

If problems STILL occur:

- Then we just leave it on legacy IRQ and move on with life!

Or:

- Re-enable the FreeBSD Intel driver, see if it provides the option to disable MSIX on 1 interface em0 (where PPPoE resides) only.

I have only ever been able to reproduce this presumed em(4) lock up, with MSI/MSIX enabled, when using PPPoE - regular interfaces, non-PPPoE, the box has gone for a couple of months without issues.

As PPPoE moves to in-kernel after MPD makes the connection, this maybe suggest some driver/kerne conflict/problem?!?

Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on December 04, 2023, 02:41:20 pm
ISP had an unrelated outage this morning, so they ruined my PPPoE uptime - just over 5 days, this box running PPPoE had never made it that far previously.

Used this as an opportunity to:

- Disable everything ASPM (Active State Power Management) in the BIOS. pciconf confirms disabled:

Code: [Select]
pciconf -lcv | grep -i asp
                 link x1(x1) speed 2.5(5.0) ASPM disabled(L0s/L1)
                 link x1(x1) speed 2.5(5.0) ASPM disabled(L0s/L1)
                 link x1(x1) speed 2.5(5.0) ASPM disabled(L0s/L1)
                 link x1(x1) speed 2.5(5.0) ASPM disabled(L0s/L1)
                 link x1(x1) speed 2.5(5.0) ASPM disabled(L0s/L1)
                 link x1(x1) speed 2.5(5.0) ASPM disabled(L0s/L1)
                 link x1(x1) speed 2.5(5.0) ASPM disabled(L0s/L1)
                 link x1(x1) speed 2.5(5.0) ASPM disabled(L0s/L1)
                 link x1(x1) speed 2.5(2.5) ASPM disabled(L0s/L1)
                 link x1(x1) speed 2.5(2.5) ASPM disabled(L0s/L1)
                 link x1(x1) speed 2.5(2.5) ASPM disabled(L0s/L1)
                 link x1(x1) speed 2.5(2.5) ASPM disabled(L0s/L1)
                 link x1(x1) speed 2.5(2.5) ASPM disabled(L0s/L1)
                 link x1(x1) speed 2.5(2.5) ASPM disabled(L0s/L1)
                 link x1(x1) speed 2.5(2.5) ASPM disabled(L0s/L1)
                 link x1(x1) speed 2.5(2.5) ASPM disabled(L0s/L1)

- Disable all the various C states lower than C1 (C2, C6, C7, etc), shouldn't have been used anyway
- Disabled Azalia (on board audio)
- Set 'Legacy USB' to 'Auto' (legacy USB will then only be enabled, at boot, if there is a keyboard, mouse, etc plugged in)
- Disable USB Mass Storage
- Re-enabled MSI, left MSI-X disabled:

Code: [Select]
hw.pci.enable_msix: 0
hw.pci.enable_msi: 1

- Still using the Intel drivers

By default, using legacy IRQ or MSI means the cards/interfaces only use a single queue - MSI-X by default enables multiple queues ... this aspect could also be something to look into further.

Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on December 10, 2023, 11:00:19 am
6 days, no PPPoE drops/flaps, running with MSIX disabled and MSI enabled.

Probably after Christmas/NY, I will try re-enabling MSIX - to see if the various BIOS changes made have had any impact.

However, my hunch, is that running on MSI - i.e 1 queue per NIC - is potentially the likely cause/fix.  In which case, trying to disable MSIX on purely the PPPoE physical interface (the only interface I see issues with) might be the way forward.

A bit confused with the below, but something to consider regardless:

Code: [Select]
machdep.disable_msix_migration: Disable migration of MSI-X interrupts between CPUs

Wondering whether setting 'net.isr.bindthreads: 1' (as I have it at the moment and had with MSIX enabled originally) should not already be doing something similar to the former?!

Code: [Select]
net.isr.bindthreads: Bind netisr threads to CPUs.

...ISR being interrupt service routine.

Although, the former would presumably apply to all MSIX - the latter, just the network side of things?!
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on January 10, 2024, 02:46:37 pm
PPPoE has now been 'up' continuously for 700 hours, ~30 days.

Will re-enable MSIX within the next week, to see if problems re-occur - if not, possible some of the various BIOS settings noted above were the root cause.
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: AllesMangoBrudi on January 30, 2024, 04:59:26 pm
Thanks iMx for your thorough tests and explanations, you know what you're doing!

I have exactly a similar issue and only a reboot fixes it. I tried following your approach and wondered: How did you disable msi-x for the pppoe device? I understand you didn't disable it globally?
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: iMx on January 31, 2024, 01:23:58 pm
I haven't yet re-enabled MSIX - but I disabled it for everything, globally, as this was the only option when using the Intel-own drivers:

Code: [Select]
hw.pci.enable_msix: 0
hw.pci.enable_msi: 1

I wanted to 'prove' stability, which I seem to have now done. Current PPPoE uptime:

Code: [Select]
Uptime 1194:41:36
I just haven't gotten around to re-enabling MSIX yet, to see whether the various other BIOS changes I made, had any impact (ASPM, legacy USB, etc, etc).

If you run:

Code: [Select]
sysctl -A | grep msix
... you might be able to see if you can disable it for just the physical interface that PPPoE runs on.  I can't recall if the FreeBSD drivers had the ability to disable per interface, possibly varies depending on the NIC.
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: elclay on February 09, 2024, 01:43:43 am
Finding the solution to this has been difficult, congratulations for being consistent, I abandoned it a long time ago. By the way, specifically how did you deactivate MSIX?
Title: Re: PPPoE drop/disconnect, requires a reboot to fix
Post by: phoenix on February 09, 2024, 09:56:10 am
The answer to your question is in the post above yours. ;)