PPPoE drop/disconnect, requires a reboot to fix

Started by iMx, November 14, 2023, 05:26:43 PM

Previous topic - Next topic
Was looking promising - got to about 40 minutes, with VLAN tagging on the switch - but then died/dropped.

Back to ISP router performing PPPoE and opnsense in the DMZ.

I have had to resolve a variety of pppoe issues recently on my own Intel NIC based systems. My ISP requires me to access the ONT via pppoe over a vlan. I also 'spoof' my system MAC as seen by the ONT.  The symptoms that I have seen are:
(1) Frequent flapping of pppoe over vlan (seen in the system log) even though there were no entries in the pppoe logs. I could reproduce this problem by simply doing a speed test.
(2) pppoe over vlan becomes unusable after restart - I always needed to reconfigure the pppoe interface to get it working again.

It is possible that your problem is completely unrelated to mine, but just in case, here is how I need to configure my system in order to remove all my problems:
*ensure that powerd is active
*ensure that tunables net.isr.dispatch=deferred and net.isr.maxthreads=<number of cores> (and rebooted)
*create an interface for the raw underlying NIC (igc..). I call it underWAN
*assign your own MAC to underWAN (and NOT to the interface to the pppoe device)
*if set, remove promiscuous mode from the interface to the pppoe device. (if you don't then flapping always occurs)
*set promiscuous mode on underWAN (i.e the direct interface to the igc device)

I need to use "promiscuous mode" on the igc because I am using vlans and also because I am spoofing the MAC.

Thanks for your input!

I think the only 'thing' I haven't tried from your list - other than MAC spoofing, which I don't think should be required as my ISP does permit/support 3rd party routers without - is:

net.isr.dispatch=deferred

The box is powerful enough and can handle the connection with plenty of overhead.

Researching the problem, pretty much every type of firewall/router seems to have its own share of weird PPPoE issues - from OpenWRT on certain hardware, pfsense, even Asus/Asus Merlin routers.

Will certainly give your settings a whirl, I'm pretty much out of options now - I've worked through my earlier thoughts.

I am more or less resigned to having to keep the ISP router in place for now - which is particularly annoying, as they have a crazy ICMP limit through their router. 

Or, perhaps rebuilding the box with something like IPFire/Linux, or even vanilla FreeBSD/OpenBSD and experimenting with the user-land ppp rather than mpd5.  Not sure if I could easily test with user-land ppp rather than mpd5 on opnsense as it is.

Do you set "Promiscuos Mode"? and, more importantly, where do you set it?

November 23, 2023, 12:41:44 PM #19 Last Edit: November 23, 2023, 12:44:52 PM by iMx
I tried with and without, on physical, vlan and pppoe - no change.

My symptoms are very similar to a bug report from 18.x/19.x - absolutely nothing will bring the interface back when it drops, other than a reboot:

https://github.com/opnsense/core/issues/2267

All other NICs on the same card, remain operational.  So it's not a 'hard' lock up of all interfaces.

The difference being, I don't see indications in the packet dump that the remote end hasn't received the LCP replies.  The connection just drops, mpd5 then starts its own LCP echo requests at 10 second intervals, when it doesn't receive a frame from the peer.

Have you tried comparing the following different packet captures:
*captures taken  using OPNsense diagnostics on each of pppoe, vlan, and the underlying NIC interface
*a capture taken from a monitoring switch placed between the ONT and the OPNsense box

I ask because I see that in your log extracts, OPNsense is claiming that, at a certain point, it never sees replies from your ISP (via the ONT). I suppose the ISP is sending the responses, so the question is, why isnt OPNsense seeing them?

Doing this helped me identify the causes of my particular problems

November 23, 2023, 01:24:06 PM #21 Last Edit: November 23, 2023, 01:29:36 PM by iMx
MPD only sends echo requests when it doesn't receive a frame from the peer, with the opnsense default settings within 10 seconds, maximum 60 seconds.

The initial echo requests come from the remote end, which my end replies to.  They're sent from the remote end, every 30 seconds - my end replied about 15 seconds before the connection dropped and is not exhibiting signs that it didn't receive them.

The 5 at the end, are opnsense - as it hasn't heard from the peer, as the connection has already dropped, so starts sending it's own up to a maximum of 60 seconds before it considers the link 'dead'.

Quoteset link keep-alive seconds max
This command enables the sending of LCP echo packets on the link. The first echo packet is sent after seconds seconds of quiet time (i.e., no frames received from the peer on that link). After seconds more seconds, another echo request is sent. If after max seconds of doing this no echo reply has been received yet, the link is brought down.

November 23, 2023, 06:44:43 PM #22 Last Edit: November 23, 2023, 07:38:09 PM by sja1440
That's the connection being pulled down.

But why didn't it come up again?  From the pppoe logs I only see that the pppoe connection attempt timed out after around 10 seconds.

From your capture do you see the PADI being broadcast by your box and are some unicast PADO's then received from your ISP as  response(s)?

[Edited for clarification]
By capture I mean a capture made with something like a monitoring switch between the ONT and the OPNsense box. Such a capture will show the vlan encapsulation - I cant see it in the capture of your first post which suggests that was made on the stack after OPNsense had removed the vlan encapsulation.

When troubleshooting one of my issues, on a capture made on the pppoe device I saw exactly what you saw: repeated PADI's but no PADO.  I only saw the PADO's (and realised what my problem was) by doing the capture on a monitoring switch.


Quote from: sja1440 on November 23, 2023, 06:44:43 PM
But why didn't it come up again?  From the pppoe logs I only see that the pppoe connection attempt timed out after around 10 seconds.

Fair point, mirror/span with an intermediary switch should help to remove some guess work.

Absolutely nothing - so far - will bring the interface back 'up', i.e it simply cannot reconnect, after it drops.  Other than a reboot.  It's like something is blocking, or hanging.

I found an OpenWRT post where an ISP was basically leaking packets meant for another customer, that basically caused very similar issues - pppoe/ppp was processing these erroneous packets and then locking up.

Anyway, this morning I am trying:

- Promisuous mode on the physical interface (WAN_PHYS), as per your config
- I noticed that Shared Forwarding was still enabled, even though I'd already removed the Shaper rules for earlier troubleshooting.
- Tuneables are still all defaults

Given that your ISP router doesn't have the problem, it seems reasonable to assume that the ISP is indeed sending the PADO's. If so, this would mean that OPNsense is ignoring them for some reason. The monitor switch capture might help to understand why.

Well, it is has made it to an hour of uptime - which in recent days testing, it hadn't

I am beginning to wonder if this is a Shared Forwarding problem - which I 'have' to have enabled, to use shaping, as I use Policy Based Routing heavily - as I had previously tried with Promiscuous on the physical port :-/

I have 2 other opnsense setups, which have also seen - I am now beginning to suspect - something similar, where the NIC just hangs or doesn't 'see' packets.  The other 2 setups, do not use PPPoE but do/did have Shared Forwarding enabled - one of the deployments only saw random crash/WAN hang, under semi load, after upgrading from 22.7 -> 23.7 ... and had been rock solid for years prior.

Fingers crossed things will run for a few days, then I can start to narrow down Shared Forwarding vs Promiscuous on the physical.

I too use shaping and so have "Shared forwarding" set.

I still suggest that your best bet for understanding what is going on is to get a capture from a monitoring switch between the ONT and your OPNsense.

November 24, 2023, 11:41:15 AM #27 Last Edit: November 24, 2023, 01:05:00 PM by iMx
Dropped again - arse.

Quote> I still suggest that your best bet for understanding what is going on is to get a capture from a monitoring switch between the ONT and your OPNsense.

I hear you, I do, I'm not disagreeing, but at the moment that's not possible without a lot of faff.  However, my symptoms do appear to be different from yours.

But anyway, I added the below:

Quote*ensure that tunables net.isr.dispatch=deferred and net.isr.maxthreads=<number of cores> (and rebooted)
*assign your own MAC to underWAN (and NOT to the interface to the pppoe device)

You just 'force assigned' the physical (underWAN) MAC back to itself on the physical interface?  Or by 'own MAC' you mean the own MAC of the ISP router?

Given the VLAN tagging however, I wouldn't expect the ISP to see the physical Mac - only the VLAN and/or PPPoE?

Here is the way I have configured the relevant interfaces:

Interfaces->underWAN
    device=igc4
    block private networks=unset
    block bogon networks=unset
    MAC address=(my desired MAC)
    Promiscuous mode=yes
    MTU=empty
    MSS=empty

Interfaces->Other Types->VLAN->vlan0.0835
    parent=igc4 [underWAN]
    vlan tag=835
    Vlan priority=0 (default)
    Edit vlan=auto

Interfaces->Point-to-Point->Devices->pppoe0
    Link interface=vlan0.0835
    Username=<my username>
    Password=<my password>
    Service name=empty
    Host-Uniq=empty
    Local IP (vlan0.0835)=not set
    Gateway=not set
    Advanced options=all default

Interfaces->WAN
    Device=pppoe0
    block private networks=unset
    block bogon networks=unset
    MAC address=empyu
    Promiscuous mode=unset
    MTU=empty
    MSS=empty


I spoof the MAC  to reduce the amount of information I give to my ISP regarding my choice of hardware. My ISP allows me to choose whatever router I like. I choose a MAC from the locally administered range.

Much obliged - only difference from my last drop/crash and now, is:

- Forcing a MAC on the physical WAN
- Setting dispatch to deferred

... I can't see how these would make a difference, based on the documentation, but hey.... nothing to lose :)