Intermittent upload collapse to ~0 Mbps on PPPoE — fixed by pfctl -F states — OP

Started by dare, Today at 12:49:48 PM

Previous topic - Next topic
Hello everyone,

I am new to this forum and have no formal education in networking — I am just an enthusiast who enjoys running my own secure and independent home network rather than relying on ISP-provided equipment. I have done my best to research and document this issue thoroughly before posting, but I may have missed something obvious or described things incorrectly. I ask for your patience and understanding, and I welcome any corrections or guidance.
Thank you in advance.

Environment:

OPNsense 26.1.8_5, FreeBSD 14.3
Protectli VP2430 (Intel I226-V NICs, hw.igc.msix_disable=1)
PPPoE WAN (FTTH, single WAN, no multi-WAN)
Firewall rules on Rules [new] (migrated from legacy in 26.1)
Traffic shaper enabled (ipfw/dummynet, pipes with FQ-CoDel)
Tunables: net.isr.dispatch=deferred, net.isr.maxthreads=4

Symptom:
Intermittently, upload drops to 0.05–0.15 Mbps while download remains at ~850 Mbps. Packet loss 5–10% measured by speedtest and gateway monitor (dpinger, monitor IP 8.8.8.8). Affects all devices simultaneously. Duration varies from minutes to hours.
Root cause identified:
pfctl -F states immediately and consistently fixes the fault. 404 states were cleared. This confirms the issue is corrupted or stuck pf states.
What also fixes it:

Firewall rule reload (as a side effect of flushing states)
Full reboot (sometimes, not always)

What does NOT fix it:

PPPoE session restart
Unbound DNS restart
Restarting individual services
pfctl -F states alone is sufficient — no rule reload needed

Diagnostics during a fault:
netstat -i -I igc2 — zero Ierrs, Idrop, Oerrs throughout. Local NIC is clean.
pfctl -s info — zero state-limit, zero memory errors, ~400 active states, no insert failures.
tcpdump -n -i pflog0 during fault — pf is actively blocking outbound SYN packets to legitimate destinations including Netflix (45.57.x.x) and Google (192.178.x.x). These IPs are NOT in any blocklist table (confirmed with pfctl -t MALWARE_LISTS -T test).
tcpdump -n -i pppoe1 during fault — download traffic flowing normally, outbound upload connections not being established.
Gateway monitor shows 5% packet loss to 8.8.8.8 during fault.
What does NOT appear to be the cause:

Physical NIC or cable (zero errors during faults)
ISP/modem (replaced by ISP, fiber signal confirmed at -15 dBm)
pf state table exhaustion (only ~400 states, no limit hits)
MALWARE_LISTS URL table (tested with pfctl -T test, no legitimate IPs matched)
PPPoE session drops (no PPPoE events in logs during faults)
Tested cables from switches - no faults, all cables okay

Question:
Why would pf states become corrupted/stuck on a single PPPoE WAN, causing legitimate outbound SYN packets to be blocked? Is there a known interaction between PPPoE, the new Rules [new] firewall system, and pf state tracking in 26.1 that could cause this? Is there a proper fix beyond periodically flushing states?

I have been dealing with this for weeks now and my ISP's support are probably sick of me :)

Did you try disabling the traffic shaping? Depending on how you configure that, it can have detrimental effects.

Also, sometimes, the I226 are known to freeze when powersaving is enabled, so you should probably disable ASPM.
Intel N100, 4* I226-V, 2* 82559, 16 GByte, 500 GByte NVME, ZTE F6005

1100 down / 450 up, Bufferbloat A+

Thank you for the suggestions. I tested both.

Traffic shaping:
Disabling the shaper did not prevent the fault from occurring.

ASPM:
This turned out to be the key lead. After investigating further, I identified that my device is a Protectli VP2430 running coreboot v0.9.0, and Protectli has an official Technical Service Bulletin (TSB-2025-001) documenting an ASPM-related i226-V network performance degradation issue on exactly this firmware version. The TSB was issued for the VP2440 but the VP2430 uses identical hardware (same Intel i226-V NICs, same coreboot v0.9.0).

The hw.pci.enable_aspm=0 tunable does not fully resolve the issue — Protectli explicitly notes this in the TSB, as coreboot v0.9.0 enables ASPM at the firmware level before the OS loads and FreeBSD cannot override it. The pciconf output confirms ASPM L1 remains active despite the tunable being set.

The permanent fix for the VP2440 is a coreboot firmware update (v0.9.1-rc3). I have contacted Protectli asking whether an equivalent fix exists for the VP2430 and am awaiting their response.

But the question remains: why did this bug not manifest itself in the 5 months I've been running this same protectli? It hasn't even been a problem after the update to OPNsense 26.1. It started after my ISP limited my upload from 150 Mbps to 50 Mbps.