Losing WAN connection periodically

Started by jstarta, August 21, 2025, 09:38:14 PM

Previous topic - Next topic
The command is kldstat, but I think most of the common NIC drivers are statically linked.
Intel N100, 4* I226-V, 2* 82559, 16 GByte, 500 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+

I had this exact issue recently, and I spent over a month trying to figure it out.

In the end, I did many things in a desperate attempt to get reliable connectivity back, so I can't pinpoint one thing, but the sum of things was this:

1. Deleted all the VLANs I had recently created, going back to a single default network. Probably didn't help, but it was a variable.
2. Switched from CloudFlare to Quad-9.  PingTool showed CloudFlare frequently becoming unresponsive, and when pinging the OPNsense WAN interface, the default gateway, and CloudFlare / Quad-9 on both IPv4 and IPv6, the point of disconnect most commonly showed up between my WAN interface and the default gateway. But CloudFlare was still unresponsive more often than Quad-9.
3. Had a technician install a splitter between the street and the cable modem, as the signal was "coming in too hot".

I believe #3 had the most to do with signal quality and reliability, and I'm going to go back and test #1 as soon as I get time.

FYI, I searched all over the place for a good Ping program that would run simultaneous pings to multiple addresses and log the results.  PingTool (ping-tool.com) was the best I could find for free.  It doesn't do fancy graphs, but it does keep a running table of results, and it will email you if it sees one of your targets go down or come back up for a specified time period.
Minisforum UN100D, N100, 8GB, 256GB SSD

August 27, 2025, 07:58:36 PM #17 Last Edit: August 27, 2025, 08:16:12 PM by BrandyWine
Quote from: jstarta on August 27, 2025, 11:28:25 AMNot sure if this is normal, but there are a lot of leases:

root@OPNsense:~ # cat /var/db/dhclient.leases.igc1
lease {
  interface "igc1";
  fixed-address AAA.BBB.CC1.132;
  option subnet-mask 255.255.252.0;
  option routers AAA.BBB.CC0.1;
  option domain-name-servers XXX.YYY.ZZZ.142,XXX.YYY.ZZZ.242;
  option host-name "opnsense";
  option dhcp-lease-time 1800;
  option dhcp-message-type 5;
  option dhcp-server-identifier AAA.BBB.CC0.1;
  renew 3 2025/8/27 07:24:18;
  rebind 3 2025/8/27 07:35:33;
  expire 3 2025/8/27 07:39:18;



The list of leases is just one lease that keeps getting renewed, the system keeps a historical record when the dhcp iface comes up.
I think your dhcp client is set to expire a lease after 30min, but it does the renew request every 15min, kinda 1/2 way through the lease time.
Seems a bit too fast, and not sure if that's causing the issue.
However, dhcp renewals for the most part are just noise and should not cause an iface to bounce in any way.
What type of dhcp device is it connected to your WAN iface?

August 27, 2025, 08:20:08 PM #18 Last Edit: August 27, 2025, 09:30:36 PM by BrandyWine
Quote from: meyergru on August 27, 2025, 12:19:40 PMThe command is kldstat, but I think most of the common NIC drivers are statically linked.
We need to distinguish diff between linking and loading.

kldstat only lists dynamically loaded KLM's, stuff outside the compiled kernel.
static loading usually means it's in-tree compiled into kernel.

To save on kernel size, I would prefer the in-tree stuff to mostly live as dynamic KLM's, this way only what's needed can be loaded in during init.
It's nice to have big almost-monolithic kernel, but then not so nice due to size. Pros & Cons.
But no security device would ever run full monolithic.

modprobe perhaps a better utility.

August 27, 2025, 08:31:26 PM #19 Last Edit: August 27, 2025, 08:41:06 PM by BrandyWine
Quote from: jstarta on August 27, 2025, 12:13:46 PMHow do I confirm that the igc driver is loaded correctly?
Well, loaded "correctly" will be hard to know.
It's either loaded into kernel or it's not.
Your iface list post #9 shows igc, yes? That's the driver being used by kernel for i226-V intel controller.
With other lower numbered intel controllers we find igb being used.

The correct driver is loaded. I suspect that's not a place to be looking, igc from kernel build works good for i226-V.

Also, when you saying "losing WAN", how so? Is it a connection that relies on DNS, or do you get all IP dead scenario?
Maybe turn off "allow dhcp to override system set DNS settings", maybe your ISP DNS is flaky? Set your fw DNS to maybe 9.9.9.11.
The logs will clearly show if the WAN iface bounced.
Try a continuous ping from the fw to something outside, see if it ever drops off.

Quote from: allenlook on August 27, 2025, 04:26:55 PM3. Had a technician install a splitter between the street and the cable modem, as the signal was "coming in too hot".
Your scenario is different from the OPs. You are on cable, they are on Ethernet.
The OP should ping from the open sense server, not from the LAN, to eliminate the FreeBSD+FW quirks.
If there is any variance in ping results between the LAN and the open sense, then it is within the router. Otherwise it is between the GW interface and the provider's infrastructure.
A good test would have been to use an alternative router and to ping from it, then compare. Temporarily use any non-FreeBSD router distro.

August 27, 2025, 09:31:50 PM #21 Last Edit: August 27, 2025, 09:44:00 PM by jstarta Reason: Add a note
$ kldstat
Id Refs Address                Size Name
 1   71 0xffffffff80200000  216dad8 kernel
 2    1 0xffffffff8236e000    16650 if_lagg.ko
 3    2 0xffffffff82385000     3558 if_infiniband.ko
 4    1 0xffffffff82389000     ed60 if_bridge.ko
 5    2 0xffffffff82398000     8990 bridgestp.ko
 6    1 0xffffffff823a2000    1e280 opensolaris.ko
 7    1 0xffffffff823c1000    11a78 pfsync.ko
 8    3 0xffffffff823d3000    908a0 pf.ko
 9    1 0xffffffff82464000     3c10 pflog.ko
10    1 0xffffffff832ce000     aa30 if_gre.ko
11    1 0xffffffff832d9000     4be0 if_enc.ko
12    1 0xffffffff832de000     fb90 carp.ko
13    1 0xffffffff832ee000   5e9300 zfs.ko
14    1 0xffffffff84510000    b4270 if_iwlwifi.ko
15    1 0xffffffff845c5000     3378 lindebugfs.ko
16    1 0xffffffff845c9000     d200 rtsx.ko
17    1 0xffffffff845d7000     4250 ichsmb.ko
18    1 0xffffffff845dc000     2178 smbus.ko
19    1 0xffffffff845df000     3390 acpi_wmi.ko
20    1 0xffffffff845e3000     5640 ng_ubt.ko
21    4 0xffffffff845e9000     abb8 netgraph.ko
22    3 0xffffffff845f4000     a250 ng_hci.ko
23    2 0xffffffff845ff000     2670 ng_bluetooth.ko
24    1 0xffffffff84602000    2f5c0 if_wg.ko
25    1 0xffffffff84632000     4850 nullfs.ko

Yep, I don't think it's a driver issue specifically. I have already disabled "Allow DNS server list to be overridden by DHCP/PPP on WAN" as well.

When it drops out, it's just 100% packet loss. Next time it happens, i'll try and capture as many different types of logs as I can.

What sort of logs should I be capturing to try and help us identify the root cause?

Quite edit: I've set up a ping on Opnsense to my remote VPS, and I have it pinging back as well so I can monitor traffic in both directions

Also, Just wanted to quickly thanks everybody for your help so far - it's been fantastic, i'm learning a lot. Hopefully we can get to the bottom of it as there are a few others that also have issues.

For brevities sake, here are the tunables i've added so far:

hw.pci.enable_aspm = 0
hw.em.smart_pwr_down = 0
hw.pci.do_power_nodriver = 0
hw.pci.do_power_suspend = 0
net.link.ether.inet.max_age = 120
dev.igc.0.fc = 0
dev.igc.1.fc = 0
hw.igc.eee_setting = 0

August 28, 2025, 03:49:36 AM #24 Last Edit: August 28, 2025, 04:04:49 AM by BrandyWine
Do you have a /var/log/messages file? If so you can cat or grep that file looking for entries related to igc or interfaces. State changes should be logged.

I also suspect not related to any power or sleep settings, the WAN iface is always active just from fw itself doing stuff, and, the fw never actually drops off into a power state of sleep.

Interface hardware seems ok, need to look elsewhere. DHCP issues is a DHCP issue, not a hardware issue, etc. I don't suspect DHCP either.
I did mean to ask earlier, in your DHCP clinet file, is the provided IP the same or did it change?

When you say "100% packet loss", what tool is used to derive that? Ping using IP? Other?

Another thing to look at is "arp -a" , make note of the igc value, keep running the command, watch the timer go down, make note of the MAC address, when the timer gets to zero just keep watching for the arp renew, right after zero timer keep watching that you get a IP and MAC address quickly, any delay here would cause 100% packet loss. Your Intel WAN iface should be the MAC that starts with 00:e0:b4, so you want to look at the other one with the timer (usually at the op of the list), this is your DFG, aka ISP IP and MAC on WAN side.






Quote from: BrandyWine on August 28, 2025, 03:49:36 AMDo you have a /var/log/messages file? If so you can cat or grep that file looking for entries related to igc or interfaces. State changes should be logged.

I also suspect not related to any power or sleep settings, the WAN iface is always active just from fw itself doing stuff, and, the fw never actually drops off into a power state of sleep.

Interface hardware seems ok, need to look elsewhere. DHCP issues is a DHCP issue, not a hardware issue, etc. I don't suspect DHCP either.
I did mean to ask earlier, in your DHCP clinet file, is the provided IP the same or did it change?

When you say "100% packet loss", what tool is used to derive that? Ping using IP? Other?

Another thing to look at is "arp -a" , make note of the igc value, keep running the command, watch the timer go down, make note of the MAC address, when the timer gets to zero just keep watching for the arp renew, right after zero timer keep watching that you get a IP and MAC address quickly, any delay here would cause 100% packet loss. Your Intel WAN iface should be the MAC that starts with 00:e0:b4, so you want to look at the other one with the timer (usually at the op of the list), this is your DFG, aka ISP IP and MAC on WAN side.

There was no /var/log/messages file unfortunately. Under Gateways configuration it would have Loss: 100%.
The provided IP Address is always the same. I'll keep looking at that 'arp -a' command, I had a look a a few times and it seemed to refresh always in the last 5 seconds or so


So it is from the router. Swap it for anything else that is not based on FreeBSD and compare. If the packet loss persists, then kick your provider in the ribs.

Quote from: Jyling on August 28, 2025, 03:39:10 PMSo it is from the router. Swap it for anything else that is not based on FreeBSD and compare. If the packet loss persists, then kick your provider in the ribs.
I would place a small switch on the WAN side (so no need to take out the fw), then plug in a laptop or something just for short period to see, the ISP should hand out more than 1 IP. Run a continuous ping to something outside, see what happens.

Quote from: BrandyWine on August 28, 2025, 06:51:38 PMhe ISP should hand out more than 1 IP
Good luck with that, in most scenarios.

I've been unable to get to the bottom of the issues unfortunately so i've but it in the VM under Proxmox. Took a bit of doing because Unbound and dnsmasq are the defaults now - I didn't want to just restore from backup so I didn't bring across any weird nonsense I had done on my previous install when trying to get stuff working.

I'll let everybody know how things go - I really wish I could have figured it out but it was getting on my nerves constantly having to restart stuff.