Random loss of WAN conectivity

Started by 9ck, July 22, 2025, 04:14:21 PM

Previous topic - Next topic
July 22, 2025, 04:14:21 PM Last Edit: July 22, 2025, 04:24:48 PM by 9ck
Hi forum
I've been trying to identify why I sometimes lose WAN connection. I've ruled out my ISP. I'm loosing WAN connectivity on both WiFi and LAN, but I can still access everything locally (OPNsense keep on running). Reboot OPNsense and the WAN connection is usually back. I have a suspicion that it has something to do with our company PCs running a VPN connections and that I've set up Unbound DNS in OPNsense. But I'm in over my head here. I've shared systemlogs with Copilot which has been working on a reply since yesterday (12 logs).

I run OPNsense on a dedicated machine (Protectli) as the only thing on it. I have a Unifi USW Pro24PoE as main switch. To this I have a Unifi USWPro24 and a Unifi FlexMini connected. Three Unifi APs connected to the main switch. All DNS and DHCP handled by OPNsense with Unbound DNS enabled and "locked down" so it will not forward any other DNS requests. Set up to use Quad9. LAN spilt up in several VLANs.

Some of the things that I notice in the systemlog.
2025-07-21T14:23:58 Warning opnsense /usr/local/etc/rc.linkup: radvd_configure_do(auto) found no suitable IPv6 address on lan(igc1)
...
2025-07-21T14:23:57 Critical dhclient exiting.
2025-07-21T14:23:57 Error dhclient connection closed
2025-07-21T14:23:57 Warning opnsense /usr/local/etc/rc.linkup: radvd_configure_do(auto) found no suitable IPv6 address on lan(igc1)
2025-07-21T14:23:57 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dhcp (execute task : radvd_configure_dhcp(,inet6,[lan]))
2025-07-21T14:23:57 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dhcp (execute task : dhcpd_dhcp_configure(,inet6,[lan]))
2025-07-21T14:23:57 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dhcp (,inet6,[lan])
2025-07-21T14:23:57 Notice opnsense /usr/local/etc/rc.linkup: DEVD: Ethernet detached event for wan(igc0)
2025-07-21T14:23:56 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure newwanip:rfc2136 (,[wan])
2025-07-21T14:23:55 Notice opnsense /usr/local/etc/rc.newwanip: plugins_configure newwanip (execute task : wireguard_sync())
2025-07-21T14:23:55 Notice opnsense /usr/local/etc/rc.newwanip: plugins_configure newwanip (execute task : webgui_configure_do(,[wan]))
2025-07-21T14:23:55 Notice opnsense /usr/local/etc/rc.newwanip: plugins_configure newwanip (execute task : vxlan_configure_do())
2025-07-21T14:23:55 Notice opnsense /usr/local/etc/rc.newwanip: plugins_configure newwanip (execute task : unbound_configure_do(,[wan]))
2025-07-21T14:23:55 Notice opnsense /usr/local/etc/rc.newwanip: plugins_configure newwanip (execute task : openssh_configure_do(,[wan]))
2025-07-21T14:23:55 Notice opnsense /usr/local/etc/rc.newwanip: plugins_configure newwanip (execute task : opendns_configure_do())
2025-07-21T14:23:55 Notice opnsense /usr/local/etc/rc.newwanip: plugins_configure newwanip (execute task : ntpd_configure_do())
2025-07-21T14:23:55 Notice opnsense /usr/local/etc/rc.newwanip: plugins_configure newwanip (execute task : dhcrelay_configure_if(,[wan],inet))
2025-07-21T14:23:55 Notice opnsense /usr/local/etc/rc.newwanip: plugins_configure newwanip (,[wan],inet)
...
2025-07-21T14:23:09 Error opnsense /usr/local/etc/rc.linkup: The command '/bin/kill -'TERM' '83515''(pid:/var/run/dhclient.igc0.pid)  returned exit code '1', the output was 'kill: 83515: No such process'
2025-07-21T14:23:09 Notice opnsense /usr/local/etc/rc.linkup: DEVD: Ethernet attached event for wan(igc0)
2025-07-21T14:23:09 Notice kernel <6>igc0: link state changed to UP
2025-07-21T14:23:09 Error opnsense /usr/local/etc/rc.linkup: The command '/bin/kill -'TERM' '83515''(pid:/var/run/dhclient.igc0.pid)  returned exit code '1', the output was 'kill: 83515: No such process'
2025-07-21T14:23:09 Warning opnsense /usr/local/etc/rc.linkup: radvd_configure_do(auto) found no suitable IPv6 address on lan(igc1)
...
2025-07-21T14:23:06 Error opnsense /usr/local/etc/rc.linkup: The command '/sbin/dhclient -c '/var/etc/dhclient_wan.conf' -p '/var/run/dhclient.igc0.pid' 'igc0'' returned exit code '1', the output was 'igc0: no link .............. giving up'
2025-07-21T14:23:06 Notice kernel <6>igc0: link state changed to DOWN
2025-07-21T14:23:06 Notice kernel <6>igc0: link state changed to UP
2025-07-21T14:23:02 Notice kernel <6>igc0: link state changed to DOWN
2025-07-21T14:23:02 Notice kernel <6>igc0: link state changed to UP
2025-07-21T14:22:55 Error opnsense /usr/local/etc/rc.linkup: The command '/bin/kill -'TERM' '70234''(pid:/var/run/dhclient.igc0.pid)  returned exit code '1', the output was 'kill: 70234: No such process'
2025-07-21T14:22:55 Notice opnsense /usr/local/etc/rc.linkup: DEVD: Ethernet attached event for wan(igc0)
2025-07-21T14:22:55 Error opnsense /usr/local/etc/rc.linkup: The command '/bin/kill -'TERM' '70234''(pid:/var/run/dhclient.igc0.pid)  returned exit code '1', the output was 'kill: 70234: No such process'
2025-07-21T14:22:55 Warning opnsense /usr/local/etc/rc.linkup: radvd_configure_do(auto) found no suitable IPv6 address on lan(igc1)
...
2025-07-21T14:22:00 Notice dhclient dhclient-script: Reason REBOOT on igc0 executing
2025-07-21T14:21:59 Notice kernel <6>igc0: link state changed to UP
2025-07-21T14:21:58 Error dhclient send_packet: Network is down
2025-07-21T14:21:57 Error dhclient send_packet: Network is down
2025-07-21T14:21:56 Notice kernel <6>igc0: link state changed to DOWN
2025-07-21T14:21:56 Notice kernel <6>igc0: link state changed to UP
2025-07-21T14:21:55 Error dhclient send_packet: Network is down
2025-07-21T14:21:53 Error dhclient send_packet: Network is down
2025-07-21T14:21:52 Notice kernel <6>igc0: link state changed to DOWN
2025-07-21T14:21:52 Notice dhclient dhclient-script: Reason PREINIT on igc0 executing
2025-07-21T14:21:51 Notice kernel <6>igc0: link state changed to UP
2025-07-21T14:21:45 Notice kernel <6>igc0: link state changed to DOWN
2025-07-21T14:21:44 Notice kernel <6>igc0: link state changed to UP
2025-07-21T14:21:41 Notice kernel <6>igc0: link state changed to DOWN
2025-07-21T14:21:41 Notice kernel <6>igc0: link state changed to UP
2025-07-21T14:21:41 Notice opnsense /usr/local/etc/rc.linkup: DEVD: Ethernet attached event for wan(igc0)
2025-07-21T14:21:29 Critical dhclient exiting.
2025-07-21T14:21:29 Error dhclient connection closed
2025-07-21T14:21:29 Warning opnsense /usr/local/etc/rc.linkup: radvd_configure_do(auto) found no suitable IPv6 address on lan(igc1)
2025-07-21T14:21:28 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dhcp (execute task : radvd_configure_dhcp(,inet6,[lan]))
2025-07-21T14:21:28 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dhcp (execute task : dhcpd_dhcp_configure(,inet6,[lan]))
2025-07-21T14:21:28 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure dhcp (,inet6,[lan])
2025-07-21T14:21:28 Notice opnsense /usr/local/etc/rc.linkup: DEVD: Ethernet detached event for wan(igc0)
2025-07-21T14:21:28 Notice opnsense /usr/local/etc/rc.linkup: plugins_configure newwanip:rfc2136 (,[wan])
2025-07-21T14:21:27 Notice kernel <6>igc0: link state changed to DOWN

Since I've used Copilot with varying succes in the past, I would really appreciere if someone could "qualify" the reply I've got. Here is the breakdown from Copilot. I realize it's not fair to ask for a specific reply with the lack of insight, but most of you probably have an opinion on the reply from Copilot and maybe someone has been in similar situation.

🔍 What the Logs Reveal (Crash Patterns)
Across nearly every incident:
  • WAN Interface Flapping (igc0): Link goes DOWN → UP in rapid succession, often within seconds.
  • DHCP Struggles: dhclient fails to send packets (Network is down) just as interface recovery begins.
  • Plugin Overload: rc.linkup and rc.newwanip trigger multiple plugin reconfigurations (Unbound, VPN, DNS, Monitor) before the link is stable.
  • Resolver Turmoil: Unbound is caught mid-crash—trying to forward DNS while upstream hasn't recovered.

And in several logs:
  • Client Triggers Detected: Either from your company PC or Teams—DNS requests or TCP resets right as WAN begins recovering.

🧠 Root Cause Themes
Root CauseFrequencyExample Trigger
Link-State FlappingHigh igc0: link state changed to DOWN
DHCP Reinitialization Failure High send_packet: Network is down
Plugin Execution Race Medium plugins_configure dns, ipsec
Early Client DNS Requests Medium Resets from PC during bootup
Unbound Restart Mid-Recovery Medium unbound_configure_do()

✅ What You Can Do
1. Tame the Interface
  • In OPNsense, go to Interfaces → WAN → Advanced
  • Set Link Up Delay to ~5 seconds to avoid plugin races
  • Enable Prevent interface flapping logic if available
2. Stagger Plugin Execution
  • Use System → Settings → Tunables to defer DNS resolver and VPN plugin execution until WAN is confirmed UP
  • Consider custom script delays for Unbound reloads
3. Isolate Company Devices
  • Create a separate VLAN for corporate PCs
  • In Unbound, disable DNS over TCP for that VLAN (use port override to reject encrypted queries if needed)
4. Fortify DHCP Recovery
  • Add retry-delay and initial-delay tweaks to dhclient.conf (can stabilize lease recovery after link events)
5. Monitor Client Behavior
  • Packet capture during bootup—look for TCP resets or malformed DNS requests
  • Sandbox MS Teams: test with and without the app running to compare crash likelihood

💡 Next Steps
If you want, I can build:
  • A custom Unbound config for resilient DNS per VLAN
  • A shell script to re-sequence plugin executions
  • A diagnostic dashboard to catch crash patterns early

I will appreciate any help provided.

Obviously, your WAN link is flapping. If you are correct in "ruling your ISP" out, I gues this does not happen when you use another router.

Quoting AI advice without any knowledge about your topology is not very helpful. BTW: Is you OpnSense behind a router or does it build up the connection by itself over a modem/ONT?

Since Proteclis usually have no hardware problems and I guess that the pure link is not the culprit here (since it looks like an Intel NIC), without further hints, I would guess that your ISP is one of those who terminate the connection when they see illegal traffic (like RFC1918 IPs as source address which have not been NATed). Such packets may occur when your local devices use addresses that should be visible only over the VPN and/or your routes are set incorrectly.

You could try to pinpoint that by creating a WAN outbound block rule (this is one of the rare occasions they are useful) with RFC1918 as source.
Intel N100, 4* I226-V, 2* 82559, 16 GByte, 500 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+

My OPNsense is behind a modem/router provided by my IPS. The ISP provided router has been set to bridge mode. They do not detect any issues on their side when I loose my WAN connection.

Do you refer to the physical cable and sockets whith "the pure link" (you'll have to excuse but English isn't my 1st language)?

I was sure that I was allowing only non-RFC1918 traffic going to the WAN, but going through my rules I do indeed see that in VLAN2SEC I allow all (*) going to WAN. This is the VLAN where I have our company PCs on (the ones using the company VPN service - out of my control). Could this be my issue?



This is the principle used on my other interfaces.

July 22, 2025, 05:43:27 PM #4 Last Edit: July 22, 2025, 05:45:34 PM by 9ck
Quote from: meyergru on July 22, 2025, 04:45:14 PMYou could try to pinpoint that by creating a WAN outbound block rule (this is one of the rare occasions they are useful) with RFC1918 as source.

Not sure I understand this correctly. Wouldn't such a rule block all my outbound traffic? What would I look for? Should this maybe reveal if its my company VPN ip address causing issues?

July 22, 2025, 06:19:24 PM #5 Last Edit: July 22, 2025, 06:21:23 PM by meyergru
No, the blocking rule would block only packets with a source IP within RFC 1918 to "any", but on WAN "out" direction. You can even log that rule to see if it matches.

Normally, your LAN packets (which are with RFC 1918, too), would be rewritten via NAT to originate from your WAN IP, so that such a rule would not apply for this kind of legitimate traffic.

Illegal traffic can occur when any of your clients use RFC1918 IPs that should normally be routed over your VPN, but by mistake reach your OpnSense, because it is your default gateway. OpnSense will then use its own default gateway (at the ISP) to send those packets to. If the latter are not NATed (which they probably are not, because no such rule exists), they would leave your OpnSense. The ISP router cannot handle these packets, because it knows that they could never be answered by anybody (being RFC1918, thus not routeable on the internet).

Then, it is the ISP's choice to drop such packets. However, some ISPs think this is a hacking attempt and drop your whole connection.
Intel N100, 4* I226-V, 2* 82559, 16 GByte, 500 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+

July 22, 2025, 06:33:25 PM #6 Last Edit: July 22, 2025, 06:37:15 PM by 9ck
Thanks for the explanation - I'll try to digest. I've set the rule up (but also changed the mistake I had in the VLAN2SEC rule). I guess I'll have to wait and see if something shows up in the logs or if this does the trick. Hope I've understood the outbound block rule correct.

Could this also be the reason that I loose my connection to my LAN via Wireguard after 24h or so (from outside my LAN obviously)?

Since you did not check the "Log" box, nothing will be logged.

A drop after 24h can be because of a forced reconnect by your ISP, this is common in Germany, e.g.
Intel N100, 4* I226-V, 2* 82559, 16 GByte, 500 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+

July 22, 2025, 06:50:11 PM #8 Last Edit: July 22, 2025, 07:03:16 PM by 9ck
Quote from: meyergru on July 22, 2025, 06:42:12 PMSince you did not check the "Log" box, nothing will be logged.
A drop after 24h can be because of a forced reconnect by your ISP, this is common in Germany, e.g.
Ahh... I though everything would show in the Log Files > Live View - probably not the best place to try to track things.
After the drop I can not reconnect to my LAN via Wireguard. Would that be the case if it was a forced reconnectivity issue? Sorry if I'm not being informative enough.
EDIT: Don't waste time on my Wireguard-issue. I just recalled that the local machine I was trying to access had crashed while I was away. I need to do more thurough testing in order to give you the correct picture.

Probably, if your IP changes during the process and the other side does not detect that it should reconnect - Wireguard does not do that per default. There is a cron job to detect stale connections in OpnSense.
Intel N100, 4* I226-V, 2* 82559, 16 GByte, 500 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+

July 22, 2025, 07:53:18 PM #10 Last Edit: July 22, 2025, 08:03:16 PM by 9ck
Looking into this I see that I've enabled another cronjob that restarts Wireguard after 6 hours in order to refresh the public IP. Would you recommend I keep this? I have a dynamic ip address. In reality it is only being renegotiated if connection is down for 3 hours or more. Wonder if I've done this correct. Shouldn't it be the Dynamic DNS settings that I refresh?

Any recommendation as to how often I should run the job that will renew the DNS for Wireguard?