[Solved] WAN/DHCP go down until web GUI is accessed

Started by HASNOGAME, September 18, 2022, 04:12:04 AM

Previous topic - Next topic
September 18, 2022, 04:12:04 AM Last Edit: October 11, 2022, 03:50:32 AM by HASNOGAME
I have been experiencing a recurring issue with OPNSense where the WAN will become unreachable and the local LAN devices will stop receiving DHCP addresses. The odd part is that when I set a static IP on an endpoint and access the web GUI everything immediately comes back up. If the web console is not accessed then DHCP will take 1-3 hours to come back up on its own.

I use UnboundDNS as both our DNS and DHCP server. Our WAN pulls its IP via DHCP from a bridged cable modem. I have IPv6 disabled for both WAN and LAN in OPNSense.

These outages occur roughly every 3 days:

Ping to my WAN gateway:


HTTPS test to Google:



Logs for outage today between 18:17 to 18:20 EST:

System -> Log Files -> General


2022-09-17T22:51:05 Notice php plugins_configure dhcp (execute task : dhcpd_dhcp_configure())
2022-09-17T22:51:05 Notice php plugins_configure dhcp ()
2022-09-17T22:51:05 Notice configctl event @ 1663455065.08 exec: system event config_changed
2022-09-17T22:51:05 Notice configctl event @ 1663455065.08 msg: Sep 17 22:51:05 OPNsense.home.local config[62336]: [2022-09-17T18:51:05-04:00][INFO] config-event: new_config /conf/backup/config-1663455065.0518.xml
2022-09-17T22:51:05 Notice php plugins_configure dns (execute task : unbound_configure_do())
2022-09-17T22:51:05 Notice php plugins_configure dns (execute task : dnsmasq_configure_do())
2022-09-17T22:51:05 Notice php plugins_configure dns ()
2022-09-17T22:50:37 Notice php plugins_configure dhcp (execute task : dhcpd_dhcp_configure())
2022-09-17T22:50:37 Notice php plugins_configure dhcp ()
2022-09-17T22:50:36 Notice configctl event @ 1663455036.27 exec: system event config_changed
2022-09-17T22:50:36 Notice configctl event @ 1663455036.27 msg: Sep 17 22:50:36 OPNsense.home.local config[81874]: [2022-09-17T18:50:36-04:00][INFO] config-event: new_config /conf/backup/config-1663455036.2397.xml
2022-09-17T22:50:36 Notice php plugins_configure dns (execute task : unbound_configure_do())
2022-09-17T22:50:36 Notice php plugins_configure dns (execute task : dnsmasq_configure_do())
2022-09-17T22:50:36 Notice php plugins_configure dns ()
2022-09-17T22:49:45 Notice php plugins_configure dhcp (execute task : dhcpd_dhcp_configure())
2022-09-17T22:49:45 Notice php plugins_configure dhcp ()
2022-09-17T22:49:45 Notice php plugins_configure dhcp ()
2022-09-17T22:49:45 Notice configctl event @ 1663454985.11 exec: system event config_changed
2022-09-17T22:49:45 Notice configctl event @ 1663454985.11 msg: Sep 17 22:49:45 OPNsense.[redacted].local config[86961]: [2022-09-17T22:49:45+00:00][INFO] config-event: new_config /conf/backup/config-1663454985.0785.xml
2022-09-17T22:49:45 Notice php plugins_configure dns (execute task : unbound_configure_do())
2022-09-17T22:49:45 Notice php plugins_configure dns (execute task : dnsmasq_configure_do())
2022-09-17T22:49:45 Notice php plugins_configure dns ()
2022-09-17T21:33:04 Notice dhclient Creating resolv.conf
2022-09-17T20:54:14 Notice dhclient Creating resolv.conf
2022-09-17T18:47:10 Notice configctl event @ 1663440429.79 exec: system event config_changed
2022-09-17T18:47:10 Notice configctl event @ 1663440429.79 msg: Sep 17 18:47:09 OPNsense.home.local config[86961]: [2022-09-17T18:47:09+00:00][INFO] config-event: new_config /conf/backup/config-1663440429.7553.xml


System -> Log Files -> Backend

2022-09-17T22:44:35 Notice configd.py [e328d179-8560-4c17-8868-bb308cfec681] Retrieve firmware product info
2022-09-17T22:44:34 Notice configd.py [62df1859-d2a8-4892-a26d-de64c8e59fc9] list gateway status
2022-09-17T22:17:03 Notice configd.py [83ca9362-e14a-4a40-b0cd-e8138af4c636] Retrieve firmware product info
2022-09-17T22:17:02 Notice configd.py [95f5a369-7e68-4059-a1ca-1e13ed721bb6] list gateway status


Services -> Unbound DNS -> Log File

2022-09-17T22:49:45 Informational unbound [3680:0] info: server stats for thread 0: requestlist max 113 avg 18.9338 exceeded 0 jostled 0
2022-09-17T22:49:45 Informational unbound [3680:0] info: server stats for thread 0: 13787 queries, 3564 answers from cache, 10223 recursions, 0 prefetch, 0 rejected by ip ratelimiting
2022-09-17T22:49:45 Informational unbound [3680:0] info: service stopped (unbound 1.16.2).
2022-09-17T22:49:45 Warning unbound PTR record already exists for [redacted].[redacted].[redacted](x.x.x.x)
2022-09-16T15:25:22 Informational unbound [3680:0] info: start of service (unbound 1.16.2).


Services -> DHCPv4 -> Log Files just has a lot of DHCPACK, DHCPDISCOVER, and DHCPREQUEST entries. Nothing out of the ordinary.

Specs:

OPNsense 22.7.4-amd64
FreeBSD 13.1-RELEASE-p2
OpenSSL 1.1.1q 5 Jul 2022
Intel(R) Xeon(R) CPU W3550 @ 3.07GHz (4 cores, 8 threads)
NIC: Intel Pro 1000 2-Port


root@OPNsense:~ # sysctl -a | grep -E 'dev.(igb|ix|em).*.%desc:'
dev.em.1.%desc: Intel(R) PRO/1000 PT 82571EB/82571GB (Copper)
dev.em.0.%desc: Intel(R) PRO/1000 PT 82571EB/82571GB (Copper)


The server is installed bare-metal.


WAN Gateway:


WAN Interface:


LAN Interface:


Interfaces Settings:


DHCPv4 (no advanced settings modified):


Unbound DNS:


I can't find much in the logs to determine what exactly is happening. Can anyone offer any advice as to what I can check to better diagnose this issue? I am also happy to provide additional details as-needed.

Any assistance would be greatly appreciated!

Basic question: have you done a full restart of the system?
OPNsense 24.7.7 running on:
Dell Optiplex 3050
Intel I5-7600 @ 3.5Ghz (4 Cores)
Intel I350-T4 Nic
8G DDR4
256G SSD

Quote from: axsdenied on September 18, 2022, 07:29:05 PM
Basic question: have you done a full restart of the system?

Yes, have tried multiple times to no avail.

I don't see anything irregular in your config which might start having me look at drivers and hardware.  In this case it looks like you're using Intel cards, which in general, are pretty stable on BSD with a few exceptions with newer chipsets.

Assuming your hardware is ok (i.e. no memory issues), i'd start swapping out cables.  What is physical network config? Any switches, hubs or anything like that?
OPNsense 24.7.7 running on:
Dell Optiplex 3050
Intel I5-7600 @ 3.5Ghz (4 Cores)
Intel I350-T4 Nic
8G DDR4
256G SSD

Quote from: axsdenied on September 19, 2022, 12:59:54 AM
I don't see anything irregular in your config which might start having me look at drivers and hardware.  In this case it looks like you're using Intel cards, which in general, are pretty stable on BSD with a few exceptions with newer chipsets.

Assuming your hardware is ok (i.e. no memory issues), i'd start swapping out cables.  What is physical network config? Any switches, hubs or anything like that?

Glad to hear my config is okay at least. Here is a simplified network diagram. All links using vlan 1 unless otherwise specified:



I used to have the WAN directly plugged into the modem but this issue started during that time. Moving it onto an L2 vlan didn't resolve the issue so I know it isn't the modem or firewall interfaces losing their physical link.

The RT-AC78U is a device I use as an L2 switch, the radios and routing functions on it are completely disabled and only the LAN ports are in use.

@HASNOGAME
Hi. looking at the network diagram, I would pay attention to the "core switch tp-link 16 port".

Quote from: Fright on September 19, 2022, 04:45:39 PM
@HASNOGAME
Hi. looking at the network diagram, I would pay attention to the "core switch tp-link 16 port".

Hi Fright, thank you for your reply.

The core switch is also L2-only. Is there anything I should pay attention to in particular?

Ok it looks like you're using vlan 99 to isolate a connection through the switch to the cable mode, but we have to assume the switch configuration is correct/functioning normally (i.e. not a trunk).  I'm also unsure there aren't any oddities with this config with Opnsense.

Any way to just physically plug the cable modem directly into to OPNsense? at least for troubleshooting?
OPNsense 24.7.7 running on:
Dell Optiplex 3050
Intel I5-7600 @ 3.5Ghz (4 Cores)
Intel I350-T4 Nic
8G DDR4
256G SSD

Quote from: axsdenied on September 19, 2022, 05:29:33 PM
Ok it looks like you're using vlan 99 to isolate a connection through the switch to the cable mode, but we have to assume the switch configuration is correct/functioning normally (i.e. not a trunk).  I'm also unsure there aren't any oddities with this config with Opnsense.

Any way to just physically plug the cable modem directly into to OPNsense? at least for troubleshooting?

No problem, it is now plugged in directly from the OPNSense WAN to the modem. I will mention that I had this same issue prior to implementing vlan 99.

QuoteThe core switch is also L2-only. Is there anything I should pay attention to in particular?
I would try to look at logs, alerts and stats.
did not often work with this manufacturer, but the impressions are not the best...
just for example:
https://community.tp-link.com/en/business/forum/topic/152767

September 19, 2022, 05:41:31 PM #10 Last Edit: September 19, 2022, 05:47:25 PM by HASNOGAME
Quote from: Fright on September 19, 2022, 05:32:59 PM
QuoteThe core switch is also L2-only. Is there anything I should pay attention to in particular?
I would try to look at logs, alerts and stats.
did not often work with this manufacturer, but the impressions are not the best...
just for example:
https://community.tp-link.com/en/business/forum/topic/152767

Unfortunately the switch does not have logs. It is basically a glorified L2-only switch. It's model # is TL-SG1016PE v2.

I wonder if the switch is unrelated to the issue at hand though. As I mentioned before the WAN used to be directly-connected to the modem and would still go down every 3 days or so.

Quick question; the tests you ran where it shows the failure.  Where are those being run from?  If NOT OPNsense box, are you able to run those same tests on from OPNSense?

OPNsense 24.7.7 running on:
Dell Optiplex 3050
Intel I5-7600 @ 3.5Ghz (4 Cores)
Intel I350-T4 Nic
8G DDR4
256G SSD

Quote from: axsdenied on September 19, 2022, 05:52:07 PM
Quick question; the tests you ran where it shows the failure.  Where are those being run from?  If NOT OPNsense box, are you able to run those same tests on from OPNSense?

Those reports are from an Uptime Kuma docker container I have running on a separate server. Maybe I could use Monit? The documentation doesn't shed much light on how to set it up for this specific issue however...

If you're checking historically you can look in the "Reporting" section and the tab you're looking for is "Quality".  That will log latency and packet loss.

In realtime, if you're experiencing the issue, simply to ping from the OPNsense box directly either via cmdline/ssh or using "Interfaces: Diagnostics: Ping".  You could also check the ARP table and see if anything has changed there.
OPNsense 24.7.7 running on:
Dell Optiplex 3050
Intel I5-7600 @ 3.5Ghz (4 Cores)
Intel I350-T4 Nic
8G DDR4
256G SSD

Quote from: axsdenied on September 19, 2022, 10:58:52 PM
If you're checking historically you can look in the "Reporting" section and the tab you're looking for is "Quality".  That will log latency and packet loss.

In realtime, if you're experiencing the issue, simply to ping from the OPNsense box directly either via cmdline/ssh or using "Interfaces: Diagnostics: Ping".  You could also check the ARP table and see if anything has changed there.

Experiencing an outage now. Can ping the LAN gateway of the OPNSense but cannot ssh to it. All ports show physically up on the switch, the modem shows a link light on the port which goes to the OPNSense box, and both the WAN and LAN interfaces on the OPNSense box have visible blinking link lights.

Changing the patch cables for the WAN/LAN ports and pinging the firewall have not brought it back up. The moment I opened the web interface however internet access was restored and I can now ssh to the OPNSense box.