Intermitten Internet Connectivity Issue

Started by trusmith, April 10, 2023, 11:26:56 AM

Previous topic - Next topic
Hi Everyone.

I am new to OPNsense and have migrated to it after using pfSense for 2+ years.

My setup is a simple Dual WAN scenario:
1. ISP1 interface gets a dynamic Public IP with the ISP1's modem in bridged mode.
2. ISP2 interface gets a NAT'ed Private IP via ISP2 modem+router combo (as bridged mode is not allowed on it). I have currently set it with a static private IP so as to avoid certain issue that arise due to DHCP lease renewal on the WAN side.

I have followed the official guides from OPNsense to set the dual WAN up, and my problem is that every once in a while my clients loose internet connectivity sometimes for short and sometimes for really long inervals (a couple hours).

I am using the Round Robin with Stickiness option in the GW Group.

I tried troubleshooting by monitoring the Firewall LOGs and such but am unable to pin point the issue exactly. And I know this is not a technically correct statement to make, but my OPNsense box is setup with the same configs my pfSense box was and this problems was not there in pfSense, FYI.

Also, apologies in advance if right off the bat i've missed adding or attaching something. If the good folks of this community please advise what's needed i'll post that as well. Currently I've attahced my LAN rules.

Basically I'm looking for any pointers around how can I further torubleshoot my issue, or even avoid any general mistakes people end up committing in such scenarios.

Thanks a lot,


Hi everyone,

Any help or even the correct direction on this would greatly help me.

TIA.

Does everyone lose internet connectivity behind the firewall?  When connectivity drops does the firewall have the ability to reach the internet, such as logging into the console and pinging google?  Does this happen if you only use one ISP not in muti-wan? 


Hi,

Thanks for agreeing to assist. Please allow me to answer your questions in points:

1. Only a few random clients loose connectivity behind the firewall.
2. When connectivity drops, the firewall shows no signs of lost connectivity. Meaning, Unbound still shows DNS queries from clients flowing in and the responses sent thereof. Yes, unbound shows sending responses to those clients who apparently show the 'Connection Timeout' error in browsers. Firewall also can update pkg in that interim. Firewall can ping 8.8.8.8 using 'Auto' interface selection. However, random clients fail to ping 8.8.8.8.
3. As a matter of fact, I today itself created a Failover GW group instead of a load balanced one with the ISP having a much more stable connection as Tier 1 and viola, my issue seems to have disappeared!

Does this mean that the Firewall is having troubles quickly switching clients from one GW to another in a load-balanced scenario? If so, how could I verify that and be sure? I remember that with pfSense (also the same ISPs) the switch used to be almost immediate.

I have attached my current LAN rules which seem to be working fine. Please do advise though if there are any obvious silly mistakes there.

Thanks & Regards,

Quote from: trusmith on April 12, 2023, 05:22:07 PM
Hi,

Thanks for agreeing to assist. Please allow me to answer your questions in points:

1. Only a few random clients loose connectivity behind the firewall.
2. When connectivity drops, the firewall shows no signs of lost connectivity. Meaning, Unbound still shows DNS queries from clients flowing in and the responses sent thereof. Yes, unbound shows sending responses to those clients who apparently show the 'Connection Timeout' error in browsers. Firewall also can update pkg in that interim. Firewall can ping 8.8.8.8 using 'Auto' interface selection. However, random clients fail to ping 8.8.8.8.
3. As a matter of fact, I today itself created a Failover GW group instead of a load balanced one with the ISP having a much more stable connection as Tier 1 and viola, my issue seems to have disappeared!

Does this mean that the Firewall is having troubles quickly switching clients from one GW to another in a load-balanced scenario? If so, how could I verify that and be sure? I remember that with pfSense (also the same ISPs) the switch used to be almost immediate.

I have attached my current LAN rules which seem to be working fine. Please do advise though if there are any obvious silly mistakes there.

Thanks & Regards,

1.Having a few random clients disconnect feels strange to me.  It could be one of these ISP's is having connectivity issues?  How are you monitoring the ISP links to make sure they are online and available?  It could be they are going down for some reason and OPNsense isn't aware of it.

2. Be careful here, you are talking about a couple of different things.  DNS and connectivity.  Try and isolate this some, if it's DNS this might not be connectivity related and visa versa.  Narrowing down which is down will help you know where to look for your problem. 

3.  This comment again makes me lean more to my comment on 1., wondering if you just have a single ISP giving you some problems and it's coincidental to your switch from pfSense to opnsense. 


1. GWs are monitored through monitor IPs, Google DNS on WAN1, Cloudfare DNS on WAN2. Trigger conditon is high latency or packet loss. WAN1 is the one which shows slight packet loss every once in a while (albeit under the threshold defined) but stays 'online'.
2. Okay, my original issue is that clients loose internet connectivity inntermittently. I'll stick to the same. But the fact also remains that DNS behaviour is the way I described. What that meant was that while the Firewall is able to not only ping the DNS servers, but even resolve quesries for its own traffic, the randomly affected clients are able to do neither at that time. If it makes life easier, we can ignore that.
3. I've had this ISP for years along with pfSense. My issue at hand has surfaced only post my migration to OPNsense. It could very well be conicidental yes, but I was observing this issue for a good deal of 3 - 4 days untill I finally switched to a Failover Group which immediately solved this issue. It might very well not be coincedental as well. I'd be interested to know what tests / exercisises / logs could I dig into to be sure.

Thanks & Regards,