OPNsense Forum

Archive => 21.1 Legacy Series => Topic started by: GeoffW on February 23, 2021, 11:54:34 am

Title: Intermittent and transient network errors
Post by: GeoffW on February 23, 2021, 11:54:34 am
A few days ago I set up OPNsense v21.1 (and updated to v21.1.1) on a virtual machine (VMware), to experiment with as a replacement for a pfSense installation.  Lots to like after I got my head around the things that are different, but I've got intermittent network errors that I was not getting with the other firewall.

The actual errors vary, but mostly they seem to be either ERR_SSL_PROTOCOL_ERROR, or sometimes ERR_CONNECTION_RESET.

They sometimes manifest in the browser as a page not loading, but more often it is just some of the images on the page that fail to load.  I've attached a screen capture of a browser console log after loading a news website.  (Ignore the last item in the list, that's a real/persistent error.)  Simply reloading the page (or right-click "Load Image") is enough to have the page/image load properly - so the errors are intermittent and transient.

The nature of the problem makes it really hard to analyse - especially since you can't just use a previous page to verify any change, and there is no guarantee the next page/site will see the problem this time around.

I am using Unbound DNS with secure DNS (CloudFlare), but I can't really see that being the problem.  I have disabled IPv6 (I have no use for it on such a simple network so prefer to remove the complication).

I have have Captive Portal turned on, but no proxy.

This is a very small network.  Just half-a-dozen machines connecting out (and some more devices that are intentionally being blocked by the Captive Portal).

Can anyone offer and suggestions on what I might try to resolve this?  I've lost the last three days to my experiments with no change, so I could really use some hints.

Thanks.
Title: Re: Intermittent and transient network errors
Post by: GeoffW on February 23, 2021, 12:08:53 pm
Just had another glitch on another site that pretty much proves it's not DNS.  A GIF started loading and actually started playing, before suddenly blinking out and being replaced with a broken-link icon.  The console showed it had received an ERR_CONNECTION_RESET error.  There are no network errors reported on the interfaces, the Internet link has been at least as stable as it was before I put in this firewall.

And speaking of broken-link icons, this very page with its list of emojis above the edit area has approximately half of them displayed as broken link icons.  I attach two more screen captures, the display and the console showing the errors.
Title: Re: Intermittent and transient network errors
Post by: GeoffW on February 24, 2021, 01:21:46 am
I'm all out of ideas, so I've switched back to the old firewall and the protocol errors and connection resets have stopped.  I think the throughput is a bit slower, but at least its working.  I really like (most of) the interface and reporting features of OPNsense, but not at the expense of core network function.

Maybe I'll come back and try again when you're up to Zealous Zorilla or something.
Title: Re: Intermittent and transient network errors
Post by: opnfwb on February 24, 2021, 02:45:07 am
How do you have the Unbound secure DNS configured for cloudflare? OPNsense is a little different than pfSense when it comes to getting a full DoT implementation, you'll need to use Custom Options.

Are there any in/out interface errors when viewing LAN/WAN interfaces by selecting Interfaces/Overview within the OPNsense GUI?
Title: Re: Intermittent and transient network errors
Post by: GeoffW on February 24, 2021, 03:03:57 am
Thanks very much for your response.

For Unbound DNS setup I followed the instructions I found here: https://www.dnsknowledge.com/unbound/opnsense-set-up-and-configure-dns-over-tls-dot/ (https://www.dnsknowledge.com/unbound/opnsense-set-up-and-configure-dns-over-tls-dot/)  (including checking the resulting dot.conf file).

As I understand it, with 21.1 you no longer need the Custom Options.  The instructions seemed to work (confirmed traffic on 853 and none on 53).  At one stage I had both CloudFlare and Google DNS servers defined, but then I read that mixed DNS sources can lead SSL protocol errors (although I don't think those articles were talking of this particular situation, I think they were talking about DNS on the browser/client being in conflict with the DNS reported by the network).


None of the interfaces show any errors (both sides reporting 0/0 for in/out errors).  Which I suppose makes my subject line here a bit ambiguous.  The errors are not hard network errors, but something higher level.  Also no blocks or rejects get reported on the Firewall (which I thought might happen after the connection reset errors, but no).

So the only errors (so far) are observed on browser clients (mostly Firefox and Vivaldi, Vivaldi is a Chromium based browser, running on Windows 10 20H2) - although this includes video and music player components that get interrupted with connection resets.
Title: Re: Intermittent and transient network errors
Post by: ingof on February 24, 2021, 04:13:14 am
U have wan and lan properly isolated? or on same ethernet?
Title: Re: Intermittent and transient network errors
Post by: GeoffW on February 24, 2021, 05:06:36 am
You had me wondering for a moment :) so I double checked.  The WAN side definitely only has the one cable and that plugs into the interface for this firewall.

The WAN side is a bit messy (but doesn't change between firewalls).  I'm remote and use a 4G wireless (from Telstra in Australia) broadband connection.  The actual router with the SIM is a small "Hotspot" device, but because I'm not actually mobile I plug it into a NetGear Cradle (with external antenna) that acts as separate router.  So, roughly:

Code: [Select]
   [ LAN ]-----[OPNsense]-------[ Cradle ]-------[ Hotspot ]-------[whatever-Telstra-has]-----{internet}
           LAN         10.x.x.x/24       10.y.y.y/24        10.z.z.z/?
          subnet

The Cradle does have WiFi that guests use.  I have a separate WiFi on the LAN side of the firewall for my own devices.  There is no cross-over using WiFi.

And like I said, none of the WAN side changes.  I'm just sliding out the pfSense VM and sliding in the OPNsense VM.  So I doubt if the mess is relevant and probably just distracts.


Thanks for your response.  If you see something in that mess that might make a difference peculiar to OPNsense then let me know.  Note that I did try ticking/unticking that "Disable Reply-To" Multi-WAN option after reading posts here, but it didn't seem to make any difference either way.
Title: Re: Intermittent and transient network errors
Post by: opnfwb on February 24, 2021, 06:14:25 am
More info here about DoT with cert validation. https://www.ctrl.blog/entry/unbound-tls-forwarding.html

Unfortunately the OPNsense GUI doesn't offer the domain name function to allow cert validation at this time. If you want a fully secure DoT setup, you'll need something like this in your custom settings (be sure to remove the duplicate references the Miscellaneous section)

Code: [Select]
# TLS Config
tls-cert-bundle: "/etc/ssl/cert.pem"
# Forwarding Config
forward-zone:
name: "."
forward-tls-upstream: yes
forward-addr: 2620:fe::9@853#dns9.quad9.net
forward-addr: 9.9.9.9@853#dns9.quad9.net
forward-addr: 149.112.112.9@853#dns9.quad9.net

You can modify that to taste for whichever DoT provider you want to use.

If the WAN interface is on a network with private IP ranges (192.x, 172.x, 10.x, etc.), I would also suggest going to Interfaces/WAN and uncheck block private/block bogon networks.

Try those two things and see if it helps?
Title: Re: Intermittent and transient network errors
Post by: GeoffW on February 24, 2021, 06:48:51 am
Thanks very much for that explanation regarding more secure DoT.  I will give that a try.

I actually like having the block private networks still ticked, since the point is to protect the LAN side even from guests whose devices I do not necessarily trust ... still, I guess I could try it and just see (no guests here now anyway).  Probably this evening, I'll let you know.
Title: Re: Intermittent and transient network errors
Post by: opnfwb on February 24, 2021, 07:59:56 am
If the WAN is issuing a routed public IP, no need to uncheck those two options. However, if it's putting you on a private IP range and double NAT'ing you, I would try unchecking them just to verify.

What does the packet loss look like on the WAN side? Turn on gateway monitoring and set a remote IP for your preferred DNS (1.1.1.1). Let it run for an hour or two and see what the packet loss looks like?
Title: Re: Intermittent and transient network errors
Post by: opnfwb on February 24, 2021, 08:01:55 am
I just realized my gateway monitoring suggestion probably wasn't very clear. Attaching a screenshot to help better show what I'm on about.
Title: Re: Intermittent and transient network errors
Post by: GeoffW on February 24, 2021, 09:20:06 am
Thanks heaps for the suggestions and extra detail.

After those changes (directly editing dot.conf with domain name, turn off blocking of private/bogon in WAN, setting up on monitoring of gateway using IP 1.1.1.1), the results are:

I still get SSL protocol and connection reset errors (intermittently and temporarily) as first described.

I saw a lot of packet loss on the gateway monitor during startup (1..4%), and I am seeing some intermittent (but much smaller) bouts of packet loss since.  Still no errors reported on the interface itself.

I've only been running a few minutes after the change so far, so I'll leave it a while longer to see how the packet loss is reported.
Title: Re: Intermittent and transient network errors
Post by: GeoffW on February 24, 2021, 01:46:57 pm
None of the changes I've tried removed the intermittent SSL protocol errors nor the connection reset errors.

Monitoring 1.1.1.1 from the Gateway "Monitor IP" setting showed intermittent periods of packet loss between 1% and 4%.  The RTT was sitting up around 160ms.

Disclaimer: I don't know anything about the following website except that it claims to do what I needed for this test, use at your own risk.  The Packet Loss Test website: https://packetlosstest.com (https://packetlosstest.com) also reports packet loss between 1% and 4% across several tests, and has latency average sitting up around 150ms with a lot calls stretching up above 200ms.


Switching back to my old pfSense firewall, the errors go away as usual.  I set Gateway monitoring to look at 1.1.1.1 (as for OPNsense above) and it reported an RTT of approximately 50ms (on an otherwise idle network, it goes up during speed tests etc.), and I have not seen packet loss go above 0.0%.  Also, the Packet Loss Test website lost only a single packet over several tests and reports latency between 50ms and 80ms across several tests.

...

So something's going on with the WAN connection under OPNsense, but I don't seem to be any closer to working out what.  I thought TCP generally arranged for automatic re-transmission of lost packets, so unless the packet loss rate is very high it should normally be transparent other than the performance loss (no errors like those that I am seeing).


Thank you very much opnfwb and ingof for your suggestions.  I am open to more, but may be slower responding because I really have to catch up on some work that I've been ignoring while trying to get this going.
Title: Re: Intermittent and transient network errors
Post by: opnfwb on February 24, 2021, 03:31:51 pm
That packet loss is definitely going to cause some issues. Up to 4% is pretty bad, you'd notice that on VOIP calls or anything that relied on UDP traffic.

Sorry I'm not sure what else to suggest due to all of the variables here. Since we're dealing with VMs, both would need to be the same (same NICs, number of CPUs, etc.) to rule out any VM hardware influences. Which hypervisor are you using, I've run OPNsense on both ESXi and HyperV so maybe we can compare notes if you're using either one of those.

Can you try a traceroute to the same IP (1.1.1.1) from the pfSense and OPNsense firewalls? Do they both have the same number of hops? Is one of them seeing more latency/loss on a certain hop than another? Do both VMs get the same WAN IP address, or does it change each time the firewalls are switched?

Due to the unusual nature of the WAN setup in this thread, do you need to do any MAC address cloning that would need to be setup in OPNsense? Just trying to think of any other variables that may cause an issue.

Lastly, you've mentioned you're editing a .conf, which I would assume to be unbound.conf? I would not recommend editing this file directly. Instead, apply changes or customization through the OPNsense GUI. You can run in to issues where your custom changes may get over written if you adjust something in the GUI and hit 'save'. Better to keep all changes in the GUI so that they all re-apply every time a tweak is made. At this point I don't think your issue is DNS but, if we can fix the packet loss, you'd want to make sure the Unbound stuff is squared away too to give you a consistent experience.
Title: Re: Intermittent and transient network errors
Post by: bitman on February 24, 2021, 09:05:09 pm
4G modem as WAN / broadband

may be have a play with the MTU size?
Title: Re: Intermittent and transient network errors
Post by: GeoffW on February 26, 2021, 08:37:07 am
Regarding the conf file: at first I directly edited /var/unbound/etc/dot.conf but then discovered that was being overwritten (presumably from the GUI config support).  So I browsed the unbound.conf file and found out I could drop my own .conf into /var/unbound/etc/ and it would be picked up, which is what I had done.


I'm running these VMs under VMware Workstation v16 (currently over a Windows 10 host).  I first set this up with pfSense years ago, intending it just for evaluation purposes, but it was convenient and seemed to work stably (and the performance boost over my dedicated IPCop firewall of the time was impressive) and so I've kept it that way (over various OS versions and hardware).  The WAN network adapter is disabled on the host OS, and only selected into the firewall VMs (Bridged).  The LAN adapter is shared with the host in Bridged mode.

I can't see it being a network adapter issue, it's the same adapter definition for all interfaces on both machines ("e1000") and I have not overridden MTU or similar details so they should be the same.  The pfSense firewall VM is running with a single processor and 1GB of RAM (and runs Squid, SquidGuard and even ntopng, so it's really putting in).  The OPNsense firewall has been given 2 processors and 4GB of RAM (no proxy or anything else exciting, so it should be kicking back cooling its heels).

I didn't need any MAC cloning.  I had already configured the Cradle to give me an area for static IPv4 addressing, which is what I've used on the firewalls - and they both use the same address.  This means I can only have one running at a time, but that's true anyway thanks to DHCP etc.  As expected, traceroute shows no difference, 11 hops, identical path.


After all the mucking about, it seemed a good idea to revert back to an early snapshot, which I did.  This had just DHCP and Captive Portal and used ordinary DNS.  This was still getting packet loss - sometimes up to 9%!  I disabled Captive Portal and packet loss seemed to subside (but was still happening), but this may have been just coincidence.

That didn't resolve the issue so I returned to my more fully configured snapshot and updated it to 21.1.2.  I rather like the idea of the "Audit now" options, I did both a security and a health audit - all reported okay.  Then I rebooted to be sure.  Re-tested and the same problem persists.

A few times today my pfSense firewall has reported some packet loss (1-2%) and some long latency times (network is obviously busy, download was slower but uploads still fast), but this has not resulted in the same protocol errors or connection resets, things just went slower - which is what I expect.


I think it's time I accepted defeat.  There's obviously something about this set up that OPNsense doesn't like.  I'll keep the VM around to try again with a future update.

Thanks everyone for your input.
Title: Re: Intermittent and transient network errors
Post by: youngman on February 26, 2021, 05:54:38 pm
I had similar loss issues a while back and it came down to MTU as someone posted earlier. Just had to put an override number in at the WAN interface and it was all good. No idea why it couldn't auto detect and correct the MTU... I suspect it was ISP related.

If you are monitoring the gateway, are your tolerances set too tightly - causing it to restart itself intermittently?

System: Gateways: Single --> Advanced (perhaps temporarily disable monitoring just to eliminate that possibility?)
Title: Re: Intermittent and transient network errors
Post by: GeoffW on February 27, 2021, 07:31:50 am
I had been reluctant to "play with" the MTU size because I'm not enough of an expert to know the consequences of my choices ... but inputting the default value of 1500 seemed safe and easy enough, so I did that on both WAN and LAN interfaces.  No change.

I also tried disabling the gateway monitor.  No change.


Today I was experimenting with pfBlockerNG on my pfSense firewall and I see that when it blocks via DNSBL the result is sometimes a ERR_SSL_PROTOCOL_ERROR.  Of course the difference is that a page refresh in this case keeps blocking persistently.  So the problem on OPNsense is not any blocking rules (because refresh will load things that previously failed), but it does show that DNS issues could result in at least one of the errors I am seeing, although I am less clear how a DNS issue could explain the interrupted GIF loads I saw (presumably a ERR_CONNECTION_RESET).
Title: Re: Intermittent and transient network errors
Post by: thowe on February 27, 2021, 09:45:24 am
An MTU of 1500 may be the standard in many Ethernet scenarios. However, especially when transmitting via PPPoE or other tunnels, a lower MTU can be more efficient, since the packets are otherwise fragmented. Depending on the protocol, this is a loss of performance or prevents connections.

That said, I don't think the MTU is the main problem here (if it is a problem at all).
Title: Re: Intermittent and transient network errors
Post by: youngman on February 28, 2021, 02:47:48 am
Not suggesting that MTU size is 100% the issue but with a 3G modem I vaguely recall being forced down to ~1370ish to prevent fragmentation. 4G may be similar? Look up MTU ping test - it isn't hard to confirm an appropriate size.

Some programs do not handle fragmentation well (e.g. In my experience Steam will simply refuse to connect to their game controller), others may be unaffected - giving the impression of intermittent errors.
Title: Re: Intermittent and transient network errors
Post by: GeoffW on March 01, 2021, 12:34:33 am
The MTU ping test is going to get exactly the same value as what operating system PMTUD gets more dynamically (for TCP anyway).  And the problems I am seeing are on normal browser page loads, no exciting games involved ... but never let it be said I didn't try. :)

The MTU ping test let through blocks of no more than 1432 bytes.  According to the articles I found I could add 28 bytes to that to get 1460 as the appropriate MTU, but then I thought: if I'm doing this lets push the issue and use 1432.  I could not find an article that was explicit about whether the MTU on a firewall needed to be set on both interfaces, but I assumed that would be best (for MTU to be a problem here we're assuming OPNsense is screwing up the fragmentation process, so let's make sure it never has to fragment by never seeing a big frame).

Change MTU on LAN and WAN and rebooted ("Apply Changes" does not actually update the MTU according to the "Overview" page).  Verified the change had taken (MTU ping test would no longer accept 1432, but would accept 1400 - and presumably 1404 - which matches expectations).

Then tested for the problem:  Still getting some packet loss, but more importantly, I am still getting the same sorts of inconsistent network errors as first described.  (I say "more importantly" because I believe the packet loss is a symptom of the underlying problem, not a cause; as noted earlier, I sometimes see packet loss on the old firewall, especially when the network is very busy, but it only makes things slower, it doesn't cause these weird transient errors.)

Thanks for your suggestion.  It was worth a shot, but I think I have now excluded MTU as the culprit here.