Intermittent and transient network errors

Started by GeoffW, February 23, 2021, 11:54:34 AM

Previous topic - Next topic
A few days ago I set up OPNsense v21.1 (and updated to v21.1.1) on a virtual machine (VMware), to experiment with as a replacement for a pfSense installation.  Lots to like after I got my head around the things that are different, but I've got intermittent network errors that I was not getting with the other firewall.

The actual errors vary, but mostly they seem to be either ERR_SSL_PROTOCOL_ERROR, or sometimes ERR_CONNECTION_RESET.

They sometimes manifest in the browser as a page not loading, but more often it is just some of the images on the page that fail to load.  I've attached a screen capture of a browser console log after loading a news website.  (Ignore the last item in the list, that's a real/persistent error.)  Simply reloading the page (or right-click "Load Image") is enough to have the page/image load properly - so the errors are intermittent and transient.

The nature of the problem makes it really hard to analyse - especially since you can't just use a previous page to verify any change, and there is no guarantee the next page/site will see the problem this time around.

I am using Unbound DNS with secure DNS (CloudFlare), but I can't really see that being the problem.  I have disabled IPv6 (I have no use for it on such a simple network so prefer to remove the complication).

I have have Captive Portal turned on, but no proxy.

This is a very small network.  Just half-a-dozen machines connecting out (and some more devices that are intentionally being blocked by the Captive Portal).

Can anyone offer and suggestions on what I might try to resolve this?  I've lost the last three days to my experiments with no change, so I could really use some hints.

Thanks.

Just had another glitch on another site that pretty much proves it's not DNS.  A GIF started loading and actually started playing, before suddenly blinking out and being replaced with a broken-link icon.  The console showed it had received an ERR_CONNECTION_RESET error.  There are no network errors reported on the interfaces, the Internet link has been at least as stable as it was before I put in this firewall.

And speaking of broken-link icons, this very page with its list of emojis above the edit area has approximately half of them displayed as broken link icons.  I attach two more screen captures, the display and the console showing the errors.

I'm all out of ideas, so I've switched back to the old firewall and the protocol errors and connection resets have stopped.  I think the throughput is a bit slower, but at least its working.  I really like (most of) the interface and reporting features of OPNsense, but not at the expense of core network function.

Maybe I'll come back and try again when you're up to Zealous Zorilla or something.

How do you have the Unbound secure DNS configured for cloudflare? OPNsense is a little different than pfSense when it comes to getting a full DoT implementation, you'll need to use Custom Options.

Are there any in/out interface errors when viewing LAN/WAN interfaces by selecting Interfaces/Overview within the OPNsense GUI?

Thanks very much for your response.

For Unbound DNS setup I followed the instructions I found here: https://www.dnsknowledge.com/unbound/opnsense-set-up-and-configure-dns-over-tls-dot/  (including checking the resulting dot.conf file).

As I understand it, with 21.1 you no longer need the Custom Options.  The instructions seemed to work (confirmed traffic on 853 and none on 53).  At one stage I had both CloudFlare and Google DNS servers defined, but then I read that mixed DNS sources can lead SSL protocol errors (although I don't think those articles were talking of this particular situation, I think they were talking about DNS on the browser/client being in conflict with the DNS reported by the network).


None of the interfaces show any errors (both sides reporting 0/0 for in/out errors).  Which I suppose makes my subject line here a bit ambiguous.  The errors are not hard network errors, but something higher level.  Also no blocks or rejects get reported on the Firewall (which I thought might happen after the connection reset errors, but no).

So the only errors (so far) are observed on browser clients (mostly Firefox and Vivaldi, Vivaldi is a Chromium based browser, running on Windows 10 20H2) - although this includes video and music player components that get interrupted with connection resets.

U have wan and lan properly isolated? or on same ethernet?

You had me wondering for a moment :) so I double checked.  The WAN side definitely only has the one cable and that plugs into the interface for this firewall.

The WAN side is a bit messy (but doesn't change between firewalls).  I'm remote and use a 4G wireless (from Telstra in Australia) broadband connection.  The actual router with the SIM is a small "Hotspot" device, but because I'm not actually mobile I plug it into a NetGear Cradle (with external antenna) that acts as separate router.  So, roughly:


   [ LAN ]-----[OPNsense]-------[ Cradle ]-------[ Hotspot ]-------[whatever-Telstra-has]-----{internet}
           LAN         10.x.x.x/24       10.y.y.y/24        10.z.z.z/?
          subnet


The Cradle does have WiFi that guests use.  I have a separate WiFi on the LAN side of the firewall for my own devices.  There is no cross-over using WiFi.

And like I said, none of the WAN side changes.  I'm just sliding out the pfSense VM and sliding in the OPNsense VM.  So I doubt if the mess is relevant and probably just distracts.


Thanks for your response.  If you see something in that mess that might make a difference peculiar to OPNsense then let me know.  Note that I did try ticking/unticking that "Disable Reply-To" Multi-WAN option after reading posts here, but it didn't seem to make any difference either way.

More info here about DoT with cert validation. https://www.ctrl.blog/entry/unbound-tls-forwarding.html

Unfortunately the OPNsense GUI doesn't offer the domain name function to allow cert validation at this time. If you want a fully secure DoT setup, you'll need something like this in your custom settings (be sure to remove the duplicate references the Miscellaneous section)

# TLS Config
tls-cert-bundle: "/etc/ssl/cert.pem"
# Forwarding Config
forward-zone:
name: "."
forward-tls-upstream: yes
forward-addr: 2620:fe::9@853#dns9.quad9.net
forward-addr: 9.9.9.9@853#dns9.quad9.net
forward-addr: 149.112.112.9@853#dns9.quad9.net


You can modify that to taste for whichever DoT provider you want to use.

If the WAN interface is on a network with private IP ranges (192.x, 172.x, 10.x, etc.), I would also suggest going to Interfaces/WAN and uncheck block private/block bogon networks.

Try those two things and see if it helps?

Thanks very much for that explanation regarding more secure DoT.  I will give that a try.

I actually like having the block private networks still ticked, since the point is to protect the LAN side even from guests whose devices I do not necessarily trust ... still, I guess I could try it and just see (no guests here now anyway).  Probably this evening, I'll let you know.

If the WAN is issuing a routed public IP, no need to uncheck those two options. However, if it's putting you on a private IP range and double NAT'ing you, I would try unchecking them just to verify.

What does the packet loss look like on the WAN side? Turn on gateway monitoring and set a remote IP for your preferred DNS (1.1.1.1). Let it run for an hour or two and see what the packet loss looks like?

I just realized my gateway monitoring suggestion probably wasn't very clear. Attaching a screenshot to help better show what I'm on about.

Thanks heaps for the suggestions and extra detail.

After those changes (directly editing dot.conf with domain name, turn off blocking of private/bogon in WAN, setting up on monitoring of gateway using IP 1.1.1.1), the results are:

I still get SSL protocol and connection reset errors (intermittently and temporarily) as first described.

I saw a lot of packet loss on the gateway monitor during startup (1..4%), and I am seeing some intermittent (but much smaller) bouts of packet loss since.  Still no errors reported on the interface itself.

I've only been running a few minutes after the change so far, so I'll leave it a while longer to see how the packet loss is reported.

None of the changes I've tried removed the intermittent SSL protocol errors nor the connection reset errors.

Monitoring 1.1.1.1 from the Gateway "Monitor IP" setting showed intermittent periods of packet loss between 1% and 4%.  The RTT was sitting up around 160ms.

Disclaimer: I don't know anything about the following website except that it claims to do what I needed for this test, use at your own risk.  The Packet Loss Test website: https://packetlosstest.com also reports packet loss between 1% and 4% across several tests, and has latency average sitting up around 150ms with a lot calls stretching up above 200ms.


Switching back to my old pfSense firewall, the errors go away as usual.  I set Gateway monitoring to look at 1.1.1.1 (as for OPNsense above) and it reported an RTT of approximately 50ms (on an otherwise idle network, it goes up during speed tests etc.), and I have not seen packet loss go above 0.0%.  Also, the Packet Loss Test website lost only a single packet over several tests and reports latency between 50ms and 80ms across several tests.

...

So something's going on with the WAN connection under OPNsense, but I don't seem to be any closer to working out what.  I thought TCP generally arranged for automatic re-transmission of lost packets, so unless the packet loss rate is very high it should normally be transparent other than the performance loss (no errors like those that I am seeing).


Thank you very much opnfwb and ingof for your suggestions.  I am open to more, but may be slower responding because I really have to catch up on some work that I've been ignoring while trying to get this going.

That packet loss is definitely going to cause some issues. Up to 4% is pretty bad, you'd notice that on VOIP calls or anything that relied on UDP traffic.

Sorry I'm not sure what else to suggest due to all of the variables here. Since we're dealing with VMs, both would need to be the same (same NICs, number of CPUs, etc.) to rule out any VM hardware influences. Which hypervisor are you using, I've run OPNsense on both ESXi and HyperV so maybe we can compare notes if you're using either one of those.

Can you try a traceroute to the same IP (1.1.1.1) from the pfSense and OPNsense firewalls? Do they both have the same number of hops? Is one of them seeing more latency/loss on a certain hop than another? Do both VMs get the same WAN IP address, or does it change each time the firewalls are switched?

Due to the unusual nature of the WAN setup in this thread, do you need to do any MAC address cloning that would need to be setup in OPNsense? Just trying to think of any other variables that may cause an issue.

Lastly, you've mentioned you're editing a .conf, which I would assume to be unbound.conf? I would not recommend editing this file directly. Instead, apply changes or customization through the OPNsense GUI. You can run in to issues where your custom changes may get over written if you adjust something in the GUI and hit 'save'. Better to keep all changes in the GUI so that they all re-apply every time a tweak is made. At this point I don't think your issue is DNS but, if we can fix the packet loss, you'd want to make sure the Unbound stuff is squared away too to give you a consistent experience.

4G modem as WAN / broadband

may be have a play with the MTU size?