Odd Issue With 22.1.3+

Started by gumbi2400, June 07, 2022, 07:23:30 PM

Previous topic - Next topic
Hello all. Long time lurker; first time poster. I've been running Opnsense on a tiny dual nic box I picked up on ebay. It has two realtek 1G nics (I know, I know). I've also installed the os-realtek-re package for and they've been working fine up until now.

I've noticed the strangest thing when upgrading past 22.1.2. It seems like I lose a good half of the internet. After a LOT of investigation it seems like SSL handshakes on some but not all site's just times out. I don't see anything in the logs about the connection being blocked. I've done packet dumps and nothing seems to happen there it just never makes it through. Downgrading back to 22.1.2 fixes the problem.

As a previous troubleshooting step, I removed the os-realtek-re package and I can update successfully and it works for a while before I receive the dreaded watchdog timeouts.

I'm really struggling to troubleshoot this one, so any advice would be greatly appreciated. As I'm sure you can imagine, it's difficult to get packet dumps, logs, etc. when I can't access the internet to post them. But happy to do my best to provide any more useful information. Thanks!

Quote from: gumbi2400 on June 07, 2022, 07:23:30 PM
I've noticed the strangest thing when upgrading past 22.1.2. It seems like I lose a good half of the internet. After a LOT of investigation it seems like SSL handshakes on some but not all site's just times out. I don't see anything in the logs about the connection being blocked. I've done packet dumps and nothing seems to happen there it just never makes it through. Downgrading back to 22.1.2 fixes the problem.

I too use a box I got on amazon with Realtek adapters.  I only had watchdog errors prior to using the manufacturer-based drivers.  Now running strong over 60 days with no errors :)

Clarifying question; When you say you lose half of the internet you mean you lose half of the bandwidth, you're capable of or you can't resolve half of the sites?  If the later are you able to visit by IP? is this a DNS issue? What symptoms are you seeing?
OPNsense 24.7.7 running on:
Dell Optiplex 3050
Intel I5-7600 @ 3.5Ghz (4 Cores)
Intel I350-T4 Nic
8G DDR4
256G SSD

Apologies I realise that wasn't particularly clear. DNS is fine, it is purely just some SSL connections. Connectivity all the way through on L3 is fine. Some websites are fine, Google and Amazon for example, others including the opnsense.org site start the handshake and it just never gets a response and times out.

It took me quite a while to track down that it even was an issue with the router. On downgrading back everything continues to work completely as expected.

My setup is pretty flat, I connect using pppoe to my ISP, and internally there is just a flat network without any VLANs. Not much in the way of firewall rules other than the default either. I'm completely stumped.

The changelog for 22.1.3 shows nothing that could remotely explain this behaviour.

Normally, SSL has nothing to do with Opnsense unless you use it as a forced proxy. That is, considering you said everything else on L3 works fine. SSL traffic should be in no sense special.

If only some sites are affected, I would try IPv4 vs. IPv6 first (although both google.com and opnsense.org offer both). Also, ssllabs.com shows that opnsense.org's SSL settings are tighter than google.com's (TLS > 1.1), so maybe you use a browser or a plugin that enforces some TLS rules (like strict CAA) - however, that would not explain why it works with Opnsense 22.1.2.

A remote possibility would be a hardware defect (like memory error) that only affects you in a specific memory layout that is triggered by 22.1.3 only. Do you see packet losses? If those occur with SSL traffic, it might prevent connections.
Intel N100, 4* I226-V, 2* 82559, 16 GByte, 500 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+

Quote from: meyergru on June 08, 2022, 10:59:18 AM
The changelog for 22.1.3 shows nothing that could remotely explain this behaviour.

Normally, SSL has nothing to do with Opnsense unless you use it as a forced proxy. That is, considering you said everything else on L3 works fine. SSL traffic should be in no sense special.

If only some sites are affected, I would try IPv4 vs. IPv6 first (although both google.com and opnsense.org offer both). Also, ssllabs.com shows that opnsense.org's SSL settings are tighter than google.com's (TLS > 1.1), so maybe you use a browser or a plugin that enforces some TLS rules (like strict CAA) - however, that would not explain why it works with Opnsense 22.1.2.

A remote possibility would be a hardware defect (like memory error) that only affects you in a specific memory layout that is triggered by 22.1.3 only. Do you see packet losses? If those occur with SSL traffic, it might prevent connections.


This is exactly why it is so odd. The router itself should have NOTHING to do with SSL connections specifically. This is why it took me so long to notice in the first place. It's definitely not a browser issue as it affects all devices on the network. And again as soon as it's downgraded back to 22.1.2, the problem disappears and everything appears to function normally.

If I didn't know any better I would think that it would possibly be a firewall rule, but I don't have anything outside of the default rules in place.

Are there any other data points for later 22.1.x? Looking at the changelog this one was amended later in 22.1.6

> o interfaces: do not update VIPs on dynamic address changes

And the bigger question: are you using VIPs on your WAN perhaps?



Cheers,
Franco

Quote from: franco on June 08, 2022, 01:12:56 PM
Are there any other data points for later 22.1.x? Looking at the changelog this one was amended later in 22.1.6

> o interfaces: do not update VIPs on dynamic address changes

And the bigger question: are you using VIPs on your WAN perhaps?


No VIPs on the WAN. I can confirm it is a setting somewhere though! A complete wipe and rebuild from scratch, and everything seems to work. I'm slowly re-implementing my previous settings and testing along the way, but so far so good!

Update: I found the problem!

So this is an odd one. The issue is.... MTU! My ISP uses PPPoE with an FTTH connection. By default they support PPPoE MTU up to 1500 (i.e. the line MTU is 1508). This has been the setting I've had on the interface for ages. On update this no longer works. Setting the MTU back down to 1500 (and therefore the PPPoE connection being 1492) works.

I'm not sure why this would be affected by a version change, but here we are. I can replicate the issue by increasing the MTU on the WAN PPPoE interface. When I do this, HTTPS handshakes never complete.

I had a one like that catch me once, when I did not enable ICMPv6 on the WAN interface, thus rendering PMTUD ineffective for IPv6. Once I enabled that, SSL connections worked again also over IPv6. That is why I suggested trying IPv4 vs. IPv6.

FWIW - Good to have it sorted out.
Intel N100, 4* I226-V, 2* 82559, 16 GByte, 500 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+