OpenVPN connections keep dropping

Started by woo, October 28, 2016, 03:52:05 PM

Previous topic - Next topic
October 28, 2016, 03:52:05 PM Last Edit: November 03, 2016, 02:42:54 PM by woo
Hi all,
(this is not directly related to OPNsense code itself, just a service provided by an OPNsense box, but here are people who know OpenVPN and can probably help me, I'm sure..)
I've now got about 50 regular users on my OPNsense OpenVPN concentrator, and I keep getting complaints that connections are dropping out, mostly around the 1 hour mark.
The log always shows the same picture.. a slew of messages "openvpn[78997]: hans/191.19.25.210:63081 TLS Error: local/remote TLS keys are out of sync: [AF_INET]191.19.25.210:63081 [1]", followed by one "openvpn[78997]: hans/191.19.25.210:63081 [hans] Inactivity timeout (--ping-restart), restarting"

All web research I've done points to this message relating to firewall config issues, but then the connection shouldn't even be able to be established in the first place.
To me, it looks like some part of the keepalive packets either can not be sent or do not arrive.. but I failed to find any details of what the keepalive actually consists of, and which firewall rules I might need to permit it.
Also, it does not seem to match up from a time perspective.. my server has "keepalive 10 30" set, which should kill the session much sooner than one hour, if it really was keepalive related.

I've switched users from UDP to TCP connection mode, with no difference. I've played with the numbers in the keepalive settings, also no change. I can't really just sniff packets on all interfaces for hours, hoping to catch the one that makes trouble, either...

I'm running out of ideas how to debug this further.. so if anyone can provide enlightenment, I'd be really grateful.

Regards
~woo

Nobody got any idea how I could dig into that issue further?

50 concurrent users may cause some load. What hardware are you using? Any crypto off-load in the CPU or otherwise?

Bart...

Like Bart said, it could be a hardware related .
we have 25 users behind A10 Firewall with SSD, we notied some CPU loading,
DEC4240 – OPNsense Owner

The behaviour is the same, whether it's 3 people logged in, or 50.
CPU load is below 20%, using crypto offloading on a current-gen Xeon.
I'm pretty sure that some handshake packets are dropped somewhere, but I don't know where, or how to sniff it out without digging through all crypted packets..

You either have a very beefy piece of hardware to use a Xeon, or you are running OPNsense as a VM. Do you have more platform details please? There are some hypervisor/NIC model/NIC driver combos that have issues with OPNsense and its underlying FreeBSD OS.

Bart...

yeah, the OPNsense is currently the only VM on our new ESXi 5.5 host. I'm using the Intel E1000 emulated network device, via the 'em' driver, which is what VMware recommends for FreeBSD.
Generally, networking works fine on that box.. no troubles with throughput or packet loss or anything at all, just these weird VPN disconnects.

Is your host up to 5.5 U3? Have you tried vmxnet3 (if_vmx in FreeBSD) instead? Are you using the official VMware tools, or open-vm-tools?

You could also try VMDirectPath I/O for the WAN connection, if the host has some spare NICs.

Bart...

My host is 5.5 on most current patch level. I'm using whatever vmtools came with the OPNsense iso, which looks like the official ones. Not much a fan of switching interface type now.. I'm semi in production with that box already, and that idea smells of downtime.

Yes, I agree that you need to consider downtime to swap interfaces. Not an awful lot you can do safely while in production without having a fail-over firewall, either through CARP or secondary routing by your clients.

Any mileage in creating a pre-production environment?

Bart...

I've now switched the e1000 card for a vmxnet card, but I don't see any difference. Will keep an eye on it for the next few days..

No change.. still getting the same errors with the vmxnet as well.
(and I drowned in other projects for the last few weeks, so couldn't investigate this any further).
I'm still having the impression that the keepalive packets are getting lost somewhere, triggering the session restart. (which of course has to fail as the OTP has changed in the meantime, so the cached credentials are useless).
I'll create a second server instance without OTP to see whether at least the automatic session restart works around this problem, that'll buy me some time to get at the original cause.
My "keepalive packets lost" feeling is also reinforced by the problem _seeming_ not to occur for users which have the "redirect gateway" option pushed to their client.. or those users just don't complain.
Kinda annoys me having to debug in production... and lacking the time to do that properly.

I've now run some statistics on the logs and the reports from my users.. and there's a weird accumulation in certain connection durations. Most users get disconnected either roughly around 33 minutes or 63 minutes..
I don't have any information about the OSes those users run (commonly Windows 7, 8 or 10), but could there be any reasons that TLS sessions expire/fail to rekey after certain times?

This may come from using TOTP if you are using it.

January 09, 2017, 07:50:44 PM #14 Last Edit: January 09, 2017, 07:53:27 PM by minime
Interesting, I was just heading to this forum as I'm lost of what else I could do to get a proper working OpenVPN connection.

I am using an i5-6200U, which is usually not at it's limit at all (it can saturate 350mbps over OpenVPN), but I can't get my system to keep the connection up. I have to reconnect to get it working again (it often seems that I am still connected, but in fact it lost it already), which is not a deal breaker, but I wonder why I can't get it to work properly.

I tried a lot of "keepalive" variations and followed a lot of different advice you can find with Google, now I am wondering, am I the only one or not. It seems I am not...

Who gets a stable connection working and with what settings?