[SOLVED] IPsec and TCP flows

Started by Yordan Yordanov, April 05, 2015, 04:44:34 PM

Previous topic - Next topic
April 05, 2015, 04:44:34 PM Last Edit: April 21, 2015, 03:00:02 PM by franco
The system is running version 15.1.8.3-c6240d38f (amd64). I have configured three interfaces - 1 LAN and two ISP lines. Currently a rule is sending all the traffic into the first line only which has a public static IPv4 address. Outbound NAT is set to automatic mode.

It seems to work okay until I tried to set up several IPsec tunnels. Most of them were connected although the interface shows that they are disconnected, but this is a known issue. The problem is that all the VPN connections are very unstable. When pinging remote hosts, there are no lost packets at all. However, when I log on using Remote Desktop the connection is lost every 30-35 seconds and it takes about 20 seconds to reconnect itself. The tunnel itself does not get disconnected - after my Remote Desktop session stops responding, I continue to receive ICMP echo replies. I have not tested with UDP traffic as I don't have an application that uses UDP. Additionally, RDP connections to the Internet directly work OK. This is what I have tested so far:

1. Changing IKE version - tunnels do not connect. Only one tunnel connects, but the other side is running pfSense which supports IKEv2. However the issue persists with IKEv2 too.
2. Disabling ISP balancing (I had previously configured ISP balancing but disabled it to troubleshoot the issue), enabling only ISP Failover to alternate line. The issue persists.
3. Setting Prefer older IPSec SAs. The issue persists.
4. Setting Do not install LAN SPD - unchecks itself automatically after Save and reloading the page. The issue persists.
5. Setting Enable TCP MSS clamping on VPN traffic - tried with 1200 and 1400 bytes, the issue still persists.

I also did a tcpdump for one of the tunnels during which I just typed some text in Notepad on the remote computer which looks like this:


16:49:37.542263 IP my.side > their.side: ESP(spi=0xc3028c6c,seq=0xa2d), length 84
16:49:37.554964 IP their.side > my.side: ESP(spi=0xcda44f21,seq=0xa92), length 92
16:49:37.575476 IP my.side > their.side: ESP(spi=0xc3028c6c,seq=0xa2e), length 84
16:49:37.586123 IP their.side > my.side: ESP(spi=0xcda44f21,seq=0xa93), length 92
16:49:37.607720 IP my.side > their.side: ESP(spi=0xc3028c6c,seq=0xa2f), length 84
16:49:37.617368 IP their.side > my.side: ESP(spi=0xcda44f21,seq=0xa94), length 92
16:49:37.641175 IP my.side > their.side: ESP(spi=0xc3028c6c,seq=0xa30), length 84
16:49:37.648702 IP their.side > my.side: ESP(spi=0xcda44f21,seq=0xa95), length 92
16:49:37.674312 IP my.side > their.side: ESP(spi=0xc3028c6c,seq=0xa31), length 84
16:49:37.680109 IP their.side > my.side: ESP(spi=0xcda44f21,seq=0xa96), length 92
16:49:37.707601 IP my.side > their.side: ESP(spi=0xc3028c6c,seq=0xa32), length 84
16:49:37.711110 IP their.side > my.side: ESP(spi=0xcda44f21,seq=0xa97), length 92
16:49:37.739768 IP my.side > their.side: ESP(spi=0xc3028c6c,seq=0xa33), length 84
16:49:37.742396 IP their.side > my.side: ESP(spi=0xcda44f21,seq=0xa98), length 92
16:49:37.773296 IP my.side > their.side: ESP(spi=0xc3028c6c,seq=0xa34), length 84
16:49:37.789533 IP their.side > my.side: ESP(spi=0xcda44f21,seq=0xa99), length 156
16:49:37.806428 IP my.side > their.side: ESP(spi=0xc3028c6c,seq=0xa35), length 84
16:49:37.820509 IP their.side > my.side: ESP(spi=0xcda44f21,seq=0xa9a), length 132
16:49:37.839801 IP my.side > their.side: ESP(spi=0xc3028c6c,seq=0xa36), length 84
16:49:37.851763 IP their.side > my.side: ESP(spi=0xcda44f21,seq=0xa9b), length 100
16:49:37.872443 IP my.side > their.side: ESP(spi=0xc3028c6c,seq=0xa37), length 84
16:49:37.883013 IP their.side > my.side: ESP(spi=0xcda44f21,seq=0xa9c), length 92
16:49:37.905261 IP my.side > their.side: ESP(spi=0xc3028c6c,seq=0xa38), length 84
16:49:37.914280 IP their.side > my.side: ESP(spi=0xcda44f21,seq=0xa9d), length 92
16:49:37.938347 IP my.side > their.side: ESP(spi=0xc3028c6c,seq=0xa39), length 84
16:49:37.945486 IP their.side > my.side: ESP(spi=0xcda44f21,seq=0xa9e), length 92
16:49:37.971371 IP my.side > their.side: ESP(spi=0xc3028c6c,seq=0xa3a), length 84


When the RDP session stopped responding, this is what I captured:


16:49:37.976871 IP their.side > my.side: ESP(spi=0xcda44f21,seq=0xa9f), length 92
16:49:38.101861 IP their.side > my.side: ESP(spi=0xcda44f21,seq=0xaa0), length 92
16:49:38.195728 IP their.side > my.side: ESP(spi=0xcda44f21,seq=0xaa1), length 116
16:49:38.852116 IP their.side > my.side: ESP(spi=0xcda44f21,seq=0xaa2), length 116
16:49:39.133557 IP their.side > my.side: ESP(spi=0xcda44f21,seq=0xaa3), length 372
16:49:39.289521 IP their.side > my.side: ESP(spi=0xcda44f21,seq=0xaa4), length 84
16:49:40.055345 IP their.side > my.side: ESP(spi=0xcda44f21,seq=0xaa5), length 412
16:49:40.133263 IP their.side > my.side: ESP(spi=0xcda44f21,seq=0xaa6), length 100
16:49:41.133313 IP their.side > my.side: ESP(spi=0xcda44f21,seq=0xaa7), length 92
16:49:41.398770 IP my.side > their.side: ESP(spi=0xc3028c6c,seq=0xa3b), length 84
16:49:41.399912 IP their.side > my.side: ESP(spi=0xcda44f21,seq=0xaa8), length 76


So my side stops responding for a period, but I don't know why. The line quality is excellent, when plugging it into another router, there are no issues, the VPN connections are established successfully and operate normally. However I have to replace the old router with OPNsense.

I am very frustrated by this issue as I have been trying to work it out for weeks, but no result. Could someone help me with this, maybe I am missing something?

Is it already unstable with 2 active tunnels? I would like to reproduce your issue on a set of machines on my side, but I'm looking for a method to trigger it as fast as possible.
Can you deliver a stripped down version of a config.xml with the issue, but without your personal data? Then I will look into it next week to see what's going on by reproducing it on our side with some freshly installed machines.

April 05, 2015, 08:28:08 PM #2 Last Edit: April 06, 2015, 03:42:36 PM by Yordan Yordanov
Out of 7 configured tunnels, 5 were active at the time of testing and they all experience this issue. I think I haven't tested with one tunnel only, but I believe this shouldn't be a consideration. By config.xml, do you mean that I can extract the VPN connection profile somehow and send it to you? Or just to prepare a file with the VPN parameters so that you can test with the same Phase1/2 parameters? Or maybe the whole configuration of the device? Thanks for engaging into this problem!

By the way, this is the device if it matters.

Can you test with one tunnel also? If it also fails as well then it's a little bit easier to reproduce on my side.

If possible I would really like a config with the issue in it with the as less as options enabled but with the issue, so we can concentrate on this one by installing it on some fresh machines (the same kind actually). 
You can backup your full configuration using the backup feature found in /diag_backup.php.

Because it's a regular text file, you can strip your personal information from the file before sending it over.

Maybe it's best if you email me the configuration directly ( ad at the project domain), just to be sure we're not posting any harmful data on the forum. ( posting a (part of) the config.xml on the forum is also fine by me, but then be sure to replace all the external ip's and passwords in it).

All right, I think I'll be able to test that in the next 2 days as I need to do that outside of business hours. I'll report back when I'm ready.

I tested with only one IPsec tunnel (the other 6 configured but disabled) and the issue still persists. I have sent the configuration to Ad per e-mail as requested. If anybody else wishes to test, I can provide it, just send me a PM.

I'm using an ipsec connection between homeoffice - office since 15.1.6 without any issue. What's your endpoint? Maybe this issue isn't really an ipsec case but loadbalances / nat issue..?

The problem is at my side for sure as this happens with each of the 5 tunnels I tested. The endpoints are different devices and the connection is OK when I switch the OPNsense router with another one. So it may or may not be the IPsec component that is at fault but the whole configuration as a whole.

Unfortunately I was not able to reproduce the issue with your config file. Although we have fixed the status page and the non functioning option "Do not install LAN SPD".

I tested with 2 OPNsense firewalls connected with each other using a direct cable connection and on one device your config filled with my own ip addresses and secrets. I was not able to test with multiple gateways.
To send traffic I used a ssh session to copy some files and connect to a machine behind the other machine.

Maybe your issue has something to do with the 2 WAN connections and routing of packets, if you have the time you might test again with only one WAN connection enabled.
ipsec itself doesn't really seem to be the issue,  your firewall rules don't explain such strange behaviour either for as far as I could see on my box.



Thanks for testing it, I'll remove one of the WAN interfaces and the associated firewall rules and see if it helps. If it doesn't, I'm restoring factory defaults and starting from scratch with one WAN and one tunnel.

Removing one of the WANs didn't change anything. What is more, I restored to factory defaults and used the wizard to configure the LAN interface and one of the WANs. Then I created one IPsec tunnel and the issue is still there. :( So, Multi-WAN is not causing it. It's not only Remote Desktop, I tried to copy some files using SMB (Windows File Sharing), the transfer doesn't start at all - network error. I noticed that the issue is caused by packets not being sent TO the other endpoint - I observed a clock ticking in a RDP session and the second hand on the clock didn't stop moving while I was unable to do anything in the session after which it just reconnects and this repeats every 30 seconds.

Now I'm taking the device with me at home and will test a tunnel to the old router that OPNsense is supposed to replace. If someone wants to help me further troubleshoot this, I'm ready to record all the steps in a video to show what I am doing and what exactly happens.

Strange, this setup really sounds quite straightforward. If you record your steps I will certainly take a look at it. 
What version of pfSense is used on the other box you connect to?

The other endpoint was running:

2.2.1-RELEASE (i386)
built on Fri Mar 13 08:16:53 CDT 2015
FreeBSD 10.1-RELEASE-p6

However it happens regardless of the other device - we have Cisco ASA, Lancom, Linksys and Cisco RVS 4000.
I tested yesterday on another site (using a completely different Internet connection) by building the configuration from scratch. I established the VPN from my home to the office (which runs Linksys RV082). The issue occurs exactly in the same manner. This time I also tried SSH and it's the same experience, the only difference being that SSH can't overcome the problem and doesn't reconnect, so I get "Software caused connection abort" after about 30-40 seconds. On Friday I'll record everything in a video and get back.

Today I installed OPNsense on a desktop PC with 2 Ethernet cards and tested my VPN using the exact same parameters (and ISP lines). The issue DOES NOT occur. I used the x86 image however and I see that the device is running x64. I'll check whether the Pentium 4 CPU I used supports 64-bit to test with it. I'd like to try this before recording the video.

Ok, just let me know if I can do anything.
If you have the opportunity to let me test from my side to your office using the same type of machine that might also be an option (just send me an email).
I tested with the 64bit version using 2 OPNsense installs and the same hardware. But if there's any weirdness going on with your specific machine we should be able to find it.

One last question, have you tried using one of the other network ports for your connection? It's probably not the issue, but you never know.