IPSEC traffic stalling after 20.7.1 upgrade

Started by Andreas_, September 01, 2020, 03:52:20 PM

Previous topic - Next topic
We have an opnsense installation (CARP pair), running on 20.1.3 until recently, with 4 ipsec peers (2xIKEv1, 2xIKEv2) and some 20 tunnels defined. This used to run flawlessly, until I upgraded those machines to 20.7.1. Since then, the tunnels will stop working after a while, until a new connect is forced on the tunnel. Strangely, all logging looks normal on both sides of the tunnel, even when the tunnel traffic has stalled (still IKE/ISAKMP traffic, but no more ESP packets)

The situation is a little different between peers, and sometimes there are stable phases for one peer, getting bad again after a while, but none is 100% fine. It will take some seconds to some minutes until the tunnels stall; more traffic seems to speed up the failure.

I reinstalled one firewall with 20.1, and now we have stable performance again. The backup machine is in maintenance mode and still 20.7.1 (with syslog-ng fixed).

When reviewing the updates that happened between 20.1.3 and 20.7.1, strongswan was upgraded from 4.8.2 to 4.8.4 (in April), and the kernel from 11.2 to 12.1. Since IKE/ISAKMP traffic seems normal, I'd suspect some issue in the kernel/pf, but I'm out of clues how to narrow down the reason further.

Any thoughts on this?
Regards,
Andreas



This is strange, only opn to opn or any logs when it stops sending traffic?

No hint in the logs on either side of the tunnel (already elevated some log levels)

And when you do a packet capture

a) do you see packets in ipsec / enc0 interface?
b) do you see encrypted ESP packets on WAN interface leaving?

I did a packet capture on WAN, no more ESP packets visible.
Didn't capture on other interfaces.

I did an after-hour test, after upgrading the fw to 20.7.2.

The tunnel traffic still stalls after a while (it did so after about 100MB inbound traffic).

When pinging a remote host, I see ICMP on enc0 entering the tunnel, a corresponding outgoing ESP packet on wan, but no returning packet; there's still communication on port 500, with no anomalies (afaics) in the log.

Switching back to the downgraded fw, same config (synced from the 20.7 machine) works flawlessly.


September 08, 2020, 04:18:03 PM #9 Last Edit: September 08, 2020, 04:57:38 PM by fraenki
I seem to be facing a similar issue. After upgrading from 20.1.4 to 20.7.2 IPsec phase 2 tunnels will randomly stall (IKEv1, mode tunnel IPv4). Only restarting strongswan seems to fix this issue (temporarely).

I should add that NOT ALL tunnels will stall AT ONCE. It seems to start with some tunnels, and other tunnels will follow after some time.

From my perspective the tunnels look perfectly fine:

- "ipsec statusall" shows all tunnels are in working condition and established
- tcpdump shows incoming/outgoing traffic with correct SPI IDs
- "setkey -DP" looks good too

I plan to test again with an older version of strongswan (5.8.3, which was included in 20.1.4). I also plan to test with strongswan 5.9. But this will take some time, because I need to build these packages for FreeBSD 12.1 first.

I've also came across this bug report:
https://wiki.strongswan.org/issues/2315
And I've wondered if the mentioned workaround that was introduced in strongswan 5.8.3 could be related.
EDIT: This was the version that was included in 20.1.4, so it is probably unrelated.

QuoteThe tunnel traffic still stalls after a while (it did so after about 100MB inbound traffic).

Unfortunately, this is not the case here. I was able to transfer hundreds MB but it did not cause the traffic to stall.


Regards
- Frank

Quote from: fraenki on September 08, 2020, 04:18:03 PM

I plan to test again with an older version of strongswan (the one that was included in 20.1.4; need to find out the version number).

This wont work since ABI changed.

Quote from: mimugmail on September 08, 2020, 04:19:43 PM
This wont work since ABI changed.

I will build the package for FreeBSD 12.1 manually :)


September 08, 2020, 05:06:09 PM #12 Last Edit: September 08, 2020, 05:19:06 PM by fraenki
After talking to mimugmail I've switched the IPsec tunnel mode from IKEv1 to IKEv2. Let's see if this changes anything.

EDIT: Even with IKEv2 the traffic stalled, but this time strongswan recognized the error and restarted the tunnel connection:

Sep  8 17:03:05 charon[62985]: 05[IKE] <con5|1> giving up after 10 path probings
Sep  8 17:03:05 charon[62985]: 05[IKE] <con5|1> restarting CHILD_SA con5


EDIT 2: With IKEv2 the issue is better reproducable to me and now I can confirm that the traffic stalls after transferring ~200-400 MB.

September 08, 2020, 05:36:34 PM #13 Last Edit: September 09, 2020, 12:42:18 AM by fraenki
QuoteI plan to test again with an older version of strongswan (5.8.3, which was included in 20.1.4). I also plan to test with strongswan 5.9. But this will take some time, because I need to build these packages for FreeBSD 12.1 first.

I've manually built packages for strongswan 5.8.3 and 5.9.0 on FreeBSD and tested them on OPNsense 20.7.2. Unfortunately this did not improve the situation. It's probably a change in FreeBSD 12.x that causes this issue.

I need some screenshots of Phase1 and Phase2.
My first test with IKEv2 I had over night 57075 pings transmitted and only 32 lost.

I'll now switch to IKEv1