Hello,
I have an IPSec routed mode between 2 opnsense FWs: opnFW1 and opnFW2 running:
OPNsense 20.7.5-amd64
FreeBSD 12.1-RELEASE-p10-HBSD
OpenSSL 1.1.1h 22 Sep 2020
After an approximately a week uptime, without any configuration changes on both ends, I'm getting the following error in opnFW1's /var/log/ipsec.log and of course the IPSec is not working....
Mar 22 21:43:47 opnFW1 charon[16547]: 11[KNL] creating acquire job for policy 192.168.1.10/32 === 192.168.1.1/32 with reqid {1000}
Mar 22 21:43:47 opnFW1 charon[16547]: 07[CFG] trap not found, unable to acquire reqid 1000
Mar 22 21:44:19 opnFW1 charon[16547]: 07[KNL] creating acquire job for policy 192.168.1.10/32 === 192.168.1.1/32 with reqid {1000}
Mar 22 21:44:19 opnFW1 charon[16547]: 11[CFG] trap not found, unable to acquire reqid 1000
The ipsec logical interface on opnFW1 is ipsec1000:
root@opnFW1:~ # ifconfig ipsec1000
ipsec1000: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> metric 0 mtu 1400
tunnel inet 192.168.1.10 --> 192.168.1.1
inet6 fe80::1a5a:58ff:fe10:13a0%ipsec1000 prefixlen 64 scopeid 0x13
inet 172.16.1.10 --> 172.16.1.1 netmask 0xffffffff
groups: ipsec
reqid: 1000
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
From opnFW1 I can successfully ping opnFW2 "underlay" IP address - 192.168.1.1, however I can't ping the "overlay" IP - 172.16.1.1
root@opnFW1:~ # ping -c 2 192.168.1.1
PING 192.168.1.1 (192.168.1.1): 56 data bytes
64 bytes from 192.168.1.1: icmp_seq=0 ttl=64 time=7.266 ms
64 bytes from 192.168.1.1: icmp_seq=1 ttl=64 time=3.638 ms
--- 192.168.1.1 ping statistics ---
2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 3.638/5.452/7.266/1.814 ms
root@opnFW1:~ # ping -c 2 172.16.1.1
PING 172.16.1.1 (172.16.1.1): 56 data bytes
--- 172.16.1.1 ping statistics ---
2 packets transmitted, 0 packets received, 100.0% packet loss
The ipsec configuration on opnFW1 is:
root@opnFW1:/usr/local/etc # cat ipsec.conf
# This file is automatically generated. Do not edit
config setup
uniqueids = yes
conn con1
aggressive = no
fragmentation = yes
keyexchange = ikev2
mobike = yes
reauth = yes
rekey = yes
forceencaps = no
installpolicy = no
dpdaction = restart
dpddelay = 10s
dpdtimeout = 60s
left = 192.168.1.10
right = 192.168.1.1
leftid = 192.168.1.10
ikelifetime = 28800s
lifetime = 3600s
ike = aes256gcm16-sha512-ecp512bp!
leftauth = psk
rightauth = psk
rightid = 192.168.1.1
reqid = 1000
rightsubnet = 0.0.0.0/0
leftsubnet = 0.0.0.0/0
esp = aes256gcm16-sha512-ecp512bp!
auto = start
From the configuration above - that IPSec should rely on DPD.
On the other side - opnFW2 the logs I'm getting is:
root@opnFW2:/var/log # clog ipsec.log | grep 192.168.1.
Mar 22 13:38:37 opnFW2 charon[41296]: 05[KNL] creating acquire job for policy 192.168.1.1/32 === 192.168.1.10/32 with reqid {9000}
Mar 22 13:39:09 opnFW2 charon[41296]: 02[KNL] creating acquire job for policy 192.168.1.1/32 === 192.168.1.10/32 with reqid {9000}
Mar 22 13:41:20 opnFW2 charon[41296]: 07[KNL] creating acquire job for policy 192.168.1.1/32 === 192.168.1.10/32 with reqid {9000}
Mar 22 13:41:52 opnFW2 charon[41296]: 14[KNL] creating acquire job for policy 192.168.1.1/32 === 192.168.1.10/32 with reqid {9000}
Mar 22 13:42:25 opnFW2 charon[41296]: 15[KNL] creating acquire job for policy 192.168.1.1/32 === 192.168.1.10/32 with reqid {9000}
IPsec logical interface on opnFW2 is ipsec9000:
root@opnFW2:/var/log # ifconfig ipsec9000
ipsec9000: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> metric 0 mtu 1400
tunnel inet 192.168.1.1 --> 192.168.1.10
inet6 fe80::1e72:1dff:feb6:c703%ipsec9000 prefixlen 64 scopeid 0x25
inet 172.16.1.1 --> 172.16.1.10 netmask 0xffffffff
groups: ipsec
reqid: 9000
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
The ping tests are identical: from opnFW2 I can ping 192.168.1.10 and cannot ping 172.16.1.10
root@opnFW2:/var/log # ping 192.168.1.10
PING 192.168.1.10 (192.168.1.10): 56 data bytes
64 bytes from 192.168.1.10: icmp_seq=0 ttl=64 time=7.893 ms
64 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=7.310 ms
64 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=7.990 ms
^C
--- 192.168.1.10 ping statistics ---
3 packets transmitted, 3 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 7.310/7.731/7.990/0.300 ms
root@opnFW2:/var/log # ping -c 2 172.16.1.10
PING 172.16.1.10 (172.16.1.10): 56 data bytes
--- 172.16.1.10 ping statistics ---
2 packets transmitted, 0 packets received, 100.0% packet loss
root@opnFW2:/var/log #
IPSec config on opnFW2 related to that tunnel is:
cat /usr/local/etc/ipsec.conf
config setup
uniqueids = yes
conn con9
aggressive = no
fragmentation = yes
keyexchange = ikev2
mobike = yes
reauth = yes
rekey = yes
forceencaps = no
installpolicy = no
dpdaction = restart
dpddelay = 10s
dpdtimeout = 60s
left = 192.168.1.1
right = 192.168.1.10
leftid = 192.168.1.1
ikelifetime = 28800s
lifetime = 3600s
ike = aes256gcm16-sha512-ecp512bp!
leftauth = psk
rightauth = psk
rightid = 192.168.1.10
reqid = 9000
rightsubnet = 0.0.0.0/0
leftsubnet = 0.0.0.0/0
esp = aes256gcm16-sha512-ecp512bp!
auto = start
And the funnies thing is that if I restart the strongswan service (/usr/local/etc/rc.d/strongswan onerestart) on opnFW1 (with ... "unable to acquire reqid" logs) the issue disappears and everything starts working again....untill the next time it stops.....
Any ideas, comments are highly appreciated!
Intentionally I haven't restored the connectivity this time, so I can provide any additional outputs/logs if required.
Regards,
Plamen
Another observation - there's no UDP traffic between both opnFW1 and opnFW2 on the transport interface.
None of them is trying to initiate phase1.
Nothing in the firewall logs, either, which makes me believe that IKE_SA_INIT is not getting generated from both ends. It's just stuck, although there's "Start immediately" option selected for phase 1 and DPD with restart on both firewalls.
Currently on opnFW2 there are other IPSec VTIs which are working fine (however some of them were in the same stuck state in the past) and I can't find out why it's not generating IKE_SA_INIT packet for that specific peer.
Any ideas how should I proceed with the troubleshooting? Any meaningful ipsec debug level increase?
As I wrote in the first post - if I restart the strongswan service the issue will be resolved, but it will happen again after few days.
Any workarounds? I start thinking of some kind of ugly script in the crontab or using monit service to restart the IPSec when it hang again.
Looks like the IPSec re-connection issue is not because of "trap not found, unable to acquire reqid 1000"
During my workaround tests in a lab environment I was able to reproduce the issue. As I expected, that's happening when there is an underlay connectivity loss for a longer period of time.
During the connectivity loss IKE packets are retransmitted 5 times before:
Mar 24 19:30:10 FW3 charon[90790]: 14[IKE] <con1|1> giving up after 5 retransmits
Mar 24 19:30:10 FW3 charon[90790]: 14[IKE] <con1|1> restarting CHILD_SA con1
....
Mar 24 19:32:55 FW3 charon[90790]: 12[IKE] <con1|2> giving up after 5 retransmits
Mar 24 19:32:55 FW3 charon[90790]: 12[IKE] <con1|2> peer not responding, trying again (2/3)
....
Mar 24 19:35:40 FW3 charon[90790]: 08[IKE] <con1|2> giving up after 5 retransmits
Mar 24 19:35:40 FW3 charon[90790]: 08[IKE] <con1|2> peer not responding, trying again (3/3)
.....
Mar 24 19:38:25 FW3 charon[90790]: 05[IKE] <con1|2> giving up after 5 retransmits
Mar 24 19:38:25 FW3 charon[90790]: 05[IKE] <con1|2> establishing IKE_SA failed, peer not responding
So the question is how can I change that behavior and force the IPSec to continue trying to connect?
And let me reply to myself again - the missing keyword here is "
keyingtries"
https://wiki.strongswan.org/projects/strongswan/wiki/connsection (https://wiki.strongswan.org/projects/strongswan/wiki/connsection)
Quote
keyingtries = 3 | <number> | %forever
how many attempts (a positive integer or %forever) should be made to negotiate a connection, or a replacement
for one, before giving up (default 3). The value %forever means 'never give up'. Relevant only locally, other end need
not agree on it.
And the issue raised back in 2020 -
https://github.com/opnsense/core/issues/4204 (https://github.com/opnsense/core/issues/4204)
Based on the https://github.com/opnsense/core/issues/4204 (https://github.com/opnsense/core/issues/4204) seems that noone is interested in having persistent ipsec connection....
Hi,
One could have that impression. I am tunneling with Linux/openswan and pfSense since a long time. Now I am diging into opnsense IPsec, still frustrated.
First learning, never use policy-based, chose route-based IPsec (1). I am using a lab infrastructure with several APU (pcengines) and some Supermicro/Celeron Firewalls as test machines. At the moment I take advantage of the cold weather to setup-test-discard-start over...
At the end I will see if I can handle ipsec in a reliable way, switch to openvpn or do not use opnsense for site-to-site tunneling.
Don't give up!
Uwe
(1) https://weberblog.net/route-vs-policy-based-vpn-tunnels/ (https://weberblog.net/route-vs-policy-based-vpn-tunnels/)
Quote from: wurmloch on March 25, 2021, 09:29:29 PM
Hi,
One could have that impression. I am tunneling with Linux/openswan and pfSense since a long time. Now I am diging into opnsense IPsec, still frustrated.
First learning, never use policy-based, chose route-based IPsec (1). I am using a lab infrastructure with several APU (pcengines) and some Supermicro/Celeron Firewalls as test machines. At the moment I take advantage of the cold weather to setup-test-discard-start over...
At the end I will see if I can handle ipsec in a reliable way, switch to openvpn or do not use opnsense for site-to-site tunneling.
Don't give up!
Uwe
(1) https://weberblog.net/route-vs-policy-based-vpn-tunnels/ (https://weberblog.net/route-vs-policy-based-vpn-tunnels/)
Please note there is a limitation in FreeBSD with pf that you can't use NAT with route-based IPsec. No matter if using OPNsense or pfSense.
Quote from: pmladenov on March 24, 2021, 11:03:44 PM
And let me reply to myself again - the missing keyword here is "keyingtries"
https://wiki.strongswan.org/projects/strongswan/wiki/connsection (https://wiki.strongswan.org/projects/strongswan/wiki/connsection)
Quote
keyingtries = 3 | <number> | %forever
how many attempts (a positive integer or %forever) should be made to negotiate a connection, or a replacement
for one, before giving up (default 3). The value %forever means 'never give up'. Relevant only locally, other end need
not agree on it.
And the issue raised back in 2020 -
https://github.com/opnsense/core/issues/4204 (https://github.com/opnsense/core/issues/4204)
Yep it was me complaining, but this only happens on unreliable WANs. For these areas I switched to OpenVPN based IPsec, but I'd also like to diagnose further if you still interested. When I see couple of replies in a thread I usually dont look at it since I guess already another guys is helping out ;D
Since you already fiddled with the .conf and CLI, can you grab your generated ipsec.conf, search for the affected con, add keyingtries=%forever and put this in a .conf file in the include folder. Then remove the ipsec from UI and restart IPsec. Is it then stable enough?
I can always reopen the issue, but it needs more voices to make progress since changing things in such a sensible area is always risky.
Thanks for hacking on :)
QuoteYep it was me complaining, but this only happens on unreliable WANs. For these areas I switched to OpenVPN based IPsec, but I'd also like to diagnose further if you still interested. When I see couple of replies in a thread I usually dont look at it since I guess already another guys is helping out ;D
Since you already fiddled with the .conf and CLI, can you grab your generated ipsec.conf, search for the affected con, add keyingtries=%forever and put this in a .conf file in the include folder. Then remove the ipsec from UI and restart IPsec. Is it then stable enough?
I can always reopen the issue, but it needs more voices to make progress since changing things in such a sensible area is always risky.
Thanks for hacking on
Thanks mimugmail,
I did the same, created a the following config file:
cat /usr/local/etc/ipsec.opnsense.d/never-give-up.conf
conn %default
keyingtries = %forever
and restarted the service. It works like a charm.
A standard use case - WAN/Internet outage for longer period of time (for instance failed during the weekend and restored on Monday). With the default keyingtries value, a manual service restart will be needed or full device restart in case the device is completely unreachable remotely and non technical people at the site (which may be even worse in case there's noone there)
QuotePlease note there is a limitation in FreeBSD with pf that you can't use NAT with route-based IPsec. No matter if using OPNsense or pfSense.
Could you please elaborate a little bit more about that? You can't use NAT with the ipsecXXXX (VTI) interfaces or at all?
Another missing feature with VTI IPSec (although not so critical as the NAT) is DHCP relay. DHCP relay daemon simply can't be bind to the VTI interface. Seems that's also the case with pfsense:
https://redmine.pfsense.org/issues/10904 (https://redmine.pfsense.org/issues/10904)
Regards,
Plamen
I asked for reopening
I tried `-1` in `keyingtries` to get the `%forever` in swanctl.conf with the current OPNsense production version, but the GUI tells me `-1` is not allowed.
Can anyone point me to the correct setup here for Site-to-Site IPSEC connections to keep trying after any WAN failure?
99999