Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - pmladenov

#1
Quote from: lilsense on October 09, 2021, 10:07:41 PM
Hi,
    Just add some comments reading up on this. I am not sure you are quite familiar to know what OPNSense really is... It's a firewall first and not a router.

Now on some things for others to know in regards to DHCP. It is mandatory for any organization to have a local DHCP. No one under any circumstances ought to use DHCP relay across a WAN interface EVEN if it's L2 ethernet (YOU DON'T OWN THE WIRE!!!) To add to this, I can create an AToM (Any Transport over MPLS) across Frame/ATM/ISDN and give you a L2 ethernet. So, NO! you got no clue what you got... LOL

As for other Routing features, if anything not working, please file your complaints to https://frrouting.org/.

you do have one or two nice complaints, but all others... Meh!


edit: just outta curiosity, what did you end up replacing OPNSense with?

Hi lilsense,

First of all thanks for the comment.

Few points that you've mentioned - DHCP relay - do you really think that feature should not work? I'm coming from the Enterprise and Service provider segments and for the last almost 20 years I haven't seen a vendor where DHCP relay doesn't work. Technically it is so simple - just listen on broadcast DHCP messages and forward it to an unicast destination (based on the configured unicast IP and routing table) and vice versa (and of course do a slight modification of few DHCP fields in the packet format)
Regarding the design decision to rely on central DHCP server in the HQ and not on local one - it really depends. Yes, I agree it is not wise all hosts in a branch office to lose IP addresses and not to be able to communicate locally, because of WAN failure. But why you believe this is the case here? As I said - the setup I'm working on is quite simple, so simple that all hosts (literally few) in each remote location are configured with static addressing. The DHCP relay was intended to be used only the deployment phase (PXE).

Regarding the firewall vs routing platform - I'm trying to accomplish really simple things which are widely available in the last 20 years in open source implementations like Zebra/Quagga/FRR! I'm not talking about MPLS and its applications like L2VPNs, L3VPNs, TE, DiffServ Aware TE, AToM, CsC, Inter-AS options (like A,B,C,AB). I'm not talking about segment routing. I'm not talking about DMVPN or vendor specific solutions like GetVPN, FlexVPN, etc)
I'm not talking about various SD-WAN solutions that each vendor (including Firewall Vendors) provides nowadays. I'm not talking about really specific BGP and OSPF features like MP-BGP transporting IPv6 over IPv4 AFI or vice versa, selective BGP next-hop address tracking, BGP confederations, BGP as-path ignore or relax, complex multi-area OSPF area stub and NSSA areas, OSPF Forward Address tricks, OSPF Type7-to-Type5 translator selection, OSPFv3 Authentication Trailer and so on. All I'm talking about is a CORE FUNCTIONALITY which is not missing from FRR (yes, it's there, it was there in Quagga, it was there even in Zebra in 2004), but it is missing in OpnSense FrontEnd. Tell me a single firewall vendor which doesn't support IBGP Route Reflector? (is it a good design of using a firewall as a BGP RR is completely different story). An year ago it was even not possible to change the keepalive/holdtime timers for the BGP session via GUI and we had to rely on the peer device to do it! 

Going back to the firewall functionality - the only reason of having less complaints for it in the list above is just because the setup I'm using is really, really simple. And actually it cannot evolves to more complex one, definitely not with OpnSense. I can't imagine dealing with more than 100 rules and with more than 5 interfaces! Imagine I need to move a rule to the middle or rearrange the rules a little bit.... I'm pretty sure that editing the pf.conf with VI will be much faster than trying to accomplish this with the Front-End. And of course - you can still generate a completely invalid rule with the Front-End and try to insert it (the same thing I can do with the text editor,right?). I'm also grateful that I don't need to play with NATs in that deployment.

What did I end up replacing OPNSense with - I haven't replaced that yet, and the only reason for that is the minority of changes expected after the go-live.
My thought process when we selected OpnSense was to check the support for the required features - does it support DHCP relay - Yes (is it working for me - NO, but that's not written anywhere), does it support dynamic routing - Yes (is it working for me as it is - NO and again that's not documented), does it support HA - Yes (does it for active-active - No, again not documented), does it support VLANs - yes (do I need to interrupt the traffic when I need to add additional VLAN - yes, and again lacks in the documentation)   

For future low cost projects I would definitely prefer a fresh Linux based distribution with iproute2, frr, iptables/nftables. At least I'll be able to achieve the routing part. For firewall part - probably I'll go with OpenBSD (In case I can't make the HA as I want in linux). Will it be with a shiny web based UI - No, it won't. Will it work - Yes, it will. Will I have to spend time debugging Front-End and Back-End interaction just to generate a simple config file for 3rd party open source product usually with a good documentation and tons of examples in Internet - no.




#2
Hello,

With almost an year experience with OpnSense trying to accomplish the most simple enterprise setup ever (Headquarter with 2 OpnSense boxes for HA and ~10 small remote branches with a single OpnSense box, all these are single-homed to one Service Provider providing L2 ethernet service to the HQ), I would like to provide my negative feedback, mainly for folks who may consider it for similar future projects.

Overall I'm disappointed.
I'm disappointed with, but not limited to:

  • HA setup - who the hell may think of scale out solution and implement active-active or even active-active-active...? (Keep in mind that OpnSense is based on FreeBSD implementation of PF and CARP which are slightly different than the OpenBSD implementation, OpenBSD has better HA features)
  • IPSec implementation (...if your branch office lost connectivity for a day - don't expect IPSec to re-establish again in case you haven't modified the config files and rely only on FrontEnd UI)
  • Lack of ECMP feature .... yes, you can't have 2 routes (static or dynamic) for the same destination towards 2 different gateways, so you can not do traffic load-sharing
  • Lack of ability to fine tune the PF firewall rules....unless you want to spend a week bugging with ...php scripts
  • Impossibility to add an additional VLAN without bouncing the main physical interface, even though it is a LACP bonding - "well the interruption is a really short...and we're not adding vlans every day, are we?
  • Jumbo frames? Yes, it's supported, even it works, but only if you enable it at the very beginning. Of course if the jumbo frame MTU requirement comes a little bit late and you have already deployed your vlan subinterfaces - you'll need to start over with a fresh install
  • We are not considering to have a DHCP relay for a remote/branch office pointing to the DHCP server located in the Headquarter over VTI IPSec interface, aren't we? Yes, the buggy daemon doesn't like ipsecXXX interface type and cannot even start
  • Dynamic Routing using FRR - Yes, but don't expect to use the FrontEnd UI, unless you want your BGP daemon to restart each time when you have to make a small modification like adding a new BGP neighbor, or even modifying a simple route-map. OSPFv2/v3 is not different. More advanced routing setup like BGP Route-Reflector or Route-Server - forget it, these may be implemented in the UI in 2122 if we're lucky (and yes, they won't work as expected) 
  • Multicast routing? PIM? IGMPv1/2/3? .... Forget it all,moreover the highly limited igmp-proxy doesn't like Lagg interface types (and even vlans in older versions), so can't bind and start at all and even if you can make it to work somehow - don't expect any functionality to replicate the multicast state between 2 HA OpnSense devices..and again, who is looking for HA?
  • VRF functionality, at least for the management traffic - no, not supported by BSD (although JunOS is running on top of FreeBSD for centuries). So, be careful what you're doing with routing, PF, badly documented check boxes on the UI just to avoid locking yourself
  • Lack of documentation? Really? Try to understand something as simple as how the static routing "gateway" concept is working and you'll understand what I mean. During that time, expect your statically configured default gateway suddenly to start pointing to a different interface or being overwritten by dynamic protocol

I understand that some of the limitations listed above are not related to OpnSense itself, but with FreeBSD or some 3rd party plugins.
I also understand that the product is mainly for home users bragging about having "c00l f1r3w@ll @home" but it's far, far, far away for enterprise ready solution.
I would firmly say it's not even being develop with "more than 2 OpnSense boxes in a network" mindset.

On a positive side - I would say - if you are not expecting to change anything after the initial deployment and have all the requirements in advance (and the listed caveats above) - it works and is stable.

What else I'm missing here? What is your experience?
#3
Exactly the same problem ... I did exactly the same tests as the guy above (setting the MTU to 9000, 1500, 3000, 8999....). That's ridiculous, do I need a fresh install just because the requirement for jumbo MTU came a little bit late?

Of course ifconfig if_vlan mtu 9000 works, but not the buggy FrontEnd or BackEnd.... I'm start considering to install a native FreeBSD, or probably OpenBSD (because of the better pf implementation there, giving the options for active-active HA setups for both CARP and PF session replication) and stop using that peace of buggy scripts.
#4
What about - "pass out log from {any} to {any} keep state allow-opts label "1232f88e5fac29a32501e3f051020cac" # let out anything from firewall host itself" rule?

What's the best way we can modify it (in my case I need "keep state ( sloppy )" ?

P.S.
Found a solution for myself (of course this will go away after any upgrade...)
root@OPNsense1:/usr/local/etc/inc # diff filter.lib.inc filter.lib.inc.org
542c542
<         array('direction' => 'out', 'statetype' => 'sloppy', 'allowopts' => true,
---
>         array('direction' => 'out', 'statetype' => 'keep', 'allowopts' => true,


#5
High availability / Active-Active HA tunning
March 30, 2021, 02:11:51 PM
Hello,

Currently I have a HA setup acting as active/active with 2 nodes and pfsync between them (unicast) and pure routing (BGP), without relying on CARP at the moment.

Although I have tested all possible ways of session asymmetry (for instance TCP SYN via FW1, tcp SYC+ACK via FW2 and all other variations) and all looks to work well in the LAB that's not the case outside of the testing environment.
With real traffic (with low number of session < 1000) I'm getting complaints for TCP re-transmits which seems to happen when there's is asymmetrical flows.
I suspect it is related to some kind of pfsync timers (preventing timely synchronization between both firewall nodes)

I've read pfsync(4) and ifconfig(8 ) for both FreeBSD and OpenBSD several times, however I can't fully understand the concept for:

1) pfsync defer option - from the OpenBSD pfsync man page, but nothing in the FreeBSD pfsync man page:

QuoteWhere more than one firewall might actively handle packets, e.g. with certain ospfd, bgpd or carp(4) configurations, it is beneficial to defer transmission of the initial packet of a connection. The pfsync state insert message is sent immediately; the packet is queued until either this message is acknowledged by another system, or a timeout has expired. This behaviour is enabled with the defer parameter to ifconfig.

So in simple words - what's happening after FW1 receives TCP SYN segment and that traffic is allowed by PF rulebase (and we expect that the SYN+ACT segment will be returned back via FW2) with defer and without defer option enabled?

2) pfsync maxupd option, by default set to 128. 

QuoteThe pfsync interface will attempt to collapse multiple state updates into a single packet where possible.The maximum number of times a single state can be updated before a pfsync packet will be sent out is controlled by the maxupd parameter to ifconfig (see ifconfig and the example below for more details). The sending out of a pfsync packet will be delayed by a maximum of one second.

Is it make sense to decrease that parameter to avoid waiting for up to one second before sending pfsync packets to the peer?

3) net.pfsync.pfsync_buckets

QuoteThe number of pfsync buckets.This affects the performance and memory tradeoff.Defaults to twice the number of CPUs.Change only if benchmarks show this helps on your workload.

Any idea here what and how should I monitor to set this properly?


P.S.1 Just went back to the pcap files - almost for all TCP sessions (with few exception) - the segment with SYN flag was re-transmitted in 1 second after the first SYN was sent.
So we have:

(1) Client ----------SYN ---------> FW1 ------------------> Server
                                                    |
                                                 pfsync
                                                    |
(2) Client <------------------------- FW2 <-- SYN+ACK----- Server

Seems that FW2 in (2) is denying SYN+ACK sent from Server in response to the Client, probably because it hasn't seen SYN-SENT session yet from FW1.

P.S.2 - Confirmed - the returned SYC+ACK segment (from Server to Client) is dropped by FW2. It comes just before the state is replicated from FW1 to FW2. I tried with and without defer option on pfsync0 interface on both FWs and don't see any changes in the behavior. Probably the queuing of the initial packet is not working?
#6
Some outputs from my lab (FW3 with router ID: 10.30.10.1, peering with 10.70.1.1, area 0.0.0.3 via VTI - ipsec1000 configured with OSPF P2P network type)

FW3.localdomain# sh ip ospf
OSPF Routing Process, Router ID: 10.30.10.1
Supports only single TOS (TOS0) routes
This implementation conforms to RFC2328
RFC1583Compatibility flag is disabled
OpaqueCapability flag is disabled
Initial SPF scheduling delay 0 millisec(s)
Minimum hold time between consecutive SPFs 50 millisec(s)
Maximum hold time between consecutive SPFs 5000 millisec(s)
Hold time multiplier is currently 1
SPF algorithm last executed 12.320s ago
Last SPF duration 51 usecs
SPF timer is inactive
LSA minimum interval 5000 msecs
LSA minimum arrival 1000 msecs
Write Multiplier set to 20
Refresh timer 10 secs
Number of external LSA 2. Checksum Sum 0x00017d78
Number of opaque AS LSA 0. Checksum Sum 0x00000000
Number of areas attached to this router: 1
Area ID: 0.0.0.3
   Shortcutting mode: Default, S-bit consensus: no
   Number of interfaces in this area: Total: 4, Active: 4
   Number of fully adjacent neighbors in this area: 1
   Area has no authentication
   Number of full virtual adjacencies going through this area: 0
   SPF algorithm executed 4 times
   Number of LSA 7
   Number of router LSA 2. Checksum Sum 0x000097f1
   Number of network LSA 0. Checksum Sum 0x00000000
   Number of summary LSA 4. Checksum Sum 0x00027f98
   Number of ASBR summary LSA 1. Checksum Sum 0x00002622
   Number of NSSA LSA 0. Checksum Sum 0x00000000
   Number of opaque link LSA 0. Checksum Sum 0x00000000
   Number of opaque area LSA 0. Checksum Sum 0x00000000


The P2P adjacency (not sure why FRR displays it with DRother, there souldn't be any DR/BDR/DROther election on P2P network type:

FW3.localdomain# show ip ospf neighbor

Neighbor ID     Pri State           Dead Time Address         Interface                        RXmtL RqstL DBsmL
10.70.1.1         1 Full/DROther       3.780s 172.16.31.1     ipsec1000:172.16.31.2                0     0     0


OSPF VTI interface details with manually changed hello/dead intervals (not important here). Note the This interface is UNNUMBERED):

FW3.localdomain# show ip ospf interface ipsec1000
ipsec1000 is up
  ifindex 12, MTU 1400 bytes, BW 0 Mbit <UP,POINTOPOINT,RUNNING,MULTICAST>
  This interface is UNNUMBERED, Area 0.0.0.3
  MTU mismatch detection: enabled
  Router ID 10.30.10.1, Network Type POINTOPOINT, Cost: 10
  Transmit Delay is 1 sec, State Point-To-Point, Priority 1
  No backup designated router on this network
  Multicast group memberships: OSPFAllRouters
  Timer intervals configured, Hello 1s, Dead 4s, Wait 4s, Retransmit 5
    Hello due in 0.567s
  Neighbor Count is 1, Adjacent neighbor count is 1

FW3.localdomain#



FW3's OSPF LSA1 (there're 2 stub network 10.30.10.0/24 and 10.30.20.0/24 and P2P VTI network with strange (Link Data) Router Interface address: 0.0.0.12 (no idea how 0.0.0.12 was extracted here):

FW3.localdomain# show ip ospf database router 10.30.10.1

       OSPF Router with ID (10.30.10.1)


                Router Link States (Area 0.0.0.3)

  LS age: 44
  Options: 0x2  : *|-|-|-|-|-|E|-
  LS Flags: 0x3
  Flags: 0x0
  LS Type: router-LSA
  Link State ID: 10.30.10.1
  Advertising Router: 10.30.10.1
  LS Seq Number: 80000007
  Checksum: 0x4ddb
  Length: 60

   Number of Links: 3

    Link connected to: Stub Network
     (Link ID) Net: 10.30.10.0
     (Link Data) Network Mask: 255.255.255.0
      Number of TOS metrics: 0
       TOS 0 Metric: 100

    Link connected to: Stub Network
     (Link ID) Net: 10.30.20.0
     (Link Data) Network Mask: 255.255.255.0
      Number of TOS metrics: 0
       TOS 0 Metric: 100

    Link connected to: another Router (point-to-point)
     (Link ID) Neighboring Router ID: 10.70.1.1
     (Link Data) Router Interface address: 0.0.0.12
      Number of TOS metrics: 0
       TOS 0 Metric: 10


FW3.localdomain#


ipsec1000 interface on FW3 is:
FW3.localdomain# sh int ipsec1000
Interface ipsec1000 is up, line protocol is up
  Link ups:       1    last: 2021/03/26 13:05:43.47
  Link downs:     0    last: (never)
  vrf: default
  index 12 metric 1 mtu 1400 speed 0
  flags: <UP,POINTOPOINT,RUNNING,MULTICAST>
  Type: Unknown
  inet 172.16.31.2/32 peer 172.16.31.1/32 unnumbered
  inet6 fe80::250:56ff:fe27:b201/64
  Interface Type Other
    input packets 181916, bytes 11272315, dropped 0, multicast packets 0
    input errors 0
    output packets 197212, bytes 12238119, multicast packets 0
    output errors 0
    collisions 0
FW3.localdomain#


Which is locally connected but in the routing table the "peer-ip" of the tunnel is presented:

FW3.localdomain# sh ip route connected
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       F - PBR, f - OpenFabric,
       > - selected route, * - FIB route, q - queued route, r - rejected route
....
C>* 172.16.31.1/32 [0/1] is directly connected, ipsec1000, 00:02:29



The VTI interface from BSD:
root@FW3:~ # ifconfig ipsec1000
ipsec1000: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> metric 0 mtu 1400
        tunnel inet 10.64.123.3 --> 10.64.123.1
        inet6 fe80::250:56ff:fe27:b201%ipsec1000 prefixlen 64 scopeid 0xc
        inet 172.16.31.2 --> 172.16.31.1 netmask 0xffffffff
        groups: ipsec
        reqid: 1000
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
root@FW3:~ #


So bottom line is - with "redistrubute connected" on FW3, you're going to redistribute 172.16.31.1/32 which lives on the other end of the tunnel (the other FW) and not your own IP address - 172.16.31.2/32
#7
QuoteYep it was me complaining, but this only happens on unreliable WANs. For these areas I switched to OpenVPN based IPsec, but I'd also like to diagnose further if you still interested. When I see couple of replies in a thread I usually dont look at it since I guess already another guys is helping out  ;D
Since you already fiddled with the .conf and CLI, can you grab your generated ipsec.conf, search for the affected con, add keyingtries=%forever and put this in a .conf file in the include folder. Then remove the ipsec from UI and restart IPsec. Is it then stable enough?

I can always reopen the issue, but it needs more voices to make progress since changing things in such a sensible area is always risky.

Thanks for hacking on

Thanks mimugmail,

I did the same, created a the following config file:

cat /usr/local/etc/ipsec.opnsense.d/never-give-up.conf
conn %default
keyingtries = %forever


and restarted the service. It works like a charm.
A standard use case - WAN/Internet outage for longer period of time (for instance failed during the weekend and restored on Monday). With the default keyingtries value, a manual service restart will be needed or full device restart in case the device is completely unreachable remotely and non technical people at the site (which may be even worse in case there's noone there)


QuotePlease note there is a limitation in FreeBSD with pf that you can't use NAT with route-based IPsec. No matter if using OPNsense or pfSense.
Could you please elaborate a little bit more about that? You can't use NAT with the ipsecXXXX (VTI) interfaces or at all?
Another missing feature with VTI IPSec (although not so critical as the NAT) is DHCP relay. DHCP relay daemon simply can't be bind to the VTI interface. Seems that's also the case with pfsense:
https://redmine.pfsense.org/issues/10904


Regards,
Plamen
#8
Hi Renaud,

If I understand your question correctly - you're trying to find out why the VTI interface (ipsec1000 for example) is not seen as a route and propagated via the OSPF.

This is because OSPF sees that type of VTI interface as "IP unnumbered" one. The funniest thing is when you try to redistribute it into OSPF (or even BGP) - for instance, assuming that given ipsec1000 vti is on FW1 and you redistribute it via OSPF

ipsec1000: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> metric 0 mtu 1400
        tunnel inet 10.1.1.1 --> 10.1.1.2
        inet6 fe80::250:56ff:fe2d:b801%ipsec1000 prefixlen 64 scopeid 0xc
        inet 172.16.41.1 --> 172.16.41.2 netmask 0xffffffff
        groups: ipsec
        reqid: 1000
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>

The assumption here is that 172.16.41.1/32 is advertised as OSPF E2 route originated from FW1...however the reality is FW1 is originating 172.16.41.2/32 (which should be the other end of the tunnel). 
I have no idea why it behaves that way.

If that's the case I guess you need that route to allow FW locally originated traffic (like NTP, DNS, etc) from FW (using VTI IP address as source IP) to be able to return back. Personally I fixed that problem using a static routes...

Regards,
Plamen
#9
Based on the https://github.com/opnsense/core/issues/4204 seems that noone is interested in having persistent ipsec connection....
#10
And let me reply to myself again - the missing keyword here is "keyingtries"

https://wiki.strongswan.org/projects/strongswan/wiki/connsection
Quote
keyingtries = 3 | <number> | %forever

how many attempts (a positive integer or %forever) should be made to negotiate a connection, or a replacement
for one, before giving up (default 3). The value %forever means 'never give up'. Relevant only locally, other end need
not agree on it.

And the issue raised back in 2020 -
https://github.com/opnsense/core/issues/4204
#11
Looks like the IPSec re-connection issue is not because of  "trap not found, unable to acquire reqid 1000"

During my workaround tests in a lab environment I was able to reproduce the issue. As I expected, that's happening when there is an underlay connectivity loss for a longer period of time.

During the connectivity loss IKE packets are retransmitted 5 times before:

Mar 24 19:30:10 FW3 charon[90790]: 14[IKE] <con1|1> giving up after 5 retransmits
Mar 24 19:30:10 FW3 charon[90790]: 14[IKE] <con1|1> restarting CHILD_SA con1
....
Mar 24 19:32:55 FW3 charon[90790]: 12[IKE] <con1|2> giving up after 5 retransmits
Mar 24 19:32:55 FW3 charon[90790]: 12[IKE] <con1|2> peer not responding, trying again (2/3)
....
Mar 24 19:35:40 FW3 charon[90790]: 08[IKE] <con1|2> giving up after 5 retransmits
Mar 24 19:35:40 FW3 charon[90790]: 08[IKE] <con1|2> peer not responding, trying again (3/3)
.....
Mar 24 19:38:25 FW3 charon[90790]: 05[IKE] <con1|2> giving up after 5 retransmits
Mar 24 19:38:25 FW3 charon[90790]: 05[IKE] <con1|2> establishing IKE_SA failed, peer not responding



So the question is how can I change that behavior and force the IPSec to continue trying to connect?
#12
Any workarounds? I start thinking of some kind of ugly script in the crontab or using monit service to restart the IPSec when it hang again.
#13
Nothing in the firewall logs, either, which makes me believe that IKE_SA_INIT is not getting generated from both ends. It's just stuck, although there's "Start immediately" option selected for phase 1 and DPD with restart on both firewalls.

Currently on opnFW2 there are other IPSec VTIs which are working fine (however some of them were in the same stuck state in the past) and I can't find out why it's not generating IKE_SA_INIT packet for that specific peer.

Any ideas how should I proceed with the troubleshooting? Any meaningful ipsec debug level increase?

As I wrote in the first post - if I restart the strongswan service the issue will be resolved, but it will happen again after few days.
#14
Another observation - there's no UDP traffic between both opnFW1 and opnFW2 on the transport interface.
None of them is trying to initiate phase1.
#15
Hello,


I have an IPSec routed mode between 2 opnsense FWs: opnFW1 and opnFW2 running:

OPNsense 20.7.5-amd64
FreeBSD 12.1-RELEASE-p10-HBSD
OpenSSL 1.1.1h 22 Sep 2020

After an approximately a week uptime, without any configuration changes on both ends, I'm getting the following error in opnFW1's /var/log/ipsec.log and of course the IPSec is not working....

Mar 22 21:43:47 opnFW1 charon[16547]: 11[KNL] creating acquire job for policy 192.168.1.10/32 === 192.168.1.1/32 with reqid {1000}
Mar 22 21:43:47 opnFW1 charon[16547]: 07[CFG] trap not found, unable to acquire reqid 1000
Mar 22 21:44:19 opnFW1 charon[16547]: 07[KNL] creating acquire job for policy 192.168.1.10/32 === 192.168.1.1/32 with reqid {1000}
Mar 22 21:44:19 opnFW1 charon[16547]: 11[CFG] trap not found, unable to acquire reqid 1000


The ipsec logical interface on opnFW1 is ipsec1000:
root@opnFW1:~ # ifconfig ipsec1000
ipsec1000: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> metric 0 mtu 1400
        tunnel inet 192.168.1.10 --> 192.168.1.1
        inet6 fe80::1a5a:58ff:fe10:13a0%ipsec1000 prefixlen 64 scopeid 0x13
        inet 172.16.1.10 --> 172.16.1.1 netmask 0xffffffff
        groups: ipsec
        reqid: 1000
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>



From opnFW1 I can successfully ping opnFW2 "underlay" IP address - 192.168.1.1, however I can't ping the "overlay" IP - 172.16.1.1

root@opnFW1:~ # ping -c 2 192.168.1.1
PING 192.168.1.1 (192.168.1.1): 56 data bytes
64 bytes from 192.168.1.1: icmp_seq=0 ttl=64 time=7.266 ms
64 bytes from 192.168.1.1: icmp_seq=1 ttl=64 time=3.638 ms

--- 192.168.1.1 ping statistics ---
2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 3.638/5.452/7.266/1.814 ms
root@opnFW1:~ # ping -c 2 172.16.1.1
PING 172.16.1.1 (172.16.1.1): 56 data bytes

--- 172.16.1.1 ping statistics ---
2 packets transmitted, 0 packets received, 100.0% packet loss


The ipsec configuration on opnFW1 is:
root@opnFW1:/usr/local/etc # cat ipsec.conf
# This file is automatically generated. Do not edit
config setup
  uniqueids = yes

conn con1
  aggressive = no
  fragmentation = yes
  keyexchange = ikev2
  mobike = yes
  reauth = yes
  rekey = yes
  forceencaps = no
  installpolicy = no

  dpdaction = restart
  dpddelay = 10s
  dpdtimeout = 60s

  left = 192.168.1.10
  right = 192.168.1.1

  leftid = 192.168.1.10
  ikelifetime = 28800s
  lifetime = 3600s
  ike = aes256gcm16-sha512-ecp512bp!
  leftauth = psk
  rightauth = psk
  rightid = 192.168.1.1
  reqid = 1000
  rightsubnet = 0.0.0.0/0
  leftsubnet = 0.0.0.0/0
  esp = aes256gcm16-sha512-ecp512bp!
  auto = start


From the configuration above - that IPSec should rely on DPD.

On the other side - opnFW2 the logs I'm getting is:

root@opnFW2:/var/log # clog ipsec.log | grep 192.168.1.
Mar 22 13:38:37 opnFW2 charon[41296]: 05[KNL] creating acquire job for policy 192.168.1.1/32 === 192.168.1.10/32 with reqid {9000}
Mar 22 13:39:09 opnFW2 charon[41296]: 02[KNL] creating acquire job for policy 192.168.1.1/32 === 192.168.1.10/32 with reqid {9000}
Mar 22 13:41:20 opnFW2 charon[41296]: 07[KNL] creating acquire job for policy 192.168.1.1/32 === 192.168.1.10/32 with reqid {9000}
Mar 22 13:41:52 opnFW2 charon[41296]: 14[KNL] creating acquire job for policy 192.168.1.1/32 === 192.168.1.10/32 with reqid {9000}
Mar 22 13:42:25 opnFW2 charon[41296]: 15[KNL] creating acquire job for policy 192.168.1.1/32 === 192.168.1.10/32 with reqid {9000}


IPsec logical interface on opnFW2 is ipsec9000:

root@opnFW2:/var/log # ifconfig ipsec9000
ipsec9000: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> metric 0 mtu 1400
        tunnel inet 192.168.1.1 --> 192.168.1.10
        inet6 fe80::1e72:1dff:feb6:c703%ipsec9000 prefixlen 64 scopeid 0x25
        inet 172.16.1.1 --> 172.16.1.10 netmask 0xffffffff
        groups: ipsec
        reqid: 9000
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>

The ping tests are identical: from opnFW2 I can ping 192.168.1.10 and cannot ping 172.16.1.10

root@opnFW2:/var/log # ping 192.168.1.10
PING 192.168.1.10 (192.168.1.10): 56 data bytes
64 bytes from 192.168.1.10: icmp_seq=0 ttl=64 time=7.893 ms
64 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=7.310 ms
64 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=7.990 ms
^C
--- 192.168.1.10 ping statistics ---
3 packets transmitted, 3 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 7.310/7.731/7.990/0.300 ms
root@opnFW2:/var/log # ping -c 2 172.16.1.10
PING 172.16.1.10 (172.16.1.10): 56 data bytes

--- 172.16.1.10 ping statistics ---
2 packets transmitted, 0 packets received, 100.0% packet loss
root@opnFW2:/var/log #


IPSec config on opnFW2 related to that tunnel is:

cat /usr/local/etc/ipsec.conf

config setup
  uniqueids = yes

conn con9
  aggressive = no
  fragmentation = yes
  keyexchange = ikev2
  mobike = yes
  reauth = yes
  rekey = yes
  forceencaps = no
  installpolicy = no

  dpdaction = restart
  dpddelay = 10s
  dpdtimeout = 60s

  left = 192.168.1.1
  right = 192.168.1.10

  leftid = 192.168.1.1
  ikelifetime = 28800s
  lifetime = 3600s
  ike = aes256gcm16-sha512-ecp512bp!
  leftauth = psk
  rightauth = psk
  rightid = 192.168.1.10
  reqid = 9000
  rightsubnet = 0.0.0.0/0
  leftsubnet = 0.0.0.0/0
  esp = aes256gcm16-sha512-ecp512bp!
  auto = start



And the funnies thing is that if I restart the strongswan service (/usr/local/etc/rc.d/strongswan onerestart) on opnFW1 (with ... "unable to acquire reqid" logs) the issue disappears and everything starts working again....untill the next time it stops.....

Any ideas, comments are highly appreciated!
Intentionally I haven't restored the connectivity this time, so I can provide any additional outputs/logs if required.

Regards,
Plamen