VPN connectivity issues caused by "HA update and reconfigure backup" cron job

Started by ici, February 13, 2025, 01:27:17 PM

Previous topic - Next topic
Hello good people of the OPNsense community,

We encountered an interesting phenomenon explained bellow and would like to inquire on the best way to handle it.

Setup:
We have two OPNsense firewalls (master and backup) with AWS Site-to-Site VPN configured. Each firewall has two IPsec tunnels to AWS (total of 4 tunnels). BGP is used for route exchange with AWS, and CARP handles the failover between master and backup OPNsense firewalls.

Issue:
After a fresh reboot of the backup OPNsense, everything works perfectly - ping from our DC to an EC2 instance in AWS flows through the VPN uninterrupted. However, at minute 30 of every hour, we lose connectivity to our AWS resources. Interestingly, while the AWS console shows status changes (DOWN then UP again) only for the backup's tunnels, our connectivity doesn't recover even after the tunnels return to UP status. All this occurs despite the master's VPN tunnels remaining consistently up and stable.

Root Cause:
We traced this to the built-in cron job "HA update and reconfigure backup" (enabled in System: Settings: Cron and running at */30). This job triggers a full reconfiguration of the backup firewall, which causes:
- IPsec tunnel renegotiation
- BGP session resets on the backup

The most puzzling aspect was that despite this occurring on the backup firewall, these service restarts were somehow affecting the routing for traffic that should have been flowing through the master.

If we perform a manual HA sync the same behavior is observed and a reboot of the backup is required for the connectivity to be restored.

Resolution:
- Disabling the "HA update and reconfigure backup" cron job immediately resolved the connectivity issues.
- The VPN tunnels and routing remain stable now, and no longer experiencing the hourly status changes.

Questions:
1. Why would the backup firewall's VPN state impact traffic when the master's tunnels were working correctly?
2. Is there a way to make HA sync less disruptive—perhaps without triggering full service restarts?
3. What's the recommended approach for handling HA synchronization in setups with active VPN tunnels on both master and backup?