Hello,
After updating several devices from 24.7.12 to 25.7.10, the following error occurs with the os-frr plugin:
After failover to the slave, it takes approximately 2 minutes until the connection to the endpoints via WireGuard and OPVPN is restored. Oddly, the IPsec tunnels are not affected. Without activating the os-frr plugin, everything works perfectly. Simply activating os-frr is enough to trigger the error; BGP doesn't even need to be enabled.
The same problem occurs when reverting to the master server.
According to the log:
After BACKUP -> MASTER, os-frr (zebra) starts, and then there's an error with configd with a timeout of approximately 2 minutes. After that, the remaining Carp interfaces are activated in /usr/local/etc/rc.syshook.d/carp/20-openvpn.
What could be causing this error? I haven't found anything relevant in the log!
Hardware used: Deciso
Logs:
2026-01-12T09:00:14
Notice
opnsense
/usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "CARP WAN FW PORT 102 (185.120.61.102) (102@ax1)" has resumed the state "MASTER" for vhid 102
2026-01-12T09:00:14
Error
configctl
error in configd communication Traceback (most recent call last): File "/usr/local/sbin/configctl", line 65, in exec_config_cmd line = sock.recv(65536).decode() ^^^^^^^^^^^^^^^^ TimeoutError: timed out
2026-01-12T08:58:15
Notice
watchfrr
[KWE5Q-QNGFC] all daemons up, doing startup-complete notify
2026-01-12T08:58:15
Notice
watchfrr
[QDG3Y-BY5TN] zebra state -> up : connect succeeded
2026-01-12T08:58:15
Notice
watchfrr
[QDG3Y-BY5TN] mgmtd state -> up : connect succeeded
2026-01-12T08:58:15
Notice
opnsense
/usr/local/sbin/pluginctl: plugins_configure crl (execute task : openvpn_refresh_crls(1))
2026-01-12T08:58:15
Notice
watchfrr
[T83RR-8SM5G] watchfrr 10.5.0 starting: vty@0
2026-01-12T08:58:14
Notice
opnsense
/usr/local/sbin/pluginctl: plugins_configure crl (execute task : core_trust_crl(1))
2026-01-12T08:58:14
Notice
opnsense
/usr/local/sbin/pluginctl: plugins_configure crl (1)
2026-01-12T08:58:14
Notice
kernel
<6>[144370] carp: 110@vlan02: BACKUP -> MASTER (preempting a slower master)
What would be a minimum configuration to reproduce?
os-frr enabled? with or without the Carp Failover activated?
At least one wireguard tunnel? Also with Depend on CARP activated?
Then the symptom is that the wireguard tunnel takes 2 minutes to failover?
Wireguard and OPNVPN Legacy Depend on CARP activated also OS-FRR
more Facts :
( pairs : Master:Slave )
Tested on various devices with CARP same behavior
1 pair : without activate frr Failover okay .
ipsec side to side tunnel
OPNVPN Legacy Side to Side Client
Wireguard Site to Side tunnel
1 pair : activate only frr Failover time out .
ipsec side to side tunnel : no time out
OPNVPN Legacy Side to Side Client : timeout
Wireguard Site to Side tunnel : timeout
In OPNVPN Legacy, it's very clear that when there's a connection status, all information about the tunnels is missing.
after the time out ( File "/usr/local/sbin/configctl", line 65, in exec_config_cmd line = sock.recv(65536).decode() ^^^^^^^^^^^^^^^^ TimeoutError: timed out)
Then you can see the information and you can also ping the remote
Wireguard Status after 2 min you can ping the remote
** 2 pair **
2 pair : without activate frr Failover okay .
ipsec side to side tunnel
OPNVPN Instance Server Side to Side TAP Brige L2 (move for test the tunnel from leagcy to Instance / see comment below ****** )
Wireguard Site to Side tunnel
2 pair : activate only frr Failover time out .
ipsec side to side tunnel: no time out
OPNVPN Instance Server Side to Side TAP Brige L2 time out (move for test the tunnel from leagcy to Instance )
Wireguard Site to Side tunnel time out
same error in the logs : ( File "/usr/local/sbin/configctl", line 65, in exec_config_cmd line = sock.recv(65536).decode() ^^^^^^^^^^^^^^^^ TimeoutError: timed out)
I have the problem with 16 Pairs ( Master:Slave ) ; I have performed a rollback to 24.7.12 for all, but 2 pairs for further investigation runs 25.7.10.
*******
OPNVPN Instance TAP L2 brige (without FRR)
After switching the OPVN tunnel (server) from legacy to instance TAP L2 with interface and bridge, the failover only works partially. After switching to slave, no connection is established, even after a longer waiting time. It's not possible to connect to the deactivated master, but if you kill it on the master, you can see that the client reconnects to the slave. Even when the master is activated, this doesn't always work immediately.
In Legacy runs without any trouble
*********
Can you be precise with this:
24.7.12 to 25.7.10, there are two major upgrades here (24.7 -> 25.1 -> 25.7).
If that is really true, its very hard to find the exact version where it stopped to work.
To bisect this, you can do incremental updates by going to:
- "System - Firmware - Settings"
- enable "advanced mode"
- Flavour "(custom)"
25.7/MINT/25.7.x/latest
Here slowly increment the versions.
25.1/MINT/25.1.1/latest
25.1/MINT/25.1.2/latest
...
You don't need every minor upgrade, just try to bisect where it happens, that would help a lot.
Okay, I understand.
I need to find a time slot where I can downgrade to 24.7.12. and after this step by step to the higher ver.
Unfortunately, some changes have already been made to the configuration, as changes were also made to the remote site.
I'll get back to you with more information;