CARP OS-FRR timeout after upgrade to rel 25.7.10

Started by rkam, January 12, 2026, 10:59:33 AM

Previous topic - Next topic
Hello,

After updating several devices from 24.7.12 to 25.7.10, the following error occurs with the os-frr plugin:

After failover to the slave, it takes approximately 2 minutes until the connection to the endpoints via WireGuard and OPVPN is restored. Oddly, the IPsec tunnels are not affected. Without activating the os-frr plugin, everything works perfectly. Simply activating os-frr is enough to trigger the error; BGP doesn't even need to be enabled.

The same problem occurs when reverting to the master server.

According to the log:

After BACKUP -> MASTER, os-frr (zebra) starts, and then there's an error with configd with a timeout of approximately 2 minutes. After that, the remaining Carp interfaces are activated in /usr/local/etc/rc.syshook.d/carp/20-openvpn.

What could be causing this error? I haven't found anything relevant in the log!

Hardware used: Deciso

Logs:

2026-01-12T09:00:14
Notice
opnsense
/usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "CARP WAN FW PORT 102 (185.120.61.102) (102@ax1)" has resumed the state "MASTER" for vhid 102
2026-01-12T09:00:14
Error
configctl
error in configd communication Traceback (most recent call last): File "/usr/local/sbin/configctl", line 65, in exec_config_cmd line = sock.recv(65536).decode() ^^^^^^^^^^^^^^^^ TimeoutError: timed out
2026-01-12T08:58:15
Notice
watchfrr
[KWE5Q-QNGFC] all daemons up, doing startup-complete notify
2026-01-12T08:58:15
Notice
watchfrr
[QDG3Y-BY5TN] zebra state -> up : connect succeeded
2026-01-12T08:58:15
Notice
watchfrr
[QDG3Y-BY5TN] mgmtd state -> up : connect succeeded
2026-01-12T08:58:15
Notice
opnsense
/usr/local/sbin/pluginctl: plugins_configure crl (execute task : openvpn_refresh_crls(1))
2026-01-12T08:58:15
Notice
watchfrr
[T83RR-8SM5G] watchfrr 10.5.0 starting: vty@0
2026-01-12T08:58:14
Notice
opnsense
/usr/local/sbin/pluginctl: plugins_configure crl (execute task : core_trust_crl(1))
2026-01-12T08:58:14
Notice
opnsense
/usr/local/sbin/pluginctl: plugins_configure crl (1)
2026-01-12T08:58:14
Notice
kernel
<6>[144370] carp: 110@vlan02: BACKUP -> MASTER (preempting a slower master)

What would be a minimum configuration to reproduce?

os-frr enabled? with or without the Carp Failover activated?
At least one wireguard tunnel? Also with Depend on CARP activated?

Then the symptom is that the wireguard tunnel takes 2 minutes to failover?
Hardware:
DEC740

Wireguard  and  OPNVPN Legacy   Depend on CARP activated   also OS-FRR

more Facts :

( pairs :  Master:Slave )

Tested on various devices with CARP same behavior

1 pair : without activate  frr   Failover okay .

ipsec side to side tunnel
OPNVPN Legacy  Side to Side  Client
Wireguard Site to Side tunnel

1 pair :  activate only frr   Failover time out  .

ipsec side to side tunnel : no time out
OPNVPN Legacy  Side to Side  Client : timeout
Wireguard Site to Side tunnel : timeout

In OPNVPN Legacy, it's very clear that when there's a connection status, all information about the tunnels is missing.

after  the time out  ( File "/usr/local/sbin/configctl", line 65, in exec_config_cmd line = sock.recv(65536).decode() ^^^^^^^^^^^^^^^^ TimeoutError: timed out)

Then you can see the information  and you can also ping the remote

Wireguard  Status   after 2 min you can ping the remote

**  2 pair **

2 pair : without activate  frr   Failover okay .

ipsec side to side tunnel 
OPNVPN  Instance   Server Side to Side TAP  Brige L2 (move for test the tunnel from leagcy to Instance / see comment below  ****** )
Wireguard Site to Side tunnel


2 pair :  activate only frr   Failover time out  .

ipsec side to side  tunnel:  no time out
OPNVPN  Instance   Server Side to Side TAP  Brige L2  time out (move for test the tunnel from leagcy to Instance )
Wireguard Site to Side tunnel time out

same error  in the logs :  ( File "/usr/local/sbin/configctl", line 65, in exec_config_cmd line = sock.recv(65536).decode() ^^^^^^^^^^^^^^^^ TimeoutError: timed out)

I have the problem with 16 Pairs  ( Master:Slave ) ;  I have performed a rollback to 24.7.12 for all, but 2 pairs for further investigation runs 25.7.10.


*******
OPNVPN Instance  TAP L2 brige (without FRR) 

After switching the OPVN tunnel (server) from legacy to instance TAP L2 with interface and bridge, the failover only works partially. After switching to slave, no connection is established, even after a longer waiting time. It's not possible to connect to the deactivated master, but if you kill it on the master, you can see that the client reconnects to the slave. Even when the master is activated, this doesn't always work immediately.

In Legacy runs without any trouble

*********

Can you be precise with this:

24.7.12 to 25.7.10, there are two major upgrades here (24.7 -> 25.1 -> 25.7).

If that is really true, its very hard to find the exact version where it stopped to work.

To bisect this, you can do incremental updates by going to:
- "System - Firmware - Settings"
- enable "advanced mode"
- Flavour "(custom)"
25.7/MINT/25.7.x/latest
Here slowly increment the versions.

25.1/MINT/25.1.1/latest
25.1/MINT/25.1.2/latest
...

You don't need every minor upgrade, just try to bisect where it happens, that would help a lot.
Hardware:
DEC740

Okay, I understand.

I need to find a time slot where I can downgrade to 24.7.12. and after this step by step to the higher  ver.
Unfortunately, some changes have already been made to the configuration, as changes were also made to the remote site.
I'll get back to you with more information;