Hello,
After updating several devices from 24.7.12 to 25.7.10, the following error occurs with the os-frr plugin:
After failover to the slave, it takes approximately 2 minutes until the connection to the endpoints via WireGuard and OPVPN is restored. Oddly, the IPsec tunnels are not affected. Without activating the os-frr plugin, everything works perfectly. Simply activating os-frr is enough to trigger the error; BGP doesn't even need to be enabled.
The same problem occurs when reverting to the master server.
According to the log:
After BACKUP -> MASTER, os-frr (zebra) starts, and then there's an error with configd with a timeout of approximately 2 minutes. After that, the remaining Carp interfaces are activated in /usr/local/etc/rc.syshook.d/carp/20-openvpn.
What could be causing this error? I haven't found anything relevant in the log!
Hardware used: Deciso
Logs:
2026-01-12T09:00:14
Notice
opnsense
/usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "CARP WAN FW PORT 102 (185.120.61.102) (102@ax1)" has resumed the state "MASTER" for vhid 102
2026-01-12T09:00:14
Error
configctl
error in configd communication Traceback (most recent call last): File "/usr/local/sbin/configctl", line 65, in exec_config_cmd line = sock.recv(65536).decode() ^^^^^^^^^^^^^^^^ TimeoutError: timed out
2026-01-12T08:58:15
Notice
watchfrr
[KWE5Q-QNGFC] all daemons up, doing startup-complete notify
2026-01-12T08:58:15
Notice
watchfrr
[QDG3Y-BY5TN] zebra state -> up : connect succeeded
2026-01-12T08:58:15
Notice
watchfrr
[QDG3Y-BY5TN] mgmtd state -> up : connect succeeded
2026-01-12T08:58:15
Notice
opnsense
/usr/local/sbin/pluginctl: plugins_configure crl (execute task : openvpn_refresh_crls(1))
2026-01-12T08:58:15
Notice
watchfrr
[T83RR-8SM5G] watchfrr 10.5.0 starting: vty@0
2026-01-12T08:58:14
Notice
opnsense
/usr/local/sbin/pluginctl: plugins_configure crl (execute task : core_trust_crl(1))
2026-01-12T08:58:14
Notice
opnsense
/usr/local/sbin/pluginctl: plugins_configure crl (1)
2026-01-12T08:58:14
Notice
kernel
<6>[144370] carp: 110@vlan02: BACKUP -> MASTER (preempting a slower master)
What would be a minimum configuration to reproduce?
os-frr enabled? with or without the Carp Failover activated?
At least one wireguard tunnel? Also with Depend on CARP activated?
Then the symptom is that the wireguard tunnel takes 2 minutes to failover?
Wireguard and OPNVPN Legacy Depend on CARP activated also OS-FRR
more Facts :
( pairs : Master:Slave )
Tested on various devices with CARP same behavior
1 pair : without activate frr Failover okay .
ipsec side to side tunnel
OPNVPN Legacy Side to Side Client
Wireguard Site to Side tunnel
1 pair : activate only frr Failover time out .
ipsec side to side tunnel : no time out
OPNVPN Legacy Side to Side Client : timeout
Wireguard Site to Side tunnel : timeout
In OPNVPN Legacy, it's very clear that when there's a connection status, all information about the tunnels is missing.
after the time out ( File "/usr/local/sbin/configctl", line 65, in exec_config_cmd line = sock.recv(65536).decode() ^^^^^^^^^^^^^^^^ TimeoutError: timed out)
Then you can see the information and you can also ping the remote
Wireguard Status after 2 min you can ping the remote
** 2 pair **
2 pair : without activate frr Failover okay .
ipsec side to side tunnel
OPNVPN Instance Server Side to Side TAP Brige L2 (move for test the tunnel from leagcy to Instance / see comment below ****** )
Wireguard Site to Side tunnel
2 pair : activate only frr Failover time out .
ipsec side to side tunnel: no time out
OPNVPN Instance Server Side to Side TAP Brige L2 time out (move for test the tunnel from leagcy to Instance )
Wireguard Site to Side tunnel time out
same error in the logs : ( File "/usr/local/sbin/configctl", line 65, in exec_config_cmd line = sock.recv(65536).decode() ^^^^^^^^^^^^^^^^ TimeoutError: timed out)
I have the problem with 16 Pairs ( Master:Slave ) ; I have performed a rollback to 24.7.12 for all, but 2 pairs for further investigation runs 25.7.10.
*******
OPNVPN Instance TAP L2 brige (without FRR)
After switching the OPVN tunnel (server) from legacy to instance TAP L2 with interface and bridge, the failover only works partially. After switching to slave, no connection is established, even after a longer waiting time. It's not possible to connect to the deactivated master, but if you kill it on the master, you can see that the client reconnects to the slave. Even when the master is activated, this doesn't always work immediately.
In Legacy runs without any trouble
*********
Can you be precise with this:
24.7.12 to 25.7.10, there are two major upgrades here (24.7 -> 25.1 -> 25.7).
If that is really true, its very hard to find the exact version where it stopped to work.
To bisect this, you can do incremental updates by going to:
- "System - Firmware - Settings"
- enable "advanced mode"
- Flavour "(custom)"
25.7/MINT/25.7.x/latest
Here slowly increment the versions.
25.1/MINT/25.1.1/latest
25.1/MINT/25.1.2/latest
...
You don't need every minor upgrade, just try to bisect where it happens, that would help a lot.
Okay, I understand.
I need to find a time slot where I can downgrade to 24.7.12. and after this step by step to the higher ver.
Unfortunately, some changes have already been made to the configuration, as changes were also made to the remote site.
I'll get back to you with more information;
Short info,
config:
ipsec legacy side to side tunnel
OPVPN legacy side to side tunnel TAP L2 bridge
Wireguard side to side tunnel
os-frr activate bgb not activate
migrate from 24.7.12 to 25.1.1 fail over behavior okay no error message
migrate from 25.1.1 to 25.1.4 fail over behavior okay no error message
migrate from 25.1.4 to 25.1.12 fail over behavior okay no error message
next step go to 25.7.1
One more question: how many intermediate steps should I take starting on 25.7.x
Just go all the way to the last available minor update, if you don't have an issue continue, if you have an issue roll back and go half the distance. That's how I bisect if there are issues.
okay now
After updating from 25.1.12 to 25.7.1, ( i update only the Slave for better rollback ) the previously described timeout error occurs, as can be clearly seen here.
2026-01-14T08:58:08
Notice
opnsense
/usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "CARP Vlan_206 (10.10.21.4) (110@vlan02)" has resumed the state "MASTER" for vhid 110
2026-01-14T08:58:08
Error
configctl
error in configd communication Traceback (most recent call last): File "/usr/local/sbin/configctl", line 65, in exec_config_cmd line = sock.recv(65536).decode() ^^^^^^^^^^^^^^^^ TimeoutError: timed out
2026-01-14T08:56:09
Notice
watchfrr
[KWE5Q-QNGFC] all daemons up, doing startup-complete notify
2026-01-14T08:56:09
Notice
watchfrr
[QDG3Y-BY5TN] zebra state -> up : connect succeeded
2026-01-14T08:56:09
Notice
watchfrr
[QDG3Y-BY5TN] mgmtd state -> up : connect succeeded
2026-01-14T08:56:08
Notice
opnsense
/usr/local/sbin/pluginctl: plugins_configure crl (execute task : openvpn_refresh_crls(1))
2026-01-14T08:56:08
Notice
watchfrr
[T83RR-8SM5G] watchfrr 10.4 starting: vty@0
2026-01-14T08:56:08
Notice
opnsense
/usr/local/sbin/pluginctl: plugins_configure crl (execute task : core_trust_crl(1))
2026-01-14T08:56:08
Notice
opnsense
/usr/local/sbin/pluginctl: plugins_configure crl (1)
2026-01-14T08:56:08
Notice
kernel
<6>[433] carp: 100@ax1: BACKUP -> MASTER (preempting a slower master)
That would be the exact point where the switch from frr 8 to frr 10 was made:
https://forum.opnsense.org/index.php?topic=48072.0
https://github.com/opnsense/plugins/blob/3af383e1e05b3a6831f7ed1f3d75ed0b17a77756/net/frr/pkg-descr#L45-L51
Unsure what could be the cause though, this has been productive in CARP setups for a while now, I know of no other current open issues.
It could be a rare specific issue that exists in your configuration (aka having multiple VPN implementations activated and depending on CARP at the same time, combined with the dynamic routing plugin even if BGP is not activated).
I need the BGP; I only mentioned it because it had no effect with or without BGP. I was trying to narrow down the error that way.
How do we proceed from here, and will there be a solution?
I need the exact configd call that timed out.
Can you search for that in the ssh shell via:
opnsense-log configd
after triggering that issue?
It looks like on your affected device this configd call stalls:
request ifconfig
It's this action: https://github.com/opnsense/core/blob/55f34d8feb7a1b2b9af1e24ed46e6029fdaf3455/src/opnsense/service/conf/actions.d/actions_interface.conf#L95
Can you execute this manually?
configctl interface list ifconfig
If this hangs also try a normal ifconfig:
ifconfig
Try with the frr plugin enabled, and disabled, see if it makes a difference.
configctl interface list ifconfig has worked
no change in behavior
We will continue looking into this when 26.1 is out, because if its somehow fixed there, we don't need to chase it right now.
Thanks for the info, then we'll wait for version 26.x
Hello, I think we found something.
https://github.com/opnsense/plugins/pull/5160
Can you try the following patch on the affected firewalls, it will only apply to the latest FRR version though (which means you have to be on >25.7.10 when you test).
# opnsense-patch https://github.com/opnsense/plugins/commit/d27619990739424db4e0aaa266c2392eeb7abe57
This patch will be in 26.1:
https://github.com/opnsense/plugins/commit/d2024adcdcef47df3915305ee1013d6a2f81d0ca
I have now tested the patch on version 27.7.10 with different Deciso models, and the error no longer occurs.
I will then test it on 27.7.11_2.
Thanks again for your support.
Wasn't it this one? https://github.com/opnsense/plugins/commit/2cc2215bb
If so we're hotfixing this for the last update of 25.7.11_x shortly after 26.1 is out this week.
Cheers,
Franco