23.1.5_2 FRR with "Enable CARP Failover" error buffer_flush_available...

Started by nzkiwi68, March 30, 2023, 08:54:59 AM

Previous topic - Next topic
Immediately after upgrade to 23.1.5_2, FRR began rebooting and routing unstable.

Diagnosis is something is broken if you have "Enable CARP Failover" selected in the config.

]Routing > Diagnostics > Log

023-03-30T18:09:42 Error bgpd [EC 100663299] buffer_flush_available: write error on fd 2: Bad file descriptor
2023-03-30T18:09:42 Error bgpd [EC 100663299] buffer_flush_available: write error on fd 2: Bad file descriptor
2023-03-30T18:09:31 Warning zebra [EC 4043309122] Client 'bfd' encountered an error and is shutting down.
2023-03-30T18:09:31 Warning zebra [EC 4043309122] Client 'bgp' encountered an error and is shutting down.
2023-03-30T18:09:31 Warning zebra [EC 4043309122] Client 'vnc' encountered an error and is shutting down.
2023-03-30T18:09:22 Error bgpd [EC 100663299] buffer_flush_available: write error on fd 2: Bad file descriptor
2023-03-30T18:09:22 Error bgpd [EC 100663299] buffer_flush_available: write error on fd 2: Bad file descriptor
2023-03-30T18:09:21 Warning zebra [EC 4043309122] Client 'bfd' encountered an error and is shutting down.
2023-03-30T18:09:21 Warning zebra [EC 4043309122] Client 'bgp' encountered an error and is shutting down.
2023-03-30T18:09:21 Warning zebra [EC 4043309122] Client 'vnc' encountered an error and is shutting down.
2023-03-30T18:07:58 Error bgpd [EC 100663299] buffer_flush_available: write error on fd 2: Bad file descriptor
2023-03-30T18:07:58 Error bgpd [EC 100663299] buffer_flush_available: write error on fd 2: Bad file descriptor
2023-03-30T18:07:57 Warning zebra [EC 4043309122] Client 'bfd' encountered an error and is shutting down.
2023-03-30T18:07:57 Warning zebra [EC 4043309122] Client 'bgp' encountered an error and is shutting down.
2023-03-30T18:07:57 Warning zebra [EC 4043309122] Client 'vnc' encountered an error and is shutting down.
2023-03-30T18:07:37 Error bgpd [EC 100663299] buffer_flush_available: write error on fd 2: Bad file descriptor
2023-03-30T18:07:37 Error bgpd [EC 100663299] buffer_flush_available: write error on fd 2: Bad file descriptor
2023-03-30T18:07:33 Warning zebra [EC 4043309122] Client 'bfd' encountered an error and is shutting down.




Yes.

I have turned off CARP failover and FRR is stable, no starting and stopping.


23.1.x something... how can I tell you precisely?

Sorry, I'm not sure exactly how to tell exactly what version this firewall was running.

And did you try to revert os-frr and frr7 to a Version of 23.1? Maybe this could help.

Do you encounter any failovers (not related to frr)?

No other failures observed.

I will try tonight after work hours to revert FRR to an earlier version and then report back.

Thanks.

Sorry for my very slow reply.

I tried rolling back in order and rebooting and testing the whole 23.1 series and the FRR bad behavior continued. I can only conclude I upgraded from below 23.1, which I didn't note at the time.

The good news is WireGuard with the latest updates is stable with FRR on and FRR not following CARP.
I have gone on to upgrade a number of firewalls and firewall clusters using WireGuard with FRR using BGP over multi WAN and deselected "FRR Enable CARP Failover" and everything is working as expected.

WireGuard continues to need a custom CARP script to stop WireGuard on the backup firewall, but, with the custom CARP script running, WireGuard with FRR and multi WAN using BGP is working great.



Yes.
BGP only.

Building configuration...

Current configuration:
!
frr version 7.5.1
frr defaults traditionnl
hostname byyfw1.localdomain
log syslog informationnl
!
router bgp 65525
no bgp ebgp-requires-policy
no bgp default ipv4-unicast
bgp graceful-restart
neighbor 172.27.3.4 remote-as 65524
neighbor 172.27.3.4 bfd
neighbor 172.27.3.4 update-source wg4
neighbor 172.27.3.104 remote-as 65524
neighbor 172.27.3.104 bfd
neighbor 172.27.3.104 update-source wg5
neighbor 172.27.5.1 remote-as 65521
neighbor 172.27.5.1 bfd
neighbor 172.27.5.1 update-source wg1
neighbor 172.27.5.101 remote-as 65521
neighbor 172.27.5.101 bfd
neighbor 172.27.5.101 update-source wg2
!
address-family ipv4 unicast
  redistribute kernel
  redistribute connected
  redistribute static
  neighbor 172.27.3.4 activate
  neighbor 172.27.3.4 next-hop-self
  neighbor 172.27.3.4 prefix-list byy-xxy-prefix-out out
  neighbor 172.27.3.4 route-map prefer-wan1 in
  neighbor 172.27.3.104 activate
  neighbor 172.27.3.104 next-hop-self
  neighbor 172.27.3.104 prefix-list byy-xxy-prefix-out out
  neighbor 172.27.5.1 activate
  neighbor 172.27.5.1 next-hop-self
  neighbor 172.27.5.1 prefix-list byy-onn-prefix-out out
  neighbor 172.27.5.1 route-map prefer-wan1 in
  neighbor 172.27.5.101 activate
  neighbor 172.27.5.101 next-hop-self
  neighbor 172.27.5.101 prefix-list byy-onn-prefix-out out
exit-address-family
!
address-family ipv6 unicast
  redistribute kernel
  redistribute connected
  redistribute static
exit-address-family
!
ip prefix-list byy-onn-prefix-out seq 12 permit 10.5.55.0/24
ip prefix-list byy-onn-prefix-out seq 13 permit 10.5.80.0/24
ip prefix-list byy-onn-prefix-out seq 14 permit 10.5.50.0/24
ip prefix-list byy-onn-prefix-out seq 11 permit 10.5.45.0/24
ip prefix-list byy-onn-prefix-out seq 10 permit 192.168.5.0/24
ip prefix-list byy-xxy-prefix-out seq 20 permit 192.168.5.0/24
ip prefix-list byy-xxy-prefix-out seq 21 permit 10.5.80.0/24
!
route-map prefer-wan1 permit 10
set local-preference 300
!
line vty
!
bfd
peer 172.27.5.1
!
peer 172.27.3.4
!
peer 172.27.5.101
!
peer 172.27.3.104
!
!
end



I expect so.

But, we are not running IPsec as the site to site VPN, but, WireGuard, and, the problem is FRR errors that were not there before, not a WireGuard fault.

BTW - why WireGuard?
Because WireGuard is so fast for setup. If I use IPsec for site to site VPN with multi WAN and clustered firewalls, and you power off site A fw1, then site A fw2 takes over, but, IPsec takes ages, as long as 2 mins before IPsec will actually setup on fw2 and start passing traffic. It's also really bad if site A fw2 is the master and you power on site A fw1. Once fw1 comes up and becomes the CARP master and takes over, the VPN is down for far too long if using IPsec.

That's my experience across pfSense, OPNsense and multi customers. Fail-over for IPsec is too slow.

WireGuard the other hand is so fast, like 3-4 pings and the tunnel is up and running on fw2 and routing is working.

Telnet sessions are not broken during a clustered firewall fail-over with WireGuard and 100% of Telnet and RDS sessions break during an IPsec clustered firewall fail-over. Hence I am a big WireGuard fan.

WireGuard now needs a decent CARP fail-over script, and, ideally the ability to follow a single interface for CARP fail-over. in 99% of deployments I would have WireGuard following CARP watching the LAN interface only, because VPN's are linking site A LAN to to site B LAN (normally) I'm only interested in starting WireGuard on the firewall that has the LAN interface as CARP master.