<Solved>HA failover user connection interrupted

Started by i.schmidt, September 04, 2023, 01:28:43 PM

Previous topic - Next topic
September 04, 2023, 01:28:43 PM Last Edit: September 12, 2023, 11:36:25 AM by i.schmidt
Hi all

We use 2 pfsense firewalls in HA setup, with CARP, state table sync and config sync (manual).
Opnsense Version 23.4.2

When I want to update, I activate "Persistent CARP Maintenance Mode" to switch to the secondary device. This works quite flawlessly except for two things:

1. User connections between devices seem to get interrupted
We use devices for recording working time. They are connected to a server via a TCP/IP connection. This connection is actively monitored by the server, to prevent manipulation or something. Every slight disruption causes the server to regard that device as offline.
When I switch to the secondary firewall, ALL of these devices still can be pinged and stay connected to the network, but they lose their server connection. So i guess this might be the secondary firewall not knowing the state of these connections.

How can i test and analyse this?

2. How to: Hot failover of WAN connection
How can i implement an automatic handover of pppoe connections to the secondary firewall? Assigning a CARP IP to these connections does not seem to work. I could not get it to work and frankly i found the information about WAN failover a little bit confusing and unclear. Maybe someone can help me out?

September 04, 2023, 02:46:56 PM #1 Last Edit: September 04, 2023, 02:53:14 PM by Monviech
Did you make sure that the protocol pfsync isn't blocked by the firewall default deny?
On the interface that sends and receives the pfsync packets, you have to create a firewall rule that allows protocol pfsync.

https://docs.opnsense.org/manual/how-tos/carp.html#terminology

Also it's best to leave the "Synchronize Peer IP" in System: High Availability: Settings: General settings empty on both firewalls. "The default is directed multicast" option to 224.0.0.240 works best in my opinion.

You can troubleshoot it by going into SSH shell on both firewalls, and tcpdump on your pfsync interface. You can see the states getting exchanged in clear text. You can also go into Firewall / Diagnostic / States and look there.
Hardware:
DEC740

Thanks!
pfsync runs on a dedicated interface, which has an "allow everything" rule.

I will check the 2 other points tomorrow.

Soooooo, thanks very much for the suggestions. Yesterday i got sidetracked @work, but now I found something that might be suspicious.

Config summary:

  • We use a hardware interface called vtnet0 for pfsync. On this interface, there is a "allow everything everywhere" rule.
  • vtnet0 is connected via direct cable connection to the equivalent interface on the secondary device, no switch involved.
  • IP on that interface is 10.0.0.1 and on secondary it is 10.0.0.2.
  • On the primary, synchronize peer is 10.0.0.2 and on secondary, sync peer is 10.0.0.1
I did a packet capture on pfsync0 via tcpdump -i pfsync0
I immediately noticed, that there are a whole lot more packets captured on the primary device, than there are on the secondary.
This doesn't make sense, because packets outgoing on primary should also be captured incoming on secondary and vice versa. Packet count should therefore be equal on both devices, right?

So i did a capture to pcap file to analyse it better, but WTH?
"The file "pfsync.pcap" contains record data that wireshark doesn't support. (pcap: network type 246 unknown or unsupported)
So I'm a bit stuck on detailed analysis.

vtnet looks suspiciously like this is a virtualised setup? Maybe your vSwitch configuration is to blame?
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Yep, opnsense runs in a VM on that machine. Proxmox is the host OS. This makes it a heck of a lot easier to backup, restore, update and handle it.

It's the only VM on that host, because this machine is dedicated for firewall for obvious reasons.

Sadly, i can't pass-through this interface directly into the VM, because it is one port of two onboard network ports. I would loose access to the Host, if i did this.
The Onboard Controller is a Broadcom NetXtreme BCM5720 2-port Gigabit Ethernet

The other network ports for WAN an VLAN are dedicated hardware and passed through.
For completeness: There are no firewall rules applied to the VM on the host system (service deactivated. Proxmox supports firewalling)

Is it likely that the visualization layer is the culprit though?

I'm currently thinking about how i can test this... live... without making a mess during office hours ::)
LOL i could put a USB network adapter and passthrough that and see what happens.  :o ;D

I think I found the issue.

When we first set up the HA pair, opnsense was on version 21.x something, or even version 20.
We updated, as versions rolled along and at some point I fiddled too much with the secondary device. I had to set it up entirely new.
Somewhere between major updates, the naming scheme for devices seems to have changed. So our primary device has interfaces with the old naming scheme, while the secondary device has these network devices set up with the new names.
pfsync is clearly unable to assign the synced connections to the appropriate interfaces.

So, how can i change the names of the interfaces? Do I have to delete every interface, create a new one and reassign it? Will I be able to keep all the firewall rules?

Thats really nasty  :'(

Make a copy of /conf/config.xml, edit, import edited file

I have successfully reconfigured all the interfaces via the config file.
It looks like, now the states are synced properly.

Thanks!