[SOLVED] 24.7 PFSync State synchronization not working

Started by Styx13, August 11, 2024, 09:31:29 PM

Previous topic - Next topic
Hello,

I have OPNsense configured with HA, CARP works fine, no issues with it.
However, PFsync seems to not work properly as when switch to backup (or back to primary), all my current established connections die. (I did the test with an ssh connections to a host behind both firewalls and it hangs and then reset when the CARP switch PRIMARY=>BACKUP happens).

This is not my first HA setup, I have been running OPNsense with HA for 4 years and it has been working very well (both CARP and PFSync with seemless transition to backup without losing any connection).

So maybe I did something wrong here in this new setup I did and I may need another pair of eyes to look at my setup and figure out what is wrong.

For 24.7, I did a fresh install.

Both the primary and backup are VMs, just like my previous setup with 24.1 was (and that previous setup was working fine for year, started with 20.7 on it all the way to 24.1 upgrades).

(important notes: my previous 24.1 setup is not running anymore I shutdown and now deleted those VMs, so only the new setup exists)

One difference with my new system, is that the VM for the primary is using PCI Passthru for the 10Gb port (LAN - ix0) and the 1Gb port (igb0 WAN).

On the backup VM, it is using Virtio adapter for both (vtnet0 & vtnet1).

So on both sides, I created failover LAGG interfaces (with a single port in each) and configured lagg0 for LAN and lagg1 for WAN in order to have the interface name match on both side as it is important for state syncing as indicated in the doc.

Then on top of the LAN LAGG interface (lagg0) I created a bunch of VLANs as this port is a trunk port with several tagged VLANs.
That part of the setup (VLANs) is identical to my previous on (with 24.1) where all my networks are connected to the firewall via a single port and tagged VLANs.

So I end up with multiple lagg0_vlanXX vlan interface which are assigned and I made sure that on both sides (primary and backup) the optXX matches. (for example, on both sides, lagg0_vlan10 is opt1, lagg0_vlan20 is opt2, etc ..).

I have a dedicated VLAN for PFSYNC (VLAN99 - assigned to opt7 on both sides) which is also used by KEA DHCP for peer traffic.
On the primary that interface is configured with IP 10.90.0.251/24
On the backup that interface is configured with IP 10.90.0.252/24

The firewall rules for the PFSYNC interface are:

     Protocol     Source                     Port  Destination    Port         Gateway  Schedule   Description 
pass IPv4 PFSYNC  VLAN99_PFSYNC net          *     This Firewall  *            *        *          Allow pfSync traffic 
pass IPv4 TCP     VLAN99_PFSYNC net          *     This Firewall  443 (HTTPS)  *        *          Allow HTTPS traffic for config synchronization 
pass IPv4 TCP     VLAN99_PFSYNC net          *     This Firewall  8001         *        *          Allow Kea DHCP HA Peer traffic


System: High Availability: Settings - On the primary node:

Synchronize States: checked
Synchronize Interface: VLAN99_PFSYNC
Sync Compatiliby: OPNsense 24.7 or above
Synchronize Peer IP: 10.99.0.252
Synchronize Config: 10.99.0.252
Remote System Username: <the username of my backup node>
Remote System Password: <the password of my backup node>
Services to synchronize (XMLRPC Sync): Aliases, Certificates, Dashboard, Firewall Categories, Firewall Groups, Firewall Log Templates, Firewall Rules, Firewall Schedules, IPsec, Kea DHCP, NAT, Network Time, Unbound DNS, Virtual IPS


System: High Availability: Settings - On the secondary node:

Synchronize States: checked
Synchronize Interface: VLAN99_PFSYNC
Sync Compatiliby: OPNsense 24.7 or above
Synchronize Peer IP: 10.99.0.251
(fields that are not indicated are either empty or default value)

System: High Availability: Status -  On the primary node:
<showing the backup firewall version and services, all green, and synchronization of configuration works fine>

System: High Availability: Status -  On the backup node:
The backup firewall is not accessible or not configured.



When I look at the Firewall: Diagnostics: States on both nodes, I can see a "similar" number of states: ~1700 on primary and ~1500 on backup.

But if I switch from Primary to backup (by enabling Persistent Carp Maintenance Mode on the primary), then any established connections (like ssh) hang and die. but also when I compare the states in Firewall: Diagnostics: States on both nodes, then the primary node shows ~500 states and the backup shows ~2200 states.

So something must be wrong somewhere, but I cannot figure out what. Is there a log/place where I can see more details about PFSync activity? and make sure it is working as expected?
Let me know if you need more information.

Thank you

I did a short package capture on my PFsync interface from the both the primary and standby, and I can see the pfsync traffic going from primary to standby.

PRIMARY

Interface                     Timestamp                     SRC                  DST                 output
VLAN99_PFSYNC lagg0_vlan99    2024-08-12 19:31:02.532333    RE:DA:CT:ED:##:32    RE:DA:CT:ED:##:7b    IPv4, length 1488: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1454 insert count 2 update compressed count 7 delete compressed count 23 eof count 1
VLAN99_PFSYNC lagg0_vlan99    2024-08-12 19:31:02.622583    RE:DA:CT:ED:##:32    RE:DA:CT:ED:##:7b    IPv4, length 1416: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1382 insert count 2 update compressed count 6 delete compressed count 24 eof count 1
VLAN99_PFSYNC lagg0_vlan99    2024-08-12 19:31:02.843370    RE:DA:CT:ED:##:32    RE:DA:CT:ED:##:7b    IPv4, length 230: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 196 update compressed count 2 eof count 1
VLAN99_PFSYNC lagg0_vlan99    2024-08-12 19:31:02.974409    RE:DA:CT:ED:##:32    RE:DA:CT:ED:##:7b    IPv4, length 1429: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1395 insert count 3 update compressed count 3 delete compressed count 23 eof count 1
VLAN99_PFSYNC lagg0_vlan99    2024-08-12 19:31:03.020705    RE:DA:CT:ED:##:32    RE:DA:CT:ED:##:7b    IPv4, length 146: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 112 update compressed count 1 eof count 1
VLAN99_PFSYNC lagg0_vlan99    2024-08-12 19:31:03.099974    RE:DA:CT:ED:##:32    RE:DA:CT:ED:##:7b    IPv4, length 1369: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1335 insert count 3 update compressed count 2 delete compressed count 25 eof count 1
VLAN99_PFSYNC lagg0_vlan99    2024-08-12 19:31:03.416010    RE:DA:CT:ED:##:32    RE:DA:CT:ED:##:7b    IPv4, length 1208: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1174 insert count 2 update compressed count 7 eof count 1
VLAN99_PFSYNC lagg0_vlan99    2024-08-12 19:31:03.486072    RE:DA:CT:ED:##:32    RE:DA:CT:ED:##:7b    IPv4, length 1124: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1090 insert count 2 update compressed count 6 eof count 1
VLAN99_PFSYNC lagg0_vlan99    2024-08-12 19:31:03.529206    RE:DA:CT:ED:##:32    RE:DA:CT:ED:##:7b    IPv4, length 1040: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1006 insert count 2 update compressed count 5 eof count 1
VLAN99_PFSYNC lagg0_vlan99    2024-08-12 19:31:03.638759    RE:DA:CT:ED:##:32    RE:DA:CT:ED:##:7b    IPv4, length 1208: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1174 insert count 2 update compressed count 7 eof count 1
VLAN99_PFSYNC lagg0_vlan99    2024-08-12 19:31:03.807782    RE:DA:CT:ED:##:32    RE:DA:CT:ED:##:7b    IPv4, length 1233: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1199 insert count 3 update compressed count 4 eof count 1
VLAN99_PFSYNC lagg0_vlan99    2024-08-12 19:31:03.870700    RE:DA:CT:ED:##:32    RE:DA:CT:ED:##:7b    IPv4, length 1317: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1283 insert count 3 update compressed count 5 eof count 1
VLAN99_PFSYNC lagg0_vlan99    2024-08-12 19:31:04.109522    RE:DA:CT:ED:##:32    RE:DA:CT:ED:##:7b    IPv4, length 1292: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1258 insert count 2 update compressed count 8 eof count 1


STANDBY

Interface                     Timestamp                     SRC                  DST                 output
VLAN99_PFSYNC lagg0_vlan99    2024-08-12 19:31:02.533114    RE:DA:CT:ED:##:32    RE:DA:CT:ED:##:7b    IPv4, length 1488: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1454 insert count 2 update compressed count 7 delete compressed count 23 eof count 1
VLAN99_PFSYNC lagg0_vlan99    2024-08-12 19:31:02.623336    RE:DA:CT:ED:##:32    RE:DA:CT:ED:##:7b    IPv4, length 1416: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1382 insert count 2 update compressed count 6 delete compressed count 24 eof count 1
VLAN99_PFSYNC lagg0_vlan99    2024-08-12 19:31:02.844388    RE:DA:CT:ED:##:32    RE:DA:CT:ED:##:7b    IPv4, length 230: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 196 update compressed count 2 eof count 1
VLAN99_PFSYNC lagg0_vlan99    2024-08-12 19:31:02.975642    RE:DA:CT:ED:##:32    RE:DA:CT:ED:##:7b    IPv4, length 1429: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1395 insert count 3 update compressed count 3 delete compressed count 23 eof count 1
VLAN99_PFSYNC lagg0_vlan99    2024-08-12 19:31:03.021657    RE:DA:CT:ED:##:32    RE:DA:CT:ED:##:7b    IPv4, length 146: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 112 update compressed count 1 eof count 1
VLAN99_PFSYNC lagg0_vlan99    2024-08-12 19:31:03.101193    RE:DA:CT:ED:##:32    RE:DA:CT:ED:##:7b    IPv4, length 1369: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1335 insert count 3 update compressed count 2 delete compressed count 25 eof count 1
VLAN99_PFSYNC lagg0_vlan99    2024-08-12 19:31:03.417133    RE:DA:CT:ED:##:32    RE:DA:CT:ED:##:7b    IPv4, length 1208: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1174 insert count 2 update compressed count 7 eof count 1
VLAN99_PFSYNC lagg0_vlan99    2024-08-12 19:31:03.487132    RE:DA:CT:ED:##:32    RE:DA:CT:ED:##:7b    IPv4, length 1124: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1090 insert count 2 update compressed count 6 eof count 1
VLAN99_PFSYNC lagg0_vlan99    2024-08-12 19:31:03.530410    RE:DA:CT:ED:##:32    RE:DA:CT:ED:##:7b    IPv4, length 1040: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1006 insert count 2 update compressed count 5 eof count 1
VLAN99_PFSYNC lagg0_vlan99    2024-08-12 19:31:03.639878    RE:DA:CT:ED:##:32    RE:DA:CT:ED:##:7b    IPv4, length 1208: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1174 insert count 2 update compressed count 7 eof count 1
VLAN99_PFSYNC lagg0_vlan99    2024-08-12 19:31:03.808977    RE:DA:CT:ED:##:32    RE:DA:CT:ED:##:7b    IPv4, length 1233: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1199 insert count 3 update compressed count 4 eof count 1
VLAN99_PFSYNC lagg0_vlan99    2024-08-12 19:31:03.871860    RE:DA:CT:ED:##:32    RE:DA:CT:ED:##:7b    IPv4, length 1317: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1283 insert count 3 update compressed count 5 eof count 1
VLAN99_PFSYNC lagg0_vlan99    2024-08-12 19:31:04.110615    RE:DA:CT:ED:##:32    RE:DA:CT:ED:##:7b    IPv4, length 1292: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1258 insert count 2 update compressed count 8 eof count 1


so pfsync packets are going through normally.

Still wondering what's causing this issue.

So, I eventually figured out the problem and it was of course on the user / admin side ...
On the standby node, I did not type in the IP address correctly for the PFsync Synchronize Peer IP .. I had typed 10.90.0.251 instead of 10.99.0.251 ...  :-[

Now that this is fixed, everything is working as expected.