Hello,
I have OPNsense configured with HA, CARP works fine, no issues with it.
However, PFsync seems to not work properly as when switch to backup (or back to primary), all my current established connections die. (I did the test with an ssh connections to a host behind both firewalls and it hangs and then reset when the CARP switch PRIMARY=>BACKUP happens).
This is not my first HA setup, I have been running OPNsense with HA for 4 years and it has been working very well (both CARP and PFSync with seemless transition to backup without losing any connection).
So maybe I did something wrong here in this new setup I did and I may need another pair of eyes to look at my setup and figure out what is wrong.
For 24.7, I did a fresh install.
Both the primary and backup are VMs, just like my previous setup with 24.1 was (and that previous setup was working fine for year, started with 20.7 on it all the way to 24.1 upgrades).
(important notes: my previous 24.1 setup is not running anymore I shutdown and now deleted those VMs, so only the new setup exists)
One difference with my new system, is that the VM for the primary is using PCI Passthru for the 10Gb port (LAN - ix0) and the 1Gb port (igb0 WAN).
On the backup VM, it is using Virtio adapter for both (vtnet0 & vtnet1).
So on both sides, I created failover LAGG interfaces (with a single port in each) and configured lagg0 for LAN and lagg1 for WAN in order to have the interface name match on both side as it is important for state syncing as indicated in the doc.
Then on top of the LAN LAGG interface (lagg0) I created a bunch of VLANs as this port is a trunk port with several tagged VLANs.
That part of the setup (VLANs) is identical to my previous on (with 24.1) where all my networks are connected to the firewall via a single port and tagged VLANs.
So I end up with multiple lagg0_vlanXX vlan interface which are assigned and I made sure that on both sides (primary and backup) the optXX matches. (for example, on both sides, lagg0_vlan10 is opt1, lagg0_vlan20 is opt2, etc ..).
I have a dedicated VLAN for PFSYNC (VLAN99 - assigned to opt7 on both sides) which is also used by KEA DHCP for peer traffic.
On the primary that interface is configured with IP 10.90.0.251/24
On the backup that interface is configured with IP 10.90.0.252/24
The firewall rules for the PFSYNC interface are:
Protocol Source Port Destination Port Gateway Schedule Description
pass IPv4 PFSYNC VLAN99_PFSYNC net * This Firewall * * * Allow pfSync traffic
pass IPv4 TCP VLAN99_PFSYNC net * This Firewall 443 (HTTPS) * * Allow HTTPS traffic for config synchronization
pass IPv4 TCP VLAN99_PFSYNC net * This Firewall 8001 * * Allow Kea DHCP HA Peer traffic
System: High Availability: Settings - On the primary node:
Synchronize States: | checked |
Synchronize Interface: | VLAN99_PFSYNC |
Sync Compatiliby: | OPNsense 24.7 or above |
Synchronize Peer IP: | 10.99.0.252 |
Synchronize Config: | 10.99.0.252 |
Remote System Username: | <the username of my backup node> |
Remote System Password: | <the password of my backup node> |
Services to synchronize (XMLRPC Sync): | Aliases, Certificates, Dashboard, Firewall Categories, Firewall Groups, Firewall Log Templates, Firewall Rules, Firewall Schedules, IPsec, Kea DHCP, NAT, Network Time, Unbound DNS, Virtual IPS |
System: High Availability: Settings - On the secondary node:
Synchronize States: | checked |
Synchronize Interface: | VLAN99_PFSYNC |
Sync Compatiliby: | OPNsense 24.7 or above |
Synchronize Peer IP: | 10.99.0.251 |
(fields that are not indicated are either empty or default value)
System: High Availability: Status - On the primary node:<showing the backup firewall version and services, all green, and synchronization of configuration works fine>
System: High Availability: Status - On the backup node:The backup firewall is not accessible or not configured.
When I look at the Firewall: Diagnostics: States on both nodes, I can see a "similar" number of states: ~1700 on primary and ~1500 on backup.
But if I switch from Primary to backup (by enabling Persistent Carp Maintenance Mode on the primary), then any established connections (like ssh) hang and die. but also when I compare the states in Firewall: Diagnostics: States on both nodes, then the primary node shows ~500 states and the backup shows ~2200 states.
So something must be wrong somewhere, but I cannot figure out what. Is there a log/place where I can see more details about PFSync activity? and make sure it is working as expected?
Let me know if you need more information.
Thank you
I did a short package capture on my PFsync interface from the both the primary and standby, and I can see the pfsync traffic going from primary to standby.
PRIMARY
Interface Timestamp SRC DST output
VLAN99_PFSYNC lagg0_vlan99 2024-08-12 19:31:02.532333 RE:DA:CT:ED:##:32 RE:DA:CT:ED:##:7b IPv4, length 1488: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1454 insert count 2 update compressed count 7 delete compressed count 23 eof count 1
VLAN99_PFSYNC lagg0_vlan99 2024-08-12 19:31:02.622583 RE:DA:CT:ED:##:32 RE:DA:CT:ED:##:7b IPv4, length 1416: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1382 insert count 2 update compressed count 6 delete compressed count 24 eof count 1
VLAN99_PFSYNC lagg0_vlan99 2024-08-12 19:31:02.843370 RE:DA:CT:ED:##:32 RE:DA:CT:ED:##:7b IPv4, length 230: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 196 update compressed count 2 eof count 1
VLAN99_PFSYNC lagg0_vlan99 2024-08-12 19:31:02.974409 RE:DA:CT:ED:##:32 RE:DA:CT:ED:##:7b IPv4, length 1429: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1395 insert count 3 update compressed count 3 delete compressed count 23 eof count 1
VLAN99_PFSYNC lagg0_vlan99 2024-08-12 19:31:03.020705 RE:DA:CT:ED:##:32 RE:DA:CT:ED:##:7b IPv4, length 146: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 112 update compressed count 1 eof count 1
VLAN99_PFSYNC lagg0_vlan99 2024-08-12 19:31:03.099974 RE:DA:CT:ED:##:32 RE:DA:CT:ED:##:7b IPv4, length 1369: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1335 insert count 3 update compressed count 2 delete compressed count 25 eof count 1
VLAN99_PFSYNC lagg0_vlan99 2024-08-12 19:31:03.416010 RE:DA:CT:ED:##:32 RE:DA:CT:ED:##:7b IPv4, length 1208: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1174 insert count 2 update compressed count 7 eof count 1
VLAN99_PFSYNC lagg0_vlan99 2024-08-12 19:31:03.486072 RE:DA:CT:ED:##:32 RE:DA:CT:ED:##:7b IPv4, length 1124: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1090 insert count 2 update compressed count 6 eof count 1
VLAN99_PFSYNC lagg0_vlan99 2024-08-12 19:31:03.529206 RE:DA:CT:ED:##:32 RE:DA:CT:ED:##:7b IPv4, length 1040: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1006 insert count 2 update compressed count 5 eof count 1
VLAN99_PFSYNC lagg0_vlan99 2024-08-12 19:31:03.638759 RE:DA:CT:ED:##:32 RE:DA:CT:ED:##:7b IPv4, length 1208: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1174 insert count 2 update compressed count 7 eof count 1
VLAN99_PFSYNC lagg0_vlan99 2024-08-12 19:31:03.807782 RE:DA:CT:ED:##:32 RE:DA:CT:ED:##:7b IPv4, length 1233: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1199 insert count 3 update compressed count 4 eof count 1
VLAN99_PFSYNC lagg0_vlan99 2024-08-12 19:31:03.870700 RE:DA:CT:ED:##:32 RE:DA:CT:ED:##:7b IPv4, length 1317: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1283 insert count 3 update compressed count 5 eof count 1
VLAN99_PFSYNC lagg0_vlan99 2024-08-12 19:31:04.109522 RE:DA:CT:ED:##:32 RE:DA:CT:ED:##:7b IPv4, length 1292: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1258 insert count 2 update compressed count 8 eof count 1
STANDBY
Interface Timestamp SRC DST output
VLAN99_PFSYNC lagg0_vlan99 2024-08-12 19:31:02.533114 RE:DA:CT:ED:##:32 RE:DA:CT:ED:##:7b IPv4, length 1488: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1454 insert count 2 update compressed count 7 delete compressed count 23 eof count 1
VLAN99_PFSYNC lagg0_vlan99 2024-08-12 19:31:02.623336 RE:DA:CT:ED:##:32 RE:DA:CT:ED:##:7b IPv4, length 1416: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1382 insert count 2 update compressed count 6 delete compressed count 24 eof count 1
VLAN99_PFSYNC lagg0_vlan99 2024-08-12 19:31:02.844388 RE:DA:CT:ED:##:32 RE:DA:CT:ED:##:7b IPv4, length 230: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 196 update compressed count 2 eof count 1
VLAN99_PFSYNC lagg0_vlan99 2024-08-12 19:31:02.975642 RE:DA:CT:ED:##:32 RE:DA:CT:ED:##:7b IPv4, length 1429: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1395 insert count 3 update compressed count 3 delete compressed count 23 eof count 1
VLAN99_PFSYNC lagg0_vlan99 2024-08-12 19:31:03.021657 RE:DA:CT:ED:##:32 RE:DA:CT:ED:##:7b IPv4, length 146: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 112 update compressed count 1 eof count 1
VLAN99_PFSYNC lagg0_vlan99 2024-08-12 19:31:03.101193 RE:DA:CT:ED:##:32 RE:DA:CT:ED:##:7b IPv4, length 1369: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1335 insert count 3 update compressed count 2 delete compressed count 25 eof count 1
VLAN99_PFSYNC lagg0_vlan99 2024-08-12 19:31:03.417133 RE:DA:CT:ED:##:32 RE:DA:CT:ED:##:7b IPv4, length 1208: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1174 insert count 2 update compressed count 7 eof count 1
VLAN99_PFSYNC lagg0_vlan99 2024-08-12 19:31:03.487132 RE:DA:CT:ED:##:32 RE:DA:CT:ED:##:7b IPv4, length 1124: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1090 insert count 2 update compressed count 6 eof count 1
VLAN99_PFSYNC lagg0_vlan99 2024-08-12 19:31:03.530410 RE:DA:CT:ED:##:32 RE:DA:CT:ED:##:7b IPv4, length 1040: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1006 insert count 2 update compressed count 5 eof count 1
VLAN99_PFSYNC lagg0_vlan99 2024-08-12 19:31:03.639878 RE:DA:CT:ED:##:32 RE:DA:CT:ED:##:7b IPv4, length 1208: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1174 insert count 2 update compressed count 7 eof count 1
VLAN99_PFSYNC lagg0_vlan99 2024-08-12 19:31:03.808977 RE:DA:CT:ED:##:32 RE:DA:CT:ED:##:7b IPv4, length 1233: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1199 insert count 3 update compressed count 4 eof count 1
VLAN99_PFSYNC lagg0_vlan99 2024-08-12 19:31:03.871860 RE:DA:CT:ED:##:32 RE:DA:CT:ED:##:7b IPv4, length 1317: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1283 insert count 3 update compressed count 5 eof count 1
VLAN99_PFSYNC lagg0_vlan99 2024-08-12 19:31:04.110615 RE:DA:CT:ED:##:32 RE:DA:CT:ED:##:7b IPv4, length 1292: 10.99.0.251 > 10.99.0.252: PFSYNCv5 len 1258 insert count 2 update compressed count 8 eof count 1
so pfsync packets are going through normally.
Still wondering what's causing this issue.
So, I eventually figured out the problem and it was of course on the user / admin side ...
On the standby node, I did not type in the IP address correctly for the PFsync Synchronize Peer IP .. I had typed 10.90.0.251 instead of 10.99.0.251 ... :-[
Now that this is fixed, everything is working as expected.