Cannot get pfsync to work: pfsync bulk fail

Started by WRMSR, October 11, 2019, 11:37:15 PM

Previous topic - Next topic
October 11, 2019, 11:37:15 PM Last Edit: October 15, 2019, 10:10:09 AM by WRMSR
Hello forum,

the setup first:

Node #1 is on version 19.7 (will upgrade on monday), node #2 on 19.7.4.

Each node has 2 x520 dual port cards. The upstream connection is connected one port of one card on every node.
The LAN VLANs are on a failover lagg with two ports on every node (e.g. lagg0_vlan50) connected to two switches. This lagg has MTU set to 9000.

CARP is working so far. The only problem is, that I can't get pfsync to work, which causes the cluster to CARP failover periodically, because of the demotion of failing pfsync:

carp: demoted by -240 to 0 (pfsync bulk fail)
carp: 1@lagg0_vlan12: BACKUP -> MASTER (preempting a slower master)
carp: 1@lagg0_vlan10: BACKUP -> MASTER (preempting a slower master)
carp: 1@lagg0_vlan50: BACKUP -> MASTER (preempting a slower master)
carp: 1@lagg0_vlan11: BACKUP -> MASTER (preempting a slower master)
carp: 10@ix0: BACKUP -> MASTER (preempting a slower master)
carp: demoted by 240 to 240 (pfsync bulk start)
carp: 1@lagg0_vlan12: MASTER -> BACKUP (more frequent advertisement received)
ifa_maintain_loopback_route: deletion failed for interface lagg0_vlan12: 3
carp: 1@lagg0_vlan50: MASTER -> BACKUP (more frequent advertisement received)
ifa_maintain_loopback_route: deletion failed for interface lagg0_vlan50: 3
carp: 1@lagg0_vlan11: MASTER -> BACKUP (more frequent advertisement received)
ifa_maintain_loopback_route: deletion failed for interface lagg0_vlan11: 3
carp: 1@lagg0_vlan10: MASTER -> BACKUP (more frequent advertisement received)
ifa_maintain_loopback_route: deletion failed for interface lagg0_vlan10: 3
carp: 10@ix0: MASTER -> BACKUP (more frequent advertisement received)
ifa_maintain_loopback_route: deletion failed for interface ix0: 3
carp: demoted by -240 to 0 (pfsync bulk fail)
carp: 10@ix0: BACKUP -> MASTER (master timed out)
carp: 1@lagg0_vlan10: BACKUP -> MASTER (master timed out)
carp: 1@lagg0_vlan12: BACKUP -> MASTER (master timed out)
carp: 1@lagg0_vlan11: BACKUP -> MASTER (master timed out)
carp: 1@lagg0_vlan50: BACKUP -> MASTER (master timed out)


VLAN 5 (on the lagg0) is the sync VLAN, with node1's IP being 10.49.5.1 and node2's IP being 10.49.5.2. This is entered in the web UI. All other interfaces have CARP IPs, the .1 being the virtual IP (and 2 & 3 being the real machine IPs)

Relevant ifconfig of node1:

root@fw01:/var/log # ifconfig lagg0_vlan5
lagg0_vlan5: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000
        options=600703<RXCSUM,TXCSUM,TSO4,TSO6,LRO,RXCSUM_IPV6,TXCSUM_IPV6>
        ether f8:f2:1e:71:5d:94
        inet6 fe80::faf2:1eff:fe71:5d94%lagg0_vlan5 prefixlen 64 scopeid 0xd
        inet 10.49.5.1 netmask 0xffffff00 broadcast 10.49.5.255
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet autoselect
        status: active
        vlan: 5 vlanpcp: 0 parent interface: lagg0
        groups: vlan
root@fw01:/var/log # ifconfig pfsync0
pfsync0: flags=41<UP,RUNNING> metric 0 mtu 9000
        groups: pfsync
        pfsync: syncdev: lagg0_vlan5 syncpeer: 10.49.5.2 maxupd: 128 defer: off


Reducing mtu of the pfsync interface, putting it on the WAN interface (to test with a "raw" physical interface instead of a lagg/vlan), using unicast or multicast does not matter. Both nodes simply refuse to send out firewall state changes. This is notice-able, that the webui's "pfsync nodes" display doesn't line up and tcpdump shows no related traffic. Firewall rules are set up to allow any traffic on the sync & lan interfaces.

Thanks in advance.

Please show screenshots of the HA/pfsync setting. I suspect there's something wrong there.

October 14, 2019, 01:57:16 PM #2 Last Edit: October 14, 2019, 01:59:34 PM by WRMSR
Hi,

after troubleshooting over the weekend, I think i've found the issue.

It was the "VLAN Hardware Filtering" (Interfaces -> Settings) being set to enable causing pfsync not sending out packets. After setting this to "Leave default" & rebooting the node, it finally started to send out pfsync packets.

October 15, 2019, 10:13:01 AM #3 Last Edit: October 15, 2019, 10:16:16 AM by WRMSR
Hi,

after upgrading both nodes to OPNsense 19.7.5_5, the sync stopped again. I can't see any pfsync packets on the VLAN via tcpdump.

I've attached screenshots of both nodes for the HA configuration in the web UI.

€:

Here is the ifconfig for node1:

root@fw01:~ # ifconfig pfsync0
pfsync0: flags=41<UP,RUNNING> metric 0 mtu 9000
        groups: pfsync
        pfsync: syncdev: lagg0_vlan5 syncpeer: 10.49.5.2 maxupd: 128 defer: off
root@fw01:~ # ifconfig lagg0_vlan5
lagg0_vlan5: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000
        options=600703<RXCSUM,TXCSUM,TSO4,TSO6,LRO,RXCSUM_IPV6,TXCSUM_IPV6>
        ether f8:f2:1e:71:5d:94
        inet6 fe80::faf2:1eff:fe71:5d94%lagg0_vlan5 prefixlen 64 scopeid 0xd
        inet 10.49.5.1 netmask 0xffffff00 broadcast 10.49.5.255
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet autoselect
        status: active
        vlan: 5 vlanpcp: 0 parent interface: lagg0
        groups: vlan


node2:

root@fw02:~ # ifconfig pfsync0
pfsync0: flags=41<UP,RUNNING> metric 0 mtu 9000
        groups: pfsync
        pfsync: syncdev: lagg0_vlan5 syncpeer: 10.49.5.1 maxupd: 128 defer: off
root@fw02:~ # ifconfig lagg0_vlan5
lagg0_vlan5: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000
        options=600703<RXCSUM,TXCSUM,TSO4,TSO6,LRO,RXCSUM_IPV6,TXCSUM_IPV6>
        ether f8:f2:1e:71:63:a8
        inet6 fe80::faf2:1eff:fe71:63a8%lagg0_vlan5 prefixlen 64 scopeid 0xd
        inet 10.49.5.2 netmask 0xffffff00 broadcast 10.49.5.255
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet autoselect
        status: active
        vlan: 5 vlanpcp: 0 parent interface: lagg0
        groups: vlan