Help with HA setup

Started by Serius, May 28, 2020, 01:46:02 PM

Previous topic - Next topic
May 28, 2020, 01:46:02 PM Last Edit: May 29, 2020, 01:02:02 AM by Serius
Hope someone can help me with this.
I had OPNSense running in a VM under esxi for some time. I didn't like loosing the network on server maintenance so I bought  an i3 NUC. While the NUC was coming I modified the existing installation to adapt to the new one.
I had three vlans on a virtual adapter each and the wan in a dedicated passthrough one. I changed it to a single trunk adapter with a router-on-stick configuration, with four vlans.

When I received the NUC I installed OPNS and restored a backup from the VM. Then I thought I could leave the VM and configure a CARP HA.
I followed this: https://www.thomas-krenn.com/en/wiki/OPNsense_HA_Cluster_configuration
and this: https://docs.opnsense.org/manual/how-tos/carp.html

I basically followed those instructions, but created a new vlan interface (also configured on switch) for the PFSync interface.
[As the NUC only has one net adapter (for now) I could not make the "mysterious and undocumented" LAGG overcome to allow syncs, so I left states synchronization deactivated.
I configured XMLRPC Sync.]

I fully configured the LAGG interface and HA settings.

Also, as I have several vlans, when documentation says to create a firewall rule to allow CARP, I created a vlan group with all the intranet+wan vlans and made the rule here*

As I have more than one interface, I created subsequent virtual IPs increasing the VHID group (WAN 1 / LAN 3 / TLN 4 / IOT 5)

The network was operative but I have the following problems:



  • When the documentation says to create a fw rule for PFSync interface, "as it is a direct cable"... Mine is not a direct cable, and I didn't know how to create that rule, as documentation also doesn't show it. I created an All to all regular one.

  • My firewall is full of block impossible hits. Like saying on interface TLN HOST1:80 contacts to HOST2:34234, when host1 is not at TLN but IOT and traffic seems inverted looking at ports. This doesn't affect normal usage.

  • Failover doesn't work. I shutdown the NUC and network goes over. Not only, but whenever before I could manage switch by setting a fixed ip and attaching my pc on management vlan, now it fails. (could any ARP config in switch mess this?)

  • Trying to XMLRPC Sync, fails saying that backup is not present. I can ping it and ping the master from the VM. I can see the allowed port 80 connection in the FW2 logs.

  • The HA->Status menu point also says that there's no communication with the backup node. FW1 seems to hang for a while but FW2 spits the error immediately.

  • After a day, all the appliances in TLN vlan (trusted) stopped working. The DHCP for this interface was deactivated at configuration (only this one). So I turned it on and started, but it doesn't work and keeps pulling this at log:
    (It does the same in both FWs)

2020-05-28T12:58:16 dhcpd: DHCPDISCOVER from 04:b1:67:1b:d1:62 via em0_vlan10: not responding (recovering)
2020-05-28T12:57:59 dhcpd: DHCPDISCOVER from 04:b1:67:1b:d1:62 via em0_vlan10: not responding (recovering)
2020-05-28T12:57:51 dhcpd: DHCPDISCOVER from 04:b1:67:1b:d1:62 via em0_vlan10: not responding (recovering)
2020-05-28T12:57:46 dhcpd: DHCPDISCOVER from 04:b1:67:1b:d1:62 via em0_vlan10: not responding (recovering)
2020-05-28T12:57:41 dhcpd: DHCPDISCOVER from 04:b1:67:1b:d1:62 via em0_vlan10: not responding (recovering)
2020-05-28T12:57:36 dhcpd: DHCPDISCOVER from 04:b1:67:1b:d1:62 via em0_vlan10: not responding (recovering)
2020-05-28T12:57:19 dhcpd: DHCPDISCOVER from 04:b1:67:1b:d1:62 via em0_vlan10: not responding (recovering)
2020-05-28T12:57:11 dhcpd: DHCPDISCOVER from 04:b1:67:1b:d1:62 via em0_vlan10: not responding (recovering)
2020-05-28T12:57:07 dhcpd: DHCPDISCOVER from 04:b1:67:1b:d1:62 via em0_vlan10: not responding (recovering)
2020-05-28T12:57:02 dhcpd: DHCPDISCOVER from 04:b1:67:1b:d1:62 via em0_vlan10: not responding (recovering)
2020-05-28T12:56:57 dhcpd: failover peer dhcp_lan: I move from startup to communications-interrupted
2020-05-28T12:56:57 dhcpd: failover peer dhcp_opt1: I move from startup to communications-interrupted
2020-05-28T12:56:57 dhcpd: failover peer dhcp_opt2: I move from startup to recover
2020-05-28T12:56:42 dhcpd: Server starting service.
2020-05-28T12:56:42 dhcpd: failover peer dhcp_lan: I move from communications-interrupted to startup
2020-05-28T12:56:42 dhcpd: failover peer dhcp_opt1: I move from communications-interrupted to startup
2020-05-28T12:56:42 dhcpd: failover peer dhcp_opt2: I move from recover to startup
2020-05-28T12:56:42 dhcpd: Sending on   Socket/fallback/fallback-net
2020-05-28T12:56:42 dhcpd: Sending on   BPF/em0_vlan1/f4:4d:30:6a:fb:9c/192.168.0.0/24
2020-05-28T12:56:42 dhcpd: Listening on BPF/em0_vlan1/f4:4d:30:6a:fb:9c/192.168.0.0/24
2020-05-28T12:56:42 dhcpd: Sending on   BPF/em0_vlan50/f4:4d:30:6a:fb:9c/192.168.50.0/24
2020-05-28T12:56:42 dhcpd: Listening on BPF/em0_vlan50/f4:4d:30:6a:fb:9c/192.168.50.0/24
2020-05-28T12:56:42 dhcpd: Sending on   BPF/em0_vlan10/f4:4d:30:6a:fb:9c/192.168.10.0/24
2020-05-28T12:56:42 dhcpd: Listening on BPF/em0_vlan10/f4:4d:30:6a:fb:9c/192.168.10.0/24
2020-05-28T12:56:42 dhcpd: Wrote 150 leases to leases file.
2020-05-28T12:56:42 dhcpd: Wrote 0 new dynamic host decls to leases file.
2020-05-28T12:56:42 dhcpd: Wrote 0 deleted host decls to leases file.
2020-05-28T12:56:42 dhcpd: For info, please visit https://www.isc.org/software/dhcp/
2020-05-28T12:56:42 dhcpd: All rights reserved.
2020-05-28T12:56:42 dhcpd: Copyright 2004-2020 Internet Systems Consortium.
2020-05-28T12:56:42 dhcpd: Internet Systems Consortium DHCP Server 4.4.2
2020-05-28T12:56:42 dhcpd: PID file: /var/run/dhcpd.pid
2020-05-28T12:56:42 dhcpd: Database file: /var/db/dhcpd.leases
2020-05-28T12:56:42 dhcpd: Config file: /etc/dhcpd.conf
2020-05-28T12:56:42 dhcpd: For info, please visit https://www.isc.org/software/dhcp/
2020-05-28T12:56:42 dhcpd: All rights reserved.
2020-05-28T12:56:42 dhcpd: Copyright 2004-2020 Internet Systems Consortium.
2020-05-28T12:56:42 dhcpd: Internet Systems Consortium DHCP Server 4.4.2

Note: Output from before LAGG implementation. Now lagg0_vlan**.

The DHCP in the other two interfaces work as expected. I can see successful DHCP negotiation for other interfaces in the log.

Thank you very much.

EDIT: I can see this block hit on the FW2 when I restart pfsync in the master:
BLOCK   wan      May 29 00:43:12   192.168.8.11   224.0.0.240   pfsync   Block private networks from WAN

This is an automatic rule. I removed it from the interface and added a PFSYNC rule with no effect to HA.

I see the same issue. The second node down and the node that is alive doesn't take over

Quote2020-06-06T01:55:59   dhcpd: DHCPDISCOVER from ac:63:be:61:31:4a via vtnet0: not responding (recovering)
2020-06-06T01:55:50   dhcpd: DHCPDISCOVER from ac:63:be:61:31:4a via vtnet0: not responding (recovering)
2020-06-06T01:55:47   dhcpd: DHCPDISCOVER from ac:63:be:61:31:4a via vtnet0: not responding (recovering)
2020-06-06T01:55:28   dhcpd: failover peer dhcp_lan: host down
2020-06-06T01:54:26   dhcpd: DHCPACK on 10.2.0.203 to 3c:a9:f4:85:4e:44 via vtnet0
2020-06-06T01:54:26   dhcpd: DHCPREQUEST for 10.2.0.203 from 3c:a9:f4:85:4e:44 via vtnet0
2020-06-06T01:54:02   dhcpd: DHCPDISCOVER from aa:c8:aa:14:5f:0e via vtnet0: not responding (recovering)
2020-06-06T01:53:58   dhcpd: failover peer dhcp_lan: host down
2020-06-06T01:53:45   dhcpd: DHCPDISCOVER from aa:c8:aa:14:5f:0e via vtnet0: not responding (recovering)
2020-06-06T01:53:37   dhcpd: DHCPDISCOVER from aa:c8:aa:14:5f:0e via vtnet0: not responding (recovering)

In the leases view I see:

QuoteMy State = recover
Peer State = unknown-state

CARP status is Master, VIP's mask matches base subnet.
VIP setup: (vhid 1 , freq. 1 / 0)

OPNsense 20.1.7-amd64

Any hints?

June 07, 2020, 06:23:48 PM #2 Last Edit: June 07, 2020, 06:46:06 PM by Serius
The past 10 days I've been trying to make CARP work without success.

I've got PFSyncs, xml syncs and dhcp somewhat working, but still plenty of issues:

  • The DHCP shows almost all the IP leases as reserved. Devices get high number IPs when they get. This is normal? I've tried erasing the dhcp.leases files and reboot in both fw without effect. If backup is offline, DHCP doesn't work -> "dhcpd: DHCPDISCOVER from xxxxx not responding (recovering)"
  • The firewall goes crazy: After putting both instances online, the backup fw logs full of blocks for traffic that is configured to pass. Main fw seems ok, but from time to time it spits several of those blocks. Every minute or so, all communications in the network break, and if left alone, after 2-3mins come back, and then break, etc... Listening music from my server it's interesting.

The CARP status shows normal.
I've tried to build a configuration from the config examples but it doesn't make a difference.
Is High availability broken?

If someone is willing I could post my configurations, stripping passwords and such.

I do not know the current status, but perhaps it provides further research for you, I seem to recall sometime ago reading about problems with CARP on a LAGG.  But perhaps that was just on older versions and was fixed, I'm afraid I cannot recall.

I can say, however, that I run a CARP cluster in the DC, without LAGG, and do not see any problems - so I don't believe it is broken.

I have been trying to get HA to work for a while but got stuck. See my post  https://forum.opnsense.org/index.php?topic=16782.0

I have used the same reference documentation as you so possibly there is a fault there.

Since you are using ESXi on (at least) one node, the follwoing link could be interesting if you haven't made the special configuration in ESXi yet: https://medium.com/@glmdev/how-to-set-up-virtualized-pfsense-on-vmware-esxi-6-x-2c2861b25931

It got me from completely nothing to something that kind of works -if you forget about the lack of DNS :-)