Updating firmware on a HA pair, and no outbound comms from secondary unit

Started by RussM, February 05, 2025, 08:30:04 PM

Previous topic - Next topic
I am new to OPNsense, and so far I am pleased.  I have been building up a HA pair of DEC3842 appliances in a lab environment prior to actual deployment, which will replace an old Cisco firewall at one of my employer's facilities.  The appliances are running 24.10.1 (Business Edition), and now that 24.10.2 is out, I wish to update to that.  I am attempting to follow this guidance at https://docs.opnsense.org/manual/how-tos/carp.html:

Updating a CARP HA Cluster

Running a redundant Active/Passive cluster leads to the expectation to have zero downtime. To keep the downtime at a minimum when running updates just follow these steps:

  • Update your secondary unit and wait until it is online again
  • On your primary unit go to Interfaces ‣ Virtual IPs ‣ Status and click Enter Persistent CARP Maintenance Mode
  • You secondary unit is now MASTER, check if all services like DHCP, VPN, NAT are working correctly
  • If you ensured the update was fine, update your primary unit and hit Leave Persistent CARP Maintenance Mode

I am assuming that no prior actions are needed, such as disabling CARP on the secondary unit first.  But the issue I have is that checking firmware status fails on the secondary unit...

***GOT REQUEST TO CHECK FOR UPDATES***
Currently running OPNsense 24.10.1 (amd64) at Wed Feb  5 13:58:05 EST 2025
Strict TLS 1.3 and CRL checking is enabled.
Fetching subscription information, please wait... fetch: transfer timed out
Fetching changelog information, please wait... fetch: transfer timed out
Updating OPNsense repository catalogue...
pkg: https://opnsense-update.deciso.com/${SUBSCRIPTION}/FreeBSD:14:amd64/24.10/latest/meta.txz: No address record
repository OPNsense has no meta file, using default settings

Subsequent diagnostics reveals that I cannot ping or nslookup in the outside direction from the secondary unit. Ping to an internal address and nslookup point to an internal DNS server do work.  A reddit post about a similar issue suggested enabling the Deny Service Binding option on the outside CARP Virtual IP interfaces, but that did not help.

I have no doubt I could get the updates installed by alternate methods, but that will be disruptive; this is not an issue while in a lab environment, but after these have been deployed, I'd really like to be able to do this as seamlessly as possible.

Any suggestions?

Both units need their own static IP address, default gateway, and working DNS. Which can be provided by e.g. Unbound running on the firewall and configuring 127.0.0.1 as the DNS server - or just keeping the default which will use the local resolver, anyway.

CARP addresses come on top of that. So you need at least a static /29 on WAN for a proper HA setup. Conveniently that accommodates 6 addresses - 3 for the provider gateway in a redundant 2-chassis plus HA fashion, and 3 for your two firewalls plus CARP.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

I have all that configured as part of the fundamental HA setup.  I can ping & access the secondary unit GUI by both its internal and external "real" IP addresses regardless of the HA state (master or backup), and also by the inside and outside CARP virtual IPs when it is Master, which indicates to me that the essential network settings are correct. I do have direct access to the secondary unit by actual configured IP address from other subnets both internally and externally, so this does not appear to be a gateway or other routing issue.

As added info... I just shutdown the primary unit, making the secondary unit Master.  Ping and nslookup to external points then worked, and the firmware update check worked.  After I bring the primary unit back online, ping, nslookup, and consequently, the firmware update check fail again.

Really, this all boils down to determining and resolving why the secondary unit cannot ping or do DNS lookups outbound while it is in the backup/secondary state.



Check what the backup system thinks it should be using as a recursive DNS server.

cat /etc/resolv.conf
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Quote from: Patrick M. Hausen on February 06, 2025, 12:08:21 AMCheck what the backup system thinks it should be using as a recursive DNS server.

cat /etc/resolv.conf

It's not a DNS issue.  I cannot ping any external IP address, not even the outside gateway address, while the secondary is in standby. No DNS query is involved.

ifconfig
netstat -rn

please.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Annotated netstat outputs below.  I do not see anything out of sorts.... some routes present on the primary unit are not present on the secondary, which I expected.  I have obfuscated the OUTSIDE IP addresses, as these are the actual public IPs which will be used when deployed.  Currently, this lab setup is behind another firewall, so using public IPs is no harm/no foul ;)

ACTIVE/MASTER

netstat -rn
Routing tables

Internet:
Destination        Gateway            Flags     Netif Expire
default            ###.###.###.33     UGS         ax0 // OUTSIDE GATEWAY [upstream router in lab]
1.1.1.1            ###.###.###.33     UGHS        ax0 // System DNS entry
8.8.8.8            ###.###.###.33     UGHS        ax0 // System DNS entry
10.175.175.0/30    link#4             U          igc3 // HASYNC subnet
10.175.175.1       link#7             UHS         lo0 // HASYNC interface IP
127.0.0.1          link#7             UH          lo0
172.16.1.0/24      172.30.1.10        UGS         ax1 // static route to an internal network via L3 switch
172.18.0.0/23      172.30.1.10        UGS         ax1 // static route to an internal network via L3 switch
172.30.1.0/24      link#6             U           ax1 // INSIDE subnet
172.30.1.1         link#7             UHS         lo0 // INSIDE interface CARP Virtual IP
172.30.1.2         link#7             UHS         lo0 // INSIDE interface IP
192.168.1.0/24     link#1             U          igc0 // LOCALMGMT [default LAN (kept this default in place, will use for emergency/OOB access)]
192.168.1.1        link#7             UHS         lo0 // LOCALMGMT [default LAN (kept this default in place, will use for emergency/OOB access)]
###.###.###.32/29  link#5             U           ax0 // OUTSIDE subnet
###.###.###.34     link#7             UHS         lo0 // OUTSIDE interface CARP Virtual IP
###.###.###.35     link#7             UHS         lo0 // OUTSIDE interface IP
###.###.###.37     link#7             UHS         lo0 // 1:1 NAT to an internal test server

STANDBY/BACKUP

netstat -rn
Routing tables

Internet:
Destination        Gateway           Flags     Netif Expire
default           ###.###.###.33      UGS         ax0  // OUTSIDE GATEWAY [upstream router in lab]
1.1.1.1           ###.###.###.33      UGHS        ax0  // System DNS entry
8.8.8.8           ###.###.###.33      UGHS        ax0  // System DNS entry
10.175.175.0/30    link#4             U          igc3  // HASYNC subnet
10.175.175.2       link#7             UHS         lo0  // HASYNC interface IP
127.0.0.1          link#7             UH          lo0
172.16.1.0/24      172.30.1.10        UGS         ax1  // static route to an internal network via L3 switch
172.18.0.0/23      172.30.1.10        UGS         ax1  // static route to an internal network via L3 switch
172.30.1.0/24      link#6             U           ax1  // INSIDE subnet
                                                       // INSIDE interface CARP Virtual IP not present in route table 
172.30.1.3         link#7             UHS         lo0  // INSIDE interface IP
192.168.1.0/24     link#1             U          igc0  // LOCALMGMT [default LAN (kept this default in place, will use for emergency/OOB access)]
192.168.1.1        link#7             UHS         lo0  // LOCALMGMT [default LAN (kept this default in place, will use for emergency/OOB access)]
###.###.###.32/29  link#5             U           ax0  // OUTSIDE subnet
                                                       // OUTSIDE interface CARP Virtual IP not present in route table 
###.###.###.36     link#7             UHS         lo0  // OUTSIDE interface IP
     


'diff' on ifconfig outputs show them to be identical, except for expected differences: MAC addresses, IP addresses, and CARP states.  Attached as files.

And thank you for your assistance!

Layer 3 looks perfectly ok. I was looking for a missing static IP address in the uplink subnet on the secondary or something like that. Not the case here.

You wrote you cannot ping the gateway from the secondary firewall. Can you ping the primary from the secondary? Assuming you are allowing ICMP echo in on every interface which IMHO you should. Ping is an indispensable debugging tool.

How are the devices wired? I mean network topology, intermediate switch(es) if present etc.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

ping from primary unit to the outside gateway address: pass
ping from primary unit to secondary unit's outside IP: pass
ping from primary unit to the inside gateway address: pass

ping from secondary unit to primary unit's outside IP: fail
ping from secondary unit to the outside CARP VIP: fail

And for good measure, I did the same for the INSIDE interface:

ping from primary unit to the inside gateway address: pass
ping from primary unit to secondary unit's inside IP: pass
ping from secondary unit to primary unit's inside IP: pass
ping from secondary unit to the inside CARP VIP: pass

This just confirms my previous findings.... there are no IP unicast comms in the outbound direction from the secondary appliance.  I am using the default multicast (224.0.0.18) for CARP, so it seems that is working. 

Hmmm.. while writing the above about multicast, I had an idea... I changed the CARP peer IPs in both units to be the actual outside IP address of the other unit, then did HA/CARP testing - that all still works.


Hi,
just a thought:
Is there a NAT rule on the second machine that converts outgoing traffic to the virtual CARP address, which is not active on the second machine?

Quote from: Ralf Kirmis on February 07, 2025, 09:32:44 AMIs there a NAT rule on the second machine that converts outgoing traffic to the virtual CARP address, which is not active on the second machine?

Great thought!
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Quote from: Ralf Kirmis on February 07, 2025, 09:32:44 AMHi,
just a thought:
Is there a NAT rule on the second machine that converts outgoing traffic to the virtual CARP address, which is not active on the second machine?

All NAT rules are identical.

I am at the point when I am seriously considering blowing the secondary unit away, then cloning the primary to it, then converting it back to being the secondary. 

Quote from: RussM on February 07, 2025, 05:28:39 PMAll NAT rules are identical.

Please show your outbound NAT rules. If you use source = any and NAT everything to the CARP address that would explain your observed behaviour.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

You, Kind Sir, are a lifesaver.  I had configured outbound NAT in accordance with the Config CARP documentation, so had automatic rules disabled, and one manually-defined rule:

OUTSIDE/any/*/*/*/outside VIP

Your post made me realize what I needed to do.  I added a new rule above that one so it is matched first:

OUTSIDE/This Firewall/*/*/*/OUTSIDE address

I then tested outbound comms from the secondary unit, confimed that ping, nslookup, and then firmware status & update checks all worked... so I then ssh'd into a couple of machines on the main network (upstream of the OPNsense pair, and verified that the source NAT address is in fact the OUTSIDE Virtual IP.

It seems like defining that rule should be specified as a requirement in the HA/CARP docs.  Without that rule, the instructions in the Updating a CARP HA Cluster section in the Configuring CARP doc will not work... it was trying to follow that procedure that got me going down this rabbit hole.

You could also use instead of:

OUTSIDE/any/*/*/*/outside VIP

this one:

OUTSIDE/internal networks/*/*/*/outside VIP
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)