HA The backup firewall is not accessible (check user credentials)

Started by klosz007, February 22, 2025, 10:04:08 PM

Previous topic - Next topic
Hi,

I spent whole afternoon and evening trying to update my HA cluster from 24.7.12_4 to 25.1.

Everything done by the books:
- upgrade standby from 24.7.12_4 to 25.1
- wait until goes online and verify state
- put primary into persistent CARP maintenance mode
- upgrade primary 24.7.12_4 to 25.1
- wait until goes online and verify state
- disable CARP maintenance mode

When standby is already at 25.1 and primary still at 24.7.12_4 then HA status page on primary node is still fine - displays status of services on secondary node (says standby runs on 25.1).
When primary gets upgraded to 25.1 too - HA status page welcomes me with a yellow message "The backup firewall is not accessible (check user credentials)".

Tried upgrading from 25.1 to 25.1.1 - same problem.

Then I restored VMs from snapshots to 24.7 and tried to upgrade to 25.1 via 'full reinstall from ISO' + config restore (same order - standby node first, then primary one) - same problem just after upgrading primary node to 25.1.

Eventually I surrendered and restored both VMs from snapshots to return to 24.7.12_4.

What has changed in 25.1 ? What's going on ?
Does sync user require some new privileges in 25.1 ?

Thanks,
Zbyszek



If you run automatic configuration synchronization via cron job, disable that before doing the upgrade.

Do not sync the configuration manually or automatically until both nodes have the same version.

Try to see if that helps.
Hardware:
DEC740

Hi,

I'm not using automatic synchronization. I was not sure if I had not try to push config from primary to standby in the middle of the upgrade though.
So I repeated the upgrade once again, making sure this time config is not pushed in the middle of the upgrade.

Unfortunately again the same story - when standby is upgraded only then everything is fine, once primary is ugpraded too, it no longer can contact standby.
CARP is working fine, it's something wrong with the config synchronization only.

I tried to set up brand new account for sync (I use dedicated account) - no improvement. I tried to use root account for sync - same thing.
From CLI on primary instance I verified (telnet into standby IP / port 8443) that I was able to contact web interface of the standby instance over the subnet used for pfsync and xmlrpc.
Note: I'm using dedicated subnet for pfsync and config replication between OPNsense instances, as recommended.
Web interface on both nodes runs on non-standard port 8443.

Currently ran out of ideas... Most painful major version upgrade ever :-(

No idea how to troubleshoot it ? Into which logs I should look into ?
What has changed in terms of HA sync operation between 24.7 and 25.1 ?
Any new requirements for the account used for symc or new firewall rules required between HA nodes ?




But can you still log into the Web Interfaces of both firewalls with the same user you would use for the HA sync, after doing the upgrade?

I want to know if the scope is only related to the HA sync or if there are WebGUI login issues in general, e.g., with the root account.

What has changed is that the user manager has been rewritten to mvc.
Hardware:
DEC740

I don't know if this is the same issue, but I am trying to setup the high reliability for the first time, and am stuck with this error also. Bot firewalls are at version 25.1.1, both have the same user/password pair in admins group and with privileges on all pages, ping works between both firewalls on the pfSync interface, and still I get the "HA The backup firewall is not accessible (check user credentials)" immediately as I try to sync configuration.

Web UI listening on all interfaces as literally "recommended"?
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Quote from: Patrick M. Hausen on February 24, 2025, 04:55:57 PMWeb UI listening on all interfaces as literally "recommended"?

Yes, just rechecked, on both firewalls (System->settings->administration->Listen Interfaces).

"verify peer" was on - apparently is on by default? - and I had not generated a letsencrypt certificate for the second firewall.
PEBKAC

Quote from: Monviech (Cedrik) on February 23, 2025, 05:34:22 PMBut can you still log into the Web Interfaces of both firewalls with the same user you would use for the HA sync, after doing the upgrade?

Yes, I can log in to WebUI with 'hasync' user (that's my username used for xmlrpc sync, has admin privileges) on both OPNsense instances, before and after upgrade.
When only standby instance is upgraded to 25.1, then primary (still at 24.7) can still access secondary (and correctly says secondary is already on 25.1).
Once primary is upgraded to 25.1 then instantly the contact between both nodes is gone.

I started checking firewall rules on the "PFSYNC" inteface (dedicated VLAN, used for pfSync and xmlrpc) and changed to/from addresses from "PFSYNC net" to 'any' (hoping it has something to do with firewall rules) - no improvement.



Quote from: Patrick M. Hausen on February 24, 2025, 04:55:57 PMWeb UI listening on all interfaces as literally "recommended"?

Yes. Albeit I'm not sure if primary node will be able to access secondary's UI (port 8443) over each possible interface/VLAN (I have a few). There might not be firewall rules for that on each interface/VLAN.

For sure primary node can access WebUI of secondary node over the 'PFSYNC' interface (use by pfSync and XMLRPC) - this subnet has only two nodes in it (primary and secondary node) and all IPv4 traffic is allowed. Telnet from primary node to secondary node's IP addrress in 'PFSYNC' subnet, port 8443 works.

All of that worked smoothly for like last two years, when I implemented HA here (by converting from single node). Until now :-(

This might not be affecting all HA users but I think something is broken by the changes made to 25.1.

Quote from: jbernardo on February 24, 2025, 05:17:25 PM"verify peer" was on - apparently is on by default? - and I had not generated a letsencrypt certificate for the second firewall.
PEBKAC

In my case this 'new' option is not enabled by default after upgrade to 25.1, still it's not working.
Tried to enable it - not better.

Ok, the issue is 100% in the primary node and something related to 25.1.

I reverted back to 24.7 and set up packet capture on primary node, on the 'PFSYNC' interface with filter for remote node's IP and WebUI port (8443).
Then entered HA status page. Went back to packet capture, stopped it and looked into it.
There's a lot of traffic on that interface from primary node to secondary's node IP/WebUI port and back (as expected).

So then switched to VM clones that were upgraded to 25.1 (only this way I can quickly experiment with two versions on Proxmox on ZFS).
And tried the same thing again on 'PFSYNC' interface.
Then there's no traffic outgoing from primary to standby node's IP/port 8443. Nothing. Zero.

So I removed filters and tried capture again- a lot of PFSYNC protocol traffic (as expected) so the capture works itself but nothing falling into such filter (remote node's IP/ port 8443).

I switched HA settings to use my 'LAN' interface for XMLRPC indstead, set up similar capture on 'LAN' interface and secondary's node 'LAN' IP/port 8443.
Again zero outgoing traffic.

Long story short - primary node is not even trying to talk to remote node's IP over WebGUI port, no matter which interface is used (so no surprise nothing's displayed in HA Status page).


Are you using unicast or multicast for pfsync?
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

OK, I found the issue... I do not remeber when I switched to WebUI port 8443 (from 443), but apparently at that time it was required to enter remote node's IP and port ('IP:8443') as 'Synchronize Config' IP address to make HA work. Otherwise I would not add port there.

Seems port is no longer needed at least in 24.7. But 24.7 did not cry about invalid format of IP address or whatever, it still continued to work.

In 25.1 such format does not work anymore (yet still no complaints about invalid format in the text field).
Once I switched to just IP address in 'Synchronize Config', it still works in 24.7 and 25.1 works then as well.
It s just no longer necessary to provide a port. And if it is provided, then it breaks HA in 25.1.