High availability sync appears to have stopped working but CARP still fine

Started by sesquipedality, September 23, 2022, 05:37:58 PM

Previous topic - Next topic
As of relatively recently my HA setting has stopped working.  When I try to access "High Availability" -> "Status" I get an error message:

    The backup firewall is not accessible or not configured.

CARP is still working fine, which I why I hadn't noticed and can't say when this begun.  There are no firewall rules preventing traffic on the direct ethernet link between the two firewalls.   Can anyone suggest how I might investigate / fix this?

Thanks

Still looking for some help on this.  Even being pointed at where I might find some useful diagnostic logs as to why the link is not operative would be a help.

Quote from: sesquipedality on September 23, 2022, 05:37:58 PM
There are no firewall rules preventing traffic on the direct ethernet link between the two firewalls.
But is there a firewall rule allowing all traffic on the direct ethernet link?

OPNsense like any reasonable firewall is "default deny".
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Thanks for the suggestion.  Yes, there is.  This is a previously working config that appears to have stopped working at some point.  I did have to reinstall the primary server at one point and did so using the USB stick config transfer method.  No passwords have changed.  The problem is that the diagnostic message I'm getting is so non-specific as to leave me lost as to how to even investigate what's not working.

IP address and credentials of the secondary as configured on the master are definitely OK?
Web UI on the secondary enabled on the dedicated HA network?
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

I am seeing this same issue after upgrading from 22.1 to 22.7.6. It actually looks like everything is working still and the fail over works, its just something with the sync.

This seems to be related to this issue here:
https://forum.opnsense.org/index.php?topic=29521.0

I was able to repeat the issue rolling back to snapshots I had, happens every time I upgrade to 22.7.6.

I rolled back again (22.1.10) and then upgraded again, and everything was still broken, but the error changed from the parsing error mentioned in the other post to host down now.

I disabled and renabled the interfaces on both the opnsense and vmware sides, and everything is working on 22.7.6.

Quote from: sesquipedality on October 21, 2022, 04:28:22 PM
Thanks for the suggestion.  Yes, there is.  This is a previously working config that appears to have stopped working at some point.  I did have to reinstall the primary server at one point and did so using the USB stick config transfer method.  No passwords have changed.  The problem is that the diagnostic message I'm getting is so non-specific as to leave me lost as to how to even investigate what's not working.

Log into the console and run this:
# /usr/local/etc/rc.filter_synchronize

Whats the output?

Sorry for the delayed reply - got busy with other stuff and this got put on the back burner.

The output is:

root@<host>:~ # /usr/local/etc/rc.filter_synchronize
send >>>
Host: 192.168.66.4
User-Agent: XML_RPC
Content-Type: text/xml
Content-Length: 117
Authorization: Basic cm9vdDpQaWJqSXBzSUxwVEFmNHlZOTZ4Uw==
<?xml version="1.0"?>
<methodCall>
<methodName>opnsense.firmware_version</methodName>
<params>
</params></methodCall>received >>>
error >>>
fetch error. remote host down?root@fenchurch:~ # send >>>
Missing name for redirect.
<methodName>opnsense.firmware_version</methodName>
<params>
</params></methodCall>received >>>
error >>>
fetch error. remote host down?


This did enable me to discover that I wasn't able to traceroute/ping the backup interface from the main interface.   I went through all my firewall rules to try to work out what was wrong, and the only difference I could find was that for entirely inexplicable reasons, some automatic outbound NAT rules were being generated for the backbone on the primary router (perhaps because the primary router is configured to route by the backbone if the primary internet goes down.  Anyway these happened after outbound NAT was manually disabled for the interface, and I checked that when disabling outbound rules the problem still existed.)

In any event having been through all that and disabled and re-enabled gateways I am now at a point where ping, ssh and http over the backbone are working again,  and so sync is back up and running.  Subsequent runs are not producing an error, and my sync menu is now back.   I do not know and probably never will know which traceroute over the backbone works on the secondary, but not the primary router.  Thanks for your help with this.  I do wish routers were a little less "black box" sometimes.