Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - klosz007

#1
I 'resolved' this by reconfiguring OVPN servers to allow multiple connections per VPN user (=enable duplicate-cn option) and resigning from binding OVPN clients to CARP VIP. This way both OVPN clients can stay up on both (primary/secondary HA nodes).

But the issue is certainly there and does not affect OVPN clients only.
Attempt to synchronize HA settings causes ALL services on secondary node to be started, whether they are expected to run or not.

I'm surprised no one noticed it and responded.
This is not how it worked in 24.7 and before.

For example this also affects iperf3 service.
By default iperf service is not started if iperf instance is not enabled/created in Interfaces/Diagnostics/iperf page.
Attempt to synchronize HA config spins up iperf on secondary node, even though it is not running on primary HA node and not configured to run on secondary node.



#2
Hi,

Preface:
I have persistently connected OVPN client instance configured, to my customer's network.
I want it to be highly available so client config is replicated between HA nodes.
Remote OVPN server allows only one connection per user account and I have only one account there so only one client can be connected at a time (on primary or on secondary node).

I achieved that by configuring client to follow WAN CARP VIP.
By default WAN CARP VIP is present on primary node so OVPN client runs on primary node and is shut down on secondary.

Issue:
That worked perfectly up to 24.7.
In 25.1 it initially seemed to work well too - if CARP VIP moves to standby node then client on primary node shuts down and and spins up on secondary node and so on. So far so good.

Unfortunately something's broken in 25.1:

If VIP is on primary node and I go to HA->Status on primary node and click on "Synchronize and restart services" then HA tries to (re)start OVPN client on secondary node (even though VIP has not moved there).
Weirdly, this does not happen instantly but after a while.

With two OVPN clients (on primary and secondary node) spun up and trying to use the same account and only one connection per account is allowed on remote OVPN server, weird things start to happen then. Clients randomly and periodically connect and disconnect etc. etc.

What fixes that is manually stopping OVPN client on standby node or moving VIP to secondary node and back (by entering and leaving perisstent CARP maintenance mode).

Any idea what is wrong ? Can this be a bug in 25.1 ?

Actually I have to client instances (to two customers) and both behave exactly same way.
#3
True. It's for EFI+UFS with /boot/efi mounted only.
That is how all my other FreeBSD installs (non-OPNsense) are configured so fix is directly applicable to OPNsense in exactly such config.
Sorry for not mentioning that before.

BIOS installs may require different fix.
Underlying problem is still the same - FreeBSD 14.x does not update bootloader even though a newer version is available in /boot.
For UFS it's rather cosmetic problem. I was not aware it may impact ZFS installs when pool gets upgraded.
#4
25.1, 25.4 Production Series / Loader needs to be updated
February 24, 2025, 11:45:23 PM
Hi,

If you noticed this warning on bootloader screen then it's a known and unimportant bug in FreeBSD 14. It does not upgrade EFI bootloader in EFI partition.
I just noticed it now on OPNsense and I have been searching for quite some time how to fix it on my FreeBSD 14.x installs that were upgraded from 13.x.

Fix is easy from shell:
cp /boot/loader.efi /boot/efi/efi/freebsd/loader.efi
cp /boot/loader.efi /boot/efi/efi/boot/bootx64.efi

+ reboot and then it's gone.
#5
OK, I found the issue... I do not remeber when I switched to WebUI port 8443 (from 443), but apparently at that time it was required to enter remote node's IP and port ('IP:8443') as 'Synchronize Config' IP address to make HA work. Otherwise I would not add port there.

Seems port is no longer needed at least in 24.7. But 24.7 did not cry about invalid format of IP address or whatever, it still continued to work.

In 25.1 such format does not work anymore (yet still no complaints about invalid format in the text field).
Once I switched to just IP address in 'Synchronize Config', it still works in 24.7 and 25.1 works then as well.
It s just no longer necessary to provide a port. And if it is provided, then it breaks HA in 25.1.

#6
Ok, the issue is 100% in the primary node and something related to 25.1.

I reverted back to 24.7 and set up packet capture on primary node, on the 'PFSYNC' interface with filter for remote node's IP and WebUI port (8443).
Then entered HA status page. Went back to packet capture, stopped it and looked into it.
There's a lot of traffic on that interface from primary node to secondary's node IP/WebUI port and back (as expected).

So then switched to VM clones that were upgraded to 25.1 (only this way I can quickly experiment with two versions on Proxmox on ZFS).
And tried the same thing again on 'PFSYNC' interface.
Then there's no traffic outgoing from primary to standby node's IP/port 8443. Nothing. Zero.

So I removed filters and tried capture again- a lot of PFSYNC protocol traffic (as expected) so the capture works itself but nothing falling into such filter (remote node's IP/ port 8443).

I switched HA settings to use my 'LAN' interface for XMLRPC indstead, set up similar capture on 'LAN' interface and secondary's node 'LAN' IP/port 8443.
Again zero outgoing traffic.

Long story short - primary node is not even trying to talk to remote node's IP over WebGUI port, no matter which interface is used (so no surprise nothing's displayed in HA Status page).

#7
Quote from: jbernardo on February 24, 2025, 05:17:25 PM"verify peer" was on - apparently is on by default? - and I had not generated a letsencrypt certificate for the second firewall.
PEBKAC

In my case this 'new' option is not enabled by default after upgrade to 25.1, still it's not working.
Tried to enable it - not better.
#8
Quote from: Patrick M. Hausen on February 24, 2025, 04:55:57 PMWeb UI listening on all interfaces as literally "recommended"?

Yes. Albeit I'm not sure if primary node will be able to access secondary's UI (port 8443) over each possible interface/VLAN (I have a few). There might not be firewall rules for that on each interface/VLAN.

For sure primary node can access WebUI of secondary node over the 'PFSYNC' interface (use by pfSync and XMLRPC) - this subnet has only two nodes in it (primary and secondary node) and all IPv4 traffic is allowed. Telnet from primary node to secondary node's IP addrress in 'PFSYNC' subnet, port 8443 works.

All of that worked smoothly for like last two years, when I implemented HA here (by converting from single node). Until now :-(

This might not be affecting all HA users but I think something is broken by the changes made to 25.1.
#9
Quote from: Monviech (Cedrik) on February 23, 2025, 05:34:22 PMBut can you still log into the Web Interfaces of both firewalls with the same user you would use for the HA sync, after doing the upgrade?

Yes, I can log in to WebUI with 'hasync' user (that's my username used for xmlrpc sync, has admin privileges) on both OPNsense instances, before and after upgrade.
When only standby instance is upgraded to 25.1, then primary (still at 24.7) can still access secondary (and correctly says secondary is already on 25.1).
Once primary is upgraded to 25.1 then instantly the contact between both nodes is gone.

I started checking firewall rules on the "PFSYNC" inteface (dedicated VLAN, used for pfSync and xmlrpc) and changed to/from addresses from "PFSYNC net" to 'any' (hoping it has something to do with firewall rules) - no improvement.


#10
Hi,

I'm not using automatic synchronization. I was not sure if I had not try to push config from primary to standby in the middle of the upgrade though.
So I repeated the upgrade once again, making sure this time config is not pushed in the middle of the upgrade.

Unfortunately again the same story - when standby is upgraded only then everything is fine, once primary is ugpraded too, it no longer can contact standby.
CARP is working fine, it's something wrong with the config synchronization only.

I tried to set up brand new account for sync (I use dedicated account) - no improvement. I tried to use root account for sync - same thing.
From CLI on primary instance I verified (telnet into standby IP / port 8443) that I was able to contact web interface of the standby instance over the subnet used for pfsync and xmlrpc.
Note: I'm using dedicated subnet for pfsync and config replication between OPNsense instances, as recommended.
Web interface on both nodes runs on non-standard port 8443.

Currently ran out of ideas... Most painful major version upgrade ever :-(

No idea how to troubleshoot it ? Into which logs I should look into ?
What has changed in terms of HA sync operation between 24.7 and 25.1 ?
Any new requirements for the account used for symc or new firewall rules required between HA nodes ?



#11
Hi,

I spent whole afternoon and evening trying to update my HA cluster from 24.7.12_4 to 25.1.

Everything done by the books:
- upgrade standby from 24.7.12_4 to 25.1
- wait until goes online and verify state
- put primary into persistent CARP maintenance mode
- upgrade primary 24.7.12_4 to 25.1
- wait until goes online and verify state
- disable CARP maintenance mode

When standby is already at 25.1 and primary still at 24.7.12_4 then HA status page on primary node is still fine - displays status of services on secondary node (says standby runs on 25.1).
When primary gets upgraded to 25.1 too - HA status page welcomes me with a yellow message "The backup firewall is not accessible (check user credentials)".

Tried upgrading from 25.1 to 25.1.1 - same problem.

Then I restored VMs from snapshots to 24.7 and tried to upgrade to 25.1 via 'full reinstall from ISO' + config restore (same order - standby node first, then primary one) - same problem just after upgrading primary node to 25.1.

Eventually I surrendered and restored both VMs from snapshots to return to 24.7.12_4.

What has changed in 25.1 ? What's going on ?
Does sync user require some new privileges in 25.1 ?

Thanks,
Zbyszek


#12
OK, I know what you mean. I run latest version of OPNSense (24.7.6) with paravirtualized vtnet adapters (on PVE)  and I do not have this issue. When I unplug the physical network cable to DSL modem, OPNsense interface stays connected (plug symbol stays green), packet loss rises in the gateway stats and that green dot changes first to orange then to red. I do not think the issue has anything to do whether you use paravirtualized or physical/redirected NICs in your VM. I cannot recreate it here so cannot help anymore. But I believe it is some kind of misconfiguration somewhere.
#13
Yes, physical NIC state connected to Linux bridge does not propagate to (para)virtualized (not virtual :-) NICs. That would be undesireable in most server applications - that is where we use KVM. You want VMs always to be able to talk to each other, even if physical NIC link went down. You can always simulate link down for virtualized NIC if needed, by setting link down option for given NIC in VM's options.

Such propagation makes sense on desktop virtualization though. I'm not sure if WVware workstation has such option, VirtualBox has it for sure.

I'm trying to find a place in the dashboard you are talking about...
Packet loss is gateway's statistic, not network interface's. And it will be reported correctly because when physical NIC link goes down, you are not able to ping your test IP even if link on virtualized NIC is still up. Interfaces do not have packet loss statistic, I cannot see it. They have packet in/out, bytes in/out or errors.
#14
I'm using AdGuard at home but I'm doing it differently. My AdGuard runds on separate VM (other than OPNsense).
I'm putting AdGuard in front of OPNsense i.e. DHCP clients revceive AdGuard's IP address as primary DNS server (OPNsense's IP is secondary - as backup in case AdGuard fails) and talk to AdGuard and AdGuard then forwards all (already filtered) queries to OPNSense.
In OPNsense I use DNS over TLS to Cloudflare servers to send/forward DNS queries as encrypted ones over WAN and not to let know my cable operator what I'm browsing that easily.

I tried to use Unbound resolver mode once but it did not work well with my dual-WAN setup.
If primary WAN (DSL) was operational (99% of the time) resolver worked fine.
If DSL failed and traffic went through backup link (LTE modem) then resolver was malfunctioning.
Resolver sends a lot of queries outside to resolve a single (uncached) name and it was taking too much time over LTE/cellular link to resolve a name so I had a lot of DNS timeouts then.
I could not find a way how to use resolver mode with DSL link and forwarder with LTE so eventually I switched from resolver to forwarder mode and then Unbound works fine over (slow) backup WAN link too.

Then I implemented DNS over TLS to enhance my privacy in forwarder mode.
#15
So with "virtual" you mean VirtIO paravirtualized NICs plugged into Linux bridge on KVM hypervisor side ? In such case I guess you meant "vtnet" not "vnet" ? (that's how these are visible in FreeBSD so in OPNsense). I use these on my PVE OPNsense VMs.

I have never paid attention to how they work in OPNsense but my understanding will be same as with PCIE passthrough with SR-IOV - driver always reports their link state to be up and will not replicate physical NICs state that is connected to the underlying Linux bridge. So they will always show up as green in OPNsense, unless you forcibly/manually put link down on such virtualized NIC in KVM VM's configuration.

I'm surprised there were any changes around 24.7.6 in this regards.

Gateway state reporting will still work fine becuase if physical link dies, you are not able to ping gateway anymore (assuming you are monitoring gateway IP behind the physical cable but not another VM's IP for example).

If VirtIO link went down when physivcal link of the NIC connected to the bridge went down then you would lose ability to talk to another VMs in the same bridge when physical link goes down. I'm guessing in most cases it is undesirable.

I saw an option to replicate physical NIC linkstate to virtualized NIC linkstate in standalone desktop hypervisors (e.g. VirtualBox) but not in KVM.
With KVM, if you want physical link state be replicated to NIC in the VM then my best guess would be to stay with PCIe passthrough and effectively reassign physical NIC from hypervisor to VM.