Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - littlevulture

#1
Disclaimer
This is not true highly available solution, but I have tested it to automatically recover my virtual firewall within 5 minutes in the event of a host failure. For me this is perfectly acceptable, sometimes I lose remote access to my network (via wireguard) when some strange bug takes down my firewall host which can be a problem if I don't have physical access to reboot it. I'm pretty sure that I've traced that issue to sketchy hard drives in my surplus eBay Proxmox nodes. I have just replaced them with brand new drives so hopefully the root cause is fixed but in any case as currently configured, in the event of node failure the HA tools Proxmox can recover my firewall (and with it my VPN) to an online node while I'm far from home. There are still multiple single points of failure in my setup but I'm ok with that since I don't want to pay for a second public IP address from a second ISP. I've been looking for a way to do this for over a year and stumbled across the solution on accident while rebuilding my Proxmox cluster. In hindsight its obviously simple but somehow I hadn't put the right combination of keywords into google to get pointed in the right direction. Hopefully sharing this will give someone in a similar position the right ideas to start working on their own solution.

Hardware
ATT BGW320
- Running in passthrough mode and connected to a Mikrotik CRS309 switch
- Side note, this device is awful in every way
Mikrotik CRS309
- Connected to BGW320
- Tags the incoming WAN with a VLAN ID, in this case I'm using 99
- Ports 1-3 are each connected to the same port of the Supermicro NIC in each Proxmox host, if you do not use the same port this probably will not work
Proxmox Hosts
- x3 identical Lenovo M920Q, each with a Supermicro AOC-STGN-i2S network card
- Proxmox 8.3 installed using ZFS automatic partitions from the GUI install on a single 1TB drive

Setup
Using your favorite tutorial or guide setup VLAN tagging for the incoming WAN on your switch, then configure your virtual OPNsense firewall in Proxmox. It will need the WAN configured to use the VLAN that the switch is assigning to the incoming WAN traffic. Both of these have been covered by people much smarter with networking that I am so I won't discuss it here.

If not already accomplished cluster your Proxmox nodes, 3 are needed for quorum.

Select your "OPNsense VM" -> "Replication" -> "Add". Configure the replication job to each remaining node in the cluster, this will ensure that there is a local copy of the VM drive on each node in the event it needs to start the container. In my case I have 3 nodes so there will be 2 replication jobs. I arbitrarily chose to replicate every 2 hours, meaning at worst I'll lose the last 2 hours if the VM is recovered right before a scheduled replication job. During my initial setup I frequently recovered my firewall without issue from snapshots that were over 24 hours old when I broke something so I think this is fine. If anyone knows better I'd love to learn why this is a bad option.

Under Datacenter -> HA -> Groups create a group with all 3 nodes and nofailback selected, selecting nofailback will keep the VM from immediately migrating back to the original node if it comes back online.

Select "Datacenter" -> "HA" -> "Add" and configure your replication job. I used MaxRestart=3 and MaxRelocate=2, if I understand the documentation this will attempt to start the VM 3 times on the node chosen by the Proxmox HA tools, if that fails it will relocate to another node up to 2 more times with 3 restarts per relocation event.

Test the HA recovery, the easiest way I found to do this is simply unplug the connection between the switch and the Proxmox node to force it offline. In my case it recovers and has a new OPNsense VM up and running in about 5 minutes. I've only tested this with ZFS but I think it should work with other Proxmox storage options, but I'll leave further testing to someone else.

A more appropriate solution would be storage designed for clusters, something like Ceph, but I'm not quite ready for that and in the meantime this seems to work great.