These days, there are many folks who use OpnSense under a virtualisation host, like Proxmox, for example.
This configuration has its own pitfalls, therefore I wanted to have this guide. The first part starts with common settings needed, the second part will deal with a setup where the virtualisation host is to be deployed remotely (e.g. in a datacenter) and holds other VMs besides OpnSense.
RAM, CPU and systemUse at least 8 GByte, better 16 GBytes of RAM and do not enable ballooning. Although OpnSense does not need that much RAM, it can be beneficial in case you put /var/log in RAM (see below).
Obviously, you should use "host" CPU type in order not to sacrifice performance by emulation. However, you should not install the microcode update packages in OpnSense - they would be useless anyway. Instead, install the appropriate microcode packages on the virtualisation host.
That being said, just for good measure, set tuneables "hw.ibrs_disable=1" and "vm.pmap.pti=0". This wil avoid performance bottlenecks because of Spectre and Meltdown mitigations. I trust the other VMs in my setup, but YMMV...
The system architecture is arbitrary, as OpnSense can boot both in legacy (BIOS) or UEFI mode.
Filesystem peculiaritiesFirst off, when you create an OpnSense VM, what should you choose as file system? If you have Proxmox, it will likely use ZFS, so you need to choose between UFS and ZFS for OpnSense itself. Although it is often said that ZFS underneath ZFS is a little more overhead, I would use it regardless, just because UFS fails more often. Also, OpnSense does not stress the filesystem much, anyway (
unless you use excessive logging, RRD or Netflow).
32 GBytes is a minimum I would recommend for disk size. It may be difficult to increase the size later on.
After a while, you will notice, that the space you have allocated for the OpnSense disk will grow to use 100%, despite that within OpnSense, the disk may be mostly unused. That is a side-effect of the copy-on-write feature of ZFS: writing logs and RRD data and other statistics always writes new data and the old data does not get dismissed against the underlying (virtual) block device.
That is, if the ZFS "autotrim" feature is not set manually. You can either set this via the OpnSense CLI with "zpool set autotrim=on zroot" or, better, add a daily cron job to to this (System: Settings: Cron) with "zroot" as parameter.
You can trim your zpool once via CLI with "zpool trim zroot".
That being said, you should always avoid to fill up the space for the disk by having verbose logging. If you do not need to keep your logs, you can also put them on a RAM disk (System: Settings: Miscellaneous).
Network "hardware"With modern FreeBSD, there should not be any more discussion about pass-through vs. emulated VTNET adapters: the latter are often faster. This is because Linux drivers are often more optimized than the FreeBSD ones. There are exceptions to the rule, but not many.
In some situations, you basically have no choice than to use vtnet anyway, e.g.:
- If FreeBSD has no driver for your NIC hardware
- If the adapter must be bridged, e.g. in a datacenter with a single NIC machine
Also, some FreeBSD drivers are known to have caused problems in the past, e.g. for RealTek NICs. By using vtnet, you rely on the often better Linux drivers for such chips.
With vtnet, you should make sure that hardware checksumming is off ("hw.vtnet.csum_disable=1", which is the default on new OpnSense installations anyway because of a FreeBSD interoperability bug with KVM (https://forum.opnsense.org/index.php?msg=216918)). Note, however, that this setting will be slower than using hardware offloading (https://forum.opnsense.org/index.php?topic=45870.0), which you will notice at very high speeds, especially on weak hardware.
You can also enable multiqueue on the VM NIC interfaces, especially, if you have multiple threads active. There is no need for enabling this in OpnSense.
For some Broadcom adapters, it is neccessary to disable GRO by using:
iface enp2s0f0np0 inet manual
up ethtool --offload $IFACE generic-receive-offload off
See: https://forum.opnsense.org/index.php?msg=233131, https://help.ovhcloud.com/csm/en-dedicated-servers-proxmox-network-troubleshoot?id=kb_article_view&sysparm_article=KB0066095 and https://www.thomas-krenn.com/de/wiki/Broadcom_P2100G_schlechte_Netzwerk_Performance_innerhalb_Docker.
When you use bridging with vtnet, there is a known Linux bug with IPv6 multicasting (https://forum.proxmox.com/threads/ipv6-neighbor-solicitation-not-forwarded-to-vm.96758/), that breaks IPv6 after a few minutes. It can be avoided by disabling multicast snooping in /etc/network/interfaces of the Proxmox host like:
auto vmbr0
iface vmbr0 inet manual
bridge-ports eth0
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 2-4094
bridge-mcsnoop 0
If you plan to enlarge your MTU size (https://forum.opnsense.org/index.php?topic=45658) on VirtIO network interfaces, note that you must do so on the Proxmox bridge device first.
Also, you probably should disable the firewall checkbox for the network interfaces in the OpnSense VM.
Guest utilitiesIn order to be able to control and monitor OpnSense from the VM host, you can install the
os-qemu-guest-agent plugin.
Problems with rolling backOne of the main advantages of using a virtualisation platform is that you can roll back your installation.
There are two problems with this:
1. DHCP leases that have been handed since the time of last roll back are still known to the client devices, but not to the OpnSense VM. Usually, this will not cause IP conflicts, but DNS for affected devices may be off intermediately.
2. If you switch back and forth, you can cause problems with backups done via os-backup-git. This plugin keeps track on both the OpnSense VM and the backup repository. If both are of a different opinion about the correct revision of the backup, subsequent backups will fail. Basically, you will ned to setup the backup again with a new, empty repository.
If you want to avoid such problems, you can roll back single packages with opnsense-revert (https://docs.opnsense.org/manual/opnsense_tools.html#opnsense-revert).
TL;DR- Have at least 8 GByte of RAM, non-balooning
- Use "host" type CPU and disable Spectre and Meltdown mitigations
- Use ZFS, dummy
- Keep 20% free space
- Add a trim job to your zpool
- Use vtnet, unless you have a good reason not to
- Check if hardware checksumming is off on OpnSense
- Disable multicast snooping and Proxmox firewall
- Install os-qemu-guest-agent plugin
That is all for now, recommendations welcome!
Caveat, emptor: This is unfinished!Setup for OpnSense and Proxmox for a datacenterA frequently used variant is to work with two bridges on Proxmox:
- vmrb0 as a bridge to which Proxmox itself, OpnSense WAN interface and VMs with a separate IP can connect (even if you don't use it)
- vmbr1 as a LAN or separated VLANs from which all VMs, OpnSense and Proxmox can be managed via VPN
That means you probably need two IPv4s for this setup. You should also get at least a /56 IPv6 prefix, which you need for SLAAC on up to 256 different subnets.
While it is possible to have just one IPv4 for both OpnSense and Proxmox, I would advise against it. You would have to use a port-forward on Proxmox, which results in an RFC1918 WAN IPv4 for OpnSense, which in turn has implications on NAT reflection that you would not want to deal with.
The configuration then looks something like this:
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!
auto lo
iface lo inet loopback
iface lo inet6 loopback
auto eth0
iface eth0 inet manual
iface eth0 inet6 manual
auto vmbr0
iface vmbr0 inet static
address x.y.z.86/32
gateway x.y.z.65
bridge-ports eth0
bridge-stp off
bridge-fd 0
bridge-mcsnoop 0
post-up echo 1 > /proc/sys/net/ipv4/ip_forward
post-up echo 1 > /proc/sys/net/ipv4/conf/eth0/proxy_arp
post-up echo 1 > /proc/sys/net/ipv6/conf/eth0/forwarding
#up ip route add x.y.z.76/32 dev vmbr0
#up ip route add x.y.z.77/32 dev vmbr0
#Proxmox WAN Bridge
iface vmbr0 inet6 static
address 2a01:x:y:z:5423::15/80
address 2a01:x:y:z:87::2/80
address 2a01:x:y:z:88::2/80
address 2a01:x:y:z:89::2/80
address 2a01:x:y:z:172::2/80
gateway fe80::1
post-up ip -6 route add 2a01:x:y:f600::/64 via 2a01:x:y:z:172::1
auto vmbr1
iface vmbr1 inet static
address 192.168.123.2/24
bridge-ports none
bridge-stp off
bridge-fd 0
bridge-mcsnoop 0
post-up ip route add 192.168.0.0/16 via 192.168.123.1 dev vmbr1
#LAN bridge
iface vmbr1 inet6 static
source /etc/network/interfaces.d/*
This includes:
x.y.z.86 main Proxmox IP with x.y.z.65 as gateway,
x.y.z.87 WAN IPv4 of OpnSense,
x.y.z.88 and x.y.z.z.89 additional IPs on vmbr0. These use x.y.z.86 as gateway so that your MAC is not visible to the ISP. Hetzner, for example, would need virtual MACs for this.
192.168.123.2 is the LAN IP for Proxmox so that it can be reached via VPN. The route is set so that the VPN responses are also routed via OpnSense and not to the default gateway.
IPv6 is a little more complex:
2a01:x:y:z:: is a /64 prefix that you can get from your ISP, for example. It is further subdivided with /80 to:
2a01:x:y:z:1234::/80 for vmbr0 with 2a01:x:y:z:1234::15/128 as external IPv6 for Proxmox.
2a01:x:y:z:172::15/128 as point-to-point IPv6 in vmbr1 for the OpnSense WAN with 2a01:x:y:z:172::1/128.
2a01:x:y:z:124::/80 as a subnet for vmbr1, namely as an IPv6 LAN for the OpnSense.
The OpnSense thus manages your LAN with 192.168.123.1/24 and can do DHCPv4 there. It is the gateway and DNS server and does NAT to the Internet via its WAN address x.y.z.87. It can also serve as a gateway for IPv4 with the IPv6 2a01:x:y:z:123::1/64.
VMs would have to get a static IPv6 or be served via SLAAC. That only works with a whole /64 subnet. The prefix, 2a01:x:y:rr00::/56, is used for this, which can then be split into individual /64 prefixes on the OpnSense and distributed to the LAN(s) via SLAAC (e.g. Hetzner offers something like this for a one-off fee of €15).
You can use the additional IPs, but you don't have to. These "directly connected" VMs could, for example, also use IPv6 in 2a01:x:y:rr00::/64.
Some more points1. You can/should close the Proxmox ports, at least for IPv4, of course, but you can still keep them accessible via IPv6. This means you can access the Proxmox even without OpnSense running. There is hardly any risk if nobody knows the external IPv6, as port scans in IPv6 hardly seem to make sense. But be careful: entries in the DNS could be visible and every ACME certificate is exposed, so if you do, only use wildcards!
2. I would also set up a client VM that is located exclusively in the LAN and has a graphical interface and a browser and that is always running. As long as the Proxmox works and its GUI is accessible via port 8006, you have a VM with LAN access and a browser. This also applies if the OpnSense is messed up and no VPN is currently working. The call chain is then: Browser -> https://[2a01:x:y:z:1234::15]:8006, there is a console to the client VM, there access https://192.168.123.1/ (OpnSense LAN IP) with the browser.
3. Be careful with (asymmetric) routes! Proxmox, for example, has several interfaces, so it is important to set the routes correctly if necessary. Note that I have not set an address for IPv6 on vmbr1 because it is actually only intended to be used for access via VPN over the LAN. However, if the OpnSense makes router advertisements on the LAN interface, you quickly have an alternative route for Proxmox...
4. You can use fe80::1/64 as virtual IPs for any (V)LAN interface on OpnSense. That way, you can set fe80::1 as IPv6 gateway for the VMs.
VPNIt is up to your preference on which VPN you should use to access the LAN or VLANs behind your OpnSense. I use Wireguard site to site.
There are tutorials on how to do this, but as an outline:
- Choose a port to make the connection and open it.
- Set up the Wireguard instance to listen on that port.
- Connect a peer by settings the secrets.
- Allow the VPN traffic (but wisely!)
- Check the routes if you cannot reach the other side.
VLAN setupIn order to isolate traffic between the VMs, you can also choose to have vmbr1 to be VLAN-aware. In that case, you will have to assign each VM a separate VLAN, define VLAN interfaces on OpnSense and break up small portions of the RFC1918 LAN network to use at leat 2 IPv4s for OpnSense and the specific VM.
You can do the same with your IPv6 range, because you have 256 IPv6 prefixes - so each VM can have its own /64 range and could even use IPv6 privacy extensions.
Since OpnSense is the main router for anything, you will still be able to access each VM via the VPN by using rules for the surrounding RFC1918 network.
Reverse proxiesIf you want to make use of your OpnSense's capabilities, you will have to place your VMs behind it, anyway. If you are like me, and want to save on cost for additiional IPv4s, you can make use of a reverse proxy.
On HAProxy vs. Caddy (there is a discussion about this starting here (https://forum.opnsense.org/index.php?topic=38714.msg217354#msg217354)):
QuoteToday I took the opportunity to try out Caddy reverse proxy instead of HAproxy, mostly because of a very specific problem with HAproxy...
I must say I reverted after trying it thoroughly. My 2cents on this are as follows:
- Caddy is suited to home setups and inexperienced users. HAproxy is much more complex.
- For example, the certificate setup is much easier, because you just have to specify the domain and it just works (tm).
- However, if you have more than just one domain, Caddy setup gets a little tedious:
* you have to create one domain/certificate plus a http backend for any domain, which includes creating different ones for www.domain.de and domain.de. You cannot combine certificates for multiple domains unless they are subdomains.
* You do not have much control over what type of certificate(s) are created - you cannot specifiy strength or ECC vs. RSA (much less both) and I have not found a means to control if ZeroSSL vs. LetsEncrypt is used.
* The ciphers being employed cannot be controlled easily - or, for TLS 1.3, at all. That results in an ssllabs.com score which is suboptimal, because 128bit ciphers are allowed. This cannot be changed because of Go limitations.
* You cannot use more than one type of DNS-01 verification if you use wildcard domains.
* The Auto HTTPS feature looks nice first, but indeed it uses a 308 instead of a 301 code, which breaks some monitoring and can only be modified via custom include files.
So, if you just want to reverse-proxy some services in your home network, go with Caddy. For an OpnSense guarding your internet site with several services/domains, stay with HAproxy.
There are nice tutorials for both HAproxy (https://forum.opnsense.org/index.php?topic=23339.0) and Caddy (https://forum.opnsense.org/index.php?topic=38714.0), so use them for reference.
A few words on securityWeb applications are inherently unsafe - even more so when they handle infrastructure, like is the case with both Proxmox and OpnSense. If you expose their interfaces on the open internet, even with 2FA enabled, you are waiting for an accident to happen.
Basically, you have these choices to protect the web interfaces:
a. Change default ports
b. Use a VPN
c. Hide behind an non-exposed DNS name (either via IPv6 only or via a reverse proxy)
Variant a. is becoming more and more useless: I had around 30000 invalid login attempts on a non-default SSH port in just a month!
While I always recommend variant b., you will have to rely on a working OpnSense to do it. That is why I have a hot standby available, that can be booted instead of the normal OpnSense instance in case I bork its configuration.
But even for that you need access to your Proxmox and how do you get that without a working OpnSense?
The answer cannot be a reverse proxy either, because that will also run on your OpnSense.
That is why I recommend using an IPv6-only fallback. This is possible, because an interface can have more than one IPv6 address, so you can use a separate address just for specific services like SSH.
If you have a /56 or /64 IPv6 prefix, the number of potential IPs is so huge that port scanning is infeasible. However, there are some pitfalls to this:
1. You must use a really random address, not one that could be guessed easily.
2. Beware of outbound connections via IPv6: Usually, they will give away your IPv6 -
unless you use IPv6 privacy extensions (see below).
3. If you want to make that address more easy to remember for yourself, you can use a DNS entry, but check if zone-transfers of your domain are really disabled and do not use guessable names like "pve.yourdomain.com", "opnsense.yourdomain.com" or "proxmox.yourdomain.com".
4. Also, keep in mind, that if you issue certificates on that domain name, almost EVERY certificate gets published, because of certificate transparency (https://certificate.transparency.dev/). So, use wildcard certificates!
You can do likewise for your VMs:
- For LXC containers, the network configuration is kept in /etc/network/interfaces, but it gets re-created from the LXC definition. Alas, you can only set one IPv6 (or use DHCPv6 or SLAAC). That is no problem if the container is behind OpnSense using a reverse proxy, via IPv4 only, since then, the container's IPv6 can get used for SSH only, if you configure OpnSense to let it through. For IPv6 privacy, add this to /etc/sysctl.conf:
net.ipv6.conf.eth0.autoconf=1
net.ipv6.conf.eth0.accept_ra=1
net.ipv6.conf.all.use_tempaddr=2
net.ipv6.conf.default.use_tempaddr=2
net.ipv6.conf.eth0.use_tempaddr=2
- For Linux VMs with old-style configuration, you can change /etc/network/interfaces. For new-style configurations using cloudinit with netplan, you can create an override for /etc/netplan/50-cloud-init.yaml, like /etc/netplan/61-ipv6-privacy with this content (using SLAAC / radvd):
network:
version: 2
renderer: networkd
ethernets:
eth0:
accept-ra: true
ipv6-privacy: true
By using /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with this content: "network: {config: disabled}", you can also disable overwriting the network configuration via cloudinit altogether and configure netplan yourself.
Many thanks for these "best practices".
I plan to deploy 2nd OPNsense on Proxmox it will be helpful.
Regards,
S.
I could have used this a few weeks ago. ;)
I'm a bit surprised by the ZFS on ZFS recommendation, as well as the one regarding passthrough vs bridges.
They seem to go against other recommendations I had found at the time (home network guy?).
At least I can test the 2nd one. I guess I'll learn how to move my configuration to another VM/host in the process...
How about a paragraph on firewalls (Proxmox's and OPNsense's) and potential conflicts between the two?
Thanks!
Since OpnSense does write only a few logfiles (and even that should be reduced to a minimum anyway to avoid running out of space), the performance impact on ZFS under ZFS is neglegible. Most of it comes from double compression, which is ineffective and could be disabled as well on OpnSense.
Most recommendations on NIC passthrough come from the past, vtnet is much better these days. You might get a performance benefit on >= 10GBit hardware - that is, IFF it is supported under FreeBSD. Some people have to resort to running under Proxmox because their NICs are badly supported (or not at all).
There are lots of recommendations that were valid in the past, like "do not mix tagged and untagged VLANs" - I had no problems with that, whatsoever.
There are no conflicts with the PVE firewall unless you enable it in the datacenter and for the OpnSense VM. BTW: the initial default is off for the datacenter. If you need it for other VMs (and why should you, as they are probably behind your OpnSense anyway?) or for the PVE host itself, you should disable it for your OpnSense VM - but that goes without saying.
The real impact of using vtnet is mostly limited to the IPv6 multicast and the hardware offloading problems.
An Idea here, maybe its stupid maybe not but...
What if this is included into the Official OPNsense docs?
Currently the docs do not have any Guide how to deploy OPNsense into Proxmox. Its easy to spin off OPNsense in Proxmox but "best practices" are another thing.
Would it be beneficial for the people to have something like that in the Official docs?
Regards,
S.
Quote from: Seimus on November 24, 2024, 05:35:43 PM
An Idea here, maybe its stupid maybe not but...
What if this is included into the Official OPNsense docs?
Currently the docs do not have any Guide how to deploy OPNsense into Proxmox. Its easy to spin off OPNsense in Proxmox but "best practices" are another thing.
Would it be beneficial for the people to have something like that in the Official docs?
Regards,
S.
This is well above the know how of most people. Doubt many people run a datacenter-level opnsense with the VMs on the same server at home to this degree.
Good dive though, much appreciated 👌 Now I have to rebuild everything... again 😒
A few questions...
1) Do you enable the Spectre option for Intel or AMD cpus in Proxmox VM definition?
2) Do you activate AES for HW acceleration in Proxmox VM definition?
3) Host CPU type? Where is this located?
4) If I choose ZFS for OPNsense VM should I define 2 disks for resiliency in Proxmox VM definition?
1. As explained here (https://docs.opnsense.org/troubleshooting/hardening.html), there are two settings:
PTI is something that can only be done on the host anyway. Whether you enable IBRS depends on if you expect your other VMs to try to attack your OpnSense. In other words: Do you use virtualisation to separate VMs like in a datacenter or do you want to use your hardware for other things in your homelab? Since there is a huge performance penalty, I would not use that mitigation in a homelab. In a datacenter, I would not virtualize OpnSense anyway, so no, I would not use those mitigations.
2. Sure. That goes without saying, because "host" CPU type does that anyway.
3. CPU host type - see attachment.
4. No. ZFS features like RAID-Z1 can only effectively be used on the VM host. If the host storage fails, having two separate disk files does not actually help. ZFS is, like I descibe, only to have snapshots within the OpnSense itself. You can use ZFS snapshots on the Proxmox host instead, but I still would not trust UFS under FreeBSD anyway, so the choice is purely for filesystem stability reasons. That does not get any better by using mirroring.
Thank you for the great guide, and explanation of settings!
I am one of those strange people with Proxmox running OPNsense in a DC. I currently don't have the rack space, or the budget to get a dedicated device for OPNsense, but that is on the list of things to do. I have been having some intermittent issues with my VMs and will try this and see if it helps.
I do have one question however. When doing some research I ended up looking at Multiqueue, what that is and if it may help. Networking is admittedly my weakest aspect in computers (well networking other then layer 1, I do hardware all day), as I understand it when using VirtIO (same as vtnet correct?) it only supports one RX/TX so the guest can only receive or send 1 packet at a time (over simplified trying to keep it short and concise). Now with modern hardware NICs can essentially make a packet queue for each CPU core (or Vcore). Will setting a Multiqueue value in Proxmox have any benefit? if yes I would assume it should be set to the number of cores the OPNsense VM has?
Thank you again for the great guide!
There is an explanation of this here: https://pve.proxmox.com/pve-docs/chapter-qm.html#qm_network_device
Short answer: It enables multiple queues for a networks card that are distributed over multiple CPU threads, which can have its benefits if you have high loads induced by a big number of clients. AFAIU, you will have to enable this in the OpnSense VM guest, too. I never tried it and probably, YMMV depending on actual hardware (and also on driver support for vtnet in OpnSense).
Note that when you change network settings in Proxmox while OpnSense is running, your connection drops and may need a reboot to get back online.
Great article.
I was curious; In my VM under Proxmox, I have 32GB RAM [Ballooning off] and in Proxmox it shows 31/32 RAM Used in RED but in OPNSense GUI shows 1.4% 900M/3200M. Is this a concern or just Proxmox not registering it correctly?
Take a look at "top" in your OpnSense VM - you will find that ~95% of memory is "wired" by FreeBSD. Part of this is that all free memory is used for ARC cache. Proxmox shows this all as used memory.
Excellent write-up, thank you.
One question, and a possible suggestion:
For a home user who already has a firewall appliance and wants to add a Proxmox node for app hosting, is there a need to virtualize OPNsense (besides having a convenient backup for the main router)? Does it avoid VM/CT traffic having to traverse the network for inter-VLAN routing?
Regarding ZFS-on-ZFS, it seems that ZFS sync is the prominent contributor to write amplification and SSD wear without a dedicated SLOG device (source: https://www.youtube.com/watch?v=V7V3kmJDHTA). Assuming the host is already protected with backup power and good ZFS hygiene, might it make sense to disable ZFS sync on the guest?
I am not promoting use of virtualised OpnSense at all, even less so for situations where a physical firewall is possible. I only use virtualised setups on cloud based setups to save a second physical instance.
That being said, I can understand when someone says they already have a Proxmox instance and want to run OpnSense on that to save power.
As for ZFS: OpnSense does not produce that high of a write load that I think this would matter, but YMMV. When I use SSDs on a Proxmox host, I know I must use enterprise-grade SSDs anyway, regardless of the type of guest.
Quote from: OPNenthu on February 09, 2025, 07:36:57 PMwrite amplification and SSD wear without a dedicated SLOG device (source: https://www.youtube.com/watch?v=V7V3kmJDHTA)
This statement is just plain wrong.
An SLOG vdev
- is not a write cache
- will not reduce a single write operation to the data vdevs
- is in normal operation only ever written to and never read
Normal ZFS operation is sync for metadata and async for data. Async meaning collected in a transaction group in memory which is flushed to disk every 5 seconds.
Kind regards,
Patrick
Thank you for the correction-
Quote from: Patrick M. Hausen on February 10, 2025, 12:01:54 AMNormal ZFS operation is sync for metadata and async for data.
I take from this that even metadata does not get written to disk more than once. I believe that you know what you're talking about on this subject so I take your word, but the video I linked makes a contradictory claim at 06:05.
I'm paraphrasing, but he claims that for a single-disk scenario (such as mine) ZFS sync writes data (or metadata, technically) twice: once for the log, and once for the commit. He presents some measurements that seem to corroborate the claim although I can't verify it.
My thinking is that modest home labs might be running 1L / mini PCs with very limited storage options so maybe there was a potential pitfall to be avoided here.
Oh, I'm sorry. Yes, synchronous writes are written twice. But they are the exception, not the rule.
If you use any consumer SSD storage option for Proxmox, you are waiting for an accident to happen anyway. Many home users may use things like Plex or Home Assistant or have a Docker instance running as VMs and those
Suffice it to say that you can reduce the write load by a huge amount just by enabling "Use memory file system for /tmp" and disabling Netflow and RRD data collection, alongside with excessive firewall logging (with an external syslog server). Also, the metadata flushes have been reduced in OpnSense to every 5 minutes instead of 30s from 23.7 on (https://forum.opnsense.org/index.php?msg=195970). In the linked thread, there is some discussion of actual induced write load. I used up ~50% worth of my first NVME disks life on a brand new DEC750 within one year - but that is totally clear when you think of it and has nothing to do with ZFS-on-ZFS.
P.S.: There are some really bad videos about ZFS out there, like this one (https://www.youtube.com/watch?v=V7V3kmJDHTA), which I just commented on:
QuoteGood intention, alas, badly executed. You should have looked at the actual hardware information instead of relying on what the Linux kernel thinks it did (i.e. use smartctl instead of /proc/diskstats).
The problem with your recommendation of ashift=9 is that Linux shows less writes, but in reality, most SSDs use a physical blocksize of >=128 KBytes. By reducing the blocksize to 512, you actually write the same 128K block multiple times. In order to really minimize the writes to the drive, you should enlarge the ashift to 17 instead of reducing it to 9.
P.P.S.: My NVME drives show a usage of 2 and 4% respectively after ~2 years of use in Proxmox. At that rate, I can still use them another 48 years, which is probably well beyond their MTTF. Back when SSDs became popular, it has been rumored that they could not be used for database use because of limited write capability. A friend of mine used some enterprise-grade SATA SSDs for a 10 TByte weather database that was being written to by thousands of clients and the SSDs were still only at 40% after 5 years of 24/7 use.
Quote from: meyergru on February 10, 2025, 09:30:28 AMmost SSDs use a physical blocksize of >=128 KBytes
I've not seen a block size that large, but then again I only have consumer drives. All of mine (a few Samsungs, a Kingston, and an SK Hynix currently) report 512 bytes in S.M.A.R.T tools:
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 1
1 - 4096 0 0
I completely agree that disks will last a long time regardless, but I thought we should at least be
aware of the possible compounding effects of block writes in a ZFS-on-ZFS scenario and factor that in to any cost projections. Unless I'm mistaken about how virtualization works, whatever inefficiencies ZFS has would be doubled in ZFS-on-ZFS.
I though this was common knowledge: I am talking about the real, physical block size (aka erase block size (https://spdk.io/doc/ssd_internals.html)) of the underlying NAND flash, not the logical one that is being reported over an API that wants to be backwards-compatible to spinning disks. Alone the fact that you can change that logical blocksize should make it clear that this has nothing to do with reality.
It basically was the same with the 4K block size, which was invented for spinning disks in order to reduce gap overhead, but most spinning disks also allowed for a backwards-compatible 512 bytes sector size, because many OSes could not handle 4K at that time.
Basically, 512 bytes and 4K are a mere convention nowadays.
About the overhead: The video I linked that was making false assumptions about the block sizes shows that the write amplification was basically nonexistent after the ashift was "optimized". This goes to show that basically, for any write of data blocks, there will be a write of metadata like checksums. On a normal ZFS, this will almost always be evened out by compression, but not on ZFS-on-ZFS, because the outer layer cannot compress any more. So, yes, there is a little overhead, and for SSDs, this write amplification will be worse with small writes. Then again, that is true for pure ZFS as well.
With projected MTTFs of decently overprovisioned SSDs that are much longer than potential failure because of other reasons, that should not be much of a problem. At least not one that I would give a recommendation to switch off the very features that ZFS stands for, namely to disable ZFS sync.
Quote from: meyergru on February 11, 2025, 09:47:14 AMI am talking about the real, physical block size (aka erase block size (https://spdk.io/doc/ssd_internals.html)) of the underlying NAND flash, not the logical one that is being reported over an API that wants to be backwards-compatible to spinning disks.
Got it, thanks for that. The link doesn't work for me, but I found some alternate sources.
Sadly it seems that the erase block size is not reported in userspace tools and unless it's published by the SSD manufacturer it is guesswork. I think that's reason enough to not worry about ashift tuning, then.
I do not change the default of ashift=12, either. However, something you can do is to avoid any SSDs that do not explicitely note to have RAM cache - even some "pro" drives do not have that. With RAM cache, you can delay the block erase until the whole block or at least more than a minuscule part of it must be written, thus avoiding many unneccessary writes even for small logical block writes.
This is something Deciso did not take into account with their choice of the Transcend TS256GMTE652T2 in the DEC750 line, resulting in this:
# smartctl -a /dev/nvme0
smartctl 7.4 2023-08-01 r5530 [FreeBSD 14.2-RELEASE amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: TS256GMTE652T2
Serial Number: G956480208
Firmware Version: 52B9T7OA
PCI Vendor/Subsystem ID: 0x1d79
IEEE OUI Identifier: 0x000000
Controller ID: 1
NVMe Version: 1.3
Number of Namespaces: 1
Namespace 1 Size/Capacity: 256,060,514,304 [256 GB]
Namespace 1 Utilization: 37,854,445,568 [37.8 GB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Wed Feb 12 10:42:37 2025 CET
Firmware Updates (0x14): 2 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0f): S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size: 32 Pages
Warning Comp. Temp. Threshold: 85 Celsius
Critical Comp. Temp. Threshold: 90 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 9.00W - - 0 0 0 0 0 0
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 48 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 80%
Data Units Read: 2,278,992 [1.16 TB]
Data Units Written: 157,783,961 [80.7 TB]
Host Read Commands: 79,558,036
Host Write Commands: 3,553,960,590
Controller Busy Time: 58,190
Power Cycles: 88
Power On Hours: 17,318
Unsafe Shutdowns: 44
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged
Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
No Self-tests Logged
As you can see, the drive has only 20% life left at only 2 years (17318 hours) of use.
This is as well interesting
Quote from: crankshaft on December 29, 2024, 12:42:46 PMFinally, after 2 weeks of testing just about every tunable possible I found the solution:
iface enp1s0f0np0 inet manual
pre-up ethtool --offload enp1s0f0np0 generic-receive-offload off
Generic Receive Offload (GRO)
- GRO is a network optimization feature that allows the NIC to combine multiple incoming packets into larger ones before passing them to the kernel.
- This reduces CPU overhead by decreasing the number of packets the kernel processes.
- It is particularly useful in high-throughput environments as it optimizes performance.
GRO may cause issues in certain scenarios, such as:
1. Poor network performance due to packet reordering or handling issues in virtualized environments.
2. Debugging network traffic where unaltered packets are required (e.g., using `tcpdump` or `Wireshark`).
3. Compatibility issues with some software or specific network setups.
This is OVH Advance Server with Broadcom BCM57502 NetXtreme-E.
Hope this will save somebody else a lot of wasted time.
Regards,
S.
Did you try this?
I'm currently just moving over to opnsense from pfsense, and not finished yet - So can't comment, but always had higher latency than I'd expect
It is extremely simple to Virtualize OPNsense in Proxmox I did it in my recent setup using PCI Passthrough and then Virtualization in Proxmox. OPnsense works great here is Step by Step guide to Install OPNsense on Proxmox
It would be good to know more about this GRO setting
I've just finished my setup (at least ported from pfsense, finished) and am pleased to see multi-queue is just a case of setting the host, as outlined here https://forum.opnsense.org/index.php?topic=33700.0
@amjid: Your setup is different by using pass-through. This has several disadvantages:
1. You need additional ports (at least 3 in total), which is often a no-go in environments where you want this on rented hardware in a datacenter - they often have only one physical interface which has to be shared (i.e. bridged) across OpnSense and Proxmox.
2. Some people use Proxmox for the sole reason to use their badly-supported NICs from Realtek, because the Linux drivers are way better than FreeBSD. By using pass-through, you use the FreeBSD drivers again, so this will work just as bad as FreeBSD alone.
@wrongly1686: Usually, you do not need to change the GRO setting. This problem will
only show on certain high-end Broadcom adapters.
I will repeat my message from here (https://forum.opnsense.org/index.php?msg=233131):
QuoteInteresting. Seems like a NIC-specific problem. OVH now has that in their FAQs: https://help.ovhcloud.com/csm/en-dedicated-servers-proxmox-network-troubleshoot?id=kb_article_view&sysparm_article=KB0066095
This was detected even earlier: https://www.thomas-krenn.com/de/wiki/Broadcom_P2100G_schlechte_Netzwerk_Performance_innerhalb_Docker
Nevertheless, I added it above.
And I did mention multiqueue, didn't I?
Apologies, you did.
I just didn't think it could ever be so easy after giving up on pfsense!
Quote from: meyergru on March 27, 2025, 08:45:19 AMAnd I did mention multiqueue, didn't I?
I think it may be worth sticking something in about cpu affinity/cpu units.
I'm moving all my setup around at the moment, but I noticed that my RTT have recently shot up on my gateways - Making my networking feel slow. They've effectively doubled.
I'm keeping an eye on this, but putting opnsense cpu units up to 10,000, Adsense - 8,000 brought them straight back down.
I do wonder if there is some way for proxmox to prioritise bridging, also