KVM with ZFS-on-ZFS

Started by TheJohnny470, July 10, 2025, 06:31:39 PM

Previous topic - Next topic
I have been running OPNsense at home on my home server for a while now. OPNsense is virtualized using KVM/QEMU with 2 vCores, 2GB RAM, 20GB VirtIO disk, and a dual-NIC Gigabit intel PCIe card passed through.

I have a big issue with ZFS-on-ZFS where the main data array consists of 3 2TB HDDs in a Z1 configuration, and OPNsense lives on top.

The issue:

Running weekly scrubs on the main data array eventually accumulates checksum errors, and the OPNsense VM completely fails. It's always the OPNsense VM media that the main array complains about, and eventually show as the source of unrecoverable errors.

What I tried:
* Using ZVols
* Using a regular qcow2 device
* Disabled all logs/metrics (Unless there's something I'm missing)
* Moving the swap partition off of the ZFS array and putting it on the boot SSD, as well as enabling auto-trim (Testing in progress)

Although I have close to a decade of professional experience with Linux, I'm pretty much a beginner when it comes to the BSDs.

Bare-metal runs NixOS if that's relevant.

As far as I know it *should* work though I would not recommend it. Running ZFS on ZFS will thwart "thin" provisioning, due to the CoW nature. I always run UFS or Ext4 for Linux inside hypervisors and use the hypervisor snapshot capability for rollbacks and backups.

It might be the case that your 2 GB of memory are to small.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Any drawbacks to doing OPNsense on UFS compared to ZFS? I remember reading that ZFS is the "recommended" filesystem for OPNsense, but I understand that this recommendation comes with the assumption that OPNsense is running bare metal. And yes I don't do ZFS on anything but bare-metal, but went with the default for OPNsense.

"Power loss" i.e. in your case terminating the VM without proper shutdown might lead to FS corruption in UFS and is almost impossible with ZFS. That's why it's the default now.

Does your hypervisor have the resources to increase memory to 4 or even 8 GB? 2 GB *is* tight.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Hmm.. UFS is a no-go then, since the server is currently susceptible to brownouts/blackouts with no UPS (working on that)

I think I can bump that VM to 4GB since I retired a couple of VMs a while ago so I think I have the memory to spare.

Would OPNsense being tight on memory cause the checksum errors? Because I suspected this might be the case, I'm now testing with an 8GB swap on the boot SSD, separate from the ZFS array

Any system with ZFS being too tight on memory might have problems in that regard. Also check cacheing on the host/virtual disk side.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

I regularly do ZFS-on-ZFS on Proxmox and have no problems with that.

However, even with standard ZFS on Linux or Proxmox, I found that there is a problem with current ZFS implementations when combined with a SATA queue depth > 1:

The problem is that ZFS can time out under high write load with the default (> 1) setting: Imagine a queue that can be executed by device firmware in arbitrary order. This is perfect in theory, yet, there is no firm maximum execution time if - under high load - every time a request is finishes, a new one can replace it, which then becomes executed first.

Under these conditions, there may be requests that "never" get executed and ZFS at some point flags a timeout. This often happens during scrubs, especially, if other write oerations (like backups) happen concurrently.

I have only seen this with spinning rust and I always set /sys/block/sdX/device/queue_depth to 1 for all SATA devices. While theoretically, this can impact performance negatively, in practice, it does not, because ZFS writes are clustered together, anyway.
Intel N100, 4* I226-V, 2* 82559, 16 GByte, 500 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+