[HOWTO] OpnSense under virtualisation (Proxmox et.al.)

Started by meyergru, November 21, 2024, 10:43:58 AM

Previous topic - Next topic
Quote from: OPNenthu on February 09, 2025, 07:36:57 PMwrite amplification and SSD wear without a dedicated SLOG device (source: https://www.youtube.com/watch?v=V7V3kmJDHTA)

This statement is just plain wrong.

An SLOG vdev

- is not a write cache
- will not reduce a single write operation to the data vdevs
- is in normal operation only ever written to and never read

Normal ZFS operation is sync for metadata and async for data. Async meaning collected in a transaction group in memory which is flushed to disk every 5 seconds.

Kind regards,
Patrick
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Thank you for the correction-

Quote from: Patrick M. Hausen on February 10, 2025, 12:01:54 AMNormal ZFS operation is sync for metadata and async for data.

I take from this that even metadata does not get written to disk more than once.  I believe that you know what you're talking about on this subject so I take your word, but the video I linked makes a contradictory claim at 06:05.

I'm paraphrasing, but he claims that for a single-disk scenario (such as mine) ZFS sync writes data (or metadata, technically) twice: once for the log, and once for the commit.  He presents some measurements that seem to corroborate the claim although I can't verify it.

My thinking is that modest home labs might be running 1L / mini PCs with very limited storage options so maybe there was a potential pitfall to be avoided here.


Oh, I'm sorry. Yes, synchronous writes are written twice. But they are the exception, not the rule.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

February 10, 2025, 09:30:28 AM #18 Last Edit: February 10, 2025, 11:40:42 AM by meyergru
If you use any consumer SSD storage option for Proxmox, you are waiting for an accident to happen anyway. Many home users may use things like Plex or Home Assistant or have a Docker instance running as VMs and those

Suffice it to say that you can reduce the write load by a huge amount just by enabling "Use memory file system for /tmp" and disabling Netflow and RRD data collection, alongside with excessive firewall logging (with an external syslog server). Also, the metadata flushes have been reduced in OpnSense to every 5 minutes instead of 30s from 23.7 on. In the linked thread, there is some discussion of actual induced write load. I used up ~50% worth of my first NVME disks life on a brand new DEC750 within one year - but that is totally clear when you think of it and has nothing to do with ZFS-on-ZFS.

P.S.: There are some really bad videos about ZFS out there, like this one, which I just commented on:

QuoteGood intention, alas, badly executed. You should have looked at the actual hardware information instead of relying on what the Linux kernel thinks it did (i.e. use smartctl instead of /proc/diskstats).

The problem with your recommendation of ashift=9 is that Linux shows less writes, but in reality, most SSDs use a physical blocksize of >=128 KBytes. By reducing the blocksize to 512, you actually write the same 128K block multiple times. In order to really minimize the writes to the drive, you should enlarge the ashift to 17 instead of reducing it to 9.

P.P.S.: My NVME drives show a usage of 2 and 4% respectively after ~2 years of use in Proxmox. At that rate, I can still use them another 48 years, which is probably well beyond their MTTF. Back when SSDs became popular, it has been rumored that they could not be used for database use because of limited write capability. A friend of mine used some enterprise-grade SATA SSDs for a 10 TByte weather database that was being written to by thousands of clients and the SSDs were still only at 40% after 5 years of 24/7 use.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 770 up, Bufferbloat A

Quote from: meyergru on February 10, 2025, 09:30:28 AMmost SSDs use a physical blocksize of >=128 KBytes

I've not seen a block size that large, but then again I only have consumer drives.  All of mine (a few Samsungs, a Kingston, and an SK Hynix currently) report 512 bytes in S.M.A.R.T tools:

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         1
 1 -    4096       0         0

I completely agree that disks will last a long time regardless, but I thought we should at least be aware of the possible compounding effects of block writes in a ZFS-on-ZFS scenario and factor that in to any cost projections.  Unless I'm mistaken about how virtualization works, whatever inefficiencies ZFS has would be doubled in ZFS-on-ZFS.

I though this was common knowledge: I am talking about the real, physical block size (aka erase block size) of the underlying NAND flash, not the logical one that is being reported over an API that wants to be backwards-compatible to spinning disks. Alone the fact that you can change that logical blocksize should make it clear that this has nothing to do with reality.

It basically was the same with the 4K block size, which was invented for spinning disks in order to reduce gap overhead, but most spinning disks also allowed for a backwards-compatible 512 bytes sector size, because many OSes could not handle 4K at that time.

Basically, 512 bytes and 4K are a mere convention nowadays.

About the overhead: The video I linked that was making false assumptions about the block sizes shows that the write amplification was basically nonexistent after the ashift was "optimized". This goes to show that basically, for any write of data blocks, there will be a write of metadata like checksums. On a normal ZFS, this will almost always be evened out by compression, but not on ZFS-on-ZFS, because the outer layer cannot compress any more. So, yes, there is a little overhead, and for SSDs, this write amplification will be worse with small writes. Then again, that is true for pure ZFS as well.

With projected MTTFs of decently overprovisioned SSDs that are much longer than potential failure because of other reasons, that should not be much of a problem. At least not one that I would give a recommendation to switch off the very features that ZFS stands for, namely to disable ZFS sync.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 770 up, Bufferbloat A

Quote from: meyergru on February 11, 2025, 09:47:14 AMI am talking about the real, physical block size (aka erase block size) of the underlying NAND flash, not the logical one that is being reported over an API that wants to be backwards-compatible to spinning disks.

Got it, thanks for that.  The link doesn't work for me, but I found some alternate sources.

Sadly it seems that the erase block size is not reported in userspace tools and unless it's published by the SSD manufacturer it is guesswork.  I think that's reason enough to not worry about ashift tuning, then.

I do not change the default of ashift=12, either. However, something you can do is to avoid any SSDs that do not explicitely note to have RAM cache - even some "pro" drives do not have that. With RAM cache, you can delay the block erase until the whole block or at least more than a minuscule part of it must be written, thus avoiding many unneccessary writes even for small logical block writes.

This is something Deciso did not take into account with their choice of the Transcend TS256GMTE652T2 in the DEC750 line, resulting in this:

# smartctl -a /dev/nvme0
smartctl 7.4 2023-08-01 r5530 [FreeBSD 14.2-RELEASE amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       TS256GMTE652T2
Serial Number:                      G956480208
Firmware Version:                   52B9T7OA
PCI Vendor/Subsystem ID:            0x1d79
IEEE OUI Identifier:                0x000000
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          256,060,514,304 [256 GB]
Namespace 1 Utilization:            37,854,445,568 [37.8 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Wed Feb 12 10:42:37 2025 CET
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     85 Celsius
Critical Comp. Temp. Threshold:     90 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.00W       -        -    0  0  0  0        0       0

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        48 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    80%
Data Units Read:                    2,278,992 [1.16 TB]
Data Units Written:                 157,783,961 [80.7 TB]
Host Read Commands:                 79,558,036
Host Write Commands:                3,553,960,590
Controller Busy Time:               58,190
Power Cycles:                       88
Power On Hours:                     17,318
Unsafe Shutdowns:                   44
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
No Self-tests Logged

As you can see, the drive has only 20% life left at only 2 years (17318 hours) of use.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 770 up, Bufferbloat A

This is as well interesting

Quote from: crankshaft on December 29, 2024, 12:42:46 PMFinally, after 2 weeks of testing just about every tunable possible I found the solution:


iface enp1s0f0np0 inet manual
  pre-up ethtool --offload enp1s0f0np0 generic-receive-offload off

Generic Receive Offload (GRO)
  - GRO is a network optimization feature that allows the NIC to combine multiple incoming packets into larger ones before passing them to the kernel.
  - This reduces CPU overhead by decreasing the number of packets the kernel processes.
  - It is particularly useful in high-throughput environments as it optimizes performance.


GRO may cause issues in certain scenarios, such as:

1. Poor network performance due to packet reordering or handling issues in virtualized environments.
2. Debugging network traffic where unaltered packets are required (e.g., using `tcpdump` or `Wireshark`).
3. Compatibility issues with some software or specific network setups.

This is OVH Advance Server with Broadcom BCM57502 NetXtreme-E.

Hope this will save somebody else a lot of wasted time.



Regards,
S.
Networking is love. You may hate it, but in the end, you always come back to it.

OPNSense HW
APU2D2 - deceased
N5105 - i226-V | Patriot 2x8G 3200 DDR4 | L 790 512G - VM HA(SOON)
N100   - i226-V | Crucial 16G  4800 DDR5 | S 980 500G - PROD

Did you try this?

I'm currently just moving over to opnsense from pfsense, and not finished yet - So can't comment, but always had higher latency than I'd expect

It is extremely simple to Virtualize OPNsense in Proxmox I did it in my recent setup using PCI Passthrough and then Virtualization in Proxmox. OPnsense works great here is Step by Step guide to Install OPNsense on Proxmox

It would be good to know more about this GRO setting

I've just finished my setup (at least ported from pfsense, finished) and am pleased to see multi-queue is just a case of setting the host, as outlined here https://forum.opnsense.org/index.php?topic=33700.0

@amjid: Your setup is different by using pass-through. This has several disadvantages:

1. You need additional ports (at least 3 in total), which is often a no-go in environments where you want this on rented hardware in a datacenter - they often have only one physical interface which has to be shared (i.e. bridged) across OpnSense and Proxmox.

2. Some people use Proxmox for the sole reason to use their badly-supported NICs from Realtek, because the Linux drivers are way better than FreeBSD. By using pass-through, you use the FreeBSD drivers again, so this will work just as bad as FreeBSD alone.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 770 up, Bufferbloat A

@wrongly1686: Usually, you do not need to change the GRO setting. This problem will only show on certain high-end Broadcom adapters.
I will repeat my message from here:

QuoteInteresting. Seems like a NIC-specific problem. OVH now has that in their FAQs: https://help.ovhcloud.com/csm/en-dedicated-servers-proxmox-network-troubleshoot?id=kb_article_view&sysparm_article=KB0066095

This was detected even earlier: https://www.thomas-krenn.com/de/wiki/Broadcom_P2100G_schlechte_Netzwerk_Performance_innerhalb_Docker

Nevertheless, I added it above.

And I did mention multiqueue, didn't I?
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 770 up, Bufferbloat A

Apologies, you did.

I just didn't think it could ever be so easy after giving up on pfsense!
Quote from: meyergru on March 27, 2025, 08:45:19 AMAnd I did mention multiqueue, didn't I?