I just recently happened to look at the SMART data of the SSD in my OPNSense machine and noticed total writes and life left values were a bit surprising considering how long the machine has been operational.
The machine has been in use for almost a year now as my primary home firewall. So no extravagant use cases. This is what the drive SMART reports:
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x0032 100 100 000 Old_age Always - 100
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 8272
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 22
148 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
149 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
167 Write_Protect_Mode 0x0000 100 100 000 Old_age Offline - 0
168 SATA_Phy_Error_Count 0x0012 100 100 000 Old_age Always - 0
169 Bad_Block_Rate 0x0000 100 100 000 Old_age Offline - 0
170 Bad_Blk_Ct_Erl/Lat 0x0000 100 100 010 Old_age Offline - 0/0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 MaxAvgErase_Ct 0x0000 100 100 000 Old_age Offline - 0
181 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
182 Erase_Fail_Count 0x0000 100 100 000 Old_age Offline - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
192 Unsafe_Shutdown_Count 0x0012 100 100 000 Old_age Always - 17
194 Temperature_Celsius 0x0022 037 056 000 Old_age Always - 37 (Min/Max 16/56)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
199 SATA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
218 CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
231 SSD_Life_Left 0x0000 088 088 000 Old_age Offline - 88
233 Flash_Writes_GiB 0x0032 100 100 000 Old_age Always - 1248
241 Lifetime_Writes_GiB 0x0032 100 100 000 Old_age Always - 9423
242 Lifetime_Reads_GiB 0x0032 100 100 000 Old_age Always - 54
244 Average_Erase_Count 0x0000 100 100 000 Old_age Offline - 124
245 Max_Erase_Count 0x0000 100 100 000 Old_age Offline - 171
246 Total_Erase_Count 0x0000 100 100 000 Old_age Offline - 168104
I was surprised at the amount of writes, as well as how far the wear has progressed in just a year. Now I'm wondering if these numbers are in line with what can be expected, or if there is something wrong with my setup.
About a week ago when I first noticed it, the life left reading was at 89. I didn't think of taking notes of the other figures, but after I updated OPNsense last weekend, I took the numbers down:
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x0032 100 100 000 Old_age Always - 100
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 8167
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 22
148 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
149 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
167 Write_Protect_Mode 0x0000 100 100 000 Old_age Offline - 0
168 SATA_Phy_Error_Count 0x0012 100 100 000 Old_age Always - 0
169 Bad_Block_Rate 0x0000 100 100 000 Old_age Offline - 0
170 Bad_Blk_Ct_Erl/Lat 0x0000 100 100 010 Old_age Offline - 0/0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 MaxAvgErase_Ct 0x0000 100 100 000 Old_age Offline - 0
181 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
182 Erase_Fail_Count 0x0000 100 100 000 Old_age Offline - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
192 Unsafe_Shutdown_Count 0x0012 100 100 000 Old_age Always - 17
194 Temperature_Celsius 0x0022 034 056 000 Old_age Always - 34 (Min/Max 16/56)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
199 SATA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
218 CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
231 SSD_Life_Left 0x0000 088 088 000 Old_age Offline - 88
233 Flash_Writes_GiB 0x0032 100 100 000 Old_age Always - 1233
241 Lifetime_Writes_GiB 0x0032 100 100 000 Old_age Always - 9343
242 Lifetime_Reads_GiB 0x0032 100 100 000 Old_age Always - 54
244 Average_Erase_Count 0x0000 100 100 000 Old_age Offline - 123
245 Max_Erase_Count 0x0000 100 100 000 Old_age Offline - 171
246 Total_Erase_Count 0x0000 100 100 000 Old_age Offline - 166475
Comparing with the above readings from today, in bit over four days there has been 80GiB written, which sounds a bit on the high side to me.
From what I can tell, flowd_aggregate.py is almost sole responsible for the writes:
# top -m io -o write -b
last pid: 40420; load averages: 0.06, 0.11, 0.08 up 4+21:53:06 21:17:36
53 processes: 1 running, 52 sleeping
CPU: 1.0% user, 0.0% nice, 0.6% system, 0.3% interrupt, 98.1% idle
Mem: 150M Active, 2887M Inact, 782M Wired, 430M Buf, 4200M Free
Swap: 8192M Total, 8192M Free
PID USERNAME VCSW IVCSW READ WRITE FAULT TOTAL PERCENT COMMAND
699 root 1393772 458763 840 1273223 0 1274063 99.58% python3.8
35311 dhcpd 853298 1329 33 2949 33 3015 0.24% dhcpd
96277 dhcpd 861423 2637 0 588 0 588 0.05% dhcpd
99676 root 3756 82 70 416 51 537 0.04% radiusd
72210 root 1400080 16412 1 229 0 230 0.02% syslog-ng
1549 root 1741 16 0 227 0 227 0.02% dhcpleases6
5872 _flowd 200493 1613 4 159 0 163 0.01% flowd
90495 _dhcp 54483 115 0 14 0 14 0.00% dhclient
49557 root 8415 649 0 8 0 8 0.00% radvd
73189 root 825674 3873 0 0 0 0 0.00% python3.8
422 root 4 3 0 0 0 0 0.00% python3.8
96103 root 6 0 0 0 0 0 0.00% rtsold
424 root 171305 3810 215 0 6 221 0.02% python3.8
# ps -a -p 699
PID TT STAT TIME COMMAND
699 - Ss 143:34.61 /usr/local/bin/python3 /usr/local/opnsense/scripts/netflow/flowd_aggregate.py (python3.8)
Is this something to be expected? Or something wrong with my config, or even in flowd_aggregate.py itself? Could something be done to reduce the writing, apart from disabling NetFlow completely?
Admittedly this is not a high quality SSD drive (KINGSTON SA400S37240G, not my choice) which might be a contributing factor to the quickly diminished life left number.
Any thoughts?
Netfilow writes information about all flows on all active interfaces SOMEWHERE. That's the point. If you don't want the wear on your local disk you need to write to an external destination.
Yes, obviously the data needs to be written. What I'm after was more like if this amount of writing is to be expected (judging from the answer - yes?), and if it is, can it be reduced somehow without disabling Netflow (and using external storage as no such thing is available).
Of course, I could perhaps add a HDD as a second drive for this and logging in general, but I've no clue whether OPNSense supports this without resorting to sorcery.
The amount of writing of netflow is roughly proportional to the amount of traffic passing through your firewall not counting edge cases like opening a single connection (flow), transferring a couple of terabytes, not doing much else.
You can of course add a second disk though you will have to use the command line to do so.
I have the same "problem". My DEC750 is only half a year old and has already eaten up 5% of the disk lifetime (and that is not a consumer-grade SSD, for that matter).
Setting /var to memory is not a good solution, since that also affects other logs. /tmp is quite another thing, because that keeps only really temporary data and can safely be put into a RAM disk.
From what I have seen by a quick glance, there is a /var/log/flowd.log which is getting rotated 10 times 10 MByte each and a Python aggregation script which parses the log into an SQLite history database in /var/netflow.
The aggregation is already optimised by that it commits only every 100000 records. So it seems the logging of every packet into /var/log/flowd.log is the culprit. If the mechanism would be changed to keep the current flowd.log in /tmp, one would have a choice to keep that in memory. You do not lose much if that file gets lost on a power outage and on reboot, one could do a force-rotation to persist it in permanent storage.
It is not as easy as making /var/log/flowd.log a symlink, however, because of the file rotation. It would be easier if all flowd log files were in a separate directory, like /var/log/flowd/flowd*.log.
Maybe we should open a feature request for this. The code itself was from Ad Schellevis.
I was planning on looking into the /var usage further this weekend now that I might have actual time on my hands but sounds like you've already done the legwork.
For what it's worth, to me that sounds like a good idea. To have the most write intensive part in ramdrive and synced on boot to permanent storage (or maybe also periodically with configurable interval?) would probably dramatically reduce the SSH wear. I'd second that feature request.
Simplest and quickest solution would probably be to get a separate HDD as suggested, and just have it mounted as /var. Not the most elegant solution though, and in my case I'm not sure if the computer (some oldish Dell USFF desktop) can even fit an additional drive. And I know many who use even smaller computers.
( apologies for the necro )
Curious if there are any solutions similar to this available for me?
I'm running on a small NUC with a large amount of RAM and a big NVME SSD
In the time being, not the whole /var is put on RAM disk, but only /var/log, if you configure it.
So, /tmp and /var/log can be put to a RAM disk which certainly eases the load on your disk. Also, you can choose to lower the log level of many services.