OPNsense Forum

English Forums => Hardware and Performance => Topic started by: nichiren on June 29, 2022, 08:57:04 pm

Title: OPNSense and SSDs, expected wear with normal use
Post by: nichiren on June 29, 2022, 08:57:04 pm
I just recently happened to look at the SMART data of the SSD in my OPNSense machine and noticed total writes and life left values were a bit surprising considering how long the machine has been operational.

The machine has been in use for almost a year now as my primary home firewall. So no extravagant use cases. This is what the drive SMART reports:
Code: [Select]
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   100   100   000    Old_age   Always       -       100
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       8272
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       22
148 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
149 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
167 Write_Protect_Mode      0x0000   100   100   000    Old_age   Offline      -       0
168 SATA_Phy_Error_Count    0x0012   100   100   000    Old_age   Always       -       0
169 Bad_Block_Rate          0x0000   100   100   000    Old_age   Offline      -       0
170 Bad_Blk_Ct_Erl/Lat      0x0000   100   100   010    Old_age   Offline      -       0/0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 MaxAvgErase_Ct          0x0000   100   100   000    Old_age   Offline      -       0
181 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
182 Erase_Fail_Count        0x0000   100   100   000    Old_age   Offline      -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
192 Unsafe_Shutdown_Count   0x0012   100   100   000    Old_age   Always       -       17
194 Temperature_Celsius     0x0022   037   056   000    Old_age   Always       -       37 (Min/Max 16/56)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
199 SATA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
218 CRC_Error_Count         0x0032   100   100   000    Old_age   Always       -       0
231 SSD_Life_Left           0x0000   088   088   000    Old_age   Offline      -       88
233 Flash_Writes_GiB        0x0032   100   100   000    Old_age   Always       -       1248
241 Lifetime_Writes_GiB     0x0032   100   100   000    Old_age   Always       -       9423
242 Lifetime_Reads_GiB      0x0032   100   100   000    Old_age   Always       -       54
244 Average_Erase_Count     0x0000   100   100   000    Old_age   Offline      -       124
245 Max_Erase_Count         0x0000   100   100   000    Old_age   Offline      -       171
246 Total_Erase_Count       0x0000   100   100   000    Old_age   Offline      -       168104

I was surprised at the amount of writes, as well as how far the wear has progressed in just a year. Now I'm wondering if these numbers are in line with what can be expected, or if there is something wrong with my setup.

About a week ago when I first noticed it, the life left reading was at 89. I didn't think of taking notes of the other figures, but after I updated OPNsense last weekend, I took the numbers down:
Code: [Select]
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   100   100   000    Old_age   Always       -       100
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       8167
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       22
148 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
149 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
167 Write_Protect_Mode      0x0000   100   100   000    Old_age   Offline      -       0
168 SATA_Phy_Error_Count    0x0012   100   100   000    Old_age   Always       -       0
169 Bad_Block_Rate          0x0000   100   100   000    Old_age   Offline      -       0
170 Bad_Blk_Ct_Erl/Lat      0x0000   100   100   010    Old_age   Offline      -       0/0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 MaxAvgErase_Ct          0x0000   100   100   000    Old_age   Offline      -       0
181 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
182 Erase_Fail_Count        0x0000   100   100   000    Old_age   Offline      -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
192 Unsafe_Shutdown_Count   0x0012   100   100   000    Old_age   Always       -       17
194 Temperature_Celsius     0x0022   034   056   000    Old_age   Always       -       34 (Min/Max 16/56)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
199 SATA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
218 CRC_Error_Count         0x0032   100   100   000    Old_age   Always       -       0
231 SSD_Life_Left           0x0000   088   088   000    Old_age   Offline      -       88
233 Flash_Writes_GiB        0x0032   100   100   000    Old_age   Always       -       1233
241 Lifetime_Writes_GiB     0x0032   100   100   000    Old_age   Always       -       9343
242 Lifetime_Reads_GiB      0x0032   100   100   000    Old_age   Always       -       54
244 Average_Erase_Count     0x0000   100   100   000    Old_age   Offline      -       123
245 Max_Erase_Count         0x0000   100   100   000    Old_age   Offline      -       171
246 Total_Erase_Count       0x0000   100   100   000    Old_age   Offline      -       166475

Comparing with the above readings from today, in bit over four days there has been 80GiB written, which sounds a bit on the high side to me.

From what I can tell, flowd_aggregate.py is almost sole responsible for the writes:
Code: [Select]
# top -m io -o write -b
last pid: 40420;  load averages:  0.06,  0.11,  0.08  up 4+21:53:06    21:17:36
53 processes:  1 running, 52 sleeping
CPU:  1.0% user,  0.0% nice,  0.6% system,  0.3% interrupt, 98.1% idle
Mem: 150M Active, 2887M Inact, 782M Wired, 430M Buf, 4200M Free
Swap: 8192M Total, 8192M Free

  PID USERNAME     VCSW  IVCSW   READ  WRITE  FAULT  TOTAL PERCENT COMMAND
  699 root      1393772 458763    840 1273223      0 1274063  99.58% python3.8
35311 dhcpd     853298   1329     33   2949     33   3015   0.24% dhcpd
96277 dhcpd     861423   2637      0    588      0    588   0.05% dhcpd
99676 root        3756     82     70    416     51    537   0.04% radiusd
72210 root      1400080  16412      1    229      0    230   0.02% syslog-ng
 1549 root        1741     16      0    227      0    227   0.02% dhcpleases6
 5872 _flowd    200493   1613      4    159      0    163   0.01% flowd
90495 _dhcp      54483    115      0     14      0     14   0.00% dhclient
49557 root        8415    649      0      8      0      8   0.00% radvd
73189 root      825674   3873      0      0      0      0   0.00% python3.8
  422 root           4      3      0      0      0      0   0.00% python3.8
96103 root           6      0      0      0      0      0   0.00% rtsold
  424 root      171305   3810    215      0      6    221   0.02% python3.8

# ps -a -p 699
PID TT  STAT      TIME COMMAND
699  -  Ss   143:34.61 /usr/local/bin/python3 /usr/local/opnsense/scripts/netflow/flowd_aggregate.py (python3.8)

Is this something to be expected? Or something wrong with my config, or even in flowd_aggregate.py itself? Could something be done to reduce the writing, apart from disabling NetFlow completely?

Admittedly this is not a high quality SSD drive (KINGSTON SA400S37240G, not my choice) which might be a contributing factor to the quickly diminished life left number.

Any thoughts?
Title: Re: OPNSense and SSDs, expected wear with normal use
Post by: Patrick M. Hausen on June 29, 2022, 10:04:26 pm
Netfilow writes information about all flows on all active interfaces SOMEWHERE. That's the point. If you don't want the wear on your local disk you need to write to an external destination.
Title: Re: OPNSense and SSDs, expected wear with normal use
Post by: nichiren on June 30, 2022, 04:56:15 pm
Yes, obviously the data needs to be written. What I'm after was more like if this amount of writing is to be expected (judging from the answer - yes?), and if it is, can it be reduced somehow without disabling Netflow (and using external storage as no such thing is available).

Of course, I could perhaps add a HDD as a second drive for this and logging in general, but I've no clue whether OPNSense supports this without resorting to sorcery.
Title: Re: OPNSense and SSDs, expected wear with normal use
Post by: Patrick M. Hausen on June 30, 2022, 05:37:59 pm
The amount of writing of netflow is roughly proportional to the amount of traffic passing through your firewall not counting edge cases like opening a single connection (flow), transferring a couple of terabytes, not doing much else.

You can of course add a second disk though you will have to use the command line to do so.
Title: Re: OPNSense and SSDs, expected wear with normal use
Post by: meyergru on June 30, 2022, 06:46:32 pm
I have the same "problem". My DEC750 is only half a year old and has already eaten up 5% of the disk lifetime (and that is not a consumer-grade SSD, for that matter).

Setting /var to memory is not a good solution, since that also affects other logs. /tmp is quite another thing, because that keeps only really temporary data and can safely be put into a RAM disk.

From what I have seen by a quick glance, there is a /var/log/flowd.log which is getting rotated 10 times 10 MByte each and a Python aggregation script which parses the log into an SQLite history database in /var/netflow.

The aggregation is already optimised by that it commits only every 100000 records. So it seems the logging of every packet into /var/log/flowd.log is the culprit. If the mechanism would be changed to keep the current flowd.log in /tmp, one would have a choice to keep that in memory. You do not lose much if that file gets lost on a power outage and on reboot, one could do a force-rotation to persist it in permanent storage.

It is not as easy as making /var/log/flowd.log a symlink, however, because of the file rotation. It would be easier if all flowd log files were in a separate directory, like /var/log/flowd/flowd*.log.

Maybe we should open a feature request for this. The code itself was from Ad Schellevis.
Title: Re: OPNSense and SSDs, expected wear with normal use
Post by: nichiren on July 01, 2022, 08:25:18 pm
I was planning on looking into the /var usage further this weekend now that I might have actual time on my hands but sounds like you've already done the legwork.

For what it's worth, to me that sounds like a good idea. To have the most write intensive part in ramdrive and synced on boot to permanent storage (or maybe also periodically with configurable interval?) would probably dramatically reduce the SSH wear. I'd second that feature request.

Simplest and quickest solution would probably be to get a separate HDD as suggested, and just have it mounted as /var. Not the most elegant solution though, and in my case I'm not sure if the computer (some oldish Dell USFF desktop) can even fit an additional drive. And I know many who use even smaller computers.
Title: Re: OPNSense and SSDs, expected wear with normal use
Post by: 0x on September 17, 2023, 06:58:57 pm
( apologies for the necro )

Curious if there are any solutions similar to this available for me?

I'm running on a small NUC with a large amount of RAM and a big NVME SSD
Title: Re: OPNSense and SSDs, expected wear with normal use
Post by: meyergru on September 17, 2023, 08:01:27 pm
In the time being, not the whole /var is put on RAM disk, but only /var/log, if you configure it.

So, /tmp and /var/log can be put to a RAM disk which certainly eases the load on your disk. Also, you can choose to lower the log level of many services.