SSD Drive failure at 10,000 writes? Thoughts?

Started by greghi, October 25, 2021, 03:09:29 AM

Previous topic - Next topic
Ok I was looking around the web and came across an article for pfsense that someone wrote that basically says that using SSD in a firewall setup might not be a good idea. His basis is that the firewall (opnsense/pfsense) are writing data just about every second (depending on usage and size of network etc.) and that SSD drives are not a good choice for high amounts of writes because they will fail around 10,000. Any thoughts or opinions?  I wont post the link here since its actually on a pfsense forum (unless I am aloud to do this) but the article was interesting and it actually did make some sense.  It was writing 10 years ago but the basis still has its roots today.
Greg

The relevant number for the write endurance of an SSD is the "TBW" or "Terabytes Written". For a typical SSD as one might use in an embedded device, like the Transcend mSATA SSD 370S in e.g. 128 Gbyte size, this number is 360.

Datasheet here:
https://www.transcend-info.com/Products/No-632

So while fundamentally valid the number of 10.000 you got from the pfSense forum is just several orders of magnitude too low.

You can get the amount of writes an SSD has done withsmartctl -a /dev/ada0 if ada0 is your device name.

The numbers to look for are in the case of one of my devices that has been operational for more than a year:
Remaining_Lifetime_Perc: 100
TLC_Writes_32MiB: 50568


Which means the SSD has done about 1.6 TiB of writes in unites of 32 MiB "cells", and the expected remaining lifetime is at 100%, which means it has not yet reallocated any cells from the reserved area to replace failed ones. And it essentially has no clue about the remaining lifetime, because that depends on future writes. The number will start to go down, once the device shows some wear.

HTH,
Patrick
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Thanks Patrick, very useful indeed, but I didn't understand how to interpret this:
Remaining_Lifetime_Perc: 97
TLC_Writes_32MiB: 432542


How do you calculate the TiB of writes and the expected remaining lifetime?

Tia.

32 MiB x 432542 = 13.8 TiB.

Expected remaining lifetime: can only be calculated by monitoring Remaining_Lifetime_Perc over time. Currently your SSD has used 3% of the reserve cells to replace failed ones. Monitor how long it takes to go from 3% to 4% to 5%, then estimate when you will reach 90% ...
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Oh I see, thanks: I have the exact same drive - Transcend 370S 128GB - and if I understood correctly, this drive will 'die' when it reaches 360 Terabytes of data written ?

It's been online 24/7 since May 2020 and I am now at 13.8 TiB - so, I understand there is no such precise formula to calculate for how long it will last...

Thanks.

It's guaranteed to last at least 360 TB. How fast it fails afterwards and in which way precisely again depends ...

German magazine c't had done a "let's write some SSDs to death" test and found that most last way longer than the guaranteed TBW.

Plus most of the time that warranty is combined with a time period, so e.g. for a particular Samsung drive it's 600 TBW or 5 years, whichever comes first.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Quote from: pmhausen on October 25, 2021, 10:45:59 AM
32 MiB x 432542 = 13.8 TiB.

Expected remaining lifetime: can only be calculated by monitoring Remaining_Lifetime_Perc over time. Currently your SSD has used 3% of the reserve cells to replace failed ones. Monitor how long it takes to go from 3% to 4% to 5%, then estimate when you will reach 90% ...

I could be wrong, but I don't think this percent flags failed cells.  It just calculates the number of erase cycles remaining before EOL.  This is explained on crucial's website, but maybe other vendors calculate this differently (Attribute 202...)

I've been monitoring mine and after 2.5 years mine is showing 73% remaining.  I just switched to tempfs for both tmp and var to alleviate some of the wear.  Hopefully that helps, but either way it's not a worry as I'll probably have different hardware in 10 years.  BTW, the smart data for power on hours is way off for some reason (below).

https://www.crucial.com/support/articles-faq-ssd/ssds-and-smart-data

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       12278
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       13
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   073   073   000    Old_age   Always       -       419
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       4
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       26
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   051   030   000    Old_age   Always       -       49 (Min/Max 0/70)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_ECC_Cnt 0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   073   073   001    Old_age   Offline      -       27
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       60992914549
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       1027826061
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       674187293
HP T730/AMD  RX-427BB/8GB/500GB SSD
HP NC365T 4-PORT

TL;DR You probably won't manage to kill a modern SSD by writing to it in a more or less normal scenario. Certainly not by logging.

I agree with the consensus that the SSD lifespan is not a concern for most firewall use cases. Here are the stats on my cheapo 120GB SATA SSD that has been running OPNsense non-stop for 2.3 years.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   000   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       20426
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       161
148 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
149 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
167 Write_Protect_Mode      0x0000   100   100   000    Old_age   Offline      -       0
168 SATA_Phy_Error_Count    0x0012   100   100   000    Old_age   Always       -       0
169 Bad_Block_Rate          0x0000   100   100   000    Old_age   Offline      -       5
170 Bad_Blk_Ct_Erl/Lat      0x0000   100   100   010    Old_age   Offline      -       0/13
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 MaxAvgErase_Ct          0x0000   100   100   000    Old_age   Offline      -       149 (Average 118)
181 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
182 Erase_Fail_Count        0x0000   100   100   000    Old_age   Offline      -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
192 Unsafe_Shutdown_Count   0x0012   100   100   000    Old_age   Always       -       59
194 Temperature_Celsius     0x0022   073   069   000    Old_age   Always       -       27 (Min/Max 22/31)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
199 SATA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
218 CRC_Error_Count         0x0032   100   100   000    Old_age   Always       -       1
231 SSD_Life_Left           0x0000   012   012   000    Old_age   Offline      -       88
233 Flash_Writes_GiB        0x0032   100   100   000    Old_age   Always       -       7601
241 Lifetime_Writes_GiB     0x0032   100   100   000    Old_age   Always       -       12530
242 Lifetime_Reads_GiB      0x0032   100   100   000    Old_age   Always       -       122
244 Average_Erase_Count     0x0000   100   100   000    Old_age   Offline      -       118
245 Max_Erase_Count         0x0000   100   100   000    Old_age   Offline      -       149
246 Total_Erase_Count       0x0000   100   100   000    Old_age   Offline      -       1301864


At 88% life remaining, I'm using roughly 5% of the SSD life every year. At this rate I'd have another 16 years remaining. And this is on a very cheap Kingston 120GB SATA SSD. A higher capacity and higher end SSD would be able to balance writes more effectively and would likely have an even greater lifespan for this use case. Plus, the SSD is faster, silent, and uses less power than a traditional spinning disk.

QuoteAt 88% life remaining, I'm using roughly 5% of the SSD life every year. At this rate I'd have another 16 years remaining.

Actually you have 12% remaining lol.
HP T730/AMD  RX-427BB/8GB/500GB SSD
HP NC365T 4-PORT

Quote from: gpb on November 14, 2021, 06:49:29 PM
QuoteAt 88% life remaining, I'm using roughly 5% of the SSD life every year. At this rate I'd have another 16 years remaining.

Actually you have 12% remaining lol.

:o Are you sure about that? I've watch it slowly tick down from the high 90s to where it's currently at now, in the high 80s after 2+ years.

Quote from: opnfwb on November 15, 2021, 01:58:20 AM
Quote from: gpb on November 14, 2021, 06:49:29 PM
QuoteAt 88% life remaining, I'm using roughly 5% of the SSD life every year. At this rate I'd have another 16 years remaining.

Actually you have 12% remaining lol.

:o Are you sure about that? I've watch it slowly tick down from the high 90s to where it's currently at now, in the high 80s after 2+ years.
Well it would be reversed from how mine reads.  It seems different brands have different formats and no I'm not sure.  If you've been watching it tick down you're fine.  Cheers!   ;)
HP T730/AMD  RX-427BB/8GB/500GB SSD
HP NC365T 4-PORT

I think the issue here is probably smartctl not reporting the value title in the same way as the manufacturer. I was curious enough about this that I quickly pulled the drive and ran the manufacturer's diag tool on it. In my case, this is a Kingston SSD.

Smartctl reports the value as 'SSD_Life_left' whereas Kingston actually lists it as "SSD Wear Indicator" and shows the wear at 12% with a remaining estimated life of 88%.

The swapped ID titles in smartctl don't make this any easier to decipher however, it looks like the drive has a long life ahead of it (fingers crossed ;) ).

Excellent!  Sorry for the scare.   :)
HP T730/AMD  RX-427BB/8GB/500GB SSD
HP NC365T 4-PORT