One SSD has failed: what's next?

Started by hushcoden, February 26, 2024, 07:02:22 PM

Previous topic - Next topic
I'm running 23.7.12_5 installed on two Transcend SSD 128GB (ZFS), one 2.5" SATA and one mSATA, and looking at my dashboard (SMART Status), I've noticed one SSD has disappeared, and I suppose it means one drive has failed, am I correct?

How do I understand which one has failed?

Tia.

Serial numbers?
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

The one I see is ada0 - see attachment - what check I have to perform?

Open the case and check the serial numbers on the devices. Only way to tell. Sorry for having been so terse, I thought that was evident.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Quote from: Patrick M. Hausen on February 26, 2024, 07:15:39 PM
Open the case and check the serial numbers on the devices. Only way to tell. Sorry for having been so terse, I thought that was evident.
Np :-) I can definitely open the case, but how do I understand which one has failed? Is the number on the SMART Status widget the actual serial number of one of the SSD?

February 26, 2024, 07:18:27 PM #5 Last Edit: February 26, 2024, 07:22:50 PM by Patrick M. Hausen
The one with the serial number shown in your screen shot is the working one. The other one the failed one.
There are stickers with serial numbers on the devices!

Quote from: hushcoden on February 26, 2024, 07:17:23 PM
Is the number on the SMART Status widget the actual serial number of one of the SSD?
Yes of course, wouldn't make much sense, otherwise.  ;)
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)


You gotta be kidding me: after switching the device off, checking the serial numbers and switching it back on, I now see the mSATA drive too (that was the one that disappeared) - what happened??  :o

Is there anything I can check via CLI at all?

Probably a BIOS glitch initializing the drives if the ssd is healthy.

Check smart data on the drive, see if there's anything unusual.

February 26, 2024, 08:18:29 PM #9 Last Edit: February 26, 2024, 08:23:05 PM by hushcoden
smartctl 7.4 2023-08-01 r5530 [FreeBSD 13.2-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED


smartctl 7.4 2023-08-01 r5530 [FreeBSD 13.2-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
No Errors Logged

"overall-health self-assessment" isn't worth anything. Perform a long selftest on both devices, check for results tomorrow.

smartctl -t long /dev/ada0
smartctl -t long /dev/ada1


To check the results:

smartctl -l selftest /dev/ada0
smartctl -l selftest /dev/ada1


Do not power cycle the device while the test is running.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

It seems it's all good, but still I can't understand why that drive disappeared from the dashboard, in the first place...  ???

root@hush:/home/hush # smartctl -l selftest /dev/ada0
smartctl 7.4 2023-08-01 r5530 [FreeBSD 13.2-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       247         -


root@hush:/home/hush # smartctl -l selftest /dev/ada1
smartctl 7.4 2023-08-01 r5530 [FreeBSD 13.2-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       100         -
# 2  Short offline       Completed without error       00%       100         -


What does LifeTime (hours) mean

Tia.

No idea, sorry. SSDs do maintain a wear indicator based on their specified TBW value. I don't know what the output in this partictular line is referring to.

To read that wear value:
# for NVME - counter goes from 0 for factory new to 100
/usr/local/sbin/smartctl -A /dev/nvmeN | fgrep 'Percentage Used:'
# for SATA - counter goes from 100 for factory new down to 0
/usr/local/sbin/smartctl -A /dev/adaN | fgrep 'Wear_Leveling_Count'


Now check your `zpool status`, perform a `zpool scrub` and when all is fine and if necessary a `zpool clear`.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

I've had drives with questionable SATA cables in the past. Drive "fails", you spend a bunch of time fooling around. Slide the server out on the rails and check again, drive now works. Wiggle cables, drive goes away again.

I would probably make sure all the cables are seated properly, and maybe replace the SATA cable on that drive.

February 28, 2024, 06:16:48 PM #14 Last Edit: February 28, 2024, 06:19:22 PM by hushcoden
Yes, when I opened the device, I did replace the SATA cable :-)

And these are the wear values:

root@hush:/home/hush # /usr/local/sbin/smartctl -A /dev/ada0 | fgrep 'Wear_Leveling_Count'
177 Wear_Leveling_Count     0x0000   100   100   000    Old_age   Offline      -       197


root@hush:/home/hush # /usr/local/sbin/smartctl -A /dev/ada1 | fgrep 'Wear_Leveling_Count'
177 Wear_Leveling_Count     0x0000   100   100   000    Old_age   Offline      -       828


I'm afraid I have no clue what those numbers mean ?

root@hush:/home/hush # zpool status
  pool: zroot
state: ONLINE
  scan: resilvered 529M in 00:00:02 with 0 errors on Mon Feb 26 19:03:35 2024
config:

        NAME        STATE     READ WRITE CKSUM
        zroot       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ada1p4  ONLINE       0     0     0
            ada0p4  ONLINE       0     0     0

errors: No known data errors