OPNsense Forum

English Forums => General Discussion => Topic started by: hushcoden on February 26, 2024, 07:02:22 pm

Title: One SSD has failed: what's next?
Post by: hushcoden on February 26, 2024, 07:02:22 pm
I'm running 23.7.12_5 installed on two Transcend SSD 128GB (ZFS), one 2.5" SATA and one mSATA, and looking at my dashboard (SMART Status), I've noticed one SSD has disappeared, and I suppose it means one drive has failed, am I correct?

How do I understand which one has failed?

Tia.
Title: Re: One SSD has failed: what's next?
Post by: Patrick M. Hausen on February 26, 2024, 07:08:42 pm
Serial numbers?
Title: Re: One SSD has failed: what's next?
Post by: hushcoden on February 26, 2024, 07:10:48 pm
The one I see is ada0 - see attachment - what check I have to perform?
Title: Re: One SSD has failed: what's next?
Post by: Patrick M. Hausen on February 26, 2024, 07:15:39 pm
Open the case and check the serial numbers on the devices. Only way to tell. Sorry for having been so terse, I thought that was evident.
Title: Re: One SSD has failed: what's next?
Post by: hushcoden on February 26, 2024, 07:17:23 pm
Open the case and check the serial numbers on the devices. Only way to tell. Sorry for having been so terse, I thought that was evident.
Np :-) I can definitely open the case, but how do I understand which one has failed? Is the number on the SMART Status widget the actual serial number of one of the SSD?
Title: Re: One SSD has failed: what's next?
Post by: Patrick M. Hausen on February 26, 2024, 07:18:27 pm
The one with the serial number shown in your screen shot is the working one. The other one the failed one.
There are stickers with serial numbers on the devices!

Is the number on the SMART Status widget the actual serial number of one of the SSD?
Yes of course, wouldn't make much sense, otherwise.  ;)
Title: Re: One SSD has failed: what's next?
Post by: hushcoden on February 26, 2024, 07:19:59 pm
Gotcha  :P
Title: Re: One SSD has failed: what's next?
Post by: hushcoden on February 26, 2024, 08:11:35 pm
You gotta be kidding me: after switching the device off, checking the serial numbers and switching it back on, I now see the mSATA drive too (that was the one that disappeared) - what happened??  :o

Is there anything I can check via CLI at all?
Title: Re: One SSD has failed: what's next?
Post by: newsense on February 26, 2024, 08:15:35 pm
Probably a BIOS glitch initializing the drives if the ssd is healthy.

Check smart data on the drive, see if there's anything unusual.
Title: Re: One SSD has failed: what's next?
Post by: hushcoden on February 26, 2024, 08:18:29 pm
Code: [Select]
smartctl 7.4 2023-08-01 r5530 [FreeBSD 13.2-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

Code: [Select]
smartctl 7.4 2023-08-01 r5530 [FreeBSD 13.2-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
No Errors Logged
Title: Re: One SSD has failed: what's next?
Post by: Patrick M. Hausen on February 26, 2024, 08:59:34 pm
"overall-health self-assessment" isn't worth anything. Perform a long selftest on both devices, check for results tomorrow.

Code: [Select]
smartctl -t long /dev/ada0
smartctl -t long /dev/ada1

To check the results:

Code: [Select]
smartctl -l selftest /dev/ada0
smartctl -l selftest /dev/ada1

Do not power cycle the device while the test is running.
Title: Re: One SSD has failed: what's next?
Post by: hushcoden on February 28, 2024, 02:26:38 pm
It seems it's all good, but still I can't understand why that drive disappeared from the dashboard, in the first place...  ???

Code: [Select]
root@hush:/home/hush # smartctl -l selftest /dev/ada0
smartctl 7.4 2023-08-01 r5530 [FreeBSD 13.2-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       247         -

Code: [Select]
root@hush:/home/hush # smartctl -l selftest /dev/ada1
smartctl 7.4 2023-08-01 r5530 [FreeBSD 13.2-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       100         -
# 2  Short offline       Completed without error       00%       100         -

What does LifeTime (hours) mean

Tia.
Title: Re: One SSD has failed: what's next?
Post by: Patrick M. Hausen on February 28, 2024, 02:39:15 pm
No idea, sorry. SSDs do maintain a wear indicator based on their specified TBW value. I don't know what the output in this partictular line is referring to.

To read that wear value:
Code: [Select]
# for NVME - counter goes from 0 for factory new to 100
/usr/local/sbin/smartctl -A /dev/nvmeN | fgrep 'Percentage Used:'
# for SATA - counter goes from 100 for factory new down to 0
/usr/local/sbin/smartctl -A /dev/adaN | fgrep 'Wear_Leveling_Count'

Now check your `zpool status`, perform a `zpool scrub` and when all is fine and if necessary a `zpool clear`.
Title: Re: One SSD has failed: what's next?
Post by: Greg_E on February 28, 2024, 05:54:57 pm
I've had drives with questionable SATA cables in the past. Drive "fails", you spend a bunch of time fooling around. Slide the server out on the rails and check again, drive now works. Wiggle cables, drive goes away again.

I would probably make sure all the cables are seated properly, and maybe replace the SATA cable on that drive.
Title: Re: One SSD has failed: what's next?
Post by: hushcoden on February 28, 2024, 06:16:48 pm
Yes, when I opened the device, I did replace the SATA cable :-)

And these are the wear values:

Code: [Select]
root@hush:/home/hush # /usr/local/sbin/smartctl -A /dev/ada0 | fgrep 'Wear_Leveling_Count'
177 Wear_Leveling_Count     0x0000   100   100   000    Old_age   Offline      -       197

Code: [Select]
root@hush:/home/hush # /usr/local/sbin/smartctl -A /dev/ada1 | fgrep 'Wear_Leveling_Count'
177 Wear_Leveling_Count     0x0000   100   100   000    Old_age   Offline      -       828

I'm afraid I have no clue what those numbers mean ?

Code: [Select]
root@hush:/home/hush # zpool status
  pool: zroot
 state: ONLINE
  scan: resilvered 529M in 00:00:02 with 0 errors on Mon Feb 26 19:03:35 2024
config:

        NAME        STATE     READ WRITE CKSUM
        zroot       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ada1p4  ONLINE       0     0     0
            ada0p4  ONLINE       0     0     0

errors: No known data errors
Title: Re: One SSD has failed: what's next?
Post by: Patrick M. Hausen on February 28, 2024, 06:24:53 pm
I should have included a way to keep the top line  ;)

With your values:
Code: [Select]
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
177 Wear_Leveling_Count     0x0000   100   100   000    Old_age   Offline      -       197
177 Wear_Leveling_Count     0x0000   100   100   000    Old_age   Offline      -       828

So the value is "100" which for a SATA SSDs means "practically factory new".
Title: Re: One SSD has failed: what's next?
Post by: hushcoden on February 28, 2024, 07:06:55 pm
Happy days, then, many thanks for your support !