OPNsense Forum
English Forums => General Discussion => Topic started by: hushcoden on February 26, 2024, 07:02:22 pm
-
I'm running 23.7.12_5 installed on two Transcend SSD 128GB (ZFS), one 2.5" SATA and one mSATA, and looking at my dashboard (SMART Status), I've noticed one SSD has disappeared, and I suppose it means one drive has failed, am I correct?
How do I understand which one has failed?
Tia.
-
Serial numbers?
-
The one I see is ada0 - see attachment - what check I have to perform?
-
Open the case and check the serial numbers on the devices. Only way to tell. Sorry for having been so terse, I thought that was evident.
-
Open the case and check the serial numbers on the devices. Only way to tell. Sorry for having been so terse, I thought that was evident.
Np :-) I can definitely open the case, but how do I understand which one has failed? Is the number on the SMART Status widget the actual serial number of one of the SSD?
-
The one with the serial number shown in your screen shot is the working one. The other one the failed one.
There are stickers with serial numbers on the devices!
Is the number on the SMART Status widget the actual serial number of one of the SSD?
Yes of course, wouldn't make much sense, otherwise. ;)
-
Gotcha :P
-
You gotta be kidding me: after switching the device off, checking the serial numbers and switching it back on, I now see the mSATA drive too (that was the one that disappeared) - what happened?? :o
Is there anything I can check via CLI at all?
-
Probably a BIOS glitch initializing the drives if the ssd is healthy.
Check smart data on the drive, see if there's anything unusual.
-
smartctl 7.4 2023-08-01 r5530 [FreeBSD 13.2-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
smartctl 7.4 2023-08-01 r5530 [FreeBSD 13.2-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
No Errors Logged
-
"overall-health self-assessment" isn't worth anything. Perform a long selftest on both devices, check for results tomorrow.
smartctl -t long /dev/ada0
smartctl -t long /dev/ada1
To check the results:
smartctl -l selftest /dev/ada0
smartctl -l selftest /dev/ada1
Do not power cycle the device while the test is running.
-
It seems it's all good, but still I can't understand why that drive disappeared from the dashboard, in the first place... ???
root@hush:/home/hush # smartctl -l selftest /dev/ada0
smartctl 7.4 2023-08-01 r5530 [FreeBSD 13.2-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 247 -
root@hush:/home/hush # smartctl -l selftest /dev/ada1
smartctl 7.4 2023-08-01 r5530 [FreeBSD 13.2-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 100 -
# 2 Short offline Completed without error 00% 100 -
What does LifeTime (hours) mean
Tia.
-
No idea, sorry. SSDs do maintain a wear indicator based on their specified TBW value. I don't know what the output in this partictular line is referring to.
To read that wear value:
# for NVME - counter goes from 0 for factory new to 100
/usr/local/sbin/smartctl -A /dev/nvmeN | fgrep 'Percentage Used:'
# for SATA - counter goes from 100 for factory new down to 0
/usr/local/sbin/smartctl -A /dev/adaN | fgrep 'Wear_Leveling_Count'
Now check your `zpool status`, perform a `zpool scrub` and when all is fine and if necessary a `zpool clear`.
-
I've had drives with questionable SATA cables in the past. Drive "fails", you spend a bunch of time fooling around. Slide the server out on the rails and check again, drive now works. Wiggle cables, drive goes away again.
I would probably make sure all the cables are seated properly, and maybe replace the SATA cable on that drive.
-
Yes, when I opened the device, I did replace the SATA cable :-)
And these are the wear values:
root@hush:/home/hush # /usr/local/sbin/smartctl -A /dev/ada0 | fgrep 'Wear_Leveling_Count'
177 Wear_Leveling_Count 0x0000 100 100 000 Old_age Offline - 197
root@hush:/home/hush # /usr/local/sbin/smartctl -A /dev/ada1 | fgrep 'Wear_Leveling_Count'
177 Wear_Leveling_Count 0x0000 100 100 000 Old_age Offline - 828
I'm afraid I have no clue what those numbers mean ?
root@hush:/home/hush # zpool status
pool: zroot
state: ONLINE
scan: resilvered 529M in 00:00:02 with 0 errors on Mon Feb 26 19:03:35 2024
config:
NAME STATE READ WRITE CKSUM
zroot ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada1p4 ONLINE 0 0 0
ada0p4 ONLINE 0 0 0
errors: No known data errors
-
I should have included a way to keep the top line ;)
With your values:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
177 Wear_Leveling_Count 0x0000 100 100 000 Old_age Offline - 197
177 Wear_Leveling_Count 0x0000 100 100 000 Old_age Offline - 828
So the value is "100" which for a SATA SSDs means "practically factory new".
-
Happy days, then, many thanks for your support !