In the second installment of severe weather borking my OPNsense box first installment can be found here (https://forum.opnsense.org/index.php?topic=47332.0) if interested.
I'm not sure if the power issue that took out my PSU also took out one of my SSDs in my ZFS mirror or if I broke this pool when I accidentally disconnected one of these drives during the PSU install. I say the second part because my first boot after the PSU install the system didn't boot, checked my connections and noticed on of my drives was disconnected. That leads me to think this drive already had an issue of some sort.
Regardless below is my current zpool status.
zpool status
pool: zroot
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
config:
NAME STATE READ WRITE CKSUM
zroot DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
ada0p4 FAULTED 0 0 0 corrupted data
ada0p4 ONLINE 0 0 0
errors: No known data errors
- Is there a chance this is reparable with the current disks? Anything I can try before replacing it?
- If I need to replace the disk (which looks likely) is the guide linked in that zpool status good enough to get me back online or is there an external guide that might help a zfs noob?
Edit: I also just noticed my device names are identical, shouldn't those be different?
You have two devices ada0p4 or is there a typo?
camcontrol devlist?
gpart show?
Please.
Indeed, they should be different. Usually, you could just remove the defective disk from the mirror and then add a new device in.
You should try zpool status -L -P first to see what has happened there. It is probably a risk to remove ada0p4 from the pool, but I have never seen such a thing.
Is /dev/ada1p4 available? Perhaps you can add it first and it will automagically take over from a hot-spare status to replace the faulted device.
Quote from: Patrick M. Hausen on May 22, 2025, 03:53:17 PMYou have two devices ada0p4 or is there a typo?
camcontrol devlist?
gpart show?
Please.
camcontrol devlist
<SanDisk SSD PLUS 240GB UF4500RL> at scbus2 target 0 lun 0 (pass0,ada0)
<AHCI SGPIO Enclosure 2.00 0001> at scbus6 target 0 lun 0 (ses0,pass1)
gpart show
=> 40 468877232 ada0 GPT (224G)
40 532480 1 efi (260M)
532520 1024 2 freebsd-boot (512K)
533544 984 - free - (492K)
534528 16777216 3 freebsd-swap (8.0G)
17311744 451563520 4 freebsd-zfs (215G)
468875264 2008 - free - (1.0M)
There for sure used to be an ada1 before all this. Not sure it's state, it's def connected, but maybe it's totally failed? But that ZFS config seems odd to have the same device twice.
Quote from: meyergru on May 22, 2025, 03:54:22 PMIndeed, they should be different. Usually, you could just remove the defective disk from the mirror and then add a new device in.
You should try zpool status -L -P first to see what has happened there. It is probably a risk to remove ada0p4 from the pool, but I have never seen such a thing.
Is /dev/ada1p4 available? Perhaps you can add it first and it will automagically take over from a hot-spare status to replace the faulted device.
zpool status -L -P
pool: zroot
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
config:
NAME STATE READ WRITE CKSUM
zroot DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
/dev/ada0p4 FAULTED 0 0 0 corrupted data
/dev/ada0p4 ONLINE 0 0 0
errors: No known data errors
Output looks similar, such a bizarre thing. /dev/ada1p4 doesn't look to exist.
Try
zpool status -g
please.
The idea is to detach the faulted drive using the GUID, then run a scrub to make sure the remaining data is healthy, then power down the system at some convenient time, and watch if the former ada1 is coming back when powering on again. If it isn't you will probably need to replace it. We can help to to re-attach a new disk to the mirror and also copy the partitions necessary to boot from either disk.
Kind regards,
Patrick
Quote from: Patrick M. Hausen on May 22, 2025, 09:08:38 PMTry
zpool status -g
please.
The idea is to detach the faulted drive using the GUID, then run a scrub to make sure the remaining data is healthy, then power down the system at some convenient time, and watch if the former ada1 is coming back when powering on again. If it isn't you will probably need to replace it. We can help to to re-attach a new disk to the mirror and also copy the partitions necessary to boot from either disk.
Kind regards,
Patrick
Will do thanks, I just received my replacement PSU so I'll be scheduling some downtime to swap that in, I'll double check to make 100% sure that disk is properly connected and report back once I replace it.
The
zpool status -g
should as mentioned output a status display like so:
NAME STATE READ WRITE CKSUM
zroot ONLINE 0 0 0
16341520380093765778 ONLINE 0 0 0
15099387462321339363 ONLINE 0 0 0
8131296105030086590 ONLINE 0 0 0
Then you can try:
zpool detach zroot <guid of broken one>
zpool scrub zroot
to get back to a consistent state as a first step.
Quote from: Patrick M. Hausen on May 22, 2025, 09:40:44 PMThe
zpool status -g
should as mentioned output a status display like so:
NAME STATE READ WRITE CKSUM
zroot ONLINE 0 0 0
16341520380093765778 ONLINE 0 0 0
15099387462321339363 ONLINE 0 0 0
8131296105030086590 ONLINE 0 0 0
Then you can try:
zpool detach zroot <guid of broken one>
zpool scrub zroot
to get back to a consistent state as a first step.
I missed that part of your question. Here is my output. So i need to detach the one ending in ...8775? Do I do this before physically doing anything to the disks?
zpool status -g
pool: zroot
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
config:
NAME STATE READ WRITE CKSUM
zroot DEGRADED 0 0 0
4730808242311169367 DEGRADED 0 0 0
4142472898976008775 FAULTED 0 0 0 corrupted data
15730135158837676855 ONLINE 0 0 0
errors: No known data errors
Quote from: FullyBorked on May 22, 2025, 10:28:46 PMSo i need to detach the one ending in ...8775? Do I do this before physically doing anything to the disks?
Yes, and I would.
No hard guarantees, though - sorry. Have a config backup just in case. I am advising to the best of my knowledge.
Quote from: Patrick M. Hausen on May 22, 2025, 10:35:13 PMQuote from: FullyBorked on May 22, 2025, 10:28:46 PMSo i need to detach the one ending in ...8775? Do I do this before physically doing anything to the disks?
Yes, and I would.
No hard guarantees, though - sorry. Have a config backup just in case. I am advising to the best of my knowledge.
Assume this is just making things cleaner before removal of the failed disk?
Quote from: FullyBorked on May 22, 2025, 10:48:55 PMAssume this is just making things cleaner before removal of the failed disk?
That is my intention, yes. The duplicate device name is ... weird. The GUIDs are ZFS' internal references so they should always be the "source of truth".
Quote from: Patrick M. Hausen on May 22, 2025, 10:50:22 PMQuote from: FullyBorked on May 22, 2025, 10:48:55 PMAssume this is just making things cleaner before removal of the failed disk?
That is my intention, yes. The duplicate device name is ... weird. The GUIDs are ZFS' internal references so they should always be the "source of truth".
Is it even possible that I somehow created a mirror on the same disk instead of two? 100% my smart monitoring widget had a 0 and a 1 but that doesn't mean the zfs pool did.
Nope. Definitely not. Unless your
zpool status -g
output lists two identical GUIDs - which I have never never never seen. I'd consider that impossible, but I might be wrong. That would mean something about the pool's internal data structure is severely broken and I would do a config export and reinstall.
Finally got the PSU replaced with it's permanent replacement and the failed ssd removed and a fresh one installed.
camcontrol devlist
<SanDisk SSD PLUS 240GB UF4500RL> at scbus0 target 0 lun 0 (pass0,ada0)
<SanDisk SSD PLUS 240GB UF4500RL> at scbus2 target 0 lun 0 (pass1,ada1)
<AHCI SGPIO Enclosure 2.00 0001> at scbus6 target 0 lun 0 (ses0,pass2)
The detach and scrub was successful, here is the pools current state.
zpool status -g
pool: zroot
state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: scrub repaired 0B in 00:08:05 with 0 errors on Fri May 23 17:33:06 2025
config:
NAME STATE READ WRITE CKSUM
zroot ONLINE 0 0 0
4730808242311169367 ONLINE 0 0 0
errors: No known data errors
Would someone mind helping me understand how to add the replacment disk into the mirror? Assuming the "replace" command won't work now since we removed the failed disk.
#zpool attach {pool name} {new disk}
but from your current setup of one disk, its ID is 4730808242311169367, so it has to be the new one.
So you could do a $zpool status
again to show the disks in /dev/adaXX (in your case) to identify the current one; then use the new one from dmesg.