OPNsense Forum

English Forums => 25.1, 25.4 Production Series => Topic started by: FullyBorked on May 22, 2025, 03:24:22 PM

Title: Degraded zpool after failed PSU
Post by: FullyBorked on May 22, 2025, 03:24:22 PM
In the second installment of severe weather borking my OPNsense box first installment can be found here (https://forum.opnsense.org/index.php?topic=47332.0) if interested. 

I'm not sure if the power issue that took out my PSU also took out one of my SSDs in my ZFS mirror or if I broke this pool when I accidentally disconnected one of these drives during the PSU install.  I say the second part because my first boot after the PSU install the system didn't boot, checked my connections and noticed on of my drives was disconnected.  That leads me to think this drive already had an issue of some sort. 

Regardless below is my current zpool status.

zpool status
  pool: zroot
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
config:

        NAME        STATE     READ WRITE CKSUM
        zroot       DEGRADED     0     0     0
          mirror-0  DEGRADED     0     0     0
            ada0p4  FAULTED      0     0     0  corrupted data
            ada0p4  ONLINE       0     0     0

errors: No known data errors


Edit: I also just noticed my device names are identical, shouldn't those be different? 
Title: Re: Degraded zpool after failed PSU
Post by: Patrick M. Hausen on May 22, 2025, 03:53:17 PM
You have two devices ada0p4 or is there a typo?

camcontrol devlist?
gpart show?

Please.
Title: Re: Degraded zpool after failed PSU
Post by: meyergru on May 22, 2025, 03:54:22 PM
Indeed, they should be different. Usually, you could just remove the defective disk from the mirror and then add a new device in.

You should try zpool status -L -P first to see what has happened there. It is probably a risk to remove ada0p4 from the pool, but I have never seen such a thing.

Is /dev/ada1p4 available? Perhaps you can add it first and it will automagically take over from a hot-spare status to replace the faulted device.
Title: Re: Degraded zpool after failed PSU
Post by: FullyBorked on May 22, 2025, 03:57:39 PM
Quote from: Patrick M. Hausen on May 22, 2025, 03:53:17 PMYou have two devices ada0p4 or is there a typo?

camcontrol devlist?
gpart show?

Please.

camcontrol devlist
<SanDisk SSD PLUS 240GB UF4500RL>  at scbus2 target 0 lun 0 (pass0,ada0)
<AHCI SGPIO Enclosure 2.00 0001>   at scbus6 target 0 lun 0 (ses0,pass1)

gpart show
=>       40  468877232  ada0  GPT  (224G)
         40     532480     1  efi  (260M)
     532520       1024     2  freebsd-boot  (512K)
     533544        984        - free -  (492K)
     534528   16777216     3  freebsd-swap  (8.0G)
   17311744  451563520     4  freebsd-zfs  (215G)
  468875264       2008        - free -  (1.0M)

There for sure used to be an ada1 before all this.  Not sure it's state, it's def connected, but maybe it's totally failed? But that ZFS config seems odd to have the same device twice.
Title: Re: Degraded zpool after failed PSU
Post by: FullyBorked on May 22, 2025, 04:01:19 PM
Quote from: meyergru on May 22, 2025, 03:54:22 PMIndeed, they should be different. Usually, you could just remove the defective disk from the mirror and then add a new device in.

You should try zpool status -L -P first to see what has happened there. It is probably a risk to remove ada0p4 from the pool, but I have never seen such a thing.

Is /dev/ada1p4 available? Perhaps you can add it first and it will automagically take over from a hot-spare status to replace the faulted device.


zpool status -L -P
  pool: zroot
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
config:

        NAME             STATE     READ WRITE CKSUM
        zroot            DEGRADED     0     0     0
          mirror-0       DEGRADED     0     0     0
            /dev/ada0p4  FAULTED      0     0     0  corrupted data
            /dev/ada0p4  ONLINE       0     0     0

errors: No known data errors

Output looks similar, such a bizarre thing.  /dev/ada1p4 doesn't look to exist. 
Title: Re: Degraded zpool after failed PSU
Post by: Patrick M. Hausen on May 22, 2025, 09:08:38 PM
Try

zpool status -g
please.

The idea is to detach the faulted drive using the GUID, then run a scrub to make sure the remaining data is healthy, then power down the system at some convenient time, and watch if the former ada1 is coming back when powering on again. If it isn't you will probably need to replace it. We can help to to re-attach a new disk to the mirror and also copy the partitions necessary to boot from either disk.

Kind regards,
Patrick
Title: Re: Degraded zpool after failed PSU
Post by: FullyBorked on May 22, 2025, 09:22:55 PM
Quote from: Patrick M. Hausen on May 22, 2025, 09:08:38 PMTry

zpool status -g
please.

The idea is to detach the faulted drive using the GUID, then run a scrub to make sure the remaining data is healthy, then power down the system at some convenient time, and watch if the former ada1 is coming back when powering on again. If it isn't you will probably need to replace it. We can help to to re-attach a new disk to the mirror and also copy the partitions necessary to boot from either disk.

Kind regards,
Patrick

Will do thanks, I just received my replacement PSU so I'll be scheduling some downtime to swap that in, I'll double check to make 100% sure that disk is properly connected and report back once I replace it.
Title: Re: Degraded zpool after failed PSU
Post by: Patrick M. Hausen on May 22, 2025, 09:40:44 PM
The

zpool status -g
should as mentioned output a status display like so:

NAME                      STATE     READ WRITE CKSUM
zroot                     ONLINE       0     0     0
  16341520380093765778    ONLINE       0     0     0
    15099387462321339363  ONLINE       0     0     0
    8131296105030086590   ONLINE       0     0     0

Then you can try:

zpool detach zroot <guid of broken one>
zpool scrub zroot

to get back to a consistent state as a first step.
Title: Re: Degraded zpool after failed PSU
Post by: FullyBorked on May 22, 2025, 10:28:46 PM
Quote from: Patrick M. Hausen on May 22, 2025, 09:40:44 PMThe

zpool status -g
should as mentioned output a status display like so:

NAME                      STATE     READ WRITE CKSUM
zroot                     ONLINE       0     0     0
  16341520380093765778    ONLINE       0     0     0
    15099387462321339363  ONLINE       0     0     0
    8131296105030086590   ONLINE       0     0     0

Then you can try:

zpool detach zroot <guid of broken one>
zpool scrub zroot

to get back to a consistent state as a first step.

I missed that part of your question.  Here is my output.  So i need to detach the one ending in ...8775?  Do I do this before physically doing anything to the disks?

zpool status -g
  pool: zroot
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
config:

        NAME                      STATE     READ WRITE CKSUM
        zroot                     DEGRADED     0     0     0
          4730808242311169367     DEGRADED     0     0     0
            4142472898976008775   FAULTED      0     0     0  corrupted data
            15730135158837676855  ONLINE       0     0     0

errors: No known data errors
Title: Re: Degraded zpool after failed PSU
Post by: Patrick M. Hausen on May 22, 2025, 10:35:13 PM
Quote from: FullyBorked on May 22, 2025, 10:28:46 PMSo i need to detach the one ending in ...8775?  Do I do this before physically doing anything to the disks?

Yes, and I would.

No hard guarantees, though - sorry. Have a config backup just in case. I am advising to the best of my knowledge.
Title: Re: Degraded zpool after failed PSU
Post by: FullyBorked on May 22, 2025, 10:48:55 PM
Quote from: Patrick M. Hausen on May 22, 2025, 10:35:13 PM
Quote from: FullyBorked on May 22, 2025, 10:28:46 PMSo i need to detach the one ending in ...8775?  Do I do this before physically doing anything to the disks?

Yes, and I would.

No hard guarantees, though - sorry. Have a config backup just in case. I am advising to the best of my knowledge.

Assume this is just making things cleaner before removal of the failed disk? 
Title: Re: Degraded zpool after failed PSU
Post by: Patrick M. Hausen on May 22, 2025, 10:50:22 PM
Quote from: FullyBorked on May 22, 2025, 10:48:55 PMAssume this is just making things cleaner before removal of the failed disk? 

That is my intention, yes. The duplicate device name is ... weird. The GUIDs are ZFS' internal references so they should always be the "source of truth".
Title: Re: Degraded zpool after failed PSU
Post by: FullyBorked on May 22, 2025, 10:57:19 PM
Quote from: Patrick M. Hausen on May 22, 2025, 10:50:22 PM
Quote from: FullyBorked on May 22, 2025, 10:48:55 PMAssume this is just making things cleaner before removal of the failed disk?

That is my intention, yes. The duplicate device name is ... weird. The GUIDs are ZFS' internal references so they should always be the "source of truth".

Is it even possible that I somehow created a mirror on the same disk instead of two?  100% my smart monitoring widget had a 0 and a 1 but that doesn't mean the zfs pool did.
Title: Re: Degraded zpool after failed PSU
Post by: Patrick M. Hausen on May 22, 2025, 11:09:47 PM
Nope. Definitely not. Unless your

zpool status -g

output lists two identical GUIDs - which I have never never never seen. I'd consider that impossible, but I might be wrong. That would mean something about the pool's internal data structure is severely broken and I would do a config export and reinstall.
Title: Re: Degraded zpool after failed PSU
Post by: FullyBorked on May 24, 2025, 12:08:39 AM
Finally got the PSU replaced with it's permanent replacement and the failed ssd removed and a fresh one installed. 

camcontrol devlist
<SanDisk SSD PLUS 240GB UF4500RL>  at scbus0 target 0 lun 0 (pass0,ada0)
<SanDisk SSD PLUS 240GB UF4500RL>  at scbus2 target 0 lun 0 (pass1,ada1)
<AHCI SGPIO Enclosure 2.00 0001>   at scbus6 target 0 lun 0 (ses0,pass2)


The detach and scrub was successful, here is the pools current state.

zpool status -g
  pool: zroot
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:08:05 with 0 errors on Fri May 23 17:33:06 2025
config:

        NAME                   STATE     READ WRITE CKSUM
        zroot                  ONLINE       0     0     0
          4730808242311169367  ONLINE       0     0     0

errors: No known data errors

Would someone mind helping me understand how to add the replacment disk into the mirror?  Assuming the "replace" command won't work now since we removed the failed disk. 
Title: Re: Degraded zpool after failed PSU
Post by: cookiemonster on May 24, 2025, 02:13:50 AM
#zpool attach {pool name} {new disk} but from your current setup of one disk, its ID is 4730808242311169367, so it has to be the new one.
So you could do a $zpool status again to show the disks in /dev/adaXX (in your case) to identify the current one; then use the new one from dmesg.