Degraded zpool after failed PSU

Started by FullyBorked, May 22, 2025, 03:24:22 PM

Previous topic - Next topic
#zpool attach {pool name} {new disk} but from your current setup of one disk, its ID is 4730808242311169367, so it has to be the new one.
So you could do a $zpool status again to show the disks in /dev/adaXX (in your case) to identify the current one; then use the new one from dmesg.

Please post a
zpool status
gpart show

Kind regards,
Patrick
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Quote from: Patrick M. Hausen on May 24, 2025, 10:56:21 AMPlease post a
zpool status
gpart show

Kind regards,
Patrick

zpool status
  pool: zroot
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:08:05 with 0 errors on Fri May 23 17:33:06 2025
config:

        NAME        STATE     READ WRITE CKSUM
        zroot       ONLINE       0     0     0
          ada1p4    ONLINE       0     0     0

errors: No known data errors

gpart show
=>       40  468877232  ada1  GPT  (224G)
         40     532480     1  efi  (260M)
     532520       1024     2  freebsd-boot  (512K)
     533544        984        - free -  (492K)
     534528   16777216     3  freebsd-swap  (8.0G)
   17311744  451563520     4  freebsd-zfs  (215G)
  468875264       2008        - free -  (1.0M)

Weird the device name changed again, now it's ada1p4 vs ada0p4 when it was broken.  I'm so confused by that naming.

May 24, 2025, 03:20:45 PM #18 Last Edit: May 24, 2025, 03:23:24 PM by meyergru
The device names are just numbered as devices are being detected, this is why they are unreliable. It looks like ada0 is present, but the disk is not yet partitioned, hence this is why it does not come up with "gpart show".

Given that your old device is now ada1, there must be an ada0 device. You can make sure using "camcontrol devlist".
Intel N100, 4* I226-V, 2* 82559, 16 GByte, 500 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+

Quote from: meyergru on May 24, 2025, 03:20:45 PMThe device names are just numbered as devices are being detected, this is why they are unreliable. It looks like ada0 is present, but the disk is not yet partitioned, henc is why it does not come up with "gpart show".

Given that your old device is now ada1, there must be an ada0 device. You can make sure using "camcontrol devlist".



<SanDisk SSD PLUS 240GB UF4500RL>  at scbus0 target 0 lun 0 (pass0,ada0)
<SanDisk SSD PLUS 240GB UF4500RL>  at scbus2 target 0 lun 0 (pass1,ada1)
<AHCI SGPIO Enclosure 2.00 0001>   at scbus6 target 0 lun 0 (ses0,pass2)

So I just need to run this?
zpool attach zroot ada1

No. Because the disks are just enumerated, your old disk can well have been ada0 before, but is now detected after the new disk, thus this new disk takes the name ada0 now and the old disk now has ada1.

Therefore, you have to attach ada0, not ada1 now.
Intel N100, 4* I226-V, 2* 82559, 16 GByte, 500 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+

Quote from: meyergru on May 24, 2025, 03:34:04 PMNo. Because the disks are just enumerated, your old disk can well have been ada0 before, but is now detected after the new disk, thus this new disk takes the name ada0 now and the old disk now has ada1.

Therefore, you have to attach ada0, not ada1 now.

That's confusing as all heck. Alright let me see what it does, I guess it won't let me attach one that's already attached anyway. 

zpool attach zroot ada0
missing <new_device> specification
usage:
        attach [-fsw] [-o property=value] <pool> <device> <new-device>

I'm not sure what this is looking for, does it need this?

zpool attach zroot ada1 ada0

So this seemed to work:

zpool attach zroot ada1p4 ada0

Shows resilvering now, but the naming looks weird:

zpool status
  pool: zroot
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat May 24 09:55:56 2025
        137G / 137G scanned, 3.30G / 137G issued at 113M/s
        3.34G resilvered, 2.41% done, 00:20:13 to go
config:

        NAME        STATE     READ WRITE CKSUM
        zroot       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ada1p4  ONLINE       0     0     0
            ada0    ONLINE       0     0     0  (resilvering)

errors: No known data errors

Nooooo!

That's exactly how not to do it!

Your system will not be able to boot, when ada1 fails.

Why can't you guys wait for me to react to the post with the info I asked for. Good grief!
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Quote from: Patrick M. Hausen on May 24, 2025, 04:07:18 PMNooooo!

That's exactly how not to do it!

Your system will not be able to boot, when ada1 fails.

Why can't you guys wait for me to react to the post with the info I asked for. Good grief!

Ugh, sorry, EVERY guide I've read showed that was the process to rebuild. 

I"m disappointed in ZFS, thought it was going to be a huge value add.  But I'm really thinking about taking all my stuff back to using a classic raid controller.  ZFS documentation is poor and confusing and covered in trip wires and mines.   

May 24, 2025, 04:19:58 PM #27 Last Edit: May 24, 2025, 04:23:17 PM by Patrick M. Hausen
Quote from: FullyBorked on May 24, 2025, 04:13:36 PMUgh, sorry, EVERY guide I've read showed that was the process to rebuild. 

I"m disappointed in ZFS, thought it was going to be a huge value add.  But I'm really thinking about taking all my stuff back to using a classic raid controller.  ZFS documentation is poor and confusing and covered in trip wires and mines.   

No FreeBSD specific guide recommends using ZFS on an entire disk without a partition table!

ZFS is the best thing since sliced bread, the most robust file system existing and you just need to follow proper procedures. This involves respecting the FreeBSD boot process and the specific partition setup necessary.

I already wrote out the whole procedure but somehow that post ist lost. Seems like the forum does not like working in two tabs in parallel.

Give me a couple of minutes, I'll write it again.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

May 24, 2025, 04:38:59 PM #28 Last Edit: May 24, 2025, 05:10:46 PM by Patrick M. Hausen
Quote from: FullyBorked on May 24, 2025, 03:06:39 PMzpool status
  pool: zroot
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:08:05 with 0 errors on Fri May 23 17:33:06 2025
config:

        NAME        STATE     READ WRITE CKSUM
        zroot       ONLINE       0     0     0
          ada1p4    ONLINE       0     0     0

errors: No known data errors

gpart show
=>       40  468877232  ada1  GPT  (224G)
         40     532480     1  efi  (260M)
     532520       1024     2  freebsd-boot  (512K)
     533544        984        - free -  (492K)
     534528   16777216     3  freebsd-swap  (8.0G)
   17311744  451563520     4  freebsd-zfs  (215G)
  468875264       2008        - free -  (1.0M)

Weird the device name changed again, now it's ada1p4 vs ada0p4 when it was broken.  I'm so confused by that naming.

OK. This is the state that I wanted and with which I intended to guide you through the recovery process step by step. Now that your zpool is a bit messed up let's fix that first.

I assume
zpool status
results in "ada1p4" and "ada0" as the mirror disks?

We need to remove that ada0:
zpool detach zroot ada0


Now we take a breath and grab a coffee ... about those device names ...

FreeBSD enumerates the devices by some "hardware order" inherent in the drive, the PCIe bus, whatnot. Starting with 0.

So initially you had ada0 and ada1. Fine. Then ada0 failed. You removed it and rebooted. With only a single drive now present what was formerly ada1 is now ada0. It starts at 0. Always.

Then you inserted a factory new drive in the "first" (whatever that means) hardware position. After another boot that one is now "first" and becomes ada0 and what was initially ada1, then ada0, is now ada1 again.

FreeBSD just counts.


Now the boot process. For a PC system to be able to boot there needs to be a partition table and either - depending on the system - legacy ("BIOS") or EFI boot code in a matching partiton. When you install stock FreeBSD you can pick which to install. OPNsense installs both, just so not to bother the user with questions they cannot answer and always be able to boot, even if you replace your hardware and move your drive from e.g. a legacy system to an EFI system.

You can see that in your "gpart show" output. An EFI partition followed by a freebsd-boot (legacy) partition. Followed by swap and ZFS. ZFS must go into a partition of type freebsd-zfs, never to the whole disk.

You need the "boot thingies" on both disks, because you want to be able to boot of either of them in case one fails.


So now if that removal of ada0 succeeded first we create a partition table. The easiest way in case of identical drives is to copy it from the good one to the new one:
gpart backup ada1 | gpart restore ada0

Should the "new" drive not be entirely new and the above command fail because gpart does check if there is a partition table present, already, you can add the "-F" flag to that "gpart restore" command. It's just a reasonable safety measure. But since your new drive never had a partition table it should go well without "-F".

You can then check with
gpart show
that now both drives are partitioned the same.


Now that we have a ZFS partition to keep our zpool data we can attach that to the mirror:
zpool attach zroot ada1p4 ada0p4

Didn't it appear odd to have "ada1p4" but just "ada0" without a partitin when you did it the first time?

Anyway the zpool should now be resilvering and be done in no time as you can check with
zpool status
again.

Good? Next step, copy that boot code.


We copy both the EFI and the legacy partitions from ada1 to their respective counterparts on ada0:
# copy EFI boot
dd if=/dev/ada1p1 of=/dev/ada0p1 bs=1m

# copy legacy boot
dd if=/dev/ada1p2 of=/dev/ada0p2 bs=1m


That's it. Grab a beer. You have a redundant bootable system again. If you want redundant swap, too, which I recommend, we can do that in another round after your system is healthy again.


Kind regards,
Patrick
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Quote from: Patrick M. Hausen on May 24, 2025, 04:38:59 PM
Quote from: FullyBorked on May 24, 2025, 03:06:39 PMzpool status
  pool: zroot
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:08:05 with 0 errors on Fri May 23 17:33:06 2025
config:

        NAME        STATE     READ WRITE CKSUM
        zroot       ONLINE       0     0     0
          ada1p4    ONLINE       0     0     0

errors: No known data errors

gpart show
=>       40  468877232  ada1  GPT  (224G)
         40     532480     1  efi  (260M)
     532520       1024     2  freebsd-boot  (512K)
     533544        984        - free -  (492K)
     534528   16777216     3  freebsd-swap  (8.0G)
   17311744  451563520     4  freebsd-zfs  (215G)
  468875264       2008        - free -  (1.0M)

Weird the device name changed again, now it's ada1p4 vs ada0p4 when it was broken.  I'm so confused by that naming.

OK. This is the state that I wanted and with which I intended to guide you through the recovery process step by step. Now that your zpool is a bit messed up let's fix that first.

I assume
zpool status
results in "ada1p4" and "ada0" as the mirror disks?

We need to remove that ada0:
zpool detach zroot ada0


Now we take a breath and grab a coffee ... about those device names ...

FreeBSD enumerates the devices by some "hardware order" inherent in the drive, the PCIe bus, whatnot. Starting with 0.

So initially you had ada0 and ada1. Fine. Then ada0 failed. You removed it and rebooted. With only a single drive now present what was formerly ada1 is now ada0. It starts at 0. Always.

Then you inserted a factory new drive in the "first" (whatever that means) hardware position. After another boot that one is now "first" and becomes ada0 and what was initially ada1, then ada0, is now ada1 again.

FreeBSD just counts.


Now the boot process. For a PC system to be able to boot there needs to be a partition table and either - depending on the system - legacy ("BIOS") or EFI boot code in a matching partiton. When you install stock FreeBSD you can pick which to install. OPNsense installs both, just so not to bother the user with questions they cannot answer and always be able to boot, even if you replace your hardware and move your drive from e.g. a legacy system to am EFI system.

You can see that in your "gpart show" output. An EFI partition followed by a freebsd-boot (legacy) partition. Followed by swap and ZFS. ZFS must go into a partition of type freebsd-zfs, never to the whole disk.

You need the "boot thingies" on both disks, because you want to be able to boot of either of them in case one fails.


So now if that removal of ada0 succeeded first we create a partition table. The easiest way in case of identical drives is to copy it from the good one to the new one:
gpart backup ada1 | gpart restore ada0

Should the "new" drive not be entirely new and the above command fail because gpart does check if there is a partition table present, already, you can add the "-F" flag to that "gpart restore" command. It's just a reasonable safety measure. But since your new drive never had a partition table it should go well without "-F".

You can then check with
gpart show
that now both drives are partitioned the same.


Now that we have a ZFS partition to keep or zpool data we can attach that to the mirror:
zpool attach zroot ada1p4 ada0p4

Didn't it appear odd to have "ada1p4" but just "ada0" without a partitin when you did it the first time?

Anyway the zpool should now be resilvering and be done in no time as you can check with
zpool status
again.

Good? Next step, copy that boot code.


We copy both the EFI and the legacy partitions from ada1 to their respective counterparts on ada0:
# copy EFI boot
dd if=/dev/da1p1 of=/dev/ada0p1 bs=1m

# copy legacy boot
dd if=/dev/da1p2 of=/dev/ada0p2 bs=1m


That's it. Grab a beer. You have a redundant bootable system again. If you want redundant swap, too, which I recommend, we can do that in another round after your system is healthy again.


Kind regards,
Patrick

ok, resilvering again, hopefully correctly this time.

zpool status
  pool: zroot
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat May 24 10:50:06 2025
        137G / 137G scanned, 2.20G / 137G issued at 119M/s
        2.24G resilvered, 1.61% done, 00:19:21 to go
config:

        NAME        STATE     READ WRITE CKSUM
        zroot       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ada1p4  ONLINE       0     0     0
            ada0p4  ONLINE       0     0     0  (resilvering)

errors: No known data errors

Once that finish (assume I should wait till resilver has finished) I'll copy the boot stuff.