warning pool has encountered an uncorrectable io error suspended

Started by thomas.sec, September 28, 2022, 11:04:29 AM

Previous topic - Next topic
Hello,
We are facing the same issue as https://forum.opnsense.org/index.php?topic=26661.0 with version 22.7.4-amd64. Opnsense is working for ~20h, and then it's not responding anymore.

Error messages:
- Solaris: warning pool has encountered an uncorrectable io error suspended
- the console is showing a some CAM errors, device not ready, ahci reset and CAM time out...

It's really look like an hardware problem, but our disk was tested with a long smartctl test and there is no error.

Do you know how to resolve this issue ?

Cheers

Those zfs errors indicate either a bad disk or a bad connection to the drive. What SMART tests did you run? I would try reseating the cables connecting the disk for a start.
- Jim

I run this one: smartctl -t long /dev/ada0, zroot is on /dev/ada0p4 (there is only one drive by host).
I will try to reseat cables. But I run two servers with HA, and I randomly got the error on both.


What is the hardware you are using for the servers? Also, brand and model of the drives? Are you using a raid controller?
- Jim

I have got the problem this morning too.

Hardware is custom:

No RAID (AHCI SATA)

Base Board Information
        Manufacturer: MSI
        Product Name: H81M-E34 (MS-7817)

CPU x1
        Version: Intel(R) Core(TM) i3-4170 CPU @ 3.70GHz
        Voltage: 1.2 V
        External Clock: 100 MHz
        Max Speed: 3800 MHz
        Current Speed: 3700 MHz

Memory Device x2
        Size: 4 GB
        Type: DDR3
        Type Detail: Synchronous
        Speed: 1333 MT/s
        Manufacturer: 0420
        Rank: 1
        Configured Memory Speed: 1333 MT/s
        Minimum Voltage: 1.35 V
        Maximum Voltage: 1.5 V
        Configured Voltage: 1.5 V

Disk x1
SSD 64G


Just before the host crash, I have seen zfs errors... I don't see them after a reboot. I need to reinstall:

  pool: zroot
state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:00:06 with 0 errors on Mon Sep 26 14:48:34 2022
config:

        NAME        STATE     READ WRITE CKSUM
        zroot       ONLINE       0     0     0
          ada0p4    ONLINE       0 4.22G     0

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x0>
        <metadata>:<0x8416>
        <metadata>:<0x34>
        <metadata>:<0x43>
        <metadata>:<0x46>
        zroot/ROOT/default:<0x0>



My guess is the disk is bad and needs replacing. You can try a reinstall, but I expect the problems to return.
- Jim

Hello,

I just want to thank you because I changed both disks and I don't have this error anymore.
Strangely, those disks have no problem with another OS...

Everythink is running well after loosing one day because of the option Firewall/Settings/Advanced/Disable reply-to  ;D

Glad to hear to got things working.

Another OS may not care about disk errors...until you experience data loss/corruption...
- Jim