OPNsense Forum

Archive => 22.1 Legacy Series => Topic started by: senser on January 31, 2022, 02:59:58 pm

Title: ZFS suspended zroot (apu2 coreboot 4.15)
Post by: senser on January 31, 2022, 02:59:58 pm
So some of my devices suddenly lost internet connection today around 12 o'clock while some others where still working fine (fresh 22.1 (with config importer) ZFS install on apu2 since yesterday). Serial console gave me this:

Code: [Select]
FreeBSD/amd64 (mrqu.freifunk) (ttyu0)

login: vvd
Solaris: WARNING: Pool 'zroot' has encountered an uncorrectable I/O failure and has been suspended.

Nothing else. I could not login. Had to pull the power for reboot. Seems everything is running fine again.
I could not find anything in the logs except maybe this:

Code: [Select]
2022-01-30T11:59:26 Error configctl unable to connect to configd socket (@/var/run/configd.socket)
Is ZFS considered experimental? :) Can I do some FS check or something?
Title: Re: ZFS suspended zroot
Post by: Patrick M. Hausen on January 31, 2022, 03:17:44 pm
ZFS is not considered experimental. It is the most stable and reliable filesystem in existence for most. It is a memory hog, though. How much RAM do you have?

And the fsck is called a scrub in ZFS terminology:
Code: [Select]
zpool scrub zroot
zpool status zroot
Title: Re: ZFS suspended zroot
Post by: franco on January 31, 2022, 03:44:31 pm
> Had to pull the power for reboot. Seems everything is running fine again.

Obviously and luckily it was not UFS. ;)


Cheers,
Franco
Title: Re: ZFS suspended zroot
Post by: senser on January 31, 2022, 03:58:16 pm
The scrub did not report any errors.
I have checked the health graphs for excessive mem usage but they looked OK (48% free of 4GB).
There are a lot more processes running with ZFS compared to UFS. But I think that is expected.
Title: Re: ZFS suspended zroot
Post by: senser on February 06, 2022, 10:34:56 am
So I still have this issue. WebUI and serial/ssh become unavailable while the network continues to work (more or less). Getting this after a reboot on the serial console:
Code: [Select]
ahcich0: Timeout on slot 6 port 0
ahcich0: is 00000000 cs 00000100 ss 000001c0 rs 000001c0 tfd 40 serr 00000000 cmd 0040e717
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 a0 78 68 40 04 00 00 01 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command, 3 more tries remain
ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Timeout on slot 9 port 0
ahcich0: is 00000000 cs 00000200 ss 00000000 rs 00000200 tfd 00 serr 00000000 cmd 0040e917
(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Retrying command, 0 more tries remain
ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Timeout on slot 10 port 0
ahcich0: is 00000000 cs 00000400 ss 00000000 rs 00000400 tfd 00 serr 00000000 cmd 0040ea17
(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Timeout on slot 11 port 0
ahcich0: is 00000000 cs 00000800 ss 00000000 rs 00000800 tfd 00 serr 00000000 cmd 0040eb17
(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Error 5, Retry was blocked
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <Phison SSEP064GTMC0-S91 S9FM02.5> s/n 16165E0641182 detached
ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Timeout on slot 12 port 0
ahcich0: is 00000000 cs 00001000 ss 00000000 rs 00001000 tfd 00 serr 00000000 cmd 0040ec17
(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Retrying command, 0 more tries remain
ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Timeout on slot 13 port 0
ahcich0: is 00000000 cs 00002000 ss 00000000 rs 00002000 tfd 00 serr 00000000 cmd 0040ed17
(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Poll timeout on slot 14 port 0
ahcich0: is 00000000 cs 00004000 ss 00000000 rs 00004000 tfd 00 serr 00000000 cmd 0040ee17
(aprobe0:ahcich0:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
ahcich0: Timeout on slot 15 port 0
ahcich0: is 00000000 cs 00008000 ss 00000000 rs 00008000 tfd 00 serr 00000000 cmd 0040ef17
(ada0:ahcich0:0:0:0): SETFEATURES ENABLE RCACHE. ACB: ef aa 00 00 00 40 00 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Error 5, Periph was invalidated
ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Poll timeout on slot 16 port 0
ahcich0: is 00000000 cs 00010000 ss 00000000 rs 00010000 tfd 00 serr 00000000 cmd 0040f017
(aprobe0:ahcich0:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
ahcich0: Timeout on slot 17 port 0
ahcich0: is 00000000 cs 00020000 ss 00000000 rs 00020000 tfd 00 serr 00000000 cmd 0040f117
(ada0:ahcich0:0:0:0): SETFEATURES ENABLE WCACHE. ACB: ef 02 00 00 00 40 00 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Error 5, Periph was invalidated
ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Poll timeout on slot 18 port 0
ahcich0: is 00000000 cs 00040000 ss 00000000 rs 00040000 tfd 00 serr 00000000 cmd 0040f217
(aprobe0:ahcich0:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
ahcich0: Timeout on slot 19 port 0
ahcich0: is 00000000 cs 00380000 ss 00380000 rs 00380000 tfd 00 serr 00000000 cmd 0040f317
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 a0 78 68 40 04 00 00 01 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Error 5, Periph was invalidated
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 c0 b8 62 6a 40 04 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: Unconditionally Re-queue Request
(ada0:ahcich0:0:0:0): Error 5, Periph was invalidated
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 28 78 ec 9f 40 05 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: Unconditionally Re-queue Request
(ada0:ahcich0:0:0:0): Error 5, Periph was invalidated
ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Poll timeout on slot 22 port 0
ahcich0: is 00000000 cs 00400000 ss 00000000 rs 00400000 tfd 00 serr 00000000 cmd 0040f617
(aprobe0:ahcich0:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted

Anyone knows what this error means?

My problem is that I updated the firmware (apu2) and switched to ZFS while upgrading to 22.1, all at the same time. Should I first revert the firmware? Revert to UFS? Do I need a new mSATA SSD? The same setup worked fine before all these updates were made. Hmm
Title: Re: ZFS suspended zroot (apu2 coreboot 4.15)
Post by: senser on February 06, 2022, 09:37:29 pm
Reverted from coreboot 4.15 to 4.14. Issue did not show up since then.
But because it happened kind of randomly (sometimes it was fine for like 12h before the issue appeared) it needs some more testing…
Title: Re: ZFS suspended zroot (apu2 coreboot 4.15)
Post by: franco on February 07, 2022, 08:11:14 am
I also found this https://www.reddit.com/r/zfs/comments/j55uot/pool_io_is_currently_suspended/g7qxytg/?utm_source=reddit&utm_medium=web2x&context=3

IMO the disk could be failing. The old BIOS may be able to work around it better.


Cheers,
Franco
Title: Re: ZFS suspended zroot (apu2 coreboot 4.15)
Post by: senser on February 07, 2022, 09:43:49 am
I think the ssd is fine. It is 64gb large and has only been used in the apu2 running pfsense/opnsense (full installs). It never failed before. smart status is good.
Since the downgrade to coreboot 4.14 the issue is gone (20h uptime). I think I need to report this to the coreboot folks.
Title: Re: ZFS suspended zroot (apu2 coreboot 4.15)
Post by: senser on February 13, 2022, 06:25:43 pm
After a long(er) period with the older coreboot version I just got the same error again. I have ordered a new mSATA SSD.
How to best migrate to the new SSD?
Boot live system with old SSD, use importer, hot swap SSD, install?
Title: Re: ZFS suspended zroot (apu2 coreboot 4.15)
Post by: thomas.sec on September 28, 2022, 08:51:47 am
Hello,
We are facing the same issue. Opnsense is working for ~20h, and then it's not responding anymore.

- Solaris: warning pool has encountered an uncorrectable io error suspended
- the console is showing a some CAM errors, device not ready, ahci reset and CAM time out...

It's really look like an hardware problem, but our disk was tested with a long smartctl test and there is no error.

Have you resolved this case ? How ? :-)

Thanks
Title: Re: ZFS suspended zroot (apu2 coreboot 4.15)
Post by: Christian on October 22, 2022, 01:19:01 pm
I am seeing the same issue, also on an apu2 but with coreboot v4.17.0.3 and OPNsense v21.7.6. My uptime is slightly higher, around 36 h.

According to smartctl the SSD is fine, no errors at all. Scrubbing the pool shows no errors. Only thing I can think of is to swap out the SSD regardless.

@senser @thomas.sec Was there any progress on this since your posts?
Title: Re: ZFS suspended zroot (apu2 coreboot 4.15)
Post by: quako on December 21, 2022, 11:57:15 am
Reverted from coreboot 4.15 to 4.14. Issue did not show up since then.
But because it happened kind of randomly (sometimes it was fine for like 12h before the issue appeared) it needs some more testing…

Hello guys,  I know this is OPNsense forum, but I have very similar issue as you can see on photo I attached. I had the same error today on pfSense Plus 22.05 on HP t730 thin client. I've HDMI cable connected to display. It was running about 3 days. At the time of the error network wasn't working. I had APIPA address on my PC and Wi-Fi was offline. I had to remove and replug power cable to restart  pfsense box. After restart it works and S.M.A.R.T. status is 'passed'.
(https://imgur.com/a/N1phX6h)
EDIT: I don't know why it isn't showing link to imgur so here it is: https://imgur.com/a/N1phX6h