ZFS suspended zroot (apu2 coreboot 4.15)

Started by senser, January 31, 2022, 02:59:58 PM

Previous topic - Next topic
January 31, 2022, 02:59:58 PM Last Edit: February 06, 2022, 09:32:33 PM by senser
So some of my devices suddenly lost internet connection today around 12 o'clock while some others where still working fine (fresh 22.1 (with config importer) ZFS install on apu2 since yesterday). Serial console gave me this:


FreeBSD/amd64 (mrqu.freifunk) (ttyu0)

login: vvd
Solaris: WARNING: Pool 'zroot' has encountered an uncorrectable I/O failure and has been suspended.


Nothing else. I could not login. Had to pull the power for reboot. Seems everything is running fine again.
I could not find anything in the logs except maybe this:

2022-01-30T11:59:26 Error configctl unable to connect to configd socket (@/var/run/configd.socket)

Is ZFS considered experimental? :) Can I do some FS check or something?

ZFS is not considered experimental. It is the most stable and reliable filesystem in existence for most. It is a memory hog, though. How much RAM do you have?

And the fsck is called a scrub in ZFS terminology:
zpool scrub zroot
zpool status zroot
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

> Had to pull the power for reboot. Seems everything is running fine again.

Obviously and luckily it was not UFS. ;)


Cheers,
Franco

The scrub did not report any errors.
I have checked the health graphs for excessive mem usage but they looked OK (48% free of 4GB).
There are a lot more processes running with ZFS compared to UFS. But I think that is expected.

So I still have this issue. WebUI and serial/ssh become unavailable while the network continues to work (more or less). Getting this after a reboot on the serial console:
ahcich0: Timeout on slot 6 port 0
ahcich0: is 00000000 cs 00000100 ss 000001c0 rs 000001c0 tfd 40 serr 00000000 cmd 0040e717
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 a0 78 68 40 04 00 00 01 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command, 3 more tries remain
ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Timeout on slot 9 port 0
ahcich0: is 00000000 cs 00000200 ss 00000000 rs 00000200 tfd 00 serr 00000000 cmd 0040e917
(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Retrying command, 0 more tries remain
ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Timeout on slot 10 port 0
ahcich0: is 00000000 cs 00000400 ss 00000000 rs 00000400 tfd 00 serr 00000000 cmd 0040ea17
(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Timeout on slot 11 port 0
ahcich0: is 00000000 cs 00000800 ss 00000000 rs 00000800 tfd 00 serr 00000000 cmd 0040eb17
(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Error 5, Retry was blocked
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <Phison SSEP064GTMC0-S91 S9FM02.5> s/n 16165E0641182 detached
ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Timeout on slot 12 port 0
ahcich0: is 00000000 cs 00001000 ss 00000000 rs 00001000 tfd 00 serr 00000000 cmd 0040ec17
(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Retrying command, 0 more tries remain
ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Timeout on slot 13 port 0
ahcich0: is 00000000 cs 00002000 ss 00000000 rs 00002000 tfd 00 serr 00000000 cmd 0040ed17
(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Poll timeout on slot 14 port 0
ahcich0: is 00000000 cs 00004000 ss 00000000 rs 00004000 tfd 00 serr 00000000 cmd 0040ee17
(aprobe0:ahcich0:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
ahcich0: Timeout on slot 15 port 0
ahcich0: is 00000000 cs 00008000 ss 00000000 rs 00008000 tfd 00 serr 00000000 cmd 0040ef17
(ada0:ahcich0:0:0:0): SETFEATURES ENABLE RCACHE. ACB: ef aa 00 00 00 40 00 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Error 5, Periph was invalidated
ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Poll timeout on slot 16 port 0
ahcich0: is 00000000 cs 00010000 ss 00000000 rs 00010000 tfd 00 serr 00000000 cmd 0040f017
(aprobe0:ahcich0:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
ahcich0: Timeout on slot 17 port 0
ahcich0: is 00000000 cs 00020000 ss 00000000 rs 00020000 tfd 00 serr 00000000 cmd 0040f117
(ada0:ahcich0:0:0:0): SETFEATURES ENABLE WCACHE. ACB: ef 02 00 00 00 40 00 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Error 5, Periph was invalidated
ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Poll timeout on slot 18 port 0
ahcich0: is 00000000 cs 00040000 ss 00000000 rs 00040000 tfd 00 serr 00000000 cmd 0040f217
(aprobe0:ahcich0:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
ahcich0: Timeout on slot 19 port 0
ahcich0: is 00000000 cs 00380000 ss 00380000 rs 00380000 tfd 00 serr 00000000 cmd 0040f317
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 a0 78 68 40 04 00 00 01 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Error 5, Periph was invalidated
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 c0 b8 62 6a 40 04 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: Unconditionally Re-queue Request
(ada0:ahcich0:0:0:0): Error 5, Periph was invalidated
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 28 78 ec 9f 40 05 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: Unconditionally Re-queue Request
(ada0:ahcich0:0:0:0): Error 5, Periph was invalidated
ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Poll timeout on slot 22 port 0
ahcich0: is 00000000 cs 00400000 ss 00000000 rs 00400000 tfd 00 serr 00000000 cmd 0040f617
(aprobe0:ahcich0:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted


Anyone knows what this error means?

My problem is that I updated the firmware (apu2) and switched to ZFS while upgrading to 22.1, all at the same time. Should I first revert the firmware? Revert to UFS? Do I need a new mSATA SSD? The same setup worked fine before all these updates were made. Hmm

Reverted from coreboot 4.15 to 4.14. Issue did not show up since then.
But because it happened kind of randomly (sometimes it was fine for like 12h before the issue appeared) it needs some more testing...


I think the ssd is fine. It is 64gb large and has only been used in the apu2 running pfsense/opnsense (full installs). It never failed before. smart status is good.
Since the downgrade to coreboot 4.14 the issue is gone (20h uptime). I think I need to report this to the coreboot folks.

After a long(er) period with the older coreboot version I just got the same error again. I have ordered a new mSATA SSD.
How to best migrate to the new SSD?
Boot live system with old SSD, use importer, hot swap SSD, install?

Hello,
We are facing the same issue. Opnsense is working for ~20h, and then it's not responding anymore.

- Solaris: warning pool has encountered an uncorrectable io error suspended
- the console is showing a some CAM errors, device not ready, ahci reset and CAM time out...

It's really look like an hardware problem, but our disk was tested with a long smartctl test and there is no error.

Have you resolved this case ? How ? :-)

Thanks

October 22, 2022, 01:19:01 PM #10 Last Edit: October 22, 2022, 01:21:13 PM by Christian
I am seeing the same issue, also on an apu2 but with coreboot v4.17.0.3 and OPNsense v21.7.6. My uptime is slightly higher, around 36 h.

According to smartctl the SSD is fine, no errors at all. Scrubbing the pool shows no errors. Only thing I can think of is to swap out the SSD regardless.

@senser @thomas.sec Was there any progress on this since your posts?

Quote from: senser on February 06, 2022, 09:37:29 PM
Reverted from coreboot 4.15 to 4.14. Issue did not show up since then.
But because it happened kind of randomly (sometimes it was fine for like 12h before the issue appeared) it needs some more testing...

Hello guys,  I know this is OPNsense forum, but I have very similar issue as you can see on photo I attached. I had the same error today on pfSense Plus 22.05 on HP t730 thin client. I've HDMI cable connected to display. It was running about 3 days. At the time of the error network wasn't working. I had APIPA address on my PC and Wi-Fi was offline. I had to remove and replug power cable to restart  pfsense box. After restart it works and S.M.A.R.T. status is 'passed'.

EDIT: I don't know why it isn't showing link to imgur so here it is: https://imgur.com/a/N1phX6h