Monitoring your ZFS root using monit

Started by redbull666, February 26, 2022, 09:30:52 AM

Previous topic - Next topic
I modified a ZFS monitoring script a bit, and use it on Opnsense. It will monitor your "zroot" ZFS pool if you have installed Opnsense on ZFS (you should, ZFS is amazing).

First copy this script to your Opnsense install, I have it in /root. Make sure it's executable.

#! /bin/sh
#
## ZFS health check script for monit.
## Original script from:
## Calomel.org
##     https://calomel.org/zfs_health_check_script.html
#

# Parameters

maxCapacity=$1 # in percentages

usage="Usage: $0 maxCapacityInPercentages\n"

if [ ! "${maxCapacity}" ]; then
  printf "Missing arguments\n"
  printf "${usage}"
  exit 1
fi

# Output for monit user interface

printf "==== ZPOOL STATUS ====\n"
printf "$(/sbin/zpool status)"
printf "\n\n==== ZPOOL LIST ====\n"
printf "%s\n" "$(/sbin/zpool list)"


# Health - Check if all zfs volumes are in good condition. We are looking for
# any keyword signifying a degraded or broken array.

condition=$(/sbin/zpool status | grep -E 'DEGRADED|FAULTED|OFFLINE|UNAVAIL|REMOVED|FAIL|DESTROYED|corrupt|cannot|unrecover')

if [ "${condition}" ]; then
  printf "\n==== ERROR ====\n"
  printf "One of the pools is in one of these statuses: DEGRADED|FAULTED|OFFLINE|UNAVAIL|REMOVED|FAIL|DESTROYED|corrupt|cannot|unrecover!\n"
  printf "$condition"
  exit 1
fi


# Capacity - Make sure the pool capacity is below 80% for best performance. The
# percentage really depends on how large your volume is. If you have a 128GB
# SSD then 80% is reasonable. If you have a 60TB raid-z2 array then you can
# probably set the warning closer to 95%.
#
# ZFS uses a copy-on-write scheme. The file system writes new data to
# sequential free blocks first and when the uberblock has been updated the new
# inode pointers become valid. This method is true only when the pool has
# enough free sequential blocks. If the pool is at capacity and space limited,
# ZFS will be have to randomly write blocks. This means ZFS can not create an
# optimal set of sequential writes and write performance is severely impacted.

capacity=$(/sbin/zpool list -H -o capacity | cut -d'%' -f1)

for line in ${capacity}
  do
    if [ $line -ge $maxCapacity ]; then
      printf "\n==== ERROR ====\n"
      printf "One of the pools has reached it's max capacity!"
      exit 1
    fi
  done


# Errors - Check the columns for READ, WRITE and CKSUM (checksum) drive errors
# on all volumes and all drives using "zpool status". If any non-zero errors
# are reported an email will be sent out. You should then look to replace the
# faulty drive and run "zpool scrub" on the affected volume after resilvering.

errors=$(/sbin/zpool status | grep ONLINE | grep -v state | awk '{print $3 $4 $5}' | grep -v 000)

if [ "${errors}" ]; then
  printf "\n==== ERROR ====\n"
  printf "One of the pools contains errors!"
  printf "$errors"
  exit 1
fi

# Finish - If we made it here then everything is fine
exit 0


Then add a new service to your monit configuration in Opnsense. The "80" is a parameter for one of the alerts, specifically triggering when the pool is 80% full. Of course the script will also trigger on serious issues, such as a degraded pool if one the disks in your mirror is offline.



That's it, assuming you have configured monit correctly to send emails, for example I am using:


December 21, 2022, 04:54:12 PM #1 Last Edit: December 28, 2022, 05:38:27 PM by dcol
I get 'Status Failed' on the status page.

Here is the log
Error   monit   'zfs_monit' failed to execute '/usr/local/bin/ZFS_monit.sh 80' -- No such file or directory

I assure you the file is there and I set the permissions to 755. I am running a CPU temp check the same way in Monit and it works fine.

I am disabling this service until a resolution is available.

Thanks @redbull666 for the script!

Can confirm that the script works like a charm.

Quote from: dcol on December 21, 2022, 04:54:12 PM
Error   monit   'zfs_monit' failed to execute '/usr/local/bin/ZFS_monit.sh 80' -- No such file or directory

Are you sure the file name is named ZFS_monit.sh (instead of zfs_monit.sh)? Maybe you just move it to /root and try if it works there?

January 13, 2023, 10:39:33 PM #3 Last Edit: January 13, 2023, 11:16:16 PM by dcol
Moved it to/root/zfs_monit.sh and file name is now zfs_monit.sh
Service is setup exactly like the example.
Getting 'zfs_monit' failed to execute '/root/zfs_monit.sh 80' -- No such file or directory
Here are my settings - I disabled the service check for now because it gives the above error
What am I missing? permissions set to 755. Are there any Service Tests Settings needed?
I am running another custom test in Monit and it works fine.
After some testing I have determined there is an error in the script. The script is looking for something that does not exist. # sudo /sbin/zpool seems to work fine

Can you login via SSH to your OPNsense, issue the following command and post the result?

ls -lart /root/zfs_monit.sh

January 13, 2023, 11:21:25 PM #5 Last Edit: January 13, 2023, 11:26:56 PM by dcol
here is the result

root@firewall:~ # ls -lart /root/zfs_monit.sh
-rwxr-xr-x  1 root  wheel  2590 Dec 21 08:46 /root/zfs_monit.sh
root@firewall:~ #

The problem seem to be something in the script itself. It finds the command just fine, but that error shows that something in the script cannot be found.

January 13, 2023, 11:27:18 PM #6 Last Edit: January 13, 2023, 11:29:27 PM by SWEETGOOD
If you are connected via SSH you could try to run the script from there:

/root/zfs_monit.sh 80

Does that work?

Please also check if the encoding of the script is correct. You can do that with this command:

cat /root/zfs_monit.sh

The script output should look exactly like in the first post of the thread starter.

January 13, 2023, 11:28:37 PM #7 Last Edit: January 13, 2023, 11:48:08 PM by dcol
Tried that, get same error. It's in the script. I tried copying the script to the file again in case I missed some code, Still doesn't work. There is an error in the script.

I ran cat /root/zfs_monit.sh and the console displays the script itself. Also tried removing all the comments in case of a syntax issue. didn't work. I am not a programmer so I can't tell where the coding issues are.

By the way, running OPNsense 22.7.10_2-amd64. Maybe I am missing a plugin with needed files? I have no plugins installed and running a default configuration. I am running ZFS. zpool status works.

Tried on second OPNsense installation with same results.


No the script doesn't involve any plugin. The error message is clear and normally only shows up if either:

- the file and folder cannot be found (like the message says)
- a binary file has been compiled for a different architecture (my experience and not applicable here)

So this issue is very strange.

Let's try it step by step:

cd /root
echo '#!/bin/sh' > test.sh
echo 'echo SH EXECUTION TEST' >> test.sh
chmod +x test.sh
./test.sh

Does it show "SH EXECUTION TEST" at the end?

January 14, 2023, 12:10:50 AM #9 Last Edit: January 14, 2023, 12:17:21 AM by dcol
Here is the result

root@firewall:~ # cd /root
root@firewall:~ # echo '#!/bin/sh' > test.sh
/bin/sh: Event not found.
root@firewall:~ # echo 'echo SH EXECUTION TEST' >> test.sh
root@firewall:~ # chmod +x test.sh
root@firewall:~ # /test.sh
/test.sh: Command not found.
root@firewall:~ #

If a file is not found you get the message 'Command not found'
The message from the zfs_monit.sh logs is 'No such file or directory'. That tells me there is something in the script it cannot find. It appears like a message from running the script.

It could not issue the most important command so the "file preparation" was not successful:

echo '#!/bin/sh' > test.sh
/bin/sh: Event not found.

Could you please edit the file with a tool like vi or nano (the first one is available on OPNsense) and make sure its content is:

#!/bin/sh
echo SH EXECUTION TEST

Try to execute it again afterwards. Also you didn't even execute it because you missed the . at the beginning. That's why it tells you "Command not found".

January 14, 2023, 12:19:07 AM #11 Last Edit: January 14, 2023, 12:23:29 AM by dcol
Yes the file contains 'echo SH EXECUTION TEST'

root@firewall:~ # ./test.sh
SH EXECUTION TEST
root@firewall:~ #

Then I tried

root@firewall:~ # ./zfs_monit.sh
./zfs_monit.sh: Command not found.



Does it also contain

#!/bin/sh

in the first line of the file? That's important that the file contains both lines as I stated in my previous post.

January 14, 2023, 12:25:14 AM #13 Last Edit: January 14, 2023, 12:38:01 AM by dcol
The test.sh file only contains 'echo SH EXECUTION TEST', no #!/bin/sh
If I add the '#!/bin/sh' I get same result
root@firewall:~ # ./test.sh
SH EXECUTION TEST
Same results with and without '#!/bin/sh'
Maybe can't recognize file type?

I also tried '#! /bin/sh' as in the original monit script above

zfs_monit.sh as it is now
#!/bin/sh
#
## ZFS health check script for monit.
## Original script from:
## Calomel.org
##     https://calomel.org/zfs_health_check_script.html
#
................... and the rest of the file

The space between the #! and the /bin/sh doesn't matter.

This line is important because it tells the interpreter how to handle the following lines. So it must stay at the top of every executable bash file.

Nevertheless I would recommend you to add the script from the thread starter block by block to my test file and execute it after every step. This might bring you to the line which causes the script to fail.

Please note that you need to add the if and the for blocks completely. So it's OK to just copy the line

maxCapacity=$1 # in percentages

for testing but the if blocks like

if [ ! "${maxCapacity}" ]; then
  printf "Missing arguments\n"
  printf "${usage}"
  exit 1
fi


and also the for block needs to be added as a whole.

And indeed you can skip all comments.