16
21.7 Legacy Series / QEMU-Guest-Agent causes lots of zombie-processes, OPNSense stalls
« on: November 21, 2021, 11:30:57 pm »
Im running half a dozen virtual OPNSense Gateways (21.7.4) under oVirt using QEMU-Guest-Agent.
The plugin qemu-ga became available within the 21.1 release.
After roughly 60 days without reboot these gateways stall, the Web-GUI as well as the console show "cannot fork" errors and it is impossible to login via console or ssh.
To find the culprit I had the smart idea, to reboot an affected OPNSense gateway from oVirt using qemu-ga and did a
very quickly during reboot to show the process list.
There were several thousand zombie-processes starting at various dates with PPID 8995
Luckily the process with PID 8995 was still alive
The output is truncated, but on this OPNSense the only binary beginning with "q" in /usr/local/bin is qemu-ga.
I think, the problem is described for FreeBSD on GitHub
https://github.com/aborche/qemu-guest-agent/issues/17
and there already seems to be a fix for it
https://github.com/aborche/qemu-guest-agent/commit/71edc56b1476bf6c45d1d461bbfa9fe987a8974e
Unfortunately this fix hasn't made it into FreeBSD and therefore in OPNSense yet.
I was able to fix the problem for me by disabling the guest-get-fsinfo RPC call in OPNSense configuration, see attachment.
Just in case, if somebody else is running into the same problem
The plugin qemu-ga became available within the 21.1 release.
After roughly 60 days without reboot these gateways stall, the Web-GUI as well as the console show "cannot fork" errors and it is impossible to login via console or ssh.
To find the culprit I had the smart idea, to reboot an affected OPNSense gateway from oVirt using qemu-ga and did a
Code: [Select]
ssh root@mygateway ps uxawj
very quickly during reboot to show the process list.
There were several thousand zombie-processes starting at various dates with PPID 8995
Code: [Select]
...
root 182 0.0 0.0 0 0 - Z Fri05 0:00.01 <defunct> 8995 8995 8995 0
root 192 0.0 0.0 0 0 - Z 4Nov21 0:00.00 <defunct> 8995 8995 8995 0
root 215 0.0 0.0 0 0 - Z Fri04 0:00.00 <defunct> 8995 8995 8995 0
root 249 0.0 0.0 0 0 - Z 13Nov21 0:00.00 <defunct> 8995 8995 8995 0
root 253 0.0 0.0 0 0 - Z 3Nov21 0:00.00 <defunct> 8995 8995 8995 0
...
Luckily the process with PID 8995 was still alive
Code: [Select]
root 8995 0.0 0.4 17236 4052 - Ss 29Oct21 2:48.09 /usr/local/bin/q 1 8995 8995 0
The output is truncated, but on this OPNSense the only binary beginning with "q" in /usr/local/bin is qemu-ga.
I think, the problem is described for FreeBSD on GitHub
https://github.com/aborche/qemu-guest-agent/issues/17
and there already seems to be a fix for it
https://github.com/aborche/qemu-guest-agent/commit/71edc56b1476bf6c45d1d461bbfa9fe987a8974e
Unfortunately this fix hasn't made it into FreeBSD and therefore in OPNSense yet.
I was able to fix the problem for me by disabling the guest-get-fsinfo RPC call in OPNSense configuration, see attachment.
Just in case, if somebody else is running into the same problem