OPNsense Forum

English Forums => General Discussion => Topic started by: maitops on December 06, 2024, 03:42:12 PM

Title: HyperV CPU hvevent goes to 100%
Post by: maitops on December 06, 2024, 03:42:12 PM
Hi, I have a critical issue that skyrocket the cpu to 100% and stuck my router until reboot.

I'm running an OPNSense VM 24.7.9_1
The Hypervisor is Hyper-V under Windows Server 2022 up-to-date.

Sometimes the hvevent related to kernel use 100% of one CPU until the VM is rebooted. This paralyse everything that use this CPU and put our production in a downtime.
There is no patterns in the regularity of thoses issues, it can happen after 12h of uptime or 15 days.
All integration services got disabled from hyper-v just in case, the problem is still there.
Dmesg doesn't show anything particular, the hvevent can be the 1 or the 3.
VM settings are in Gen2 on HyperV everything is in default except on the network card for CARP usage, secure boot is disabled.

Basically the usage of the command "top -aHST" show one of the [kernel{hvevent1}] at 100% WCPU, we have also [kernel{hvevent3}] maybe 0 and 2 does the same thing too.
The hvevent seem to run continuously until the reboot.

We also tried to disable all Hyperv integrations services from the kernel modules in FreeBSD but we couln't find how.

What tests/logs should I provide to understand more what really happen ?

Thank you in advance
Title: Re: HyperV CPU hvevent goes to 100%
Post by: maitops on December 11, 2024, 08:34:31 AM
For example this morning the problem came back, here is what "top -aHSTb" return

# top -aHSTb
last pid: 86105;  load averages:  5.04,  4.98,  4.96  up 0+21:13:18    08:32:40
299 threads:   9 running, 281 sleeping, 9 waiting
CPU:  1.3% user,  0.0% nice,  2.0% system,  0.0% interrupt, 96.7% idle
Mem: 80M Active, 393M Inact, 1765M Wired, 56K Buf, 1606M Free
ARC: 1079M Total, 128M MFU, 798M MRU, 4345K Anon, 19M Header, 129M Other
     827M Compressed, 2265M Uncompressed, 2.74:1 Ratio
Swap: 8192M Total, 8192M Free

   THR USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
100003 root        187 ki31     0B    64K CPU0     0  20.8H 100.00% [idle{idle: cpu0}]
100006 root        187 ki31     0B    64K RUN      3  20.8H 100.00% [idle{idle: cpu3}]
100005 root        187 ki31     0B    64K CPU2     2  20.8H 100.00% [idle{idle: cpu2}]
100103 root        -64    -     0B  1744K CPU1     1  87:53 100.00% [kernel{hvevent1}]
100004 root        187 ki31     0B    64K RUN      1  19.4H   0.00% [idle{idle: cpu1}]
100892 www          20    0   196M    41M kqread   3  14:02   0.00% /usr/local/sbin/haproxy -q -f /usr/local/etc/haproxy.conf -p /var/run/haproxy.pid{haproxy}
100795 www          20    0   196M    41M kqread   3  13:54   0.00% /usr/local/sbin/haproxy -q -f /usr/local/etc/haproxy.conf -p /var/run/haproxy.pid{haproxy}
100894 www          20    0   196M    41M kqread   3  13:52   0.00% /usr/local/sbin/haproxy -q -f /usr/local/etc/haproxy.conf -p /var/run/haproxy.pid{haproxy}
100893 www          20    0   196M    41M kqread   2  13:47   0.00% /usr/local/sbin/haproxy -q -f /usr/local/etc/haproxy.conf -p /var/run/haproxy.pid{haproxy}
100328 root         20    0    86M    60M nanslp   2   3:09   0.00% /usr/local/bin/php /usr/local/opnsense/scripts/routes/gateway_watcher.php interface routes alarm
100101 root        -64    -     0B  1744K -        0   2:56   0.00% [kernel{hvevent0}]
100107 root        -64    -     0B  1744K -        3   2:31   0.00% [kernel{hvevent3}]
100105 root        -64    -     0B  1744K -        2   2:29   0.00% [kernel{hvevent2}]
100114 root        -64    -     0B  1744K -        3   0:50   0.00% [kernel{hn1 tx0}]
100111 root        -64    -     0B  1744K -        2   0:39   0.00% [kernel{hn0 tx0}]
100093 root        -60    -     0B    96K WAIT     0   0:24   0.00% [intr{swi1: pfsync}]
100272 root         20    0    13M  2736K bpf      2   0:22   0.00% /usr/local/sbin/filterlog -i pflog0 -p /var/run/filterlog.pid
100037 root        -60    -     0B    64K WAIT     0   0:21   0.00% [clock{clock (0)}]
Title: Re: HyperV CPU hvevent goes to 100%
Post by: maitops on December 16, 2024, 10:50:46 AM
Hi, we got the problem again (on both the router of the CARP group), we tried to disable the checkpoints on hyperv and update to 24.7.10_2

We are going to rollback to 24.1.10, because the problem seem to be related to the point where we update from 24.1 to 24.7, probably because of FreeBSD upgrade.
Title: Re: HyperV CPU hvevent goes to 100%
Post by: cookiemonster on December 16, 2024, 11:01:35 AM
Just a thought here.
Quote from: maitops on December 06, 2024, 03:42:12 PM[kernel{hvevent1}] at 100% WCPU, we have also [kernel{hvevent3}]
suggests as I'm sure you can also see, a kernel event that has not completed processing in adequate time. That processing being kernel, needs to interact I imagine with the hypervysor. So it might be a problem between kernel (freeBSD) and hypervysor (Hyper-V).
Therefore you could take it to the freeBSD group but they typyically will request a test with a plain freeBSD kernel instead of a OPNSense one.
Suggestion if you can at all, change hypervysor and re-test. Hyper-V has to my knowledge never been a good bedfellow of freeBSD.
Title: Re: HyperV CPU hvevent goes to 100%
Post by: maitops on December 17, 2024, 08:01:42 AM
Quote from: cookiemonster on December 16, 2024, 11:01:35 AMJust a thought here.
Quote from: maitops on December 06, 2024, 03:42:12 PM[kernel{hvevent1}] at 100% WCPU, we have also [kernel{hvevent3}]
suggests as I'm sure you can also see, a kernel event that has not completed processing in adequate time. That processing being kernel, needs to interact I imagine with the hypervysor. So it might be a problem between kernel (freeBSD) and hypervysor (Hyper-V).
Therefore you could take it to the freeBSD group but they typyically will request a test with a plain freeBSD kernel instead of a OPNSense one.
Suggestion if you can at all, change hypervysor and re-test. Hyper-V has to my knowledge never been a good bedfellow of freeBSD.


Thanks for replying, sadly we don't know how to reproduce the bug, so we can't really simulate it in a non production environment with another hypervisor...

We went back in 24.1 even if its deprecated, to confirm if it's related to that. We have another Opnsense active/passive duo with HAProxy for the dev environment. The bug never appeared there, probably because the traffic is very low.
Title: Re: HyperV CPU hvevent goes to 100%
Post by: ctr_techniker on February 05, 2025, 05:46:05 PM
Seems like we do encounter the same problem.
I cannot confirm the top output as our opnsense instances completely freeze, but we saw same crashing of OPNsense 24.7 and 25.1 on Microsoft HyperV on Server2019, OPNsense as Gen2 VM with 2 Cores, 2GB Memory and 20GB HDD. On the hypervisor we can confirm the CPU stuck on 100% when crashed, interesting is that most of the time a firewall crashes, we still do get traffic through them to the systems behind, the firewall itself also still answers to ping but no reaction to ssh, webinterface not loading and on the console completely frozen.

The same happens to our pfSense 2.7.2 firewalls, many of them (currently ~50 firewalls on 2.7.2) already crashed once since the update to 2.7 with FreeBSD14. As we planned migrating all of our ~300 firewalls to OPNsense we deployed our first productive ones just two weeks ago, and they crashed within a week. No such crashes were seen before on all our test firewalls, so might have something todo with production traffic.

At the moment we try to find a way to trigger it, but no success for now.. The production firewalls, pfsense and opnsense, crash randomly, sometimes they crash, we reboot them and an hour later they crash again, sometimes they don't crash for one or two months, some of them even didn't crash at all for half a year.
Title: Re: HyperV CPU hvevent goes to 100%
Post by: maitops on April 05, 2025, 12:44:53 PM
I tried with UFS + VHDX Fixed size on a fresh install of 25.1.4 with my conf imported.
(someone from pfsense community thinks it was related to ZFS here https://forum.netgate.com/topic/190927/pfsense-2-7-2-in-hyper-v-freezing-with-no-crash-report-after-reboot/29)
The issue is still there.

root@proxy01:~ # top -aHSTn
last pid: 42090;  load averages:  3.50,  1.88,  0.94  up 0+14:51:36    12:33:11
252 threads:   13 running, 226 sleeping, 13 waiting
CPU:  1.5% user,  0.0% nice,  0.4% system,  0.0% interrupt, 98.1% idle
Mem: 135M Active, 1875M Inact, 1813M Wired, 1100M Buf, 7823M Free
Swap: 8192M Total, 8192M Free

   THR USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
100004 root        187 ki31     0B   128K RUN      1 872:01 100.00% [idle{idle: cpu1}]
100009 root        187 ki31     0B   128K CPU6     6 871:57 100.00% [idle{idle: cpu6}]
100003 root        187 ki31     0B   128K CPU0     0 871:36 100.00% [idle{idle: cpu0}]
100007 root        187 ki31     0B   128K CPU4     4 871:35 100.00% [idle{idle: cpu4}]
100005 root        187 ki31     0B   128K CPU2     2 871:33 100.00% [idle{idle: cpu2}]
100116 root        -64    -     0B  1472K CPU7     7   4:51 100.00% [kernel{hvevent7}]
100008 root        187 ki31     0B   128K RUN      5 871:22  98.97% [idle{idle: cpu5}]
100006 root        187 ki31     0B   128K CPU3     3 871:28  96.97% [idle{idle: cpu3}]
101332 www          21    0   537M    59M kqread   2  11:31   3.96% /usr/local/sbin/haproxy -q -f /usr/local/etc/haproxy.conf -p /var/run/haproxy.pid{haproxy}
100902 www          21    0   537M    59M kqread   4  12:29   1.95% /usr/local/sbin/haproxy -q -f /usr/local/etc/haproxy.conf -p /var/run/haproxy.pid{haproxy}
101333 www          23    0   537M    59M kqread   5  11:17   1.95% /usr/local/sbin/haproxy -q -f /usr/local/etc/haproxy.conf -p /var/run/haproxy.pid{haproxy}
101334 www          21    0   537M    59M CPU1     1  10:57   1.95% /usr/local/sbin/haproxy -q -f /usr/local/etc/haproxy.conf -p /var/run/haproxy.pid{haproxy}
100010 root        187 ki31     0B   128K RUN      7 867:24   0.00% [idle{idle: cpu7}]
100805 root         20    0    81M    52M nanslp   5   3:09   0.00% /usr/local/bin/php /usr/local/opnsense/scripts/routes/gateway_watcher.php interface routes alarm
100102 root        -64    -     0B  1472K -        0   1:14   0.00% [kernel{hvevent0}]
100104 root        -64    -     0B  1472K -        1   0:59   0.00% [kernel{hvevent1}]
100114 root        -64    -     0B  1472K -        6   0:56   0.00% [kernel{hvevent6}]
100106 root        -64    -     0B  1472K -        2   0:56   0.00% [kernel{hvevent2}]

root@proxy01:~ # kldstat | grep zfs
root@proxy01:~ # top | grep arc


As you can see hvevent7 is at 100%, no arc thing on top command and no zfs module loaded