HyperV CPU hvevent goes to 100%

Started by maitops, December 06, 2024, 03:42:12 PM

Previous topic - Next topic
Hi, I have a critical issue that skyrocket the cpu to 100% and stuck my router until reboot.

I'm running an OPNSense VM 24.7.9_1
The Hypervisor is Hyper-V under Windows Server 2022 up-to-date.

Sometimes the hvevent related to kernel use 100% of one CPU until the VM is rebooted. This paralyse everything that use this CPU and put our production in a downtime.
There is no patterns in the regularity of thoses issues, it can happen after 12h of uptime or 15 days.
All integration services got disabled from hyper-v just in case, the problem is still there.
Dmesg doesn't show anything particular, the hvevent can be the 1 or the 3.
VM settings are in Gen2 on HyperV everything is in default except on the network card for CARP usage, secure boot is disabled.

Basically the usage of the command "top -aHST" show one of the [kernel{hvevent1}] at 100% WCPU, we have also [kernel{hvevent3}] maybe 0 and 2 does the same thing too.
The hvevent seem to run continuously until the reboot.

We also tried to disable all Hyperv integrations services from the kernel modules in FreeBSD but we couln't find how.

What tests/logs should I provide to understand more what really happen ?

Thank you in advance

For example this morning the problem came back, here is what "top -aHSTb" return

# top -aHSTb
last pid: 86105;  load averages:  5.04,  4.98,  4.96  up 0+21:13:18    08:32:40
299 threads:   9 running, 281 sleeping, 9 waiting
CPU:  1.3% user,  0.0% nice,  2.0% system,  0.0% interrupt, 96.7% idle
Mem: 80M Active, 393M Inact, 1765M Wired, 56K Buf, 1606M Free
ARC: 1079M Total, 128M MFU, 798M MRU, 4345K Anon, 19M Header, 129M Other
     827M Compressed, 2265M Uncompressed, 2.74:1 Ratio
Swap: 8192M Total, 8192M Free

   THR USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
100003 root        187 ki31     0B    64K CPU0     0  20.8H 100.00% [idle{idle: cpu0}]
100006 root        187 ki31     0B    64K RUN      3  20.8H 100.00% [idle{idle: cpu3}]
100005 root        187 ki31     0B    64K CPU2     2  20.8H 100.00% [idle{idle: cpu2}]
100103 root        -64    -     0B  1744K CPU1     1  87:53 100.00% [kernel{hvevent1}]
100004 root        187 ki31     0B    64K RUN      1  19.4H   0.00% [idle{idle: cpu1}]
100892 www          20    0   196M    41M kqread   3  14:02   0.00% /usr/local/sbin/haproxy -q -f /usr/local/etc/haproxy.conf -p /var/run/haproxy.pid{haproxy}
100795 www          20    0   196M    41M kqread   3  13:54   0.00% /usr/local/sbin/haproxy -q -f /usr/local/etc/haproxy.conf -p /var/run/haproxy.pid{haproxy}
100894 www          20    0   196M    41M kqread   3  13:52   0.00% /usr/local/sbin/haproxy -q -f /usr/local/etc/haproxy.conf -p /var/run/haproxy.pid{haproxy}
100893 www          20    0   196M    41M kqread   2  13:47   0.00% /usr/local/sbin/haproxy -q -f /usr/local/etc/haproxy.conf -p /var/run/haproxy.pid{haproxy}
100328 root         20    0    86M    60M nanslp   2   3:09   0.00% /usr/local/bin/php /usr/local/opnsense/scripts/routes/gateway_watcher.php interface routes alarm
100101 root        -64    -     0B  1744K -        0   2:56   0.00% [kernel{hvevent0}]
100107 root        -64    -     0B  1744K -        3   2:31   0.00% [kernel{hvevent3}]
100105 root        -64    -     0B  1744K -        2   2:29   0.00% [kernel{hvevent2}]
100114 root        -64    -     0B  1744K -        3   0:50   0.00% [kernel{hn1 tx0}]
100111 root        -64    -     0B  1744K -        2   0:39   0.00% [kernel{hn0 tx0}]
100093 root        -60    -     0B    96K WAIT     0   0:24   0.00% [intr{swi1: pfsync}]
100272 root         20    0    13M  2736K bpf      2   0:22   0.00% /usr/local/sbin/filterlog -i pflog0 -p /var/run/filterlog.pid
100037 root        -60    -     0B    64K WAIT     0   0:21   0.00% [clock{clock (0)}]

Hi, we got the problem again (on both the router of the CARP group), we tried to disable the checkpoints on hyperv and update to 24.7.10_2

We are going to rollback to 24.1.10, because the problem seem to be related to the point where we update from 24.1 to 24.7, probably because of FreeBSD upgrade.

Just a thought here.
Quote from: maitops on December 06, 2024, 03:42:12 PM[kernel{hvevent1}] at 100% WCPU, we have also [kernel{hvevent3}]
suggests as I'm sure you can also see, a kernel event that has not completed processing in adequate time. That processing being kernel, needs to interact I imagine with the hypervysor. So it might be a problem between kernel (freeBSD) and hypervysor (Hyper-V).
Therefore you could take it to the freeBSD group but they typyically will request a test with a plain freeBSD kernel instead of a OPNSense one.
Suggestion if you can at all, change hypervysor and re-test. Hyper-V has to my knowledge never been a good bedfellow of freeBSD.

Quote from: cookiemonster on December 16, 2024, 11:01:35 AMJust a thought here.
Quote from: maitops on December 06, 2024, 03:42:12 PM[kernel{hvevent1}] at 100% WCPU, we have also [kernel{hvevent3}]
suggests as I'm sure you can also see, a kernel event that has not completed processing in adequate time. That processing being kernel, needs to interact I imagine with the hypervysor. So it might be a problem between kernel (freeBSD) and hypervysor (Hyper-V).
Therefore you could take it to the freeBSD group but they typyically will request a test with a plain freeBSD kernel instead of a OPNSense one.
Suggestion if you can at all, change hypervysor and re-test. Hyper-V has to my knowledge never been a good bedfellow of freeBSD.


Thanks for replying, sadly we don't know how to reproduce the bug, so we can't really simulate it in a non production environment with another hypervisor...

We went back in 24.1 even if its deprecated, to confirm if it's related to that. We have another Opnsense active/passive duo with HAProxy for the dev environment. The bug never appeared there, probably because the traffic is very low.