Proxmox + N6005/N5105 + OPNsense + Frequent VM crashing/rebooting

Started by LiFE1688, February 09, 2023, 03:50:49 PM

Previous topic - Next topic
Currently I have the following miniPC boxes in testing with the following processors.
N6005 (6x i226) CW-NW11v2
N5105 (4x i226) CW-FMI01v5
J4125 (4x i226) J4125-4L-i226

J4125 works flawlessly. So this is the last I will be mentioning it on this post.

N5105 and N6005 both suffers frequent VM (OPNsense) reboots or crashes.
My BIOS settings for all the miniPCs are default.

While people on Proxmox suggest to disable C STATE in BIOS, mine are disabled in BIOS by default.
Turbo Boost by default in BIOS for N6005 is 3300MHz and N5105 is 2900MHz.
Both N6005/N5105 base frequency is 2000MHz.

When in Proxmox shell using cli command
watch "lscpu | grep MHz

N6005 will be constantly at 3300MHz
N5105 will be constantly at 2900MHz

This should not be the "correct" states as they are Boost Frequencies and should be at this frequency for short periods of time.

Solution 1:
Goto BIOS disable MAX Turbo Boost
Both N6005/N5105 will be constantly at base max frequency 2000MHz instead of their boost frequencies.
The CPU will never use their Turbo Boost frequencies

Solution 2:
This method is not recommended by the people at Proxmox which prefers to have CPU Frequency at the maximum state at all times (Something about stability or another)
I am currently testing this method.

Change governor to "ondemand" instead of "performance"

In Proxmox shell use the cli command
echo "ondemand" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

You can see the frequency of the CPU with 2 second intervals using command
watch "lscpu | grep MHz"
or
cat /proc/cpuinfo |grep "cpu MH"

To set a cron to change governor to "ondemand" every reboot
crontab -e
If this is your first time editing your crontab, choose your editor and add
@reboot echo "ondemand" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

The reason I choose ondemand instead of powersave is because powersave seems to stick at 800MHz and does not increase upon load.

Again, I am just starting to test this, and so far, have no issues. I will continue to monitor this and post if there is any further crashes.

The above method does not work!
After stress testing connection for 2 hours, the VM (OPNsense) crashed.

Currently testing with the following combination:

1. Opt-in Kernel 6.1
apt update
apt install pve-kernel-6.1

2. Add in grub (I am using XFS with UEFI)
intel_idle.max_cstate=1 processor.max_cstate=1
update-grub
3. Update Intel-Microcode to 3.20220510.1
edit /etc/apt/sources.list add non-free to
deb http://ftp.debian.org/debian bullseye main contrib non-free
deb http://ftp.debian.org/debian bullseye-updates main contrib non-free
deb http://security.debian.org bullseye-security main contrib non-free

update by
apt update
apt install intel-microcode

Then remove the non-free from sources.list because you don't need it anymore, and reboot.

Currently VM (OPNsense) has been running for 19 hours without any crashes, in which during this duration, I have been stressing it with high internet activity.

I do not know what might be working, it could be one, to the combination of all 3. I will revise it if the VM doesn't crash after a week with a clean install.

Was testing a N5105 this year as well, and observed vm crashes. Turned out the system had ram stability issues.
If you did not yet tested stability you might run mprime and a memchecker for a day each.

Thanks @freejack,

Yes, the systems, N5105 and N6005, are separate ones, they have their own RAM, both sets are brand new and tested with the new memtest86+, as a added precaution, I also do a Prime95 burn-in, making sure no threads fails, before tinkering with them.

Anyways, updating the Intel Microcode method seems to be working, I have absolutely 0 weird quirks that I noticed before and dismissed, and VM (OPNsense) in Proxmox, has been running roughly around 48 hrs without crashing. If it works for a week, I will try to get someone to pass the message to CWWK so hopefully, they will update it in the BIOS and release it.

Hi guys, really interested in this. Are there any further updates? I run a unit with N5105 and 4 NIC's as well and experience occasional crashes (every 4-5 days). The VM does not recover automatically and I have to hard reset the host every time.

N5105 and similar are known for this. You need to update your Microcode to at least 0x24000024

See https://forum.opnsense.org/index.php?topic=33239.0