os-vmware problems with ESXi 8

Started by Neuer_User, April 04, 2023, 10:21:39 AM

Previous topic - Next topic
April 18, 2023, 04:05:16 PM #15 Last Edit: April 18, 2023, 04:11:33 PM by benyamin
Just having a look over your logs, I'm still seeing references to em0, an intel adapter.

References to "unsupported partitions" are likely red herrings.

I would recommend a fresh install per my last post.

EDIT: You may also need to exclude your tinc0 interface. I would setup any required VLANs as new VMXNET3 NICs and do the tagging at the vSwitch. My thoughts here are that RPC heartbeats are going out one of these interfaces, but I could well be clutching at straws...

Thanks for the approach. I will give it a try and see what is the result. I guess, as a very first test, I will now try to let opnsense run without the vmware-tools. While it looks as if vmware-tools are the culprit, it is possible that the "heartbeat" messages are not the cause but only a side effect of the hard reset. So far, I haven't let the machine run without the tools for more than six days. So, first test is really to let it run for a longer time. If no reset happens, I can try to setup a parallel test vm like you suggested.

In case the vm also resets without the tools, I guess, I only have limited possibilities. Yes, I could setup a new opnsense installation, just to see if the installation could be somewhat damaged (which I currently do not believe), and if that doesn't help, I probably go the proxmox route. 

Quote from: benyamin on April 18, 2023, 04:05:16 PM
Just having a look over your logs, I'm still seeing references to em0, an intel adapter.
Are you sure? I think you see the "em0_vlan" interfaces. They kept their old names, but they are just logical vlan interfaces reassigned to the "vm-physical" VMXNET3 nics.
Quote
References to "unsupported partitions" are likely red herrings.
These "unsupported partitions" are the tmpfs RAM-disks used by opnsense for /tmp and /var/log and others. I also saw these references. It is possible to exclude them from the vmware-tools access by some config options. But then that problem should be with every opnsense in vmware install. (opnsense does not install a vmware-tool.conf file.)
Quote
I would recommend a fresh install per my last post.
Yeah, I might give that a try.
Quote
EDIT: You may also need to exclude your tinc0 interface. I would setup any required VLANs as new VMXNET3 NICs and do the tagging at the vSwitch. My thoughts here are that RPC heartbeats are going out one of these interfaces, but I could well be clutching at straws...
Well, I guess with a new install I would probably not configure it at all. Just to see, if the resets also happen. But in the end, I, of course, need the tinc interface (as that is my vpn). For the heartbeats going out on a wrong interface: Everything may be possible, but it does indeed sound a bit as clutching at straws, as there are clear network routes in the system, none of which overlaps in any way with each other. Additionally, it is strange that only a few heartbeats would go the wrong way after some undefined timespan.

Now, let`s see how the system behaves without the os-vmware package.  ???

It also occurred to me that @Supermule could certainly be right in that this is CPU related. Whilst the Jasper Lake CPUs meet the minimum requirements for ESXi 8, I did remember seeing something about intermittent crashing and freezing here a couple of weeks back and also on the Proxmox forums too for many months and was reminded of this when it popped up again today.

This could certainly be related to what you are observing if ESXi is also affected in some similar way. A freeze could potentially cause a heartbeat timeout, which might then be detected by the tools. The tools would then request for the CPU reset, which I guess is the appropriate thing to do when a freeze is detected.

The point in the Proxmox thread where the issue seems to be solved is here. The microcode update is revision 0x24000024, which is available in this release. The relevant advisory appears to be INTEL-SA-00767, although I'm unsure as to why or how that fixes the problem on Proxmox, but perhaps ESXi is also affected in some way too.

I think it unlikely HUNSN will release a BIOS update containing updated CPU microcode (you could certainly ask them), so you might need to consider another mechanism to load it. Given a solution for Proxmox seems to be worked out, perhaps running it up on Proxmox might be the best path forward.

Having said that, I also see that ESXi 8.0 Update 1 (8.0.1) was just released. I didn't see mention of any intel microcode updates. I'd recommend installing it regardless, as your solution might be in there somewhere anyway...

Wow, indeed! That is exactly my configuration and a VERY similar experience. Haven't thought that a microcode issue could result in VM-only freezes.  :o

I will definitely update the microcode and also esxi to version 8.0.1. I might also give the BIOS update a try. But the first thing is to complete the current test, i.e. running without vmware-tools. At the moment opnsense is running for 2 days and 5 hours without issues. That's definitely much better than my last tests with tools installed, but it is not guaranteed, as I also had a test with tools installed running for 6 days once.

But I will wait for at least two weeks if the machine freezes or shows any other irregularities. And then I'm gonna do the updates.

And, thanks really a lot for your help. That is so much appreciated!!!

Short Update: After about three days I had the same reset. Same entries in vmware.log. I was even working on the PC when it happened.

So, that clarifies a couple of things: 1.) It is NOT the vmware-tools. Altough with vmware-tools it seems to happen more frequently. 2.) It could indeed be the CPU microcode.

So, as a next test I added the new firmware to esxi 8.0 (not yet updated to 8.0.1). I also reactivate vmware-tools. Let's see how long it runs...  ::)

I suspect that the tools were prompting the reset following timeouts caused by freezes; whereas when the tools are not installed/running a more significant freeze needs to occur, and that likely happens less frequently.

I saw your post in the other topic re the microcode update. I hope it helps...!