Very big problem to snapshot Opnsense

Started by maurotb, August 13, 2020, 10:05:06 AM

Previous topic - Next topic
Hi,
in my esxi7 when i make a snapshot opnsense to backup it with veeam, network stop working,
is very,very slow, my ping from network to opnsense is 1/2sec
Sometimes is slow only lan interface,sometimes only wan interface,sometimes both
To restore operation i need to reboot opnsense
I have try to install openvmtools with no success
top not show any persormance issue, interrupt is 100% free and cpu 0% of load

Try to snapshot only the disk, not the memory and VM state.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Yes, with snapshot only disk work ,
but this is a big problem in a real production farm environment...
Any workarround?

I would recommend to backup the config xml together with a state that contains the disk. In worst case you will have to install updates and reboot.


Ok, but in a real professional environment, where it is necessary to have the lowest downtime, this is not good ...
I should be able to take a snapshot of opnsense, do activities (updates, modifications, etc.) and be able to go back in the shortest possible time in case of problems...
Other, with snap only disk i have only a crash consistent image, not good...
I have done various tests but i can't solve it, with a normal freebsd distribution or with pfsense this doesn't happen, i think it's something related to vmxnet3 driver used on so...

You will need to boot the VM anyway after rolling back because the entire firewall state, system clock and whatnot will all be completely off.

Just roll back to the disk based snapshot, boot, you will be up and running in "no" time.

Anything else would need support from VMware or FreeBSD or both. I don't know if installing open-vm-tools will be sufficient for that.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

No, even with the tools the problem is present. From what I can see it is an obvious kernel or module problem ...

It is true that you can go back from the disk snapshot, but it is also true that it is crash consistent so there could be problems on the filesystem

I have still the same issue with 20.7.7 and ESX 7.0.1

Any ideas how this could ve solved other than just snapshot the disk?

Quote from: scream on December 18, 2020, 09:05:51 PM
I have still the same issue with 20.7.7 and ESX 7.0.1

Any ideas how this could ve solved other than just snapshot the disk?

It's important to understand that OPNsense is not a "server", it's a networking and security appliance. It does not operate the same as a Server guest, so it shouldn't be handled the same way.

The state of the CPU and memory is essentially irrelevant since attempting to recover it provides zero value. If the focus is on things like vMotion, state tables and what-not are all going to get dropped when you move the appliance anyhow. A snapshot of the disk is all that's necessary.

If the goal is to provide "100% uptime" through various updates and downgrades of code, multiple nodes are required with heartbeat and sync, and most aspects of state tables don't get replicated. Other appliances in the same space work exactly the same way.

Sorry... it was working now for many years before exactly this way. It has stopped working and now after every snapshot (with ram) or every vMotion the VM has high latency on one or multiple vNICs ans is unusable until a manual reboot over the console. This was not the case before.

As long as the elasticsearch database for sensei is on the same system it isn't enought to guet just the disk-image.

I was pfsense user for more than 10 years... switched to opnsense around 2 years ago and it was working without any issue until a few weeks/month ago. Since then every snapshot and every vMotion does break the whole system.

And you are absolutely sure this is not a VMware issue? It would be trivial to pin this to a particular version breaking it on both possible sides, no?


Cheers,
Franco

Quote from: franco on December 18, 2020, 11:25:30 PM
And you are absolutely sure this is not a VMware issue? It would be trivial to pin this to a particular version breaking it on both possible sides, no?

Is there anything in life you can be 100% sure? ;)

For me, all issues began with update from 20.1.9 to 20.7. First I got crashing opnsense because of sensei and the vmx-bug with netmap. After this was resolved I had major performance degration (985 Mbit/s on 20.1 -> ~300Mbits on 20.7) when sensei was running. I waitet now a long time since august and the performance is not really well but I can live with at this time. In the above time no vmware update was done at all.

In the last few day I upgraded everything to the latest ESX 7.0.1 as I want to see if the snapshots issue is solved by vmware. But this is not the case.

The problem:

Everytime I take a snapshot incl. RAM or sometimes when I do a vMotion (compute only) the running machine the vmx interface comes nealry completly unresponsive. It isn't always the same interface. Sometimes it is WAN (vmx0), sometimes it is LAN (vmx1). During the snapshot process the vm is unavailable at all (for around 60-90 seconds I guess).

Last time it occours on WAN interface I had time and was able to log in to the WebUI (as LAN wasn't affected).
I saw that WAN -> Gateway (which ist just the next router -> local connected hop) has a very high latency (I saw values around 600-2500ms) and packet loss between 60-100%.

This happens also if sensei and elasticsearch is disabled, so this shouldn't have any impact.

I didn't found a solution how to recover from that state except to reboot the whole opnsense vm. After a reboot everything is working as expected until next snapshot/vMotion.

As I was using vmwares DRS feature in the past, there was a lot of vMotion of the runnting opnsense vm and I never had any issues like that.

Is there a way to restore a 20.7 config backup to a 20.1 vm so I can test easy if it occours on 20.1 now too?

If vmxnet3 is related to this, have you tried switching to e1000 emulation to see if the issue can be avoided? As far as I understand vmxnet3 with Netmap is likely not the best deal in FreeBSD 12 so far.


Cheers,
Franco

E1000 doesn't exist anymore. Should I try E1000e instead?
Last time I tried, performance was really bad too. But this was some month ago.

It's probably still bad :(

For now I am out of ideas. My only thought here is that this might be related to Netmap/vmxnet3 combo and I wonder if that works again if you do not use Sensei/IDS. At least it would help narrow the work area.


Thanks for your help so far,
Franco