OPNsense Forum

English Forums => General Discussion => Topic started by: maurotb on August 13, 2020, 10:05:06 am

Title: Very big problem to snapshot Opnsense
Post by: maurotb on August 13, 2020, 10:05:06 am
Hi,
in my esxi7 when i make a snapshot opnsense to backup it with veeam, network stop working,
is very,very slow, my ping from network to opnsense is 1/2sec
Sometimes is slow only lan interface,sometimes only wan interface,sometimes both
To restore operation i need to reboot opnsense
I have try to install openvmtools with no success
top not show any persormance issue, interrupt is 100% free and cpu 0% of load
Title: Re: Very big problem to snapshot Opnsense
Post by: Patrick M. Hausen on August 13, 2020, 10:17:57 am
Try to snapshot only the disk, not the memory and VM state.
Title: Re: Very big problem to snapshot Opnsense
Post by: maurotb on August 13, 2020, 11:08:56 am
Yes, with snapshot only disk work ,
but this is a big problem in a real production farm environment...
Any workarround?
Title: Re: Very big problem to snapshot Opnsense
Post by: fabian on August 13, 2020, 04:52:16 pm
I would recommend to backup the config xml together with a state that contains the disk. In worst case you will have to install updates and reboot.
Title: Re: Very big problem to snapshot Opnsense
Post by: maurotb on August 13, 2020, 05:06:52 pm

Ok, but in a real professional environment, where it is necessary to have the lowest downtime, this is not good ...
I should be able to take a snapshot of opnsense, do activities (updates, modifications, etc.) and be able to go back in the shortest possible time in case of problems...
Other, with snap only disk i have only a crash consistent image, not good...
I have done various tests but i can't solve it, with a normal freebsd distribution or with pfsense this doesn't happen, i think it's something related to vmxnet3 driver used on so...
Title: Re: Very big problem to snapshot Opnsense
Post by: Patrick M. Hausen on August 13, 2020, 07:57:02 pm
You will need to boot the VM anyway after rolling back because the entire firewall state, system clock and whatnot will all be completely off.

Just roll back to the disk based snapshot, boot, you will be up and running in "no" time.

Anything else would need support from VMware or FreeBSD or both. I don't know if installing open-vm-tools will be sufficient for that.
Title: Re: Very big problem to snapshot Opnsense
Post by: maurotb on August 13, 2020, 09:58:51 pm
No, even with the tools the problem is present. From what I can see it is an obvious kernel or module problem ...

It is true that you can go back from the disk snapshot, but it is also true that it is crash consistent so there could be problems on the filesystem
Title: Re: Very big problem to snapshot Opnsense
Post by: scream on December 18, 2020, 09:05:51 pm
I have still the same issue with 20.7.7 and ESX 7.0.1

Any ideas how this could ve solved other than just snapshot the disk?
Title: Re: Very big problem to snapshot Opnsense
Post by: ember1205 on December 18, 2020, 10:39:10 pm
I have still the same issue with 20.7.7 and ESX 7.0.1

Any ideas how this could ve solved other than just snapshot the disk?

It's important to understand that OPNsense is not a "server", it's a networking and security appliance. It does not operate the same as a Server guest, so it shouldn't be handled the same way.

The state of the CPU and memory is essentially irrelevant since attempting to recover it provides zero value. If the focus is on things like vMotion, state tables and what-not are all going to get dropped when you move the appliance anyhow. A snapshot of the disk is all that's necessary.

If the goal is to provide "100% uptime" through various updates and downgrades of code, multiple nodes are required with heartbeat and sync, and most aspects of state tables don't get replicated. Other appliances in the same space work exactly the same way.
Title: Re: Very big problem to snapshot Opnsense
Post by: scream on December 18, 2020, 10:47:27 pm
Sorry... it was working now for many years before exactly this way. It has stopped working and now after every snapshot (with ram) or every vMotion the VM has high latency on one or multiple vNICs ans is unusable until a manual reboot over the console. This was not the case before.

As long as the elasticsearch database for sensei is on the same system it isn't enought to guet just the disk-image.

I was pfsense user for more than 10 years... switched to opnsense around 2 years ago and it was working without any issue until a few weeks/month ago. Since then every snapshot and every vMotion does break the whole system.
Title: Re: Very big problem to snapshot Opnsense
Post by: franco on December 18, 2020, 11:25:30 pm
And you are absolutely sure this is not a VMware issue? It would be trivial to pin this to a particular version breaking it on both possible sides, no?


Cheers,
Franco
Title: Re: Very big problem to snapshot Opnsense
Post by: scream on December 20, 2020, 12:50:11 pm
And you are absolutely sure this is not a VMware issue? It would be trivial to pin this to a particular version breaking it on both possible sides, no?

Is there anything in life you can be 100% sure? ;)

For me, all issues began with update from 20.1.9 to 20.7. First I got crashing opnsense because of sensei and the vmx-bug with netmap. After this was resolved I had major performance degration (985 Mbit/s on 20.1 -> ~300Mbits on 20.7) when sensei was running. I waitet now a long time since august and the performance is not really well but I can live with at this time. In the above time no vmware update was done at all.

In the last few day I upgraded everything to the latest ESX 7.0.1 as I want to see if the snapshots issue is solved by vmware. But this is not the case.

The problem:

Everytime I take a snapshot incl. RAM or sometimes when I do a vMotion (compute only) the running machine the vmx interface comes nealry completly unresponsive. It isn't always the same interface. Sometimes it is WAN (vmx0), sometimes it is LAN (vmx1). During the snapshot process the vm is unavailable at all (for around 60-90 seconds I guess).

Last time it occours on WAN interface I had time and was able to log in to the WebUI (as LAN wasn't affected).
I saw that WAN -> Gateway (which ist just the next router -> local connected hop) has a very high latency (I saw values around 600-2500ms) and packet loss between 60-100%.

This happens also if sensei and elasticsearch is disabled, so this shouldn't have any impact.

I didn't found a solution how to recover from that state except to reboot the whole opnsense vm. After a reboot everything is working as expected until next snapshot/vMotion.

As I was using vmwares DRS feature in the past, there was a lot of vMotion of the runnting opnsense vm and I never had any issues like that.

Is there a way to restore a 20.7 config backup to a 20.1 vm so I can test easy if it occours on 20.1 now too?
Title: Re: Very big problem to snapshot Opnsense
Post by: franco on December 20, 2020, 08:53:50 pm
If vmxnet3 is related to this, have you tried switching to e1000 emulation to see if the issue can be avoided? As far as I understand vmxnet3 with Netmap is likely not the best deal in FreeBSD 12 so far.


Cheers,
Franco
Title: Re: Very big problem to snapshot Opnsense
Post by: scream on December 20, 2020, 09:10:13 pm
E1000 doesn‘t exist anymore. Should I try E1000e instead?
Last time I tried, performance was really bad too. But this was some month ago.
Title: Re: Very big problem to snapshot Opnsense
Post by: franco on December 20, 2020, 09:12:38 pm
It's probably still bad :(

For now I am out of ideas. My only thought here is that this might be related to Netmap/vmxnet3 combo and I wonder if that works again if you do not use Sensei/IDS. At least it would help narrow the work area.


Thanks for your help so far,
Franco
Title: Re: Very big problem to snapshot Opnsense
Post by: scream on December 20, 2020, 09:14:55 pm
I will do some tests without Sensei (will uninstall it). And with E1000e too. So we will see if there is any difference.

Can I just restore the config backup from 20.7 to a fresh 20.1 install, just to test if it still works with the old version?
Title: Re: Very big problem to snapshot Opnsense
Post by: muchacha_grande on December 21, 2020, 04:05:29 pm
I've been using pf... since 8 years ago and migrated to opn... 3 years ago. Always virtualized on VMWare ESXi 5.0, 5.1, 6.0, 6.5, 6.7 and 7.0.1 by now.
From the old pf... times, it was clear that it's better to use e1000 adapter than vmxnet3. I forgot the reasons why, but I just got used to make my VMs with e1000.
I'm actually on OPN 20.7.7 under ESXi 7.0.1 and I can make snapshots with RAM without problems.
Anyway I agree with what @ember1205 sayed. For me it is not important to snapshot the router RAM as it leaves you with an inconsistent state table in case you have to revert to the snapshot.
For me e1000 is the best option for an OPNsense VM.
Title: Re: Very big problem to snapshot Opnsense
Post by: scream on December 21, 2020, 07:33:43 pm
So I switched to E1000e now.
Snapshots are working great with this... but...

IPTV over Multicast doesn‘t work propper with E1000e. It does work for around 3-5 minutes and then the stream freezes. I do not have any idea how to solve this. When I switch the interface back to vmx3 multicast is working normally.
Title: Re: Very big problem to snapshot Opnsense
Post by: franco on December 21, 2020, 09:14:20 pm
I read a bit about vmxnet3 and it seems that it also was moved over to iflib in FreeBSD 12 like Intel drivers causing a number of headaches and suboptimal performance. We have a couple more patches including things for vmxnet3 in our master branch for 21.1 which could help with this, although I'm not overly enthusiastic given the nature and lack of interest in providing fixes since the iflib conversion ("hit and run" comes to mind to be honest).


Cheers,
Franco
Title: Re: Very big problem to snapshot Opnsense
Post by: scream on December 21, 2020, 09:53:10 pm
I think this os exaclty the main issue.

As the problem for me began in august with update from 20.1.9 to 20.7.
First the whole FW was crashing because of netmap driver issue with vmx nics.
As this was solved, there were massive performance issues. As I was able to reach 995 Mbps on 20.1.9 with Sensei on, with 20.7 I got arount ~300Mbps.

So 20.1.9 was based on FreeBSD 11.2. i think there wasn‘t iflib driver for vmx in place?
20.7 is based on FreeBSD 12.1 and every release since then having this issues.

I really don‘t know much from programming. But it looks exaclty as you say. Fire and forget.

Is there already a way to upgrade to 21.1 dev version to check if issues were resolved in this release?

Title: Re: Very big problem to snapshot Opnsense
Post by: scream on December 22, 2020, 02:42:28 pm
We have a couple more patches including things for vmxnet3 in our master branch for 21.1 which could help with this, although I'm not overly enthusiastic given the nature and lack of interest in providing fixes since the iflib conversion ("hit and run" comes to mind to be honest).

I've upgraded to development branch.

Code: [Select]
OPNsense 21.1.a_272-amd64
FreeBSD 12.1-RELEASE-p11-HBSD
OpenSSL 1.1.1i 8 Dec 2020

I did some performance test for the vmx3 interfaces.

Server1 <-> vmx0 <-> vmx1 <-> Server2

Sensei is running on vmx0 AND vmx1 interfaces.

I reach 1.12 Gbit/s as 5min average with iperf3:
Code: [Select]
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-300.00 sec  39.2 GBytes  1.12 Gbits/sec    0             sender
[  5]   0.00-300.00 sec  39.2 GBytes  1.12 Gbits/sec                  receiver

As side note, there is other traffic on same interfaces while testing but the results look good so far.

So it looks better than before. For my environment this is probabily the same speed as wigh 20.1.9 version of opnsense.

Edit:
But it doesn't help for snapshots with RAM. In this case interface is unstable and not usable anymore.

A simple
Code: [Select]
ifconfig vmx0 down && ifconfig vmx0 up for each interface bringt it back to work again.

Any idea to find ths issue to this? Which logs could be relevant to track that issue down?
Title: Re: Very big problem to snapshot Opnsense
Post by: scream on December 29, 2020, 01:41:36 pm
@franco
I opend a bug report on freebsd bugzilla (ID 252265) as I can reproduce the issue with Snapshoting the VM on a fresh FreeBSD install too.
Title: Re: Very big problem to snapshot Opnsense
Post by: Rajstopy on February 08, 2021, 01:07:57 pm
Hi there,

Looks like having the same issue on my EXSi 6.7 U2 + OPNSense 20.7.7  :-\ Snapshots lead to NIC breakdown. I must admit I'm a little bit fed up, I'm running a bunch of services in my environment, but OPNSense is about 90% of my concerns  >:(

Cheers
Title: Re: Very big problem to snapshot Opnsense
Post by: franco on February 08, 2021, 01:20:51 pm
No activity on bug report so far... https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252265
Title: Re: Very big problem to snapshot Opnsense
Post by: scream on February 08, 2021, 02:34:32 pm
No activity on bug report so far... https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252265

I saw that too :( Any idea how to make them "more aware" of this issue? As this occours on all of the FreeBSD releases I think there are a lot of People running into this. As ESXi and FreeBSD is common I think.
Title: Re: Very big problem to snapshot Opnsense
Post by: hunter86_bg on February 08, 2021, 07:54:28 pm
If I used Opnsense for prod , I would have CERPed it. Snapshoting the memory is quite pointless.
Title: Re: Very big problem to snapshot Opnsense
Post by: scream on February 08, 2021, 08:19:51 pm
If I used Opnsense for prod , I would have CERPed it. Snapshoting the memory is quite pointless.

It just doesn‘t matter as this issue occours on ANY FreeBSD 12.1-RELEASE and not just opnsense. As described in the issue linked above.

So it may be pointless for your usecases but this may not be true for everyone else.
Title: Re: Very big problem to snapshot Opnsense
Post by: Patrick M. Hausen on February 08, 2021, 09:50:58 pm
What precisely do you gain by snapshoting memory? I run my own data center, roughly a hundred machines, thousands of containers and VMs and I never did that. I don't see the point.
Title: Re: Very big problem to snapshot Opnsense
Post by: hunter86_bg on February 09, 2021, 05:56:04 am
If I used Opnsense for prod , I would have CERPed it. Snapshoting the memory is quite pointless.

It just doesn%u2018t matter as this issue occours on ANY FreeBSD 12.1-RELEASE and not just opnsense. As described in the issue linked above.

So it may be pointless for your usecases but this may not be true for everyone else.

Well, you fail over and then snapshot and patch + reboot -> zero downtime. Once you are happy with the upgrade -> failover again and repeat.

There is '0' usefulness to snapshot the memory.