Very big problem to snapshot Opnsense

Started by maurotb, August 13, 2020, 10:05:06 AM

Previous topic - Next topic
December 20, 2020, 09:14:55 PM #15 Last Edit: December 20, 2020, 09:20:14 PM by scream
I will do some tests without Sensei (will uninstall it). And with E1000e too. So we will see if there is any difference.

Can I just restore the config backup from 20.7 to a fresh 20.1 install, just to test if it still works with the old version?

I've been using pf... since 8 years ago and migrated to opn... 3 years ago. Always virtualized on VMWare ESXi 5.0, 5.1, 6.0, 6.5, 6.7 and 7.0.1 by now.
From the old pf... times, it was clear that it's better to use e1000 adapter than vmxnet3. I forgot the reasons why, but I just got used to make my VMs with e1000.
I'm actually on OPN 20.7.7 under ESXi 7.0.1 and I can make snapshots with RAM without problems.
Anyway I agree with what @ember1205 sayed. For me it is not important to snapshot the router RAM as it leaves you with an inconsistent state table in case you have to revert to the snapshot.
For me e1000 is the best option for an OPNsense VM.

So I switched to E1000e now.
Snapshots are working great with this... but...

IPTV over Multicast doesn't work propper with E1000e. It does work for around 3-5 minutes and then the stream freezes. I do not have any idea how to solve this. When I switch the interface back to vmx3 multicast is working normally.

I read a bit about vmxnet3 and it seems that it also was moved over to iflib in FreeBSD 12 like Intel drivers causing a number of headaches and suboptimal performance. We have a couple more patches including things for vmxnet3 in our master branch for 21.1 which could help with this, although I'm not overly enthusiastic given the nature and lack of interest in providing fixes since the iflib conversion ("hit and run" comes to mind to be honest).


Cheers,
Franco

I think this os exaclty the main issue.

As the problem for me began in august with update from 20.1.9 to 20.7.
First the whole FW was crashing because of netmap driver issue with vmx nics.
As this was solved, there were massive performance issues. As I was able to reach 995 Mbps on 20.1.9 with Sensei on, with 20.7 I got arount ~300Mbps.

So 20.1.9 was based on FreeBSD 11.2. i think there wasn't iflib driver for vmx in place?
20.7 is based on FreeBSD 12.1 and every release since then having this issues.

I really don't know much from programming. But it looks exaclty as you say. Fire and forget.

Is there already a way to upgrade to 21.1 dev version to check if issues were resolved in this release?


December 22, 2020, 02:42:28 PM #20 Last Edit: December 22, 2020, 02:52:29 PM by scream
Quote from: franco on December 21, 2020, 09:14:20 PM
We have a couple more patches including things for vmxnet3 in our master branch for 21.1 which could help with this, although I'm not overly enthusiastic given the nature and lack of interest in providing fixes since the iflib conversion ("hit and run" comes to mind to be honest).

I've upgraded to development branch.

OPNsense 21.1.a_272-amd64
FreeBSD 12.1-RELEASE-p11-HBSD
OpenSSL 1.1.1i 8 Dec 2020


I did some performance test for the vmx3 interfaces.

Server1 <-> vmx0 <-> vmx1 <-> Server2

Sensei is running on vmx0 AND vmx1 interfaces.

I reach 1.12 Gbit/s as 5min average with iperf3:

- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-300.00 sec  39.2 GBytes  1.12 Gbits/sec    0             sender
[  5]   0.00-300.00 sec  39.2 GBytes  1.12 Gbits/sec                  receiver


As side note, there is other traffic on same interfaces while testing but the results look good so far.

So it looks better than before. For my environment this is probabily the same speed as wigh 20.1.9 version of opnsense.

Edit:
But it doesn't help for snapshots with RAM. In this case interface is unstable and not usable anymore.

A simple ifconfig vmx0 down && ifconfig vmx0 up for each interface bringt it back to work again.

Any idea to find ths issue to this? Which logs could be relevant to track that issue down?

@franco
I opend a bug report on freebsd bugzilla (ID 252265) as I can reproduce the issue with Snapshoting the VM on a fresh FreeBSD install too.

Hi there,

Looks like having the same issue on my EXSi 6.7 U2 + OPNSense 20.7.7  :-\ Snapshots lead to NIC breakdown. I must admit I'm a little bit fed up, I'm running a bunch of services in my environment, but OPNSense is about 90% of my concerns  >:(

Cheers


Quote from: franco on February 08, 2021, 01:20:51 PM
No activity on bug report so far... https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252265

I saw that too :( Any idea how to make them "more aware" of this issue? As this occours on all of the FreeBSD releases I think there are a lot of People running into this. As ESXi and FreeBSD is common I think.

If I used Opnsense for prod , I would have CERPed it. Snapshoting the memory is quite pointless.

Quote from: hunter86_bg on February 08, 2021, 07:54:28 PM
If I used Opnsense for prod , I would have CERPed it. Snapshoting the memory is quite pointless.

It just doesn't matter as this issue occours on ANY FreeBSD 12.1-RELEASE and not just opnsense. As described in the issue linked above.

So it may be pointless for your usecases but this may not be true for everyone else.

What precisely do you gain by snapshoting memory? I run my own data center, roughly a hundred machines, thousands of containers and VMs and I never did that. I don't see the point.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

February 09, 2021, 05:56:04 AM #28 Last Edit: February 09, 2021, 05:58:22 AM by hunter86_bg
Quote from: scream on February 08, 2021, 08:19:51 PM
Quote from: hunter86_bg on February 08, 2021, 07:54:28 PM
If I used Opnsense for prod , I would have CERPed it. Snapshoting the memory is quite pointless.

It just doesn%u2018t matter as this issue occours on ANY FreeBSD 12.1-RELEASE and not just opnsense. As described in the issue linked above.

So it may be pointless for your usecases but this may not be true for everyone else.

Well, you fail over and then snapshot and patch + reboot -> zero downtime. Once you are happy with the upgrade -> failover again and repeat.

There is '0' usefulness to snapshot the memory.