OPNsense Forum

Archive => 18.1 Legacy Series => Topic started by: usr1324 on May 03, 2018, 11:01:11 am

Title: weird issue with opnsense unresponsive over vmware
Post by: usr1324 on May 03, 2018, 11:01:11 am
Hello opnsense community

I have had a weird issue twice with two different opnsense versions (17.7.8 and 18.1.6). I tried to search the issue in freebsd and I couldn’t finnd anything similar. So please bear with me

I have a VM with OPNsense over a VMWare ESX 5.5 (on an old Xeon machine, old enough to not have AES NI capabilities). This machine is configured as a firewall/NAT/openvpn with two interfaces and runs some additional services. The VMWare only runs two VMs and is not overloaded by any means (and when the issue happens I dont see anything weird in resource usage in the other VM). The OPNsense has plenty of resources (4GB of RAM, one vCPU, 40GB of disk space with less than 10% of disk space used).

Now here is the issue that it happened twice: one day with no warning connectivity starts to get very slow. I know this cold be said to be an external problem, but no indications that this is the case. It gets slow for people using any resource (openvpn, ssh, nat), I tried for example to copy logs to a machine in the LAN and also to a machine in the WAN side (internet) and the scp just slows to a halt. At the same time the graphs indicates high latency, I have disconnections and in the system graphs (CPU, mem), there are long sections without records (as if the machine was completely unresponsive for 5 to 10 minutes), this happened several times until I decided to reboot. So sometimes it simply gets completely unresponsive or when it's responsive all traffic is very slow.

In the ESX logs or esxtop nothing indicates an overload of resources or any other issue that could explain the unresponsiveness. In the dmesg logs it just shows reset of the WAN interface due to apinger not able to ping its gateway, but nothing indicacting why there is this slow down / unresponsiveness issue.

As I couldn’t copy the logs out of the machine I simply did a "cp -rf /var/log /var/log.$date" as I wanted to preserve them. After that I simply rebooted the VM and everything started working again., as if there was no issue in the beginning.

Yes, its an old ESX, the underlying hardware is reasonably old, but I just don’t have any explanation on why this happens, and months apart, in two different kernel major versions and why the issue stopped immediately after the reboot.

Has anybody seen this? Any thoughts? I might migrate this gateway to a bare metal for a few weeks just to have the ESX out of the equation but I don’t even know if I can justify that. I googled a bit "vmware unresponive" + "opnsense/pfsense/freebsd" and no joy