opnsense freezes and needs reboot

Started by mgrue, August 18, 2020, 09:14:34 AM

Previous topic - Next topic
I have the following setup:
- opnsense 20.1 running for months without any problem in a VMware vSphere (ESXi 6.7) VM
- Rather plain config without IDS/IPS or any special addons (Plugins os-net-snmp, os-vmware, os-dyndns)
- VM has 2 vCPUs / 1 GB RAM / 9 vNICs (VMXNET 3) / VMware Tools installed
- Average load 0.4 / Between 30-40% Memory utilisation after boot
- WAN connection is PPPoE with 175 Mb down / 40 Mb up (IPv4/IPv6)

Now I upgraded to 20.7 and subsequently to 20.7.1. The problem is that the system stops forwarding packets after 24 to 72 hours. When thise 'freeze' happens the symptoms are as following:
- No packets forwarded at all
- WebUI or SSH login not possible
- Only chance is to use the VMware console to go the command line interface
- 'Restart all services' does rarely help
- Typically a reboot helps
- in some cases the WAN connection is reporting packet loss and long round trip times after reboot,
  the only chance to heal that issue is another reboot (sometimes two times in a row)
- No log entries that would indicate a problem to me

I cannot see the root of the problems. Therefore I have no clue what I can do. Any help is highly appreciated.

P.S.: As a temporary mitigation I will setup a cron-based nightly reboot.

Thanks,
Martin

Hi Martin,

Are any of the resources spiking in the ESXi monitoring tab leading up to the crash?

What about storage? (max IOPS/throughput)

Bart...

Quote from: bartjsmit on August 18, 2020, 03:09:04 PM
Are any of the resources spiking in the ESXi monitoring tab leading up to the crash?
What about storage? (max IOPS/throughput)

As I'm not using vCenter I don't have past metrics available and the ESXi Webinterface has only data from the last hour. But I am monitoring overall CPU utilisation of the ESXi host through SNMP and I can say that there is nothing obvious to see there for the last days. I don't monitor any further metrics yet. The ESXi datastore is on a local SSD inside the host and should be capable enough. There is a second VM on the host which experiences no problems at all.

Update:
a daily reboot at 5 AM mitigates the problem, the system doesn't freeze anymore (i.e. is routing packets between different networks/interfaces). But when rebooting the WAN latency occassionally goes up directly after the reboot (RTT > 800ms with high packet loss).

Rebooting again one or two times fixes the problem and everything is back to normal 7 to 8ms RTT. Very strange.


I tried a fresh install of 20.7 which worked, but then freezed immediately 2 minutes after booting. I have reverted now to 20.1.9 - which works as expected. I will try to upgrade to 20.7 some minor releases in the future.

I'm having the same problem on some QOTOM hardware.  2-3 days after a reboot the whole thing locks up and stops passing traffic.  Guess it must be an issue with the new version.  Does anyone know how/where I can snag an older firmware version?

Quote from: loganx1121 on August 24, 2020, 06:39:28 PM
I'm having the same problem on some QOTOM hardware.  2-3 days after a reboot the whole thing locks up and stops passing traffic.  Guess it must be an issue with the new version.  Does anyone know how/where I can snag an older firmware version?


Running two Qotom's here with zero issues, pretty basic systems with no Intrusion detection but one does run ntopng. If you've not tried it you might want to switch out the SSD, if that has a problem  it can cause the system to freeze.
OPNsense 25.7a - Qotom Q355G4 - ISP - Squirrel 1Gbps.

Team Rebellion Member

August 27, 2020, 04:13:43 PM #7 Last Edit: August 31, 2020, 10:54:07 AM by mgrue
When I downgraded from 20.7.1 to 20.1.9_1 my system locked up after 24 hours or so. That was strange because 20.1 was stable and had months of uptime before. I tried to investigate further and found the setting 'VLAN Hardware Filtering' which was turned on by default starting with 20.7 (according to docs). When I took my latest config back from 20.7 to 20.1 I kept it turned on - and the system freezed.

I switched this setting to disabled and my 20.1 instance is running happily again for about 3 days. I will monitor uptime and if it stays stable I will again upgrade to 20.7 and disable VLAN Hardware filtering which seems to be a bad idea in conjunction with VMXNET3 network interfaces on VMware ESXi.

EDIT: After 7 days of uptime everything is still working smooth. Will re-upgrade to 20.7 soon.

Update: with 20.7.2 I retried the version - now with 'VLAN hardware filtering' turned off. Unfortunately the system freezes again within 48h of uptime. I'm back on 20.1 again which is stable on my vSphere host.

have you tried turning all offloads off?
-rxcsum -txcsum -tso4 -tso6 -lro -rxcsum6 -txcsum6 -vlanhwcsum -vlanhwtso

September 14, 2020, 12:35:02 PM #10 Last Edit: September 14, 2020, 12:36:41 PM by mgrue
Yes, I have disabled all Hardware Offloading and VLAN Hardware filtering options in Interfaces -> Settings.

Based on the latest posts and FreeBSD Bugzilla, it seems that FreeBSD12 has some issues with the vmx driver.
could you share ifconfig output on one of vmx interfaces?

September 15, 2020, 07:38:17 AM #12 Last Edit: September 15, 2020, 07:42:14 AM by mgrue
vmx0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=98<VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM>
        ether 00:0c:29:2c:ec:cd
        hwaddr 00:0c:29:2c:ec:cd
        inet 192.168.179.1 netmask 0xffffff00 broadcast 192.168.179.255
        inet6 fe80::20c:29ff:fe2c:eccd%vmx0 prefixlen 64 scopeid 0x1
        inet6 2003:dd:2f26:6004:20c:29ff:fe2c:eccd prefixlen 64
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet autoselect
        status: active

Edit: This is the ifconfig output from opnsense 2.1.9_1

I have found these links based on your comment:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=236999
https://www.freebsd.org/security/advisories/FreeBSD-EN-20:16.vmx.asc

Fixed in 12.1-RELEASE-p8. But I'm not sure if this really addresses my problem because it happens only when TSO is enabled (which is disabled in the opnsense GUI). Is this what you meant?

QuoteThis is the ifconfig output from opnsense 2.1.9_1
I'm sure everything is fine in 20.1)
but what in 20.7?
QuoteBut I'm not sure if this really addresses my problem
I'm not sure either. just trying to guess ..
but "disabled in the opnsense GUI" not the same as actually disabled
various drivers may not allow features to be disabled
eg your ifconfig on 20.1 shows that vlanhwtag enabled although the interfaces.lib.inc-script tries to disable it if disablevlanhwfilter is set