OPNsense Forum

Archive => 18.1 Legacy Series => Topic started by: tmp on March 13, 2018, 08:50:05 pm

Title: em0 watchdog timeout -- resetting / no traffic is being routed
Post by: tmp on March 13, 2018, 08:50:05 pm
Dear community,

I'm facing a strange problem that has already been reported in this thread https://forum.opnsense.org/index.php?topic=7145.0. In my case, OPNsense does not run virtualized.

My setup:

HP Elitedesk 705 G1
AMD A8-6500b
8 GB RAM (2x4 Dual channel)
Intel EXPI9402PT Pro Dual 1000 (pciex)

My config is quite basic:
em0-> LAN (192.168.0.x) static
em1-> WAN (192.168.1.x) dhcp (connected to plastic crap cable router-> can't be changed)

Services I'm running:
Squid (transparent setup, SSL-Inspection enabled but only for filtering domains, shallalist as blocklist)
Suricata (in IDS-Mode, not IPS, Rules: ET-P2P, ET-Tor, ET-Malware)
100 users

Everything else is in default configuration.

When put in production, the firewall works as it should for a few hours. After a few hours in combination with higher load (100mbit routed through WAN), internet browsing becomes slow and a few minutes later completly inaccessible. The routing between LAN and WAN completly breaks down at this moment. The CPU and RAM load is always accetable.
In this situation, I'm able to access the webinterface, but can't ping out to WAN (even from the box itself).
On the attached LCD I can see (even without being logged in to the machine) the following output:

em0: watchdog timeout - reset.
(and some statistical data about packets ->if needed I'll take a screenshot)


I already tried:
-Disabled hardware offloading in interface settings (no change)
-completly reinstall and reconfigure OPNsense
-disabled squid


Nothing of these steps helped so far. I want to get this working, because I prefer OPNsense and are quite happy with it - great work, guys!
Do you have any idea what I can do to get this working? It seems to me like a driver issue with the nic, as far as is found out on various searches.

Kind regards

tmp





Title: Re: em0 watchdog timeout -- resetting / no traffic is being routed
Post by: tmp on March 14, 2018, 07:02:57 pm
Some update:

I've found another thread describing the same problem with an intel nic:

https://forum.opnsense.org/index.php?topic=4918.0


I updated the BIOS to the latest version and disabled all power saving features in BIOS but the problem persists.
Any help is appreciated.
Title: Re: em0 watchdog timeout -- resetting / no traffic is being routed
Post by: opnfwb on March 17, 2018, 12:55:28 pm
There are a few things that you can check to rule out a faulty NIC. I have used dual port EM Intel cards with the 82571EB chipset and have had very good reliability from these cards with a little bit of tweaking, which I'll outline below. From your description it seems like you're running the same identical card, so hopefully this helps you get up and running with stability.

First, lets ensure that the NIC has a unique IRQ and is not sharing IRQs.
Run this command and let us know the results:
Quote
vmstat -i

You can also try to applying some EM driver specific tuning variables to help improve performance and stability of EM series NICs. I've documented these settings below if you want to try them, you will need a reboot to fully apply all of these once you've saved them. I have included lines for a dual port EM config, if you have more ports you will need to adjust some of the values below to match your system.

In /boot/loader.conf.local (you may need to create this file if it isn't already present):

Code: [Select]
hw.em.num_queues=0
hw.em.txd="2048"
hw.em.rxd="2048"
net.link.ifqmaxlen="4096"
hw.em.enable_msix=1
hw.pci.enable_msix=1
dev.em.0.fc=0
dev.em.1.fc=0
hw.em.rx_process_limit="-1"
hw.em.tx_process_limit="-1"

In WebGUI System/Settings/Tunables, add one line each with the following:

Code: [Select]
dev.em.0.eee_control: 0
dev.em.1.eee_control: 0
dev.em.0.fc: 0
dev.em.1.fc: 0

It's worth noting that I've only used these settings with Intel cards on Intel based systems with Intel chipsets/CPUs. This should not matter however, I haven't tried any of these tweaks with AMD based systems and their different chipsets. Depending on BIOS settings and various hardware differences, some of these settings may need some adjustment to fit your environment. Give them a try and let us know the results. You may find that you'll need to set the MSI-X variables to zero (disabled) depending on how the chipset in the router prefers to handle interrupts.
Title: Re: em0 watchdog timeout -- resetting / no traffic is being routed
Post by: bringha on March 17, 2018, 03:01:49 pm
Hi

at least in my case https://forum.opnsense.org/index.php?topic=4918.0 (https://forum.opnsense.org/index.php?topic=4918.0) it turned out at the end to be a hardware issue with the board which could only be fixed by RMA the board to Supermicro (see https://forum.opnsense.org/index.php?topic=5869.msg25622#msg25622 (https://forum.opnsense.org/index.php?topic=5869.msg25622#msg25622))

See also here for a little bit more in depth description https://forum.opnsense.org/index.php?topic=5063.0 (https://forum.opnsense.org/index.php?topic=5063.0)

Since then, it is rock solid stable. I also experimented a lot around with the sysconf settings in /boot/loader.conf.local before with no sustainable success

Br br
Title: Re: em0 watchdog timeout -- resetting / no traffic is being routed
Post by: tmp on March 18, 2018, 08:28:06 pm
Thanks a lot for the detailed description of the tunables and the hint for hardware-testing. I will try all your suggestions tomorrow when I'm back at work and will report back if something helped!
Title: Re: em0 watchdog timeout -- resetting / no traffic is being routed
Post by: tmp on April 22, 2018, 01:26:00 pm
Some update after a long time trying different tweaks, BIOS-Updates, Firmware-Updates, several configurations from scratch and so on.

At first, thanks to opnfwb for the detailed tunables that helped to get more throughput.

The solution is a kind of "weird" but works. I've put two "dumb" gigabit switches between em0 and em1:

LAN (Zyxel managed switch)-> dumb switch -> OPNsense

WAN (FritzBox, branded by "Unitymedia", a german cable ISP)-> switch -> OPNsense


And the problems are solved. If I connect the firewalls nics directly, the stability problems occur again. I'm really curious how this solved the problem.