Nic freeze, firewall stop (FUNCTION:murmur3_32_hash32 CALLERS:pf_find_state_all)

Started by netmd1234, October 04, 2019, 01:45:37 PM

Previous topic - Next topic
I am asking for help, because this situation is driving me crazy.
The system is:
OPNsense 19.7.4_1 , Supermicro X9SCI-LN4F, xeon e3-1220v2, 8gb, 256ssd, 1 x intel x520 10gb card. 2x built-in intel 82574L, ix0 - WAN, ix1 - LAN, em0, em1 - other (no or very small traffic)
max system interrupts 60k, max Context Switches 110k, avg interrupt load 17%, max traffic 960M, max packets 90k/60k
The computer performs the task of a router for a large number of people (fq_pie, dns)

At irregular intervals of hours, sometimes after 30 minutes, sometimes 5-6 hours it hangs in this way that:
Interfaces that are set up in Opnsense do not respond (even those with no or small traffic), interfaces that are present
in the system but are not set up in Opnsense work ok. Tunables configuration out of the box or with various changes.
The same occurred on intel 82576 cards.

Last minutes before freeze.
Oct  4 11:34:40 OPNsense kernel: ix0: link state changed to DOWN
Oct  4 11:34:41 OPNsense opnsense: /usr/local/etc/rc.linkup: Hotplug event detected for WAN(wan) but ignoring since interface is configured with static IP (89.186.2.22 ::)
Oct  4 11:34:41 OPNsense kernel: ix1: link state changed to DOWN
Oct  4 11:34:41 OPNsense opnsense: /usr/local/etc/rc.linkup: Hotplug event detected for LAN(lan) but ignoring since interface is configured with static IP (192.168.110.1 ::)
Oct  4 11:34:42 OPNsense kernel: em0: link state changed to DOWN
Oct  4 11:36:21 OPNsense configctl: error in configd communication  Traceback (most recent call last):   File "/usr/local/opnsense/service/configd_ctl.py", line 67, in exec_config_cmd     line = sock.recv(65536).decode() socket.timeout: timed out

All logs done from ipmi after freeze. The sysctl -a command and pfctl -si command - suspended the system completely, only power reset possible.
Enter 11 (Reload all services) in Opnsense menu - stops/hang at Configuring loopback interfaces...
All interfaces not assign in Opnsense work ok after freeze/hang.

https://pastebin.com/u/netmd123

Please help me...

 :-\

That is strange. Are you on the latest BIOS/UEFI version to be sure it its not a known bug?
Do you have serial console or VGA output that shows a panic or similar messages?

Have you tweaked or modified sysctls?

Twitter: banym
Mastodon: banym@bsd.network
Blog: https://www.banym.de

Quote from: banym on October 04, 2019, 06:20:07 PM
:-\

That is strainge. Are you on the latest BIOS/UEFI version to be sure it its not a known bug?
Do you have serial console or VGA output that shows a panic or similar messages?

Have you tweaked or modified sysctls?
1. I have lastest 19.7.4 vga or dvd version - i dont remember, bios in mainboard is latest.
2. What do you mean "known bug"?
3. Vga output on ipmi - there is no kernel panic message or similar, the only message that pops up sometimes is: Bump sched buckets to 256 (was 0)
4. I checked on normal settings sysctl and tweak with different settings from the internet (mostly on a large number of packages) and there was no difference. Maybe I set something bad in sysctl that breaks the system but it should be ok in the initial box settings?.

Original settings sysctl:
https://pastebin.com/EBqxg5vN

Tweaked settings sysctl:
https://pastebin.com/CZSK4srj


I would reset all tweaks to default and test if the system is stable again.
If it is, you know something with the changed option has bad a bad side effect.

Thats how I would try to find the problem.
Twitter: banym
Mastodon: banym@bsd.network
Blog: https://www.banym.de

October 04, 2019, 10:17:55 PM #4 Last Edit: October 05, 2019, 08:01:52 AM by netmd1234
Quote from: banym on October 04, 2019, 09:27:39 PM
I would reset all tweaks to default and test if the system is stable again.
If it is, you know something with the changed option has bad a bad side effect.

Thats how I would try to find the problem.

I checked the original sysctl settings - the router hung up after a few minutes.

November 03, 2019, 08:04:35 AM #5 Last Edit: November 03, 2019, 10:12:14 AM by netmd1234
Strange thing, when I change something in the firewall, e.g. in the advanced options "Bypass firewall rules for traffic on the same interface" the system hangs in a few seconds to several minutes (one core 100% interrupt).
When changing other options in the firewall or adding / changing the rule, change logging options the same happens - syste hangs. I will add that the min traffic is about 300 mbit to max 1500 mbit in primetime.

Hi,

When the router hangs as I wrote above (Sometimes every few minutes, sometimes
every few hours or days or always when i add or change the rule, change something
in firewall advanced etc)
executing pmcstat -TS inst_retired.any_p -w1 i get this result:

%SAMP     IMAGE      FUNCTION                        CALLERS
35.5          kernel       murmur3_32_hash32        pf_find_state_all
...



What does it mean?

Hello,

Just a hunch but might worth a try. Do you know the temperature of your network when it freezes?
I have a HP NC364T (with two 82571EBs on it) and have experienced something similar: the NIC shut itself off at random times closing me out of the system. It drove me crazy aswell. Then came the stroke of genius: these cards were designed for servers with excessive cooling but my box is passive, so the card shuts itself down when it runs too hot. (It the summers I have to use extra cooling on the box to keep the NIC running.) I feel this might be your problem as well.

So may not be your ultimate issue, but I fought with HOTPLUG events and Link state up/downs on my WAN interface for many weeks over the summer. No rhyme or reason. Sometimes days with no drops. Sometimes minutes. I tried everything, cables, ports on my cable gateway, even rebuilt my OPNsense box going from a laptop with one onboard and one USB NIC, to a Dell SFF with a dual-port Intel card. The USB just HAD to be the issue...The issue remained, even with the completely new hardware.

In the end, the issue had nothing to do with OPNSense at all. It was a failing Ethernet port on my cable gateway. Replaced with a new cable modem, and all problems disappeared.

Error free wIth the new cable modem.