crashing a few times per week, help me diagnose?

Started by butters, April 21, 2021, 08:18:00 PM

Previous topic - Next topic
April 21, 2021, 08:18:00 PM Last Edit: April 21, 2021, 08:22:05 PM by butters
Hello!

Almost once a day, my OPNsense 21.1.4 installation crashes and becomes 100% unresponsive. I have to pull the power plug and reboot to get it working again. I'm new to OPNsense and FreeBSD, so I'm a little bit clueless as to how to chase down the problem. Help?

This is running on a Protectli FW4B. From my latest crash report:

My goal is, of course, to stop these crashes and get this thing to just run smoothly. Appreciate any ideas, suggestions, or guidance. Thanks

With that hardware I'd look at these two possibilities.

1. Overheating.  Is there sufficient space around it for cooling?  Nothing on top?  Is the heatsink up?  Try aiming a fan of some sort at the unit, see if that makes a difference.

2. Bad RAM.  Download memtest86 and write it to a USB drive then let it run for a couple of hours.  Any reported RAM faults is cause to swap the RAM out.

Thanks! I'm also suspicious of a hardware problem.

Quote from: SnowGhost on April 23, 2021, 01:45:10 AM
1. Overheating.  Is there sufficient space around it for cooling?  Nothing on top?  Is the heatsink up?  Try aiming a fan of some sort at the unit, see if that makes a difference.
There's plenty of air space around the unit, though maybe this weekend I'll rig up a fan to blow onto it directly. My dashboard shows that the CPUs are idling around 50 deg C (at the most).

Quote from: SnowGhost on April 23, 2021, 01:45:10 AM
2. Bad RAM.  Download memtest86 and write it to a USB drive then let it run for a couple of hours.  Any reported RAM faults is cause to swap the RAM out.
I agree that the RAM could be to blame. Yesterday I picked up a new 8 GB stick (different brand) and installed it. No issues as of yet, but I think running memtest is a wise idea and I'll do that this weekend.

I'll report back with any updates. Thanks again for weighing in! If anyone else has any suggestions - please feel free to share.

Hi,
same issue for me.
After upgrading to 21.1.4.

I've got an amd 5350 CPU, 16 GB RAM, SSD, integrated eth relatek and PCIE Intel i350.
All interfaces of the I350 get down after about 12H (its seems to be cyclic) but the realtek interface is still UP.
I can still use the realtek interface to get access to the OPNsense.

I need to manually reboot to get it back for another 12H.

I used to get heating problem monh ago but I put a 140mm fan in front of the Intel i350 and problem was gone.
For now the temperature never get upper than 32°.

Please find some logs that's have been collecteced after i350 getting down. For information the network getting down the 04/23 at about 23h44

Quick update: I'm cautiously optimistic that enabling powerd has resolved the issue for me.

Since I turned on powerd (System > Settings > Miscellaneous) about 3 days ago, I haven't had a crash. And that's in my un-airconditioned house during the current heat wave here in Southern California.

Weirdly, the CPU core temps are now up to the mid 60s (Celsius), which is still very safely within their max temp of 90. I originally enabled powerd thinking that maybe it would cool down the system by reducing CPU freq and voltage.

So while the cause(es) of the crashing are still unknown, I'm hopeful that maybe I stumbled upon a solution? Will report back with further updates.

Thanks for the tips !

For me powerd was already enable  :-[

But i find something strange, after restarting manually my opnsense, ntpd service seems not to be start. I start it manually by clicking on the arrow and then opnsense be down again ... i lwill look further for it.

May 03, 2021, 11:02:37 PM #6 Last Edit: May 03, 2021, 11:09:08 PM by mrk45k
You can try to ground the case proper.
I have had the same probs.
I changed some Ethernet cables (no shielded Plug). After that my fw Crashs randomly. No logs written.
With a attached Monitor via HDMI for Troubleshooting, it was not crashing.
Then i found out that with HDMI i have a good grounding.

At the end i ground my Case proper and since that my fw is stable.

I really appreciate everyone's insight and suggestions. Unfortunately, after about 8 days of uptime, the crashing has resurfaced with a vengeance and it's time to throw in the towel :(

Originally thinking that it was a hardware issue, here's a summary of everything I tried:

  • three different brand-new drives; Transcend, then Dogfish, and now Kingston
  • two different memory sticks: Patriot and now PNY
  • replaced the entire FW4B unit itself, thanks to Protectli's great customer service
  • added a USB-powered fan to cool the unit's heat sinks
  • switching to an electrical outlet which I confirmed had a good ground
  • powerd off and on
  • many, many permutations of opnsense's settings
Unfortunately, I can't spend the rest of my life debugging this issue, so I've currently "gone to the dark side" and am testing out pfSense right now. At lease this will give me a good indication of if it's software or hardware. I'm bummed because

  • I was never able to fully diagnose and resolve this issue and
  • there are some features of opnsense that I like a lot better - especially the UI
I'll report back if I find that pfSense exhibits the same problem. Thanks again.

if you have the same problem with pfsense the power supply might be the problem. It might be sensitive to "noise" on the grid. I would suggest a "online" UPS since that will output nice and clean power if working properly.

After about two days with pfSense, the crashes continue :-[. Sometimes I'll get a kernel panic, and sometimes it just randomly reboots itself without any error logging. So at this point I can confidently say that it's the hardware. Lame!

Quote from: kpwn on May 20, 2021, 06:09:02 PM
if you have the same problem with pfsense the power supply might be the problem. It might be sensitive to "noise" on the grid. I would suggest a "online" UPS since that will output nice and clean power if working properly.
Thanks for the tip! I will check that out.

In case anyone is curious, my replacement will be a Zotac C1329 nano. It has some bells and whistles that I don't really need, and it only has 2 ethernet jacks, but otherwise it looks like a promising piece of hardware.

I'll report back on this thread once I get it up and running (it will arrive tomorrow).

Up and running with the new hardware, and no crashes in over 48 hours  ;D  Plus, the CPU temps have barely crested 50 C and the unit doesn't feel hot to the touch (like the FW4B did).

Everyone raves about Protectli, but after my experience I feel like their hardware may not be all it's cracked up to be. For about the same price point, I think the Zotac is a higher quality machine.

Thanks again for everyone's assistance! Hopefully this thread will help out someone else in the future.