OPNsense Forum

Archive => 15.7 Legacy Series => Topic started by: russoj88 on August 30, 2015, 04:14:05 am

Title: [SOLVED] Watchdog timeout -- resetting
Post by: russoj88 on August 30, 2015, 04:14:05 am
Hi, I am getting the message "Watchdog timeout -- resetting" pretty often (about once a minute).  I just installed 15.7.11 onto the hardware listed below.  The timeout message is happening on em0 (LAN).  The network completely drops out for a few seconds each time.  It happens often enough that I can't do a speed test.  I have a 75/75 connection here.  I was able to reach that using pfSense.

I have em1 as the WAN, em0, igb0, igb1 as LAN on a bridge.  This is 15.7.11 with LibreSSL.

SUPERMICRO MBD-X9SBAA-F-O http://www.supermicro.com/products/motherboard/ATOM/X9/X9SBAA-F.cfm
8GB ECC RAM
120GB SSD
Intel PWLA8492MT PRO/1000 MT http://www.intel.com/content/www/us/en/ethernet-products/gigabit-server-adapters/pro-1000-mt-dp.html

I'm not sure how to diagnose.  Let me know if I can get any info to help.

I'm going to try the latest HardenedBSD build now to see if the same issues occur.

EDIT: Added OPNsense version.
Title: Re: Watchdog timeout -- resetting
Post by: AdSchellevis on August 30, 2015, 11:11:54 am
Could you also try this with the testing version of FreeBSD 10.2?
It's probably a driver issue in the intel driver of FreeBSD 10.1.

For notes how to install, see this posting:

https://forum.opnsense.org/index.php?topic=1302.0

Title: Re: Watchdog timeout -- resetting
Post by: russoj88 on August 30, 2015, 10:23:03 pm
I wasn't able to get the HardenedBSD image to work.  I get to the point where it puts me in a prompt "mountroot>".   Maybe I'm writing it to the USB incorrectly?

Code: [Select]
gzcat FILE.img.xz | dd of=/dev/da0 bs=64k && sync
With the 10.2 upgrade, I am getting the same timeout issue.

pfSense 2.2.4 is giving me the same issue.  I've been bouncing back and forth between 2 machines for my router.  It must've been working on the other one (an old Dell optiplex).  OPNsense was working there too.
Title: Re: Watchdog timeout -- resetting
Post by: Solaris17 on August 30, 2015, 11:30:35 pm
hm have you tried doing a memtest (http://"http://www.memtest.org/#downiso")? I'm not actually fantastic at networking. but im actually very good with hardware if BSD is indicating watchdog errors than the fault may actually lie with the overall system stability. The previous edition working could have simply been coincidence. Can you take this machine offline to run extended diagnostics? Are you able to get any thermal measurements from the CPU? Does SMART indicate drive failure?
Title: Re: Watchdog timeout -- resetting
Post by: russoj88 on August 31, 2015, 02:51:06 pm
hm have you tried doing a memtest (http://"http://www.memtest.org/#downiso")? I'm not actually fantastic at networking. but im actually very good with hardware if BSD is indicating watchdog errors than the fault may actually lie with the overall system stability. The previous edition working could have simply been coincidence. Can you take this machine offline to run extended diagnostics? Are you able to get any thermal measurements from the CPU? Does SMART indicate drive failure?

I'm starting to think that it could be a hardware issue with the nic card.  I have this setup in a 1U case (link below).  I'm using SuperMicro's 90 degree PCI piece.

I noticed the case itself was really hot, so I popped the cover off and the card was too hot to touch.  The heatsink on the CPU was cool.  I've been running a room fan into it (uncovered).  This seems to lessen the errors, but definitely does not stop it.  Is it possible I already burned the card up?

Tonight I'll put the card in the other machine and see if I still get the timeouts and if its running hot.  I will try memtest/SMART/temp tests before switching over as well.

Thanks for the help.

http://www.supermicro.com/products/chassis/1U/504/SC504-203.cfm
Title: Re: Watchdog timeout -- resetting
Post by: weust on August 31, 2015, 09:07:59 pm
I have a similar type case, except backwards.
Superserver with a Atom C2758F in it, four SSD's (two SATA, one SATA-DOM and a PCI-E 4X card.
Only has one fan, next to the one in the PSU, and keeps everything in normal temperature ranges.

If that NIC really is getting so hot, it's the card and not the case/cooling.
The NIC doesn't have a heatsink, let alone active cooling, as far as I can tell?
Then it shouldn't overheat.

Title: Re: Watchdog timeout -- resetting
Post by: russoj88 on September 01, 2015, 03:09:29 am
I have a similar type case, except backwards.
Superserver with a Atom C2758F in it, four SSD's (two SATA, one SATA-DOM and a PCI-E 4X card.
Only has one fan, next to the one in the PSU, and keeps everything in normal temperature ranges.

If that NIC really is getting so hot, it's the card and not the case/cooling.
The NIC doesn't have a heatsink, let alone active cooling, as far as I can tell?
Then it shouldn't overheat.

Correct, it has no heat sink.
Title: Re: Watchdog timeout -- resetting
Post by: russoj88 on September 01, 2015, 04:10:04 am
The NIC is working in the other machine.  It is still running very hot.

The CPU temps were in the high 30's.

I'm pretty sure this is just a hardware issue.  I even had trouble trying to start up memtest.  I'm going to flash the BIOS and then start from scratch again.

Thanks for everyone's help.  If this actually starts working again, I'll post it.
Title: Re: Watchdog timeout -- resetting
Post by: Solaris17 on September 01, 2015, 04:23:54 am
Good catch depnding on config hot to tough may totally be within specification. Even the power regulators for your CPU are rated far above what you can physically touch. That doesn't rule out it might be a contributing factor but if it works in another system then thats a good indication its not the issue.

Can you perhaps try the card in a bench machine or another desktop system with that 90º riser card? its possible there is a damaged trace of the riser is simply faulty (fails continuity).

Otherwise I would be interested in memtest results. and a bios reset. which a flash should do for you depending on board. so that we may be able to scratch off power saving/other features from interrupting the bus in an odd way.
Title: Re: Watchdog timeout -- resetting
Post by: russoj88 on September 04, 2015, 02:01:54 am
Good catch depnding on config hot to tough may totally be within specification. Even the power regulators for your CPU are rated far above what you can physically touch. That doesn't rule out it might be a contributing factor but if it works in another system then thats a good indication its not the issue.

Can you perhaps try the card in a bench machine or another desktop system with that 90º riser card? its possible there is a damaged trace of the riser is simply faulty (fails continuity).

Otherwise I would be interested in memtest results. and a bios reset. which a flash should do for you depending on board. so that we may be able to scratch off power saving/other features from interrupting the bus in an odd way.

Sorry for late reply.

I flashed the BIOS (was already on the latest), but no improvement.

I'm running memtest right now (win for IPMI).  It should complete a pass within 80 minutes and I'll update when its done.

While that's running, I think I can get the riser on the board to test.
Title: Re: Watchdog timeout -- resetting
Post by: russoj88 on September 04, 2015, 03:35:46 am
Memtest reported no errors.

It was easier to run the card without the riser than to try the riser in a different machine.  Unfortunately, same results.
Title: Re: Watchdog timeout -- resetting
Post by: russoj88 on September 04, 2015, 03:41:43 am
The PCI slot on the motherboard with the timeouts is 3.3v.  The card is PCI-X, but from what I've read, this card should work in any PCI slot, but at a slower speed.

http://forums.untangle.com/hardware/31398-urgent-intel-pwla8494mt-pro-quad-port.html

Is that incorrect?
Title: Re: Watchdog timeout -- resetting
Post by: Solaris17 on September 04, 2015, 03:55:31 am
It is but isnt incorrect, The good news is there isnt any real way to mess it up. How? because 5v PCI slots are A on specialty motherboards and B because 5v PCI slots are keyed.

http://www.intel.com/support/network/adapter/1000mtquad/sb/cs-009537.htm

It still sounds to me like a card failure. Are you using all 4 slots at the moment? could you perchance test it with a single or dual port PCI card to see if it also has issues?
Title: Re: Watchdog timeout -- resetting
Post by: russoj88 on September 04, 2015, 03:16:31 pm
The card I have is a dual port and I'm using both.

I think the only other PCI card I have is a USB 2.0 card, but I can give it a shot.

I'm probably going to end up getting a switch and using the two ports on the motherboard.
Title: Re: Watchdog timeout -- resetting
Post by: Solaris17 on September 05, 2015, 05:47:30 pm
Thats probably what I would do. If system stress tests seem to be going fine Then the only real root cause is the card itself. Especially if it is exhibiting the same problems in another machine.
Title: Re: Watchdog timeout -- resetting
Post by: russoj88 on September 09, 2015, 02:21:42 am
I think the card is ok.  It works in the other machine.

I ran the card without the riser in the box with errors and still got them, so I think the riser is ok too.

I picked up a switch and started using the ports on the motherboard for WAN/LAN and everything's been good for a while.  I'm guessing the PCI slot is somehow broken.
Title: Re: [SOLVED] Watchdog timeout -- resetting
Post by: Solaris17 on September 09, 2015, 06:27:17 am
ah ok I misunderstood I thought the card was having issues as well. It does seem like it or there is some kind of North bridge issue. I would keep an eye on the machine over the next several days.
Title: Re: [SOLVED] Watchdog timeout -- resetting
Post by: UKEE93 on October 25, 2015, 03:25:15 am
Seems that there has been a patch commit that's not made it into FreeBSD yet that should fix this issue.

https://reviews.freebsd.org/D3192
Title: Re: [SOLVED] Watchdog timeout -- resetting
Post by: Supermule on October 25, 2015, 07:33:13 am
How is this related to storage/NFS?
Title: Re: [SOLVED] Watchdog timeout -- resetting
Post by: franco on October 25, 2015, 11:34:26 am
Fixes have been pushed to 11-CURRENT but not 10-STABLE. Should have been there 2 months ago, but networking maintenance in 10-STABLE is flaky. I'm sorry.

I know that Shawn will try to build OPNsense/HardenedBSD on top of 11-CURRENT soon. That'll certainly help.
Title: Re: [SOLVED] Watchdog timeout -- resetting
Post by: UKEE93 on October 25, 2015, 05:03:09 pm
Fixes have been pushed to 11-CURRENT but not 10-STABLE. Should have been there 2 months ago, but networking maintenance in 10-STABLE is flaky. I'm sorry.

I know that Shawn will try to build OPNsense/HardenedBSD on top of 11-CURRENT soon. That'll certainly help.

You have no reason to be sorry.  I appreciate the information.  I'm very new and RAW to any of this (pfsense, opnsense, FreeBSD, etc) and just decided I needed a new project so I picked up a new SuperMicro Intel N3700 board with 4 - Intel I210-AT gigabit ports on it to build a router in a box project.  If I could get past the watchdog timeouts that randomly occur (under higher traffic load), it would be great.  Seems, from google, that this all started with FreeBSD 10.1 or so.

When trying pfsense, I looked and saw the driver was 2.4.0 but Intel has an updated driver of 2.4.3 but I don't know if that will fix anything or not (and I don't yet know how to compile it and insert it into a running system - I may have to install FreeBSD in a VM, compile it there, and then transport it over).  I was hoping FreeBSD 10.2 would have the commit changes but doesn't appear to be.

Again, thanks for the help. :)
Title: Re: [SOLVED] Watchdog timeout -- resetting
Post by: UKEE93 on October 25, 2015, 05:16:22 pm
How is this related to storage/NFS?

To be honest, I don't know.  I've followed the trail through google and it ended at that commit.  If you read through the page at the link, there are various stages of testing and trials.