[SOLVED] Watchdog timeout -- resetting

Started by russoj88, August 30, 2015, 04:14:05 AM

Previous topic - Next topic
August 30, 2015, 04:14:05 AM Last Edit: September 09, 2015, 02:23:11 AM by russoj88
Hi, I am getting the message "Watchdog timeout -- resetting" pretty often (about once a minute).  I just installed 15.7.11 onto the hardware listed below.  The timeout message is happening on em0 (LAN).  The network completely drops out for a few seconds each time.  It happens often enough that I can't do a speed test.  I have a 75/75 connection here.  I was able to reach that using pfSense.

I have em1 as the WAN, em0, igb0, igb1 as LAN on a bridge.  This is 15.7.11 with LibreSSL.

SUPERMICRO MBD-X9SBAA-F-O http://www.supermicro.com/products/motherboard/ATOM/X9/X9SBAA-F.cfm
8GB ECC RAM
120GB SSD
Intel PWLA8492MT PRO/1000 MT http://www.intel.com/content/www/us/en/ethernet-products/gigabit-server-adapters/pro-1000-mt-dp.html

I'm not sure how to diagnose.  Let me know if I can get any info to help.

I'm going to try the latest HardenedBSD build now to see if the same issues occur.

EDIT: Added OPNsense version.

Could you also try this with the testing version of FreeBSD 10.2?
It's probably a driver issue in the intel driver of FreeBSD 10.1.

For notes how to install, see this posting:

https://forum.opnsense.org/index.php?topic=1302.0


I wasn't able to get the HardenedBSD image to work.  I get to the point where it puts me in a prompt "mountroot>".   Maybe I'm writing it to the USB incorrectly?

gzcat FILE.img.xz | dd of=/dev/da0 bs=64k && sync

With the 10.2 upgrade, I am getting the same timeout issue.

pfSense 2.2.4 is giving me the same issue.  I've been bouncing back and forth between 2 machines for my router.  It must've been working on the other one (an old Dell optiplex).  OPNsense was working there too.

hm have you tried doing a memtest? I'm not actually fantastic at networking. but im actually very good with hardware if BSD is indicating watchdog errors than the fault may actually lie with the overall system stability. The previous edition working could have simply been coincidence. Can you take this machine offline to run extended diagnostics? Are you able to get any thermal measurements from the CPU? Does SMART indicate drive failure?

Quote from: Solaris17 on August 30, 2015, 11:30:35 PM
hm have you tried doing a memtest? I'm not actually fantastic at networking. but im actually very good with hardware if BSD is indicating watchdog errors than the fault may actually lie with the overall system stability. The previous edition working could have simply been coincidence. Can you take this machine offline to run extended diagnostics? Are you able to get any thermal measurements from the CPU? Does SMART indicate drive failure?

I'm starting to think that it could be a hardware issue with the nic card.  I have this setup in a 1U case (link below).  I'm using SuperMicro's 90 degree PCI piece.

I noticed the case itself was really hot, so I popped the cover off and the card was too hot to touch.  The heatsink on the CPU was cool.  I've been running a room fan into it (uncovered).  This seems to lessen the errors, but definitely does not stop it.  Is it possible I already burned the card up?

Tonight I'll put the card in the other machine and see if I still get the timeouts and if its running hot.  I will try memtest/SMART/temp tests before switching over as well.

Thanks for the help.

http://www.supermicro.com/products/chassis/1U/504/SC504-203.cfm

I have a similar type case, except backwards.
Superserver with a Atom C2758F in it, four SSD's (two SATA, one SATA-DOM and a PCI-E 4X card.
Only has one fan, next to the one in the PSU, and keeps everything in normal temperature ranges.

If that NIC really is getting so hot, it's the card and not the case/cooling.
The NIC doesn't have a heatsink, let alone active cooling, as far as I can tell?
Then it shouldn't overheat.

Hobbyist at home, sysadmin at work. Sometimes the first is mixed with the second.

Quote from: weust on August 31, 2015, 09:07:59 PM
I have a similar type case, except backwards.
Superserver with a Atom C2758F in it, four SSD's (two SATA, one SATA-DOM and a PCI-E 4X card.
Only has one fan, next to the one in the PSU, and keeps everything in normal temperature ranges.

If that NIC really is getting so hot, it's the card and not the case/cooling.
The NIC doesn't have a heatsink, let alone active cooling, as far as I can tell?
Then it shouldn't overheat.

Correct, it has no heat sink.

The NIC is working in the other machine.  It is still running very hot.

The CPU temps were in the high 30's.

I'm pretty sure this is just a hardware issue.  I even had trouble trying to start up memtest.  I'm going to flash the BIOS and then start from scratch again.

Thanks for everyone's help.  If this actually starts working again, I'll post it.

Good catch depnding on config hot to tough may totally be within specification. Even the power regulators for your CPU are rated far above what you can physically touch. That doesn't rule out it might be a contributing factor but if it works in another system then thats a good indication its not the issue.

Can you perhaps try the card in a bench machine or another desktop system with that 90º riser card? its possible there is a damaged trace of the riser is simply faulty (fails continuity).

Otherwise I would be interested in memtest results. and a bios reset. which a flash should do for you depending on board. so that we may be able to scratch off power saving/other features from interrupting the bus in an odd way.

Quote from: Solaris17 on September 01, 2015, 04:23:54 AM
Good catch depnding on config hot to tough may totally be within specification. Even the power regulators for your CPU are rated far above what you can physically touch. That doesn't rule out it might be a contributing factor but if it works in another system then thats a good indication its not the issue.

Can you perhaps try the card in a bench machine or another desktop system with that 90º riser card? its possible there is a damaged trace of the riser is simply faulty (fails continuity).

Otherwise I would be interested in memtest results. and a bios reset. which a flash should do for you depending on board. so that we may be able to scratch off power saving/other features from interrupting the bus in an odd way.

Sorry for late reply.

I flashed the BIOS (was already on the latest), but no improvement.

I'm running memtest right now (win for IPMI).  It should complete a pass within 80 minutes and I'll update when its done.

While that's running, I think I can get the riser on the board to test.

Memtest reported no errors.

It was easier to run the card without the riser than to try the riser in a different machine.  Unfortunately, same results.

The PCI slot on the motherboard with the timeouts is 3.3v.  The card is PCI-X, but from what I've read, this card should work in any PCI slot, but at a slower speed.

http://forums.untangle.com/hardware/31398-urgent-intel-pwla8494mt-pro-quad-port.html

Is that incorrect?

It is but isnt incorrect, The good news is there isnt any real way to mess it up. How? because 5v PCI slots are A on specialty motherboards and B because 5v PCI slots are keyed.

http://www.intel.com/support/network/adapter/1000mtquad/sb/cs-009537.htm

It still sounds to me like a card failure. Are you using all 4 slots at the moment? could you perchance test it with a single or dual port PCI card to see if it also has issues?

The card I have is a dual port and I'm using both.

I think the only other PCI card I have is a USB 2.0 card, but I can give it a shot.

I'm probably going to end up getting a switch and using the two ports on the motherboard.

Thats probably what I would do. If system stress tests seem to be going fine Then the only real root cause is the card itself. Especially if it is exhibiting the same problems in another machine.