[SOLVED] Zotac nano ci323 LAN Drops after a few days

Started by brady1408, January 02, 2017, 02:59:20 AM

Previous topic - Next topic
January 02, 2017, 02:59:20 AM Last Edit: February 15, 2017, 10:00:05 PM by franco
I'm not sure what is causing this, I am wondering it if it is just the nics that are in this device. Please let me know your thoughts. Every couple of days I lose the ability to connect to the LAN IP and the logs fill with the following.

once this starts happening I have to reboot the machine to get it back.

Jan 2 01:00:49   configd.py: [c7e56997-aa8f-4f94-9c4f-2c070f67ab76] updating dyndns lan
Jan 2 01:00:46   opnsense: /usr/local/etc/rc.linkup: HOTPLUG: Configuring interface lan
Jan 2 01:00:46   opnsense: /usr/local/etc/rc.linkup: DEVD Ethernet attached event for lan
Jan 2 01:00:46   configd.py: [7f6a2423-4cdc-49d5-8181-881fc54f12e0] Linkup starting re0
Jan 2 01:00:46   devd: Executing '/usr/local/opnsense/service/configd_ctl.py interface linkup start re0'
Jan 2 01:00:46   kernel: re0: link state changed to UP
Jan 2 01:00:42   opnsense: /usr/local/etc/rc.linkup: DEVD Ethernet detached event for lan
Jan 2 01:00:42   configd.py: [e29d0834-cbfb-4aff-9c82-5bf2df12a232] Linkup stopping re0
Jan 2 01:00:41   devd: Executing '/usr/local/opnsense/service/configd_ctl.py interface linkup stop re0'
Jan 2 01:00:41   kernel: re0: link state changed to DOWN
Jan 2 01:00:41   kernel: re0: watchdog timeout

Same here - afaik a known issue. Disabling all HW acceleration helps extending the time before failure.

Hi guys,

Watchdog timeout points to a hardware lockup. re(4) is not very good in general. Maybe it's better when we switch to FreeBSD 11.0 with 17.1, but in general migrating to better NICs is the best (and ironically) cheapest solution in the long run.

It could also be temperature concerns, malfunctioning lines / cables, etc.

The main question, though, what are you using the box for. How much traffic are you pushing? The more you approach the edge of the specification the more visible such cases can be.


Cheers,
Franco

Hi franco,

thank you for the reply.

Since the box does not have any extension capabilities it will be hard to replace the nics (usb would be possible).

To replicate the failure you just need to push about 30mb/s (megabyte) and starting at about 20 gig overall traffic one or both nics fail. dmesg tells you that they eventually come up again but the do not forward any traffic after the first failure.

Disabling the acceleration helps - even if you do not use the feature e.g.:

If i disable vlan hw acceleration in the options it only disables the hw feature on my lan side (i only use vlans on my lan side - it might be a bug but i did not test vlan on the wan side). If i disable the acceleration manually it takes much more time and traffic before the nics fail.

Btw. using windows as os (just test installation) the nics do not fail even under much heavier load.

If it helps to grant access to the box for debuging purpose i can add your public key next week (i wont make it earlier).

Cheers
Martin

Hi Martin,

Ok, let me rephrase: re(4) drivers on BSD are difficult. I checked the source code for fixes in newer versions and did not find a single one. This is not going to be fixed. ;(

If Windows works, maybe that's an option for the box, or a Linux there. But I don't know the state of it, maybe IPfire can do what you will expect without issues.


Sorry,
Franco

Same here.
kernel: re0: watchdog timeout endlessly and only power cycle helps.
Seems to be a trouble with the re(4) driver and/or Zotac hardware or combination thereof.

At the same time the system itself is responsive, you can log in from terminal, the error is logged, etc. And it never seems to be re1 for us, no matter which is assigned to LAN or WAN.

The most baffling thing is that its totally random. We had one crashing regurarly at one office. At the same time we have two ci323 working in different locations and one of them has never crashed in two months and the other one crashed several times for a week and now has not done it in more than month or so. Does not seem to be related to traffic intensity.

Generally disabling HW acceleration etc did not seem to help us. Temp also not an issue (cooled room, well below 20C). As the box might work for days, its pretty hard to diagnose.

After a couple of weeks we decided to replace ci323 with a different mini-pc that has 2 intel NICs and have had no trouble since.

Hi Franco,

windows is not an option ;)

I already ordered this box with intel nics: http://www.shuttle.eu/products/slim/ds68u/

Would you be able to debug the problem if i send you the hardware? Or maybe someone else?

Cheers
Martin

Quote from: franco on January 05, 2017, 04:35:30 PM
Ok, let me rephrase: re(4) drivers on BSD are difficult. I checked the source code for fixes in newer versions and did not find a single one. This is not going to be fixed. ;(

Hi Martin,

That's a a very kind offer, but I cannot allocate time for this between OPNsense and my day job.

If someone else wants to take you up on this, that would be great. We can also try to reach more people on Twitter with this request if you want. :)


Cheers,
Franco

Hi Franco,

sure!

If that works and we find someone who could take care of the problem and try to fix the bug(s) that would be great.

Cheers
Martin

Hi Franco,

any news on this topic? I sent you a private message with some details.

Cheers
Martin

I do have such a CI323 box too, since about six month never had any problems... except once when I performed an iperf between two boxes, one on the LAN side, one on the WAN side. I had to poweroff and poweron that box to make it run again (didn't have a monitor attached).

If I can perform some tests for you to gather logfiles just ask!

If you just use the box at home it happens only under high load (over 100mbit).

As soon as i start using iperf or fast downloads the problems appear.


Removing re(4) from the kernel is not an easy task as the kernel gets reverted on firmware updates.

Let me look into this and provide an in-kernel driver test based on realtek's version 1.92...

https://github.com/opnsense/src/issues/15

If this works and the world doesn't end we can consider a full switch.


Cheers,
Franco

So, here's your code branch for the original re(4) driver from Realtek, slight adaptation for FreeBSD 11.0.

https://github.com/opnsense/src/commits/re

It builds fine but haven't tested this yet. If anybody wants to try... it will apply on any 17.1 for amd64 including prereleases:

# opnsense-update -kr 17.1-re
# /usr/local/etc/rc.reboot

Caveats: UNTESTED, amd64 only and netmap is not in native mode, only emulated.


Cheers,
Franco