WAN link gone sometimes (igb driver, I211 nics), ifconfig d/u fixes it

Started by Werner Fischer, July 17, 2017, 03:54:41 PM

Previous topic - Next topic
(UPDATE #1: the issue has finally been fixed through a BIOS/UEFI-Firmware update, see posting #44 in this thread for details: https://forum.opnsense.org/index.php?topic=5511.msg28681#msg28681)

(UPDATE #2: after many days of tests, the problem came up again - see https://forum.opnsense.org/index.php?topic=5511.msg28781#msg28781 for details)

Hi all,

I'm trying to analyze a strange issue: sometimes (very rare, I was only able to reproduce the issue 2 times), the WAN link goes away. From a laptop behind the OPNsense firewall I am not able to ping the WAN's default gateway anymore. I still can access the OPNsense system.

Details:

Have you ever seen an issue like this?
Do you have any hints what I could do to further analyze the issue?

Thanks in advance and best regards,
Werner

I also experience the same problem, my OPNsense is a VM (on ESXi) with Intel NICs and I've never been able to find anything about troubleshooting this problem (I've previously posted this on the forum without a response. I thought it was a rare occurrence until I added a cron job to test the connection every minute and this is the (abbreviated) result:

2017-07-04 15:23:15 - WAN interface Restarted on OPNsense
2017-07-04 15:24:15 - WAN interface Restarted on OPNsense
2017-07-06 03:46:22 - WAN interface Restarted on OPNsense
2017-07-06 04:17:22 - WAN interface Restarted on OPNsense
2017-07-07 17:41:22 - WAN interface Restarted on OPNsense
2017-07-09 06:35:22 - WAN interface Restarted on OPNsense
2017-07-09 07:06:22 - WAN interface Restarted on OPNsense
2017-07-10 20:30:22 - WAN interface Restarted on OPNsense
2017-07-12 09:24:22 - WAN interface Restarted on OPNsense
2017-07-12 09:55:22 - WAN interface Restarted on OPNsense
2017-07-12 12:53:22 - WAN interface Restarted on OPNsense
2017-07-12 12:55:15 - WAN interface Restarted on OPNsense
2017-07-12 12:56:15 - WAN interface Restarted on OPNsense


I thought that apinger might restart the interface but that appears not to be the case. :( This hasn't always happened but I can't remember when it did start other than it seems to be recent, as in sometime this year.
Regards


Bill

Thank you for reporting your experiences, Bill.

So your cronjob checks the availability of the WAN interface, and does then a restart in case the interface is not available, right?

Could you maybe post your cronjob-script?

Sure, here's the script I use:

#!/bin/sh

# -q quiet
# -c nb of pings to perform

ping -q -c5 [your_wan_gateway] > /dev/null 2>&1  <<-- obviously your wan gateway IP

if [ $? -eq 0 ]
  then
      echo "ok"
  else
# When we restart the NIC we also need to run a dhclient to get our (fixed) IP address:

            /etc/rc.d/netif restart vmx0 > /dev/null 2>&1 ; dhclient vmx0 > /dev/null 2>&1

    echo "$(date '+%Y-%m-%d %H:%M:%S') - WAN interface Restarted on" $(hostname -s) >> /usr/home/restart_wan.log

fi


I put the script (and the log file) in /usr/home then modify the crontab to add this line at the end:

*       *       *       *       *       (/usr/home/restart_wan) > /dev/nullI can't really take any credit for that (I found a similar script on the internet) as I'm not very experienced in scripting and it can probably be improved but it works for me.
Regards


Bill

Thank you for your script.

Today morning, the error happened again:

  • I have switched on the OPNsense system (having igb9 connected as DHCP client for the WAN uplink, WAN gateway being 10.1.102.1)
  • At "05:45:52 UTC" I have connected my laptop on igb0 as a client (getting an IP address via DHCP from OPNsense)
  • For about 2-5 minutes, I was able to use the internet on the laptop. Then I noticed that I cannot ping the default gateway 10.1.102.1 of the WAN uplink)
  • Then I SSH'ed into OPNsense at "05:53:40 UTC". Here follows the shortened log of my session:


root@OPNsense:~ # date
Wed Jul 19 05:53:53 UTC 2017
root@OPNsense:~ # ifconfig
igb0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=4400b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO,TXCSUM_IPV6>
ether 00:30:18:cd:e8:54
inet 192.168.1.1 netmask 0xffffff00 broadcast 192.168.1.255
inet6 fe80::1:1%igb0 prefixlen 64 scopeid 0x1
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
media: Ethernet autoselect (1000baseT <full-duplex>)
status: active
igb1: flags=8c02<BROADCAST,OACTIVE,SIMPLEX,MULTICAST> metric 0 mtu 1500
[...]
igb9: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=4400b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO,TXCSUM_IPV6>
ether 00:30:18:cd:ec:63
inet6 fe80::230:18ff:fecd:ec63%igb9 prefixlen 64 scopeid 0xa
inet 10.1.102.55 netmask 0xffffff00 broadcast 10.1.102.255
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
media: Ethernet autoselect (1000baseT <full-duplex>)
status: active
[...]
root@OPNsense:~ # arp -a
? (10.1.102.1) at 4c:5e:0c:4b:23:30 on igb9 expires in 1084 seconds [ethernet]
? (10.1.102.55) at 00:30:18:cd:ec:63 on igb9 permanent [ethernet]
? (192.168.1.1) at 00:30:18:cd:e8:54 on igb0 permanent [ethernet]
? (192.168.1.100) at f0:de:f1:f3:17:88 on igb0 expires in 1136 seconds [ethernet]
root@OPNsense:~ # time arp -a
? (10.1.102.1) at 4c:5e:0c:4b:23:30 on igb9 expires in 1016 seconds [ethernet]
? (10.1.102.55) at 00:30:18:cd:ec:63 on igb9 permanent [ethernet]
? (192.168.1.1) at 00:30:18:cd:e8:54 on igb0 permanent [ethernet]
? (192.168.1.100) at f0:de:f1:f3:17:88 on igb0 expires in 1172 seconds [ethernet]
0.000u 0.004s 0:23.29 0.0% 0+0k 0+0io 0pf+0w
root@OPNsense:~ # clog /var/log/system.log | tail -n 150
[...]
Jul 19 05:37:41 OPNsense kernel: uhub1: 4 ports with 4 removable, self powered
Jul 19 05:37:41 OPNsense kernel: igb9: link state changed to UP
Jul 19 05:37:41 OPNsense kernel: aesni0: No AESNI support.
Jul 19 05:37:42 OPNsense kernel: done.
Jul 19 05:37:42 OPNsense kernel: igb9: link state changed to DOWN
Jul 19 05:37:42 OPNsense sshlockout[12834]: sshlockout/webConfigurator v3.0 starting up
Jul 19 05:37:42 OPNsense configd.py: [3a2068ae-8494-4ad6-9476-7ef4d08a0ce5] Linkup stopping igb9
Jul 19 05:37:46 OPNsense kernel: igb9: link state changed to UP
Jul 19 05:37:46 OPNsense configd.py: [f7d9bfd1-e05b-40c2-b52b-fd08a8c054c3] Linkup starting igb9
Jul 19 05:37:49 OPNsense kernel: done.
Jul 19 05:37:49 OPNsense kernel: pflog0: promiscuous mode enabled
Jul 19 05:37:50 OPNsense kernel: ...done.
Jul 19 05:37:50 OPNsense kernel: done.
Jul 19 05:37:50 OPNsense sshd[48047]: Server listening on :: port 22.
Jul 19 05:37:50 OPNsense sshd[48047]: Server listening on 0.0.0.0 port 22.
[...]
Jul 19 05:38:02 OPNsense sshlockout[70807]: sshlockout/webConfigurator v3.0 starting up
Jul 19 05:38:02 OPNsense kernel: OK
Jul 19 05:38:04 OPNsense kernel:
Jul 19 05:45:52 OPNsense kernel: igb0: link state changed to UP
Jul 19 05:45:52 OPNsense configd.py: [ac2de28c-7b2a-4fa0-9f27-2e8536f2c95d] Linkup starting igb0
Jul 19 05:45:53 OPNsense opnsense: /usr/local/etc/rc.linkup: DEVD Ethernet attached event for lan
Jul 19 05:45:53 OPNsense opnsense: /usr/local/etc/rc.linkup: HOTPLUG: Configuring interface lan
Jul 19 05:45:56 OPNsense configd.py: [2355f16d-4b08-4c1d-85db-cc50b95f937e] updating dyndns lan
Jul 19 05:45:56 OPNsense configd.py: [6072a8dd-cf07-4c0d-aea2-c0b32445b557] updating rfc2136 lan
Jul 19 05:53:40 OPNsense sshd[30826]: Postponed keyboard-interactive for root from 192.168.1.100 port 35188 ssh2 [preauth]
Jul 19 05:53:44 OPNsense opnsense: user 'root' authenticated successfully
Jul 19 05:53:44 OPNsense sshd[30826]: Postponed keyboard-interactive/pam for root from 192.168.1.100 port 35188 ssh2 [preauth]
Jul 19 05:53:44 OPNsense sshd[30826]: Accepted keyboard-interactive/pam for root from 192.168.1.100 port 35188 ssh2
root@OPNsense:~ # date
Wed Jul 19 06:05:53 UTC 2017
root@OPNsense:~ # time arp -a
? (10.1.102.1) at 4c:5e:0c:4b:23:30 on igb9 expires in 204 seconds [ethernet]
? (10.1.102.55) at 00:30:18:cd:ec:63 on igb9 permanent [ethernet]
? (192.168.1.1) at 00:30:18:cd:e8:54 on igb0 permanent [ethernet]
? (192.168.1.100) at f0:de:f1:f3:17:88 on igb0 expires in 1101 seconds [ethernet]
0.000u 0.004s 0:23.33 0.0% 0+0k 0+0io 0pf+0w
root@OPNsense:~ # date
Wed Jul 19 06:11:33 UTC 2017
root@OPNsense:~ # ping 10.1.102.1
PING 10.1.102.1 (10.1.102.1): 56 data bytes
^C
--- 10.1.102.1 ping statistics ---
5 packets transmitted, 0 packets received, 100.0% packet loss
root@OPNsense:~ # time arp -a
? (10.1.102.1) at (incomplete) on igb9 expired [ethernet]
? (10.1.102.55) at 00:30:18:cd:ec:63 on igb9 permanent [ethernet]
? (192.168.1.1) at 00:30:18:cd:e8:54 on igb0 permanent [ethernet]
? (192.168.1.100) at f0:de:f1:f3:17:88 on igb0 expires in 1074 seconds [ethernet]
0.000u 0.004s 0:23.59 0.0% 0+0k 0+0io 0pf+0w


My findings:

  • Interface igb9 is still up.
  • I cannot ping the default gateway of the WAN (10.1.102.1) anymore.
  • Executing "arp -a" takes about 23 seconds (because of trying to resolve IP addresses to hostnames, I have missed to use the "-n" option of arp)
  • I did a tcpdump for igb9 on the OPNsense system when I executed a ping. I only see the broadcasts being sent, but no answer (see attachment).

Does anybody have an idea or hint what I could further execute to analyze this issue? (I keep the OPNsystem in this state, as it is not easy to trigger the issue).

Hi Werner,

Which one is the WAN interface, igb9 or igb0?

WAN is DHCP, right?

What do the following commands yield?

# ps aux | grep dhclient
# netstat -nr | grep default

Does this bring it back?

# killall dhclient
# dhclient igbX

If not, does this?

# /usr/local/etc/rc.newwanip wan


Cheers,
Franco


Hi Franco,

thanks a lot for your friendly help.

WAN is igb9 an WAN is DHCP.

Unfortunately, the commands did not help to get the connectivity again. But it's surprising to me, that executing dhclient again shows a "DHCPACK". Here is my log:


root@OPNsense:~ # ps aux | grep dhclient
root   15548   0.0  0.0 1076296  2844  -  Is   05:37     0:00.00 dhclient: igb9 [priv] (dhclient)
_dhcp  24493   0.0  0.0 1076296  2908  -  Is   05:37     0:00.00 dhclient: igb9 (dhclient)
root   59205   0.0  0.0 1080488  2856  0  S+   08:53     0:00.01 grep dhclient
root@OPNsense:~ # netstat -nr | grep default
default            10.1.102.1         UGS        igb9
root@OPNsense:~ # ping 10.1.102.1
PING 10.1.102.1 (10.1.102.1): 56 data bytes
^C
--- 10.1.102.1 ping statistics ---
3 packets transmitted, 0 packets received, 100.0% packet loss
root@OPNsense:~ # date
Wed Jul 19 08:54:46 UTC 2017
root@OPNsense:~ # killall dhclient
root@OPNsense:~ # ps aux | grep dhclient
root   77072   0.0  0.0 1080488  2860  0  S+   08:55     0:00.00 grep dhclient
root@OPNsense:~ # date
Wed Jul 19 08:55:16 UTC 2017
root@OPNsense:~ # dhclient igb9
DHCPREQUEST on igb9 to 255.255.255.255 port 67
DHCPACK from 10.1.102.5
bound to 10.1.102.55 -- renewal in 21600 seconds.
root@OPNsense:~ # date
Wed Jul 19 08:55:32 UTC 2017
root@OPNsense:~ # ping 10.1.102.1
PING 10.1.102.1 (10.1.102.1): 56 data bytes
^C
--- 10.1.102.1 ping statistics ---
5 packets transmitted, 0 packets received, 100.0% packet loss
root@OPNsense:~ # date
Wed Jul 19 08:56:08 UTC 2017
root@OPNsense:~ # arp -a -n
? (10.1.102.1) at (incomplete) on igb9 expired [ethernet]
? (10.1.102.55) at 00:30:18:cd:ec:63 on igb9 permanent [ethernet]
? (192.168.1.1) at 00:30:18:cd:e8:54 on igb0 permanent [ethernet]
? (192.168.1.101) at f0:de:f1:ab:ce:49 on igb0 expires in 1011 seconds [ethernet]
root@OPNsense:~ # date
Wed Jul 19 08:56:36 UTC 2017
root@OPNsense:~ # /usr/local/etc/rc.newwanip wan
root@OPNsense:~ # date
Wed Jul 19 08:57:09 UTC 2017
root@OPNsense:~ # ping 10.1.102.1
PING 10.1.102.1 (10.1.102.1): 56 data bytes
^C
--- 10.1.102.1 ping statistics ---
3 packets transmitted, 0 packets received, 100.0% packet loss
root@OPNsense:~ # date
Wed Jul 19 08:57:25 UTC 2017
root@OPNsense:~ #


In the past, an "ifconfig igb9 down" followed by an "ifconfig igb9 up" brought the connectivity back up again, but I keep the system in the current state in case that you have any other ideas.

Should I do the steps again while recording a tcpdump?

Thanks,
Werner


I had an similar Problem with
an another NIC on a embedded Device.

.. it can be possible, the Power Management
/Energy Saver Mode of your Intel NIC works not correctly.

Disable for the i211 GB Nic:

* the Energy Efficient Ethernet Saver Mode
* and WOL (WakeOnLan) for this Interface.

Test it manually with ethtool ...


Use:

Disable Energy-Efficient Ethernet Mode:

    ethtool --set-eee <your nic device> eee off

Disable WakeOnLan:

   ethtool --s <your nic device> wol d


If it works ,
put the modifications to load @ boot

in

/etc/rc.local





Sincely
Wilbo.

Thank you wilbolinux for your suggestion to check the Energy Efficient Ethernet settings.

I have searched for ethtool, but it seems that this tool is not available for FreeBSD.

Has anybody an idea how the Energy Efficient Ethernet settings can be checked for igb devices under FreeBSD/OPNsense?

Hi Werner,

Uhh, I do remember this now...

sysctl  -a | grep dev.igb...eee_disabled

Set it to 1 for each interface to disable EEE (Energy Efficient Ethernet) as this may cause up/down issues.

Apologies, I took this from an old e-mail thread, it's not on the forum so hard to find.


Cheers,
Franco

Hi Franco,

thank you so much for your fast and confident answer! This makes me think very positive, that we can solve this issue :)

I will test this and let you know within the upcoming days whether I was able to fix my issue by disabling EEE.

Thanks again,
Werner

Hi Franco,

I was able to reproduce the problem with OPNsense 17.7 (after I have updated the server mentioned above from 17.1 to 17.7).

I have then disabled Energy-Efficient Ethernet for all 10 NICs of the system (setting the tunables from "dev.igb.0.eee_disabled" to "dev.igb.9.eee_disabled" to "1"). After that, I have had no problems anymore.

I have documented the fix in our wiki here (in German): https://www.thomas-krenn.com/de/wiki/OPNsense_igb_EEE_Funktion_deaktivieren

In case that against my expectation the problem arises again, I would post this here - I hope this will not be necessary  ;)

So thanks again for your help,
best regards,
Werner

Hi all,

i think Energy Efficient Ethernet (eee) should be disabled by default in opnsense.

cheers
till

It seems to only affect a tiny fraction of igb devices. And since the number of igb devices is device-specific, we'd have to write detection code as well or alternatively hardcode the EEE default in the kernel. Unless we narrow these chipsets down to look at the full picture, I am not sure what to do.


Cheers,
Franco

Hi again,

unfortunately I got the problem now again, although I have set eee_disabled to 1 for all of the 10 NICs.

The error occurred about 3-4 minutes after I have booted the firewall system. I was able to use the Internet from my laptop behind the firewall, after 3-4 minutes I was not able to ping the WAN's link default gateway anymore:


root@OPNsense:~ # ping 10.1.102.1
PING 10.1.102.1 (10.1.102.1): 56 data bytes
^C
--- 10.1.102.1 ping statistics ---
4 packets transmitted, 0 packets received, 100.0% packet loss
root@OPNsense:~ # sysctl -a | grep -i eee
options IEEE80211_SUPPORT_MESH
options IEEE80211_AMPDU_AGE
options IEEE80211_DEBUG
z0xfffff8000baeee80 [label="r0w0e0"];
z0xfffff8000baeee80 -> z0xfffff8000521d300;
z0xfffff8000bb5be00 -> z0xfffff8000baeee80;
<consumer id="0xfffff8000baeee80">
hw.bxe.autogreeen: 0
hw.em.eee_setting: 1
dev.igb.9.eee_disabled: 1
dev.igb.8.eee_disabled: 1
dev.igb.7.eee_disabled: 1
dev.igb.6.eee_disabled: 1
dev.igb.5.eee_disabled: 1
dev.igb.4.eee_disabled: 1
dev.igb.3.eee_disabled: 1
dev.igb.2.eee_disabled: 1
dev.igb.1.eee_disabled: 1
dev.igb.0.eee_disabled: 1
root@OPNsense:~ #


Does anybody have any other hints what the root cause for this issue could be?

PS: I will now start some tests in parallel on another system from the same board manufacturer http://www.jetwaycomputer.com/JBC385F551.html I will keep you updated how the tests run there.