OPNsense Forum

Archive => 17.1 Legacy Series => Topic started by: bringha on April 26, 2017, 08:25:41 am

Title: Systems suddenly HALTS - Stability issue since 17.1
Post by: bringha on April 26, 2017, 08:25:41 am
Hello together,

I am facing since the migration to 17.1 the phenomenon that about all 3-4 weeks my system out of the sudden simply HALTS. No Logs, no message on the console, no crash report.  As I saw a few others who are at least from their writing being in a similar situation (with perhaps different root cause), I would like to ask how we could analyze this further. (are there any additional debug option I could activate et al)

Phenomenologically, I observe a random service stop in three classes:

Must not necessarily correlate but could.

The following phenomenons are then observed when these events happen:

1.) Dashboard does not show suddenly any valid IPv6 addresses on the interfaces and ipv6 to the outside does not work ... radvd still running according to process table, wireshark does not show any RA packets anymore on the interfaces.

2.) WAN interface down:
Code: [Select]
Apr 25 22:21:00 OPNsense opnsense: /usr/local/etc/rc.filter_configure: Could not find IPv6 gateway for interface(wan).
Apr 25 22:21:00 OPNsense opnsense: /usr/local/etc/rc.filter_configure: Could not find IPv6 gateway for interface(wan).
Apr 25 22:22:59 OPNsense kernel: igb1: Watchdog timeout -- resetting
Apr 25 22:22:59 OPNsense kernel: igb1: Queue(73303552) tdh = 718230942, hw tdt = 38012041
Apr 25 22:22:59 OPNsense kernel: igb1: TX(73303552) desc avail = 0,Next TX to Clean = 0
Apr 25 22:22:59 OPNsense kernel: igb1: link state changed to DOWN
Apr 25 22:22:59 OPNsense configd.py: [b7a4b9b8-7077-471d-b36b-88b921d53a20] Linkup stopping igb1
Apr 25 22:22:59 OPNsense opnsense: /usr/local/etc/rc.linkup: DEVD Ethernet detached event for wan
Apr 25 22:23:00 OPNsense UNKNOWN[10427]: Exiting, sigterm or sigint received.
Apr 25 22:23:00 OPNsense UNKNOWN[10427]: sending stop adverts
Apr 25 22:23:00 OPNsense UNKNOWN[10427]: removing /var/run/radvd.pid
Apr 25 22:23:03 OPNsense kernel: igb1: link state changed to UP
Apr 25 22:23:03 OPNsense configd.py: [897d7623-e7c8-4e2d-bd7f-b1b2c1497d6e] Linkup starting igb1
Apr 25 22:23:03 OPNsense opnsense: /usr/local/etc/rc.linkup: DEVD Ethernet attached event for wan
Apr 25 22:23:03 OPNsense opnsense: /usr/local/etc/rc.linkup: HOTPLUG: Configuring interface wan
Apr 25 22:23:03 OPNsense opnsense: /usr/local/etc/rc.linkup: Accept router advertisements on interface igb1
Apr 25 22:23:03 OPNsense opnsense: /usr/local/etc/rc.linkup: ROUTING: setting IPv4 default route to 192.168.2.1
Apr 25 22:23:04 OPNsense opnsense: /usr/local/etc/rc.linkup: The command '/sbin/route delete -inet 'default'' returned exit code '1', the output was 'route: route has not been found delete net default fib 0: not in table'
Apr 25 22:23:07 OPNsense opnsense: /usr/local/etc/rc.newwanipv6: rc.newwanipv6: Informational is starting igb1.
Apr 25 22:23:07 OPNsense opnsense: /usr/local/etc/rc.newwanipv6: rc.newwanipv6: on (IP address: fe80::217:3fff:febe:a21d) (interface: wan) (real interface: igb1).
Apr 25 22:23:10 OPNsense opnsense: /usr/local/etc/rc.newwanipv6: The command '/sbin/route delete -host 192.168.2.1' returned exit code '1', the output was 'route: route has not been found delete host 192.168.2.1 fib 0: not in table'
Apr 25 22:23:10 OPNsense opnsense: /usr/local/etc/rc.newwanipv6: The command '/sbin/route delete -host 8.8.8.8' returned exit code '1', the output was 'route: route has not been found delete host 8.8.8.8 fib 0: not in table'
Apr 25 22:23:10 OPNsense opnsense: /usr/local/etc/rc.newwanipv6: ROUTING: setting IPv4 default route to 192.168.2.1
Apr 25 22:23:10 OPNsense configd.py: [8d2bfedd-ecf1-403d-a361-aa63f27621ec] updating dyndns GW_WAN
Apr 25 22:23:11 OPNsense configd.py: [557139d3-e4f7-4dc3-b655-7e38388b6976] updating rfc2136 GW_WAN
Apr 25 22:23:11 OPNsense configd.py: [b519c996-1d03-49aa-9049-361ca6457316] Restarting ipsec tunnels
Apr 25 22:23:11 OPNsense configd.py: [6274acfb-b837-43fe-b148-788e960e93db] Restarting OpenVPN tunnels/interfaces GW_WAN
Apr 25 22:23:11 OPNsense configd.py: [4a6b68ba-78e4-4bb1-946e-847ee4f42255] Reloading filter
Apr 25 22:23:12 OPNsense configd.py: [f476d51c-71d5-47ef-93f6-2280408e2197] updating dyndns wan
Apr 25 22:23:13 OPNsense configd.py: [06014b09-7e86-4fb2-aa08-4e7ef3f51abe] updating rfc2136 wan
Apr 25 22:23:14 OPNsense opnsense: /usr/local/etc/rc.filter_configure: Could not find IPv6 gateway for interface(wan).
Apr 25 22:23:14 OPNsense opnsense: /usr/local/etc/rc.filter_configure: Could not find IPv6 gateway for interface(wan).
Apr 25 22:23:15 OPNsense configd.py: [9c8020e5-c3aa-44a3-91c2-6e4858bf2214] Reloading filter
Apr 25 22:23:18 OPNsense opnsense: /usr/local/etc/rc.filter_configure: Could not find IPv6 gateway for interface(wan).
Apr 25 22:23:18 OPNsense opnsense: /usr/local/etc/rc.filter_configure: Could not find IPv6 gateway for interface(wan).
Also here, WAN (igb1) needed to be restarted manually , traffic was not possible (neither ipv4 nor v6).

3.) System HALT: No logs or console messages
No interaction possible anymore (ssh, console, BMC console, no system activity anymore (flashing disk LED); BMC heartbeat still working). A simple hard restart makes everything working again until the next time.

I would like to analyze this further but can not figure out the trigger why the router advertising stops or what triggers  the watchdog for the WAN interface. (although showing in the log that igb1 is restarting, it is NOT let traffic through afterwards and need to be restarted once again manually ....

Checked also Memory consumption over time and CPU load pattern - no finding

I am running out of ideas how to analyze further. Does somebody has any further idea? - looking forward to your reply

Br br

PS: This is also the reason for getting this solved  (https://forum.opnsense.org/index.php?topic=5019.0) to get a really fully working console ...
Title: Re: Systems suddenly HALTS - Stability issue since 17.1
Post by: franco on April 26, 2017, 09:12:39 am
It looks like a igb issue on FreeBSD 11.0, I don't have high hopes for fixes in 11.1, there is nothing of importance being commited. The igb driver for the 1-2 years away FreeBSD 12.0 looks way different now. No way to easily backport this. It seems a bit like history is repeating itself with a similar networking rework that went on for 10 -> 11. :/
Title: Re: Systems suddenly HALTS - Stability issue since 17.1
Post by: bringha on April 26, 2017, 09:33:58 am
Hi Franco,

this is really bad news! This means that 12.0 in opnsense is not available before a year or so ?! And this limits definitely xxSense usage in productive environments, right?

Intel HW for NIC has one of the biggest market share in firewall HW. In the light that also the re driver is still in a stabilization phase makes the situation even more limited ...

The igb driver: is this a contribution from Intel or is this FreeBSD community work? In either case, involvement of Intel might become helpful?!?

Br br

Title: Re: Systems suddenly HALTS - Stability issue since 17.1
Post by: bringha on April 28, 2017, 04:56:49 pm
Hello,

no first time I could get an output on the console when my WAN interface went down suddenly. (See attachment). is there something how I could perhaps fine tune some parameters in configs to at least degrade the impact if not possible to solve it?

Br br
Title: Re: Systems suddenly HALTS - Stability issue since 17.1
Post by: bringha on April 29, 2017, 10:24:01 am
Hello together,

after a long night of testing, chats and communications, there is some progress.... The error could be provoked/reproduced.

I am much probably affected from this: https://forum.pfsense.org/index.php?topic=98230.0 (https://forum.pfsense.org/index.php?topic=98230.0)  ::). Motherboard hardware replacement with new Supermicro HW revision is getting ordered.

Will give an update when I have it and made the necessary stability verification....

Br br