WAN interface flapping with 22.1.2

Started by foxmanb, March 03, 2022, 01:45:18 PM

Previous topic - Next topic
> Either way, shouldn't the Opnsense GUI prevent you from using overlapping DNS/Gateway Monitors to prevent this?  And why did this work on 21.x and not in 22.x if it's always been the case?

In all of the years the code existed it never prevented it, it never attempted to create visibility for the situation either. There are 4 potential sources for static host routes which can all overlap.

22.1 cleaned up some of the undefined behaviour WRT which route wins which caused issues with people's setups, which formerly was last one configured wins but now it tries to deduplicate the host routes to be created to DNS servers and shows them in the GUI (Interfaces: Overview).

There is, however, still no larger picture or any structure in place that ties together static routes, ISP DNS servers, manual DNS servers and gateway monitor routes. The amount of work due to initial lack of design is the main reason for that. At least now in 22.1.x we have a new tool "ifctl" that registers DNS information from ISPs persistently and the dynamic address scripts don't try to flood the system with the routes that they have just gotten, which is now handled by the main DNS reload code in an orderly fashion, but still has no relation to implied gateway host routes and static routes set by the user.


Cheers,
Franco

June 14, 2022, 10:09:46 AM #136 Last Edit: June 14, 2022, 11:02:52 AM by Davesworld
Quote from: tracerrx on June 14, 2022, 12:43:27 AM
I had this DNS overlap on 1 device originally, and fixing it definitely made the problem better, however i still got flapping on the wan every 2-3 days until I replaced the driver.  All of my primary WANs are Comcast (some residential DHCP  others business static), so it's possible that Comcast has made a change to their systems that's sending something funny.

I would think that if there were a driver issue, the interface would fail and the interface would never come back up on it's own. I would want to see a kernel log entry that points to the driver causing anything. A driver failure is not easy if even possible to recover from without a reboot. They just don't automatically reload and bring back up the interface as far as I know.

Rather than throwing things at the problem, I created a diff file between the in kernel source and the latest BSD driver from Intel which is not that new either. The trouble is that the diff file is 1MB in size and I cannot attach it here. There is more in common between the two than there isn't.

Here is a snip that even contains the command I used:

diff -Naur /home/dave/src-release/13.1.0/sys/dev/e1000/e1000_80003es2lan.c /home/dave/em-7.7.8/src/e1000_80003es2lan.c
--- /home/dave/src-release/13.1.0/sys/dev/e1000/e1000_80003es2lan.c   2022-05-11 16:59:24.000000000 -0700
+++ /home/dave/em-7.7.8/src/e1000_80003es2lan.c   2020-04-08 08:13:17.000000000 -0700
@@ -1,32 +1,31 @@
/******************************************************************************
-  SPDX-License-Identifier: BSD-3-Clause

-  Copyright (c) 2001-2020, Intel Corporation
+  Copyright (c) 2001-2019, Intel Corporation
   All rights reserved.

Yes, I know that the current kernel is 13.0 but the intel driver code even in 13.1 is dated from 2020 and that's just the copyright, the driver itself goes back much further.

The file is much too long to paste in here. If I could get the attachment size permission raised to 1MB I could attach it here. The + and - lines are what is added and subtracted to the old source source code in this instance to make the new driver. The in kernel driver is older than I thought so there should be no new issues with it. Even the next version newer than the in kernel version was released in 2016.

If we want to throw diffs around maybe start with the most obvious:

Our stable/22.1 branch differences against the main FreeBSD branch with the latest and greatest code for the em(4) driver:

% git diff --stat upstream/main sys/dev/e1000 
sys/dev/e1000/e1000_phy.c |  2 +-
sys/dev/e1000/em_txrx.c   | 13 ++++++++-----
sys/dev/e1000/if_em.c     | 32 +++++++++++++++-----------------
sys/dev/e1000/igb_txrx.c  | 21 ++++++++++++---------
4 files changed, 36 insertions(+), 32 deletions(-)

As such I doubt that the current driver situation gets much better than what we have with FreeBSD 13 right now and additional driver updates even from Intel are out of the question for direct release inclusion (kmod packages can be used but that's all there is).


Cheers,
Franco

June 14, 2022, 07:00:45 PM #138 Last Edit: June 14, 2022, 10:21:21 PM by Davesworld
Quote from: franco on June 14, 2022, 11:19:55 AM
If we want to throw diffs around maybe start with the most obvious:

Our stable/22.1 branch differences against the main FreeBSD branch with the latest and greatest code for the em(4) driver:

% git diff --stat upstream/main sys/dev/e1000 
sys/dev/e1000/e1000_phy.c |  2 +-
sys/dev/e1000/em_txrx.c   | 13 ++++++++-----
sys/dev/e1000/if_em.c     | 32 +++++++++++++++-----------------
sys/dev/e1000/igb_txrx.c  | 21 ++++++++++++---------
4 files changed, 36 insertions(+), 32 deletions(-)

As such I doubt that the current driver situation gets much better than what we have with FreeBSD 13 right now and additional driver updates even from Intel are out of the question for direct release inclusion (kmod packages can be used but that's all there is).


Cheers,
Franco

This is the most meaningful diff, thanks for that, I haven't git pulled them and ran a diff on them. Has there ever been consideration into using deltas? I know there are good reasons for doing a full download and good reasons for using a delta but the delta can only upgrade the most recent version before the update it provides. Just curious.

As far as Intel goes, they have not updated their out of kernel driver source in two years and probably no need to. Nobody has shown me a log that points to the Intel or Broadcom driver (there was a reported flapping with Broadcom) when their wan flaps and if it was indicated, I just do not see how an interface would be able to bring itself back up once the kernel throws a driver error for that device so the driver would be the last place I would have looked. I've never seen a NIC module recover the hardware once a kernel error is thrown without recycling power to the nic which we have no way to do on a running system. All I have seen is igb up and igb down or em up or em down with zero kernel driver logs from anyone. My case was simply overlap caused by misconfiguration that now is rightfully caught by the upgraded system and WAN hasn't cycled a single time since.


Quote from: franco on June 14, 2022, 08:03:16 AM
> 22.1 cleaned up some of the undefined behaviour WRT which route wins which caused issues with people's setups, which formerly was last one configured wins but now it tries to deduplicate the host routes to be created to DNS servers and shows them in the GUI (Interfaces: Overview).

> At least now in 22.1.x we have a new tool "ifctl" that registers DNS information from ISPs persistently and the dynamic address scripts don't try to flood the system with the routes that they have just gotten, which is now handled by the main DNS reload code in an orderly fashion, but still has no relation to implied gateway host routes and static routes set by the user.

Given the lack of meaningful consistency amongst those reporting this flapping issue, along with the fact these problems were not reported prior to 22.1.x, is it possible that the above-referenced changes are the root cause? If so, and should it be straightforward to back them out with a patch, I'll gladly be a guinea pig.

You can test any 22.1.x and the initial 22.1 to see if something changed there (opnsense-revert). If not it would indicate the FreeBSD 13 kernel. To rule out changes before 22.1 release and after 21.7.8 use 21.7.8 and switch to development version which is the same core as 22.1 without the FreeBSD 13 kernel.

I'm still suspecting the kernel has a hand in this which makes it difficult to nail to some single change/component.

Going backwards on complex changes such as DNS registration behaviour is not easily possible due to larger code changes involved, but also not relevant given the ways to pin this down to a clear confirmation (is it core or is it kernel).


Cheers,
Franco

Quote from: franco on June 15, 2022, 09:32:43 PM

I'm still suspecting the kernel has a hand in this which makes it difficult to nail to some single change/component.

Going backwards on complex changes such as DNS registration behaviour is not easily possible due to larger code changes involved, but also not relevant given the ways to pin this down to a clear confirmation (is it core or is it kernel).


Cheers,
Franco

Well, the kernel is temporary. I wonder who else besides me had this issue caused purely by DNS overlap and now are no longer flapping? I've been solid since I got rid of the overlap. I also wonder if those who used compiled out of kernel drivers and say it stopped flapping, did not also make configuration changes that in themselves may have actually fixed it by getting rid of overlap. The other thing that raises eyebrows is that only that one interface was involved, not the others as many people have igb nics on all interfaces and they also use that same driver module. 

> Well, the kernel is temporary.

This doesn't make any sense.


Cheers,
Franco

June 16, 2022, 06:23:03 PM #143 Last Edit: June 16, 2022, 06:26:09 PM by subivoodoo
Hi everybody,

My case could probably help here???

- IDP with IPS mode on + MAC spoofing
- 2 different system, issue is every time reporducable (also with a new clean OpnSense install):
* Intel igc driver (on a testing VM with NIC passthrough)
or
* Intel ixl driver (on my live firewall hardware)

My testing results:
- ixl driver compiled to newest from intel => still flapping
- tested 22.7.pre3 => still flapping
- opnsense-revert -r 22.1.1 opnsense => FIXED my issues!!!
- changed from intel to realtek NIC => FIXED my issues on all OpnSense versions!!!

Other known working workarounds:
- Remove MAC spoofing
or
- Disable IPS mode

So my guesses:
It isn't an explicit driver issue but it depends on certain NIC's and/or drivers + kernel or changes after 22.1.1... so some complex combination  :(

Regards

Thanks for adding data points! The only changes in 22.1.2 that seem to be relevant at first glance are:

o interfaces: simplify device destroy code https://github.com/opnsense/core/commit/84cd38adb558
o interfaces: avoid use legacy_get_interface_addresses() in MAC address read https://github.com/opnsense/core/commit/13388839e7e

But upon inspection it doesn't look like these could change the rules of MAC address assignments in terms of making links flap. Second opinions?


Cheers,
Franco

Hi Franco,

I can guarantee you 100% that with the settings (see documentation screenshots from my testing VM attached) up to and with 22.1.1 everything worked fine. From 22.1.2 on it just doesn't work anymore on all my Intel NIC's with IPS + MAC spoofing.

I can also do/send you more logs if needed... the issue is easily reproducible for me in the test VM.

Greetings

We're definitely missing some sort of logging information from "opnsense-log system" and "opnsense-log gateways" at the time of the link events. Some script has to be responsible or at least react to linkup which makes this worse than before.


Cheers,
Franco

Hi again,

Logs after IPS enabled attached, Gateway log is/was empty.

Regards

Logs, ignore the DHCP error, this is a known issue for starlink currently.  Here is a comcast (with static IP) flap... No Mac spoofing, IDS enabled on LAN, IPS disabled.


2022-06-14T20:52:01-04:00 Notice /update_tables.py remove old alias __automatic_3a953935_0
2022-06-14T20:51:39-04:00 Error opnsense /usr/local/etc/rc.filter_configure: ROUTING: creating /tmp/igb0_defaultgw using 'REDACTED'
2022-06-14T20:51:39-04:00 Error opnsense /usr/local/etc/rc.filter_configure: ROUTING: removing /tmp/igb3_defaultgw
2022-06-14T20:51:32-04:00 Notice dhclient Creating resolv.conf
2022-06-14T20:51:32-04:00 Error dhclient unknown dhcp option value 0x52
2022-06-14T20:51:24-04:00 Error opnsense /usr/local/etc/rc.filter_configure: Ignore down inet6 gateways : WAN_Comcast_GWv4
2022-06-14T20:51:24-04:00 Error opnsense /usr/local/etc/rc.filter_configure: ROUTING: creating /tmp/igb3_defaultgw using '100.64.0.1'
2022-06-14T20:51:24-04:00 Error opnsense /usr/local/etc/rc.filter_configure: ROUTING: removing /tmp/igb0_defaultgw
2022-06-14T20:51:24-04:00 Error opnsense /usr/local/etc/rc.filter_configure: Ignore down inet gateways : WAN_Comcast_GWv4


2022-06-14T20:51:36-04:00 Notice dpinger GATEWAY ALARM: WAN_Comcast_GWv4 (Addr: 75.75.75.75 Alarm: 0 RTT: 20292us RTTd: 2949us Loss: 10%)
2022-06-14T20:51:36-04:00 Warning dpinger WAN_Comcast_GWv4 75.75.75.75: Clear latency 20292us stddev 2949us loss 10%
2022-06-14T20:51:23-04:00 Notice dpinger GATEWAY ALARM: WAN_Comcast_GWv4 (Addr: 75.75.75.75 Alarm: 1 RTT: 20080us RTTd: 2568us Loss: 11%)
2022-06-14T20:51:23-04:00 Warning dpinger WAN_Comcast_GWv4 75.75.75.75: Alarm latency 20080us stddev 2568us loss 11%

Quote from: franco on June 16, 2022, 08:36:05 AM
> Well, the kernel is temporary.

This doesn't make any sense.


Cheers,
Franco

The version.