Occasional interface flapping on all interfaces

Started by FullyBorked, May 15, 2023, 03:34:36 PM

Previous topic - Next topic
May 15, 2023, 03:34:36 PM Last Edit: May 15, 2023, 09:07:53 PM by FullyBorked
I've been struggling for a while now with seemingly random flapping of all of my interfaces that lasts for 10-15 min sometimes then clears on its own.  I initially thought it was related to the firewall as all traffic gets blocked during these events.  I don't see any other events in the logs, even at debug level the first event is the "DEVD attached" event, then that just spams that and related items related to interface events till the event is over, nothing else logged before.

I do have a few plugins so maybe one of those is to blame?  I run Zenarmor on my internal interfaces, Suricata on my DMZ, and Crowdsec on my WAN.  Zenarmor and suricata do not monitor the same physical interfaces so there shouldn't be overlap. 

I'm starting to get to my wits end on solving this, my wife and I work from home and having the network go down at random sometimes inopportune times is starting to cause some tension. 

I've attached a few screenshots to add some color, I'd be glad to add anything else that might be helpful just let me know.

Edit: Adding dmesg output, this flapping goes on for pages and pages.  I do see a few entries of eastpack exiting with signal 11.  Not sure what that is but makes me wonder if Zenarmor is triggering this somehow or if it's just struggling with the interface flapping.



pid 59481 (eastpect), jid 0, uid 0: exited on signal 11
ix0: link state changed to DOWN
ix0_vlan10: link state changed to DOWN
ix0_vlan11: link state changed to DOWN
ix0_vlan12: link state changed to DOWN
ix0_vlan13: link state changed to DOWN
ix0: link state changed to UP
ix0_vlan10: link state changed to UP
ix0_vlan11: link state changed to UP
ix0_vlan12: link state changed to UP
ix0_vlan13: link state changed to UP
ix0: link state changed to DOWN
ix0_vlan10: link state changed to DOWN
ix0_vlan11: link state changed to DOWN
ix0_vlan12: link state changed to DOWN
ix0_vlan13: link state changed to DOWN
ix0: link state changed to UP
ix0_vlan10: link state changed to UP
ix0_vlan11: link state changed to UP
ix0_vlan12: link state changed to UP
ix0_vlan13: link state changed to UP
193.368040 [ 851] iflib_netmap_config       txr 4 rxr 4 txd 2048 rxd 2048 rbufsz 2048
193.368304 [ 851] iflib_netmap_config       txr 4 rxr 4 txd 2048 rxd 2048 rbufsz 2048
ix0: link state changed to DOWN
ix0_vlan10: link state changed to DOWN
ix0_vlan11: link state changed to DOWN
ix0_vlan12: link state changed to DOWN
ix0_vlan13: link state changed to DOWN
ix0: link state changed to UP
ix0_vlan10: link state changed to UP
ix0_vlan11: link state changed to UP
ix0_vlan12: link state changed to UP
ix0_vlan13: link state changed to UP
ix0: link state changed to DOWN
ix0_vlan10: link state changed to DOWN
ix0_vlan11: link state changed to DOWN
ix0_vlan12: link state changed to DOWN
ix0_vlan13: link state changed to DOWN
ix0: link state changed to UP
ix0_vlan10: link state changed to UP
ix0_vlan11: link state changed to UP
ix0_vlan12: link state changed to UP
ix0_vlan13: link state changed to UP
pid 48117 (eastpect), jid 0, uid 0: exited on signal 11
ix0: link state changed to DOWN
ix0_vlan10: link state changed to DOWN
ix0_vlan11: link state changed to DOWN
ix0_vlan12: link state changed to DOWN
ix0_vlan13: link state changed to DOWN
ix0: link state changed to UP
ix0_vlan10: link state changed to UP
ix0_vlan11: link state changed to UP
ix0_vlan12: link state changed to UP
ix0_vlan13: link state changed to UP
ix0: link state changed to DOWN
ix0_vlan10: link state changed to DOWN
ix0_vlan11: link state changed to DOWN
ix0_vlan12: link state changed to DOWN
ix0_vlan13: link state changed to DOWN
ix0: link state changed to UP
ix0_vlan10: link state changed to UP
ix0_vlan11: link state changed to UP
ix0_vlan12: link state changed to UP
ix0_vlan13: link state changed to UP
223.756165 [ 851] iflib_netmap_config       txr 4 rxr 4 txd 2048 rxd 2048 rbufsz 2048
223.756223 [ 851] iflib_netmap_config       txr 4 rxr 4 txd 2048 rxd 2048 rbufsz 2048
ix0: link state changed to DOWN
ix0_vlan10: link state changed to DOWN
ix0_vlan11: link state changed to DOWN
ix0_vlan12: link state changed to DOWN
ix0_vlan13: link state changed to DOWN
ix0: link state changed to UP
ix0_vlan10: link state changed to UP
ix0_vlan11: link state changed to UP
ix0_vlan12: link state changed to UP
ix0_vlan13: link state changed to UP
ix0: link state changed to DOWN
ix0_vlan10: link state changed to DOWN
ix0_vlan11: link state changed to DOWN
ix0_vlan12: link state changed to DOWN
ix0_vlan13: link state changed to DOWN
ix0: link state changed to UP
ix0_vlan10: link state changed to UP
ix0_vlan11: link state changed to UP
ix0_vlan12: link state changed to UP
ix0_vlan13: link state changed to UP


A few more screenshots (second post due to size limit...)

Since I have Zenarmor on my Wireguard interface any chance this is just a bug with the emulated netmap driver?  Any thoughts on better logging or another place I an look for more detail? 

As far as I know that is unfortunate behaviour of the adapter detaching to enter or exit netmap mode. Should be the same for native mode.

We were discussing this for the previous project but it didn't match the scope back then.

Depending on how Murat and team view this as an issue to tackle we might start another netmap improvement round. But I'm just theorizing here.


Cheers,
Franco

Quote from: franco on May 17, 2023, 09:00:25 PM
As far as I know that is unfortunate behaviour of the adapter detaching to enter or exit netmap mode. Should be the same for native mode.

We were discussing this for the previous project but it didn't match the scope back then.

Depending on how Murat and team view this as an issue to tackle we might start another netmap improvement round. But I'm just theorizing here.


Cheers,
Franco

Do I understand this to mean this is expected behavior when a service using netmap restarts or has an issue? 

Yes if the netmap process exists all devices are moved back into non-netmap mode which toggles link-down because some hardware flags are set back to defaults. This is actually netmap trying to disable hardware features in order to be able to read packets correctly.


Cheers,
Franco

Quote from: franco on May 17, 2023, 09:26:37 PM
Yes if the netmap process exists all devices are moved back into non-netmap mode which toggles link-down because some hardware flags are set back to defaults. This is actually netmap trying to disable hardware features in order to be able to read packets correctly.


Cheers,
Franco

I guess the real question in my instance is determining what is causing netmap mode changes.  I would assume a change would only cause an up down up, not flapping for 2-3 minutes. 

Well it starts with

> pid 59481 (eastpect), jid 0, uid 0: exited on signal 11

crashing/exiting and the rest is driver and interface related. A lot of interfaces might take more time being reconfigured making it more likely to cause a cascade of this down/up sequence if that takes too long to process properly.

eastpect restarting might not help trying to get back into netmap mode causing another down/up at the same time.


Cheers,
Franco

This problem is starting to wear on me, anyone have any other thoughts on how to track this down?  Currently sitting here waiting on my vlans to come back online, watching the unbound_dhcp service flap, watching interfaces flap.  I don't' know what's happening. 


I did run across this post, https://forum.opnsense.org/index.php?topic=26583.0, seems to be nearly identical to my issue.  So is the fix to just not run applications needing netmap (i.e., zenarmor, suricata) on interfaces that have vlans trunked?  Are there any tunables or best practices to improve this? 

As a test I'm going to remove my wireguard interface so that the emulated driver isn't being used, that way only the native driver is being utilized.  I don't know that will fix anything but will help narrow down why this became so frequent of late.  I'm not sure if it was the implementation of wireguard or the latest update, but it's become a nearly daily occurrence now. 

Well... removing wireguard to try and use only the native netmap driver didn't correct the issue.  I guess let's try using emulated for everything, maybe the emulated driver fixes will help me here. 

Doubt anyone is following along, but I've now went the longest period of time in ages without any interface flapping after moving fully to the emulated driver.  I'm up roughly a week now, where before I rarely made it 24 hours.  Will continue to monitor but so far this looks promising. 


Finally flapped again, but it's much rarer on the emulated driver.  I do see the "eastpack exit code 11" wrapped up in the latest flapping in dmesg.  Still can't understand if the flapping is related to Zenarmor crashing or restarting or if the flapping is causing the zenarmor service to suffer. 

Here is a snip of the log.


228.835337 [1173] generic_netmap_attach     Emulated adapter for wg0 created (prev was NULL)
228.835361 [1078] generic_netmap_dtor       Emulated netmap adapter for wg0 destroyed
228.835421 [1173] generic_netmap_attach     Emulated adapter for wg0 created (prev was NULL)
228.957825 [ 321] generic_netmap_register   Emulated adapter for wg0 activated
228.960569 [1173] generic_netmap_attach     Emulated adapter for ix0 created (prev was ix0)
228.960590 [1070] generic_netmap_dtor       Native netmap adapter for ix0 restored
228.960607 [1078] generic_netmap_dtor       Emulated netmap adapter for ix0 destroyed
228.960708 [1173] generic_netmap_attach     Emulated adapter for ix0 created (prev was ix0)
228.960797 [ 321] generic_netmap_register   Emulated adapter for ix0 activated
283.117371 [1173] generic_netmap_attach     Emulated adapter for igb1 created (prev was igb1)
283.117394 [1070] generic_netmap_dtor       Native netmap adapter for igb1 restored
283.117411 [1078] generic_netmap_dtor       Emulated netmap adapter for igb1 destroyed
283.118795 [1173] generic_netmap_attach     Emulated adapter for igb1 created (prev was igb1)
283.118850 [ 321] generic_netmap_register   Emulated adapter for igb1 activated
pid 19676 (eastpect), jid 0, uid 0: exited on signal 11
101.698911 [ 296] generic_netmap_unregister Emulated adapter for ix0 deactivated
101.698959 [1070] generic_netmap_dtor       Native netmap adapter for ix0 restored
101.698976 [1078] generic_netmap_dtor       Emulated netmap adapter for ix0 destroyed
115.720185 [1173] generic_netmap_attach     Emulated adapter for ix0 created (prev was ix0)
115.720288 [1070] generic_netmap_dtor       Native netmap adapter for ix0 restored
115.720371 [1078] generic_netmap_dtor       Emulated netmap adapter for ix0 destroyed
115.720605 [1173] generic_netmap_attach     Emulated adapter for ix0 created (prev was ix0)
115.721066 [ 321] generic_netmap_register   Emulated adapter for ix0 activated
pid 25846 (eastpect), jid 0, uid 0: exited on signal 11
393.730419 [ 296] generic_netmap_unregister Emulated adapter for ix0 deactivated
393.733183 [1070] generic_netmap_dtor       Native netmap adapter for ix0 restored
393.733200 [1078] generic_netmap_dtor       Emulated netmap adapter for ix0 destroyed
407.026824 [1173] generic_netmap_attach     Emulated adapter for ix0 created (prev was ix0)
407.026928 [1070] generic_netmap_dtor       Native netmap adapter for ix0 restored
407.027042 [1078] generic_netmap_dtor       Emulated netmap adapter for ix0 destroyed
407.027289 [1173] generic_netmap_attach     Emulated adapter for ix0 created (prev was ix0)
407.027683 [ 321] generic_netmap_register   Emulated adapter for ix0 activated
pid 46025 (eastpect), jid 0, uid 0: exited on signal 11
894.655312 [ 296] generic_netmap_unregister Emulated adapter for ix0 deactivated
894.655371 [1070] generic_netmap_dtor       Native netmap adapter for ix0 restored
894.655390 [1078] generic_netmap_dtor       Emulated netmap adapter for ix0 destroyed
907.750498 [1173] generic_netmap_attach     Emulated adapter for ix0 created (prev was ix0)
907.750606 [1070] generic_netmap_dtor       Native netmap adapter for ix0 restored
907.750691 [1078] generic_netmap_dtor       Emulated netmap adapter for ix0 destroyed
907.750924 [1173] generic_netmap_attach     Emulated adapter for ix0 created (prev was ix0)
907.751373 [ 321] generic_netmap_register   Emulated adapter for ix0 activated
igb2: link state changed to DOWN
igb2: link state changed to UP
igb2: link state changed to DOWN
igb2: link state changed to UP
igb2: link state changed to DOWN
igb2: link state changed to UP
ix1: link state changed to DOWN


Had some arp issue for the WAN in the log I haven't seen before, not sure if it's related or not.  Lot's of the below line spammed in dmesg. 


arpresolve: can't allocate llinfo for X.X.X.X on ix1