arp timeouts on VLAN cause connection interruptions

Started by devilkin, November 30, 2020, 03:14:50 PM

Previous topic - Next topic
Hi,

I've been fighting with an intermittent problem: at intervals, my laptop (wifi) is no longer able to access anything that requires my OPNsense box for routing. Traffic inside the same subnet keeps working, and traffic from a server on the same vlan out over the OPNsense box also keeps working.

At the same time, an mp3 stream which is playing on my chromecast on another VLAN keeps playing without issues.

Tests have concluded that:

  • ino wifi issue: connectivity inside the vlan works, chromecast keeps playing
  • no wired networking issue: everything on this keeps working, and my AP is connected to the same core switch in the end)
  • no endpoit problem: i can still connect to everything that doesn't require the OPNsense box

So, it seems this has to be *something* on the opnsense box. Since I can connect via backhaul (ssh to server in same vlan, and from there i can connect wired to the opnsense box), I decided to check some things, and I found out that when my laptop's ARP entry expires, my connectivity drops.


thor.home.lan (192.168.xx.xx) at (incomplete) on igb1_vlan134 expired [vlan]


It then takes between 10 secs and several minutes before it starts working again... and the ARP entry is once again filled. Countdown timer 1200secs, and it's reproducible.

The host itself has a static dhcp entry in dhcpv4, but I don't think that matters much.

Anybody any idea how I could solve this?

I can reproduce this on multiple wireless clients. Really odd.

I've been playing with the hardware TSO/LRO/CRC and VLAN support, switching that off, but nothing changed.

Please describe the full communication path and the components (including os version) involved. Normally, the timeout of an arp table entry is reset when it is used.
OPNsense 24.7.11_2-amd64

In this case:

* Linux laptop(s) running debian unstable/testing (kernel 5.9.0)
* switches are unifi switch 8's, firmware 4.3.21.11325
* access points are unifi ap-ac-pro's, firmware 4.3.21.11325
* OPNsense is 20.7.5

Flow is linux laptop <-> AP <-> switch <-> opnsense <-> internet (or other vlan).

So, in this scenario your laptop must have an arp entry for the opnsense box to communicate across the subnet. The entry should never expire in case you continiously exchange ip packets with the opnsense.


  • What happens if you add a static arp entry for testing?
  • You can do a packet capture to identify possible communication issues
OPNsense 24.7.11_2-amd64

Static ARP: problem solved. Not really an acceptable 'fix' though... strange thing is that before I used to have an Unifi USG, and that was replaced with the OPNsense box.

I've also been playing with some Unifi configuration, and now the problem has 'disappeared'... so I'm going to have to wait what happens.

I came across
https://community.ui.com/questions/LAN-to-WLAN-ARP-Issue-with-UAP-AC-LR/9b1b3060-2950-4bb8-b4ec-eaf4442d75bb?page=1
https://forum.netgate.com/topic/157090/periodic-drops

which seem like the thing I was seeing - and I have the same hardware in play :/

If there are no malformed arp responses from the opnsense, it seems to be a firmware issue of the AP
OPNsense 24.7.11_2-amd64

Is there an easy way to pick out malformed arp reqs/replies?

I don't think so. Do you have observed such malformed packets (e.g. packet capture)?
OPNsense 24.7.11_2-amd64

Not that I can see. I just don't see the traffic *at all* coming into opnsense when i have timeouts

Did you check arp communication with packet dump and wireshark?
OPNsense 24.7.11_2-amd64

I think I am having the same issue. I will try adding a static arp entry on my ubuntu machine connceted via wifi and report back.
Mo

In the end it was the access point causing corruption.

Since I'm running the latest beta 5.53 on my APs, the problems seem to have vanished...

Sent from my SM-T970 using Tapatalk


My AP is not unifi but will check if there are any fw updates.

I had this happen again, and I had already set a static arp entry on my client side. This did not fix it. However I logged into opnsense via another machine and observed the following:

? (192.168.2.13) at (incomplete) on em1 expired [ethernet]
? (192.168.2.66) at 28:a0:2b:3c:f3:8c on em1 expires in 1166 seconds [ethernet]
? (192.168.2.2) at 00:e0:67:21:e6:07 on em1 permanent [ethernet]
? (192.168.2.192) at c0:f8:da:21:e2:ac on em1 expires in 1189 seconds [ethernet]
? (192.168.2.160) at 74:ac:b9:e0:05:7a on em1 expires in 1150 seconds [ethernet]
? (192.168.2.6) at 52:54:00:36:76:f0 on em1 expires in 1193 seconds [ethernet]
? (192.168.2.4) at 00:d8:61:03:45:cd on em1 expires in 870 seconds [ethernet]
? (192.168.2.58) at 74:81:14:b5:32:86 on em1 expires in 1171 seconds [ethernet]
? (192.168.2.56) at 04:69:f8:31:eb:e3 on em1 expires in 1178 seconds [ethernet]
? (192.168.2.185) at ec:8e:b5:04:dd:8e on em1 expires in 649 seconds [ethernet]
? (192.168.2.63) at 52:54:00:e8:36:5d on em1 expires in 1180 seconds [ethernet]
? (192.168.2.50) at c0:9a:d0:c7:5c:22 on em1 expires in 1118 seconds [ethernet]
? (192.168.2.82) at 58:d3:49:2c:f0:cf on em1 expires in 1150 seconds [ethernet]
? (192.168.2.83) at 58:d3:49:02:31:33 on em1 expires in 1180 seconds [ethernet]
? (192.168.2.16) at 52:54:00:a8:0d:05 on em1 expires in 1028 seconds [ethernet]
? (192.168.2.80) at 58:d3:49:23:0e:01 on em1 expires in 1180 seconds [ethernet]
? (192.168.2.81) at 58:d3:49:22:25:74 on em1 expires in 1151 seconds [ethernet]
? (192.168.2.54) at 64:0b:d7:ee:0e:51 on em1 expires in 1173 seconds [ethernet]
? (192.168.2.22) at 52:54:00:17:48:d0 on em1 expires in 1172 seconds [ethernet]
? (192.168.2.183) at 9c:8e:cd:26:be:56 on em1 expires in 1186 seconds [ethernet]
? (192.168.2.55) at 64:0b:d7:eb:fc:e6 on em1 expires in 1197 seconds [ethernet]
? (192.168.2.21) at 9c:b6:54:be:3e:60 on em1 expires in 1199 seconds [ethernet]
? (192.168.2.181) at 74:ee:2a:5f:c0:60 on em1 expires in 1118 seconds [ethernet]
root@OPNsense:~ # arp -s 192.168.2.13 98:83:89:8A:4F:83
root@OPNsense:~ # arp -a -n
? (192.168.2.13) at 98:83:89:8a:4f:83 on em1 permanent [ethernet]
? (192.168.2.66) at 28:a0:2b:3c:f3:8c on em1 expires in 1176 seconds [ethernet]
? (192.168.2.2) at 00:e0:67:21:e6:07 on em1 permanent [ethernet]
? (192.168.2.192) at c0:f8:da:21:e2:ac on em1 expires in 1150 seconds [ethernet]
? (192.168.2.160) at 74:ac:b9:e0:05:7a on em1 expires in 1173 seconds [ethernet]
? (192.168.2.6) at 52:54:00:36:76:f0 on em1 expires in 1154 seconds [ethernet]
? (192.168.2.4) at 00:d8:61:03:45:cd on em1 expires in 831 seconds [ethernet]
? (192.168.2.58) at 74:81:14:b5:32:86 on em1 expires in 1200 seconds [ethernet]
? (192.168.2.56) at 04:69:f8:31:eb:e3 on em1 expires in 1199 seconds [ethernet]
? (192.168.2.185) at ec:8e:b5:04:dd:8e on em1 expires in 610 seconds [ethernet]
? (192.168.2.63) at 52:54:00:e8:36:5d on em1 expires in 1141 seconds [ethernet]
? (192.168.2.50) at c0:9a:d0:c7:5c:22 on em1 expires in 1169 seconds [ethernet]
? (192.168.2.82) at 58:d3:49:2c:f0:cf on em1 expires in 1111 seconds [ethernet]
? (192.168.2.83) at 58:d3:49:02:31:33 on em1 expires in 1141 seconds [ethernet]
? (192.168.2.16) at 52:54:00:a8:0d:05 on em1 expires in 989 seconds [ethernet]
? (192.168.2.80) at 58:d3:49:23:0e:01 on em1 expires in 1141 seconds [ethernet]
? (192.168.2.81) at 58:d3:49:22:25:74 on em1 expires in 1112 seconds [ethernet]
? (192.168.2.54) at 64:0b:d7:ee:0e:51 on em1 expires in 1164 seconds [ethernet]
? (192.168.2.22) at 52:54:00:17:48:d0 on em1 expires in 1184 seconds [ethernet]
? (192.168.2.183) at 9c:8e:cd:26:be:56 on em1 expires in 1147 seconds [ethernet]
? (192.168.2.55) at 64:0b:d7:eb:fc:e6 on em1 expires in 1158 seconds [ethernet]
? (192.168.2.21) at 9c:b6:54:be:3e:60 on em1 expires in 1199 seconds [ethernet]
? (192.168.2.181) at 74:ee:2a:5f:c0:60 on em1 expires in 1167 seconds [ethernet]
root@OPNsense:~ #


The moment this line was entered: root@OPNsense:~ # arp -s 192.168.2.13 98:83:89:8A:4F:83 everything immediately fixed itself on the client side, so it seems that opnsense is somehow losing/not getting the arp entry from the client.

I am unsure how to troubleshoot this. Adding a static entry on opnsense has fixed the problem and it has not reappeared so far. I am thinking it wont since the sending of this command so immediately and obviously fixed the problem.

What should I do? Just always add a static arp entry for affected clients or is there a better way?

Pete