Loosing WAN every 15 minutes [SOLVED, ntopng Network Discovery...]

Started by iorx, August 11, 2024, 10:06:05 AM

Previous topic - Next topic
Hi!

Need some help to get on the right track here I think.

Loosing WAN connection every 15 minutes as the title says.

Was in contact with the ISP and they hade some theory that matches up with the times of events. :00,:15,:30,:45.
The WAN-connection is a static IP-address and they use ARP to assign it according to them, that every 15 minutes.

They also talked about a setting, ignoring not matching ARP(?!) This is where I have a hard time understanding what's happening.

Don't know if it's worth anything but attached a Copy/Paste of wiresharks packet list filtered on ARP and DHCP.
(_ws.col.protocol == "DHCP" || _ws.col.protocol == "ARP"  || _ws.col.protocol == "BOOTP")

Everything is updated to latest available as of now.
Running opnsense as vm in Hyper-V.

I'm not an expert so no idea if my solution is related to your problem.

One of my onboard NIC's went bad and I had this exact problem. I installed a PCIe two-NIC card and that solved my problem.

If you have a spare NIC, might want to try it.

Quote from: iorx on August 11, 2024, 10:06:05 AM
Hi!

Need some help to get on the right track here I think.

Loosing WAN connection every 15 minutes as the title says.


I am having the same issue

Quote from: EASC Support on August 11, 2024, 08:34:46 PM
Quote from: iorx on August 11, 2024, 10:06:05 AM
Hi!

Need some help to get on the right track here I think.

Loosing WAN connection every 15 minutes as the title says.


I am having the same issue

"Great" that I'm not alone :) So, maybe this needs some eyes from a developer perspective.
I'm not sure when it started though, you got a that? I think it was 24.7.1 but not sure.

Can it be related to this kernel-thingy maybe, 24.7 / 24.7.1?
https://forum.opnsense.org/index.php?topic=42066.0

It's all rather vague. Schedules are executed every 15 minutes if used. DHCP renew timings could be 15 minutes. But I don't see any log to support either theory.


Cheers,
Franco

Hi!

Thank you for the response.

I've been in contact with the ISP again, and it their recommendation was to lower the arp timeout. I have set the net.link.ether.inet.max_age=120. Problem persist.

On their recommendation and troubleshooting steps I was asked to test from another device.
I've now got a Windows Server 2022 on one of their other static IP-addresses to monitor what happens every 15 minutes.

What I understand this is a connection called "Stadsfiber"(swedish name) translated to "City Fiber". It's "virtual" "thingy" as it can be multiple ISPs providing services through this fiber.
If the customer, as in this case, want static IP-addresses they provide that and use arp to update and assign them. OPNsense is configured with one of these assigned static addresses.

The problem showed first on 24.7 no problem before that, I think. Made the upgrade all the way to 24.7.1 in one sweep so not certain if it was 24.7.1 or 24.7 that introduced the problem.

Please tell if I can provide more info.

UPDATE while writing this. This is more than strange.
If I have the other machine, a Windows Server 2022 in this case, active on one of their other static addresses the problem goes away?!? No more 15 minutes losses...

I'm working with the support at the ISP right now to try to figure out what's going on.

Hi iorx!
Buy a normal router, configure it according to your network (LAN, WAN). Connect it to your network.
Reinstall OPNSense but an older ver. on your current OPNSense Server. Configure it according to your network. Test.

best regards
(Med vänliga hälsningar)
Hugo Cortes V.

Hi and thank you for a somewhat creative suggestion.

I've have no problem with this way of running it, I got multiple installation at various sites all running virtualized (and have done so for years). This site starting acting like this with 24.7 or 24.7.1 so wanted to reach out here to find a solution or to point out a potential problem.
All other sites are running 24.7.1 now but are not on this kind of "multiple ISP in the same fiber" solution. And they work. Trace Route ICMP broken tough.

To add, and why I though this maybe related, is that tracert using ICMP is not working either. I refereed and asked about a other thread here were people had problem with tracert in the latest version of bsd kernel/opnsense.

UPDATE, Forgot that I had a snapshot of the virtual instance from 24.7_9 so reverted to that.
Trace route works again but I think I still see the 15 minutes losses.

I've just setup a new-previous version 24.1 in parallel on a different WAN-IP and is checking that out to see if it experience the same drop/loss. If that version works then I'll switch to that and wait out a possible fix for 24.7.

Will circle back here with the results.

(everything is done remotely for this site so need to be cautious with how I proceed...  :) )

Hi again!

Looks like the problem started with 24.7.

The ISP and fiber deliverer returned with this:
"They could see from their end that your device does not respond to ARP probes sent from their side every 840 seconds."


So, something is up with the latest update, this is my low-level-amateur conclusion. Any chance that underlying ARP-handling is borked in the latest updates?

I checked

https://www.freebsd.org/releases/14.0R/relnotes/
https://www.freebsd.org/releases/14.1R/relnotes/

But nothing came up. A lot of code shifted. I'm not sure where to look.

A packet capture might tell us if it ever responds to ARP on WAN or stops at some point...


Cheers,
Franco

arp -na | grep <$waniface>

Do you see something there?

Quote from: doktornotor on August 15, 2024, 10:01:58 AM
arp -na | grep <$waniface>

Do you see something there?

arp -na | grep hn1
? (xxx.yyy.zzz.220) at 00:15:5d:01:99:01 on hn1 permanent [ethernet]
? (xxx.yyy.zzz.194) at 00:00:5e:00:01:5a on hn1 expires in 122 seconds [ethernet]

Looks just fine, the first one should be your WAN and the other the ISP gateway... Maybe the Hyper-V switch does not act well with whatever ARP they are monitoring. Shrug...

ISP and WAN IP presumption you make is correct.

Yeah, know there are some sceptical thought about the vm, hyper-v, thingy here  :o :). But the problem was introduced recently. The ISP claims no changes in their side. And the change I've made is upgrade to 24.7, and had no problem like this before.


Quote from: franco on August 15, 2024, 09:51:04 AM
I checked

https://www.freebsd.org/releases/14.0R/relnotes/
https://www.freebsd.org/releases/14.1R/relnotes/

But nothing came up. A lot of code shifted. I'm not sure where to look.

A packet capture might tell us if it ever responds to ARP on WAN or stops at some point...


Cheers,
Franco

Got a capture of the event, but not sure if it's safe to attached that here. Can I dm you that file maybe?
In the first post there is a txt-file file filtered on DHCP and ARP, what it looked like in WireShark.