[SOLVED] WAN stops working randomly?

Started by Dowd, June 03, 2024, 06:39:22 PM

Previous topic - Next topic
June 03, 2024, 06:39:22 PM Last Edit: June 07, 2024, 02:44:12 AM by Dowd
I am currently running a memtest86 to rule out bad ram, and I will also be doing a new install of opnsense without proxmox to see if that is the culprit. In the meantime I am hoping someone has any ideas for things to try on my to do list.

Full disclosure, prior to this new system I had a Lenovo RS140 server that also ran Proxmox + OPNSense and did not experience any sudden disconnects on the WAN. I also used a small Ubi EdgeRouter and experienced no disconnects as well.

I am having an issue where randomly OPNSense will not connect out to the WAN. It gets the DHCP IP from Verizon and works then several hours it will just stop working. The IP is still there, I just cant ping anything outside of the LAN network. The WAN gateway, 1.1.1.1, etc. None of it works. If I go to Overview and navigate to the WAN interface and do the reload button it will work immediately again, but then will break again in a few hours.

I am running OPNSense 24.1, everything is up to date. My specs are

i9-9900k
128GB RAM (4GB allocated to OPNSense)
E3C246D4U2-2T Motherboard (BIOS and Firmware up to date)
2 x Intel X550-AT2 RJ45

LAN works fine, I am not seeing any issues connecting there, it just solely is WAN disconnecting. I have tried replacing the cable, same result.

EDIT: Latest updates:

  • Memtest86 test passed after 24 hours.
  • Static Routes do exist during the timeout, before the timeout, after the timeout
  • Proxmox does not report any interface issues via dmesg | grep eno1 (the WAN interface)
  • I reloaded the WAN at 10:25 AM. Aprox at 12:20 PM (~2 hours later), the WAN died again.
  • net.link.ether.inet.max_age did not make a difference, it still dies in 2 hours.
  • Clean install of 24.1 and clean install of 23.1 made no difference.
  • Logs show <3>arp: d0:50:99:f6:xx:xx is using my IP address <WANIP> on vtnet0! however the MAC Address does not belong to me as my MAC addresses start with d0:50:99:fc, not f6.

EDIT: I believe I found the solution for future reference, documented in the comment here - https://forum.opnsense.org/index.php?topic=40856.msg200606#msg200606

Next time check in System > Routes > Status, if you have a default route set

Some devices (modems, etc.) have a feature in that they stop responding if they have not received an ARP request for a couple of minutes. The cache of BSD based routers (such as OPNSense) is longer than that.

Try adding net.link.ether.inet.max_age=120 to tunables, which forces the router to re-arp every two minutes and often solves this issue.

Topton 4 x i225-v (Core i5-1135G7 * 32GB * 512SSD)
Xfinity Gigabit (1.2G Down * 200M Up)

Quote from: bestboy on June 03, 2024, 09:06:19 PM
Next time check in System > Routes > Status, if you have a default route set

I checked, the routes exist during the timeout and after the timeout, theres no change. Shown below.

Quote from: LOTRouter on June 03, 2024, 09:51:58 PM
Some devices (modems, etc.) have a feature in that they stop responding if they have not received an ARP request for a couple of minutes. The cache of BSD based routers (such as OPNSense) is longer than that.

Try adding net.link.ether.inet.max_age=120 to tunables, which forces the router to re-arp every two minutes and often solves this issue.

I changed it, it still dies in 2 hours. It dies in 2 hours on the dot apparently.

What do the logs say? This seems like a DHCP lease expiring on WAN and not renewing until the interface is force reloaded.

Is there anything in System/Log Files/General around the timeframe that the WAN link becomes unresponsive?

June 05, 2024, 08:27:06 PM #6 Last Edit: June 05, 2024, 08:32:26 PM by Dowd
Quote from: opnfwb on June 05, 2024, 05:30:15 AM
What do the logs say? This seems like a DHCP lease expiring on WAN and not renewing until the interface is force reloaded.

Is there anything in System/Log Files/General around the timeframe that the WAN link becomes unresponsive?

There is! I found it yesterday, but I am not sure how to handle it. I tried dealing with Verizon about it, but no luck.

2024-06-05T17:51:46 Notice kernel <3>arp: d0:50:99:f6:xx:xx is using my IP address xx.xx.127.59 on vtnet0!


The MAC address references ASRock for vendor, which is the brand of the motherboard I am using, but it does not belong to the physical NIC in Proxmox, nor is it the MAC address of the virtual NIC thats allocated to OPNSense. I am not sure where this mac address came from and why it constantly takes my OPNSense IP address on renewals.

Is the MAC from your modem?
I have the same error on my box (with different MAC/IP) with MAC from my Starlink Router since changing to 24.x, however, this is not true. Starlink Router does hold the MAC but _not_ the IP!?
It is not affecting me in any case as far as I can see, though.

Quote from: apunkt on June 05, 2024, 09:04:08 PM
Is the MAC from your modem?
I have the same error on my box (with different MAC/IP) with MAC from my Starlink Router since changing to 24.x, however, this is not true. Starlink Router does hold the MAC but _not_ the IP!?
It is not affecting me in any case as far as I can see, though.

Thanks for the reply, I dont have any middle man modem, my wire is cat6 going from the Verizon ONT to the Proxmox box that runs the OPNSense VM. At this stage, I am considering reinstalling proxmox from scratch and reinstalling everything since I dont know where this MAC Address is coming from.

How is proxmox configured for the OPNsense NICs? Are you using direct hardware passthrough or is there a virtual switch with virtual interfaces assigned to the OPNsense VM?

If it's using a virtual switch, is there another device on the OPNsense WAN side that is 'stealing' the DHCP address?

June 05, 2024, 11:26:53 PM #10 Last Edit: June 05, 2024, 11:29:32 PM by Dowd
Quote from: opnfwb on June 05, 2024, 10:54:29 PM
How is proxmox configured for the OPNsense NICs? Are you using direct hardware passthrough or is there a virtual switch with virtual interfaces assigned to the OPNsense VM?

If it's using a virtual switch, is there another device on the OPNsense WAN side that is 'stealing' the DHCP address?

I have two physical ports on proxmox (third is IPMI and I am not plugged anything into it currently). eno1 and eno2. eno1 is wired to the ONT, eno2 is connected to a Cisco switch wherein are all my physical devices. I have two virtual interfaces vmbr0 and vmbr1, vmbr0 is bridged to eno1 (the ONT box), and vmbr1 is bridged to eno2 (the cisco switch) and also runs the proxmox interface.

For the opnsense vm I have it use vmbr0 and vmbr1. The vmbr0 interface is the WAN (since its bridged to the ONT box) and the vmbr1 is the LAN (since its bridged to the cisco switch interface).

I am attaching a screenshot for reference of the proxmox interfaces and the opnsense vm interfaces. (Ignore the fact it says 23.1, I was testing with 23.1 to see if it behaved differently than 24.1)

There is only one other VM and it is under the vmbr1 interface. Nothing else runs on the vmbr0 interface apart from opnsense.

That all seems logical and correct.

The phantom MAC that is stealing the WAN IP, does that MAC coincide with the IPMI interface? I'd want to be extra sure that isn't somehow popping up on the NIC that proxmox is using for the WAN bridge and causing all kinds of issues. I know this is apples/oranges but some newer HP server hardware allows the lights out functionality "move" around to different NICs, so I'm not sure if Asrock has a similar function or not?

The other odd issue that might be worth checking is newer mainboard BIOS' have the option for "install drivers" or some such feature. This usually involves the BIOS UEFI firmware pulling an IP address and trying to stage drivers through to the OS. If your mainboard has such a feature I would suggest turning it off and see if this may also be using that ENO1 NIC for those attempts. This is a long shot but I'm just thinking out loud here trying to help narrow down possibilities.


Quote from: opnfwb on June 06, 2024, 12:09:13 AM
That all seems logical and correct.

The phantom MAC that is stealing the WAN IP, does that MAC coincide with the IPMI interface? I'd want to be extra sure that isn't somehow popping up on the NIC that proxmox is using for the WAN bridge and causing all kinds of issues. I know this is apples/oranges but some newer HP server hardware allows the lights out functionality "move" around to different NICs, so I'm not sure if Asrock has a similar function or not?

The other odd issue that might be worth checking is newer mainboard BIOS' have the option for "install drivers" or some such feature. This usually involves the BIOS UEFI firmware pulling an IP address and trying to stage drivers through to the OS. If your mainboard has such a feature I would suggest turning it off and see if this may also be using that ENO1 NIC for those attempts. This is a long shot but I'm just thinking out loud here trying to help narrow down possibilities.

It doesn't, I ran ip -a on the proxmox machine and none of the mac addresses are ones that match or come close. The only ones with d0:50:99 are the two eno interfaces and they dont match the one outputted. The IPMI interface starts with 7e:89, so its not the same vendor. Its really puzzling where this interface is coming from.

Quote from: opnfwb on June 06, 2024, 12:09:13 AM
I know this is apples/oranges but some newer HP server hardware allows the lights out functionality "move" around to different NICs, so I'm not sure if Asrock has a similar function or not?

I am not aware of such a feature, but I will also check the BIOS if there are any NIC features that could be on and make sure they are off. I am not sure what you mean by lights out functionality unfortunately, its not a term I am familiar with.

also if proxmox is the OS on the host, list the interfaces $ip a

Can you check the networking setup of your other VM ? If the MAC in the list appears there, then it could be the VM causing the clash. Maybe some software there could also be worth visiting.

Quote from: cookiemonster on June 06, 2024, 12:20:09 AM
also if proxmox is the OS on the host, list the interfaces $ip a

Can you check the networking setup of your other VM ? If the MAC in the list appears there, then it could be the VM causing the clash. Maybe some software there could also be worth visiting.

Sorry I meant ip a before instead of ip -a. Force of habit. Same result.

I ran ip a on the other vm, it has one interface and its a virtual one, mac address doesn't match. The other interfaces are all related to k8s/docker and also do not match.

The only idea I have so far is to recheck the BIOS for any weird networking that is being done and then reinstall the OS from scratch with OPNSense being the only VM for 24 hours before I start doing anything else. At this stage I am out of ideas since I am baffled.