[SOLVED] Random network drops on WAN

Started by Baez, December 18, 2024, 06:46:02 AM

Previous topic - Next topic
December 18, 2024, 06:46:02 AM Last Edit: February 18, 2025, 01:31:21 AM by Baez Reason: mark as resolved
This issue has been really perplexing. It's a 24.7 bare-metal installation on a new box with an Intel X710-BM2 NIC installed. ixl0 is WAN and ixl1 is LAN. Internet access comes through a fiber line to a Nokia GPON plugged into an Adtran 854-v6 router.

Initially, after about 2 days of uptime, I started seeing random "Page not found" errors while browsing. Then a few seconds later, the page would work as expected. Pinging would then show this:

❯ ping cloudflare.com
PING cloudflare.com (104.16.132.229): 56 data bytes
64 bytes from 104.16.132.229: icmp_seq=0 ttl=58 time=3.580 ms
64 bytes from 104.16.132.229: icmp_seq=1 ttl=58 time=3.743 ms
64 bytes from 104.16.132.229: icmp_seq=2 ttl=58 time=4.048 ms
64 bytes from 104.16.132.229: icmp_seq=3 ttl=58 time=4.007 ms
64 bytes from 104.16.132.229: icmp_seq=4 ttl=58 time=3.517 ms
64 bytes from 104.16.132.229: icmp_seq=5 ttl=58 time=4.015 ms
64 bytes from 104.16.132.229: icmp_seq=6 ttl=58 time=3.692 ms
64 bytes from 104.16.132.229: icmp_seq=7 ttl=58 time=3.825 ms
64 bytes from 104.16.132.229: icmp_seq=8 ttl=58 time=3.690 ms
64 bytes from 104.16.132.229: icmp_seq=9 ttl=58 time=3.876 ms
64 bytes from 104.16.132.229: icmp_seq=10 ttl=58 time=3.835 ms
64 bytes from 104.16.132.229: icmp_seq=11 ttl=58 time=3.588 ms
64 bytes from 104.16.132.229: icmp_seq=12 ttl=58 time=4.006 ms
64 bytes from 104.16.132.229: icmp_seq=13 ttl=58 time=3.575 ms
64 bytes from 104.16.132.229: icmp_seq=14 ttl=58 time=3.700 ms
Request timeout for icmp_seq 15
Request timeout for icmp_seq 16
Request timeout for icmp_seq 17
Request timeout for icmp_seq 18
64 bytes from 104.16.132.229: icmp_seq=19 ttl=58 time=3.447 ms
64 bytes from 104.16.132.229: icmp_seq=20 ttl=58 time=3.572 ms
64 bytes from 104.16.132.229: icmp_seq=21 ttl=58 time=3.996 ms
64 bytes from 104.16.132.229: icmp_seq=22 ttl=58 time=3.658 ms
64 bytes from 104.16.132.229: icmp_seq=23 ttl=58 time=3.275 ms
64 bytes from 104.16.132.229: icmp_seq=24 ttl=58 time=3.561 ms
64 bytes from 104.16.132.229: icmp_seq=25 ttl=58 time=3.640 ms
64 bytes from 104.16.132.229: icmp_seq=26 ttl=58 time=3.782 ms
64 bytes from 104.16.132.229: icmp_seq=27 ttl=58 time=3.657 ms
64 bytes from 104.16.132.229: icmp_seq=28 ttl=58 time=3.647 ms
64 bytes from 104.16.132.229: icmp_seq=29 ttl=58 time=3.482 ms
64 bytes from 104.16.132.229: icmp_seq=30 ttl=58 time=3.561 ms
64 bytes from 104.16.132.229: icmp_seq=31 ttl=58 time=3.874 ms
64 bytes from 104.16.132.229: icmp_seq=32 ttl=58 time=3.583 ms
64 bytes from 104.16.132.229: icmp_seq=33 ttl=58 time=3.548 ms
64 bytes from 104.16.132.229: icmp_seq=34 ttl=58 time=3.722 ms
64 bytes from 104.16.132.229: icmp_seq=35 ttl=58 time=3.674 ms
64 bytes from 104.16.132.229: icmp_seq=36 ttl=58 time=4.147 ms
64 bytes from 104.16.132.229: icmp_seq=37 ttl=58 time=3.761 ms
64 bytes from 104.16.132.229: icmp_seq=38 ttl=58 time=3.583 ms
64 bytes from 104.16.132.229: icmp_seq=39 ttl=58 time=3.526 ms
Request timeout for icmp_seq 40
Request timeout for icmp_seq 41
Request timeout for icmp_seq 42
Request timeout for icmp_seq 43
Request timeout for icmp_seq 44
Request timeout for icmp_seq 45
64 bytes from 104.16.132.229: icmp_seq=46 ttl=58 time=3.183 ms
64 bytes from 104.16.132.229: icmp_seq=47 ttl=58 time=3.276 ms

Changing a multitude of settings, I could not solve why this was happening. A few things I tried the first time:

  • Resetting all firewall settings
  • Deleting and recreating the WAN interface
  • Restarting the server
  • Restarting the Adtran (which at the time was supplying internet via DHCP through DMZ)

I finally gave in, decided it must have been a bad installation and reset to factory defaults, and WAN immediately worked without drops.

2 days later the exact same issue arises. I then updated to the latest BIOS and updated the X710's firmware from 9.00 to 9.90 using Intel's FreeBSD tool. I reset again to factory defaults and hoped this would solve it.

Fast forward 3 days and the same problem rears its ugly head again... In between this time, I called my ISP to switch the Adtran to bridge mode. The WAN interface is now connected on PPPoE.

And here we are today, still with the same issue. Some notes I've made during this endeavour that might help:

  • At most, the network has dropped for 12 seconds and picks back up every time, guaranteed
  • The drops are random, but somewhat consistent, with at most 30 seconds of uptime before the next packet loss
  • Pinging an IP results in the same issue, so it doesn't appear to be DNS
  • All local connections are stable, pinging the gateway/DNS server at 192.168.1.1 is always up, same for another device on LAN
  • Pinging outside the network directly on the server results in the same timeouts, so I've eliminated all but the OPNsense server and Adtran
  • While the Adtran was not in bridge, pinging on a machine plugged into a LAN port on the Adtran showed no issues, connection was stable

Any help to get to the root of this problem would be greatly appreciated.

What leads you to believe the problem is on your end? Did you check with other sites?

Here's what it looks like on my end:
PING cloudflare.com (104.16.132.229): 56 data bytes
64 bytes from 104.16.132.229: icmp_seq=0 ttl=58 time=17.415 ms
64 bytes from 104.16.132.229: icmp_seq=1 ttl=58 time=19.508 ms
64 bytes from 104.16.132.229: icmp_seq=2 ttl=58 time=18.514 ms
64 bytes from 104.16.132.229: icmp_seq=3 ttl=58 time=17.107 ms
64 bytes from 104.16.132.229: icmp_seq=4 ttl=58 time=17.424 ms
64 bytes from 104.16.132.229: icmp_seq=5 ttl=58 time=17.326 ms
64 bytes from 104.16.132.229: icmp_seq=6 ttl=58 time=18.613 ms
64 bytes from 104.16.132.229: icmp_seq=7 ttl=58 time=18.646 ms
64 bytes from 104.16.132.229: icmp_seq=8 ttl=58 time=18.685 ms
64 bytes from 104.16.132.229: icmp_seq=9 ttl=58 time=22.029 ms
64 bytes from 104.16.132.229: icmp_seq=10 ttl=58 time=20.799 ms
64 bytes from 104.16.132.229: icmp_seq=11 ttl=58 time=18.936 ms
64 bytes from 104.16.132.229: icmp_seq=12 ttl=58 time=22.456 ms
64 bytes from 104.16.132.229: icmp_seq=13 ttl=58 time=18.974 ms
Request timeout for icmp_seq 14
64 bytes from 104.16.132.229: icmp_seq=15 ttl=58 time=21.874 ms
64 bytes from 104.16.132.229: icmp_seq=16 ttl=58 time=19.129 ms

Not nearly as many probs as in your test, but with the occasional 300msec response time instead of timeouts.

December 18, 2024, 05:16:02 PM #2 Last Edit: December 19, 2024, 02:46:36 AM by Baez
Quote from: mooh on December 18, 2024, 11:00:09 AMWhat leads you to believe the problem is on your end? Did you check with other sites?

Not nearly as many probs as in your test, but with the occasional 300msec response time instead of timeouts.

Hey mooh, it's not only Cloudflare, but all external connections stop working for that period of time.

Occurred again today. I was at least able to get it back to normal by turning the opnsense box off, restarting the Adtran, and turning the opnsense back on once the Adtran was booted.

I am having a very similar experience with opnsense right now.

Very Very frustrating. It's just enough of a drop to regularly sever my ssh connection.

I thought for sure it was my starlink connection, but it isn't - If I run my router by itself - not connected to opnsense, it's essentially rock solid.

The moment I connect in opnsense - everything goes foobar. I can ping 1.1.1.1 -t and the command will continue to run, drop 5-10 pings, just enough to sever a ssh connection and pick right back up again.

Has anyone figured this out yet - I am basically banging my head into a wall trying to track this down.

Moderately long shot, assuming Ethernet links:

In System: Log Files: General, do you have any:

2025-01-05T17:01:35-06:00   Notice   kernel   <6>arp: 47.190.83.190 moved from 2e:21:72:1a:39:83 to 90:6c:ac:89:be:8c on bridge0

These can indicate the presence of an ARP proxy (in my case it's my provider gateway at 2e:21:72:1a:39:83) stealin' your connections. Sufficient traffic will un-steal them, as above. I ended up defining static ARP entries for all of my stuff exposed to the proxy.

Also, check Interfaces: Diagnostics: ARP Table for anomalous entries.

Wanted to follow up on this in case anyone else has the same problem in the future. For me, the issue was at the hardware level, and the Intel X710-BM2 NIC was the culprit.

If I only used ixl1 for LAN, and plugged the WAN into the motherboard's ethernet, the problem disappeared entirely. I've since switched to an X520 NIC and this has not been an issue.

There is still a chance it could have been driver-related, but it was related to the X710 NIC either way.