Random network drops on WAN

Started by Baez, December 18, 2024, 06:46:02 AM

Previous topic - Next topic
December 18, 2024, 06:46:02 AM Last Edit: December 18, 2024, 07:05:26 AM by Baez
This issue has been really perplexing. It's a 24.7 bare-metal installation on a new box with an Intel X710-BM2 NIC installed. ixl0 is WAN and ixl1 is LAN. Internet access comes through a fiber line to a Nokia GPON plugged into an Adtran 854-v6 router.

Initially, after about 2 days of uptime, I started seeing random "Page not found" errors while browsing. Then a few seconds later, the page would work as expected. Pinging would then show this:

❯ ping cloudflare.com
PING cloudflare.com (104.16.132.229): 56 data bytes
64 bytes from 104.16.132.229: icmp_seq=0 ttl=58 time=3.580 ms
64 bytes from 104.16.132.229: icmp_seq=1 ttl=58 time=3.743 ms
64 bytes from 104.16.132.229: icmp_seq=2 ttl=58 time=4.048 ms
64 bytes from 104.16.132.229: icmp_seq=3 ttl=58 time=4.007 ms
64 bytes from 104.16.132.229: icmp_seq=4 ttl=58 time=3.517 ms
64 bytes from 104.16.132.229: icmp_seq=5 ttl=58 time=4.015 ms
64 bytes from 104.16.132.229: icmp_seq=6 ttl=58 time=3.692 ms
64 bytes from 104.16.132.229: icmp_seq=7 ttl=58 time=3.825 ms
64 bytes from 104.16.132.229: icmp_seq=8 ttl=58 time=3.690 ms
64 bytes from 104.16.132.229: icmp_seq=9 ttl=58 time=3.876 ms
64 bytes from 104.16.132.229: icmp_seq=10 ttl=58 time=3.835 ms
64 bytes from 104.16.132.229: icmp_seq=11 ttl=58 time=3.588 ms
64 bytes from 104.16.132.229: icmp_seq=12 ttl=58 time=4.006 ms
64 bytes from 104.16.132.229: icmp_seq=13 ttl=58 time=3.575 ms
64 bytes from 104.16.132.229: icmp_seq=14 ttl=58 time=3.700 ms
Request timeout for icmp_seq 15
Request timeout for icmp_seq 16
Request timeout for icmp_seq 17
Request timeout for icmp_seq 18
64 bytes from 104.16.132.229: icmp_seq=19 ttl=58 time=3.447 ms
64 bytes from 104.16.132.229: icmp_seq=20 ttl=58 time=3.572 ms
64 bytes from 104.16.132.229: icmp_seq=21 ttl=58 time=3.996 ms
64 bytes from 104.16.132.229: icmp_seq=22 ttl=58 time=3.658 ms
64 bytes from 104.16.132.229: icmp_seq=23 ttl=58 time=3.275 ms
64 bytes from 104.16.132.229: icmp_seq=24 ttl=58 time=3.561 ms
64 bytes from 104.16.132.229: icmp_seq=25 ttl=58 time=3.640 ms
64 bytes from 104.16.132.229: icmp_seq=26 ttl=58 time=3.782 ms
64 bytes from 104.16.132.229: icmp_seq=27 ttl=58 time=3.657 ms
64 bytes from 104.16.132.229: icmp_seq=28 ttl=58 time=3.647 ms
64 bytes from 104.16.132.229: icmp_seq=29 ttl=58 time=3.482 ms
64 bytes from 104.16.132.229: icmp_seq=30 ttl=58 time=3.561 ms
64 bytes from 104.16.132.229: icmp_seq=31 ttl=58 time=3.874 ms
64 bytes from 104.16.132.229: icmp_seq=32 ttl=58 time=3.583 ms
64 bytes from 104.16.132.229: icmp_seq=33 ttl=58 time=3.548 ms
64 bytes from 104.16.132.229: icmp_seq=34 ttl=58 time=3.722 ms
64 bytes from 104.16.132.229: icmp_seq=35 ttl=58 time=3.674 ms
64 bytes from 104.16.132.229: icmp_seq=36 ttl=58 time=4.147 ms
64 bytes from 104.16.132.229: icmp_seq=37 ttl=58 time=3.761 ms
64 bytes from 104.16.132.229: icmp_seq=38 ttl=58 time=3.583 ms
64 bytes from 104.16.132.229: icmp_seq=39 ttl=58 time=3.526 ms
Request timeout for icmp_seq 40
Request timeout for icmp_seq 41
Request timeout for icmp_seq 42
Request timeout for icmp_seq 43
Request timeout for icmp_seq 44
Request timeout for icmp_seq 45
64 bytes from 104.16.132.229: icmp_seq=46 ttl=58 time=3.183 ms
64 bytes from 104.16.132.229: icmp_seq=47 ttl=58 time=3.276 ms

Changing a multitude of settings, I could not solve why this was happening. A few things I tried the first time:

  • Resetting all firewall settings
  • Deleting and recreating the WAN interface
  • Restarting the server
  • Restarting the Adtran (which at the time was supplying internet via DHCP through DMZ)

I finally gave in, decided it must have been a bad installation and reset to factory defaults, and WAN immediately worked without drops.

2 days later the exact same issue arises. I then updated to the latest BIOS and updated the X710's firmware from 9.00 to 9.90 using Intel's FreeBSD tool. I reset again to factory defaults and hoped this would solve it.

Fast forward 3 days and the same problem rears its ugly head again... In between this time, I called my ISP to switch the Adtran to bridge mode. The WAN interface is now connected on PPPoE.

And here we are today, still with the same issue. Some notes I've made during this endeavour that might help:

  • At most, the network has dropped for 12 seconds and picks back up every time, guaranteed
  • The drops are random, but somewhat consistent, with at most 30 seconds of uptime before the next packet loss
  • Pinging an IP results in the same issue, so it doesn't appear to be DNS
  • All local connections are stable, pinging the gateway/DNS server at 192.168.1.1 is always up, same for another device on LAN
  • Pinging outside the network directly on the server results in the same timeouts, so I've eliminated all but the OPNsense server and Adtran
  • While the Adtran was not in bridge, pinging on a machine plugged into a LAN port on the Adtran showed no issues, connection was stable

Any help to get to the root of this problem would be greatly appreciated.

What leads you to believe the problem is on your end? Did you check with other sites?

Here's what it looks like on my end:
PING cloudflare.com (104.16.132.229): 56 data bytes
64 bytes from 104.16.132.229: icmp_seq=0 ttl=58 time=17.415 ms
64 bytes from 104.16.132.229: icmp_seq=1 ttl=58 time=19.508 ms
64 bytes from 104.16.132.229: icmp_seq=2 ttl=58 time=18.514 ms
64 bytes from 104.16.132.229: icmp_seq=3 ttl=58 time=17.107 ms
64 bytes from 104.16.132.229: icmp_seq=4 ttl=58 time=17.424 ms
64 bytes from 104.16.132.229: icmp_seq=5 ttl=58 time=17.326 ms
64 bytes from 104.16.132.229: icmp_seq=6 ttl=58 time=18.613 ms
64 bytes from 104.16.132.229: icmp_seq=7 ttl=58 time=18.646 ms
64 bytes from 104.16.132.229: icmp_seq=8 ttl=58 time=18.685 ms
64 bytes from 104.16.132.229: icmp_seq=9 ttl=58 time=22.029 ms
64 bytes from 104.16.132.229: icmp_seq=10 ttl=58 time=20.799 ms
64 bytes from 104.16.132.229: icmp_seq=11 ttl=58 time=18.936 ms
64 bytes from 104.16.132.229: icmp_seq=12 ttl=58 time=22.456 ms
64 bytes from 104.16.132.229: icmp_seq=13 ttl=58 time=18.974 ms
Request timeout for icmp_seq 14
64 bytes from 104.16.132.229: icmp_seq=15 ttl=58 time=21.874 ms
64 bytes from 104.16.132.229: icmp_seq=16 ttl=58 time=19.129 ms

Not nearly as many probs as in your test, but with the occasional 300msec response time instead of timeouts.

December 18, 2024, 05:16:02 PM #2 Last Edit: December 19, 2024, 02:46:36 AM by Baez
Quote from: mooh on December 18, 2024, 11:00:09 AMWhat leads you to believe the problem is on your end? Did you check with other sites?

Not nearly as many probs as in your test, but with the occasional 300msec response time instead of timeouts.

Hey mooh, it's not only Cloudflare, but all external connections stop working for that period of time.

Occurred again today. I was at least able to get it back to normal by turning the opnsense box off, restarting the Adtran, and turning the opnsense back on once the Adtran was booted.