[Solved] Loss of internet/WAN gateway packet loss after 20 sec

Started by sknr, February 29, 2024, 01:04:01 AM

Previous topic - Next topic
Hello,

My OPNsense adventures continue. My previous ask for help ended up being user error, so I hope I'm not doing something dumb again!

I have been running OPNsense for about a week now, and I've got a very odd one today where when I got home from work it seems like my OPNsense has stopped working. My gateway monitoring was informing me that packet loss was 100%, and subsequently I had no internet access.

My setup is fairly simple, with my WAN on igc0, LAN on igc1 and WLAN on igc2, I've configured basic firewall rules allowing both the LAN and WLAN networks to access the internet but not each other. In addition i've been running DNS over TLS using unbound, NATing any port 53 traffic to unbound and using cloudflare on port 853 on my outbound DNS queries. In addition I also decided to run IDS/IPS on my WAN port and Zenarmour on my LAN/WLAN ports. This was all working fine for a couple of days, and I even added SQM/QoS to improve my "buffer bloat".

I initially thought that this might be a service interruption from my ISP, but I noticed that if I rebooted my ONT or reconnected my CAT6a connection to the ONT, and left a ping command running on the OPNsense shell, I would get a valid connection to the internet (or 8.8.8.8) for about 20-30 seconds before my connection went down again. Subsequently I went throught the process of disabling extra services like IDS/IPS and Zenarmour, and even disabling/removing my DNS over TLS configuration, followed by multiple reboots and power cycles of both my ONT and OPNsense. To no avail, the same issue persisted, re-plug the WAN connection, 15-30 seconds of success followed by the Gateway monitor reporting packet loss, slowly building up from 10%, 15%, 20%, etc finally to 100% and then nothing worked again.

While checking the firewall live logs I could see that all WAN traffic coming in was being blocked by Default deny/state violation, but my firewall was still sending stuff out. And the ping requests to 8.8.8.8 were still getting responses on the OPNsense shell while I could see that 8.8.8.8:53 was being blocked on the "WAN in" the live log.

As the next step I connected my display and keyboard, and did a "reset to factory settings", that didn't work either. As a last resort I then tried to do a clean install from the ISO image on a USB, assigned the usual WAN/LAN interfaces, and still didn't get a network connection, despite that working when I first started.

Luckily I have an old EdgeRouter-lite-3 sitting around and managed to get that re-configured and up and running and it seems like that now give me an internet connection, which leads me to think that either:

a) the ISP has has decided to blocked my OPNsense machine (mac address)
b) it was a complete fluke that I got OPNsense running last week and now something fundamental is missing which I'm not aware of

The appliance running OPNsense, is an Intel N100 based HUNSN box, 16Gb Ram, 250Gb SSD, 4 x Intel i226-V NICS, and I'm running the latest release of OPNsense.

My ISP provides me with a static IP address which is assigned via DHCP Option 82, which I can see get assigned on my WAN interface, before I loose my connection to packet loss.

Any tips or advice is much appreciated!


Before you do much else I would contact the ISP, 2nd tier tech support or higher, and confirm their policy on customer-provided equipment. If they're blocking you, not a lot of sense tearing your hair out to try to figure out what's wrong on your side.

Yeah I've already raised a ticket with my ISP, they are fairly open with customers running their own equipment after the ONT. They even recently traded out my "standard" ONT for a Nokia XS-010G-Q, which requires a router to function.

Still waiting for The ISPs tech support to confirm if they see anything unusual on my connection.

I can only think of three reasons to go from a working state to a non-working state after some days.
First is that you had settings uncommited to config, like not saving a rule or working from a live media.
Second is a hardware problem.
Third is enablement of services that overpower the machine.
Otherwise I can't see this happening.
I lean to the third. I'd like to suggest to disable one of the two, probably Suricata IPS if you are not protecting somethingn especific with it. The thinking is those two will use a big chunk of resources, maybe the system started swapping and then killed a core service needed.

I am of course just guessing, but based on what you shared you see that traffic is reaching your WAN interface and then is dropped/blocked. So I personally do not think that the ISP is blocking you, as you might not be able to see anything in this case.
Are you sure that you are not behind some carrier grade natting and got in an IPrange for private usage with your WAN  port, as this then would trigger the default rules.

I'd like to believe that my ISP isn't blocking me either, but it also seems odd that a fresh install of OPNsense is somehow blocking all of my WAN traffic. Especially as it was working fine for 3-4 days. My CPU utilisation hasn't gone above 40% and my memory usage is around 10-15%, so no real "red-flags" regarding services pushing my Intel N100 to any limits. CPU temp is also around 27-30 degrees during idle and load. The IP provided by my ISP is 185.96.xxx.xxx so it's not in the CGnat range and I specifically asked my ISP before install that there was no CGnat.

My search for an answer continues!

Quote from: cookiemonster on February 29, 2024, 03:45:40 PM
I can only think of three reasons to go from a working state to a non-working state after some days.
First is that you had settings uncommited to config, like not saving a rule or working from a live media.
Second is a hardware problem.
Third is enablement of services that overpower the machine.
Otherwise I can't see this happening.
I lean to the third. I'd like to suggest to disable one of the two, probably Suricata IPS if you are not protecting somethingn especific with it. The thinking is those two will use a big chunk of resources, maybe the system started swapping and then killed a core service needed.

Hello again cookiemonster!

AFAIK I performed the reduction of services as usual, and currently I'm running a clean install from the ISO image and i've not enabled anything other than the bog-standard services, CPU is trickling in at 1-4% util, RAM at 4%, MBUF util at 1%, Disk usage at 1%, and temps at 27 degrees. The odd part is that for about 20seconds after reconnecting the WAN link I get network access again, which baffles me!

you need to be methodical and specific so we can help you. "I get network access.." doesn't help much. From OPN, from a client? Is this a wireless or lan client, etc. Right now all being OK until something fails again doesn't sound much like is an OPN thing.

Quote from: cookiemonster on February 29, 2024, 10:21:10 PM
you need to be methodical and specific so we can help you. "I get network access.." doesn't help much. From OPN, from a client? Is this a wireless or lan client, etc. Right now all being OK until something fails again doesn't sound much like is an OPN thing.

To clarify, things are definitely not OK, my test has been to use the OPNsense CLI, using option "11" to restart all the services, then press "8" to enter the shell and run "ping 8.8.8.8", and then after about 10-20 successful replies it fails. If I repeat the process of reloading all services, I can successfully get another batch of ping replies before them failing again.

Not sure if it's the best way of testing, but it implies that my OPNsense can access the WAN and get a response from Google for a bit before things get blocked again. Not really sure if it's an OPNsense thing or my ISP somehow blocking my connection after a while. Still waiting for someone from my ISP's tech support to take a look at my ticket.

I guess what seems odd, is that if I swap out my OPNsense box for an old Ubiquiti Edgerouter, my access to the internet seems to work, albeit having to wait for a while for the WAN DHCP address to figure itself out.

Quote from: sknr on March 01, 2024, 12:00:01 AM
Quote from: cookiemonster on February 29, 2024, 10:21:10 PM
you need to be methodical and specific so we can help you. "I get network access.." doesn't help much. From OPN, from a client? Is this a wireless or lan client, etc. Right now all being OK until something fails again doesn't sound much like is an OPN thing.

To clarify, things are definitely not OK, my test has been to use the OPNsense CLI, using option "11" to restart all the services, then press "8" to enter the shell and run "ping 8.8.8.8", and then after about 10-20 successful replies it fails. If I repeat the process of reloading all services, I can successfully get another batch of ping replies before them failing again.

Not sure if it's the best way of testing, but it implies that my OPNsense can access the WAN and get a response from Google for a bit before things get blocked again. Not really sure if it's an OPNsense thing or my ISP somehow blocking my connection after a while. Still waiting for someone from my ISP's tech support to take a look at my ticket.

I guess what seems odd, is that if I swap out my OPNsense box for an old Ubiquiti Edgerouter, my access to the internet seems to work, albeit having to wait for a while for the WAN DHCP address to figure itself out.


Just as a quick update, I can literally SSH into OPNsense from a client on the LAN network, type "11" to reload all services, then my internet works on my client and on OPNsense, after about 15, maybe 25 seconds, my internet connection is lost and my ping to 8.8.8.8 fails. Then I just jump back on the OPNsense SSH shell, type "11" again to reload all services, and my internet/pings to 8.8.8.8 start working again, until it fails again, until I reload all services again, etc....

I've tried my best to look through as many logs on OPNsense that I can find in debug mode, and nothing is throwing any errors that I can see.

Next steps for me is to spend tomorrow morning trying to get through to my ISP's tech support to see if they can confirm if there's something funky going on with my service.

If you speak with ISP tech support, I expect they will eventually be able to confirm what you have, that the problem is only when using OPN or the machine running OPN, so not on their side unless they need some mechanisms to use their network (option 82 you mention) and/or authentication details, vlan ids, etc. that aren't yet set in OPN. Not on their side I mean, if it works with another device, or theirs, they would not normally spend a lot of time helping you diagnose it. Hopefully it will go well though.
I'd start by looking for clues in your wan dhcp logs. Sorry, not much else to suggest if logs aren't helping.

Quote from: cookiemonster on March 01, 2024, 12:37:55 PM
If you speak with ISP tech support, I expect they will eventually be able to confirm what you have, that the problem is only when using OPN or the machine running OPN, so not on their side unless they need some mechanisms to use their network (option 82 you mention) and/or authentication details, vlan ids, etc. that aren't yet set in OPN. Not on their side I mean, if it works with another device, or theirs, they would not normally spend a lot of time helping you diagnose it. Hopefully it will go well though.
I'd start by looking for clues in your wan dhcp logs. Sorry, not much else to suggest if logs aren't helping.

Yeah I just heard back from the ISP and they have confirmed that there are no flags or issues on their side, from what I can tell DHCP is working fine as my WAN interface is getting the static IP address that I've confirmed with the ISP is assigned to my service. Which is all I can really ask from them, as I am running my own hardware post-ONT. From their side they don't need anything to be configured in terms of VLAN or additional authentication, as DHCP option 82 somehow takes care of all that (need to dive into what exactly option 82 might need as a valid response).

I've tried reinstalling OPNsense again, but this time wiping the m.2 to remove any residual issues that might somehow have persisted. But that didn't help either, wondering if there is some sort of a hardware fault causing the NICs to just drop packets or stop working, i've tried forcing the NICs to 1Gbps, 2.5Gbps and even 100mbps just to see if that impacts things, but so far the same results, after reloading all the services i'm online for about 20 seconds before it dies again  :'(

But i'll dive into DHCP Option 82/RFC 3046 for now and see if that might be the solution to this problem!

I wonder if you're getting hit with the Intel 2.5G NIC issues.  Try putting a switch between OPNsense and the ONT.

There's some more info regarding the i225/i226 NICs in this thread. https://forum.opnsense.org/index.php?topic=38055.0

Quote from: CJ on March 01, 2024, 04:59:03 PM
I wonder if you're getting hit with the Intel 2.5G NIC issues.  Try putting a switch between OPNsense and the ONT.

There's some more info regarding the i225/i226 NICs in this thread. https://forum.opnsense.org/index.php?topic=38055.0

Yeah I saw that thread a while ago when I was trying to figure out why my i226 NIC wasn't auto-negotiating with my previous ONT, but recently the ISP gave me a "business version" which auto-negotiates fine to 2.5G. I have tried running link speeds all the way down to 10mbps, and the same issue persists, and i've also run a simple L2 switch between the ONT and OPNsense to see if that helped, unfortunately it doesn't.

I've even managed to narrow things down to a point were to get my internet working (either ping to 1.1.1.1/8.8.8.8 from OPNsense or even google.com from a client web browser) after pressing save and apply on the WAN interface settings page. So i'm no longer having to "reload services" vis SSH to get momentary internet access.

As another check, i've been running tcpdump -i igc0 -v and looking at the DHCP request and ACK messages from my ISP, now without knowing too much about DHCP option 82, I cannot see anything glaringly wrong in the tcpdump but funnily can see the ARP request from my WAN gateway showing a Cisco MAC address, which I assume is one of the ISP's routers.

During my testing I also stopped using 1.1.1.1/8.8.8.8 to check if I had access to the internet and reverted back to checking if I had access to my WAN gateway IP, and that also fails in the same way as checking against 1.1.1.1/8.8.8.8. So, I can only assume that something is wrong with my OPNsense, despite multiple re-installs, unless I've got something funky going on with my hardware... not quite sure how to try to validate that other than installing a simple linux distro on my appliance and seeing if the NICs start dropping packets after 20 seconds...

Just another update on this situation, i've just connected my OPNsense box to my old ISPs home gateway/ONT that isn't in bridge mode, etc. And my network connections seems to be stable.... in hindsight I should have probably tested that earlier, but here we are.

I guess this narrows things down to a point where there is some issue with compatibility between OPNsense and my ISP.

To clarify, this was working for about a week with no issues, and all of a sudden things are broken. My new ISP provides me with a static IP which is not under cgnat, and as far as they can see, there should be nothing wrong with my service.

Could it be down to some sort of problem with DHCP option 82 authentication? I.e it acknowledges my DHCP request but then goes "hangon you haven't responded with option 82 info" and cuts drops my connection? I'll have to hop on the phone with my ISP to try to see what more they can investigate.