OPNsense needs periodic reboot since updated to 24.7.9_1-amd64

Started by bongo, November 23, 2024, 02:40:37 PM

Previous topic - Next topic
Log in via SSH and run "top" ...
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Quote from: Patrick M. Hausen on December 02, 2024, 09:01:51 PM
Log in via SSH and run "top" ...

At least in my case, my firewall does not respond to ssh or its web ui.

Rather it seems like it tries to but fails with a timeout eventually. Whatever my firewall is stuck doing seemingly clogs it up so hard i can't ssh into it.

If it happens to me again I will try to connect to it using serial.

Can we stop piling "I have the same issue" on a reporter that said he uses re0/ue0 and observes WAN link failures? How I know? Because we have a bug tracker.

https://github.com/opnsense/core/issues/8098


Cheers,
Franco

Quote from: franco on December 02, 2024, 09:35:17 PM
Can we stop piling "I have the same issue" on a reporter that said he uses re0/ue0 and observes WAN link failures? How I know? Because we have a bug tracker.

https://github.com/opnsense/core/issues/8098


Cheers,
Franco

I apologize. I will attempt to capture logs and go to github instead if my issue reappears

i think i found the reason why my setup was running stable for 5 days, and then the issue popped up again and the uplink always failed after a few hours:
during these 5 days, i had my uplink connected to a switch that only supports 100M. afther this time, i was confident that everything is working fine again and i removed all the unneeded stuff, and the uplink was running at 1G again.
then i had the issue again.
i tried to force my uplink to 100M by OPNsense settings, but this does not help. now i added the switch again to get the link down to 100M, and it works stable for almost 2 days now.
the only strange thing is, that i did not have this issue before updating to the latest version of OPNsense.

I do believe the burst speed will kill the NIC driver causing it to drop out and lose the link. This has been the case for as far as I can remember for some. It circles back to discussing that the particular hardware is not a good fit in our case.


Cheers,
Franco

i plan to replace my uplink with either an intel 82571 or an i350 based NIC. can i expect that this will solve the issue?
thanx!

em(4) driver should cover both devices and should be fine. Just for reference, what hardware does this run on?


Cheers,
Franco

I continue to experience issues where the dashboard items do not load and the WebGUI overall hangs until I reboot every few days or so.
My version is OPNsense 24.7.9_1-amd64

I cannot even get firmware status or check for updates when it is in this state, nor can I get past SSH login prompt. As though it does not like my password...

But after forceful reboot, all back to normal for a few days.

Is this the proper post or should I make a new post?

I went ahead and forced shutdown of Opnsense (running on Protectli J3710) and then updated to 24.7.10_1

However, the dashboard widget content failures are already starting again and this usually is the preamble to the WebGUI hangs and inability to login to SSH.


For the record, I do not have  re0/ue0 as Franco noted

Any suggestions?

I have similar symptoms occasionally.

If you can, log in via SSH and run `pftop`. I have something running on my LAN (it's a Docker container but I still need to spend the time to narrow down which one) which seems to hold connections open, but only sometimes.

Earlier this evening, after the 24.7.10_1 update, I couldn't SSH into the box anymore but I happened to have a serial cable connected. I was able to run `pftop` and see that there were 15000+ states open. Shortly after that, the box kernal panicked, spewed a load of debug output via serial and then rebooted itself. After the reboot, I stopped the Docker daemon on my home NAS and the number of states is currently hovering around 2000 states. I'm going to start the Docker daemon again and see if the problem comes back - if it does then I need to figure out which container it is because all I can see on OPNsense is the source IP which just comes back as the NAS because of how Docker networking works by default.

In your case, it sounds like maybe you also have too many states open, so your box gets to the point that for some reason it can't accept new connections or the DDoS protection (syncookies) is coming into play, or something like that.

edit - I started the Docker daemon. Within a few seconds, `pftop` was showing 7500 states, so the amount tripled.

Please start a new thread for your issue, it has nothing to do with this one

Quote from: slackadelic on November 26, 2024, 04:49:29 PM
This is an Intel nick that's been running great for quite a few years.  Didn't have this particular issue back in the summer and folks are correct, about the last update is when I started noticing the issue.
I'm continuing to look at logs when it happens to see if I can sort out what is going on, but so far nothing stands out.

After some more observations and testing, this issue that is discussed does not seem to apply to my Intel setup.  I'm pretty sure my ISP did something; not sure what but will keep an eye out if the issue persists.

So far, I'm stable. 

Quote from: franco on December 03, 2024, 09:25:27 PM
em(4) driver should cover both devices and should be fine. Just for reference, what hardware does this run on?


it's an asrock j3455m pc mainboard (with a realtek onboard nic which i used so far for the upling re0).
in each one of the 3 pcie slots, i have a nic used for one of the lans.
when i built the machine a few years ago, i took different lan cards for each of the slots to be prepared for tests once i run into issues with one of the cards.
unfortunately, all 3 cards are used for lans in the meantime, that's why i attached an usb nic for my actual tests.

i now plan to replace one of the cards with an intel dual nic. so i again get a spare nic.