Constant lockups/crashes

Started by falsifyable_entity, December 08, 2024, 10:33:36 PM

Previous topic - Next topic
Realtek NICs by any chance?  Services, setup, hardware involved (switches, modems, VLANs, etc.). Otherwise we're guessing still.
I've had to resort to this in the past when I've had devices seizing up, that is setting up something external to monitor it or at least as a remote syslog.

Intel NIC I225-V as mentioned in the OG post, no VLANS, only hardware involved besides the OPN box is a OpenWRT router that only acts as a dumb switch and an AP, does nothing else.
The onyl services i am running is NTP, DHCP, and DNS (Unbound with blocklists), nothing more.
I also ran across the idea of a remote log storage, and did that with a raspberry Pi4, but it just had the exact same logs as if they were stored locally, nothing of note was ever logged to it no matter how many times OPN crashed. My guess is the crash is so severe it just kills the logging alongside everything else.
Since I havent been able to get anything useful out of a remote log device i removed it

I missed the NIC in the original post, apologies.
Remove blocklists from Unbound ? They can overwhelm some systems.
What about htop or top. Anything there consuming the cpu cycles? That's what could be monitored externally but it is a bit of an adventure to setup.
The thinking is there's nothing wrong with the services or setup but "something" chews your cpu cycles and not releasing them. Logs won't show that.

I doubt its the CPU, the somewhat overpriced Deciso DEC677 has a CPU worse than I have in that box and they basically guarantee it will always work.
Even when I load websites with idiotic amounts of iframes and media that load from all kinds of domains the CPU usage barely even reaches 40% and I have't seen it reach more than 60% ever, even when using iperf (with all HW offload disabled) it does not peak the CPU more than 60%. And RAM, well i have never seen it use more than 3G, with 8 available, and on 24.1.10 it was fine with an even longer blocklist (I reduced it, then tried disabling it thinking it might help, before turning off Unbound altogether)

I doubt a random app will prove anything here.

A better avenue might be to run the debug kernel, see if anything comes up in /var/crash after the next freezing event.

Also, if you have any power management features in the BIOS it would be best to disable it for now.



opnsense-update -zkr dbg-24.7.10

opnsense-shell reboot

@newsense
Applied Your changes and awaiting results

December 12, 2024, 11:50:46 AM #21 Last Edit: December 12, 2024, 11:55:50 AM by colourcode
Same problem on my intel 305 running proxmox, since 1 or 2 versions before the webgui rebuild (and before that a few years hiatus from opnsense).

Sometimes SSH and HTTPS can take minute(s) to load. Without any traffic to speak of being routed / blocked / scanned. It does seem to be faster browsing to the IP address in general.

1. Saturating the 1gpbs with steam / web tests @3-7% cpu usage.  (attach 1)

2. OPNsense practically dying as soon as I open the webgui when steam / webtests running (attach 2)

3. Nearly saturating the 1gpbs with linus ISO's over wireguard / selective routing @ around 40% cpu

Mine always work fine if I don't connect to the webgui. Start a download with webgui open and 100% cpu usage about the time the page shows. Which again can take minutes.

Ran opnsense on a fitlet2 and qotom g7i7 for years without any problems to speak of. This device should handle the same network with ease compared to those two.

Another opnsense instance /w DNS and certificates running for years on another proxmox server never had this problem. Altough firewall is shut down on there, it does have unbound with blocklist - so probably thats not related(?).

Have reinstalled the VM more than once. Tried different tunables, all yield the same problem. WAN is DHCP. I'm not finding information on wether I'm supposed to assign and enable the vtnet parent interfaces on this version, maybe thats the problem?

I have disabled everything except the DNS service, geoblock (1 rule w/ 1 country in it), selective routing (but problem happens on non-routed hosts). I initially thought it was due to netflow but disabling it made no difference. Most firewall rules are not logging anything. Reinstalled webgui and netflow.

The same prox that hosts opnsense has an untangle setup with rulesets that allow me to "switch" rather seamlessly. There are no issues whatsoever when untangle is running.

Running the debug kernel now, for some reason the debug kernel crashes LESS than teh normal one, but it does freeze nontheless. Unfortunately after a few crashes this is what /var/crash looks like

root@Sense:/var/crash # ls -la
total 10
drwxr-x---   2 root wheel  3 Dec  4 23:42 .
drwxr-xr-x  28 root wheel 28 Dec  2 20:45 ..
-rw-r--r--   1 root wheel  5 Dec  2 20:45 minfree
root@Sense:/var/crash # cat minfree
2048
root@Sense:/var/crash #

December 13, 2024, 03:58:57 AM #23 Last Edit: December 13, 2024, 04:02:53 AM by newsense
Don't run the debug kernel anymore, there's a known issue there, interesting you don't seem affected.

opnsense-update -kr 24.7.10 && opnsense-shell reboot
From you post above however it is clear you have no kernel crashes, whatever freezes the machine is not causing kernel dumps.


You didn't answer my question about power mgmt features enabled in the BIOS...

Quote from: newsense on December 13, 2024, 03:58:57 AMYou didn't answer my question about power mgmt features enabled in the BIOS...

Sorry, my bad, there are no power saving related settings in the BIOS, the only one I would consider close is auto boot when power is supplied.

Quote from: falsifyable_entity on December 13, 2024, 11:48:40 AM
Quote from: newsense on December 13, 2024, 03:58:57 AMYou didn't answer my question about power mgmt features enabled in the BIOS...

Sorry, my bad, there are no power saving related settings in the BIOS, the only one I would consider close is auto boot when power is supplied.

Are you running plenty of VLANS?

Could be completely unrelated problems, but mine seems to be semi-remediated.

Noticed my webgui log was LOADED with dead/dying sessions. Running plenty of vlans and using my normal FQDN for access, guessing it chose different IPs or similar which could've been the reason gui loaded so damn slow (everywhere).

  • I put a 10 minute session timeout in settings > administration. Default 240 min.
  • Added a dns host override entry outside of the search domain.

It still completely shit the bed when working with a lot of traffic, but it seems to only happen on the dashboard now. Doesn't really seem to be related to netflow either as it happens without traffic charts running. But I'm much to stupid to find the actual cause. The GUI is snappy again in most other areas even during higher load.

Quote from: colourcode on December 15, 2024, 08:49:10 PMNoticed my webgui log was LOADED with dead/dying sessions. Running plenty of vlans and using my normal FQDN for access, guessing it chose different IPs or similar which could've been the reason gui loaded so damn slow (everywhere).

I guess this is only tangential to the main topic of this thread but the problem of "opnsense.mydomain.lan" resolving to a dozen or so IP addresses can easily be remedied:

- Services > Unbound DNS > General > Do not register system A/AAAA records [X]
- Services > Unbound DNS > Overrides - add the single address you want to use for management

HTH,
Patrick
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Quote from: colourcode on December 15, 2024, 08:49:10 PM
Quote from: falsifyable_entity on December 13, 2024, 11:48:40 AM
Quote from: newsense on December 13, 2024, 03:58:57 AMYou didn't answer my question about power mgmt features enabled in the BIOS...

Sorry, my bad, there are no power saving related settings in the BIOS, the only one I would consider close is auto boot when power is supplied.

Are you running plenty of VLANS?

Could be completely unrelated problems, but mine seems to be semi-remediated.

Noticed my webgui log was LOADED with dead/dying sessions. Running plenty of vlans and using my normal FQDN for access, guessing it chose different IPs or similar which could've been the reason gui loaded so damn slow (everywhere).

  • I put a 10 minute session timeout in settings > administration. Default 240 min.
  • Added a dns host override entry outside of the search domain.

It still completely shit the bed when working with a lot of traffic, but it seems to only happen on the dashboard now. Doesn't really seem to be related to netflow either as it happens without traffic charts running. But I'm much to stupid to find the actual cause. The GUI is snappy again in most other areas even during higher load.

Nope, not even one VLAN, besides Unbound I have pretty much nothing going on

Quote from: falsifyable_entity on December 16, 2024, 11:35:44 AMNope, not even one VLAN, besides Unbound I have pretty much nothing going on

It does sound pretty much exactly like the problem I'm having though and I still have the problem but the initial super long load times are for the most part gone.

Does it happen if you don't have the webgui open at all? Mine never stalls when I'm SSH'd into it but as soon as I open the GUI (dashboard) it 100% all cores immediately with PHP.

Mind checking with SSH and TOP -P? Start top, download a steam game, start a speed test etc, and then open the dashboard and see if you can reproduce it that way. Assuming it's not fully borked without even doing anything.

I can use the GUI fine as long as I'm using spotify / youtube and browsing the net but any heavy load and its game over.


Quote from: colourcode on December 16, 2024, 02:19:43 PM
Quote from: falsifyable_entity on December 16, 2024, 11:35:44 AMNope, not even one VLAN, besides Unbound I have pretty much nothing going on

It does sound pretty much exactly like the problem I'm having though and I still have the problem but the initial super long load times are for the most part gone.

Does it happen if you don't have the webgui open at all? Mine never stalls when I'm SSH'd into it but as soon as I open the GUI (dashboard) it 100% all cores immediately with PHP.

Mind checking with SSH and TOP -P? Start top, download a steam game, start a speed test etc, and then open the dashboard and see if you can reproduce it that way. Assuming it's not fully borked without even doing anything.

I can use the GUI fine as long as I'm using spotify / youtube and browsing the net but any heavy load and its game over.

It happens regardless of what I do or how much traffic is going through, its basically random for all intents and purposes. Sometimes it happens with literally 0 load, sometimes when I am actively doing something. Does not matter if i have an SSH session open or the web dashboard