Constant lockups/crashes

Started by falsifyable_entity, December 08, 2024, 10:33:36 PM

Previous topic - Next topic
 Running on a physical system:

Celeron(R) J4125 CPU @ 2.00GHz (4 cores, 4 threads)
8 GB of ram
256 nvme SSD
Intel Ethernet Controller I225-V


Hardware is tested to be all functional.

The system locks up, webui and ssh non functional, keeps routing fine but DNS also dies. Absolutely NOTHING of note in the logs. This only started happening since i updated to 24.7 branch from 24.1.10. Before the update it was stable.

I have done a memtest, and fsck, and everything came up green, I also tried re-applying ALL the imported settings manually. I have no other ideas as to what should I do.

I have exhausted all ideas I had to diagnose this. Can someone with more brains than me help?

Not enough details, crystal balls need more info to diagnose perfectly running systems that suddenly freeze in certain areas only.

While explaining what services are you running, what is the output of

ls -ltrh /var/crash && df -hT

Theres the output of the command:

# ls -ltrh /var/crash && df -hT
total 1
-rw-r--r--  1 root wheel    5B Dec  2 20:45 minfree
Filesystem                 Type       Size    Used   Avail Capacity  Mounted on
zroot/ROOT/default         zfs        221G    1.6G    219G     1%    /
devfs                      devfs      1.0K      0B    1.0K     0%    /dev
/dev/gpt/efiboot0          msdosfs    260M    1.3M    259M     1%    /boot/efi
zroot/tmp                  zfs        219G    200K    219G     0%    /tmp
zroot/var/log              zfs        219G    117M    219G     0%    /var/log
zroot                      zfs        219G     96K    219G     0%    /zroot
zroot/var/audit            zfs        219G     96K    219G     0%    /var/audit
zroot/home                 zfs        219G     96K    219G     0%    /home
zroot/usr/src              zfs        219G     96K    219G     0%    /usr/src
zroot/usr/ports            zfs        219G     96K    219G     0%    /usr/ports
zroot/var/tmp              zfs        219G     10M    219G     0%    /var/tmp
zroot/var/crash            zfs        219G     96K    219G     0%    /var/crash
zroot/var/mail             zfs        219G     96K    219G     0%    /var/mail
devfs                      devfs      1.0K      0B    1.0K     0%    /var/dhcpd/dev
devfs                      devfs      1.0K      0B    1.0K     0%    /var/unbound/dev
/usr/local/lib/python3.11  nullfs     221G    1.6G    219G     1%    /var/unbound/usr/local/lib/python3.11
/lib                       nullfs     221G    1.6G    219G     1%    /var/unbound/lib


The only service of note i am running is Unbound with blocklists, thats it, I cut everything out trying to isolate this issue

Nothing weird in dmesg either ?

This still seems to be a HW issue, most likely RAM or power related.

You could start adding services back, one by one, watching temps and top for any clues on what could be causing your issues

I tested this machine with Linux running for 2 days straight with the 'stress' command running, was solid as a rock, I also ran memtest and it tested the memory green... Thats why I am utterly baffled by this

Dmesg also shows absolutely nothing of note, but I cannot see it after the freeze because the machine does not accept ANY interaction, not from plugged in peripherals nor SSH nor serial

I also noticed that the wireless interface (the machine has one but i never used it) has disappeared, last time i remember seeing it was on 24.1.10

please double, triple check the threads about a troublesome kernel on 24.7.10. Only in the last week so should be easy to spot. You'd want to be sure you are on the correct one.

Quote from: falsifyable_entity on December 09, 2024, 03:58:45 AM
Theres the output of the command:

# ls -ltrh /var/crash && df -hT
total 1
-rw-r--r--  1 root wheel    5B Dec  2 20:45 minfree
Filesystem                 Type       Size    Used   Avail Capacity  Mounted on
zroot/ROOT/default         zfs        221G    1.6G    219G     1%    /
devfs                      devfs      1.0K      0B    1.0K     0%    /dev
/dev/gpt/efiboot0          msdosfs    260M    1.3M    259M     1%    /boot/efi
zroot/tmp                  zfs        219G    200K    219G     0%    /tmp
zroot/var/log              zfs        219G    117M    219G     0%    /var/log
zroot                      zfs        219G     96K    219G     0%    /zroot
zroot/var/audit            zfs        219G     96K    219G     0%    /var/audit
zroot/home                 zfs        219G     96K    219G     0%    /home
zroot/usr/src              zfs        219G     96K    219G     0%    /usr/src
zroot/usr/ports            zfs        219G     96K    219G     0%    /usr/ports
zroot/var/tmp              zfs        219G     10M    219G     0%    /var/tmp
zroot/var/crash            zfs        219G     96K    219G     0%    /var/crash
zroot/var/mail             zfs        219G     96K    219G     0%    /var/mail
devfs                      devfs      1.0K      0B    1.0K     0%    /var/dhcpd/dev
devfs                      devfs      1.0K      0B    1.0K     0%    /var/unbound/dev
/usr/local/lib/python3.11  nullfs     221G    1.6G    219G     1%    /var/unbound/usr/local/lib/python3.11
/lib                       nullfs     221G    1.6G    219G     1%    /var/unbound/lib


The only service of note i am running is Unbound with blocklists, thats it, I cut everything out trying to isolate this issue

The quadruple checking already happened, no need to scare people into issues they don't have :)

I'm not trying to scare anyone and would be remiss to not point out those potential issues. Where is in this thread been checked?, I can't see it. Unless you know something we all don't. Otherwise how do you know they have the correct kernel ?

# uname -v
FreeBSD 14.1-RELEASE-p6 stable/24.7-n267981-8375762712f SMP


This issue has begun months ago the day i moved off of 24.1.10, but i figured an update would just fix it, unfortunately the issue persisted and got real annoying hence this thread

ok that is unfortunate because the diagnostics could have been easier to identify differences between old and new on the fresh major upgrade.
Only thing now is to diagnose from scratch.
Different log files to look around in: https://docs.opnsense.org/manual/logging_services.html
And either disable services or enable them one by one until reproducing. Basically an elimitaion process.
Describing your setup would also help.

Theres absolutely nothing of note in the log files listed in the linked page, no errors, everything in order.
I already tried disabling services, but it seems its unrelated to what services i run, since even with DNS, DHCP and NTP disabled, basically doing nothing it still did the same thing

This doesn't sound like a software issue. Any chance you could test another PSU ?

This box has an external PSU brick 12V 5A and i did swap it with another one I had from a different device, and it hasnt changed anything in the behavior.
Also the 2 day Linux test was conducted on the original power brick and as I already mentioned it was fine for far longer than OPN ever could without freezing