lots of python3.9 core dumps, then reboots

Started by bebef, August 14, 2023, 01:17:35 PM

Previous topic - Next topic
Hi everyone,

23.7 seems quite unstable for me. The other day I had to restart unbound and today I see a lot of python3.9 core dumps, after that a kernel panic
kernel - - [meta sequenceId="2"] panic: ffs_blkfree_cg: freeing free block
and then a reboot.

I haven't noticed unstable behaviour when being on 23.1.

Any ideas?

Cheers


I'm seeing it with PHP as well.

Quote from: franco on August 14, 2023, 01:23:42 PM
UFS? Disk dying?

Yup, UFS. Checked the disk, SMART says everything is OK. No reallocated sectors etc.

I wonder if it might be RAM instead? I mean, still could be the disk though.

"ffs" here stands for fast file system. If the RAM is damaged there's no definitive area where it crashes and it usually crashes harder, but eventually it could also destabilize the file system.


Cheers,
Franco

I just reinstalled from a USB device and so far I'm not seeing the errors any more. Makes me wonder if it really was the disk, but the problem in https://forum.opnsense.org/index.php?topic=35404.0 sounds quite familiar...

And there come the core dumps again :(

pid 93942 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)
pid 95393 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)
pid 96661 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)
pid 2310 (cc), jid 0, uid 0: exited on signal 6 (core dumped)
pid 432 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)
pid 17313 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)
pid 19572 (cc), jid 0, uid 0: exited on signal 11 (core dumped)
pid 18855 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)
pid 21265 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)
pid 23996 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)
pid 24997 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)
pid 25505 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)
pid 28272 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)
pid 33503 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)
pid 35356 (cc), jid 0, uid 0: exited on signal 6 (core dumped)
pid 34514 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)
pid 38852 (python3.9), jid 0, uid 0: exited on signal 10 (core dumped)
pid 41862 (python3.9), jid 0, uid 0: exited on signal 10 (core dumped)
pid 43078 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)
pid 43889 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)
pid 45957 (python3.9), jid 0, uid 0: exited on signal 10 (core dumped)

Can you replace RAM ? Or try a BIOS reset ?

RAM is one of the next things to replace, I guess.

What would you expect from a BIOS reset however?

Seems that it really was the RAM. Replaced it, haven't seen any core dumps since.

In the future, you can run a version of memtest in order to test and confirm it's the ram before just replacing it.

Well, technically I could, I assume.

However, you should consider the following: The OPNsense box is the only machine that I have access to that is capable of testing the RAM. A full memtest would mean a few hours of downtime, essentially leaving me without network for that period of time. On top of that, RAM is very cheap. The spare part cost me less than a case of my favourite beer.  ;)

So yeah, although you can test RAM before exchanging it and just "blindly" replacing it might not have fixed the underlying issue anyway, but there might be good reasons to take the chance and go forward and replace an unchecked part. Best case (my case) is fixing the issue, worst case is spending a few bucks for nothing. Compare that to the "test case" of running patterns over a RAM for hours, having no network whatsoever.  ;)

I mainly was pointing that out so that others who might run into similar issues would know.  A lot of people don't know how to do hardware testing or even that it's something that they can easily do.

Well, I am getting pretty much identical issues on my new OpnSense box too.  Only been happening the last couple of weeks since I did the upgrade. 

Just out of interest what box/processor etc. are you using just in case there is commonality of the processor memory size etc.  Mine is an N5105 with 16GB ram, it's six weeks old, but that does not guarantee that there isn't a hardware fault.


Out of interest this seems to be when it started.  Interesting that the script failing is related to NUT which ought to have the ability to shutdown the PC too.  I'm going to disable NUT and see if this helps.