24.7 CPU Temps

Started by ProximusAl, July 26, 2024, 03:28:26 PM

Previous topic - Next topic
Now when you are speaking about the widget you are right, the widget shows higher temp. I Assumed all are talking about temps in RDD or in CLI.

Franco, yes, fetching it that way would be better, cause the Widget gives a bit misleading info.

Regards,
S.
Networking is love. You may hate it, but in the end, you always come back to it.

OPNSense HW
APU2D2 - deceased
N5105 - i226-V | Patriot 2x8G 3200 DDR4 | L 790 512G - VM HA(SOON)
N100   - i226-V | Crucial 16G  4800 DDR5 | S 980 500G - PROD

Admittedly it's a bit counter-productive when you think about it when these temperature spikes are reported, because you fetch them when they happen, because they happen when you fetch them. ;)


Cheers,
Franco

July 27, 2024, 07:44:04 PM #17 Last Edit: July 27, 2024, 07:53:23 PM by MenschAergereDichNicht
Quote from: franco on July 27, 2024, 10:18:20 AM
We may have to write a small tool to fetch the temperatures in the background away from the GUI so when the GUI query comes in it reads the actual value, not the one while the CPU is busy processing the user request..?

Cheers,
Franco

Maybe you could just use the RRD data if it is available.

And if i understood your problem description it might be a good idea to use "sysctl dev.cpu" instead of "sysctl -a" for the RRD data.

> Maybe you could just use the RRD data if it is available.

Not a great plan because the RRD backend needs a full rewrite.

> And if i understood your problem description it might be a good idea to use "sysctl dev.cpu" instead of "sysctl -a" for the RRD data.

RRD is not even using sysctl so not understood. It's actually using a tool that really really needs to be removed for the same reason that RRD backend needs a full rewrite.

Just trying to give a perspective. Guessing problems into open source is a bit taxing from a dev point of view because now it's not enough to be open it needs to be explained constantly...


Cheers,
Franco

Listening about rewrites and seeing some.

This back a question. In a long run do you plan a complete graphical rewrites and overhauls for all aspect of OPNsense in time? :)

Regards,
S.
Networking is love. You may hate it, but in the end, you always come back to it.

OPNSense HW
APU2D2 - deceased
N5105 - i226-V | Patriot 2x8G 3200 DDR4 | L 790 512G - VM HA(SOON)
N100   - i226-V | Crucial 16G  4800 DDR5 | S 980 500G - PROD

Was initially hoping it would take 10 years so we would be almost done, but it's safe to say it may take up to 5 more years.

This includes API/MVC for everything user-facing as well as full privilege separation for the GUI.


Cheers,
Franco

Great!

Is actually awesome to hear you are still doing this. No matter the time frame, this is still awesome to hear :)

Thanks Franco.

Regards,
S.
Networking is love. You may hate it, but in the end, you always come back to it.

OPNSense HW
APU2D2 - deceased
N5105 - i226-V | Patriot 2x8G 3200 DDR4 | L 790 512G - VM HA(SOON)
N100   - i226-V | Crucial 16G  4800 DDR5 | S 980 500G - PROD

Hi Franco,

i guess i could try harder to explain myself.

> Guessing problems into open source is a bit taxing from a dev point of view because now it's not enough to be open it needs to be explained constantly

First of all i understand that it is sometimes tiring to explain things over and over. In my case i actually think you have a point because i *could* look into the sources and get some insights or be more precise with my statements. But you can't really expect this from every random person.

> RRD is not even using sysctl so not understood. It's actually using a tool that really really needs to be removed for the same reason that RRD backend needs a full rewrite.

Now what i tried to express was that it is probably a good idea to use one common source of truth for such data in general (because it is irritating having different vaues inside the GUI). And because i didn't knew that RRD was in need of a refactoring i thought that source of truth could be the RRD. But you can abstract that away if you like.

Similar to your idea

> We may have to write a small tool to fetch the temperatures in the background away from the GUI so when the GUI query comes in it reads the actual value, not the one while the CPU is busy processing the user request

The important thing being that there shouldn't be several ways to gather the data (tool and RRD) but only one way (tool) and the other consumer (widget, RRD) should ask the tool service for the values (to avoid different results and to avoid unnecessary load).

The second part about the means on how to read the actual values (sysctl dev.cpu) was meant to illustrate that it would be nice if one would use a more lightweight method for the central data crawler.
I compared that to "sysctl -a" in this context because i compared the RRD values to the output of the command line calls and "sysctl -a" was close to the RRD values in my case. Threrefore i asumed that it is using this command or at least something similar.


Greetings,
Stefan

With the updated widget in 24.7.1, is there a way (perhaps by editing the js file) to exclude the temperatures for CPU1, CPU2, etc.? In many cases, they will all be the same so it takes up screen real-estate to show several identical values.
OPNsense 24.7.7-amd64 on APU2E4 using ZFS

We will possibly add an option to average across all common sensors in the widget. It's the best of both words without trying to to it automatically.. which failed because temps from CPUs that report separate temperatures could still match when reading them from time to time making the data set jumpy in terms of how many sources it actually has.

As far as temps reading goes here is my take: if we say the GUI temp is wrong we have to assume the idle test temp is wrong as well. The real temp is somewhere in the middle, so the question is how many checks per second do we need to make to get the correct average under light load... because I think the temperatures are closer together under higher load anyway.



Cheers,
Franco

so taking out the OPNsense UI and such i can't really explain this weirdness. in the same command on an idle system:

# sysctl hw.acpi.thermal.tz0.temperature dev.cpu.{0,1,2,3}.temperature && sysctl -e `sysctl -aN | grep temperature`
hw.acpi.thermal.tz0.temperature: 27.9C
dev.cpu.0.temperature: 31.0C
dev.cpu.1.temperature: 31.0C
dev.cpu.2.temperature: 31.0C
dev.cpu.3.temperature: 31.0C
hw.acpi.thermal.tz0.temperature=27.9C
dev.cpu.3.temperature=40.0C
dev.cpu.2.temperature=40.0C
dev.cpu.1.temperature=40.0C
dev.cpu.0.temperature=40.0C


if i look at the sysctl directly, its much lower temps, similar temps if i boot the same machine with debian. if i look at the temps how the UI gets them: /usr/local/opnsense/scripts/system/temperature.sh which does
sysctl -e `sysctl -aN | grep temperature`

for some reason those sysctls getting the same names return different values and its not some sort of thing like the commands themselves cause the CPU temps to rise...

From what we have learned today this is the observation of heat not being able to get off the CPU quickly enough for whatever reason. It feels counter-productive to report a lower reading just because of the argument that the CPU reading is lower during idle. It is the temperature the CPU is at at the time of the reading.


Cheers,
Franco

trying this simple shell script:

#!/usr/local/bin/bash
sysctl dev.cpu.{0,1,2,3}.temperature hw.acpi.thermal.tz0.temperature
sysctl -e `sysctl -aN | grep temperature`


now run it super fast:

gnu-watch -n0.1 /root/temps.sh


Result:

Every 0.1s: /root/temps.sh

dev.cpu.0.temperature: 32.0C
dev.cpu.1.temperature: 32.0C
dev.cpu.2.temperature: 32.0C
dev.cpu.3.temperature: 32.0C
hw.acpi.thermal.tz0.temperature: 27.9C
hw.acpi.thermal.tz0.temperature=27.9C
dev.cpu.3.temperature=43.0C
dev.cpu.2.temperature=43.0C
dev.cpu.1.temperature=43.0C
dev.cpu.0.temperature=43.0C


there is clearly a difference between

sysctl dev.cpu.{0,1,2,3}.temperature hw.acpi.thermal.tz0.temperature

and

sysctl -e `sysctl -aN | grep temperature`


which seems like it maybe a bug in sysctl? unless the subprocess of

`sysctl -aN | grep temperature`

can cause the CPU to spike 10+ degees C, which seems unlikely. and if it did, why when running in at 0.1s intervals, why doesn't it effect the other sysctl?

Why should the numbers lie? Idle vs. busy should yield a temperature difference, no? Assuming the reading is wrong seems futile... software bug? hardware bug? Not on our end then, we just read it. ;)


Cheers,
Franco

i certainly agree that the differences don't really matter, for sure. but it not because of idle or CPU activity, its seems like a bug. the behavior has existed in 24.1 and now in 24.7 and exists for me on intel and amd cpus, so it seems like a systctl bug and it seems very subtle, but also not critical