24.7 CPU Temps

Started by ProximusAl, July 26, 2024, 03:28:26 PM

Previous topic - Next topic
No issue with the temps themselves. I'm using an N100 and they are where I expect them to be.

However, there is a formatting issue in the dashboard widget.


August 13, 2024, 08:48:28 PM #31 Last Edit: August 13, 2024, 08:55:21 PM by yahyoh
I still think the discrepancy in temps readings make no sense lol.

Isn't the temps supposed to be read directly from the sensor on the CPU?im not that BSD nerd. but i never faced such issue in windows or linux Ahh.

plus 10-15c differences sounds way too much of a difference,furthermore the external(which act as heat sink) body of my N5105 doesnt even feel that hot to indicted that CPU is really in the 50c.

https://i.imgur.com/W5UCYKo.mp4


root@OPNsense:~ # sysctl -a | grep temperature
hw.acpi.thermal.tz0.temperature: 47.1C
dev.cpu.3.temperature: 46.0C
dev.cpu.2.temperature: 46.0C
dev.cpu.1.temperature: 45.0C
dev.cpu.0.temperature: 45.0C
root@OPNsense:~ # sysctl -a | grep temperature
hw.acpi.thermal.tz0.temperature: 47.1C
dev.cpu.3.temperature: 48.0C
dev.cpu.2.temperature: 47.0C
dev.cpu.1.temperature: 46.0C
dev.cpu.0.temperature: 45.0C
root@OPNsense:~ # sysctl -a | grep temperature
hw.acpi.thermal.tz0.temperature: 47.1C
dev.cpu.3.temperature: 47.0C
dev.cpu.2.temperature: 46.0C
dev.cpu.1.temperature: 45.0C
dev.cpu.0.temperature: 47.0C
root@OPNsense:~ # sysctl -a | grep temperature
hw.acpi.thermal.tz0.temperature: 47.1C
dev.cpu.3.temperature: 49.0C
dev.cpu.2.temperature: 48.0C
dev.cpu.1.temperature: 47.0C
dev.cpu.0.temperature: 46.0C
root@OPNsense:~ #
root@OPNsense:~ # sysctl -a | grep temperature
hw.acpi.thermal.tz0.temperature: 47.1C
dev.cpu.3.temperature: 48.0C
dev.cpu.2.temperature: 46.0C
dev.cpu.1.temperature: 46.0C
dev.cpu.0.temperature: 46.0C


Hi @all,

I'm new here and facing the same problem: I'm using an N100 Mini-PC which I've updated last evening from 24.1 to 24.7. Although the update process seemed to work fine, GUI shows much higher (~10°C) temperatures than the CLI output:


markus@opnsense:~ % sysctl -a | grep temperature && sysctl dev.cpu | grep temperature
hw.acpi.thermal.tz0.temperature: 27.9C
dev.cpu.3.temperature: 61.0C
dev.cpu.2.temperature: 59.0C
dev.cpu.1.temperature: 58.0C
dev.cpu.0.temperature: 57.0C
dev.cpu.3.temperature: 58.0C
dev.cpu.2.temperature: 58.0C
dev.cpu.1.temperature: 57.0C
dev.cpu.0.temperature: 57.0C




Also tried this solution, but nothing seems to have changed:
https://forum.opnsense.org/index.php?topic=42323.0


August 27, 2024, 11:55:11 AM #33 Last Edit: August 27, 2024, 12:00:07 PM by meyergru
@maxus and @yahyoh: Yes, we know all that. You probably should re-read the thread.

Franco already explained in detail what is going on:

The difference between the (current) GUI query and a query from the CLI is that the during the processing of the dashboard widgets (which include the temperature readouts), the CPU is being used, which in turn heats it up, resulting in an increased reading. You could probably reduce the difference by de-selecting all but the CPU temperature widget.

The granularity of modern CPUs temperature is so high that this matters, because the sensors now reside on the CPU die itself. The temperature can jump a few degrees in a few microseconds.

Franco also told you that this could only be fixed if the time of readout is shifted from the point in time that the GUI processes the widgets (so a background process is probably needed which decouples this).
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+

Quote from: meyergru on August 27, 2024, 11:55:11 AM
@maxus and @yahyoh: Yes, we know all that. You probably should re-read the thread.

Franco already explained in detail what is going on:

The difference between the (current) GUI query and a query from the CLI is that the during the processing of the dashboard widgets (which include the temperature readouts), the CPU is being used, which in turn heats it up, resulting in an increased reading. You could probably reduce the difference by de-selecting all but the CPU temperature widget.

The granularity of modern CPUs temperature is so high that this matters, because the sensors now reside on the CPU die itself. The temperature can jump a few degrees in a few microseconds.

Franco also told you that this could only be fixed if the time of readout is shifted from the point in time that the GUI processes the widgets (so a background process is probably needed which decouples this).

Hi @meyergru,

Thank you for explaining it again :)
Sorry to ask, but were the temperature queries solved differently in the old GUI (24.1)?

No. There were exactly the same complaints about something being "too high" when compared to not staring at the dashboard. Like here.


Quote from: doktornotor on August 27, 2024, 01:08:15 PM
No. There were exactly the same complaints about something being "too high" when compared to not staring at the dashboard. Like here.

Hy @doktornotor,

thank your for the information: It's not the case that I stare at the dashboard all day long ;)
This is my CPU overview before and after the upgrade. You can see that the "User" and "System" processes were previously in the milli range and are now 1 and 2 digits respectively. Why? Is it possible that this is why the temperature has risen?



I also ran several widgets (including the CPU temp graph) in the old GUI and never experienced such temperature spikes (even while staring at the dashboard).

I also removed all widgets except for the CPU temp display this afternoon. In the widget itself it looks like the temperature is no longer rising as high as before (~70-80°C), but nothing really changes in the RRD diagram.

I am certainly no expert, but the phenomenon only occurred after the update from 24.1 to 24.7. So I wonder what has changed? I just want to understand it.

Thank you.

Quote from: maxus on August 27, 2024, 09:19:29 PM
but the phenomenon only occurred after the update from 24.1 to 24.7.

Well that simply is not true, as documented by the ticket I linked. Whatever, it shows data returned by the on-die sensors and as read and provided by the kernel. I don't really know why people want to see incorrect readings just because they don't like the data shown. Anyway, this is the current dumpspace of these complaints..

Maybe part of the difference is from the fact that widget evaluation has changed because of the structural changes (like order of evaluation or complexity of other widgets).

Maybe you have an RRD database update / maintenance running after the upgrade that caused CPU spikes.

Whatever the reason, there is a github issue and probably it will be adressed if no issues exist that have higher priority (of which I know some...).
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+

Quote from: doktornotor on August 27, 2024, 09:30:10 PM
Well that simply is not true, as documented by the ticket I linked.
Whatever, it shows data returned by the on-die sensors and as read and provided by the kernel. I don't really know why people want to see incorrect readings just because they don't like the data shown. Anyway, this is the current dumpspace of these complaints..

And I wonder if my questions are perhaps really so misleading:
- Why the change in the processes (as in the picture in the last post)?
- Why the increase in temperatures after the update?
- The link you mentioned starts with October 2023. My discrepancies became visible with yesterday's update from 24.1 to 24.7.

If someone tells me: "Yes, we have changed something here and there (e.g. widget) so that the correct temperatures are now logged and more processes are also running", then that's totally ok, because that would be an answer.

I am quite sure that the problem you are talking about (Github link) has been occurring for some time. That doesn't mean that this is the case for me. I have already written that I have not changed anything in the widgets (e.g. number) or in OPNsense itself, nor have I changed my behavior (e.g. viewing the dashboard continuously for 24 hours), but only that I have done the system update. Something must have changed since the update, otherwise we wouldn't be discussing it here.

But maybe I just don't understand it...

Quote from: meyergru on August 27, 2024, 09:55:54 PM
Maybe part of the difference is from the fact that widget evaluation has changed because of the structural changes (like order of evaluation or complexity of other widgets).

Maybe you have an RRD database update / maintenance running after the upgrade that caused CPU spikes.

Whatever the reason, there is a github issue and probably it will be adressed if no issues exist that have higher priority (of which I know some...).

Thank you @meyergru for your answer.

Regarding the RRD database update / maintenance: How long could that really take? Temperature doesn't change in the RRD since the Update.
Do you think "Reset RRD Data" would change something (someone mentioned it before)?

D'accord that there are definitely bigger problems than this  ;D
My mini PC with N100 wasn't really "cold" even before and at first, when another 10 degrees are added on top (according to the RRD), you start to worry.

August 27, 2024, 10:46:24 PM #40 Last Edit: August 27, 2024, 10:59:24 PM by meyergru
RRD databases do reconstruct sometimes after a reboot. I have experienced CPU spikes and 100% load after reboots as well. Of course that will raise temps, so resetting/repairing RRD databases often helps.

Even with all things equal: There is a discrepancy between a CLI show oft temps vs. a GUI inspection, because there is a lot more going on in the GUI. And because of structural changes with widgets, there may be even differences between old and new widgets.

That is not a sign of some kind of defect, but probably we need a different approach here to be compatible with the old reporting.

Also, the 10 degrees more are real - but there is no need to worry, you just need to understand how to interpret the reported temperature as these higher temps are spikes only, which may have been there all the time - only wthout you noticing them.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+

maxus,

I'm having the exact same issue and observations, and I do not believe the higher temperatures are strictly the result of how they're being reported as explained by others. Instead, as you inferred, I think the temperature is the result of higher CPU utilization.

Under Reporting -> Health my 'user' and 'system' processor utilization is showing it run 25 times higher under 24.7.x than under 24.1. The number of existing processes is still the same though. Logically, this explains why I am also seeing temperatures that are 10 to 15 degrees higher and peaking into maximum range of what my processors specs.

I think this is worth investigating and patching before anyone has hardware failure as a result.

Quote from: irrenarzt on August 28, 2024, 12:22:42 AM
I think this is worth investigating and patching before anyone has hardware failure as a result.

This is getting borderline absurd. Get a better cooling system if you have such concerns.

Quote from: doktornotor on August 28, 2024, 12:25:43 AM
Quote from: irrenarzt on August 28, 2024, 12:22:42 AM
I think this is worth investigating and patching before anyone has hardware failure as a result.

This is getting borderline absurd. Get a better cooling system if you have such concerns.

My cooling system wasn't a problem in 24.1.

My CPU utilization increasing for no apparent reason in 24.7 and producing more heat is though.

Why be so condescending about a problem that is so easily observable and graphed in the system history?

Your cooling system is not a problem with 24.7 either. The only problem here is in people's heads. Take off the heatsink of your CPU. Nothing will happen. It will underclock itself to the point of being unusable. Eventually it will shut down. That's all. Nothing will burn. No flames. Nothing will be damaged.

Additionally, would suggest reading your CPU thermal specs before bringing claims such as CPUs are damaged when run at 60C.

Sheesh. Perhaps removing the widget would be the best course of action here.