Temperature: Dashboard Temps differ massively from CLI

Started by fastboot, December 01, 2024, 02:15:53 PM

Previous topic - Next topic
@OPNenthu

I have a protectli, but like mentioned its a different model. The VP6000 Series is almost brand new and shipped with two fans.
My 6630 has a complete different behavior with a Linux installed and using lm-sensors. In my case I can say the difference is like 30-40°C compared with the output I get from the dashboard.
Even the output of "sysctl dev.cpu | grep temperature" is far away from this peaks.

https://protectli.com/wp-content/uploads/2024/07/VP6630-Datasheet-20240628.pdf
Page 8 you can see the Mainboard. #28 would be the place for the NVME (I'm using a INTENSO SSD with SLC). There is an additional heatsink with a thermal pad mounted in my case.

On top of that I got a replacement part from Protectli. The first 36-48hours the Dashboard showed lower values in comparison to the other machine. After that it reached as well the 80-82°C on the dashboard. So to summarize. Both devices have the same behavior after ~2+ days

My environmental temperature is monitored by different Sensors. Just to name some: BME680, BME280 and some others.

The NVME is monitored as well:
E.g
Temperature:                        36 Celsius

Room_Temperature right now:  21,52 °C (increasing)
Also this temperature is far away from having an impact on the temperature of the FW. In a high computing power pc build, the NVME temperatures are even similar. And there the heatsink is "MASSIV" (Gigabyte X670 Aorus Master)

Let's see how it goes in the summer :D

Ha! Found a data point: Protectli FW4B

root@opnsense:~ # sysctl dev.cpu.0.temperature
dev.cpu.0.temperature: 50.0C
root@opnsense:~ # sysctl -a | grep dev.cpu.0.temperature
dev.cpu.0.temperature: 54.0C
root@opnsense:~ # configctl system sysctl values dev.cpu
[...] 51.0C

Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

December 05, 2024, 11:01:45 AM #47 Last Edit: December 05, 2024, 11:07:43 AM by meyergru
My N5105 china box (by Topton, the darker one from this test, dmidecode shows CW-N6000, so Changwang/CWWK is the manufacturer):


[root@OPNsense.jmg]# sysctl dev.cpu.0.temperature
dev.cpu.0.temperature: 43.0C
[root@OPNsense.jmg]# sysctl dev.cpu | fgrep dev.cpu.0.temperature
dev.cpu.0.temperature: 44.0C
[root@OPNsense.jmg]# sysctl -a | fgrep dev.cpu.0.temperature
dev.cpu.0.temperature: 46.0C
[root@OPNsense.jmg]# configctl system sysctl values dev.cpu
...
"dev.cpu.0.temperature":"46.0C"
...


I always check the thermal paste on those boxes if I see fluctuating temps and fix it.

Quote from: OPNenthu on December 05, 2024, 10:31:57 AM
There was a proposal on GitHub to save the list of sensors from 'sysctl -a' at startup, as a one-time call.  From then on it would be possible to call 'sysctl' on them periodically.  It was not well received.

I know, I made that proposal. What I wanted to stress is that the "easy (reduction) method" Franco prefers cannot find all potential sensors.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 440 up, Bufferbloat A+

All I tried to achieve back then is to not lose track of the idea of the temperature widget to show relevant temperature sensors that might be active in the system. We can dial back the scope of the widget to dev.cpu, but first we need to make sure the lookup is sensible in what it tries to achieve: "show more accurate heat readings for the average case without spinning the CPU too much so that it skews the reading".


Cheers,
Franco

December 05, 2024, 11:27:54 AM #49 Last Edit: December 05, 2024, 11:31:39 AM by meyergru
And for now my impression is like the usual case is a 2-3°C delta and up to 15°C for cases where heat transfer is problematic.

I wonder if it is better to keep the old way of doing it and explaining users that if they observe a big difference, they should inspect their cooling  ;)
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 440 up, Bufferbloat A+

Quote from: meyergru on December 05, 2024, 11:27:54 AM
And for now my impression is like the usual case is a 2-3°C delta and up to 15°C for cases where heat transfer is problematic.

I wonder if it is better to keep the old way of doing it and explaining users that if they observe a big difference, they should inspect their cooling  ;)

Maybe it's just me, maybe I am in the wrong mood at this moment. But sometimes I have the impression you think that other users are stupid.

In this regard I can only speak for myself for sure. Actually I precisely know what I am doing. I know my hardware, and I know my tools. If not, I put time and effort in it to get a deep knowledge of the things I work with.

But to make it very short: There is no issue with the cooling in my devices. If it would, it would have been fixed already.

December 05, 2024, 01:15:14 PM #51 Last Edit: December 05, 2024, 01:19:57 PM by _tribal_
Quote from: OPNenthu on December 05, 2024, 10:31:57 AM

Maybe we need a 5105 owners thread to see if anyone else is having similar quick temperature transitions.
I have exactly the same behavior on my N5105 since upgrading to 24.x.  >:(All my questions were answered with assurances that I was looking at the temperature in the wrong way and everything is correct now.  :'( I'm already desperate to explain anything to the developers and just subtract 10-15 degrees in my mind when I look at the temperature graphs ::)

Quote from: meyergru on December 05, 2024, 11:27:54 AM
And for now my impression is like the usual case is a 2-3°C delta and up to 15°C for cases where heat transfer is problematic.
if only. Unfortunately the difference is floating, but more often it is in +10 gr. No problems with heat transfer, the system passed a stress test lasting more than a week.

Maybe it is my inability to explain this more clearly, so for that last time:

When there is a bad heat transfer because of bad thermal paste or too small of a small contact patch, you will experience short spikes of CPU die temperatures, because the heat cannot be soaked up by the mass of the case or heatsink immediately. After a while, the heat WILL eventually be transferred anyway, because there is still is no vacuum, it is just a bad transfer medium causing the delay.

Thus, short bursts of CPU activity will heat up the die with bad heat transfer much faster. That is exactly what is happening with the current measurement method.

It says nothing about long term stability under stress. If you put a continuous load on the CPU, the resulting maximum temperature will not even be higher with bad transfer (with the same power limit and thermal capacity of the case/heatsink, of course), so you cannot compare those.

I have 4 those china boxes of which 2 had this problem, 2 did not. After fixing it, I see spikes of 2-3 degrees on all of these systems during measurements, as does Patrick.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 440 up, Bufferbloat A+

I know exactly what you mean. But the "box" was serviced upon reception and the thermal paste was replaced with Arctic Cooling MX4, which is more than enough in this case. And if everything was as you write, there would be a difference on the old version too, i.e. the average temperature would be higher, but the mentioned increase happened exclusively after switching to the new way of temperature reading. Actually, for me it is not so critical, but I am not the only one who noticed such behavior. I perfectly understand that I have no right to demand anything from a free product and the developers in any case will do as they want, it's just not very convenient - to keep a correction of 7-10 degrees in your head when you look at the temperature graph. That's all.

Quote from: fastboot on December 05, 2024, 10:39:30 AM
The first 36-48hours the Dashboard showed lower values in comparison to the other machine. After that it reached as well the 80-82°C on the dashboard. So to summarize. Both devices have the same behavior after ~2+ days
Yours is a different CPU architecture (Core i3) and includes fans, so we are comparing apples and oranges.  However I wanted to mention something I learned about my own Vault (1410) from the support chat.  He said that given enough time, and if the Vault is generating more heat than can be dissipated into the environment, it will regulate itself to 60C.

I don't see that is happening for me (my idle temps are in the 40-45C range, as confirmed by both Linux and FreeBSD utilities) but in your case maybe there is some kind of regulation like this happening which explains why your temps are settling at higher values after 2 days.  The differences you mentioned in your post seem high to me but I'm not familiar enough with your particular model to say if those are typical.  Maybe it's worth opening a ticket to see what they say.

Quote from: meyergru on December 05, 2024, 11:27:54 AM
I wonder if it is better to keep the old way of doing it and explaining users that if they observe a big difference, they should inspect their cooling  ;)
I don't know if this is good advice for those of us with active warranties.  Doing surgery to check and re-paste sounds like a potential way to void it.

Quote from: _tribal_ on December 05, 2024, 06:31:04 PM
[...] but the mentioned increase happened exclusively after switching to the new way of temperature reading.

I wonder if the change is in FreeBSD, because there is one report of high temperature using the other firewall

The casual observer might look at this and come away with the impression that this is a characteristic of these devices, but Protectli disagree.  To quote the support tech I spoke to:

Quote
Last year we got a bunch of complains all fairly close together so we set up LM Sensors in another VM and saw a 10C skew from what OPNsense reports. [...] we saw reports of it happening on many other brands and it happens on all of ours so we chalked it up to "the way OPNsense does things".

So the issue is not vendor specific, and the reports are clustered in time (if this is to be believed).

I only started using OPNsense with 2.47 myself, so I didn't get to experience the "before" and "after" effect.

December 06, 2024, 02:03:45 PM #56 Last Edit: December 06, 2024, 02:11:06 PM by MenschAergereDichNicht
System: Protectli VP2420 (Celeron J6412)

Widget Temperature: 56°C
Reporting: ~50°C
sysctl dev.cpu.0.temperature: dev.cpu.0.temperature: 42.0C
sysctl -a | grep dev.cpu.0.temperature: dev.cpu.0.temperature: 52.0C

I am using the hwp_state driver and not powerd. Don't know if this is relevant.

While we are having fun posting temperatures, on a passively-cooled Yanling 6-port i7 box I see right now:
sysctl dev.cpu.0.temperature = 44°C
sysctl -a | grep dev.cpu.0.temperature = 55°C
GUI : 57-61°C for CPU 7 to 0, Zones A & B at 65°C

All is within specification, the ambient is maintained at a ceiling of 27-28°C (26 at measurement), so I have no particular interest in which one is "right" although I trust the first one. The differences in measurement are certainly there, from the method and from the GUI.
Deciso DEC697
+crowdsec +wireguard

December 07, 2024, 03:32:57 AM #58 Last Edit: December 07, 2024, 03:36:41 AM by OPNenthu
Slightly longer test with a Linux live USB.  'lm_sensors' output collected every 5 seconds:


$ dnf install lm_sensors
$ watch -n 5 sensors | tee --append data.txt


By default Fedora 41 has 'firewalld' service and Gnome running.  Not much else going on for the first 10 min. so that I could get a baseline.

At the ~10 min. mark, I did some light workload activities... launching and configuring Firefox, running online speedtest, launching LibreOffice.

The results (attached) shows that both things are true:

1. The baseline temperature is substantially less than reported by 'sysctl -a'
2. The temperature rises sharply on any kind of burst activity

A subjective observation I made:  the box feels a lot cooler to the touch running under Linux.  The baseline is even lower than reported by 'sysctl dev.cpu.0.temperature' in FreeBSD (although this could be due to all the services running in OPNsense.)

If anyone wants to try, feel free to modify the python script attached for your particular 'lm_sensors' output.  It will be useful for me to compare notes, especially with similar devices.

tried these two commands on my system shell command line and the dev.cpu does not return any results.  Just throwing it out there, please don't break all installs.

root@OPNsense:~ # sysctl dev.cpu | grep temperature | sort
root@OPNsense:~ # sysctl -a | grep temperature | sort
hw.acpi.thermal.tz0.temperature: 27.9C
hw.acpi.thermal.tz1.temperature: 29.9C
root@OPNsense:~ #