No temperature data on Thermal Sensor widget after upgrade to 20.7.1

Started by lore.phoenix, August 17, 2020, 02:32:51 PM

Previous topic - Next topic
Quote from: franco on August 26, 2020, 05:12:02 PM
Maybe syslog-ng isn't running at all so the messages try to find a way "home". ;)


Hence my first thought about syslog-ng, but more of a corruption event happening at the same time.
OPNsense 24.7 - Qotom Q355G4 - ISP - Squirrel 1Gbps.

Team Rebellion Member

Yeah, nice catch actually. That would explain why temperature reading is fine. I checked earlier today.

Something else interfering with the reads, probably causing numerous subtle unrelated issues on the side.


Cheers,
Franco

I'll leave my APU running on test for a while and see if it plays up. No patches, just a straight update to 20.7.1, if it is syslog-ng then it should show up soon enough.
OPNsense 24.7 - Qotom Q355G4 - ISP - Squirrel 1Gbps.

Team Rebellion Member

Sorry, @marjohn56. Thanks for testing. I had to go to bed eventually. I've picked it up again and taken a closer look at things. Here's my analysis:

The sysctrl_output.txt I attached previously may not have been entirely helpful. Not sure if @fruit's would have been more helpful. I didn't check. Anyhoo...

The sysctrl dump contained a value for kern.msgbuf with a whole lot of junk, including what seems to be web request logs. I removed them because they contained a bunch of IPs and domain names which I didn't want to go through and sanitise. Those logs include requests (presumably from the dashboard) to /widgets/api/get.php?load=system%2Ctraffic%2Ctemperature%2Cgateway%2Cinterfaces...

<118>2020-08-18T22:38:44.491556+10:00 firewall.xxxx lighttpd 79007 - - xxxx firewall.xxxx - [18/Aug/2020:22:38:44 +1000] "GET /widgets/api/get.php?load=system%2Ctraffic%2Ctemperature%2Cgateway%2Cinterfaces&_=1597753466133 HTTP/1.1" 200 8670 "http://firewall.xxxx/index.php" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36"
<118>2020-08-18T22:38:51.924012+10:00 firewall.xxxx lighttpd 79007 - - xxxx firewall.xxxx - [18/Aug/2020:22:38:51 +1000] "GET /widgets/api/get.php?load=system%2Ctraffic%2Ctemperature%2Cgateway%2Cinterfaces&_=1597753466134 HTTP/1.1" 200 8671 "http://firewall.xxxx/index.php" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36"
<118>2020-08-18T22:38:59.269693+10:00 firewall.xxxx lighttpd 79007 - - xxxx firewall.xxxx - [18/Aug/2020:22:38:59 +1000] "GET /widgets/api/get.php?load=system%2Ctraffic%2Ctemperature%2Cgateway%2Cinterfaces&_=1597753466135 HTTP/1.1" 200 8670 "http://firewall.xxxx/index.php" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36"


When the temperature widget API does a grep for "temperature", it's picking up all these lines as well as the correct temperature values. When the widget tries to update the progress bars, it starts looking for elements with weird names and JQuery spits the dummy. Hence, this error:

Syntax error, unrecognized expression: #thermal_sensors_widget_&lt;118&gt;2020-08-18T22

I have no idea why kern.msgbuf is there or if/why it's different to what it used to be. But that's definitely the issue (for me, at least). I would suggest tightening the grep expression to exclude the log lines and just return the values we're actually looking for.

I've never created a PR for an open source project before, so I thought I'd give it a crack (even for a single character change  ;)).

https://github.com/opnsense/core/pull/4300

Nigelw,
Nice catch. I also have a "kern.msgbuf" section with a handful of lines like these in it (note the "temperature" string embedded in there):

<118>2020-08-26T01:47:36.967822-07:00 OPNsense.kurort lighttpd 61178 - - 172.16.1.106 172.16.1.1 - [26/Aug/2020:01:47:36 -0700] "GET /widgets/api/get.php?load=traffic%2Cinterfaces%2Cgateway%2Csystem%2Ctemperature&_=1598431650218 HTTP/1.1" 200 11921 "https://172.16.1.1/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0"


This is probably causing the widget to fail, even while this works:
root@OPNsense:/tmp # /sbin/sysctl -a | grep temperature:
hw.acpi.thermal.tz1.temperature: 29.9C
hw.acpi.thermal.tz0.temperature: 27.9C
dev.cpu.3.temperature: 53.0C
dev.cpu.2.temperature: 53.0C
dev.cpu.1.temperature: 52.0C
dev.cpu.0.temperature: 52.0C

root@OPNsense:/tmp #
ProtectLi FW6 | Intel i3-7100U CPU @ 2.40GHz (4 cores) | 8GB RAM | 120GB SSD
Prod Release Train.

I hadn't intended posting again this soon but though this might help in some way - of course it may not help at all but I may not have much time to  get back here for a while

I did a couple of reboots last night (an extra one just for good measure ;) and all is looking good, I have temps in GUI and graphs and
/sbin/sysctl -a | grep temperature
dev.cpu.1.temperature: 55.1C
dev.cpu.0.temperature: 55.1C

none of those extraneous lines
/sbin/sysctl -a is also cleaner but still some lighttpd lines

I am still concerned that I may have memory/disk issues so will be keeping a close eye on things

@fruit - Thanks for update. We are now aware that there appears to be extraneous data being returned when sysctl -a is called, this is the route of the problem and we're trying to find out why. Why it's not happening on every system is also a bit of a mystery
OPNsense 24.7 - Qotom Q355G4 - ISP - Squirrel 1Gbps.

Team Rebellion Member

running:

sysctl kern.msgbuf_clear=1

... on the shell/console clears it, and the temperatures INSTANTLY start showing up on the dashboard.
ProtectLi FW6 | Intel i3-7100U CPU @ 2.40GHz (4 cores) | 8GB RAM | 120GB SSD
Prod Release Train.

Thanks Xelas, question is, why is there junk in there to start with. Might be a 12.1 issue that needs to be resolved, but that call before any other calls using sysctl would be a good idea
OPNsense 24.7 - Qotom Q355G4 - ISP - Squirrel 1Gbps.

Team Rebellion Member

From what I can tell, the last few KB from dmesg get saved in there whenever there is a system crash to aid in debugging, and this has been the behavior of FreeBSD since at least 9.0 Why this has suddenly become an issue now is mystery. Perhaps we were lucky and nothing containing the text "temperature" made it in there before to break the widget. Who knows.

I found this article from 2011 explaining how to clear kern.msgbuf:
https://www.gnutoolbox.com/clearing-dmesg-logs/
ProtectLi FW6 | Intel i3-7100U CPU @ 2.40GHz (4 cores) | 8GB RAM | 120GB SSD
Prod Release Train.

This makes it more likely that the problem is/was with syslog-ng. Franco has patched this and it's in for the 20.7.2 release, however the patch is available now, so it could be a good idea to run this:


# pkg add -f https://pkg.opnsense.org/FreeBSD:12:amd64/20.7/misc/syslog-ng327-3.27.1_2.txz
OPNsense 24.7 - Qotom Q355G4 - ISP - Squirrel 1Gbps.

Team Rebellion Member

Quote from: marjohn56 on August 27, 2020, 10:02:17 AM
This makes it more likely that the problem is/was with syslog-ng. Franco has patched this and it's in for the 20.7.2 release, however the patch is available now, so it could be a good idea to run this:

# pkg add -f https://pkg.opnsense.org/FreeBSD:12:amd64/20.7/misc/syslog-ng327-3.27.1_2.txz

In case it helps others, please can someone confirm that from 20.7.1 the changes suggested in the console after
syslog-ng is now installed!  To replace FreeBSD's standard syslogd
(/usr/sbin/syslogd), complete these steps:
are not required? I tried them last night, something broke and it froze so I reverted but left syslog-ng327-3.27.1_2.txz in  then a reboot and all has been well since.

Ignore that waffle..  :)


A few hours have been spent by myself and @nigelw looking at better methods of parsing the temperature info from sysctl, we have a few runners so that will also be happening.
OPNsense 24.7 - Qotom Q355G4 - ISP - Squirrel 1Gbps.

Team Rebellion Member

Quote from: marjohn56 on August 27, 2020, 09:31:24 AM
Thanks Xelas, question is, why is there junk in there to start with. Might be a 12.1 issue that needs to be resolved, but that call before any other calls using sysctl would be a good idea

If you are parsing sysctl output, then you can never be sure that kern.msgbuf does not contain a string that won't trip your parser. IMHO, unless that info is truly useful for crash diagnosis, I'd clear it with a "sysctl kern.msgbuf_clear=1" before parsing it.

In any case, looks like you got it. I just wanted to get this thought out of my head. :-)
You guys are doing an amazing job staying on top of all of the issues and bugs. OPNsense is a very complicated project!
ProtectLi FW6 | Intel i3-7100U CPU @ 2.40GHz (4 cores) | 8GB RAM | 120GB SSD
Prod Release Train.

Yes, but I personally don't really want to kill that data, it could be useful for debug purposes. The solution is tighter sysctl requests and parsing of some form. There are several options in play, the dev's will make their choice and that should see the end of this issue once and for all.
OPNsense 24.7 - Qotom Q355G4 - ISP - Squirrel 1Gbps.

Team Rebellion Member