22.1 beta - 100% CPU - How to resolve?

Started by mjalafoo, January 13, 2022, 10:02:26 AM

Previous topic - Next topic
Hi All,

I've been having high CPU load issue since 21.7.4. Happened after the upgrade.
Tried upgrading to 21.7.7, the problem carried over. Last night upgraded to 22.1 beta and the problem carried over too. The situation seems to worsen, as the GUI takes long time to respond, and the SSH session terminates during the login process.

It is worth noting, not all pages in the GUI show slow response. For ex. the login page and the main dashboard takes forever, but configuration pages for Suricata and Firmware update respond much faster.

I checked activities of services, and its not consistent what consumes the CPU load. Once it is python scripts for Suricata, sometimes it is just the php. I stopped Suricata and disabled its configuration.

Any guide how to resolve this?

Note: the unit has been running over a year with the same configuration. Updated and patched consistently. Using J1800 with 4GB RAM. Currently, Squid, Pf, Captive Portal, DHCP, Syslog, OpenVPN, WebGUI are activated. No external plugins installed. Suricata is disabled.

22.1 beta is a bit misleading if you are having persistent trouble since 21.7.4. The upgrade indicates the problem lies with the configuration most likely, not the operating system running it.

Is it a VM or hardware?


Cheers,
Franco

It is hardware. 4 Ports Micro Firewall appliance.

Maybe it's throttling itself making it seem to use 100% CPU when it doesn't? Have you looked into changing powerd settings?


Cheers,
Franco

Thanks for your reply. But I think there is an issue with the config, as the boot sequence takes longer than 30m to conclude.

Attached is boot sequence snapshots.

How would I change powerd settings?

Potential hardware problems, hard to tell with screenshots. Can you please post the text in quotes instead?

Well disk seems damaged for one thing, not sure if beyond repair. The other captures look normal. A broken disk could cause slowness.


Cheers,
Franco

Looks like a hardware problem. I have another box, that I will rebuild using the same config. Then will flash the original box and check if the problem persists.

I will post the updates.

So, I used a fresh box (exact match to the hardware set having the 100% CPU load). Flashed it to 22.1 RC.

After the fresh build, the box behaves normal, reboot is quick, access to WebGUI is with normal response speed.

Loaded the backup configuration from the misbehaving box. The first reboot (after config loading) is taking not less that 20m to complete boot sequence.

It is definitely something to do with the config and not with the hardware. It is also definitely something that surfaced with the recent OS changes.

If anyone can give me access to an older OS than 21.x. I can flash my test box and load the config to check if has the same behavior.

I will also reflash the test box, and build it manually with out loading the config from backup, to check what triggers the CPU load.

Any ideas are welcomed.

I was tracking down some odd boot behavior recently.  I was able to look in /var/log/system/ and scan the system log there to see where the delays were occurring.  (Your screenshots are too small for me to read.)

About the 100% cpu, I've had that happen after upgrading between versions.  It seemed there was a duplicate python task (maybe something like config.py, can't remember for sure) that once KILLed then started behaving.  However, this was for updates that did not require a reboot, so probably unrelated (and a reboot would correct the problem).  You can use something like top in the terminal to see what's going on or I prefer htop...but you have to manually add that via:

pkg add https://pkg.freebsd.org/FreeBSD:13:amd64/quarterly/All/htop-3.1.2.txz
HP T730/AMD  RX-427BB/8GB/500GB SSD
HP NC365T 4-PORT

QuoteYou can use something like top in the terminal to see
System: Diagnostics: Activity   ;)

Thanks for the replies all.

In the Diagnostics activity, there seems to be no single items being the culprit in the major loading of the 4 CPUs. Sometimes it is the PHP, or Phython scripts, etc. One thing that is common, is the fact that the top problematic activity contributes to 80/90% of the load on the 4 CPUs.

In the test box today, from the console, I have the following log:
sonewconn: pcb 0xfffff80080bda800 (local:/tmp/php-fastcgi.socket-1): Listen queue overflow: 193 already in queue awaiting acceptance

Can you try disabling Netflow if it's enabled?

Quote from: franco on January 13, 2022, 10:38:17 AM
Maybe it's throttling itself making it seem to use 100% CPU when it doesn't? Have you looked into changing powerd settings?

Before chasing ghosts please make sure to set our powerd settings in a way that the system can't throttle its CPU to MAKE IT SEEM that CPU is 100%.  ::)


Cheers,
Franco

So, a little update.

It is not Powerd and not Netflow. Netflow is disabled.

I did analyze the config file, and figured out that IDS alerts are loaded even though Suricata is disabled. The list is huge, and it seems its loading this entire list and churning through it.

So I flashed the test box, and starting loading configuration section by section. The moment I load "OpnSense Additions" the 100% CPU load problem reappear.


I flashed the box one more time, but cleaned the backup config by inserting clean IDS section.

Once rebooted, the OS operates normal and the entire config seems to be intact.

The question remains, why did the IDS config remain in place even though Suricata is disabled. In fact, I tried re-installing Suricata in efforts to remove the residue from the in production box without luck.