Re: Random Frequent CPU Spikes and Page Faults [Almost Resolved]

Started by dpsguard, August 01, 2023, 04:41:49 AM

Previous topic - Next topic
Hello All,

I have a simple lab setup with two Super Micro servers with Xeon CPU, 4 core and 12GB of RAM, Intel X520-DA2 card in HA. I have done stress testing using 2 more servers, one on LAN side and one on WAN segment and then running iPerf2 with 200 parallel streams (each stream set to 50Mbps) in between these two servers thru the firewall and I am able to get very close to 10Gig running for 30 minutes even.

I had installed 23.1 and then updated to 23.1.11 and then finally 23.7 in an attempt to resolve the sudden short bursts of CPU spikes (sometimes reaching 67%, many times 40% and generally 20% every few seconds, and all other time, CPU is like 0 to 2%).  When I do iPerf  stress test, saturating the pipe, CPU is normally close to 50% and when these peaks strike, it then reaches 70 to 100%.

The top -P with 1 sec delay and with system processes also added,  or vmstat 1, shows much less of total CPU that what GUI shows under CPU usage widget. So not sure if this is a cosmetic bug and actual CPU usage is much less than what GUI shows.  But while running iPerf stress test, I can also add a test laptop going thru this firewall to Internet and doing a 4K / 2160p video with no issues (clearly shows not more than couple of seconds of buffered content, so I had real issues, I will see it stall) but a continuous ping to Google DNS starts incurring lots of loss, video stays well.

I have tried varies things including enabling / disabling the hardware offload, PowerD, enabling / disabling both consoles etc and these random pulses every few seconds to few minutes keep showing up.

The firewalls have no security features, or logs or Netflow etc enabled. Just a simple NAT router is my setup.

The CPU spikes happen at no traffic. But performance tests in my limited lab setup don't seem to be impacted, other than ping losses. I will test doing a real time audio / video test thru this test also to see if that gets impacted. But I do believe I am not the only one seeing this issue. There are couple of short threads in community that talk about CPU spikes but they all have small CPU processors.

Top does not show any process that is consuming RAM when the peaks come (or few seconds before, assuming top uses some averaging over last few seconds). And for sure it is system that consumes some RAM in single digit even when peak 67% CPU happens in GUI. But what that process (or thread of process), I have not been able to find with any utility.

Another issue I find is that when I do vmstat 1 to let it repeat every second or even 5 seconds, Page flt show up as like 54K, 3K, 7K and most time a single number or so, but I even removed swap memory, did a reboot, but when there is no SWAP set aside, why will there be any paging in and out and resulting page faults? Memory remains mostly underutilized and thus no need for SWAP, but shows large Page flts.  Could this be bad memory that results in page faults and thus CPU spikes (this was new UDIMM Buffered) but it happens in two HA firewalls?

Any help will be appreciated.


I used another physical machine and installed 23.1 and without any configuration or optimization, I have the same issues with CPU peaks.

Also page flts are same in Kilos as was in other two machines. So looks like Memory management in FreeBSD is messed up or OPNSense somehow is not able to use the memory and then somehow needs to fetch some code frequently from disk. This machine has different processor, different motherboard, different BIOS, but has the same 10Gig X520-DA2 card. So I am not sure if it is this card issue, but for this test, I am not using the 10Gig card and have only assigned two 1Gi LOM interfaces for WAN and LAN.

At least it rules out any configuration or tunable issue.

If someone can validate on their setup for CPU usage peaks / triangles, as well as output of "vmstat 1" to look for Page flt to be in Kilos, that will be appreciated.

And I pulled the 10G card out leaving just the two Intel 1Gig motherboard ports in and CPU peaks as well as page faults remain as before.

Installed Debian on same server (used another SSD) and I did not see any CPU utilization beyond 4% as a sum of all cores when using htop. VMstat 1 also showed no page faults in kilos, it was just in some low numbers, but format is little different than in FreeBSD.

Then downloaded pfsense4 latest 27.0 and installed it on the same server. This uses FreeBSD 14, so not apples to apples comparison, but with pfsense, I still have page faults in kilos, so definitely this is FreeBSD memory management issues (maybe this is non-issue anyway). But I had CPU not jump and stay within 8% sum total of all cores.

So likely a small bug or minor issue that should be brought to notice of @franco and hope this will get addressed or a workaround provided.

Thanks

Since FreeBSD 14 is more or less development version I don't have much hope this will be quickly fixed or that patches are already available. It might be hardware specific but unsure what that would be.

The rule of thumb for FreeBSD fixes is give it a lot of time and then some more... :/


Cheers,
Franco

Thank you so much @franco for looking into this and your advice. I agree the upstream things are beyond you and we just have to wait. I assume then that is for the memory management / page faults and I assume you also see this in any other hardware that you can test with (I also tested it with a Chinese network appliance that has a Intel i5 and 4GB RAM. Clearly this is FreeBSD and that I am not the only one seeing these page faults in kilos.

But can we conclude that CPU spikes are also related to then FreeBSD 13? Many others might not be noticing it, if on the graph, normal CPU shows up as a band much higher than 0 that shows in my case, and then they may not be noticing the peaks as I see, since in my case, it is mostly 0 to 2% normal CPU and then suddenly 20% surge, then wait few seconds to couple of minutes, and then another one could be 63%, and then 42% and then 30%, and they are all over the place.

At least with graph showing sum-total of all cores, the 63% surge will be on one core, but other cores are available and servicing other traffic (hopefully SMP works well) and this way few users might get impacted, if they had to be.

I'm seeing this with my firewall too.  I'm on a J455.  I noticed the CPU spikes.  The CPU is running hotter. It looks like there's an issue with system interrupts that is new since my upgrade.  They seldom showed up before the upgrade. 


August 04, 2023, 01:13:06 AM #7 Last Edit: August 05, 2023, 03:21:01 AM by dpsguard
@JustMeHere I also had the same issues with 23.1 or with 23.1.11 and now with 23.7. If you had no such surges in 23, then that will be interesting. I am assuming that the CPU surges might just be some issues with FreeBSD 13, which they might have taken care of in FreeBSD 14 (as my same hardware does not see these CPU surges with PFsense 27 which uses FreeBSD 14.  I am waiting for @franco to confirm this behaviour with other hardware as well, since I believe to have ruled out issues with my hardware with PFSense test.

And heat in your case could be because the Celron Processors are for brief busty traffic, while Xeon processors are for severs to withstand a sustained / continuous high traffic load. When I stress test my server, yes of course the CPU temperature goes up to like 62 deg C in half hour of 10Gig iPerf testing, your box will be small with Celron J processor that may not be able to handle the load you may be subjecting it to. Hence getting hotter. Other than that, there should not be much reason for CPU surges or heat when you move from 23.1 to 23.7. Sure there will be bug fixes and some new features and better support for hardware, so I am thinking you may just have higher amount of traffic than before.

@dpsguard.  The graph I posted shows the reboot from the upgrade and the change in CPU activity.  There was no change in actual work load.  I have also posted the graph of the CPU heat.  Not sure what has changed, but the CPU is definitely busier in the latest release.  I think this is affecting server throughput.   I know I have a weak CPU in this box, but it should be overkill for a firewall.  This is a simple home network.

I just ran some speed tests and network load is making a much bigger difference to CPU load than is used to.

The gaps in the graphs I've posed are from the system upgrade.  The load on the router was the same before and after.

August 04, 2023, 02:30:28 AM #9 Last Edit: August 05, 2023, 03:20:20 AM by dpsguard
I had not seen the graph before. Definitely your box is busy. I am sure if latest OS has this issue, soon many will report this. You may want to reference this post in the latest 23.7 thread for wider comments. Latest code might be impacting Celron processors and yes your CPU is more than enough for home use case.

Hi @franco, I did a fresh install with 22.7 and I still have the issues that are seen also with 23.1 and 23.7. And I again tested with pfsense 27 and I don't have the CPU spikes issues. So it must be something to do with FreeBSD 13, unless there could be another reason. Normally I will not be worried, but when I am doing a continuous pings to firewall interfaces or thru firewall to Google DNS, when CPU surges, the pings also drop.

And I have tried this on three different hardware platforms and with two different 10Gig NIC's.  Anything you can recommend or a patch that could resolve this issue, will be highly appreciated.  Thanks

Running 23.7 and here are the results of a "vmstat 1" on my J3455 system with Intel IGB NICs and a 120GB SATA SSD.

The system was idle during this sampling, with just minor internet traffic (email, spotify, youtube).

procs    memory    page                      disks     faults       cpu
r  b  w  avm  fre  flt  re  pi  po   fr   sr ad0 pa0   in   sy   cs us sy id
0  0  0 514G  13G 7.2K   0   0  21 7.8K   30 113   0  165 5.3K 1.2K  3  3 94
0  0  0 514G  13G  22K   0   0  35  26K   30   0   0   39 7.8K 1.0K  3  3 94
0  0  0 514G  13G  875   0   0   0 1.0K   30   0   0   25 1.2K  290  1  0 99
2  0  0 514G  13G  874   0   0   0 1.0K   30  17   0   49 1.5K  417  3  1 97
0  0  0 514G  13G  874   0   0   0 1.0K   30   0   0   27 1.2K  286  0  1 99
1  0  0 514G  13G 1.6K   0   0   0 1.4K   30 144   0  177 1.6K 1.1K  9  1 90
1  0  0 514G  13G  881   0   0   0 1.1K   27   0   0   45 1.2K  310 25  1 74
1  0  0 514G  13G  878   0   0   0 1.1K   33   0   0   25 1.2K  299 25  1 74
1  0  0 514G  13G 1.2K   0   0   0 1.0K   27   0   0   38 1.6K  324 26  1 74
0  0  0 514G  13G 1.1K   0   0   0 1.1K   33  81   0  127 2.3K  941  3  0 96
0  0  0 514G  13G  874   0   0   0 1.0K   30 139   0  182 1.2K 1.1K  1  2 98
0  0  0 514G  13G  876   0   0   0 1.0K   30   0   0   34 1.2K  281  0  1 99
0  0  0 514G  13G  875   0   0   0 1.0K   30   0   0   28 1.2K  293  0  0 99
0  0  0 514G  13G  879   0   0   0 1.1K   30  17   0   75 1.5K  505  2  1 97
0  0  0 514G  13G  877   0   0   0 1.1K   27   0   0   27 1.2K  307  0  0 99
0  0  0 514G  13G  875   0   0   0 1.0K   30 104   0  142 1.2K  968  0  2 98
0  0  0 514G  13G  874   0   0   0 1.0K   27   0   0   35 1.2K  293  0  1 99
0  0  0 514G  13G  878   0   0   0 1.0K   30   0   0   39 1.2K  379  0  1 99
0  0  0 514G  13G 1.0K   0   0   0 1.3K   30   0   0   66 1.4K  423  0  1 99
0  0  0 514G  13G  876   0   0   0 1.0K   30   0   0   23 1.2K  294  1  2 98
2  0  0 514G  13G 5.8K   0   0   0 1.4K   31  83   0  106 5.9K  765 14  3 83
2  0  0 514G  13G  16K   0   0   0 8.1K   36   0   0   33  12K 3.5K 23  4 73
0  0  0 514G  13G  24K   0   0   0  23K   47   0   0   25 3.2K 1.3K 15  3 82
0  0  0 514G  13G  878   0   0   0 1.1K   30  17   0   53 1.5K  430  3  2 95
0  0  0 514G  13G  874   0   0   0 1.0K   30 119   0  168 1.2K 1.0K  0  1 99
0  0  0 514G  13G  876   0   0   0 1.0K   30   0   0   25 1.2K  276  0  1 99
0  0  0 514G  13G  881   0   0   0 1.1K   30   0   0   27 1.2K  291  0  0 99
0  0  0 514G  13G  877   0   0   0 1.0K   30   0   0   35 1.2K  316  0  1 99
0  0  0 514G  13G  873   0   0   0 1.0K   30   0   0   36 1.2K  338  0  1 99
0  0  0 514G  13G  878   0   0   0 1.1K   30  74   0  105 1.2K  809  0  1 99

Thanks @opnfwb for sharing the results from your box. I can see that you also have lots of page faults ( in kilos) so I will then rule out the issue with my setup and assume that this is how the memory management functions in FreeBSD.

I can also see some CPU surges from your vmstat output. Simple test if you can do, will be helpful to me to further diagnose my issues.

I have now tested on 4 different hardware platforms and I get the same issues on all. The last one I did is on a J3855U with 8GB of RAM and 32GB SSD and Intel 1Gig NIcs. I am now running pfsense on this small appliance and doing g a continuous ping to LAN interface with packet size of 20000 bytes and I have zero loss so far in over 4000 pings (and pfsense has a wireless AP on the front of it and my laptop is connected via wi-fi while pinging.  Can you do a ping test and see if you have any ping drops please ? ping 192.168.1.1 -t -l 20000

If this was a opnsense issues, then so many would have complained (when I repeat the same ping test with 20K packet size, I have 6% loss), but I have seemingly ruled out my hardware, so it must be FreeBSD 13. Pfsense is using FreeBSD 14 and I dont have the issue. Maybe I should test with some old release of opnsense, but I can only find 21.1 available to download, and hopefully that uses FreeBSD12.x.  Or it could be that FreeBSD 13 has some driver issues for Working with SuperMicro Xeon processors.  Thanks again

I'm running the ping now but, why such a large packet size? Won't this just fragment? My LAN MTU is only 1500, as I would presume most others are. I'm not sure the purpose of such a large packet size for testing?

I have an older OPNsense 19.1.4 image from years ago, I can install it in my LAB and do a quick vmstat check there too.

I was curious if you were using UFS or ZFS for your OPNsense installs? The one I posted above with the high page faults is ZFS. I figured I would try an identical setup in my LAB but with UFS and see if this made a difference for some reason? I highly doubt it but I'm just trying to rule out potential factors.

I purposely used 20K packet size as each packet will get fragmented into many packets and thus generates some traffic going thru the firewall. This is a better way to test any connectivity issues. I even have tested with 64K size sometime.

And yes I tested both opnsense and pfsense with ZFS. In my case, with normal CPU utilization of 0 to 2% and with some CPU bursts generate triangles of spikes. Do you see anything like that? And around this time, pings drop.

Thanks for your help in troubleshooting the issue, which if can provide some data to @franco, may help resolve this. This issue might be silently affecting many others in the form of adding to burst of latency / jitter and dropping some real-time traffic calls etc.Other than this issue, as I explained earlier, I can pump in sustained 10Gig traffic even for half hour (max I tested) and everything keeps working fine. Only commonality between 4 different hardware platforms tested is that all use Intel NICs. Two servers are SuperMicro Superservers (X11 and X8) and one Chinese box (Hunsn) and second was Taiwanese box (iBase).