OPNsense Forum

English Forums => Hardware and Performance => Topic started by: dpsguard on August 01, 2023, 04:41:49 AM

Title: Re: Random Frequent CPU Spikes and Page Faults [Almost Resolved]
Post by: dpsguard on August 01, 2023, 04:41:49 AM
Hello All,

I have a simple lab setup with two Super Micro servers with Xeon CPU, 4 core and 12GB of RAM, Intel X520-DA2 card in HA. I have done stress testing using 2 more servers, one on LAN side and one on WAN segment and then running iPerf2 with 200 parallel streams (each stream set to 50Mbps) in between these two servers thru the firewall and I am able to get very close to 10Gig running for 30 minutes even.

I had installed 23.1 and then updated to 23.1.11 and then finally 23.7 in an attempt to resolve the sudden short bursts of CPU spikes (sometimes reaching 67%, many times 40% and generally 20% every few seconds, and all other time, CPU is like 0 to 2%).  When I do iPerf  stress test, saturating the pipe, CPU is normally close to 50% and when these peaks strike, it then reaches 70 to 100%.

The top -P with 1 sec delay and with system processes also added,  or vmstat 1, shows much less of total CPU that what GUI shows under CPU usage widget. So not sure if this is a cosmetic bug and actual CPU usage is much less than what GUI shows.  But while running iPerf stress test, I can also add a test laptop going thru this firewall to Internet and doing a 4K / 2160p video with no issues (clearly shows not more than couple of seconds of buffered content, so I had real issues, I will see it stall) but a continuous ping to Google DNS starts incurring lots of loss, video stays well.

I have tried varies things including enabling / disabling the hardware offload, PowerD, enabling / disabling both consoles etc and these random pulses every few seconds to few minutes keep showing up.

The firewalls have no security features, or logs or Netflow etc enabled. Just a simple NAT router is my setup.

The CPU spikes happen at no traffic. But performance tests in my limited lab setup don't seem to be impacted, other than ping losses. I will test doing a real time audio / video test thru this test also to see if that gets impacted. But I do believe I am not the only one seeing this issue. There are couple of short threads in community that talk about CPU spikes but they all have small CPU processors.

Top does not show any process that is consuming RAM when the peaks come (or few seconds before, assuming top uses some averaging over last few seconds). And for sure it is system that consumes some RAM in single digit even when peak 67% CPU happens in GUI. But what that process (or thread of process), I have not been able to find with any utility.

Another issue I find is that when I do vmstat 1 to let it repeat every second or even 5 seconds, Page flt show up as like 54K, 3K, 7K and most time a single number or so, but I even removed swap memory, did a reboot, but when there is no SWAP set aside, why will there be any paging in and out and resulting page faults? Memory remains mostly underutilized and thus no need for SWAP, but shows large Page flts.  Could this be bad memory that results in page faults and thus CPU spikes (this was new UDIMM Buffered) but it happens in two HA firewalls?

Any help will be appreciated.

Title: Re: Random Frequent CPU Spikes and Page Faults
Post by: dpsguard on August 02, 2023, 09:57:30 PM
I used another physical machine and installed 23.1 and without any configuration or optimization, I have the same issues with CPU peaks.

Also page flts are same in Kilos as was in other two machines. So looks like Memory management in FreeBSD is messed up or OPNSense somehow is not able to use the memory and then somehow needs to fetch some code frequently from disk. This machine has different processor, different motherboard, different BIOS, but has the same 10Gig X520-DA2 card. So I am not sure if it is this card issue, but for this test, I am not using the 10Gig card and have only assigned two 1Gi LOM interfaces for WAN and LAN.

At least it rules out any configuration or tunable issue.

If someone can validate on their setup for CPU usage peaks / triangles, as well as output of "vmstat 1" to look for Page flt to be in Kilos, that will be appreciated.
Title: Re: Random Frequent CPU Spikes and Page Faults
Post by: dpsguard on August 02, 2023, 10:44:14 PM
And I pulled the 10G card out leaving just the two Intel 1Gig motherboard ports in and CPU peaks as well as page faults remain as before.
Title: Re: Random Frequent CPU Spikes and Page Faults
Post by: dpsguard on August 03, 2023, 12:31:31 AM
Installed Debian on same server (used another SSD) and I did not see any CPU utilization beyond 4% as a sum of all cores when using htop. VMstat 1 also showed no page faults in kilos, it was just in some low numbers, but format is little different than in FreeBSD.

Then downloaded pfsense4 latest 27.0 and installed it on the same server. This uses FreeBSD 14, so not apples to apples comparison, but with pfsense, I still have page faults in kilos, so definitely this is FreeBSD memory management issues (maybe this is non-issue anyway). But I had CPU not jump and stay within 8% sum total of all cores.

So likely a small bug or minor issue that should be brought to notice of @franco and hope this will get addressed or a workaround provided.

Thanks
Title: Re: Random Frequent CPU Spikes and Page Faults
Post by: franco on August 03, 2023, 09:30:52 AM
Since FreeBSD 14 is more or less development version I don't have much hope this will be quickly fixed or that patches are already available. It might be hardware specific but unsure what that would be.

The rule of thumb for FreeBSD fixes is give it a lot of time and then some more... :/


Cheers,
Franco
Title: Re: Random Frequent CPU Spikes and Page Faults
Post by: dpsguard on August 03, 2023, 01:33:41 PM
Thank you so much @franco for looking into this and your advice. I agree the upstream things are beyond you and we just have to wait. I assume then that is for the memory management / page faults and I assume you also see this in any other hardware that you can test with (I also tested it with a Chinese network appliance that has a Intel i5 and 4GB RAM. Clearly this is FreeBSD and that I am not the only one seeing these page faults in kilos.

But can we conclude that CPU spikes are also related to then FreeBSD 13? Many others might not be noticing it, if on the graph, normal CPU shows up as a band much higher than 0 that shows in my case, and then they may not be noticing the peaks as I see, since in my case, it is mostly 0 to 2% normal CPU and then suddenly 20% surge, then wait few seconds to couple of minutes, and then another one could be 63%, and then 42% and then 30%, and they are all over the place.

At least with graph showing sum-total of all cores, the 63% surge will be on one core, but other cores are available and servicing other traffic (hopefully SMP works well) and this way few users might get impacted, if they had to be.
Title: Re: Random Frequent CPU Spikes and Page Faults
Post by: JustMeHere on August 03, 2023, 07:12:52 PM
I'm seeing this with my firewall too.  I'm on a J455.  I noticed the CPU spikes.  The CPU is running hotter. It looks like there's an issue with system interrupts that is new since my upgrade.  They seldom showed up before the upgrade. 

Title: Re: Random Frequent CPU Spikes and Page Faults
Post by: dpsguard on August 04, 2023, 01:13:06 AM
@JustMeHere I also had the same issues with 23.1 or with 23.1.11 and now with 23.7. If you had no such surges in 23, then that will be interesting. I am assuming that the CPU surges might just be some issues with FreeBSD 13, which they might have taken care of in FreeBSD 14 (as my same hardware does not see these CPU surges with PFsense 27 which uses FreeBSD 14.  I am waiting for @franco to confirm this behaviour with other hardware as well, since I believe to have ruled out issues with my hardware with PFSense test.

And heat in your case could be because the Celron Processors are for brief busty traffic, while Xeon processors are for severs to withstand a sustained / continuous high traffic load. When I stress test my server, yes of course the CPU temperature goes up to like 62 deg C in half hour of 10Gig iPerf testing, your box will be small with Celron J processor that may not be able to handle the load you may be subjecting it to. Hence getting hotter. Other than that, there should not be much reason for CPU surges or heat when you move from 23.1 to 23.7. Sure there will be bug fixes and some new features and better support for hardware, so I am thinking you may just have higher amount of traffic than before.
Title: Re: Random Frequent CPU Spikes and Page Faults
Post by: JustMeHere on August 04, 2023, 01:25:11 AM
@dpsguard.  The graph I posted shows the reboot from the upgrade and the change in CPU activity.  There was no change in actual work load.  I have also posted the graph of the CPU heat.  Not sure what has changed, but the CPU is definitely busier in the latest release.  I think this is affecting server throughput.   I know I have a weak CPU in this box, but it should be overkill for a firewall.  This is a simple home network.

I just ran some speed tests and network load is making a much bigger difference to CPU load than is used to.

The gaps in the graphs I've posed are from the system upgrade.  The load on the router was the same before and after.
Title: Re: Random Frequent CPU Spikes and Page Faults
Post by: dpsguard on August 04, 2023, 02:30:28 AM
I had not seen the graph before. Definitely your box is busy. I am sure if latest OS has this issue, soon many will report this. You may want to reference this post in the latest 23.7 thread for wider comments. Latest code might be impacting Celron processors and yes your CPU is more than enough for home use case.
Title: Re: Random Frequent CPU Spikes and Page Faults
Post by: dpsguard on August 05, 2023, 03:26:28 AM
Hi @franco, I did a fresh install with 22.7 and I still have the issues that are seen also with 23.1 and 23.7. And I again tested with pfsense 27 and I don't have the CPU spikes issues. So it must be something to do with FreeBSD 13, unless there could be another reason. Normally I will not be worried, but when I am doing a continuous pings to firewall interfaces or thru firewall to Google DNS, when CPU surges, the pings also drop.

And I have tried this on three different hardware platforms and with two different 10Gig NIC's.  Anything you can recommend or a patch that could resolve this issue, will be highly appreciated.  Thanks
Title: Re: Random Frequent CPU Spikes and Page Faults
Post by: opnfwb on August 05, 2023, 06:23:12 AM
Running 23.7 and here are the results of a "vmstat 1" on my J3455 system with Intel IGB NICs and a 120GB SATA SSD.

The system was idle during this sampling, with just minor internet traffic (email, spotify, youtube).

procs    memory    page                      disks     faults       cpu
r  b  w  avm  fre  flt  re  pi  po   fr   sr ad0 pa0   in   sy   cs us sy id
0  0  0 514G  13G 7.2K   0   0  21 7.8K   30 113   0  165 5.3K 1.2K  3  3 94
0  0  0 514G  13G  22K   0   0  35  26K   30   0   0   39 7.8K 1.0K  3  3 94
0  0  0 514G  13G  875   0   0   0 1.0K   30   0   0   25 1.2K  290  1  0 99
2  0  0 514G  13G  874   0   0   0 1.0K   30  17   0   49 1.5K  417  3  1 97
0  0  0 514G  13G  874   0   0   0 1.0K   30   0   0   27 1.2K  286  0  1 99
1  0  0 514G  13G 1.6K   0   0   0 1.4K   30 144   0  177 1.6K 1.1K  9  1 90
1  0  0 514G  13G  881   0   0   0 1.1K   27   0   0   45 1.2K  310 25  1 74
1  0  0 514G  13G  878   0   0   0 1.1K   33   0   0   25 1.2K  299 25  1 74
1  0  0 514G  13G 1.2K   0   0   0 1.0K   27   0   0   38 1.6K  324 26  1 74
0  0  0 514G  13G 1.1K   0   0   0 1.1K   33  81   0  127 2.3K  941  3  0 96
0  0  0 514G  13G  874   0   0   0 1.0K   30 139   0  182 1.2K 1.1K  1  2 98
0  0  0 514G  13G  876   0   0   0 1.0K   30   0   0   34 1.2K  281  0  1 99
0  0  0 514G  13G  875   0   0   0 1.0K   30   0   0   28 1.2K  293  0  0 99
0  0  0 514G  13G  879   0   0   0 1.1K   30  17   0   75 1.5K  505  2  1 97
0  0  0 514G  13G  877   0   0   0 1.1K   27   0   0   27 1.2K  307  0  0 99
0  0  0 514G  13G  875   0   0   0 1.0K   30 104   0  142 1.2K  968  0  2 98
0  0  0 514G  13G  874   0   0   0 1.0K   27   0   0   35 1.2K  293  0  1 99
0  0  0 514G  13G  878   0   0   0 1.0K   30   0   0   39 1.2K  379  0  1 99
0  0  0 514G  13G 1.0K   0   0   0 1.3K   30   0   0   66 1.4K  423  0  1 99
0  0  0 514G  13G  876   0   0   0 1.0K   30   0   0   23 1.2K  294  1  2 98
2  0  0 514G  13G 5.8K   0   0   0 1.4K   31  83   0  106 5.9K  765 14  3 83
2  0  0 514G  13G  16K   0   0   0 8.1K   36   0   0   33  12K 3.5K 23  4 73
0  0  0 514G  13G  24K   0   0   0  23K   47   0   0   25 3.2K 1.3K 15  3 82
0  0  0 514G  13G  878   0   0   0 1.1K   30  17   0   53 1.5K  430  3  2 95
0  0  0 514G  13G  874   0   0   0 1.0K   30 119   0  168 1.2K 1.0K  0  1 99
0  0  0 514G  13G  876   0   0   0 1.0K   30   0   0   25 1.2K  276  0  1 99
0  0  0 514G  13G  881   0   0   0 1.1K   30   0   0   27 1.2K  291  0  0 99
0  0  0 514G  13G  877   0   0   0 1.0K   30   0   0   35 1.2K  316  0  1 99
0  0  0 514G  13G  873   0   0   0 1.0K   30   0   0   36 1.2K  338  0  1 99
0  0  0 514G  13G  878   0   0   0 1.1K   30  74   0  105 1.2K  809  0  1 99
Title: Re: Random Frequent CPU Spikes and Page Faults
Post by: dpsguard on August 05, 2023, 05:59:47 PM
Thanks @opnfwb for sharing the results from your box. I can see that you also have lots of page faults ( in kilos) so I will then rule out the issue with my setup and assume that this is how the memory management functions in FreeBSD.

I can also see some CPU surges from your vmstat output. Simple test if you can do, will be helpful to me to further diagnose my issues.

I have now tested on 4 different hardware platforms and I get the same issues on all. The last one I did is on a J3855U with 8GB of RAM and 32GB SSD and Intel 1Gig NIcs. I am now running pfsense on this small appliance and doing g a continuous ping to LAN interface with packet size of 20000 bytes and I have zero loss so far in over 4000 pings (and pfsense has a wireless AP on the front of it and my laptop is connected via wi-fi while pinging.  Can you do a ping test and see if you have any ping drops please ? ping 192.168.1.1 -t -l 20000

If this was a opnsense issues, then so many would have complained (when I repeat the same ping test with 20K packet size, I have 6% loss), but I have seemingly ruled out my hardware, so it must be FreeBSD 13. Pfsense is using FreeBSD 14 and I dont have the issue. Maybe I should test with some old release of opnsense, but I can only find 21.1 available to download, and hopefully that uses FreeBSD12.x.  Or it could be that FreeBSD 13 has some driver issues for Working with SuperMicro Xeon processors.  Thanks again
Title: Re: Random Frequent CPU Spikes and Page Faults
Post by: opnfwb on August 05, 2023, 06:54:58 PM
I'm running the ping now but, why such a large packet size? Won't this just fragment? My LAN MTU is only 1500, as I would presume most others are. I'm not sure the purpose of such a large packet size for testing?

I have an older OPNsense 19.1.4 image from years ago, I can install it in my LAB and do a quick vmstat check there too.

I was curious if you were using UFS or ZFS for your OPNsense installs? The one I posted above with the high page faults is ZFS. I figured I would try an identical setup in my LAB but with UFS and see if this made a difference for some reason? I highly doubt it but I'm just trying to rule out potential factors.
Title: Re: Random Frequent CPU Spikes and Page Faults
Post by: dpsguard on August 05, 2023, 07:43:06 PM
I purposely used 20K packet size as each packet will get fragmented into many packets and thus generates some traffic going thru the firewall. This is a better way to test any connectivity issues. I even have tested with 64K size sometime.

And yes I tested both opnsense and pfsense with ZFS. In my case, with normal CPU utilization of 0 to 2% and with some CPU bursts generate triangles of spikes. Do you see anything like that? And around this time, pings drop.

Thanks for your help in troubleshooting the issue, which if can provide some data to @franco, may help resolve this. This issue might be silently affecting many others in the form of adding to burst of latency / jitter and dropping some real-time traffic calls etc.Other than this issue, as I explained earlier, I can pump in sustained 10Gig traffic even for half hour (max I tested) and everything keeps working fine. Only commonality between 4 different hardware platforms tested is that all use Intel NICs. Two servers are SuperMicro Superservers (X11 and X8) and one Chinese box (Hunsn) and second was Taiwanese box (iBase).
Title: Re: Random Frequent CPU Spikes and Page Faults
Post by: opnfwb on August 05, 2023, 09:11:12 PM
So I'm still goofing around with this, I actually find this quite interesting.

I've been using OPNsense for years and occasionally I'll switch to the "other" pf brand just to compare them. I have a Netstat VM on my internal LAN that pings outside hosts and measures latency and I keep the data stored for weeks at a time. I have two HDDs in my J3455 router, so I simply swap the cable from one to the other and I can boot a different router OS. Between OPNsense and pfSense, I can see no descernable difference when running sustained pings to outside hosts.

If you are seeing your gateway drop or latency spikes, to me that's quite unusual. If you've isolated this just to OPNsense there has to be some odd variable that you're hitting. Are you doing any other custom settings? Maybe some NIC tuning? Processor power management? Just trying to think of some odd variable that might be introducing latency or jitter in this setup.
Title: Re: Random Frequent CPU Spikes and Page Faults
Post by: dpsguard on August 05, 2023, 10:36:24 PM
I have pretty much default configuration other than interface IP addressing, HA and a management interface also in the mix. Here is how the pings show up on my firewall.

(https://i.postimg.cc/cr2jNVtB/Pings1.png) (https://postimg.cc/cr2jNVtB)

(https://i.postimg.cc/R6ZGtRn0/Pings2.png) (https://postimg.cc/R6ZGtRn0)

And here is the output of vmstat at 1 second interval

procs    memory    page                      disks     faults       cpu
r  b  w  avm  fre  flt  re  pi  po   fr   sr md98 ad0   in   sy   cs us sy id
0  0  0 848M  11G  70K   0   0   0  45K   67   0   0   19 100K  31K  4  5 91
1  0  0 896M  11G  13K   0   0   0 9.2K   54   0  72  161 6.3K 3.6K  2  5 93
0  0  0 847M  11G  57K   0   0   0  35K   61   0   0   20  96K  31K  2  9 89
0  0  0 844M  11G  28K   0   0   0  30K   61   0   0   48  13K 3.1K  5  7 88
0  0  0 847M  11G 5.7K   0   0   0 5.3K   60   0   0   11 2.8K 2.5K  1  2 97
1  0  0 855M  11G  56K   0   0   0  31K   60   0   0   13  95K  31K  2  6 91
1  0  0 844M  11G  15K   0   0   0  16K   67   0  63  155 6.4K 3.7K  2  1 96
2  0  0 889M  11G  43K   0   0   0  42K   57   0   0   42  27K 7.0K  4 10 86
0  0  0 844M  11G  53K   0   0   0  30K   68   0   0   25  87K  28K  4  9 87
0  0  0 848M  11G 5.8K   0   0   0 5.3K   54   0   0   11 2.8K 2.3K  0  3 97
0  0  0 844M  11G  71K   0   0   0  31K   60   0   0   15  96K  31K  3  7 91
0  0  0 847M  11G  17K   0   0   0  17K   55   0  69  186 7.0K 3.6K  3  3 94
0  0  0 844M  11G  13K   0   0   0  14K   60   0   0   19 6.8K 2.7K  2  8 90
0  0  0 847M  11G  98K   0   0   0  59K   61   0   0   32 107K  32K  6 13 81
0  0  0 844M  11G 5.5K   0   0   0 5.6K   60   0   0    8 2.8K 2.4K  1  2 97
1  0  0 891M  11G  10K   0   0   0 9.0K   54   0   0   34 6.3K 2.7K  1  5 94
1  0  0 848M  11G  76K   0   0   0  37K   60   0  66  159  95K  32K  3  3 94
0  0  0 896M  11G  13K   0   0   0 9.1K   62   0   0   15 6.5K 2.6K  2  7 91
0  0  0 847M  11G  97K   0   0   0  63K   63   0   0   30 107K  32K  7 14 79
0  0  0 844M  11G 5.5K   0   0   0 5.6K   66   0   0   34 2.8K 2.6K  0  3 97
0  0  0 847M  11G 1.7K   0   0   0 1.6K   60   0   0    9 1.2K 2.4K  0  2 97
0  0  0 844M  11G  57K   0   0   0  31K   54   0  71  166  96K  32K  3  6 91
0  0  0 846M  11G  22K   0   0   0  22K   61   0   0   39 9.1K 2.7K  3  3 93
0  0  0 851M  11G  43K   0   0   0  46K   63   0   0   29  22K 3.6K  7 19 74
0  0  0 848M  11G  51K   0   0   0  24K   60   0   0   13  91K  31K  1  6 93
0  0  0 844M  11G 1.4K   0   0   0 1.8K   60   0   0   10 1.2K 2.4K  0  2 97
0  0  0 846M  11G  71K   0   0   0  31K   60   0  66  184  96K  33K  2  7 90
3  0  0 974M  11G  36K   0   0   0  25K   61   0   0   21  14K 2.6K  6  4 91
2  0  0 885M  11G  23K   0   0   0  33K   63   0   0   24  12K 3.0K  4 12 83
0  0  0 847M  11G  57K   0   0   0  33K   60   0   0   12  95K  31K  3  6 91
0  0  0 844M  11G 1.4K   0   0   0 1.8K   54   0   0   32 1.2K 2.5K  0  2 98
2  0  0 889M  11G  54K   0   0   0  30K   60   0  71  157  62K  21K  2  6 92
2  0  0 941M  11G  56K   0   0   0  31K   61   0   0   24  49K  14K  6  7 87
1  0  0 846M  11G  16K   0   0   0  28K   56   0   0   23  10K 3.1K  3 11 86
0  0  0 844M  11G  61K   0   0   0  35K   60   0   0   37  97K  31K  2  8 91
0  0  0 847M  11G 1.7K   0   0   0 1.6K   60   0   0   10 1.2K 2.3K  0  2 97

Thanks

Title: Re: Random Frequent CPU Spikes and Page Faults
Post by: dpsguard on August 05, 2023, 10:42:05 PM
and I have just default config with LAN and WAN interface and no HA etc on other two boxes that I tried. In all, I have similar issues of CPU spikes.
Title: Re: Random Frequent CPU Spikes and Page Faults
Post by: opnfwb on August 05, 2023, 10:51:31 PM
I understand on the CPU spikes. For instance, I use NetFlow on my home LAN (very useful little built in tool) and it does a background stat collection every 60-90 seconds and this spikes the CPU. But when this happens, the LAN gateway and ping monitors are not impacted, there is no discernible change in ping or network responsiveness for outbound connections.

Are your CPU spikes related to bandwidth usage, for instance when the CPU rises is this due to a spike or a burst in traffic? I'm just trying to better understand if this CPU spike is causing a latency/jitter on an idle line, or if it's due to some traffic kicking in.

I setup 3 identical VMs on my VMware host, OPNsense 23.7ZFS, OPNsense 23.7UFS, and pfSense 2.7ZFS. I'm still collecting all of the vmstat totals from each VM but I'll post them here shortly. Then I'll try an old OPNsense 19.1.4 image and just see?
Title: Re: Random Frequent CPU Spikes and Page Faults
Post by: dpsguard on August 05, 2023, 11:23:00 PM
I had ruled out netflow related issues as I had seen similar issue few days ago when doing some internet searches (link below, which produces similar spikes as I see)

https://github.com/opnsense/core/issues/5046

I have almost zero traffic, no videos, just reading discussions forums on my laptop connected thru opnsense. Nothing else on this box. So spikes remain with or without traffic. Sure with traffic added,  CPU graph grass level goes up, so spikes ride on top to then push the CPU sometimes closer to 90%. I have run out of options. Of course the impact will be lesser if I don't have any  GUI session open. However with GUI off, and using speedtest (CLI), during the test, the couple of pings still go down.

With pfsense, I don't get to see this type of CPU graph, so it is possible that it does not show the peaks and averages it out in its CPU bar on home page. So my tests could be flawed comparing the two OS.
Title: Re: Random Frequent CPU Spikes and Page Faults
Post by: opnfwb on August 06, 2023, 01:03:38 AM
All VMs idling during these samples.
None of the admin web interfaces were logged in to or in use during these samples.

All these VMs are hosted on VMware ESXi, 7.0.3, 21930508. All VMs have the same VM hardware version and each has 2 vCPU, 2GB of RAM, and a Paravirtual SCSI HDD. All VMs have 2x VMXNET3 adapters assigned. All VMs had their package version of OpenVM tools installed, and all VMs also had their vnstat package installed (on the pfSense VM this package is called Traffic Totals but it uses vnstat).

For this sampling all VMs have these tunables:
hw.ibrs_disable = 1
vm.pmap.pti = 0


OPNsense 23.7 ZFS:
procs    memory    page                      disks     faults       cpu
r  b  w  avm  fre  flt  re  pi  po   fr   sr da0 cd0   in   sy   cs us sy id
0  0  0 2.0T 455M 1.4K   0   0   1 1.5K   13   0   0   12 1.5K  248  1  0 99
0  0  0 2.0T 454M  610   0   0   0  760   10   0   0    5 1.2K  188  0  1 99
0  0  0 2.0T 454M  613   0   0   0  758   11   0   0    2 1.2K  183  0  0 100
0  0  0 2.0T 454M  614   0   0   0  765   10  68   0   69 1.2K  584  0  1 98
0  0  0 2.0T 454M  612   0   0   0  754   10   0   0    4 1.2K  176  0  0 100
0  0  0 2.0T 454M  617   0   0   0  756   10   0   0    2 1.2K  186  0  1 99
0  0  0 2.0T 454M  611   0   0   0  755   11   0   0    2 1.2K  185  0  0 100
0  0  0 2.0T 454M  614   0   0   0  753   10   0   0    4 1.2K  189  0  0 100
0  0  0 2.0T 454M  618   0   0   0  759   10   0   0    2 1.2K  192  0  0 100
0  0  0 2.0T 454M  615   0   0   0  760   11   0   0    2 1.2K  180  0  0 100
0  0  0 2.0T 454M  612   0   0   0  754   10   0   0    4 1.2K  197  0  0 100
0  0  0 2.0T 454M  611   0   0   0  756   11   0   0    2 1.2K  181  0  0 100
0  0  0 2.0T 454M  613   0   0   0  757   10   0   0    2 1.2K  176  0  0 100
0  0  0 2.0T 454M  613   0   0   0  754   10   0   0    4 1.2K  193  0  0 100
0  0  0 2.0T 454M  610   0   0   0  755   11   0   0    2 1.2K  174  1  0 99
0  0  0 2.0T 454M  613   0   0   0  761   10   0   0    2 1.2K  173  0  0 100
0  0  0 2.0T 454M  612   0   0   0  753   10   0   0    4 1.2K  195  0  0 100
0  0  0 2.0T 454M  615   0   0   0  759   11   0   0    2 1.2K  191  1  0 99
0  0  0 2.0T 454M  621   0   0   0  765   10   0   0    2 1.3K  188  0  0 100
0  0  0 2.0T 454M  612   0   0   0  755   11   0   0    4 1.3K  188  0  0 100
1  0  0 2.0T 454M  615   0   0   0  760   10   0   0    2 1.2K  181  0  0 100


OPNsense 23.7 UFS:
procs    memory    page                      disks     faults       cpu
r  b  w  avm  fre  flt  re  pi  po   fr   sr da0 cd0   in   sy   cs us sy id
0  0  0 2.0T 1.4G 4.4K   0  21   1 4.6K   58   0   0   53 8.1K  582  2  1 97
0  0  0 2.0T 1.4G  633   0   0   0  774   66   0   0    3 1.1K  169  0  0 100
2  0  0 2.0T 1.4G  27K   0   5   0 9.6K   66  15   0   16  15K 1.0K 22  4 73
1  0  0 2.0T 1.4G  703   0   0   0 9.2K   60   1   0   16 1.2K  210  0  1 99
0  0  0 2.0T 1.4G  638   0   0   0  775   66   0   0    2 1.3K  170  1  0 99
1  0  0 2.0T 1.4G  631   0   0   0  777   60   0   0    3 1.2K  169  0  0 100
0  0  0 2.0T 1.4G  633   0   0   0  778   66   0   0    3 1.1K  168  0  0 100
0  0  0 2.0T 1.4G  634   0   0   0  774   60   0   0    2 1.1K  164  1  0 99
0  0  0 2.0T 1.4G  631   0   0   0  777   66   0   0   13 1.2K  202  0  0 100
0  0  0 2.0T 1.4G  630   0   0   0  775   60   0   0    2 1.1K  164  0  0 99
0  0  0 2.0T 1.4G  632   0   0   0  771   66   0   0    2 1.1K  164  0  0 100
0  0  0 2.0T 1.4G  632   0   0   0  776   60   0   0    4 1.2K  183  0  1 99
0  0  0 2.0T 1.4G  629   0   0   0  775   60   0   0    2 1.1K  164  0  0 100
0  0  0 2.0T 1.4G  628   0   0   0  770   66   0   0   11 1.1K  185  0  1 99
0  0  0 2.0T 1.4G  631   0   0   0  773   60   0   0    4 1.2K  182  0  0 100
0  0  0 2.0T 1.4G  627   0   0   0  771   60   0   0    2 1.1K  167  0  0 100
0  0  0 2.0T 1.4G  634   0   0   0  779   66   0   0    2 1.1K  161  0  0 100
1  0  0 2.0T 1.4G  632   0   0   0  775   60   3   0   15 1.2K  200  0  0 100
1  0  0 2.0T 1.4G  633   0   0   0  771   66   0   0    2 1.1K  185  0  0 100
1  0  0 2.0T 1.4G  629   0   0   0  783   66  82   0   85 1.2K  506  0  0 99
1  0  0 2.0T 1.4G  627   0   0   0  767   60   9   0   13 1.1K  210  0  0 100


pfSense 2.7 ZFS:
procs    memory    page                      disks     faults       cpu
r  b  w  avm  fre  flt  re  pi  po   fr   sr da0 cd0   in   sy   cs us sy id
0  0  0 514G 1.5G  611   0   0   1  663    6   0   0   10  476  178  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    5   0   0    6  313  155  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    5   0   0    3  291  149  0  0 100
0  0  0 514G 1.5G    1   0   0   0    0    5   0   0    3  326  144  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    6   0   0    5  294  158  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    5   0   0    3  322  154  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    5   0   0    3  306  144  0  0 100
0  0  0 514G 1.5G    1   0   0   0    0    5   0   0    5  316  160  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    5   0   0    3  273  141  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    5   0   0    3  320  149  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    5   0   0    5  403  156  0  0 100
0  0  0 514G 1.5G    6   0   0   0    0    6   0   0    2  316  147  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    5   0   0    3  295  142  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    5   0   0    5  309  155  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    5   0   0    3  326  158  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    5   0   0    3  299  146  0  0 100
0  0  0 514G 1.5G    1   0   0   0    0    6   0   0    5  319  164  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    5   0   0    3  329  155  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    5   0   0    4  306  157  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    5   0   0    5  310  155  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    6   0   0    3  307  153  0  0 100
Title: Re: Random Frequent CPU Spikes and Page Faults
Post by: dpsguard on August 06, 2023, 01:18:27 AM
Thanks @opnfwb for your great help trying to troubleshoot my issues.

Looks like with 2GB RAM, you have essentially no page faults. I have 12GB RAM. Maybe opnsense has issues in managing memory, so I may try reducing the RAM. I have already tried by removing 10Gig card just in case of any driver issues (I did not apply any driver myself, whatever is part if opnsense OS, detects these cards correctly). There must be some process that is firing up the CPU, probably happens for a split second every so often that vmstat or top are not able to catch.

Just did a factory reset again and no relief.
Title: Re: Random Frequent CPU Spikes and Page Faults
Post by: opnfwb on August 06, 2023, 03:10:10 AM
I do think you've stumbled on to something interesting here. It's obvious from my observations that there is definitely a higher page fault occurrence in OPNsense compared to an identically configured pfSense 2.7 VM (same hardware, same resources, same tunables, same packages installed).

However, what I'm not sure about is if the anomalies you've found are directly contributing to the problem that you're seeing.

If I run iperf tests on any of these firewall VMs I get virtually identical throughput with all of them, OPNsense and pfSense. The entire time I'm running the test I see a small spike in latency on the firewall VM that is pushing the traffic, usually 2-4ms. I don't get any dropped packets and once the iperf test stops, everything returns to normal.

So it would seem that even though the OPNsense VMs do all exhibit substantially more page faults than the pfSense VM, it doesn't appear to be impacting overall throughput in my testing. And none of them seem to have an issue with dropping pings even under high load. I'm running iperf through each of the firewalls. I use a traffic generator on the WAN side and on the LAN side to make the firewall route the traffic through both of its interfaces. Obviously with my VMs, these are all virtual interfaces (VMXnet3) so its still possible there's a hardware issue with one of the cards you are using but you've said you are seeing the ping spikes/packet loss on multiple different systems with varied hardware.
Title: Re: Random Frequent CPU Spikes and Page Faults
Post by: dpsguard on August 06, 2023, 04:10:21 AM
Yes the page faults definitely are higher in opnsense and I was thinking this could be something to do with version 13 used in opnsense.

My test setup is all physical. I have the firewall-under-test attached to my main firewall LAN (and main firewall is pfsense running 2.6). Thus the WAN segment of the test firewall is my local LAN. This allows me to add an iPerf server on my LAN (I like using iperf2 over iperf3 as I can have a large number of parallel stream and utilization of multiple cores of the CPU) and then clients to run iperf2 clients on LAN side of the firewall under test.

I am able to get very high throughput repeatedly (I set -t 600 to 1800 and -P 100 or more) thus flooding the firewall under test and then I launch internet bound traffic also thru a desktop PC running 4K at max resolution ( I have a 100/10 Meg pair bonded DSL on two pairs of phone line, so I need to test with local iPerf server). I don't see any hiccups in the video playback while I am doing iPerf continuous testing (both uploads from the client or with reverse side to download from server). But I do see significant ping packet loss when running iPerf from LAN to WAN side of firewall under test. While ping is stateless and thus sensitive to congestion and consequent some loss and YouTube videos uses tcp, which is able to tolerate some loss to not let me feel any issues. I have yet to test some real-time traffic like whatsapp / facetime audio / video call thru this firewall if the ping losses manifest into actual call drop or pixelation of pictures.
Title: Re: Random Frequent CPU Spikes and Page Faults [Almost Resolved]
Post by: dpsguard on August 07, 2023, 01:33:06 AM
including @franco for his information.

I might be close to resolving this. Here is what I have done so far to get some acceptable working of firewall. Before this, I had peaks even reaching 75% and for sure when that hit, a parallel continuous stream from the test laptop (to 1.1.1.1) will lose couple of pings.

top -SHz 20

pressed s to set delay to 1 second and then shift +s to include system / kernel processes, to show total of 20 processes using CPU.

Here I could see unbound process hogging some cycles. I was not even sure what unbound is, but this DNS proxy service was by default enabled. I unchecked it under services section and then manually specified 8.8.8.8 and 1.1.1.1. I for DNS servers to use for system and for DHCP scope.

I could see CPU peaks then not exceeding 12% and generally 5 to 7%, after watching for ten minutes (no traffic at this time).

Then I logged off the GUI to kill php-cgi processes chewing up CPU. Once in a while DHCPv6 was also showing up. Again this was default enabled. I unchecked it and made sure to disable IPv6 under interfaces.

I launched a 4K / 2160p60Hz res video (Flying over Norway to generate some traffic) and also maintained continuous ping to 1.1.1.1, and in addition started iPerf client to a WAN segment based iPerf2 server. I used -P50 and -i 1, -t 600 to keep firewall somewhat busy. And  I fired the GUI again and CPU peaks were now under 20%. Clearly GUI introduces its demand to paint the CPU graph etc, but watching the top output over SSH console, it generally remains below 10% utilized. I still see Ping loss which could be for various reasons, especially firewall might be treating it as least priority over the normal traffic, when it is flooded with traffic. But clearly with about 900Mbps average being downloaded or uploaded via iPerf, traversing thru the firewall, situation seems to be overall better with changed I made.

Over next few days, when I get chance, I will do some more stress testing with two 10Gig machines to act as iPerf client and server. For now, I have also stressed all CPU cores by issuing the following (4 times for my 4 cores) and this makes all cores almost at 99% and my pings were still going thru and my iPerf testing was also going on.

yes > /dev/null &

and then when done

killall yes

reference  https://forum.netgate.com/topic/171454/stress-ng-install/4

Title: Re: Random Frequent CPU Spikes and Page Faults [Almost Resolved]
Post by: opnfwb on August 07, 2023, 03:39:51 AM
The Unbound spikes you're seeing are likely due to OPNsense's Unbound Reporting feature. Its a very powerful and useful feature but it does some background stats collection every 30 seconds or so, and during this time there's a small CPU spike while it processes the stats.

You can turn it off in Reporting/Settings and uncheck the "Unbound DNS Reporting" section to see if this stops the Unbound CPU usage that you're noticing.
Title: Re: Random Frequent CPU Spikes and Page Faults [Almost Resolved]
Post by: dpsguard on August 07, 2023, 12:48:44 PM
Thanks @opnfwb. I will test and report back later today.
Title: Re: Random Frequent CPU Spikes and Page Faults [Almost Resolved]
Post by: dpsguard on August 08, 2023, 12:31:42 AM
Hi @opnfwb. I tested again. The unbound reporting was already unchecked.

Definitely most of these spikes that now show up (many are gone, especially tall ones) seem to be from GUI (php-cgi) although watching iostat or top etc does not show any processor(s) consuming anywhere close to the CPU surge. And that is largely gone if I logout and close the GUI tab.

Then I looked into the output of "top -m io" and that showed two interesting usages. Syslog-ng will toggle back and forth between 0 and 100% IO and similarly python3.9 doing the same. Since I don't log anything or send anything out to a logging server, I disabled the service by editing the /etc/rc.conf with syslog_ng_enable="NO".

This removed of course syslog related IO, but I don't understand why will python3.9 switch back and forth between 100% and 0% every second. I will like to remove that bottleneck also and will request any tips to resolve this. Thanks so much
Title: Re: Random Frequent CPU Spikes and Page Faults [Almost Resolved]
Post by: dpsguard on August 08, 2023, 01:08:43 AM
Further look showed that 100% IO for Python was coming from Captive Portal (I set up a page for simple terms and conditions, no accounting, no authentication etc). So when I shutdown captive portal service, then that issue gets resolved.

However I need captive portal. The script is cp-background-process.py. There must be something in this script to keep python generating so much IO.

Thanks
Title: Re: Random Frequent CPU Spikes and Page Faults [Almost Resolved]
Post by: zz00mm on August 08, 2023, 05:31:27 AM
Look at the information provided here.
https://bsd44.blogspot.com/2004/12/vmstat.html

Looks like faults is nothing but interrupts, so a high number shows a busy system.

Faults:
The faults section shows system faults. Faults, in this case, aren't bad, they're just received system traps and interrupts.

in Shows the number of system interrupts (IRQ requests) the system received in the last five seconds.

sy Shows the number of system calls in the last five seconds.

cs Gives the number of context switches, or times the CPU changed from doing one thing to doing another.
Title: Re: Random Frequent CPU Spikes and Page Faults [Almost Resolved]
Post by: dpsguard on August 08, 2023, 01:02:34 PM
Thanks @zz00mm for your advice. However in my case, system is not busy as all, it is idle with essentially no traffic. And I get that page faults is not bad, but there is no use of any swap either and overall RAM use is a fraction of what is available. My more concern now is on the constant interruption by python.
Title: Re: Random Frequent CPU Spikes and Page Faults [Almost Resolved]
Post by: opnfwb on August 08, 2023, 11:55:18 PM
I finally got around to sampling the old OPNsense 19.1.4 image I had available. This is quite old, based on FreeBSD 11 so I don't think this is a relevant comparison at this point but I'm posting the results anyway. It does show noticeably lower faults than the current versions and it has the same hardware config and the same plugins installed (vnStat and vmware tools).

This vmstat sample was taken with the system effectively idle, just passing minimal gateway monitor ping traffic.

OPNsense 19.1.4
procs  memory       page                    disks     faults         cpu
r b w  avm   fre   flt  re  pi  po    fr   sr da0 cd0   in    sy    cs us sy id
0 0 0 2.0T  1.6G     7   0   0   0     0   42   0   0    7   418   151  0  0 100
1 0 0 2.0T  1.6G     0   0   0   0     0   43   0   0    6   233   130  0  0 100
0 0 0 2.0T  1.6G     0   0   0   0     0   42   0   0    3   194   117  0  0 100
0 0 0 2.0T  1.6G     3   0   0   0     0   42   0   0    9   235   149  0  0 100
0 0 0 2.0T  1.6G     1   0   0   0     0   42   0   0    3   223   121  0  0 100
4 0 0 2.0T  1.6G     0   0   0   0     0   42   0   0    7   206   127  0  0 99
1 0 0 2.0T  1.6G     0   0   0   0     0   42   0   0    3   222   125  0  0 100
0 0 0 2.0T  1.6G     0   0   0   0     0   42   0   0    6   209   134  0  0 100
0 0 0 2.0T  1.6G     4   0   0   0     8   42   5   0   12   256   165  0  0 100
2 0 0 2.0T  1.6G     1   0   0   0     0   42   0   0    8   227   141  0  0 100
0 0 0 2.0T  1.6G     0   0   0   0     0   42   0   0    4   248   122  0  0 100
0 0 0 2.0T  1.6G     0   0   0   0     0   42   0   0    6   218   132  0  0 100
2 0 0 2.0T  1.6G     0   0   0   0     0   84   0   0    5   209   123  0  1 99
0 0 0 2.0T  1.6G     2   0   0   0     0   42   0   0    3   202   121  0  0 100
0 0 0 2.0T  1.6G     0   0   0   0     0   42   1   0    7   237   148  0  0 99
0 0 0 2.0T  1.6G     0   0   0   0     0   42   0   0    4   204   127  0  0 100
0 0 0 2.0T  1.6G 11701   0   0  11 11041   42   0   0    5  8359   485  2  3 95
0 0 0 2.0T  1.6G 23435   0   0  16 23041   44   0   0    7  7984   548  3  5 92
0 0 0 2.0T  1.6G     2   0   0   0     0   42   1   0    5   247   137  0  0 100
1 0 0 2.0T  1.6G     0   0   0   2     0   42   2   0    5   211   132  0  0 100
0 0 0 2.0T  1.6G     0   0   0   0     0   42   0   0    6   224   136  0  0 100
Title: Re: Random Frequent CPU Spikes and Page Faults [Almost Resolved]
Post by: dpsguard on August 09, 2023, 01:17:14 AM
Thanks again @opnwb for your continued interest and help with my situation. I can see same type pf page faults in your 19.1.4 tests as I saw with FreeBSD 14 based pfsense latest version.

I am also dealing with another issue described here at

https://forum.opnsense.org/index.php?topic=35288.0

I may also try installing version 19.1.4 to see if it resolves both of my issues. My needs are simple, just a NAT router with Captive portal for Guests to accept terms and then allowed to go to Internet.
Title: Re: Random Frequent CPU Spikes and Page Faults [Almost Resolved]
Post by: opnfwb on August 11, 2023, 05:49:16 AM
I would highly recommend avoiding the older installs of any firewall distro. They aren't security maintained and will only become more vulnerable over time. In this case we're talking about falling back to something two major OS revisions behind with no future support.

I just did this as an interesting baseline to see if I saw the same vmstat results (I don't) compared to newer verions. Beyond that, I wouldn't seriously consider still running something this old and I wouldn't recommend it to anyone either. Just my 2c on it but there you have it.
Title: Re: Random Frequent CPU Spikes and Page Faults [Almost Resolved]
Post by: dpsguard on August 12, 2023, 05:21:11 PM
Thanks @opnfwb for your advice. I agree with what you said.

Title: Re: Random Frequent CPU Spikes and Page Faults [Almost Resolved]
Post by: dpsguard on August 12, 2023, 05:36:13 PM
@opnfwb can you please also review my post below in case you have any experience with similar issues? Thanks

https://forum.opnsense.org/index.php?topic=35375.0
Title: Re: Random Frequent CPU Spikes and Page Faults [Almost Resolved]
Post by: libertasfox on August 16, 2023, 10:35:06 PM
Hey All,

I am jumping on the proc at 100% bandwagon with this last upgrade as well.  I was having no issues with the FW until I upgraded to 23.7.  The latency of UI is painful and sometimes fails to even load.  I'm not having any bandwidth as of now but the proc and core temps are running consistently high.  I'm also running Zenarmor and with their UI upgrade I wonder if this is having an effect??  Anyway, just wanted to add to the list of end users who are having this issue and hope the folks at OPNsense are actively looking into this.
Title: Re: Random Frequent CPU Spikes and Page Faults [Almost Resolved]
Post by: dpsguard on August 16, 2023, 11:04:44 PM
In my case, I seem to have lowered the CPU spikes to seemingly acceptable levels. I am running latest version and I never had 100% CPU issues, rather mostly CPU was very low and then sudden spikes that will go sometimes to 70%.

What I found was that under Firewall/Settings/Advanced, Firewall Optimization was set to aggressive. I changed it to normal. Further I removed all the widgets on dashboard and then also if I don't use the GUI, then repeating the tests that I was doing, I don't see the ping drops from time to time (which I attributed to CPU peaks).

Only issue that I have (or my misunderstanding) is around Python Captive Portal background / housekeeping script showing as 100% IOPS every couple of seconds.