Hello All,
I have a simple lab setup with two Super Micro servers with Xeon CPU, 4 core and 12GB of RAM, Intel X520-DA2 card in HA. I have done stress testing using 2 more servers, one on LAN side and one on WAN segment and then running iPerf2 with 200 parallel streams (each stream set to 50Mbps) in between these two servers thru the firewall and I am able to get very close to 10Gig running for 30 minutes even.
I had installed 23.1 and then updated to 23.1.11 and then finally 23.7 in an attempt to resolve the sudden short bursts of CPU spikes (sometimes reaching 67%, many times 40% and generally 20% every few seconds, and all other time, CPU is like 0 to 2%). When I do iPerf stress test, saturating the pipe, CPU is normally close to 50% and when these peaks strike, it then reaches 70 to 100%.
The top -P with 1 sec delay and with system processes also added, or vmstat 1, shows much less of total CPU that what GUI shows under CPU usage widget. So not sure if this is a cosmetic bug and actual CPU usage is much less than what GUI shows. But while running iPerf stress test, I can also add a test laptop going thru this firewall to Internet and doing a 4K / 2160p video with no issues (clearly shows not more than couple of seconds of buffered content, so I had real issues, I will see it stall) but a continuous ping to Google DNS starts incurring lots of loss, video stays well.
I have tried varies things including enabling / disabling the hardware offload, PowerD, enabling / disabling both consoles etc and these random pulses every few seconds to few minutes keep showing up.
The firewalls have no security features, or logs or Netflow etc enabled. Just a simple NAT router is my setup.
The CPU spikes happen at no traffic. But performance tests in my limited lab setup don't seem to be impacted, other than ping losses. I will test doing a real time audio / video test thru this test also to see if that gets impacted. But I do believe I am not the only one seeing this issue. There are couple of short threads in community that talk about CPU spikes but they all have small CPU processors.
Top does not show any process that is consuming RAM when the peaks come (or few seconds before, assuming top uses some averaging over last few seconds). And for sure it is system that consumes some RAM in single digit even when peak 67% CPU happens in GUI. But what that process (or thread of process), I have not been able to find with any utility.
Another issue I find is that when I do vmstat 1 to let it repeat every second or even 5 seconds, Page flt show up as like 54K, 3K, 7K and most time a single number or so, but I even removed swap memory, did a reboot, but when there is no SWAP set aside, why will there be any paging in and out and resulting page faults? Memory remains mostly underutilized and thus no need for SWAP, but shows large Page flts. Could this be bad memory that results in page faults and thus CPU spikes (this was new UDIMM Buffered) but it happens in two HA firewalls?
Any help will be appreciated.
I used another physical machine and installed 23.1 and without any configuration or optimization, I have the same issues with CPU peaks.
Also page flts are same in Kilos as was in other two machines. So looks like Memory management in FreeBSD is messed up or OPNSense somehow is not able to use the memory and then somehow needs to fetch some code frequently from disk. This machine has different processor, different motherboard, different BIOS, but has the same 10Gig X520-DA2 card. So I am not sure if it is this card issue, but for this test, I am not using the 10Gig card and have only assigned two 1Gi LOM interfaces for WAN and LAN.
At least it rules out any configuration or tunable issue.
If someone can validate on their setup for CPU usage peaks / triangles, as well as output of "vmstat 1" to look for Page flt to be in Kilos, that will be appreciated.
And I pulled the 10G card out leaving just the two Intel 1Gig motherboard ports in and CPU peaks as well as page faults remain as before.
Installed Debian on same server (used another SSD) and I did not see any CPU utilization beyond 4% as a sum of all cores when using htop. VMstat 1 also showed no page faults in kilos, it was just in some low numbers, but format is little different than in FreeBSD.
Then downloaded pfsense4 latest 27.0 and installed it on the same server. This uses FreeBSD 14, so not apples to apples comparison, but with pfsense, I still have page faults in kilos, so definitely this is FreeBSD memory management issues (maybe this is non-issue anyway). But I had CPU not jump and stay within 8% sum total of all cores.
So likely a small bug or minor issue that should be brought to notice of @franco and hope this will get addressed or a workaround provided.
Thanks
Since FreeBSD 14 is more or less development version I don't have much hope this will be quickly fixed or that patches are already available. It might be hardware specific but unsure what that would be.
The rule of thumb for FreeBSD fixes is give it a lot of time and then some more... :/
Cheers,
Franco
Thank you so much @franco for looking into this and your advice. I agree the upstream things are beyond you and we just have to wait. I assume then that is for the memory management / page faults and I assume you also see this in any other hardware that you can test with (I also tested it with a Chinese network appliance that has a Intel i5 and 4GB RAM. Clearly this is FreeBSD and that I am not the only one seeing these page faults in kilos.
But can we conclude that CPU spikes are also related to then FreeBSD 13? Many others might not be noticing it, if on the graph, normal CPU shows up as a band much higher than 0 that shows in my case, and then they may not be noticing the peaks as I see, since in my case, it is mostly 0 to 2% normal CPU and then suddenly 20% surge, then wait few seconds to couple of minutes, and then another one could be 63%, and then 42% and then 30%, and they are all over the place.
At least with graph showing sum-total of all cores, the 63% surge will be on one core, but other cores are available and servicing other traffic (hopefully SMP works well) and this way few users might get impacted, if they had to be.
I'm seeing this with my firewall too. I'm on a J455. I noticed the CPU spikes. The CPU is running hotter. It looks like there's an issue with system interrupts that is new since my upgrade. They seldom showed up before the upgrade.
@JustMeHere I also had the same issues with 23.1 or with 23.1.11 and now with 23.7. If you had no such surges in 23, then that will be interesting. I am assuming that the CPU surges might just be some issues with FreeBSD 13, which they might have taken care of in FreeBSD 14 (as my same hardware does not see these CPU surges with PFsense 27 which uses FreeBSD 14. I am waiting for @franco to confirm this behaviour with other hardware as well, since I believe to have ruled out issues with my hardware with PFSense test.
And heat in your case could be because the Celron Processors are for brief busty traffic, while Xeon processors are for severs to withstand a sustained / continuous high traffic load. When I stress test my server, yes of course the CPU temperature goes up to like 62 deg C in half hour of 10Gig iPerf testing, your box will be small with Celron J processor that may not be able to handle the load you may be subjecting it to. Hence getting hotter. Other than that, there should not be much reason for CPU surges or heat when you move from 23.1 to 23.7. Sure there will be bug fixes and some new features and better support for hardware, so I am thinking you may just have higher amount of traffic than before.
@dpsguard. The graph I posted shows the reboot from the upgrade and the change in CPU activity. There was no change in actual work load. I have also posted the graph of the CPU heat. Not sure what has changed, but the CPU is definitely busier in the latest release. I think this is affecting server throughput. I know I have a weak CPU in this box, but it should be overkill for a firewall. This is a simple home network.
I just ran some speed tests and network load is making a much bigger difference to CPU load than is used to.
The gaps in the graphs I've posed are from the system upgrade. The load on the router was the same before and after.
I had not seen the graph before. Definitely your box is busy. I am sure if latest OS has this issue, soon many will report this. You may want to reference this post in the latest 23.7 thread for wider comments. Latest code might be impacting Celron processors and yes your CPU is more than enough for home use case.
Hi @franco, I did a fresh install with 22.7 and I still have the issues that are seen also with 23.1 and 23.7. And I again tested with pfsense 27 and I don't have the CPU spikes issues. So it must be something to do with FreeBSD 13, unless there could be another reason. Normally I will not be worried, but when I am doing a continuous pings to firewall interfaces or thru firewall to Google DNS, when CPU surges, the pings also drop.
And I have tried this on three different hardware platforms and with two different 10Gig NIC's. Anything you can recommend or a patch that could resolve this issue, will be highly appreciated. Thanks
Running 23.7 and here are the results of a "vmstat 1" on my J3455 system with Intel IGB NICs and a 120GB SATA SSD.
The system was idle during this sampling, with just minor internet traffic (email, spotify, youtube).
procs memory page disks faults cpu
r b w avm fre flt re pi po fr sr ad0 pa0 in sy cs us sy id
0 0 0 514G 13G 7.2K 0 0 21 7.8K 30 113 0 165 5.3K 1.2K 3 3 94
0 0 0 514G 13G 22K 0 0 35 26K 30 0 0 39 7.8K 1.0K 3 3 94
0 0 0 514G 13G 875 0 0 0 1.0K 30 0 0 25 1.2K 290 1 0 99
2 0 0 514G 13G 874 0 0 0 1.0K 30 17 0 49 1.5K 417 3 1 97
0 0 0 514G 13G 874 0 0 0 1.0K 30 0 0 27 1.2K 286 0 1 99
1 0 0 514G 13G 1.6K 0 0 0 1.4K 30 144 0 177 1.6K 1.1K 9 1 90
1 0 0 514G 13G 881 0 0 0 1.1K 27 0 0 45 1.2K 310 25 1 74
1 0 0 514G 13G 878 0 0 0 1.1K 33 0 0 25 1.2K 299 25 1 74
1 0 0 514G 13G 1.2K 0 0 0 1.0K 27 0 0 38 1.6K 324 26 1 74
0 0 0 514G 13G 1.1K 0 0 0 1.1K 33 81 0 127 2.3K 941 3 0 96
0 0 0 514G 13G 874 0 0 0 1.0K 30 139 0 182 1.2K 1.1K 1 2 98
0 0 0 514G 13G 876 0 0 0 1.0K 30 0 0 34 1.2K 281 0 1 99
0 0 0 514G 13G 875 0 0 0 1.0K 30 0 0 28 1.2K 293 0 0 99
0 0 0 514G 13G 879 0 0 0 1.1K 30 17 0 75 1.5K 505 2 1 97
0 0 0 514G 13G 877 0 0 0 1.1K 27 0 0 27 1.2K 307 0 0 99
0 0 0 514G 13G 875 0 0 0 1.0K 30 104 0 142 1.2K 968 0 2 98
0 0 0 514G 13G 874 0 0 0 1.0K 27 0 0 35 1.2K 293 0 1 99
0 0 0 514G 13G 878 0 0 0 1.0K 30 0 0 39 1.2K 379 0 1 99
0 0 0 514G 13G 1.0K 0 0 0 1.3K 30 0 0 66 1.4K 423 0 1 99
0 0 0 514G 13G 876 0 0 0 1.0K 30 0 0 23 1.2K 294 1 2 98
2 0 0 514G 13G 5.8K 0 0 0 1.4K 31 83 0 106 5.9K 765 14 3 83
2 0 0 514G 13G 16K 0 0 0 8.1K 36 0 0 33 12K 3.5K 23 4 73
0 0 0 514G 13G 24K 0 0 0 23K 47 0 0 25 3.2K 1.3K 15 3 82
0 0 0 514G 13G 878 0 0 0 1.1K 30 17 0 53 1.5K 430 3 2 95
0 0 0 514G 13G 874 0 0 0 1.0K 30 119 0 168 1.2K 1.0K 0 1 99
0 0 0 514G 13G 876 0 0 0 1.0K 30 0 0 25 1.2K 276 0 1 99
0 0 0 514G 13G 881 0 0 0 1.1K 30 0 0 27 1.2K 291 0 0 99
0 0 0 514G 13G 877 0 0 0 1.0K 30 0 0 35 1.2K 316 0 1 99
0 0 0 514G 13G 873 0 0 0 1.0K 30 0 0 36 1.2K 338 0 1 99
0 0 0 514G 13G 878 0 0 0 1.1K 30 74 0 105 1.2K 809 0 1 99
Thanks @opnfwb for sharing the results from your box. I can see that you also have lots of page faults ( in kilos) so I will then rule out the issue with my setup and assume that this is how the memory management functions in FreeBSD.
I can also see some CPU surges from your vmstat output. Simple test if you can do, will be helpful to me to further diagnose my issues.
I have now tested on 4 different hardware platforms and I get the same issues on all. The last one I did is on a J3855U with 8GB of RAM and 32GB SSD and Intel 1Gig NIcs. I am now running pfsense on this small appliance and doing g a continuous ping to LAN interface with packet size of 20000 bytes and I have zero loss so far in over 4000 pings (and pfsense has a wireless AP on the front of it and my laptop is connected via wi-fi while pinging. Can you do a ping test and see if you have any ping drops please ? ping 192.168.1.1 -t -l 20000
If this was a opnsense issues, then so many would have complained (when I repeat the same ping test with 20K packet size, I have 6% loss), but I have seemingly ruled out my hardware, so it must be FreeBSD 13. Pfsense is using FreeBSD 14 and I dont have the issue. Maybe I should test with some old release of opnsense, but I can only find 21.1 available to download, and hopefully that uses FreeBSD12.x. Or it could be that FreeBSD 13 has some driver issues for Working with SuperMicro Xeon processors. Thanks again
I'm running the ping now but, why such a large packet size? Won't this just fragment? My LAN MTU is only 1500, as I would presume most others are. I'm not sure the purpose of such a large packet size for testing?
I have an older OPNsense 19.1.4 image from years ago, I can install it in my LAB and do a quick vmstat check there too.
I was curious if you were using UFS or ZFS for your OPNsense installs? The one I posted above with the high page faults is ZFS. I figured I would try an identical setup in my LAB but with UFS and see if this made a difference for some reason? I highly doubt it but I'm just trying to rule out potential factors.
I purposely used 20K packet size as each packet will get fragmented into many packets and thus generates some traffic going thru the firewall. This is a better way to test any connectivity issues. I even have tested with 64K size sometime.
And yes I tested both opnsense and pfsense with ZFS. In my case, with normal CPU utilization of 0 to 2% and with some CPU bursts generate triangles of spikes. Do you see anything like that? And around this time, pings drop.
Thanks for your help in troubleshooting the issue, which if can provide some data to @franco, may help resolve this. This issue might be silently affecting many others in the form of adding to burst of latency / jitter and dropping some real-time traffic calls etc.Other than this issue, as I explained earlier, I can pump in sustained 10Gig traffic even for half hour (max I tested) and everything keeps working fine. Only commonality between 4 different hardware platforms tested is that all use Intel NICs. Two servers are SuperMicro Superservers (X11 and X8) and one Chinese box (Hunsn) and second was Taiwanese box (iBase).
So I'm still goofing around with this, I actually find this quite interesting.
I've been using OPNsense for years and occasionally I'll switch to the "other" pf brand just to compare them. I have a Netstat VM on my internal LAN that pings outside hosts and measures latency and I keep the data stored for weeks at a time. I have two HDDs in my J3455 router, so I simply swap the cable from one to the other and I can boot a different router OS. Between OPNsense and pfSense, I can see no descernable difference when running sustained pings to outside hosts.
If you are seeing your gateway drop or latency spikes, to me that's quite unusual. If you've isolated this just to OPNsense there has to be some odd variable that you're hitting. Are you doing any other custom settings? Maybe some NIC tuning? Processor power management? Just trying to think of some odd variable that might be introducing latency or jitter in this setup.
I have pretty much default configuration other than interface IP addressing, HA and a management interface also in the mix. Here is how the pings show up on my firewall.
(https://i.postimg.cc/cr2jNVtB/Pings1.png) (https://postimg.cc/cr2jNVtB)
(https://i.postimg.cc/R6ZGtRn0/Pings2.png) (https://postimg.cc/R6ZGtRn0)
And here is the output of vmstat at 1 second interval
procs memory page disks faults cpu
r b w avm fre flt re pi po fr sr md98 ad0 in sy cs us sy id
0 0 0 848M 11G 70K 0 0 0 45K 67 0 0 19 100K 31K 4 5 91
1 0 0 896M 11G 13K 0 0 0 9.2K 54 0 72 161 6.3K 3.6K 2 5 93
0 0 0 847M 11G 57K 0 0 0 35K 61 0 0 20 96K 31K 2 9 89
0 0 0 844M 11G 28K 0 0 0 30K 61 0 0 48 13K 3.1K 5 7 88
0 0 0 847M 11G 5.7K 0 0 0 5.3K 60 0 0 11 2.8K 2.5K 1 2 97
1 0 0 855M 11G 56K 0 0 0 31K 60 0 0 13 95K 31K 2 6 91
1 0 0 844M 11G 15K 0 0 0 16K 67 0 63 155 6.4K 3.7K 2 1 96
2 0 0 889M 11G 43K 0 0 0 42K 57 0 0 42 27K 7.0K 4 10 86
0 0 0 844M 11G 53K 0 0 0 30K 68 0 0 25 87K 28K 4 9 87
0 0 0 848M 11G 5.8K 0 0 0 5.3K 54 0 0 11 2.8K 2.3K 0 3 97
0 0 0 844M 11G 71K 0 0 0 31K 60 0 0 15 96K 31K 3 7 91
0 0 0 847M 11G 17K 0 0 0 17K 55 0 69 186 7.0K 3.6K 3 3 94
0 0 0 844M 11G 13K 0 0 0 14K 60 0 0 19 6.8K 2.7K 2 8 90
0 0 0 847M 11G 98K 0 0 0 59K 61 0 0 32 107K 32K 6 13 81
0 0 0 844M 11G 5.5K 0 0 0 5.6K 60 0 0 8 2.8K 2.4K 1 2 97
1 0 0 891M 11G 10K 0 0 0 9.0K 54 0 0 34 6.3K 2.7K 1 5 94
1 0 0 848M 11G 76K 0 0 0 37K 60 0 66 159 95K 32K 3 3 94
0 0 0 896M 11G 13K 0 0 0 9.1K 62 0 0 15 6.5K 2.6K 2 7 91
0 0 0 847M 11G 97K 0 0 0 63K 63 0 0 30 107K 32K 7 14 79
0 0 0 844M 11G 5.5K 0 0 0 5.6K 66 0 0 34 2.8K 2.6K 0 3 97
0 0 0 847M 11G 1.7K 0 0 0 1.6K 60 0 0 9 1.2K 2.4K 0 2 97
0 0 0 844M 11G 57K 0 0 0 31K 54 0 71 166 96K 32K 3 6 91
0 0 0 846M 11G 22K 0 0 0 22K 61 0 0 39 9.1K 2.7K 3 3 93
0 0 0 851M 11G 43K 0 0 0 46K 63 0 0 29 22K 3.6K 7 19 74
0 0 0 848M 11G 51K 0 0 0 24K 60 0 0 13 91K 31K 1 6 93
0 0 0 844M 11G 1.4K 0 0 0 1.8K 60 0 0 10 1.2K 2.4K 0 2 97
0 0 0 846M 11G 71K 0 0 0 31K 60 0 66 184 96K 33K 2 7 90
3 0 0 974M 11G 36K 0 0 0 25K 61 0 0 21 14K 2.6K 6 4 91
2 0 0 885M 11G 23K 0 0 0 33K 63 0 0 24 12K 3.0K 4 12 83
0 0 0 847M 11G 57K 0 0 0 33K 60 0 0 12 95K 31K 3 6 91
0 0 0 844M 11G 1.4K 0 0 0 1.8K 54 0 0 32 1.2K 2.5K 0 2 98
2 0 0 889M 11G 54K 0 0 0 30K 60 0 71 157 62K 21K 2 6 92
2 0 0 941M 11G 56K 0 0 0 31K 61 0 0 24 49K 14K 6 7 87
1 0 0 846M 11G 16K 0 0 0 28K 56 0 0 23 10K 3.1K 3 11 86
0 0 0 844M 11G 61K 0 0 0 35K 60 0 0 37 97K 31K 2 8 91
0 0 0 847M 11G 1.7K 0 0 0 1.6K 60 0 0 10 1.2K 2.3K 0 2 97
Thanks
and I have just default config with LAN and WAN interface and no HA etc on other two boxes that I tried. In all, I have similar issues of CPU spikes.
I understand on the CPU spikes. For instance, I use NetFlow on my home LAN (very useful little built in tool) and it does a background stat collection every 60-90 seconds and this spikes the CPU. But when this happens, the LAN gateway and ping monitors are not impacted, there is no discernible change in ping or network responsiveness for outbound connections.
Are your CPU spikes related to bandwidth usage, for instance when the CPU rises is this due to a spike or a burst in traffic? I'm just trying to better understand if this CPU spike is causing a latency/jitter on an idle line, or if it's due to some traffic kicking in.
I setup 3 identical VMs on my VMware host, OPNsense 23.7ZFS, OPNsense 23.7UFS, and pfSense 2.7ZFS. I'm still collecting all of the vmstat totals from each VM but I'll post them here shortly. Then I'll try an old OPNsense 19.1.4 image and just see?
I had ruled out netflow related issues as I had seen similar issue few days ago when doing some internet searches (link below, which produces similar spikes as I see)
https://github.com/opnsense/core/issues/5046
I have almost zero traffic, no videos, just reading discussions forums on my laptop connected thru opnsense. Nothing else on this box. So spikes remain with or without traffic. Sure with traffic added, CPU graph grass level goes up, so spikes ride on top to then push the CPU sometimes closer to 90%. I have run out of options. Of course the impact will be lesser if I don't have any GUI session open. However with GUI off, and using speedtest (CLI), during the test, the couple of pings still go down.
With pfsense, I don't get to see this type of CPU graph, so it is possible that it does not show the peaks and averages it out in its CPU bar on home page. So my tests could be flawed comparing the two OS.
All VMs idling during these samples.
None of the admin web interfaces were logged in to or in use during these samples.
All these VMs are hosted on VMware ESXi, 7.0.3, 21930508. All VMs have the same VM hardware version and each has 2 vCPU, 2GB of RAM, and a Paravirtual SCSI HDD. All VMs have 2x VMXNET3 adapters assigned. All VMs had their package version of OpenVM tools installed, and all VMs also had their vnstat package installed (on the pfSense VM this package is called Traffic Totals but it uses vnstat).
For this sampling all VMs have these tunables:
hw.ibrs_disable = 1
vm.pmap.pti = 0
OPNsense 23.7 ZFS:
procs memory page disks faults cpu
r b w avm fre flt re pi po fr sr da0 cd0 in sy cs us sy id
0 0 0 2.0T 455M 1.4K 0 0 1 1.5K 13 0 0 12 1.5K 248 1 0 99
0 0 0 2.0T 454M 610 0 0 0 760 10 0 0 5 1.2K 188 0 1 99
0 0 0 2.0T 454M 613 0 0 0 758 11 0 0 2 1.2K 183 0 0 100
0 0 0 2.0T 454M 614 0 0 0 765 10 68 0 69 1.2K 584 0 1 98
0 0 0 2.0T 454M 612 0 0 0 754 10 0 0 4 1.2K 176 0 0 100
0 0 0 2.0T 454M 617 0 0 0 756 10 0 0 2 1.2K 186 0 1 99
0 0 0 2.0T 454M 611 0 0 0 755 11 0 0 2 1.2K 185 0 0 100
0 0 0 2.0T 454M 614 0 0 0 753 10 0 0 4 1.2K 189 0 0 100
0 0 0 2.0T 454M 618 0 0 0 759 10 0 0 2 1.2K 192 0 0 100
0 0 0 2.0T 454M 615 0 0 0 760 11 0 0 2 1.2K 180 0 0 100
0 0 0 2.0T 454M 612 0 0 0 754 10 0 0 4 1.2K 197 0 0 100
0 0 0 2.0T 454M 611 0 0 0 756 11 0 0 2 1.2K 181 0 0 100
0 0 0 2.0T 454M 613 0 0 0 757 10 0 0 2 1.2K 176 0 0 100
0 0 0 2.0T 454M 613 0 0 0 754 10 0 0 4 1.2K 193 0 0 100
0 0 0 2.0T 454M 610 0 0 0 755 11 0 0 2 1.2K 174 1 0 99
0 0 0 2.0T 454M 613 0 0 0 761 10 0 0 2 1.2K 173 0 0 100
0 0 0 2.0T 454M 612 0 0 0 753 10 0 0 4 1.2K 195 0 0 100
0 0 0 2.0T 454M 615 0 0 0 759 11 0 0 2 1.2K 191 1 0 99
0 0 0 2.0T 454M 621 0 0 0 765 10 0 0 2 1.3K 188 0 0 100
0 0 0 2.0T 454M 612 0 0 0 755 11 0 0 4 1.3K 188 0 0 100
1 0 0 2.0T 454M 615 0 0 0 760 10 0 0 2 1.2K 181 0 0 100
OPNsense 23.7 UFS:
procs memory page disks faults cpu
r b w avm fre flt re pi po fr sr da0 cd0 in sy cs us sy id
0 0 0 2.0T 1.4G 4.4K 0 21 1 4.6K 58 0 0 53 8.1K 582 2 1 97
0 0 0 2.0T 1.4G 633 0 0 0 774 66 0 0 3 1.1K 169 0 0 100
2 0 0 2.0T 1.4G 27K 0 5 0 9.6K 66 15 0 16 15K 1.0K 22 4 73
1 0 0 2.0T 1.4G 703 0 0 0 9.2K 60 1 0 16 1.2K 210 0 1 99
0 0 0 2.0T 1.4G 638 0 0 0 775 66 0 0 2 1.3K 170 1 0 99
1 0 0 2.0T 1.4G 631 0 0 0 777 60 0 0 3 1.2K 169 0 0 100
0 0 0 2.0T 1.4G 633 0 0 0 778 66 0 0 3 1.1K 168 0 0 100
0 0 0 2.0T 1.4G 634 0 0 0 774 60 0 0 2 1.1K 164 1 0 99
0 0 0 2.0T 1.4G 631 0 0 0 777 66 0 0 13 1.2K 202 0 0 100
0 0 0 2.0T 1.4G 630 0 0 0 775 60 0 0 2 1.1K 164 0 0 99
0 0 0 2.0T 1.4G 632 0 0 0 771 66 0 0 2 1.1K 164 0 0 100
0 0 0 2.0T 1.4G 632 0 0 0 776 60 0 0 4 1.2K 183 0 1 99
0 0 0 2.0T 1.4G 629 0 0 0 775 60 0 0 2 1.1K 164 0 0 100
0 0 0 2.0T 1.4G 628 0 0 0 770 66 0 0 11 1.1K 185 0 1 99
0 0 0 2.0T 1.4G 631 0 0 0 773 60 0 0 4 1.2K 182 0 0 100
0 0 0 2.0T 1.4G 627 0 0 0 771 60 0 0 2 1.1K 167 0 0 100
0 0 0 2.0T 1.4G 634 0 0 0 779 66 0 0 2 1.1K 161 0 0 100
1 0 0 2.0T 1.4G 632 0 0 0 775 60 3 0 15 1.2K 200 0 0 100
1 0 0 2.0T 1.4G 633 0 0 0 771 66 0 0 2 1.1K 185 0 0 100
1 0 0 2.0T 1.4G 629 0 0 0 783 66 82 0 85 1.2K 506 0 0 99
1 0 0 2.0T 1.4G 627 0 0 0 767 60 9 0 13 1.1K 210 0 0 100
pfSense 2.7 ZFS:
procs memory page disks faults cpu
r b w avm fre flt re pi po fr sr da0 cd0 in sy cs us sy id
0 0 0 514G 1.5G 611 0 0 1 663 6 0 0 10 476 178 0 0 100
0 0 0 514G 1.5G 0 0 0 0 0 5 0 0 6 313 155 0 0 100
0 0 0 514G 1.5G 0 0 0 0 0 5 0 0 3 291 149 0 0 100
0 0 0 514G 1.5G 1 0 0 0 0 5 0 0 3 326 144 0 0 100
0 0 0 514G 1.5G 0 0 0 0 0 6 0 0 5 294 158 0 0 100
0 0 0 514G 1.5G 0 0 0 0 0 5 0 0 3 322 154 0 0 100
0 0 0 514G 1.5G 0 0 0 0 0 5 0 0 3 306 144 0 0 100
0 0 0 514G 1.5G 1 0 0 0 0 5 0 0 5 316 160 0 0 100
0 0 0 514G 1.5G 0 0 0 0 0 5 0 0 3 273 141 0 0 100
0 0 0 514G 1.5G 0 0 0 0 0 5 0 0 3 320 149 0 0 100
0 0 0 514G 1.5G 0 0 0 0 0 5 0 0 5 403 156 0 0 100
0 0 0 514G 1.5G 6 0 0 0 0 6 0 0 2 316 147 0 0 100
0 0 0 514G 1.5G 0 0 0 0 0 5 0 0 3 295 142 0 0 100
0 0 0 514G 1.5G 0 0 0 0 0 5 0 0 5 309 155 0 0 100
0 0 0 514G 1.5G 0 0 0 0 0 5 0 0 3 326 158 0 0 100
0 0 0 514G 1.5G 0 0 0 0 0 5 0 0 3 299 146 0 0 100
0 0 0 514G 1.5G 1 0 0 0 0 6 0 0 5 319 164 0 0 100
0 0 0 514G 1.5G 0 0 0 0 0 5 0 0 3 329 155 0 0 100
0 0 0 514G 1.5G 0 0 0 0 0 5 0 0 4 306 157 0 0 100
0 0 0 514G 1.5G 0 0 0 0 0 5 0 0 5 310 155 0 0 100
0 0 0 514G 1.5G 0 0 0 0 0 6 0 0 3 307 153 0 0 100
Thanks @opnfwb for your great help trying to troubleshoot my issues.
Looks like with 2GB RAM, you have essentially no page faults. I have 12GB RAM. Maybe opnsense has issues in managing memory, so I may try reducing the RAM. I have already tried by removing 10Gig card just in case of any driver issues (I did not apply any driver myself, whatever is part if opnsense OS, detects these cards correctly). There must be some process that is firing up the CPU, probably happens for a split second every so often that vmstat or top are not able to catch.
Just did a factory reset again and no relief.
I do think you've stumbled on to something interesting here. It's obvious from my observations that there is definitely a higher page fault occurrence in OPNsense compared to an identically configured pfSense 2.7 VM (same hardware, same resources, same tunables, same packages installed).
However, what I'm not sure about is if the anomalies you've found are directly contributing to the problem that you're seeing.
If I run iperf tests on any of these firewall VMs I get virtually identical throughput with all of them, OPNsense and pfSense. The entire time I'm running the test I see a small spike in latency on the firewall VM that is pushing the traffic, usually 2-4ms. I don't get any dropped packets and once the iperf test stops, everything returns to normal.
So it would seem that even though the OPNsense VMs do all exhibit substantially more page faults than the pfSense VM, it doesn't appear to be impacting overall throughput in my testing. And none of them seem to have an issue with dropping pings even under high load. I'm running iperf through each of the firewalls. I use a traffic generator on the WAN side and on the LAN side to make the firewall route the traffic through both of its interfaces. Obviously with my VMs, these are all virtual interfaces (VMXnet3) so its still possible there's a hardware issue with one of the cards you are using but you've said you are seeing the ping spikes/packet loss on multiple different systems with varied hardware.
Yes the page faults definitely are higher in opnsense and I was thinking this could be something to do with version 13 used in opnsense.
My test setup is all physical. I have the firewall-under-test attached to my main firewall LAN (and main firewall is pfsense running 2.6). Thus the WAN segment of the test firewall is my local LAN. This allows me to add an iPerf server on my LAN (I like using iperf2 over iperf3 as I can have a large number of parallel stream and utilization of multiple cores of the CPU) and then clients to run iperf2 clients on LAN side of the firewall under test.
I am able to get very high throughput repeatedly (I set -t 600 to 1800 and -P 100 or more) thus flooding the firewall under test and then I launch internet bound traffic also thru a desktop PC running 4K at max resolution ( I have a 100/10 Meg pair bonded DSL on two pairs of phone line, so I need to test with local iPerf server). I don't see any hiccups in the video playback while I am doing iPerf continuous testing (both uploads from the client or with reverse side to download from server). But I do see significant ping packet loss when running iPerf from LAN to WAN side of firewall under test. While ping is stateless and thus sensitive to congestion and consequent some loss and YouTube videos uses tcp, which is able to tolerate some loss to not let me feel any issues. I have yet to test some real-time traffic like whatsapp / facetime audio / video call thru this firewall if the ping losses manifest into actual call drop or pixelation of pictures.
including @franco for his information.
I might be close to resolving this. Here is what I have done so far to get some acceptable working of firewall. Before this, I had peaks even reaching 75% and for sure when that hit, a parallel continuous stream from the test laptop (to 1.1.1.1) will lose couple of pings.
top -SHz 20
pressed s to set delay to 1 second and then shift +s to include system / kernel processes, to show total of 20 processes using CPU.
Here I could see unbound process hogging some cycles. I was not even sure what unbound is, but this DNS proxy service was by default enabled. I unchecked it under services section and then manually specified 8.8.8.8 and 1.1.1.1. I for DNS servers to use for system and for DHCP scope.
I could see CPU peaks then not exceeding 12% and generally 5 to 7%, after watching for ten minutes (no traffic at this time).
Then I logged off the GUI to kill php-cgi processes chewing up CPU. Once in a while DHCPv6 was also showing up. Again this was default enabled. I unchecked it and made sure to disable IPv6 under interfaces.
I launched a 4K / 2160p60Hz res video (Flying over Norway to generate some traffic) and also maintained continuous ping to 1.1.1.1, and in addition started iPerf client to a WAN segment based iPerf2 server. I used -P50 and -i 1, -t 600 to keep firewall somewhat busy. And I fired the GUI again and CPU peaks were now under 20%. Clearly GUI introduces its demand to paint the CPU graph etc, but watching the top output over SSH console, it generally remains below 10% utilized. I still see Ping loss which could be for various reasons, especially firewall might be treating it as least priority over the normal traffic, when it is flooded with traffic. But clearly with about 900Mbps average being downloaded or uploaded via iPerf, traversing thru the firewall, situation seems to be overall better with changed I made.
Over next few days, when I get chance, I will do some more stress testing with two 10Gig machines to act as iPerf client and server. For now, I have also stressed all CPU cores by issuing the following (4 times for my 4 cores) and this makes all cores almost at 99% and my pings were still going thru and my iPerf testing was also going on.
yes > /dev/null &
and then when done
killall yes
reference https://forum.netgate.com/topic/171454/stress-ng-install/4
The Unbound spikes you're seeing are likely due to OPNsense's Unbound Reporting feature. Its a very powerful and useful feature but it does some background stats collection every 30 seconds or so, and during this time there's a small CPU spike while it processes the stats.
You can turn it off in Reporting/Settings and uncheck the "Unbound DNS Reporting" section to see if this stops the Unbound CPU usage that you're noticing.
Thanks @opnfwb. I will test and report back later today.
Hi @opnfwb. I tested again. The unbound reporting was already unchecked.
Definitely most of these spikes that now show up (many are gone, especially tall ones) seem to be from GUI (php-cgi) although watching iostat or top etc does not show any processor(s) consuming anywhere close to the CPU surge. And that is largely gone if I logout and close the GUI tab.
Then I looked into the output of "top -m io" and that showed two interesting usages. Syslog-ng will toggle back and forth between 0 and 100% IO and similarly python3.9 doing the same. Since I don't log anything or send anything out to a logging server, I disabled the service by editing the /etc/rc.conf with syslog_ng_enable="NO".
This removed of course syslog related IO, but I don't understand why will python3.9 switch back and forth between 100% and 0% every second. I will like to remove that bottleneck also and will request any tips to resolve this. Thanks so much
Further look showed that 100% IO for Python was coming from Captive Portal (I set up a page for simple terms and conditions, no accounting, no authentication etc). So when I shutdown captive portal service, then that issue gets resolved.
However I need captive portal. The script is cp-background-process.py. There must be something in this script to keep python generating so much IO.
Thanks
Look at the information provided here.
https://bsd44.blogspot.com/2004/12/vmstat.html
Looks like faults is nothing but interrupts, so a high number shows a busy system.
Faults:
The faults section shows system faults. Faults, in this case, aren't bad, they're just received system traps and interrupts.
in Shows the number of system interrupts (IRQ requests) the system received in the last five seconds.
sy Shows the number of system calls in the last five seconds.
cs Gives the number of context switches, or times the CPU changed from doing one thing to doing another.
Thanks @zz00mm for your advice. However in my case, system is not busy as all, it is idle with essentially no traffic. And I get that page faults is not bad, but there is no use of any swap either and overall RAM use is a fraction of what is available. My more concern now is on the constant interruption by python.
I finally got around to sampling the old OPNsense 19.1.4 image I had available. This is quite old, based on FreeBSD 11 so I don't think this is a relevant comparison at this point but I'm posting the results anyway. It does show noticeably lower faults than the current versions and it has the same hardware config and the same plugins installed (vnStat and vmware tools).
This vmstat sample was taken with the system effectively idle, just passing minimal gateway monitor ping traffic.
OPNsense 19.1.4
procs memory page disks faults cpu
r b w avm fre flt re pi po fr sr da0 cd0 in sy cs us sy id
0 0 0 2.0T 1.6G 7 0 0 0 0 42 0 0 7 418 151 0 0 100
1 0 0 2.0T 1.6G 0 0 0 0 0 43 0 0 6 233 130 0 0 100
0 0 0 2.0T 1.6G 0 0 0 0 0 42 0 0 3 194 117 0 0 100
0 0 0 2.0T 1.6G 3 0 0 0 0 42 0 0 9 235 149 0 0 100
0 0 0 2.0T 1.6G 1 0 0 0 0 42 0 0 3 223 121 0 0 100
4 0 0 2.0T 1.6G 0 0 0 0 0 42 0 0 7 206 127 0 0 99
1 0 0 2.0T 1.6G 0 0 0 0 0 42 0 0 3 222 125 0 0 100
0 0 0 2.0T 1.6G 0 0 0 0 0 42 0 0 6 209 134 0 0 100
0 0 0 2.0T 1.6G 4 0 0 0 8 42 5 0 12 256 165 0 0 100
2 0 0 2.0T 1.6G 1 0 0 0 0 42 0 0 8 227 141 0 0 100
0 0 0 2.0T 1.6G 0 0 0 0 0 42 0 0 4 248 122 0 0 100
0 0 0 2.0T 1.6G 0 0 0 0 0 42 0 0 6 218 132 0 0 100
2 0 0 2.0T 1.6G 0 0 0 0 0 84 0 0 5 209 123 0 1 99
0 0 0 2.0T 1.6G 2 0 0 0 0 42 0 0 3 202 121 0 0 100
0 0 0 2.0T 1.6G 0 0 0 0 0 42 1 0 7 237 148 0 0 99
0 0 0 2.0T 1.6G 0 0 0 0 0 42 0 0 4 204 127 0 0 100
0 0 0 2.0T 1.6G 11701 0 0 11 11041 42 0 0 5 8359 485 2 3 95
0 0 0 2.0T 1.6G 23435 0 0 16 23041 44 0 0 7 7984 548 3 5 92
0 0 0 2.0T 1.6G 2 0 0 0 0 42 1 0 5 247 137 0 0 100
1 0 0 2.0T 1.6G 0 0 0 2 0 42 2 0 5 211 132 0 0 100
0 0 0 2.0T 1.6G 0 0 0 0 0 42 0 0 6 224 136 0 0 100
Thanks again @opnwb for your continued interest and help with my situation. I can see same type pf page faults in your 19.1.4 tests as I saw with FreeBSD 14 based pfsense latest version.
I am also dealing with another issue described here at
https://forum.opnsense.org/index.php?topic=35288.0
I may also try installing version 19.1.4 to see if it resolves both of my issues. My needs are simple, just a NAT router with Captive portal for Guests to accept terms and then allowed to go to Internet.
I would highly recommend avoiding the older installs of any firewall distro. They aren't security maintained and will only become more vulnerable over time. In this case we're talking about falling back to something two major OS revisions behind with no future support.
I just did this as an interesting baseline to see if I saw the same vmstat results (I don't) compared to newer verions. Beyond that, I wouldn't seriously consider still running something this old and I wouldn't recommend it to anyone either. Just my 2c on it but there you have it.
Thanks @opnfwb for your advice. I agree with what you said.
@opnfwb can you please also review my post below in case you have any experience with similar issues? Thanks
https://forum.opnsense.org/index.php?topic=35375.0
Hey All,
I am jumping on the proc at 100% bandwagon with this last upgrade as well. I was having no issues with the FW until I upgraded to 23.7. The latency of UI is painful and sometimes fails to even load. I'm not having any bandwidth as of now but the proc and core temps are running consistently high. I'm also running Zenarmor and with their UI upgrade I wonder if this is having an effect?? Anyway, just wanted to add to the list of end users who are having this issue and hope the folks at OPNsense are actively looking into this.
In my case, I seem to have lowered the CPU spikes to seemingly acceptable levels. I am running latest version and I never had 100% CPU issues, rather mostly CPU was very low and then sudden spikes that will go sometimes to 70%.
What I found was that under Firewall/Settings/Advanced, Firewall Optimization was set to aggressive. I changed it to normal. Further I removed all the widgets on dashboard and then also if I don't use the GUI, then repeating the tests that I was doing, I don't see the ping drops from time to time (which I attributed to CPU peaks).
Only issue that I have (or my misunderstanding) is around Python Captive Portal background / housekeeping script showing as 100% IOPS every couple of seconds.