Re: Random Frequent CPU Spikes and Page Faults [Almost Resolved]

Started by dpsguard, August 01, 2023, 04:41:49 AM

Previous topic - Next topic
So I'm still goofing around with this, I actually find this quite interesting.

I've been using OPNsense for years and occasionally I'll switch to the "other" pf brand just to compare them. I have a Netstat VM on my internal LAN that pings outside hosts and measures latency and I keep the data stored for weeks at a time. I have two HDDs in my J3455 router, so I simply swap the cable from one to the other and I can boot a different router OS. Between OPNsense and pfSense, I can see no descernable difference when running sustained pings to outside hosts.

If you are seeing your gateway drop or latency spikes, to me that's quite unusual. If you've isolated this just to OPNsense there has to be some odd variable that you're hitting. Are you doing any other custom settings? Maybe some NIC tuning? Processor power management? Just trying to think of some odd variable that might be introducing latency or jitter in this setup.

August 05, 2023, 10:36:24 PM #16 Last Edit: August 05, 2023, 10:39:57 PM by dpsguard
I have pretty much default configuration other than interface IP addressing, HA and a management interface also in the mix. Here is how the pings show up on my firewall.





And here is the output of vmstat at 1 second interval

procs    memory    page                      disks     faults       cpu
r  b  w  avm  fre  flt  re  pi  po   fr   sr md98 ad0   in   sy   cs us sy id
0  0  0 848M  11G  70K   0   0   0  45K   67   0   0   19 100K  31K  4  5 91
1  0  0 896M  11G  13K   0   0   0 9.2K   54   0  72  161 6.3K 3.6K  2  5 93
0  0  0 847M  11G  57K   0   0   0  35K   61   0   0   20  96K  31K  2  9 89
0  0  0 844M  11G  28K   0   0   0  30K   61   0   0   48  13K 3.1K  5  7 88
0  0  0 847M  11G 5.7K   0   0   0 5.3K   60   0   0   11 2.8K 2.5K  1  2 97
1  0  0 855M  11G  56K   0   0   0  31K   60   0   0   13  95K  31K  2  6 91
1  0  0 844M  11G  15K   0   0   0  16K   67   0  63  155 6.4K 3.7K  2  1 96
2  0  0 889M  11G  43K   0   0   0  42K   57   0   0   42  27K 7.0K  4 10 86
0  0  0 844M  11G  53K   0   0   0  30K   68   0   0   25  87K  28K  4  9 87
0  0  0 848M  11G 5.8K   0   0   0 5.3K   54   0   0   11 2.8K 2.3K  0  3 97
0  0  0 844M  11G  71K   0   0   0  31K   60   0   0   15  96K  31K  3  7 91
0  0  0 847M  11G  17K   0   0   0  17K   55   0  69  186 7.0K 3.6K  3  3 94
0  0  0 844M  11G  13K   0   0   0  14K   60   0   0   19 6.8K 2.7K  2  8 90
0  0  0 847M  11G  98K   0   0   0  59K   61   0   0   32 107K  32K  6 13 81
0  0  0 844M  11G 5.5K   0   0   0 5.6K   60   0   0    8 2.8K 2.4K  1  2 97
1  0  0 891M  11G  10K   0   0   0 9.0K   54   0   0   34 6.3K 2.7K  1  5 94
1  0  0 848M  11G  76K   0   0   0  37K   60   0  66  159  95K  32K  3  3 94
0  0  0 896M  11G  13K   0   0   0 9.1K   62   0   0   15 6.5K 2.6K  2  7 91
0  0  0 847M  11G  97K   0   0   0  63K   63   0   0   30 107K  32K  7 14 79
0  0  0 844M  11G 5.5K   0   0   0 5.6K   66   0   0   34 2.8K 2.6K  0  3 97
0  0  0 847M  11G 1.7K   0   0   0 1.6K   60   0   0    9 1.2K 2.4K  0  2 97
0  0  0 844M  11G  57K   0   0   0  31K   54   0  71  166  96K  32K  3  6 91
0  0  0 846M  11G  22K   0   0   0  22K   61   0   0   39 9.1K 2.7K  3  3 93
0  0  0 851M  11G  43K   0   0   0  46K   63   0   0   29  22K 3.6K  7 19 74
0  0  0 848M  11G  51K   0   0   0  24K   60   0   0   13  91K  31K  1  6 93
0  0  0 844M  11G 1.4K   0   0   0 1.8K   60   0   0   10 1.2K 2.4K  0  2 97
0  0  0 846M  11G  71K   0   0   0  31K   60   0  66  184  96K  33K  2  7 90
3  0  0 974M  11G  36K   0   0   0  25K   61   0   0   21  14K 2.6K  6  4 91
2  0  0 885M  11G  23K   0   0   0  33K   63   0   0   24  12K 3.0K  4 12 83
0  0  0 847M  11G  57K   0   0   0  33K   60   0   0   12  95K  31K  3  6 91
0  0  0 844M  11G 1.4K   0   0   0 1.8K   54   0   0   32 1.2K 2.5K  0  2 98
2  0  0 889M  11G  54K   0   0   0  30K   60   0  71  157  62K  21K  2  6 92
2  0  0 941M  11G  56K   0   0   0  31K   61   0   0   24  49K  14K  6  7 87
1  0  0 846M  11G  16K   0   0   0  28K   56   0   0   23  10K 3.1K  3 11 86
0  0  0 844M  11G  61K   0   0   0  35K   60   0   0   37  97K  31K  2  8 91
0  0  0 847M  11G 1.7K   0   0   0 1.6K   60   0   0   10 1.2K 2.3K  0  2 97

Thanks


and I have just default config with LAN and WAN interface and no HA etc on other two boxes that I tried. In all, I have similar issues of CPU spikes.

I understand on the CPU spikes. For instance, I use NetFlow on my home LAN (very useful little built in tool) and it does a background stat collection every 60-90 seconds and this spikes the CPU. But when this happens, the LAN gateway and ping monitors are not impacted, there is no discernible change in ping or network responsiveness for outbound connections.

Are your CPU spikes related to bandwidth usage, for instance when the CPU rises is this due to a spike or a burst in traffic? I'm just trying to better understand if this CPU spike is causing a latency/jitter on an idle line, or if it's due to some traffic kicking in.

I setup 3 identical VMs on my VMware host, OPNsense 23.7ZFS, OPNsense 23.7UFS, and pfSense 2.7ZFS. I'm still collecting all of the vmstat totals from each VM but I'll post them here shortly. Then I'll try an old OPNsense 19.1.4 image and just see?

I had ruled out netflow related issues as I had seen similar issue few days ago when doing some internet searches (link below, which produces similar spikes as I see)

https://github.com/opnsense/core/issues/5046

I have almost zero traffic, no videos, just reading discussions forums on my laptop connected thru opnsense. Nothing else on this box. So spikes remain with or without traffic. Sure with traffic added,  CPU graph grass level goes up, so spikes ride on top to then push the CPU sometimes closer to 90%. I have run out of options. Of course the impact will be lesser if I don't have any  GUI session open. However with GUI off, and using speedtest (CLI), during the test, the couple of pings still go down.

With pfsense, I don't get to see this type of CPU graph, so it is possible that it does not show the peaks and averages it out in its CPU bar on home page. So my tests could be flawed comparing the two OS.

August 06, 2023, 01:03:38 AM #20 Last Edit: August 06, 2023, 01:05:25 AM by opnfwb
All VMs idling during these samples.
None of the admin web interfaces were logged in to or in use during these samples.

All these VMs are hosted on VMware ESXi, 7.0.3, 21930508. All VMs have the same VM hardware version and each has 2 vCPU, 2GB of RAM, and a Paravirtual SCSI HDD. All VMs have 2x VMXNET3 adapters assigned. All VMs had their package version of OpenVM tools installed, and all VMs also had their vnstat package installed (on the pfSense VM this package is called Traffic Totals but it uses vnstat).

For this sampling all VMs have these tunables:
hw.ibrs_disable = 1
vm.pmap.pti = 0


OPNsense 23.7 ZFS:
procs    memory    page                      disks     faults       cpu
r  b  w  avm  fre  flt  re  pi  po   fr   sr da0 cd0   in   sy   cs us sy id
0  0  0 2.0T 455M 1.4K   0   0   1 1.5K   13   0   0   12 1.5K  248  1  0 99
0  0  0 2.0T 454M  610   0   0   0  760   10   0   0    5 1.2K  188  0  1 99
0  0  0 2.0T 454M  613   0   0   0  758   11   0   0    2 1.2K  183  0  0 100
0  0  0 2.0T 454M  614   0   0   0  765   10  68   0   69 1.2K  584  0  1 98
0  0  0 2.0T 454M  612   0   0   0  754   10   0   0    4 1.2K  176  0  0 100
0  0  0 2.0T 454M  617   0   0   0  756   10   0   0    2 1.2K  186  0  1 99
0  0  0 2.0T 454M  611   0   0   0  755   11   0   0    2 1.2K  185  0  0 100
0  0  0 2.0T 454M  614   0   0   0  753   10   0   0    4 1.2K  189  0  0 100
0  0  0 2.0T 454M  618   0   0   0  759   10   0   0    2 1.2K  192  0  0 100
0  0  0 2.0T 454M  615   0   0   0  760   11   0   0    2 1.2K  180  0  0 100
0  0  0 2.0T 454M  612   0   0   0  754   10   0   0    4 1.2K  197  0  0 100
0  0  0 2.0T 454M  611   0   0   0  756   11   0   0    2 1.2K  181  0  0 100
0  0  0 2.0T 454M  613   0   0   0  757   10   0   0    2 1.2K  176  0  0 100
0  0  0 2.0T 454M  613   0   0   0  754   10   0   0    4 1.2K  193  0  0 100
0  0  0 2.0T 454M  610   0   0   0  755   11   0   0    2 1.2K  174  1  0 99
0  0  0 2.0T 454M  613   0   0   0  761   10   0   0    2 1.2K  173  0  0 100
0  0  0 2.0T 454M  612   0   0   0  753   10   0   0    4 1.2K  195  0  0 100
0  0  0 2.0T 454M  615   0   0   0  759   11   0   0    2 1.2K  191  1  0 99
0  0  0 2.0T 454M  621   0   0   0  765   10   0   0    2 1.3K  188  0  0 100
0  0  0 2.0T 454M  612   0   0   0  755   11   0   0    4 1.3K  188  0  0 100
1  0  0 2.0T 454M  615   0   0   0  760   10   0   0    2 1.2K  181  0  0 100


OPNsense 23.7 UFS:
procs    memory    page                      disks     faults       cpu
r  b  w  avm  fre  flt  re  pi  po   fr   sr da0 cd0   in   sy   cs us sy id
0  0  0 2.0T 1.4G 4.4K   0  21   1 4.6K   58   0   0   53 8.1K  582  2  1 97
0  0  0 2.0T 1.4G  633   0   0   0  774   66   0   0    3 1.1K  169  0  0 100
2  0  0 2.0T 1.4G  27K   0   5   0 9.6K   66  15   0   16  15K 1.0K 22  4 73
1  0  0 2.0T 1.4G  703   0   0   0 9.2K   60   1   0   16 1.2K  210  0  1 99
0  0  0 2.0T 1.4G  638   0   0   0  775   66   0   0    2 1.3K  170  1  0 99
1  0  0 2.0T 1.4G  631   0   0   0  777   60   0   0    3 1.2K  169  0  0 100
0  0  0 2.0T 1.4G  633   0   0   0  778   66   0   0    3 1.1K  168  0  0 100
0  0  0 2.0T 1.4G  634   0   0   0  774   60   0   0    2 1.1K  164  1  0 99
0  0  0 2.0T 1.4G  631   0   0   0  777   66   0   0   13 1.2K  202  0  0 100
0  0  0 2.0T 1.4G  630   0   0   0  775   60   0   0    2 1.1K  164  0  0 99
0  0  0 2.0T 1.4G  632   0   0   0  771   66   0   0    2 1.1K  164  0  0 100
0  0  0 2.0T 1.4G  632   0   0   0  776   60   0   0    4 1.2K  183  0  1 99
0  0  0 2.0T 1.4G  629   0   0   0  775   60   0   0    2 1.1K  164  0  0 100
0  0  0 2.0T 1.4G  628   0   0   0  770   66   0   0   11 1.1K  185  0  1 99
0  0  0 2.0T 1.4G  631   0   0   0  773   60   0   0    4 1.2K  182  0  0 100
0  0  0 2.0T 1.4G  627   0   0   0  771   60   0   0    2 1.1K  167  0  0 100
0  0  0 2.0T 1.4G  634   0   0   0  779   66   0   0    2 1.1K  161  0  0 100
1  0  0 2.0T 1.4G  632   0   0   0  775   60   3   0   15 1.2K  200  0  0 100
1  0  0 2.0T 1.4G  633   0   0   0  771   66   0   0    2 1.1K  185  0  0 100
1  0  0 2.0T 1.4G  629   0   0   0  783   66  82   0   85 1.2K  506  0  0 99
1  0  0 2.0T 1.4G  627   0   0   0  767   60   9   0   13 1.1K  210  0  0 100


pfSense 2.7 ZFS:
procs    memory    page                      disks     faults       cpu
r  b  w  avm  fre  flt  re  pi  po   fr   sr da0 cd0   in   sy   cs us sy id
0  0  0 514G 1.5G  611   0   0   1  663    6   0   0   10  476  178  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    5   0   0    6  313  155  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    5   0   0    3  291  149  0  0 100
0  0  0 514G 1.5G    1   0   0   0    0    5   0   0    3  326  144  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    6   0   0    5  294  158  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    5   0   0    3  322  154  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    5   0   0    3  306  144  0  0 100
0  0  0 514G 1.5G    1   0   0   0    0    5   0   0    5  316  160  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    5   0   0    3  273  141  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    5   0   0    3  320  149  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    5   0   0    5  403  156  0  0 100
0  0  0 514G 1.5G    6   0   0   0    0    6   0   0    2  316  147  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    5   0   0    3  295  142  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    5   0   0    5  309  155  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    5   0   0    3  326  158  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    5   0   0    3  299  146  0  0 100
0  0  0 514G 1.5G    1   0   0   0    0    6   0   0    5  319  164  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    5   0   0    3  329  155  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    5   0   0    4  306  157  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    5   0   0    5  310  155  0  0 100
0  0  0 514G 1.5G    0   0   0   0    0    6   0   0    3  307  153  0  0 100

Thanks @opnfwb for your great help trying to troubleshoot my issues.

Looks like with 2GB RAM, you have essentially no page faults. I have 12GB RAM. Maybe opnsense has issues in managing memory, so I may try reducing the RAM. I have already tried by removing 10Gig card just in case of any driver issues (I did not apply any driver myself, whatever is part if opnsense OS, detects these cards correctly). There must be some process that is firing up the CPU, probably happens for a split second every so often that vmstat or top are not able to catch.

Just did a factory reset again and no relief.

I do think you've stumbled on to something interesting here. It's obvious from my observations that there is definitely a higher page fault occurrence in OPNsense compared to an identically configured pfSense 2.7 VM (same hardware, same resources, same tunables, same packages installed).

However, what I'm not sure about is if the anomalies you've found are directly contributing to the problem that you're seeing.

If I run iperf tests on any of these firewall VMs I get virtually identical throughput with all of them, OPNsense and pfSense. The entire time I'm running the test I see a small spike in latency on the firewall VM that is pushing the traffic, usually 2-4ms. I don't get any dropped packets and once the iperf test stops, everything returns to normal.

So it would seem that even though the OPNsense VMs do all exhibit substantially more page faults than the pfSense VM, it doesn't appear to be impacting overall throughput in my testing. And none of them seem to have an issue with dropping pings even under high load. I'm running iperf through each of the firewalls. I use a traffic generator on the WAN side and on the LAN side to make the firewall route the traffic through both of its interfaces. Obviously with my VMs, these are all virtual interfaces (VMXnet3) so its still possible there's a hardware issue with one of the cards you are using but you've said you are seeing the ping spikes/packet loss on multiple different systems with varied hardware.

Yes the page faults definitely are higher in opnsense and I was thinking this could be something to do with version 13 used in opnsense.

My test setup is all physical. I have the firewall-under-test attached to my main firewall LAN (and main firewall is pfsense running 2.6). Thus the WAN segment of the test firewall is my local LAN. This allows me to add an iPerf server on my LAN (I like using iperf2 over iperf3 as I can have a large number of parallel stream and utilization of multiple cores of the CPU) and then clients to run iperf2 clients on LAN side of the firewall under test.

I am able to get very high throughput repeatedly (I set -t 600 to 1800 and -P 100 or more) thus flooding the firewall under test and then I launch internet bound traffic also thru a desktop PC running 4K at max resolution ( I have a 100/10 Meg pair bonded DSL on two pairs of phone line, so I need to test with local iPerf server). I don't see any hiccups in the video playback while I am doing iPerf continuous testing (both uploads from the client or with reverse side to download from server). But I do see significant ping packet loss when running iPerf from LAN to WAN side of firewall under test. While ping is stateless and thus sensitive to congestion and consequent some loss and YouTube videos uses tcp, which is able to tolerate some loss to not let me feel any issues. I have yet to test some real-time traffic like whatsapp / facetime audio / video call thru this firewall if the ping losses manifest into actual call drop or pixelation of pictures.

including @franco for his information.

I might be close to resolving this. Here is what I have done so far to get some acceptable working of firewall. Before this, I had peaks even reaching 75% and for sure when that hit, a parallel continuous stream from the test laptop (to 1.1.1.1) will lose couple of pings.

top -SHz 20

pressed s to set delay to 1 second and then shift +s to include system / kernel processes, to show total of 20 processes using CPU.

Here I could see unbound process hogging some cycles. I was not even sure what unbound is, but this DNS proxy service was by default enabled. I unchecked it under services section and then manually specified 8.8.8.8 and 1.1.1.1. I for DNS servers to use for system and for DHCP scope.

I could see CPU peaks then not exceeding 12% and generally 5 to 7%, after watching for ten minutes (no traffic at this time).

Then I logged off the GUI to kill php-cgi processes chewing up CPU. Once in a while DHCPv6 was also showing up. Again this was default enabled. I unchecked it and made sure to disable IPv6 under interfaces.

I launched a 4K / 2160p60Hz res video (Flying over Norway to generate some traffic) and also maintained continuous ping to 1.1.1.1, and in addition started iPerf client to a WAN segment based iPerf2 server. I used -P50 and -i 1, -t 600 to keep firewall somewhat busy. And  I fired the GUI again and CPU peaks were now under 20%. Clearly GUI introduces its demand to paint the CPU graph etc, but watching the top output over SSH console, it generally remains below 10% utilized. I still see Ping loss which could be for various reasons, especially firewall might be treating it as least priority over the normal traffic, when it is flooded with traffic. But clearly with about 900Mbps average being downloaded or uploaded via iPerf, traversing thru the firewall, situation seems to be overall better with changed I made.

Over next few days, when I get chance, I will do some more stress testing with two 10Gig machines to act as iPerf client and server. For now, I have also stressed all CPU cores by issuing the following (4 times for my 4 cores) and this makes all cores almost at 99% and my pings were still going thru and my iPerf testing was also going on.

yes > /dev/null &

and then when done

killall yes

reference  https://forum.netgate.com/topic/171454/stress-ng-install/4


The Unbound spikes you're seeing are likely due to OPNsense's Unbound Reporting feature. Its a very powerful and useful feature but it does some background stats collection every 30 seconds or so, and during this time there's a small CPU spike while it processes the stats.

You can turn it off in Reporting/Settings and uncheck the "Unbound DNS Reporting" section to see if this stops the Unbound CPU usage that you're noticing.

Thanks @opnfwb. I will test and report back later today.

Hi @opnfwb. I tested again. The unbound reporting was already unchecked.

Definitely most of these spikes that now show up (many are gone, especially tall ones) seem to be from GUI (php-cgi) although watching iostat or top etc does not show any processor(s) consuming anywhere close to the CPU surge. And that is largely gone if I logout and close the GUI tab.

Then I looked into the output of "top -m io" and that showed two interesting usages. Syslog-ng will toggle back and forth between 0 and 100% IO and similarly python3.9 doing the same. Since I don't log anything or send anything out to a logging server, I disabled the service by editing the /etc/rc.conf with syslog_ng_enable="NO".

This removed of course syslog related IO, but I don't understand why will python3.9 switch back and forth between 100% and 0% every second. I will like to remove that bottleneck also and will request any tips to resolve this. Thanks so much

Further look showed that 100% IO for Python was coming from Captive Portal (I set up a page for simple terms and conditions, no accounting, no authentication etc). So when I shutdown captive portal service, then that issue gets resolved.

However I need captive portal. The script is cp-background-process.py. There must be something in this script to keep python generating so much IO.

Thanks

Look at the information provided here.
https://bsd44.blogspot.com/2004/12/vmstat.html

Looks like faults is nothing but interrupts, so a high number shows a busy system.

Faults:
The faults section shows system faults. Faults, in this case, aren't bad, they're just received system traps and interrupts.

in Shows the number of system interrupts (IRQ requests) the system received in the last five seconds.

sy Shows the number of system calls in the last five seconds.

cs Gives the number of context switches, or times the CPU changed from doing one thing to doing another.