OPNsense Forum
English Forums => Hardware and Performance => Topic started by: dejhost on April 29, 2021, 06:49:16 pm
-
Hi all!
I am worried about the CPU loadin OPNsense, which is in average about 60%. Even if there is little traffic going on.
"top" tells me, that two sqlite3-processes use about 50% of my CPU. Shouldn't be like that, should it?
I would be very grateful if someone could help me troubleshooting this. Thanks in advance.
dejhost
---
My Setup:
- OPNsense v. 21.1.5-amd64, FreeBSD 12.1-RELEASE-p16-HBSD, OpenSSL 1.1.1k 25 Mar 2021.
- OPNsense is a VM within Proxmox, PVE Manager Version pve-manager/6.4-4/337d6701.
- Hardware: Lenovo Thinkserver (https://support.lenovo.com/cy/en/solutions/pd014761), with CPU(s) 8 x Intel(R) Xeon(R) CPU E3-1240 V2 @ 3.40GHz.
- The OPNsense-VM has 2Gb of RAM, 4 Cores Type Sandybridge. The last 4 CPU-flags in Proxmox are disabled - please see image attached.
-
Adding a screenshot of system resources, including CPU-load...
-
You can run
top -aSCHIP
and it will show the COMMAND output of the process hogging CPU. That should give us more hints as to what started these processes and maybe would point to a particular package or setting.
-
Please see the screenshot below.
Of course, now that the doctor is here, the pain vannishes: CPU is down to 25%. I am not aware of any changes in my network that could cause this. I will keep an eye on it a run the command again, when the CPU-load rises again.
-
If you re-run 'top -aSCHIP', you can hit 'o' while top is running. This will give you the option to sort by category, type in 'time' and hit enter. This will show current CPU usage and also show you the processes that have taken the most CPU time while the firewall has been running. I suspect we should see some sql related components in there.
If you can, try to expand your terminal window to see more of the COMMAND section. This can be helpful to determine which component was associated with the high sql usage. The interrupt and system usage also seem quite high in that screenshot. Is this system pushing a lot of traffic?
-
Find 2 screenshots below. The first one is with netflow disabled, the second has netflow enabled.
Is this system pushing a lot of traffic?
I've been uploading 100Gb to Nextcloud yesterday (using Nextclouds web-interface), from a PC outside the LAN. Other than that, It was mostly just browsing or streaming some music. I am unexperienced when it comes to network, but I don't believe that this cause a higher CPU load. It's a home network with very limited amount of users.
-
Second image...
-
I think there are some performance issues with the VM hardware (either the wrong virtual hardware or a bottleneck on the physical hardware). Unfortunately I don't use Proxmox so I can't provide much direct help.
Seeing high IRQ usage on both the NIC and the ATA storage is odd, especially on the storage. What kind of chipset is specified for the storage (IDE, SATA, VirtIO passthrough?). The em1 NIC (E1000 chipset?) also seems to be causing a lot of interrupts for a system that is lightly used.
What is the underlying physical storage that this VM sits on? If that is queuing it could cause high wait times while the VM tries to commit writes.
You can also run 'vmstat -i' to get an output of the interrupts used on the firewall. I suspect running this would show high usage on the EM NICs, which may not be abnormal. But it would be interesting to see if any ATA/IDE interrupts are in there too.
Sorry I don't have more specifics for you but this seems like a VM tuning/hardware issue at this point if the system isn't pushing much traffic.
-
subscribe...
These tips and suggestions should be part of the official docs page, under the troubleshooting section. Because what is there currently, is less than practial, no matter who is reading it.
-
I really appreciate that you spend so much time on this.
See the screenshot attched.
The em1 has a cat5e cable with rj45-connector (1000Mbit), to a router from my ISP, switching to fiber there:
Status up
DHCP
up
MAC address
MTU 1500
IPv4 address
IPv4 gateway
IPv6 link-local
DNS servers 109.247.114.4
92.220.228.70
Media 1000baseT <full-duplex>
In/out packets 156306529 / 156761362 (206.97 GB / 27.24 GB)
In/out packets (pass) 156101589 / 156761362 (206.96 GB / 27.24 GB)
In/out packets (block) 3403990 / 0 (200 KB / 0 bytes)
In/out errors 0 / 0
Collisions 0
Interrupts
irq device total rate
irq11 em1:irq0++ 147233016 877
root@OPNsense:~ # vmstat -i
interrupt total rate
irq1: atkbd0 199 0
irq10: em0:irq0 126321288 743
irq11: em1:irq0++ 147666702 869
irq14: ata0 72743347 428
cpu0:timer 9987002 59
cpu1:timer 7230780 43
cpu2:timer 10365342 61
cpu3:timer 6961074 41
Total 381275734 2243
The underlying physical storage... unsure what you want to know... a local SSD, SCSI controller
-
Thanks for the additional screenshot. I would try to change the chipset to something newer. The I440FX is a very old chipset (mid 1990s). This is likely why we're seeing such limited IRQ sharing (both EM NICs appear to be on the single IRQ and the ATA storage is on IRQ14). Ideally we could get multiple IRQs on the NICs and they would spread load on all of the processor threads. Having a single IRQ is less than ideal here and will likely cause a bottle neck on a single core.
I would also change the SCSI controller to something more traditionally supported. VirtIO isn't always well supported on BSD, so if you can choose SATA AHCI or something similar, I would try that and see how it goes.
In regards to the underlying storage, yes I was referring to what kind of physical drive it is using. Ideally it would be an SSD or at least a RAID array to avoid any IO queuing. If it's just OPNsense only, IO probably isn't an issue. But if this is a share host with multiple VMs on the same datastore, one VM could be monopolizing the storage component and causing the others to wait (queue) their writes.
Also a side note, you may want to scrub your IP from the posted info. Usually not a big deal but I want to mention just in case.
EDIT: On 2nd thought, maybe try to leave the VirtIO in place and just pick a newer chipset. I think the issue here is the old hardware emulation combined with the new VirtIO hardware. Perhaps also change the VM NICs to VirtIO and retest? I know in the VMware world, the E1000 NICs are kind of the last resort of use due to their poor performance relative to VMware's VirtIO equivalent (VMXnet3).
-
Hi.
The only alternative I have to i440fx is q35, which is even older. So I stick to i440fx for now.
I changed the network adapter to VMware vmxnet3 and will monitor if this helps. If the network is more reliable now, we know for sure what caused it. I will report back to you.
Thanks again!
-
The Q35 is several generations newer (and has native PCIe support). I would highly recommend using that chipset emulation if it is available.
-
Right, will do! Actually, for my TrueNAS-VM as well :-)
I totally agree with Ricardo: we should document your input somehow, so it's easier to find. I'd be happy to contribute, if you instruct me.
-
I am resurrecting this performance thread. I am hitting this issue. Reading through the thread I don't think there was a resolution. Not sure I will get one, but thought I would ask. I have an Asus socket 1151 based x99 motherboard with an 24 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz. I have an Intel i350 quad port card, and an IBM 1015 SCSI adapter using IT mode. I have a 6 disk configured with ZFS RAID 10. The system runs well and gets good performance.
I noticed the same issue as the original poster, Opnsense runs as a VM under Proxmox and when I enable Netflow sqlite3 pegs the CPU. I have tried different CPU and Disk controller settings and nothing works. I have another setup almost identical. It is a SuperMicro motherboard with 8 x Intel(R) Xeon(R) CPU E3-1286L v3 @ 3.20GHz (1 Socket), same RAID controller but with an Intel x550 NIC. This system has no issues with the CPU spiking. I added the top outputs from the good and bad systems. If I turn off Netflow no more spikes on the bad system Any ideas?
here is the bad system
last pid: 54806; load averages: 1.22, 1.24, 1.05 up 0+00:26:40 10:32:48
56 processes: 2 running, 54 sleeping
CPU: 12.1% user, 0.0% nice, 51.6% system, 1.0% interrupt, 35.4% idle
Mem: 285M Active, 2953M Inact, 5160K Laundry, 526M Wired, 304M Buf, 123M Free
Swap: 8192M Total, 8192M Free
kill
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
56332 root 1 102 0 18M 6848K CPU1 1 24:44 98.09% sqlite3
290 root 1 52 0 80M 36M accept 1 0:09 1.23% python3.
57397 root 1 52 0 66M 31M accept 1 0:01 1.09% php-cgi
27172 root 1 20 0 14M 3592K CPU0 0 0:05 0.53% top
46649 root 1 20 0 60M 29M accept 1 0:01 0.44% php-cgi
18196 root 1 20 0 60M 27M accept 0 0:00 0.37% php-cgi
63469 root 1 20 0 79M 39M nanslp 1 0:05 0.29% php
20087 root 4 20 0 43M 11M kqread 1 0:01 0.11% syslog-n
92109 root 1 20 0 23M 12M select 0 0:01 0.05% python3.
10100 root 1 20 0 22M 8164K kqread 0 0:00 0.05% lighttpd
21925 root 1 20 0 25M 15M select 1 0:01 0.02% python3.
89645 root 4 52 0 13M 2528K uwait 1 0:00 0.01% dpinger
99645 root 4 52 0 13M 2528K uwait 1 0:00 0.01% dpinger
91164 root 1 20 0 23M 12M select 0 0:00 0.01% python3.
16606 unbound 2 20 0 64M 32M kqread 0 0:01 0.01% unbound
24880 dhcpd 1 20 0 25M 8160K select 0 0:00 0.01% dhcpd
38410 root 1 22 0 13M 2488K wait 1 0:00 0.01% sh
last pid: 10041; load averages: 1.24, 1.23, 1.08 up 0+00:28:42 10:34:50
199 threads: 6 running, 166 sleeping, 27 waiting
CPU 0: 11.3% user, 0.0% nice, 39.7% system, 0.8% interrupt, 48.2% idle
CPU 1: 10.9% user, 0.0% nice, 54.5% system, 1.6% interrupt, 33.1% idle
Mem: 274M Active, 2974M Inact, 5132K Laundry, 525M Wired, 303M Buf, 116M Free
Swap: 8192M Total, 8192M Free
renice
PID USERNAME PRI NICE SIZE RES STATE C TIME CPU COMMAND
56332 root 103 0 18M 6848K RUN 0 26:42 97.69% /usr/local/b
11 root 155 ki31 0B 32K RUN 0 13:11 47.86% [idle{idle:
11 root 155 ki31 0B 32K RUN 1 13:02 33.67% [idle{idle:
0 root -16 - 0B 528K swapin 0 0:16 0.00% [kernel{swap
12 root -60 - 0B 432K WAIT 1 0:14 0.81% [intr{swi4:
12 root -76 - 0B 432K WAIT 0 0:08 0.74% [intr{swi0:
4 root -16 - 0B 32K RUN 0 0:06 0.17% [cam{doneq0}
63469 root 20 0 79M 37M nanslp 0 0:05 0.28% /usr/local/b
12 root -88 - 0B 432K WAIT 1 0:02 0.17% [intr{irq32:
12 root -88 - 0B 432K WAIT 0 0:02 0.11% [intr{irq31:
46649 root 20 0 60M 29M accept 0 0:02 0.43% /usr/local/b
290 root 52 0 80M 36M accept 0 0:01 0.13% /usr/local/b
7 root -16 - 0B 48K psleep 1 0:01 0.02% [pagedaemon{
5 root -16 - 0B 16K pftm 0 0:01 0.03% [pf purge]
0 root -76 - 0B 528K - 1 0:01 0.01% [kernel{if_c
21925 root 20 0 25M 15M select 1 0:01 0.02% /usr/local/b
the good box
last pid: 78271; load averages: 0.46, 0.37, 0.34 up 0+15:18:28 10:39:08
197 threads: 3 running, 164 sleeping, 30 waiting
CPU 0: 10.6% user, 0.0% nice, 5.1% system, 0.8% interrupt, 83.5% idle
CPU 1: 15.3% user, 0.0% nice, 5.5% system, 1.6% interrupt, 77.6% idle
Mem: 95M Active, 2357M Inact, 58M Laundry, 712M Wired, 391M Buf, 774M Free
Swap: 8192M Total, 4500K Used, 8187M Free
PID USERNAME PRI NICE SIZE RES STATE C TIME CPU COMMAND
11 root 155 ki31 0B 32K CPU0 0 880:05 84.73% [idle{idle: cpu0}]
11 root 155 ki31 0B 32K RUN 1 877:28 78.89% [idle{idle: cpu1}]
85899 root 39 0 54M 26M select 0 11:51 26.94% /usr/local/bin/python3 /usr/local/opnsense/scripts/netflow/flowd_aggregate.py (python3.9)
0 root -76 - 0B 560K - 0 7:36 3.34% [kernel{wg_tqg_0}]
12 root -60 - 0B 480K WAIT 0 6:27 0.68% [intr{swi4: clock (0)}]
12 root -92 - 0B 480K WAIT 1 5:33 1.85% [intr{irq30: virtio_pci4}]
0 root -76 - 0B 560K - 1 4:09 1.79% [kernel{wg_tqg_1}]
30271 root 20 0 79M 44M nanslp 1 2:31 0.28% /usr/local/bin/php /usr/local/opnsense/scripts/routes/gateway_watcher.php interface routes alarm
4 root -16 - 0B 32K - 0 0:50 0.53% [cam{doneq0}]
0 root -92 - 0B 560K - 0 0:37 0.05% [kernel{dummynet}]
17 root 16 - 0B 16K syncer 0 0:33 0.00% [syncer]
5 root -16 - 0B 16K pftm 1 0:24 0.08% [pf purge]
6 root -16 - 0B 16K - 1 0:13 0.05% [rand_harvestq]
7 root -16 - 0B 48K psleep 0 0:11 0.02% [pagedaemon{dom0}]
74591 unbound 20 0 1421M 943M kqread 0 0:11 0.01% /usr/local/sbin/unbound -c /var/unbound/unbound.conf{unbound}
65902 root 20 0 23M 11M select 1 0:10 0.01% /usr/local/bin/python3 /usr/local/opnsense/scripts/syslog/lockout_handler (python3.9)
65605 root 20 0 23M 11M select 0 0:09 0.04% /usr/local/bin/python3 /usr/local/sbin/configctl -e -t 0.5 system event config_changed (python3.9)
12 root -88 - 0B 480K WAIT 1 0:07 0.19% [intr{irq28: virtio_pci1}]
39250 root 20 0 17M 2104K nanslp 1 0:04 0.01% /usr/local/bin/dpinger -f -S -r 0 -i WG_PIA_GW -B 10.6.207.112 -p /var/run/dpinger_WG_PIA_GW.pid -u /var/run/dpinger_WG_PIA_GW.sock -s 1s -l 4s -t 60s -d 1 10.6.128.1{dpinger}
55727 root 20 0 13M 2104K bpf 1 0:04 0.02% /usr/local/sbin/filterlog -i pflog0 -p /var/run/filterlog.pid
85129 root 20 0 25M 15M select 0 0:04 0.02% /usr/local/bin/python3 /usr/local/opnsense/scripts/dhcp/unbound_watcher.py --domain photis.net (python3.9)
6752 root 20 0 39M 14M kqread 1 0:04 0.01% /usr/local/sbin/syslog-ng -f /usr/local/etc/syslog-ng.conf -p /var/run/syslog-ng.pid{syslog-ng}
85402 root 20 0 22M 11M kqread 1 0:03 0.00% /usr/local/sbin/lighttpd -f /var/etc/lighty-webConfigurator.conf
9 root 20 - 0B 64K sdflus 0 0:03 0.06% [bufdaemon{/ worker}]
75708 root 23 0 13M 2164K wait 1 0:03 0.01% /bin/sh /var/db/rrd/updaterrd.sh
39250 root 20 0 17M 2104K accept 0 0:03 0.00% /usr/local/bin/dpinger -f -S -r 0 -i WG_PIA_GW -B 10.6.207.112 -p /var/run/dpinger_WG_PIA_GW.pid -u /var/run/dpinger_WG_PIA_GW.sock -s 1s -l 4s -t 60s -d 1 10.6.128.1{dpinger}
28286 dhcpd 20 0 25M 12M select 1 0:02 0.01% /usr/local/sbin/dhcpd -user dhcpd -group dhcpd -chroot /var/dhcpd -cf /etc/dhcpd.conf -pf /var/run/dhcpd.pid vtnet0_vlan100 vtnet0_vlan200
39250 root 20 0 17M 2104K sbwait 0 0:02 0.00% /usr/local/bin/dpinger -f -S -r 0 -i WG_PIA_GW -B 10.6.207.112 -p /var/run/dpinger_WG_PIA_GW.pid -u /var/run/dpinger_WG_PIA_GW.sock -s 1s -l 4s -t 60s -d 1 10.6.128.1{dpinger}
12 root -72 - 0B 480K WAIT 0 0:02 0.00% [intr{swi1: pfsync}]
86877 _flowd 20 0 12M 1812K select 1 0:01 0.00% flowd: net (flowd)
97592 root 20 0 13M 2208K nanslp 1 0:01 0.01% /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP -B 71.184.134.32 -p /var/run/dpinger_WAN_DHCP.pid -u /var/run/dpinger_WAN_DHCP.sock -s 1s -l 4s -t 60s -d 1 71.184.134.1{dpinger}
69260 root 20 0 23M 8288K select 1 0:01 0.01% /usr/local/sbin/ntpd -g -c /var/etc/ntpd.conf{ntpd}
9 root 21 - 0B 64K psleep 1 0:01 0.00% [bufdaemon{bufdaemon}]
97592 root 20 0 13M 2208K accept 0 0:01 0.00% /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP -B 71.184.134.32 -p /var/run/dpinger_WAN_DHCP.pid -u /var/run/dpinger_WAN_DHCP.sock -s 1s -l 4s -t 60s -d 1 71.184.134.1{dpinger}
9 root 20 - 0B 64K - 1 0:01 0.00% [bufdaemon{bufspacedaemon-0}]
9 root 20 - 0B 64K - 1 0:01 0.01% [bufdaemon{bufspacedaemon-1}]
16 root -16 - 0B 16K vlruwt 0 0:01 0.00% [vnlru]
7 root -16 - 0B 48K umarcl 0 0:01 0.00% [pagedaemon{uma}]
12 root -72 - 0B 480K WAIT 1 0:01 0.00% [intr{swi1: netisr 0}]
84024 nobody 20 0 12M 1556K sbwait 1 0:00 0.00% /usr/local/bin/samplicate -s 127.0.0.1 -p 2055 127.0.0.1/2056
97592 root 20 0 13M 2208K sbwait 0 0:00 0.00% /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP -B 71.184.134.32 -p /var/run/dpinger_WAN_DHCP.pid -u /var/run/dpinger_WAN_DHCP.sock -s 1s -l 4s -t 60s -d 1 71.184.134.1{dpinger}
80186 root -16 - 0B 32K sleep 1 0:00 0.00% [ng_queue{ng_queue0}]
0 root -92 - 0B 560K - 0 0:00 0.00% [kernel{vtnet0 txq 0}]
0 root -76 - 0B 560K - 1 0:00 0.00% [kernel{softirq_1}]
48406 root 20 0 19M 9664K select 1 0:00 0.01% sshd: root@pts/0 (sshd)
71735 root 20 0 14M 3480K CPU1 1 0:00 0.05% top -aSCHIP
6752 root 20 0 39M 14M kqread 0 0:00 0.06% /usr/local/sbin/syslog-ng -f /usr/local/etc/syslog-ng.conf -p /var/run/syslog-ng.pid{syslog-ng}
78271 root 23 0 12M 1700K nanslp 0 0:00 0.04% sleep 1
vmstat -i bad system
root@crawford:~ # vmstat -i
interrupt total rate
irq1: atkbd0 2 0
irq4: uart0 40926 19
irq20: virtio_pci0 2070 1
cpu0:timer 351580 164
cpu1:timer 274393 128
irq25: igb0:rxq0 25481 12
irq26: igb0:rxq1 22513 10
irq27: igb0:aq 3 0
irq31: virtio_pci3 184943 86
irq32: ahci0 43443 20
irq34: virtio_pci1 1 0
irq36: virtio_pci2 1 0
Total 945356 440
vmstat -i good system
root@crawford:~ # vmstat -i
interrupt total rate
irq1: atkbd0 2 0
irq4: uart0 2833 0
irq20: virtio_pci0 26438 0
cpu0:timer 5711126 103
cpu1:timer 1021849 18
irq28: virtio_pci1 384911 7
irq30: virtio_pci4 6396264 115
irq31: virtio_pci4 348 0
irq32: ahci0 632325 11
irq34: virtio_pci2 61 0
irq36: virtio_pci3 1 0
Total 14176158 256
-
Is there a reason you set the KVM CPU model to "SandyBridge" and not "IvyBridge", "kvm64" or "host"?
Sandy Bridge would be the E3-1240 v1. E3-1240 v2 is Ivy Bridge.
https://ark.intel.com/content/www/us/en/ark/products/52273/intel-xeon-processor-e3-1240-8m-cache-3-30-ghz.html
https://ark.intel.com/content/www/us/en/ark/products/65730/intel-xeon-processor-e3-1240-v2-8m-cache-3-40-ghz.html
-
Both the bad and good systems are set to host. Pictures attached
-
Oh, I didn't see that this thread was that old and was talking to the OP.
But did you try to turn off NUMA on the first machine?
A member in the Proxmox forum said this could actually cost performance on a single socket board.
https://forum.proxmox.com/threads/numa-1-on-single-socket-systems.136935/
-
I've tried it off and on, no difference. I can turn it off.
-
Is "Intel Virtualization Technology" (VT-x) activated in the X99 BIOS?
-
Yes, CPUINFO flages
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts vnmi md_clear flush_l1d
vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs