Hello OPNsense,
I have two DEC2752 units configured in HA that are being used for a new remote office network build. These two units were purchased less than 1-year ago and the configuration on them is quite basic. No IPv6. No unbound/dnsmasq configuration. I've got an IPSEC vpn connection from the VIP addresses to my primary site. DHCP/DNS (at this time) is handled at my primary DC.
So the OPNsense firewall cluster is just acting as a secure gateway with a site to site tunnel. Nothing fancy.
Last week, I tried to connect to both units. Master (10.103.0.1) responds fine. Secondary (10.103.0.2) is sluggish. The web gui fails to load properly.
The IPSec VPN tunnel is functional.
When I started to investigate what is going on with the secondary unit, I see a ton of errors via CLI:
root@FW02:~ # swap_pager: out of swap space
swp_pager_getswapspace(10): failed
swp_pager_getswapspace(3): failed
swap_pager: out of swap space
swp_pager_getswapspace(4): failed
swap_pager: out of swap space
swp_pager_getswapspace(1): failed
swp_pager_getswapspace(4): failed
swp_pager_getswapspace(6): failed
swap_pager: out of swap space
swp_pager_getswapspace(1): failed
swap_pager: out of swap space
swp_pager_getswapspace(2): failed
swap_pager: out of swap space
swp_pager_getswapspace(22): failed
swp_pager_getswapspace(20): failed
If I reboot the secondary firewall, the OS loading process seems slow. Once I get in and press 8 for CLI, I don't have much time before it starts to bog down.
running df -h I get:
root@FW02:~ # df -h
Filesystem Size Used Avail Capacity Mounted on
zroot/ROOT/default 222G 10G 212G 5% /
devfs 1.0K 0B 1.0K 0% /dev
/dev/gpt/efifs 256M 645K 255M 0% /boot/efi
zroot/tmp 212G 224K 212G 0% /tmp
zroot 212G 96K 212G 0% /zroot
zroot/var/log 212G 164M 212G 0% /var/log
zroot/var/audit 212G 96K 212G 0% /var/audit
zroot/usr/home 212G 96K 212G 0% /usr/home
zroot/usr/ports 212G 96K 212G 0% /usr/ports
zroot/usr/src 212G 96K 212G 0% /usr/src
zroot/var/crash 212G 96K 212G 0% /var/crash
zroot/var/mail 212G 144K 212G 0% /var/mail
zroot/var/tmp 212G 96K 212G 0% /var/tmp
devfs 1.0K 0B 1.0K 0% /var/dhcpd/dev
When I take a look at top -o res, I see high swap:
root@FW02:~ # top -o res
last pid: 13953; load averages: 4.39, 2.72, 1.50 up 0+00:10:36 15:54:20
63 processes: 25 running, 38 sleeping
CPU: 53.4% user, 0.0% nice, 44.4% system, 2.2% interrupt, 0.0% idle
Mem: 5721M Active, 702M Inact, 195M Laundry, 754M Wired, 2056K Buf, 493M Free
ARC: 257M Total, 184M MFU, 66M MRU, 610K Anon, 1329K Header, 5222K Other
224M Compressed, 324M Uncompressed, 1.44:1 Ratio
Swap: 8418M Total, 6101M Used, 2317M Free, 72% Inuse, 314M In
swap_pager: out of swap spaceiled
swp_pager_getswapspace(10): failedIZE RES STATE C TIME WCPU COMMAND
85082ager_getswapspace(48: faile 824M 555M RUN 2 0:04 32.18% php
7616pager: out of swap42paceile1069M 490M RUN 2 0:14 30.59% php-cgi
5271ager_getswapspace(48): faile488M 393M CPU0 0 0:02 24.03% php-cgi
28393 root 1 21 0 548M 392M select 2 0:15 0.01% php-cgi
5177 root 1 48 0 494M 391M RUN 1 0:02 31.91% php
9260 root 1 24 0 584M 386M select 2 0:06 0.00% php-cgi
91350 root 1 50 0 516M 368M RUN 0 0:03 6.38% php
2079 root 1 42 0 440M 338M RUN 2 0:03 43.99% php-cgi
6068 root 1 24 0 538M 291M RUN 3 0:02 9.70% php-cgi
2260 root 1 44 0 751M 270M RUN 3 0:09 9.23% php-cgi
63351 root 1 24 0 726M 266M RUN 3 0:03 7.69% php
28926 root 1 20 0 634M 236M select 3 0:16 0.00% php-cgi
8504 root 1 20 0 584M 236M select 2 0:06 0.00% php-cgi
1583 root 1 24 0 792M 224M CPU1 1 0:07 23.80% php-cgi
I have tried to clean up some logs that I had in /var/log and reboot but that didn't help.
These are the only packages I have installed:
root@FW02:~ # pkg info | grep os-
os-OPNBEcore-1.7_3 OPNsense Business Edition add-ons
os-OPNcentral-1.12_2 OPNsense central management
os-dmidecode-1.2 Display hardware information on the dashboard
os-etpro-telemetry-1.8 ET Pro Telemetry Edition
What I'm struggling to understand is why would my primary unit be working just fine and my secondary having this issue. I have been evaluating OPNsense as a use case for our remote site(s) but my configurations seem a bit light.
I do enable logging on my firewall rules but I don't have many rules at all.
I have a total of 11 VHIDs and on my primary unit at this time, my swap is 0.0%, memory used is 1047mb/arc 1103mb and my disk utilization is 1%.
When checking my snapshot, I see that bectl list shows default as 10.4G.
root@FW02:~ # bectl list
BE Active Mountpoint Space Created
default NR / 10.4G 2025-04-17 09:23
When I compare that to my primary/active unit, it shows 1.29G
My concern here is that this is some kind of hardware failure but I'm not sure how to confirm that or check.
The web interface is unresponsive that I can't even go in and create a recent backup. The page for backups won't load. I should have a latest backup but I'm just pointing out as to how locked up the interface is.
I can't recall what was done in the past 1-3 weeks but it wouldn't be much. These firewalls are waiting for me to rebuild a new IPSEC vpn connection from my primary location so I haven't performed any recent configuration changes to them to my knowledge.
I've captured screenshots and I can get further logs from the POST sequence and OS bootup if it helps.
Thank you,
I've been able to get into SSH on the secondary unit and running top -aSH shows very high CPU usage on [idle{idle: cpu0}] to [idle{idle: cpu3}]
It will fluctuate between 60-90% on idle CPU.
339 threads: 6 running, 308 sleeping, 25 waiting
CPU: 1.0% user, 0.0% nice, 2.6% system, 0.5% interrupt, 95.9% idle
Mem: 5466M Active, 4188K Inact, 1599M Laundry, 753M Wired, 2056K Buf, 45M Free
ARC: 248M Total, 191M MFU, 41M MRU, 5741K Anon, 1550K Header, 9298K Other
199M Compressed, 350M Uncompressed, 1.76:1 Ratio
Swap: 8418M Total, 5057M Used, 3361M Free, 60% Inuse, 3160K In, 16M Out
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
11 root 187 ki31 0B 64K CPU0 0 212:59 98.46% [idle{idle: cpu0}]
11 root 187 ki31 0B 64K RUN 1 213:02 97.30% [idle{idle: cpu1}]
11 root 187 ki31 0B 64K RUN 2 213:18 97.11% [idle{idle: cpu2}]
11 root 187 ki31 0B 64K CPU3 3 213:02 96.48% [idle{idle: cpu3}]
2 root -60 - 0B 64K WAIT 2 1:59 2.19% [clock{clock (0)}]
16 root -16 - 0B 16K psleep 3 0:53 2.08% [vmdaemon]
9 root -16 - 0B 48K swbufa 0 1:11 1.93% [pagedaemon{laundry: dom0}]
26862 root 20 0 15M 2748K CPU2 2 0:00 0.66% top -aSH
9 root -16 - 0B 48K CPU1 1 3:31 0.53% [pagedaemon{dom0}]
12 root -64 - 0B 336K WAIT 1 0:10 0.40% [intr{irq61: nvme0:io1}]
64127 root 24 0 984M 249M swread 3 0:17 0.33% /usr/local/bin/php-cgi
65172 root 20 0 746M 81M swread 0 0:14 0.28% /usr/local/bin/php-cgi
95245 root 20 0 538M 115M swread 1 0:04 0.27% /usr/local/bin/php-cgi
69176 root 21 0 634M 168M swread 3 0:19 0.27% /usr/local/bin/php-cgi
12 root -64 - 0B 336K WAIT 3 0:10 0.25% [intr{irq63: nvme0:io3}]
Here is what I had when I ran top -o size
root@FW02:~ # top -o size
last pid: 9419; load averages: 4.04, 4.38, 4.72 up 0+04:14:12 12:56:00
76 processes: 1 running, 71 sleeping, 1 zombie, 3 waiting
CPU: 38.8% user, 0.0% nice, 6.3% system, 0.7% interrupt, 54.1% idle
Mem: 4471M Active, 1060K Inact, 2614M Laundry, 740M Wired, 2056K Buf, 40M Free
ARC: 248M Total, 193M MFU, 44M MRU, 45K Anon, 1576K Header, 9401K Other
205M Compressed, 364M Uncompressed, 1.78:1 Ratio
Swap: 8418M Total, 8407M Used, 11M Free, 99% Inuse, 15M In, 49M Out
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
64127 root 1 20 0 874M 67M swread 2 0:34 0.14% php-cgi
90300 root 1 20 0 824M 110M pfault 2 0:05 0.22% php
84811 root 1 20 0 824M 81M pfault 1 0:05 0.08% php
9889 root 1 20 0 824M 55M pfault 2 0:07 0.24% php
8548 root 1 20 0 824M 49M swread 3 0:07 0.27% php
10724 root 1 20 0 824M 25M swread 3 0:06 0.21% php
67711 root 1 20 0 824M 13M swread 0 0:07 0.40% php
17207 root 1 20 0 746M 76M select 0 0:34 0.00% php-cgi
65172 root 1 20 0 746M 69M swread 2 0:24 0.14% php-cgi
70166 root 1 23 0 746M 4096B lockf 1 0:33 0.00% <php-cgi>
70079 root 1 24 0 746M 4096B lockf 1 0:31 0.00% <php-cgi>
395 root 19 21 0 741M 4096B WAIT 2 0:51 0.00% <python3.11>
54338 root 1 20 0 726M 110M pfault 0 0:05 0.22% php
95464 root 1 20 0 726M 75M swread 0 0:04 0.14% php
6824 root 1 20 0 726M 64M swread 1 0:05 0.14% php
49842 root 1 20 0 726M 46M swread 0 0:04 0.19% php
67322 root 1 28 0 635M 296M lockf 0 0:37 2.03% php-cgi
68272 root 1 28 0 634M 236M lockf 1 0:43 0.00% php-cgi
69176 root 1 21 0 634M 219M select 3 0:33 0.00% php-cgi
68381 root 1 20 0 634M 84M select 3 0:31 0.01% php-cgi
64196 root 1 20 0 634M 129M pfault 2 0:27 0.21% php-cgi
66839 root 1 20 0 634M 32M select 1 0:34 0.00% php-cgi
18065 root 1 20 0 538M 380M select 1 0:34 0.00% php-cgi
69630 root 1 20 0 538M 5988K select 3 0:29 0.00% php-cgi
68121 root 1 21 0 538M 4096B lockf 3 0:29 0.00% <php-cgi>
59949 root 1 20 0 538M 356M select 3 0:25 0.00% php-cgi
2236 root 1 20 0 538M 317M select 3 0:23 0.00% php-cgi
8262 root 1 20 0 538M 198M select 3 0:18 0.00% php-cgi
71237 root 1 20 0 538M 169M select 1 0:27 0.00% php-cgi
71329 root 1 20 0 538M 26M select 1 0:26 0.00% php-cgi
95245 root 1 24 0 538M 4096B WAIT 1 0:19 0.00% <php-cgi>
4833 root 1 36 0 510M 407M pfault 3 0:02 16.10% php
98709 root 1 34 0 490M 396M pfault 0 0:02 17.35% php
4317 root 1 36 0 490M 395M pfault 1 0:02 18.56% php
7701 root 1 40 0 450M 369M pfault 3 0:02 33.57% php
8680 root 1 34 0 446M 356M pfault 0 0:01 12.93% php
8837 root 1 36 0 440M 359M pfault 3 0:01 16.98% php
7792 root 1 36 0 438M 357M pfault 1 0:01 19.34% php
7077 root 1 36 0 438M 356M pfault 1 0:01 18.72% php
52571 root 17 68 0 97M 3236K sigwai 3 0:00 0.00% charon
3712 root 1 29 0 70M 39M pfault 2 0:01 4.29% python3.11
16540 root 1 20 0 53M 4096B wait 2 0:00 0.00% <php-cgi>
62847 root 1 20 0 53M 4096B wait 0 0:00 0.00% <php-cgi>
62672 root 1 68 0 53M 4096B wait 2 0:00 0.00% <php-cgi>
63410 root 1 68 0 53M 4096B wait 1 0:00 0.00% <php-cgi>
31731 root 3 20 0 49M 2564K kqread 2 0:05 0.03% syslog-ng
17415 root 1 20 0 41M 3612K nanslp 3 3:28 0.01% python3.11
393 root 1 68 0 35M 4096B wait 2 0:00 0.00% <python3.11>
53436 root 1 20 0 28M 4408K select 1 0:01 0.01% python3.11
53217 root 1 20 0 27M 3788K select 3 0:00 0.02% python3.11
20171 root 4 68 0 26M 1360K uwait 2 0:02 0.02% dpinger
31675 root 1 68 0 24M 4096B wait 3 0:00 0.00% <syslog-ng>
57565 root 2 20 0 24M 3144K select 1 0:02 0.01% ntpd
61730 root 1 20 0 23M 3228K kqread 0 0:02 0.00% lighttpd
35568 root 1 20 0 20M 2064K select 3 0:00 0.02% sshd-session
451 root 1 20 0 20M 1072K select 3 0:00 0.00% sshd-session
86829 root 1 20 0 20M 2536K select 0 0:00 0.00% sshd
88249 root 1 20 0 17M 4096B pause 3 0:00 0.00% <csh>
56438 root 1 20 0 15M 2040K CPU3 3 0:00 0.10% top
757 root 1 20 0 15M 408K select 1 0:00 0.00% devd
65025 root 1 68 0 14M 1000K ttyin 2 0:00 0.00% sh
36343 root 1 20 0 14M 4096B wait 2 0:00 0.00% <sh>
64828 root 1 56 0 14M 4096B wait 0 0:00 0.00% <login>
197 root 1 20 0 14M 1416K piperd 3 0:00 0.11% cron
16433 root 1 20 0 14M 1300K bpf 3 0:00 0.03% filterlog
2237 root 1 20 0 14M 4096B wait 2 0:00 0.00% <flock>
72462 root 1 20 0 14M 1304K kqread 2 0:00 0.00% tail
74613 root 1 20 0 14M 1296K select 0 0:00 0.00% tail
15949 root 1 20 0 14M 4096B WAIT 3 0:00 0.00% <cron>
52141 root 1 68 0 13M 4096B kqread 0 0:00 0.00% <daemon>
88865 root 1 68 0 13M 4096B kqread 2 0:00 0.00% <daemon>
61055 root 1 20 0 13M 1072K select 3 0:01 0.01% powerd
40897 _flowd 1 20 0 13M 980K select 2 0:00 0.00% flowd
40882 root 1 68 0 13M 4096B sbwait 1 0:00 0.00% <flowd>
IDLE definition: 1. not working or being used
Regards
Joel.
EDIT: ROOT CAUSE IDENTIFIED.
As I was working and gathering details to submit a ticket with OPNsense support, I tried to export both backup configs to send to support.
FW#1 backup config was 103 KB
FW#2 backup config was 31,384 KB
I was reviewing once more to find out why it was so large so I opened both configs and I noticed thousands of lines additional in the FW#2 backup and all of those lines showed:
<cert uuid="f9d19239-67c1-43c6-87c0-d69a73899149">
<refid>69e6546f96ee4</refid>
<descr>Web GUI TLS certificate</descr>
I had 35k lines total in my FW#2 config.
So I started to explore why I'm constantly getting a new self signed GUI TLS certificate, which led me to look into the SYNC and CRON job for HA.
I noticed that my FW#1 had a CRON job that was enabled for "HA UPDATE AND RECONFIGURE BACKUP" that was running with the following settings:
Min = *
Hour = *
Day of Month = *
Months = *
Days of week = *
I must have incorrectly removed my daily sync (usually at 4am) and had the Asterix instead.
I set my Min = 0 and hour = 4, saved and spent the next 2 hours deleting 4800 Web GUI TLS certificates that were not active.
Since doing so, my firewall cluster has been operating without issues and I'm back to normal operations.
I thought I'd share this here incase anybody else messes up their CRON job in an HA sync and encounters similar issues.