We have quite a few OPNsense boxes deployed, especially on primary/secondary master/slave configurations.
For the past 1 year, on versions 24.1.9, 24.7.2 and undoubtedly others we've tested, we've had systematic random freezes on FW2/FW slave. Only on SLAVE/SECONDARY
The frequency is around 60-70 days of uptime. Sometimes much less.
What is a freeze?
- The ping responds.
- ssh impossible: prompt requested but no connection offered. From LAN, from HA link, ssh not responding from everywhere.
- web impossible: load login page, but no connection.
- We continue to receive routing syslogs + dhcp lease logs. No logs on other processes.
- We have console port on FW: login:..., password: ... connection impossible.
- We have sometimes left a console open. By typing reboot + enter. Nothing, it doesn't reboot.
- of course, all other processes no longer work: ipsec, openvpn, etc....
- After reboot, everything is OK. No particular log.
- pre-freeze, monitoring gives nothing: no abnormal CPU, no abnormal network activity, no abnormal disk activity. These are slave firewalls, so they do nothing.
All OS processes seem to be frozen.
How to solve the problem?
electrically disconnect and reconnect the firewall
What's our configuration?
1 WAN port with 2 x vlan (FFTO + FFTH)
1 LAN port with multiple local VLAN
1 HA port connected directly to the second
We use :
- CARPs for HA.
- Gateway group
- IPSEC client
- OpenVPN client
- DHCP server
- Unbound DNS sometimes
- captive portal sometimes
In short, nothing extraordinary. Resources are well-dimensioned, no RAM, disk or CPU problems.
What's our hardware?
Version: Intel(R) Atom(TM) CPU E3845 @ 1.91GH
port INTEL 3x WGI211AT
installed on msata and/or mmc - sometimes msata, sometimes mmc, sometimes ZFS raid mirror on emmc + msata
What have we tried?
- read all freeze cases on the internet on freebsd/pfsense/opnsense...
- enable watchdog! same thing, it does nothing.
- disable c-state: nothing.
- change hardware: there's no hardware problem. We have this case on more than 10 pairs of firewalls with different hardware/install/BIOS
- ZFS vs UFS => same
- Different BIOS versions => same
- disable opnsense optimisation
- disable watchdog !
What help do we need?
- Any ideas?
- How to get more logs? How to set the OPNsense to HYPER-verbose mode.
- How to crash/force a reboot in the event of a freeze?
- We don't have access to our FW, so we'd like to be able to crash the opnsense via the console. I've tried this:
https://gist.github.com/xiangchu0/5eda63b3c5234ce4eb48ca9deb1d0090#how-to-panic-on-demand-when-system-is-freezed
But alas, it doesn't work. So we're forced to request an electrical reboot in order to recover the firewall. Any ideas to crash opnsense from console port ?
Thanks to you, and thanks to anyone who can come up with a brilliant idea to solve this problem, which has been incomprehensible for over a year. Given the time spent on this problem, Santa Claus won't forget who finds us a fix!
Hello,
So new freeze on FW2 from an HA cluster. Version : 24.7.8
on this version, the symptom is a bit different: the router is frozen, but I can connect to the interface, which gives me nothing.
I can login, but "Lobby: Dashboard" can't show information. No commands work.
ex: If I try a reboot, it doesn't go away.
The server responds to the ping, but an ssh connection doesn't respond.
On the console, same thing, login:, then I type in the password and the connection doesn't arrive.
The only solution is to disconnect it electrically.
That's more than 20 identicals cases now on different hardware and opnsense version.
I'm asking Father Christmas for an idea, for help!
Hello,
Just FYI, we continue to have freeze on multiple FW.
I was able to repatriate some FWs for testing. I did IO benchmark, RAM (ubench -m + memtest86), CPU (ubench -c) ... in loop, during 1 week.
No hardware problems were found.
We updated the BIOS, and no problems were found.
I was able to heat up the FWs in very hot environments, only 1 crashed, but then it was a real crash, i.e. nothing responded, unlike our problem where ping and routing continue to work during freeze. It's a different issue and in this case, it's an thermal issue, so an hardware issue.
With zabbix agent, we create a command key named "Panic" => "sysctl debug.kdb.panic=1". When the FW froze, we tried to execute the command via the zabbix agent, it doesn't work. :(
we tested a reboot from the API, but that didn't work either.
Via API, on the first attempt, we get an Ok status return, but the reboot doesn't work. On the second attempt, it takes longer, the status is still ok, but it doesn't reboot.
When an FW is frozen, you can authenticate via the web interface, but when you execute a reboot, it doesn't work. A second authentication via the web does not work.
Our only remedy is to restart the FW electrically.
Any idee to identify issue nor reboot/restart FW when it froze ?
Maybe a wild guess, but did you consider a possible brownout?
Is the Primary and Secondary FWs on the same electrical circuit?
Can you try to use for FW2 the power socket that is used by FW1?
Regards,
S.
Hello,
our intuition is that we have a problem with Intel speedshift/speedtest.
All FWs with minimal activity are experiencing the problem: backup FW2s.
We try to disable ALL power-saving implementations in the BIOS?
On OS, Powercontrol is disabled :
root@fw_qua_sr1_f2:~ # sysctl -a | grep hwp
kern.hwpmc.softevents: 16
kern.features.hwpmc_hooks: 1
debug.hwpstate_pstate_limit: 0
debug.hwpstate_verify: 0
debug.hwpstate_verbose: 0
machdep.hwpstate_pkg_ctrl: 1
root@FW1:~ # sysctl -a | grep -i dev.cpu
dev.cpufreq.3.freq_driver: hwpstate_intel3
dev.cpufreq.3.%parent: cpu3
dev.cpufreq.3.%pnpinfo:
dev.cpufreq.3.%location:
dev.cpufreq.3.%driver: cpufreq
dev.cpufreq.3.%desc:
dev.cpufreq.2.freq_driver: hwpstate_intel2
dev.cpufreq.2.%parent: cpu2
dev.cpufreq.2.%pnpinfo:
dev.cpufreq.2.%location:
dev.cpufreq.2.%driver: cpufreq
dev.cpufreq.2.%desc:
dev.cpufreq.1.freq_driver: hwpstate_intel1
dev.cpufreq.1.%parent: cpu1
dev.cpufreq.1.%pnpinfo:
dev.cpufreq.1.%location:
dev.cpufreq.1.%driver: cpufreq
dev.cpufreq.1.%desc:
dev.cpufreq.0.freq_driver: hwpstate_intel0
dev.cpufreq.0.%parent: cpu0
dev.cpufreq.0.%pnpinfo:
dev.cpufreq.0.%location:
dev.cpufreq.0.%driver: cpufreq
dev.cpufreq.0.%desc:
dev.cpufreq.%parent:
dev.cpu.3.cx_method: C1/mwait/hwc C2/mwait/hwc C3/mwait/hwc
dev.cpu.3.cx_usage_counters: 31919645 0 0
dev.cpu.3.cx_usage: 100.00% 0.00% 0.00% last 28us
dev.cpu.3.cx_lowest: C1
dev.cpu.3.cx_supported: C1/1/1 C2/2/253 C3/3/1048
dev.cpu.3.freq_levels: 1996/-1
dev.cpu.3.freq: 2366
dev.cpu.3.%parent: acpi0
dev.cpu.3.%pnpinfo: _HID=none _UID=0 _CID=none
dev.cpu.3.%location: handle=\_SB_.PR03
dev.cpu.3.%driver: cpu
dev.cpu.3.%desc: ACPI CPU
dev.cpu.2.cx_method: C1/mwait/hwc C2/mwait/hwc C3/mwait/hwc
dev.cpu.2.cx_usage_counters: 37773627 0 0
dev.cpu.2.cx_usage: 100.00% 0.00% 0.00% last 633us
dev.cpu.2.cx_lowest: C1
dev.cpu.2.cx_supported: C1/1/1 C2/2/253 C3/3/1048
dev.cpu.2.freq_levels: 1996/-1
dev.cpu.2.freq: 2454
dev.cpu.2.%parent: acpi0
dev.cpu.2.%pnpinfo: _HID=none _UID=0 _CID=none
dev.cpu.2.%location: handle=\_SB_.PR02
dev.cpu.2.%driver: cpu
dev.cpu.2.%desc: ACPI CPU
dev.cpu.1.cx_method: C1/mwait/hwc C2/mwait/hwc C3/mwait/hwc
dev.cpu.1.cx_usage_counters: 132676551 0 0
dev.cpu.1.cx_usage: 100.00% 0.00% 0.00% last 63us
dev.cpu.1.cx_lowest: C1
dev.cpu.1.cx_supported: C1/1/1 C2/2/253 C3/3/1048
dev.cpu.1.freq_levels: 1996/-1
dev.cpu.1.freq: 2422
dev.cpu.1.%parent: acpi0
dev.cpu.1.%pnpinfo: _HID=none _UID=0 _CID=none
dev.cpu.1.%location: handle=\_SB_.PR01
dev.cpu.1.%driver: cpu
dev.cpu.1.%desc: ACPI CPU
dev.cpu.0.cx_method: C1/mwait/hwc C2/mwait/hwc C3/mwait/hwc
dev.cpu.0.cx_usage_counters: 223484647 0 0
dev.cpu.0.cx_usage: 100.00% 0.00% 0.00% last 93us
dev.cpu.0.cx_lowest: C1
dev.cpu.0.cx_supported: C1/1/1 C2/2/253 C3/3/1048
dev.cpu.0.freq_levels: 1996/-1
dev.cpu.0.freq: 2466
dev.cpu.0.%parent: acpi0
dev.cpu.0.%pnpinfo: _HID=none _UID=0 _CID=none
dev.cpu.0.%location: handle=\_SB_.PR00
dev.cpu.0.%driver: cpu
dev.cpu.0.%desc: ACPI CPU
dev.cpu.%parent:
(https://nextcloud.pci-conseil.net/index.php/s/ZPjmgMoCiW6brWk)
Any idea how to check that no energy-saving mechanism is being used?
Thanks,
In response to Seimus :
We have quite a few FWs deployed at different sites, almost all of them installed with quality inverters. We have already tried to change the power supplies. Alas, without success.
This problem occurs on several different sites.
This doesn't seem to be a good idea.
You are right I assumed wrongly this all is a same location.
From the command output you can see the CPUs are not going into lower C states, they stay at C1.
Anyway to disable Intel Speed Shift
Check if its running
https://docs.freebsd.org/en/books/handbook/config/#hwpstate_intel
Disable
https://forums.freebsd.org/threads/cpu-speed-throttled-hwpstate_intel.91590/
Regards,
S.
Hello,
so we try to disable INTEL EIST on the bios where it's enabled by default. (Intel_SpeedStep)
We are rebuilding a specific BIOS for our hardware and will quickly try it.
But, is-it possible to disable it on kernel OS ?
root@xxxxx:~ # dmesg | grep -i speedStep -A2
est0: <Enhanced SpeedStep Frequency Control> on cpu0
Timecounter "TSC" frequency 1916666258 Hz quality 1000
Timecounters tick every 1.000 msec
thanks
regards,
I will try :
echo 'hint.est.0.disabled="1"' >> /boot/loader.conf
and reboot
maybe this is quite
after reboot :
root@*****:~ # dmesg | grep -i -A3 est0
est0: <Enhanced SpeedStep Frequency Control> on cpu0
Timecounter "TSC" frequency 1916666258 Hz quality 1000
Timecounters tick every 1.000 msec
it does'nt work. :(
Hello,
With old BIOS :
root@*****-FW1:~ # sysctl -a|grep cpufreq
device cpufreq
debug.cpufreq.verbose: 0
debug.cpufreq.lowest: 0
dev.cpufreq.3.freq_driver: est3
dev.cpufreq.3.%parent: cpu3
dev.cpufreq.3.%pnpinfo:
dev.cpufreq.3.%location:
dev.cpufreq.3.%driver: cpufreq
dev.cpufreq.3.%desc:
dev.cpufreq.2.freq_driver: est2
dev.cpufreq.2.%parent: cpu2
dev.cpufreq.2.%pnpinfo:
dev.cpufreq.2.%location:
dev.cpufreq.2.%driver: cpufreq
dev.cpufreq.2.%desc:
dev.cpufreq.1.freq_driver: est1
dev.cpufreq.1.%parent: cpu1
dev.cpufreq.1.%pnpinfo:
dev.cpufreq.1.%location:
dev.cpufreq.1.%driver: cpufreq
dev.cpufreq.1.%desc:
dev.cpufreq.0.freq_driver: est0
dev.cpufreq.0.%parent: cpu0
dev.cpufreq.0.%pnpinfo:
dev.cpufreq.0.%location:
dev.cpufreq.0.%driver: cpufreq
dev.cpufreq.0.%desc:
dev.cpufreq.%parent:
root@****-FW1:~ # sysctl -a | grep 'est:'
kern.vm_guest: none
vfs.nfs.realign_test: 0
vfs.nfsd.request_space_used_highest: 0
net.inet.ip.broadcast_lowest: 0
debug.cpufreq.lowest: 0
hw.acpi.cpu.cx_lowest: C1
dev.cpu.3.cx_lowest: C1
dev.cpu.2.cx_lowest: C1
dev.cpu.1.cx_lowest: C1
dev.cpu.0.cx_lowest: C1
With our new fresh BIOS :
root@xxxxx-FW2:~ # sysctl -a|grep cpufreq
device cpufreq
debug.cpufreq.verbose: 0
debug.cpufreq.lowest:
root@xxxxxx:~ # sysctl -a | grep 'est:'
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 7e000000173f
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 7e000000173f
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 7e000000173f
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 7e000000173f
kern.vm_guest: none
vfs.nfs.realign_test: 0
vfs.nfsd.request_space_used_highest: 0
net.inet.ip.broadcast_lowest: 0
debug.cpufreq.lowest: 0
hw.acpi.cpu.cx_lowest: C1
dev.cpu.3.cx_lowest: C1
dev.cpu.2.cx_lowest: C1
dev.cpu.1.cx_lowest: C1
dev.cpu.0.cx_lowest: C1
Now, we need to flush FW that froze and see what happens ! be happy not to have any more customer calls about this problem. :D
"... The frequency is around 60-70 days of uptime. "
Can you list all the processes that need to use CPU?
Is there any processe that need to write to the log file continuously and uninterrupted, causing the CPU to be "busy" and have no time to handle other processes besides the ping service.
Is this related to the memory capacity? 2G is not enough, try 4G/8G if it helps.
Regards,
Richard
What are the processes?
few :
root@xxx:~ # ps auxwww | grep -v '\['
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
root 62354 0.2 0.6 75836 46272 - S Mon11 37:30.26 /usr/local/bin/php /usr/local/opnsense/scripts/routes/gateway_watcher.php interface routes alarm
root 66338 0.1 0.1 19716 9164 - Ss Mon12 5:38.00 /usr/local/sbin/openvpn --config /var/etc/openvpn/client4.conf
root 1 0.0 0.0 11376 1024 - SLs Mon11 0:00.87 /sbin/init
root 285 0.0 0.2 27104 15384 - Is Mon11 0:00.42 /usr/local/bin/python3 /usr/local/opnsense/service/configd.py (python3.11)
root 287 0.0 0.5 91092 44544 - S Mon11 1:06.86 /usr/local/bin/python3 /usr/local/opnsense/service/configd.py console (python3.11)
root 425 0.0 0.0 13488 2864 - S Mon12 1:42.35 /bin/sh /var/db/rrd/updaterrd.sh
root 803 0.0 0.0 11552 1600 - Ss Mon11 0:00.47 /sbin/devd
root 855 0.0 0.2 26364 14776 - S 22:00 0:06.00 /usr/local/bin/python3 /usr/local/sbin/configctl -e -t 0.5 system event config_changed (python3.11)
root 860 0.0 0.2 27388 15080 - S 22:00 0:06.04 /usr/local/bin/python3 /usr/local/opnsense/scripts/syslog/lockout_handler (python3.11)
root 5940 0.0 0.2 12720 12824 - S<s Mon12 0:03.20 /usr/sbin/watchdogd
root 7943 0.0 0.4 61640 32536 - S 09:12 0:00.33 /usr/local/bin/php-cgi
root 11356 0.0 0.1 23456 9908 - I Mon11 0:00.00 /usr/local/sbin/syslog-ng -f /usr/local/etc/syslog-ng.conf -p /var/run/syslog-ng.pid
root 11585 0.0 0.4 69488 28824 - Ss Mon11 15:11.03 /usr/local/sbin/syslog-ng -f /usr/local/etc/syslog-ng.conf -p /var/run/syslog-ng.pid
zabbix 17287 0.0 0.1 23364 9360 - I Mon12 0:00.00 /usr/local/sbin/zabbix_agentd -c /usr/local/etc/zabbix_agentd.conf
root 29187 0.0 0.1 22804 11024 - S Mon11 1:18.17 /usr/local/sbin/lighttpd -f /var/etc/lighty-webConfigurator.conf
root 29396 0.0 0.3 57784 26296 - Is Mon11 0:00.04 /usr/local/bin/php-cgi
root 29795 0.0 0.3 57784 26624 - Is Mon11 0:00.06 /usr/local/bin/php-cgi
dhcpd 35032 0.0 0.1 25516 10572 - Ss Mon11 1:40.23 /usr/local/sbin/dhcpd -user dhcpd -group dhcpd -chroot /var/dhcpd -cf /etc/dhcpd.conf -pf /var/run/dhcpd.pid igb0 vlan01 vlan09 vlan04 vlan010 vlan07 vlan03
root 44106 0.0 0.4 61640 32744 - I 08:15 0:00.80 /usr/local/bin/php-cgi
root 47083 0.0 0.1 19860 9496 - S 09:07 0:00.18 sshd-session: root@pts/0 (sshd-session)
root 49563 0.0 0.1 14684 4404 - S Mon12 0:28.95 /usr/local/sbin/lighttpd -f /var/etc/lighttpd-api-dispatcher.conf
www 51810 0.0 0.1 22788 10140 - S Mon12 1:42.44 /usr/local/sbin/lighttpd -f /var/etc/lighttpd-cp-zone-0.conf
root 52094 0.0 0.2 30460 19160 - Ss Mon12 15:48.38 /usr/local/bin/python3 /usr/local/opnsense/scripts/OPNsense/CaptivePortal/cp-background-process.py start (python3.11)
root 58447 0.0 0.4 61640 32988 - S Fri13 0:01.48 /usr/local/bin/php-cgi
root 59872 0.0 0.0 13340 2652 - Is Mon11 0:56.44 /usr/local/bin/dpinger -f -S -r 0 -i GW_Freebox -B 192.168.222.252 -p /var/run/dpinger_GW_Freebox.pid -u /var/run/dpinger_GW_Freebox.sock -s 1s -l 4s -t 60s -d 1 1.1.1.1
root 60772 0.0 0.0 17568 2732 - Is Mon11 0:49.90 /usr/local/bin/dpinger -f -S -r 0 -i GW_Livebox_1 -B 192.168.10.252 -p /var/run/dpinger_GW_Livebox_1.pid -u /var/run/dpinger_GW_Livebox_1.sock -s 1s -l 4s -t 60s -d 1 9.9.9.9
root 61445 0.0 0.0 13340 2652 - Is Mon11 0:49.10 /usr/local/bin/dpinger -f -S -r 0 -i GW_Livebox_2 -B 192.168.11.252 -p /var/run/dpinger_GW_Livebox_2.pid -u /var/run/dpinger_GW_Livebox_2.sock -s 1s -l 4s -t 60s -d 1 8.8.8.8
root 67609 0.0 0.1 23112 8096 - Ss Mon12 0:35.71 /usr/local/sbin/ntpd -g -c /var/etc/ntpd.conf
root 70984 0.0 0.4 61640 32328 - I 09:12 0:00.14 /usr/local/bin/php-cgi
root 77465 0.0 0.1 19716 9316 - Ss Mon11 12:53.02 /usr/local/sbin/openvpn --config /var/etc/openvpn/instance-1e2b2a71-fc49-4d58-9cd1-bf82e41a0455.conf
root 78283 0.0 0.0 12816 2756 - Ss Mon12 0:03.41 /usr/sbin/cron -s
root 84238 0.0 0.4 61832 32872 - S 09:08 0:00.97 /usr/local/bin/php-cgi
root 88504 0.0 0.4 61512 31968 - S 09:08 0:00.07 /usr/local/bin/php-cgi
root 90045 0.0 0.0 13008 3060 - Ss Wed09 4:31.21 /usr/local/sbin/filterlog -i pflog0 -p /var/run/filterlog.pid
unbound 91146 0.0 14.3 1539156 1173868 - Ss Mon12 12:21.96 /usr/local/sbin/unbound -c /var/unbound/unbound.conf
root 91229 0.0 0.0 12620 2308 - SC 09:15 0:00.00 sleep 1
root 1443 0.0 0.0 12744 2488 u0 Ss+ Mon12 0:09.76 /usr/libexec/getty 3wire.115200 ttyu0
root 47139 0.0 0.0 13488 2996 0 Is 09:07 0:00.00 /bin/sh /usr/local/sbin/opnsense-shell
root 58254 0.0 0.0 13784 4012 0 S 09:07 0:00.02 /bin/csh
root 91355 0.0 0.0 13340 3312 0 R+ 09:15 0:00.00 ps auxwww
What are the latest files created?
find /var/log/ -mtime 0 | grep -v '\->'
root@****:~ # find /var/log/ -ctime 0 -o -mtime 0 -o -ctime 1 -o -mtime 1 -type f | grep -v '\->'
/var/log/
/var/log/utx.lastlogin
/var/log/audit
/var/log/audit/audit_20250113.log
/var/log/audit/audit_20250112.log
/var/log/audit/latest.log
/var/log/system
/var/log/system/system_20250113.log
/var/log/system/latest.log
/var/log/system/system_20250112.log
/var/log/utx.log
/var/log/openvpn
/var/log/openvpn/openvpn_20250113.log
/var/log/openvpn/latest.log
/var/log/openvpn/openvpn_20250112.log
/var/log/firewall
/var/log/firewall/latest.log
/var/log/ipfw.today
/var/log/kea
/var/log/kea/latest.log
/var/log/ipfw.yesterday
/var/log/pkg
/var/log/pkg/latest.log
/var/log/routing
/var/log/routing/latest.log
/var/log/lighttpd
/var/log/lighttpd/lighttpd_20250113.log
/var/log/lighttpd/latest.log
/var/log/pf.yesterday
/var/log/pf.today
/var/log/configd
/var/log/configd/latest.log
/var/log/configd/configd_20250112.log
/var/log/configd/configd_20250113.log
/var/log/portalauth
/var/log/portalauth/latest.log
/var/log/gateways
/var/log/gateways/latest.log
/var/log/filter
/var/log/filter/filter_20250112.log
/var/log/filter/latest.log
/var/log/filter/filter_20250113.log
/var/log/dhcpd
/var/log/dhcpd/latest.log
/var/log/dhcpd/dhcpd_20250113.log
/var/log/dhcpd/dhcpd_20250112.log
/var/log/zabbix
/var/log/zabbix/zabbix_agentd.log.old
/var/log/zabbix/zabbix_agentd.log
/var/log/ntpd
/var/log/ntpd/latest.log
/var/log/resolver
/var/log/resolver/resolver_20250112.log
/var/log/resolver/resolver_20250113.log
/var/log/resolver/latest.log
nothing extraordinary, we're on a classic opnsense with no particular modifications.
How do you put the kernel into hyper debug mode? hyper verbose? On log analysis, we have nothing at the time of the freeze. No particular activity, no log to understand.
Our syslog continue to receive filter log and dhcp log from FW during the freeze. Packet forwarding continue to run during freeze but nothing else works.
Hello,
We try powerd enabled with profil performance.
It's better, but not, today, we lost 2 x FW2 on 2 different site. Max uptime 17 days...
Freeze : GUI present but can't connect, console present, but connection not possible... frozen !
We will try 2 things :
- Install Intel-microcode-instruction (plugins opnsense) =>
https://www.thomas-krenn.com/en/wiki/Update_Intel_Microcode_on_FreeBSD
- unplug USB serial ceble et leave both FW1 et FW2 alone without USB serial cable.
wait and see.
the nightmare continues :(
Hello,
This weekend, we lost 5 x FW2 ! it's not a crash, it's a freeze.
"Enhanced SpeedStep Frequency Control" is not the cause. Our home made BIOS froze as well...
New plan :
-> intel microcode
-> USB<->serial cable
After that, we will change hardware because from our point of view, it's the last thing we haven't tested.
If anyone has an idea for increasing verbosity, increasing the possibility of seeing what happens when it freezes, I'd love to hear it. If it's necessary to recompile a special kernel, I'll do it for that, we'll do it.
A man in the middle of a nightmare :)
Thanks,
Hello,
We continue to have waves of FW2 freezes and the occasional FW1. We have found a log :
LOG :
<7>sonewconn: pcb 0xfffff8009edd8a80 (0.0.0.0:10050 (proto 6)): Listen queue overflow: 193 already in queue awaiting acceptance (1 occurrences), euid 122, rgid 122, jail 0
chatgpt tells us interesting things :
https://chatgpt.com/share/67adb533-7a24-8004-8ed5-46bf3afcb995
should i do what chatgpt suggests: increase the waiting list?
root@xxxFW2:~ # sysctl -w kern.ipc.somaxconn=1024
kern.ipc.somaxconn: 128 -> 1024
root@xxxxxxFW2:~ # sysctl -w net.inet.tcp.syncache.count
net.inet.tcp.syncache.count: 0
We don't know if these logs are a cause or a consequence, but either way, we're going to try because we've got nothing else! A light at the end of the tunnel, perhaps?
If an expert comes by, I'd like to hear from him :)
thanks,
regards,
You may want to cross check it on FreeBSD forum
https://forums.freebsd.org/threads/listen-queue-overflow.66845/
https://forums.freebsd.org/threads/listen-queue-overflow.74098/
P.S. Also, if you going to adjust any tunable to it via GUI not CLI.
Regards,
S.
Hello,
the investigation continues !
A new FW2 has frozen with the parameter kern.ipc.somaxconn=1024
AND
FW with zabbix monitoring disabled stay alive !
No, that's not it, but the investigation is tightening up around Zabbix agent.
In fact, after analysis, we have Freeze on FWs with Zabbix agent failures:
zabbix key example:
"timeout -s 9 10 "ping -c 4 -S 192.168.10.252 8.8.4.4 | grep 'packet loss' | awk '{print $7}' | tr -d '%'""
It's badly coded, but it does the job.
In the mass, we were able to determine that the sudden freezing of FWs corresponded to the increase in our monitoring and, above all, to the presence of command execution failures.
There are more zabbix command failures on FW2 because, for example, IPSEC links are only mounted on FW1. This would explain the freeze on the FW2 and not the FW1.
FWs that freeze are FWs that have a lot of logs as follows:
2025-02-21T15:00:02 41206 Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.1.254 8.8.4.4 | grep 'packet loss' | awk '{print $7}' | tr -d '%'": Timeout while executing a shell script.
2025-02-21T15:00:00 40564 Failed to execute command "timeout -s 9 10 ping -c 1 -S 10.66.255.5 10.4.0.1 | grep round-trip | cut -d= -f2 | cut -d/ -f2": Timeout while executing a shell script.
2025-02-21T14:59:59 41001 Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.1.254 8.8.4.4 | grep round-trip | cut -d= -f2 | cut -d/ -f2": Timeout while executing a shell script.
2025-02-21T14:59:39 40564 Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.10.10 192.168.10.9 | grep -q '0.0% packet loss' ; echo $?": Timeout while executing a shell script.
2025-02-21T14:59:02 41206 Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.1.254 8.8.4.4 | grep 'packet loss' | awk '{print $7}' | tr -d '%'": Timeout while executing a shell script.
2025-02-21T14:59:00 40564 Failed to execute command "timeout -s 9 10 ping -c 1 -S 10.66.255.5 10.4.0.1 | grep round-trip | cut -d= -f2 | cut -d/ -f2": Timeout while executing a shell script.
2025-02-21T14:58:59 41001 Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.1.254 8.8.4.4 | grep round-trip | cut -d= -f2 | cut -d/ -f2": Timeout while executing a shell script.
2025-02-21T14:58:39 41206 Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.10.10 192.168.10.9 | grep -q '0.0% packet loss' ; echo $?": Timeout while executing a shell script.
2025-02-21T14:58:02 41001 Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.1.254 8.8.4.4 | grep 'packet loss' | awk '{print $7}' | tr -d '%'": Timeout while executing a shell script.
2025-02-21T14:58:00 41206 Failed to execute command "timeout -s 9 10 ping -c 1 -S 10.66.255.5 10.4.0.1 | grep round-trip | cut -d= -f2 | cut -d/ -f2": Timeout while executing a shell script.
2025-02-21T14:57:59 40564 Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.1.254 8.8.4.4 | grep round-trip | cut -d= -f2 | cut -d/ -f2": Timeout while executing a shell script.
2025-02-21T14:57:39 40564 Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.10.10 192.168.10.9 | grep -q '0.0% packet loss' ; echo $?": Timeout while executing a shell script.
2025-02-21T14:57:02 41001 Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.1.254 8.8.4.4 | grep 'packet loss' | awk '{print $7}' | tr -d '%'": Timeout while executing a shell script.
2025-02-21T14:57:00 41206 Failed to execute command "timeout -s 9 10 ping -c 1 -S 10.66.255.5 10.4.0.1 | grep round-trip | cut -d= -f2 | cut -d/ -f2": Timeout while executing a shell script.
2025-02-21T14:56:59 41001 Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.1.254 8.8.4.4 | grep round-trip | cut -d= -f2 | cut -d/ -f2": Timeout while executing a shell script.
2025-02-21T14:56:39 40564 Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.10.10 192.168.10.9 | grep -q '0.0% packet loss' ; echo $?": Timeout while executing a shell script.
2025-02-21T14:56:02 40564 Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.1.254 8.8.4.4 | grep 'packet loss' | awk '{print $7}' | tr -d '%'": Timeout while executing a shell script.
2025-02-21T14:56:00 41206 Failed to execute command "timeout -s 9 10 ping -c 1 -S 10.66.255.5 10.4.0.1 | grep round-trip | cut -d= -f2 | cut -d/ -f2": Timeout while executing a shell script.
So:
Next week, we're going to simulate an FW with 50 zabbix keys generating lots of errors in an attempt to re-trigger this random freeze.
Still the same, if anyone has a great idea, I'll take it :D
thanks,
Any chance these pings are started more frequently than every 10 seconds (when the timeout will finally kill them)? Self made "fork bomb" that crashes the firewall?
Hello,
This weekend, we lost another 3 x FW2! All the FW2s we no longer supervise are still UP!
well, we have confirmation that it's our zabbix supervision that's causing the FW2s to freeze.
We have the same settings on the FW1s which do not freeze.
The only difference is that ping checks via ipsec are timeouted by the command and/or by the zabbix timeout.
No, there's no command overload, when you look at it. There aren't many zabbix processes, no RAM saturation, no CPU saturation...
In short, I'm still amazed at how easy it is to freeze a freebsd.
Is the freeze simply a kill of process 1 (init)? If so, how can this happen?
In any case, freebsd can be broken remotely while remaining at user level!
To be continued.
Hello,
New Idea :
on our test FWs, I set cpu limits via rctl :
With a constrained zabbix process, we'll see if it freezes again or not :D
ps auxwww | grep zabbix_agentd
zabbix 43003 0.0 0.2 25632 9548 - I 10:56 0:00.01 /usr/local/sbin/zabbix_agentd -c /usr/local/etc/zabbix_agentd.conf
zabbix 43373 0.0 0.2 25632 9920 - S 10:56 0:00.17 zabbix_agentd: collector [idle 1 sec] (zabbix_agentd)
zabbix 43903 0.0 0.2 25892 10088 - S 10:56 0:00.41 zabbix_agentd: listener #1 [processing request] (zabbix_agentd)
zabbix 43976 0.0 0.2 25892 10104 - S 10:56 0:00.54 zabbix_agentd: listener #2 [processing request] (zabbix_agentd)
zabbix 44354 0.0 0.2 25892 10092 - S 10:56 0:00.53 zabbix_agentd: listener #3 [processing request] (zabbix_agentd)
zabbix 44437 0.0 0.2 25892 9992 - S 10:56 0:00.09 zabbix_agentd: active checks #1 [idle 1 sec] (zabbix_agentd)
root 36378 0.0 0.1 13744 2388 1 S+ 11:13 0:00.00 grep zabbix_agentd
root@TestZabbix01:~ # rctl
process:43003:pcpu:deny=50
process:43003:memoryuse:deny=1073741824
user:zabbix:memoryuse:deny=1073741824
user:zabbix:pcpu:deny=50
=> Max 1G RAM and 50% CPU :D
wait and see
Hello,
good news: we're able to refreeze FW! in the lab after 4-5 days.
However, setting limits on zabbix didn't work. It still froze despite the set limits.
Well, I'm out of ideas! Does anyone have one?
otherwise, no supervision via zabbix = no problem!
:D
Hello,
FYI : => https://support.zabbix.com/browse/ZBX-26145
But It continue to freeze: command use :
UserParameter=ipsec.status,timeout -s 9 10 ping -c 1 -S 192.168.1.1 192.168.10.25 | grep -q '0.0% packet loss' ; echo $?
UserParameter=latence.lien1,timeout -s 9 10 ping -c 1 -S 192.168.103.100 9.9.7.2 | grep round-trip | cut -d= -f2 | cut -d/ -f2
UserParameter=latence.lien2,timeout -s 9 10 ping -c 1 -S 192.168.1.1 10.4.0.1 | grep round-trip | cut -d= -f2 | cut -d/ -f2
UserParameter=latence.ipsec,timeout -s 9 10 ping -c 1 -S 192.168.1.1 192.168.10.25 | grep round-trip | cut -d= -f2 | cut -d/ -f2
UserParameter=packetloss.ipsec,timeout -s 9 10 ping -c 1 -S 192.168.1.1 192.168.10.25 | grep 'packet loss' | awk '{print $7}' | tr -d '%'
UserParameter=packetloss.lien2,timeout -s 9 10 ping -c 1 -S 192.168.1.1 10.4.0.1 | grep 'packet loss' | awk '{print $7}' | tr -d '%'
UserParameter=packetloss.lien1,timeout -s 9 10 ping -c 1 -S 192.168.103.100 9.9.7.2 | grep 'packet loss' | awk '{print $7}' | tr -d '%'
UserParameter=opnsense.version,opnsense-version
UserParameter=states.total,sudo pfctl -si | grep current | awk '{print $3}'
UserParameter=states.max,sudo pfctl -sm | grep states | awk '{print $4}'
UserParameter=states.ftth,timeout -s 9 10 ping -c 1 -S 192.168.103.100 8.8.5.5 | grep -q '0.0% packet loss' ; echo $?
UserParameter=state.ftto,timeout -s 9 10 ping -c 1 -S 192.168.1.1 8.8.5.5 | grep -q '0.0% packet loss' ; echo $?
UserParameter=state.priseconnecte,timeout -s 9 10 ping -c 1 172.17.99.45 > /dev/null ; echo $?
which command freeze the freebsd OS ?
regards,
Hello,
we have a guilty party :
We try 2 differents configuration :
After 7 days : one of both FW freezed.
355,361c355,361
< UserParameter=ipsec.status,ping -c 1 -S 192.168.1.2 192.168.10.25 | grep -q '0.0% packet loss' ; echo $?
< UserParameter=latence.lien,ping -c 1 -S 192.168.101.50 9.9.7.5 | grep round-trip | cut -d= -f2 | cut -d/ -f2
< UserParameter=latence.lien2,ping -c 1 -S 192.168.1.2 10.4.0.1 | grep round-trip | cut -d= -f2 | cut -d/ -f2
< UserParameter=latence.ipsec,ping -c 1 -S 192.168.1.2 192.168.10.25 | grep round-trip | cut -d= -f2 | cut -d/ -f2
< UserParameter=packetloss.ipsec,ping -c 1 -S 192.168.1.2 192.168.10.25 | grep 'packet loss' | awk '{print $7}' | tr -d '%'
< UserParameter=packetloss.lien2,ping -c 1 -S 192.168.1.2 10.4.0.1 | grep 'packet loss' | awk '{print $7}' | tr -d '%'
< UserParameter=packetloss.lien1,ping -c 1 -S 192.168.101.50 9.9.7.5 | grep 'packet loss' | awk '{print $7}' | tr -d '%'
---
> UserParameter=ipsec.status,timeout -s 9 10 ping -c 1 -S 192.168.1.1 192.168.10.25 | grep -q '0.0% packet loss' ; echo $?
> UserParameter=latence.lien1,timeout -s 9 10 ping -c 1 -S 192.168.103.100 9.9.7.2 | grep round-trip | cut -d= -f2 | cut -d/ -f2
> UserParameter=latence.lien2,timeout -s 9 10 ping -c 1 -S 192.168.1.1 10.4.0.1 | grep round-trip | cut -d= -f2 | cut -d/ -f2
> UserParameter=latence.ipsec,timeout -s 9 10 ping -c 1 -S 192.168.1.1 192.168.10.25 | grep round-trip | cut -d= -f2 | cut -d/ -f2
> UserParameter=packetloss.ipsec,timeout -s 9 10 ping -c 1 -S 192.168.1.1 192.168.10.25 | grep 'packet loss' | awk '{print $7}' | tr -d '%'
> UserParameter=packetloss.lien2,timeout -s 9 10 ping -c 1 -S 192.168.1.1 10.4.0.1 | grep 'packet loss' | awk '{print $7}' | tr -d '%'
> UserParameter=packetloss.lien1,timeout -s 9 10 ping -c 1 -S 192.168.103.100 9.9.7.2 | grep 'packet loss' | awk '{print $7}' | tr -d '%'
365,367c365,367
< UserParameter=state.ftth,ping -c 1 -S 192.168.101.50 8.8.5.5 | grep -q '0.0% packet loss' ; echo $?
< UserParameter=state.ftto,ping -c 1 -S 192.168.1.2 8.8.5.5 | grep -q '0.0% packet loss' ; echo $?
< UserParameter=state.priseconnecte,ping -c 1 172.17.99.45 > /dev/null ; echo $?
---
> UserParameter=states.ftth,timeout -s 9 10 ping -c 1 -S 192.168.103.100 8.8.5.5 | grep -q '0.0% packet loss' ; echo $?
> UserParameter=state.ftto,timeout -s 9 10 ping -c 1 -S 192.168.1.1 8.8.5.5 | grep -q '0.0% packet loss' ; echo $?
> UserParameter=state.priseconnecte,timeout -s 9 10 ping -c 1 172.17.99.45 > /dev/null ; echo $?
It's the configuration with "timeout -s 9 10 ..."
So, why command timeout can freeze a freebsd ?
How do you explain this ?
same old, same old, I'm posting for the cause! because no one's been very interested in this investigation :D
so, maybe :D
regards,