Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - ppucci

#1
Hello,

we have a guilty party :

We try 2 differents configuration :
After 7 days : one of both FW freezed.


355,361c355,361
< UserParameter=ipsec.status,ping -c 1 -S 192.168.1.2 192.168.10.25 | grep -q '0.0% packet loss' ; echo $?
< UserParameter=latence.lien,ping -c 1 -S 192.168.101.50 9.9.7.5 | grep round-trip | cut -d= -f2 | cut -d/ -f2
< UserParameter=latence.lien2,ping -c 1 -S 192.168.1.2 10.4.0.1 | grep round-trip | cut -d= -f2 | cut -d/ -f2
< UserParameter=latence.ipsec,ping -c 1 -S 192.168.1.2 192.168.10.25 | grep round-trip | cut -d= -f2 | cut -d/ -f2
< UserParameter=packetloss.ipsec,ping -c 1 -S 192.168.1.2 192.168.10.25 | grep 'packet loss' | awk '{print $7}' | tr -d '%'
< UserParameter=packetloss.lien2,ping -c 1 -S 192.168.1.2 10.4.0.1 | grep 'packet loss' | awk '{print $7}' | tr -d '%'
< UserParameter=packetloss.lien1,ping -c 1 -S 192.168.101.50 9.9.7.5 | grep 'packet loss' | awk '{print $7}' | tr -d '%'
---
> UserParameter=ipsec.status,timeout -s 9 10 ping -c 1 -S 192.168.1.1 192.168.10.25 | grep -q '0.0% packet loss' ; echo $?
> UserParameter=latence.lien1,timeout -s 9 10 ping -c 1 -S 192.168.103.100 9.9.7.2 | grep round-trip | cut -d= -f2 | cut -d/ -f2
> UserParameter=latence.lien2,timeout -s 9 10 ping -c 1 -S 192.168.1.1 10.4.0.1 | grep round-trip | cut -d= -f2 | cut -d/ -f2
> UserParameter=latence.ipsec,timeout -s 9 10 ping -c 1 -S 192.168.1.1 192.168.10.25 | grep round-trip | cut -d= -f2 | cut -d/ -f2
> UserParameter=packetloss.ipsec,timeout -s 9 10 ping -c 1 -S 192.168.1.1 192.168.10.25 | grep 'packet loss' | awk '{print $7}' | tr -d '%'
> UserParameter=packetloss.lien2,timeout -s 9 10 ping -c 1 -S 192.168.1.1 10.4.0.1 | grep 'packet loss' | awk '{print $7}' | tr -d '%'
> UserParameter=packetloss.lien1,timeout -s 9 10 ping -c 1 -S 192.168.103.100 9.9.7.2 | grep 'packet loss' | awk '{print $7}' | tr -d '%'
365,367c365,367
< UserParameter=state.ftth,ping -c 1 -S 192.168.101.50 8.8.5.5 | grep -q '0.0% packet loss' ; echo $?
< UserParameter=state.ftto,ping -c 1 -S 192.168.1.2 8.8.5.5 | grep -q '0.0% packet loss' ; echo $?
< UserParameter=state.priseconnecte,ping -c 1 172.17.99.45 > /dev/null ; echo $?
---
> UserParameter=states.ftth,timeout -s 9 10 ping -c 1 -S 192.168.103.100 8.8.5.5 | grep -q '0.0% packet loss' ; echo $?
> UserParameter=state.ftto,timeout -s 9 10 ping -c 1 -S 192.168.1.1 8.8.5.5 | grep -q '0.0% packet loss' ; echo $?
> UserParameter=state.priseconnecte,timeout -s 9 10 ping -c 1 172.17.99.45 > /dev/null ; echo $?

It's the configuration with "timeout -s 9 10 ..."

So, why command timeout can freeze a freebsd ?

How do you explain this ?

same old, same old, I'm posting for the cause! because no one's been very interested in this investigation :D

so, maybe :D

regards,

#2
Hello,

FYI : => https://support.zabbix.com/browse/ZBX-26145

But It continue to freeze:  command use :

UserParameter=ipsec.status,timeout -s 9 10 ping -c 1 -S 192.168.1.1 192.168.10.25 | grep -q '0.0% packet loss' ; echo $?
UserParameter=latence.lien1,timeout -s 9 10 ping -c 1 -S 192.168.103.100 9.9.7.2 | grep round-trip | cut -d= -f2 | cut -d/ -f2
UserParameter=latence.lien2,timeout -s 9 10 ping -c 1 -S 192.168.1.1 10.4.0.1 | grep round-trip | cut -d= -f2 | cut -d/ -f2
UserParameter=latence.ipsec,timeout -s 9 10 ping -c 1 -S 192.168.1.1 192.168.10.25 | grep round-trip | cut -d= -f2 | cut -d/ -f2
UserParameter=packetloss.ipsec,timeout -s 9 10 ping -c 1 -S 192.168.1.1 192.168.10.25 | grep 'packet loss' | awk '{print $7}' | tr -d '%'
UserParameter=packetloss.lien2,timeout -s 9 10 ping -c 1 -S 192.168.1.1 10.4.0.1 | grep 'packet loss' | awk '{print $7}' | tr -d '%'
UserParameter=packetloss.lien1,timeout -s 9 10 ping -c 1 -S 192.168.103.100 9.9.7.2 | grep 'packet loss' | awk '{print $7}' | tr -d '%'
UserParameter=opnsense.version,opnsense-version
UserParameter=states.total,sudo pfctl -si | grep current | awk '{print $3}'
UserParameter=states.max,sudo pfctl -sm | grep states | awk '{print $4}'
UserParameter=states.ftth,timeout -s 9 10 ping -c 1 -S 192.168.103.100 8.8.5.5 | grep -q '0.0% packet loss' ; echo $?
UserParameter=state.ftto,timeout -s 9 10 ping -c 1 -S 192.168.1.1 8.8.5.5 | grep -q '0.0% packet loss' ; echo $?
UserParameter=state.priseconnecte,timeout -s 9 10 ping -c 1 172.17.99.45 > /dev/null ; echo $?

which command freeze the freebsd OS ?

regards,
#3
Hello,

good news: we're able to refreeze FW! in the lab after 4-5 days.

However, setting limits on zabbix didn't work. It still froze despite the set limits.

Well, I'm out of ideas! Does anyone have one?

otherwise, no supervision via zabbix = no problem!

:D
#4
Hello,

New Idea :

on our test FWs, I set cpu limits via rctl :
With a constrained zabbix process, we'll see if it freezes again or not :D

ps auxwww | grep zabbix_agentd
zabbix  43003   0.0  0.2  25632  9548  -  I    10:56    0:00.01 /usr/local/sbin/zabbix_agentd -c /usr/local/etc/zabbix_agentd.conf
zabbix  43373   0.0  0.2  25632  9920  -  S    10:56    0:00.17 zabbix_agentd: collector [idle 1 sec] (zabbix_agentd)
zabbix  43903   0.0  0.2  25892 10088  -  S    10:56    0:00.41 zabbix_agentd: listener #1 [processing request] (zabbix_agentd)
zabbix  43976   0.0  0.2  25892 10104  -  S    10:56    0:00.54 zabbix_agentd: listener #2 [processing request] (zabbix_agentd)
zabbix  44354   0.0  0.2  25892 10092  -  S    10:56    0:00.53 zabbix_agentd: listener #3 [processing request] (zabbix_agentd)
zabbix  44437   0.0  0.2  25892  9992  -  S    10:56    0:00.09 zabbix_agentd: active checks #1 [idle 1 sec] (zabbix_agentd)
root    36378   0.0  0.1  13744  2388  1  S+   11:13    0:00.00 grep zabbix_agentd

root@TestZabbix01:~ # rctl
process:43003:pcpu:deny=50
process:43003:memoryuse:deny=1073741824
user:zabbix:memoryuse:deny=1073741824
user:zabbix:pcpu:deny=50

=> Max 1G RAM and 50% CPU :D

wait and see
#5
Hello,

This weekend, we lost another 3 x FW2! All the FW2s we no longer supervise are still UP!

well, we have confirmation that it's our zabbix supervision that's causing the FW2s to freeze.
We have the same settings on the FW1s which do not freeze.

The only difference is that ping checks via ipsec are timeouted by the command and/or by the zabbix timeout.

No, there's no command overload, when you look at it. There aren't many zabbix processes, no RAM saturation, no CPU saturation...
In short, I'm still amazed at how easy it is to freeze a freebsd.

Is the freeze simply a kill of process 1 (init)? If so, how can this happen?

In any case, freebsd can be broken remotely while remaining at user level!

To be continued.
#6
Hello,

the investigation continues !

A new FW2 has frozen with the parameter  kern.ipc.somaxconn=1024
AND
FW with zabbix monitoring disabled stay alive !


No, that's not it, but the investigation is tightening up around Zabbix agent.

In fact, after analysis, we have Freeze on FWs with Zabbix agent failures:

zabbix key example:
"timeout -s 9 10 "ping -c 4 -S 192.168.10.252 8.8.4.4 | grep 'packet loss' | awk '{print $7}' | tr -d '%'""

It's badly coded, but it does the job.

In the mass, we were able to determine that the sudden freezing of FWs corresponded to the increase in our monitoring and, above all, to the presence of command execution failures.

There are more zabbix command failures on FW2 because, for example, IPSEC links are only mounted on FW1. This would explain the freeze on the FW2 and not the FW1.

FWs that freeze are FWs that have a lot of logs as follows:

2025-02-21T15:00:02       41206   Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.1.254 8.8.4.4 | grep 'packet loss' | awk '{print $7}' | tr -d '%'": Timeout while executing a shell script.   
2025-02-21T15:00:00       40564   Failed to execute command "timeout -s 9 10 ping -c 1 -S 10.66.255.5 10.4.0.1 | grep round-trip | cut -d= -f2 | cut -d/ -f2": Timeout while executing a shell script.   
2025-02-21T14:59:59       41001   Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.1.254 8.8.4.4 | grep round-trip | cut -d= -f2 | cut -d/ -f2": Timeout while executing a shell script.   
2025-02-21T14:59:39       40564   Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.10.10 192.168.10.9 | grep -q '0.0% packet loss' ; echo $?": Timeout while executing a shell script.   
2025-02-21T14:59:02       41206   Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.1.254 8.8.4.4 | grep 'packet loss' | awk '{print $7}' | tr -d '%'": Timeout while executing a shell script.   
2025-02-21T14:59:00       40564   Failed to execute command "timeout -s 9 10 ping -c 1 -S 10.66.255.5 10.4.0.1 | grep round-trip | cut -d= -f2 | cut -d/ -f2": Timeout while executing a shell script.   
2025-02-21T14:58:59       41001   Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.1.254 8.8.4.4 | grep round-trip | cut -d= -f2 | cut -d/ -f2": Timeout while executing a shell script.   
2025-02-21T14:58:39       41206   Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.10.10 192.168.10.9 | grep -q '0.0% packet loss' ; echo $?": Timeout while executing a shell script.   
2025-02-21T14:58:02       41001   Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.1.254 8.8.4.4 | grep 'packet loss' | awk '{print $7}' | tr -d '%'": Timeout while executing a shell script.   
2025-02-21T14:58:00       41206   Failed to execute command "timeout -s 9 10 ping -c 1 -S 10.66.255.5 10.4.0.1 | grep round-trip | cut -d= -f2 | cut -d/ -f2": Timeout while executing a shell script.   
2025-02-21T14:57:59       40564   Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.1.254 8.8.4.4 | grep round-trip | cut -d= -f2 | cut -d/ -f2": Timeout while executing a shell script.   
2025-02-21T14:57:39       40564   Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.10.10 192.168.10.9 | grep -q '0.0% packet loss' ; echo $?": Timeout while executing a shell script.   
2025-02-21T14:57:02       41001   Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.1.254 8.8.4.4 | grep 'packet loss' | awk '{print $7}' | tr -d '%'": Timeout while executing a shell script.   
2025-02-21T14:57:00       41206   Failed to execute command "timeout -s 9 10 ping -c 1 -S 10.66.255.5 10.4.0.1 | grep round-trip | cut -d= -f2 | cut -d/ -f2": Timeout while executing a shell script.   
2025-02-21T14:56:59       41001   Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.1.254 8.8.4.4 | grep round-trip | cut -d= -f2 | cut -d/ -f2": Timeout while executing a shell script.   
2025-02-21T14:56:39       40564   Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.10.10 192.168.10.9 | grep -q '0.0% packet loss' ; echo $?": Timeout while executing a shell script.   
2025-02-21T14:56:02       40564   Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.1.254 8.8.4.4 | grep 'packet loss' | awk '{print $7}' | tr -d '%'": Timeout while executing a shell script.   
2025-02-21T14:56:00       41206   Failed to execute command "timeout -s 9 10 ping -c 1 -S 10.66.255.5 10.4.0.1 | grep round-trip | cut -d= -f2 | cut -d/ -f2": Timeout while executing a shell script.

So:

Next week, we're going to simulate an FW with 50 zabbix keys generating lots of errors in an attempt to re-trigger this random freeze.

Still the same, if anyone has a great idea, I'll take it :D


thanks,
#7
Hello,

We continue to have waves of FW2 freezes and the occasional FW1. We have found a log :

LOG :

<7>sonewconn: pcb 0xfffff8009edd8a80 (0.0.0.0:10050 (proto 6)): Listen queue overflow: 193 already in queue awaiting acceptance (1 occurrences), euid 122, rgid 122, jail 0

chatgpt tells us interesting things :

https://chatgpt.com/share/67adb533-7a24-8004-8ed5-46bf3afcb995

should i do what chatgpt suggests: increase the waiting list?

root@xxxFW2:~ # sysctl -w kern.ipc.somaxconn=1024
kern.ipc.somaxconn: 128 -> 1024
root@xxxxxxFW2:~ # sysctl -w net.inet.tcp.syncache.count
net.inet.tcp.syncache.count: 0

We don't know if these logs are a cause or a consequence, but either way, we're going to try because we've got nothing else! A light at the end of the tunnel, perhaps?

If an expert comes by, I'd like to hear from him :)

thanks,

regards,
#8
Hello,

This weekend, we lost 5 x FW2 ! it's not a crash, it's a freeze.

"Enhanced SpeedStep Frequency Control" is not the cause. Our home made BIOS froze as well...

New plan :

-> intel microcode
-> USB<->serial cable

After that, we will change hardware because from our point of view, it's the last thing we haven't tested.

If anyone has an idea for increasing verbosity, increasing the possibility of seeing what happens when it freezes, I'd love to hear it. If it's necessary to recompile a special kernel, I'll do it for that, we'll do it.

A man in the middle of a nightmare :)

Thanks,


 
#9
Hello,

We try powerd enabled with profil performance.

It's better, but not, today, we lost 2 x FW2 on 2 different site. Max uptime 17 days...

Freeze : GUI present but can't connect, console present, but connection not possible... frozen !


We will try 2 things :

- Install Intel-microcode-instruction (plugins opnsense) =>
  https://www.thomas-krenn.com/en/wiki/Update_Intel_Microcode_on_FreeBSD
- unplug USB serial ceble et leave both FW1 et FW2 alone without USB serial cable.

wait and see.

the nightmare continues :(
#10
What are the processes?

few :

root@xxx:~ # ps auxwww | grep -v '\['
USER      PID  %CPU %MEM     VSZ     RSS TT  STAT STARTED        TIME COMMAND
root    62354   0.2  0.6   75836   46272  -  S    Mon11      37:30.26 /usr/local/bin/php /usr/local/opnsense/scripts/routes/gateway_watcher.php interface routes alarm
root    66338   0.1  0.1   19716    9164  -  Ss   Mon12       5:38.00 /usr/local/sbin/openvpn --config /var/etc/openvpn/client4.conf
root        1   0.0  0.0   11376    1024  -  SLs  Mon11       0:00.87 /sbin/init
root      285   0.0  0.2   27104   15384  -  Is   Mon11       0:00.42 /usr/local/bin/python3 /usr/local/opnsense/service/configd.py (python3.11)
root      287   0.0  0.5   91092   44544  -  S    Mon11       1:06.86 /usr/local/bin/python3 /usr/local/opnsense/service/configd.py console (python3.11)
root      425   0.0  0.0   13488    2864  -  S    Mon12       1:42.35 /bin/sh /var/db/rrd/updaterrd.sh
root      803   0.0  0.0   11552    1600  -  Ss   Mon11       0:00.47 /sbin/devd
root      855   0.0  0.2   26364   14776  -  S    22:00       0:06.00 /usr/local/bin/python3 /usr/local/sbin/configctl -e -t 0.5 system event config_changed (python3.11)
root      860   0.0  0.2   27388   15080  -  S    22:00       0:06.04 /usr/local/bin/python3 /usr/local/opnsense/scripts/syslog/lockout_handler (python3.11)
root     5940   0.0  0.2   12720   12824  -  S<s  Mon12       0:03.20 /usr/sbin/watchdogd
root     7943   0.0  0.4   61640   32536  -  S    09:12       0:00.33 /usr/local/bin/php-cgi
root    11356   0.0  0.1   23456    9908  -  I    Mon11       0:00.00 /usr/local/sbin/syslog-ng -f /usr/local/etc/syslog-ng.conf -p /var/run/syslog-ng.pid
root    11585   0.0  0.4   69488   28824  -  Ss   Mon11      15:11.03 /usr/local/sbin/syslog-ng -f /usr/local/etc/syslog-ng.conf -p /var/run/syslog-ng.pid
zabbix  17287   0.0  0.1   23364    9360  -  I    Mon12       0:00.00 /usr/local/sbin/zabbix_agentd -c /usr/local/etc/zabbix_agentd.conf
root    29187   0.0  0.1   22804   11024  -  S    Mon11       1:18.17 /usr/local/sbin/lighttpd -f /var/etc/lighty-webConfigurator.conf
root    29396   0.0  0.3   57784   26296  -  Is   Mon11       0:00.04 /usr/local/bin/php-cgi
root    29795   0.0  0.3   57784   26624  -  Is   Mon11       0:00.06 /usr/local/bin/php-cgi
dhcpd   35032   0.0  0.1   25516   10572  -  Ss   Mon11       1:40.23 /usr/local/sbin/dhcpd -user dhcpd -group dhcpd -chroot /var/dhcpd -cf /etc/dhcpd.conf -pf /var/run/dhcpd.pid igb0 vlan01 vlan09 vlan04 vlan010 vlan07 vlan03
root    44106   0.0  0.4   61640   32744  -  I    08:15       0:00.80 /usr/local/bin/php-cgi
root    47083   0.0  0.1   19860    9496  -  S    09:07       0:00.18 sshd-session: root@pts/0 (sshd-session)
root    49563   0.0  0.1   14684    4404  -  S    Mon12       0:28.95 /usr/local/sbin/lighttpd -f /var/etc/lighttpd-api-dispatcher.conf
www     51810   0.0  0.1   22788   10140  -  S    Mon12       1:42.44 /usr/local/sbin/lighttpd -f /var/etc/lighttpd-cp-zone-0.conf
root    52094   0.0  0.2   30460   19160  -  Ss   Mon12      15:48.38 /usr/local/bin/python3 /usr/local/opnsense/scripts/OPNsense/CaptivePortal/cp-background-process.py start (python3.11)
root    58447   0.0  0.4   61640   32988  -  S    Fri13       0:01.48 /usr/local/bin/php-cgi
root    59872   0.0  0.0   13340    2652  -  Is   Mon11       0:56.44 /usr/local/bin/dpinger -f -S -r 0 -i GW_Freebox -B 192.168.222.252 -p /var/run/dpinger_GW_Freebox.pid -u /var/run/dpinger_GW_Freebox.sock -s 1s -l 4s -t 60s -d 1 1.1.1.1
root    60772   0.0  0.0   17568    2732  -  Is   Mon11       0:49.90 /usr/local/bin/dpinger -f -S -r 0 -i GW_Livebox_1 -B 192.168.10.252 -p /var/run/dpinger_GW_Livebox_1.pid -u /var/run/dpinger_GW_Livebox_1.sock -s 1s -l 4s -t 60s -d 1 9.9.9.9
root    61445   0.0  0.0   13340    2652  -  Is   Mon11       0:49.10 /usr/local/bin/dpinger -f -S -r 0 -i GW_Livebox_2 -B 192.168.11.252 -p /var/run/dpinger_GW_Livebox_2.pid -u /var/run/dpinger_GW_Livebox_2.sock -s 1s -l 4s -t 60s -d 1 8.8.8.8
root    67609   0.0  0.1   23112    8096  -  Ss   Mon12       0:35.71 /usr/local/sbin/ntpd -g -c /var/etc/ntpd.conf
root    70984   0.0  0.4   61640   32328  -  I    09:12       0:00.14 /usr/local/bin/php-cgi
root    77465   0.0  0.1   19716    9316  -  Ss   Mon11      12:53.02 /usr/local/sbin/openvpn --config /var/etc/openvpn/instance-1e2b2a71-fc49-4d58-9cd1-bf82e41a0455.conf
root    78283   0.0  0.0   12816    2756  -  Ss   Mon12       0:03.41 /usr/sbin/cron -s
root    84238   0.0  0.4   61832   32872  -  S    09:08       0:00.97 /usr/local/bin/php-cgi
root    88504   0.0  0.4   61512   31968  -  S    09:08       0:00.07 /usr/local/bin/php-cgi
root    90045   0.0  0.0   13008    3060  -  Ss   Wed09       4:31.21 /usr/local/sbin/filterlog -i pflog0 -p /var/run/filterlog.pid
unbound 91146   0.0 14.3 1539156 1173868  -  Ss   Mon12      12:21.96 /usr/local/sbin/unbound -c /var/unbound/unbound.conf
root    91229   0.0  0.0   12620    2308  -  SC   09:15       0:00.00 sleep 1
root     1443   0.0  0.0   12744    2488 u0  Ss+  Mon12       0:09.76 /usr/libexec/getty 3wire.115200 ttyu0
root    47139   0.0  0.0   13488    2996  0  Is   09:07       0:00.00 /bin/sh /usr/local/sbin/opnsense-shell
root    58254   0.0  0.0   13784    4012  0  S    09:07       0:00.02 /bin/csh
root    91355   0.0  0.0   13340    3312  0  R+   09:15       0:00.00 ps auxwww

What are the latest files created?

find /var/log/ -mtime 0 | grep -v '\->'

root@****:~ # find /var/log/ -ctime 0 -o -mtime 0 -o -ctime 1 -o -mtime 1 -type f | grep -v '\->'
/var/log/
/var/log/utx.lastlogin
/var/log/audit
/var/log/audit/audit_20250113.log
/var/log/audit/audit_20250112.log
/var/log/audit/latest.log
/var/log/system
/var/log/system/system_20250113.log
/var/log/system/latest.log
/var/log/system/system_20250112.log
/var/log/utx.log
/var/log/openvpn
/var/log/openvpn/openvpn_20250113.log
/var/log/openvpn/latest.log
/var/log/openvpn/openvpn_20250112.log
/var/log/firewall
/var/log/firewall/latest.log
/var/log/ipfw.today
/var/log/kea
/var/log/kea/latest.log
/var/log/ipfw.yesterday
/var/log/pkg
/var/log/pkg/latest.log
/var/log/routing
/var/log/routing/latest.log
/var/log/lighttpd
/var/log/lighttpd/lighttpd_20250113.log
/var/log/lighttpd/latest.log
/var/log/pf.yesterday
/var/log/pf.today
/var/log/configd
/var/log/configd/latest.log
/var/log/configd/configd_20250112.log
/var/log/configd/configd_20250113.log
/var/log/portalauth
/var/log/portalauth/latest.log
/var/log/gateways
/var/log/gateways/latest.log
/var/log/filter
/var/log/filter/filter_20250112.log
/var/log/filter/latest.log
/var/log/filter/filter_20250113.log
/var/log/dhcpd
/var/log/dhcpd/latest.log
/var/log/dhcpd/dhcpd_20250113.log
/var/log/dhcpd/dhcpd_20250112.log
/var/log/zabbix
/var/log/zabbix/zabbix_agentd.log.old
/var/log/zabbix/zabbix_agentd.log
/var/log/ntpd
/var/log/ntpd/latest.log
/var/log/resolver
/var/log/resolver/resolver_20250112.log
/var/log/resolver/resolver_20250113.log
/var/log/resolver/latest.log

nothing extraordinary, we're on a classic opnsense with no particular modifications.

How do you put the kernel into hyper debug mode? hyper verbose? On log analysis, we have nothing at the time of the freeze. No particular activity, no log to understand.
Our syslog continue to receive filter log and dhcp log from FW during the freeze. Packet forwarding continue to run during freeze but nothing else works.
#11
Hello,

With old BIOS :

root@*****-FW1:~ # sysctl -a|grep cpufreq
device  cpufreq
debug.cpufreq.verbose: 0
debug.cpufreq.lowest: 0
dev.cpufreq.3.freq_driver: est3
dev.cpufreq.3.%parent: cpu3
dev.cpufreq.3.%pnpinfo:
dev.cpufreq.3.%location:
dev.cpufreq.3.%driver: cpufreq
dev.cpufreq.3.%desc:
dev.cpufreq.2.freq_driver: est2
dev.cpufreq.2.%parent: cpu2
dev.cpufreq.2.%pnpinfo:
dev.cpufreq.2.%location:
dev.cpufreq.2.%driver: cpufreq
dev.cpufreq.2.%desc:
dev.cpufreq.1.freq_driver: est1
dev.cpufreq.1.%parent: cpu1
dev.cpufreq.1.%pnpinfo:
dev.cpufreq.1.%location:
dev.cpufreq.1.%driver: cpufreq
dev.cpufreq.1.%desc:
dev.cpufreq.0.freq_driver: est0
dev.cpufreq.0.%parent: cpu0
dev.cpufreq.0.%pnpinfo:
dev.cpufreq.0.%location:
dev.cpufreq.0.%driver: cpufreq
dev.cpufreq.0.%desc:
dev.cpufreq.%parent:

root@****-FW1:~ # sysctl -a | grep 'est:'
kern.vm_guest: none
vfs.nfs.realign_test: 0
vfs.nfsd.request_space_used_highest: 0
net.inet.ip.broadcast_lowest: 0
debug.cpufreq.lowest: 0
hw.acpi.cpu.cx_lowest: C1
dev.cpu.3.cx_lowest: C1
dev.cpu.2.cx_lowest: C1
dev.cpu.1.cx_lowest: C1
dev.cpu.0.cx_lowest: C1

With our new fresh BIOS :

root@xxxxx-FW2:~ # sysctl -a|grep cpufreq
device  cpufreq
debug.cpufreq.verbose: 0
debug.cpufreq.lowest:

root@xxxxxx:~ # sysctl -a | grep 'est:'
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 7e000000173f
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 7e000000173f
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 7e000000173f
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 7e000000173f
kern.vm_guest: none
vfs.nfs.realign_test: 0
vfs.nfsd.request_space_used_highest: 0
net.inet.ip.broadcast_lowest: 0
debug.cpufreq.lowest: 0
hw.acpi.cpu.cx_lowest: C1
dev.cpu.3.cx_lowest: C1
dev.cpu.2.cx_lowest: C1
dev.cpu.1.cx_lowest: C1
dev.cpu.0.cx_lowest: C1

Now, we need to flush FW that froze and see what happens ! be happy not to have any more customer calls about this problem. :D
#12
I will try :

echo 'hint.est.0.disabled="1"' >> /boot/loader.conf

and reboot

maybe this is quite

after reboot :

root@*****:~ # dmesg | grep -i -A3 est0
est0: <Enhanced SpeedStep Frequency Control> on cpu0
Timecounter "TSC" frequency 1916666258 Hz quality 1000
Timecounters tick every 1.000 msec

it does'nt work. :(
#13
Hello,

so we try to disable INTEL EIST on the bios where it's enabled by default. (Intel_SpeedStep)

We are rebuilding a specific BIOS for our hardware and will quickly try it.

But, is-it possible to disable it on kernel OS ?

root@xxxxx:~ # dmesg | grep -i speedStep -A2
est0: <Enhanced SpeedStep Frequency Control> on cpu0
Timecounter "TSC" frequency 1916666258 Hz quality 1000
Timecounters tick every 1.000 msec

thanks


regards,
#14
In response to Seimus :

We have quite a few FWs deployed at different sites, almost all of them installed with quality inverters. We have already tried to change the power supplies. Alas, without success.
This problem occurs on several different sites.
This doesn't seem to be a good idea.
#15
Hello,

our intuition is that we have a problem with Intel speedshift/speedtest.
All FWs with minimal activity are experiencing the problem: backup FW2s.

We try to disable ALL power-saving implementations in the BIOS?

On OS, Powercontrol is disabled :

root@fw_qua_sr1_f2:~ # sysctl -a | grep hwp
kern.hwpmc.softevents: 16
kern.features.hwpmc_hooks: 1
debug.hwpstate_pstate_limit: 0
debug.hwpstate_verify: 0
debug.hwpstate_verbose: 0
machdep.hwpstate_pkg_ctrl: 1


root@FW1:~ # sysctl -a | grep -i dev.cpu
dev.cpufreq.3.freq_driver: hwpstate_intel3
dev.cpufreq.3.%parent: cpu3
dev.cpufreq.3.%pnpinfo:
dev.cpufreq.3.%location:
dev.cpufreq.3.%driver: cpufreq
dev.cpufreq.3.%desc:
dev.cpufreq.2.freq_driver: hwpstate_intel2
dev.cpufreq.2.%parent: cpu2
dev.cpufreq.2.%pnpinfo:
dev.cpufreq.2.%location:
dev.cpufreq.2.%driver: cpufreq
dev.cpufreq.2.%desc:
dev.cpufreq.1.freq_driver: hwpstate_intel1
dev.cpufreq.1.%parent: cpu1
dev.cpufreq.1.%pnpinfo:
dev.cpufreq.1.%location:
dev.cpufreq.1.%driver: cpufreq
dev.cpufreq.1.%desc:
dev.cpufreq.0.freq_driver: hwpstate_intel0
dev.cpufreq.0.%parent: cpu0
dev.cpufreq.0.%pnpinfo:
dev.cpufreq.0.%location:
dev.cpufreq.0.%driver: cpufreq
dev.cpufreq.0.%desc:
dev.cpufreq.%parent:
dev.cpu.3.cx_method: C1/mwait/hwc C2/mwait/hwc C3/mwait/hwc
dev.cpu.3.cx_usage_counters: 31919645 0 0
dev.cpu.3.cx_usage: 100.00% 0.00% 0.00% last 28us
dev.cpu.3.cx_lowest: C1
dev.cpu.3.cx_supported: C1/1/1 C2/2/253 C3/3/1048
dev.cpu.3.freq_levels: 1996/-1
dev.cpu.3.freq: 2366
dev.cpu.3.%parent: acpi0
dev.cpu.3.%pnpinfo: _HID=none _UID=0 _CID=none
dev.cpu.3.%location: handle=\_SB_.PR03
dev.cpu.3.%driver: cpu
dev.cpu.3.%desc: ACPI CPU
dev.cpu.2.cx_method: C1/mwait/hwc C2/mwait/hwc C3/mwait/hwc
dev.cpu.2.cx_usage_counters: 37773627 0 0
dev.cpu.2.cx_usage: 100.00% 0.00% 0.00% last 633us
dev.cpu.2.cx_lowest: C1
dev.cpu.2.cx_supported: C1/1/1 C2/2/253 C3/3/1048
dev.cpu.2.freq_levels: 1996/-1
dev.cpu.2.freq: 2454
dev.cpu.2.%parent: acpi0
dev.cpu.2.%pnpinfo: _HID=none _UID=0 _CID=none
dev.cpu.2.%location: handle=\_SB_.PR02
dev.cpu.2.%driver: cpu
dev.cpu.2.%desc: ACPI CPU
dev.cpu.1.cx_method: C1/mwait/hwc C2/mwait/hwc C3/mwait/hwc
dev.cpu.1.cx_usage_counters: 132676551 0 0
dev.cpu.1.cx_usage: 100.00% 0.00% 0.00% last 63us
dev.cpu.1.cx_lowest: C1
dev.cpu.1.cx_supported: C1/1/1 C2/2/253 C3/3/1048
dev.cpu.1.freq_levels: 1996/-1
dev.cpu.1.freq: 2422
dev.cpu.1.%parent: acpi0
dev.cpu.1.%pnpinfo: _HID=none _UID=0 _CID=none
dev.cpu.1.%location: handle=\_SB_.PR01
dev.cpu.1.%driver: cpu
dev.cpu.1.%desc: ACPI CPU
dev.cpu.0.cx_method: C1/mwait/hwc C2/mwait/hwc C3/mwait/hwc
dev.cpu.0.cx_usage_counters: 223484647 0 0
dev.cpu.0.cx_usage: 100.00% 0.00% 0.00% last 93us
dev.cpu.0.cx_lowest: C1
dev.cpu.0.cx_supported: C1/1/1 C2/2/253 C3/3/1048
dev.cpu.0.freq_levels: 1996/-1
dev.cpu.0.freq: 2466
dev.cpu.0.%parent: acpi0
dev.cpu.0.%pnpinfo: _HID=none _UID=0 _CID=none
dev.cpu.0.%location: handle=\_SB_.PR00
dev.cpu.0.%driver: cpu
dev.cpu.0.%desc: ACPI CPU
dev.cpu.%parent:



Any idea how to check that no energy-saving mechanism is being used?

Thanks,