Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - ppucci

#1
Hello,

With old BIOS :

root@*****-FW1:~ # sysctl -a|grep cpufreq
device  cpufreq
debug.cpufreq.verbose: 0
debug.cpufreq.lowest: 0
dev.cpufreq.3.freq_driver: est3
dev.cpufreq.3.%parent: cpu3
dev.cpufreq.3.%pnpinfo:
dev.cpufreq.3.%location:
dev.cpufreq.3.%driver: cpufreq
dev.cpufreq.3.%desc:
dev.cpufreq.2.freq_driver: est2
dev.cpufreq.2.%parent: cpu2
dev.cpufreq.2.%pnpinfo:
dev.cpufreq.2.%location:
dev.cpufreq.2.%driver: cpufreq
dev.cpufreq.2.%desc:
dev.cpufreq.1.freq_driver: est1
dev.cpufreq.1.%parent: cpu1
dev.cpufreq.1.%pnpinfo:
dev.cpufreq.1.%location:
dev.cpufreq.1.%driver: cpufreq
dev.cpufreq.1.%desc:
dev.cpufreq.0.freq_driver: est0
dev.cpufreq.0.%parent: cpu0
dev.cpufreq.0.%pnpinfo:
dev.cpufreq.0.%location:
dev.cpufreq.0.%driver: cpufreq
dev.cpufreq.0.%desc:
dev.cpufreq.%parent:

root@****-FW1:~ # sysctl -a | grep 'est:'
kern.vm_guest: none
vfs.nfs.realign_test: 0
vfs.nfsd.request_space_used_highest: 0
net.inet.ip.broadcast_lowest: 0
debug.cpufreq.lowest: 0
hw.acpi.cpu.cx_lowest: C1
dev.cpu.3.cx_lowest: C1
dev.cpu.2.cx_lowest: C1
dev.cpu.1.cx_lowest: C1
dev.cpu.0.cx_lowest: C1

With our new fresh BIOS :

root@xxxxx-FW2:~ # sysctl -a|grep cpufreq
device  cpufreq
debug.cpufreq.verbose: 0
debug.cpufreq.lowest:

root@xxxxxx:~ # sysctl -a | grep 'est:'
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 7e000000173f
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 7e000000173f
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 7e000000173f
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 7e000000173f
kern.vm_guest: none
vfs.nfs.realign_test: 0
vfs.nfsd.request_space_used_highest: 0
net.inet.ip.broadcast_lowest: 0
debug.cpufreq.lowest: 0
hw.acpi.cpu.cx_lowest: C1
dev.cpu.3.cx_lowest: C1
dev.cpu.2.cx_lowest: C1
dev.cpu.1.cx_lowest: C1
dev.cpu.0.cx_lowest: C1

Now, we need to flush FW that froze and see what happens ! be happy not to have any more customer calls about this problem. :D
#2
I will try :

echo 'hint.est.0.disabled="1"' >> /boot/loader.conf

and reboot

maybe this is quite

after reboot :

root@*****:~ # dmesg | grep -i -A3 est0
est0: <Enhanced SpeedStep Frequency Control> on cpu0
Timecounter "TSC" frequency 1916666258 Hz quality 1000
Timecounters tick every 1.000 msec

it does'nt work. :(
#3
Hello,

so we try to disable INTEL EIST on the bios where it's enabled by default. (Intel_SpeedStep)

We are rebuilding a specific BIOS for our hardware and will quickly try it.

But, is-it possible to disable it on kernel OS ?

root@xxxxx:~ # dmesg | grep -i speedStep -A2
est0: <Enhanced SpeedStep Frequency Control> on cpu0
Timecounter "TSC" frequency 1916666258 Hz quality 1000
Timecounters tick every 1.000 msec

thanks


regards,
#4
In response to Seimus :

We have quite a few FWs deployed at different sites, almost all of them installed with quality inverters. We have already tried to change the power supplies. Alas, without success.
This problem occurs on several different sites.
This doesn't seem to be a good idea.
#5
Hello,

our intuition is that we have a problem with Intel speedshift/speedtest.
All FWs with minimal activity are experiencing the problem: backup FW2s.

We try to disable ALL power-saving implementations in the BIOS?

On OS, Powercontrol is disabled :

root@fw_qua_sr1_f2:~ # sysctl -a | grep hwp
kern.hwpmc.softevents: 16
kern.features.hwpmc_hooks: 1
debug.hwpstate_pstate_limit: 0
debug.hwpstate_verify: 0
debug.hwpstate_verbose: 0
machdep.hwpstate_pkg_ctrl: 1


root@FW1:~ # sysctl -a | grep -i dev.cpu
dev.cpufreq.3.freq_driver: hwpstate_intel3
dev.cpufreq.3.%parent: cpu3
dev.cpufreq.3.%pnpinfo:
dev.cpufreq.3.%location:
dev.cpufreq.3.%driver: cpufreq
dev.cpufreq.3.%desc:
dev.cpufreq.2.freq_driver: hwpstate_intel2
dev.cpufreq.2.%parent: cpu2
dev.cpufreq.2.%pnpinfo:
dev.cpufreq.2.%location:
dev.cpufreq.2.%driver: cpufreq
dev.cpufreq.2.%desc:
dev.cpufreq.1.freq_driver: hwpstate_intel1
dev.cpufreq.1.%parent: cpu1
dev.cpufreq.1.%pnpinfo:
dev.cpufreq.1.%location:
dev.cpufreq.1.%driver: cpufreq
dev.cpufreq.1.%desc:
dev.cpufreq.0.freq_driver: hwpstate_intel0
dev.cpufreq.0.%parent: cpu0
dev.cpufreq.0.%pnpinfo:
dev.cpufreq.0.%location:
dev.cpufreq.0.%driver: cpufreq
dev.cpufreq.0.%desc:
dev.cpufreq.%parent:
dev.cpu.3.cx_method: C1/mwait/hwc C2/mwait/hwc C3/mwait/hwc
dev.cpu.3.cx_usage_counters: 31919645 0 0
dev.cpu.3.cx_usage: 100.00% 0.00% 0.00% last 28us
dev.cpu.3.cx_lowest: C1
dev.cpu.3.cx_supported: C1/1/1 C2/2/253 C3/3/1048
dev.cpu.3.freq_levels: 1996/-1
dev.cpu.3.freq: 2366
dev.cpu.3.%parent: acpi0
dev.cpu.3.%pnpinfo: _HID=none _UID=0 _CID=none
dev.cpu.3.%location: handle=\_SB_.PR03
dev.cpu.3.%driver: cpu
dev.cpu.3.%desc: ACPI CPU
dev.cpu.2.cx_method: C1/mwait/hwc C2/mwait/hwc C3/mwait/hwc
dev.cpu.2.cx_usage_counters: 37773627 0 0
dev.cpu.2.cx_usage: 100.00% 0.00% 0.00% last 633us
dev.cpu.2.cx_lowest: C1
dev.cpu.2.cx_supported: C1/1/1 C2/2/253 C3/3/1048
dev.cpu.2.freq_levels: 1996/-1
dev.cpu.2.freq: 2454
dev.cpu.2.%parent: acpi0
dev.cpu.2.%pnpinfo: _HID=none _UID=0 _CID=none
dev.cpu.2.%location: handle=\_SB_.PR02
dev.cpu.2.%driver: cpu
dev.cpu.2.%desc: ACPI CPU
dev.cpu.1.cx_method: C1/mwait/hwc C2/mwait/hwc C3/mwait/hwc
dev.cpu.1.cx_usage_counters: 132676551 0 0
dev.cpu.1.cx_usage: 100.00% 0.00% 0.00% last 63us
dev.cpu.1.cx_lowest: C1
dev.cpu.1.cx_supported: C1/1/1 C2/2/253 C3/3/1048
dev.cpu.1.freq_levels: 1996/-1
dev.cpu.1.freq: 2422
dev.cpu.1.%parent: acpi0
dev.cpu.1.%pnpinfo: _HID=none _UID=0 _CID=none
dev.cpu.1.%location: handle=\_SB_.PR01
dev.cpu.1.%driver: cpu
dev.cpu.1.%desc: ACPI CPU
dev.cpu.0.cx_method: C1/mwait/hwc C2/mwait/hwc C3/mwait/hwc
dev.cpu.0.cx_usage_counters: 223484647 0 0
dev.cpu.0.cx_usage: 100.00% 0.00% 0.00% last 93us
dev.cpu.0.cx_lowest: C1
dev.cpu.0.cx_supported: C1/1/1 C2/2/253 C3/3/1048
dev.cpu.0.freq_levels: 1996/-1
dev.cpu.0.freq: 2466
dev.cpu.0.%parent: acpi0
dev.cpu.0.%pnpinfo: _HID=none _UID=0 _CID=none
dev.cpu.0.%location: handle=\_SB_.PR00
dev.cpu.0.%driver: cpu
dev.cpu.0.%desc: ACPI CPU
dev.cpu.%parent:



Any idea how to check that no energy-saving mechanism is being used?

Thanks,
#6
Hello,

Just FYI, we continue to have freeze on multiple FW.

I was able to repatriate some FWs for testing. I did IO benchmark, RAM (ubench -m + memtest86), CPU (ubench -c) ... in loop, during 1 week.
No hardware problems were found.

We updated the BIOS, and no problems were found.

I was able to heat up the FWs in very hot environments, only 1 crashed, but then it was a real crash, i.e. nothing responded, unlike our problem where ping and routing continue to work during freeze. It's a different issue and in this case, it's an thermal issue, so an hardware issue.

With zabbix agent, we create a command key named "Panic" => "sysctl debug.kdb.panic=1". When the FW froze, we tried to execute the command via the zabbix agent, it doesn't work. :(

we tested a reboot from the API, but that didn't work either.
Via API, on the first attempt, we get an Ok status return, but the reboot doesn't work. On the second attempt, it takes longer, the status is still ok, but it doesn't reboot.
When an FW is frozen, you can authenticate via the web interface, but when you execute a reboot, it doesn't work. A second authentication via the web does not work.

Our only remedy is to restart the FW electrically.

Any idee to identify issue nor reboot/restart FW when it froze ?
#7
Hello,

So new freeze on FW2 from an HA cluster. Version : 24.7.8

on this version, the symptom is a bit different: the router is frozen, but I can connect to the interface, which gives me nothing.

I can login, but "Lobby: Dashboard" can't show information. No commands work.

ex: If I try a reboot, it doesn't go away.

The server responds to the ping, but an ssh connection doesn't respond.
On the console, same thing, login:, then I type in the password and the connection doesn't arrive.
The only solution is to disconnect it electrically.

That's more than 20 identicals cases now on different hardware and opnsense version.

I'm asking Father Christmas for an idea, for help!
#8
We have quite a few OPNsense boxes deployed, especially on primary/secondary master/slave configurations.

For the past 1 year, on versions 24.1.9, 24.7.2 and undoubtedly others we've tested, we've had systematic random freezes on FW2/FW slave. Only on SLAVE/SECONDARY

The frequency is around 60-70 days of uptime. Sometimes much less.

What is a freeze?

- The ping responds.
- ssh impossible: prompt requested but no connection offered. From LAN, from HA link, ssh not responding from everywhere.
- web impossible: load login page, but no connection.
- We continue to receive routing syslogs + dhcp lease logs. No logs on other processes.
- We have console port on FW: login:..., password: ... connection impossible.
- We have sometimes left a console open. By typing reboot + enter. Nothing, it doesn't reboot.
- of course, all other processes no longer work: ipsec, openvpn, etc....
- After reboot, everything is OK. No particular log.
- pre-freeze, monitoring gives nothing: no abnormal CPU, no abnormal network activity, no abnormal disk activity. These are slave firewalls, so they do nothing.

All OS processes seem to be frozen.

How to solve the problem?

electrically disconnect and reconnect the firewall

What's our configuration?

1 WAN port with 2 x vlan (FFTO + FFTH)
1 LAN port with multiple local VLAN
1 HA port connected directly to the second
We use :
- CARPs for HA.
- Gateway group
- IPSEC client
- OpenVPN client
- DHCP server
- Unbound DNS sometimes
- captive portal sometimes

In short, nothing extraordinary. Resources are well-dimensioned, no RAM, disk or CPU problems.

What's our hardware?

Version: Intel(R) Atom(TM) CPU E3845 @ 1.91GH
port INTEL 3x WGI211AT
installed on msata and/or mmc - sometimes msata, sometimes mmc, sometimes ZFS raid mirror on emmc + msata

What have we tried?

- read all freeze cases on the internet on freebsd/pfsense/opnsense...
- enable watchdog! same thing, it does nothing.
- disable c-state: nothing.
- change hardware: there's no hardware problem. We have this case on more than 10 pairs of firewalls with different hardware/install/BIOS
- ZFS vs UFS => same
- Different BIOS versions => same
- disable opnsense optimisation
- disable watchdog !

What help do we need?

- Any ideas?
- How to get more logs? How to set the OPNsense to HYPER-verbose mode.
- How to crash/force a reboot in the event of a freeze?
- We don't have access to our FW, so we'd like to be able to crash the opnsense via the console. I've tried this:
https://gist.github.com/xiangchu0/5eda63b3c5234ce4eb48ca9deb1d0090#how-to-panic-on-demand-when-system-is-freezed
But alas, it doesn't work. So we're forced to request an electrical reboot in order to recover the firewall. Any ideas to crash opnsense from console port ?

Thanks to you, and thanks to anyone who can come up with a brilliant idea to solve this problem, which has been incomprehensible for over a year. Given the time spent on this problem, Santa Claus won't forget who finds us a fix!