Random freeze on secondary/slave firewalls

Started by ppucci, December 16, 2024, 11:23:14 AM

Previous topic - Next topic
You may want to cross check it on FreeBSD forum

https://forums.freebsd.org/threads/listen-queue-overflow.66845/
https://forums.freebsd.org/threads/listen-queue-overflow.74098/

P.S. Also, if you going to adjust any tunable to it via GUI not CLI.

Regards,
S.
Networking is love. You may hate it, but in the end, you always come back to it.

OPNSense HW
APU2D2 - deceased
N5105 - i226-V | Patriot 2x8G 3200 DDR4 | L 790 512G - VM HA(SOON)
N100   - i226-V | Crucial 16G  4800 DDR5 | S 980 500G - PROD

Hello,

the investigation continues !

A new FW2 has frozen with the parameter  kern.ipc.somaxconn=1024
AND
FW with zabbix monitoring disabled stay alive !


No, that's not it, but the investigation is tightening up around Zabbix agent.

In fact, after analysis, we have Freeze on FWs with Zabbix agent failures:

zabbix key example:
"timeout -s 9 10 "ping -c 4 -S 192.168.10.252 8.8.4.4 | grep 'packet loss' | awk '{print $7}' | tr -d '%'""

It's badly coded, but it does the job.

In the mass, we were able to determine that the sudden freezing of FWs corresponded to the increase in our monitoring and, above all, to the presence of command execution failures.

There are more zabbix command failures on FW2 because, for example, IPSEC links are only mounted on FW1. This would explain the freeze on the FW2 and not the FW1.

FWs that freeze are FWs that have a lot of logs as follows:

2025-02-21T15:00:02       41206   Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.1.254 8.8.4.4 | grep 'packet loss' | awk '{print $7}' | tr -d '%'": Timeout while executing a shell script.   
2025-02-21T15:00:00       40564   Failed to execute command "timeout -s 9 10 ping -c 1 -S 10.66.255.5 10.4.0.1 | grep round-trip | cut -d= -f2 | cut -d/ -f2": Timeout while executing a shell script.   
2025-02-21T14:59:59       41001   Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.1.254 8.8.4.4 | grep round-trip | cut -d= -f2 | cut -d/ -f2": Timeout while executing a shell script.   
2025-02-21T14:59:39       40564   Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.10.10 192.168.10.9 | grep -q '0.0% packet loss' ; echo $?": Timeout while executing a shell script.   
2025-02-21T14:59:02       41206   Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.1.254 8.8.4.4 | grep 'packet loss' | awk '{print $7}' | tr -d '%'": Timeout while executing a shell script.   
2025-02-21T14:59:00       40564   Failed to execute command "timeout -s 9 10 ping -c 1 -S 10.66.255.5 10.4.0.1 | grep round-trip | cut -d= -f2 | cut -d/ -f2": Timeout while executing a shell script.   
2025-02-21T14:58:59       41001   Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.1.254 8.8.4.4 | grep round-trip | cut -d= -f2 | cut -d/ -f2": Timeout while executing a shell script.   
2025-02-21T14:58:39       41206   Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.10.10 192.168.10.9 | grep -q '0.0% packet loss' ; echo $?": Timeout while executing a shell script.   
2025-02-21T14:58:02       41001   Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.1.254 8.8.4.4 | grep 'packet loss' | awk '{print $7}' | tr -d '%'": Timeout while executing a shell script.   
2025-02-21T14:58:00       41206   Failed to execute command "timeout -s 9 10 ping -c 1 -S 10.66.255.5 10.4.0.1 | grep round-trip | cut -d= -f2 | cut -d/ -f2": Timeout while executing a shell script.   
2025-02-21T14:57:59       40564   Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.1.254 8.8.4.4 | grep round-trip | cut -d= -f2 | cut -d/ -f2": Timeout while executing a shell script.   
2025-02-21T14:57:39       40564   Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.10.10 192.168.10.9 | grep -q '0.0% packet loss' ; echo $?": Timeout while executing a shell script.   
2025-02-21T14:57:02       41001   Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.1.254 8.8.4.4 | grep 'packet loss' | awk '{print $7}' | tr -d '%'": Timeout while executing a shell script.   
2025-02-21T14:57:00       41206   Failed to execute command "timeout -s 9 10 ping -c 1 -S 10.66.255.5 10.4.0.1 | grep round-trip | cut -d= -f2 | cut -d/ -f2": Timeout while executing a shell script.   
2025-02-21T14:56:59       41001   Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.1.254 8.8.4.4 | grep round-trip | cut -d= -f2 | cut -d/ -f2": Timeout while executing a shell script.   
2025-02-21T14:56:39       40564   Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.10.10 192.168.10.9 | grep -q '0.0% packet loss' ; echo $?": Timeout while executing a shell script.   
2025-02-21T14:56:02       40564   Failed to execute command "timeout -s 9 10 ping -c 1 -S 192.168.1.254 8.8.4.4 | grep 'packet loss' | awk '{print $7}' | tr -d '%'": Timeout while executing a shell script.   
2025-02-21T14:56:00       41206   Failed to execute command "timeout -s 9 10 ping -c 1 -S 10.66.255.5 10.4.0.1 | grep round-trip | cut -d= -f2 | cut -d/ -f2": Timeout while executing a shell script.

So:

Next week, we're going to simulate an FW with 50 zabbix keys generating lots of errors in an attempt to re-trigger this random freeze.

Still the same, if anyone has a great idea, I'll take it :D


thanks,

Any chance these pings are started more frequently than every 10 seconds (when the timeout will finally kill them)? Self made "fork bomb" that crashes the firewall?
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Hello,

This weekend, we lost another 3 x FW2! All the FW2s we no longer supervise are still UP!

well, we have confirmation that it's our zabbix supervision that's causing the FW2s to freeze.
We have the same settings on the FW1s which do not freeze.

The only difference is that ping checks via ipsec are timeouted by the command and/or by the zabbix timeout.

No, there's no command overload, when you look at it. There aren't many zabbix processes, no RAM saturation, no CPU saturation...
In short, I'm still amazed at how easy it is to freeze a freebsd.

Is the freeze simply a kill of process 1 (init)? If so, how can this happen?

In any case, freebsd can be broken remotely while remaining at user level!

To be continued.

March 03, 2025, 12:18:05 PM #19 Last Edit: March 03, 2025, 12:19:38 PM by ppucci
Hello,

New Idea :

on our test FWs, I set cpu limits via rctl :
With a constrained zabbix process, we'll see if it freezes again or not :D

ps auxwww | grep zabbix_agentd
zabbix  43003   0.0  0.2  25632  9548  -  I    10:56    0:00.01 /usr/local/sbin/zabbix_agentd -c /usr/local/etc/zabbix_agentd.conf
zabbix  43373   0.0  0.2  25632  9920  -  S    10:56    0:00.17 zabbix_agentd: collector [idle 1 sec] (zabbix_agentd)
zabbix  43903   0.0  0.2  25892 10088  -  S    10:56    0:00.41 zabbix_agentd: listener #1 [processing request] (zabbix_agentd)
zabbix  43976   0.0  0.2  25892 10104  -  S    10:56    0:00.54 zabbix_agentd: listener #2 [processing request] (zabbix_agentd)
zabbix  44354   0.0  0.2  25892 10092  -  S    10:56    0:00.53 zabbix_agentd: listener #3 [processing request] (zabbix_agentd)
zabbix  44437   0.0  0.2  25892  9992  -  S    10:56    0:00.09 zabbix_agentd: active checks #1 [idle 1 sec] (zabbix_agentd)
root    36378   0.0  0.1  13744  2388  1  S+   11:13    0:00.00 grep zabbix_agentd

root@TestZabbix01:~ # rctl
process:43003:pcpu:deny=50
process:43003:memoryuse:deny=1073741824
user:zabbix:memoryuse:deny=1073741824
user:zabbix:pcpu:deny=50

=> Max 1G RAM and 50% CPU :D

wait and see

Hello,

good news: we're able to refreeze FW! in the lab after 4-5 days.

However, setting limits on zabbix didn't work. It still froze despite the set limits.

Well, I'm out of ideas! Does anyone have one?

otherwise, no supervision via zabbix = no problem!

:D

Hello,

FYI : => https://support.zabbix.com/browse/ZBX-26145

But It continue to freeze:  command use :

UserParameter=ipsec.status,timeout -s 9 10 ping -c 1 -S 192.168.1.1 192.168.10.25 | grep -q '0.0% packet loss' ; echo $?
UserParameter=latence.lien1,timeout -s 9 10 ping -c 1 -S 192.168.103.100 9.9.7.2 | grep round-trip | cut -d= -f2 | cut -d/ -f2
UserParameter=latence.lien2,timeout -s 9 10 ping -c 1 -S 192.168.1.1 10.4.0.1 | grep round-trip | cut -d= -f2 | cut -d/ -f2
UserParameter=latence.ipsec,timeout -s 9 10 ping -c 1 -S 192.168.1.1 192.168.10.25 | grep round-trip | cut -d= -f2 | cut -d/ -f2
UserParameter=packetloss.ipsec,timeout -s 9 10 ping -c 1 -S 192.168.1.1 192.168.10.25 | grep 'packet loss' | awk '{print $7}' | tr -d '%'
UserParameter=packetloss.lien2,timeout -s 9 10 ping -c 1 -S 192.168.1.1 10.4.0.1 | grep 'packet loss' | awk '{print $7}' | tr -d '%'
UserParameter=packetloss.lien1,timeout -s 9 10 ping -c 1 -S 192.168.103.100 9.9.7.2 | grep 'packet loss' | awk '{print $7}' | tr -d '%'
UserParameter=opnsense.version,opnsense-version
UserParameter=states.total,sudo pfctl -si | grep current | awk '{print $3}'
UserParameter=states.max,sudo pfctl -sm | grep states | awk '{print $4}'
UserParameter=states.ftth,timeout -s 9 10 ping -c 1 -S 192.168.103.100 8.8.5.5 | grep -q '0.0% packet loss' ; echo $?
UserParameter=state.ftto,timeout -s 9 10 ping -c 1 -S 192.168.1.1 8.8.5.5 | grep -q '0.0% packet loss' ; echo $?
UserParameter=state.priseconnecte,timeout -s 9 10 ping -c 1 172.17.99.45 > /dev/null ; echo $?

which command freeze the freebsd OS ?

regards,

Hello,

we have a guilty party :

We try 2 differents configuration :
After 7 days : one of both FW freezed.


355,361c355,361
< UserParameter=ipsec.status,ping -c 1 -S 192.168.1.2 192.168.10.25 | grep -q '0.0% packet loss' ; echo $?
< UserParameter=latence.lien,ping -c 1 -S 192.168.101.50 9.9.7.5 | grep round-trip | cut -d= -f2 | cut -d/ -f2
< UserParameter=latence.lien2,ping -c 1 -S 192.168.1.2 10.4.0.1 | grep round-trip | cut -d= -f2 | cut -d/ -f2
< UserParameter=latence.ipsec,ping -c 1 -S 192.168.1.2 192.168.10.25 | grep round-trip | cut -d= -f2 | cut -d/ -f2
< UserParameter=packetloss.ipsec,ping -c 1 -S 192.168.1.2 192.168.10.25 | grep 'packet loss' | awk '{print $7}' | tr -d '%'
< UserParameter=packetloss.lien2,ping -c 1 -S 192.168.1.2 10.4.0.1 | grep 'packet loss' | awk '{print $7}' | tr -d '%'
< UserParameter=packetloss.lien1,ping -c 1 -S 192.168.101.50 9.9.7.5 | grep 'packet loss' | awk '{print $7}' | tr -d '%'
---
> UserParameter=ipsec.status,timeout -s 9 10 ping -c 1 -S 192.168.1.1 192.168.10.25 | grep -q '0.0% packet loss' ; echo $?
> UserParameter=latence.lien1,timeout -s 9 10 ping -c 1 -S 192.168.103.100 9.9.7.2 | grep round-trip | cut -d= -f2 | cut -d/ -f2
> UserParameter=latence.lien2,timeout -s 9 10 ping -c 1 -S 192.168.1.1 10.4.0.1 | grep round-trip | cut -d= -f2 | cut -d/ -f2
> UserParameter=latence.ipsec,timeout -s 9 10 ping -c 1 -S 192.168.1.1 192.168.10.25 | grep round-trip | cut -d= -f2 | cut -d/ -f2
> UserParameter=packetloss.ipsec,timeout -s 9 10 ping -c 1 -S 192.168.1.1 192.168.10.25 | grep 'packet loss' | awk '{print $7}' | tr -d '%'
> UserParameter=packetloss.lien2,timeout -s 9 10 ping -c 1 -S 192.168.1.1 10.4.0.1 | grep 'packet loss' | awk '{print $7}' | tr -d '%'
> UserParameter=packetloss.lien1,timeout -s 9 10 ping -c 1 -S 192.168.103.100 9.9.7.2 | grep 'packet loss' | awk '{print $7}' | tr -d '%'
365,367c365,367
< UserParameter=state.ftth,ping -c 1 -S 192.168.101.50 8.8.5.5 | grep -q '0.0% packet loss' ; echo $?
< UserParameter=state.ftto,ping -c 1 -S 192.168.1.2 8.8.5.5 | grep -q '0.0% packet loss' ; echo $?
< UserParameter=state.priseconnecte,ping -c 1 172.17.99.45 > /dev/null ; echo $?
---
> UserParameter=states.ftth,timeout -s 9 10 ping -c 1 -S 192.168.103.100 8.8.5.5 | grep -q '0.0% packet loss' ; echo $?
> UserParameter=state.ftto,timeout -s 9 10 ping -c 1 -S 192.168.1.1 8.8.5.5 | grep -q '0.0% packet loss' ; echo $?
> UserParameter=state.priseconnecte,timeout -s 9 10 ping -c 1 172.17.99.45 > /dev/null ; echo $?

It's the configuration with "timeout -s 9 10 ..."

So, why command timeout can freeze a freebsd ?

How do you explain this ?

same old, same old, I'm posting for the cause! because no one's been very interested in this investigation :D

so, maybe :D

regards,