States Killed Before Expiration

Started by Gianks, July 16, 2021, 01:36:28 PM

Previous topic - Next topic
July 16, 2021, 01:36:28 PM Last Edit: July 16, 2021, 02:24:26 PM by Gianks
Hi,
we are having troubles with UDP connections getting closed prematurely.
Upon checks with pftop we think there is a possible serious bug with the timeouts, at least with UDP: connections with more than 13 minutes of expiry left are suddenly closed for no reason.

OpnSense is already operating with conservative settings but the problem remains: connections are really dropped and applications have to reconnect all the time with SIP services.

And regardless of the setting, there is no reason why the timer starts with 1 minute, increases the expiry timer to 15 after some packages (another unexplicable behaviour, seems conservative is still trying to 'optimize' somehow, but nobody asked), and then disregards it after less than 2 minutes, forgetting completely of the state and blocking following packages!

Shall i file a bug on GitHub?
Thanks a lot

hi, difficult to discuss without details (udp timeouts, app settings etc)

July 17, 2021, 04:26:30 PM #2 Last Edit: July 17, 2021, 07:05:17 PM by Gianks
Hi, thanks for the answer but i don't think it has any relevance the application in use (btw, it's a SIP PBX TRUNK).
If we agree that UDP is basically stateless (without additional packet inspection) the application protocol cannot affect firewall states.
The only thing i did not state clearly enough i guess is that these connections are idle for approximately 6 minutes before each exchange, no drop occurs if there is continuous traffic and this is why i am pointing to the expiration timer.

So... what is causing a UDP state to be dropped before it's expiration given an almost empty states table (less than 2% full)?
OpnSense does not provide a per rule state expiration, but such is provided in pfSense and after testing it yesterday i can tell the issue occurs with both. In both products you can see sometimes the timer being far from expiry and still the state gets dropped for no apparent reason.

As i said it's my belief this is a BUG but i don't know enough about pf to do additional testing.

July 18, 2021, 07:59:16 AM #3 Last Edit: July 18, 2021, 09:06:09 AM by Fright
hi
i don't think it has any relevance the application in use (btw, it's a SIP PBX TRUNK)
do you manage the registration expire time on your pbx and what value is set?
If we agree that UDP is basically stateless (without additional packet inspection) the application protocol cannot affect firewall states.
in theory. but to control the translation, states are used for udp, icmp etc.
QuoteThe only thing i did not state clearly enough i guess is that these connections are idle for approximately 6 minutes before each exchange, no drop occurs if there is continuous traffic and this is why i am pointing to the expiration timer
yep, thats the key imho. you can try to increase udp timeouts or decrease trunk registration expire value (so that the trunk is updated more often than udp state times out. the expire time is better to agree with the sip-provider, since some do not like too frequent renegotiation and then trunk may fall due to registration refusal) or both.
if 6 min is enough for stable connectin imho you can start with { udp.first 300, udp.single 150, udp.multiple 400 } timeouts. so I asked the current timouts settings

Hi,
do you manage the registration expire time on your pbx and what value is set?
wish i could! Our pbx connects to the telecom provider which is easy to piss off (fail2ban style). For the sake of completeness i must say that Asterisk would not allow itself to re-register before the server assigned timeout without manual intervention. Qualifying the client on a regular basis (normal workaround for such cases) is not an option allowed by our telecom. We are screwed  :o

in theory. but to control the translation, states are used for udp, icmp etc.
Agreed, as far as i understand, a new connection where just a single packet exchange has occurred has a shorter timeout (at least on Linux) which gets incremented when a subsequent exchange is completed (for the sake of this discussion i simplified to 'exchanges').
But... this still does not explain why a state with a 13 minute timeout left (pfTop) gets cleared before it's time!
Once the timer has been reset from 2 minutes to 15, why is 2 (empirically determined, 15-13) still relevant?

I guess my point is to understand what is going on more than finding a workaround because imho this is not the desired/expected behaviour of pf (therefore OpnSense and others based on it), otherwise we should re-define what timeout means!  ;D

@Gianks
hm. can't reproduce this behavior yet
tested on 20.7.7_1:
set optimization to conservative
connected to udp-openvpn (MULTIPLE:MULTIPLE state appears in pftop)
wait 5 min and disconnect (also tried stopping the server):
it takes a full 15 minutes to expire for me