Hi!
Some time ago we've been noticing the service behind one of our OPNsense's is being shortly and unpredictably inaccessible. Diagnosis showed some of the connections were interrupted and it corelated with growing "state mismatch" pfctl counter.
Since then we're monitoring this parameter on almost all of our FWs. As of now I'm experiencing it for a 4th time, with 4th OPNsense instance, I think there's a patern.
Meaning first we start noticing sporadical counter increase, then zabbix triggers based on it become more frequent, then the service owner starts complaining because his clients start noticing the behaviour.
In 2/4 cases changing "Firewall Optimization" from Normal to Aggressive fixed the problem as for now, for other 2/4 only prolonged the time to experience the problem again.
The change mentioned above has been introduced to our configs, because the first case was really eye opening.
One client was generating lots of HTTP request to the service from his Azure cloud environment, sometimes very frequent, and traffic dump showed TCP source port in those request was circulating in a range of only 10 different numbers, meaning the numbers were reused when old FIN_WAIT sessions were still present in a state tables of our firewalla. Hence same src/dst IP, and src/dst port were used, matching old sessions for new SYN packets - packets were dropped, state mismatch counter grew. That's our conclusion.
In other 3 cases we did not find such explanation for firewall behaviour.
Does that ring a bell to anyone, specially OPNsense developers, why it happens now and why aggresive mode doesn't fix it in a consistent manner?
Interesting. I've never fixed an issue through more aggressive state retirement. Always the opposite. It'd be nice if there were better diagnostics and control (per-rule or per-protocol state lifetime). My old Fortigate had this, and it was useful to modify DNS session lifetime (to 15s for clients, 90s for servers, default 30s) as some remote servers would send a few packets after a minute or so idle. Discarding these stray packets seemed to have no functional impact, but you never can tell and the logs were annoying.
I see ~6000 state mismatches over 32 days. I don't notice an impact... but that's no surprise. Log view on OPNsense is pretty terrifyingly bad - I may dig into specifics, assuming mismatches are logged (I've never seen one offhand). (Searching the logs from the GUI gets me an occasional result after several minutes. Not helpful.)
Can you check/alter the client source port range?
No, I have no influence on configuration on the client's side.
I'm just wondering if there's a change recently in default network stack on some OSes or in CSP or something like that, that causes an impact on TCP session management...
We were forced to migrate some publicly available services from virtual instances of OPNsense to Fortigate clusters, which do not impose a problem.
Quote from: JakubJB on June 13, 2025, 09:49:24 AM[...]I'm just wondering if there's a change recently in default network stack on some OSes or in CSP or something like that, that causes an impact on TCP session management...
Sounds like a misconfiguration to me. But it'd be an odd one.
OPNsense does have a bunch of per-rule advanced settings, but I don't see one that might work for you. pf appears to allow a complete set of timeout settings per-rule, which might help you out if they were exposed in the GUI. I tried a specific option for "State timeout" ("tcp.finwait 5"), but the GUI expects an integer. Perhaps the setting is only for "tcp.established", which is unhelpful. I may request an enhancement for this, as per-rule control can occasionally be handy.
Thanks a lot for Your interest in the topic. We'll observe other services, that are of such properties, that they have lots of request, many clients and a characteristics of burst traffic.