[BUG] inc/filter.inc: check for 'kill_states' / state flush on ruleset update

Started by mfedv, July 14, 2020, 03:36:54 PM

Previous topic - Next topic
With no changes yet being made to "Firewall / Settings / Advanced":
if gateway monitoring is active, and any monitored gateway is down,
opnsense will flush all pfctl states ("/sbin/pfctl -Fs") whenever
firewall rulesets are updated. Such updates happen e.g. when a gateway
goes down and /usr/local/etc/rc.syshook.d/monitor/10-dpinger invokes
"configctl filter reload". But this also happens when firewall rules are
changed using the admin GUI.

Killed sessions include the current admin sessions to the firewall via
web GUI and ssh.

After the flush, TCP ACK packets from the browser (in reply to GUI
output) are no longer part of an established session and get dropped by
"Default deny rule".

The admin GUI web request actually succeeds in installing the new
firewall rules, but the GUI HTTP response will time out and the browser
will display a corresponding error message.

Perhaps this is even related to some instances of the famous "slow GUI"
problem seen with opnsense clusters?

Firewall / Settings / Advanced:
  Gateway Monitoring
    Kill states     (default: checked)
                    Disable State Killing on Gateway Failure

                    Help text: The monitoring process will flush states
                    for a gateway that goes down if this box is not
                    checked. Check this box to disable this behavior.

The corresponding config value "kill_states" has a slightly confusing
name: kill_states=1 means states should _not_ be killed on gateway
failures.

The help text also is misleading: "flush states for a gateway" sounds
as if only states using the failed gateway were involved, while
actually each and every pfctl connection state will get flushed.

So far I have seen three possible values for "kill_states" in
/conf/config.xml:

    a) default after install (20.1 ISO)
        + upgrades to 20.1.8_1:    <kill_states/>

    b) after unchecking:           ... no kill_states entry at all ...

    c) after checking again:       <kill_states>1</kill_states>



Bug 1) misinterpreting "kill_states" default value

/usr/local/etc/inc/filter.inc:

    128 function filter_delete_states_for_down_gateways()
    129 {
    ...
    145     if ($any_gateway_down == true) {
    146         mwexec("/sbin/pfctl -Fs");
    147     }
    148 }

    ...

    211 function filter_configure_sync($verbose = false, $flush_states = false, $load_aliases = true)
    212 {
    ...
    561     if (empty($config['system']['kill_states'])) {
    562         filter_delete_states_for_down_gateways();
    563     }
    ...
    570     if ($flush_states) {
    571         mwexec('/sbin/pfctl -Fs');
    572     }
    ...
    590     unlock($filterlck);
    591 }



line 561 works for case c) ("<kill_states>1</kill_states>"), but not for
case a) ("<kill_states/>"). Perhaps array_key_exists() would be better?



Bug 2) flushing state on ruleset update

lines 561-563 happen to be in the same codepath for "gateway down" and
for "admin ruleset update"; they ought to be in the "gateway down"
codepath only.

Function "filter_configure_sync()" already has an explicit parameter
"flush_states", which is used e.g. from /usr/local/etc/rc.newwanip
(reacting to an updated WAN ip address).

I think lines 561-563 should be removed from
/usr/local/etc/inc/filter.inc, and be moved to
/usr/local/etc/rc.syshook.d/monitor/10-dpinger instead.


Regards
Matthias Ferdinand

Hi Matthias,

I think your misreading the code here, filter_delete_states_for_down_gateways() is only called when kill_states is empty (either not set, non existing or 0). 

https://github.com/opnsense/core/blob/86c40469dbd0641608e26f395259ba87c385d68c/src/etc/inc/filter.inc#L566-L568

The $any_gateway_down variable is true when there is at least one gateway that is marked down, but only evaluated when kill_states is empty (since not being called).

This will evaluate as being true in php when not set, 0 or non existing.

if (empty($config['system']['kill_states'])) {


An example (just copy it in a .php file and execute):

$tmp = ['xx' => [] ];
if (empty($tmp['xx']['non_existing'])){
    echo "yeay, empty\n";
}


The help text could probably be more descriptive, such as:

Quote
Help text: The monitoring process will flush states
                    when a gateway goes down if this box is not
                    checked. Check this box to disable this behavior.

The slow gui issue can indeed be caused by reset states, certainly on unstable networks triggering filter reloads quite often due to all sorts of other technical issues.

Quote
I think lines 561-563 should be removed from
/usr/local/etc/inc/filter.inc, and be moved to
/usr/local/etc/rc.syshook.d/monitor/10-dpinger instead.

I don't think it should be moved, but probably should be made conditional if $flush_states is allowed, since it doesn't take that into account it will also reset when no explicit flush is  requested. In which case the configd call might need a parameter to pass the option as well (haven't look into it, but sounds logical)

Best regards,

Ad

Hi Ad,

thanks for replying.

Yes, exactly: "only evaluated when kill_states is empty".

It is just not the right condition to use here.

Instead it should be evaluated if and only if the attribute does not
exist at all (!array_key_exists()).

Perhaps the actual bug lies in the fact that the default config contains
"<kill_states/>" (variant a) instead of "<kill_states>1</kill_states>"
(variant c).
But with all the existing installations, this cannot be easily rolled
back and inc/filter.inc should be modified to correctly handle variant
a).

In Firewall / Settings / Advanced, variant a) counts as "checked", i.e.
the same as kill_states=1.

In lib/filter.inc the same variant currently counts as "empty", i.e. 
"not checked", so states get flushed when they shouldn't.

Unchecking "kill_states" in Firewall / Settings / Advanced completely
removes the "kill_states" attribute, it does not just leave it empty.
And only with this (attribute-does-not-exist) setting state flushes
should happen.

The old ui page seems to depend on isset() indeed, but writes an actual value when set:

https://github.com/opnsense/core/blob/f3b5c0e8f7d28acd921940d8d11e4185e13043a7/src/www/system_advanced_firewall.php#L54

https://github.com/opnsense/core/blob/f3b5c0e8f7d28acd921940d8d11e4185e13043a7/src/www/system_advanced_firewall.php#L205-L209

We shouldn't try to fix the consumers in this case, since we use the same construction on multiple places , but changing the default config and changing the initial isset() is fine by me.

If you open a PR on GitHub we'll take a look. Since it's only an initial config issue (if advanced settings have never been written, the impact should be relatively low).

Hi Ad,

> If you open a PR on GitHub we'll take a look.
Alas, I don't have a github account. Tried signing up again to no avail.
([1]).

> We shouldn't try to fix the consumers in this case, since we use the
> same construction on multiple places , but changing the default config
> and changing the initial isset() is fine by me.
"changing the default config": yes, but existing configs (when
containing "<kill_states/>") will need to be modified too, otherwise
existing installations would continue with incorrect behaviour.

"the initial isset()": not sure what this means

And then there is still "Bug 2) flushing state on ruleset update".

Best regards
Matthias Ferdinand

[1]: https://forum.opnsense.org/index.php?topic=16220.msg74460#msg74460


Hi Ad,

thank you, that was quick, I'm impressed!

Just one thing:
> o remove <kill_states/> from our default config, since it was
> evaluated as empty (feature enabled), we might as well remove the
> option to reach the same effect.
It does indeed achieve the same effect.
I just think this to be an unpleasant and surprising effect to have
enabled by default.

Regards
Matthias