Fritzbox (IP-Client mode) SIP + NAT + dynamic IP Change

Started by schnipp, May 25, 2018, 04:42:23 PM

Previous topic - Next topic
pfSense does it the way you describe always, but despite this there is still the override that was introduced in 18.1.9 so I think there is a problem with only killing NAT states or it simply does not solve the issue for persistent connections.


Cheers,
Franco

I tested the patch, clearing the states work well. But now, internal connections between interfaces will also be reset. So, in case the WAN connection breaks (e.g. DSL signal loss, 24h reconnect, ...) a reset of internal connections occur (e.g. remote backups etc. will fail).
OPNsense 24.7.11_2-amd64

Quote from: schnipp on June 03, 2018, 07:52:40 PM
[...] But now, internal connections between interfaces will also be reset.

As I updated the Opnsense (v.18.7) last days, I trapped again into the PPPoE reconnect loop because I forgot to reinstall my workaround. This time it was not possible to install the workaround, because the reconnect loop triggered killing the NAT states which regularly dropped my internal SSH session, too  :'(
OPNsense 24.7.11_2-amd64

August 16, 2018, 09:58:37 PM #18 Last Edit: August 16, 2018, 10:03:08 PM by karl047
@schnipp: exactly... it was my problem too with PPP0E reconnect loop that I had with 18.7.1_3 too.
I reported earlier that the PPP0E RE-Connection was stable just on 18.1.9 & it worked fine without any problem, I've tried it over a couple of days & than more 10 times a day, later with the another Updates of 18.1 or with 18.7 doesn't work anymore.

For your Fritzbox after the change of the dynamic IP on WAN interface, have you tried the option:
"Reset all states when a dynamic IP address changes" in Firewall -> Settings -> Advanced ? the last option there?
you can simply check it, & let the option "Disable State Killing on Gateway Failure " checked, it was the solution of this issue for me, and my Fritz works continually fine after the change of the IP Address (I mean of course the IP Address of my ISP).

@karl047: I think there is a misunderstanding. The issue with the reconnect loop triggered state killing every 30 seconds (which also dropped LAN connections like SSH to Opnsense console). So it was hard to apply my workaround for the reconnect loop issue.

My wish is to limit state killing only to WAN connections. If I understand correctly, this will be a lot of work in scripting etc.
OPNsense 24.7.11_2-amd64

Sorry for pulling this old thread out again.

Historically, I had enabled the setting "Reset all states when a dynamic IP address changes" to avoid any stale states which would lead to problems with my VoIP setup.

I completely missed out that I don't need this setting any longer as since OPNsense 21.1 WAN IP address changes are detected by rc.newwanip script and states of outdated WAN IPs will be removed from the state stable.

The beauty with the code changes in rc.newwanip script is that LAN connections now keep alive on WAN IP address changes while state entries originating from the old WAN IP will still be killed.

However, I am asking myself if the respective code snippet in rc.newwanip should be amended by mwexec('/sbin/pfctl -k 0.0.0.0/0 -k ' . $cacheip); to ensure that also all state entries destinating at the old WAN IP will be killed. The code snippet would then look like as follows:


if (is_ipaddr($cacheip) && $ip != $cacheip && !isset($config['system']['ip_change_kill_states'])) {
        log_error("IP address change detected, killing states of old ip $cacheip");
        mwexec('/sbin/pfctl -k ' . $cacheip);
        mwexec('/sbin/pfctl -k 0.0.0.0/0 -k ' . $cacheip);
}



During my testing I've experienced one unsightly issue. States won't be killed when the files with the cached IP addresses are deleted from /var/db. Now one might wonder if that can happen. Unfortunately, the answer is yes. The files will be deleted once the pppoe interface will be removed (maybe due to pulling out the WAN network cable or when clicking pppoe disconnect in the GUI).

For that reason I would like to suggest, that the cache files should remain untouched when the pppoe interface will be removed.

@glasi:
Thanks for your investigation. My first tests also show that the following two commands are sufficient during a dynamic IP change on the WAN interface:

mwexec('/sbin/pfctl -k ' . $cacheip);
mwexec('/sbin/pfctl -k 0.0.0.0/0 -k ' . $cacheip);


Opnsense should proceed with the following work flow in case a dynamic IP change occurs on the WAN interface, no matter what reason triggered the change (e.g. loss of link, ppp link down event sent by ISP, rejected renewal of DHCP lease etc.)


Draft:

  • WAN link comes up
  • Gather new IPv4 WAN address (Note: At this point IP address is not assigned to the interface)
  • In case IP address has changed,...

    • ...replace all firewall rules depending on the outdated WAN IP address with new ones
    • ...update static routes depending on the outdated WAN IP address (if there are any)
    • ...kill all sockets depending on the outdated WAN IP address (to inform running processes)
    • ...reset all dynamic states depending on the outdated WAN IP address (do not perform a global state reset!)
  • Assign new IP address to the WAN interface
OPNsense 24.7.11_2-amd64

@glasi @schnipp hi!
Quotemwexec('/sbin/pfctl -k 0.0.0.0/0 -k ' . $cacheip);
imho although it seems logical to me in terms of freeing up space in the states table, from a practical point of view: what a chances that such a packet will appear on the interface after dynamic ip change?)

QuoteOpnsense should proceed with the following work flow in case a dynamic IP change
hmm. may I ask for a comment?
3.1. ip address is explicitly specified in the rules?
3.2. static routes on dynamic interface?
3.3. can you give an example please?
3.4. isn't that happening now? (mwexec('/sbin/pfctl -k ' . $cacheip);)
the problem is that the 'pfctl -k' can kill the state by the source or target ip. but not by mapped address
so actually we have to parse the 'pfctl -ss' output and kill states by id?
or run some custom script with "pfctl -k internal_client_ip -k target_server_ip"?

imho it is worth remembering that we force the opnsene to do what it should not do.
if the application claims to be NAT-aware then it should take care of such situations
(for example, if the PBX maintains states with frequent keepalive\heartbeats, then there should also be configurable delays available (for example. "send keepalive packets every 'n' seconds. if the answer is not received then try 'k' times with 'm' sec interval, then keep silent for 'p' seconds and initiate new registration"))

Quote from: Fright on October 10, 2021, 05:57:18 PM
@glasi @schnipp hi!
Quotemwexec('/sbin/pfctl -k 0.0.0.0/0 -k ' . $cacheip);
imho although it seems logical to me in terms of freeing up space in the states table, from a practical point of view: what a chances that such a packet will appear on the interface after dynamic ip change?)

In theory no packets with the old IPv4 address will arrive, but from practical perspective there is no global answer. But, you can treat this old IPv4 address like an additional (temporary) bogon.

Quote from: Fright on October 10, 2021, 05:57:18 PM
[...]. may I ask for a comment?
3.1. ip address is explicitly specified in the rules?
3.2. static routes on dynamic interface?
3.3. can you give an example please?
3.4. isn't that happening now? (mwexec('/sbin/pfctl -k ' . $cacheip);)
the problem is that the 'pfctl -k' can kill the state by the source or target ip. but not by mapped address
so actually we have to parse the 'pfctl -ss' output and kill states by id?
or run some custom script with "pfctl -k internal_client_ip -k target_server_ip"?

3.1: There can be firewall or forwarding rules on the WAN interface which integrate the dynamic IPv4 address which need to be updated.
3.2: Possibly yes. Static routes bound to the WAN interface (internally need an update of the gateway IP address. Ok the latter one is dynamic  :) ).
3.3: This step is optional but shortens the time that local processes on the Opnsense run into a timeout.
3.4: Yes, it's happening now, but can be extended by the second command as glasi already mentioned  :). Killing the states identified by the old WAN IP address is sufficient.


Quote from: Fright on October 10, 2021, 05:57:18 PM
imho it is worth remembering that we force the opnsene to do what it should not do.

I don't think so. SOHO DSL routers do the same.

Quote from: Fright on October 10, 2021, 05:57:18 PM
if the application claims to be NAT-aware then it should take care of such situations
(for example, if the PBX maintains states with frequent keepalive\heartbeats, then there should also be configurable delays available (for example. "send keepalive packets every 'n' seconds. if the answer is not received then try 'k' times with 'm' sec interval, then keep silent for 'p' seconds and initiate new registration"))

NAT-awareness of applications do not matter, because as I already mentioned, the issue resides at OSI layer 3 and 4. Packets sent by the applications (also keep-alives at application level) reset the timeout counter of outdated NAPT entries which then will never be deleted.
OPNsense 24.7.11_2-amd64

the discussion is becoming more theoretical, but I hope this is ok)
Quoteyou can treat this old IPv4 address like an additional (temporary) bogon
do not agree. it is just an old address, new connections to which will not be allowed, and old ones will expire according to the timeout settings. in scenarios with high load, freeing up space in the states table is probably logical, but nothing more IMHO
Quote3.1: There can be firewall or forwarding rules on the WAN interface which integrate the dynamic IPv4
but the rules contain parentheses for the interface address. so there is no need for additional actions (like reloading rules) to pick up a new address?
Quote3.2: Possibly yes. Static routes bound to the WAN interface
spooky config )
Quote3.3: This step is optional but shortens the time that local processes on the Opnsense run into a timeout
shouldn't the local process connect to the localhost?
QuoteI don't think so. SOHO DSL routers do the same.
in my opinion the same thing: routers devs have to workaround others issues because of:
Quotethe issue resides at OSI layer 3 and 4
and higher levels too. if app uses stateless proto then the flow control falls on the application.
(in tcp its somewhat simpler)
if the connection for some reason stops working, what is the point in continuing to persistently and quickly knocking on the broken door? the application should provide different actions in response to events (especially since it is considered nat-aware and it is this awareness that creates problems)
but of course this is more a question of terminology.

imho there may be another workaround for this: make a very low timeout for the nat-rule and add a separate rdr-rule for incoming packets



October 12, 2021, 06:18:09 PM #25 Last Edit: October 12, 2021, 06:29:29 PM by schnipp
Quote from: Fright on October 10, 2021, 07:57:44 PM
the discussion is becoming more theoretical, but I hope this is ok)

Ok, let's shorten the discussion and try yourself if you still don't believe me  :)

Preparation:

  • Disable the Option "Dynamic state reset" in Opnsense (Firewall -> Settings -> Advanced)
  • Comment out the line "mwexec('/sbin/pfctl -k ' . $cacheip);" in file "/usr/local/etc/rc.newwanip"
  • Ensure by firewall rule that the client can reach DNS server 8.8.8.8 and 8.8.4.4 (without redirection to Opnsense itself)
  • Ensure you have a Linux client in place

Testing: ('$' means execute the following command on the command line)

  • Opnsense: Ensure there is no NAT state table entry for IPv4 8.8.8.8 and 8.8.4.4

    • $ pfctl -s states -v | grep -A 1 "8\.8\.[4|8]\.[4|8]"
    • result should be empty

  • Client: Send a DNS query

    • $ dig @8.8.8.8 -b 0.0.0.0#55555 www.heise.de (repeat this multiple times to increase NAT state table timer, so we do not get into hurry during testing)
    • result should be a valid DNS answer

  • Opnsense: Check NAT state table

    • $ pfctl -s states -v | grep -A 1 "8\.8\.[4|8]\.[4|8]"
    • result should be similar to:
      Quote
              No ALTQ support in kernel
              ALTQ related functions disabled
              all udp 8.8.8.8:53 <- 10.1.0.102:55555       MULTIPLE:MULTIPLE
              age 00:02:46, expires in 00:14:59, 11:11 pkts, 891:935 bytes, rule 199
              all udp 83.135.92.15:40401 (10.1.0.102:55555) -> 8.8.8.8:53       MULTIPLE:MULTIPLE
              age 00:02:46, expires in 00:14:59, 11:11 pkts, 891:935 bytes, rule 100

  • Opnsense: Renew WAN IP address of PPPoE connection and check that the IPv4 address has changed

    • ifconfig
    • $ kill -s USR2 `pgrep mpd5`
    • $ kill -s USR1 `pgrep mpd5`
    • $ ifconfig

  • Opnsense: Check NAT state table

    • $ pfctl -s states -v | grep -A 1 "8\.8\.[4|8]\.[4|8]"
    • result should be similar to (consider the old (now outdated) WAN IP is still in the table):
      Quote
              No ALTQ support in kernel
              ALTQ related functions disabled
              all udp 8.8.8.8:53 <- 10.1.0.102:55555       MULTIPLE:MULTIPLE
              age 00:07:20, expires in 00:13:12, 14:14 pkts, 1134:1190 bytes, rule 199
              all udp 83.135.92.15:40401 (10.1.0.102:55555) -> 8.8.8.8:53       MULTIPLE:MULTIPLE
              age 00:07:20, expires in 00:13:12, 14:14 pkts, 1134:1190 bytes, rule 100

  • Client: Send a DNS query

    • $ dig @8.8.4.4 -b 0.0.0.0#55555 www.heise.de (repeat this multiple times to increase NAT state table timer, so we do not get into hurry during testing)
    • result should be a valid DNS answer

  • Client: Send a DNS query

    • $ dig @8.8.8.8 -b 0.0.0.0#55555 www.heise.de
    • result: DNS times out

  • Opnsense: Check NAT state table

    • $ pfctl -s states -v | grep -A 1 "8\.8\.[4|8]\.[4|8]"
    • result should be similar to (consider old WAN IP (red colored; new one is blue colored) is still in the table, but only for 8.8.8.8 ):
      Quote
              No ALTQ support in kernel
              ALTQ related functions disabled
              all udp 8.8.8.8:53 <- 10.1.0.102:55555       MULTIPLE:MULTIPLE
              age 00:09:39, expires in 00:14:22, 17:14 pkts, 1377:1190 bytes, rule 199
              all udp 83.135.92.15:40401 (10.1.0.102:55555) -> 8.8.8.8:53       MULTIPLE:MULTIPLE
              age 00:09:39, expires in 00:14:22, 17:14 pkts, 1377:1190 bytes, rule 100
              --
              all udp 8.8.4.4:53 <- 10.1.0.102:55555       MULTIPLE:MULTIPLE
              age 00:01:24, expires in 00:14:08, 5:5 pkts, 405:425 bytes, rule 201
              all udp 82.207.219.175:15111 (10.1.0.102:55555) -> 8.8.4.4:53       MULTIPLE:MULTIPLE
              age 00:01:24, expires in 00:14:08, 5:5 pkts, 405:425 bytes, rule 100

Result:
The DNS query to 8.8.8.8 times out, because it hits the outdated NAT state table entry and the request will be translated to the wrong (outdated) source IPv4 address. These packets will be dropped by the ISP. As long as the client fires DNS requests to 8.8.8.8 the existing NAT state table entry will NEVER expire, because the timer resets to its starting value on every DNS query.

After manually deleting the outdated NAT state table entry ($ pfctl -k 0.0.0.0/0 -k 8.8.8.8 ), DNS queries to 8.8.8.8 will be successfully answered :-)

For this reason, deleting the outdated NAT state table entries is essential after the WAN IP has changed.

Edit:
- I have added an excerpt of a wireshark packet capture
OPNsense 24.7.11_2-amd64

Quotetry yourself if you still don't believe me

I didn't seem to say anywhere that I don't believe you or that pf doesn't work the way you say it  ;)
everything is exactly like that.

I said that pf works as it should (from my point of view). and the application should take into account the possible change of the external address (or other circumstances), especially if it is supposed to be nat-aware (and intentionally keeps the state alive)
the example with dns, by the way, is quite indicative. Any dns client allows to specify several DNS resolvers and this removes the need to solve the states "problem" on the firewall side (and this despite the fact that the dns client does not intentionally send requests to save the state).
so I am talking only about two things: the correct application should provide settings for working with a nat-device with a dynamic address IMHO and that there may be ways to solve the "state issues" without resetting (and i cannot test this assumption since my PBXs works on static addresses)

@Fright: It looks like you haven't understood the demontrated DNS example and to my mind you are lacking of basic TCP/IP stack and NAT knowlege. It has nothing to do with specifying multiple DNS resolvers or changing any of them. The discussion around SIP and DNS exemplarily illustrates possible issues regarding any TCP/UDP communication in that context. Related to the states issue, internal devices do not need to be NAT aware.

It does not make sense to discuss this topic further. It's a fact that opnsense has bug in managing the NAT states.  I'll raise a github ticket that this issue gets solved
OPNsense 24.7.11_2-amd64


Quote from: schnipp on October 12, 2021, 06:18:09 PM
Testing: ('$' means execute the following command on the command line)

  • [...]

Result:
The DNS query to 8.8.8.8 times out, because it hits the outdated NAT state table entry and the request will be translated to the wrong (outdated) source IPv4 address. These packets will be dropped by the ISP. As long as the client fires DNS requests to 8.8.8.8 the existing NAT state table entry will NEVER expire, because the timer resets to its starting value on every DNS query.

After manually deleting the outdated NAT state table entry ($ pfctl -k 0.0.0.0/0 -k 8.8.8.8 ), DNS queries to 8.8.8.8 will be successfully answered :-)

For this reason, deleting the outdated NAT state table entries is essential after the WAN IP has changed.

Thanks for the example, which illustrates the problem well.

I agree that it basically affects all TCP / UDP communication.

IMHO, the state table should always be cleaned up for invalid entries when changing the IP address. OPNsense actually does it quite well. For 100% perfection, however, any entries that are referenced should really be removed from the state table.

I therefore suggest that we add the following line to the rc.newwanip script, as already mentioned:

mwexec ('/ sbin / pfctl -k 0.0.0.0/0 -k'. $cacheip);

With this additional line we really don't break anything.