Stale NAT states on WAN-IP change prevent SIP re-register

Started by Steph1corn, January 27, 2023, 02:44:27 PM

Previous topic - Next topic
January 27, 2023, 02:44:27 PM Last Edit: January 28, 2023, 03:48:46 PM by Steph1corn
Migrated to 23.1 yesterday, good job guys btw!

My OPNSense is connected via a VDSL line (and a second line, so it's multiwan setup if that's relevant) with a public IPV4 on the WAN side,the IP changes every 24 hours.

Behind the OPNSense box there is a FreePBX (Asterisk) registering SIP trunks to Telekom and Sipgate. That's working well. In principle.

As soon as the WAN-IP changes, Asterisk is unable to re-register.

Checking with  pfctl -s state -vv | grep FreePBX-IP | grep :5060 confirms that the states still refer to the old, now invalid public IP address, so that the SIP-UDP packets have a wrong origin IP and the anwers never come back.

A similar problem was already mentioned in https://forum.opnsense.org/index.php?topic=10385.msg47423#msg47423

The solution mentioned there was to tick the box for 'Dynamic State Reset', however this box is missing in 23.1 ('firewall: remove deprecated "Dynamic state reset" mechanic').

The only solution atm for me is to manually flush all states using 'Reset state table' from the webinteface. After that the SIP channels are immediately re-registered.

Any advice is welcome - and as a workaround, is there an easy way to execute the flush command automatically on every WAN IP change?

Thanks for help, Stephan

Update: Seems as only the Sipgate Channels are affected, the DTAG channels re-register immediately.
Update 2: Found https://github.com/opnsense/core/issues/2414. According to that, the problem should be already fixed?
Update 3: See also bug report https://github.com/opnsense/core/issues/4652


I also migrated the firewall from version 22.7.11_1 to 23.1_6 yesterday and can confirm the problem. This morning my Fritzbox showed an interrupted SIP connectivity. I did some research and found out that an old problem has reappeared. After the 24-hour disconnection forced by the ISP, VoIP connections are still based on the now invalid WAN IP.

I know the entry in the changelog "firewall: remove deprecated "Dynamic state reset" mechanic". But, I was not aware that this was obviously dropped without replacement.

I had extensive discussions in the past as to why entries in stateful firewall and NAT tables had to be removed when changing the WAN IP (e.g. here). Now the functionality (dynamic state reset) seems to be gone. If the removal without replacement was deliberately chosen, this does not speak for the quality of the Opnsense project. So, please reintegrate the dynamic state reset functionality as it is important for many users who are affected by 24h disconnects with non-persistent WAN IP.

Lately I've had to hope several times that an update of the Opnsense won't destroy any functions of my standard setup again. Unfortunately, it happened again  :(

Thanks.
OPNsense 24.7.11_2-amd64

@Steph1corn
can you test please if
opnsense-patch -a kulikov-a 6dbad67
helps?
(i think you can try to rise  retry_interval a bit also)

Guys, seriuosly? Every time the same dilemma that something doesn't work after an update. Does anyone here do a code review and testing before releasing the changes? Doesn't look like it!

Due to recent changes in rc.newwanip and rc.newwanipv6 stale states won't be deleted from the state table any longer.

The problem lies in changed code lines 61-78 of rc.newwanip script (analogue rc.newwanip6):

$ip = get_interface_ip($interface);

$cacheip_file = "/tmp/{$device}_oldip";

if (!is_ipaddr($ip)) {
    /* remove previously cached IP since it is gone */
    @unlink($cacheip_file);

    /*
     * Take care of OpenVPN and similar if you generate the event
     * to reconfigure an interface.  OpenVPN might be in tap(4)
     * mode and not have an IP address.
     */
    if (substr($device, 0, 4) != 'ovpn') {
        log_msg("Failed to detect IP for {$interface_descr}[{$interface}]", LOG_WARNING);
        return;
    }
}


get_interface_ip won't return any IP when there is a forced disconnect by the provider. Hence, if (!is_appr($ip)) will be TRUE, the cacheip_file be deleted and the script exited.

The script won't reach code lines 166-181 where stale states will be flushed from the state table.

Perhaps you want to share your expertise in the multiple months such code is in development in the future. ;)

The trouble here is that we can't forever cache the IP, but still want to wait out if the IP actually cycles or how long the disconnect happened. I understand the implications, but from experience the old code also had its pitfalls with forever-caching the previous IP.

Not sure how to proceed. I don't see a possible solution here too within this thread. Complaining is easy.


Cheers,
Franco

Quote from: franco on February 13, 2023, 08:07:09 AM
Complaining is easy.
Yes, you're right. Sorry for my harsh words.

Can't we just - before we delete the cached IP - kill the old states like we do when the IP address changes? It should look like this then:


        $ip = get_interface_ip($interface);

        $cacheip_file = "/tmp/{$device}_oldip";

        if (!is_ipaddr($ip)) {

            /* kill all states destinating at and originating from cached IP before removing cached IP*/
            mwexecf('/sbin/pfctl -k 0.0.0.0/0 -k %s', $cacheip);
            mwexecf('/sbin/pfctl -k %s', $cacheip);


            /* remove previously cached IP since it is gone */
            @unlink($cacheip_file);

            /*
             * Take care of OpenVPN and similar if you generate the event
             * to reconfigure an interface.  OpenVPN might be in tap(4)
             * mode and not have an IP address.
             */
            if (substr($device, 0, 4) != 'ovpn') {
                log_msg("Failed to detect IP for {$interface_descr}[{$interface}]", LOG_WARNING);
                return;
            }
        }

Quote from: franco on February 13, 2023, 08:07:09 AM
The trouble here is that we can't forever cache the IP, but still want to wait out if the IP actually cycles or how long the disconnect happened. I understand the implications, but from experience the old code also had its pitfalls with forever-caching the previous IP.

I think it is not an unsolvable issue, hence soho routers have similar challenges. The IP does not need to be cached forever.

Quote from: franco on February 13, 2023, 08:07:09 AM
Not sure how to proceed. I don't see a possible solution here too within this thread. Complaining is easy.

First we should identify all scenarios of obtaining an IP address when establishing a PPPoE connection (e.g. IPCP, IP6CP, DHCP, SLACK etc.). Here are two ideas as a rough sketch for IPv4 assigned via IPCP (further investigation needed):

A (basic):

  • When closing the PPPoE connection enumerate all IPv4 addresses assigned to the PPPoE interface
  • Keep the IP adresses in the back (e.g. memory, file etc.)
  • Remove IP adresses from interface related to list of enumerated IPv4 adresses
  • Kill all state tables (firewall, NAT) related to list of enumerated IPv4 adresses
  • Allow re-connection

B (extended):

  • When closing the PPPoE connection enumerate all IPv4 addresses assigned to the PPPoE interface
  • Keep the IP adresses in the back (e.g. memory, file etc.)
  • Start timer with predefined time: if expired jump to step 6
  • Based on external (or MPD5 internal) trigger: Try to re-establish the PPPoE connection, stop timer if successful
  • Remove new obtained IP addresses from list created in step 2
  • Remove IP adresses from interface related to list of enumerated IPv4 adresses
  • Kill all state tables (firewall, NAT) related to list of enumerated IPv4 adresses

Compared to "A" scenario "B" is more sophisticated and needs synchronization of parallel tasks to prevent race conditions. But scenario "B" allows open TCP/UDP connections to continue if the new IP addresses are the same like the ones previously assigned.
OPNsense 24.7.11_2-amd64

To be frank A) is already what we are doing. The devil is in the details.

For B) time is indeed loosely coupled. If we cache an IP that comes back after 2 hours from the provider with a different IP we don't need to flush. It doesn't matter much but what is a cached IP it we don't keep the last state of an interface which also means address-less because rc.newwanip saw it transition correctly. It's really a two level cache flush system we may need, but the complexity will end up causing other issues. I'm looking for a simpler solution at the moment.

If a watchdog is required for mpd5 to reconnect I'm afraid someone with a PPPoE connection and the required knowledge of the system will have to implement that. That person isn't me.


Cheers,
Franco

What is doing the job for me atm is a small bash script that checks via cron every 5 minutes if the respective IP (pppoe2 here) has changed. In that case, all states related to the old IP are killed.

Not super sophisticated and might miss some edge cases (e.g. if pppoe2 is down and there is no IP, maybe I'll add that case later), but at least for me it works and allows the SIP channels to re-register (here with debug ouptput).


#/usr/local/bin/bash

current_ip=$(ifconfig pppoe2 | grep 'inet' | awk -F ' ' '{ print $2 }')

read cached_ip < /opt/cached_ip

if [[ $current_ip == $cached_ip ]]; then
echo "IPs gleich"

else

pfctl -k $cached_ip
echo "$current_ip" > /opt/cached_ip
echo "States gekillt"

fi


Cheers,
Stephan