Releases >= 24.7.1 have stopped outgoing VoIP - an MTU issue

Started by fbantgat7, September 01, 2024, 05:29:15 PM

Previous topic - Next topic
September 03, 2024, 03:22:16 PM #15 Last Edit: September 03, 2024, 03:47:27 PM by BoodahsFever
Quote from: chemlud on September 03, 2024, 02:21:41 PM
Have here several Gigasets, no special gym needed to make them work. Just some FW rules to allow for the IPs of the provider and done here...

Exactly and this is how it is supposed to work. Normally you should not have the need for static translations. I have a few OPNSense boxes installed at large sites without the need to do anything special to make phones work.

I hope I explain this right.


Outbound calls:


In a line like this in the Message body of an INVITE announces on what port it would like to receive RTP on (the firewall doesn't do anything with this):

m=audio 52154 RTP/AVP 8 0 9 101

Likewise in the subsequential 200 OK message the other party announces what port it wants to receive RTP on:

m=audio 58130 RTP/AVP 8 101

Also this is part of Codec Negotiation.

After the 200 OK and some other messages the phone will start sending RTP to the port announced in the 200 Ok  from the port itself announced in the First INVITE. Opnsense should create that PAT/NAT Translation and Firewall PINHOLE where the other party can return the RTP to because that stream is being sent and allowed. No proxy required on the Edge side of things.

RTP packet Source port 52154  Destination port 58130

Inbound Calls:

For INBOUND calls the server Sends the INVITE to the Address and (randomised by Edge Firewall/Router) Port that are in the SIP Register cache table maintained by the SBC of the operator. The principle is the same but this time the SIP Server announces it's port in the INVITE and the Client in the 200 OK. The client will again start the RTP stream and so on.

The proxy is in this case on the operators site. And I call this an SBC (Session Border Controller).

The only time I can imagine you would need a static NAT translation is when you don't use the SIP Register mechanism and the SIP Server use static registrations.

Just search the forums for "VOIP", "static port" and see how many people have had problems that have been fixed by that (me being one of those and I know others who use Fritzboxen).

I also know how SIP is supposed to handle things and how a firewall "should" do it. Matter-of-fact, I run my own OpenSIPS server. But SIP is a very old standard and things have evolved quite a bit since I started using it in the early 90s.

I think, you have three choices to make things work even on corner cases:

1. Do not allow arbitrary port translations (aka enable "static port") on outbound NAT
2. Employ a SIP proxy that monitors and rewrites SIP messages according to the translations in use
3. Use IPv6

Other than that, maybe you are lucky that your equipment and/or VoIP provider works without using one of the "guaranteed" ways from above.

At least I am quite sure that using "static port" will do no harm whatsoever. If you have an explanation or a solution for the OP on why his outgoing SIP calls do not work, go ahead.

Up to this point, I still think that the outbound NAT rules are wrong one way or another. Maybe the DNS entries do not map to the suspected IPs, maybe it is even worse and he is behind CGNAT or other type of double NAT.

Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 770 up, Bufferbloat A

Quote from: meyergru on September 03, 2024, 03:56:46 PM
At least I am quite sure that using "static port" will do no harm whatsoever. If you have an explanation or a solution for the OP on why his outgoing SIP calls do not work, go ahead.

Up to this point, I still think that the outbound NAT rules are wrong one way or another. Maybe the DNS entries do not map to the suspected IPs, maybe it is even worse and he is behind CGNAT or other type of double NAT.

I agree it won't harm. I just say normally in a B2BUA with symmetric signalling situation which is pretty common these days it shouldn't be needed. Removing the NAT config could also remove errors in his config.

I agree it is probably some misconfiguration and that could also be a misconfiguration of the phone. If he is behind cgnat or double Nat I wish him the best of luck. I hope he has IPv6 working in that case.

Thank you guys for all your help!  :)

TL;DR:  In my case the problem has been with the way the recent kernel changes have affected how the MTU is being processed.  Just to confirm, in my case I did not need a "static port", or additional outbound NAT beyond the automatically created NAT rules when setting Port Forwarding.  This is because in this case the phones and ISP's SIP gateway are 'smart' enough to acknowledge the SIP/RTP ports requested by the phones and use them reliably.

Since I started looking at this problem, I've spent some time checking firewall rules and trying all the different settings meyergru suggested, including "static port" for outbound NAT.  I tried setting a Hybrid Outbound rule with:

Destination address: any

and when it made no difference with:

Destination port: the ISP's SIP/RTP addresses.

Still no success.  I also removed port translation as recommended and took the opportunity to tidy up addresses in my aliases for the ISP's SIP/RTP gateway.  Without a usable VoIP setup I was heading back to the 24.7 firmware, but as a last thing before calling it a day I decided late yesterday to use IPv6, thinking this would allow me to do away with NATing and port forwarding complications, or potential configuration errors.  At this point I discovered with IPv6 the phone registration would fail altogether!  So instead of the IPv4 issue of just outgoing calls failing, with IPV6 the whole lot stopped working.  I suspected an opnsense MTU regression problem (has happened before), but ran out of time.

This morning I checked the WAN MTU and sure enough things looked different ...igb0: flags=1008843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1508
        options=4802028<VLAN_MTU,JUMBO_MTU,WOL_MAGIC,HWSTATS,MEXTPG>
[snip ...]
pppoe0: flags=10088d1<UP,POINTOPOINT,RUNNING,NOARP,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1508
        description: WAN (wan)


Then I saw BoodahsFever's post (thank you!), mentioning the INVITE packet size and how it can get larger with ever more RTP codecs added to it.  So I changed the phone settings to advertise only two codecs suitable for my ISP, which reduced a 1512 byte WAN packet down to 1376 and from then on outgoing calls became possible again.   :D

NOTE: Some months/years ago I had to keep setting the MTU manually using ifconfig on igb0 to 1508, so that the WAN PPPoe0 MTU would have a full IPv4 ethernet  payload of 1500.  Now ifconfig displays the ethernet MTU as 1508, but ping tests indicate it is still 1500.

Either way, should packet fragmentation break VoIP?

Something in the 24.7.x changes has caused my specific problem, but more applications/devices may be lurking out there similarly affected.  Is this something to be addressed in a future update?

Quote from: meyergru on September 03, 2024, 03:56:46 PM
Up to this point, I still think that the outbound NAT rules are wrong one way or another. Maybe the DNS entries do not map to the suspected IPs, maybe it is even worse and he is behind CGNAT or other type of double NAT.
Perhaps my reasoning was wrong, but I was thinking all that changed to cause the phones to stop calling out was the opensense firmware update.  It could be the updates allowed for hidden errors in my configuration to show up, and/or perhaps something else was clashing with the latest firmware changes.  As it happened it was an MTU issue causing fragmentation.

Either way, thank you meyergru for your config suggestions, it helped me improve and tidy up my rules.

Interesting. Even in two ways:

1. I never had imagined that a SIP INVITE or REGISTER could be that big... so BoodahsFever was right!

2. I use MTU 1508 over PPPoE with a VLAN as well in order to have net 1500 MTU and my ifconfig on my pppoe0 shows 1508. The underlying physical interface is igc0 (on which there is the VLAN) has even 1512.

I have configured this all manually before 24.7 and checked via a PMTU tool from here: https://www.baeldung.com/linux/maximum-transmission-unit-mtu-ip that my actual MTU really is 1500 bytes. I have to say that when I tried this first, I noticed that the presumed "automatic" MTU inheritance never worked for me over pppoe0 -> vlanX -> igc0.

I re-checked again just now and the settings are still intact. If you rely on the automagic, this could have changed in the meantime. You probably would have noticed problems for some websites that have no PMTU discovery as well.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 770 up, Bufferbloat A

September 04, 2024, 08:41:50 AM #21 Last Edit: September 04, 2024, 02:18:34 PM by BoodahsFever
Quote from: fbantgat7 on September 03, 2024, 05:56:25 PM

Either way, should packet fragmentation break VoIP?


That depends on the Operator. Some allow for packet fragmentation on Access Realms others don't. Most Operators handle larger then 1500 bytes packets with the advice to use TCP instead of UDP. Offcourse they should then allow for TCP.

Oracle (Acme Packet) SBC's now ship with the advise to use TCP for SIP period as INVITES grow bigger and bigger these days.

Quote
I never had imagined that a SIP INVITE or REGISTER could be that big.

SIP REGISTER Messages usually don't get very big. That's why fbantgat7 phones where still registering causing me to suspect a MTU issue.

QuoteYou probably would have noticed problems for some websites that have no PMTU discovery as well.

I think this affects UDP only. But I can't be sure.


No, opnsense is installed on bare metal and its igb0 NIC is connected directly with a CAT6 cable to the ONT.

I've had an on & off MTU configuration issue with opnsense for years now.  It used to be the case when I'd set WAN's PPPoE MTU on the GUI to 1508, this would push the MTU of igb0 to 1508 to allow 8 bytes for the PPPoE header+ID tag and leave 1500 bytes for the  payload through WAN.  Unlike meyergru I don't use VLAN, but if I did the VLAN tag would add another 4 bytes for the VLAN tag, increasing the frame size to 1500+8+4=1512.

At some point the WAN interface settings GUI no longer worked this way, so I'd set the MTU for igb0 manually to 1508 on the console, which would provide a 1500 MTU for the WAN interface.  Annoyingly, I'd have to repeat this operation after each update/upgrade of the opnsense firmware - the setting would not stick over a reboot.

For the last few months, up and including 24.7, the MTU setting had started to stick, so I didn't have to look into (re)setting it manually.

Anyway, after I saw the phone would also fail to work over IPv6 and spotted errors on the phone logs mentioning something about no PDU received, I thought it must be an MTU issue again.  I checked with ifconfig and the Interfaces GUI details and was surprised to see pppoe0 showing an MTU of 1508, instead of 1500.  The igb0 link also showed MTU 1508, as it should.  Both couldn't be right.  However, ping tests on the router showed the WAN MTU was indeed 1500:

~ # ping -6 -c 1 -D -s 1452 google.com
PING(1500=40+8+1452 bytes) 2001:XXXXXXXXXXXXXX --> 2a00:1450:4009:827::200e
1460 bytes from 2a00:1450:4009:827::200e, icmp_seq=0 hlim=120 time=5.871 ms

~ # ping -6 -c 1 -D -s 1453 google.com
PING(1501=40+8+1453 bytes) 2001:XXXXXXXXXXXXXX --> 2a00:1450:4009:827::200e
ping: sendmsg: Message too long
ping: wrote google.com 1461 chars, ret=-1

The above ping tests are no different when run with firmware 24.7, although with 24.7 the pppoe0 MTU is displayed correctly as 1500 not 1508.

Here is what I fail to understand:

Despite the displayed MTU size anomaly, ping tests indicate an effective WAN MTU of 1500 bytes with both 24.7 and 24.7.x - so why does a larger INVITE datagram go through fine on 24.7, but fails to do so with 24.7.x?  What else is at play here?

September 04, 2024, 02:30:08 PM #24 Last Edit: September 04, 2024, 03:13:40 PM by BoodahsFever
Yeah something seems off with Either MTU or Packet Fragmentation. But it is hard to pinpoint.

@home I run OPNSense as VM inside Proxmox. I know not recommended and stuff. But I find it convenient.

Before I was connected via a Capble ISP and the cable modem in bridged mode. No PPPoE just Ethernet with MTU of 1500 and DHCP/SLAAC for IPv4/6. The Wan Facing interface was also virtual (VirtIO driver). No problems ever with Packet Fragments. I work for a Telco so I can monitor our SBC's and we accept fragmented packets. So packets larger than 1500 bytes are known to arrive and be processed by our SBC.

After that GlassFiber was connect at my place so I changed ISP to make use of that. Now it is PPPoE and my setting are the same as meyergru. When rebooting OPNSense or Proxmox I used to get your issues as well. Then I would reboot the ONT and everything started working again. I could see the packets leaving OPNSense but they never arrived at the SBC. Invisible to see where the packets where dropped. I thought maybe a Auto Neg issue between Proxmox and the ONT. Note this was also UDP packets larger than 1500 bytes.

After a while I was fed up with that. So I performed a PCI passthrough on proxmox letting OPNSense manage the Wan interface and the problem is gone. Everything works as expected again. It seems weird to me as the Operating System is responsible for MTU and Packet Fragmentation (L3) as the OS controls the IP stack. The network driver should be responsible for L2.

I am having a hard time wrapping my brain around this.

I'm on release 24.7.4_1 now, but the same RTP fragmentation problem remains.   :(

I noticed something with ping which I can't recall if it was the same on 24.7.  When I ping using IPv4 to check MTU size, it complains  about the packet size (unless I use sudo):
~ $ ping -4 -c 1 -D -s 1472 www.google.com
ping: packet size too large: 1472 > 56: Operation not permitted


However, no such problem with IPv6:
~ $ ping -6 -c 1 -D -s 1452 www.google.com
PING(1500=40+8+1452 bytes) 2001:XXXXXXXXXXX --> 2a00:1450:4009:81f::2004
1460 bytes from 2a00:1450:4009:81f::2004, icmp_seq=0 hlim=120 time=6.121 ms

--- www.google.com ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 6.121/6.121/6.121/0.000 ms


Could someone please confirm if ping packet size is meant to have this constraint on IPv4 only, or if it is a result of the later releases?