Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - mattlf

#1
Thanks for the reply @muchacha_grande, VIP was already defined though and inbound + non-SIP outbound working correctly so that wasn't it.

I have since just upgraded to version 25.1.6, captured some outbound SIP packets and the issue seems resolved...

I cannot see anything obvious in changelog that would obviously be related but seeing as it was working correctly on ~25.1.3, not on 25.1.5, and now working correctly on 25.1.6 (without me changing any configuration) I'm going to assume this could have been been a bug introduced in .4 or .5, would recommend an update if anyone is encountering the same issue on those versions.

Would love confirmation that it was for my own sanity and it wasn't just me doing something silly, but in any case looks to be working fine in 25.1.6
#2
25.1, 25.4 Production Series / NAT issues with SIP
April 30, 2025, 02:11:23 PM
Currently on OPNsense 25.1.5_5, recently upgraded from 25.1.3 or 25.1.2 where the issue didn't exist, checked my audit log no other configuration changed since.

I currently cannot make outbound calls via my PBX server, IPs mentioned faked. I use Hybrid Outbound NAT generation.

I have my OPNsense firewall GW at 50.0.0.1
Default traffic goes outbound via 50.0.0.1 via auto NAT rule
My PBX server is locally at 10.0.0.100
I NAT all inbound traffic for the PBX server to 50.0.0.5, all works fine
I have an outbound NAT rule for all traffic from 10.0.0.100 to translate to 50.0.0.5, the rule is super simple, looks like:

interface: WAN
prot: any
src: 10.0.0.100
src port: any
dest: any
dest port: any
translate 50.0.0.5
static port: yes

placed at the top of outbound rules

If I run a public IP check from FreePBX server and do an IP check e.g: curl ifconfig.me, I can verify it's being correctly NAT'd publically to 50.0.0.5

If I manually send some packets to another remote server from the PBX machine via the port it typically uses (5060) and monitor with tcpdump, the IP is correctly 50.0.0.5

When peforming an outbound call from the PBX machine to our 3rd party SIP Trunk provider, during a packet capture via OPNsense's Diagnostics (attached), I can see that there's traffic via the default gw IP, 50.0.0.1, not 50.0.0.5, which it should be being translated to.

I've also tried adding a catch-all outbound rule for the 3rd party IP, forcing anything going to it to use 50.0.0.5 as well, but also didn't work.

I'm not currently in a position to restore an earlier version of 25.1.x to confirm if it's related, but it's my best guess at the moment. Would anyone have any other ideas to check?

Thanks for any help.


#3
Thanks Patrick, working further with the guys that control the upstream stack this does sound like the most likely thing to be the case, although they're convinced that both switches are connected and support multicast...

Just incase it assists anyone else searching for a similar problem. Out of interest, and impatience, I decided to add my own L2 into the stack above my firewalls, connected both upstream cables and firewalls to that L2, and a failover test primary->backup->primary seems to work flawlessly with that additional layer, it's just not ideal, I'll keep at it with the other team. This test proves what you suggested in my mind.

Interestingly I noticed that if returned to the previously connected setup and replicated the bricked state again, if disabled my outbound rule on my primary FW of everything to use x.x.x.20 (the WAN VHID), and instead use x.x.x.21, upstream gateway works absolutely fine again. I want the WAN VHID so not a solution for myself, but makes me suspect it's potentially something like ARP caching problem on their switches (x.x.x.20 still being routed to the backup FW despite the primary coming back up?).

I don't currently suspect it's OPNsense or me writing bad config, but welcome to hear any other insights though! Just want this relatively simple thing resolved

#4
I believe each firewall is connected to a separate juniper switch, haven't visibility on the model/config/capabilities as they're not owned or managed by myself.

If you have any suspicions though I can relay them to the team responsible for managing them and fill in the gaps.
#5
I have a Primary/Backup setup with CARP sitting a rack, both WAN is a separate connection to additional network infra I do not have visibility on.

Both FW's are sitting in a /29, x.x.x.16-19/29 not mine

Primary WAN is at x.x.x.21/29, LAN at 10.50.50.10/24
Backup WAN at x.x.x.22/29, LAN at 10.50.50.20/24
Both point at x.x.x.17/29 as their default Gateway
WAN VHID at x.x.x.20/29
LAN VHID at 10.50.50.1/24
Adv Frq is 1 / 0 on Primary, 1 / 100 on Backup
Primary PFSync at 10.0.0.1
Backup PFSync at 10.0.0.2

Everything works fine initially, Primary can sync to Backup fine, and can see chatter on pfsync. Viewing the VIPs Status page I can see both at a CARP demotion level of 0, Primary knows it's the master of both WAN+LAN VHIDs, Backup knows it's the Backup.

If I simulate Primary FW down (pull power out), tiny outage as expected but within 1-2 seconds Backup has taken over and I can see the Backup FW has become the Master.

Once the Primary FW has come back online, I can see the Backup relinquish it's claim on the VHIDs and move back into a Backup state, and the Primary FW becomes the Master for both, I can confirm this by connecting to the LAN VHID at 10.50.50.1.


However despite the Primary now being in control of both VHIDs, upstream traffic becomes unusable similar to a network loop. LAN remains fine. If i physically remove the WAN cable from the Backup machine or power it off, the Primary quickly becomes happy again and everything's good, if the Backup then rejoins it causes no additional interference, and stays in it's Backup state waiting for another failover.

My 2 suspicions are that the Backup FW is somehow not fully relinquishing the WAN VHID back to the primary despite it looking like it has via the GUI? Or, there is something going on in the non-owned switches these firewalls are connected that I do not have access to, potentially something like them caching the Backup FW at .20/29 at initial failover event, and despite it having relinquished the VHID back to the Primary as it's come back online, the switch hasn't noticed and is still trying to route traffic to it.

I have raised a ticket with the team that manages the upstream switches as well as to me it sounds to me more likely that OPNsense in this instance is behaving correctly, but has anyone else experienced something like this before, or any suggestions of where to start looking to verify if the problem is perhaps in my configuration? Thanks for any help or suggestions.

#6
After more googling found the issue on plugins repo of github https://github.com/opnsense/plugins/issues/2550 for anyone else looking
#7
Hi,

We're on the latest version 21.7.3_1 and finding that when we issue/renew ssl certificates through the ACME Client they're successfully being generated, but are still being linked to an expired root certificate (see attachment), is there another step that we need to take to remedy this does anyone know? Thanks for any help