Hi everyone,
I have a setup where I am passing Unbound DNS over TLS traffic through two VPN tunnels as part of implementing failover. Half the resolver traffic is getting dropped, and I want to see if I can update my firewall or maybe NAT rules to not drop that traffic. I can probably live with the setup without that fix, but that solution is inefficient and I think asking the question is a good learning opportunity.
More Detailed Specification of the ProblemI have traffic going out through the tunnel A interface that was meant for VPN tunnel B. The traffic goes to one of two DNS resolver addresses (https://mullvad.net/en/help/dns-over-https-and-dns-over-tls%20provides%20several%20options) that my VPN provider uses for DNS-over-TLS (DoT) requests. That is, when I take a packet capture of the WireGuard instance associated with interface "A_WG" (i.e., WireGuard instance A), I can see DoT TCP SYN packets with a source address of VPN tunnel B and destination addresses of either 194.242.2.3 or 194.242.2.2 (I see both). I never see corresponding acknowledgement packets, so the traffic goes out but not back in. I would like the DoT traffic for VPN tunnel B to be allowed to flow through VPN tunnel A as well as B. If that's not possible, I would like to change my settings to drop the superfluous DNS queries while preserving failover and also passing DNS traffic through only one of the VPN tunnels until it fails.
BackgroundI'm an OPNSense newbie setting up VPN failover using Mullvad VPN with WireGuard. DNS resolution is through Unbound, and I'm pushing the Unbound traffic out through the VPN tunnels. I'm also using DNS over TLS, and the primary reason for that is that I found the root servers are rejecting my DNS traffic when I just send the DNS requests in the clear and Unbound requests to the root servers end up traveling from the VPN tunnel endpoints to the root servers (a known problem (https://forum.netgate.com/topic/144430/unbound-queries-to-root-server-via-vpn-being-refused-but-work-when-via-wan)). So I'm sending traffic using DNS over TLS to Mullvad's DNS servers. (I wonder if there's a better approach but for now I want to focus on the technical question in this post.)
I've gotten lots of information from forum posts along the way, and my main sources of information have been Schnerring's tutorial (https://schnerring.net/blog/opnsense-baseline-guide-with-vpn-guest-and-vlan-support/) from 2021; a more recent video (https://whatsnewandrew.com/always-on-vpn/) basic guide for setting up WireGuard connections in OPNSense, and then for failover, I'm using Christian McDonald's video (https://www.youtube.com/watch?v=wYe7FzZ_0X8) from 2021 regarding setting up failover on PfSense. Also I referred to a more recent forum post (https://forum.opnsense.org/index.php?topic=39061.0) describing how to pass DNS out through the VPN tunnels in a failover scenario and followed all the steps there to get a minimum viable product for failover.
The SetupLet's say I have two Mullvad WireGuard VPN interfaces, which I'll call "A_WG" and "B_WG". I've properly created the peer entries and WireGuard instances, established a handshake, and then associated WireGuard VPN interfaces with those instances. I'll call those interfaces "A_WG" and "B_WG". I've then successfully created two corresponding gateways, which I'll call "A_WG_GW" and "B_WG_GW". I'm using the endpoints of the gateways as monitors: I need the monitors to know when the gateway goes down, for failover. Also in System --> Routes --> Configuration I have configured static routes to those endpoints through my WAN gateway, so that the monitors work properly. I can then confirm the gateways are up and working properly. With those gateways up and showing reasonable-looking RTT times (e.g., more than 1ms), I then created an interface group for the two interfes that I'll call "A_B_WG". I also created a gateway group for the two VPN gateways, assigning "A_WG_GW" to be Tier 1 and "B_WG_GW" to be Tier 2. I'll call the gateway group "A_B_WG_GW". Both of these gateways are configured to be far gateways and are upstream gateways as well, with a priority of 245 to "A_WG_GW" and 250 to "B_WG_GW" (with a priority of 255 to the WAN). (See this post (https://forum.opnsense.org/index.php?topic=35170.0) for the justification for doing that.) Next I set up DNS over TLS in Unbound, selecting two different Unbound DNS addresses (194.242.2.2 and 194.242.2.3) for Mullvad and using Server Port 853 for both in Unbound --> DNS over TLS. I then create a static route to 194.242.2.3 through "A_WG_GW" and a static route to 194.242.2.2 through "B_WG_GW" in System --> Routes. (The DNS addresses are listed here (https://mullvad.net/en/help/dns-over-https-and-dns-over-tls), and the rationale for associating a different DNS server with each gateway and creating static routes to each DNS server through one of the two gateways is described here (https://forum.opnsense.org/index.php?topic=39061.0) in Step 12.) Next in Services --> Unbound DNS --> General --> [select "Advanced mode"] --> Outgoing Network Interfaces I select "A_WG" and "B_WG" and de-select all other interfaces, including the WAN. Finally, regarding NAT resolution and firewall rules, I properly redirect outgoing DNS queries from my VLANs to Unbound, and then I have a rule that takes _outgoing_ traffic from local services on the firewall and routes it through the gateway group, "A_B_WG_GW". 
pfctl -sr | grep "wg1" output confirms that all outbound traffic from the "B_WG" interface goes through the "A_WG_GW" gateway. That is, the output contains the following rule:
pass out log route-to (wg1 [IP address of "A_WG_GW"]) inet from (wg2) to ! ([Interface group A_W_WG]:network) flags S/SA keep state allow-opts(I generated that rule with the following firewall floating rule configuration in OPNSense:
    Action: Pass
    Disabled: checked
    Interface: DON'T SELECT ANY
    Quick: NO
    Direction: out
    TCP/IP Version: IPv4
    Protocol: any
    Source: "B_WGQ address"
    Destination: !"B_WGQ net"
    Log: yes
    Category: DNS
    Description: "Route local services on router through B_WGQ_GW"
    Gateway: "B_WGQ_GW"
    Allow options: checked (need to show advanced features in the GUI)
    Leave everything else at its default.)
My understanding of what's going wrongI can see four kinds of TCP SYN packets getting sent out (when both gateways are online): 
- Packets with a source IP address of the "A_WG" tunnel and a destination of 194.242.2.3 (DNS resolution works)
- Packets with a source IP address of the "A_WG" tunnel and a destination of 194.242.2.2 (DNS resolution works)
- Packets with a source IP address of the "B_WG" tunnel and a destination of 194.242.2.3 (no response)
- Packets with a source IP address of the "B_WG" tunnel and a destination of 194.242.2.2 (no response)
Currently failover works, but half of all my DNS resolution requests time out and so my current configuration would be inefficient in the long run.
I attached a sample packet capture.
Here's what I infer is happening:
1. My selecting both "A_WG" and "B_WG" under "Outgoing Network Interfaces" in Unbound must explain why packets are going out both tunnels.
2. Packets of type #3 and #4 above are those that are sent out the B_WG tunnel but get intercepted when they leave the "wg2" instance and are routed through the A_WG_GW gateway instead. That's why they show up in the "A_WG" tunnel. (Note: the Live View log improperly labels those packets as sent out the "B_WG" interface -- in fact they're only sent _towards_ the "B_WG" gateway and then re-routed before they get there. That looks like a bug in how the logs are displayed to me as the packet captures show no traffic at all for the "B_WG" WireGuard instance, but that's a post for somewhere else.)
3. Prior to viewing the packet captures, my initial intuition was that the static routes I configured would ensure that among packets of type 1 and 2 above, Unbound would only generate packets of type 1 (with the "A_WG" tunnel address for the source address), and among packets of type 3 and 4 above, Unbound would only generate packets of type 4 (with the "B_WG" tunnel address for the source address). I speculate that the reason I also see packets of types 2 and 3 is that Unbound attempts to send packets to any DNS server for which it is configured for DNS over TLS, using any outgoing network interface, and the routing table has no impact on those settings. I could follow up by reading the OPNSense source code but haven't done so.
4. The reason I'm not seeing a response to packets of types 3 and 4 pushed through A_WG_GW could be that Mullvad only accepts traffic with a destination address that's the tunnel IP address back into the gateway. It could also be that my firewall rules are blocking the traffic.
I therefore set my firewall to allow all traffic from 194.242.2.3 or 194.242.2.2 regardless of its destination address, for the A_WG interface. That didn't work, so now I'm stuck and writing this post.
What should I try next, to get the DoT traffic for VPN tunnel B to flow through VPN tunnel A as well as B? I'm happy to provide other packet captures, logs, firewall / NAT rules, etc., and thought this was enough information to start. Thanks very much!
			
				Solved! Hooray, here's a solution for posterity.
First, the Mullvad support team confirmed that traffic sent out tunnel A with a source IP address that is the address of the interface for tunnel B will go out, but return traffic will get dropped by the VPN endpoint for tunnel A. (That is, they didn't test it but thought that statement is true.) So, the problem was reduced to one of getting Unbound DNS to stop sending traffic intended for Tunnel B out of Tunnel A, and vice versa. 
The solution I found was to go into Services --> Unbound DNS --> General --> [click 'Advanced Mode'] and uncheck all outgoing interfaces, changing the value of the outgoing interface to "All". What that change does is transfer to the operating system the decision-making control over what interface to use for sending traffic out of Unbound, and that's what I want because now the operating system will send traffic out the upstream gateway that has the highest priority, whereas previously Unbound was blindly trying both interfaces regardless of which one had priority. Next I went to System --> Gateways --> Configuration and confirmed that the gateways for my two VPN tunnels were both marked as "upstream gateways" and that they were of a higher priority than the WAN. Then I went to System --> Settings --> General and checked the box for "Allow default gateway switching." I clicked "Apply" where relevant on all these pages and may have also disabled and re-enabled Wireguard. (Feel free to restart Unbound as well.)
At that point failover was working great! DNS over TLS traffic and HTTPS traffic all went out Tunnel A when both Wireguard Interface A and Wireguard Interface B were enabled and the gateways were active. Disabling Interface A sent DNS over TLS and HTTPS traffic out Tunnel B with no Tunnel A traffic also getting pushed through Tunnel B. Disabling Interface B as well then caused DNS over TLS to get sent straight from the firewall out through the WAN, and setting up a killswitch for that with a firewall rule shouldn't be hard as a next step. HTTPS traffic was already blocked by the fact that I was forcing all such traffic out the VPN gateway group -- so with both tunnels down, that traffic wasn't getting sent out. Re-enabling the VPN interfaces also worked as expected. 
So that's the answer.  Also, to verify all this, with the new changes there are no entries for outgoing interfaces in /var/unbound/unbound.conf, and previously I saw entries for the interface addresses of both Tunnel A and Tunnel B. 
This forum post (https://forum.netgate.com/topic/152314/interface-groups-vs-lagg-multi-wan-dns-streaming-service-problems) answered my question.
			
			
			
				Also, I can confirm that Michael Schnerring's tutorial (https://schnerring.net/blog/use-custom-dns-servers-with-mullvad-and-any-wireguard-client/) for running Unbound traffic through the Mullvad VPN tunnels without using query forwarding works as of this posting. Now the DNS "leaks" (https://schnerring.net/blog/opnsense-baseline-guide-with-vpn-guest-and-vlan-support/#vlan20_vpn-dns-leak-test) match the public IP addresses of the VPN tunnels.