WireGuard Kill Switch Fails / Requires State Reset - Traffic Exits WG Interface

Started by DoomDude, April 09, 2025, 05:07:21 AM

Previous topic - Next topic
Hello OPNsense Community,

I'm seeking help with a persistent WireGuard kill switch issue on OPNsense 25.1.4_1 where standard methods are failing, likely due to traffic bypassing the intended logic.

Please bear with me (I'm a noob)

1. Objective:
Route all traffic from my LAN (10.0.10.0/24) through a NordVPN WireGuard tunnel (wg0) with a reliable, automatic kill switch that blocks LAN internet access if the NordVPN gateway object is disabled or down.

2. Environment:

    OPNsense: 25.1.4_1 (amd64) running in VirtualBox on a Windows 11 host.
    Host NICs: Realtek 2.5GbE (WAN bridge), Intel I211 Gigabit (LAN bridge), USB WiFi (Alfa bridge).
    OPNsense Interfaces:
        WAN (em0): Bridged to Realtek, IP 10.0.0.20/24 (via DHCP from 10.0.0.1).
        LAN (em1): Bridged to Intel, Static IP 10.0.10.1/24.
        WAN_NordLynx (wg0): WireGuard interface, IP 10.5.0.2/10.
    Laptop Client: Static IP 10.0.10.11, Gateway 10.0.10.1, connected to host Intel NIC port. IPv6 disabled on adapter. Hamachi persistent route (25.0.0.1) found and removed.
    OPNsense Gateways:
        WAN_GW: Default (10.0.0.1).
        NordVPN: Uses WAN_NordLynx interface, GW IP 10.5.0.1, Monitor IP 103.86.96.100, Monitoring Enabled, Kill states when down Checked, Disable host route Checked. Status page shows it as Online when enabled.
    WireGuard Instance (wg0): Disable Routes Checked. MTU 1420. Gateway field within instance settings is blank.
    NAT: Hybrid Outbound NAT with rule for WAN_NordLynx interface, LAN net source, Interface address translation. (Verified correct).
    Advanced Firewall Settings: Confirmed to be OPNsense defaults (Firewall Optimization: normal, Sticky connections: checked, etc.).

3. Primary Method Attempted: Tag + Floating Rule

    LAN Rule: Pass | Quick ✓ | IPv4 | Source: LAN net | Dest: any | Gateway: NordVPN | Log ✓ | Advanced -> Set local tag: NORDVPN_TRAFFIC. Positioned correctly below anti-lockout, no "Default Allow LAN->Any" rule below it.
    Floating Rule (Kill Switch): Block | Quick ✓ | Interface: WAN | Direction: out | IPv4 | Proto: any | Src: any | Dest: any | Log ✓ | Advanced -> Match local tag: NORDVPN_TRAFFIC. Positioned correctly below default firewall-out rules.
    WAN_NordLynx Rule: Pass | Quick ✓ | IPv4 | Direction: in | Proto: any | Src: any | Dest: any.

4. Observed Behavior & Diagnostics (Problem)

    (Kill Switch Failure): When the NordVPN gateway object is Disabled, the laptop ping (10.0.10.11 -> 1.1.1.1) continues to succeed.
    (Manual Reset Required): Only after manually resetting states (Firewall -> Diagnostics -> States) does the ping stop (kill switch engages).
    (Recovery Failure): When the NordVPN gateway object is re-enabled, the ping remains stopped.
    (Manual Reset Required for Recovery): Only after manually resetting states again does the ping resume (though this failed in the very last test).
    Evidence of Bypass Path:
        State Table: Shows ICMP 10.0.10.11 -> 1.1.1.1 does match the LAN rule (Route LAN traffic via NordVPN). Also shows anomalous second state icmp 10.5.0.2 -> ... 1.1.1.1 Rule: let out anything from firewall host itself.
        Packet Capture wg0: Shows ICMP echo requests/replies (10.5.0.2 <-> 1.1.1.1) egressing/ingressing wg0 while the NordVPN gateway object is disabled.
        Packet Capture WAN: Shows no corresponding ICMP traffic leaving WAN (em0).
    Routing/Log Anomalies:
        System -> Routes -> Status shows host route 10.5.0.1 via wg0 present when WG service is up.
        WireGuard logs show setting inet interface route to 10.5.0.1 via wg0 despite Disable Routes being checked in the GUI.

5. Other Methods Tried (All Failed to Provide Automatic Kill Switch)

    Gateway Group: Using a Gateway Group (NordVPN Tier 1, Trigger Member Down) in the LAN rule (tag removed, float disabled) still allowed ping when NordVPN gateway was disabled.
    Explicit LAN Block: Adding a Block Quick LAN net -> any rule immediately below the LAN VPN rule still allowed ping when NordVPN gateway was disabled.
    Manual Route Deletion: route delete 10.5.0.1 did not stop the bypassing ping.

6. Host Interaction Notes:

    Cycling the host PC's Intel NIC (bridged to OPNsense LAN) temporarily stopped the bypass but broke basic LAN connectivity (laptop couldn't ping gateway) until the OPNsense VM was restarted. Host PC also couldn't access OPNsense GUI (10.0.10.1) unless the laptop had an active link to the bridged NIC.

7. Analysis Text Provided by External AI:

    An external analysis confirmed "State Killing on Gateway Failure" is an OPNsense/pfSense feature triggered by gateway monitoring (dpinger) detecting a down event. It also highlighted a known limitation where states are often not automatically cleared on gateway recovery, explaining the need for manual state resets to restore connectivity. However, it didn't fully explain the failure of the kill switch to engage automatically when the gateway goes down in this specific scenario.

8. Current Status & Core Problem:

    The configuration is currently reverted to the Tag + Floating Rule method (Section III).
    The kill switch only engages reliably after a manual state reset. Connectivity only resumes reliably after a manual state reset. Automatic operation fails.
    Evidence strongly suggests that when the NordVPN gateway object is disabled, traffic matching the policy route rule is incorrectly routed directly out the wg0 interface (using the 10.5.0.1 via wg0 route), bypassing the WAN interface floating rule and subsequent LAN rules. This seems linked to an override possibly caused by the WG interface route or a state handling anomaly.

9. Question for the Forum:

    Is this behavior (traffic exiting WG interface directly despite disabled policy route gateway, bypassing kill switch rules, requiring state resets for block/recovery) a known issue, edge case, or potential bug, possibly related to WG integration, state handling, or the VM environment?
    Why would the 10.5.0.1 via wg0 route seemingly be added/used despite Disable Routes being checked?
    Are there any alternative configurations or system tunables that could force the intended kill switch behavior automatically without requiring manual state resets in this scenario?

Any insights or suggestions would be greatly appreciated. I can provide specific config file snippets or further details if needed.

Thanks in advance.
choppity chop bang bang

State kill on down was implemented in 25.1.4 via https://github.com/opnsense/core/commit/6bb6d3a843

But to be honest I don't know if it works when you disable the gateway manually. If it doesn't it's something to consider in this scope to fix.


Cheers,
Franco

Quote from: franco on April 09, 2025, 07:41:03 AMState kill on down was implemented in 25.1.4 via https://github.com/opnsense/core/commit/6bb6d3a843

But to be honest I don't know if it works when you disable the gateway manually. If it doesn't it's something to consider in this scope to fix.


Cheers,
Franco


Hi Franco,

Thank you very much for your reply and for confirming the implementation details of Kill states when down in 25.1.4.

Your insight regarding the trigger mechanism (monitored failure vs. manual GUI disable) makes sense and aligns with our observation that manual state resets were needed to engage the kill switch during testing when manually disabling the gateway. Similarly, the point raised in the external analysis (and matching general pfSense/OPNsense behavior) about states not automatically clearing on gateway recovery also perfectly explains why manual state resets were needed to restore connectivity after re-enabling the gateway. We accept these aspects related to state reset triggers and recovery limitations.

However, the primary issue we're struggling with seems to be a different, more fundamental problem that prevents any automatic kill switch method (Tag+Float, Gateway Group, Explicit LAN Block) from working reliably:

Even when the NordVPN gateway object is disabled OR shows as Offline in Status -> Gateways, traffic matching the LAN policy routing rule incorrectly exits directly via the wg0 interface, bypassing the intended failover path to WAN and all kill switch rules.

We have confirmed this bypass with the following evidence:

    Packet Capture (wg0): Shows ICMP echo requests (Source: 10.5.0.2) and replies successfully traversing wg0 while the NordVPN gateway object is disabled.
    Packet Capture (WAN): Shows no corresponding ICMP traffic attempting to leave via the default WAN (em0) interface.
    Gateway Status on Boot: On a fresh boot, System -> Gateways -> Status shows NordVPN as Offline (likely due to WG startup delay vs. dpinger), yet during this time, laptop traffic is already routing via the VPN (confirmed via external IP check) and pings to 1.1.1.1 succeed, proving the bypass happens regardless of the monitored gateway state (Offline or Disabled). The status only corrects to Online after a manual disable/re-enable toggle of the gateway object.
    Routing Anomaly: Despite Disable Routes being checked in the WireGuard Instance settings, System -> Routes -> Status consistently shows a host route 10.5.0.1 via wg0 when the WG service is running, and WireGuard logs show setting inet interface route to 10.5.0.1 via wg0. Manually deleting this route via route delete 10.5.0.1 did not stop the traffic bypass.
    State Table Anomaly: When the bypass occurs, the state table shows the initial packet hitting the correct LAN rule (Route... via NordVPN), but also an odd second state (icmp 10.5.0.2 -> ... 1.1.1.1 Rule: let out anything from firewall host itself), suggesting internal reprocessing.

It appears OPNsense is using the 10.5.0.1 via wg0 interface route to send policy-routed traffic directly out wg0 as long as the interface is up, completely ignoring the disabled/offline status of the logical gateway object (NordVPN) or gateway group (NordVPN_Group) specified in the firewall rule. This prevents the traffic from ever hitting the WAN interface where the floating kill switch rule resides, or being blocked by subsequent LAN rules or gateway group failover logic.

Core Questions Remaining:

    Is this behavior (policy-routed traffic for a disabled/offline gateway exiting directly via the gateway's UP interface) expected, or does it indicate a bug/misconfiguration?
    Why does the 10.5.0.1 via wg0 route appear to be added/used despite Disable Routes being checked in the WG Instance settings?
    Are there any known workarounds or settings to prevent this direct egress via wg0 when the associated logical gateway is down, ensuring traffic correctly fails over towards the default route path (where a kill switch rule can evaluate it) or is simply dropped?

Any further insight specifically into this routing bypass via wg0 would be immensely helpful.

(Setup: OPNsense 25.1.4_1, VirtualBox on Win10 Host, WG Client to NordVPN)
choppity chop bang bang

I'm trying to keep the scope small but I seem to have failed at that. I also checked the code and at first glance disabling the gateway should force it down so the kill state should work, but feel free to challenge me on this.


Cheers,
Franco

Quote from: franco on April 09, 2025, 07:14:40 PMI'm trying to keep the scope small but I seem to have failed at that. I also checked the code and at first glance disabling the gateway should force it down so the kill state should work, but feel free to challenge me on this.


Cheers,
Franco

Hi Franco,

Thank you again for the feedback. We understand that Kill states when down might not reliably trigger on manual gateway disable actions. While that explains the need for manual state resets to enforce blocking after the fact, our primary issue seems to be that the traffic doesn't even attempt the path where the kill switch rule lives, making the state killing trigger somewhat irrelevant to the initial failure.

To demonstrate, we performed the following test using the standard Tag + Floating Rule configuration (detailed fully in my original post, currently active):

Test Steps & Evidence:

    VPN Enabled: NordVPN gateway object enabled and confirmed Online in Status. Continuous ping 10.0.10.11 -> 1.1.1.1 runs successfully.
        Evidence: Live Log (Screenshot_1.png) confirms ping packets Pass via LAN rule Route LAN traffic via NordVPN. State Table (Screenshot_2.png) shows expected NATted outbound state 10.5.0.2 -> 1.1.1.1 and return state.

    NordVPN Gateway Manually Disabled: The gateway object was Disabled via System -> Gateways -> Configuration -> Edit -> Check Disable -> Save -> Apply.
        Observation: The continuous ping 10.0.10.11 -> 1.1.1.1 continued to succeed indefinitely without manual intervention (visually confirmed in Screenshot_5.png foreground).
        Evidence: Gateway Configuration page confirmed NordVPN object was Disabled (Screenshot_3.png context).

    Diagnostics While Kill Switch Failing (Gateway Disabled, Ping Succeeding):
        Live Log (action=pass): Confirmed ping packets were still being logged as PASSED by the LAN rule Route LAN traffic via NordVPN (Screenshot_4.png, Screenshot_5.png background), despite this rule pointing to the now-disabled NordVPN gateway.
        Live Log (action=block): Showed NO blocks for the ping traffic by the Kill Switch block for NordVPN floating rule (Screenshot_6.png).
        Packet Capture WAN (em0): Capture filtered for ICMP 1.1.1.1 was EMPTY (OPN1_CAPTURE2.jpg). The traffic was not attempting to leave via WAN.
        Packet Capture wg0: Capture filtered for ICMP 1.1.1.1 showed continuous successful echo requests/replies (10.5.0.2 <-> 1.1.1.1) egressing/ingressing directly via the wg0 interface (Screenshot_7.png). This proves the bypass path.
        State Table: Showed the initial state matching the LAN rule (Route...NordVPN) plus the anomalous second state (icmp 10.5.0.2 -> ...).

    Manual State Reset: Clicking Reset state table (Firewall -> Diagnostics -> States).
        Observation: The continuous ping 10.0.10.11 -> 1.1.1.1 stopped immediately (visually confirmed in Screenshot_8.png).
        Evidence: Live Log (action=block) then showed ICMP packets being blocked by the Kill Switch block for NordVPN floating rule (Screenshot_10.png), confirming the rule works only after the state reset forces traffic to attempt the WAN path.

Conclusion from Test:

The evidence clearly shows that when the policy route target (NordVPN gateway object) is disabled, traffic matching the rule is not failing over to the default route (where the WAN kill switch rule lives). Instead, it seems OPNsense internally routes the traffic directly out the associated wg0 interface (as shown by the wg0 capture), effectively bypassing the gateway's disabled status and all kill switch logic until states are manually flushed.

This seems linked to the persistent 10.5.0.1 via wg0 host route which appears in netstat -rn (and WG logs) even though Disable Routes is checked in the WG Instance settings.

So, while Kill states when down might not trigger on manual disable, the bigger issue seems to be that the traffic isn't even reaching the point where state killing (or the floating block rule) on the correct failover path (WAN) can occur, due to this apparent routing override via wg0.

Is this direct egress via wg0 for traffic policy-routed to a disabled gateway expected? Could the handling of the 10.5.0.1 via wg0 route be involved?

Thanks for looking into this!


https://imgur.com/a/AYsj1vj

https://imgur.com/a/AYsj1vj

https://imgur.com/a/AYsj1vj
choppity chop bang bang

Remembering this and gateway-disable-or-not: the way to force connections away from a gateway is not to disable it. Use the "force down" monitoring checkbox in the gateway settings instead. This is how it always worked.  ;)


Cheers,
Franco