Hey folks,
I think I may have found a malfunction in OPNSense (I am running 24.7.3_1), but wanted to validate with you first.
Background: I have successfully configured my system to have 3 vpn tunnels pointing at three different ProtonVPN locations for policy based routing on three of my subnets. I also configured a 'Home' wireguard tunnel to allow me to connect to my firewall when I am on the road. Lastly, I have configured a s2s wireguard tunnel with another router I manage in a different location (the two sites have non overlapping LANs).
The fun part: I tested all these 5 tunnels and they all worked as intended, once configured and for many days since then...until I rebooted the firewall (clean reboot from the GUI).
Observations:
- After reboot, only 2 out of the 5 tunnels came back alive, and interestingly not always the same 2 across several reboots.
- I tried to disable and re-enable wireguard in the GUI and from the terminal, but only the 2 tunnels that came alive at boot restarted.
- I disabled all the instances and peers (in the GUI), I rebooted the firewall, and I manually re-enabled each one of the 5 tunnels (instances and peers), but again only 2 of them (not always the same ones) came back alive.
- the only messages I can find in the vpn logs concerning the tunnels that do not come up look similar to the following: " wireguard instance HomeWireguard (wg2) can not reconfigure without stopping it first."
Are you aware of what could be causing this? Has someone else experienced something similar, or have good leads for me to follow?
Thank you in advance.
At the moment we are discussing strange issues with WireGuard startup failing here:
https://github.com/opnsense/core/issues/7364
The locking may not be airtight on instance start. But it may also be a different issue like DNS or tunnel in a tunnel. Some of these things are hard to tell from "wireguard is not starting".
Cheers,
Franco
@franco, thank you for sharing the github thread, I am following it. The problem manifests differently, but it seems indeed related.
What is the most efficient way to help with debugging this? My config is long but not necessarily complex, I am happy to provide details on any area you would like to dive in.
Let's see what happens with the latest patch first:
https://github.com/opnsense/core/commit/d0806969
# opnsense-patch d0806969
Reboot for a clean test to see what happens with the tunnels.
The patch is harmless either way so it either stays the same or gets better.
Cheers,
Fraco
Thank you, I did try the patch in the past couple of days and it did not work. I attempted to reboot, disable and re-enable all wireguard instances from both the GUI and the terminal, individually or in group..
Only 3 out of 5 tunnels now work (with or without the patch), which almost always happen to be the 3 protonvpn ones. The HomeWireguard and the s2s tunnels seem to never come up anymore.
I reverted the patch, and I experience the same behavior.
There seem to be a reproducible race condition when multiple wireguard tunnels are started, which lead to blocking startup after 2-3 tunnels come up.
What should I do next to investigate this further and provide details to help the resolution?
Using FQDNs could be a factor. Tunnels in tunnels another. I'm not sure. It's a complex world.
There is another patch we tried to triage this with, but it shouldn't change things for you, although it's harmless to try as well:
https://github.com/opnsense/core/commit/dd1c2e19e548
# opnsense-patch dd1c2e19e548
Cheers,
Franco
HI,
I am investigating a similar issue myself. In my case it seems to be connected to the modem connection (modem-ISP).
Can you try to:
1 restart only the modem:
1a the connections returned?
1b only the connections that already were UP before the restart, returned?
2 handshake / gateways
2a the handshake is performed for all the connections or only the ones UP
2b status of the gateways before and after the restart.
3 connection
3a do you have a DSL/cabel connection or through a 4g/5g modem?
3b the modem is also a router? or it is only a modem/ configured in bridge mode?
Possible partial solution (after restarting the modem): perform a trace route or trace route&ping (in parallel) to the gateway address (i.e 10.2.0.1). This should restore somehow the pinging to the gateway.
Quote from: franco on September 10, 2024, 08:04:09 AM
Using FQDNs could be a factor. Tunnels in tunnels another. I'm not sure. It's a complex world.
There is another patch we tried to triage this with, but it shouldn't change things for you, although it's harmless to try as well:
https://github.com/opnsense/core/commit/dd1c2e19e548
# opnsense-patch dd1c2e19e548
I do use FQDN in the HomeWireguard and s2s tunnel configs to point at my router's ip. This worked well for several weeks, before this behavior started showing up, so I am not sure what changed... I do not know enough to speculate, but it almost feels like somehow my wireguard configuration got corrupted. Is that even possible, and if so, would this explain this behaviors?
Anyway, many thanks you for the patch, I tried it, doing the following tests.
- First, I tried it and after the first reboot only one of the three protonvpn tunnels came up with proper handshake, all other 4 tunnels did not (despite instances and peers were all marked as "enabled" in the GUI.
Second test, I disabled them all, and disabled wireguard, rebooted and I had the same situation.
Third test, I stopped every instance with configctl wireguard stop XXX, then rebooted, and once the router came back I re-enabled all the VPN tunnels, which brought me back to having the 3 protonvpn tunnels working (i.e. proper handshake), and remaining 2 HomeWireguard tunnel (for accessing from the road) and s2s tunnel (to connect across geographically separated lans), not working.
Not sure if this is helpful, but in the logs, I see these 2 "
errors" messages related to the s2s tunnel:
- /usr/local/opnsense/scripts/Wireguard/wg-service-control.php: The command '/sbin/ifconfig 'wg4' 'inet' '10.2.2.1/24' alias' returned exit code '1', the output was 'ifconfig: ioctl (SIOCAIFADDR): File exists'
- /usr/local/opnsense/scripts/Wireguard/wg-service-control.php: The command '/sbin/route -q -n add -'inet' '10.2.2.2/32' -interface 'wg4'' returned exit code '1', the output was ''
and I see this "
notice" message for every interface I have associated with my 5 tunnels:
- wireguard instance HomeWireguard (wg2) can not reconfigure without stopping it first.
Quote
1 restart only the modem:
1a the connections returned?
1b only the connections that already were UP before the restart, returned?
My router gets the ethernet cable directly from the FIOS ONT, so I do not really have a modem. When I reboot the router typically (but not always) the wg vpn tunnels that were previously working (i.e. with proper handshake) come back properly after reboot and handshake correctly.
Quote
2 handshake / gateways
2a the handshake is performed for all the connections or only the ones UP
2b status of the gateways before and after the restart.
Not sure if I understand what "UP" means, I tend to use "UP" to represent a functioning wireguard tunnel. For that the handshake needs to be successfully shown in the status page. If by "UP" you mean instances and peers that are *enabled* in the GUI, then the answer is "no" all my instances (5) are *enabled*, but only a subset completes the handshake successfully.
Gateways status before and after restart are green.
Quote
3 connection
3a do you have a DSL/cabel connection or through a 4g/5g modem?
3b the modem is also a router? or it is only a modem/ configured in bridge mode?
I have fiber, which goes into the ONT and connects via ethernet to my OPNSense router.
No modem, only the OPNSense router, no bridge mode.
/usr/local/opnsense/scripts/Wireguard/wg-service-control.php: The command '/sbin/ifconfig 'wg4' 'inet' '10.2.2.1/24' alias' returned exit code '1', the output was 'ifconfig: ioctl (SIOCAIFADDR): File exists'
/usr/local/opnsense/scripts/Wireguard/wg-service-control.php: The command '/sbin/route -q -n add -'inet' '10.2.2.2/32' -interface 'wg4'' returned exit code '1', the output was ''
Both are symptoms of misconfigurations either in overlapping subnets or old configurations like IPv4 mode for assigned interfaces or adding VIPs (when you can easily add several tunnel addresses in the wireguard instance).
About the FQDNs all bets are off as this pertains to internal DNS behaviour and connectivity to the source of those FQDNs as they are resolved during bootup.
Cheers,
Franco
After the latest update (and maybe a few specific firewall rules) now i don't lose anymore any of the 3 vpn connections and in case, after some time, they are restored automatically.
Thank you opnsense team! ;D
I have to correct myself, after the latest update
OPNsense 24.7.5-amd64
FreeBSD 14.1-RELEASE-p5
OpenSSL 3.0.15
I am back to the initial situation :(
Thank you franco I realized from your comment I had something misconfigured. I fixed the apparent misconfiguration, but still I could not solve the issue.
In order to get back operational with the network, I decided to rebuild my firewall. As I rebuilt it, I learned about snapshots (yep...I did not know about the super convenient snapshots feature ::)), and I used them to mark any major successful step I made in my conf.
I am now at the point where I have all my tunnels (1 s2s, 3 protonVPNs, 1 road-warrior for remote maintenance ) up and running and working solid (as intended). :)
My learnings were not necessarily on how to fix the initial situation, that still remains a mystery, but my best guess is that I either messed up in my conf. elsewhere (although I did check it for what it feels like a million times), or my configuration got corrupted somehow with one of the recent firmware updates (just a speculation to save my ego, I have no proof of that). I did learn tho how to have easier way in reverting to a working config.
This thread is not therefore SOLVED, as I rebuilt from scratch, but since my system is working now, I will go ahead and mark it as OBSOLETE.