I have one opnsense installation (out of 3) where Wireguard is disconnecting every 10 days or so. The peer is showing as "stale". The only way to reestablish the link is by rebooting. These options have failed:
- disabling and reenabling
- shell: /usr/local/opnsense/scripts/Wireguard/wg-service-control.php stop/start xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
On the same hardware, wireguard was working flawlessly. The other side of wireguard has not changed when switching from pfsense to opnsense.
At the times this is happening, there is no log entry in /var/log/wireguard/*
I can see my restart attempts in the log, to no avail:
<37>1 2025-04-24T01:54:58+02:00 gwtsb.tonstudiobeusch.ch wireguard 56813 - [meta sequenceId="1"] wireguard instance WGHD (wg0) stopped
<37>1 2025-04-24T01:54:58+02:00 gwtsb.tonstudiobeusch.ch wireguard 58680 - [meta sequenceId="2"] /usr/local/opnsense/scripts/Wireguard/wg-service-control.php: ROUTING: entering configure using opt1
<37>1 2025-04-24T01:54:58+02:00 gwtsb.tonstudiobeusch.ch wireguard 58680 - [meta sequenceId="3"] /usr/local/opnsense/scripts/Wireguard/wg-service-control.php: plugins_configure monitor (,[GWHD])
<37>1 2025-04-24T01:54:58+02:00 gwtsb.tonstudiobeusch.ch wireguard 58680 - [meta sequenceId="4"] /usr/local/opnsense/scripts/Wireguard/wg-service-control.php: plugins_configure monitor (execute task : dpinger_configure_do(,[GWHD]))
<37>1 2025-04-24T01:54:58+02:00 gwtsb.tonstudiobeusch.ch wireguard 58680 - [meta sequenceId="5"] wireguard instance WGHD (wg0) started
I have not found any other log entries. Is there a way to debug this?
Hi here the same situation. I have seen that also the rest of the modem is enough (my modem is in brdge mode so the opnsense see the change of the public ip).
It seems it requires a push to re-initiate something.
This is expected because its how wireguard works.
Read about it in this discussion:
https://github.com/opnsense/docs/pull/691
Stale does not mean anything is wrong.
If you need a true connected session based vpn use ipsec or openvpn.
Hi Cedrik, Ialready have a keep alive interval of 25 seconds, in order to keep it active should I reduce it?
I use wireguard only for outgoing connections , when the peer become stale, in case there is a request it should become active again?
Read the "NAT and Firewall Traversal Persistence" section.
https://www.wireguard.com/quickstart/
If wireguard does not have matching traffic, it does not send anything.
Its triggered by matching traffic.
There is an option to "Renew DNS for WireGuard on stale connections" in System/Cron. Try it.
Hi,
@Bob.Dig - that option is already active, once each hour, but it has no effect.
@Cedrik - once the peer is in stale status and i try to force some traffic through it, the state does not change. Shouldn't it go online having traffic?
The stale status depends on when the last handshake happened. The code checks if it was less than 5 minutes ago, and if thats the case assumes the peer is online.
If traffic happens, the peer should change state because a new handshake should happen.
Look at the Wireguard diagnostics page, there is a value in seconds how long ago the last handshake was.
in this case, 1200 seconds. But trying to force traffic through it, should trigger a new handshake?
Additional info:
- restarting the wireguard, does not trigger a new handshake;
- restarting the opnsense, does not trigger a new handshake;
- restarting the modem (in bridge modem) change the wan ip and trigger the new handshake.
Just to complete the pieces of information above, after the point 3, to trigger again the correct pinging to the from the gateway of that wireguard connection to the monitoring IP, I have to perform a trace route from my pc or from the opnsense towards the gateway IP
@Monviech thank you for the feedbacks and your time assisting. I would like to ask for some more assistance from an expert (you)
I am actually very frustrated as this is the last thing I need to avoid to check every day the firewall (and have more time for other things) and it is at least one year that I am struggling on this.
The abnormal behaviors are:
• Peers become stale even if are used. I have three connections, managed through 3 gateways with different priorities. Also the one actually used became stale (with ongoing connections);
• The two backup connections (not used because the first one is online), behave differently: one can become stale after 2-3 days while the other can last for weeks.
• Even if I try to force traffic through the stale gateway, It never return online. If I try restart the peer or the wireguard service, such peer is marked as offline. Even I fi restart the opsense, it doesn't return online
• The only way to a new handshake with the wireguard endpoint is to change the WAN address (restart the modem). After the restart, the peers is online but the gateway not, I have to perform a traceroute towards the gateways in order to bring it online again.
Please, do you have any suggestion on hove to solve this issue?
Thank you
Use monit and ping through your tunnels and let it send you an email if the ping fails.
That way you can see if something is actually wrong.
Unfortunately I already know that it would require to correct the situation at least once every two days. Even if it is not a solution, can you suggest a way to further analyze this situation?
Wireguard does not have many ways to analyze it out of the box.
If it fails, there's most likely an issue in communication between the endpoint and the peer.
This can be firewall rules, firewall states, dynamic IPs, CGNAT, Provider issues, dns issues, etc...
ok, the fun "must go on", but opnsense is fantastic a I want to find a solution rather than a patch.
What is the best log to check to try to investigate the problems exposed so far? I am not a technician and usually I use the live view of the firewall, but I don't think it is suitable for this kind of ivestigation.
I created dedicated firewall rules just to create log record, but on the live view I can't see anything. Is it a good idea to download the .csv from the "plain view" page and elaborate with excel?
The best tool to troubleshoot is packet captures. If a tunnel fails use tcpdump from the shell or packet capture in the webgui and see what happens to the wireguard packets.
Thank you! I will try also this way.
It clearly depends on from what side the wireguard connection is initiated: although a site-to-site connection can potentially be opened from either side, it might actually not work the way you think it does.
Say, for example, the other side is behind CG-NAT. In that case, it can initiate a connection as a client, but never act as a server.
In this situation, even if you detect stale connections on your side, you cannot "repair" the connection by restarting wireguard. Thus, the "stale detection" via the cron job has to be done on both sides preferably and also in a short interval to keep interruptions small.
I am myself in a postion like this (I am not behind CG-NAT, but my peers are). Thus, the stale detection on my side - although enabled - would not help.
@meyergru the 3 connections are 3 different VPNs all towards the PROTON VPN servers. So the conenction should be always initiated only from my side.
The modem is an LTE modem and the mobile operator uses a CG-NAT system (the IP I receive start with 10.xx.xx.xx), BUT the strange thing is that the 2 backup connections become stale with a very different amount of time and with a random order.
According to the OPNSENSE documentation I set the "Keepalive interval" at 25 secs, but also changing to a lower or higher doesn't change the result.
It seems that some route/firewall state change or the system drops something: now I have the third VPN online, but the the gateway monitoring is offline:
the VPN instance is pingable;
the endpoint IP is pingable;
the gateways is NOT pingable.
Looking to the live view log, I see the first ping going through the correct VPN gateway, but no reply is recorded.
Even if I restart the opnsense nothing change, If I traceroute to the gateway IP, it start again pinging and returns online.
If the privacy-vpn-servers or your LTE-connection are overloaded, there is nothing OPNsense can do...
@Bob.Dig - also this road was already tested: connecting to the same server from mobile phone and computer is possible. Also the load on that server is not high (according to the Proton app).
Waiting for a couple of days for everythihng returning normal, wasn't successful.
UPDATE:
found the following situation:
1 VPN peers down (second of three)
2 VPN gateways down because of monitoring (2nd and 3rd)
disabled monitoring on the 2nd and 3rd VPN gateways, in 5 minutes everything was online again. The strange thing is that the ping of the primary VPN gateways was more than the double of before.
Restored the "gateway monitoring", both the 2nd and 3rd VPN gateways were marked as offline.
The gateway monitoring system could have some diffult managing more than one VPN gateway?
SECOND UPDATE:
while I was looking more and more to the live view of the, I found this:
(https://i.postimg.cc/0bKvX3Rp/screeen.jpg) (https://postimg.cc/0bKvX3Rp)
When everything works good: i see the initial request for the ping (and not the reply), going through the relative wireguard gateway.
Suddenly I stop to see it, and instead I start to see the returning replying trying to enter through the wiregaurd gateway with the highest priority used in that moment.
Now the monitoring IP is the same of the gateway, but before I tried to use an external IP and create a dedicated route+firewall rule to send the initial ping request always through the correct wireguard VPN, but the same behavior happened (even if I didn't clearly see in the LOG like this time).
The Routes status tell me this (the 2.1.1.1 is the proper gateway for that Monitoring IP)
(https://i.postimg.cc/QVrnZyTQ/state.jpg) (https://postimg.cc/QVrnZyTQ)
Question: the routing rules are considered or not for this ping queries?
Also, this happenes with or without the flag applied to "Disable route" in each wireguard gateways.
Hi again, trying to push this topic.
This week I disabled the gateway monitoring and the systems seems to be working much better (previously the gateway monitoring became offline even if the wireguards peers were online and the handshake done properly).
If you have a look to my previous message, it seems that at certain moment the ping replies of the backup gateways try to return through the main wireguard gateway (even if I created a dedicated routing from the gateway to the monitor ip). -> Any suggestion on how to avoid this?
Personally, it seems there is an issue with properly updating and keeping the firewall states/routes or routings used by the "gateway monitoring" service. I also tried the CRON job to restart the wireguard service every 24h, but no effect.
What I confim, instead, is that once every 7-10 days the backup peers are marked as stale: the odd thing is that to bring them back online I have to restart the modem (and change the publich IP seen on the WAN port). Restarting everythign else doesn't sort any effect.
Later I'll send a request to PRoton to understand if somehow they limit a new handshake from an IP that let the connection became stale.
Thank you for your time and help!
EDIT just a quick extra info, I re-enabled the monitoring on all the three wireguard gateways and only the main one started working, in order to have the other working I have to perform a traceroute towards each gateway. IS this creating/updating what, firewal states or routing rules?
ok, after several tests I think there is a bug in the gateway monitor application when:
- multiple gateway going through wireguard VPN
- the WAN gateway is not the primary GATEWAY
It seems that, even with the keepalive option, the firewall state between the gateway instance and the monitor IP is dropped. Thi lead to gateway marked as offline, later the wireguard Peers is marked as stale.
If you have the gateway offline, but the peers is still online, you can bring the gateway online if:
- trace route to the monitor IP OR
- trace route to the gateway IP
- restart the modem and obtain a different WAN IP
If you have the gateway offline and the VPN peer stale/offline, the only way to bring it back online is to restart the modem and get a new IP on the WAN.
So far I think I tried everything.
@Monviech my question is: is this enough to have someone from the opnsense team have a look at it, or should I open a bug report on github?
I am fully available to perfomr all and every test usefull.
Thank you.
Well you can always try going for a github report, but if the setup is very hard to replicate and not something a lot of people have it might not get much traction as it takes a lot of effort to track edge cases down.
I understand, if it is a very rare configuration, it is not worth to spend time on it.
Leaving aside the other personalization, I just put everything going outside through a VPN (for privacy reason) and added two backups VPN connections.
I thought it would be a common configuration between the people oriented to privacy.
Thank you for your time, I will try to open a report on github.
Hello, topic solved.
Solution:
- create a route to force the conenction towards the VPN IP (provided in the configuration file) to go through the WAN;
- set as monitoring IP an external IP (not the internal IP of the VPN);
- create a route to force the monitoring toward the IP to go through the correct VPN gateway (one for each vpn connection);
and here the problem that created me so much headache:
- DO NOT create firewall rules to try monitoring the routes toward the monitoring IP (in the floating rules section).
Finally the monitorings are always working /coming back online and the handshakes are restored in case of stale conenction.