Stale peers in Wireguard, v2

Started by fst, April 24, 2025, 11:13:43 AM

Previous topic - Next topic
I have one opnsense installation (out of 3) where Wireguard is disconnecting every 10 days or so. The peer is showing as "stale". The only way to reestablish the link is by rebooting. These options have failed:
- disabling and reenabling
- shell: /usr/local/opnsense/scripts/Wireguard/wg-service-control.php stop/start xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

On the same hardware, wireguard was working flawlessly. The other side of wireguard has not changed when switching from pfsense to opnsense.
At the times this is happening, there is no log entry in /var/log/wireguard/*
I can see my restart attempts in the log, to no avail:
<37>1 2025-04-24T01:54:58+02:00 gwtsb.tonstudiobeusch.ch wireguard 56813 - [meta sequenceId="1"] wireguard instance WGHD (wg0) stopped
<37>1 2025-04-24T01:54:58+02:00 gwtsb.tonstudiobeusch.ch wireguard 58680 - [meta sequenceId="2"] /usr/local/opnsense/scripts/Wireguard/wg-service-control.php: ROUTING: entering configure using opt1
<37>1 2025-04-24T01:54:58+02:00 gwtsb.tonstudiobeusch.ch wireguard 58680 - [meta sequenceId="3"] /usr/local/opnsense/scripts/Wireguard/wg-service-control.php: plugins_configure monitor (,[GWHD])
<37>1 2025-04-24T01:54:58+02:00 gwtsb.tonstudiobeusch.ch wireguard 58680 - [meta sequenceId="4"] /usr/local/opnsense/scripts/Wireguard/wg-service-control.php: plugins_configure monitor (execute task : dpinger_configure_do(,[GWHD]))
<37>1 2025-04-24T01:54:58+02:00 gwtsb.tonstudiobeusch.ch wireguard 58680 - [meta sequenceId="5"] wireguard instance WGHD (wg0) started

I have not found any other log entries. Is there a way to debug this?

Hi here the same situation. I have seen that also the rest of the modem is enough (my modem is in brdge mode so the opnsense see the change of the public ip).

It seems it requires a push to re-initiate something.

This is expected because its how wireguard works.

Read about it in this discussion:

https://github.com/opnsense/docs/pull/691

Stale does not mean anything is wrong.

If you need a true connected session based vpn use ipsec or openvpn.
Hardware:
DEC740

Hi Cedrik, Ialready have a keep alive interval of 25 seconds, in order to keep it active should I reduce it?

I use wireguard only for outgoing connections , when the peer become stale, in case there is a request it should become active again?

Read the "NAT and Firewall Traversal Persistence" section.

https://www.wireguard.com/quickstart/

If wireguard does not have matching traffic, it does not send anything.

Its triggered by matching traffic.
Hardware:
DEC740

There is an option to "Renew DNS for WireGuard on stale connections" in System/Cron. Try it.

Hi,

@Bob.Dig - that option is already active, once each hour, but it has no effect.

@Cedrik - once the peer is in stale status and i try to force some traffic through it, the state does not change. Shouldn't it go online having traffic?

The stale status depends on when the last handshake happened. The code checks if it was less than 5 minutes ago, and if thats the case assumes the peer is online.

If traffic happens, the peer should change state because a new handshake should happen.

Look at the Wireguard diagnostics page, there is a value in seconds how long ago the last handshake was.
Hardware:
DEC740

May 05, 2025, 07:44:03 PM #8 Last Edit: May 05, 2025, 08:28:43 PM by FredFresh
in this case, 1200 seconds. But trying to force traffic through it, should trigger a new handshake?

Additional info:
- restarting the wireguard, does not trigger a new handshake;
- restarting the opnsense, does not trigger a new handshake;
- restarting the modem (in bridge modem) change the wan ip and trigger the new handshake.

Just to complete the pieces of information above, after the point 3, to trigger again the correct pinging to the from the gateway of that wireguard connection to the monitoring IP, I have to perform a trace route from my pc or from the opnsense towards the gateway IP

@Monviech thank you for the feedbacks and your time assisting. I would like to ask for some more assistance from an expert (you)

I am actually very frustrated as this is the last thing I need to avoid to check every day the firewall (and have more time for other things) and it is at least one year that I am struggling on this.

The abnormal behaviors are:
•   Peers become stale even if are used. I have three connections, managed through 3 gateways with different priorities. Also the one actually used became stale (with ongoing connections);
•   The two backup connections (not used because the first one is online), behave differently: one can become stale after 2-3 days while the other can last for weeks.
•   Even if I try to force traffic through the stale gateway, It never return online. If I try restart the peer or the wireguard service, such peer is marked as offline. Even I fi restart the opsense, it doesn't return online
•   The only way to a new handshake with the wireguard endpoint is to change the WAN address (restart the modem). After the restart, the peers is online but the gateway not, I have to perform a traceroute towards the gateways in order to bring it online again.

Please, do you have any suggestion on hove to solve this issue?

Thank you

Use monit and ping through your tunnels and let it send you an email if the ping fails.

That way you can see if something is actually wrong.
Hardware:
DEC740

Unfortunately I already know that it would require to correct the situation at least once every two days. Even if it is not a solution, can you suggest a way to further analyze this situation?

Wireguard does not have many ways to analyze it out of the box.

If it fails, there's most likely an issue in communication between the endpoint and the peer.

This can be firewall rules, firewall states, dynamic IPs, CGNAT, Provider issues, dns issues, etc...
Hardware:
DEC740

ok, the fun "must go on", but opnsense is fantastic a I want to find a solution rather than a patch.

What is the best log to check to try to investigate the problems exposed so far? I am not a technician and usually I use the live view of the firewall, but I don't think it is suitable for this kind of ivestigation.

I created dedicated firewall rules just to create log record, but on the live view I can't see anything. Is it a good idea to download the .csv from the "plain view" page and elaborate with excel?

The best tool to troubleshoot is packet captures. If a tunnel fails use tcpdump from the shell or packet capture in the webgui and see what happens to the wireguard packets.
Hardware:
DEC740