Stale peers in Wireguard, v2

Started by fst, April 24, 2025, 11:13:43 AM

Previous topic - Next topic
Thank you! I will try also this way.

May 11, 2025, 05:25:16 PM #16 Last Edit: May 11, 2025, 05:26:48 PM by meyergru
It clearly depends on from what side the wireguard connection is initiated: although a site-to-site connection can potentially be opened from either side, it might actually not work the way you think it does.

Say, for example, the other side is behind CG-NAT. In that case, it can initiate a connection as a client, but never act as a server.

In this situation, even if you detect stale connections on your side, you cannot "repair" the connection by restarting wireguard. Thus, the "stale detection" via the cron job has to be done on both sides preferably and also in a short interval to keep interruptions small.

I am myself in a postion like this (I am not behind CG-NAT, but my peers are). Thus, the stale detection on my side - although enabled - would not help.
Intel N100, 4* I226-V, 2* 82559, 16 GByte, 500 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+

May 11, 2025, 07:41:39 PM #17 Last Edit: May 11, 2025, 08:10:41 PM by FredFresh
@meyergru the 3 connections are 3 different VPNs all towards the PROTON VPN servers. So the conenction should be always initiated only from my side.

The modem is an LTE modem and the mobile operator uses a CG-NAT system (the IP I receive start with 10.xx.xx.xx), BUT the strange thing is that the 2 backup connections become stale with a very different amount of time and with a random order.

According to the OPNSENSE documentation I set the "Keepalive interval" at 25 secs, but also changing to a lower or higher doesn't change the result.


It seems that some route/firewall state change or the system drops something: now I have the third VPN online, but the the gateway monitoring is offline:
the VPN instance is pingable;
the endpoint IP is pingable;
the gateways is NOT pingable.

Looking to the live view log, I see the first ping going through the correct VPN gateway, but no reply is recorded.
Even if I restart the opnsense nothing change, If I traceroute to the gateway IP, it start again pinging and returns online.

May 11, 2025, 07:47:45 PM #18 Last Edit: May 11, 2025, 07:53:14 PM by Bob.Dig
If the privacy-vpn-servers or your LTE-connection are overloaded, there is nothing OPNsense can do...

May 11, 2025, 07:53:53 PM #19 Last Edit: May 12, 2025, 07:42:32 AM by FredFresh
@Bob.Dig - also this road was already tested: connecting to the same server from mobile phone and computer is possible. Also the load on that server is not high (according to the Proton app).
Waiting for a couple of days for everythihng returning normal, wasn't successful.

UPDATE:
found the following situation:
1 VPN peers down (second of three)
2 VPN gateways down because of monitoring (2nd and 3rd)

disabled monitoring on the 2nd and 3rd VPN gateways, in 5 minutes everything was online again. The strange thing is that the ping of the primary VPN gateways was more than the double of before.

Restored the "gateway monitoring", both the 2nd and 3rd VPN gateways were marked as offline.

The gateway monitoring system could have some diffult managing more than one VPN gateway?

SECOND UPDATE:
while I was looking more and more to the live view of the, I found this:


When everything works good: i see the initial request for the ping (and not the reply), going through the relative wireguard gateway.
Suddenly I stop to see it, and instead I start to see the returning replying trying to enter through the wiregaurd gateway with the highest priority used in that moment.

Now the monitoring IP is the same of the gateway, but before I tried to use an external IP and create a dedicated route+firewall rule to send the initial ping request always through the correct wireguard VPN, but the same behavior happened (even if I didn't clearly see in the LOG like this time).

The Routes status tell me this (the 2.1.1.1 is the proper gateway for that Monitoring IP)


Question: the routing rules are considered or not for this ping queries?
Also, this happenes with or without the flag applied to "Disable route" in each wireguard gateways.

May 18, 2025, 11:25:33 AM #20 Last Edit: May 18, 2025, 11:28:46 AM by FredFresh
Hi again, trying to push this topic.

This week I disabled the gateway monitoring and the systems seems to be working much better (previously the gateway monitoring became offline even if the wireguards peers were online and the handshake done properly).

If you have a look to my previous message, it seems that at certain moment the ping replies of the backup gateways try to return through the main wireguard gateway (even if I created a dedicated routing from the gateway to the monitor ip). -> Any suggestion on how to avoid this?

Personally, it seems there is an issue with properly updating and keeping the firewall states/routes or routings used by the "gateway monitoring" service. I also tried the CRON job to restart the wireguard service every 24h, but no effect.

What I confim, instead, is that once every 7-10 days the backup peers are marked as stale: the odd thing is that to bring them back online I have to restart the modem (and change the publich IP seen on the WAN port). Restarting everythign else doesn't sort any effect.

Later I'll send a request to PRoton to understand if somehow they limit a new handshake from an IP that let the connection became stale.

Thank you for your time and help!

EDIT just a quick extra info, I re-enabled the monitoring on all the three wireguard gateways and only the main one started working, in order to have the other working I have to perform a traceroute towards each gateway. IS this creating/updating what, firewal states or routing rules?

ok, after several tests I think there is a bug in the gateway monitor application when:
- multiple gateway going through wireguard VPN
- the WAN gateway is not the primary GATEWAY

It seems that, even with the keepalive option, the firewall state between the gateway instance and the monitor IP is dropped. Thi lead to gateway marked as offline, later the wireguard Peers is marked as stale.

If you have the gateway offline, but the peers is still online, you can bring the gateway online if:
- trace route to the monitor IP OR
- trace route to the gateway IP
- restart the modem and obtain a different WAN IP

If you have the gateway offline and the VPN peer stale/offline, the only way to bring it back online is to restart the modem and get a new IP on the WAN.

So far I think I tried everything.

@Monviech  my question is: is this enough to have someone from the opnsense team have a look at it, or should I open a bug report on github?

I am fully available to perfomr all and every test usefull.

Thank you.

Well you can always try going for a github report, but if the setup is very hard to replicate and not something a lot of people have it might not get much traction as it takes a lot of effort to track edge cases down.
Hardware:
DEC740

I understand, if it is a very rare configuration, it is not worth to spend time on it.

Leaving aside the other personalization, I just put everything going outside through a VPN (for privacy reason) and added two backups VPN connections.

I thought it would be a common configuration between the people oriented to privacy.

Thank you for your time, I will try to open a report on github.

Hello, topic solved.

Solution:
- create a route to force the conenction towards the VPN IP (provided in the configuration file) to go through the WAN;
- set as monitoring IP an external IP (not the internal IP of the VPN);
- create a route to force the monitoring toward the IP to go through the correct VPN gateway (one for each vpn connection);

and here the problem that created me so much headache:
- DO NOT create firewall rules to try monitoring the routes toward the monitoring IP (in the floating rules section).

Finally the monitorings are always working /coming back online and the handshakes are restored in case of stale conenction.