OPNsense Forum

Archive => 21.1 Legacy Series => Topic started by: talopensense on March 28, 2021, 07:09:08 pm

Title: How to troubleshoot wireguard tunnels and the effect of packet loss?
Post by: talopensense on March 28, 2021, 07:09:08 pm
Hi all,

I have wireguard-go implemented in multiple OPNsense instances running 21.1.3 and 21.1.3_3.
When there is 0 packet loss, there is no issue. When small packet loss is seen, it seems to affect WG stability exponentially.
Is there a way to understand why my wireguard connections show 50% packet loss when a ping to the endpoint destination of that tunnel has less than 2%?
The problem I experience is that the forwarding literally stops on WG tunnels for multiple seconds if not minutes with no clear reason when the ping to the endpoint keeps going.
I am not too sure where to look for more details.

I have checked the firewall logs. I see that common message each time I see WG tunnels not passing any more traffic:
Code: [Select]
pflog0: promiscuous mode enabled
pflog0: promiscuous mode disabled

I have checked the interfaces counters for any major errors but everything is clear (i.e no errors):
Code: [Select]
Name    Mtu Network       Address              Ipkts Ierrs Idrop    Opkts Oerrs  Coll
igb0   1500 <Link#1>      00:90:0b:44:6d:02  2062082     0     0   638152     0     0
igb0      - fe80::%igb0/6 fe80::290:bff:fe4        0     -     -        1     -     -
igb0      - 174.112.148.0 174.112.148.15        1692     -     -     5616     -     -
igb1   1500 <Link#2>      00:90:0b:44:6d:03   654939     0     0  1635305     0     0
igb1      - 172.27.0.0/22 172.27.0.252         28938     -     -    35430     -     -
igb1      - fe80::%igb1/6 fe80::290:bff:fe4      110     -     -      112     -     -
igb1      - 172.27.0.0/22 172.27.0.254           316     -     -        0     -     -
igb2   1500 <Link#3>      00:90:0b:44:6d:04    10003     0     0    11472     0     0
igb2      - fe80::%igb2/6 fe80::290:bff:fe4        0     -     -        1     -     -
igb2      - 192.168.1.0/2 192.168.1.3           8559     -     -     3241     -     -
igb3*  1500 <Link#4>      00:90:0b:44:6d:05        0     0     0        0     0     0
igb4*  1500 <Link#5>      00:90:0b:44:6d:06        0     0     0        0     0     0
igb5*  1500 <Link#6>      00:90:0b:44:6d:07        0     0     0        0     0     0
enc0*  1536 <Link#7>      enc0                     0     0     0        0     0     0
lo0   16384 <Link#8>      lo0                  16749     0     0    16749     0     0
lo0       - ::1/128       ::1                      0     -     -        0     -     -
lo0       - fe80::%lo0/64 fe80::1%lo0              0     -     -        0     -     -
lo0       - 127.0.0.0/8   127.0.0.1            16746     -     -    16749     -     -
pflog 33160 <Link#9>      pflog0                   0     0     0    60670     0     0
pfsyn  1500 <Link#10>     pfsync0                  0     0     0        0     0     0
ovpnc  1500 <Link#11>     ovpnc1                9272     0     0    12888     0     0
ovpnc     - fe80::%ovpnc1 fe80::290:bff:fe4        0     -     -        1     -     -
ovpnc     - 10.8.0.0/24   10.8.0.8               693     -     -        0     -     -
wg0    1420 <Link#12>     wg0                    727     0     0     4073     0     0
wg0       - 172.27.252.0/ 172.27.252.1          2244     -     -     2647     -     -
wg1    1420 <Link#13>     wg1                    409     0     0      933     0     0
wg1       - 172.27.77.0/2 172.27.77.254            0     -     -        0     -     -


I disabled schedules as it kept interrupting traffic a lot more with that above message (pf being reloaded) popping up on a regular basis, trying to minimize the moving pieces to help with troubleshooting.

I use a remote LibreNMS instance that can reach private IP addresses behind that OPNsense instance over WG tunnel that is experiencing packet loss. It is just a bit difficult to correlate that kind of packet loss to the behaviour I observe on OPNsense that looks like something is going on in the way OPNsense 'reacts' to WG tunnels. I cannot say they are going down since I don't believe we can say they are stateful tunnels - I might be wrong here, if they are, well they I can only see the handshake.

On the WG tunnels, I have a keepalive of 2, that I reduced to 1 just to see if that would make a difference.

I have added a few RRD graphs showing the impact of the connectivity loss as LibreNMS cannot query SNMP due to the severe connectivity issue through WG. In order to rule out LibreNMS, I can confirm the same instance can reach others destinations on the internet with zero gap, zero loss. I am definitely aware that packet loss is not good and should be addressed, it is just the exponential effect on WG that is surprising to me and a bit hard to understand given the limited experience I have at troubleshooting WG and the behaviour of the Go implementation on OPNsense. I don't think it is a problem as such, it is more about how to make it more evident.

For reference see the attached graphs:

Normal no packet loss monitor report of memory utilization (no gap):
20210327-WG-troubleshooting-no-packetloss-2_result.jpg

How it has started:
20210327-WG-troubleshooting-packetloss-2_result.jpg

How it is going:
20210327-WG-troubleshooting-packetloss-1_result.jpg

Meanwhile packet loss towards major destination is minimal and, on a residential connection, the ISP is not really interested in doing much beyond modem/router restarts. It is a longer story, but they activated OFDMA on upstream channel last November which triggered a significant issue for many people. They disabled it during February and, apparently, they are now trying again. So, it will be a long battle, but I wanted to see if I could get a bit further along with traces from WG to confirm the effect of packet loss beyond running pings on top of WG tunnels. The real question is what happens in WG when I see the traffic stopping for such a long period of time while the underlying connectivity is experiencing not as big of a packet loss.

For what I can verify myself, I have the widget to see the last handshake for each WG tunnel, I look at the log where I can see clearly when WG comes up or goes down (if I disable WG) but during what I have seen with connectivity being affected, there is nothing at all in the logs.

Any idea would be welcome to help me progress on understanding WG in OPNsense a bit better.