Zerotier and odd performance issues...

Started by j_s, August 26, 2022, 09:33:44 PM

Previous topic - Next topic
I've always been relatively disappointed in the zerotier VPN performance despite us using it for almost a year.  I've always thought ZT was "just slow" until I did some googling and more testing.  A few websites compared ZT to be nearly identical to IPSEC and others.  When I saw some people doing 400Mb/sec, I started questioning if my performance is expected or if something is actually wrong on my part.  So I started digging deeper...

I have 2 different ZT networks and different sites.  All are using Opnsense and all are on the latest versions.  I've chosen 2 sites close to each other by latency (and geographically) that I can do testing on that are not using HA (High Availability adds another layer of complexity, so I'm trying to forgo dealing with that until I can get non-HA sites working better).

Let's pretend I have 2 sites.  Site one has router-A and desktop-A (it's business internet account, but at a home) with internet speeds of 500Mb down and 35Mb up.  Site two has router-B and Server-B with internet speeds of 1Gb down and 1Gb up (this is also a home, but has fiber to the home).

My sites have a config file that is like this:

{
   "physical": {
      "192.168.0.0/16": { "blacklist": true },
      "10.0.0.0/16": { "blacklist": true }
   },
   "settings": {
      "primaryPort": 9993,
      "portMappingEnabled": false,
      "allowSecondaryPort": false,
      "allowTcpFallbackRelay": false
   }
}


I then have the appropriate firewall rules to allow traffic from 9993.

If I do iperf3 from router-A to router-B, I get 36Mbit/sec.  That saturates the uplink at site A so I'm okay with that.

If I do iperf3 from router-B to router-A, I get 312Mb/sec.  I'm okay with that speed since it is 14ms between the two routers.  More would be great, but I'm gonna take this issue as a bunch of baby steps instead of being upset that it doesn't saturate 1Gb.  (I'm expecting to need some very large buffers if I really want to do 1Gb with 14ms of latency.  I'm no expert at adjusting buffers on opnsense, and I haven't found a link that would teach me enough to figure it out on my own, so I'm gonna work on the issues I do understand first.)

For "the lolz" I did do iperf3 from router-B to router-A using the external IPs to compare against the VPN.  I got 443Mb/sec.

To summarize thus far:
1.  I can saturate the upstream at site A.
2,  I get speeds of 443Mb without the VPN, and 312Mb with the VPN going from router-B to Router-A over the VPN.

If I then connect to an SMB share on Server-B (TrueNAS) and start downloading a file to Desktop-A (Windows 10) and I get 9-12MB/sec fluctuating (I'll just call it a 100Mb link for simplicity).  That's about 1/3 of what I got with iperf3 from router-B to router-A.  So I did a bunch of things here because I wanted to see exactly what was going wrong:

1.  I did a packet capture on Desktop-A using wireshark.  Things go smoothly for the smb transfer, but every 1 second or so I get inundated with a crapload of TCP DUP ACKs.  I'll get 100 to 120 DUP ACKs in less than 2ms.  At the same time I'll get a dozen or so TCP out of orders and a fast retransmit.  Then another second of what looks like normal SMB traffic, followed by another huge group of dup acks, out of order packets, and fast retransmits.
2.  I did an iperf3 test from Server-B to Desktop-A.  I get 9-12MB/sec with 5-50 retransmits almost every second per iperf3.  SMB and iperf3's performance seems to be inline with each other, so that means I probably don't have an SMB problem.  I also did check with Wireshark and I have the same dup acks and such as I mentioned in #1 above.
3.  I did iperf3 from Desktop-A to Router-A and vice versa and I got 931Mb/sec (basically a 1Gb connection).
4.  I did an iperf3 from Server-B to Router-B and vice versa and I got 930Mb/sec (basically a 1Gb connection).
5.  So I did iperf3 from Router-B to Desktop-A and I get about 100Mb/sec with 5-20 retransmits every second from iperf3 output.
6.  I did iperf3 from Server-B to Router-A and I got about 100Mb/sec with 5-20 retransmits every second from iperf3 output.

So it seems if I go from router to router directly, all is good.  I get good speeds and no retransmits.  But as soon as I want to go from a desktop or server on one site to anything on the other (including the router on the other side), performance and reliability take a nasty nosedive.

To put it another way, I can go from router to router just fine, but if I want to actually use the VPN in any useful way using other servers and desktops, it's unreliable and slow.

Any ideas where to even start to investigate this issue? 

I did try emailing the zerotier plugin maintainer to start a conversation more than a week ago, but he didn't respond.  I'd really rather not bother him with additional emails unless I can really prove this is a zerotier issue.

As this is affecting business functionality, the company is open to the idea of paying for someone to troubleshoot and identify the issue.  But I don't know where to start.  Is it an Opnsense problem and I need an Opnsense expert?  Is it a Zerotier problem?  If so, is it the plugin itself or is it the zerotier code.  Is it just a configuration problem on my part.  I know that Zerotier documentation at https://docs.zerotier.com/devices/opnsense makes it pretty clear that Zerotier, Inc doesn't maintain the opnsense implementation.  From some of the "official" posts in the Zerotier forums I get the impression Zerotier, Inc. has the attitude of "we don't do opnsense, so if it works, great, and if it doesn't, don't talk to us about it either".

Thanks to whoever read this all the way to the end.  I realize this is a lot to swallow.

Wow.  I'm surprised nobody had anything to try or ideas.  Did I break the hive mind?  LOL!

I'm gonna do more testing later this week on this, so I may have more information as time progresses.

Afraid to say my experience with Zerotier on Opnsense has been nothing short of baffling. It's maddeningly inconsistent with whether it will come online or not, requires multiple reboots for the service to stay online, peers go to RELAY for no apparent reason, etc.
I don't think this is actually an Opnsense issue, I suspect that BSD is a second-class citizen when it comes to Opnsense development.

September 18, 2022, 09:33:37 AM #3 Last Edit: September 18, 2022, 09:35:18 AM by j_s
I won't disagree with you as I've read plenty of people having that experience.  However, my experience has been pretty good, except for this issue.  I did a lot of homework and tried to incorporate as many "lessons learned" from other people as I could before we started using it at all.  Let alone deciding to rely on it for "production".

But I have no direction to go with this particular issue.  I did try emailing the package maintainer per the plugin details, but that was more than a month ago with no response at all.

I'm a little disappointed because I have no idea where to go with this issue.  What if I had found some serious significant security issue?  I honestly have no idea where I'd go as I've already used the first avenue to correct the issue.

At this point, since we're paying for the Zerotier service, I'm going to try reaching out to them next week to see what they think of this.

Overall, 5MB/sec between sites is not going to work for us, so we'll ultimately have to find another solution if we cannot fix this issue.

Edit:  When I originally started using Zerotier, I was planning to put together a guide (and a friend who's a Youtuber wanted to do a video guide) on this.  I thought that could help solve a lot of people's problems with Zerotier by having some kind of well-documented guide or video.  But with the significance of the performance issue I don't know that I could really recommend this to anyone.