1
General Discussion / cloning speed issues with Azure DevOps Repo
« on: October 24, 2024, 04:07:07 pm »
i know this is a long shot; help would be appreaciated no matter what.
im out of ideas.
we already tried everything on the azure-devops side.
i still want to restart the whole setup;
TL;DR
we're experiencing slow cloning speeds (6 mbit/s) from azure devops when using lacp-lagg on opnsense in a production environment. testing shows that bypassing opnsense or using a different (local) uplink improves speeds significantly. the issue seems specific to this setup, as public repos clone at expected speeds. replicated tests in a similar environment did not reproduce the issue. potential factors include lagg configuration, cascading nat, or something specific to azure devops.
#### problem description:
we're encountering slow cloning speeds (around 6 mbit/s) when cloning our private repository from azure devops in a production environment. this setup consists of a axge lacp-lagg on an opnsense DEC2750 appliance as the gateway for machines in a vlan-trunked subnet, connecting several proxmox-pve server with sfp+ connections, cascaded with a fritzbox 5g. this setup is far from optimal but for now we have to work with it. the problems somehow started when moving to the lacp-lagg, but might be completely unrelated. switching is done by two HP 5900AF 48XG stacked via IRF.
#### production environment testing:
-lagg trunk performance: peer-to-peer (pve-to-pve) bandwidth tests using iperf show expected speeds of around 10 gbit/s. when passing or testing the opnsense appliance directly, it drops to around 2 gbit/s single-threaded. i read that single-thread iperf tests might not be very realistic so i multi-threaded which caps at around 5gbit/s.
-build servers: cloning on both windows and linux virtual machines via the opnsense lagg setup results in the slow speeds mentioned. however, switching to a (exclusive switched) different subnet improves the speed by a factor 10. using a dedicated uplink to the fritzBox provides expected cloning speeds. interestingly, cloning a public repository (e.g., wireshark) also achieves much faster speeds from wherever in the setup. speedtests on all machines (vms, hosts, opnsense itself) reaching speeds as they should be (maximum provided by fritzBox).
#### test environment insights:
to investigate, we replicated the setup in a test environment, with a single switch and a stack, but the issue could not be consistently reproduced. uplink was also cascaded NAT. i played with tunables and all lacp setting available on switch, opnsense and pve-host. no difference at all, except for +5%~ more throughput via opnsense after some tunable settings.
-lagg bandwidth: testing with iperf shows that a single thread on the lagg trunk via opnsense reaches around 2 gbit/s, while multi-threading improves performance to about 6 gbit/s.
-routing performance: tests showed slightly better routing performance across vlans, than testing in-vlan, but bandwidth is still capped, likely due to hardware or configuration limitations. appears somehow weird to me.
-cloning was as fast as expected; no matter what vlan, machine or repo.
#### conclusion:
- i can verify the uplink "problems" of the opnsense appliance, people on the internet say a single iperf thread wont saturate a 10Gbit/s uplink; pve-hosts can do it somehow; switching seems less performant than routing (??) -> hardware?
- these uplink "problems" dont seem to directly affect cloning speed, at least it was not observed in test-env.
- somehow tuning opnsense or rather freebsd via system-tunables seems necessary on non-opnsense hardware and 10GBE as described by some people (see https://calomel.org/freebsd_network_tuning.html). -> not sure how/if necessary or to which extent on opnsense hardware.
- i cant reproduce cloning issues when cascading NAT or traversing LAGG or vlan-trunks -> some extremely weird 5G uplink issue?
- other (public-internet) repos can be cloned with reasonable bandwidth -> azure(-devops) issue, repo issue, see above?
im out of ideas.
we already tried everything on the azure-devops side.
i still want to restart the whole setup;
TL;DR
we're experiencing slow cloning speeds (6 mbit/s) from azure devops when using lacp-lagg on opnsense in a production environment. testing shows that bypassing opnsense or using a different (local) uplink improves speeds significantly. the issue seems specific to this setup, as public repos clone at expected speeds. replicated tests in a similar environment did not reproduce the issue. potential factors include lagg configuration, cascading nat, or something specific to azure devops.
#### problem description:
we're encountering slow cloning speeds (around 6 mbit/s) when cloning our private repository from azure devops in a production environment. this setup consists of a axge lacp-lagg on an opnsense DEC2750 appliance as the gateway for machines in a vlan-trunked subnet, connecting several proxmox-pve server with sfp+ connections, cascaded with a fritzbox 5g. this setup is far from optimal but for now we have to work with it. the problems somehow started when moving to the lacp-lagg, but might be completely unrelated. switching is done by two HP 5900AF 48XG stacked via IRF.
#### production environment testing:
-lagg trunk performance: peer-to-peer (pve-to-pve) bandwidth tests using iperf show expected speeds of around 10 gbit/s. when passing or testing the opnsense appliance directly, it drops to around 2 gbit/s single-threaded. i read that single-thread iperf tests might not be very realistic so i multi-threaded which caps at around 5gbit/s.
-build servers: cloning on both windows and linux virtual machines via the opnsense lagg setup results in the slow speeds mentioned. however, switching to a (exclusive switched) different subnet improves the speed by a factor 10. using a dedicated uplink to the fritzBox provides expected cloning speeds. interestingly, cloning a public repository (e.g., wireshark) also achieves much faster speeds from wherever in the setup. speedtests on all machines (vms, hosts, opnsense itself) reaching speeds as they should be (maximum provided by fritzBox).
#### test environment insights:
to investigate, we replicated the setup in a test environment, with a single switch and a stack, but the issue could not be consistently reproduced. uplink was also cascaded NAT. i played with tunables and all lacp setting available on switch, opnsense and pve-host. no difference at all, except for +5%~ more throughput via opnsense after some tunable settings.
-lagg bandwidth: testing with iperf shows that a single thread on the lagg trunk via opnsense reaches around 2 gbit/s, while multi-threading improves performance to about 6 gbit/s.
-routing performance: tests showed slightly better routing performance across vlans, than testing in-vlan, but bandwidth is still capped, likely due to hardware or configuration limitations. appears somehow weird to me.
-cloning was as fast as expected; no matter what vlan, machine or repo.
#### conclusion:
- i can verify the uplink "problems" of the opnsense appliance, people on the internet say a single iperf thread wont saturate a 10Gbit/s uplink; pve-hosts can do it somehow; switching seems less performant than routing (??) -> hardware?
- these uplink "problems" dont seem to directly affect cloning speed, at least it was not observed in test-env.
- somehow tuning opnsense or rather freebsd via system-tunables seems necessary on non-opnsense hardware and 10GBE as described by some people (see https://calomel.org/freebsd_network_tuning.html). -> not sure how/if necessary or to which extent on opnsense hardware.
- i cant reproduce cloning issues when cascading NAT or traversing LAGG or vlan-trunks -> some extremely weird 5G uplink issue?
- other (public-internet) repos can be cloned with reasonable bandwidth -> azure(-devops) issue, repo issue, see above?