Git Logo

Making git work better on Windows

In a previous blog post I explained how you can substantially improve the performance of git on Windows updating the underlying SSH implementation. This performance improvement is very worthwhile in a standard Unix-style git setup where access to the git repository is done using ssh as the transport layer. For a regular development workstation, this update works fine as long as you keep remembering that you need to check and possibly update the ssh binaries after every git update.

I’ve since run into a couple of other issues that are connected to using OpenSSH on Windows, especially in the context of a Jenkins CI system.

Accessing multiple git repositories via OpenSSH can cause problems on Windows

I’ve seen this a lot on a Jenkins system I administer.

When Jenkins is executing a longer-running git operation like a clone or large update, it can also check for updates on another project. During the check, you’ll suddenly see an “unrecognised host” message pop up on the console you’re running Jenkins from and it’s asking you to confirm the host fingerprint/key for the git server it uses all the time. What’s happening behind the scenes is that the first ssh process is locking .ssh/known_hosts and the second ssh process suddenly can’t check the host key due to the lock.

This problem occurs if you’re using OpenSSH on Windows to access your git server. PuTTY/Pageant is the recommended setup but I personally prefer using OpenSSH because if it is working, it’s seamless the same way it works on a Unix machine. OK, the real reason is that I tend to forget to start pageant and load its keys but we don’t need to talk about that here.

One workaround that is being suggested for this issue is to turn off the key check and make /dev/null “storage” for known_hosts. I don’t personally like that approach much as it feels wrong to me – why add security by insisting on using ssh as a transport and then turn off said security, which results in a somewhat performance challenged git on Windows with not much in the way of security?

Another workaround improves performance, gets rid of the parallel access issue and isn’t much less safe.

Use http/https transport for git on Windows

Yes, I know that git is “supposed” to use ssh, but using http/https access on Windows just works better. I’m using the two interchangeably even though my general preference would be to just use https. If you have to access the server over the public Internet and it contains confidential information, I’d probably still use ssh, but I’d also question why you’re not accessing it over a VPN tunnel. But I digress.

The big advantages of using http for git on Windows is that it works better than ssh simply by virtue of not being a “foreign object” in the world of Windows. There is also the bonus that clones and large updates tend to be faster even compared to a git installation with updated OpenSSH binaries. As an aside, when I tested the OpenSSH version that is shipped with git for Windows against PuTTY/Pageant, the speeds are roughly the same so you’ll be seeing the performance improvements no matter which ssh transport you use.

As a bonus, it also gets rid of the problematic race condition that is triggered by the locking of known_hosts.

It’s not all roses though as it’ll require some additional setup on behalf of your git admin. Especially if you use a tool like gitolite for access control, the fact that you end up with two paths in and out of your repository (ssh and http) means that you essentially have to manage two types of access control as the http transport needs its own set of access control. Even with the additional setup cost, in my experience offering both access methods is worth it if you’re dealing with repositories that are a few hundred megabytes in size or even gigabytes in size. It still takes a fair amount of time to shovel an large unbundled git repo across the wire this way, but you’ll be drinking less coffee while waiting for it to finish.

Improving the performance of Git for Windows

Admittedly I’m  not the biggest fan of git – I prefer Mercurial – but we’re using it at work and it does a good job as a DVCS. However, we’re mostly a Windows shop and the out of the box performance of Git for Windows is anything but stellar when you are using ssh as the transport for git. That’s not too much bother with most of our repos but we have a couple of fairly big ones and clone performance with those matters.

I finally got fed up with the performance after noticing that cloning a large repository from the Linux git server to a FreeBSD box was over an order of magnitude faster than cloning the same repository onto a Windows machine. So I decided to start digging around for a solution.

The clone performance left a lot to be desired using either PuTTY or the bundled OpenSSH as the SSH client. I finally settled on using OpenSSH as I find it easier to deal with multiple keys. Well, it might just be easier if you’re a Unixhead.

My search led me to this discussion, which implies that the problem lies with the version of OpenSSH and OpenSSL that comes prepackaged with Git for Windows. The version is rather out of date. Now, I had come across this discussion before and as a result attempt to build my own SSH binary that included the high performance ssh patches, but even after I got those to build using cygwin, I never managed to actually get it to work with Git for Windows. Turns out I was missing a crucial detail. It looks like the Git for Windows binaries ignore the PATH variable when they look for their OpenSSH binaries and just look in their local directory. After re-reading the above discussion, it turned out that the easiest way to get Git for Windows to recognise the new ssh binaries is to simply overwrite the ones that are bundled with the Git for Windows installer.

*** Bad hack alert ***

The simple recipe to improve the Git performance on Windows when using a git+ssh server is thus:

  • Install Git for Windows and configure it to use OpenSSH
  • Install the latest MinGW system. You only need the base system and their OpenSSH binaries. The OpenSSH and OpenSSL binaries that come with this installation are much newer than the ones you get with the Git for Windows installer.
  • Copy the new SSH binaries into your git installation directory. You will need to have local administrator rights for this as the binaries reside under Program Files (or “Program Files (x86)” if you’re on a 64 bit OS). The binaries you need to copy are msys-crypto-1.0.0.dll, msys-ssl-1.0.0.dll, ssh-add.exe, ssh-agent.exe, ssh-keygen.exe, ssh-keyscan.exe and ssh.exe

After the above modifications, the clone performance on my Windows machine went from 1.5MB/s – 2.5MB/s to 10MB/s-28MB/s depending on which part of the repository it was processing. That’s obviously a major speedup for cloning, but another nice side effect is that this change will result in noticeable performance improvements for pretty much all operations that involve a remote git repository. They all feel much snappier.

Sometimes, std::set just doesn’t cut it from a performance point of view

A piece of code I recently worked with required data structures that hold unique, sorted data elements. The requirement for the data being both sorted and unique came from it being fed into std::set_intersection() so using an std::set seemed to be an obvious way of fulfilling these requirements. The code did fulfill all the requirements but I found the performance somewhat wanting in this particular implementation (Visual Studio 2008 with the standard library implementation shipped by Microsoft). The main problem was is that this code is extremely critical to the performance of the application and it simply wasn’t fast enough at this point.

Pointing the profiler at the code very quickly suggested that the cost of destruction of the std::set seemed to be rather out of proportion compared to the cost of the rest of the code. To make matters worse, the std::set in question was used as a temporary accumulator and was thus created and destroyed very often. Also, the cost was in the destruction of the elements in the std::set, so the obvious technique – keep the std::set around, but clear() its contents – did not yield any improvement in the overall runtime, mainly due to the fact that red-black trees (the underlying implementation of this – and most – std::sets) are very expensive to destroy as you have to traverse the whole tree and delete it node by node, even if the data held in the nodes is a POD. Clearly, a different approach was needed.

A post on gamedev.net suggested that I wasn’t the first person to run into this issue, nor did it appear to be that specific to my platform. The article also suggested a technique that should work for me. Basically, std::set provides three guarantees – the contents is sorted, the keys unique and the lookup speed for a given key is in the order of O(log n). As I mentioned in the introduction, I did only really need two of the three guarantees, namely unique elements in a sorted order; std::set_intersection() expects input iterators as its parameters so despite its name, it is not tied to using a std::set, even though it’s the obvious initial choice of data structure.

In this particular case – a container accumulating a varying number of PODs, that eventually get fed into std::set_intersection – I could make use of the fact that at the point of accumulation of the data, I basically didn’t care if the data was sorted and unique. As long as I could ensure the data in the container fulfilled both criteria before calling std::set_intersection, I should be fine.

The simplest way and the least expensive in terms of CPU time I was able to come up with was to accumulate the data in a std::vector of PODs, which is cheap to destroy. Much like as described in the above thread on gamedev.net. I then took care of the sorted and unique requirements just before I fed it into the std::set_intersection:

std::vector<uint32_t> accumulator;

... accumulate lots of data ...

std::sort(accumulator.begin(), accumulator.end());
accumulator.erase(std::unique(accumulator.begin(),
                              accumulator.end()),
                  accumulator.end());

Note the call to accumulator.erase() – std::unique doesn’t actually remove the ‘superfluous’ elements from the std::vector, it just returns the iterator to the new end of the container. In the code I was using I couldn’t make use of this feature so I had to actually shrink the std::vector to only contain the elements of interest. Nevertheless, changing over from a real set to the ‘fake’ resulted in a speed increase of about 2x-3x which was very welcome.

Basically if you need a std::set, you need a std::set and you’ll have to keep in mind its restrictions when the container in question only has a short lifetime. I don’t advocate ripping out all sets from your application and replacing them with sorted, uniqified std::vectors. However in some cases like the one I described above when you don’t need the full functionality of a std::set, using a std::vector can offer a very reasonable alternative, especially when you identify the std::set as a bottleneck. Yes, you could probably also get O(log n) lookup speed in the sorted and uniqified vector by using std::binary_search but that’s getting a little too close to declaring “the only data C++ structure I’m familiar with is a std::vector”. Using the appropriate data structure (like a std::set) communicates additional information to the readers of your code; using workarounds like the one above are not that obvious and tend to obscure your code somewhat.