Why I’m strongly opposed to modifying third-party library code

I encounter this on a fairly regular basis – a project uses a third-party library and there is either a bug in the library that we can’t seem to avoid hitting, or there’s a feature missing or not 100% ideal for our use case.

Especially when dealing with an open source library, at this point someone will inevitably suggest that we have the source, so we should just fix/hack/modify the library and get on with our lives. I’m massively opposed to that approach, with essentially one exception I’ll mention towards the end.

So, why am I so opposed to changing a third-party library, even if I have the code and the rights to do so?

It’s very simple – it adds to the maintenance headache. If I suddenly find myself having to change outside code, I have to be able to maintain it going forward. That means I:

  • Suddenly have to have a place to keep it in source control so I can maintain my patches going forward
  • Updating the library becomes an exercise in merging the local changes with the remote changes and should the author of the library make substantial structural changes, it’s not even clear if you can easily integrate the new version into your patched version. You might well end up in a situation where you locked yourself out of taking future updates because the effort involved in porting your changes forward is so big that you can’t reasonably accommodate it. You find yourself in a dark room, owning another piece of code that by rights, you really shouldn’t and it’s going to unnecessarily increase your maintenance effort because all of a sudden, you’re not only responsible for bugs in your own code, but also bugs in the third-party library code that isn’t so third-party library anymore.

There is one exception I’d make to the rule of not touching third-party code and that is, if you talked to the maintainer/owner of the library, agreed that what you’re trying to do is actually beneficial to everybody involved and most importantly, that the owner of the library is willing to take your change and integrate it into their next release. Most open source projects will be more than happy to accept contributions this way as long as they fit in with their vision of how the code should work. Same goes for commercial libraries, too. Just talk to them beforehand to make sure it’s not a misunderstanding of the code at your end.

Under those particular circumstances, yes, I’d accept that it’s OK to change third-party code. Any other reason, don’t do it.

Butbutbutbut, I absolutely have to change an important third-party library because their API doesn’t fit into our model anymore?

No you don’t, sorry. Either replace the library or write a shim that takes care of your needs, because I don’t need to have a discussion with you in a few years time as to why we aren’t able to upgrade a library that we decided we needed to hack and now are tied to an old code base that’s potentially buggy and full of security holes.

Large file handling in Emacs revisited, with a quick look at VLF

I recently blogged about installing a 64-bit build of Emacs for Windows because I was dealing with a bunch of large and very large files.

While the 64-bit build definitely handled the really large files much better than the 32-bit build, there were still some performance issues. The main advantage of using the 64 bit build was that I could finally load a couple of files that I wasn’t able to load on the 32-bit build, but opening the files severely tested my patience.

The other problem that I didn’t think about initially was that if you happen to have desktop-save-mode turned on, Emacs will of course reload the large files every time you restart it, unless you remember to close large files before shutting down Emacs. That tends to take a considerable amount of time because either build of Emacs is rather slow when it comes loading files in the hundreds of megabytes or bigger category.

A commented in the previous post kindly suggested to use VLF for large files, which is what I’ve been doing instead. I keep forgetting to turn on the integration which should save me from having to remember to manually kick off vlf every time I want to open a large file, but other than that it just works.

Highly recommended if you need to deal with large files from time to time.

Dr Dobb’s – the end of an era

I grew up as a software developer on a steady diet of Dr Dobb’s magazines. I was hooked the first time I came across an issue of the magazine as a student in the university library and for most of my career I have been a subscriber to it, until the print magazine was cancelled. I was sad to read this morning that after 38 years of publication, first in print and then on the web, the online edition has now met the same fate.

I probably learned more about real world software development from reading articles in the magazine than the courses I took at university. Heck, even if the articles initially completely went over my head, I learned from them. I remember reading the series of articles about 386BSD and being strangely fascinated by them, even though I had at that time trouble understanding parts of it. In a strange “what comes around goes around” fashion, I’ve been using FreeBSD – which is a direct descendant of 386BSD – for almost 20 years now.

If Dr Dobb’s hadn’t opened my eyes to the strange and wonderful world of software development with all its facets, I doubt I’d be where I am today, still fascinated by computers and programming, and occasionally still infuriated by both.

RIP, Dr Dobb’s. Thanks for the ride.

Image of the title page of the first issue courtesy of Wikipedia.

Git Logo

Making git work better on Windows

In a previous blog post I explained how you can substantially improve the performance of git on Windows updating the underlying SSH implementation. This performance improvement is very worthwhile in a standard Unix-style git setup where access to the git repository is done using ssh as the transport layer. For a regular development workstation, this update works fine as long as you keep remembering that you need to check and possibly update the ssh binaries after every git update.

I’ve since run into a couple of other issues that are connected to using OpenSSH on Windows, especially in the context of a Jenkins CI system.

Accessing multiple git repositories via OpenSSH can cause problems on Windows

I’ve seen this a lot on a Jenkins system I administer.

When Jenkins is executing a longer-running git operation like a clone or large update, it can also check for updates on another project. During the check, you’ll suddenly see an “unrecognised host” message pop up on the console you’re running Jenkins from and it’s asking you to confirm the host fingerprint/key for the git server it uses all the time. What’s happening behind the scenes is that the first ssh process is locking .ssh/known_hosts and the second ssh process suddenly can’t check the host key due to the lock.

This problem occurs if you’re using OpenSSH on Windows to access your git server. PuTTY/Pageant is the recommended setup but I personally prefer using OpenSSH because if it is working, it’s seamless the same way it works on a Unix machine. OK, the real reason is that I tend to forget to start pageant and load its keys but we don’t need to talk about that here.

One workaround that is being suggested for this issue is to turn off the key check and make /dev/null “storage” for known_hosts. I don’t personally like that approach much as it feels wrong to me – why add security by insisting on using ssh as a transport and then turn off said security, which results in a somewhat performance challenged git on Windows with not much in the way of security?

Another workaround improves performance, gets rid of the parallel access issue and isn’t much less safe.

Use http/https transport for git on Windows

Yes, I know that git is “supposed” to use ssh, but using http/https access on Windows just works better. I’m using the two interchangeably even though my general preference would be to just use https. If you have to access the server over the public Internet and it contains confidential information, I’d probably still use ssh, but I’d also question why you’re not accessing it over a VPN tunnel. But I digress.

The big advantages of using http for git on Windows is that it works better than ssh simply by virtue of not being a “foreign object” in the world of Windows. There is also the bonus that clones and large updates tend to be faster even compared to a git installation with updated OpenSSH binaries. As an aside, when I tested the OpenSSH version that is shipped with git for Windows against PuTTY/Pageant, the speeds are roughly the same so you’ll be seeing the performance improvements no matter which ssh transport you use.

As a bonus, it also gets rid of the problematic race condition that is triggered by the locking of known_hosts.

It’s not all roses though as it’ll require some additional setup on behalf of your git admin. Especially if you use a tool like gitolite for access control, the fact that you end up with two paths in and out of your repository (ssh and http) means that you essentially have to manage two types of access control as the http transport needs its own set of access control. Even with the additional setup cost, in my experience offering both access methods is worth it if you’re dealing with repositories that are a few hundred megabytes in size or even gigabytes in size. It still takes a fair amount of time to shovel an large unbundled git repo across the wire this way, but you’ll be drinking less coffee while waiting for it to finish.

Checking C++ library versions during build time

In my previous post, I discussed various strategies for managing third party libraries. In this post I’ll discuss a couple of techniques you can use to ensure that a specific version of your source code will get compiled with the correct version of the required libraries.

Yes, you can rely on your package management tools to always deliver you the correct versions. If you’re a little more paranoid and/or spent way too much time debugging problems stemming from mixing the wrong libraries, you may want to continue reading.

Suggestion #1 – use C++ compile time assertions to check library versions

This suggestion only works if your libraries have version numbers that are easily accessible as compile time constants. You can use something like BOOST_STATIC_ASSERT or C++ 11’s static_asset to do a compile time check of the version number against your expected version number. If the test fails, it’ll break the compilation so you get an immediate hint that there might be a problem.

The code for this could look something like this example:

First, in the header file the version number constant is defined:

 
...
const int libgbrmpzyyxx_version = 0x123;
...

The header file or source file pulling in all the version headers then checks that it’s pulled in the correct version:

 
#include "boost/static_assert.hpp"
#include "lib_to_check"

BOOST_STATIC_ASSERT(libgbrmpzyyxx_version == 0x123);

If you are so inclined and are using C++ 11, you can replace the BOOST_STATIC_ASSERT with the standard static_assert.

My suggested approach would be to have a single header file in each module of your project that pulls in all relevant #include files from all required libraries. This file should also contain all the checks necessary to determine if the libraries that got pulled in have the correct version numbers. This way, having a compilation error in a single, well-named file (call it ‘libchecks.H’, for example) should immediately suggest to you that a library needs updating. If you keep the naming schema consistent, a quick glance at the error message should provoke the right sort of “don’t make me think – I know what the problem is already” type response.

Suggestion #2 – use link failures to indicate library versioning problems

This is a variation of suggestion #1, only that instead of using a compile time check along the lines of BOOST_ASSERT, your library contains a version specific symbols which your code references. Obviously if the code is referencing a symbol that doesn’t exist in the library, the linker will fail and you’ll get a message with is relatively easy to parse for a human and still pinpoints the problem. The advantage of this method is that it does work across languages – you can use it in plain C code when linking against a library that is implemented in C++, for example, or in C extension modules built for dynamic languages. Its main downside is that in case of a version mismatch, the build fails a lot later in the process and gives you the same information that you may have received using suggestion #2, only three cups of coffee later. That said, if your project builds fast enough the difference in elapsed time between suggestions #2 & #3 might be negligible. On the other
hand if your build takes hours or days to complete, you really should try to make suggestion #2 work for you.

This suggestion relies on the fact that somewhere in the library code, a constant is defined that is externally visible, ie

 
...
const int libgbrmpzyyxx_1_23_version = 0;
...

And somewhere in your code, you try to access the above constant simply by referencing it:

 
int test_lib_version = libgbrmpzyyxx_1_23_version;

Suggestion #3 – use runtime checks

Sometimes, the only way to work out if you are using the right version of a library is a runtime check. This is unfortunate especially if you have long build times but if your library returns, say, a version string this would be the earliest you can check that your project is linked with or loaded the correct version. If you are working a lot with shared libraries that are loaded dynamically at runtime , this is a worthwhile check anyway to ensure that both your build and runtime environments are consistent. If anything I would consider this an additional check to complement the ones described in suggestions #1 & #2. It also has the advantage that you can leave the check in the code you ship and thus detect a potential misconfiguration at the client end a lot easier.

Conclusion

I personally prefer suggestion #1 as I want to ensure the build fails as early as possible. Suggestion #2 works especially when you can’t use boost for whatever reason and don’t have a C++11 compiler, but otherwise I personally would not use it. Suggestion #3 is something you use when you need it, but if you do at least try to cover the relevant cases in your unit tests so your QA team doesn’t have to try and find out manually if you are using the correct library version for every component.

Managing third party libraries in C++ projects

Every reasonably sized C++ project these days will use some third party libraries. Some of them like boost are viewed as extensions of the standard libraries that no sane developer would want to be without. Then there is whatever GUI toolkit your project uses, possibly another toolkit to deal with data access, the ACE libraries, etc etc. You get the picture.

Somehow, these third party libraries have to be integrated into your development and build process in such a way that they don’t become major stumbling blocks. I’ll discuss a few approaches that I have encountered in the multitude of projects I was part of, and will discuss both their advantages and problems.

All the developers download and install the libraries themselves

This is what I call the “good luck with that” approach. You’ll probably end up documenting the various third party libraries and which versions of which library were used in which release on the internal wiki and as long as everybody can still download the appropriate versions, everything works. Kinda.

The problems will start to rear their ugly heads when either someone has to try to build an older version of your project and can’t find a copy of the library anymore, someone forgets to update the CI server or – my favourite – the “bleeding edge” member of the team starts tracking the latest releases and just randomly checks in “fixes” needed to build with newer versions of the library that nobody else is using. Oh, and someone else missed the standup and the discussion that everybody needs to update libgrmblfx to a newer, but not current version and is now having a hard time figuring out why their build is broken.

Whichever way you look at it, this approach is an exercise in controlled chaos. It works most of the times, you can usually get away with it in smaller and/or short term projects but you’re always teetering on the edge of the Abyss Of Massive Headaches.

What’s the problem? Just check third party libraries into your version control repository!

This is the tried and tested approach. It works well if you are using a centralized VCS/CM system that just checks out a copy of the source. Think CVS, Subversion, Perforce and the like. Most of these systems are able to handle binaries well in addition to “just” managing source code. You can easily check in pre-built versions of your third party libraries. Yes, the checkouts may be a little on the slow side when a library is updated but in most cases, that’s an event that occurs every few months. In a lot of teams I used to work in, the libraries would be updated in the main development branches after every release and then kept stable until the next release unless extremely important fixes required further updates. This model works well overall and generally keeps things reasonably stable, which is what you want for a productive team because you don’t want to fight your tools. Third party libraries are tools – never forget that.

The big downside to this approach is when you are using a DVCS like git or Mercurial. Both will happily ingest large “source” trees containing pre-built third-party libraries, but these things can be pretty big even when compressed. A pre-built current boost takes up several gigabytes of disk space depending on your build configurations and if you’re build 32 bit and 64 bit versions at the same time. Assuming a fairly agile release frequency, you’re not going to miss many releases so you’ll be adding those several gigabytes to the repository every six months or so. Over the course of a few years, you will end up with a pretty large repository that will take your local developers half an hour to an hour to clone. Your remote developers will suddenly either have to mirror the repository – which has its own set of challenges if it has to be a two-way mirror – or will suddenly find themselves resorting to overnight clones and hope nothing breaks during the clone. Yes, there are workarounds like Mercurial’s Largefiles extension and git-annex, and they’re certainly workable if you are planning for them from the beginning.

The one big upside of this approach is that it is extremely easy to reproduce the exact combination of source code and third party libraries that go into each and every release provided an appropriate release branching or release tagging strategy is used. You also don’t need to maintain multiple repositories of different types like you have to in the approach I’ll discuss next.

Handle third party libraries using a separate package management tool

I admit I’m hugely biased towards this approach when working with a team that is using a DVCS. It keeps the large binaries out of the source code repository and into a repository managed by a tool that was designed for the express purpose of managing binary packages. Typical package management tools would be NuGet, ivy and similar tools. What they all have in common is that they use a repository format that is optimized for storing large binary packages, usually in a compressed format. They also make it easy to pull a specific version of a package out of the repository and put it into an appropriate place in your source tree or anywhere else on your hard drive.

Instead of containing the whole third party library, your source control system contains a configuration file or two that specifies which versions of which third party libraries are needed to build whichever version of your project. You obviously need to hook these tools into your build process to ensure that the correct third party libraries get pulled in during build time.

The downside of these tools is that you get to maintain and back up yet another repository that needs to be treated as having immutable history like most regular VCS/DVCSs have. This requires additional discipline to ensure nobody touches a package once it’s been made part of the overall build process – if you need to put in a patch, the correct way is to rev the package so you are able to reproduce the correct state of the source tree and its third party libraries at any given time.

TL;DR – how should I manage my third party libraries?

If you’re using a centralised version control system, checking in the binaries into the VCS is fine. Yes, the admin might yell at you for taking up precious space, but it’s easy, reasonably fast and simplifies release management.

If you are using a DVCS, use a separate package management, either a third party one or one that you roll yourself. Just make sure you keep the third party libraries in your own internal repository so you’re not at somebody else’s mercy when they decide to suddenly delete one of the libraries you’re using.