Revision control

Posted by: Kim N. Lesmer on 17.11.2007

Updated 10.10.2009: Some minor updates in the text, especially with regard to binary files.

In connection with some corrupted files in CVS I decided that it was time to take a look on some other revision control systems. This article is a summary of my three day "journey" into the land of revision controlling.

Revision control

Revision control (also known as version control (system) (VCS), source control or (source) code management (SCM)) is the management of multiple revisions of the same unit of information.

A revision control system is a set of tools that allows you to keep control with the development of files. The files could be software source code, or it could be articles, or even images.

The best way to understand a revision control system is to think of a log or a diary. Lets say you are developing software and your want to keep track of the changes you make over time to the files, each time you change something, or each time you add or remove a piece of code, you make a note in the diary about the changes. You write about why you made the changes and how you did it, you write about where in the source code the changed where made, and you also note how to revert those changes.

Lets imagine that your project grew and began to involve more people. To still keep track of changes to make an online version of the diary, and make everybody who is working on the project, commit changes to your files, and you make them write about it in the diary.

That's basically how a revision control system works, except that its more complicated and it does it much better.

A revision control system is like a log or dairy, but its more that than, a revision control system takes charge of the files in a project, and the system can merge different changes made by different people.

The biggest problem when a lot of people are working on the same files, whether its software or writting on a book, is to keep the different changes that people make from messing things up. If four persons are working on the same file, how do you keep that file intact, and how do you fit those pieces together? You do that by using a revision control system.

A revision control system can help you keep a track of things, and it can help you merge different commitments into a single file, but the system can't decide on your behalf, so oftens times you have to figure out how to deal with merging conflicts, but that's not the topic of this article.

A revision control system can also revert changes and it can work like a nice backup system, which it does in my personal case.

I don't have much experience in cooperating in large development teams at the current moment, but I do have experience in using different revision control systems on a more personal and private level.

My story

I have always loved programming, and I have developed a lot of different application, both as a professional and as a hobby. In both cases I used to keep my files backed up on CD's, and I used to keep small notes about important stuff.

After doing that for many years I suddenly stubled upon the revision control system called CVS. I learned about CVS via OpenBSD and after setting the CVS system up, I never looked back. With CVS I could easily keep a log of every single detail of change I made to a file, and at the same time CVS worked as a backup system.

Another great advantage was that I could keep all my important files on a single server, and then pull those files down into both my laptop and my desktop. If I made a change to a file using my laptop, and I commited that change to CVS, I could run an update on my desktop and it would grab those changes from the CVS server and voila everything was in sync.

After running with CVS for about three years one day I needed to pull down a set of files which I had added into the CVS at the beginning of its use. The files were compressed using Tar and GZip. Somehow I had missed the information that CVS needs to be told not to mess with binary files. The result of my lack of knowledge resulted in corrupted files.

After the incident I decided to take a deeper look into CVS, and I decided to also take a look at other solutions.

I have always been happy running with CVS, but that was because I didn't know about other systems at that time. One thing that did annoy me was CVS's lack of support to rename files. If you need to rename a file that has been commited to CVS, you have to make a copy of that file, delete the original and then commit the renamed version as a new file hence loosing all the data about that file in the log.

CVS's handling of moving files around is also a big mess, and in most cases the easiest way to deal with that is to manually remove files on the server and then recommit from the client.

Because CVS is a centralized system it keeps small pieces of information in each directory of the system. When you develop a lot of subdirectories CVS keeps its own directory in each of those as well.

Sometimes you need to move a lot of stuff around, but you don't want to move the CVS directories as well.

All in all I have always been happy with CVS, but I have also been annoyed with the above mentioned problems.

I finally decided to take a good and hard look at other solutions and this article is about my experiences with those systems, the tests I made, and the final decision about which system I would end up using and why.

One thing is for sure, after running the tests, I will never return to CVS.

Different types of systems

My main concern about a new revision control system was that it had to be Open Source. Besides from that I was pretty open to new ways of doing things.

First thing first. I quickly discovered that there exists different kinds of way to run a revision control system.

  • Centralized system
  • Decentralized system
  • Distributed system

I am not going into a lot of details about each kind of system, but I am going to address each system shortly.

A centralized system use a centralized model, where all the revision control functions are performed on a shared server. If two developers try to change the same file at the same time, without some method of managing access the developers may end up overwriting each others work. Centralized revision control systems solve this problem in one of 2 different "source management models": file locking and version merging.

A decentralized system and a distributed system more or less covers the same, they both takes a peer-to-peer approach, as opposed to the client-server approach of centralized systems. Rather than a single, central repository on which clients synchronize, each peer's working copy of the codebase is a bona-fide repository. Synchronization is conducted by exchanging patches (change-sets) from peer to peer.

A centralized RCS requires that one is able to connect to the server whenever one wants to do version control work. This can be a bit of a problem if your server on some other machine on the internet and you are not. Or, worse yet, you are on the internet but the server is missing.

Decentralized revision control systems deal with this problem by keeping branches on the same machine as the client. This allows the user to save his changes (commit) whenever he wants - even if he is offline. The user only needs internet access when he wants to access the changes in someone else's branch that are somewhere else.

The main differences from centralized systems

Each developer does work on his own local repository.

Repositories are cloned by anyone, and are often cloned many times.

There may be many "central" repositories. Access control lists are not employed. Instead code from disparate repositories are merged based on a web of trust, i.e., historical merit or quality of changes.

Lieutenants are project members who have the power to dynamically decide which branches to merge.

Network is not involved in most operations.

Advantages vs centralized

Allows users to work productively even when not connected to a network.

Makes most operations much faster since no network is involved.

Allows participation in projects without requiring permissions from project authorities, and thus arguably better fosters culture of meritocracy instead of requiring "committer" status.

Allows private work, so you can use your revision control system even for early drafts you don't want to publish. Avoids relying on a single physical machine. A server disk crash is a non-event with distributed revision control

Disadvantages vs centralized

I personally find none. The "disadvantages" that some people describe are from a personal habitual point of view rather than from a technical point of view.

Most people are used to a centralized system, and out of habit they don't want things to change.

One argument is that a distributed system can end up with a person as the central point of control, rather than a server, but in my opinion that would only be because of a lack of proper organization.

Even for my own personal use of a revision control system I find a decentralized or distributed system a joy to work with. The ability to make local commits even if you are offline are great, and I often find myself in use of that option.

The test

I must start by pointing out, that the tests I made in no way can compare to real day to day work done by many of the big projects like working on the Linux kernel or the FreeBSD kernel. I am only a single person and my tests reflects my daily usage of a revision control system. I develop small pieces of code, and I write articles and books, but I don't have to cooperate with a lot of people doing that.

The main advantages that I gain from using a revision control system is backup, documentation (the log) and the ability to revert back to prior releases of software that I have made, and using the exact same files on different computers without having to copy data back and forth.

First I made a list of systems to test, next I did a scanning process in which I divided the different system up into groups. All non Open Source systems where removed from the list. Next I removed those systems that aren't in active developments. Next I looked at each system documentation and online community. I also demanded a simple setup running over OpenSSH. From the documentation I decided which system looked the most userfriendly regarding the commands. Last but not least I asked a lot of questions on the different IRC channels related to each system.

When I was done sorting, I ended up with the following systems to test: BZR, Git, SVN, Darcs and Mercurial (hg). Subversion (SVN) is the only centralized system in my test.

I installed each system on my Debian GNU/Linux revision server, and tested the system using one of my desktops which also runs GNU/Linux Debian.

On each system I did the following tests, but not limit to:

  • Added a lot of files including big binary files, separated into different directory structures.
  • Commited changes, reverted and generally used the system.
  • Did some renaming and reorganization of files.
  • Tried to integrate (merge) two completely separated and non-related repositories into one.
  • Did some comparison with CVS because that's what I am used to use.

On each system I had to mess a bit around with filepermissions on the initial directory, I haven't displayed that in the test.

BZR

Bazaar (formerly Bazaar-NG) is a distributed revision control system sponsored by Canonical Ltd., designed to make it easier for anyone to contribute to open source software projects. As of 2007, the best known user of Bazaar is the Ubuntu project.

The development team's focus is on ease of use, accuracy and flexibility. Branching and merging upstream code is designed to be very easy, with focus on users being productive with just a few commands. Bazaar can be used by a single developer working on multiple branches of local content, or by teams collaborating across a network.

Bazaar is written in the Python programming language, with packages for major Linux distributions, Mac OS X and Windows. Released under the GNU General Public License, Bazaar is free software.

Installation

Installing BZR and getting it up and running was very easy.

# apt-get install bzr
$ bzr whoami "John Doe "
$ mkdir /bzr
$ cd /bzr
$ bzr init
$ touch my_file.txt
$ bzr add
$ bzr commit -m "Initial commit."
			

Checkout of a project via ssh:

$ bzr checkout bzr+ssh://myserver/bzr
			

Result

I liked BZR right away, and I found it well documented. The system is userfriendly but compared to both Git and Mercurial it is very slow. Committing a lot of files took forever.

I didn't find any compelling reason to use BZR when compared to the other systems except that when compared to SVN its decentralized.

BZR failed my test on merging the two different repositories into one. Perhaps I did something wrong, but I tried several times, and I tried with different merge options, but I ended up with the same result each time.

I like the fact that BZR only has one directory in the root of the repository .bzr

BZR got a third place on my winners list.

Git

Git is a distributed revision control project created by Linus Torvalds.

Git's design was inspired by BitKeeper and Monotone. Git was originally designed only as a low-level engine that others could use to write front ends to. However, the core Git project has since become a complete revision control system that is usable directly. Several high-profile software projects now use Git for revision control, most notably the Linux kernel.

Installation

On the server:

# apt-get install git-cvs
$ git config --global user.name "Your Name Comes Here"
$ git config --global user.email you@yourdomain.example.com
$ mkdir /git
$ cd /git
$ git init
$ touch my_file.txt
$ git add .
$ git commit
			

Clone a project via ssh:

$ git clone ssh://[user@]host.xz[:port]/path/to/repo.git/
			

Result

I find Git very cool. Its almost as fast as Mercurial and it is very userfriendly. The documentation is not that great, and several times I had to get help on the git IRC channel to get the information that I needed - this is bad.

Update september 2010: This is no longer the case. Git documentation is really great.

Git's manpages are not much better. There are a few git commands (such as log) that take arguments that other git commands accept. Sometimes this fact isn't documented and a person is left guessing what the full range of accepted arguments are.

I like the fact that Git only has one directory in the root of the repository .git

Git handles binary files correctly (it detect binariness using algorithm used by diff and other GNU tools, but you can mark given file as binary or text using gitattributes). Binariness of file does not matter to the Git libXdiff deltaification. Git can send or generate binary patches. Unfortunately, depending on binary file and how much it repesentation change due to small underlying change, Git can fail to detect file rename or file copying (it is based on similarity score).

A great feature of Git is that you can force other file types to be binary by adding a .gitattributes file to your repository. This file contains a list of patterns, followed by attributes to be applied to files matching those patterns. By adding .gitattributes to the repository all cloned repositories will pick this up as well.

For example, if you want all *.pdf files to be treated as binary files you can have this line in .gitattributes:

			*.pdf -crlf -diff -merge
			

This means that all files with the .pdf extension will not have carriage return/line feed translations done, won't be diffed and merges will result in conflicts leaving the original file untouched.

A great annoyance with Git is that the repository requires periodic optimization with git-gc. IMHO if something like this needs to be done periodically, the tool should just find a way to do it automatically. Otherwise, I am going to forget to do it and get frustrated when things are slow.

Linus Torvalds started a thread about people being unaware of the importance of optimizing a git repository: http://kerneltrap.org/mailarchive/git/2007/9/5/257021, notice the answers too.

I don't like having to optimize a repository on a periodic level, and I find it time consuming (but that's just me).

Update september 2010: Some git commands automatically run git-gc now.

As stated elsewhere (link provided below): “Git and Mercurial both turn in good numbers but make an interesting trade-off between speed and repository size. Mercurial is fast with both adds and modifications, and keeps repository growth under control at the same time. Git is also fast, but its repository grows very quickly with modified files until you repack - and those repacks can be very slow. But the packed repository is much smaller than Mercurial's.”

Personally I don't need my repository to be repacked, and if I forget to do so on my old laptop, I could easily imagine getting into trouble running a command that requires caching.

Git ended up on the second place on my winners list.

Update september 2010: I mainly use Git now, but both Git and Mercurial are really great revision control systems. Git does however have some advantages IMHO. One such advantage is that it is much more powerful.

Darcs

Installation

# apt-get install darcs
$ mkdir /darcs
$ cd /darcs
$ darcs init
$ touch my_file.txt
$ darcs add -r *
$ darcs record -am "Initial commit."
			

Checkout a project via ssh:

$ darcs get sftp://john@darcs.net:10022/foo/bar
			

Result

Darcs is a distributed revision control system by David Roundy that was designed to replace traditional centralized source control systems such as CVS and subversion.

Darcs is written in Haskell, among other tools, it uses QuickCheck. Many of its commands are interactive, allowing users to commit changes or pull specific files selectively. This feature is designed to encourage more specificity in patches. As a result of this interactivity, darcs has fewer distinct commands than many comparable revision control systems.

I must admit that I didn't like Darcs, but that wasn't because I found the system bad. I just didn't feel at home.

Darcs uses a completely different command structure than CVS and it takes some getting used to.

There are several benefits to the way Darcs works. Each developer essentially has his own private branch and can check in anything to that branch without affecting others. Developers can also send each other patches without affecting the main repository. Darcs supports sending patches via email, which eliminates the need for a publicly accessible server.

Darcs used to have some significant bugs, but I don't know if it still has these. The most severe of them was "the Conflict bug" - an exponential blowup in time needed to perform conflict resolution during merges, reaching into the hours and days for "large" repositories. A redesign of the repository format and wide-ranging changes in the codebase are planned in order to fix this bug, and work on this was planned to start in Spring 2007.

Darcs may be a great revision control system, but I just didn't like it.

Darcs ended up second last on my winners list.

Subversion (SVN)

Subversion (SVN) is a version control system (VCS) initiated in 2000 by CollabNet Inc.

Projects using Subversion include the Apache Software Foundation, KDE, GNOME, Free Pascal, GCC, Python, Ruby, Sakai, Samba, and Mono. SourceForge.net and Tigris.org also provide Subversion hosting for their open source projects, Google Code and BountySource systems use it exclusively. Subversion is also finding adoption in the corporate world.

The goal of the Subversion project is to build a version control system that is a compelling replacement for CVS in the open source community. Subversion is meant to be a better CVS, so it has most of CVS's features. Generally, Subversion's interface to a particular feature is similar to CVS's, except where there's a compelling reason to do otherwise.

Installation

On the server:

# apt-get install subversion subversion-tools
# svnadmin create /svn
			

On the client:

$ su
# apt-get install subversion subversion-tools
# exit
$ cd ~
$ svn checkout svn+ssh://myserver/svn	
			

Result

I really liked SVN and because it is a centralized system like CVS, I felt right at home. In my personal opinion SVN is like a better an improved version of CVS. It's CVS with the good stuff, without the bad stuff, and with some better stuff.

I had no problems in any of my tests and merging two completely different repositories was very easy though not as easy as with Mercurial.

A thing I liked about both SVN and Mercurial is their support for keywords expansion and the way SVN handles that on a per file basis is really cool. Another cool thing is the possibility to make checkouts on single subdirectories in a repository.

SVN has a really good documentation Subversion book.

Because of the centralized nature of SVN I dislike the system in general. If I had to use a centralized system my choice would definitely be SVN.

For a period of time SVN was on the second place on my winners list, but after working more with Git, SVN landed on a third place.

Mercurial (hg)

Mercurial is written in Python, with a binary diff implementation written in C. Mercurial is primarily a command line program. All its commands begin with hg, a reference to the chemical symbol for mercury.

Its major goals include high performance and scalability, serverless, fully distributed collaborative development, robust handling of both plain text and binary files, and advanced branching and merging capabilities, while remaining conceptually simple. It includes an integrated web interface.

The creator and lead developer of Mercurial is Matt Mackall. The full source code is available under the terms of the GNU General Public License, making Mercurial free software. Like Git and Monotone, Mercurial uses SHA-1 hashes to identify revisions.

A lot of great projects uses Mercurial, amongst them are SUN, FreeBSD kernel development, The Mozilla Foundation, Google Code and many more.

Installation

You can edit your preferences in .hgrc

# apt-get install mercurial
$ mkdir /hg
$ cd /hg
$ hg init
$ touch my_file.txt
$ hg addremove
$ hg commit
			

Cloning a repository via ssh:

$ hg clone ssh://myserver//hg
			

Result

I will start by saying that Mercurial is the winner on my test. Mercurial is a charm to work with. It is very fast, it is very well documented, and it is very userfriendly.

With all the different tests I did, including merging two complete independent repositories, Mercurial was the most easy and locical system to use.

The only thing I didn't like about Mercurial is that it keeps your personal configuration file ~/.hgrc outside of the repository. The file doesn't contain any important stuff and its perfectly possible to work without ever touching that file, but once you have setup the file, you don't want to loose it, especially if you work with keyword expansion on a per file basis. Each filename has to go into that file.

On the other hand Mercurial supports personal setup of the keywords which means, that I can have both the keyword Date and Dato do the same thing, where "Dato" is danish for "Date".

If you need keyword extension in Mercurial you have to download a file called keyword.py and set things up right. It's quite easy but I still think that this function should be natively supported.

Mercurial generally makes no assumptions about file contents. Thus, most things in Mercurial work fine with any type of file. The exceptions are commands like diff, export, and annotate, that work well on files intended to be read by humans, and merge, where processing binary files makes very little sense at all. The question naturally arises, what is a binary file anyway? It turns out there's really no good answer to this question, so Mercurial uses the same heuristic that programs like diff(1) use. The test is simply if there are any NUL bytes in a file.

For diff, export, and annotate, this will get things right almost all of the time and it will not attempt to process files it thinks are binary. If necessary, you can force these commands to treat files as text with -a.

Merging is another matter. The actual merging of individual files in Mercurial is handled entirely by external programs and Mercurial doesn't pretend to tell these programs what files they can and cannot merge. The example mergetools.hgrc currently makes no attempt to do anything special for various file types, but it could easily be extended to do so. But precisely what you would want to do with these files will depend on the specific file type and your project needs.

If you need to merge binaries, you need to have a tool which manages binary merge. Joachim Eibl's kdiff3 version ships a version qt4 version (on windows called "kdiff3-QT4.exe") which recognizes binary files. Pressing "cancel" and "do not save" leaves you with the version of the file you have currently in the filesystem.

I haven't found anything to complain about with Mercurial, and Mercurial is the winner on my winners list. Mercurial is my new revision system of choice.

My .hgrc file

I ended up with a .hgrc that looked something like this:

[ui]
username = Kim N. Lesmer
[extensions]
hgk = /usr/share/python-support/mercurial/hgext/hgk.py
keyword = /usr/share/python-support/mercurial/hgext/keyword.py
[keyword]
**.php =
**.txt =
**.cpp =
**.java =
**.css =
[keywordmaps]
RCSFile = {file|basename},v
Author = {author|email}
Forfatter = {author|email}
Header = {root}/{file},v {node|short} {date|utcdate} {author|user}
Source = {root}/{file},v
Date = {date|utcdate}
Dato = {date|utcdate}
Id = {file|basename},v {node|short} {date|utcdate} {author|user}
Revision = {node|short}
				

The winners list

  1. Mercurial
  2. Git
  3. SVN
  4. BZR
  5. Darcs

Merging two non-related repositories

I never got it to work with BZR and it took me a long time to figure out how to do things right on Git.

Merging on Mercurial

On the client:

$ cd mercurial-first
$ hg pull –force ssh://webserver//mercurial-second
pulling from ssh://webserver//mercurial-second
searching for changes
warning: repository is unrelated
adding changesets
adding manifests
adding file changes
added 1 changesets with 2 changes to 2 files (+1 heads)
(run ‘hg heads’ to see heads, ‘hg merge’ to merge)
$ hg merge
2 files updated, 0 files merged, 0 files removed, 0 files unresolved
			

Success! If you do a $ hg log, the log contains both repositories entries, like they had belonged together from the beginning.

Merging on SVN

Merging on SVN involves the server, which I don't like.

On the server:

# svnadmin dump subversion2 > subversion2_dump.txt
# svnadmin load /subversion1 < /subversion2_dump.txt				
			

On the client:

$ rm -rf subversion*
$ svn checkout svn+ssh://webserver/subversion1				
			

Very simple and easy, but I just don't like having to deal with the server. The reason for the need of server interaction is because of the centralized system design.

Merging on BZR

Merging on BZR wasn't possible and I got the following error:

bzr: ERROR: Branches have no common ancestor, and no merge base revision was specified.
			

I am pretty sure that I made some mistake, but nonetheless it should be quite easy to achieve with a decentralized system. I read the documentation on merging, but maybe I missed some points.

Merging on Git

Merging on Git was super easy as with Mercurial, but finding out how to do it was difficult.

On the client:

$ cd git
$ git pull ssh://webserver/another_git_repo
$ git push
			

I expected a merge command, but the merging process automatically follows the pull command.

Conclusion

I know this article hasn't been an in dept analysis of every aspect of the different systems, and I haven't really elaborated on the details of each test I did, but I still wanted to write about it, at least to be able to come back and read this once I (maybe) forgot, why I chose Mercurial.

This article was mainly writtin for personal benefit, but I still hope that someone else might benefit from it.

My choice on Mercurial was more a personal matter of "taste" rather than a matter of technicality. Any of the systems could fullfill my needs perfectly, especially since I am used to run with CVS. Every single one of the systems are better than CVS.

I liked both Mercurial, Git and SVN, but Mercurial is just more nice to work with - IMHO.