current concepts in version control systems - arxiv · 2018. 3. 14. · current concepts in version...

23
Current Concepts in Version Control Systems Petr Baudiˇ s * 2009-09-11 Abstract We give the reader a comprehensive overview of the state of the Version Control software engi- neering field, describing and analysing the con- cepts, architectural approaches and methods re- searched and included in the currently widely used version control systems and propose some possible future research directions. Preface This paper has been originally written in May 2008 as an adaptation of my Bachelor thesis and I have been meaning to get back to it ever since, to add in some missing information, especially on some of my research ideas, describing the patch algebra more formally and detailing on the more interesting chapters of historical de- velopment in the field. Unfortunately, it is not likely that I will ever get the time to polish the paper up to my full content; in the meantime, it is slowly getting obsolete and I believe it can have a good value even as it stands, especially as an introductory material for people getting into this area. Maybe I have missed some of the latest developments; I ask the reader for tolerance of the omissions, and my fellow researches and authors for fol- lowups on this work that would fill in the gaps and explain ongoing new ideas in the field. Petr Baudiˇ s 1 Introduction Version control plays an important role in most of the creative processes involving a computer. Ranging from simple undo/redo capabilities of most office tools to complicated version control supporting branching and diverse collaboration * The work on this paper was in part sponsored by Novell (SUSE Labs). of thousands of developers on huge projects. This work will focus on the latter end of the scale, describing the modern approaches for ad- vanced version control with emphasis on soft- ware development. We will compare various sys- tems and techniques, but not judge their general suitability as the set of requirements for version control can vary widely. We will not focus on integration of version control in the source con- figuration management systems 1 since SCM is an extremely wide area and many other papers already cover SCM in general. [3][15][70] This work aims to fill the sore need for an introduction-level but comprehensive summary of the current state of art in version control the- ory and practice. We hope that it will help new developers entering the field to quickly under- stand all the concepts and ideas in use. Since the author is (was) a Git developer, we focus some- what on the Git version control system in our examples, but we try to be comprehensive and cover other systems fairly as much as is within our power and knowledge. We also describe for the first time some tech- niques so far present only as source code in various systems without any explicit documen- tation and present some of our original work. One problem in the field of version control is that since most innovation and research hap- pens not in the academia or even classical in- dustry, but the open community, very few of the results get published in the form of a scientific paper or technical report and many of the con- cepts are known only from mailing list posts, random writeups scattered around the web or merely source code of the free software tools. Thus, another aim of this work is tie up all these sources and provide an anchor point for future researchers. 1 Source Configuration Management covers areas like version control, build management, bug tracking or de- velopment process management. 1 arXiv:1405.3496v1 [cs.SE] 14 May 2014

Upload: others

Post on 27-Sep-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Current Concepts in Version Control Systems - arXiv · 2018. 3. 14. · Current Concepts in Version Control Systems Petr Baudi s 2009-09-11 Abstract We give the reader a comprehensive

Current Concepts in Version Control Systems

Petr Baudis∗

2009-09-11

Abstract

We give the reader a comprehensive overview ofthe state of the Version Control software engi-neering field, describing and analysing the con-cepts, architectural approaches and methods re-searched and included in the currently widelyused version control systems and propose somepossible future research directions.

Preface

This paper has been originally written in May2008 as an adaptation of my Bachelor thesis andI have been meaning to get back to it ever since,to add in some missing information, especiallyon some of my research ideas, describing thepatch algebra more formally and detailing onthe more interesting chapters of historical de-velopment in the field.

Unfortunately, it is not likely that I will everget the time to polish the paper up to my fullcontent; in the meantime, it is slowly gettingobsolete and I believe it can have a good valueeven as it stands, especially as an introductorymaterial for people getting into this area. MaybeI have missed some of the latest developments;I ask the reader for tolerance of the omissions,and my fellow researches and authors for fol-lowups on this work that would fill in the gapsand explain ongoing new ideas in the field.— Petr Baudis

1 Introduction

Version control plays an important role in mostof the creative processes involving a computer.Ranging from simple undo/redo capabilities ofmost office tools to complicated version controlsupporting branching and diverse collaboration

∗The work on this paper was in part sponsored byNovell (SUSE Labs).

of thousands of developers on huge projects.This work will focus on the latter end of thescale, describing the modern approaches for ad-vanced version control with emphasis on soft-ware development. We will compare various sys-tems and techniques, but not judge their generalsuitability as the set of requirements for versioncontrol can vary widely. We will not focus onintegration of version control in the source con-figuration management systems1 since SCM isan extremely wide area and many other papersalready cover SCM in general. [3] [15] [70]

This work aims to fill the sore need for anintroduction-level but comprehensive summaryof the current state of art in version control the-ory and practice. We hope that it will help newdevelopers entering the field to quickly under-stand all the concepts and ideas in use. Since theauthor is (was) a Git developer, we focus some-what on the Git version control system in ourexamples, but we try to be comprehensive andcover other systems fairly as much as is withinour power and knowledge.

We also describe for the first time some tech-niques so far present only as source code invarious systems without any explicit documen-tation and present some of our original work.One problem in the field of version control isthat since most innovation and research hap-pens not in the academia or even classical in-dustry, but the open community, very few of theresults get published in the form of a scientificpaper or technical report and many of the con-cepts are known only from mailing list posts,random writeups scattered around the web ormerely source code of the free software tools.Thus, another aim of this work is tie up all thesesources and provide an anchor point for futureresearchers.

1Source Configuration Management covers areas likeversion control, build management, bug tracking or de-velopment process management.

1

arX

iv:1

405.

3496

v1 [

cs.S

E]

14

May

201

4

Page 2: Current Concepts in Version Control Systems - arXiv · 2018. 3. 14. · Current Concepts in Version Control Systems Petr Baudi s 2009-09-11 Abstract We give the reader a comprehensive

2 BASIC DESIGN 2

2 Basic Design

The basic task of version control system is topreserve history of a collection of files (project)so that the user can compare two historical ver-sions of the content easily, return to an olderversion or review the list of changes between twoversions.

2.1 Development Process Model

We will focus on version control systems that arebased on the distributed development paradigm.While the concept dates many years back, dis-tributed development has only recently seen ma-jor adoption and so far mostly only in the areaof open source development.

The classical approach to version control is tohave a single central server2 storing the projecthistory; the client can have the current versioncached locally, and possibly also (parts of) thehistory, but any operation involving change inthe history (e.g. committing new version) re-quires cooperation with the central server.

This centralized development approach hastwo major shortcomings: first, the developerhas to have network access during development,which is problematic in the era of mobile com-puting — i.e., when developing during travel orfixing problems at the customer site. Second, thecentral repository access implies some inherentbureaucracy overhead — the developer has togain appropriate commit permissions before be-ing able to version-control any of his work, de-velopment of third-party patches is difficult andsince the repository is shared, the developer isusually restricted in experimentation and creat-ing short-lived branches, often resorting to usingsome kind of manual version control during reg-ular workflow and using the VCS only on officialoccassions.

On the other hand, since all the develop-ment is tracked at a central place, this modelmay pose advantages to corporate shops becausework and performance of all developers can betracked easily.3

In the distributed development model, a copyof the project created by version control has

2Possibly with several read-only mirrors for replica-tion, as in the BSD ports setup.

3In case of in-company development using distributedversion control, this can be alleviated by setting appro-priate policy for the developers to push their work todesignated central repositories in regular intervals.

full version tracking capabilities and not onlyis all the history usually copied over (pulled),but it is also possible to commit new changeslocally; later, they can be sent (pushed) to an-other repository in order to be shared with otherdevelopers and users. This approach effectivelymakes every copy a stand-alone repository withstand-alone branches; nevertheless, most sys-tems make it easy to emulate the centralizedversion control model as a special-case at leastfor the read-only users.

2.2 Snapshot-Oriented History

There are two possible approaches to modelingthe history. The first-class object is usually theparticular version4 as a state (snapshot) of thecontent at a given moment, while the changesare deduced only by comparing two versions.However there is also a competing approachwhich focuses on the changes instead: in thatcase, the important thing when marking newversion is not the new state of the content, butthe difference between the old and new state;a particular version is then regarded as a cer-tain combination of changes.5

2.2.1 Branches

With regard to the snapshot-oriented historymodel, it may suffice in the trivial case to merelyrepresent the succeeding versions as a list withthe versions ordered chronologically. But overtime it may become feasible to work on severalversions of the content in parallel, while alsorecording further modifications using the ver-sion control system; the development has effec-tively been split to several variants (branches).

Handling branches is an obvious feature of aversion control system as the necessity to branchthe development is frequent for many projectstargetted at wider audience. Commonly, asideof the main development branch another branchis kept based on the latest public release of theproject and it is dedicated to bugfixes for thisrelease.

Moreover, branches have many other uses aswell — it may be feasible for individual develop-ers or teams to store work in progress yet unsuit-

4We use the terms version, revision and commit inter-changeably in this paper based on the context, accordingto usual practice.

5Note that the conceptual model may have no relationto actual storage, which can well use deltas even withinsnapshot-oriented systems.

Page 3: Current Concepts in Version Control Systems - arXiv · 2018. 3. 14. · Current Concepts in Version Control Systems Petr Baudi s 2009-09-11 Abstract We give the reader a comprehensive

2 BASIC DESIGN 3

able for the main development branch in sepa-rate branches, or even for a developer to saveincomplete modifications during short-time re-assignment to a different task or when movingto another machine. The prerequisite for thisis that creating a separate branch must haveonly minimal overhead, both technical (timeand space required to create a new branch)and social (it must be possible to easily removethe branch too; also, the unimportant branchesshould not cloud the view of the important ma-jor ones).

The ability to properly record and manipu-late branches nevertheless becomes truly vitalas the development gets distributed, since in ef-fect every clone of the project becomes a sep-arate branch. If the system aims to fully andcomfortably support the distributed develop-ment model, it must make sure that there areno hurdles for creating many new branches; fur-thermore, there can be no central registry ofbranches, thus the branch-related data struc-tures must allow for branches to grow from dif-ferent points in history with full independenceand dealing with naming conflicts,6 shall theyever meet again in some repository.

Aside of the obvious operations with branches(referring to them, comparing them, storingthem in a reasonable manner, switching betweenthem), perhaps the technically most challengingone is merging — combining changes made sepa-rately in two branches in a single revision. In thepast, only the most crude methods for mergingtwo branches were available, but again with theonset of distributed version control this aspecthas grown in importance (a merge essentiallyhappens every time a developer exchanges mod-ifications with the outside world). We will ex-plain the challenges associated with merging insection 6; one important consideration we shallmention now is that many of the merging tech-niques require to find the latest point in historycommon to both of the to-be-merged branches(or, more generally, which changes are not com-mon to the history of both branches).

Aside of merging, another desirable opera-tion is so-called cherry-picking: taking individ-ual changes from one branch to another with-out merging them completely. While most mod-ern systems support this in a rudimentary way,the operation is not natural in snapshot-oriented

6In independent repositories, branches and revisionscan be given conflicting names, as will be covered in de-tail later.

systems and can result in trouble when mergingthe branches later, both in duplicated changesstored in the history and increased merge con-flicts rate.7

2.2.2 Data Structures

Given that the development may fork at somepoint in history, we cannot use simple lists any-more to adequately represent this event but wehave to model the history as a directed treeinstead, with succeeding versions connected byedges; the node representing a fork point (notonly one, but several branches grow from thenode) has multiple children.

However, when we take merging into con-sideration, it becomes necessary to also repre-sent the past merges between branches if we areto properly support repeated merges and pre-serve detailed history of both branches acrossthe merge. The commonly used method is touse a more general directed acyclic graph (DAG;figure 2 several pages later) to represent the his-tory; compared to a directed tree, DAG lifts therestriction for a node to have only a single par-ent. Thus, versions representing merges of sev-eral branches have the last (“head”) versions ofall the merged branches as their parents. Theacyclic property makes sure that a commit stillcannot be its own ancestor, thus keeping thehistory sane (time travel and quantum effectsnotwithstanding).

An interesting problem to note is how to actu-ally define a branch. In case of trees, the branchis defined naturally by the tree structure, how-ever in a DAG this becomes less obvious. Oneapproach is to record branch of each revisionexplicitly (e.g. Mercurial), another is to give upon sorting out revisions to explicit branches al-together (e.g. Git) — in that case, a “branch”is merely a named pointer to the commit at thehead of the branch, with no clear distinction ofwhich ancestors belong to the same branch; itmight be tempting to keep branch correspon-dence through fixed parent nodes ordering inmerge nodes (“first parent comes from the samebranch”), however this approach has caveats asdescribed in 6.1.

7Git provides a git-rerere tool for automatically re-solving merge conflicts that have been resolved in a cer-tain way before, thus somewhat reducing this problem.

Page 4: Current Concepts in Version Control Systems - arXiv · 2018. 3. 14. · Current Concepts in Version Control Systems Petr Baudi s 2009-09-11 Abstract We give the reader a comprehensive

2 BASIC DESIGN 4

2.3 Changes-Oriented History

Alternatively to snapshot-oriented history, a ra-dically different approach is to focus on thechanges as the first-class object — the user cre-ates and manipulates not a particular versionof the project, but instead a particular change;a “version” is then described simply by givencombination of changes recorded. The major ad-vantage is that merging operations may be muchmore flexible, and inherently more data is pre-served as the changes themselves are recordedand do not have to be deduced a posteriori fromchecking the difference between two snapshots.

The main representative of this class is Darcs(see 3.7) with its main trait of the powerful for-malisation patch algebra (see 6.6) which makesit possible to easily perform merging of two ver-sions by adding the changes missing in the otherversion and reordering the changes in a well de-fined way.

In order to make full use of the smart mergingtechniques like patch algebra, in some cases theoperation time can degrade exponentially to thenumber of revisions in the current implementa-tions [21]; making the algorithms more efficientis subject of active research, however at least un-til very recently, practicality of changes-orientedhistory model was severely limited and this ap-proach was not suitable for large histories; weare not aware of performance improvement stud-ies for the recent Darcs improvements and it isuncertain if these complexity issues are inherentto the whole changes-oriented model.

2.4 Objects Identification

The most visible (and diverse) aspect of the his-tory models is the problem of identifying thehistory objects (snapshots or changes, depend-ing on the model).

The old RCS and CVS tools use a simplescheme where every version is identified by astring of dot-separated numbers; the first num-ber is usually 1, the second number is the ver-sion number on the main branch. If it is re-quired to identify a version on a different branch,BASE VERSION.BRANCH.VERSION scheme is used— BRANCH is the number of the branch forkedfrom the base version and VERSION is the se-quence number of the given version on thebranch; this scheme can be used recursively toidentify versions on branches of branches etc.

However, we shall remember that in a dis-

tributed system, we cannot see all the usedidentifiers in all repositories (since they can beon disconnected systems), thus we meet withspecial problems when having to simultanouslyassign new identifiers in different disconnectedrepositories. Eventually, we are met with threeconflicting requirements8 [44][6]:

• Uniqueness: The identifier is guaranteed tobe unique and always identifies only a singlehistory object in a given repository. (Notethat this requirement is frequently slightlycompromised by only hoping that the iden-tifier is unique with a very high probabil-ity.) 9

• Simplicity: The identifier is easy to use —reasonably short and preferrably informa-tive to the user (e.g. giving some informa-tion on ordering of the revisions).

• Stability: The identifier does not changeover time and between different reposito-ries.

The requirements are of conflicting nature —we can generally always pick only two and sacri-fice the third. Different version control systemsmake different choices here: for example, Git,Mercurial and Monotone sacrifice some usabilityand assigns every commit a 40-digit SHA1 hash[57] depending on the current content and his-tory of the version.10 To alleviate the UI prob-lem somewhat, the systems allow shortcuttingthe long hashes to only few leading digits, aslong as they match uniquely.

8For clarity, in this context we use “simple” instead of“memorable” and “stable” instead of “global” presentedin the literature. The latter correspondence is somewhatloose and given under the assumption that it does notmake sense to use central authority for assigning identi-fiers in case of distributed version control — thus, eitherthe identifiers are globally-meaningful and stable, or theyhave only local meaning and thus inter-repository colli-sions are inevitable and relabeling in that case necessary.

9Here we mean uniqueness guaranteed by technicalmeans, preferrably provably secure. E.g., GNU Arch doesuse identifiers that are “unique”, but only by policy asone part of the id is user’s e-mail address — these areunique, but there is no technical safeguard that differentusers will not input the same e-mail address; this ap-proach requires trusting the other repository databaseto contain correct content.

10This also technically sacrifies some uniqueness, butthe probability of collision of two hashes is so small thatit is deemed to be unimportant for practical purposes.This has been disputed by [5] [34], however that paperhas been criticized by the version control [40] and cryp-tographic [14] community.

Page 5: Current Concepts in Version Control Systems - arXiv · 2018. 3. 14. · Current Concepts in Version Control Systems Petr Baudi s 2009-09-11 Abstract We give the reader a comprehensive

3 CURRENT SYSTEMS 5

BitKeeper and Bazaar11 sacrifice the stabil-ity and use RCS-style identifiers that are uniqueonly within a single repository and can shiftduring merges. Darcs sacrifies the uniquenessand uses user-assigned identifiers for individualchanges (however, the identifiers are encouragedto be rather elaborate, so collision probabilityshould not be very high).

2.5 Content Movement Tracking

One interesting problem in version control isfollowing particular content across file renames,copies, etc. Practically all systems offer a way tolist all revisions changing given file or set of files,however things become more troublesome whena file is renamed or copied or when we want tocheck history of finer-grained objects than wholefiles. Most systems support so-called “annotatedview” of a file, for each line showing the committhat introduced it in its current version. How-ever, this operation is frequently very expensiveand on many occassions the line-level granular-ity is not suitable either.

While old systems like CVS do not supporttracking file renames at all and the new filesstart with fresh history unless the repository ismanually modified,12 most newer systems sup-port file renames, usually by providing a spe-cial command that records the movement — lesscommon approach used e.g. by GNU Arch is toautodetect movement by tracking unique tagswithin the objects (strings within files or hid-den files in directories).

In that case, showing history of the new filewill usually automatically switch to the older filename at the point of the rename, and there isalso a possibility of cross-rename merges when afile has been renamed between the two revisions.However, this carries e.g. the danger of makingthe merging operation complexity proportionalto project history, something many systems seekto avoid. Also, as the support gets naturally ex-tended to recording not only renames but alsofile copies, it becomes much less obvious whichfile instances to merge; another natural exten-

11Technically, Bazaar internally uses globally uniqueidentifiers, but RCS-style identifiers are the primary wayof user interaction.

12The modification usually involves manually copyingthe file under new name within the repository database.However, this presents problems when the working treeis seeked to an older point in the history or in case ofbranches.

sion to record file combinations makes the situ-ation even more challenging.13

A somewhat controversial alternative ap-proach has been taken by Git, where file re-names and copying is not explicitly recordedwith the rationale that Git aims to track contenton a more fine-grained level and hard-codingrenames information would pollute the historywith potentially bogus information. [25] Instead,Git can heuristically detect the events by com-puting similarity indexes between files in thecompared revisions,14 and it provides a so-called“pick-axe” mechanism to show changes addingor removing a particular string. Thus, whenfeeding pick-axe with some chunk of code, it willshow the commits that introduced it to the files,as well as commits that removed it from files, ef-fectively giving the user a rudimental ability totrace the code history.15

The problem of showing history of particularparts of files and following semantic units acrossfiles is still open and will in our opinion benefitfrom further research (as elaborated in 7.1).

3 Current Systems

We have already mentioned several version con-trol systems when describing basic design as-pects; now, we shall briefly describe the mostwidespread and interesting ones in more detail.We will focus on systems that are freely availableand in use in the open-source community, andeven so not go through all of them but insteadhighlight the most influential ones. We will notcover most proprietary systems like ClearCaseor SourceSafe in detail since we feel they gener-ally do not bring much innovation into the ver-sion control field itself (focusing more on pro-cess management, integration with other soft-ware etc.) and detailed information is hard togather from open sources.

Up to CVS, the evolution was rather slow, butthen with growing need for distributed versioncontrol, many short-lived projects sprang up

13The underlying design of recording renames varies —for example Subversion records renames as copy+deleteoperations, which can bring some inherent problems e.g.for using the rename information during merges. [50]

14The similarity index is computed by hashing line-based file chunks of the files and comparing how manyof the hashes exist in both files. [29]

15An obvious idea for an extension is to have the GUIfrontends integrate mouse text selection with pickaxe,further simplifying the process.

Page 6: Current Concepts in Version Control Systems - arXiv · 2018. 3. 14. · Current Concepts in Version Control Systems Petr Baudi s 2009-09-11 Abstract We give the reader a comprehensive

3 CURRENT SYSTEMS 6

and explored various approaches; now the fieldhas rather stabilized again, with three majorsystems currently used: Bazaar, Git and Mer-curial.

3.1 The Two Grandfathers

The first true version control ever in use wasSCCS (Source Code Control System) [63], in-troduced in 1972 at Bell Labs and distributedcommercially. It uses the weave file format(4.2.5) to track individual files seprately andprovides only a rather arcane user interface. Al-though its use in the industry is very rare nowa-days, the weave storage format has resurfacedrecently.

A competing system RCS (Revision ControlSystem) [54] published in 1985 has seen muchmore success, even though it also operates onlyon isolated files. It uses a delta sequence file for-mat (4.2.4) and the rapid adoption rate com-pared to SCCS is likely to be accounted mainlyto much more user-friendly interface (thoughit is still rather rudimentary by current stan-dards). RCS is still sometimes used in presentfor version-controlling individual files, e.g. whenmaintaining system configuration. However, itsconcepts and file format has seen even muchmore widespread adoption thanks to CVS.

3.2 Concurrent Version System

CVS [17] is essentially an RCS spinoff, bringingtwo major extra capabilities: ability to version-control multiple files at once and support forparallel work, including network support. It hasseen extremely widespread adoption and haslong remained the de-facto standard in versioncontrol, at least among the free systems.

However, CVS is directly based on RCS, andit is ridden with inherited artifacts of individ-ual file revision control — the need for elabo-rate locking, difficult handling of branching andmerging and lack of commits atomicity: whencommitted change spans multiple files, the com-mit will be stored for each file separately andthere is no reliable way to reconstruct the wholechange.16

3.3 Modern Centralized Systems

Perforce [49] is a widely commercially used

16However, relatively reliable heuristics are imple-mented, e.g. by the cvsps project. [18]

proprietary version control system that providesrather advanced version control features, thoughit keeps within the centralized model. Com-pared to CVS, it supports atomic commits (ty-ing changes in multiple files together), trackingfile renames and more comfortable branchingand merging support (branches are presented asseparate paths in the repository and repeatedmerges are supported).

In the open-source version control realm,SVN [59] has been developed as a direct CVSreplacement and currently appears to remainthe final step in the evolution of centralized ver-sion control. It shares many of the Perforce fea-tures, though many workflows are different. Itis seeing rapid adoption [58] and basically de-throned CVS as the default choice of centralizedversion control for open-source projects.

3.4 Older Distributed Systems

BitKeeper [11] is another proprietary versioncontrol system worth mentioning, since it pi-oneered the distributed version control field17

and also has relatively well-understood capabil-ities and internal working as it was for long timeavailable for free usage by open source projectsand some prominent projects like the Linux Ker-nel used it.

BitKeeper supports atomic commits (thoughfiles have revisions tracked individually as well),generic DAG history and the basic distributedversion workflows;18 however, each repositorycontains only one branch. RCS-style revisionnumbers are used, same revisions can have dif-ferent numbers in different repositories and re-vision numbers on branches can shift duringmerges. Internally, BitKeeper uses the weave fileformat and makes full use of its annotation capa-bilities in the user interface and during merges(see 6.4). It is still actively developed and re-portedly commercially popular.

With regard to the history, we shall also men-tion GNU Arch [32] as the first open-sourcedistributed version control system that has seenwider usage, however it suffered by bad perfor-mance and difficult user interface; the projecthas been basically discontinued (though still for-mally maintained), but Bazaar (see below) can

17The concept of distributed version control has beenexplored before by e.g. Aegis, but BitKeeper was the firstsystem to bring it to widespread use.

18That is, the pull/push operations and good mergingsupport.

Page 7: Current Concepts in Version Control Systems - arXiv · 2018. 3. 14. · Current Concepts in Version Control Systems Petr Baudi s 2009-09-11 Abstract We give the reader a comprehensive

3 CURRENT SYSTEMS 7

trace back its ancestry to GNU Arch to a certaindegree; Darcs has been also somewhat inspiredby its changeset-oriented design.

3.5 Monotone

Another important milestone is Monotone [39]with its design built around a graph of ob-jects identified by hash numbers computed fromtheir contents19 serving as direct inspiration forGit and Mercurial. Monotone itself puts strongemphasis on security and authentication, usingRSA certificates during communication, etc. Itis actively developed but does not see as wideadoption as the three systems below, mainly dueto performance issues.

Git and Mercurial build upon the Mono-tone object model, which is generally builtas follows (Monotone implementation detailsnotwithstanding): There are three main objectclasses — files (or blobs), manifests (or directorytrees) and revisions (or commits/changesets).All objects are referenced by large-size hash oftheir content that should have guarantees akinto cryptographical hashes (currently, SHA1 isused in all popular systems). File objects con-tain raw contents of particular file versions.Manifest objects contain listing of all files anddirectories20 at a particular point of history,with references to objects with appropriate con-tent. Revision objects form the history DAG asa subgraph of the general object graph by refer-encing the parent revision (or revisions, in caseof merge) and the corresponding manifest ob-ject. Usually, log message and authorship infor-mation is also part of the revision object.

The reader should notice that this schemeis cascading — by changing a single file, newmanifest object with unique id will be gener-ated (the file hash inside the manifest objectchanged, thus manifest object hash will changetoo), and this will in turn result in a revision ob-ject with unique id as well. Even committing thesame content with the same log message in dif-ferent branches will not create an id clash sincethe parent revision hash will be different withinboth objects. Tagging can be realized by sim-ply creating named reference to either the re-

19Technically, the first system using hashes as identi-fiers is OpenCM [46], but this system has never reallytaken off as far as we can tell.

20Git uses one object per directory instead; there isa trade-off between object reuse and dereferencing over-head in this choice.

vision object or another object containing thereference and also some tag description, possi-bly digitally signed by the project maintainer toconfirm that the tagged revision is an official re-lease; thanks to the usage of cryptographically-strong hashes for references, the cryptographictrust extends to the complete revision content.

3.6 The Three Rulers

In spring of 2005 a rapid series of events re-sulted in termination of the permission to useBitKeeper for free software projects, provid-ing an immediate impulse for improvement inthe area of open-source version control sys-tems, especially to the Linux Kernel commu-nity — almost in parallel, the Git and Mercurialprojects were started. Independently, CanonicalLtd. (the vendor of popular Linux distributionUbuntu) started the Bazaar21 project aroundthe same time.

All three systems provide atomic commits,store history in generic DAG and support ba-sic distributed version control workflows.

Git [24] [25] re-uses the generalized Mono-tone object model; by default it stores each ob-ject in a separate file in the database, thus hav-ing a very simple and robust but inefficient datamodel;22 however, extremely efficient storage us-ing Git Packs (4.3.2) is also provided. In its userinterface, Git emphasizes (but does not makemandatory) the rather unique concept of in-dex, providing a staging area above the work-ing tree for composing the next commit.23 Gitis written in C (and a mix of shell and Perl

21At that time, it has been called BazaarNG — a suc-cessor to an original Baz(aar) project which was itself afork of GNU Arch.

22This scheme precludes need for any locking what-soever, but there is no compression in use and usingmany small files means inherent overhead on the filesys-tem level.

23Originally a low-level mechanism, the popularity ofindex usage among developers elected it to a first-classuser interface concept. If users choose to use it, theymanually add files (or even just chunks of changes) fromthe working tree to the index at various points and thenthe index is committed, regardless of working tree state;thus, the user may modify a file, add it to index and thenbefore committing modify it again for example with thechanges required to compile it locally — the commit op-eration will use the older version of the file. When explic-itly recording all files to commit in the index, greater dis-cipline of making appropriately fine-grained commits canbe encouraged. Also, non-trivial conflicts are recorded inthe index, making for a natural way of resolving themby simply adding the desired resolution to the index.

Page 8: Current Concepts in Version Control Systems - arXiv · 2018. 3. 14. · Current Concepts in Version Control Systems Petr Baudi s 2009-09-11 Abstract We give the reader a comprehensive

4 STORAGE MODELS 8

scripts) and is tightly tailored to POSIX sys-tems, which historically meant portability prob-lems especially with regard to Windows. It is de-signed as a UNIX tool, providing large numberof high- and low-level commands, and it is tradi-tionally extended by tools calling the commandsexternally.

Mercurial [37] [68] has very similar objectmodel to Git and Monotone, but uses the revlogformat (4.3.3) for efficient storage. It is writ-ten mainly in Python and thus does not suf-fer from the portability problems of Git; sinceit uses an interpreted language, the main ex-tension mechanism is writing tightly integratedplugins. Performance-sensitive portions are im-plemented in C.

While both of the systems above always em-phasized their robust data models and perfor-mance, Bazaar’s [9] main focus is simple andeasy user interface; though internally using elab-orate unique revision identifiers, it presents pri-marily the RCS-like revision numbers to theuser, and its user interface has direct supportfor some common workflows like emulation ofthe centralized version control setup (checkouts)or Patch Queue Manager, an advanced al-ternative to all developers pushing to a centralrepository.24 Bazaar is also written in Python,making the same tradeoffs as Mercurial.

3.7 Darcs

Darcs [19] is a somewhat unusual specimen inthe arena of version control as it focuses onchanges as the first-class objects instead of trees;any particular version of the repository is merelya combination of patches on an empty tree, andmanipulating the patches is the primary focus.

Darcs supports single branch per repositoryand names the patches based on user input; par-ticular combinations of patches can be tagged tomark specific versions of the tree.25 Darcs doesnot organize the patches in a particular graph,they are structured more like a stack, thougha graph could be imagined based on interde-pendencies between the patches; merges of sev-eral patches create auxiliary patches represent-ing the merges.

24Developers send their pull requests or actual patchesto a public mailing list, where they can be acknowledgedor vetoed by others and then a robot automatically picksthem up, checks if they pass the required testsuites andeventually merges them to the main branch. [47]

25In fact, a Darcs tag is itself a patch that is howeverempty and depends on all patches currently applied.

A particularly interesting feature of Darcs isprecisely how it handles patch merges — it usesa formal system called patch algebra (6.6) to de-scribe dependencies between patches and basicoperations on the patch stack required to prop-erly merge two sets of patches (branches).

Darcs is written in Haskell, which is some-what exotic choice, but is well portable. Unfor-tunately, as far as we are aware it suffers from se-vere performance problems (as discussed in 2.3)when used on large repositories with long his-tories; more efficient merging algorithms as wellas proving formal correctness of the underlyingpatch theory is subject of ongoing research.

4 Storage Models

The storage model of a system can be quite dif-ferent from the logical design in order to ac-comodate for practical considerations of spaceand time effectivity. Here we shall explore pos-sible approaches to permanent storage of thedata and metadata tracked by version control.All version control systems differ in implemen-tation details and various individual tweaks; wewill not try to be exhaustive here and will focusonly on the most widespread and most interest-ing ideas.

First, we will dwell on the currently popularalgorithms to compare two objects and createa delta: complete technical description of theirdifferences.26 Second, we shall elaborate on pos-sible ways of representing an individual delta aswell as whole single-file history, then we will lookat the approaches on organization of the wholerepository.

4.1 Delta Algorithms

There is a certain variety of delta algorithms inuse, and there can even be multiple algorithmsused within a single system — some algorithmsprovide suitable and sensible output for humanreview while others are focused on finding thesmallest set of differences and providing minimalrequired output for storage.

4.1.1 Common Techniques

First, we shall describe several common tech-niques used when dealing with deltas. The

26E.g. a series of insert–copy commands or a “diff”tool output text.

Page 9: Current Concepts in Version Control Systems - arXiv · 2018. 3. 14. · Current Concepts in Version Control Systems Petr Baudi s 2009-09-11 Abstract We give the reader a comprehensive

4 STORAGE MODELS 9

first technique worth mentioning is used to ourknowledge by all commonly used systems newerthan CVS: delta combination [45] [68]. When ap-plying a chain of deltas, they are first mergedto a single large delta, and only then applied.This reduces the number of data copies fromO(deltas) to O(1) and takes advantage of cachelocality, dramatically increasing performance ofapplying delta chains.

Another interesting technique used most no-tably in Subversion is using skip-deltas whenstoring delta chains. [45] In order to limit chainlengths, the deltas aren’t always created againstneighboring revision but instead construct a bi-nary tree alike structure — thus, reconstruct-ing particular revision does not require O(revs)deltas but only O(log revs). However, whenadding new revisions, parts of the tree or thewhole tree needs to be reconstructed, and thedeltas are usually larger. Newer systems usuallysolve the problem of too long delta chains bysimply inserting the whole revision text whenthe chain becomes too long or large in size.

4.1.2 Myers’ Longest Common Subse-quence

Perhaps the most widely used algorithm wasproposed by Eugene Myers in [7] and is basedon recursively finding the longest subsequenceof common lines in the list of lines of the com-pared objects; it is used most notably by theGNU diff tool [33], but also as the user-visiblediff algorithm of most systems. The main ideaof the algorithm is to perform two breadth-firstsearches through the space of possible edita-tions (line adds/removals/keeps), starting fromthe beginning and end of the object in paral-lel; when the two searches meet, we have foundthe shortest editation sequence. The algorithmhas quadratic computational complexity, so it isunsuitable for very large inputs.

As noted in [68], while this algorithm is quitesuitable for human consumption and does pro-duce optimal longest common subsequences, itdoes not produce optimal binary deltas sinceit weights additions, changes and removals allthe same, while binary deltas need to actuallyrecord only added data and removed data doesnot need to be spelled out.27

27However, this should be easy to work around by ad-justing the breadth-first search to add 0-length edges atthe beginning of the queue and 1-length edges at the end.

4.1.3 Patience Diff

An interesting variation of the above algo-rithm was recently introduced as the default diffmethod to the Bazaar system, devised by BramCohen. [10] A common problem with using theMyers’ algorithm is that the common subsec-tions search misaligns additions of parentheseson separate lines or other simple syntactic con-structs, making it harder to view by humans.

The main idea in this algorithm is to “takethe longest common subsequence on lines whichoccur exactly once on both sides, then recursebetween lines which got matched on that pass.”[61] Also, it achieves better computational com-plexity by using Patience Sorting [36] to find thelongest common subsequence.28

4.1.4 BDiff

The bdiff algorithm [68] improving the Ratcliff–Obershelp pattern recognition [48] comes fromthe Python difflib library [53] and is used byMercurial, both for delta storage and provid-ing difference listing to the user. In contrast tolooking for the longest common subsequence, itsearches for the longest common continuous sub-string within the objects and recursively in theparts preceding/succeeding it. It is quadraticlike the Myers algorithm, but has been mea-sured [68] to have better average performanceand output more human-friendly deltas.

4.1.5 XDelta

The tool xdelta [72] [73] popularized use of theRabin Fingerprinting technique [23] for deltageneration: it is effectively a simple modifica-tion of the Rabin–Karp substring matching al-gorithm, inspired by the work on rsync [22].Subversion [60], Monotone [41] and Git [28] usethis variation for internal storage — generat-ing one-way binary diffs between arbitrary non-textual blobs.

The algorithm produces Copy and Insert in-structions on the output and depends on fasthashing. The first object is divided into smallblocks29 and each block is hashed and added to

28The referenced paper describes finding the longestincreasing subsequence. To find longest common subse-quence, a sequence of lines corresponding to one objectis created, with the elements being line numbers of thesame lines in the other object; lines that are present morethan once or never are ignored in the search.

29The window size is 16 bytes in Git; small power of2 is recommended in general.

Page 10: Current Concepts in Version Control Systems - arXiv · 2018. 3. 14. · Current Concepts in Version Control Systems Petr Baudi s 2009-09-11 Abstract We give the reader a comprehensive

4 STORAGE MODELS 10

a hash table (proportional to input size). Then,a running hash of the same-sized window is be-ing computed on the second file; the momenta match is hit, all found blocks of the first ob-ject are stretched to largest possible matches tothe current position in the second object andthe Copy instruction is generated for the largestmatch. Insert instruction is generated for un-matched data in second object.

This algorithm is very fast (having much bet-ter computational complexity compared to My-ers’ — O(n) vs O(n2)) and produces efficientdiffs for arbitrary data, however it is entirelyunsuitable for human consumption.

A similar scheme is also used in zdelta [74]with reportedly better efficiency and with codereleased under a more permissive licence. ZDeltareuses the LZ77 algorithm [4] for finding thematching parts and Huffman codes for encod-ing the result.

4.2 Delta Formats

4.2.1 Trivial Approach

Not really a delta format by itself, the mostnaive approach to store history is to simply storeevery version of every object separately, possiblyusing some compression method (not taking ad-vantage of historical context of the object). Thisis the obvious format used for “poor man’s ver-sion control” methods like archiving compressedsnapshots of the whole project at regular in-tervals and backing them up. However, perhapssurprisingly this method does see usage in somemodern systems as well — namely, it is used asthe basic in-flight storage format in Git.

4.2.2 Unified Diff

The unified diff format [69] — albeit not usedin the changes storage itself — is the de-factostandard for interchange of plaintext changes,being the usual format used in “patch” files.30

For each file, the diff contains all the changechunks — the areas of the file where changes oc-cured. Each chunk consists of the chunk headerlocalizing the chunk in both versions of the file(starting line number and line length of thechunk), context lines immediately surroundingthe changed lines (preceded by a space) and the

30Sometimes, a very similar format called “contextdiff” is also used, however unified diffs are much morecommon.

change itself: added lines preceded by a + signand removed lines by a - sign; modified lines arewritten as a +/- line pair. Unidentified lines inthe diff are ignored, which allows version con-trol systems to include custom metadata in thediff.31

This format has number of advantages: it iseasy to generate, it is easy to inspect by a humanfor the purpose of change review (including eas-ily tweaking details of the change in the unifieddiff itself), it is easy to apply automatically32

and very importantly, it is possible to apply thedifferences to another version than the one thediff was created from. This is allowed by thepresence of the context lines: even if the start-ing line numbers in the chunk header do notfit the context, the program can automaticallysearch the vincinity of the area for the contextlines and apply the patch fuzzily. At the sametime, the context lines provide a way to verifythat the change is still applicable as-is to thetarget version.

We should also mention a Git extension tothe unified diff for showing diff representationof merges — combined diff [26] (see figure 1 foran example). Instead of a single column withthe ± marks, multiple columns are included,one per merge parent, allowing concise descrip-tion of which lines differ against which parents.Furthermore, by excluding hunks that containonly changes coming from one of the parents,we get a diff describing only changes actuallyintroduced by the merge itself — manual reso-lutions of merge conflicts.33

4.2.3 VCDIFF (RFC3284)

The VCDIFF format [64] deserves a brief men-tion, being the official standardized way to ex-change binary deltas over the network, beingspecified by RFC3284. However, to our knowl-edge no widely used version control system ac-tually uses this format for anything, though itis the default output generated by the xdeltatool (4.1.5). The main reason is that it is eas-ier and faster for individual systems to use theirnative delta formats even for data interchangeand that the VCDIFF format is rather baroque,

31Commonly, informational-only notes about the com-pared revisions are inserted. However, e.g. Git uses thisto extend the diffs with file renames/copies informationand can make use of the metadata when applying thepatches.

32UNIX provides a standard patch(1) tool. [52]33Conflicts are further explained in 6.2.

Page 11: Current Concepts in Version Control Systems - arXiv · 2018. 3. 14. · Current Concepts in Version Control Systems Petr Baudi s 2009-09-11 Abstract We give the reader a comprehensive

4 STORAGE MODELS 11

diff --cc git-am.sh

index 75886a8,4dce87b..b48096e

--- a/git-am.sh

+++ b/git-am.sh

@@@ -9,9 -8,9 +9,9 @@@ git-am [options] <mbox>|<Maildir>..

git-am [options] --skip

-d,dotest= use <dir> and not .dotest

+d,dotest= (removed -- do not use)

i,interactive run interactively

- b,binary pass --allo-binary-replacement to git-apply

+ b,binary pass --allow-binary-replacement to git-apply

3,3way allow fall back on 3way merging if needed

Figure 1: Combined diff example. [26] The diff and index lines are Git-inserted metadata, andthe trailing text at the @@@ line is only to give the reader a semantic context of the hunk.

implementing it is not trivial and it is generallynot as compact as the native protocols.

4.2.4 RCS Delta

RCS uses the simplest representation of changessequence that also spilled over to CVS and heav-ily influenced the SVN BerkeleyDB schema.34

[54] The representation is in text format andis optimized for quick access to the latest revi-sion, which is stored at the top of the file andalways in full. Older revisions are then storedeach as a delta against the succeeding one. Incase of branches, the delta direction is reversedand newer revisions are represented by forwarddelta — thus, to get a tip of a non-trunk branch,delta sequence all the way from the latest trunkrevision to branch base and then back to thebranch tip must be applied.

The delta itself is simply composed by lineseach containing an add a or delete d command,followed by line number in base data and num-ber of lines, with the lines to insert by a follow-ing the command immediately.

Aside of very slow reconstruction of non-trunkbranch tips and old revisions in general, the ma-jor issue with this format is the requirement torewrite the whole file in case of commit, mak-ing the repository database much prone to lockcontention and data corruption and making cre-ation of new revision quite an expensive opera-tion.

34SVN FSFS format is quite different and perhapsmost resembles the Mercurial revlogs (4.3.3). We chosenot to describe SVN formats in detail since we deemthem not particularly interesting in themselves.

4.2.5 Weave

The weave format introduced by SCCS [63] usedto be rather obscure for long time, but cur-rently many version control developers are fa-miliar with it due to its prominent usage by Bit-Keeper and some newly developed merging al-gorithms. Still, it has not seen widespread adop-tion.35

Instead of storing each revision separately,this format intersperses all revisions, listing allthe lines that ever appeared in the file togetherwith the list of revisions they appear in. Then,retrieving any revision means simply extractingall the lines that have the revision in their set.

This format shares many disadvantages withthe RCS delta format, especially with regard tolocking and data corruption potential. In addi-tion, retrieving any revision is uniformly pro-portional to history size. On the other hand,the line annotation36 is available very cheaply(compared to systems using delta-based stor-age where retrieving this involves walking all thehistory) and having per-line history informationenables some interesting merging methods to beused (see 6.4).

35The Bazaar project actually used it for brief periodbut then switched to a more traditional delta chain rep-resentation for the benefits of append-only data struc-ture. [13]

36Information about who, when and in which revisionadded a particular line.

Page 12: Current Concepts in Version Control Systems - arXiv · 2018. 3. 14. · Current Concepts in Version Control Systems Petr Baudi s 2009-09-11 Abstract We give the reader a comprehensive

4 STORAGE MODELS 12

4.3 Repository Database Formats

4.3.1 Trivial Approach

The trivial approach is to simply use separatefiles for appropriately fine-grained objects. Thisusage is very common in many systems; it is theapproach taken by CVS and Mercurial whereevery file in the project directory has its historystored in corresponding file in the repository. Itis also the basic storage format of Git, whereeach object (commit, tree or blob — particu-lar state of a version-tracked file) is stored in aspearate file.

4.3.2 Git Packs

As mentioned above, Git simply stores its first-class objects directly on the disk as its primaryformat. However, as explained earlier this is veryinefficient for long-term storage; for that pur-pose, Git Packs offer a storage format showingextreme size effectivity in practice.37

A Git Pack [26] consists of a pack file andindex file. The index file provides a simple two-level index of the objects keyed by the objectid. The pack file contains the compressed Gitobjects of all basic types — commit, tree andblob — but can also accomodate an extra objecttype delta:

The pack-specific delta objects represent agiven object by describing binary differencesagainst another object of the same type,38 asyielded by the xdelta algorithm (see 4.1.5). Intheory, any object of the same type will do asa base, but Git sorts the objects using a spe-cial heuristics and then considers n followingobjects in the list as delta candidates,39 pick-ing the smallest delta generated.

The list heuristics is the core of the Git Packefficiency. [16] [28] The important criteria arethat blob objects are sorted primarily by the file-name they were first reached though, with thelast characters being the most significant ones.(Thus, all .c files are grouped together, then allfiles with the same name in different directories,etc.) The secondary sorting order is by object

37Git also uses the same format for native networkdata transfer.

38Or, of course, another delta object. However, thedelta chain is upper-bounded, by default by the value of10.

39By default, the delta window is n = 10, but com-monly, values like 50 and rarely even 100 are used forgenerating ultra-dense packfiles.

size, largest objects first — thus, deltas are gen-erated from larger objects to smaller ones, al-lowing more efficient representation.

Note that the delta order is in almost no wayrelated to physical ordering of the objects inthe pack file. There are no theoretical restric-tions on that, however Git orders them basedon breadth-first search on the graph of all ob-jects starting from the top commit objects onthe defined branches. In other words, the objectsnecessary for recreating the latest commits arelumped together at the beginning of the packfile. This particular ordering greatly optimizesthe I/O operations locality for the most com-mon I/O patterns.

One specific exception in the physical order-ing is that in case of delta objects, the base ob-ject is stored before the delta object; this en-sures locality when reconstructing delta chains.In practice, this rule does not reorder the ob-jects significantly — as the Linus’ Law states:“Files grow” [16]; thus, the base objects (largerfiles) tend to come before delta objects (smallerfiles) anyway and while it is not enforced, thedelta objects tend to become “backward deltas”most of the time.

Git Packs use no locking mechanism — theyare not created on the fly, but only at certainpoints in time during a “garbage collection” op-eration over the repository, from loose one-per-object files created in the repository by regularusage.40

4.3.3 Mercurial Revlogs

Another popular data structure are so-calledrevlogs used by Mercurial. [68] [38] While Gitpacks are designed around heuristics, guaranteevery little about worst-case performance and areoptimized for work with hot cache,41 revlogs canguarantee good worst-case seek performance (atthe cost of less flexibility for improving averageperformance) and are tailored for cold cache ac-cess.42 Unfortunately, while there are some ca-sual benchmarks available on the web, detailedcomparison with clearly defined cache status toconfirm these expectations is yet to be pub-lished.

40One exception is when new commits are pulled bythe native protocol, which itself is in the Git Pack for-mat; the data is then saved directly as a pack.

41That is, when the user uses the system continuouslyfor longer time and most of the relevant data is cachedin memory.

42With “cold cache”, most of the data are still on disk.

Page 13: Current Concepts in Version Control Systems - arXiv · 2018. 3. 14. · Current Concepts in Version Control Systems Petr Baudi s 2009-09-11 Abstract We give the reader a comprehensive

5 USER INTERACTION 13

Mercurial uses separate revlogs for eachtracked file and two extra revlogs for the man-ifests and changesets. Each revlog consists ofan index file for revision lookup43 and data filecontaining a hunk per revision, every hunk be-ing either a full revision in verbatim or a deltaagainst the previous revision — the total size ofthe delta chain is limited by a small multiple ofthe revision size.44 In case of the revision beingrepresented by a delta, the index will containpointer to the base revision so that all the nec-essary data can be read from the file in a singleswipe. Revisions from different branches are allstored in a single revlog, but when forming deltachains only the order of recording the revisionsmatters, not their branch.

When recording new revision in a revlog, therevlog is locked for writing and a new hunk isappended to the data file and then recorded inthe index. Since the revlogs are append-only, nolocking is required for readers and it is a matterof simply truncating the file to recover from aninterrupted write. Furthermore, when workingwith the whole tree (both reading and writing),all the revlog files are visited in a fixed ordercompatible with the standard system files or-dering; this way, the filesystem can maintain themost desired on-disk layout and using standardsystem tools to copy a repository will result ina good on-disk layout as well.

5 User Interaction

In the development of new generation versioncontrol tools, the importance of good user inter-face has been underestimated in the past, whileit tends to be one of the most important fac-tors for users when deciding which system touse; good user interface is therefore one of mainconsiderations in the development of currentlywidely used tools.

Traditionally, all of the tools providecommand-line interface, which tends to be themain way of interaction with the system fromthe side of the developers. However, manyauthors also explore ways to interact with thesystem efficiently using GUI, both for actuallyoperating the system and merely exploringthe history. Another separate area are web

43A slightly modified RevlogNG format used by newerMercurial versions supports interleaving of index infor-mation and data hunks.

44Mercurial 1.0 uses the value of 2.

interfaces for inspection of repositories.

5.1 Command-line Interfaces

One important conisderation with thecommand-line interfaces is to keep in mindthat users are actually going to type manycommands — e.g., GNU Arch requires usageof extremely long revision identifiers, and forsystems using hashes for object identification,it is important to allow shortcutting mentionedearlier.

Another considerable point is to design theinterface carefully and logically around a mini-mal set of commands, also hiding implementa-tion details whenever possible (e.g., by tryingto intelligently default to a suitable merge al-gorithm instead of requiring the user to decideon one). Git provides currently about 140 com-mands, while large portion of them is in factalmost never interesting to end users and is use-ful only as helper commands within scripts orother commands.45

Note that even regarding traditionalcommand-line interfaces, there can be in-teresting variations — for example, Git extendsthe basic edit–commit workflow with the op-tional Git index concept, presenting a stagingarea inbetween the working tree and nextcommit; see 3.6 for details.

User interface can take very unusual forms;a specific one is integration of version controlwithin the filesystem interface. In that case, ac-cessing a particular revision of a file is usu-ally performed by simply opening a specificallycrafted interface. This concept goes as far backas to the VMS operating system where thefilesystem interface had builtin support for fileversioning. [71] A configuration managementsystem with version control capabilities Vestaprovides primarily filesystem-based access to in-dividual revisions (furthermore with O(1) ac-cess guarantees). [65] There also exist alterna-tive filesystem-based interfaces for more tradi-tional version control systems as well (though

45To simplify the usage of Git, we created a frontendCogito that used lowlevel Git interface to create an easy-to-use version control tool carefully designed to simplifymany workflows and present a simple and consistent setof commands to the user, with consideration for usersused to older systems like CVS or Subversion. Since mostfeatures provided by Cogito have been adopted or obso-leted by upstream Git development, Cogito is not main-tained anymore (but we still consider the Git UI some-what lacking).

Page 14: Current Concepts in Version Control Systems - arXiv · 2018. 3. 14. · Current Concepts in Version Control Systems Petr Baudi s 2009-09-11 Abstract We give the reader a comprehensive

5 USER INTERACTION 14

mostly in early development stages), e.g. Bzr-FS [12] or GitFS [31].

5.2 Basic GUI Tools

There are several types of GUI tools. Themost obvious candidates are all-in tools cov-ering most of the system functionality, oftenreusing other helper applications (such as his-tory browsers) — most systems have tools likethat, e.g. WinCVS, git-gui, Bzr-Gtk, etc.

An interesting approach is to integrate theversion control system with basic system shellsor file managers (like the Windows Explorer orKonqueror); version-controlled files include thestate in their icon (clean, modified, containingmerge conflicts, ...) and all versioning operationsare available through the context menu. Theclassical example being TortoiseCVS, cloneslike TortoiseSVN, TortoiseBzr, Tortoise-Git, TortoiseHg, etc. also exist. Similarly, ver-sion control capabilities are often integratedwith IDEs like Eclipse.

Another family of tools are simple GUIhelpers designed merely to be used in concertwith the classical commandline controls; this hasbeen mostly pioneered by BitKeeper, providingcitool, mergetool etc. to allow to perform oper-ations like reviewing to-be-committed state ormerging files more comfortably. Currently, e.g.Git frontends like git-gui or tig provide a wayto control to-be-committed content down to theper-hunk level of diffs and there exist system-agnostic tools for graphical review of diffs (xxd-iff, kdiff3, ...) and even merging assistance (es-pecially meld).

Visual UI tools do not need to be alwaysgraphical. Many developers prefer exclusivelyterminal work and sometimes it may be neces-sary to work on a remote server over ssh ses-sion; tig is a good example of terminal tool thatprovides ASCII-art46 history diagrams and fron-tends many of Git’s functionality (including in-dex management down to diff-hunk level), with-out requiring any graphics rendering.

5.3 History Browsing

With the advent of distributed version control,it became much more important to properlyvisualize the project history — suddenly it is

46Graphics constructed in text mode from standardletters and symbols.

not a simple tree of commits, but a more com-plex graph with possibly many branchings andmerges and the relationships between revisionsmay not be immediately obvious to the user.

The first popular history visualizer were Bit-Keeper’s revtool and Monotone-viz [42] whichfocuses on visualising the revision graph, offer-ing it as the main visual object to the user; click-ing on nodes will display revision details. DuringGit infancy, the tool was adopted as Git-viz, butit did not see widespread usage; however, similartool is being used in Bazaar as part of BzrTools.

Instead, a much denser history presentationgained popularity in the Git community withina tool gitk. Its primary visual objects are therevisions themselves, with the first line of thecommit message (traditionally regarded as kindof “subject”) and authorship information shownfor each commit; the graph is still shown, how-ever only in narrow area at the margin of thescreen, with nodes aligned to the line structureof commits. Reused in other Git user interfaces(both graphical ones and tig), this way of visu-alization became popular in the Mercurial com-munity as well (hgk), Monotone even producessimilar kind of output in ASCII art with its na-tive mtn log command; there also exist projectslike git-browser which use AJAX technologiesto bring this interface to the web.

5.4 Web Interfaces

Important part of user interface of a version con-trol system is a web interface — most projectshave some kind of web interface available so thatpeople wanting to look at the source and the his-tory do not have to install the particular versioncontrol tool and download the whole project.Good interface also should not assume users’ fa-miliarity with the particular system, since it getsmuch more diverse audience than other tools ofthe system.

Historically, cvsweb, viewcvs and similartools for SVN focused on primarily presentingthe tree of files, with support for inspecting in-dividual history of files when needed. While sim-ilar approaches are also present in some web in-terfaces for distributed version control systems(e.g. GitHub), the paradigm shift to primarilypresenting the project history is visible here aswell, in the widely used gitweb interface andthe hgweb clone. On the project front page,the user is not presented with root directory ofthe file tree, but instead with the summary of

Page 15: Current Concepts in Version Control Systems - arXiv · 2018. 3. 14. · Current Concepts in Version Control Systems Petr Baudi s 2009-09-11 Abstract We give the reader a comprehensive

6 MERGING METHODS 15

the latest commits, available branches and tags,etc. — browsing the tree is of course possible,but not the primary activity anymore.

Something of a rarity, the Gist service47 usesmodern web technologies to allow users buildwhole repositories from scratch from within theweb browser; the user is presented by a text-areawhere they can paste content (akin to variouspastebin services), but further modifications arethen stored in git revision history, it is possi-ble to add multiple files and eventually pull thehistory. It is unclear to us how often it is actu-ally used as anything more than a trivial one-offpastebin, however.

6 Merging Methods

One of the major past challenges with systemsencouraging massive branching was devising apowerful merging mechanism that would makethe merging process sufficiently easy, quick androbust for frequent usage. This ability is crucial,otherwise people will become reluctant to createbranches and the main strength of distributedversion control will dwindle. On the other hand,practice shows that merge algorithms should besimple enough to be easily predictable, even ifthe tradeoff would be to require manual inter-ventions by the user more frequently — commonproblem of algorithms not based on three-waymerge is that while they may generate less con-flicts, their merge resolutions may be relativelyunintuitive and some of their properties are con-troversial and unfamiliar to the users.

In this section, we will describe the mostcommon merge strategies (three-way merge andrecursive merge) as well as some more ad-vanced theoretical developments (mark-merge,PseudoCDV merge). We will also consider thepatch algebra based way of merging in changeset-oriented systems.

There is a great number of considerationswhen designing and comparing merging algo-rithms and we cannot cover them all here; werecommend the pages dedicated to merging al-gorithms in the RevCtrl wiki [55] for detailedstudy of all the aspects.

6.1 Fast-forward Merge

First, we should notice what happens in the re-vision DAG when a “blank” merge is actually

47Part of the GitHub cloud.

0

1.1 2.1 3.1

1.2 3.2

2.3

2.2

3.3

2.4

Figure 2: A history DAG example

performed: consider merging revisions 1.2 and2.4 as shown in the figure 2. On the branch 1,no new revisions appeared since its last mergewith 2.2, thus merging the 2.4 revision will ef-fectively merely copy this revision over to thebranch 1. A system might create a merge-node1.3 anyway, but this may not be desirable sincemany useless merge-nodes will be created in theextremely common scenario of users only track-ing some project in read-only fashion: every timethey update their copy, no local revisions will befound locally but the merge will create a merge-node revision to represent the update anyway.

To avoid this problem, a “virtual” mergestrategy of fast-forwarding is used; in case oneof the merged branches contains no new revi-sions since the last merge, it is simply repointedto the head commit of the other branch. Thus,in our scenario, branch 1 is simply repointed tothe revision 2.4! Then, when committing a newrevision to branch 1, 2.4 will be treated as thenew fork point.

This technique allows branches of the DAGto converge whenever possible, however it alsobreaks commit parents ordering — after fast-forwarding to a different branch, merges withthe original branch will appear “flipped”. Theusers either cannot rely on the parents ordering,or they have to constrain the system usage by apolicy to disallow scenarios that would provokea fast-forward.48

48Fast-forwarding can frequently occur in tight push-pull loops performed concurrently on a shared central

Page 16: Current Concepts in Version Control Systems - arXiv · 2018. 3. 14. · Current Concepts in Version Control Systems Petr Baudi s 2009-09-11 Abstract We give the reader a comprehensive

6 MERGING METHODS 16

6.2 Three-way Merge

Three-way merge49 is the most common methodof merging in use. Assume merging two versionnodes x and y into merged version m. Let b bethe merge base — the “nearest common ances-tor” of x and y. Then, using difference func-tion50 ∆: (text1, text2) → delta (and delta ap-plying function ∆−1: (text1, delta)→ text2):

dx = ∆(b, x)

dy = ∆(b, y)

d = dx ∪ dy

m = ∆−1(b, d)

Simply the combination of differences be-tween base and both merged versions are ap-plied to the base version again and the result isthe merged version. The problematic part hereis determining d — if dx and dy concern discreteportions of the base version, combining them istrivial.

However, when both change the same partof base version, each in a different way, theycannot be combined in a simple way. A mergeconflict arises, only the non-conflicting parts ofd are applied and for the rest the user is re-quired to choose between mx = ∆−1(b, dx) andmy = ∆−1(b, dy) or combine mx and my manu-ally — usually the system inserts both variantsto the file, separated by visible one-line mark-ers.51

In general, in single-shot usage52 the algo-rithm can work on any b merge base version.

repository. One way to avoid them is to tell the devel-opers not to merge the changes their peers have done inparallel and pushed earlier, but instead to rebase theirown changes on top of the already-pushed ones: this op-eration rewrites the history locally, sequencing two com-mits performed in parallel.

49Unfortunately, we were not able to find out who in-troduced the technique of three-way merging, currentlycommonly known and widely used; the earliest referencewe were able to find is in the RCS paper. [54]

50See 4.1 for the tour over various functions.51In some cases, the system can still resolve simple

conflicts, for example if they concern only whitespacecharacters or perhaps even when merging a simple refor-matting change with a semantic change. Putting moreinteligence into conflict resolution is one of the activeareas of merge research. However, silently mis-mergingincompatible changes can have dangerous consequences,so making the resolution more intelligent at the risk ofincreasing error rate can be very harmful.

52When the algorithm is applied on two arbitrary ver-sions without historical context. In a generic DAG, se-lecting inappropriate b within repeated merges can leadto information loss, as described below.

However, the choice of a particular b has ma-jor impact on the size of dx and dy, affectingthe conflicting portion of d. Thus, b should bechosen intelligently, usually by taking the leastcommon ancestor (LCA).

In case of a tree, LCA can be simply describedas the node nearest to the tree root on a non-oriented path from x to y, and it will be alwaysunique as follows from tree properties. However,this approach is satisfying only for the first-timemerge of the branches of x and y — if a mergebetween the branches is performed repeatedly,the same b will always get chosen, even thoughthe previous merge outcomes would be muchmore reasonable choices as the size of dx and dycan be assumed to generally increase over time.

This is the main motivation behind makingthe history a DAG instead of a tree; whenrecording the m version, both x and y aremarked as its ancestors instead of just one ofthem. When determining the LCA53 of descen-dands of x and y later, m can then be used in-stead of the original b. Thus, for example forthe graph in figure 2 earlier, the base of the 2.4merge of 2.3 and 3.3 would be 2.2.

However, in this case we lose the guarantee ofb uniqueness, since multiple discrete paths be-tween x and y can exist: to deal with this prob-lem, recursive merge extension of the algorithmhas been devised.

0

X0 Y0

X1 Y1

merge

Figure 3: A criss-cross history graph example

6.3 Recursive Merge

Consider the criss-cross merge scenario [56]shown in figure 3. In this case, b can take the

53In case of a DAG, we prune paths that are supersetsof other paths from our consideration.

Page 17: Current Concepts in Version Control Systems - arXiv · 2018. 3. 14. · Current Concepts in Version Control Systems Petr Baudi s 2009-09-11 Abstract We give the reader a comprehensive

6 MERGING METHODS 17

values of both X0 and Y0, however taking anyof these can lose data from the other branch de-pending on the nature of the merge — considerconflicting change in X0 and Y0, in X1 resolvedby taking X0 version exclusively and in Y1 bytaking Y0. Then, for b = X0, Y0 version will becleanly merged and vice versa — clearly, conflictshould be generated instead.

This particular problem and its more compli-cated variations led some researchers to aban-don the three-way merge altogether and focuson other merge methods instead, some of themdescribed below. However, the problem can beat least partially solved within the context ofthree-way merge as well, though the practicalimpact has to be yet evaluated carefully.

A basic rule should be that the system, detect-ing multiple possible merge bases, does neverrandomly select one but instead asks the userfor intervention, chooses a proper merge base orpossibly intelligently constructs one.

One naive approach we devised is to takeLCA(X1, Y1) recursively and repeat until asingle revision comes out. The practicality ofthis approach is dependent on the developmentstrategy adopted by a particular project — if thecriss-cross merge is not repeated for too long anda common base for both branches appears oncein a while, the LCA will stay in a reasonabledistance from the merged versions and conflictswould generally stay within reasonable scale.However, for long-term criss-cross merges thisapproach degenerates to something similar to atree-based history, and it still leaves some moreadvanced criss-cross problems unsolved [1].54

An important enhancement implemented byFredrik N. Kuivinen is to actually perform themerges themselves recursively: [27] [56]

|B| = 1 : b(B) = B0

|B| = 2 : b(B) = M(LCA(B0, B1), B0, B1)

M(B, x, y) = ∆−1(b(B), x ∪ y)

m(x, y) = M(LCA(x, y), x, y)

That is, in the example above:

m = ∆−1(∆−1(0, X0 ∪ Y0), X1 ∪ Y1)

It is easily visible that this works properly incase of no conflicts, and reduces conflict rate by

54It must be also considered that if the merge gener-ates unnecessarily large conflicts, risk of human-inducedmismerge raises considerably.

working on much finer grain level. In case of con-flict, the main idea of the algorithm is to simplyleave the conflict markers in place when usingthe result as a base for further merges. Thismeans that earlier conflicts are properly prop-agated as well as conflicting changes in newerrevisions. However, some rare edge cases are stillmishandled as described in [56].

6.4 Precise Codeville Merge

A weave-based merge algorithm Precise Codev-ille Merge [62] has been devised by the Codev-ille version control system.55 It later turned outthat Codeville probably independently inventeda merging method very similar to what Bit-Keeper uses internally.

Instead of comparing the to-be-merged revi-sions to a base revision, the algorithm directlyperforms a two-way merge between the two re-visions and then uses the weave-embedded his-tory information to decide which side of eachconflicting hunk is to be chosen.

Consider vi(r) being the number of statechanges (addition or removal; also called gen-eration count) of line i at revision r. For eachrevision, have a list of line chunks: lx, ly. Then,the weave of the file is iterated line by line:

• if a line is not in either of the revisions,nothing is done

• if a line is found to be only in one of therevisions, it is appended to appropriate linechunk list lx or ly and revision precedenceflag px is set if vi(x) > vi(y), or py if vi(y) >vi(x) (obviously, vi(x) 6= vi(y))

• if a line is found to be in both revisions,either:

1. px and py are unset and this is part ofcommon lines block and appended tothe output

2. only px or py is set and this is non-conflicting change (then, lx or ly re-spectively is appended)

3. both are set and a conflict block fromlx and ly chunks is generated; in thelatter two cases, lx ← ly ← ∅ andpx ← py ← 0.

55The system is not otherwise covered here since it hasnever seen wider usage and is not developed anymore; itcan be probably said that its main purpose ended up tobe to research various merging algorithms.

Page 18: Current Concepts in Version Control Systems - arXiv · 2018. 3. 14. · Current Concepts in Version Control Systems Petr Baudi s 2009-09-11 Abstract We give the reader a comprehensive

7 FUTURE RESEARCH 18

The advantage of this algorithm is the absenceof the need to walk the revision graph at all(provided that the system uses the right stor-age format, which can be difficult while keep-ing the desirable append-only properties of themost common ones), simplicity and handling ofmost of the problematic edge cases of the three-way merges. Also, this algorithm handles cherry-picking naturally well compared to three-waymerge.

The disadvantage is a requirement for theweave structure potentially expensive to em-ulate if the system uses different storagemethod,56 and somewhat controversial usage ofthe vi criterion — it is not immediately obviousthat lines with more state changes should alwayswin against these with less. Due to the funda-mental difference in operation to the much morewidely used three-way merge, in cases it choosesdifferent result its operation may be confusing tothe users.

6.5 Mark Merge

The mark merge (or *-merge) algorithm [35] issomewhat unusual. First, all the other mergetechniques described here are designed to mergevectors (file content), while this one is a scalarmerge algorithm (for example for directoryitems — the executability flag or file name whenmerging renames). Second, mark merge has aprecisely specified user model and formal proofthat the algorithm matches the user model.

The aim of mark merge is to define the mostsensible way of merging scalar values while giv-ing foremost priority to explicit user choices.First, a mark means explicit decision by the useron the scalar value; marked revision is such thatuser explicitly set the value in this revision. Letus have version r, then ∗(r) shall be set of min-imal marked ancestors of r — that is, takinggraph of r ancestry and reducing it to markedrevisions, set of revisions with no descendants.Then, when merging versions x and y with val-ues vx and vy:

• if vx = vy, the value shall be also the mergeresult

• if all (x)∗ revisions are ancestors of y, vyshall be merge result

56Vesta currently uses this merge method and con-structs weaves on-demand; scalability of this method wasnot studied, however. [51]

• if all (y)∗ revisions are ancestors of x, vxshall be merge result

• otherwise, throw a conflict

For detailed formal user model and proof ofcorrectness please refer to [35].

6.6 Patch Algebra

In Darcs (3.7), a particularly elegant formal sys-tem patch algebra has been developed for merg-ing two sets of patches in a changeset-orientedversion control system. [20]

The main idea of patch algebra is to pro-vide natural and formally-backed semantics formerging within such a system.57 All patches aredefined to have a set of dependencies on otherpatches58, inversions, and independent patchesare allowed to commute on the stack. Whenmerging two sets, independent patches are ap-propriately commuted (sometimes requiring useof inversion patches to work around fuzzy appli-cations) and for conflicting patches, user reso-lution is requested; the merge operation is thenrecorded in a special merger patch.59

Given the way the to-be-merged mergerpatches are unwound, it is believed that themerge behaviour with regard to the pathologi-cal cases like criss-cross merge are similar to therecursive merge algorithm or better, since Darcscan make use even of intermediate changes be-tween the two merged revisions and their basethat the three-way merge approach cannot see.[56] [8] However, we are not aware of any workformally analysing and rigorously comparing theproperties of Darcs merges with the other ap-proaches.

7 Future Research

On the technical side, some of the existingsystems are not portable enough to be usable

57But please note that while Darcs makes an attemptfor formal description, it is still lacking in many aspectsboth regarding precise definitions and many proofs miss-ing. We must however mention a relatively little-knowntheoretical work [2] inspired by Darcs formalism and at-tempting to rebuild it on sound foundations.

58The dependencies are either logical as explicitlyspecified by the user, or syntactic — a patch is dependingon all patches lines of which it touches — as autodetectedby the system.

59The actual mechanisms used to preserve all the de-sirable properties of the patch stack are rather compli-cated — please refer to [20] for full description.

Page 19: Current Concepts in Version Control Systems - arXiv · 2018. 3. 14. · Current Concepts in Version Control Systems Petr Baudi s 2009-09-11 Abstract We give the reader a comprehensive

7 FUTURE RESEARCH 19

equally well in different operating systems (es-pecially with regard to Microsoft Windows vs.Linux and other UNIX systems) and the userinterfaces are still lacking in many aspects, aswell as documentation.

Regarding the theoretical side, we believethree main areas need research most urgently— content movement tracking, management ofthird party changes and more precise formalspecification.

7.1 Fine-grained Content Move-ment Tracking

More fine-grained tracking of changes is requiredto improve merging algorithms, accounting forchanges and following history of arbitrary partsof content; as noted in 2.5, some of the systemsprovide way to follow file history across renames,Git introduces the pickaxe mechanism, but bet-ter infrastructure is required for practical usageespecially during merging and allowing visualrepresentation and usage.

7.2 Third-party Changes Tracking

The current distributed architectures still arenot flexible enough to accomodate for fully dis-tributed development in open-source projects.When a third-party developer publishes somechanges and the upstream developers want tomerge it only with certain reservations, thethird-party developer can either do the requiredchanges on top of his previous changes, leav-ing the history in a certain degree of disar-ray and making his proposed chances increas-ingly harder to review, or rewrite the history ofhis changes, which creates problems for otherswho have already accessed the published versionsince the rewrite essentially created a fresh newbranch unrelated to the older one.

Similar problems arise when the third-partychanges need to be ported to new upstream ver-sion, or when a distributor wants to maintainseries of changes in their version of the prod-uct; using the traditional merge mechanisms, itquickly becomes difficult to track the changesagainst upstream cleanly as the ability to gen-erate a diff of single change against latest up-stream version can be crucial.

Thus, many people make use of the dis-tributed nature by only maintaining the changeslocally and still using patches for changes inter-change; or, they use the version control system

for tracking upstream and some special frontendon top of that60 for tracking their changes inform of stack — this has user interface issuesand it is problematic to publish the repositoriessince the history is changing constantly. Anotheralternative (often used by open-source softwaredistributors when maintaining their packages) isto version-control the patches in their diff formdirectly.

Using patches for changes interchange canhave some benefits with regard e.g. to code re-view practices, but frequently it is merely extraoverhead and it incurs extra load on the third-party developers. A new design that would allowseamless external changes tracking needs to bedeveloped.

7.2.1 TopGit

We have created TopGit [66], a layer over Gitthat aims to solve the task of maintaining third-party patches properly, allowing for fully dis-tributed development and proper history track-ing of all the changes.

We start off from the concept of topicbranches, popular Git technique of having a spe-cialized branch for development of each indepen-dent change, then merging them all together;this is merely a “design pattern”, needing nospecial tool support. However, we extend this intwo ways:

1. We create a directed acyclic graph from thetopic branches,61 giving each branch a listof dependencies (other branches).

2. We store metadata for each topic branchwithin its file tree: the branch descrip-tion and authorship information within/.topmsg file and the list of branches it de-pends on inside /.topdeps.62

The topic branches can depend on other topicbranches, but also on regular Git branches. Typ-ically, in the main set of third-party patches,each patch will have its own TopGit branchdepending on a Git branch tracking the up-stream development, then possibly some third-party patches will have dependencies on someof the TopGit branches if they further extend

60E.g. mq for Mercurial, StGIT for Git61Thus, conceptually one level higher than the DAG

of the commits.62These two files are excluded from merges of depen-

dencies.

Page 20: Current Concepts in Version Control Systems - arXiv · 2018. 3. 14. · Current Concepts in Version Control Systems Petr Baudi s 2009-09-11 Abstract We give the reader a comprehensive

8 CONCLUSION 20

the third-party changes, and at the top of thegraph will be usually at least one so-called stag-ing branch that introduces no changes on its own(as opposed to content branches) but ties all thethird-party patches together by listing them allin its dependencies.

TopGit provides tools for creating topicbranches, querying them, pulling topic branchupdates and merging them with local work, butmost importantly a command for updating thebranches, which recursively updates all the de-pendencies of a branch, then merges them in.

The project reached basic practical usabilityand at the time of this writing is used for main-tenance of some Debian packages, but its userinterface leaves a lot to be desired and some ofthe functionality is not fully implemented.

Especially problematic is the task of remov-ing a dependency of a patch in a way that pre-serves the history, makes it possible to easily re-add the dependency later and does not spreadthe dependency removal to further dependingpatches that also have the removed patch as adependency. [67] Another problem is inventinga user interface that makes history explorationeasy, since traditional commit browsers presentexceedingly complex trees.

7.3 Formal Analysis

Precise semantics of current version control sys-tems are defined only loosely and the lack offormal analysis is visible especially in the areaof merging algorithms. In order to get a trulyreliable and dependable system with regards tomerging “wild” branches exhibiting pathologi-cal behaviour, the precise semantics of variousmerges should be specified formally; many of thecurrently widely used merging methods lack de-tailed formal analysis and their behaviour in cor-ner cases is only guessed.

There has been ongoing development in thisarea [20] [2], but it appears to be mostly stallednow as the recursive three-way merge and sim-ilar algorithms appear to work well enough incommon cases and thus there is not much incen-tive to formally describe their behaviour in cor-ner cases. However, we believe that such a for-mal foundation would also help development ofnewer changeset-oriented systems, in turn mak-ing progress with the third-party changes track-ing problem.

8 Conclusion

We have gone through the most important con-cepts, algorithms and structures used in currentversion control systems, and put them in thecontext of practical usage; we hope this gave thereader a comprehensive idea about the area andwill help to put his further research in concretetopics into the broad context.

Among the open source projects, the new gen-eration of distributed version control systemshas registered massive uptake; while there arestill projects using CVS and many projects areusing or even migrating to SVN, large portion ofthe most high-profile open source projects withmany commits per day and code base of mil-lions of lines of code [43] is using one of thedistributed systems described above.

The wide adoption and scale of use clearlyindicates that the current systems have reachedappropriate scalability and reliability levels. Thehigh number of user interface tools in devel-opment also makes it more comfortable to usethese systems and smoothens up the learningcurve that has been traditionally very steep fordistributed version control systems. However, asdescribed in section 7, many desirable featuresare still unimplemented and support for someworkflows is still very lacking.

8.1 Acknowledgements

This article is based on Bachelor thesis I wroteat the Faculty of Math and Physics, CharlesUniversity. I would like to thank my thesisadvisor Martin Mares for guidance and BramCohen, Piet Delport, Junio C. Hamano, GregHudson, Uwe Kleine-Konig, Tom Lord, MattMackall, Larry McVoy, Kenneth C. Schalk,Nathaniel Smith, Pavel Surynek and ZookoWilcox–O’Hearn for their review and helpful in-put.

References

[1] Smith N.: 3-way merge considered harmful,[email protected], 2005-05-01.

[2] Loh A., Swierstra W., Leijen D.: APrincipled Approach to Version Control,http://people.cs.uu.nl/andres/

VersionControl.html

Page 21: Current Concepts in Version Control Systems - arXiv · 2018. 3. 14. · Current Concepts in Version Control Systems Petr Baudi s 2009-09-11 Abstract We give the reader a comprehensive

REFERENCES 21

[3] Asklund U., Bendix L.: A study ofconfiguration management in open sourcesoftware projects, Software, IEEEProceedings 149 (2002) 40–46.

[4] Ziv J., Lepmep A.: A universal algorithmfor data compression, IEEE Transactionson Information Theory, 23(3):337-343,1977.

[5] Henson V.: An Analysis ofCompare-by-hash, USENIX Proceedings ofHotOS IX (2003).

[6] Stiegler M.: An Introduction to PetnameSystems, http://www.skyhunter.com/marcs/petnames/IntroPetNames.html

[7] Myers E.: An O(ND) Difference Algorithmand its Variations, Algorithmica 1 (1986)251–266.

[8] Wilcox-O’Hearn Z. B.: badmerge –concrete version with good semantics,https://zooko.com/badmerge/

concrete-good-semantics.html

[9] Bentley A., Pool M., et al.: Bazaar,http://www.bazaar-vcs.org/

[10] Cohen B., et al.: Bazaar Source Code,bzrlib/patiencediff.py,http://www.bazaar-vcs.org/Download

[11] McVoy L., et al.: BitKeeper,http://www.bitkeeper.com/

[12] Korn M., et al.: Bzr Virtual Filesystem,https://launchpad.net/bzr-fs/

[13] Bentley A., Pool M., et al.: Bzr WeaveFormat, http://bazaar-vcs.org/BzrWeaveFormat

[14] Black J.: Compare-by-Hash: A ReasonedAnalysis, USENIX Annual TechnicalConference–Systems and ExperienceTrack, USENIX 2006, Boston, 2006.

[15] Dart S.: Concepts in configurationmanagement systems, Proceedings of the3rd international workshop on Softwareconfiguration management, Trondheim,1991, ISBN 0-897914-429-5.

[16] Torvalds L.: Concerning Git’s PackingHeuristics, http://www.kernel.org/pub/software/scm/git/docs/technical/

pack-heuristics.txt

[17] Grune D., Berliner B., et al.: CVS,http://www.nongnu.org/cvs/

[18] Mansfield D., et al.: CVSps,http://www.cobite.com/cvsps/

[19] Roundy D.: Darcs, http://darcs.net/

[20] Roundy D.: Darcs manual,http://darcs.net/manual/

[21] Darcs Community: Darcs Wiki: ConflictsFAQ, http://wiki.darcs.net/DarcsWiki/ConflictsFAQ

[22] Tridgell A.: Efficient Algorithms forSorting and Synchronization, PhD thesis,Australian National University, 1999.

[23] Rabin M. O.: Fingerprinting By RandomPolynomials, Technical Report TR-15-81,Center for Research in ComputerTechnology, Harvard University, 1981.

[24] Torvalds L., Hamano J. C., et al.: Git,http://git.or.cz/

[25] Hamano J. C., et al.: Git — A StupidContent Tracker, OLS 2006 ProceedingsVolume 1, Ottawa, 2006.

[26] Greaves D., Hamano J. C., Torvalds L., etal.: Git Documentation,http://www.kernel.org/pub/software/

scm/git/docs/

[27] Kuivinen F., et al.: Git Source Code,builtin-merge-recursive.c, http://www.kernel.org/pub/software/scm/git/

[28] Pitre N., et al.: Git Source Code,diff-delta.c, http://www.kernel.org/pub/software/scm/git/

[29] Hamano J. C., et al.: Git Source Code,diffcore-delta.c, http://www.kernel.org/pub/software/scm/git/

[30] Torvalds L., et al.: Git Source Code,pack-objects.c, http://www.kernel.org/pub/software/scm/git/

[31] Blank M., et al.: gitfs, http://www.sfgoth.com/~mitch/linux/gitfs/

[32] Lord T., et al.: GNU Arch, http://www.gnu.org/software/gnu-arch/

[33] Eggert P., et al.: GNU Diffutils, http://www.gnu.org/software/diffutils/

Page 22: Current Concepts in Version Control Systems - arXiv · 2018. 3. 14. · Current Concepts in Version Control Systems Petr Baudi s 2009-09-11 Abstract We give the reader a comprehensive

REFERENCES 22

[34] Henson V., Henderson R., et al.:Guidelines for Using Compare-by-Hash,http://infohost.nmt.edu/~val/

review/hash2.html

[35] Brownawell T., Smith N.: Improvementsto *-merge, [email protected],2005-08-30.

[36] Aldous D., Diaconis P., et al.: LongestIncreasing Subsequences: From PatienceSorting to the Baik-Dieft-JohanssonTheorem, Bull. Amer. Math. Soc. 36(1999) 413—432.

[37] Mackall M., et al.: Mercurial,http://www.selenic.com/mercurial/

[38] Mackall M., et al.: Mercurial Source Code,mercurial/revlog.py, http://www.selenic.com/mercurial/release/

[39] Hoare G., Smith N., et al.: Monotone,http://monotone.ca/

[40] Hoare G., et al.: MonotoneDocumentation: Hash Integrity,http://monotone.ca/docs/

Hash-Integrity.html

[41] Hoare G., et al.: Monotone FAQ,http://www.venge.net/mtn-wiki/FAQ

[42] Andrieu O., et al.: Monotone-viz, http://oandrieu.nerim.net/monotone-viz/

[43] Wheeler D.: More Than a Gigabuck:Estimating GNU/Linux’s Size,http://www.dwheeler.com/sloc

[44] Wilcox–O’Hearn Z. B.: Names:Decentralized, Secure, Human-Meaningful:Choose Two,https://zooko.com/distnames.html

[45] Hudson G.: Notes on keeping versionhistories of files, http://web.mit.edu/ghudson/thoughts/file-versioning

[46] Shapiro J. S., et al.: OpenCM,http://www.opencm.org/

[47] Collins R., et al.: Patch Queue Manager,http:

//bazaar-vcs.org/PatchQueueManager

[48] Ratcliff J. W., Metzener D. E., et al.:Pattern Matching: The Gestalt Approach,Dr. Dobb’s Journal, July 1988, p. 46.

[49] Seiwald C., et al.: Perforce,http://www.perforce.com/

[50] Hudson G., Personal conversation onSubversion

[51] Schalk K. C., Personal conversation onVesta

[52] IEEE: POSIX 1003.1 2004 Edition, 2004.

[53] Python Community: Python difflib —helpers for producing deltas,http://docs.python.org/lib/

module-difflib.html

[54] Tichy W.: RCS: a system for versioncontrol, Software—Practice andExperience, 15 (7): 637–654, July 1985.

[55] Research Community: RevCtrl Wiki,http://revctrl.org/

[56] Smith N., Baudis P., Cohen B. et al.:RevCtrl Wiki: Criss Cross Merge,http://revctrl.org/CrissCrossMerge

[57] National Institute of Standards andTechnology: Secure Hash Standard (SHS),FIPS180-02, August 2002.

[58] E-Soft Inc.: Security Space Survey ofPublic Subversion DAV Servers,http://subversion.tigris.org/

svn-dav-securityspace-survey.html

[59] Blandy J., Fogel K., et al.: Subversion,http://subversion.tigris.org/

[60] Berlin D.: SVNDIFF version 1,[email protected], 2005-10-21

[61] Cohen B.: The diff problem has beensolved, http://bramcohen.livejournal.com/37690.html

[62] Cohen B.: The new Codeville mergealgorithm, [email protected],2005-05-05.

[63] Rochkind M. J.: The Source Code ControlSystem, IEEE Transactions on SoftwareEngineering SE-1:4 (Dec. 1975), 364–370.

[64] Korn D., et al.: The VCDIFF GenericDifferencing and Compression DataFormat, RFC3284.

Page 23: Current Concepts in Version Control Systems - arXiv · 2018. 3. 14. · Current Concepts in Version Control Systems Petr Baudi s 2009-09-11 Abstract We give the reader a comprehensive

REFERENCES 23

[65] Heydon A., Levin R., Mann T., Yu Y.:The Vesta Approach to SoftwareConfiguration Management, Compaq SRCResearch Report 168, 2001.

[66] Baudis P., et al.: TopGit,http://repo.or.cz/w/topgit.git

[67] Baudis P.: TopGit: how to deal withupstream inclusion, [email protected],2008-09-14.

[68] Mackall M.: Towards a Better SCM:Revlog and Mercurial, OLS 2006Proceedings Volume 2, Ottawa, 2006.

[69] Davison W.: Unified context diff tools,comp.source.misc Volume 14, 1990.

[70] Conradi R., Westfechtel B.: Versionmodels for software configurationmanagement, ACM Computing Surveys30 (1998) 232–282.

[71] McCoy K.: VMS File System Internals,Digital Press, 1990, ISBN 1-55558-056-4.

[72] MacDonald J.: XDelta,http://xdelta.org/

[73] MacDonald J.: Versioned File Archiving,Compression, and Distribution, UCBerkeley, 1999.

[74] Trendafilov D., Memon N., Suel T.: zdelta:An Efficient Delta Compression Tool,Technical Report TR-CIS-2002-02,Polytechnic University Brooklyn, 2002.