                       Meta-CVS --- A Directory Structure
                             Versioning Layer Over 
                        The Concurrent Versions System.

                                  Kaz Kylheku
                     Originally published January 25, 2002
                      Edited and Revised February 3, 2004


           "Directory versioning is a Hard Problem" -- Subversion FAQ

              "Any problem in computer science can be solved with
                 another layer of indirection" -- David Wheeler


  Abstract

    This is Meta-CVS, Meta-(Concurrent Versions System), a front end for
    CVS.  It supports the concurrent and independent versioning of
    files, as well as a directory structure, by several people.  I have
    it been using it for a few weeks now, mostly just to version the
    Meta-CVS sources themselves.  It uses the cvs program in such a way that
    you can not only version the file contents, but you can move and
    rename files.  These changes are committed to the repository, and
    can be picked up by an update, which will incorporate them by
    rearranging the working copy accordingly. There can be conflicting
    parallel changes to the structure, which can be resolved like any
    other conflict.  It is all Lisp.


  Contents

    1. Introduction  . . . . . . . . . . . . . . . . . . . . . . Line 42
    2. Data Representation Overview . . . . . . . . . . . . . . . . . 75
        2.1 File Mapping Example . . . . . . . . . . . . . . . . . . 120
        2.2 Symbolic links  . . . . . . . . . . . . . . . . . . . .  210
        2.3 Synchronization  . . . . . . . . . . . . . . . . . . . . 210
        2.4 Partial Sandboxes . . . . . . . . . . . . . . . . . . .  210
    3. Surprising Advantages . . . . . . . . . . . . . . . . . . . . 247
        3.1 File Adding conflicts . . . . . . . . . . . . . . . . .  260
        3.2 File Removal conflicts . . . . . . . . . . . . . . . . . 286
        3.3 Diffing and Patching  . . . . . . . . . . . . . . . . .  320


  1. Introduction

      The software known as CVS has been in existence since the year
      1986, when its first version, consisting of shell scripts acting
      as a front end to RCS commands, was posted to Usenet by Dick
      Grune.  Over the next fifteen years, CVS was turned into a C
      program, enhanced and debugged. But in its present form, version
      1.11, it still has annoying quirks and some serious limitations.
      
      One of the biggest limitations of CVS is it does not treat the
      directory structure of a module as a versioned object. Meta-CVS
      solves this problem not by intruding in any way into the
      well-debugged and time-tested CVS code, but by introducing a layer
      of indirection.  Meta-CVS retains the fundamental capabilites of
      CVS: the ability to branch and merge, to work in parallel, to work
      over a variety of network transports and so on. CVS worked as a
      front end for RCS; similarly, Meta-CVS is a front end for CVS.

      It turns out that Meta-CVS solves a few other infelicities in CVS
      as well. A few tricky scenarios that cause grief in CVS are no
      longer problems in Meta-CVS, such as: two developers concurrently
      adding the same file, or one developer removing a file that
      another is working on.

      Meta-CVS works by creating a special representation of the
      versioned file tree, and this special representation is what is
      stored in CVS. Thus the naive direct mapping between the versioned
      tree and the tree in the repository is avoided.

      The aim of this paper is to document this simple representation
      and explain how it supports the directory versioning operation.


  2. Data Representation Overview

      In order to obtain, from CVS, the ability to perform parallel
      version control over any object, it is necessary to represent that
      object as a text file. This is a given. CVS can effectively handle
      only text input in its merging and conflict identification
      algorithms. A critical non-functional constraint in the
      requirements of Meta-CVS is that CVS is not to be modified in any
      way; nobody should have to to install new CVS code on a client or
      server machine to use Meta-CVS. Morever, the CVS code is fragile C
      that has been debugged for over a decade (and counting). 

      To treat the file structure as a versioned entity, therefore, it
      is necessary to represent it as a text file. What structure should
      that text file have?

      Firstly, it would be highly desirable if small changes, such as
      renaming a few files, gave rise to small differences. Moreover,
      a single change should only affect at most one line or two in the
      text file.  This property would allow for parallel changes with
      minimal conflicts.  The text file representation should also be
      human readable and editable, because humans will have to resolve
      conflicts in it.

      Secondly, a file must somehow retain its identity and CVS history
      when its path name changes. This means that we must never change
      the name of the file, at least not the name which is known to CVS.

      Meta-CVS represents the file structure of a project as a simple
      entity called a ``file mapping''. The file mapping associates path
      names with a flat database of files.  Both the mapping and the
      files are stored in CVS. The files have machine-generated names;
      only through the mapping are they given actual names as they
      appear to the users. The names known to CVS are called ``F-
      files''.

      Meta-CVS manipulates the mapping as a simple data structure in the
      Lisp language. Lisp has a built-in parser and formatter for
      reading a printed representation of a list object and producing a
      printed representation. Thus the text file format for the Meta-CVS
      mapping is simply a file containing a Lisp association list, with
      special care taken to print each element of the association on a
      separate line of text, and maintaining a consistently sorted order.

      The separation of the directory structure from a flat file database
      is nothing new; separation of a directory service from a flat file
      service is a common theme in the design of filesystems.
      
      Meta-CVS imitates the UNIX filesystem to the extent that a
      restore command hunts down ``unlinked'' files and places them
      into a lost+found directory under cryptic names derived from
      their ID's, which behave analogously to inode numbers!
      In UNIX, a file can remain in use while it is deleted from
      the directory structure, and so it is in Meta-CVS. A file
      removal in Meta-CVS is a non-destructive unlinking from the directory
      structure.

  2.1 File Mapping Example

      Suppose that some project 'foo' consists of these files:

        foo/README
        foo/inc/foo.h
        foo/src/Makefile
        foo/src/foo.c
      
      what does a Meta-CVS representation look like? This is best
      understood in terms of the working copy checked out from CVS via
      Meta-CVS, which contains these things:

        foo/MCVS/CVS/Entries
        foo/MCVS/CVS/... other CVS metadata ...

        foo/MCVS/F-123D61C8FE942733281D2B08C15CD438
        foo/MCVS/F-156CAB88D4EEE703E8C4B4146B5094E2.h
        foo/MCVS/F-15EA9689ACF749C314CE6FC5255DC4B0
        foo/MCVS/F-1C43C940D8745CAA78752C1206316B55.c
        foo/MCVS/MAP
        foo/MCVS/MAP-LOCAL

        foo/README
        foo/inc/foo.h
        foo/src/Makefile
        foo/src/foo.c    

      There is a subdirectory called MCVS, which contains a CVS
      subdirectory. This MCVS subdirectory is in fact the CVS
      ``sandbox''. Everything else under foo are the working files.
      Thus every Meta-CVS working copy is just an ordinary file tree,
      except that the top level directory contains a MCVS subdirectory
      with interesting contents.

      What are these files under MCVS? There are some files with cryptic
      names like F-123D...438. Then there are two files MAP and
      MAP-LOCAL. 

      Firstly, it should be understood that the F- files and MAP are
      versioned in CVS. On the other hand, MAP-LOCAL is a file that is
      not known to CVS, but important to the functioning of Meta-CVS.

      The four F- files are the actual CVS representations of
      foo/README, foo/src/foo.c, foo/src/Makefile and foo/inc/foo.h. 

      What establishes the relationship between the F- names and the
      human readable paths is the association list in the MAP file,
      which, in the early versions of Meta-CVS looked like this:

        (("MCVS/F-123D61C8FE942733281D2B08C15CD438" 
          "README")
         ("MCVS/F-156CAB88D4EEE703E8C4B4146B5094E2.h" 
         "inc/foo.h")
         ("MCVS/F-15EA9689ACF749C314CE6FC5255DC4B0" 
          "src/Makefile")
         ("MCVS/F-1C43C940D8745CAA78752C1206316B55.c" 
          "src/foo.c"))

      The MAP-LOCAL file, upon checkout, is simply an exact copy of MAP.
      The purpose of MAP-LOCAL is to keep track of the actual mapping
      that exists in the user's checked out copy. When an update
      operation is performed, it may incorporate changes from the
      repository into MAP, causing the MAP to no longer reflect the
      local file structure. In fact MAP can at that point contain
      unresolved conflicts, so that it is not usable by Meta-CVS,
      requiring manual intervention. The MAP-LOCAL copy, however,
      remains untouched and consistent.

      Because Meta-CVS maintains a local copy of the mapping, the
      Meta-CVS update operation can compute the differences between the
      new mapping coming from the repository and the local mapping.
      These differences can then be translated into
      filesystem-rearranging actions that change the shape of the
      working copy to bring it up to date. Then MAP and MAP-LOCAL are
      once again identical.

      This rearranging is the heart of the Meta-CVS system. Everything
      else is largely just manipulations of the mappings. For example,
      renaming a file is simple. Open up MCVS/MAP in a text editor, and
      change a path (taking care not to create a duplicate, or otherwise
      corrupt the mapping). Then save it and run the mcvs update.
      Meta-CVS will propagate the change you made by physically
      relocating that file. If you like what you have done, simply
      commit. You can commit at the CVS level within the MCVS
      directory. But of course, a Meta-CVS file renaming operation is
      provided, and so is a commit operation, which in addition to
      running CVS also ensures that the F- files are properly
      synchronized with their unfolded counterparts.


  2.2 Symbolic Links

      In August 2002, support for symbolic links was added to Meta-CVS,
      and the format of the mapping became more complicated to reflect
      that. The syntax was extended to allow for different kinds of
      entries, as well as future extensibility. Each entry now has
      a Lisp keyword symbol in its first position which identifies 
      its type. The rest of the list specifies the type-specific properties.
      Currently, there are two types of entries :FILE and :SYMLINK.

      Right around that time, support for versioned property lists was
      also added.

      The previous section's example now looks like this:

        ((:FILE
          "MCVS/F-123D61C8FE942733281D2B08C15CD438" 
          "README")
         (:FILE
          "MCVS/F-156CAB88D4EEE703E8C4B4146B5094E2.h" 
          "inc/foo.h")
         (:FILE
          "MCVS/F-15EA9689ACF749C314CE6FC5255DC4B0" 
          "src/Makefile")
         (:FILE
          "MCVS/F-1C43C940D8745CAA78752C1206316B55.c" 
          "src/foo.c"))

       Executable files have additional material after the path name.
       Symbolic links look like this:

         (:SYMLINK
          "S-DF03GA1200347CF1935509371F8C1765" 
          "src/foo.c"
          "../foo.c")

       which asserts the existence of a symbolic link called src/foo.c 
       whose target is ../foo.c.

       Both currently supported map entries have an ID string in the second
       position, and an object path in the third position. The syntax
       varies after that.

       Incidentally, Meta-CVS continues to recognize and parse the old
       format. Once the mapping object is read from the MAP file, the
       abstract syntax tree is examined to determine whether it conforms
       to the old syntax or new. Nobody uses the old syntax; only old
       versions stored in the repository of Meta-CVS itself.


  2.3 Synchronization

      The next problem to tackle is how to establish the correspondence
      between the F- files and the working files. Meta-CVS does this in a
      platform-specific way, namely by relying on Unix hard links.

      When Meta-CVS checks out a sandbox, it creates hard links, so that
      a F- file and its corresponding working file are in fact the same
      filesystem object. Thus ``unpacking'' the F- files through the
      mapping does not require the mass duplication of of file data,
      only the creation of directories and links.

      The problem is that some operations ``break'' this hard link
      connection by unlinking a file and overwriting it with a new one
      that has the same name. The CVS update operation does this, for
      instance. If cvs up creates a new F- file, that file is no longer
      connected with the working file.

      To keep the two synchronized, Meta-CVS performs a synchronization
      operation. This operation sweeps over the file map, and repairs
      any broken links. If either of the two files is missing, then a
      link is created. If both are present, but are distinct objects,
      then the one with the most recent modification timestamp
      supersedes; the other is unlinked and replaced with a link to the
      newer one.

      A synchronization must be done before any operation which can
      cause a file to be moved, removed, or to be committed to the CVS
      repository. In all these situations, the F- files must have 
      the correct contents. 

      The Meta-CVS update operation must perform synchronization twice:
      before the CVS update to ensure that the F- files carry all of the
      local changes; then after the CVS update to make sure that any
      newly incorporated changes propagate back to the working copy.

      The current behavior of Meta-CVS is more subtle than the above
      description implies. The synchronization does not process the
      entire MAP for commands that operate only on a subtree; instead,
      entries corresponding to that subtree are filtered out of the
      mapping. Secondly, the synchronization is direction-sensitive.
      For instance, before a CVS commit, it makes sense to synchronize
      from the tree to the CVS sandbox, not in the opposite direction.
      Immediately after a commit, it makes sense to push in the opposite
      direction, in case CVS modified the commited files (for instance
      by altering keyword expansions).

  2.4 Partial Sandboxes

      Sometimes it is desirable to pull out just a subtree of a larger
      project from a repository. How can this be done in a version
      control system that represents the whole tree as a versioned object?
      Wanting to check out part of the tree seems roughly equivalent to wanting
      to check out half of a file.

      Meta-CVS solves this problem by supporting the concept of a partial
      sandbox. This is a checkout that has the full mapping in the CVS
      sandbox. A local file called DISPLACED is written which contains
      the relative pathname of the root of the subtree that is checked out.
      For example if the testcases/optimization subdirectory of 
      the x-compiler project is checked out, then the DISPLACED file
      contains the path testcases/optimization.

      All of the algorithms in Meta-CVS are aware of the DISPLACED path,
      and properly translate between the /abstract/ paths contained in
      the mapping and the shorter, /real/ paths in the sandbox tree.
      This translation is a no-op when there is no DISPLACED file---that
      is, when the sandbox is a full one.

      Partial sandboxes behave robustly with respect to mapping changes
      arriving from the repository. If another user commits a change
      that moves a file from some currently invisible part of the tree
      into the visible subtree, this works properly. The opposite direction
      likewise.

      Partial sandboxes are used by the grab command to store a new
      external drop into just a subtree of the project on a particular
      branch or the trunk. To do this, the grab command's already
      contorted algorithm had to be infused with translations between
      abstract and real paths.


  3. Surprising Advantages

      The Meta-CVS representation brings with it a few advantages which
      were not immediately obvious in the design stages, but came to
      light during development. In addition to the lack of directory
      structure versioning, CVS has a few other infelicities which go
      away under Meta-CVS. Also, bringing in the capability to version
      directory structure also brings in a new concern. Free software
      developers uses patches to communicate code changes to each other.
      The traditional tools for producing and applying patches, like
      CVS, do not handle directory versioning. Meta-CVS has some answers
      to these problems.

  3.1 File Adding Conflicts

      In CVS, it can happen that two (or more) developers working on the
      same module, add a file to the same directory, and all use the
      same file name. The first developer commits the file, and then
      problems occurs for the subsequent developers who try to commit.
      CVS complains that the file was independently added by a second
      party, and not allow the commit to proceed.

      In Meta-CVS, this cannot happen. Meta-CVS recognizes that if two
      people add a file, it is not the same file. Names do not determine
      equivalence, semantics does! When a file is added to Meta-CVS, a
      F- file is created to represent it. That F- file name contains a
      randomly chosen 128-bit number, expressed in hexadecimal.  It is
      extremely unlikely that two such numbers will collide, so in
      practice, one will ``never'' see the aforementioned CVS error
      message.

      Instead, what will happen when developers choose the same path
      name for a file is that either a conflict will arise in the MAP
      file, which will have to be resolved, or else the mapping will
      contain a duplicate path name, which can be detected by Meta-CVS
      as an error which again, the users must resolve. Each file is a
      separate object with its own version history; that two objects
      accidentally map to the same name is a minor, correctable problem.

  3.2 File Removal Conflicts

      CVS does not behave very well when one developer deletes a file,
      via cvs remove, and another tries to continue comitting changes.

      This is really just an instance of the classic problem of
      computing the object lifetimes, translated to the domain of
      version control.

      The cleanest solution to the problem of computing object lifetimes
      is garbage collection, which ensures that as long as an object can
      still be used, it persists, and thereafter, it is automatically
      removed when the system finds it necessary or convenient to do so.

      It turns out that Meta-CVS supports a kind of garbage collection
      concept. When a file is removed, it does not have to be subject to
      ``cvs remove''. It only has to be removed from the file mapping,
      but the F- file can remain unremoved. What this means is that the
      F- file continues to be checked out, so it occupies bandwidth and
      space. What happens if a user has outstanding changes, and
      performs an Meta-CVS update which removes the file? The link
      synchronization ensures that the outstanding changes are
      transferred to the F- file before the removal. So the changes are
      not lost!  It is possible to manually restore that F- file in the
      MAP to give it a ``new lease on life''. This is analogous to
      sifting through garbage, to salvage it by making it reachable
      again. And, of course, changes to the F- file itself can be committed to
      CVS whether or not it is reentered into the map.

      The space problem can be dealt with by a Meta-CVS ``garbage
      collection'' routine that can be invoked administratively.  This
      will sweep through the F- files, identify any which have no
      mapping, and ``cvs remove'' these.

  3.3 Diffing and Patching

      Another surprising advantage of Meta-CVS is that it addresses the
      problem of distributing patches which patch the file system
      structure as well as contents.

      The F- and MAP files in fact constitute an interchange format for
      the distribution of program source which, in principle, amplifies
      the capabilities of any change management tools that are based on
      flat files.

      A developer can obtain a copy of a project in Meta-CVS form, then
      work on making changes, including the renaming of paths. These
      changes are represented in a new Meta-CVS file set. A diff is
      computed between the new and the old. Someone with a copy of the
      original can patch it, to reproduce the changes.  All that is
      needed is the Meta-CVS software to realize the rearrangements.
