How Arch Works
To understand what a device is good for, it is often helpful to take it apart and examine how it works. Here's an introduction to how Arch does what it does.
This page contains a brief dissection of Arch -- a high-level description of how Arch works. The intention is to give technically savvy readers deeper insight into Arch by helping them to develop some intuitions about how Arch is implemented.
There are two parts and, depending on your interests, you might want to skip the first. These are:
Performance Principles
Before examining the details of the Arch implementation directly, it will be helpful to name a few performance principles that shape the overall design:
Minimize Server-side Computation
Arch allows multiple users to access a single archive over a network. Consequently, although it also supports serverless, local operation, Arch must support a client-server mode of operation.
We wish Arch to scale in number of concurrent users accessing an archive as far as possible. For example, if a popular project commits an important bug-fix, there may be thousands or tens of thousands of users who want to update their local sources after that commit.
Consequently, the design of Arch aims to
minimize server-side computation
minimize input-output bandwidth requirements
The primary strategy for doing so is to
maximize the client-side share
of unavoidable computation
Disk Space is Cheap
Most revision control systems in widespread use today were conceived or designed at least a decade and in some cases two decades ago. Even some contemporary designs follow design patterns that were established that long ago.
Systems designed with that older perspective place a very high premium on disk space. Although they work hard to be fast, nevertheless they are apt to make trade-offs that consume I/O bandwidth and CPU cycles in favor of consuming disk space. Even worse, these trade-offs often add to the implementation complexity of these systems.
These days, such trade-offs are already somewhat anachronistic and, within just a few years, such trade-offs will be completely absurd. Multiple gigabyte disks can be had for under $200USD and, before long, we'll have terabyte disks in such price ranges. Storage at those scales is more than ample for revision control purposes.
The lesson?
disk space is very cheap
If the storage costs of an extravagant configuration of Arch
work out to a few 10s of dollars per programmer, per year, our
disk space usage is not excessive.
Bandwidth is Expensive
Unfortunately, there is for the forseeable future still a sharp constraint about just how much disk space we can touch for a given arch operation: bandwidth remains expensive, especially relative to disk capacity.
I/O Bandwidth costs are imposed at more than one layer. On-device cache space limits, seek times, transfer times, OS cache space limits, and CPU cache space limits, and main-memory speeds all constrain the utility of passing lots of data back and forth to disk.
i/o bandwidth is very expensive
Networks are Slow
A fast, local network connection can easily have lower latency and higher bandwidth than a local disk. Nevertheless, the expected case for most Arch operations over a network is that they pass over the Internet or a comparable network. Minimizing roundtrips and the quantity of data transfered is critical.
for Arch:
1) network bandwidth is more expensive than disk bandwidth
2) each network roundtrip may as well be a punch in the face
A Layered Architecture
And now we turn to the implementation structure of Arch itself. Arch is a layered architecture and it's easiest to understand by starting at the lowest level and working upwards.
Tree Inventories
The lowest level concept in Arch is that of a tree inventory.
A tree inventory is a listing of all of the files and directories in a tree, with two fields per file or directory. These fields are:
1. a physical name -- the location of a file relative to the root
of the source tree. For example: ./src/include/foo.h
2. an inventory id -- an identifier naming the file. This identifier remains constant even if the physical name of the file changes.
In other words, if a file is renamed, its physical name changes, but its inventory id remains the same. This allows Arch to recognize that a file has been renamed by comparing two versions of the physical name for a given inventory id. Here is a fragment of a directory listing showing physical names and inventory ids:
physical name: inventory id: links/Makefile.in i_x9034AKLJn3 links/make-links.sh.in i_lk3n8908nNx links/remove-links.sh.in i_lk3nn90lkjs
How are inventory ids associated with files? How does arch know what the inventory id of a given file is? Three mechanisms are provided:
1. inventory ids can be stored within a file For example, an id can be stored in a comment in a C source file. This mechanism is in some sense the most convenient, but is only suitable for text files, and only for text files that can tolerate such comments.
2. inventory ids can be stored in separate files The inventory id
for a file foo can be stored in ./.arch-ids/foo.id
. Special commands
must be used to rename such a file (so that the .id file is renamed
at the same time) but this mechanism can be used with any type of
file.
3. inventory ids can be omitted Users can choose to not assign an id to a given file in which case its physical name serves double-duty as its inventory id. Arch can not recognize when such a file is renamed (since the inventory id will change in the process) but for some situtations, this is an acceptable trade-off for the convenience of not having to set-up ids.
Changesets
Traditional diff
is most familiar as a way to show the differences
between two versions of a single file. Given a command line option,
diff
can be recursively applied to all of the same-named files
within a tree.
Traditional patch
is familiar as a way to take the output of
diff
, a description of changes made to a file, and apply those
changes to some different file.
The next layer of Arch takes diff and patch one step further by taking advantage of inventory ids. The arch commands mkpatch/dopatch (aka make-changeset/apply-changeset) produce and apply a whole-tree diff-like account of differences, taking into account renames, adds, and deletes. Additionally, these arch commands handle symbolic links, binary files, and file permissions.
For example, as we'll see below, when you commit a set of changes, that operation records the mkpatch output of your modified tree compared to the unmodified version you "checked out" from an archive.
Namespace and Archive
The combination of mkpatch/dopatch and inventory ids allows for a kind
of "by hand" approach to revision control in the diff/patch
style. Programmers could cooperate by exchanging and applying
changesets created by mkpatch much as they can cooperate by creating
and exchanging patch sets made by diff
.
The next layer of arch, the namespace and archive layer, essentially just automates that by hand process. It provides a mechanism for naming (with globally unique, human-readable names) the changesets programmers create, and a mechanism for publishing changesets and retrieving published changesets.
Global Revision Names
The arch namespace is organized around the concept of a development line -- a sequence of revisions ordinarilly beginning with a complete tree called a base revision and proceding with a number of successive changeset revisions, each defined relative to its predecessort by a changeset.
For example, a programmer might initially import the source for an implementation of tar creating a base revision named:
tar--mainline--1.0--base-0
After fixing a bug, the programmer can commit the changeset describing the bugfix. The changeset and the tree resulting from applying it are known as:
tar--mainline--1.0--patch-1
Additional work and more commits would yield:
tar--mainline--1.0--patch-1 tar--mainline--1.0--patch-2 tar--mainline--1.0--patch-3 ... etc.
This development line is situated within a particular arch archive which itself has a name. Prefixing the archive name to a revision name yields a globally unique name for the revision:
lord@emf.net--gnu-archive-2004/tar--mainline--1.0--patch-3
Tags are named in the same way and function comparably to base revisions. For example, instead of importing a new source tree to create
tar--mainline--1.0--base-0
the programmer might have defined that revision to be a tag revision -- more or less an an alias within the namespace. We often use notation like:
tar--mainline--1.0--base-0 => tar--experimental--1.0--patch-4
to mean "the base-0 revision in the mainline branch of tar is a tag revision, tagging the patch-4 revision in the experimental branch".
Archive Storage
The most basic role of an arch archive, therefore, is to store base revisions, commit revisions, and tag revisions in a way that (1) associates them with their name; (2) allows programmers to query and retrieve revisions and the changesets and other data associated with them.
Internally to arch, the storage mechanism is isolated from the rest of the implementation by an abstraction barrier. Entirely new storage managers are easy to install.
In practice, only a very simple storage manager is used and experience has shown this stone-simple approach to work well. In particular, archives are stored as ordinary files: imported base revisions are stored as a compressed tar file of the entire source tree; changeset revisions created by commit are stored as a compressed tar file containing just the changeset; tag revisions are stored as a small text file containing the name of the tagged revision.
The stone-simple approach has some enourmous advantages and is one are where the performance principles enumerated above come into play. Some example advantages:
A server is merely a generic file server The only capability needed for an arch server is the ability to create new files and directories and retrieve existing ones. Consequently, an unmodified FTP, SFTP, or WebDAV server is, already, an arch archive server. Nearly no computation takes place server-side and a given server can easily support very large numbers of concurrent connections.
Network traffic is minimized Imports and commits ship to the server compressed tar files of just the essential data of the transition. Subsequent retrievals simply read that data back. As a result, basic operations require few round-trips and minimal bandwidth.
Replication is easy New files are added to archives with every commit, tag, or import transaction, but old files are never removed or modified. Consequently, it is trivial to replicate and incrementally update read-only mirrors of any archive. Arch has a built-in facility for this and programs such as rsync can be configured to do the job more efficiently.
Archive Caching, Revision Libraries and Other Speedups
The archive storage layer has many advantages but it also creates a performance problem that needs to be solved:
Simple Archive Formats Can be Slow
Consider how, given just the archive storage mechanism described above, a particular revision can be retrieved. Let's suppose that we want to get a copy of the tree for:
tar--mainline--1.0--patch-4
Absent any other mechanism, we would build that tree with the following steps:
1. retrieve the base revision -- fetch the compressed tar file of
the base-0
revision from an archive and unpack it.
2. retrieve and apply the first changeset -- fetch the compressed
tar file of the changeset for patch-1
, unpack it, and apply the
changeset to the base-0
tree. Now we have a patch-1
tree.
3. retrieve and apply successive changesets -- continue applying
changesets until we lllreach the desired revision (patch-4
).
That's conceptually simple, and it's a direct consequence of our
choice of "dumb server" storage management, but it is
problematic. For example, suppose that instead of wanting the
patch-4
revision we wanted the patch-140
revision. Instead of having
to apply four changesets to the base revision, we'd have to apply
140 -- which would be prohibitively expensive.
Worse, in the case of wanting the patch-140
revision, we may find
ourselves violating some of our performance principles. After all,
even compressed, those 140 changesets that we'll fetch from the
server may turn out to be larger than just a complete copy of the
patch-140
tree would be. By not having the server compute that
complete tree, we may be trading network costs for server costs.
Arch solves these problems with three mechanisms: archive cached revisions, archive mirrors, and client-side revision libraries.
Archive Cached Revisions
One way to reduce the amount of changeset application and network bandwidth associated with building a revision is to apply our performance principle that disk space is very cheap.
Specifically, we augment the archive format to permit
(user-directed) caching of pre-built revisions, stored as compressed
tar files. Continuing the earlier example, a user may have asked
that the revision patch-130
be cached. That means that in addition
to containing a compressed tar file of the changeset that describes
how patch-130
differs from patch-129
the archive will contain a
compressed tar-file of a pre-built copy of patch-130
.
Now, if a user asks for a copy of the patch-140
tree, rather than
starting at base-0
then fetching and applying 140 changesets, arch
will fetch the full-tree copy of patch-130
then fetch and apply only
10 changesets to produce a patch-140
tree.
In keeping with our goals of client-side computation and dumb servers, when an archive cached revision is created it is constructed on some client, bundled as a tar file, and stored on the server.
Which revisions should be archive cached? One rational, approximately time-and-bandwidth optimal policy would be to space cached revisions such that the sum of the sizes of changesets between any two cached revisions is about the same as the size as the tar file containing the cached revision. Spacing them further apart would lead to a situation where sometimes the total size of changesets fetched to build a revision exceeds what could have been fetched with denser caching. Spacing them closer together will cause arch to sometimes fetch a cached revision when it would have been faster to fetch several changesets instead. Note that applying this policy approximately doubles the size of an archive -- which is quite accessible since archives are relatively compact to begin with and, after all, disk space is very cheap.
Other caching policies make good sense in some situations, though. Therefore, we leave it to users to explicitly cache revisions -- a process that can trivially be automated to implement whatever policy a user chooses.
Archive Mirrors
A second technique for saving on network costs is to use archive mirrors. Arch contains a facility for constructing and incrementally updating local, read-only copies of remote archives. When such a mirror is used, each changeset or base revision from the remote archive is fetched exactly once and, thereafter, access to it is entirely local. In our example, this would still (by default) mean applying 140 changesets -- but at least those changesets would reside on a local disk.
Archive caching and archive mirroring interact in a favorable way. Although mirrors are generally read-only, users may archive-cache a revision in a mirror. For example, one reasonable set-up is to publish a heavily read archive on the Internet, but archive-cache nothing there. Instead, encourage users to mirror from that public archive rather than use it directly, and to archive-cache according to their needs in their local copies.
Revision Libraries
The final technique for speeding up arch operations is entirely a client-side optimization. As with archive mirrors and archive caching, it trades-off disk space for speed.
The technique provides for something called a revisision library. A revision library is a read-only collection of pre-built whole-tree copies of revisions, stored more or less as ordinary trees.
A special feature of revision libraries is that unmodified files are
shared between revisions via hard-links. For example, suppose that a
revision library contains patch-139
and we wish to add patch-140
to
the library. The addition happens in two steps: First, a hard-linked
copy of patch-139
is created. All of the files in this copy are
shared with the original patch-139
-- they are the hard-links to the
two files. Second, the changeset defining patch-140
is applied to
the hard-linked copy. The changeset application process "breaks
hard-links" for files it modifies. Thus, the end result is a
patch-140
tree in which files that are unmodified since patch-139
are shared with the patch-139
tree via hard-links, and modified
files are private to the patch-140
tree.
This is a reasonably space-efficient way to store many, many revisions in space which is a fraction of the sum of their individual sizes.
The revision library mechanism is designed to be configurable as a cache or memo. Revisions can be automatically added to a library, on demand, or they can be explicitly added. It is a trivial matter for users to implement a "pruning" policy to discard revisions that are no longer needed.
When a revision library is present, many arch operations become
very, very fast. For example, suppose again that we want a patch-140
tree and that initially the revision library contains
patch-139
. Arch (suitably configured) will proceed by adding
patch-140
to the library (by creating a hard-linked copy of 139 and
applying one changeset to it) and then recursively copying the
resulting tree from the library to the user's work area. If the
library already contains patch-140
, then only the recursive copy is
necessary.
Often the tools that a user uses to work on their checked-out trees are well-behaved with respect to hard-links. Emacs and vi, for example, do not modify the files you edit --- they write fresh copies of those files containing your modifications and thus they break hard-links to those files.
Arch can take advantage of the hard-link-safety of common tools. When checking-out a tree, you can ask arch to not recursively copy the tree form a revision library but, instead, to make your working copy by hard-linking to the revision library. This turns out to be a blindingly fast way to check out a tree. (And, arch contains integrity checking features for revision libraries. Should your tools fail to be hard-link-safe after all, arch will notice and protect you from any resulting revision library corruption.)
Speedups Summarized
The basic archive storage model of arch is simple, space-efficient, and often but not always network- and time-efficient. In some cases, however, without additional mechanisms, it would lead to gross network- and time-inefficiencies.
We eliminate those gross inefficiences with three mechanisms, archive caching, archive mirroring, and revision libraries, which have in common that that they trade disk-space and client-side computation for better performance overall. A pleasant property of this approach is that it requires _no_ server-side computation and thus, excellent speeds can be achieved without sacrificing the scalability or arch in the slightest.
Patch Logs and Merging
We've seen so far how arch allows users to commit import, changeset, and tag revisions. We've seen that these are assigned globally unique names and examined how they are stored in archives. We've seen how revisions are retrieved from archives and looked at some of the speed-up mechanisms that exist to make retrieval fast.
Now we turn to the question of merging. Suppose, for example, that two users are each working on their own development line of a single project. Naturally, they may want to share work: each user will want to combine his changes with changes made by the other. How does arch handle this?
To understand the answer, we must first understand patch logs and then we can turn to history sensitive merging.
Patch Logs
Whenever a user creates any new revision he must provide a short description of the revision. The description, formatted similarly to an email message, contains a header that provides a brief summary of the revision, and a body that can contain details.
Arch maintains a special directory, called {arch}
in the top level
of each project tree. This directory contains the patch log entries
that pertain to the tree.
For example, when a tree is first imported, the user writes a patch
log entry and that is stored in a file under {arch}
, indexed by the
name of the revision created by the import. The tar bundle
containing the imported tree contains that patch log file.
Similarly, when a user commits or tags a revision, the patch log entry for that new revision is treated like a newly added file -- it appears in the associated changeset.
This implies, for one thing, that a given revision contains patch
log files for its ancestors in its development line. The tree for
patch-140
contains patch log files for base-0
, patch-1
,
.... through patch-140
itself.
When patch log files are added to a revision, arch adds additional, automatically computed headers. For example, one header names all newly added files in the revision. Another header reports what files have been deleted.
Conveniently, patch log files can be processed and summarized to produce a conventional ChangeLog file. In fact, arch has facilities to do so automatically.
History Sensitive Merging
Consider, then, what happens if a user wants to add to his tree the changes assocated with a revision made in a different development line. For example, suppose that I have checked out a copy of
tar--mainline--1.0--patch-140
and I note that a revision in a different branch, namely
tar--bugfixes--1.0--patch-2
is a changeset revision and that the changes it makes fix a bug I'd like fixed in mainline.
Arch can accomplish this quite easily simply by fetching the changeset for the bugfix and applying it to my patch-140 tree. That changeset treats the patch log entry for the bugfix revision as a newly added file. Therefore, after applying the changeset, my (now modified) tree for
tar--mainline--1.0--patch-140
will contain a patch log entry for
tar--bugfixes--1.0--patch-2
That's handy in and of itself. For example, if I don't remember whether the bug-fix changes have been applied to my tree, I can use arch to query the patch log files and look for a record of it having been applied.
It's also handy for smart merging. For example, in my mainline tree, I can ask arch:
% tla whats-missing tla--bugfixes--1.0
and it will respond with something like:
patch-1 patch-3 patch-4
indicating that I have not applied bugfix patches 1, 3, and 4 -- but have applied bugfix patch 2.
Using the summary lines from log messages, arch can report:
% tla --summary whats-missing tla--bugfixes--1.0 patch-1 fix the help message patch-3 fix buffer overflow (bug #238) patch-4 correct exit status (bug #14)
and a command such as:
% tla whats-missing tla--bugfixes--1.0 | tla replay --list -
will apply just the missing patches but none that I've already applied.
Merging isn't limited to just the changesets created by commit. Indeed, arch can compute the differences between any two revisions and apply those to my tree as a form of merging. Intelligently deciding which two revisions to compare is a tricky problem with no one solution -- arch provides a variety of solutions, most of which examine patch-logs to help with the decision process. This general mechanism is the basis for specific history-sensitive merge commands such as star-merge.
And the Rest is Easy
The above summarizes the essense of what could be dubbed core arch. Although many details have been omitted or glossed over, hopefully you now have some intuition about how arch works.
Many commands are built on top of the basic facilities outlined above. In all cases, they boil down to simple combinations of the above ideas: changeset computation and application, building revisions based on archive and revision library data, and so forth.