I think a little background on the author is in order. I am a physicist, and think like a physicist. The proofs and theorems given here are what I would call ``physicist'' proofs and theorems, which is to say that while the proofs may not be rigorous, they are practical, and the theorems are intended to give physical insight. It would be great to have a mathematician work on this, but I am not a mathematician, and don't care for math.
From the beginning of this theory, which originated as the result of a series of email discussions with Tom Lord, I have looked at patches as being analogous to the operators of quantum mechanics. I include in this appendix footnotes explaining the theory of patches in terms of the theory of quantum mechanics. I know that for most people this won't help at all, but many of my friends (and as I write this all three of darcs' users) are physicists, and this will be helpful to them. To non-physicists, perhaps it will provide some insight into how at least this physicist thinks.
A patch describes a change to the tree. It could be either a primitive patch (such as a file add/remove, a directory rename, or a hunk replacement within a file), or a composite patch describing many such changes. Every patch type must satisfy the conditions described in this appendix. The theory of patches is independent of the data which the patches manipulate, which is what makes it both powerful and useful, as it provides a framework upon which one can build a revision control system in a sane manner.
Although in a sense, the defining property of any patch is that it can be applied to a certain tree, and thus make a certain change, this change does not wholly define the patch. A patch is defined by a representation, together with a set of rules for how it behaves (which it has in common with its patch type). The representation of a patch defines what change that particular patch makes, and must be defined in the context of a specific tree. The theory of patches is a theory of the many ways one can change the representation of a patch to place it in the context of a different tree. The patch itself is not changed, since it describes a single change, which must be the same regardless of its representationA.1.
So how does one define a tree, or the context of a patch? The simplest way to define a tree is as the result of a series of patches applied to the empty treeA.2. Thus, the context of a patch consists of the set of patches that precede it.
Hunks are an example of a complex filepatch. A hunk is a set of lines of a text file to be replaced by a different set of lines. Either of these sets may be empty, which would mean a deletion or insertion of lines.
Although most filepatches will be hunks, darcs is clever enough to support
other types of changes as well. A ``token replace'' patch replaces all
instances of a given token with some other version. A token, here, is
defined by a regular expression, which must be of the simple [a-z...] type,
indicating which characters are allowed in a token, with all other
characters acting as delimiters. For example, a C identifier would be a
token with the flag [A-Za-z_0-9]
.
What makes the token replace patch special is the fact that a token replace can be merged with almost any ordinary hunk, giving exactly what you would want. For example, you might want to change the patch type TokReplace to TokenReplace (if you decided that saving two characters of space was stupid). If you did this using hunks, it would modify every line where TokReplace occurred, and quite likely provoke a conflict with another patch modifying those lines. On the other hand, if you did this using a token replace patch, the only change that it could conflict with would be if someone else had used the token ``TokenReplace'' in their patch rather than TokReplace--and that actually would be a real conflict!
The simplest relationship between two patches is that of ``sequential''
patches, which means that the context of the second patch (the one on the
left) consists of the first patch (on the right) plus the context of the
first patch. The composition of two patches (which is also a patch) refers
to the patch which is formed by first applying one and then the other. The
composition of two patches, and
is represented as
,
where
is to be applied first, then
A.3
There is one other very useful relationship that two patches can have,
which is to be parallel patches, which means that the two patches have an
identical context (i.e. their representation applies to identical trees).
This is represented by
. Of course, two patches may also
have no simple relationship to one another. In that case, if you want to
do something with them, you'll have to manipulate them with respect to
other patches until they are either in sequence or in parallel.
The most fundamental and simple property of patches is that they must be
invertible. The inverse of a patch is described by: . In the
darcs implementation, the inverse is required to be computable from
knowledge of the patch only, without knowledge of its context, but that
(although convenient) is not required by the theory of patches.
Composite patches are made up of a series of patches intended to be applied sequentially. They are represented by a list of patches, with the first patch in the list being applied first.
The first way (of only two) to change the context of a patch is by commutation, which is the process of changing the order of two sequential patches.
The constraint that any two compatible patches (patches which can successfully be applied to the same tree) can be merged is actually quite difficult to apply. The above merge constraints also imply that the result of a series of merges must be independent of the order of the merges. So I'm putting a whole section here for the interested to see what algorithms I use to actually perform the merges (as this is pretty close to being the most difficult part of the code).
The first case is that in which the two merges don't actually conflict, but don't trivially merge either (e.g. hunk patches on the same file, where the line number has to be shifted as they are merged). This kind of merge can actually be very elegantly dealt with using only commutation and inversion.
There is a handy little theorem which is immensely useful when trying to merge two patches.
Of course, there are patches that actually conflict, meaning a merge where
the two patches truly cannot both be applied (e.g. trying to create a file
and a directory with the same name). We deal with this case by creating a
special kind of patch to support the merge, which we will call a
``merger''. Basically, a merger is a patch that contains the two patches
that conflicted, and instructs darcs basically to resolve the conflict. By
construction a merger will satisfy the commutation property (see
Definition ) that characterizes all merges. Moreover the
merger's properties are what makes the order of merges unimportant (which
is a rather critical property for darcs as a whole).
The job of a merger is basically to undo the two conflicting patches, and then apply some sort of a ``resolution'' of the two instead. In the case of two conflicting hunks, this will look much like what CVS does, where it inserts both versions into the file. In general, of course, the two conflicting patches may both be mergers themselves, in which case the situation is considerably more complicated.
Much of the merger code depends on a routine which recreates from a single merger the entire sequence of patches which led up to that merger (this is, of course, assuming that this is the complicated general case of a merger of mergers of mergers). This ``unwind'' procedure is rather complicated, but absolutely critical to the merger code, as without it we wouldn't even be able to undo the effects of the patches involved in the merger, since we wouldn't know what patches were all involved in it.
Basically, unwind takes a merger such as
M( M(A,B), M(A,M(C,D)))From which it recreates a merge history:
C A M(A,B) M( M(A,B), M(A,M(C,D)))(For the curious, yes I can easily unwind this merger in my head [and on paper can unwind insanely more complex mergers]--that's what comes of working for a few months on an algorithm.) Let's start with a simple unwinding. The merger
M(A,B)
simply means that two patches
(A
and B
) conflicted, and of the two of them A
is
first in the history. The last two patches in the unwinding of any merger
are always just this easy. So this unwinds to:
A M(A,B)What about a merger of mergers? How about
M(A,M(C,D))
. In this case
we know the two most recent patches are:
A M(A,M(C,D))But obviously the unwinding isn't complete, since we don't yet see where
C
and D
came from. In this case we take the unwinding of
M(C,D)
and drop its latest patch (which is M(C,D)
itself) and
place that at the beginning of our patch train:
C A M(A,M(C,D))As we look at
M( M(A,B), M(A,M(C,D)))
, we consider the unwindings of
each of its subpatches:
C A A M(A,B) M(A,M(C,D))As we did with
M(A,M(C,D))
, we'll drop the first patch on the
right and insert the first patch on the left. That moves us up to the two
A
's. Since these agree, we can use just one of them (they
``should'' agree). That leaves us with the C
which goes first.
The catch is that things don't always turn out this easily. There is no
guarantee that the two A
's would come out at the same time, and if
they didn't, we'd have to rearrange things until they did. Or if there was
no way to rearrange things so that they would agree, we have to go on to
plan B, which I will explain now.
Consider the case of M( M(A,B), M(C,D))
. We can easily unwind the
two subpatches
A C M(A,B) M(C,D)Now we need to reconcile the
A
and C
. How do we do this?
Well, as usual, the solution is to use the most wonderful
Theorem A
and C
could
either one be the last patch applied before M(A,B)
or
M(C,D)
. So we can find C'
using
C' A M(A,B) M( M(A,B), M(C,D) )There is a bit more complexity to the unwinding process (mostly having to do with cases where you have deeper nesting), but I think the general principles that are followed are pretty much included in the above discussion.
It can sometimes be handy to have a canonical representation of a given patch. We achieve this by defining a canonical form for each patch type, and a function ``canonize'' which takes a patch and puts it into canonical form. This routine is used by the diff function to create an optimal patch (based on an LCS algorithm) from a simple hunk describing the old and new version of a file. Note that canonization may fail, if the patch is internally inconsistent.
A simpler, faster (and more generally useful) cousin of canonize is the coalescing function. This takes two sequential patches, and tries to turn them into one patch. This function is used to deal with ``split'' patches, which are created when the commutation of a primitive patch can only be represented by a composite patch. In this case the resulting composite patch must return to the original primitive patch when the commutation is reversed, which a split patch accomplishes by trying to coalesce its contents each time it is commuted.
A file patch is a patch which only modifies a single file. There are some rules which can be made about file patches in general, which makes them a handy class. For example, commutation of two filepatches is trivial if they modify different files. There is an exception when one of the files has a name ending with ``-conflict'', in which case it may not commute with a file having the same name, but without the ``-conflict.'' If they happen to modify the same file, we'll have to check whether or not they commute.
There is another handy function, which primarily affects file patches (although it can also affect other patches, such as rename patches or dir add/remove patches), which is the submerge-in-directory function. This function changes the patch to act on a patch within a subdirectory rather than in the current directory, and is useful when performing the recursive diff.
The hunk is the simplest patch that has a commuting pattern in which the commuted patches differ from the originals (rather than simple success or failure). This makes commuting or merging two hunks a tad tedious. Hunks, of course, can be coalesced if they have any overlap. Note that coalesce code doesn't check if the two patches are conflicting. If you are coalescing two conflicting hunks, you've already got a bug somewhere.
One of the most important pieces of code is the canonization of a hunk,
which is where the ``diff'' algorithm is performed. This algorithm begins
with chopping off the identical beginnings and endings of the old and new
hunks. This isn't strictly necessary, but is a good idea, since this
process is , while the primary diff algorithm is something
considerably more painful than that... actually the head would be dealt
with all right, but with more space complexity. I think it's more
efficient to just chop the head and tail off first.
There are a couple of simple constraints on the routine which determines how to resolve two conflicting patches (which is called `glump'). These must be satisfied in order that the result of a series of merges is always independent of their order. Firstly, the output of glump cannot change when the order of the two conflicting patches is switched. If it did, then commuting the merger could change the resulting patch, which would be bad. Secondly, the result of the merge of three (or more) conflicting patches cannot depend on the order in which the merges are performed.
The conflict resolution code (glump) begins by ``unravelling'' the merger into a set of sequences of patches. Each sequence of patches corresponds to one non-conflicted patch that got merged together with the others. The result of the unravelling of a series of merges must obviously be independent of the order in which those merges are performed. This unravelling code (which uses the unwind code mentioned above) uses probably the second most complicated algorithm. Fortunately, if we can successfully unravel the merger, almost any function of the unravelled merger satisfies the two constraints mentioned above that the conflict resolution code must satisfy.
Of course, in order to store our patches in a file, we'll have to save them as some sort of strings. The convention is that each patch string will end with a newline, but on parsing we skip any amount of whitespace between patches.
{ <put patches here> (indented two) }
( <put patches here> (indented two) )
hunk FILE LINE# -LINE ... +LINE ...
Replace a token with a new token. Note that this format means that whitespace must not be allowed within a token. If you know of a practical application of whitespace within a token, let me know and I may change this.
replace FILENAME [REGEX] OLD NEW
Modify a binary file
binary FILENAME oldhex *HEXHEXHEX ... newhex *HEXHEXHEX ...
addfile filename
rmfile filename
move oldname newname
changepref prefname oldval newval
adddir filename
rmdir filename
merger MERGERVERSION <first patch> <second patch>
conflict <CONFLICTING PATCH SEQUENCE> with <OLDER PATCH SEQUENCE> tcilfnoc
Named patches are displayed as a ``patch id'' which is in square brackets, followed by a patch. Optionally, after the patch id (but before the patch itself) can come a list of dependencies surrounded by angle brackets. Each dependency consists of a patch id.
darcs-stable 2007-06-16