.\" .\" Copyright 2004 Michael B. Allen .\" .TH diff 3m "Apr 29, 2005" "libmba-0.9.1" "MBA Library Functions" .SH NAME diff \- compute a shortest edit script (SES) given two sequences .SH SYNOPSIS .nf .B #include .sp .B typedef const void *(*idx_fn)(const void *s, int idx, void *context); .B typedef int (*cmp_fn)(const void *e1, const void *e2, void *context); .sp .B typedef enum { .B DIFF_MATCH = 1, .B DIFF_DELETE, .B DIFF_INSERT .B } diff_op; .sp .B struct diff_edit { .B short op; /* DIFF_MATCH, DIFF_DELETE or DIFF_INSERT */ .B int off; /* off into a if MATCH or DELETE, b if INSERT */ .B int len; .B }; .B .sp .BI "int diff(const void *" a ", .BI " int " aoff ", .BI " int " n ", .BI " const void *" b ", .BI " int " boff ", .BI " int " m ", .BI " idx_fn " idx ", .BI " cmp_fn " cmp ", .BI " void *" context ", .BI " int " dmax ", .BI " struct varray *" ses ", .BI " int *" sn ", .BI " struct varray *" buf "); .br .fi .SH DESCRIPTION The .BR "diff" (3m) module will compute theshortest edit script(SES) of two sequences. This algorithm is perhaps best known as the one used in GNU .BR "diff" (1) although GNU .B diff employs additional optimizations specific to line oriented input such as source code files whereas this implementation is more generic. Formally, this implementation of the SES solution uses the dynamic programming algorithm described by Myers [1] with the Hirschberg linear space refinement. The objective is to compute the minimum set of edit operations necessary to transform a sequence A of length N into B of length M. This can be performed in O(N+M*D^2) expected time where D is theedit distance(number of elements deleted and inserted to transform A into B). Thus the algorithm is particularly fast and uses less space if the difference between sequences is small. The interface is generic such that sequences can be anything that can be indexed and compared with user supplied functions including arrays of structures, linked lists, arrays of pointers to strings in a file, etc. .sp [1] E. Myers, ``An O(ND) Difference Algorithm and Its Variations,'' Algorithmica 1, 2 (1986), 251-266. http://www.cs.arizona.edu/people/gene/PAPERS/diff.ps .PP .TP .B diff The .BR "diff" (3m) function computes the minimumedit distancebetween the sequences .I a (from .I aoff for length .IR "n" ) and .I b (from .I boff for length .IR "m" ) and optionally collects theedit scriptnecessary to transform .I a into .I b storing the result in the .B varray identified by the .I ses parameter. The .I idx paremeter is a pointer to an .B idx_fn that returns the element at the numeric index in a sequence. The .I cmp parameter is a pointer to a .B cmp_fn function that returns zero if the two elements .I e1 and .I e2 are equal and non-zero otherwise. The supplied .I context parameter will be passed directly to both functions with each invokation. If .I idx and .I cmp are .B NULL .I a and .I b will be compared as continuous sequences of .I unsigned char (i.e. raw memory diff). .sp If the .I ses parameter is not .B NULL it must be a .BR "varray" (3m) with a .I membsize of .IR "sizeof(struct diff_edit)" . Each .I struct diff_edit element in the .BR "varray" (3m) starting from 0 will be populated with the .IR "op" , .IR "off" , and .I len that together constitute the edit script. The number of .I struct diff_edit elements in the edit script is written to the integer pointed to by the .I sn parameter. If the .I ses or .I sn parameter is .BR "NULL" , the edit script will not be collected. .sp If the .I dmax parameter is not 0, the calculation will stop as soon as it is determined that the edit distance of the two sequences equals or exceeds the specified value. A value of 0 indicates that there is no limit. .sp If the .I buf parameter is not .B NULL it must be a .BR "varray" (3m) with .I membsize of .I sizeof(int) and will be used as temporary storage for the dynamic programming tables. If .I buf is .B NULL storage will be temporarily allocated and freed with .BR "malloc" (3) and .BR "free" (3). .SH RETURNS .TP .B diff The .BR "diff" (3m) function returns the edit distance between the two sequences .I a and .I b or .I dmax if the edit distance has been determined to meet or exceed the specified .I dmax value. If an error occurs -1 is returned and .B errno is set to indicate the error.