diff - compute a shortest edit script (SES) given two sequences
#include <mba/diff.h>
typedef const void *(*idx_fn)(const void *s, int idx, void *context);
typedef int (*cmp_fn)(const void *e1, const void *e2, void *context);
typedef enum {
DIFF_MATCH = 1,
DIFF_DELETE,
DIFF_INSERT
} diff_op;
struct diff_edit {
short op; /* DIFF_MATCH, DIFF_DELETE or DIFF_INSERT */
int off; /* off into a if MATCH or DELETE, b if INSERT */
int len;
};
int diff(const void *a,
int aoff,
int n,
const void *b,
int boff,
int m,
idx_fn idx,
cmp_fn cmp,
void *context,
int dmax,
struct varray *ses,
int *sn,
struct varray *buf);
The diff(3m) module will compute theshortest edit script(SES) of two
sequences. This algorithm is perhaps best known as the one used in GNU
diff(1) although GNU diff employs additional optimizations
specific to line oriented input such as source code files whereas this
implementation is more generic. Formally, this implementation of the SES
solution uses the dynamic programming algorithm described by Myers [1] with
the Hirschberg linear space refinement. The objective is to compute the
minimum set of edit operations necessary to transform a sequence A of length N
into B of length M. This can be performed in O(N+M*D^2) expected time where D
is theedit distance(number of elements deleted and inserted to transform A
into B). Thus the algorithm is particularly fast and uses less space if the
difference between sequences is small. The interface is generic such that
sequences can be anything that can be indexed and compared with user supplied
functions including arrays of structures, linked lists, arrays of pointers to
strings in a file, etc.
[1] E. Myers, ``An O(ND) Difference Algorithm and Its
Variations,'' Algorithmica 1, 2 (1986), 251-266.
http://www.cs.arizona.edu/people/gene/PAPERS/diff.ps
- diff
- The diff(3m) function computes the minimumedit distancebetween the
sequences a (from aoff for length n) and b
(from boff for length m) and optionally collects theedit
scriptnecessary to transform a into b storing the result in
the varray identified by the ses parameter. The idx
paremeter is a pointer to an idx_fn that returns the element at the
numeric index in a sequence. The cmp parameter is a pointer to a
cmp_fn function that returns zero if the two elements e1 and
e2 are equal and non-zero otherwise. The supplied context
parameter will be passed directly to both functions with each invokation.
If idx and cmp are NULL a and b will be
compared as continuous sequences of unsigned char (i.e. raw memory
diff).
If the ses parameter is not NULL it must be a
varray(3m) with a membsize of sizeof(struct
diff_edit). Each struct diff_edit element in the
varray(3m) starting from 0 will be populated with the op,
off, and len that together constitute the edit script. The
number of struct diff_edit elements in the edit script is written
to the integer pointed to by the sn parameter. If the ses
or sn parameter is NULL, the edit script will not be
collected.
If the dmax parameter is not 0, the calculation will
stop as soon as it is determined that the edit distance of the two
sequences equals or exceeds the specified value. A value of 0 indicates
that there is no limit.
If the buf parameter is not NULL it must be a
varray(3m) with membsize of sizeof(int) and will be
used as temporary storage for the dynamic programming tables. If
buf is NULL storage will be temporarily allocated and
freed with malloc(3) and free(3).
- diff
- The diff(3m) function returns the edit distance between the two
sequences a and b or dmax if the edit distance has
been determined to meet or exceed the specified dmax value. If an
error occurs -1 is returned and errno is set to indicate the
error.