Algorithm::Diff(3pm)

1Algorithm::Diff(3)    User Contributed Perl Documentation   Algorithm::Diff(3)
2
3
4

NAME

6       Algorithm::Diff - Compute `intelligent' differences between two files /
7       lists
8

SYNOPSIS

10           require Algorithm::Diff;
11
12           # This example produces traditional 'diff' output:
13
14           my $diff = Algorithm::Diff->new( \@seq1, \@seq2 );
15
16           $diff->Base( 1 );   # Return line numbers, not indices
17           while(  $diff->Next()  ) {
18               next   if  $diff->Same();
19               my $sep = '';
20               if(  ! $diff->Items(2)  ) {
21                   printf "%d,%dd%d\n",
22                       $diff->Get(qw( Min1 Max1 Max2 ));
23               } elsif(  ! $diff->Items(1)  ) {
24                   printf "%da%d,%d\n",
25                       $diff->Get(qw( Max1 Min2 Max2 ));
26               } else {
27                   $sep = "---\n";
28                   printf "%d,%dc%d,%d\n",
29                       $diff->Get(qw( Min1 Max1 Min2 Max2 ));
30               }
31               print "< $_"   for  $diff->Items(1);
32               print $sep;
33               print "> $_"   for  $diff->Items(2);
34           }
35
36           # Alternate interfaces:
37
38           use Algorithm::Diff qw(
39               LCS LCS_length LCSidx
40               diff sdiff compact_diff
41               traverse_sequences traverse_balanced );
42
43           @lcs    = LCS( \@seq1, \@seq2 );
44           $lcsref = LCS( \@seq1, \@seq2 );
45           $count  = LCS_length( \@seq1, \@seq2 );
46
47           ( $seq1idxref, $seq2idxref ) = LCSidx( \@seq1, \@seq2 );
48
49           # Complicated interfaces:
50
51           @diffs  = diff( \@seq1, \@seq2 );
52
53           @sdiffs = sdiff( \@seq1, \@seq2 );
54
55           @cdiffs = compact_diff( \@seq1, \@seq2 );
56
57           traverse_sequences(
58               \@seq1,
59               \@seq2,
60               {   MATCH     => \&callback1,
61                   DISCARD_A => \&callback2,
62                   DISCARD_B => \&callback3,
63               },
64               \&key_generator,
65               @extra_args,
66           );
67
68           traverse_balanced(
69               \@seq1,
70               \@seq2,
71               {   MATCH     => \&callback1,
72                   DISCARD_A => \&callback2,
73                   DISCARD_B => \&callback3,
74                   CHANGE    => \&callback4,
75               },
76               \&key_generator,
77               @extra_args,
78           );
79

INTRODUCTION

81       (by Mark-Jason Dominus)
82
83       I once read an article written by the authors of "diff"; they said that
84       they worked very hard on the algorithm until they found the right one.
85
86       I think what they ended up using (and I hope someone will correct me,
87       because I am not very confident about this) was the `longest common
88       subsequence' method.  In the LCS problem, you have two sequences of
89       items:
90
91           a b c d f g h j q z
92
93           a b c d e f g i j k r x y z
94
95       and you want to find the longest sequence of items that is present in
96       both original sequences in the same order.  That is, you want to find a
97       new sequence S which can be obtained from the first sequence by delet‐
98       ing some items, and from the secend sequence by deleting other items.
99       You also want S to be as long as possible.  In this case S is
100
101           a b c d f g j z
102
103       From there it's only a small step to get diff-like output:
104
105           e   h i   k   q r x y
106           +   - +   +   - + + +
107
108       This module solves the LCS problem.  It also includes a canned function
109       to generate "diff"-like output.
110
111       It might seem from the example above that the LCS of two sequences is
112       always pretty obvious, but that's not always the case, especially when
113       the two sequences have many repeated elements.  For example, consider
114
115           a x b y c z p d q
116           a b c a x b y c z
117
118       A naive approach might start by matching up the "a" and "b" that appear
119       at the beginning of each sequence, like this:
120
121           a x b y c         z p d q
122           a   b   c a b y c z
123
124       This finds the common subsequence "a b c z".  But actually, the LCS is
125       "a x b y c z":
126
127                 a x b y c z p d q
128           a b c a x b y c z
129
130       or
131
132           a       x b y c z p d q
133           a b c a x b y c z
134

USAGE

136       (See also the README file and several example scripts include with this
137       module.)
138
139       This module now provides an object-oriented interface that uses less
140       memory and is easier to use than most of the previous procedural inter‐
141       faces.  It also still provides several exportable functions.  We'll
142       deal with these in ascending order of difficulty:  "LCS", "LCS_length",
143       "LCSidx", OO interface, "prepare", "diff", "sdiff", "tra‐
144       verse_sequences", and "traverse_balanced".
145
146       "LCS"
147
148       Given references to two lists of items, LCS returns an array containing
149       their longest common subsequence.  In scalar context, it returns a ref‐
150       erence to such a list.
151
152           @lcs    = LCS( \@seq1, \@seq2 );
153           $lcsref = LCS( \@seq1, \@seq2 );
154
155       "LCS" may be passed an optional third parameter; this is a CODE refer‐
156       ence to a key generation function.  See "KEY GENERATION FUNCTIONS".
157
158           @lcs    = LCS( \@seq1, \@seq2, \&keyGen, @args );
159           $lcsref = LCS( \@seq1, \@seq2, \&keyGen, @args );
160
161       Additional parameters, if any, will be passed to the key generation
162       routine.
163
164       "LCS_length"
165
166       This is just like "LCS" except it only returns the length of the long‐
167       est common subsequence.  This provides a performance gain of about 9%
168       compared to "LCS".
169
170       "LCSidx"
171
172       Like "LCS" except it returns references to two arrays.  The first array
173       contains the indices into @seq1 where the LCS items are located.  The
174       second array contains the indices into @seq2 where the LCS items are
175       located.
176
177       Therefore, the following three lists will contain the same values:
178
179           my( $idx1, $idx2 ) = LCSidx( \@seq1, \@seq2 );
180           my @list1 = @seq1[ @$idx1 ];
181           my @list2 = @seq2[ @$idx2 ];
182           my @list3 = LCS( \@seq1, \@seq2 );
183
184       "new"
185
186           $diff = Algorithm::Diffs->new( \@seq1, \@seq2 );
187           $diff = Algorithm::Diffs->new( \@seq1, \@seq2, \%opts );
188
189       "new" computes the smallest set of additions and deletions necessary to
190       turn the first sequence into the second and compactly records them in
191       the object.
192
193       You use the object to iterate over hunks, where each hunk represents a
194       contiguous section of items which should be added, deleted, replaced,
195       or left unchanged.
196
197           The following summary of all of the methods looks a lot like Perl
198           code but some of the symbols have different meanings:
199
200               [ ]     Encloses optional arguments
201               :       Is followed by the default value for an optional argument
202               ⎪       Separates alternate return results
203
204           Method summary:
205
206               $obj        = Algorithm::Diff->new( \@seq1, \@seq2, [ \%opts ] );
207               $pos        = $obj->Next(  [ $count : 1 ] );
208               $revPos     = $obj->Prev(  [ $count : 1 ] );
209               $obj        = $obj->Reset( [ $pos : 0 ] );
210               $copy       = $obj->Copy(  [ $pos, [ $newBase ] ] );
211               $oldBase    = $obj->Base(  [ $newBase ] );
212
213           Note that all of the following methods "die" if used on an object
214           that is "reset" (not currently pointing at any hunk).
215
216               $bits       = $obj->Diff(  );
217               @items⎪$cnt = $obj->Same(  );
218               @items⎪$cnt = $obj->Items( $seqNum );
219               @idxs ⎪$cnt = $obj->Range( $seqNum, [ $base ] );
220               $minIdx     = $obj->Min(   $seqNum, [ $base ] );
221               $maxIdx     = $obj->Max(   $seqNum, [ $base ] );
222               @values     = $obj->Get(   @names );
223
224           Passing in "undef" for an optional argument is always treated the
225           same as if no argument were passed in.
226
227       "Next"
228               $pos = $diff->Next();    # Move forward 1 hunk
229               $pos = $diff->Next( 2 ); # Move forward 2 hunks
230               $pos = $diff->Next(-5);  # Move backward 5 hunks
231
232           "Next" moves the object to point at the next hunk.  The object
233           starts out "reset", which means it isn't pointing at any hunk.  If
234           the object is reset, then "Next()" moves to the first hunk.
235
236           "Next" returns a true value iff the move didn't go past the last
237           hunk.  So Next(0) will return true iff the object is not reset.
238
239           Actually, "Next" returns the object's new position, which is a num‐
240           ber between 1 and the number of hunks (inclusive), or returns a
241           false value.
242
243       "Prev"
244           "Prev($N)" is almost identical to "Next(-$N)"; it moves to the $Nth
245           previous hunk.  On a 'reset' object, "Prev()" [and "Next(-1)"] move
246           to the last hunk.
247
248           The position returned by "Prev" is relative to the end of the
249           hunks; -1 for the last hunk, -2 for the second-to-last, etc.
250
251       "Reset"
252               $diff->Reset();     # Reset the object's position
253               $diff->Reset($pos); # Move to the specified hunk
254               $diff->Reset(1);    # Move to the first hunk
255               $diff->Reset(-1);   # Move to the last hunk
256
257           "Reset" returns the object, so, for example, you could use
258           "$diff->Reset()->Next(-1)" to get the number of hunks.
259
260       "Copy"
261               $copy = $diff->Copy( $newPos, $newBase );
262
263           "Copy" returns a copy of the object.  The copy and the orignal
264           object share most of their data, so making copies takes very little
265           memory.  The copy maintains its own position (separate from the
266           original), which is the main purpose of copies.  It also maintains
267           its own base.
268
269           By default, the copy's position starts out the same as the original
270           object's position.  But "Copy" takes an optional first argument to
271           set the new position, so the following three snippets are equiva‐
272           lent:
273
274               $copy = $diff->Copy($pos);
275
276               $copy = $diff->Copy();
277               $copy->Reset($pos);
278
279               $copy = $diff->Copy()->Reset($pos);
280
281           "Copy" takes an optional second argument to set the base for the
282           copy.  If you wish to change the base of the copy but leave the
283           position the same as in the original, here are two equivalent ways:
284
285               $copy = $diff->Copy();
286               $copy->Base( 0 );
287
288               $copy = $diff->Copy(undef,0);
289
290           Here are two equivalent way to get a "reset" copy:
291
292               $copy = $diff->Copy(0);
293
294               $copy = $diff->Copy()->Reset();
295
296       "Diff"
297               $bits = $obj->Diff();
298
299           "Diff" returns a true value iff the current hunk contains items
300           that are different between the two sequences.  It actually returns
301           one of the follow 4 values:
302
303           3   "3==(1⎪2)".  This hunk contains items from @seq1 and the items
304               from @seq2 that should replace them.  Both sequence 1 and 2
305               contain changed items so both the 1 and 2 bits are set.
306
307           2   This hunk only contains items from @seq2 that should be
308               inserted (not items from @seq1).  Only sequence 2 contains
309               changed items so only the 2 bit is set.
310
311           1   This hunk only contains items from @seq1 that should be deleted
312               (not items from @seq2).  Only sequence 1 contains changed items
313               so only the 1 bit is set.
314
315           0   This means that the items in this hunk are the same in both
316               sequences.  Neither sequence 1 nor 2 contain changed items so
317               neither the 1 nor the 2 bits are set.
318
319       "Same"
320           "Same" returns a true value iff the current hunk contains items
321           that are the same in both sequences.  It actually returns the list
322           of items if they are the same or an emty list if they aren't.  In a
323           scalar context, it returns the size of the list.
324
325       "Items"
326               $count = $diff->Items(2);
327               @items = $diff->Items($seqNum);
328
329           "Items" returns the (number of) items from the specified sequence
330           that are part of the current hunk.
331
332           If the current hunk contains only insertions, then
333           "$diff->Items(1)" will return an empty list (0 in a scalar conext).
334           If the current hunk contains only deletions, then "$diff->Items(2)"
335           will return an empty list (0 in a scalar conext).
336
337           If the hunk contains replacements, then both "$diff->Items(1)" and
338           "$diff->Items(2)" will return different, non-empty lists.
339
340           Otherwise, the hunk contains identical items and all of the follow‐
341           ing will return the same lists:
342
343               @items = $diff->Items(1);
344               @items = $diff->Items(2);
345               @items = $diff->Same();
346
347       "Range"
348               $count = $diff->Range( $seqNum );
349               @indices = $diff->Range( $seqNum );
350               @indices = $diff->Range( $seqNum, $base );
351
352           "Range" is like "Items" except that it returns a list of indices to
353           the items rather than the items themselves.  By default, the index
354           of the first item (in each sequence) is 0 but this can be changed
355           by calling the "Base" method.  So, by default, the following two
356           snippets return the same lists:
357
358               @list = $diff->Items(2);
359               @list = @seq2[ $diff->Range(2) ];
360
361           You can also specify the base to use as the second argument.  So
362           the following two snippets always return the same lists:
363
364               @list = $diff->Items(1);
365               @list = @seq1[ $diff->Range(1,0) ];
366
367       "Base"
368               $curBase = $diff->Base();
369               $oldBase = $diff->Base($newBase);
370
371           "Base" sets and/or returns the current base (usually 0 or 1) that
372           is used when you request range information.  The base defaults to 0
373           so that range information is returned as array indices.  You can
374           set the base to 1 if you want to report traditional line numbers
375           instead.
376
377       "Min"
378               $min1 = $diff->Min(1);
379               $min = $diff->Min( $seqNum, $base );
380
381           "Min" returns the first value that "Range" would return (given the
382           same arguments) or returns "undef" if "Range" would return an empty
383           list.
384
385       "Max"
386           "Max" returns the last value that "Range" would return or "undef".
387
388       "Get"
389               ( $n, $x, $r ) = $diff->Get(qw( min1 max1 range1 ));
390               @values = $diff->Get(qw( 0min2 1max2 range2 same base ));
391
392           "Get" returns one or more scalar values.  You pass in a list of the
393           names of the values you want returned.  Each name must match one of
394           the following regexes:
395
396               /^(-?\d+)?(min⎪max)[12]$/i
397               /^(range[12]⎪same⎪diff⎪base)$/i
398
399           The 1 or 2 after a name says which sequence you want the informa‐
400           tion for (and where allowed, it is required).  The optional number
401           before "min" or "max" is the base to use.  So the following equali‐
402           ties hold:
403
404               $diff->Get('min1') == $diff->Min(1)
405               $diff->Get('0min2') == $diff->Min(2,0)
406
407           Using "Get" in a scalar context when you've passed in more than one
408           name is a fatal error ("die" is called).
409
410       "prepare"
411
412       Given a reference to a list of items, "prepare" returns a reference to
413       a hash which can be used when comparing this sequence to other
414       sequences with "LCS" or "LCS_length".
415
416           $prep = prepare( \@seq1 );
417           for $i ( 0 .. 10_000 )
418           {
419               @lcs = LCS( $prep, $seq[$i] );
420               # do something useful with @lcs
421           }
422
423       "prepare" may be passed an optional third parameter; this is a CODE
424       reference to a key generation function.  See "KEY GENERATION FUNC‐
425       TIONS".
426
427           $prep = prepare( \@seq1, \&keyGen );
428           for $i ( 0 .. 10_000 )
429           {
430               @lcs = LCS( $seq[$i], $prep, \&keyGen );
431               # do something useful with @lcs
432           }
433
434       Using "prepare" provides a performance gain of about 50% when calling
435       LCS many times compared with not preparing.
436
437       "diff"
438
439           @diffs     = diff( \@seq1, \@seq2 );
440           $diffs_ref = diff( \@seq1, \@seq2 );
441
442       "diff" computes the smallest set of additions and deletions necessary
443       to turn the first sequence into the second, and returns a description
444       of these changes.  The description is a list of hunks; each hunk repre‐
445       sents a contiguous section of items which should be added, deleted, or
446       replaced.  (Hunks containing unchanged items are not included.)
447
448       The return value of "diff" is a list of hunks, or, in scalar context, a
449       reference to such a list.  If there are no differences, the list will
450       be empty.
451
452       Here is an example.  Calling "diff" for the following two sequences:
453
454           a b c e h j l m n p
455           b c d e f j k l m r s t
456
457       would produce the following list:
458
459           (
460             [ [ '-', 0, 'a' ] ],
461
462             [ [ '+', 2, 'd' ] ],
463
464             [ [ '-', 4, 'h' ],
465               [ '+', 4, 'f' ] ],
466
467             [ [ '+', 6, 'k' ] ],
468
469             [ [ '-',  8, 'n' ],
470               [ '-',  9, 'p' ],
471               [ '+',  9, 'r' ],
472               [ '+', 10, 's' ],
473               [ '+', 11, 't' ] ],
474           )
475
476       There are five hunks here.  The first hunk says that the "a" at posi‐
477       tion 0 of the first sequence should be deleted ("-").  The second hunk
478       says that the "d" at position 2 of the second sequence should be
479       inserted ("+").  The third hunk says that the "h" at position 4 of the
480       first sequence should be removed and replaced with the "f" from posi‐
481       tion 4 of the second sequence.  And so on.
482
483       "diff" may be passed an optional third parameter; this is a CODE refer‐
484       ence to a key generation function.  See "KEY GENERATION FUNCTIONS".
485
486       Additional parameters, if any, will be passed to the key generation
487       routine.
488
489       "sdiff"
490
491           @sdiffs     = sdiff( \@seq1, \@seq2 );
492           $sdiffs_ref = sdiff( \@seq1, \@seq2 );
493
494       "sdiff" computes all necessary components to show two sequences and
495       their minimized differences side by side, just like the Unix-utility
496       sdiff does:
497
498           same             same
499           before     ⎪     after
500           old        <     -
501           -          >     new
502
503       It returns a list of array refs, each pointing to an array of display
504       instructions. In scalar context it returns a reference to such a list.
505       If there are no differences, the list will have one entry per item,
506       each indicating that the item was unchanged.
507
508       Display instructions consist of three elements: A modifier indicator
509       ("+": Element added, "-": Element removed, "u": Element unmodified,
510       "c": Element changed) and the value of the old and new elements, to be
511       displayed side-by-side.
512
513       An "sdiff" of the following two sequences:
514
515           a b c e h j l m n p
516           b c d e f j k l m r s t
517
518       results in
519
520           ( [ '-', 'a', ''  ],
521             [ 'u', 'b', 'b' ],
522             [ 'u', 'c', 'c' ],
523             [ '+', '',  'd' ],
524             [ 'u', 'e', 'e' ],
525             [ 'c', 'h', 'f' ],
526             [ 'u', 'j', 'j' ],
527             [ '+', '',  'k' ],
528             [ 'u', 'l', 'l' ],
529             [ 'u', 'm', 'm' ],
530             [ 'c', 'n', 'r' ],
531             [ 'c', 'p', 's' ],
532             [ '+', '',  't' ],
533           )
534
535       "sdiff" may be passed an optional third parameter; this is a CODE ref‐
536       erence to a key generation function.  See "KEY GENERATION FUNCTIONS".
537
538       Additional parameters, if any, will be passed to the key generation
539       routine.
540
541       "compact_diff"
542
543       "compact_diff" is much like "sdiff" except it returns a much more com‐
544       pact description consisting of just one flat list of indices.  An exam‐
545       ple helps explain the format:
546
547           my @a = qw( a b c   e  h j   l m n p      );
548           my @b = qw(   b c d e f  j k l m    r s t );
549           @cdiff = compact_diff( \@a, \@b );
550           # Returns:
551           #   @a      @b       @a       @b
552           #  start   start   values   values
553           (    0,      0,   #       =
554                0,      0,   #    a  !
555                1,      0,   #  b c  =  b c
556                3,      2,   #       !  d
557                3,      3,   #    e  =  e
558                4,      4,   #    f  !  h
559                5,      5,   #    j  =  j
560                6,      6,   #       !  k
561                6,      7,   #  l m  =  l m
562                8,      9,   #  n p  !  r s t
563               10,     12,   #
564           );
565
566       The 0th, 2nd, 4th, etc. entries are all indices into @seq1 (@a in the
567       above example) indicating where a hunk begins.  The 1st, 3rd, 5th, etc.
568       entries are all indices into @seq2 (@b in the above example) indicating
569       where the same hunk begins.
570
571       So each pair of indices (except the last pair) describes where a hunk
572       begins (in each sequence).  Since each hunk must end at the item just
573       before the item that starts the next hunk, the next pair of indices can
574       be used to determine where the hunk ends.
575
576       So, the first 4 entries (0..3) describe the first hunk.  Entries 0 and
577       1 describe where the first hunk begins (and so are always both 0).
578       Entries 2 and 3 describe where the next hunk begins, so subtracting 1
579       from each tells us where the first hunk ends.  That is, the first hunk
580       contains items $diff[0] through "$diff[2] - 1" of the first sequence
581       and contains items $diff[1] through "$diff[3] - 1" of the second
582       sequence.
583
584       In other words, the first hunk consists of the following two lists of
585       items:
586
587                      #  1st pair     2nd pair
588                      # of indices   of indices
589           @list1 = @a[ $cdiff[0] .. $cdiff[2]-1 ];
590           @list2 = @b[ $cdiff[1] .. $cdiff[3]-1 ];
591                      # Hunk start   Hunk end
592
593       Note that the hunks will always alternate between those that are part
594       of the LCS (those that contain unchanged items) and those that contain
595       changes.  This means that all we need to be told is whether the first
596       hunk is a 'same' or 'diff' hunk and we can determine which of the other
597       hunks contain 'same' items or 'diff' items.
598
599       By convention, we always make the first hunk contain unchanged items.
600       So the 1st, 3rd, 5th, etc. hunks (all odd-numbered hunks if you start
601       counting from 1) all contain unchanged items.  And the 2nd, 4th, 6th,
602       etc. hunks (all even-numbered hunks if you start counting from 1) all
603       contain changed items.
604
605       Since @a and @b don't begin with the same value, the first hunk in our
606       example is empty (otherwise we'd violate the above convention).  Note
607       that the first 4 index values in our example are all zero.  Plug these
608       values into our previous code block and we get:
609
610           @hunk1a = @a[ 0 .. 0-1 ];
611           @hunk1b = @b[ 0 .. 0-1 ];
612
613       And "0..-1" returns the empty list.
614
615       Move down one pair of indices (2..5) and we get the offset ranges for
616       the second hunk, which contains changed items.
617
618       Since @diff[2..5] contains (0,0,1,0) in our example, the second hunk
619       consists of these two lists of items:
620
621               @hunk2a = @a[ $cdiff[2] .. $cdiff[4]-1 ];
622               @hunk2b = @b[ $cdiff[3] .. $cdiff[5]-1 ];
623           # or
624               @hunk2a = @a[ 0 .. 1-1 ];
625               @hunk2b = @b[ 0 .. 0-1 ];
626           # or
627               @hunk2a = @a[ 0 .. 0 ];
628               @hunk2b = @b[ 0 .. -1 ];
629           # or
630               @hunk2a = ( 'a' );
631               @hunk2b = ( );
632
633       That is, we would delete item 0 ('a') from @a.
634
635       Since @diff[4..7] contains (1,0,3,2) in our example, the third hunk
636       consists of these two lists of items:
637
638               @hunk3a = @a[ $cdiff[4] .. $cdiff[6]-1 ];
639               @hunk3a = @b[ $cdiff[5] .. $cdiff[7]-1 ];
640           # or
641               @hunk3a = @a[ 1 .. 3-1 ];
642               @hunk3a = @b[ 0 .. 2-1 ];
643           # or
644               @hunk3a = @a[ 1 .. 2 ];
645               @hunk3a = @b[ 0 .. 1 ];
646           # or
647               @hunk3a = qw( b c );
648               @hunk3a = qw( b c );
649
650       Note that this third hunk contains unchanged items as our convention
651       demands.
652
653       You can continue this process until you reach the last two indices,
654       which will always be the number of items in each sequence.  This is
655       required so that subtracting one from each will give you the indices to
656       the last items in each sequence.
657
658       "traverse_sequences"
659
660       "traverse_sequences" used to be the most general facility provided by
661       this module (the new OO interface is more powerful and much easier to
662       use).
663
664       Imagine that there are two arrows.  Arrow A points to an element of
665       sequence A, and arrow B points to an element of the sequence B.  Ini‐
666       tially, the arrows point to the first elements of the respective
667       sequences.  "traverse_sequences" will advance the arrows through the
668       sequences one element at a time, calling an appropriate user-specified
669       callback function before each advance.  It willadvance the arrows in
670       such a way that if there are equal elements $A[$i] and $B[$j] which are
671       equal and which are part of the LCS, there will be some moment during
672       the execution of "traverse_sequences" when arrow A is pointing to
673       $A[$i] and arrow B is pointing to $B[$j].  When this happens, "tra‐
674       verse_sequences" will call the "MATCH" callback function and then it
675       will advance both arrows.
676
677       Otherwise, one of the arrows is pointing to an element of its sequence
678       that is not part of the LCS.  "traverse_sequences" will advance that
679       arrow and will call the "DISCARD_A" or the "DISCARD_B" callback,
680       depending on which arrow it advanced.  If both arrows point to elements
681       that are not part of the LCS, then "traverse_sequences" will advance
682       one of them and call the appropriate callback, but it is not specified
683       which it will call.
684
685       The arguments to "traverse_sequences" are the two sequences to tra‐
686       verse, and a hash which specifies the callback functions, like this:
687
688           traverse_sequences(
689               \@seq1, \@seq2,
690               {   MATCH => $callback_1,
691                   DISCARD_A => $callback_2,
692                   DISCARD_B => $callback_3,
693               }
694           );
695
696       Callbacks for MATCH, DISCARD_A, and DISCARD_B are invoked with at least
697       the indices of the two arrows as their arguments.  They are not
698       expected to return any values.  If a callback is omitted from the ta‐
699       ble, it is not called.
700
701       Callbacks for A_FINISHED and B_FINISHED are invoked with at least the
702       corresponding index in A or B.
703
704       If arrow A reaches the end of its sequence, before arrow B does, "tra‐
705       verse_sequences" will call the "A_FINISHED" callback when it advances
706       arrow B, if there is such a function; if not it will call "DISCARD_B"
707       instead.  Similarly if arrow B finishes first.  "traverse_sequences"
708       returns when both arrows are at the ends of their respective sequences.
709       It returns true on success and false on failure.  At present there is
710       no way to fail.
711
712       "traverse_sequences" may be passed an optional fourth parameter; this
713       is a CODE reference to a key generation function.  See "KEY GENERATION
714       FUNCTIONS".
715
716       Additional parameters, if any, will be passed to the key generation
717       function.
718
719       If you want to pass additional parameters to your callbacks, but don't
720       need a custom key generation function, you can get the default by pass‐
721       ing undef:
722
723           traverse_sequences(
724               \@seq1, \@seq2,
725               {   MATCH => $callback_1,
726                   DISCARD_A => $callback_2,
727                   DISCARD_B => $callback_3,
728               },
729               undef,     # default key-gen
730               $myArgument1,
731               $myArgument2,
732               $myArgument3,
733           );
734
735       "traverse_sequences" does not have a useful return value; you are
736       expected to plug in the appropriate behavior with the callback func‐
737       tions.
738
739       "traverse_balanced"
740
741       "traverse_balanced" is an alternative to "traverse_sequences". It uses
742       a different algorithm to iterate through the entries in the computed
743       LCS. Instead of sticking to one side and showing element changes as
744       insertions and deletions only, it will jump back and forth between the
745       two sequences and report changes occurring as deletions on one side
746       followed immediatly by an insertion on the other side.
747
748       In addition to the "DISCARD_A", "DISCARD_B", and "MATCH" callbacks sup‐
749       ported by "traverse_sequences", "traverse_balanced" supports a "CHANGE"
750       callback indicating that one element got "replaced" by another:
751
752           traverse_balanced(
753               \@seq1, \@seq2,
754               {   MATCH => $callback_1,
755                   DISCARD_A => $callback_2,
756                   DISCARD_B => $callback_3,
757                   CHANGE    => $callback_4,
758               }
759           );
760
761       If no "CHANGE" callback is specified, "traverse_balanced" will map
762       "CHANGE" events to "DISCARD_A" and "DISCARD_B" actions, therefore
763       resulting in a similar behaviour as "traverse_sequences" with different
764       order of events.
765
766       "traverse_balanced" might be a bit slower than "traverse_sequences",
767       noticable only while processing huge amounts of data.
768
769       The "sdiff" function of this module is implemented as call to "tra‐
770       verse_balanced".
771
772       "traverse_balanced" does not have a useful return value; you are
773       expected to plug in the appropriate behavior with the callback func‐
774       tions.
775

KEY GENERATION FUNCTIONS

777       Most of the functions accept an optional extra parameter.  This is a
778       CODE reference to a key generating (hashing) function that should
779       return a string that uniquely identifies a given element.  It should be
780       the case that if two elements are to be considered equal, their keys
781       should be the same (and the other way around).  If no key generation
782       function is provided, the key will be the element as a string.
783
784       By default, comparisons will use "eq" and elements will be turned into
785       keys using the default stringizing operator '""'.
786
787       Where this is important is when you're comparing something other than
788       strings.  If it is the case that you have multiple different objects
789       that should be considered to be equal, you should supply a key genera‐
790       tion function. Otherwise, you have to make sure that your arrays con‐
791       tain unique references.
792
793       For instance, consider this example:
794
795           package Person;
796
797           sub new
798           {
799               my $package = shift;
800               return bless { name => '', ssn => '', @_ }, $package;
801           }
802
803           sub clone
804           {
805               my $old = shift;
806               my $new = bless { %$old }, ref($old);
807           }
808
809           sub hash
810           {
811               return shift()->{'ssn'};
812           }
813
814           my $person1 = Person->new( name => 'Joe', ssn => '123-45-6789' );
815           my $person2 = Person->new( name => 'Mary', ssn => '123-47-0000' );
816           my $person3 = Person->new( name => 'Pete', ssn => '999-45-2222' );
817           my $person4 = Person->new( name => 'Peggy', ssn => '123-45-9999' );
818           my $person5 = Person->new( name => 'Frank', ssn => '000-45-9999' );
819
820       If you did this:
821
822           my $array1 = [ $person1, $person2, $person4 ];
823           my $array2 = [ $person1, $person3, $person4, $person5 ];
824           Algorithm::Diff::diff( $array1, $array2 );
825
826       everything would work out OK (each of the objects would be converted
827       into a string like "Person=HASH(0x82425b0)" for comparison).
828
829       But if you did this:
830
831           my $array1 = [ $person1, $person2, $person4 ];
832           my $array2 = [ $person1, $person3, $person4->clone(), $person5 ];
833           Algorithm::Diff::diff( $array1, $array2 );
834
835       $person4 and $person4->clone() (which have the same name and SSN) would
836       be seen as different objects. If you wanted them to be considered
837       equivalent, you would have to pass in a key generation function:
838
839           my $array1 = [ $person1, $person2, $person4 ];
840           my $array2 = [ $person1, $person3, $person4->clone(), $person5 ];
841           Algorithm::Diff::diff( $array1, $array2, \&Person::hash );
842
843       This would use the 'ssn' field in each Person as a comparison key, and
844       so would consider $person4 and $person4->clone() as equal.
845
846       You may also pass additional parameters to the key generation function
847       if you wish.
848

ERROR CHECKING

850       If you pass these routines a non-reference and they expect a reference,
851       they will die with a message.
852

AUTHOR

854       This version released by Tye McQueen (http://perlmonks.org/?node=tye).
855

LICENSE

857       Parts Copyright (c) 2000-2004 Ned Konz.  All rights reserved.  Parts by
858       Tye McQueen.
859
860       This program is free software; you can redistribute it and/or modify it
861       under the same terms as Perl.
862

MAILING LIST

864       Mark-Jason still maintains a mailing list.  To join a low-volume mail‐
865       ing list for announcements related to diff and Algorithm::Diff, send an
866       empty mail message to mjd-perl-diff-request@plover.com.
867

CREDITS

869       Versions through 0.59 (and much of this documentation) were written by:
870
871       Mark-Jason Dominus, mjd-perl-diff@plover.com
872
873       This version borrows some documentation and routine names from
874       Mark-Jason's, but Diff.pm's code was completely replaced.
875
876       This code was adapted from the Smalltalk code of Mario Wolczko
877       <mario@wolczko.com>, which is available at
878       ftp://st.cs.uiuc.edu/pub/Smalltalk/MANCHESTER/manchester/4.0/diff.st
879
880       "sdiff" and "traverse_balanced" were written by Mike Schilli
881       <m@perlmeister.com>.
882
883       The algorithm is that described in A Fast Algorithm for Computing Long‐
884       est Common Subsequences, CACM, vol.20, no.5, pp.350-353, May 1977, with
885       a few minor improvements to improve the speed.
886
887       Much work was done by Ned Konz (perl@bike-nomad.com).
888
889       The OO interface and some other changes are by Tye McQueen.
890
891
892
893perl v5.8.8                       2006-07-31                Algorithm::Diff(3)