Algorithm::Diff(3pm)

1Algorithm::Diff(3)    User Contributed Perl Documentation   Algorithm::Diff(3)
2
3
4

NAME

6       Algorithm::Diff - Compute `intelligent' differences between two files /
7       lists
8

SYNOPSIS

10           require Algorithm::Diff;
11
12           # This example produces traditional 'diff' output:
13
14           my $diff = Algorithm::Diff->new( \@seq1, \@seq2 );
15
16           $diff->Base( 1 );   # Return line numbers, not indices
17           while(  $diff->Next()  ) {
18               next   if  $diff->Same();
19               my $sep = '';
20               if(  ! $diff->Items(2)  ) {
21                   printf "%d,%dd%d\n",
22                       $diff->Get(qw( Min1 Max1 Max2 ));
23               } elsif(  ! $diff->Items(1)  ) {
24                   printf "%da%d,%d\n",
25                       $diff->Get(qw( Max1 Min2 Max2 ));
26               } else {
27                   $sep = "---\n";
28                   printf "%d,%dc%d,%d\n",
29                       $diff->Get(qw( Min1 Max1 Min2 Max2 ));
30               }
31               print "< $_"   for  $diff->Items(1);
32               print $sep;
33               print "> $_"   for  $diff->Items(2);
34           }
35
36
37           # Alternate interfaces:
38
39           use Algorithm::Diff qw(
40               LCS LCS_length LCSidx
41               diff sdiff compact_diff
42               traverse_sequences traverse_balanced );
43
44           @lcs    = LCS( \@seq1, \@seq2 );
45           $lcsref = LCS( \@seq1, \@seq2 );
46           $count  = LCS_length( \@seq1, \@seq2 );
47
48           ( $seq1idxref, $seq2idxref ) = LCSidx( \@seq1, \@seq2 );
49
50
51           # Complicated interfaces:
52
53           @diffs  = diff( \@seq1, \@seq2 );
54
55           @sdiffs = sdiff( \@seq1, \@seq2 );
56
57           @cdiffs = compact_diff( \@seq1, \@seq2 );
58
59           traverse_sequences(
60               \@seq1,
61               \@seq2,
62               {   MATCH     => \&callback1,
63                   DISCARD_A => \&callback2,
64                   DISCARD_B => \&callback3,
65               },
66               \&key_generator,
67               @extra_args,
68           );
69
70           traverse_balanced(
71               \@seq1,
72               \@seq2,
73               {   MATCH     => \&callback1,
74                   DISCARD_A => \&callback2,
75                   DISCARD_B => \&callback3,
76                   CHANGE    => \&callback4,
77               },
78               \&key_generator,
79               @extra_args,
80           );
81

INTRODUCTION

83       (by Mark-Jason Dominus)
84
85       I once read an article written by the authors of "diff"; they said that
86       they worked very hard on the algorithm until they found the right one.
87
88       I think what they ended up using (and I hope someone will correct me,
89       because I am not very confident about this) was the `longest common
90       subsequence' method.  In the LCS problem, you have two sequences of
91       items:
92
93           a b c d f g h j q z
94
95           a b c d e f g i j k r x y z
96
97       and you want to find the longest sequence of items that is present in
98       both original sequences in the same order.  That is, you want to find a
99       new sequence S which can be obtained from the first sequence by
100       deleting some items, and from the second sequence by deleting other
101       items.  You also want S to be as long as possible.  In this case S is
102
103           a b c d f g j z
104
105       From there it's only a small step to get diff-like output:
106
107           e   h i   k   q r x y
108           +   - +   +   - + + +
109
110       This module solves the LCS problem.  It also includes a canned function
111       to generate "diff"-like output.
112
113       It might seem from the example above that the LCS of two sequences is
114       always pretty obvious, but that's not always the case, especially when
115       the two sequences have many repeated elements.  For example, consider
116
117           a x b y c z p d q
118           a b c a x b y c z
119
120       A naive approach might start by matching up the "a" and "b" that appear
121       at the beginning of each sequence, like this:
122
123           a x b y c         z p d q
124           a   b   c a b y c z
125
126       This finds the common subsequence "a b c z".  But actually, the LCS is
127       "a x b y c z":
128
129                 a x b y c z p d q
130           a b c a x b y c z
131
132       or
133
134           a       x b y c z p d q
135           a b c a x b y c z
136

USAGE

138       (See also the README file and several example scripts include with this
139       module.)
140
141       This module now provides an object-oriented interface that uses less
142       memory and is easier to use than most of the previous procedural
143       interfaces.  It also still provides several exportable functions.
144       We'll deal with these in ascending order of difficulty:  "LCS",
145       "LCS_length", "LCSidx", OO interface, "prepare", "diff", "sdiff",
146       "traverse_sequences", and "traverse_balanced".
147
148   "LCS"
149       Given references to two lists of items, LCS returns an array containing
150       their longest common subsequence.  In scalar context, it returns a
151       reference to such a list.
152
153           @lcs    = LCS( \@seq1, \@seq2 );
154           $lcsref = LCS( \@seq1, \@seq2 );
155
156       "LCS" may be passed an optional third parameter; this is a CODE
157       reference to a key generation function.  See "KEY GENERATION
158       FUNCTIONS".
159
160           @lcs    = LCS( \@seq1, \@seq2, \&keyGen, @args );
161           $lcsref = LCS( \@seq1, \@seq2, \&keyGen, @args );
162
163       Additional parameters, if any, will be passed to the key generation
164       routine.
165
166   "LCS_length"
167       This is just like "LCS" except it only returns the length of the
168       longest common subsequence.  This provides a performance gain of about
169       9% compared to "LCS".
170
171   "LCSidx"
172       Like "LCS" except it returns references to two arrays.  The first array
173       contains the indices into @seq1 where the LCS items are located.  The
174       second array contains the indices into @seq2 where the LCS items are
175       located.
176
177       Therefore, the following three lists will contain the same values:
178
179           my( $idx1, $idx2 ) = LCSidx( \@seq1, \@seq2 );
180           my @list1 = @seq1[ @$idx1 ];
181           my @list2 = @seq2[ @$idx2 ];
182           my @list3 = LCS( \@seq1, \@seq2 );
183
184   "new"
185           $diff = Algorithm::Diff->new( \@seq1, \@seq2 );
186           $diff = Algorithm::Diff->new( \@seq1, \@seq2, \%opts );
187
188       "new" computes the smallest set of additions and deletions necessary to
189       turn the first sequence into the second and compactly records them in
190       the object.
191
192       You use the object to iterate over hunks, where each hunk represents a
193       contiguous section of items which should be added, deleted, replaced,
194       or left unchanged.
195
196       The following summary of all of the methods looks a lot like Perl code
197       but some of the symbols have different meanings:
198
199           [ ]     Encloses optional arguments
200           :       Is followed by the default value for an optional argument
201           |       Separates alternate return results
202
203       Method summary:
204
205           $obj        = Algorithm::Diff->new( \@seq1, \@seq2, [ \%opts ] );
206           $pos        = $obj->Next(  [ $count : 1 ] );
207           $revPos     = $obj->Prev(  [ $count : 1 ] );
208           $obj        = $obj->Reset( [ $pos : 0 ] );
209           $copy       = $obj->Copy(  [ $pos, [ $newBase ] ] );
210           $oldBase    = $obj->Base(  [ $newBase ] );
211
212       Note that all of the following methods "die" if used on an object that
213       is "reset" (not currently pointing at any hunk).
214
215           $bits       = $obj->Diff(  );
216           @items|$cnt = $obj->Same(  );
217           @items|$cnt = $obj->Items( $seqNum );
218           @idxs |$cnt = $obj->Range( $seqNum, [ $base ] );
219           $minIdx     = $obj->Min(   $seqNum, [ $base ] );
220           $maxIdx     = $obj->Max(   $seqNum, [ $base ] );
221           @values     = $obj->Get(   @names );
222
223       Passing in "undef" for an optional argument is always treated the same
224       as if no argument were passed in.
225
226       "Next"
227               $pos = $diff->Next();    # Move forward 1 hunk
228               $pos = $diff->Next( 2 ); # Move forward 2 hunks
229               $pos = $diff->Next(-5);  # Move backward 5 hunks
230
231           "Next" moves the object to point at the next hunk.  The object
232           starts out "reset", which means it isn't pointing at any hunk.  If
233           the object is reset, then "Next()" moves to the first hunk.
234
235           "Next" returns a true value iff the move didn't go past the last
236           hunk.  So Next(0) will return true iff the object is not reset.
237
238           Actually, "Next" returns the object's new position, which is a
239           number between 1 and the number of hunks (inclusive), or returns a
240           false value.
241
242       "Prev"
243           "Prev($N)" is almost identical to "Next(-$N)"; it moves to the $Nth
244           previous hunk.  On a 'reset' object, "Prev()" [and "Next(-1)"] move
245           to the last hunk.
246
247           The position returned by "Prev" is relative to the end of the
248           hunks; -1 for the last hunk, -2 for the second-to-last, etc.
249
250       "Reset"
251               $diff->Reset();     # Reset the object's position
252               $diff->Reset($pos); # Move to the specified hunk
253               $diff->Reset(1);    # Move to the first hunk
254               $diff->Reset(-1);   # Move to the last hunk
255
256           "Reset" returns the object, so, for example, you could use
257           "$diff->Reset()->Next(-1)" to get the number of hunks.
258
259       "Copy"
260               $copy = $diff->Copy( $newPos, $newBase );
261
262           "Copy" returns a copy of the object.  The copy and the original
263           object share most of their data, so making copies takes very little
264           memory.  The copy maintains its own position (separate from the
265           original), which is the main purpose of copies.  It also maintains
266           its own base.
267
268           By default, the copy's position starts out the same as the original
269           object's position.  But "Copy" takes an optional first argument to
270           set the new position, so the following three snippets are
271           equivalent:
272
273               $copy = $diff->Copy($pos);
274
275               $copy = $diff->Copy();
276               $copy->Reset($pos);
277
278               $copy = $diff->Copy()->Reset($pos);
279
280           "Copy" takes an optional second argument to set the base for the
281           copy.  If you wish to change the base of the copy but leave the
282           position the same as in the original, here are two equivalent ways:
283
284               $copy = $diff->Copy();
285               $copy->Base( 0 );
286
287               $copy = $diff->Copy(undef,0);
288
289           Here are two equivalent way to get a "reset" copy:
290
291               $copy = $diff->Copy(0);
292
293               $copy = $diff->Copy()->Reset();
294
295       "Diff"
296               $bits = $obj->Diff();
297
298           "Diff" returns a true value iff the current hunk contains items
299           that are different between the two sequences.  It actually returns
300           one of the follow 4 values:
301
302           3   "3==(1|2)".  This hunk contains items from @seq1 and the items
303               from @seq2 that should replace them.  Both sequence 1 and 2
304               contain changed items so both the 1 and 2 bits are set.
305
306           2   This hunk only contains items from @seq2 that should be
307               inserted (not items from @seq1).  Only sequence 2 contains
308               changed items so only the 2 bit is set.
309
310           1   This hunk only contains items from @seq1 that should be deleted
311               (not items from @seq2).  Only sequence 1 contains changed items
312               so only the 1 bit is set.
313
314           0   This means that the items in this hunk are the same in both
315               sequences.  Neither sequence 1 nor 2 contain changed items so
316               neither the 1 nor the 2 bits are set.
317
318       "Same"
319           "Same" returns a true value iff the current hunk contains items
320           that are the same in both sequences.  It actually returns the list
321           of items if they are the same or an empty list if they aren't.  In
322           a scalar context, it returns the size of the list.
323
324       "Items"
325               $count = $diff->Items(2);
326               @items = $diff->Items($seqNum);
327
328           "Items" returns the (number of) items from the specified sequence
329           that are part of the current hunk.
330
331           If the current hunk contains only insertions, then
332           "$diff->Items(1)" will return an empty list (0 in a scalar
333           context).  If the current hunk contains only deletions, then
334           "$diff->Items(2)" will return an empty list (0 in a scalar
335           context).
336
337           If the hunk contains replacements, then both "$diff->Items(1)" and
338           "$diff->Items(2)" will return different, non-empty lists.
339
340           Otherwise, the hunk contains identical items and all of the
341           following will return the same lists:
342
343               @items = $diff->Items(1);
344               @items = $diff->Items(2);
345               @items = $diff->Same();
346
347       "Range"
348               $count = $diff->Range( $seqNum );
349               @indices = $diff->Range( $seqNum );
350               @indices = $diff->Range( $seqNum, $base );
351
352           "Range" is like "Items" except that it returns a list of indices to
353           the items rather than the items themselves.  By default, the index
354           of the first item (in each sequence) is 0 but this can be changed
355           by calling the "Base" method.  So, by default, the following two
356           snippets return the same lists:
357
358               @list = $diff->Items(2);
359               @list = @seq2[ $diff->Range(2) ];
360
361           You can also specify the base to use as the second argument.  So
362           the following two snippets always return the same lists:
363
364               @list = $diff->Items(1);
365               @list = @seq1[ $diff->Range(1,0) ];
366
367       "Base"
368               $curBase = $diff->Base();
369               $oldBase = $diff->Base($newBase);
370
371           "Base" sets and/or returns the current base (usually 0 or 1) that
372           is used when you request range information.  The base defaults to 0
373           so that range information is returned as array indices.  You can
374           set the base to 1 if you want to report traditional line numbers
375           instead.
376
377       "Min"
378               $min1 = $diff->Min(1);
379               $min = $diff->Min( $seqNum, $base );
380
381           "Min" returns the first value that "Range" would return (given the
382           same arguments) or returns "undef" if "Range" would return an empty
383           list.
384
385       "Max"
386           "Max" returns the last value that "Range" would return or "undef".
387
388       "Get"
389               ( $n, $x, $r ) = $diff->Get(qw( min1 max1 range1 ));
390               @values = $diff->Get(qw( 0min2 1max2 range2 same base ));
391
392           "Get" returns one or more scalar values.  You pass in a list of the
393           names of the values you want returned.  Each name must match one of
394           the following regexes:
395
396               /^(-?\d+)?(min|max)[12]$/i
397               /^(range[12]|same|diff|base)$/i
398
399           The 1 or 2 after a name says which sequence you want the
400           information for (and where allowed, it is required).  The optional
401           number before "min" or "max" is the base to use.  So the following
402           equalities hold:
403
404               $diff->Get('min1') == $diff->Min(1)
405               $diff->Get('0min2') == $diff->Min(2,0)
406
407           Using "Get" in a scalar context when you've passed in more than one
408           name is a fatal error ("die" is called).
409
410   "prepare"
411       Given a reference to a list of items, "prepare" returns a reference to
412       a hash which can be used when comparing this sequence to other
413       sequences with "LCS" or "LCS_length".
414
415           $prep = prepare( \@seq1 );
416           for $i ( 0 .. 10_000 )
417           {
418               @lcs = LCS( $prep, $seq[$i] );
419               # do something useful with @lcs
420           }
421
422       "prepare" may be passed an optional third parameter; this is a CODE
423       reference to a key generation function.  See "KEY GENERATION
424       FUNCTIONS".
425
426           $prep = prepare( \@seq1, \&keyGen );
427           for $i ( 0 .. 10_000 )
428           {
429               @lcs = LCS( $seq[$i], $prep, \&keyGen );
430               # do something useful with @lcs
431           }
432
433       Using "prepare" provides a performance gain of about 50% when calling
434       LCS many times compared with not preparing.
435
436   "diff"
437           @diffs     = diff( \@seq1, \@seq2 );
438           $diffs_ref = diff( \@seq1, \@seq2 );
439
440       "diff" computes the smallest set of additions and deletions necessary
441       to turn the first sequence into the second, and returns a description
442       of these changes.  The description is a list of hunks; each hunk
443       represents a contiguous section of items which should be added,
444       deleted, or replaced.  (Hunks containing unchanged items are not
445       included.)
446
447       The return value of "diff" is a list of hunks, or, in scalar context, a
448       reference to such a list.  If there are no differences, the list will
449       be empty.
450
451       Here is an example.  Calling "diff" for the following two sequences:
452
453           a b c e h j l m n p
454           b c d e f j k l m r s t
455
456       would produce the following list:
457
458           (
459             [ [ '-', 0, 'a' ] ],
460
461             [ [ '+', 2, 'd' ] ],
462
463             [ [ '-', 4, 'h' ],
464               [ '+', 4, 'f' ] ],
465
466             [ [ '+', 6, 'k' ] ],
467
468             [ [ '-',  8, 'n' ],
469               [ '-',  9, 'p' ],
470               [ '+',  9, 'r' ],
471               [ '+', 10, 's' ],
472               [ '+', 11, 't' ] ],
473           )
474
475       There are five hunks here.  The first hunk says that the "a" at
476       position 0 of the first sequence should be deleted ("-").  The second
477       hunk says that the "d" at position 2 of the second sequence should be
478       inserted ("+").  The third hunk says that the "h" at position 4 of the
479       first sequence should be removed and replaced with the "f" from
480       position 4 of the second sequence.  And so on.
481
482       "diff" may be passed an optional third parameter; this is a CODE
483       reference to a key generation function.  See "KEY GENERATION
484       FUNCTIONS".
485
486       Additional parameters, if any, will be passed to the key generation
487       routine.
488
489   "sdiff"
490           @sdiffs     = sdiff( \@seq1, \@seq2 );
491           $sdiffs_ref = sdiff( \@seq1, \@seq2 );
492
493       "sdiff" computes all necessary components to show two sequences and
494       their minimized differences side by side, just like the Unix-utility
495       sdiff does:
496
497           same             same
498           before     |     after
499           old        <     -
500           -          >     new
501
502       It returns a list of array refs, each pointing to an array of display
503       instructions. In scalar context it returns a reference to such a list.
504       If there are no differences, the list will have one entry per item,
505       each indicating that the item was unchanged.
506
507       Display instructions consist of three elements: A modifier indicator
508       ("+": Element added, "-": Element removed, "u": Element unmodified,
509       "c": Element changed) and the value of the old and new elements, to be
510       displayed side-by-side.
511
512       An "sdiff" of the following two sequences:
513
514           a b c e h j l m n p
515           b c d e f j k l m r s t
516
517       results in
518
519           ( [ '-', 'a', ''  ],
520             [ 'u', 'b', 'b' ],
521             [ 'u', 'c', 'c' ],
522             [ '+', '',  'd' ],
523             [ 'u', 'e', 'e' ],
524             [ 'c', 'h', 'f' ],
525             [ 'u', 'j', 'j' ],
526             [ '+', '',  'k' ],
527             [ 'u', 'l', 'l' ],
528             [ 'u', 'm', 'm' ],
529             [ 'c', 'n', 'r' ],
530             [ 'c', 'p', 's' ],
531             [ '+', '',  't' ],
532           )
533
534       "sdiff" may be passed an optional third parameter; this is a CODE
535       reference to a key generation function.  See "KEY GENERATION
536       FUNCTIONS".
537
538       Additional parameters, if any, will be passed to the key generation
539       routine.
540
541   "compact_diff"
542       "compact_diff" is much like "sdiff" except it returns a much more
543       compact description consisting of just one flat list of indices.  An
544       example helps explain the format:
545
546           my @a = qw( a b c   e  h j   l m n p      );
547           my @b = qw(   b c d e f  j k l m    r s t );
548           @cdiff = compact_diff( \@a, \@b );
549           # Returns:
550           #   @a      @b       @a       @b
551           #  start   start   values   values
552           (    0,      0,   #       =
553                0,      0,   #    a  !
554                1,      0,   #  b c  =  b c
555                3,      2,   #       !  d
556                3,      3,   #    e  =  e
557                4,      4,   #    f  !  h
558                5,      5,   #    j  =  j
559                6,      6,   #       !  k
560                6,      7,   #  l m  =  l m
561                8,      9,   #  n p  !  r s t
562               10,     12,   #
563           );
564
565       The 0th, 2nd, 4th, etc. entries are all indices into @seq1 (@a in the
566       above example) indicating where a hunk begins.  The 1st, 3rd, 5th, etc.
567       entries are all indices into @seq2 (@b in the above example) indicating
568       where the same hunk begins.
569
570       So each pair of indices (except the last pair) describes where a hunk
571       begins (in each sequence).  Since each hunk must end at the item just
572       before the item that starts the next hunk, the next pair of indices can
573       be used to determine where the hunk ends.
574
575       So, the first 4 entries (0..3) describe the first hunk.  Entries 0 and
576       1 describe where the first hunk begins (and so are always both 0).
577       Entries 2 and 3 describe where the next hunk begins, so subtracting 1
578       from each tells us where the first hunk ends.  That is, the first hunk
579       contains items $diff[0] through "$diff[2] - 1" of the first sequence
580       and contains items $diff[1] through "$diff[3] - 1" of the second
581       sequence.
582
583       In other words, the first hunk consists of the following two lists of
584       items:
585
586                      #  1st pair     2nd pair
587                      # of indices   of indices
588           @list1 = @a[ $cdiff[0] .. $cdiff[2]-1 ];
589           @list2 = @b[ $cdiff[1] .. $cdiff[3]-1 ];
590                      # Hunk start   Hunk end
591
592       Note that the hunks will always alternate between those that are part
593       of the LCS (those that contain unchanged items) and those that contain
594       changes.  This means that all we need to be told is whether the first
595       hunk is a 'same' or 'diff' hunk and we can determine which of the other
596       hunks contain 'same' items or 'diff' items.
597
598       By convention, we always make the first hunk contain unchanged items.
599       So the 1st, 3rd, 5th, etc. hunks (all odd-numbered hunks if you start
600       counting from 1) all contain unchanged items.  And the 2nd, 4th, 6th,
601       etc. hunks (all even-numbered hunks if you start counting from 1) all
602       contain changed items.
603
604       Since @a and @b don't begin with the same value, the first hunk in our
605       example is empty (otherwise we'd violate the above convention).  Note
606       that the first 4 index values in our example are all zero.  Plug these
607       values into our previous code block and we get:
608
609           @hunk1a = @a[ 0 .. 0-1 ];
610           @hunk1b = @b[ 0 .. 0-1 ];
611
612       And "0..-1" returns the empty list.
613
614       Move down one pair of indices (2..5) and we get the offset ranges for
615       the second hunk, which contains changed items.
616
617       Since @diff[2..5] contains (0,0,1,0) in our example, the second hunk
618       consists of these two lists of items:
619
620               @hunk2a = @a[ $cdiff[2] .. $cdiff[4]-1 ];
621               @hunk2b = @b[ $cdiff[3] .. $cdiff[5]-1 ];
622           # or
623               @hunk2a = @a[ 0 .. 1-1 ];
624               @hunk2b = @b[ 0 .. 0-1 ];
625           # or
626               @hunk2a = @a[ 0 .. 0 ];
627               @hunk2b = @b[ 0 .. -1 ];
628           # or
629               @hunk2a = ( 'a' );
630               @hunk2b = ( );
631
632       That is, we would delete item 0 ('a') from @a.
633
634       Since @diff[4..7] contains (1,0,3,2) in our example, the third hunk
635       consists of these two lists of items:
636
637               @hunk3a = @a[ $cdiff[4] .. $cdiff[6]-1 ];
638               @hunk3a = @b[ $cdiff[5] .. $cdiff[7]-1 ];
639           # or
640               @hunk3a = @a[ 1 .. 3-1 ];
641               @hunk3a = @b[ 0 .. 2-1 ];
642           # or
643               @hunk3a = @a[ 1 .. 2 ];
644               @hunk3a = @b[ 0 .. 1 ];
645           # or
646               @hunk3a = qw( b c );
647               @hunk3a = qw( b c );
648
649       Note that this third hunk contains unchanged items as our convention
650       demands.
651
652       You can continue this process until you reach the last two indices,
653       which will always be the number of items in each sequence.  This is
654       required so that subtracting one from each will give you the indices to
655       the last items in each sequence.
656
657   "traverse_sequences"
658       "traverse_sequences" used to be the most general facility provided by
659       this module (the new OO interface is more powerful and much easier to
660       use).
661
662       Imagine that there are two arrows.  Arrow A points to an element of
663       sequence A, and arrow B points to an element of the sequence B.
664       Initially, the arrows point to the first elements of the respective
665       sequences.  "traverse_sequences" will advance the arrows through the
666       sequences one element at a time, calling an appropriate user-specified
667       callback function before each advance.  It will advance the arrows in
668       such a way that if there are equal elements $A[$i] and $B[$j] which are
669       equal and which are part of the LCS, there will be some moment during
670       the execution of "traverse_sequences" when arrow A is pointing to
671       $A[$i] and arrow B is pointing to $B[$j].  When this happens,
672       "traverse_sequences" will call the "MATCH" callback function and then
673       it will advance both arrows.
674
675       Otherwise, one of the arrows is pointing to an element of its sequence
676       that is not part of the LCS.  "traverse_sequences" will advance that
677       arrow and will call the "DISCARD_A" or the "DISCARD_B" callback,
678       depending on which arrow it advanced.  If both arrows point to elements
679       that are not part of the LCS, then "traverse_sequences" will advance
680       one of them and call the appropriate callback, but it is not specified
681       which it will call.
682
683       The arguments to "traverse_sequences" are the two sequences to
684       traverse, and a hash which specifies the callback functions, like this:
685
686           traverse_sequences(
687               \@seq1, \@seq2,
688               {   MATCH => $callback_1,
689                   DISCARD_A => $callback_2,
690                   DISCARD_B => $callback_3,
691               }
692           );
693
694       Callbacks for MATCH, DISCARD_A, and DISCARD_B are invoked with at least
695       the indices of the two arrows as their arguments.  They are not
696       expected to return any values.  If a callback is omitted from the
697       table, it is not called.
698
699       Callbacks for A_FINISHED and B_FINISHED are invoked with at least the
700       corresponding index in A or B.
701
702       If arrow A reaches the end of its sequence, before arrow B does,
703       "traverse_sequences" will call the "A_FINISHED" callback when it
704       advances arrow B, if there is such a function; if not it will call
705       "DISCARD_B" instead.  Similarly if arrow B finishes first.
706       "traverse_sequences" returns when both arrows are at the ends of their
707       respective sequences.  It returns true on success and false on failure.
708       At present there is no way to fail.
709
710       "traverse_sequences" may be passed an optional fourth parameter; this
711       is a CODE reference to a key generation function.  See "KEY GENERATION
712       FUNCTIONS".
713
714       Additional parameters, if any, will be passed to the key generation
715       function.
716
717       If you want to pass additional parameters to your callbacks, but don't
718       need a custom key generation function, you can get the default by
719       passing undef:
720
721           traverse_sequences(
722               \@seq1, \@seq2,
723               {   MATCH => $callback_1,
724                   DISCARD_A => $callback_2,
725                   DISCARD_B => $callback_3,
726               },
727               undef,     # default key-gen
728               $myArgument1,
729               $myArgument2,
730               $myArgument3,
731           );
732
733       "traverse_sequences" does not have a useful return value; you are
734       expected to plug in the appropriate behavior with the callback
735       functions.
736
737   "traverse_balanced"
738       "traverse_balanced" is an alternative to "traverse_sequences". It uses
739       a different algorithm to iterate through the entries in the computed
740       LCS. Instead of sticking to one side and showing element changes as
741       insertions and deletions only, it will jump back and forth between the
742       two sequences and report changes occurring as deletions on one side
743       followed immediately by an insertion on the other side.
744
745       In addition to the "DISCARD_A", "DISCARD_B", and "MATCH" callbacks
746       supported by "traverse_sequences", "traverse_balanced" supports a
747       "CHANGE" callback indicating that one element got "replaced" by
748       another:
749
750           traverse_balanced(
751               \@seq1, \@seq2,
752               {   MATCH => $callback_1,
753                   DISCARD_A => $callback_2,
754                   DISCARD_B => $callback_3,
755                   CHANGE    => $callback_4,
756               }
757           );
758
759       If no "CHANGE" callback is specified, "traverse_balanced" will map
760       "CHANGE" events to "DISCARD_A" and "DISCARD_B" actions, therefore
761       resulting in a similar behaviour as "traverse_sequences" with different
762       order of events.
763
764       "traverse_balanced" might be a bit slower than "traverse_sequences",
765       noticeable only while processing huge amounts of data.
766
767       The "sdiff" function of this module is implemented as call to
768       "traverse_balanced".
769
770       "traverse_balanced" does not have a useful return value; you are
771       expected to plug in the appropriate behavior with the callback
772       functions.
773

KEY GENERATION FUNCTIONS

775       Most of the functions accept an optional extra parameter.  This is a
776       CODE reference to a key generating (hashing) function that should
777       return a string that uniquely identifies a given element.  It should be
778       the case that if two elements are to be considered equal, their keys
779       should be the same (and the other way around).  If no key generation
780       function is provided, the key will be the element as a string.
781
782       By default, comparisons will use "eq" and elements will be turned into
783       keys using the default stringizing operator '""'.
784
785       Where this is important is when you're comparing something other than
786       strings.  If it is the case that you have multiple different objects
787       that should be considered to be equal, you should supply a key
788       generation function. Otherwise, you have to make sure that your arrays
789       contain unique references.
790
791       For instance, consider this example:
792
793           package Person;
794
795           sub new
796           {
797               my $package = shift;
798               return bless { name => '', ssn => '', @_ }, $package;
799           }
800
801           sub clone
802           {
803               my $old = shift;
804               my $new = bless { %$old }, ref($old);
805           }
806
807           sub hash
808           {
809               return shift()->{'ssn'};
810           }
811
812           my $person1 = Person->new( name => 'Joe', ssn => '123-45-6789' );
813           my $person2 = Person->new( name => 'Mary', ssn => '123-47-0000' );
814           my $person3 = Person->new( name => 'Pete', ssn => '999-45-2222' );
815           my $person4 = Person->new( name => 'Peggy', ssn => '123-45-9999' );
816           my $person5 = Person->new( name => 'Frank', ssn => '000-45-9999' );
817
818       If you did this:
819
820           my $array1 = [ $person1, $person2, $person4 ];
821           my $array2 = [ $person1, $person3, $person4, $person5 ];
822           Algorithm::Diff::diff( $array1, $array2 );
823
824       everything would work out OK (each of the objects would be converted
825       into a string like "Person=HASH(0x82425b0)" for comparison).
826
827       But if you did this:
828
829           my $array1 = [ $person1, $person2, $person4 ];
830           my $array2 = [ $person1, $person3, $person4->clone(), $person5 ];
831           Algorithm::Diff::diff( $array1, $array2 );
832
833       $person4 and $person4->clone() (which have the same name and SSN) would
834       be seen as different objects. If you wanted them to be considered
835       equivalent, you would have to pass in a key generation function:
836
837           my $array1 = [ $person1, $person2, $person4 ];
838           my $array2 = [ $person1, $person3, $person4->clone(), $person5 ];
839           Algorithm::Diff::diff( $array1, $array2, \&Person::hash );
840
841       This would use the 'ssn' field in each Person as a comparison key, and
842       so would consider $person4 and $person4->clone() as equal.
843
844       You may also pass additional parameters to the key generation function
845       if you wish.
846

ERROR CHECKING

848       If you pass these routines a non-reference and they expect a reference,
849       they will die with a message.
850

AUTHOR

852       This version released by Tye McQueen (http://perlmonks.org/?node=tye).
853

LICENSE

855       Parts Copyright (c) 2000-2004 Ned Konz.  All rights reserved.  Parts by
856       Tye McQueen.
857
858       This program is free software; you can redistribute it and/or modify it
859       under the same terms as Perl.
860

MAILING LIST

862       Mark-Jason still maintains a mailing list.  To join a low-volume
863       mailing list for announcements related to diff and Algorithm::Diff,
864       send an empty mail message to mjd-perl-diff-request@plover.com.
865

CREDITS

867       Versions through 0.59 (and much of this documentation) were written by:
868
869       Mark-Jason Dominus
870
871       This version borrows some documentation and routine names from Mark-
872       Jason's, but Diff.pm's code was completely replaced.
873
874       This code was adapted from the Smalltalk code of Mario Wolczko
875       <mario@wolczko.com>, which is available at
876       ftp://st.cs.uiuc.edu/pub/Smalltalk/MANCHESTER/manchester/4.0/diff.st
877
878       "sdiff" and "traverse_balanced" were written by Mike Schilli
879       <m@perlmeister.com>.
880
881       The algorithm is that described in A Fast Algorithm for Computing
882       Longest Common Subsequences, CACM, vol.20, no.5, pp.350-353, May 1977,
883       with a few minor improvements to improve the speed.
884
885       Much work was done by Ned Konz (perl@bike-nomad.com).
886
887       The OO interface and some other changes are by Tye McQueen.
888
889
890
891perl v5.34.0                      2022-01-20                Algorithm::Diff(3)