1Algorithm::Diff(3) User Contributed Perl Documentation Algorithm::Diff(3)
2
3
4
6 Algorithm::Diff - Compute `intelligent' differences between two files /
7 lists
8
10 require Algorithm::Diff;
11
12 # This example produces traditional 'diff' output:
13
14 my $diff = Algorithm::Diff->new( \@seq1, \@seq2 );
15
16 $diff->Base( 1 ); # Return line numbers, not indices
17 while( $diff->Next() ) {
18 next if $diff->Same();
19 my $sep = '';
20 if( ! $diff->Items(2) ) {
21 printf "%d,%dd%d\n",
22 $diff->Get(qw( Min1 Max1 Max2 ));
23 } elsif( ! $diff->Items(1) ) {
24 printf "%da%d,%d\n",
25 $diff->Get(qw( Max1 Min2 Max2 ));
26 } else {
27 $sep = "---\n";
28 printf "%d,%dc%d,%d\n",
29 $diff->Get(qw( Min1 Max1 Min2 Max2 ));
30 }
31 print "< $_" for $diff->Items(1);
32 print $sep;
33 print "> $_" for $diff->Items(2);
34 }
35
36
37 # Alternate interfaces:
38
39 use Algorithm::Diff qw(
40 LCS LCS_length LCSidx
41 diff sdiff compact_diff
42 traverse_sequences traverse_balanced );
43
44 @lcs = LCS( \@seq1, \@seq2 );
45 $lcsref = LCS( \@seq1, \@seq2 );
46 $count = LCS_length( \@seq1, \@seq2 );
47
48 ( $seq1idxref, $seq2idxref ) = LCSidx( \@seq1, \@seq2 );
49
50
51 # Complicated interfaces:
52
53 @diffs = diff( \@seq1, \@seq2 );
54
55 @sdiffs = sdiff( \@seq1, \@seq2 );
56
57 @cdiffs = compact_diff( \@seq1, \@seq2 );
58
59 traverse_sequences(
60 \@seq1,
61 \@seq2,
62 { MATCH => \&callback1,
63 DISCARD_A => \&callback2,
64 DISCARD_B => \&callback3,
65 },
66 \&key_generator,
67 @extra_args,
68 );
69
70 traverse_balanced(
71 \@seq1,
72 \@seq2,
73 { MATCH => \&callback1,
74 DISCARD_A => \&callback2,
75 DISCARD_B => \&callback3,
76 CHANGE => \&callback4,
77 },
78 \&key_generator,
79 @extra_args,
80 );
81
83 (by Mark-Jason Dominus)
84
85 I once read an article written by the authors of "diff"; they said that
86 they worked very hard on the algorithm until they found the right one.
87
88 I think what they ended up using (and I hope someone will correct me,
89 because I am not very confident about this) was the `longest common
90 subsequence' method. In the LCS problem, you have two sequences of
91 items:
92
93 a b c d f g h j q z
94
95 a b c d e f g i j k r x y z
96
97 and you want to find the longest sequence of items that is present in
98 both original sequences in the same order. That is, you want to find a
99 new sequence S which can be obtained from the first sequence by
100 deleting some items, and from the second sequence by deleting other
101 items. You also want S to be as long as possible. In this case S is
102
103 a b c d f g j z
104
105 From there it's only a small step to get diff-like output:
106
107 e h i k q r x y
108 + - + + - + + +
109
110 This module solves the LCS problem. It also includes a canned function
111 to generate "diff"-like output.
112
113 It might seem from the example above that the LCS of two sequences is
114 always pretty obvious, but that's not always the case, especially when
115 the two sequences have many repeated elements. For example, consider
116
117 a x b y c z p d q
118 a b c a x b y c z
119
120 A naive approach might start by matching up the "a" and "b" that appear
121 at the beginning of each sequence, like this:
122
123 a x b y c z p d q
124 a b c a b y c z
125
126 This finds the common subsequence "a b c z". But actually, the LCS is
127 "a x b y c z":
128
129 a x b y c z p d q
130 a b c a x b y c z
131
132 or
133
134 a x b y c z p d q
135 a b c a x b y c z
136
138 (See also the README file and several example scripts include with this
139 module.)
140
141 This module now provides an object-oriented interface that uses less
142 memory and is easier to use than most of the previous procedural
143 interfaces. It also still provides several exportable functions.
144 We'll deal with these in ascending order of difficulty: "LCS",
145 "LCS_length", "LCSidx", OO interface, "prepare", "diff", "sdiff",
146 "traverse_sequences", and "traverse_balanced".
147
148 "LCS"
149 Given references to two lists of items, LCS returns an array containing
150 their longest common subsequence. In scalar context, it returns a
151 reference to such a list.
152
153 @lcs = LCS( \@seq1, \@seq2 );
154 $lcsref = LCS( \@seq1, \@seq2 );
155
156 "LCS" may be passed an optional third parameter; this is a CODE
157 reference to a key generation function. See "KEY GENERATION
158 FUNCTIONS".
159
160 @lcs = LCS( \@seq1, \@seq2, \&keyGen, @args );
161 $lcsref = LCS( \@seq1, \@seq2, \&keyGen, @args );
162
163 Additional parameters, if any, will be passed to the key generation
164 routine.
165
166 "LCS_length"
167 This is just like "LCS" except it only returns the length of the
168 longest common subsequence. This provides a performance gain of about
169 9% compared to "LCS".
170
171 "LCSidx"
172 Like "LCS" except it returns references to two arrays. The first array
173 contains the indices into @seq1 where the LCS items are located. The
174 second array contains the indices into @seq2 where the LCS items are
175 located.
176
177 Therefore, the following three lists will contain the same values:
178
179 my( $idx1, $idx2 ) = LCSidx( \@seq1, \@seq2 );
180 my @list1 = @seq1[ @$idx1 ];
181 my @list2 = @seq2[ @$idx2 ];
182 my @list3 = LCS( \@seq1, \@seq2 );
183
184 "new"
185 $diff = Algorithm::Diffs->new( \@seq1, \@seq2 );
186 $diff = Algorithm::Diffs->new( \@seq1, \@seq2, \%opts );
187
188 "new" computes the smallest set of additions and deletions necessary to
189 turn the first sequence into the second and compactly records them in
190 the object.
191
192 You use the object to iterate over hunks, where each hunk represents a
193 contiguous section of items which should be added, deleted, replaced,
194 or left unchanged.
195
196 The following summary of all of the methods looks a lot like Perl
197 code but some of the symbols have different meanings:
198
199 [ ] Encloses optional arguments
200 : Is followed by the default value for an optional argument
201 | Separates alternate return results
202
203 Method summary:
204
205 $obj = Algorithm::Diff->new( \@seq1, \@seq2, [ \%opts ] );
206 $pos = $obj->Next( [ $count : 1 ] );
207 $revPos = $obj->Prev( [ $count : 1 ] );
208 $obj = $obj->Reset( [ $pos : 0 ] );
209 $copy = $obj->Copy( [ $pos, [ $newBase ] ] );
210 $oldBase = $obj->Base( [ $newBase ] );
211
212 Note that all of the following methods "die" if used on an object
213 that is "reset" (not currently pointing at any hunk).
214
215 $bits = $obj->Diff( );
216 @items|$cnt = $obj->Same( );
217 @items|$cnt = $obj->Items( $seqNum );
218 @idxs |$cnt = $obj->Range( $seqNum, [ $base ] );
219 $minIdx = $obj->Min( $seqNum, [ $base ] );
220 $maxIdx = $obj->Max( $seqNum, [ $base ] );
221 @values = $obj->Get( @names );
222
223 Passing in "undef" for an optional argument is always treated the
224 same as if no argument were passed in.
225
226 "Next"
227
228 $pos = $diff->Next(); # Move forward 1 hunk
229 $pos = $diff->Next( 2 ); # Move forward 2 hunks
230 $pos = $diff->Next(-5); # Move backward 5 hunks
231
232 "Next" moves the object to point at the next hunk. The object
233 starts out "reset", which means it isn't pointing at any hunk. If
234 the object is reset, then "Next()" moves to the first hunk.
235
236 "Next" returns a true value iff the move didn't go past the last
237 hunk. So Next(0) will return true iff the object is not reset.
238
239 Actually, "Next" returns the object's new position, which is a
240 number between 1 and the number of hunks (inclusive), or returns a
241 false value.
242
243 "Prev"
244
245 "Prev($N)" is almost identical to "Next(-$N)"; it moves to the $Nth
246 previous hunk. On a 'reset' object, "Prev()" [and "Next(-1)"] move
247 to the last hunk.
248
249 The position returned by "Prev" is relative to the end of the
250 hunks; -1 for the last hunk, -2 for the second-to-last, etc.
251
252 "Reset"
253
254 $diff->Reset(); # Reset the object's position
255 $diff->Reset($pos); # Move to the specified hunk
256 $diff->Reset(1); # Move to the first hunk
257 $diff->Reset(-1); # Move to the last hunk
258
259 "Reset" returns the object, so, for example, you could use
260 "$diff->Reset()->Next(-1)" to get the number of hunks.
261
262 "Copy"
263
264 $copy = $diff->Copy( $newPos, $newBase );
265
266 "Copy" returns a copy of the object. The copy and the original
267 object share most of their data, so making copies takes very little
268 memory. The copy maintains its own position (separate from the
269 original), which is the main purpose of copies. It also maintains
270 its own base.
271
272 By default, the copy's position starts out the same as the original
273 object's position. But "Copy" takes an optional first argument to
274 set the new position, so the following three snippets are
275 equivalent:
276
277 $copy = $diff->Copy($pos);
278
279 $copy = $diff->Copy();
280 $copy->Reset($pos);
281
282 $copy = $diff->Copy()->Reset($pos);
283
284 "Copy" takes an optional second argument to set the base for the
285 copy. If you wish to change the base of the copy but leave the
286 position the same as in the original, here are two equivalent ways:
287
288 $copy = $diff->Copy();
289 $copy->Base( 0 );
290
291 $copy = $diff->Copy(undef,0);
292
293 Here are two equivalent way to get a "reset" copy:
294
295 $copy = $diff->Copy(0);
296
297 $copy = $diff->Copy()->Reset();
298
299 "Diff"
300
301 $bits = $obj->Diff();
302
303 "Diff" returns a true value iff the current hunk contains items
304 that are different between the two sequences. It actually returns
305 one of the follow 4 values:
306
307 3 "3==(1|2)". This hunk contains items from @seq1 and the items
308 from @seq2 that should replace them. Both sequence 1 and 2
309 contain changed items so both the 1 and 2 bits are set.
310
311 2 This hunk only contains items from @seq2 that should be
312 inserted (not items from @seq1). Only sequence 2 contains
313 changed items so only the 2 bit is set.
314
315 1 This hunk only contains items from @seq1 that should be deleted
316 (not items from @seq2). Only sequence 1 contains changed items
317 so only the 1 bit is set.
318
319 0 This means that the items in this hunk are the same in both
320 sequences. Neither sequence 1 nor 2 contain changed items so
321 neither the 1 nor the 2 bits are set.
322
323 "Same"
324
325 "Same" returns a true value iff the current hunk contains items
326 that are the same in both sequences. It actually returns the list
327 of items if they are the same or an empty list if they aren't. In
328 a scalar context, it returns the size of the list.
329
330 "Items"
331
332 $count = $diff->Items(2);
333 @items = $diff->Items($seqNum);
334
335 "Items" returns the (number of) items from the specified sequence
336 that are part of the current hunk.
337
338 If the current hunk contains only insertions, then
339 "$diff->Items(1)" will return an empty list (0 in a scalar
340 context). If the current hunk contains only deletions, then
341 "$diff->Items(2)" will return an empty list (0 in a scalar
342 context).
343
344 If the hunk contains replacements, then both "$diff->Items(1)" and
345 "$diff->Items(2)" will return different, non-empty lists.
346
347 Otherwise, the hunk contains identical items and all of the
348 following will return the same lists:
349
350 @items = $diff->Items(1);
351 @items = $diff->Items(2);
352 @items = $diff->Same();
353
354 "Range"
355
356 $count = $diff->Range( $seqNum );
357 @indices = $diff->Range( $seqNum );
358 @indices = $diff->Range( $seqNum, $base );
359
360 "Range" is like "Items" except that it returns a list of indices to
361 the items rather than the items themselves. By default, the index
362 of the first item (in each sequence) is 0 but this can be changed
363 by calling the "Base" method. So, by default, the following two
364 snippets return the same lists:
365
366 @list = $diff->Items(2);
367 @list = @seq2[ $diff->Range(2) ];
368
369 You can also specify the base to use as the second argument. So
370 the following two snippets always return the same lists:
371
372 @list = $diff->Items(1);
373 @list = @seq1[ $diff->Range(1,0) ];
374
375 "Base"
376
377 $curBase = $diff->Base();
378 $oldBase = $diff->Base($newBase);
379
380 "Base" sets and/or returns the current base (usually 0 or 1) that
381 is used when you request range information. The base defaults to 0
382 so that range information is returned as array indices. You can
383 set the base to 1 if you want to report traditional line numbers
384 instead.
385
386 "Min"
387
388 $min1 = $diff->Min(1);
389 $min = $diff->Min( $seqNum, $base );
390
391 "Min" returns the first value that "Range" would return (given the
392 same arguments) or returns "undef" if "Range" would return an empty
393 list.
394
395 "Max"
396
397 "Max" returns the last value that "Range" would return or "undef".
398
399 "Get"
400
401 ( $n, $x, $r ) = $diff->Get(qw( min1 max1 range1 ));
402 @values = $diff->Get(qw( 0min2 1max2 range2 same base ));
403
404 "Get" returns one or more scalar values. You pass in a list of the
405 names of the values you want returned. Each name must match one of
406 the following regexes:
407
408 /^(-?\d+)?(min|max)[12]$/i
409 /^(range[12]|same|diff|base)$/i
410
411 The 1 or 2 after a name says which sequence you want the
412 information for (and where allowed, it is required). The optional
413 number before "min" or "max" is the base to use. So the following
414 equalities hold:
415
416 $diff->Get('min1') == $diff->Min(1)
417 $diff->Get('0min2') == $diff->Min(2,0)
418
419 Using "Get" in a scalar context when you've passed in more than one
420 name is a fatal error ("die" is called).
421
422 "prepare"
423 Given a reference to a list of items, "prepare" returns a reference to
424 a hash which can be used when comparing this sequence to other
425 sequences with "LCS" or "LCS_length".
426
427 $prep = prepare( \@seq1 );
428 for $i ( 0 .. 10_000 )
429 {
430 @lcs = LCS( $prep, $seq[$i] );
431 # do something useful with @lcs
432 }
433
434 "prepare" may be passed an optional third parameter; this is a CODE
435 reference to a key generation function. See "KEY GENERATION
436 FUNCTIONS".
437
438 $prep = prepare( \@seq1, \&keyGen );
439 for $i ( 0 .. 10_000 )
440 {
441 @lcs = LCS( $seq[$i], $prep, \&keyGen );
442 # do something useful with @lcs
443 }
444
445 Using "prepare" provides a performance gain of about 50% when calling
446 LCS many times compared with not preparing.
447
448 "diff"
449 @diffs = diff( \@seq1, \@seq2 );
450 $diffs_ref = diff( \@seq1, \@seq2 );
451
452 "diff" computes the smallest set of additions and deletions necessary
453 to turn the first sequence into the second, and returns a description
454 of these changes. The description is a list of hunks; each hunk
455 represents a contiguous section of items which should be added,
456 deleted, or replaced. (Hunks containing unchanged items are not
457 included.)
458
459 The return value of "diff" is a list of hunks, or, in scalar context, a
460 reference to such a list. If there are no differences, the list will
461 be empty.
462
463 Here is an example. Calling "diff" for the following two sequences:
464
465 a b c e h j l m n p
466 b c d e f j k l m r s t
467
468 would produce the following list:
469
470 (
471 [ [ '-', 0, 'a' ] ],
472
473 [ [ '+', 2, 'd' ] ],
474
475 [ [ '-', 4, 'h' ],
476 [ '+', 4, 'f' ] ],
477
478 [ [ '+', 6, 'k' ] ],
479
480 [ [ '-', 8, 'n' ],
481 [ '-', 9, 'p' ],
482 [ '+', 9, 'r' ],
483 [ '+', 10, 's' ],
484 [ '+', 11, 't' ] ],
485 )
486
487 There are five hunks here. The first hunk says that the "a" at
488 position 0 of the first sequence should be deleted ("-"). The second
489 hunk says that the "d" at position 2 of the second sequence should be
490 inserted ("+"). The third hunk says that the "h" at position 4 of the
491 first sequence should be removed and replaced with the "f" from
492 position 4 of the second sequence. And so on.
493
494 "diff" may be passed an optional third parameter; this is a CODE
495 reference to a key generation function. See "KEY GENERATION
496 FUNCTIONS".
497
498 Additional parameters, if any, will be passed to the key generation
499 routine.
500
501 "sdiff"
502 @sdiffs = sdiff( \@seq1, \@seq2 );
503 $sdiffs_ref = sdiff( \@seq1, \@seq2 );
504
505 "sdiff" computes all necessary components to show two sequences and
506 their minimized differences side by side, just like the Unix-utility
507 sdiff does:
508
509 same same
510 before | after
511 old < -
512 - > new
513
514 It returns a list of array refs, each pointing to an array of display
515 instructions. In scalar context it returns a reference to such a list.
516 If there are no differences, the list will have one entry per item,
517 each indicating that the item was unchanged.
518
519 Display instructions consist of three elements: A modifier indicator
520 ("+": Element added, "-": Element removed, "u": Element unmodified,
521 "c": Element changed) and the value of the old and new elements, to be
522 displayed side-by-side.
523
524 An "sdiff" of the following two sequences:
525
526 a b c e h j l m n p
527 b c d e f j k l m r s t
528
529 results in
530
531 ( [ '-', 'a', '' ],
532 [ 'u', 'b', 'b' ],
533 [ 'u', 'c', 'c' ],
534 [ '+', '', 'd' ],
535 [ 'u', 'e', 'e' ],
536 [ 'c', 'h', 'f' ],
537 [ 'u', 'j', 'j' ],
538 [ '+', '', 'k' ],
539 [ 'u', 'l', 'l' ],
540 [ 'u', 'm', 'm' ],
541 [ 'c', 'n', 'r' ],
542 [ 'c', 'p', 's' ],
543 [ '+', '', 't' ],
544 )
545
546 "sdiff" may be passed an optional third parameter; this is a CODE
547 reference to a key generation function. See "KEY GENERATION
548 FUNCTIONS".
549
550 Additional parameters, if any, will be passed to the key generation
551 routine.
552
553 "compact_diff"
554 "compact_diff" is much like "sdiff" except it returns a much more
555 compact description consisting of just one flat list of indices. An
556 example helps explain the format:
557
558 my @a = qw( a b c e h j l m n p );
559 my @b = qw( b c d e f j k l m r s t );
560 @cdiff = compact_diff( \@a, \@b );
561 # Returns:
562 # @a @b @a @b
563 # start start values values
564 ( 0, 0, # =
565 0, 0, # a !
566 1, 0, # b c = b c
567 3, 2, # ! d
568 3, 3, # e = e
569 4, 4, # f ! h
570 5, 5, # j = j
571 6, 6, # ! k
572 6, 7, # l m = l m
573 8, 9, # n p ! r s t
574 10, 12, #
575 );
576
577 The 0th, 2nd, 4th, etc. entries are all indices into @seq1 (@a in the
578 above example) indicating where a hunk begins. The 1st, 3rd, 5th, etc.
579 entries are all indices into @seq2 (@b in the above example) indicating
580 where the same hunk begins.
581
582 So each pair of indices (except the last pair) describes where a hunk
583 begins (in each sequence). Since each hunk must end at the item just
584 before the item that starts the next hunk, the next pair of indices can
585 be used to determine where the hunk ends.
586
587 So, the first 4 entries (0..3) describe the first hunk. Entries 0 and
588 1 describe where the first hunk begins (and so are always both 0).
589 Entries 2 and 3 describe where the next hunk begins, so subtracting 1
590 from each tells us where the first hunk ends. That is, the first hunk
591 contains items $diff[0] through "$diff[2] - 1" of the first sequence
592 and contains items $diff[1] through "$diff[3] - 1" of the second
593 sequence.
594
595 In other words, the first hunk consists of the following two lists of
596 items:
597
598 # 1st pair 2nd pair
599 # of indices of indices
600 @list1 = @a[ $cdiff[0] .. $cdiff[2]-1 ];
601 @list2 = @b[ $cdiff[1] .. $cdiff[3]-1 ];
602 # Hunk start Hunk end
603
604 Note that the hunks will always alternate between those that are part
605 of the LCS (those that contain unchanged items) and those that contain
606 changes. This means that all we need to be told is whether the first
607 hunk is a 'same' or 'diff' hunk and we can determine which of the other
608 hunks contain 'same' items or 'diff' items.
609
610 By convention, we always make the first hunk contain unchanged items.
611 So the 1st, 3rd, 5th, etc. hunks (all odd-numbered hunks if you start
612 counting from 1) all contain unchanged items. And the 2nd, 4th, 6th,
613 etc. hunks (all even-numbered hunks if you start counting from 1) all
614 contain changed items.
615
616 Since @a and @b don't begin with the same value, the first hunk in our
617 example is empty (otherwise we'd violate the above convention). Note
618 that the first 4 index values in our example are all zero. Plug these
619 values into our previous code block and we get:
620
621 @hunk1a = @a[ 0 .. 0-1 ];
622 @hunk1b = @b[ 0 .. 0-1 ];
623
624 And "0..-1" returns the empty list.
625
626 Move down one pair of indices (2..5) and we get the offset ranges for
627 the second hunk, which contains changed items.
628
629 Since @diff[2..5] contains (0,0,1,0) in our example, the second hunk
630 consists of these two lists of items:
631
632 @hunk2a = @a[ $cdiff[2] .. $cdiff[4]-1 ];
633 @hunk2b = @b[ $cdiff[3] .. $cdiff[5]-1 ];
634 # or
635 @hunk2a = @a[ 0 .. 1-1 ];
636 @hunk2b = @b[ 0 .. 0-1 ];
637 # or
638 @hunk2a = @a[ 0 .. 0 ];
639 @hunk2b = @b[ 0 .. -1 ];
640 # or
641 @hunk2a = ( 'a' );
642 @hunk2b = ( );
643
644 That is, we would delete item 0 ('a') from @a.
645
646 Since @diff[4..7] contains (1,0,3,2) in our example, the third hunk
647 consists of these two lists of items:
648
649 @hunk3a = @a[ $cdiff[4] .. $cdiff[6]-1 ];
650 @hunk3a = @b[ $cdiff[5] .. $cdiff[7]-1 ];
651 # or
652 @hunk3a = @a[ 1 .. 3-1 ];
653 @hunk3a = @b[ 0 .. 2-1 ];
654 # or
655 @hunk3a = @a[ 1 .. 2 ];
656 @hunk3a = @b[ 0 .. 1 ];
657 # or
658 @hunk3a = qw( b c );
659 @hunk3a = qw( b c );
660
661 Note that this third hunk contains unchanged items as our convention
662 demands.
663
664 You can continue this process until you reach the last two indices,
665 which will always be the number of items in each sequence. This is
666 required so that subtracting one from each will give you the indices to
667 the last items in each sequence.
668
669 "traverse_sequences"
670 "traverse_sequences" used to be the most general facility provided by
671 this module (the new OO interface is more powerful and much easier to
672 use).
673
674 Imagine that there are two arrows. Arrow A points to an element of
675 sequence A, and arrow B points to an element of the sequence B.
676 Initially, the arrows point to the first elements of the respective
677 sequences. "traverse_sequences" will advance the arrows through the
678 sequences one element at a time, calling an appropriate user-specified
679 callback function before each advance. It will advance the arrows in
680 such a way that if there are equal elements $A[$i] and $B[$j] which are
681 equal and which are part of the LCS, there will be some moment during
682 the execution of "traverse_sequences" when arrow A is pointing to
683 $A[$i] and arrow B is pointing to $B[$j]. When this happens,
684 "traverse_sequences" will call the "MATCH" callback function and then
685 it will advance both arrows.
686
687 Otherwise, one of the arrows is pointing to an element of its sequence
688 that is not part of the LCS. "traverse_sequences" will advance that
689 arrow and will call the "DISCARD_A" or the "DISCARD_B" callback,
690 depending on which arrow it advanced. If both arrows point to elements
691 that are not part of the LCS, then "traverse_sequences" will advance
692 one of them and call the appropriate callback, but it is not specified
693 which it will call.
694
695 The arguments to "traverse_sequences" are the two sequences to
696 traverse, and a hash which specifies the callback functions, like this:
697
698 traverse_sequences(
699 \@seq1, \@seq2,
700 { MATCH => $callback_1,
701 DISCARD_A => $callback_2,
702 DISCARD_B => $callback_3,
703 }
704 );
705
706 Callbacks for MATCH, DISCARD_A, and DISCARD_B are invoked with at least
707 the indices of the two arrows as their arguments. They are not
708 expected to return any values. If a callback is omitted from the
709 table, it is not called.
710
711 Callbacks for A_FINISHED and B_FINISHED are invoked with at least the
712 corresponding index in A or B.
713
714 If arrow A reaches the end of its sequence, before arrow B does,
715 "traverse_sequences" will call the "A_FINISHED" callback when it
716 advances arrow B, if there is such a function; if not it will call
717 "DISCARD_B" instead. Similarly if arrow B finishes first.
718 "traverse_sequences" returns when both arrows are at the ends of their
719 respective sequences. It returns true on success and false on failure.
720 At present there is no way to fail.
721
722 "traverse_sequences" may be passed an optional fourth parameter; this
723 is a CODE reference to a key generation function. See "KEY GENERATION
724 FUNCTIONS".
725
726 Additional parameters, if any, will be passed to the key generation
727 function.
728
729 If you want to pass additional parameters to your callbacks, but don't
730 need a custom key generation function, you can get the default by
731 passing undef:
732
733 traverse_sequences(
734 \@seq1, \@seq2,
735 { MATCH => $callback_1,
736 DISCARD_A => $callback_2,
737 DISCARD_B => $callback_3,
738 },
739 undef, # default key-gen
740 $myArgument1,
741 $myArgument2,
742 $myArgument3,
743 );
744
745 "traverse_sequences" does not have a useful return value; you are
746 expected to plug in the appropriate behavior with the callback
747 functions.
748
749 "traverse_balanced"
750 "traverse_balanced" is an alternative to "traverse_sequences". It uses
751 a different algorithm to iterate through the entries in the computed
752 LCS. Instead of sticking to one side and showing element changes as
753 insertions and deletions only, it will jump back and forth between the
754 two sequences and report changes occurring as deletions on one side
755 followed immediately by an insertion on the other side.
756
757 In addition to the "DISCARD_A", "DISCARD_B", and "MATCH" callbacks
758 supported by "traverse_sequences", "traverse_balanced" supports a
759 "CHANGE" callback indicating that one element got "replaced" by
760 another:
761
762 traverse_balanced(
763 \@seq1, \@seq2,
764 { MATCH => $callback_1,
765 DISCARD_A => $callback_2,
766 DISCARD_B => $callback_3,
767 CHANGE => $callback_4,
768 }
769 );
770
771 If no "CHANGE" callback is specified, "traverse_balanced" will map
772 "CHANGE" events to "DISCARD_A" and "DISCARD_B" actions, therefore
773 resulting in a similar behaviour as "traverse_sequences" with different
774 order of events.
775
776 "traverse_balanced" might be a bit slower than "traverse_sequences",
777 noticeable only while processing huge amounts of data.
778
779 The "sdiff" function of this module is implemented as call to
780 "traverse_balanced".
781
782 "traverse_balanced" does not have a useful return value; you are
783 expected to plug in the appropriate behavior with the callback
784 functions.
785
787 Most of the functions accept an optional extra parameter. This is a
788 CODE reference to a key generating (hashing) function that should
789 return a string that uniquely identifies a given element. It should be
790 the case that if two elements are to be considered equal, their keys
791 should be the same (and the other way around). If no key generation
792 function is provided, the key will be the element as a string.
793
794 By default, comparisons will use "eq" and elements will be turned into
795 keys using the default stringizing operator '""'.
796
797 Where this is important is when you're comparing something other than
798 strings. If it is the case that you have multiple different objects
799 that should be considered to be equal, you should supply a key
800 generation function. Otherwise, you have to make sure that your arrays
801 contain unique references.
802
803 For instance, consider this example:
804
805 package Person;
806
807 sub new
808 {
809 my $package = shift;
810 return bless { name => '', ssn => '', @_ }, $package;
811 }
812
813 sub clone
814 {
815 my $old = shift;
816 my $new = bless { %$old }, ref($old);
817 }
818
819 sub hash
820 {
821 return shift()->{'ssn'};
822 }
823
824 my $person1 = Person->new( name => 'Joe', ssn => '123-45-6789' );
825 my $person2 = Person->new( name => 'Mary', ssn => '123-47-0000' );
826 my $person3 = Person->new( name => 'Pete', ssn => '999-45-2222' );
827 my $person4 = Person->new( name => 'Peggy', ssn => '123-45-9999' );
828 my $person5 = Person->new( name => 'Frank', ssn => '000-45-9999' );
829
830 If you did this:
831
832 my $array1 = [ $person1, $person2, $person4 ];
833 my $array2 = [ $person1, $person3, $person4, $person5 ];
834 Algorithm::Diff::diff( $array1, $array2 );
835
836 everything would work out OK (each of the objects would be converted
837 into a string like "Person=HASH(0x82425b0)" for comparison).
838
839 But if you did this:
840
841 my $array1 = [ $person1, $person2, $person4 ];
842 my $array2 = [ $person1, $person3, $person4->clone(), $person5 ];
843 Algorithm::Diff::diff( $array1, $array2 );
844
845 $person4 and $person4->clone() (which have the same name and SSN) would
846 be seen as different objects. If you wanted them to be considered
847 equivalent, you would have to pass in a key generation function:
848
849 my $array1 = [ $person1, $person2, $person4 ];
850 my $array2 = [ $person1, $person3, $person4->clone(), $person5 ];
851 Algorithm::Diff::diff( $array1, $array2, \&Person::hash );
852
853 This would use the 'ssn' field in each Person as a comparison key, and
854 so would consider $person4 and $person4->clone() as equal.
855
856 You may also pass additional parameters to the key generation function
857 if you wish.
858
860 If you pass these routines a non-reference and they expect a reference,
861 they will die with a message.
862
864 This version released by Tye McQueen (http://perlmonks.org/?node=tye).
865
867 Parts Copyright (c) 2000-2004 Ned Konz. All rights reserved. Parts by
868 Tye McQueen.
869
870 This program is free software; you can redistribute it and/or modify it
871 under the same terms as Perl.
872
874 Mark-Jason still maintains a mailing list. To join a low-volume
875 mailing list for announcements related to diff and Algorithm::Diff,
876 send an empty mail message to mjd-perl-diff-request@plover.com.
877
879 Versions through 0.59 (and much of this documentation) were written by:
880
881 Mark-Jason Dominus, mjd-perl-diff@plover.com
882
883 This version borrows some documentation and routine names from Mark-
884 Jason's, but Diff.pm's code was completely replaced.
885
886 This code was adapted from the Smalltalk code of Mario Wolczko
887 <mario@wolczko.com>, which is available at
888 ftp://st.cs.uiuc.edu/pub/Smalltalk/MANCHESTER/manchester/4.0/diff.st
889
890 "sdiff" and "traverse_balanced" were written by Mike Schilli
891 <m@perlmeister.com>.
892
893 The algorithm is that described in A Fast Algorithm for Computing
894 Longest Common Subsequences, CACM, vol.20, no.5, pp.350-353, May 1977,
895 with a few minor improvements to improve the speed.
896
897 Much work was done by Ned Konz (perl@bike-nomad.com).
898
899 The OO interface and some other changes are by Tye McQueen.
900
902 Hey! The above document had some coding errors, which are explained
903 below:
904
905 Around line 989:
906 You can't have =items (as at line 1021) unless the first thing
907 after the =over is an =item
908
909 Around line 1108:
910 Expected text after =item, not a number
911
912 Around line 1114:
913 Expected text after =item, not a number
914
915 Around line 1120:
916 Expected text after =item, not a number
917
918
919
920perl v5.28.1 2014-11-26 Algorithm::Diff(3)