1Algorithm::Diff(3) User Contributed Perl Documentation Algorithm::Diff(3)
2
3
4
6 Algorithm::Diff - Compute `intelligent' differences between two files /
7 lists
8
10 require Algorithm::Diff;
11
12 # This example produces traditional 'diff' output:
13
14 my $diff = Algorithm::Diff->new( \@seq1, \@seq2 );
15
16 $diff->Base( 1 ); # Return line numbers, not indices
17 while( $diff->Next() ) {
18 next if $diff->Same();
19 my $sep = '';
20 if( ! $diff->Items(2) ) {
21 printf "%d,%dd%d\n",
22 $diff->Get(qw( Min1 Max1 Max2 ));
23 } elsif( ! $diff->Items(1) ) {
24 printf "%da%d,%d\n",
25 $diff->Get(qw( Max1 Min2 Max2 ));
26 } else {
27 $sep = "---\n";
28 printf "%d,%dc%d,%d\n",
29 $diff->Get(qw( Min1 Max1 Min2 Max2 ));
30 }
31 print "< $_" for $diff->Items(1);
32 print $sep;
33 print "> $_" for $diff->Items(2);
34 }
35
36 # Alternate interfaces:
37
38 use Algorithm::Diff qw(
39 LCS LCS_length LCSidx
40 diff sdiff compact_diff
41 traverse_sequences traverse_balanced );
42
43 @lcs = LCS( \@seq1, \@seq2 );
44 $lcsref = LCS( \@seq1, \@seq2 );
45 $count = LCS_length( \@seq1, \@seq2 );
46
47 ( $seq1idxref, $seq2idxref ) = LCSidx( \@seq1, \@seq2 );
48
49 # Complicated interfaces:
50
51 @diffs = diff( \@seq1, \@seq2 );
52
53 @sdiffs = sdiff( \@seq1, \@seq2 );
54
55 @cdiffs = compact_diff( \@seq1, \@seq2 );
56
57 traverse_sequences(
58 \@seq1,
59 \@seq2,
60 { MATCH => \&callback1,
61 DISCARD_A => \&callback2,
62 DISCARD_B => \&callback3,
63 },
64 \&key_generator,
65 @extra_args,
66 );
67
68 traverse_balanced(
69 \@seq1,
70 \@seq2,
71 { MATCH => \&callback1,
72 DISCARD_A => \&callback2,
73 DISCARD_B => \&callback3,
74 CHANGE => \&callback4,
75 },
76 \&key_generator,
77 @extra_args,
78 );
79
81 (by Mark-Jason Dominus)
82
83 I once read an article written by the authors of "diff"; they said that
84 they worked very hard on the algorithm until they found the right one.
85
86 I think what they ended up using (and I hope someone will correct me,
87 because I am not very confident about this) was the `longest common
88 subsequence' method. In the LCS problem, you have two sequences of
89 items:
90
91 a b c d f g h j q z
92
93 a b c d e f g i j k r x y z
94
95 and you want to find the longest sequence of items that is present in
96 both original sequences in the same order. That is, you want to find a
97 new sequence S which can be obtained from the first sequence by delet‐
98 ing some items, and from the secend sequence by deleting other items.
99 You also want S to be as long as possible. In this case S is
100
101 a b c d f g j z
102
103 From there it's only a small step to get diff-like output:
104
105 e h i k q r x y
106 + - + + - + + +
107
108 This module solves the LCS problem. It also includes a canned function
109 to generate "diff"-like output.
110
111 It might seem from the example above that the LCS of two sequences is
112 always pretty obvious, but that's not always the case, especially when
113 the two sequences have many repeated elements. For example, consider
114
115 a x b y c z p d q
116 a b c a x b y c z
117
118 A naive approach might start by matching up the "a" and "b" that appear
119 at the beginning of each sequence, like this:
120
121 a x b y c z p d q
122 a b c a b y c z
123
124 This finds the common subsequence "a b c z". But actually, the LCS is
125 "a x b y c z":
126
127 a x b y c z p d q
128 a b c a x b y c z
129
130 or
131
132 a x b y c z p d q
133 a b c a x b y c z
134
136 (See also the README file and several example scripts include with this
137 module.)
138
139 This module now provides an object-oriented interface that uses less
140 memory and is easier to use than most of the previous procedural inter‐
141 faces. It also still provides several exportable functions. We'll
142 deal with these in ascending order of difficulty: "LCS", "LCS_length",
143 "LCSidx", OO interface, "prepare", "diff", "sdiff", "tra‐
144 verse_sequences", and "traverse_balanced".
145
146 "LCS"
147
148 Given references to two lists of items, LCS returns an array containing
149 their longest common subsequence. In scalar context, it returns a ref‐
150 erence to such a list.
151
152 @lcs = LCS( \@seq1, \@seq2 );
153 $lcsref = LCS( \@seq1, \@seq2 );
154
155 "LCS" may be passed an optional third parameter; this is a CODE refer‐
156 ence to a key generation function. See "KEY GENERATION FUNCTIONS".
157
158 @lcs = LCS( \@seq1, \@seq2, \&keyGen, @args );
159 $lcsref = LCS( \@seq1, \@seq2, \&keyGen, @args );
160
161 Additional parameters, if any, will be passed to the key generation
162 routine.
163
164 "LCS_length"
165
166 This is just like "LCS" except it only returns the length of the long‐
167 est common subsequence. This provides a performance gain of about 9%
168 compared to "LCS".
169
170 "LCSidx"
171
172 Like "LCS" except it returns references to two arrays. The first array
173 contains the indices into @seq1 where the LCS items are located. The
174 second array contains the indices into @seq2 where the LCS items are
175 located.
176
177 Therefore, the following three lists will contain the same values:
178
179 my( $idx1, $idx2 ) = LCSidx( \@seq1, \@seq2 );
180 my @list1 = @seq1[ @$idx1 ];
181 my @list2 = @seq2[ @$idx2 ];
182 my @list3 = LCS( \@seq1, \@seq2 );
183
184 "new"
185
186 $diff = Algorithm::Diffs->new( \@seq1, \@seq2 );
187 $diff = Algorithm::Diffs->new( \@seq1, \@seq2, \%opts );
188
189 "new" computes the smallest set of additions and deletions necessary to
190 turn the first sequence into the second and compactly records them in
191 the object.
192
193 You use the object to iterate over hunks, where each hunk represents a
194 contiguous section of items which should be added, deleted, replaced,
195 or left unchanged.
196
197 The following summary of all of the methods looks a lot like Perl
198 code but some of the symbols have different meanings:
199
200 [ ] Encloses optional arguments
201 : Is followed by the default value for an optional argument
202 ⎪ Separates alternate return results
203
204 Method summary:
205
206 $obj = Algorithm::Diff->new( \@seq1, \@seq2, [ \%opts ] );
207 $pos = $obj->Next( [ $count : 1 ] );
208 $revPos = $obj->Prev( [ $count : 1 ] );
209 $obj = $obj->Reset( [ $pos : 0 ] );
210 $copy = $obj->Copy( [ $pos, [ $newBase ] ] );
211 $oldBase = $obj->Base( [ $newBase ] );
212
213 Note that all of the following methods "die" if used on an object
214 that is "reset" (not currently pointing at any hunk).
215
216 $bits = $obj->Diff( );
217 @items⎪$cnt = $obj->Same( );
218 @items⎪$cnt = $obj->Items( $seqNum );
219 @idxs ⎪$cnt = $obj->Range( $seqNum, [ $base ] );
220 $minIdx = $obj->Min( $seqNum, [ $base ] );
221 $maxIdx = $obj->Max( $seqNum, [ $base ] );
222 @values = $obj->Get( @names );
223
224 Passing in "undef" for an optional argument is always treated the
225 same as if no argument were passed in.
226
227 "Next"
228 $pos = $diff->Next(); # Move forward 1 hunk
229 $pos = $diff->Next( 2 ); # Move forward 2 hunks
230 $pos = $diff->Next(-5); # Move backward 5 hunks
231
232 "Next" moves the object to point at the next hunk. The object
233 starts out "reset", which means it isn't pointing at any hunk. If
234 the object is reset, then "Next()" moves to the first hunk.
235
236 "Next" returns a true value iff the move didn't go past the last
237 hunk. So Next(0) will return true iff the object is not reset.
238
239 Actually, "Next" returns the object's new position, which is a num‐
240 ber between 1 and the number of hunks (inclusive), or returns a
241 false value.
242
243 "Prev"
244 "Prev($N)" is almost identical to "Next(-$N)"; it moves to the $Nth
245 previous hunk. On a 'reset' object, "Prev()" [and "Next(-1)"] move
246 to the last hunk.
247
248 The position returned by "Prev" is relative to the end of the
249 hunks; -1 for the last hunk, -2 for the second-to-last, etc.
250
251 "Reset"
252 $diff->Reset(); # Reset the object's position
253 $diff->Reset($pos); # Move to the specified hunk
254 $diff->Reset(1); # Move to the first hunk
255 $diff->Reset(-1); # Move to the last hunk
256
257 "Reset" returns the object, so, for example, you could use
258 "$diff->Reset()->Next(-1)" to get the number of hunks.
259
260 "Copy"
261 $copy = $diff->Copy( $newPos, $newBase );
262
263 "Copy" returns a copy of the object. The copy and the orignal
264 object share most of their data, so making copies takes very little
265 memory. The copy maintains its own position (separate from the
266 original), which is the main purpose of copies. It also maintains
267 its own base.
268
269 By default, the copy's position starts out the same as the original
270 object's position. But "Copy" takes an optional first argument to
271 set the new position, so the following three snippets are equiva‐
272 lent:
273
274 $copy = $diff->Copy($pos);
275
276 $copy = $diff->Copy();
277 $copy->Reset($pos);
278
279 $copy = $diff->Copy()->Reset($pos);
280
281 "Copy" takes an optional second argument to set the base for the
282 copy. If you wish to change the base of the copy but leave the
283 position the same as in the original, here are two equivalent ways:
284
285 $copy = $diff->Copy();
286 $copy->Base( 0 );
287
288 $copy = $diff->Copy(undef,0);
289
290 Here are two equivalent way to get a "reset" copy:
291
292 $copy = $diff->Copy(0);
293
294 $copy = $diff->Copy()->Reset();
295
296 "Diff"
297 $bits = $obj->Diff();
298
299 "Diff" returns a true value iff the current hunk contains items
300 that are different between the two sequences. It actually returns
301 one of the follow 4 values:
302
303 3 "3==(1⎪2)". This hunk contains items from @seq1 and the items
304 from @seq2 that should replace them. Both sequence 1 and 2
305 contain changed items so both the 1 and 2 bits are set.
306
307 2 This hunk only contains items from @seq2 that should be
308 inserted (not items from @seq1). Only sequence 2 contains
309 changed items so only the 2 bit is set.
310
311 1 This hunk only contains items from @seq1 that should be deleted
312 (not items from @seq2). Only sequence 1 contains changed items
313 so only the 1 bit is set.
314
315 0 This means that the items in this hunk are the same in both
316 sequences. Neither sequence 1 nor 2 contain changed items so
317 neither the 1 nor the 2 bits are set.
318
319 "Same"
320 "Same" returns a true value iff the current hunk contains items
321 that are the same in both sequences. It actually returns the list
322 of items if they are the same or an emty list if they aren't. In a
323 scalar context, it returns the size of the list.
324
325 "Items"
326 $count = $diff->Items(2);
327 @items = $diff->Items($seqNum);
328
329 "Items" returns the (number of) items from the specified sequence
330 that are part of the current hunk.
331
332 If the current hunk contains only insertions, then
333 "$diff->Items(1)" will return an empty list (0 in a scalar conext).
334 If the current hunk contains only deletions, then "$diff->Items(2)"
335 will return an empty list (0 in a scalar conext).
336
337 If the hunk contains replacements, then both "$diff->Items(1)" and
338 "$diff->Items(2)" will return different, non-empty lists.
339
340 Otherwise, the hunk contains identical items and all of the follow‐
341 ing will return the same lists:
342
343 @items = $diff->Items(1);
344 @items = $diff->Items(2);
345 @items = $diff->Same();
346
347 "Range"
348 $count = $diff->Range( $seqNum );
349 @indices = $diff->Range( $seqNum );
350 @indices = $diff->Range( $seqNum, $base );
351
352 "Range" is like "Items" except that it returns a list of indices to
353 the items rather than the items themselves. By default, the index
354 of the first item (in each sequence) is 0 but this can be changed
355 by calling the "Base" method. So, by default, the following two
356 snippets return the same lists:
357
358 @list = $diff->Items(2);
359 @list = @seq2[ $diff->Range(2) ];
360
361 You can also specify the base to use as the second argument. So
362 the following two snippets always return the same lists:
363
364 @list = $diff->Items(1);
365 @list = @seq1[ $diff->Range(1,0) ];
366
367 "Base"
368 $curBase = $diff->Base();
369 $oldBase = $diff->Base($newBase);
370
371 "Base" sets and/or returns the current base (usually 0 or 1) that
372 is used when you request range information. The base defaults to 0
373 so that range information is returned as array indices. You can
374 set the base to 1 if you want to report traditional line numbers
375 instead.
376
377 "Min"
378 $min1 = $diff->Min(1);
379 $min = $diff->Min( $seqNum, $base );
380
381 "Min" returns the first value that "Range" would return (given the
382 same arguments) or returns "undef" if "Range" would return an empty
383 list.
384
385 "Max"
386 "Max" returns the last value that "Range" would return or "undef".
387
388 "Get"
389 ( $n, $x, $r ) = $diff->Get(qw( min1 max1 range1 ));
390 @values = $diff->Get(qw( 0min2 1max2 range2 same base ));
391
392 "Get" returns one or more scalar values. You pass in a list of the
393 names of the values you want returned. Each name must match one of
394 the following regexes:
395
396 /^(-?\d+)?(min⎪max)[12]$/i
397 /^(range[12]⎪same⎪diff⎪base)$/i
398
399 The 1 or 2 after a name says which sequence you want the informa‐
400 tion for (and where allowed, it is required). The optional number
401 before "min" or "max" is the base to use. So the following equali‐
402 ties hold:
403
404 $diff->Get('min1') == $diff->Min(1)
405 $diff->Get('0min2') == $diff->Min(2,0)
406
407 Using "Get" in a scalar context when you've passed in more than one
408 name is a fatal error ("die" is called).
409
410 "prepare"
411
412 Given a reference to a list of items, "prepare" returns a reference to
413 a hash which can be used when comparing this sequence to other
414 sequences with "LCS" or "LCS_length".
415
416 $prep = prepare( \@seq1 );
417 for $i ( 0 .. 10_000 )
418 {
419 @lcs = LCS( $prep, $seq[$i] );
420 # do something useful with @lcs
421 }
422
423 "prepare" may be passed an optional third parameter; this is a CODE
424 reference to a key generation function. See "KEY GENERATION FUNC‐
425 TIONS".
426
427 $prep = prepare( \@seq1, \&keyGen );
428 for $i ( 0 .. 10_000 )
429 {
430 @lcs = LCS( $seq[$i], $prep, \&keyGen );
431 # do something useful with @lcs
432 }
433
434 Using "prepare" provides a performance gain of about 50% when calling
435 LCS many times compared with not preparing.
436
437 "diff"
438
439 @diffs = diff( \@seq1, \@seq2 );
440 $diffs_ref = diff( \@seq1, \@seq2 );
441
442 "diff" computes the smallest set of additions and deletions necessary
443 to turn the first sequence into the second, and returns a description
444 of these changes. The description is a list of hunks; each hunk repre‐
445 sents a contiguous section of items which should be added, deleted, or
446 replaced. (Hunks containing unchanged items are not included.)
447
448 The return value of "diff" is a list of hunks, or, in scalar context, a
449 reference to such a list. If there are no differences, the list will
450 be empty.
451
452 Here is an example. Calling "diff" for the following two sequences:
453
454 a b c e h j l m n p
455 b c d e f j k l m r s t
456
457 would produce the following list:
458
459 (
460 [ [ '-', 0, 'a' ] ],
461
462 [ [ '+', 2, 'd' ] ],
463
464 [ [ '-', 4, 'h' ],
465 [ '+', 4, 'f' ] ],
466
467 [ [ '+', 6, 'k' ] ],
468
469 [ [ '-', 8, 'n' ],
470 [ '-', 9, 'p' ],
471 [ '+', 9, 'r' ],
472 [ '+', 10, 's' ],
473 [ '+', 11, 't' ] ],
474 )
475
476 There are five hunks here. The first hunk says that the "a" at posi‐
477 tion 0 of the first sequence should be deleted ("-"). The second hunk
478 says that the "d" at position 2 of the second sequence should be
479 inserted ("+"). The third hunk says that the "h" at position 4 of the
480 first sequence should be removed and replaced with the "f" from posi‐
481 tion 4 of the second sequence. And so on.
482
483 "diff" may be passed an optional third parameter; this is a CODE refer‐
484 ence to a key generation function. See "KEY GENERATION FUNCTIONS".
485
486 Additional parameters, if any, will be passed to the key generation
487 routine.
488
489 "sdiff"
490
491 @sdiffs = sdiff( \@seq1, \@seq2 );
492 $sdiffs_ref = sdiff( \@seq1, \@seq2 );
493
494 "sdiff" computes all necessary components to show two sequences and
495 their minimized differences side by side, just like the Unix-utility
496 sdiff does:
497
498 same same
499 before ⎪ after
500 old < -
501 - > new
502
503 It returns a list of array refs, each pointing to an array of display
504 instructions. In scalar context it returns a reference to such a list.
505 If there are no differences, the list will have one entry per item,
506 each indicating that the item was unchanged.
507
508 Display instructions consist of three elements: A modifier indicator
509 ("+": Element added, "-": Element removed, "u": Element unmodified,
510 "c": Element changed) and the value of the old and new elements, to be
511 displayed side-by-side.
512
513 An "sdiff" of the following two sequences:
514
515 a b c e h j l m n p
516 b c d e f j k l m r s t
517
518 results in
519
520 ( [ '-', 'a', '' ],
521 [ 'u', 'b', 'b' ],
522 [ 'u', 'c', 'c' ],
523 [ '+', '', 'd' ],
524 [ 'u', 'e', 'e' ],
525 [ 'c', 'h', 'f' ],
526 [ 'u', 'j', 'j' ],
527 [ '+', '', 'k' ],
528 [ 'u', 'l', 'l' ],
529 [ 'u', 'm', 'm' ],
530 [ 'c', 'n', 'r' ],
531 [ 'c', 'p', 's' ],
532 [ '+', '', 't' ],
533 )
534
535 "sdiff" may be passed an optional third parameter; this is a CODE ref‐
536 erence to a key generation function. See "KEY GENERATION FUNCTIONS".
537
538 Additional parameters, if any, will be passed to the key generation
539 routine.
540
541 "compact_diff"
542
543 "compact_diff" is much like "sdiff" except it returns a much more com‐
544 pact description consisting of just one flat list of indices. An exam‐
545 ple helps explain the format:
546
547 my @a = qw( a b c e h j l m n p );
548 my @b = qw( b c d e f j k l m r s t );
549 @cdiff = compact_diff( \@a, \@b );
550 # Returns:
551 # @a @b @a @b
552 # start start values values
553 ( 0, 0, # =
554 0, 0, # a !
555 1, 0, # b c = b c
556 3, 2, # ! d
557 3, 3, # e = e
558 4, 4, # f ! h
559 5, 5, # j = j
560 6, 6, # ! k
561 6, 7, # l m = l m
562 8, 9, # n p ! r s t
563 10, 12, #
564 );
565
566 The 0th, 2nd, 4th, etc. entries are all indices into @seq1 (@a in the
567 above example) indicating where a hunk begins. The 1st, 3rd, 5th, etc.
568 entries are all indices into @seq2 (@b in the above example) indicating
569 where the same hunk begins.
570
571 So each pair of indices (except the last pair) describes where a hunk
572 begins (in each sequence). Since each hunk must end at the item just
573 before the item that starts the next hunk, the next pair of indices can
574 be used to determine where the hunk ends.
575
576 So, the first 4 entries (0..3) describe the first hunk. Entries 0 and
577 1 describe where the first hunk begins (and so are always both 0).
578 Entries 2 and 3 describe where the next hunk begins, so subtracting 1
579 from each tells us where the first hunk ends. That is, the first hunk
580 contains items $diff[0] through "$diff[2] - 1" of the first sequence
581 and contains items $diff[1] through "$diff[3] - 1" of the second
582 sequence.
583
584 In other words, the first hunk consists of the following two lists of
585 items:
586
587 # 1st pair 2nd pair
588 # of indices of indices
589 @list1 = @a[ $cdiff[0] .. $cdiff[2]-1 ];
590 @list2 = @b[ $cdiff[1] .. $cdiff[3]-1 ];
591 # Hunk start Hunk end
592
593 Note that the hunks will always alternate between those that are part
594 of the LCS (those that contain unchanged items) and those that contain
595 changes. This means that all we need to be told is whether the first
596 hunk is a 'same' or 'diff' hunk and we can determine which of the other
597 hunks contain 'same' items or 'diff' items.
598
599 By convention, we always make the first hunk contain unchanged items.
600 So the 1st, 3rd, 5th, etc. hunks (all odd-numbered hunks if you start
601 counting from 1) all contain unchanged items. And the 2nd, 4th, 6th,
602 etc. hunks (all even-numbered hunks if you start counting from 1) all
603 contain changed items.
604
605 Since @a and @b don't begin with the same value, the first hunk in our
606 example is empty (otherwise we'd violate the above convention). Note
607 that the first 4 index values in our example are all zero. Plug these
608 values into our previous code block and we get:
609
610 @hunk1a = @a[ 0 .. 0-1 ];
611 @hunk1b = @b[ 0 .. 0-1 ];
612
613 And "0..-1" returns the empty list.
614
615 Move down one pair of indices (2..5) and we get the offset ranges for
616 the second hunk, which contains changed items.
617
618 Since @diff[2..5] contains (0,0,1,0) in our example, the second hunk
619 consists of these two lists of items:
620
621 @hunk2a = @a[ $cdiff[2] .. $cdiff[4]-1 ];
622 @hunk2b = @b[ $cdiff[3] .. $cdiff[5]-1 ];
623 # or
624 @hunk2a = @a[ 0 .. 1-1 ];
625 @hunk2b = @b[ 0 .. 0-1 ];
626 # or
627 @hunk2a = @a[ 0 .. 0 ];
628 @hunk2b = @b[ 0 .. -1 ];
629 # or
630 @hunk2a = ( 'a' );
631 @hunk2b = ( );
632
633 That is, we would delete item 0 ('a') from @a.
634
635 Since @diff[4..7] contains (1,0,3,2) in our example, the third hunk
636 consists of these two lists of items:
637
638 @hunk3a = @a[ $cdiff[4] .. $cdiff[6]-1 ];
639 @hunk3a = @b[ $cdiff[5] .. $cdiff[7]-1 ];
640 # or
641 @hunk3a = @a[ 1 .. 3-1 ];
642 @hunk3a = @b[ 0 .. 2-1 ];
643 # or
644 @hunk3a = @a[ 1 .. 2 ];
645 @hunk3a = @b[ 0 .. 1 ];
646 # or
647 @hunk3a = qw( b c );
648 @hunk3a = qw( b c );
649
650 Note that this third hunk contains unchanged items as our convention
651 demands.
652
653 You can continue this process until you reach the last two indices,
654 which will always be the number of items in each sequence. This is
655 required so that subtracting one from each will give you the indices to
656 the last items in each sequence.
657
658 "traverse_sequences"
659
660 "traverse_sequences" used to be the most general facility provided by
661 this module (the new OO interface is more powerful and much easier to
662 use).
663
664 Imagine that there are two arrows. Arrow A points to an element of
665 sequence A, and arrow B points to an element of the sequence B. Ini‐
666 tially, the arrows point to the first elements of the respective
667 sequences. "traverse_sequences" will advance the arrows through the
668 sequences one element at a time, calling an appropriate user-specified
669 callback function before each advance. It willadvance the arrows in
670 such a way that if there are equal elements $A[$i] and $B[$j] which are
671 equal and which are part of the LCS, there will be some moment during
672 the execution of "traverse_sequences" when arrow A is pointing to
673 $A[$i] and arrow B is pointing to $B[$j]. When this happens, "tra‐
674 verse_sequences" will call the "MATCH" callback function and then it
675 will advance both arrows.
676
677 Otherwise, one of the arrows is pointing to an element of its sequence
678 that is not part of the LCS. "traverse_sequences" will advance that
679 arrow and will call the "DISCARD_A" or the "DISCARD_B" callback,
680 depending on which arrow it advanced. If both arrows point to elements
681 that are not part of the LCS, then "traverse_sequences" will advance
682 one of them and call the appropriate callback, but it is not specified
683 which it will call.
684
685 The arguments to "traverse_sequences" are the two sequences to tra‐
686 verse, and a hash which specifies the callback functions, like this:
687
688 traverse_sequences(
689 \@seq1, \@seq2,
690 { MATCH => $callback_1,
691 DISCARD_A => $callback_2,
692 DISCARD_B => $callback_3,
693 }
694 );
695
696 Callbacks for MATCH, DISCARD_A, and DISCARD_B are invoked with at least
697 the indices of the two arrows as their arguments. They are not
698 expected to return any values. If a callback is omitted from the ta‐
699 ble, it is not called.
700
701 Callbacks for A_FINISHED and B_FINISHED are invoked with at least the
702 corresponding index in A or B.
703
704 If arrow A reaches the end of its sequence, before arrow B does, "tra‐
705 verse_sequences" will call the "A_FINISHED" callback when it advances
706 arrow B, if there is such a function; if not it will call "DISCARD_B"
707 instead. Similarly if arrow B finishes first. "traverse_sequences"
708 returns when both arrows are at the ends of their respective sequences.
709 It returns true on success and false on failure. At present there is
710 no way to fail.
711
712 "traverse_sequences" may be passed an optional fourth parameter; this
713 is a CODE reference to a key generation function. See "KEY GENERATION
714 FUNCTIONS".
715
716 Additional parameters, if any, will be passed to the key generation
717 function.
718
719 If you want to pass additional parameters to your callbacks, but don't
720 need a custom key generation function, you can get the default by pass‐
721 ing undef:
722
723 traverse_sequences(
724 \@seq1, \@seq2,
725 { MATCH => $callback_1,
726 DISCARD_A => $callback_2,
727 DISCARD_B => $callback_3,
728 },
729 undef, # default key-gen
730 $myArgument1,
731 $myArgument2,
732 $myArgument3,
733 );
734
735 "traverse_sequences" does not have a useful return value; you are
736 expected to plug in the appropriate behavior with the callback func‐
737 tions.
738
739 "traverse_balanced"
740
741 "traverse_balanced" is an alternative to "traverse_sequences". It uses
742 a different algorithm to iterate through the entries in the computed
743 LCS. Instead of sticking to one side and showing element changes as
744 insertions and deletions only, it will jump back and forth between the
745 two sequences and report changes occurring as deletions on one side
746 followed immediatly by an insertion on the other side.
747
748 In addition to the "DISCARD_A", "DISCARD_B", and "MATCH" callbacks sup‐
749 ported by "traverse_sequences", "traverse_balanced" supports a "CHANGE"
750 callback indicating that one element got "replaced" by another:
751
752 traverse_balanced(
753 \@seq1, \@seq2,
754 { MATCH => $callback_1,
755 DISCARD_A => $callback_2,
756 DISCARD_B => $callback_3,
757 CHANGE => $callback_4,
758 }
759 );
760
761 If no "CHANGE" callback is specified, "traverse_balanced" will map
762 "CHANGE" events to "DISCARD_A" and "DISCARD_B" actions, therefore
763 resulting in a similar behaviour as "traverse_sequences" with different
764 order of events.
765
766 "traverse_balanced" might be a bit slower than "traverse_sequences",
767 noticable only while processing huge amounts of data.
768
769 The "sdiff" function of this module is implemented as call to "tra‐
770 verse_balanced".
771
772 "traverse_balanced" does not have a useful return value; you are
773 expected to plug in the appropriate behavior with the callback func‐
774 tions.
775
777 Most of the functions accept an optional extra parameter. This is a
778 CODE reference to a key generating (hashing) function that should
779 return a string that uniquely identifies a given element. It should be
780 the case that if two elements are to be considered equal, their keys
781 should be the same (and the other way around). If no key generation
782 function is provided, the key will be the element as a string.
783
784 By default, comparisons will use "eq" and elements will be turned into
785 keys using the default stringizing operator '""'.
786
787 Where this is important is when you're comparing something other than
788 strings. If it is the case that you have multiple different objects
789 that should be considered to be equal, you should supply a key genera‐
790 tion function. Otherwise, you have to make sure that your arrays con‐
791 tain unique references.
792
793 For instance, consider this example:
794
795 package Person;
796
797 sub new
798 {
799 my $package = shift;
800 return bless { name => '', ssn => '', @_ }, $package;
801 }
802
803 sub clone
804 {
805 my $old = shift;
806 my $new = bless { %$old }, ref($old);
807 }
808
809 sub hash
810 {
811 return shift()->{'ssn'};
812 }
813
814 my $person1 = Person->new( name => 'Joe', ssn => '123-45-6789' );
815 my $person2 = Person->new( name => 'Mary', ssn => '123-47-0000' );
816 my $person3 = Person->new( name => 'Pete', ssn => '999-45-2222' );
817 my $person4 = Person->new( name => 'Peggy', ssn => '123-45-9999' );
818 my $person5 = Person->new( name => 'Frank', ssn => '000-45-9999' );
819
820 If you did this:
821
822 my $array1 = [ $person1, $person2, $person4 ];
823 my $array2 = [ $person1, $person3, $person4, $person5 ];
824 Algorithm::Diff::diff( $array1, $array2 );
825
826 everything would work out OK (each of the objects would be converted
827 into a string like "Person=HASH(0x82425b0)" for comparison).
828
829 But if you did this:
830
831 my $array1 = [ $person1, $person2, $person4 ];
832 my $array2 = [ $person1, $person3, $person4->clone(), $person5 ];
833 Algorithm::Diff::diff( $array1, $array2 );
834
835 $person4 and $person4->clone() (which have the same name and SSN) would
836 be seen as different objects. If you wanted them to be considered
837 equivalent, you would have to pass in a key generation function:
838
839 my $array1 = [ $person1, $person2, $person4 ];
840 my $array2 = [ $person1, $person3, $person4->clone(), $person5 ];
841 Algorithm::Diff::diff( $array1, $array2, \&Person::hash );
842
843 This would use the 'ssn' field in each Person as a comparison key, and
844 so would consider $person4 and $person4->clone() as equal.
845
846 You may also pass additional parameters to the key generation function
847 if you wish.
848
850 If you pass these routines a non-reference and they expect a reference,
851 they will die with a message.
852
854 This version released by Tye McQueen (http://perlmonks.org/?node=tye).
855
857 Parts Copyright (c) 2000-2004 Ned Konz. All rights reserved. Parts by
858 Tye McQueen.
859
860 This program is free software; you can redistribute it and/or modify it
861 under the same terms as Perl.
862
864 Mark-Jason still maintains a mailing list. To join a low-volume mail‐
865 ing list for announcements related to diff and Algorithm::Diff, send an
866 empty mail message to mjd-perl-diff-request@plover.com.
867
869 Versions through 0.59 (and much of this documentation) were written by:
870
871 Mark-Jason Dominus, mjd-perl-diff@plover.com
872
873 This version borrows some documentation and routine names from
874 Mark-Jason's, but Diff.pm's code was completely replaced.
875
876 This code was adapted from the Smalltalk code of Mario Wolczko
877 <mario@wolczko.com>, which is available at
878 ftp://st.cs.uiuc.edu/pub/Smalltalk/MANCHESTER/manchester/4.0/diff.st
879
880 "sdiff" and "traverse_balanced" were written by Mike Schilli
881 <m@perlmeister.com>.
882
883 The algorithm is that described in A Fast Algorithm for Computing Long‐
884 est Common Subsequences, CACM, vol.20, no.5, pp.350-353, May 1977, with
885 a few minor improvements to improve the speed.
886
887 Much work was done by Ned Konz (perl@bike-nomad.com).
888
889 The OO interface and some other changes are by Tye McQueen.
890
891
892
893perl v5.8.8 2006-07-31 Algorithm::Diff(3)