1Text::Fuzzy(3) User Contributed Perl Documentation Text::Fuzzy(3)
2
3
4
6 Text::Fuzzy - Partial string matching using edit distances
7
9 use Text::Fuzzy;
10 my $tf = Text::Fuzzy->new ('boboon');
11 print "Distance is ", $tf->distance ('babboon'), "\n";
12 my @words = qw/the quick brown fox jumped over the lazy dog/;
13 my $nearest = $tf->nearestv (\@words);
14 print "Nearest array entry is $nearest\n";
15
16 produces output
17
18 Distance is 2
19 Nearest array entry is brown
20
21 (This example is included as synopsis.pl
22 <https://fastapi.metacpan.org/source/BKB/Text-
23 Fuzzy-0.29/examples/synopsis.pl> in the distribution.)
24
26 This documents version 0.29 of Text::Fuzzy corresponding to git commit
27 0e5db1fbccbc7d7518ec39a95d43a5b20166a727
28 <https://github.com/benkasminbullock/Text-
29 Fuzzy/commit/0e5db1fbccbc7d7518ec39a95d43a5b20166a727> released on Thu
30 Dec 10 14:38:31 2020 +0900.
31
33 This module calculates edit distances between words, and searches
34 arrays and files to find the nearest entry by edit distance. It handles
35 both byte strings and character strings (strings containing Unicode),
36 treating each Unicode character as a single entity.
37
38 use Text::Fuzzy;
39 use utf8;
40 my $tf = Text::Fuzzy->new ('あいうえお☺');
41 print $tf->distance ('うえお☺'), "\n";
42
43 produces output
44
45 2
46
47 (This example is included as unicode.pl
48 <https://fastapi.metacpan.org/source/BKB/Text-
49 Fuzzy-0.29/examples/unicode.pl> in the distribution.)
50
51 The default edit distance is the Levenshtein one, which counts each
52 addition ("cat" -> "cart"), substitution ("cat" -> "cut"), and deletion
53 ("carp" -> "cap") as one unit. The Damerau-Levenshtein edit distance,
54 which also allows transpositions ("salt" -> "slat") may also be
55 selected with the "transpositions_ok" method or the "trans" option.
56
57 This module is particularly suited to searching for the nearest match
58 to a term over a list of words, using the "nearestv" or "nearest"
59 methods. It studies the target string to be matched (the first argument
60 to "new") to build information to rapidly reject mismatches in a list.
61 Since computing the Levenshtein and Damerau-Levenshtein edit distances
62 with the Wagner-Fischer algorithm is computationally expensive, the
63 module offers a boost in performance for searching for a string in a
64 list of words.
65
67 new
68 my $tf = Text::Fuzzy->new ('bibbety bobbety boo');
69
70 Create a new Text::Fuzzy object from the supplied word.
71
72 The following parameters may be supplied to new:
73
74 max
75 my $tf = Text::Fuzzy->new ('Cinderella', max => 3);
76
77 This option affects the behaviour of "nearestv" and "nearest"
78 methods. When searching over an array, this sets the maximum edit
79 distance allowed for a word to be considered a "near match". For
80 example, with
81
82 my $tf = Text::Fuzzy->new ('Cinderella');
83 $tf->set_max_distance (3);
84
85 when using "nearest", 'Cinder' will not be considered a match, but
86 'derella' will.
87
88 To switch off the maximum distance, and allow all words to be
89 considered, you can set "max" to be a negative value:
90
91 my $tf = Text::Fuzzy->new ('Cinderella', max => -1);
92
93 Note that this is the default, so there is hardly any point
94 specifying it, except if you want to make self-documenting code, or
95 you're worried that the module's default behaviour may suddenly
96 change.
97
98 Setting "max" to zero makes $tf only match exactly.
99
100 The method "set_max_distance" does the same thing as this
101 parameter.
102
103 no_exact
104 my $tf = Text::Fuzzy->new ('slipper', no_exact => 1);
105
106 This parameter switches on rejection of exact matches, in the same
107 way as the method "no_exact":
108
109 my $tf = Text::Fuzzy->new ('slipper');
110 $tf->no_exact (1);
111
112 This is useful for the case of scanning an array which contains the
113 search term itself, when we are interested in near matches only.
114 For example, if we have a dictionary of words and we need to find
115 near matches for a word which is in the dictionary.
116
117 trans
118 my $tf = Text::Fuzzy->new ('glass', trans => 1);
119
120 This switches on transpositions, in other words it uses the
121 Damerau-Levenshtein edit distance rather than the Levenshtein edit
122 distance. The method "transpositions_ok" has the same effect as
123 this.
124
125 distance
126 my $dist = $tf->distance ($word);
127
128 This method's return value is the edit distance to $word from the word
129 used to create the object in "new".
130
131 use Text::Fuzzy;
132 my $cat = Text::Fuzzy->new ('cat');
133 print $cat->distance ('cut'), "\n";
134 print $cat->distance ('cart'), "\n";
135 print $cat->distance ('catamaran'), "\n";
136 use utf8;
137 print $cat->distance ('γάτος'), "\n";
138
139 produces output
140
141 1
142 1
143 6
144 5
145
146 (This example is included as distance.pl
147 <https://fastapi.metacpan.org/source/BKB/Text-
148 Fuzzy-0.29/examples/distance.pl> in the distribution.)
149
150 To know which edits are used to convert the words, use
151 "distance_edits".
152
153 nearestv
154 my $nearest_word = $tf->nearestv (\@words);
155 my @nearest_words = $tf->nearestv (\@words);
156
157 Returns the value in @words which has the nearest distance to the value
158 given to $tf in "new". In array context, it returns a list of the
159 nearest values.
160
161 use Text::Fuzzy;
162 my @words = (qw/who where what when why/);
163 my $tf = Text::Fuzzy->new ('whammo');
164 my @nearest = $tf->nearestv (\@words);
165 print "@nearest\n";
166
167 produces output
168
169 who what
170
171 (This example is included as nearestv.pl
172 <https://fastapi.metacpan.org/source/BKB/Text-
173 Fuzzy-0.29/examples/nearestv.pl> in the distribution.)
174
175 The behaviour of the match can be controlled with "no_exact" and
176 "set_max_distance" in exactly the same way as "nearest".
177
178 This is a convenient wrapper around the "nearest" function. "nearest"
179 is annoying to use, because it only returns array offsets, and also
180 error-prone due to having to check to distinguish the first element of
181 the array from an undefined value using "defined".
182
183 This method was added in version 0.18 of Text::Fuzzy.
184
185 nearest
186 my $index = $tf->nearest (\@words);
187 my $nearest_word = $words[$index];
188
189 Given an array reference, this returns a number, the index of the
190 nearest element in the array @words to the argument to "new". Having
191 found the nearest match you then need to look up the value in the
192 array, as in $nearest_word above.
193
194 It is possible to set a maximum edit distance, beyond which entries are
195 rejected, using "set_max_distance" or the "max" parameter to "new". In
196 this case, if none of the elements of @words are less than the maximum
197 distance away from the word, $index is the undefined value, so when
198 setting a maximum distance, it is necessary to check the return value
199 of index using "defined".
200
201 use Text::Fuzzy;
202 my $tf = Text::Fuzzy->new ('calamari', max => 1);
203 my @words = qw/Have you ever kissed in the moonlight
204 In the grand and glorious
205 Gay notorious
206 South American Way?/;
207 my $index = $tf->nearest (\@words);
208 if (defined $index) {
209 printf "Found at $index, distance was %d.\n",
210 $tf->last_distance ();
211 }
212 else {
213 print "Not found anywhere.\n";
214 }
215
216 produces output
217
218 Not found anywhere.
219
220 (This example is included as check-return.pl
221 <https://fastapi.metacpan.org/source/BKB/Text-
222 Fuzzy-0.29/examples/check-return.pl> in the distribution.)
223
224 If there is more than one word with the same edit distance in @words,
225 this returns the last one found, unless it is an exact match, in which
226 case it returns the first one found. To get all matches, call it in
227 array context:
228
229 my @nearest = $tf->nearest (\@words);
230
231 In array context, if there are no matches within the minimum distance,
232 "nearest" returns an empty list. If there is one or more match, it
233 returns the array offset of it or them, not the value itself.
234
235 use Text::Fuzzy;
236
237 my @funky_words = qw/nice funky rice gibbon lice graeme garden/;
238 my $tf = Text::Fuzzy->new ('dice');
239 my @nearest = $tf->nearest (\@funky_words);
240
241 print "The nearest words are ";
242 print join ", ", (map {$funky_words[$_]} @nearest);
243 printf ", distance %d.\n", $tf->last_distance ();
244
245 produces output
246
247 The nearest words are nice, rice, lice, distance 1.
248
249 (This example is included as list-context.pl
250 <https://fastapi.metacpan.org/source/BKB/Text-Fuzzy-0.29/examples/list-
251 context.pl> in the distribution.)
252
253 last_distance
254 my $last_distance = $tf->last_distance ();
255
256 The distance from the previous match's closest match. This is used in
257 conjunction with "nearest" or "nearestv" to find the edit distance to
258 the previous match.
259
260 use Text::Fuzzy;
261 my @words = (qw/who where what when why/);
262 my $tf = Text::Fuzzy->new ('whammo');
263 my @nearest = $tf->nearestv (\@words);
264 print "@nearest\n";
265 print $tf->last_distance (), "\n";
266 # Prints 3, the number of edits needed to turn "whammo" into "who"
267 # (delete a, m, m) or into "what" (replace m with t, delete m, delete
268 # o).
269
270 produces output
271
272 who what
273 3
274
275 (This example is included as last-distance.pl
276 <https://fastapi.metacpan.org/source/BKB/Text-Fuzzy-0.29/examples/last-
277 distance.pl> in the distribution.)
278
279 set_max_distance
280 # Set the max distance.
281 $tf->set_max_distance (3);
282
283 Set the maximum edit distance of $tf. Set the maximum distance to a low
284 value to improve the speed of searches over lists with "nearest", or to
285 reject unlikely matches. When searching for a near match, anything with
286 an edit distance of a value over the maximum is rejected without
287 computing the exact distance. To compute exact distances, call this
288 method without an argument:
289
290 $tf->set_max_distance ();
291
292 The maximum edit distance is switched off, and whatever the nearest
293 match is is accepted. A negative value also switches it off:
294
295 $tf->set_max_distance (-1);
296
297 The object created by "new" has no maximum distance unless specified by
298 the user.
299
300 use Text::Fuzzy;
301 my $tf = Text::Fuzzy->new ('nopqrstuvwxyz');
302 # Prints 13, the correct value.
303 print $tf->distance ('abcdefghijklm'), "\n";
304 $tf->set_max_distance (10);
305 # Prints 11, one more than the maximum distance, because the search
306 # stopped when the distance was exceeded.
307 print $tf->distance ('abcdefghijklm'), "\n";
308
309 produces output
310
311 13
312 11
313
314 (This example is included as max-dist.pl
315 <https://fastapi.metacpan.org/source/BKB/Text-Fuzzy-0.29/examples/max-
316 dist.pl> in the distribution.)
317
318 Setting the maximum distance is a way to make a search more rapid. For
319 example if you are searching over a dictionary of 100,000 or a million
320 words, and only need close matches, you can more rapidly reject
321 unwanted matches by setting the maximum distance to a lower value.
322 Calculating Levenshtein distance is an O(n^2) algorithm in the lengths
323 of the words, so even a small increase in the maximum permitted
324 distance means a much larger amount of work for the computer to do.
325 With the maximum distance set, the computer can give up calculating
326 more quickly with bad matches.
327
328 transpositions_ok
329 $tf->transpositions_ok (1);
330
331 A true value in the argument changes the type of edit distance used to
332 allow transpositions, such as "clam" and "calm". Initially
333 transpositions are not allowed, giving the Levenshtein edit distance.
334 If transpositions are used, the edit distance becomes the Damerau-
335 Levenshtein edit distance. A false value disallows transpositions:
336
337 $tf->transpositions_ok (0);
338
339 no_exact
340 $tf->no_exact (1);
341
342 This is a flag to "nearest" which makes it ignore exact matches. For
343 example,
344
345 use Text::Fuzzy;
346
347 my @words = qw/bibbity bobbity boo/;
348 for my $word (@words) {
349 my $tf = Text::Fuzzy->new ($word);
350 $tf->no_exact (0);
351 my $nearest1 = $tf->nearest (\@words);
352 print "With exact, nearest to $word is $words[$nearest1]\n";
353 # Make "$word" not match itself.
354 $tf->no_exact (1);
355 my $nearest2 = $tf->nearest (\@words);
356 print "Without exact, nearest to $word is $words[$nearest2]\n";
357 }
358
359 produces output
360
361 With exact, nearest to bibbity is bibbity
362 Without exact, nearest to bibbity is bobbity
363 With exact, nearest to bobbity is bobbity
364 Without exact, nearest to bobbity is bibbity
365 With exact, nearest to boo is boo
366 Without exact, nearest to boo is bobbity
367
368 (This example is included as no-exact.pl
369 <https://fastapi.metacpan.org/source/BKB/Text-Fuzzy-0.29/examples/no-
370 exact.pl> in the distribution.)
371
372 This is for the case of searching over an array which contains the
373 searched-for item itself.
374
375 scan_file
376 my $nearest = $tf->scan_file ('/usr/share/dict/words');
377
378 Scan a file to find the nearest match to the word used in "new". This
379 assumes that the file contains lines of text separated by newlines, and
380 finds the closest match in the file. Its return value is a string
381 rather than a line number. It cannot return an array of values. It does
382 not currently support Unicode-encoded files.
383
385 These functions do not require a "Text::Fuzzy" object.
386
387 distance_edits
388 my ($distance, $edits) = distance_edits ('before', 'after');
389
390 This returns the edit distance between the two arguments, and the edits
391 necessary to transform the first one into the second one. $Edits is a
392 string containing the four letters k, r, d, and i, for "keep",
393 "replace", "delete", and "insert" respectively. For example, for
394 "piece" and "peace", $edits contains "krrkk" for "keep, replace,
395 replace, keep, keep".
396
397 use Text::Fuzzy 'distance_edits';
398 my @words = (qw/who where what when why/);
399 my $tf = Text::Fuzzy->new ('whammo');
400 my @nearest = $tf->nearestv (\@words);
401 print "@nearest\n";
402 # Prints "who what"
403 print $tf->last_distance (), "\n";
404 # Prints 3, the number of edits needed to turn "whammo" into "who"
405 # (delete a, m, m) or into "what" (replace m with t, delete m, delete
406 # o).
407 my ($distance, $edits) = distance_edits ('whammo', 'who');
408 print "$edits\n";
409 # Prints kkdddk, keep w, keep h, delete a, delete m, delete m, keep o.
410
411 produces output
412
413 who what
414 3
415 kkdddk
416
417 (This example is included as distance-edits.pl
418 <https://fastapi.metacpan.org/source/BKB/Text-
419 Fuzzy-0.29/examples/distance-edits.pl> in the distribution.)
420
421 This does not handle transpositions. Unlike the rest of the module,
422 this is pure Perl rather than XS, and not optimized for speed. The edit
423 distance search within "nearest" is optimized for speed, and hence
424 discards its record of edits used to get the result.
425
426 fuzzy_index
427 my ($offset, $edits, $distance) = fuzzy_index ($needle, $haystack);
428
429 Searches for $needle in $haystack using fuzzy matching.
430
431 Return value is the offest of the closest match found, the edits
432 necessary on $needle to make it into the matching text, and the
433 Levenshtein edit distance between the matching part of $haystack and
434 $needle.
435
436 For the algorithm used, see
437
438 <http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/>
439
440 This is implemented in Perl not C, and it's slow due to lots of
441 debugging code. Please expect the interface and internals to change.
442
444 This section gives extended examples of the use of the module to solve
445 practical problems.
446
447 misspelt-web-page.cgi
448 The file examples/misspelt-web-page.cgi
449 <https://fastapi.metacpan.org/source/BKB/Text-
450 Fuzzy-0.29/examples/misspelt-web-page.cgi> is an example of a CGI
451 script which does something similar to the Apache mod_speling module,
452 offering spelling corrections for mistyped URLs and sending the user to
453 a correct page.
454
455 use Text::Fuzzy;
456
457 # The directory of files served by the web server.
458
459 my $web_root = '/usr/local/www/data';
460
461 # If the query is "http://www.example.com/abc/xyz.html", $path_info is
462 # "abc/xyz.html".
463
464 my $path_info = $ENV{REQUEST_URI};
465
466 if (! defined $path_info) {
467 fail ("No path info");
468 }
469
470 if ($0 =~ /$path_info/) {
471 fail ("Don't redirect to self");
472 }
473
474 # This is the list of files under the main page.
475
476 my @allfiles = get_all_files ($web_root, '');
477
478 # This is our spelling search engine.
479
480 my $tf = Text::Fuzzy->new ($path_info);
481
482 my $nearest = $tf->nearest (\@allfiles, max => 5);
483
484 if (defined $nearest) {
485 redirect ($allfiles[$nearest]);
486 }
487 else {
488 fail ("Nothing like $path_info was found");
489 }
490 exit;
491
492 # Read all the files under "$root/$dir". This is recursive. The return
493 # value is an array containing all files found.
494
495 sub get_all_files
496 {
497 my ($root, $dir) = @_;
498 my @allfiles;
499 my $full_dir = "$root/$dir";
500 if (! -d $full_dir) {
501 fail ("$full_dir is not a directory");
502 }
503 opendir DIR, $full_dir or fail ("Can't open directory $full_dir: $!");
504 my @files = grep !/^\./, readdir DIR;
505 closedir DIR or fail ("Can't close $full_dir: $!");
506 for my $file (@files) {
507 my $dir_file = "$dir/$file";
508 my $full_file = "$root/$dir_file";
509 if (-d $full_file) {
510 push @allfiles, get_all_files ($root, $dir_file);
511 }
512 else {
513 push @allfiles, $dir_file;
514 }
515 }
516 return @allfiles;
517 }
518
519 # Print a "permanent redirect" to the respelt name, then exit.
520
521 sub redirect
522 {
523 my ($url) = @_;
524 print <<EOF;
525 Status: 301
526 Location: $url
527
528 EOF
529 exit;
530 }
531
532 # Print an error message for the sake of the requester, and print a
533 # message to the error log, then exit.
534
535 sub fail
536 {
537 my ($error) = @_;
538 print <<EOF;
539 Content-Type: text/plain
540
541 $error
542 EOF
543 # Add the name of the program and the time to the error message,
544 # otherwise the error log will get awfully confusing-looking.
545 my $time = scalar gmtime ();
546 print STDERR "$0: $time: $error\n";
547 exit;
548 }
549
550 See also <http://www.lemoda.net/perl/perl-mod-speling/> for how to set
551 up .htaccess to use the script.
552
553 spell-check.pl
554 The file examples/spell-check.pl
555 <https://fastapi.metacpan.org/source/BKB/Text-
556 Fuzzy-0.29/examples/spell-check.pl> is a spell checker. It uses a
557 dictionary of words specified by a command-line option "-d":
558
559 spell-check.pl -d /usr/dict/words file1.txt file2.txt
560
561 It prints out any words which look like spelling mistakes, using the
562 dictionary.
563
564 use Getopt::Long;
565 use Text::Fuzzy;
566 use Lingua::EN::PluralToSingular 'to_singular';
567
568 # The location of the Unix dictionary.
569 my $dict = '/usr/share/dict/words';
570
571 # Default maximum edit distance. Five is quite a big number for a
572 # spelling mistake.
573 my $max = 5;
574
575 GetOptions (
576 "dict=s" => \$dict,
577 "max=i" => \$max,
578 );
579
580 my @words;
581 my %words;
582 my $min_length = 4;
583 read_dictionary ($dict, \@words, \%words);
584 # Known mistakes, don't repeat.
585 my %known;
586 # Spell-check each file on the command line.
587 for my $file (@ARGV) {
588 open my $input, "<", $file or die "Can't open $file: $!";
589 while (<$input>) {
590 my @line = split /[^a-z']+/i, $_;
591 for my $word (@line) {
592 # Remove leading/trailing apostrophes.
593 $word =~ s/^'|'$//g;
594 my $clean_word = to_singular (lc $word);
595 $clean_word =~ s/'s$//;
596 if ($words{$clean_word} || $words{$word}) {
597 # It is in the dictionary.
598 next;
599 }
600 if (length $word < $min_length) {
601 # Very short words are ignored.
602 next;
603 }
604 if ($word eq uc $word) {
605 # Acronym like BBC, IRA, etc.
606 next;
607 }
608 if ($known{$clean_word}) {
609 # This word was already given to the user.
610 next;
611 }
612 if ($clean_word =~ /(.*)ed$/ || $clean_word =~ /(.*)ing/) {
613 my $stem = $1;
614 if ($words{$stem} || $words{"${stem}e"}) {
615 # Past/gerund of $stem/${stem}e
616 next;
617 }
618 # Test for doubled end consonants,
619 # e.g. "submitted"/"submit".
620 if ($stem =~ /([bcdfghjklmnpqrstvwxz])\1/) {
621 $stem =~ s/$1$//;
622 if ($words{$stem}) {
623 # Past/gerund of $stem/${stem}e
624 next;
625 }
626 }
627 }
628 my $tf = Text::Fuzzy->new ($clean_word, max => $max);
629 my $nearest = $tf->nearest (\@words);
630 # We have set a maximum distance to search for, so we need
631 # to check whether $nearest is defined.
632 if (defined $nearest) {
633 my $correction = $words[$nearest];
634 print "$file:$.: '$word' may be '$correction'.\n";
635 $known{$clean_word} = $correction;
636 }
637 else {
638 print "$file:$.: $word may be a spelling mistake.\n";
639 $known{$clean_word} = 1;
640 }
641 }
642 }
643 close $input or die $!;
644 }
645
646 exit;
647
648 sub read_dictionary
649 {
650 my ($dict, $words_array, $words_hash) = @_;
651 open my $din, "<", $dict or die "Can't open dictionary $dict: $!";
652 my @words;
653 while (<$din>) {
654 chomp;
655 push @words, $_;
656 }
657 close $din or die $!;
658 # Apostrophe words
659
660 my @apo = qw/
661
662 let's I'll you'll he'll she'll they'll we'll I'm
663 you're he's she's it's we're they're I've they've
664 you've we've one's isn't aren't doesn't don't
665 won't wouldn't I'd you'd he'd we'd they'd
666 shouldn't couldn't didn't can't
667
668 /;
669
670 # Irregular past participles.
671 my @pp = qw/became/;
672
673 push @words, @apo, @pp;
674 for (@words) {
675 push @$words_array, lc $_;
676 $words_hash->{$_} = 1;
677 $words_hash->{lc $_} = 1;
678 }
679 }
680
681 Because the usual Unix dictionary doesn't have plurals, it uses
682 Lingua::EN::PluralToSingular, to convert nouns into singular forms.
683 Unfortunately it still misses past participles and past tenses of
684 verbs.
685
686 extract-kana.pl
687 The file examples/extract-kana.pl
688 <https://fastapi.metacpan.org/source/BKB/Text-
689 Fuzzy-0.29/examples/extract-kana.pl> extracts the kana entries from
690 "edict", a freely-available Japanese to English electronic dictionary,
691 and does some fuzzy searches on them. It requires a local copy of the
692 file to run. This script demonstrates the use of Unicode searches with
693 Text::Fuzzy.
694
695 use Lingua::JA::Moji ':all';
696 use Text::Fuzzy;
697 use utf8;
698 my $infile = '/home/ben/data/edrdg/edict';
699 open my $in, "<:encoding(EUC-JP)", $infile or die $!;
700 my @kana;
701 while (<$in>) {
702 my $kana;
703 if (/\[(\p{InKana}+)\]/) {
704 $kana = $1;
705 }
706 elsif (/^(\p{InKana}+)/) {
707 $kana = $1;
708 }
709 if ($kana) {
710 $kana = kana2katakana ($kana);
711 push @kana, $kana;
712 }
713 }
714 printf "Starting fuzzy searches over %d lines.\n", scalar @kana;
715 search ('ウオソウコ');
716 search ('アイウエオカキクケコバビブベボハヒフヘホ');
717 search ('アルベルトアインシュタイン');
718 search ('バババブ');
719 search ('バババブアルベルト');
720 exit;
721
722 sub search
723 {
724 my ($silly) = @_;
725 my $max = 10;
726 my $search = Text::Fuzzy->new ($silly, max => $max);
727 my $n = $search->nearest (\@kana);
728 if (defined $n) {
729 printf "$silly nearest is $kana[$n] (distance %d)\n",
730 $search->last_distance ();
731 }
732 else {
733 printf "Nothing like '$silly' was found within $max edits.\n";
734 }
735 }
736
737 Lingua::JA::Gairaigo::Fuzzy
738 The module Lingua::JA::Gairaigo::Fuzzy tries to determine whether two
739 Japanese loanwords are the same word or not.
740
741 CPAN::Nearest
742 The module CPAN::Nearest offers a search over the titles of CPAN
743 modules using a fuzzy search to get the nearest match.
744
746 This module has no dependencies on other modules.
747
749 Reporting a bug
750 There is a bug tracker for the module at
751 <https://github.com/benkasminbullock/Text-Fuzzy/issues>.
752
753 Testing
754 The CPAN tester results are at
755 <http://www.cpantesters.org/distro/T/Text-Fuzzy.html>. The ActiveState
756 tester results are at <http://code.activestate.com/ppm/Text-Fuzzy/>.
757
759 The following methods are for benchmarking the module and checking its
760 correctness.
761
762 no_alphabet
763 $tf->no_alphabet (1);
764
765 This turns off alphabetizing of the string. Alphabetizing is a filter
766 where the intersection of all the characters in the two strings is
767 computed, and if the alphabetical difference of the two strings is
768 greater than the maximum distance, the match is rejected without
769 applying the dynamic programming algorithm. This increases speed,
770 because the dynamic programming algorithm is slow.
771
772 The alphabetizing should not ever reject anything which is a legitimate
773 match, and it should make the program run faster in almost every case.
774 The only envisaged uses of switching this off are checking that the
775 algorithm is working correctly, and benchmarking performance.
776
777 get_trans
778 my $trans_ok = $tf->get_trans ();
779
780 This returns the value set by "transpositions_ok".
781
782 unicode_length
783 my $length = $tf->unicode_length ();
784
785 This returns the length in characters (not bytes) of the string used in
786 "new". If the string is not marked as Unicode, it returns the undefined
787 value. In the following, $l1 should be equal to $l2.
788
789 use utf8;
790 my $word = 'ⅅⅆⅇⅈⅉ';
791 my $l1 = length $word;
792 my $tf = Text::Fuzzy->new ($word);
793 my $l2 = $tf->unicode_length ();
794
795 ualphabet_rejections
796 my $rejected = $tf->ualphabet_rejections ();
797
798 After running "nearest" over an array, this returns the number of
799 entries of the array which were rejected using only the Unicode
800 alphabet. Its value is reset to zero each time "nearest" is called.
801
802 alphabet_rejections
803 my $rejected = $tf->alphabet_rejections ();
804
805 After running "nearest" over an array, this returns the number of
806 entries of the array which were rejected using only the non-Unicode
807 alphabet. Its value is reset to zero each time "nearest" is called.
808
809 length_rejections
810 my $rejected = $tf->length_rejections ();
811
812 After running "nearest" over an array, this returns the number of
813 entries of the array which were rejected because the length difference
814 between them and the target string was larger than the maximum distance
815 allowed.
816
817 get_max_distance
818 # Get the maximum edit distance.
819 print "The max distance is ", $tf->get_max_distance (), "\n";
820
821 Get the maximum edit distance of $tf. The maximum distance may be set
822 with "set_max_distance".
823
825 Other CPAN modules
826 Similar modules on CPAN include the following.
827
828 String::Approx
829 Approximate matching (fuzzy matching) using the Levenshtein edit
830 distance. As a bonus, if you don't have a headache, you can get one
831 easily trying to make head or tail out of this module's
832 documentation.
833
834 Text::EditTranscript
835 Determine the edit transcript between two strings. This is similar
836 to what you get from "distance_edits" in this module.
837
838 Text::Fuzzy::PP
839 This is Nick Logan's Pure Perl version of this module.
840
841 Text::Levenshtein::Damerau
842 Levenshtein-Damerau edit distance.
843
844 Text::Levenshtein
845 Text::Levenshtein::Flexible
846 XS Levenshtein distance calculation with bounds and adjustable
847 costs (so the cost of deletion can be more than the cost of
848 addition, etc.) See also Text::WagnerFischer for a pure-Perl
849 module which also allows altered costs.
850
851 Text::Levenshtein::XS
852 An XS implementation of the Levenshtein edit distance. It claims to
853 be a drop-in replacement for Text::LevenshteinXS which does Unicode
854 correctly.
855
856 Text::LevenshteinXS
857 An XS implementation of the Levenshtein edit distance. Does not do
858 Unicode very well. See
859 <https://rt.cpan.org/Public/Bug/Display.html?id=36685>.
860
861 Text::Levenshtein::Edlib
862 A wrapper around the edlib library that computes Levenshtein edit
863 distance and optimal alignment path for a pair of strings.
864
865 Tree::BK
866 Structure for efficient fuzzy matching.
867
868 Text::Brew
869 An implementation of the Brew edit distance.
870
871 Text::WagnerFischer
872 Implements the Wagner-Fischer algorithm to calculate edit
873 distances. This is generalised version of the Levenshtein edit
874 distance. See also Text::Levenshtein::Flexible for an XS version.
875
876 Bencher::Scenario::LevenshteinModules
877 Some benchmarks of various modules including this one.
878
879 Text::JaroWinkler
880 Another text similarity measure.
881
882 About the algorithms
883 This section contains some blog posts which I found useful in
884 understanding the algorithms.
885
886 Fuzzy substring matching with Levenshtein distance in Python
887 <http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-
888 with-levenshtein-distance-in-python/> by Ryan Ginstrom explains the
889 Levenshtein algorithm and its use in substring matching.
890
891 Damerau-Levenshtein Edit Distance Explained
892 <https://www.lemoda.net/text-fuzzy/damerau-levenshtein/index.html> by
893 James M. Jensen II explains the Damerau-Levenshtein edit distance (the
894 algorithm used with "transpositions_ok").
895
896 I recommend steering fairly clear of the Wikipedia articles on these
897 things, which are very poorly written indeed.
898
899 References
900 Here are the original research papers by the algorithms' discoverers.
901
902 Damerau
903 Damerau, Fred J. (March 1964), "A technique for computer detection
904 and correction of spelling errors", Communications of the ACM, ACM,
905 7 (3): 171–176, doi:10.1145/363958.363994
906
907 Levenshtein
908 Levenshtein, Vladimir I. (February 1966), "Binary codes capable of
909 correcting deletions, insertions, and reversals", Soviet Physics
910 Doklady, 10 (8): 707–710
911
912 Wagner and Fischer
913 R. Wagner and M. Fischer (1974), "The string to string correction
914 problem", Journal of the ACM, 21:168-178, doi:10.1145/321796.321811
915
917 0.26
918 • A bug was fixed where an input string may be overwritten in
919 code like
920
921 my $tf = Text::Fuzzy->new ($x);
922 $tf->distance ($y);
923
924 if $x is a plain ASCII string and $y is a Unicode string.
925
926 • Links were added to the examples, and the outputs of the
927 examples were added as part of the documentation.
928
929 0.28
930 • The transposition code (the implementation of the Damerau-
931 Levenshtein distance) was completely rewritten to make it more
932 efficient. The back-indexing of strings to find transpositions
933 was changed so that the index of the object's string (the first
934 argument to "new") is preserved from one query to the next. A
935 useless indexing of characters in the other string was removed.
936 The structure used to hold the characters was changed from an
937 unsorted allocated linked list to a sorted array in the case of
938 Unicode strings, and a 256 character array in the case of non-
939 Unicode strings. The oddly-named variables were renamed to more
940 meaningful names.
941
942 • The transposition code now allows for a maximum distance to be
943 set, beyond which no further matches will be allowed.
944
946 The edit distance including transpositions was contributed by Nick
947 Logan (UGEXE). (This code was largely rewritten in version "0.28", so
948 Nick Logan can no longer be held responsible for the Text::Fuzzy
949 module's failings.) Some of the tests in t/trans.t are taken from the
950 Text::Levenshtein::Damerau::XS module. Nils Boeffel reported a bug
951 where strings may be overwritten in version 0.25.
952
954 Ben Bullock, <bkb@cpan.org>
955
957 This package and associated files are copyright (C) 2012-2020 Ben
958 Bullock.
959
960 You can use, copy, modify and redistribute this package and associated
961 files under the Perl Artistic Licence or the GNU General Public
962 Licence.
963
964
965
966perl v5.36.0 2022-07-22 Text::Fuzzy(3)