1Text::Fuzzy(3)        User Contributed Perl Documentation       Text::Fuzzy(3)
2
3
4

NAME

6       Text::Fuzzy - Partial string matching using edit distances
7

SYNOPSIS

9           use Text::Fuzzy;
10           my $tf = Text::Fuzzy->new ('boboon');
11           print "Distance is ", $tf->distance ('babboon'), "\n";
12           my @words = qw/the quick brown fox jumped over the lazy dog/;
13           my $nearest = $tf->nearestv (\@words);
14           print "Nearest array entry is $nearest\n";
15
16       produces output
17
18           Distance is 2
19           Nearest array entry is brown
20
21       (This example is included as synopsis.pl
22       <https://fastapi.metacpan.org/source/BKB/Text-
23       Fuzzy-0.29/examples/synopsis.pl> in the distribution.)
24

VERSION

26       This documents version 0.29 of Text::Fuzzy corresponding to git commit
27       0e5db1fbccbc7d7518ec39a95d43a5b20166a727
28       <https://github.com/benkasminbullock/Text-
29       Fuzzy/commit/0e5db1fbccbc7d7518ec39a95d43a5b20166a727> released on Thu
30       Dec 10 14:38:31 2020 +0900.
31

DESCRIPTION

33       This module calculates edit distances between words, and searches
34       arrays and files to find the nearest entry by edit distance. It handles
35       both byte strings and character strings (strings containing Unicode),
36       treating each Unicode character as a single entity.
37
38           use Text::Fuzzy;
39           use utf8;
40           my $tf = Text::Fuzzy->new ('あいうえお☺');
41           print $tf->distance ('うえお☺'), "\n";
42
43       produces output
44
45           2
46
47       (This example is included as unicode.pl
48       <https://fastapi.metacpan.org/source/BKB/Text-
49       Fuzzy-0.29/examples/unicode.pl> in the distribution.)
50
51       The default edit distance is the Levenshtein one, which counts each
52       addition ("cat" -> "cart"), substitution ("cat" -> "cut"), and deletion
53       ("carp" -> "cap") as one unit. The Damerau-Levenshtein edit distance,
54       which also allows transpositions ("salt" -> "slat") may also be
55       selected with the "transpositions_ok" method or the "trans" option.
56
57       This module is particularly suited to searching for the nearest match
58       to a term over a list of words, using the "nearestv" or "nearest"
59       methods. It studies the target string to be matched (the first argument
60       to "new") to build information to rapidly reject mismatches in a list.
61       Since computing the Levenshtein and Damerau-Levenshtein edit distances
62       with the Wagner-Fischer algorithm is computationally expensive, the
63       module offers a boost in performance for searching for a string in a
64       list of words.
65

METHODS

67   new
68           my $tf = Text::Fuzzy->new ('bibbety bobbety boo');
69
70       Create a new Text::Fuzzy object from the supplied word.
71
72       The following parameters may be supplied to new:
73
74       max
75               my $tf = Text::Fuzzy->new ('Cinderella', max => 3);
76
77           This option affects the behaviour of "nearestv" and "nearest"
78           methods. When searching over an array, this sets the maximum edit
79           distance allowed for a word to be considered a "near match". For
80           example, with
81
82               my $tf = Text::Fuzzy->new ('Cinderella');
83               $tf->set_max_distance (3);
84
85           when using "nearest", 'Cinder' will not be considered a match, but
86           'derella' will.
87
88           To switch off the maximum distance, and allow all words to be
89           considered, you can set "max" to be a negative value:
90
91               my $tf = Text::Fuzzy->new ('Cinderella', max => -1);
92
93           Note that this is the default, so there is hardly any point
94           specifying it, except if you want to make self-documenting code, or
95           you're worried that the module's default behaviour may suddenly
96           change.
97
98           Setting "max" to zero makes $tf only match exactly.
99
100           The method "set_max_distance" does the same thing as this
101           parameter.
102
103       no_exact
104               my $tf = Text::Fuzzy->new ('slipper', no_exact => 1);
105
106           This parameter switches on rejection of exact matches, in the same
107           way as the method "no_exact":
108
109               my $tf = Text::Fuzzy->new ('slipper');
110               $tf->no_exact (1);
111
112           This is useful for the case of scanning an array which contains the
113           search term itself, when we are interested in near matches only.
114           For example, if we have a dictionary of words and we need to find
115           near matches for a word which is in the dictionary.
116
117       trans
118               my $tf = Text::Fuzzy->new ('glass', trans => 1);
119
120           This switches on transpositions, in other words it uses the
121           Damerau-Levenshtein edit distance rather than the Levenshtein edit
122           distance. The method "transpositions_ok" has the same effect as
123           this.
124
125   distance
126           my $dist = $tf->distance ($word);
127
128       This method's return value is the edit distance to $word from the word
129       used to create the object in "new".
130
131           use Text::Fuzzy;
132           my $cat = Text::Fuzzy->new ('cat');
133           print $cat->distance ('cut'), "\n";
134           print $cat->distance ('cart'), "\n";
135           print $cat->distance ('catamaran'), "\n";
136           use utf8;
137           print $cat->distance ('γάτος'), "\n";
138
139       produces output
140
141           1
142           1
143           6
144           5
145
146       (This example is included as distance.pl
147       <https://fastapi.metacpan.org/source/BKB/Text-
148       Fuzzy-0.29/examples/distance.pl> in the distribution.)
149
150       To know which edits are used to convert the words, use
151       "distance_edits".
152
153   nearestv
154           my $nearest_word = $tf->nearestv (\@words);
155           my @nearest_words = $tf->nearestv (\@words);
156
157       Returns the value in @words which has the nearest distance to the value
158       given to $tf in "new". In array context, it returns a list of the
159       nearest values.
160
161           use Text::Fuzzy;
162           my @words = (qw/who where what when why/);
163           my $tf = Text::Fuzzy->new ('whammo');
164           my @nearest = $tf->nearestv (\@words);
165           print "@nearest\n";
166
167       produces output
168
169           who what
170
171       (This example is included as nearestv.pl
172       <https://fastapi.metacpan.org/source/BKB/Text-
173       Fuzzy-0.29/examples/nearestv.pl> in the distribution.)
174
175       The behaviour of the match can be controlled with "no_exact" and
176       "set_max_distance" in exactly the same way as "nearest".
177
178       This is a convenient wrapper around the "nearest" function. "nearest"
179       is annoying to use, because it only returns array offsets, and also
180       error-prone due to having to check to distinguish the first element of
181       the array from an undefined value using "defined".
182
183       This method was added in version 0.18 of Text::Fuzzy.
184
185   nearest
186           my $index = $tf->nearest (\@words);
187           my $nearest_word = $words[$index];
188
189       Given an array reference, this returns a number, the index of the
190       nearest element in the array @words to the argument to "new". Having
191       found the nearest match you then need to look up the value in the
192       array, as in $nearest_word above.
193
194       It is possible to set a maximum edit distance, beyond which entries are
195       rejected, using "set_max_distance" or the "max" parameter to "new".  In
196       this case, if none of the elements of @words are less than the maximum
197       distance away from the word, $index is the undefined value, so when
198       setting a maximum distance, it is necessary to check the return value
199       of index using "defined".
200
201           use Text::Fuzzy;
202           my $tf = Text::Fuzzy->new ('calamari', max => 1);
203           my @words = qw/Have you ever kissed in the moonlight
204                          In the grand and glorious
205                          Gay notorious
206                          South American Way?/;
207           my $index = $tf->nearest (\@words);
208           if (defined $index) {
209               printf "Found at $index, distance was %d.\n",
210               $tf->last_distance ();
211           }
212           else {
213               print "Not found anywhere.\n";
214           }
215
216       produces output
217
218           Not found anywhere.
219
220       (This example is included as check-return.pl
221       <https://fastapi.metacpan.org/source/BKB/Text-
222       Fuzzy-0.29/examples/check-return.pl> in the distribution.)
223
224       If there is more than one word with the same edit distance in @words,
225       this returns the last one found, unless it is an exact match, in which
226       case it returns the first one found. To get all matches, call it in
227       array context:
228
229           my @nearest = $tf->nearest (\@words);
230
231       In array context, if there are no matches within the minimum distance,
232       "nearest" returns an empty list. If there is one or more match, it
233       returns the array offset of it or them, not the value itself.
234
235           use Text::Fuzzy;
236
237           my @funky_words = qw/nice funky rice gibbon lice graeme garden/;
238           my $tf = Text::Fuzzy->new ('dice');
239           my @nearest = $tf->nearest (\@funky_words);
240
241           print "The nearest words are ";
242           print join ", ", (map {$funky_words[$_]} @nearest);
243           printf ", distance %d.\n", $tf->last_distance ();
244
245       produces output
246
247           The nearest words are nice, rice, lice, distance 1.
248
249       (This example is included as list-context.pl
250       <https://fastapi.metacpan.org/source/BKB/Text-Fuzzy-0.29/examples/list-
251       context.pl> in the distribution.)
252
253   last_distance
254           my $last_distance = $tf->last_distance ();
255
256       The distance from the previous match's closest match. This is used in
257       conjunction with "nearest" or "nearestv" to find the edit distance to
258       the previous match.
259
260           use Text::Fuzzy;
261           my @words = (qw/who where what when why/);
262           my $tf = Text::Fuzzy->new ('whammo');
263           my @nearest = $tf->nearestv (\@words);
264           print "@nearest\n";
265           print $tf->last_distance (), "\n";
266           # Prints 3, the number of edits needed to turn "whammo" into "who"
267           # (delete a, m, m) or into "what" (replace m with t, delete m, delete
268           # o).
269
270       produces output
271
272           who what
273           3
274
275       (This example is included as last-distance.pl
276       <https://fastapi.metacpan.org/source/BKB/Text-Fuzzy-0.29/examples/last-
277       distance.pl> in the distribution.)
278
279   set_max_distance
280           # Set the max distance.
281           $tf->set_max_distance (3);
282
283       Set the maximum edit distance of $tf. Set the maximum distance to a low
284       value to improve the speed of searches over lists with "nearest", or to
285       reject unlikely matches. When searching for a near match, anything with
286       an edit distance of a value over the maximum is rejected without
287       computing the exact distance. To compute exact distances, call this
288       method without an argument:
289
290           $tf->set_max_distance ();
291
292       The maximum edit distance is switched off, and whatever the nearest
293       match is is accepted. A negative value also switches it off:
294
295           $tf->set_max_distance (-1);
296
297       The object created by "new" has no maximum distance unless specified by
298       the user.
299
300           use Text::Fuzzy;
301           my $tf = Text::Fuzzy->new ('nopqrstuvwxyz');
302           # Prints 13, the correct value.
303           print $tf->distance ('abcdefghijklm'), "\n";
304           $tf->set_max_distance (10);
305           # Prints 11, one more than the maximum distance, because the search
306           # stopped when the distance was exceeded.
307           print $tf->distance ('abcdefghijklm'), "\n";
308
309       produces output
310
311           13
312           11
313
314       (This example is included as max-dist.pl
315       <https://fastapi.metacpan.org/source/BKB/Text-Fuzzy-0.29/examples/max-
316       dist.pl> in the distribution.)
317
318       Setting the maximum distance is a way to make a search more rapid. For
319       example if you are searching over a dictionary of 100,000 or a million
320       words, and only need close matches, you can more rapidly reject
321       unwanted matches by setting the maximum distance to a lower value.
322       Calculating Levenshtein distance is an O(n^2) algorithm in the lengths
323       of the words, so even a small increase in the maximum permitted
324       distance means a much larger amount of work for the computer to do.
325       With the maximum distance set, the computer can give up calculating
326       more quickly with bad matches.
327
328   transpositions_ok
329           $tf->transpositions_ok (1);
330
331       A true value in the argument changes the type of edit distance used to
332       allow transpositions, such as "clam" and "calm". Initially
333       transpositions are not allowed, giving the Levenshtein edit distance.
334       If transpositions are used, the edit distance becomes the Damerau-
335       Levenshtein edit distance. A false value disallows transpositions:
336
337           $tf->transpositions_ok (0);
338
339   no_exact
340           $tf->no_exact (1);
341
342       This is a flag to "nearest" which makes it ignore exact matches. For
343       example,
344
345           use Text::Fuzzy;
346
347           my @words = qw/bibbity bobbity boo/;
348           for my $word (@words) {
349               my $tf = Text::Fuzzy->new ($word);
350               $tf->no_exact (0);
351               my $nearest1 = $tf->nearest (\@words);
352               print "With exact, nearest to $word is $words[$nearest1]\n";
353               # Make "$word" not match itself.
354               $tf->no_exact (1);
355               my $nearest2 = $tf->nearest (\@words);
356               print "Without exact, nearest to $word is $words[$nearest2]\n";
357           }
358
359       produces output
360
361           With exact, nearest to bibbity is bibbity
362           Without exact, nearest to bibbity is bobbity
363           With exact, nearest to bobbity is bobbity
364           Without exact, nearest to bobbity is bibbity
365           With exact, nearest to boo is boo
366           Without exact, nearest to boo is bobbity
367
368       (This example is included as no-exact.pl
369       <https://fastapi.metacpan.org/source/BKB/Text-Fuzzy-0.29/examples/no-
370       exact.pl> in the distribution.)
371
372       This is for the case of searching over an array which contains the
373       searched-for item itself.
374
375   scan_file
376           my $nearest = $tf->scan_file ('/usr/share/dict/words');
377
378       Scan a file to find the nearest match to the word used in "new". This
379       assumes that the file contains lines of text separated by newlines, and
380       finds the closest match in the file. Its return value is a string
381       rather than a line number. It cannot return an array of values. It does
382       not currently support Unicode-encoded files.
383

FUNCTIONS

385       These functions do not require a "Text::Fuzzy" object.
386
387   distance_edits
388           my ($distance, $edits) = distance_edits ('before', 'after');
389
390       This returns the edit distance between the two arguments, and the edits
391       necessary to transform the first one into the second one. $Edits is a
392       string containing the four letters k, r, d, and i, for "keep",
393       "replace", "delete", and "insert" respectively. For example, for
394       "piece" and "peace", $edits contains "krrkk" for "keep, replace,
395       replace, keep, keep".
396
397           use Text::Fuzzy 'distance_edits';
398           my @words = (qw/who where what when why/);
399           my $tf = Text::Fuzzy->new ('whammo');
400           my @nearest = $tf->nearestv (\@words);
401           print "@nearest\n";
402           # Prints "who what"
403           print $tf->last_distance (), "\n";
404           # Prints 3, the number of edits needed to turn "whammo" into "who"
405           # (delete a, m, m) or into "what" (replace m with t, delete m, delete
406           # o).
407           my ($distance, $edits) = distance_edits ('whammo', 'who');
408           print "$edits\n";
409           # Prints kkdddk, keep w, keep h, delete a, delete m, delete m, keep o.
410
411       produces output
412
413           who what
414           3
415           kkdddk
416
417       (This example is included as distance-edits.pl
418       <https://fastapi.metacpan.org/source/BKB/Text-
419       Fuzzy-0.29/examples/distance-edits.pl> in the distribution.)
420
421       This does not handle transpositions. Unlike the rest of the module,
422       this is pure Perl rather than XS, and not optimized for speed. The edit
423       distance search within "nearest" is optimized for speed, and hence
424       discards its record of edits used to get the result.
425
426   fuzzy_index
427           my ($offset, $edits, $distance) = fuzzy_index ($needle, $haystack);
428
429       Searches for $needle in $haystack using fuzzy matching.
430
431       Return value is the offest of the closest match found, the edits
432       necessary on $needle to make it into the matching text, and the
433       Levenshtein edit distance between the matching part of $haystack and
434       $needle.
435
436       For the algorithm used, see
437
438       <http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/>
439
440       This is implemented in Perl not C, and it's slow due to lots of
441       debugging code. Please expect the interface and internals to change.
442

EXAMPLES

444       This section gives extended examples of the use of the module to solve
445       practical problems.
446
447   misspelt-web-page.cgi
448       The file examples/misspelt-web-page.cgi
449       <https://fastapi.metacpan.org/source/BKB/Text-
450       Fuzzy-0.29/examples/misspelt-web-page.cgi> is an example of a CGI
451       script which does something similar to the Apache mod_speling module,
452       offering spelling corrections for mistyped URLs and sending the user to
453       a correct page.
454
455           use Text::Fuzzy;
456
457           # The directory of files served by the web server.
458
459           my $web_root = '/usr/local/www/data';
460
461           # If the query is "http://www.example.com/abc/xyz.html", $path_info is
462           # "abc/xyz.html".
463
464           my $path_info = $ENV{REQUEST_URI};
465
466           if (! defined $path_info) {
467               fail ("No path info");
468           }
469
470           if ($0 =~ /$path_info/) {
471               fail ("Don't redirect to self");
472           }
473
474           # This is the list of files under the main page.
475
476           my @allfiles = get_all_files ($web_root, '');
477
478           # This is our spelling search engine.
479
480           my $tf = Text::Fuzzy->new ($path_info);
481
482           my $nearest = $tf->nearest (\@allfiles, max => 5);
483
484           if (defined $nearest) {
485               redirect ($allfiles[$nearest]);
486           }
487           else {
488               fail ("Nothing like $path_info was found");
489           }
490           exit;
491
492           # Read all the files under "$root/$dir". This is recursive. The return
493           # value is an array containing all files found.
494
495           sub get_all_files
496           {
497               my ($root, $dir) = @_;
498               my @allfiles;
499               my $full_dir = "$root/$dir";
500               if (! -d $full_dir) {
501                   fail ("$full_dir is not a directory");
502               }
503               opendir DIR, $full_dir or fail ("Can't open directory $full_dir: $!");
504               my @files = grep !/^\./, readdir DIR;
505               closedir DIR or fail ("Can't close $full_dir: $!");
506               for my $file (@files) {
507                   my $dir_file = "$dir/$file";
508                   my $full_file = "$root/$dir_file";
509                   if (-d $full_file) {
510                       push @allfiles, get_all_files ($root, $dir_file);
511                   }
512                   else {
513                       push @allfiles, $dir_file;
514                   }
515               }
516               return @allfiles;
517           }
518
519           # Print a "permanent redirect" to the respelt name, then exit.
520
521           sub redirect
522           {
523               my ($url) = @_;
524               print <<EOF;
525           Status: 301
526           Location: $url
527
528           EOF
529               exit;
530           }
531
532           # Print an error message for the sake of the requester, and print a
533           # message to the error log, then exit.
534
535           sub fail
536           {
537               my ($error) = @_;
538               print <<EOF;
539           Content-Type: text/plain
540
541           $error
542           EOF
543               # Add the name of the program and the time to the error message,
544               # otherwise the error log will get awfully confusing-looking.
545               my $time = scalar gmtime ();
546               print STDERR "$0: $time: $error\n";
547               exit;
548           }
549
550       See also <http://www.lemoda.net/perl/perl-mod-speling/> for how to set
551       up .htaccess to use the script.
552
553   spell-check.pl
554       The file examples/spell-check.pl
555       <https://fastapi.metacpan.org/source/BKB/Text-
556       Fuzzy-0.29/examples/spell-check.pl> is a spell checker. It uses a
557       dictionary of words specified by a command-line option "-d":
558
559           spell-check.pl -d /usr/dict/words file1.txt file2.txt
560
561       It prints out any words which look like spelling mistakes, using the
562       dictionary.
563
564           use Getopt::Long;
565           use Text::Fuzzy;
566           use Lingua::EN::PluralToSingular 'to_singular';
567
568           # The location of the Unix dictionary.
569           my $dict = '/usr/share/dict/words';
570
571           # Default maximum edit distance. Five is quite a big number for a
572           # spelling mistake.
573           my $max = 5;
574
575           GetOptions (
576               "dict=s" => \$dict,
577               "max=i" => \$max,
578           );
579
580           my @words;
581           my %words;
582           my $min_length = 4;
583           read_dictionary ($dict, \@words, \%words);
584           # Known mistakes, don't repeat.
585           my %known;
586           # Spell-check each file on the command line.
587           for my $file (@ARGV) {
588               open my $input, "<", $file or die "Can't open $file: $!";
589               while (<$input>) {
590                   my @line = split /[^a-z']+/i, $_;
591                   for my $word (@line) {
592                       # Remove leading/trailing apostrophes.
593                       $word =~ s/^'|'$//g;
594                       my $clean_word = to_singular (lc $word);
595                       $clean_word =~ s/'s$//;
596                       if ($words{$clean_word} || $words{$word}) {
597                           # It is in the dictionary.
598                           next;
599                       }
600                       if (length $word < $min_length) {
601                           # Very short words are ignored.
602                           next;
603                       }
604                       if ($word eq uc $word) {
605                           # Acronym like BBC, IRA, etc.
606                           next;
607                       }
608                       if ($known{$clean_word}) {
609                           # This word was already given to the user.
610                           next;
611                       }
612                       if ($clean_word =~ /(.*)ed$/ || $clean_word =~ /(.*)ing/) {
613                           my $stem = $1;
614                           if ($words{$stem} || $words{"${stem}e"}) {
615                               # Past/gerund of $stem/${stem}e
616                               next;
617                           }
618                           # Test for doubled end consonants,
619                           # e.g. "submitted"/"submit".
620                           if ($stem =~ /([bcdfghjklmnpqrstvwxz])\1/) {
621                               $stem =~ s/$1$//;
622                               if ($words{$stem}) {
623                                   # Past/gerund of $stem/${stem}e
624                                   next;
625                               }
626                           }
627                       }
628                       my $tf = Text::Fuzzy->new ($clean_word, max => $max);
629                       my $nearest = $tf->nearest (\@words);
630                       # We have set a maximum distance to search for, so we need
631                       # to check whether $nearest is defined.
632                       if (defined $nearest) {
633                           my $correction = $words[$nearest];
634                           print "$file:$.: '$word' may be '$correction'.\n";
635                           $known{$clean_word} = $correction;
636                       }
637                       else {
638                           print "$file:$.: $word may be a spelling mistake.\n";
639                           $known{$clean_word} = 1;
640                       }
641                   }
642               }
643               close $input or die $!;
644           }
645
646           exit;
647
648           sub read_dictionary
649           {
650               my ($dict, $words_array, $words_hash) = @_;
651               open my $din, "<", $dict or die "Can't open dictionary $dict: $!";
652               my @words;
653               while (<$din>) {
654                   chomp;
655                   push @words, $_;
656               }
657               close $din or die $!;
658               # Apostrophe words
659
660               my @apo = qw/
661
662                               let's I'll you'll he'll she'll they'll we'll I'm
663                               you're he's she's it's we're they're I've they've
664                               you've we've one's isn't aren't doesn't don't
665                               won't wouldn't I'd you'd he'd we'd they'd
666                               shouldn't couldn't didn't can't
667
668                           /;
669
670               # Irregular past participles.
671               my @pp = qw/became/;
672
673               push @words, @apo, @pp;
674               for (@words) {
675                   push @$words_array, lc $_;
676                   $words_hash->{$_} = 1;
677                   $words_hash->{lc $_} = 1;
678               }
679           }
680
681       Because the usual Unix dictionary doesn't have plurals, it uses
682       Lingua::EN::PluralToSingular, to convert nouns into singular forms.
683       Unfortunately it still misses past participles and past tenses of
684       verbs.
685
686   extract-kana.pl
687       The file examples/extract-kana.pl
688       <https://fastapi.metacpan.org/source/BKB/Text-
689       Fuzzy-0.29/examples/extract-kana.pl> extracts the kana entries from
690       "edict", a freely-available Japanese to English electronic dictionary,
691       and does some fuzzy searches on them. It requires a local copy of the
692       file to run. This script demonstrates the use of Unicode searches with
693       Text::Fuzzy.
694
695           use Lingua::JA::Moji ':all';
696           use Text::Fuzzy;
697           use utf8;
698           my $infile = '/home/ben/data/edrdg/edict';
699           open my $in, "<:encoding(EUC-JP)", $infile or die $!;
700           my @kana;
701           while (<$in>) {
702               my $kana;
703               if (/\[(\p{InKana}+)\]/) {
704                   $kana = $1;
705               }
706               elsif (/^(\p{InKana}+)/) {
707                   $kana = $1;
708               }
709               if ($kana) {
710                   $kana = kana2katakana ($kana);
711                   push @kana, $kana;
712               }
713           }
714           printf "Starting fuzzy searches over %d lines.\n", scalar @kana;
715           search ('ウオソウコ');
716           search ('アイウエオカキクケコバビブベボハヒフヘホ');
717           search ('アルベルトアインシュタイン');
718           search ('バババブ');
719           search ('バババブアルベルト');
720           exit;
721
722           sub search
723           {
724               my ($silly) = @_;
725               my $max = 10;
726               my $search = Text::Fuzzy->new ($silly, max => $max);
727               my $n = $search->nearest (\@kana);
728               if (defined $n) {
729                   printf "$silly nearest is $kana[$n] (distance %d)\n",
730                       $search->last_distance ();
731               }
732               else {
733                   printf "Nothing like '$silly' was found within $max edits.\n";
734               }
735           }
736
737   Lingua::JA::Gairaigo::Fuzzy
738       The module Lingua::JA::Gairaigo::Fuzzy tries to determine whether two
739       Japanese loanwords are the same word or not.
740
741   CPAN::Nearest
742       The module CPAN::Nearest offers a search over the titles of CPAN
743       modules using a fuzzy search to get the nearest match.
744

DEPENDENCIES

746       This module has no dependencies on other modules.
747

SUPPORT

749   Reporting a bug
750       There is a bug tracker for the module at
751       <https://github.com/benkasminbullock/Text-Fuzzy/issues>.
752
753   Testing
754       The CPAN tester results are at
755       <http://www.cpantesters.org/distro/T/Text-Fuzzy.html>. The ActiveState
756       tester results are at <http://code.activestate.com/ppm/Text-Fuzzy/>.
757

PRIVATE METHODS

759       The following methods are for benchmarking the module and checking its
760       correctness.
761
762   no_alphabet
763           $tf->no_alphabet (1);
764
765       This turns off alphabetizing of the string. Alphabetizing is a filter
766       where the intersection of all the characters in the two strings is
767       computed, and if the alphabetical difference of the two strings is
768       greater than the maximum distance, the match is rejected without
769       applying the dynamic programming algorithm. This increases speed,
770       because the dynamic programming algorithm is slow.
771
772       The alphabetizing should not ever reject anything which is a legitimate
773       match, and it should make the program run faster in almost every case.
774       The only envisaged uses of switching this off are checking that the
775       algorithm is working correctly, and benchmarking performance.
776
777   get_trans
778           my $trans_ok = $tf->get_trans ();
779
780       This returns the value set by "transpositions_ok".
781
782   unicode_length
783           my $length = $tf->unicode_length ();
784
785       This returns the length in characters (not bytes) of the string used in
786       "new". If the string is not marked as Unicode, it returns the undefined
787       value. In the following, $l1 should be equal to $l2.
788
789           use utf8;
790           my $word = 'ⅅⅆⅇⅈⅉ';
791           my $l1 = length $word;
792           my $tf = Text::Fuzzy->new ($word);
793           my $l2 = $tf->unicode_length ();
794
795   ualphabet_rejections
796           my $rejected = $tf->ualphabet_rejections ();
797
798       After running "nearest" over an array, this returns the number of
799       entries of the array which were rejected using only the Unicode
800       alphabet. Its value is reset to zero each time "nearest" is called.
801
802   alphabet_rejections
803           my $rejected = $tf->alphabet_rejections ();
804
805       After running "nearest" over an array, this returns the number of
806       entries of the array which were rejected using only the non-Unicode
807       alphabet. Its value is reset to zero each time "nearest" is called.
808
809   length_rejections
810           my $rejected = $tf->length_rejections ();
811
812       After running "nearest" over an array, this returns the number of
813       entries of the array which were rejected because the length difference
814       between them and the target string was larger than the maximum distance
815       allowed.
816
817   get_max_distance
818           # Get the maximum edit distance.
819           print "The max distance is ", $tf->get_max_distance (), "\n";
820
821       Get the maximum edit distance of $tf. The maximum distance may be set
822       with "set_max_distance".
823

SEE ALSO

825   Other CPAN modules
826       Similar modules on CPAN include the following.
827
828       String::Approx
829           Approximate matching (fuzzy matching) using the Levenshtein edit
830           distance. As a bonus, if you don't have a headache, you can get one
831           easily trying to make head or tail out of this module's
832           documentation.
833
834       Text::EditTranscript
835           Determine the edit transcript between two strings. This is similar
836           to what you get from "distance_edits" in this module.
837
838       Text::Fuzzy::PP
839           This is Nick Logan's Pure Perl version of this module.
840
841       Text::Levenshtein::Damerau
842           Levenshtein-Damerau edit distance.
843
844       Text::Levenshtein
845       Text::Levenshtein::Flexible
846           XS Levenshtein distance calculation with bounds and adjustable
847           costs (so the cost of deletion can be more than the cost of
848           addition, etc.)  See also Text::WagnerFischer for a pure-Perl
849           module which also allows altered costs.
850
851       Text::Levenshtein::XS
852           An XS implementation of the Levenshtein edit distance. It claims to
853           be a drop-in replacement for Text::LevenshteinXS which does Unicode
854           correctly.
855
856       Text::LevenshteinXS
857           An XS implementation of the Levenshtein edit distance. Does not do
858           Unicode very well. See
859           <https://rt.cpan.org/Public/Bug/Display.html?id=36685>.
860
861       Text::Levenshtein::Edlib
862           A wrapper around the edlib library that computes Levenshtein edit
863           distance and optimal alignment path for a pair of strings.
864
865       Tree::BK
866           Structure for efficient fuzzy matching.
867
868       Text::Brew
869           An implementation of the Brew edit distance.
870
871       Text::WagnerFischer
872           Implements the Wagner-Fischer algorithm to calculate edit
873           distances. This is generalised version of the Levenshtein edit
874           distance. See also Text::Levenshtein::Flexible for an XS version.
875
876       Bencher::Scenario::LevenshteinModules
877           Some benchmarks of various modules including this one.
878
879       Text::JaroWinkler
880           Another text similarity measure.
881
882   About the algorithms
883       This section contains some blog posts which I found useful in
884       understanding the algorithms.
885
886       Fuzzy substring matching with Levenshtein distance in Python
887       <http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-
888       with-levenshtein-distance-in-python/> by Ryan Ginstrom explains the
889       Levenshtein algorithm and its use in substring matching.
890
891       Damerau-Levenshtein Edit Distance Explained
892       <https://www.lemoda.net/text-fuzzy/damerau-levenshtein/index.html> by
893       James M. Jensen II explains the Damerau-Levenshtein edit distance (the
894       algorithm used with "transpositions_ok").
895
896       I recommend steering fairly clear of the Wikipedia articles on these
897       things, which are very poorly written indeed.
898
899   References
900       Here are the original research papers by the algorithms' discoverers.
901
902       Damerau
903           Damerau, Fred J. (March 1964), "A technique for computer detection
904           and correction of spelling errors", Communications of the ACM, ACM,
905           7 (3): 171–176, doi:10.1145/363958.363994
906
907       Levenshtein
908           Levenshtein, Vladimir I. (February 1966), "Binary codes capable of
909           correcting deletions, insertions, and reversals", Soviet Physics
910           Doklady, 10 (8): 707–710
911
912       Wagner and Fischer
913           R. Wagner and M. Fischer (1974), "The string to string correction
914           problem", Journal of the ACM, 21:168-178, doi:10.1145/321796.321811
915

HISTORY

917       0.26
918           •   A bug was fixed where an input string may be overwritten in
919               code like
920
921                   my $tf = Text::Fuzzy->new ($x);
922                   $tf->distance ($y);
923
924               if $x is a plain ASCII string and $y is a Unicode string.
925
926           •   Links were added to the examples, and the outputs of the
927               examples were added as part of the documentation.
928
929       0.28
930           •   The transposition code (the implementation of the Damerau-
931               Levenshtein distance) was completely rewritten to make it more
932               efficient. The back-indexing of strings to find transpositions
933               was changed so that the index of the object's string (the first
934               argument to "new") is preserved from one query to the next. A
935               useless indexing of characters in the other string was removed.
936               The structure used to hold the characters was changed from an
937               unsorted allocated linked list to a sorted array in the case of
938               Unicode strings, and a 256 character array in the case of non-
939               Unicode strings. The oddly-named variables were renamed to more
940               meaningful names.
941
942           •   The transposition code now allows for a maximum distance to be
943               set, beyond which no further matches will be allowed.
944

ACKNOWLEDGEMENTS

946       The edit distance including transpositions was contributed by Nick
947       Logan (UGEXE). (This code was largely rewritten in version "0.28", so
948       Nick Logan can no longer be held responsible for the Text::Fuzzy
949       module's failings.) Some of the tests in t/trans.t are taken from the
950       Text::Levenshtein::Damerau::XS module. Nils Boeffel reported a bug
951       where strings may be overwritten in version 0.25.
952

AUTHOR

954       Ben Bullock, <bkb@cpan.org>
955
957       This package and associated files are copyright (C) 2012-2020 Ben
958       Bullock.
959
960       You can use, copy, modify and redistribute this package and associated
961       files under the Perl Artistic Licence or the GNU General Public
962       Licence.
963
964
965
966perl v5.38.0                      2023-07-21                    Text::Fuzzy(3)
Impressum