String::Approx(3pm)

1Approx(3)             User Contributed Perl Documentation            Approx(3)
2
3
4

NAME

6       String::Approx - Perl extension for approximate matching (fuzzy match‐
7       ing)
8

SYNOPSIS

10         use String::Approx 'amatch';
11
12         print if amatch("foobar");
13
14         my @matches = amatch("xyzzy", @inputs);
15
16         my @catches = amatch("plugh", ['2'], @inputs);
17

DESCRIPTION

19       String::Approx lets you match and substitute strings approximately.
20       With this you can emulate errors: typing errorrs, speling errors,
21       closely related vocabularies (colour color), genetic mutations (GAG
22       ACT), abbreviations (McScot, MacScot).
23
24       NOTE: String::Approx suits the task of string matching, not string com‐
25       parison, and it works for strings, not for text.
26
27       If you want to compare strings for similarity, you probably just want
28       the Levenshtein edit distance (explained below), the Text::Levenshtein
29       and Text::LevenshteinXS modules in CPAN.  See also Text::WagnerFischer
30       and Text::PhraseDistance.  (There are functions for this in
31       String::Approx, e.g. adist(), but their results sometimes differ from
32       the bare Levenshtein et al.)
33
34       If you want to compare things like text or source code, consisting of
35       words or tokens and phrases and sentences, or expressions and state‐
36       ments, you should probably use some other tool than String::Approx,
37       like for example the standard UNIX diff(1) tool, or the Algorithm::Diff
38       module from CPAN.
39
40       The measure of approximateness is the Levenshtein edit distance.  It is
41       the total number of "edits": insertions,
42
43               word world
44
45       deletions,
46
47               monkey money
48
49       and substitutions
50
51               sun fun
52
53       required to transform a string to another string.  For example, to
54       transform "lead" into "gold", you need three edits:
55
56               lead gead goad gold
57
58       The edit distance of "lead" and "gold" is therefore three, or 75%.
59
60       String::Approx uses the Levenshtein edit distance (tLed) as its mea‐
61       sure, but String::Approx is not well-suited for comparing the tLeds of
62       strings, in other words, if you want a "fuzzy eq", see above.
63       Strings::Approx is more like regular expressions or index(), it finds
64       substrings that are close matches.
65

MATCH

67               use String::Approx 'amatch';
68
69               $matched     = amatch("pattern")
70               $matched     = amatch("pattern", [ modifiers ])
71
72               $any_matched = amatch("pattern", @inputs)
73               $any_matched = amatch("pattern", [ modifiers ], @inputs)
74
75               @match       = amatch("pattern")
76               @match       = amatch("pattern", [ modifiers ])
77
78               @matches     = amatch("pattern", @inputs)
79               @matches     = amatch("pattern", [ modifiers ], @inputs)
80
81       Match pattern approximately.  In list context return the matched
82       @inputs.  If no inputs are given, match against the $_.  In scalar con‐
83       text return true if any of the inputs match, false if none match.
84
85       Notice that the pattern is a string.  Not a regular expression.  None
86       of the regular expression notations (^, ., *, and so on) work.  They
87       are characters just like the others.  Note-on-note: some limited form
88       of "regular expressionism" is planned in future: for example character
89       classes ([abc]) and any-chars (.).  But that feature will be turned on
90       by a special modifier (just a guess: "r"), so there should be no back‐
91       ward compatibility problem.
92
93       Notice also that matching is not symmetric.  The inputs are matched
94       against the pattern, not the other way round.  In other words: the pat‐
95       tern can be a substring, a submatch, of an input element.  An input
96       element is always a superstring of the pattern.
97
98       MODIFIERS
99
100       With the modifiers you can control the amount of approximateness and
101       certain other control variables.  The modifiers are one or more
102       strings, for example "i", within a string optionally separated by
103       whitespace.  The modifiers are inside an anonymous array: the [ ] in
104       the syntax are not notational, they really do mean [ ], for example [
105       "i", "2" ].  ["2 i"] would be identical.
106
107       The implicit default approximateness is 10%, rounded up.  In other
108       words: every tenth character in the pattern may be an error, an edit.
109       You can explicitly set the maximum approximateness by supplying a modi‐
110       fier like
111
112               number
113               number%
114
115       Examples: "3", "15%".
116
117       Note that "0%" is not rounded up, it is equal to 0.
118
119       Using a similar syntax you can separately control the maximum number of
120       insertions, deletions, and substitutions by prefixing the numbers with
121       I, D, or S, like this:
122
123               Inumber
124               Inumber%
125               Dnumber
126               Dnumber%
127               Snumber
128               Snumber%
129
130       Examples: "I2", "D20%", "S0".
131
132       You can ignore case ("A" becames equal to "a" and vice versa) by adding
133       the "i" modifier.
134
135       For example
136
137               [ "i 25%", "S0" ]
138
139       means ignore case, allow every fourth character to be "an edit", but
140       allow no substitutions.  (See NOTES about disallowing substitutions or
141       insertions.)
142
143       NOTE: setting "I0 D0 S0" is not equivalent to using index().  If you
144       want to use index(), use index().
145

SUBSTITUTE

147               use String::Approx 'asubstitute';
148
149               @substituted = asubstitute("pattern", "replacement")
150               @substituted = asubstitute("pattern", "replacement", @inputs)
151               @substituted = asubstitute("pattern", "replacement", [ modifiers ])
152               @substituted = asubstitute("pattern", "replacement",
153                                          [ modifiers ], @inputs)
154
155       Substitute approximate pattern with replacement and return as a list
156       <copies> of @inputs, the substitutions having been made on the elements
157       that did match the pattern.  If no inputs are given, substitute in the
158       $_.  The replacement can contain magic strings $&, $`, $' that stand
159       for the matched string, the string before it, and the string after it,
160       respectively.  All the other arguments are as in "amatch()", plus one
161       additional modifier, "g" which means substitute globally (all the
162       matches in an element and not just the first one, as is the default).
163
164       See "BAD NEWS" about the unfortunate stinginess of "asubstitute()".
165

INDEX

167               use String::Approx 'aindex';
168
169               $index   = aindex("pattern")
170               @indices = aindex("pattern", @inputs)
171               $index   = aindex("pattern", [ modifiers ])
172               @indices = aindex("pattern", [ modifiers ], @inputs)
173
174       Like "amatch()" but returns the index/indices at which the pattern
175       matches approximately.  In list context and if @inputs are used,
176       returns a list of indices, one index for each input element.  If
177       there's no approximate match, "-1" is returned as the index.
178
179       NOTE: if there is character repetition (e.g. "aa") either in the pat‐
180       tern or in the text, the returned index might start "too early".  This
181       is consistent with the goal of the module of matching "as early as pos‐
182       sible", just like regular expressions (that there might be a "less
183       approximate" match starting later is of somewhat irrelevant).
184
185       There's also backwards-scanning "arindex()".
186

SLICE

188               use String::Approx 'aslice';
189
190               ($index, $size)   = aslice("pattern")
191               ([$i0, $s0], ...) = aslice("pattern", @inputs)
192               ($index, $size)   = aslice("pattern", [ modifiers ])
193               ([$i0, $s0], ...) = aslice("pattern", [ modifiers ], @inputs)
194
195       Like "aindex()" but returns also the size (length) of the match.  If
196       the match fails, returns an empty list (when matching against $_) or an
197       empty anonymous list corresponding to the particular input.
198
199       NOTE: size of the match will very probably be something you did not
200       expect (such as longer than the pattern, or a negative number).  This
201       may or may not be fixed in future releases. Also the beginning of the
202       match may vary from the expected as with aindex(), see above.
203
204       If the modifier
205
206               "minimal_distance"
207
208       is used, the minimal possible edit distance is returned as the third
209       element:
210
211               ($index, $size, $distance) = aslice("pattern", [ modifiers ])
212               ([$i0, $s0, $d0], ...)     = aslice("pattern", [ modifiers ], @inputs)
213

DISTANCE

215               use String::Approx 'adist';
216
217               $dist = adist("pattern", $input);
218               @dist = adist("pattern", @input);
219
220       Return the edit distance or distances between the pattern and the input
221       or inputs.  Zero edit distance means exact match.  (Remember that the
222       match can 'float' in the inputs, the match is a substring match.)  If
223       the pattern is longer than the input or inputs, the returned distance
224       or distances is or are negative.
225
226               use String::Approx 'adistr';
227
228               $dist = adistr("pattern", $input);
229               @dist = adistr("pattern", @inputs);
230
231       Return the relative edit distance or distances between the pattern and
232       the input or inputs.  Zero relative edit distance means exact match,
233       one means completely different.  (Remember that the match can 'float'
234       in the inputs, the match is a substring match.)  If the pattern is
235       longer than the input or inputs, the returned distance or distances is
236       or are negative.
237
238       You can use adist() or adistr() to sort the inputs according to their
239       approximateness:
240
241               my %d;
242               @d{@inputs} = map { abs } adistr("pattern", @inputs);
243               my @d = sort { $d{$a} <=> $d{$b} } @inputs;
244
245       Now @d contains the inputs, the most like "pattern" first.
246

CONTROLLING THE CACHE

248       "String::Approx" maintains a LU (least-used) cache that holds the
249       'matching engines' for each instance of a pattern+modifiers.  The cache
250       is intended to help the case where you match a small set of patterns
251       against a large set of string.  However, the more engines you cache the
252       more you eat memory.  If you have a lot of different patterns or if you
253       have a lot of memory to burn, you may want to control the cache your‐
254       self.  For example, allowing a larger cache consumes more memory but
255       probably runs a little bit faster since the cache fills (and needs
256       flushing) less often.
257
258       The cache has two parameters: max and purge.  The first one is the max‐
259       imum size of the cache and the second one is the cache flushing ratio:
260       when the number of cache entries exceeds max, max times purge cache
261       entries are flushed.  The default values are 1000 and 0.75, respec‐
262       tively, which means that when the 1001st entry would be cached, 750
263       least used entries will be removed from the cache.  To access the
264       parameters you can use the calls
265
266               $now_max = String::Approx::cache_max();
267               String::Approx::cache_max($new_max);
268
269               $now_purge = String::Approx::cache_purge();
270               String::Approx::cache_purge($new_purge);
271
272               $limit = String::Approx::cache_n_purge();
273
274       To be honest, there are actually two caches: the first one is used far
275       the patterns with no modifiers, the second one for the patterns with
276       pattern modifiers.  Using the standard parameters you will therefore
277       actually cache up to 2000 entries.  The above calls control both caches
278       for the same price.
279
280       To disable caching completely use
281
282               String::Approx::cache_disable();
283
284       Note that this doesn't flush any possibly existing cache entries, to do
285       that use
286
287               String::Approx::cache_flush_all();
288

NOTES

290       Because matching is by substrings, not by whole strings, insertions and
291       substitutions produce often very similar results: "abcde" matches
292       "axbcde" either by insertion or substitution of "x".
293
294       The maximum edit distance is also the maximum number of edits.  That
295       is, the "I2" in
296
297               amatch("abcd", ["I2"])
298
299       is useless because the maximum edit distance is (implicitly) 1.  You
300       may have meant to say
301
302               amatch("abcd", ["2D1S1"])
303
304       or something like that.
305
306       If you want to simulate transposes
307
308               feet fete
309
310       you need to allow at least edit distance of two because in terms of our
311       edit primitives a transpose is first one deletion and then one inser‐
312       tion.
313
314       TEXT POSITION
315
316       The starting and ending positions of matching, substituting, indexing,
317       or slicing can be changed from the beginning and end of the input(s) to
318       some other positions by using either or both of the modifiers
319
320               "initial_position=24"
321               "final_position=42"
322
323       or the both the modifiers
324
325               "initial_position=24"
326               "position_range=10"
327
328       By setting the "position_range" to be zero you can limit (anchor) the
329       operation to happen only once (if a match is possible) at the position.
330

VERSION

332       Major release 3.
333

CHANGES FROM VERSION 2

335       GOOD NEWS
336
337       The version 3 is 2-3 times faster than version 2
338       No pattern length limitation
339           The algorithm is independent on the pattern length: its time com‐
340           plexity is O(kn), where k is the number of edits and n the length
341           of the text (input).  The preprocessing of the pattern will of
342           course take some O(m) (m being the pattern length) time, but
343           "amatch()" and "asubstitute()" cache the result of this preprocess‐
344           ing so that it is done only once per pattern.
345
346       BAD NEWS
347
348       You do need a C compiler to install the module
349           Perl's regular expressions are no more used; instead a faster and
350           more scalable algorithm written in C is used.
351
352       "asubstitute()" is now always stingy
353           The string matched and substituted is now always stingy, as short
354           as possible.  It used to be as long as possible.  This is an unfor‐
355           tunate change stemming from switching the matching algorithm.
356           Example: with edit distance of two and substituting for "word" from
357           "cork" and "wool" previously did match "cork" and "wool".  Now it
358           does match "or" and "wo".  As little as possible, or, in other
359           words, with as much approximateness, as many edits, as possible.
360           Because there is no need to match the "c" of "cork", it is not
361           matched.
362
363       no more "aregex()" because regular expressions are no more used
364       no more "compat1" for String::Approx version 1 compatibility
365

ACKNOWLEDGEMENTS

367       The following people have provided valuable test cases, documentation
368       clarifications, and other feedback:
369
370       Jared August, Arthur Bergman, Anirvan Chatterjee, Steve A. Chervitz,
371       Aldo Calpini, David Curiel, Teun van den Dool, Alberto Fontaneda, Rob
372       Fugina, Dmitrij Frishman, Lars Gregersen, Kevin Greiner, B. Elijah
373       Griffin, Mike Hanafey, Mitch Helle, Ricky Houghton, 'idallen', Helmut
374       Jarausch, Damian Keefe, Ben Kennedy, Craig Kelley, Franz Kirsch, Dag
375       Kristian, Mark Land, J. D. Laub, John P. Linderman, Tim Maher, Juha
376       Muilu, Sergey Novoselov, Andy Oram, Ji Y Park, Eric Promislow, Nikolaus
377       Rath, Stefan Ram, Slaven Rezic, Dag Kristian Rognlien, Stewart Russell,
378       Slaven Rezic, Chris Rosin, Pasha Sadri, Ilya Sandler, Bob J.A. Schijve‐
379       naars, Ross Smith, Frank Tobin, Greg Ward, Rich Williams, Rick Wise.
380
381       The matching algorithm was developed by Udi Manber, Sun Wu, and Burra
382       Gopal in the Department of Computer Science, University of Arizona.
383

AUTHOR

385       Jarkko Hietaniemi <jhi@iki.fi>
386

COPYRIGHT AND LICENSE

388       Copyright 2001-2005 by Jarkko Hietaniemi
389
390       This library is free software; you can redistribute it and/or modify it
391       under the same terms as Perl itself.
392
393       Furthermore: no warranties or obligations of any kind are given, and
394       the separate file COPYRIGHT must be included intact in all copies and
395       derived materials.
396
397
398
399perl v5.8.8                       2006-04-09                         Approx(3)