String::Approx(3pm)

1Approx(3)             User Contributed Perl Documentation            Approx(3)
2
3
4

NAME

6       String::Approx - Perl extension for approximate matching (fuzzy
7       matching)
8

SYNOPSIS

10         use String::Approx 'amatch';
11
12         print if amatch("foobar");
13
14         my @matches = amatch("xyzzy", @inputs);
15
16         my @catches = amatch("plugh", ['2'], @inputs);
17

DESCRIPTION

19       String::Approx lets you match and substitute strings approximately.
20       With this you can emulate errors: typing errorrs, speling errors,
21       closely related vocabularies (colour color), genetic mutations (GAG
22       ACT), abbreviations (McScot, MacScot).
23
24       NOTE: String::Approx suits the task of string matching, not string
25       comparison, and it works for strings, not for text.
26
27       If you want to compare strings for similarity, you probably just want
28       the Levenshtein edit distance (explained below), the Text::Levenshtein
29       and Text::LevenshteinXS modules in CPAN.  See also Text::WagnerFischer
30       and Text::PhraseDistance.  (There are functions for this in
31       String::Approx, e.g. adist(), but their results sometimes differ from
32       the bare Levenshtein et al.)
33
34       If you want to compare things like text or source code, consisting of
35       words or tokens and phrases and sentences, or expressions and
36       statements, you should probably use some other tool than
37       String::Approx, like for example the standard UNIX diff(1) tool, or the
38       Algorithm::Diff module from CPAN.
39
40       The measure of approximateness is the Levenshtein edit distance.  It is
41       the total number of "edits": insertions,
42
43               word world
44
45       deletions,
46
47               monkey money
48
49       and substitutions
50
51               sun fun
52
53       required to transform a string to another string.  For example, to
54       transform "lead" into "gold", you need three edits:
55
56               lead gead goad gold
57
58       The edit distance of "lead" and "gold" is therefore three, or 75%.
59
60       String::Approx uses the Levenshtein edit distance as its measure, but
61       String::Approx is not well-suited for comparing strings of different
62       length, in other words, if you want a "fuzzy eq", see above.
63       String::Approx is more like regular expressions or index(), it finds
64       substrings that are close matches.>
65

MATCH

67               use String::Approx 'amatch';
68
69               $matched     = amatch("pattern")
70               $matched     = amatch("pattern", [ modifiers ])
71
72               $any_matched = amatch("pattern", @inputs)
73               $any_matched = amatch("pattern", [ modifiers ], @inputs)
74
75               @match       = amatch("pattern")
76               @match       = amatch("pattern", [ modifiers ])
77
78               @matches     = amatch("pattern", @inputs)
79               @matches     = amatch("pattern", [ modifiers ], @inputs)
80
81       Match pattern approximately.  In list context return the matched
82       @inputs.  If no inputs are given, match against the $_.  In scalar
83       context return true if any of the inputs match, false if none match.
84
85       Notice that the pattern is a string.  Not a regular expression.  None
86       of the regular expression notations (^, ., *, and so on) work.  They
87       are characters just like the others.  Note-on-note: some limited form
88       of "regular expressionism" is planned in future: for example character
89       classes ([abc]) and any-chars (.).  But that feature will be turned on
90       by a special modifier (just a guess: "r"), so there should be no
91       backward compatibility problem.
92
93       Notice also that matching is not symmetric.  The inputs are matched
94       against the pattern, not the other way round.  In other words: the
95       pattern can be a substring, a submatch, of an input element.  An input
96       element is always a superstring of the pattern.
97
98   MODIFIERS
99       With the modifiers you can control the amount of approximateness and
100       certain other control variables.  The modifiers are one or more
101       strings, for example "i", within a string optionally separated by
102       whitespace.  The modifiers are inside an anonymous array: the [ ] in
103       the syntax are not notational, they really do mean [ ], for example [
104       "i", "2" ].  ["2 i"] would be identical.
105
106       The implicit default approximateness is 10%, rounded up.  In other
107       words: every tenth character in the pattern may be an error, an edit.
108       You can explicitly set the maximum approximateness by supplying a
109       modifier like
110
111               number
112               number%
113
114       Examples: "3", "15%".
115
116       Note that "0%" is not rounded up, it is equal to 0.
117
118       Using a similar syntax you can separately control the maximum number of
119       insertions, deletions, and substitutions by prefixing the numbers with
120       I, D, or S, like this:
121
122               Inumber
123               Inumber%
124               Dnumber
125               Dnumber%
126               Snumber
127               Snumber%
128
129       Examples: "I2", "D20%", "S0".
130
131       You can ignore case ("A" becames equal to "a" and vice versa) by adding
132       the "i" modifier.
133
134       For example
135
136               [ "i 25%", "S0" ]
137
138       means ignore case, allow every fourth character to be "an edit", but
139       allow no substitutions.  (See NOTES about disallowing substitutions or
140       insertions.)
141
142       NOTE: setting "I0 D0 S0" is not equivalent to using index().  If you
143       want to use index(), use index().
144

SUBSTITUTE

146               use String::Approx 'asubstitute';
147
148               @substituted = asubstitute("pattern", "replacement")
149               @substituted = asubstitute("pattern", "replacement", @inputs)
150               @substituted = asubstitute("pattern", "replacement", [ modifiers ])
151               @substituted = asubstitute("pattern", "replacement",
152                                          [ modifiers ], @inputs)
153
154       Substitute approximate pattern with replacement and return as a list
155       <copies> of @inputs, the substitutions having been made on the elements
156       that did match the pattern.  If no inputs are given, substitute in the
157       $_.  The replacement can contain magic strings $&, $`, $' that stand
158       for the matched string, the string before it, and the string after it,
159       respectively.  All the other arguments are as in "amatch()", plus one
160       additional modifier, "g" which means substitute globally (all the
161       matches in an element and not just the first one, as is the default).
162
163       See "BAD NEWS" about the unfortunate stinginess of "asubstitute()".
164

INDEX

166               use String::Approx 'aindex';
167
168               $index   = aindex("pattern")
169               @indices = aindex("pattern", @inputs)
170               $index   = aindex("pattern", [ modifiers ])
171               @indices = aindex("pattern", [ modifiers ], @inputs)
172
173       Like "amatch()" but returns the index/indices at which the pattern
174       matches approximately.  In list context and if @inputs are used,
175       returns a list of indices, one index for each input element.  If
176       there's no approximate match, "-1" is returned as the index.
177
178       NOTE: if there is character repetition (e.g. "aa") either in the
179       pattern or in the text, the returned index might start "too early".
180       This is consistent with the goal of the module of matching "as early as
181       possible", just like regular expressions (that there might be a "less
182       approximate" match starting later is of somewhat irrelevant).
183
184       There's also backwards-scanning "arindex()".
185

SLICE

187               use String::Approx 'aslice';
188
189               ($index, $size)   = aslice("pattern")
190               ([$i0, $s0], ...) = aslice("pattern", @inputs)
191               ($index, $size)   = aslice("pattern", [ modifiers ])
192               ([$i0, $s0], ...) = aslice("pattern", [ modifiers ], @inputs)
193
194       Like "aindex()" but returns also the size (length) of the match.  If
195       the match fails, returns an empty list (when matching against $_) or an
196       empty anonymous list corresponding to the particular input.
197
198       NOTE: size of the match will very probably be something you did not
199       expect (such as longer than the pattern, or a negative number).  This
200       may or may not be fixed in future releases. Also the beginning of the
201       match may vary from the expected as with aindex(), see above.
202
203       If the modifier
204
205               "minimal_distance"
206
207       is used, the minimal possible edit distance is returned as the third
208       element:
209
210               ($index, $size, $distance) = aslice("pattern", [ modifiers ])
211               ([$i0, $s0, $d0], ...)     = aslice("pattern", [ modifiers ], @inputs)
212

DISTANCE

214               use String::Approx 'adist';
215
216               $dist = adist("pattern", $input);
217               @dist = adist("pattern", @input);
218
219       Return the edit distance or distances between the pattern and the input
220       or inputs.  Zero edit distance means exact match.  (Remember that the
221       match can 'float' in the inputs, the match is a substring match.)  If
222       the pattern is longer than the input or inputs, the returned distance
223       or distances is or are negative.
224
225               use String::Approx 'adistr';
226
227               $dist = adistr("pattern", $input);
228               @dist = adistr("pattern", @inputs);
229
230       Return the relative edit distance or distances between the pattern and
231       the input or inputs.  Zero relative edit distance means exact match,
232       one means completely different.  (Remember that the match can 'float'
233       in the inputs, the match is a substring match.)  If the pattern is
234       longer than the input or inputs, the returned distance or distances is
235       or are negative.
236
237       You can use adist() or adistr() to sort the inputs according to their
238       approximateness:
239
240               my %d;
241               @d{@inputs} = map { abs } adistr("pattern", @inputs);
242               my @d = sort { $d{$a} <=> $d{$b} } @inputs;
243
244       Now @d contains the inputs, the most like "pattern" first.
245

CONTROLLING THE CACHE

247       "String::Approx" maintains a LU (least-used) cache that holds the
248       'matching engines' for each instance of a pattern+modifiers.  The cache
249       is intended to help the case where you match a small set of patterns
250       against a large set of string.  However, the more engines you cache the
251       more you eat memory.  If you have a lot of different patterns or if you
252       have a lot of memory to burn, you may want to control the cache
253       yourself.  For example, allowing a larger cache consumes more memory
254       but probably runs a little bit faster since the cache fills (and needs
255       flushing) less often.
256
257       The cache has two parameters: max and purge.  The first one is the
258       maximum size of the cache and the second one is the cache flushing
259       ratio: when the number of cache entries exceeds max, max times purge
260       cache entries are flushed.  The default values are 1000 and 0.75,
261       respectively, which means that when the 1001st entry would be cached,
262       750 least used entries will be removed from the cache.  To access the
263       parameters you can use the calls
264
265               $now_max = String::Approx::cache_max();
266               String::Approx::cache_max($new_max);
267
268               $now_purge = String::Approx::cache_purge();
269               String::Approx::cache_purge($new_purge);
270
271               $limit = String::Approx::cache_n_purge();
272
273       To be honest, there are actually two caches: the first one is used far
274       the patterns with no modifiers, the second one for the patterns with
275       pattern modifiers.  Using the standard parameters you will therefore
276       actually cache up to 2000 entries.  The above calls control both caches
277       for the same price.
278
279       To disable caching completely use
280
281               String::Approx::cache_disable();
282
283       Note that this doesn't flush any possibly existing cache entries, to do
284       that use
285
286               String::Approx::cache_flush_all();
287

NOTES

289       Because matching is by substrings, not by whole strings, insertions and
290       substitutions produce often very similar results: "abcde" matches
291       "axbcde" either by insertion or substitution of "x".
292
293       The maximum edit distance is also the maximum number of edits.  That
294       is, the "I2" in
295
296               amatch("abcd", ["I2"])
297
298       is useless because the maximum edit distance is (implicitly) 1.  You
299       may have meant to say
300
301               amatch("abcd", ["2D1S1"])
302
303       or something like that.
304
305       If you want to simulate transposes
306
307               feet fete
308
309       you need to allow at least edit distance of two because in terms of our
310       edit primitives a transpose is first one deletion and then one
311       insertion.
312
313   TEXT POSITION
314       The starting and ending positions of matching, substituting, indexing,
315       or slicing can be changed from the beginning and end of the input(s) to
316       some other positions by using either or both of the modifiers
317
318               "initial_position=24"
319               "final_position=42"
320
321       or the both the modifiers
322
323               "initial_position=24"
324               "position_range=10"
325
326       By setting the "position_range" to be zero you can limit (anchor) the
327       operation to happen only once (if a match is possible) at the position.
328

VERSION

330       Major release 3.
331

CHANGES FROM VERSION 2

333   GOOD NEWS
334       The version 3 is 2-3 times faster than version 2
335       No pattern length limitation
336           The algorithm is independent on the pattern length: its time
337           complexity is O(kn), where k is the number of edits and n the
338           length of the text (input).  The preprocessing of the pattern will
339           of course take some O(m) (m being the pattern length) time, but
340           "amatch()" and "asubstitute()" cache the result of this
341           preprocessing so that it is done only once per pattern.
342
343   BAD NEWS
344       You do need a C compiler to install the module
345           Perl's regular expressions are no more used; instead a faster and
346           more scalable algorithm written in C is used.
347
348       "asubstitute()" is now always stingy
349           The string matched and substituted is now always stingy, as short
350           as possible.  It used to be as long as possible.  This is an
351           unfortunate change stemming from switching the matching algorithm.
352           Example: with edit distance of two and substituting for "word" from
353           "cork" and "wool" previously did match "cork" and "wool".  Now it
354           does match "or" and "wo".  As little as possible, or, in other
355           words, with as much approximateness, as many edits, as possible.
356           Because there is no need to match the "c" of "cork", it is not
357           matched.
358
359       no more "aregex()" because regular expressions are no more used
360       no more "compat1" for String::Approx version 1 compatibility
361

ACKNOWLEDGEMENTS

363       The following people have provided valuable test cases, documentation
364       clarifications, and other feedback:
365
366       Jared August, Arthur Bergman, Anirvan Chatterjee, Steve A. Chervitz,
367       Aldo Calpini, David Curiel, Teun van den Dool, Alberto Fontaneda, Rob
368       Fugina, Dmitrij Frishman, Lars Gregersen, Kevin Greiner, B. Elijah
369       Griffin, Mike Hanafey, Mitch Helle, Ricky Houghton, 'idallen', Helmut
370       Jarausch, Damian Keefe, Ben Kennedy, Craig Kelley, Franz Kirsch, Dag
371       Kristian, Mark Land, J. D. Laub, John P. Linderman, Tim Maher, Juha
372       Muilu, Sergey Novoselov, Andy Oram, Ji Y Park, Eric Promislow, Nikolaus
373       Rath, Stefan Ram, Slaven Rezic, Dag Kristian Rognlien, Stewart Russell,
374       Slaven Rezic, Chris Rosin, Pasha Sadri, Ilya Sandler, Bob J.A.
375       Schijvenaars, Ross Smith, Frank Tobin, Greg Ward, Rich Williams, Rick
376       Wise.
377
378       The matching algorithm was developed by Udi Manber, Sun Wu, and Burra
379       Gopal in the Department of Computer Science, University of Arizona.
380

AUTHOR

382       Jarkko Hietaniemi <jhi@iki.fi>
383

COPYRIGHT AND LICENSE

385       Copyright 2001-2013 by Jarkko Hietaniemi
386
387       This library is free software; you can redistribute it and/or modify
388       under either the terms of the Artistic License 2.0, or the GNU Library
389       General Public License, Version 2.  See the files Artistic and LGPL for
390       more details.
391
392       Furthermore: no warranties or obligations of any kind are given, and
393       the separate file COPYRIGHT must be included intact in all copies and
394       derived materials.
395
396
397
398perl v5.30.0                      2019-07-26                         Approx(3)