1Approx(3) User Contributed Perl Documentation Approx(3)
2
3
4
6 String::Approx - Perl extension for approximate matching (fuzzy match‐
7 ing)
8
10 use String::Approx 'amatch';
11
12 print if amatch("foobar");
13
14 my @matches = amatch("xyzzy", @inputs);
15
16 my @catches = amatch("plugh", ['2'], @inputs);
17
19 String::Approx lets you match and substitute strings approximately.
20 With this you can emulate errors: typing errorrs, speling errors,
21 closely related vocabularies (colour color), genetic mutations (GAG
22 ACT), abbreviations (McScot, MacScot).
23
24 NOTE: String::Approx suits the task of string matching, not string com‐
25 parison, and it works for strings, not for text.
26
27 If you want to compare strings for similarity, you probably just want
28 the Levenshtein edit distance (explained below), the Text::Levenshtein
29 and Text::LevenshteinXS modules in CPAN. See also Text::WagnerFischer
30 and Text::PhraseDistance. (There are functions for this in
31 String::Approx, e.g. adist(), but their results sometimes differ from
32 the bare Levenshtein et al.)
33
34 If you want to compare things like text or source code, consisting of
35 words or tokens and phrases and sentences, or expressions and state‐
36 ments, you should probably use some other tool than String::Approx,
37 like for example the standard UNIX diff(1) tool, or the Algorithm::Diff
38 module from CPAN.
39
40 The measure of approximateness is the Levenshtein edit distance. It is
41 the total number of "edits": insertions,
42
43 word world
44
45 deletions,
46
47 monkey money
48
49 and substitutions
50
51 sun fun
52
53 required to transform a string to another string. For example, to
54 transform "lead" into "gold", you need three edits:
55
56 lead gead goad gold
57
58 The edit distance of "lead" and "gold" is therefore three, or 75%.
59
60 String::Approx uses the Levenshtein edit distance (tLed) as its mea‐
61 sure, but String::Approx is not well-suited for comparing the tLeds of
62 strings, in other words, if you want a "fuzzy eq", see above.
63 Strings::Approx is more like regular expressions or index(), it finds
64 substrings that are close matches.
65
67 use String::Approx 'amatch';
68
69 $matched = amatch("pattern")
70 $matched = amatch("pattern", [ modifiers ])
71
72 $any_matched = amatch("pattern", @inputs)
73 $any_matched = amatch("pattern", [ modifiers ], @inputs)
74
75 @match = amatch("pattern")
76 @match = amatch("pattern", [ modifiers ])
77
78 @matches = amatch("pattern", @inputs)
79 @matches = amatch("pattern", [ modifiers ], @inputs)
80
81 Match pattern approximately. In list context return the matched
82 @inputs. If no inputs are given, match against the $_. In scalar con‐
83 text return true if any of the inputs match, false if none match.
84
85 Notice that the pattern is a string. Not a regular expression. None
86 of the regular expression notations (^, ., *, and so on) work. They
87 are characters just like the others. Note-on-note: some limited form
88 of "regular expressionism" is planned in future: for example character
89 classes ([abc]) and any-chars (.). But that feature will be turned on
90 by a special modifier (just a guess: "r"), so there should be no back‐
91 ward compatibility problem.
92
93 Notice also that matching is not symmetric. The inputs are matched
94 against the pattern, not the other way round. In other words: the pat‐
95 tern can be a substring, a submatch, of an input element. An input
96 element is always a superstring of the pattern.
97
98 MODIFIERS
99
100 With the modifiers you can control the amount of approximateness and
101 certain other control variables. The modifiers are one or more
102 strings, for example "i", within a string optionally separated by
103 whitespace. The modifiers are inside an anonymous array: the [ ] in
104 the syntax are not notational, they really do mean [ ], for example [
105 "i", "2" ]. ["2 i"] would be identical.
106
107 The implicit default approximateness is 10%, rounded up. In other
108 words: every tenth character in the pattern may be an error, an edit.
109 You can explicitly set the maximum approximateness by supplying a modi‐
110 fier like
111
112 number
113 number%
114
115 Examples: "3", "15%".
116
117 Note that "0%" is not rounded up, it is equal to 0.
118
119 Using a similar syntax you can separately control the maximum number of
120 insertions, deletions, and substitutions by prefixing the numbers with
121 I, D, or S, like this:
122
123 Inumber
124 Inumber%
125 Dnumber
126 Dnumber%
127 Snumber
128 Snumber%
129
130 Examples: "I2", "D20%", "S0".
131
132 You can ignore case ("A" becames equal to "a" and vice versa) by adding
133 the "i" modifier.
134
135 For example
136
137 [ "i 25%", "S0" ]
138
139 means ignore case, allow every fourth character to be "an edit", but
140 allow no substitutions. (See NOTES about disallowing substitutions or
141 insertions.)
142
143 NOTE: setting "I0 D0 S0" is not equivalent to using index(). If you
144 want to use index(), use index().
145
147 use String::Approx 'asubstitute';
148
149 @substituted = asubstitute("pattern", "replacement")
150 @substituted = asubstitute("pattern", "replacement", @inputs)
151 @substituted = asubstitute("pattern", "replacement", [ modifiers ])
152 @substituted = asubstitute("pattern", "replacement",
153 [ modifiers ], @inputs)
154
155 Substitute approximate pattern with replacement and return as a list
156 <copies> of @inputs, the substitutions having been made on the elements
157 that did match the pattern. If no inputs are given, substitute in the
158 $_. The replacement can contain magic strings $&, $`, $' that stand
159 for the matched string, the string before it, and the string after it,
160 respectively. All the other arguments are as in "amatch()", plus one
161 additional modifier, "g" which means substitute globally (all the
162 matches in an element and not just the first one, as is the default).
163
164 See "BAD NEWS" about the unfortunate stinginess of "asubstitute()".
165
167 use String::Approx 'aindex';
168
169 $index = aindex("pattern")
170 @indices = aindex("pattern", @inputs)
171 $index = aindex("pattern", [ modifiers ])
172 @indices = aindex("pattern", [ modifiers ], @inputs)
173
174 Like "amatch()" but returns the index/indices at which the pattern
175 matches approximately. In list context and if @inputs are used,
176 returns a list of indices, one index for each input element. If
177 there's no approximate match, "-1" is returned as the index.
178
179 NOTE: if there is character repetition (e.g. "aa") either in the pat‐
180 tern or in the text, the returned index might start "too early". This
181 is consistent with the goal of the module of matching "as early as pos‐
182 sible", just like regular expressions (that there might be a "less
183 approximate" match starting later is of somewhat irrelevant).
184
185 There's also backwards-scanning "arindex()".
186
188 use String::Approx 'aslice';
189
190 ($index, $size) = aslice("pattern")
191 ([$i0, $s0], ...) = aslice("pattern", @inputs)
192 ($index, $size) = aslice("pattern", [ modifiers ])
193 ([$i0, $s0], ...) = aslice("pattern", [ modifiers ], @inputs)
194
195 Like "aindex()" but returns also the size (length) of the match. If
196 the match fails, returns an empty list (when matching against $_) or an
197 empty anonymous list corresponding to the particular input.
198
199 NOTE: size of the match will very probably be something you did not
200 expect (such as longer than the pattern, or a negative number). This
201 may or may not be fixed in future releases. Also the beginning of the
202 match may vary from the expected as with aindex(), see above.
203
204 If the modifier
205
206 "minimal_distance"
207
208 is used, the minimal possible edit distance is returned as the third
209 element:
210
211 ($index, $size, $distance) = aslice("pattern", [ modifiers ])
212 ([$i0, $s0, $d0], ...) = aslice("pattern", [ modifiers ], @inputs)
213
215 use String::Approx 'adist';
216
217 $dist = adist("pattern", $input);
218 @dist = adist("pattern", @input);
219
220 Return the edit distance or distances between the pattern and the input
221 or inputs. Zero edit distance means exact match. (Remember that the
222 match can 'float' in the inputs, the match is a substring match.) If
223 the pattern is longer than the input or inputs, the returned distance
224 or distances is or are negative.
225
226 use String::Approx 'adistr';
227
228 $dist = adistr("pattern", $input);
229 @dist = adistr("pattern", @inputs);
230
231 Return the relative edit distance or distances between the pattern and
232 the input or inputs. Zero relative edit distance means exact match,
233 one means completely different. (Remember that the match can 'float'
234 in the inputs, the match is a substring match.) If the pattern is
235 longer than the input or inputs, the returned distance or distances is
236 or are negative.
237
238 You can use adist() or adistr() to sort the inputs according to their
239 approximateness:
240
241 my %d;
242 @d{@inputs} = map { abs } adistr("pattern", @inputs);
243 my @d = sort { $d{$a} <=> $d{$b} } @inputs;
244
245 Now @d contains the inputs, the most like "pattern" first.
246
248 "String::Approx" maintains a LU (least-used) cache that holds the
249 'matching engines' for each instance of a pattern+modifiers. The cache
250 is intended to help the case where you match a small set of patterns
251 against a large set of string. However, the more engines you cache the
252 more you eat memory. If you have a lot of different patterns or if you
253 have a lot of memory to burn, you may want to control the cache your‐
254 self. For example, allowing a larger cache consumes more memory but
255 probably runs a little bit faster since the cache fills (and needs
256 flushing) less often.
257
258 The cache has two parameters: max and purge. The first one is the max‐
259 imum size of the cache and the second one is the cache flushing ratio:
260 when the number of cache entries exceeds max, max times purge cache
261 entries are flushed. The default values are 1000 and 0.75, respec‐
262 tively, which means that when the 1001st entry would be cached, 750
263 least used entries will be removed from the cache. To access the
264 parameters you can use the calls
265
266 $now_max = String::Approx::cache_max();
267 String::Approx::cache_max($new_max);
268
269 $now_purge = String::Approx::cache_purge();
270 String::Approx::cache_purge($new_purge);
271
272 $limit = String::Approx::cache_n_purge();
273
274 To be honest, there are actually two caches: the first one is used far
275 the patterns with no modifiers, the second one for the patterns with
276 pattern modifiers. Using the standard parameters you will therefore
277 actually cache up to 2000 entries. The above calls control both caches
278 for the same price.
279
280 To disable caching completely use
281
282 String::Approx::cache_disable();
283
284 Note that this doesn't flush any possibly existing cache entries, to do
285 that use
286
287 String::Approx::cache_flush_all();
288
290 Because matching is by substrings, not by whole strings, insertions and
291 substitutions produce often very similar results: "abcde" matches
292 "axbcde" either by insertion or substitution of "x".
293
294 The maximum edit distance is also the maximum number of edits. That
295 is, the "I2" in
296
297 amatch("abcd", ["I2"])
298
299 is useless because the maximum edit distance is (implicitly) 1. You
300 may have meant to say
301
302 amatch("abcd", ["2D1S1"])
303
304 or something like that.
305
306 If you want to simulate transposes
307
308 feet fete
309
310 you need to allow at least edit distance of two because in terms of our
311 edit primitives a transpose is first one deletion and then one inser‐
312 tion.
313
314 TEXT POSITION
315
316 The starting and ending positions of matching, substituting, indexing,
317 or slicing can be changed from the beginning and end of the input(s) to
318 some other positions by using either or both of the modifiers
319
320 "initial_position=24"
321 "final_position=42"
322
323 or the both the modifiers
324
325 "initial_position=24"
326 "position_range=10"
327
328 By setting the "position_range" to be zero you can limit (anchor) the
329 operation to happen only once (if a match is possible) at the position.
330
332 Major release 3.
333
335 GOOD NEWS
336
337 The version 3 is 2-3 times faster than version 2
338 No pattern length limitation
339 The algorithm is independent on the pattern length: its time com‐
340 plexity is O(kn), where k is the number of edits and n the length
341 of the text (input). The preprocessing of the pattern will of
342 course take some O(m) (m being the pattern length) time, but
343 "amatch()" and "asubstitute()" cache the result of this preprocess‐
344 ing so that it is done only once per pattern.
345
346 BAD NEWS
347
348 You do need a C compiler to install the module
349 Perl's regular expressions are no more used; instead a faster and
350 more scalable algorithm written in C is used.
351
352 "asubstitute()" is now always stingy
353 The string matched and substituted is now always stingy, as short
354 as possible. It used to be as long as possible. This is an unfor‐
355 tunate change stemming from switching the matching algorithm.
356 Example: with edit distance of two and substituting for "word" from
357 "cork" and "wool" previously did match "cork" and "wool". Now it
358 does match "or" and "wo". As little as possible, or, in other
359 words, with as much approximateness, as many edits, as possible.
360 Because there is no need to match the "c" of "cork", it is not
361 matched.
362
363 no more "aregex()" because regular expressions are no more used
364 no more "compat1" for String::Approx version 1 compatibility
365
367 The following people have provided valuable test cases, documentation
368 clarifications, and other feedback:
369
370 Jared August, Arthur Bergman, Anirvan Chatterjee, Steve A. Chervitz,
371 Aldo Calpini, David Curiel, Teun van den Dool, Alberto Fontaneda, Rob
372 Fugina, Dmitrij Frishman, Lars Gregersen, Kevin Greiner, B. Elijah
373 Griffin, Mike Hanafey, Mitch Helle, Ricky Houghton, 'idallen', Helmut
374 Jarausch, Damian Keefe, Ben Kennedy, Craig Kelley, Franz Kirsch, Dag
375 Kristian, Mark Land, J. D. Laub, John P. Linderman, Tim Maher, Juha
376 Muilu, Sergey Novoselov, Andy Oram, Ji Y Park, Eric Promislow, Nikolaus
377 Rath, Stefan Ram, Slaven Rezic, Dag Kristian Rognlien, Stewart Russell,
378 Slaven Rezic, Chris Rosin, Pasha Sadri, Ilya Sandler, Bob J.A. Schijve‐
379 naars, Ross Smith, Frank Tobin, Greg Ward, Rich Williams, Rick Wise.
380
381 The matching algorithm was developed by Udi Manber, Sun Wu, and Burra
382 Gopal in the Department of Computer Science, University of Arizona.
383
385 Jarkko Hietaniemi <jhi@iki.fi>
386
388 Copyright 2001-2005 by Jarkko Hietaniemi
389
390 This library is free software; you can redistribute it and/or modify it
391 under the same terms as Perl itself.
392
393 Furthermore: no warranties or obligations of any kind are given, and
394 the separate file COPYRIGHT must be included intact in all copies and
395 derived materials.
396
397
398
399perl v5.8.8 2006-04-09 Approx(3)