1Approx(3) User Contributed Perl Documentation Approx(3)
2
3
4
6 String::Approx - Perl extension for approximate matching (fuzzy
7 matching)
8
10 use String::Approx 'amatch';
11
12 print if amatch("foobar");
13
14 my @matches = amatch("xyzzy", @inputs);
15
16 my @catches = amatch("plugh", ['2'], @inputs);
17
19 String::Approx lets you match and substitute strings approximately.
20 With this you can emulate errors: typing errorrs, speling errors,
21 closely related vocabularies (colour color), genetic mutations (GAG
22 ACT), abbreviations (McScot, MacScot).
23
24 NOTE: String::Approx suits the task of string matching, not string
25 comparison, and it works for strings, not for text.
26
27 If you want to compare strings for similarity, you probably just want
28 the Levenshtein edit distance (explained below), the Text::Levenshtein
29 and Text::LevenshteinXS modules in CPAN. See also Text::WagnerFischer
30 and Text::PhraseDistance. (There are functions for this in
31 String::Approx, e.g. adist(), but their results sometimes differ from
32 the bare Levenshtein et al.)
33
34 If you want to compare things like text or source code, consisting of
35 words or tokens and phrases and sentences, or expressions and
36 statements, you should probably use some other tool than
37 String::Approx, like for example the standard UNIX diff(1) tool, or the
38 Algorithm::Diff module from CPAN.
39
40 The measure of approximateness is the Levenshtein edit distance. It is
41 the total number of "edits": insertions,
42
43 word world
44
45 deletions,
46
47 monkey money
48
49 and substitutions
50
51 sun fun
52
53 required to transform a string to another string. For example, to
54 transform "lead" into "gold", you need three edits:
55
56 lead gead goad gold
57
58 The edit distance of "lead" and "gold" is therefore three, or 75%.
59
60 String::Approx uses the Levenshtein edit distance as its measure, but
61 String::Approx is not well-suited for comparing strings of different
62 length, in other words, if you want a "fuzzy eq", see above.
63 String::Approx is more like regular expressions or index(), it finds
64 substrings that are close matches.>
65
67 use String::Approx 'amatch';
68
69 $matched = amatch("pattern")
70 $matched = amatch("pattern", [ modifiers ])
71
72 $any_matched = amatch("pattern", @inputs)
73 $any_matched = amatch("pattern", [ modifiers ], @inputs)
74
75 @match = amatch("pattern")
76 @match = amatch("pattern", [ modifiers ])
77
78 @matches = amatch("pattern", @inputs)
79 @matches = amatch("pattern", [ modifiers ], @inputs)
80
81 Match pattern approximately. In list context return the matched
82 @inputs. If no inputs are given, match against the $_. In scalar
83 context return true if any of the inputs match, false if none match.
84
85 Notice that the pattern is a string. Not a regular expression. None
86 of the regular expression notations (^, ., *, and so on) work. They
87 are characters just like the others. Note-on-note: some limited form
88 of "regular expressionism" is planned in future: for example character
89 classes ([abc]) and any-chars (.). But that feature will be turned on
90 by a special modifier (just a guess: "r"), so there should be no
91 backward compatibility problem.
92
93 Notice also that matching is not symmetric. The inputs are matched
94 against the pattern, not the other way round. In other words: the
95 pattern can be a substring, a submatch, of an input element. An input
96 element is always a superstring of the pattern.
97
98 MODIFIERS
99 With the modifiers you can control the amount of approximateness and
100 certain other control variables. The modifiers are one or more
101 strings, for example "i", within a string optionally separated by
102 whitespace. The modifiers are inside an anonymous array: the [ ] in
103 the syntax are not notational, they really do mean [ ], for example [
104 "i", "2" ]. ["2 i"] would be identical.
105
106 The implicit default approximateness is 10%, rounded up. In other
107 words: every tenth character in the pattern may be an error, an edit.
108 You can explicitly set the maximum approximateness by supplying a
109 modifier like
110
111 number
112 number%
113
114 Examples: "3", "15%".
115
116 Note that "0%" is not rounded up, it is equal to 0.
117
118 Using a similar syntax you can separately control the maximum number of
119 insertions, deletions, and substitutions by prefixing the numbers with
120 I, D, or S, like this:
121
122 Inumber
123 Inumber%
124 Dnumber
125 Dnumber%
126 Snumber
127 Snumber%
128
129 Examples: "I2", "D20%", "S0".
130
131 You can ignore case ("A" becames equal to "a" and vice versa) by adding
132 the "i" modifier.
133
134 For example
135
136 [ "i 25%", "S0" ]
137
138 means ignore case, allow every fourth character to be "an edit", but
139 allow no substitutions. (See NOTES about disallowing substitutions or
140 insertions.)
141
142 NOTE: setting "I0 D0 S0" is not equivalent to using index(). If you
143 want to use index(), use index().
144
146 use String::Approx 'asubstitute';
147
148 @substituted = asubstitute("pattern", "replacement")
149 @substituted = asubstitute("pattern", "replacement", @inputs)
150 @substituted = asubstitute("pattern", "replacement", [ modifiers ])
151 @substituted = asubstitute("pattern", "replacement",
152 [ modifiers ], @inputs)
153
154 Substitute approximate pattern with replacement and return as a list
155 <copies> of @inputs, the substitutions having been made on the elements
156 that did match the pattern. If no inputs are given, substitute in the
157 $_. The replacement can contain magic strings $&, $`, $' that stand
158 for the matched string, the string before it, and the string after it,
159 respectively. All the other arguments are as in "amatch()", plus one
160 additional modifier, "g" which means substitute globally (all the
161 matches in an element and not just the first one, as is the default).
162
163 See "BAD NEWS" about the unfortunate stinginess of "asubstitute()".
164
166 use String::Approx 'aindex';
167
168 $index = aindex("pattern")
169 @indices = aindex("pattern", @inputs)
170 $index = aindex("pattern", [ modifiers ])
171 @indices = aindex("pattern", [ modifiers ], @inputs)
172
173 Like "amatch()" but returns the index/indices at which the pattern
174 matches approximately. In list context and if @inputs are used,
175 returns a list of indices, one index for each input element. If
176 there's no approximate match, "-1" is returned as the index.
177
178 NOTE: if there is character repetition (e.g. "aa") either in the
179 pattern or in the text, the returned index might start "too early".
180 This is consistent with the goal of the module of matching "as early as
181 possible", just like regular expressions (that there might be a "less
182 approximate" match starting later is of somewhat irrelevant).
183
184 There's also backwards-scanning "arindex()".
185
187 use String::Approx 'aslice';
188
189 ($index, $size) = aslice("pattern")
190 ([$i0, $s0], ...) = aslice("pattern", @inputs)
191 ($index, $size) = aslice("pattern", [ modifiers ])
192 ([$i0, $s0], ...) = aslice("pattern", [ modifiers ], @inputs)
193
194 Like "aindex()" but returns also the size (length) of the match. If
195 the match fails, returns an empty list (when matching against $_) or an
196 empty anonymous list corresponding to the particular input.
197
198 NOTE: size of the match will very probably be something you did not
199 expect (such as longer than the pattern, or a negative number). This
200 may or may not be fixed in future releases. Also the beginning of the
201 match may vary from the expected as with aindex(), see above.
202
203 If the modifier
204
205 "minimal_distance"
206
207 is used, the minimal possible edit distance is returned as the third
208 element:
209
210 ($index, $size, $distance) = aslice("pattern", [ modifiers ])
211 ([$i0, $s0, $d0], ...) = aslice("pattern", [ modifiers ], @inputs)
212
214 use String::Approx 'adist';
215
216 $dist = adist("pattern", $input);
217 @dist = adist("pattern", @input);
218
219 Return the edit distance or distances between the pattern and the input
220 or inputs. Zero edit distance means exact match. (Remember that the
221 match can 'float' in the inputs, the match is a substring match.) If
222 the pattern is longer than the input or inputs, the returned distance
223 or distances is or are negative.
224
225 use String::Approx 'adistr';
226
227 $dist = adistr("pattern", $input);
228 @dist = adistr("pattern", @inputs);
229
230 Return the relative edit distance or distances between the pattern and
231 the input or inputs. Zero relative edit distance means exact match,
232 one means completely different. (Remember that the match can 'float'
233 in the inputs, the match is a substring match.) If the pattern is
234 longer than the input or inputs, the returned distance or distances is
235 or are negative.
236
237 You can use adist() or adistr() to sort the inputs according to their
238 approximateness:
239
240 my %d;
241 @d{@inputs} = map { abs } adistr("pattern", @inputs);
242 my @d = sort { $d{$a} <=> $d{$b} } @inputs;
243
244 Now @d contains the inputs, the most like "pattern" first.
245
247 "String::Approx" maintains a LU (least-used) cache that holds the
248 'matching engines' for each instance of a pattern+modifiers. The cache
249 is intended to help the case where you match a small set of patterns
250 against a large set of string. However, the more engines you cache the
251 more you eat memory. If you have a lot of different patterns or if you
252 have a lot of memory to burn, you may want to control the cache
253 yourself. For example, allowing a larger cache consumes more memory
254 but probably runs a little bit faster since the cache fills (and needs
255 flushing) less often.
256
257 The cache has two parameters: max and purge. The first one is the
258 maximum size of the cache and the second one is the cache flushing
259 ratio: when the number of cache entries exceeds max, max times purge
260 cache entries are flushed. The default values are 1000 and 0.75,
261 respectively, which means that when the 1001st entry would be cached,
262 750 least used entries will be removed from the cache. To access the
263 parameters you can use the calls
264
265 $now_max = String::Approx::cache_max();
266 String::Approx::cache_max($new_max);
267
268 $now_purge = String::Approx::cache_purge();
269 String::Approx::cache_purge($new_purge);
270
271 $limit = String::Approx::cache_n_purge();
272
273 To be honest, there are actually two caches: the first one is used far
274 the patterns with no modifiers, the second one for the patterns with
275 pattern modifiers. Using the standard parameters you will therefore
276 actually cache up to 2000 entries. The above calls control both caches
277 for the same price.
278
279 To disable caching completely use
280
281 String::Approx::cache_disable();
282
283 Note that this doesn't flush any possibly existing cache entries, to do
284 that use
285
286 String::Approx::cache_flush_all();
287
289 Because matching is by substrings, not by whole strings, insertions and
290 substitutions produce often very similar results: "abcde" matches
291 "axbcde" either by insertion or substitution of "x".
292
293 The maximum edit distance is also the maximum number of edits. That
294 is, the "I2" in
295
296 amatch("abcd", ["I2"])
297
298 is useless because the maximum edit distance is (implicitly) 1. You
299 may have meant to say
300
301 amatch("abcd", ["2D1S1"])
302
303 or something like that.
304
305 If you want to simulate transposes
306
307 feet fete
308
309 you need to allow at least edit distance of two because in terms of our
310 edit primitives a transpose is first one deletion and then one
311 insertion.
312
313 TEXT POSITION
314 The starting and ending positions of matching, substituting, indexing,
315 or slicing can be changed from the beginning and end of the input(s) to
316 some other positions by using either or both of the modifiers
317
318 "initial_position=24"
319 "final_position=42"
320
321 or the both the modifiers
322
323 "initial_position=24"
324 "position_range=10"
325
326 By setting the "position_range" to be zero you can limit (anchor) the
327 operation to happen only once (if a match is possible) at the position.
328
330 Major release 3.
331
333 GOOD NEWS
334 The version 3 is 2-3 times faster than version 2
335 No pattern length limitation
336 The algorithm is independent on the pattern length: its time
337 complexity is O(kn), where k is the number of edits and n the
338 length of the text (input). The preprocessing of the pattern will
339 of course take some O(m) (m being the pattern length) time, but
340 "amatch()" and "asubstitute()" cache the result of this
341 preprocessing so that it is done only once per pattern.
342
343 BAD NEWS
344 You do need a C compiler to install the module
345 Perl's regular expressions are no more used; instead a faster and
346 more scalable algorithm written in C is used.
347
348 "asubstitute()" is now always stingy
349 The string matched and substituted is now always stingy, as short
350 as possible. It used to be as long as possible. This is an
351 unfortunate change stemming from switching the matching algorithm.
352 Example: with edit distance of two and substituting for "word" from
353 "cork" and "wool" previously did match "cork" and "wool". Now it
354 does match "or" and "wo". As little as possible, or, in other
355 words, with as much approximateness, as many edits, as possible.
356 Because there is no need to match the "c" of "cork", it is not
357 matched.
358
359 no more "aregex()" because regular expressions are no more used
360 no more "compat1" for String::Approx version 1 compatibility
361
363 The following people have provided valuable test cases, documentation
364 clarifications, and other feedback:
365
366 Jared August, Arthur Bergman, Anirvan Chatterjee, Steve A. Chervitz,
367 Aldo Calpini, David Curiel, Teun van den Dool, Alberto Fontaneda, Rob
368 Fugina, Dmitrij Frishman, Lars Gregersen, Kevin Greiner, B. Elijah
369 Griffin, Mike Hanafey, Mitch Helle, Ricky Houghton, 'idallen', Helmut
370 Jarausch, Damian Keefe, Ben Kennedy, Craig Kelley, Franz Kirsch, Dag
371 Kristian, Mark Land, J. D. Laub, John P. Linderman, Tim Maher, Juha
372 Muilu, Sergey Novoselov, Andy Oram, Ji Y Park, Eric Promislow, Nikolaus
373 Rath, Stefan Ram, Slaven Rezic, Dag Kristian Rognlien, Stewart Russell,
374 Slaven Rezic, Chris Rosin, Pasha Sadri, Ilya Sandler, Bob J.A.
375 Schijvenaars, Ross Smith, Frank Tobin, Greg Ward, Rich Williams, Rick
376 Wise.
377
378 The matching algorithm was developed by Udi Manber, Sun Wu, and Burra
379 Gopal in the Department of Computer Science, University of Arizona.
380
382 Jarkko Hietaniemi <jhi@iki.fi>
383
385 Copyright 2001-2013 by Jarkko Hietaniemi
386
387 This library is free software; you can redistribute it and/or modify
388 under either the terms of the Artistic License 2.0, or the GNU Library
389 General Public License, Version 2. See the files Artistic and LGPL for
390 more details.
391
392 Furthermore: no warranties or obligations of any kind are given, and
393 the separate file COPYRIGHT must be included intact in all copies and
394 derived materials.
395
396
397
398perl v5.32.1 2021-01-27 Approx(3)