1PCRE2PERFORM(3) Library Functions Manual PCRE2PERFORM(3)
2
3
4
6 PCRE2 - Perl-compatible regular expressions (revised API)
7
9
10 Two aspects of performance are discussed below: memory usage and pro‐
11 cessing time. The way you express your pattern as a regular expression
12 can affect both of them.
13
15
16 Patterns are compiled by PCRE2 into a reasonably efficient interpretive
17 code, so that most simple patterns do not use much memory for storing
18 the compiled version. However, there is one case where the memory usage
19 of a compiled pattern can be unexpectedly large. If a parenthesized
20 group has a quantifier with a minimum greater than 1 and/or a limited
21 maximum, the whole group is repeated in the compiled code. For example,
22 the pattern
23
24 (abc|def){2,4}
25
26 is compiled as if it were
27
28 (abc|def)(abc|def)((abc|def)(abc|def)?)?
29
30 (Technical aside: It is done this way so that backtrack points within
31 each of the repetitions can be independently maintained.)
32
33 For regular expressions whose quantifiers use only small numbers, this
34 is not usually a problem. However, if the numbers are large, and par‐
35 ticularly if such repetitions are nested, the memory usage can become
36 an embarrassment. For example, the very simple pattern
37
38 ((ab){1,1000}c){1,3}
39
40 uses over 50KiB when compiled using the 8-bit library. When PCRE2 is
41 compiled with its default internal pointer size of two bytes, the size
42 limit on a compiled pattern is 65535 code units in the 8-bit and 16-bit
43 libraries, and this is reached with the above pattern if the outer rep‐
44 etition is increased from 3 to 4. PCRE2 can be compiled to use larger
45 internal pointers and thus handle larger compiled patterns, but it is
46 better to try to rewrite your pattern to use less memory if you can.
47
48 One way of reducing the memory usage for such patterns is to make use
49 of PCRE2's "subroutine" facility. Re-writing the above pattern as
50
51 ((ab)(?2){0,999}c)(?1){0,2}
52
53 reduces the memory requirements to around 16KiB, and indeed it remains
54 under 20KiB even with the outer repetition increased to 100. However,
55 this kind of pattern is not always exactly equivalent, because any cap‐
56 tures within subroutine calls are lost when the subroutine completes.
57 If this is not a problem, this kind of rewriting will allow you to
58 process patterns that PCRE2 cannot otherwise handle. The matching per‐
59 formance of the two different versions of the pattern are roughly the
60 same. (This applies from release 10.30 - things were different in ear‐
61 lier releases.)
62
64
65 From release 10.30, the interpretive (non-JIT) version of pcre2_match()
66 uses very little system stack at run time. In earlier releases recur‐
67 sive function calls could use a great deal of stack, and this could
68 cause problems, but this usage has been eliminated. Backtracking posi‐
69 tions are now explicitly remembered in memory frames controlled by the
70 code.
71
72 The size of each frame depends on the size of pointer variables and the
73 number of capturing parenthesized groups in the pattern being matched.
74 On a 64-bit system the frame size for a pattern with no captures is 128
75 bytes. For each capturing group the size increases by 16 bytes.
76
77 Until release 10.41, an initial 20KiB frames vector was allocated on
78 the system stack, but this still caused some issues for multi-thread
79 applications where each thread has a very small stack. From release
80 10.41 backtracking memory frames are always held in heap memory. An
81 initial heap allocation is obtained the first time any match data block
82 is passed to pcre2_match(). This is remembered with the match data
83 block and re-used if that block is used for another match. It is freed
84 when the match data block itself is freed.
85
86 The size of the initial block is the larger of 20KiB or ten times the
87 pattern's frame size, unless the heap limit is less than this, in which
88 case the heap limit is used. If the initial block proves to be too
89 small during matching, it is replaced by a larger block, subject to the
90 heap limit. The heap limit is checked only when a new block is to be
91 allocated. Reducing the heap limit between calls to pcre2_match() with
92 the same match data block does not affect the saved block.
93
94 In contrast to pcre2_match(), pcre2_dfa_match() does use recursive
95 function calls, but only for processing atomic groups, lookaround as‐
96 sertions, and recursion within the pattern. The original version of the
97 code used to allocate quite large internal workspace vectors on the
98 stack, which caused some problems for some patterns in environments
99 with small stacks. From release 10.32 the code for pcre2_dfa_match()
100 has been re-factored to use heap memory when necessary for internal
101 workspace when recursing, though recursive function calls are still
102 used.
103
104 The "match depth" parameter can be used to limit the depth of function
105 recursion, and the "match heap" parameter to limit heap memory in
106 pcre2_dfa_match().
107
109
110 Certain items in regular expression patterns are processed more effi‐
111 ciently than others. It is more efficient to use a character class like
112 [aeiou] than a set of single-character alternatives such as
113 (a|e|i|o|u). In general, the simplest construction that provides the
114 required behaviour is usually the most efficient. Jeffrey Friedl's book
115 contains a lot of useful general discussion about optimizing regular
116 expressions for efficient performance. This document contains a few ob‐
117 servations about PCRE2.
118
119 Using Unicode character properties (the \p, \P, and \X escapes) is
120 slow, because PCRE2 has to use a multi-stage table lookup whenever it
121 needs a character's property. If you can find an alternative pattern
122 that does not use character properties, it will probably be faster.
123
124 By default, the escape sequences \b, \d, \s, and \w, and the POSIX
125 character classes such as [:alpha:] do not use Unicode properties,
126 partly for backwards compatibility, and partly for performance reasons.
127 However, you can set the PCRE2_UCP option or start the pattern with
128 (*UCP) if you want Unicode character properties to be used. This can
129 double the matching time for items such as \d, when matched with
130 pcre2_match(); the performance loss is less with a DFA matching func‐
131 tion, and in both cases there is not much difference for \b.
132
133 When a pattern begins with .* not in atomic parentheses, nor in paren‐
134 theses that are the subject of a backreference, and the PCRE2_DOTALL
135 option is set, the pattern is implicitly anchored by PCRE2, since it
136 can match only at the start of a subject string. If the pattern has
137 multiple top-level branches, they must all be anchorable. The optimiza‐
138 tion can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is au‐
139 tomatically disabled if the pattern contains (*PRUNE) or (*SKIP).
140
141 If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization, be‐
142 cause the dot metacharacter does not then match a newline, and if the
143 subject string contains newlines, the pattern may match from the char‐
144 acter immediately following one of them instead of from the very start.
145 For example, the pattern
146
147 .*second
148
149 matches the subject "first\nand second" (where \n stands for a newline
150 character), with the match starting at the seventh character. In order
151 to do this, PCRE2 has to retry the match starting after every newline
152 in the subject.
153
154 If you are using such a pattern with subject strings that do not con‐
155 tain newlines, the best performance is obtained by setting
156 PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate ex‐
157 plicit anchoring. That saves PCRE2 from having to scan along the sub‐
158 ject looking for a newline to restart at.
159
160 Beware of patterns that contain nested indefinite repeats. These can
161 take a long time to run when applied to a string that does not match.
162 Consider the pattern fragment
163
164 ^(a+)*
165
166 This can match "aaaa" in 16 different ways, and this number increases
167 very rapidly as the string gets longer. (The * repeat can match 0, 1,
168 2, 3, or 4 times, and for each of those cases other than 0 or 4, the +
169 repeats can match different numbers of times.) When the remainder of
170 the pattern is such that the entire match is going to fail, PCRE2 has
171 in principle to try every possible variation, and this can take an ex‐
172 tremely long time, even for relatively short strings.
173
174 An optimization catches some of the more simple cases such as
175
176 (a+)*b
177
178 where a literal character follows. Before embarking on the standard
179 matching procedure, PCRE2 checks that there is a "b" later in the sub‐
180 ject string, and if there is not, it fails the match immediately. How‐
181 ever, when there is no following literal this optimization cannot be
182 used. You can see the difference by comparing the behaviour of
183
184 (a+)*\d
185
186 with the pattern above. The former gives a failure almost instantly
187 when applied to a whole line of "a" characters, whereas the latter
188 takes an appreciable time with strings longer than about 20 characters.
189
190 In many cases, the solution to this kind of performance issue is to use
191 an atomic group or a possessive quantifier. This can often reduce mem‐
192 ory requirements as well. As another example, consider this pattern:
193
194 ([^<]|<(?!inet))+
195
196 It matches from wherever it starts until it encounters "<inet" or the
197 end of the data, and is the kind of pattern that might be used when
198 processing an XML file. Each iteration of the outer parentheses matches
199 either one character that is not "<" or a "<" that is not followed by
200 "inet". However, each time a parenthesis is processed, a backtracking
201 position is passed, so this formulation uses a memory frame for each
202 matched character. For a long string, a lot of memory is required. Con‐
203 sider now this rewritten pattern, which matches exactly the same
204 strings:
205
206 ([^<]++|<(?!inet))+
207
208 This runs much faster, because sequences of characters that do not con‐
209 tain "<" are "swallowed" in one item inside the parentheses, and a pos‐
210 sessive quantifier is used to stop any backtracking into the runs of
211 non-"<" characters. This version also uses a lot less memory because
212 entry to a new set of parentheses happens only when a "<" character
213 that is not followed by "inet" is encountered (and we assume this is
214 relatively rare).
215
216 This example shows that one way of optimizing performance when matching
217 long subject strings is to write repeated parenthesized subpatterns to
218 match more than one character whenever possible.
219
220 SETTING RESOURCE LIMITS
221
222 You can set limits on the amount of processing that takes place when
223 matching, and on the amount of heap memory that is used. The default
224 values of the limits are very large, and unlikely ever to operate. They
225 can be changed when PCRE2 is built, and they can also be set when
226 pcre2_match() or pcre2_dfa_match() is called. For details of these in‐
227 terfaces, see the pcre2build documentation and the section entitled
228 "The match context" in the pcre2api documentation.
229
230 The pcre2test test program has a modifier called "find_limits" which,
231 if applied to a subject line, causes it to find the smallest limits
232 that allow a pattern to match. This is done by repeatedly matching with
233 different limits.
234
236
237 Philip Hazel
238 Retired from University Computing Service
239 Cambridge, England.
240
242
243 Last updated: 27 July 2022
244 Copyright (c) 1997-2022 University of Cambridge.
245
246
247
248PCRE2 10.41 27 July 2022 PCRE2PERFORM(3)