1PCRE2PERFORM(3)            Library Functions Manual            PCRE2PERFORM(3)
2
3
4

NAME

6       PCRE2 - Perl-compatible regular expressions (revised API)
7

PCRE2 PERFORMANCE

9
10       Two  aspects  of performance are discussed below: memory usage and pro‐
11       cessing time. The way you express your pattern as a regular  expression
12       can affect both of them.
13

COMPILED PATTERN MEMORY USAGE

15
16       Patterns are compiled by PCRE2 into a reasonably efficient interpretive
17       code, so that most simple patterns do not use much memory  for  storing
18       the compiled version. However, there is one case where the memory usage
19       of a compiled pattern can be unexpectedly  large.  If  a  parenthesized
20       group  has  a quantifier with a minimum greater than 1 and/or a limited
21       maximum, the whole group is repeated in the compiled code. For example,
22       the pattern
23
24         (abc|def){2,4}
25
26       is compiled as if it were
27
28         (abc|def)(abc|def)((abc|def)(abc|def)?)?
29
30       (Technical  aside:  It is done this way so that backtrack points within
31       each of the repetitions can be independently maintained.)
32
33       For regular expressions whose quantifiers use only small numbers,  this
34       is  not  usually a problem. However, if the numbers are large, and par‐
35       ticularly if such repetitions are nested, the memory usage  can  become
36       an embarrassment. For example, the very simple pattern
37
38         ((ab){1,1000}c){1,3}
39
40       uses  over  50KiB  when compiled using the 8-bit library. When PCRE2 is
41       compiled with its default internal pointer size of two bytes, the  size
42       limit on a compiled pattern is 65535 code units in the 8-bit and 16-bit
43       libraries, and this is reached with the above pattern if the outer rep‐
44       etition  is  increased from 3 to 4. PCRE2 can be compiled to use larger
45       internal pointers and thus handle larger compiled patterns, but  it  is
46       better to try to rewrite your pattern to use less memory if you can.
47
48       One  way  of reducing the memory usage for such patterns is to make use
49       of PCRE2's "subroutine" facility. Re-writing the above pattern as
50
51         ((ab)(?2){0,999}c)(?1){0,2}
52
53       reduces the memory requirements to around 16KiB, and indeed it  remains
54       under  20KiB  even with the outer repetition increased to 100. However,
55       this kind of pattern is not always exactly equivalent, because any cap‐
56       tures  within  subroutine calls are lost when the subroutine completes.
57       If this is not a problem, this kind of  rewriting  will  allow  you  to
58       process  patterns that PCRE2 cannot otherwise handle. The matching per‐
59       formance of the two different versions of the pattern are  roughly  the
60       same.  (This applies from release 10.30 - things were different in ear‐
61       lier releases.)
62

STACK AND HEAP USAGE AT RUN TIME

64
65       From release 10.30, the interpretive (non-JIT) version of pcre2_match()
66       uses  very  little system stack at run time. In earlier releases recur‐
67       sive function calls could use a great deal of  stack,  and  this  could
68       cause  problems, but this usage has been eliminated. Backtracking posi‐
69       tions are now explicitly remembered in memory frames controlled by  the
70       code.
71
72       The size of each frame depends on the size of pointer variables and the
73       number of capturing parenthesized groups in the pattern being  matched.
74       On a 64-bit system the frame size for a pattern with no captures is 128
75       bytes. For each capturing group the size increases by 16 bytes.
76
77       Until release 10.41, an initial 20KiB frames vector  was  allocated  on
78       the  system  stack,  but this still caused some issues for multi-thread
79       applications where each thread has a very  small  stack.  From  release
80       10.41  backtracking  memory  frames  are always held in heap memory. An
81       initial heap allocation is obtained the first time any match data block
82       is  passed  to  pcre2_match().  This  is remembered with the match data
83       block and re-used if that block is used for another match. It is  freed
84       when the match data block itself is freed.
85
86       The  size  of the initial block is the larger of 20KiB or ten times the
87       pattern's frame size, unless the heap limit is less than this, in which
88       case  the  heap  limit  is  used. If the initial block proves to be too
89       small during matching, it is replaced by a larger block, subject to the
90       heap  limit.  The  heap limit is checked only when a new block is to be
91       allocated. Reducing the heap limit between calls to pcre2_match()  with
92       the same match data block does not affect the saved block.
93
94       In  contrast  to  pcre2_match(),  pcre2_dfa_match()  does use recursive
95       function calls, but only for processing atomic groups,  lookaround  as‐
96       sertions, and recursion within the pattern. The original version of the
97       code used to allocate quite large internal  workspace  vectors  on  the
98       stack,  which  caused  some  problems for some patterns in environments
99       with small stacks. From release 10.32 the  code  for  pcre2_dfa_match()
100       has  been  re-factored  to  use heap memory when necessary for internal
101       workspace when recursing, though recursive  function  calls  are  still
102       used.
103
104       The  "match depth" parameter can be used to limit the depth of function
105       recursion, and the "match heap"  parameter  to  limit  heap  memory  in
106       pcre2_dfa_match().
107

PROCESSING TIME

109
110       Certain  items  in regular expression patterns are processed more effi‐
111       ciently than others. It is more efficient to use a character class like
112       [aeiou]   than   a   set   of  single-character  alternatives  such  as
113       (a|e|i|o|u). In general, the simplest construction  that  provides  the
114       required behaviour is usually the most efficient. Jeffrey Friedl's book
115       contains a lot of useful general discussion  about  optimizing  regular
116       expressions for efficient performance. This document contains a few ob‐
117       servations about PCRE2.
118
119       Using Unicode character properties (the \p,  \P,  and  \X  escapes)  is
120       slow,  because  PCRE2 has to use a multi-stage table lookup whenever it
121       needs a character's property. If you can find  an  alternative  pattern
122       that does not use character properties, it will probably be faster.
123
124       By  default,  the  escape  sequences  \b, \d, \s, and \w, and the POSIX
125       character classes such as [:alpha:]  do  not  use  Unicode  properties,
126       partly for backwards compatibility, and partly for performance reasons.
127       However, you can set the PCRE2_UCP option or  start  the  pattern  with
128       (*UCP)  if  you  want Unicode character properties to be used. This can
129       double the matching time for  items  such  as  \d,  when  matched  with
130       pcre2_match();  the  performance loss is less with a DFA matching func‐
131       tion, and in both cases there is not much difference for \b.
132
133       When a pattern begins with .* not in atomic parentheses, nor in  paren‐
134       theses  that  are  the subject of a backreference, and the PCRE2_DOTALL
135       option is set, the pattern is implicitly anchored by  PCRE2,  since  it
136       can  match  only  at  the start of a subject string. If the pattern has
137       multiple top-level branches, they must all be anchorable. The optimiza‐
138       tion  can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is au‐
139       tomatically disabled if the pattern contains (*PRUNE) or (*SKIP).
140
141       If PCRE2_DOTALL is not set, PCRE2 cannot make  this  optimization,  be‐
142       cause  the  dot metacharacter does not then match a newline, and if the
143       subject string contains newlines, the pattern may match from the  char‐
144       acter immediately following one of them instead of from the very start.
145       For example, the pattern
146
147         .*second
148
149       matches the subject "first\nand second" (where \n stands for a  newline
150       character),  with the match starting at the seventh character. In order
151       to do this, PCRE2 has to retry the match starting after  every  newline
152       in the subject.
153
154       If  you  are using such a pattern with subject strings that do not con‐
155       tain  newlines,  the  best   performance   is   obtained   by   setting
156       PCRE2_DOTALL,  or starting the pattern with ^.* or ^.*? to indicate ex‐
157       plicit anchoring. That saves PCRE2 from having to scan along  the  sub‐
158       ject looking for a newline to restart at.
159
160       Beware  of  patterns  that contain nested indefinite repeats. These can
161       take a long time to run when applied to a string that does  not  match.
162       Consider the pattern fragment
163
164         ^(a+)*
165
166       This  can  match "aaaa" in 16 different ways, and this number increases
167       very rapidly as the string gets longer. (The * repeat can match  0,  1,
168       2,  3, or 4 times, and for each of those cases other than 0 or 4, the +
169       repeats can match different numbers of times.) When  the  remainder  of
170       the  pattern  is such that the entire match is going to fail, PCRE2 has
171       in principle to try every possible variation, and this can take an  ex‐
172       tremely long time, even for relatively short strings.
173
174       An optimization catches some of the more simple cases such as
175
176         (a+)*b
177
178       where  a  literal  character  follows. Before embarking on the standard
179       matching procedure, PCRE2 checks that there is a "b" later in the  sub‐
180       ject  string, and if there is not, it fails the match immediately. How‐
181       ever, when there is no following literal this  optimization  cannot  be
182       used. You can see the difference by comparing the behaviour of
183
184         (a+)*\d
185
186       with  the  pattern  above.  The former gives a failure almost instantly
187       when applied to a whole line of  "a"  characters,  whereas  the  latter
188       takes an appreciable time with strings longer than about 20 characters.
189
190       In many cases, the solution to this kind of performance issue is to use
191       an atomic group or a possessive quantifier. This can often reduce  mem‐
192       ory requirements as well. As another example, consider this pattern:
193
194         ([^<]|<(?!inet))+
195
196       It  matches  from wherever it starts until it encounters "<inet" or the
197       end of the data, and is the kind of pattern that  might  be  used  when
198       processing an XML file. Each iteration of the outer parentheses matches
199       either one character that is not "<" or a "<" that is not  followed  by
200       "inet".  However,  each time a parenthesis is processed, a backtracking
201       position is passed, so this formulation uses a memory  frame  for  each
202       matched character. For a long string, a lot of memory is required. Con‐
203       sider now this  rewritten  pattern,  which  matches  exactly  the  same
204       strings:
205
206         ([^<]++|<(?!inet))+
207
208       This runs much faster, because sequences of characters that do not con‐
209       tain "<" are "swallowed" in one item inside the parentheses, and a pos‐
210       sessive  quantifier  is  used to stop any backtracking into the runs of
211       non-"<" characters. This version also uses a lot  less  memory  because
212       entry  to  a  new  set of parentheses happens only when a "<" character
213       that is not followed by "inet" is encountered (and we  assume  this  is
214       relatively rare).
215
216       This example shows that one way of optimizing performance when matching
217       long subject strings is to write repeated parenthesized subpatterns  to
218       match more than one character whenever possible.
219
220   SETTING RESOURCE LIMITS
221
222       You  can  set  limits on the amount of processing that takes place when
223       matching, and on the amount of heap memory that is  used.  The  default
224       values of the limits are very large, and unlikely ever to operate. They
225       can be changed when PCRE2 is built, and  they  can  also  be  set  when
226       pcre2_match()  or pcre2_dfa_match() is called. For details of these in‐
227       terfaces, see the pcre2build documentation  and  the  section  entitled
228       "The match context" in the pcre2api documentation.
229
230       The  pcre2test  test program has a modifier called "find_limits" which,
231       if applied to a subject line, causes it to  find  the  smallest  limits
232       that allow a pattern to match. This is done by repeatedly matching with
233       different limits.
234

AUTHOR

236
237       Philip Hazel
238       Retired from University Computing Service
239       Cambridge, England.
240

REVISION

242
243       Last updated: 27 July 2022
244       Copyright (c) 1997-2022 University of Cambridge.
245
246
247
248PCRE2 10.41                      27 July 2022                  PCRE2PERFORM(3)
Impressum