pcre2perform(3)

1PCRE2PERFORM(3)            Library Functions Manual            PCRE2PERFORM(3)
2
3
4

NAME

6       PCRE2 - Perl-compatible regular expressions (revised API)
7

PCRE2 PERFORMANCE

9
10       Two  aspects  of performance are discussed below: memory usage and pro‐
11       cessing time. The way you express your pattern as a regular  expression
12       can affect both of them.
13

COMPILED PATTERN MEMORY USAGE

15
16       Patterns are compiled by PCRE2 into a reasonably efficient interpretive
17       code, so that most simple patterns do not use much memory  for  storing
18       the compiled version. However, there is one case where the memory usage
19       of a compiled pattern can be unexpectedly  large.  If  a  parenthesized
20       subpattern has a quantifier with a minimum greater than 1 and/or a lim‐
21       ited maximum, the whole subpattern is repeated in  the  compiled  code.
22       For example, the pattern
23
24         (abc|def){2,4}
25
26       is compiled as if it were
27
28         (abc|def)(abc|def)((abc|def)(abc|def)?)?
29
30       (Technical  aside:  It is done this way so that backtrack points within
31       each of the repetitions can be independently maintained.)
32
33       For regular expressions whose quantifiers use only small numbers,  this
34       is  not  usually a problem. However, if the numbers are large, and par‐
35       ticularly if such repetitions are nested, the memory usage  can  become
36       an embarrassment. For example, the very simple pattern
37
38         ((ab){1,1000}c){1,3}
39
40       uses  over  50KiB  when compiled using the 8-bit library. When PCRE2 is
41       compiled with its default internal pointer size of two bytes, the  size
42       limit on a compiled pattern is 65535 code units in the 8-bit and 16-bit
43       libraries, and this is reached with the above pattern if the outer rep‐
44       etition  is  increased from 3 to 4. PCRE2 can be compiled to use larger
45       internal pointers and thus handle larger compiled patterns, but  it  is
46       better to try to rewrite your pattern to use less memory if you can.
47
48       One  way  of reducing the memory usage for such patterns is to make use
49       of PCRE2's "subroutine" facility. Re-writing the above pattern as
50
51         ((ab)(?2){0,999}c)(?1){0,2}
52
53       reduces the memory requirements to around 16KiB, and indeed it  remains
54       under  20KiB  even with the outer repetition increased to 100. However,
55       this kind of pattern is not always exactly equivalent, because any cap‐
56       tures  within  subroutine calls are lost when the subroutine completes.
57       If this is not a problem, this kind of  rewriting  will  allow  you  to
58       process  patterns that PCRE2 cannot otherwise handle. The matching per‐
59       formance of the two different versions of the pattern are  roughly  the
60       same.  (This applies from release 10.30 - things were different in ear‐
61       lier releases.)
62

STACK AND HEAP USAGE AT RUN TIME

64
65       From release 10.30, the interpretive (non-JIT) version of pcre2_match()
66       uses  very  little system stack at run time. In earlier releases recur‐
67       sive function calls could use a great deal of  stack,  and  this  could
68       cause  problems, but this usage has been eliminated. Backtracking posi‐
69       tions are now explicitly remembered in memory frames controlled by  the
70       code.  An  initial  20KiB  vector  of frames is allocated on the system
71       stack (enough for about 100 frames for small patterns), but if this  is
72       insufficient,  heap  memory  is  used. The amount of heap memory can be
73       limited; if the limit is set to zero, only the initial stack vector  is
74       used.  Rewriting patterns to be time-efficient, as described below, may
75       also reduce the memory requirements.
76
77       In contrast to  pcre2_match(),  pcre2_dfa_match()  does  use  recursive
78       function  calls,  but  only  for  processing  atomic groups, lookaround
79       assertions, and recursion within the pattern. The original  version  of
80       the code used to allocate quite large internal workspace vectors on the
81       stack, which caused some problems for  some  patterns  in  environments
82       with  small  stacks.  From release 10.32 the code for pcre2_dfa_match()
83       has been re-factored to use heap memory  when  necessary  for  internal
84       workspace  when  recursing,  though  recursive function calls are still
85       used.
86
87       The "match depth" parameter can be used to limit the depth of  function
88       recursion,  and  the  "match  heap"  parameter  to limit heap memory in
89       pcre2_dfa_match().
90

PROCESSING TIME

92
93       Certain items in regular expression patterns are processed  more  effi‐
94       ciently than others. It is more efficient to use a character class like
95       [aeiou]  than  a  set  of   single-character   alternatives   such   as
96       (a|e|i|o|u).  In  general,  the simplest construction that provides the
97       required behaviour is usually the most efficient. Jeffrey Friedl's book
98       contains  a  lot  of useful general discussion about optimizing regular
99       expressions for efficient performance. This  document  contains  a  few
100       observations about PCRE2.
101
102       Using  Unicode  character  properties  (the  \p, \P, and \X escapes) is
103       slow, because PCRE2 has to use a multi-stage table lookup  whenever  it
104       needs  a  character's  property. If you can find an alternative pattern
105       that does not use character properties, it will probably be faster.
106
107       By default, the escape sequences \b, \d, \s,  and  \w,  and  the  POSIX
108       character  classes  such  as  [:alpha:]  do not use Unicode properties,
109       partly for backwards compatibility, and partly for performance reasons.
110       However,  you  can  set  the PCRE2_UCP option or start the pattern with
111       (*UCP) if you want Unicode character properties to be  used.  This  can
112       double  the  matching  time  for  items  such  as \d, when matched with
113       pcre2_match(); the performance loss is less with a DFA  matching  func‐
114       tion, and in both cases there is not much difference for \b.
115
116       When  a pattern begins with .* not in atomic parentheses, nor in paren‐
117       theses that are the subject of a backreference,  and  the  PCRE2_DOTALL
118       option  is  set,  the pattern is implicitly anchored by PCRE2, since it
119       can match only at the start of a subject string.  If  the  pattern  has
120       multiple top-level branches, they must all be anchorable. The optimiza‐
121       tion can be disabled by  the  PCRE2_NO_DOTSTAR_ANCHOR  option,  and  is
122       automatically disabled if the pattern contains (*PRUNE) or (*SKIP).
123
124       If  PCRE2_DOTALL  is  not  set,  PCRE2  cannot  make this optimization,
125       because the dot metacharacter does not then match a newline, and if the
126       subject  string contains newlines, the pattern may match from the char‐
127       acter immediately following one of them instead of from the very start.
128       For example, the pattern
129
130         .*second
131
132       matches  the subject "first\nand second" (where \n stands for a newline
133       character), with the match starting at the seventh character. In  order
134       to  do  this, PCRE2 has to retry the match starting after every newline
135       in the subject.
136
137       If you are using such a pattern with subject strings that do  not  con‐
138       tain   newlines,   the   best   performance   is  obtained  by  setting
139       PCRE2_DOTALL, or starting the pattern with  ^.*  or  ^.*?  to  indicate
140       explicit anchoring. That saves PCRE2 from having to scan along the sub‐
141       ject looking for a newline to restart at.
142
143       Beware of patterns that contain nested indefinite  repeats.  These  can
144       take  a  long time to run when applied to a string that does not match.
145       Consider the pattern fragment
146
147         ^(a+)*
148
149       This can match "aaaa" in 16 different ways, and this  number  increases
150       very  rapidly  as the string gets longer. (The * repeat can match 0, 1,
151       2, 3, or 4 times, and for each of those cases other than 0 or 4, the  +
152       repeats  can  match  different numbers of times.) When the remainder of
153       the pattern is such that the entire match is going to fail,  PCRE2  has
154       in  principle  to  try  every  possible variation, and this can take an
155       extremely long time, even for relatively short strings.
156
157       An optimization catches some of the more simple cases such as
158
159         (a+)*b
160
161       where a literal character follows. Before  embarking  on  the  standard
162       matching  procedure, PCRE2 checks that there is a "b" later in the sub‐
163       ject string, and if there is not, it fails the match immediately.  How‐
164       ever,  when  there  is no following literal this optimization cannot be
165       used. You can see the difference by comparing the behaviour of
166
167         (a+)*\d
168
169       with the pattern above. The former gives  a  failure  almost  instantly
170       when  applied  to  a  whole  line of "a" characters, whereas the latter
171       takes an appreciable time with strings longer than about 20 characters.
172
173       In many cases, the solution to this kind of performance issue is to use
174       an  atomic group or a possessive quantifier. This can often reduce mem‐
175       ory requirements as well. As another example, consider this pattern:
176
177         ([^<]|<(?!inet))+
178
179       It matches from wherever it starts until it encounters "<inet"  or  the
180       end  of  the  data,  and is the kind of pattern that might be used when
181       processing an XML file. Each iteration of the outer parentheses matches
182       either  one  character that is not "<" or a "<" that is not followed by
183       "inet". However, each time a parenthesis is processed,  a  backtracking
184       position  is  passed,  so this formulation uses a memory frame for each
185       matched character. For a long string, a lot of memory is required. Con‐
186       sider  now  this  rewritten  pattern,  which  matches  exactly the same
187       strings:
188
189         ([^<]++|<(?!inet))+
190
191       This runs much faster, because sequences of characters that do not con‐
192       tain "<" are "swallowed" in one item inside the parentheses, and a pos‐
193       sessive quantifier is used to stop any backtracking into  the  runs  of
194       non-"<"  characters.  This  version also uses a lot less memory because
195       entry to a new set of parentheses happens only  when  a  "<"  character
196       that  is  not  followed by "inet" is encountered (and we assume this is
197       relatively rare).
198
199       This example shows that one way of optimizing performance when matching
200       long  subject strings is to write repeated parenthesized subpatterns to
201       match more than one character whenever possible.
202
203   SETTING RESOURCE LIMITS
204
205       You can set limits on the amount of processing that  takes  place  when
206       matching,  and  on  the amount of heap memory that is used. The default
207       values of the limits are very large, and unlikely ever to operate. They
208       can  be  changed  when  PCRE2  is  built, and they can also be set when
209       pcre2_match() or pcre2_dfa_match() is  called.  For  details  of  these
210       interfaces,  see  the pcre2build documentation and the section entitled
211       "The match context" in the pcre2api documentation.
212
213       The pcre2test test program has a modifier called  "find_limits"  which,
214       if  applied  to  a  subject line, causes it to find the smallest limits
215       that allow a pattern to match. This is done by repeatedly matching with
216       different limits.
217

AUTHOR

219
220       Philip Hazel
221       University Computing Service
222       Cambridge, England.
223

REVISION

225
226       Last updated: 25 April 2018
227       Copyright (c) 1997-2018 University of Cambridge.
228
229
230
231PCRE2 10.32                      25 April 2018                 PCRE2PERFORM(3)