1PCRE2PATTERN(3) Library Functions Manual PCRE2PATTERN(3)
2
3
4
6 PCRE2 - Perl-compatible regular expressions (revised API)
7
9
10 The syntax and semantics of the regular expressions that are supported
11 by PCRE2 are described in detail below. There is a quick-reference syn‐
12 tax summary in the pcre2syntax page. PCRE2 tries to match Perl syntax
13 and semantics as closely as it can. PCRE2 also supports some alterna‐
14 tive regular expression syntax (which does not conflict with the Perl
15 syntax) in order to provide some compatibility with regular expressions
16 in Python, .NET, and Oniguruma.
17
18 Perl's regular expressions are described in its own documentation, and
19 regular expressions in general are covered in a number of books, some
20 of which have copious examples. Jeffrey Friedl's "Mastering Regular
21 Expressions", published by O'Reilly, covers regular expressions in
22 great detail. This description of PCRE2's regular expressions is
23 intended as reference material.
24
25 This document discusses the patterns that are supported by PCRE2 when
26 its main matching function, pcre2_match(), is used. PCRE2 also has an
27 alternative matching function, pcre2_dfa_match(), which matches using a
28 different algorithm that is not Perl-compatible. Some of the features
29 discussed below are not available when DFA matching is used. The advan‐
30 tages and disadvantages of the alternative function, and how it differs
31 from the normal function, are discussed in the pcre2matching page.
32
34
35 A number of options that can be passed to pcre2_compile() can also be
36 set by special items at the start of a pattern. These are not Perl-com‐
37 patible, but are provided to make these options accessible to pattern
38 writers who are not able to change the program that processes the pat‐
39 tern. Any number of these items may appear, but they must all be
40 together right at the start of the pattern string, and the letters must
41 be in upper case.
42
43 UTF support
44
45 In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either
46 as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32
47 can be specified for the 32-bit library, in which case it constrains
48 the character values to valid Unicode code points. To process UTF
49 strings, PCRE2 must be built to include Unicode support (which is the
50 default). When using UTF strings you must either call the compiling
51 function with the PCRE2_UTF option, or the pattern must start with the
52 special sequence (*UTF), which is equivalent to setting the relevant
53 option. How setting a UTF mode affects pattern matching is mentioned in
54 several places below. There is also a summary of features in the
55 pcre2unicode page.
56
57 Some applications that allow their users to supply patterns may wish to
58 restrict them to non-UTF data for security reasons. If the
59 PCRE2_NEVER_UTF option is passed to pcre2_compile(), (*UTF) is not
60 allowed, and its appearance in a pattern causes an error.
61
62 Unicode property support
63
64 Another special sequence that may appear at the start of a pattern is
65 (*UCP). This has the same effect as setting the PCRE2_UCP option: it
66 causes sequences such as \d and \w to use Unicode properties to deter‐
67 mine character types, instead of recognizing only characters with codes
68 less than 256 via a lookup table.
69
70 Some applications that allow their users to supply patterns may wish to
71 restrict them for security reasons. If the PCRE2_NEVER_UCP option is
72 passed to pcre2_compile(), (*UCP) is not allowed, and its appearance in
73 a pattern causes an error.
74
75 Locking out empty string matching
76
77 Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same
78 effect as passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option
79 to whichever matching function is subsequently called to match the pat‐
80 tern. These options lock out the matching of empty strings, either
81 entirely, or only at the start of the subject.
82
83 Disabling auto-possessification
84
85 If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as
86 setting the PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from making
87 quantifiers possessive when what follows cannot match the repeated
88 item. For example, by default a+b is treated as a++b. For more details,
89 see the pcre2api documentation.
90
91 Disabling start-up optimizations
92
93 If a pattern starts with (*NO_START_OPT), it has the same effect as
94 setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti‐
95 mizations for quickly reaching "no match" results. For more details,
96 see the pcre2api documentation.
97
98 Disabling automatic anchoring
99
100 If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect
101 as setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimiza‐
102 tions that apply to patterns whose top-level branches all start with .*
103 (match any number of arbitrary characters). For more details, see the
104 pcre2api documentation.
105
106 Disabling JIT compilation
107
108 If a pattern that starts with (*NO_JIT) is successfully compiled, an
109 attempt by the application to apply the JIT optimization by calling
110 pcre2_jit_compile() is ignored.
111
112 Setting match resource limits
113
114 The pcre2_match() function contains a counter that is incremented every
115 time it goes round its main loop. The caller of pcre2_match() can set a
116 limit on this counter, which therefore limits the amount of computing
117 resource used for a match. The maximum depth of nested backtracking can
118 also be limited; this indirectly restricts the amount of heap memory
119 that is used, but there is also an explicit memory limit that can be
120 set.
121
122 These facilities are provided to catch runaway matches that are pro‐
123 voked by patterns with huge matching trees (a typical example is a pat‐
124 tern with nested unlimited repeats applied to a long string that does
125 not match). When one of these limits is reached, pcre2_match() gives an
126 error return. The limits can also be set by items at the start of the
127 pattern of the form
128
129 (*LIMIT_HEAP=d)
130 (*LIMIT_MATCH=d)
131 (*LIMIT_DEPTH=d)
132
133 where d is any number of decimal digits. However, the value of the set‐
134 ting must be less than the value set (or defaulted) by the caller of
135 pcre2_match() for it to have any effect. In other words, the pattern
136 writer can lower the limits set by the programmer, but not raise them.
137 If there is more than one setting of one of these limits, the lower
138 value is used. The heap limit is specified in kibibytes (units of 1024
139 bytes).
140
141 Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This
142 name is still recognized for backwards compatibility.
143
144 The heap limit applies only when the pcre2_match() or pcre2_dfa_match()
145 interpreters are used for matching. It does not apply to JIT. The match
146 limit is used (but in a different way) when JIT is being used, or when
147 pcre2_dfa_match() is called, to limit computing resource usage by those
148 matching functions. The depth limit is ignored by JIT but is relevant
149 for DFA matching, which uses function recursion for recursions within
150 the pattern and for lookaround assertions and atomic groups. In this
151 case, the depth limit controls the depth of such recursion.
152
153 Newline conventions
154
155 PCRE2 supports six different conventions for indicating line breaks in
156 strings: a single CR (carriage return) character, a single LF (line‐
157 feed) character, the two-character sequence CRLF, any of the three pre‐
158 ceding, any Unicode newline sequence, or the NUL character (binary
159 zero). The pcre2api page has further discussion about newlines, and
160 shows how to set the newline convention when calling pcre2_compile().
161
162 It is also possible to specify a newline convention by starting a pat‐
163 tern string with one of the following sequences:
164
165 (*CR) carriage return
166 (*LF) linefeed
167 (*CRLF) carriage return, followed by linefeed
168 (*ANYCRLF) any of the three above
169 (*ANY) all Unicode newline sequences
170 (*NUL) the NUL character (binary zero)
171
172 These override the default and the options given to the compiling func‐
173 tion. For example, on a Unix system where LF is the default newline
174 sequence, the pattern
175
176 (*CR)a.b
177
178 changes the convention to CR. That pattern matches "a\nb" because LF is
179 no longer a newline. If more than one of these settings is present, the
180 last one is used.
181
182 The newline convention affects where the circumflex and dollar asser‐
183 tions are true. It also affects the interpretation of the dot metachar‐
184 acter when PCRE2_DOTALL is not set, and the behaviour of \N when not
185 followed by an opening brace. However, it does not affect what the \R
186 escape sequence matches. By default, this is any Unicode newline
187 sequence, for Perl compatibility. However, this can be changed; see the
188 next section and the description of \R in the section entitled "Newline
189 sequences" below. A change of \R setting can be combined with a change
190 of newline convention.
191
192 Specifying what \R matches
193
194 It is possible to restrict \R to match only CR, LF, or CRLF (instead of
195 the complete set of Unicode line endings) by setting the option
196 PCRE2_BSR_ANYCRLF at compile time. This effect can also be achieved by
197 starting a pattern with (*BSR_ANYCRLF). For completeness, (*BSR_UNI‐
198 CODE) is also recognized, corresponding to PCRE2_BSR_UNICODE.
199
201
202 PCRE2 can be compiled to run in an environment that uses EBCDIC as its
203 character code instead of ASCII or Unicode (typically a mainframe sys‐
204 tem). In the sections below, character code values are ASCII or Uni‐
205 code; in an EBCDIC environment these characters may have different code
206 values, and there are no code points greater than 255.
207
209
210 A regular expression is a pattern that is matched against a subject
211 string from left to right. Most characters stand for themselves in a
212 pattern, and match the corresponding characters in the subject. As a
213 trivial example, the pattern
214
215 The quick brown fox
216
217 matches a portion of a subject string that is identical to itself. When
218 caseless matching is specified (the PCRE2_CASELESS option), letters are
219 matched independently of case.
220
221 The power of regular expressions comes from the ability to include
222 alternatives and repetitions in the pattern. These are encoded in the
223 pattern by the use of metacharacters, which do not stand for themselves
224 but instead are interpreted in some special way.
225
226 There are two different sets of metacharacters: those that are recog‐
227 nized anywhere in the pattern except within square brackets, and those
228 that are recognized within square brackets. Outside square brackets,
229 the metacharacters are as follows:
230
231 \ general escape character with several uses
232 ^ assert start of string (or line, in multiline mode)
233 $ assert end of string (or line, in multiline mode)
234 . match any character except newline (by default)
235 [ start character class definition
236 | start of alternative branch
237 ( start subpattern
238 ) end subpattern
239 ? extends the meaning of (
240 also 0 or 1 quantifier
241 also quantifier minimizer
242 * 0 or more quantifier
243 + 1 or more quantifier
244 also "possessive quantifier"
245 { start min/max quantifier
246
247 Part of a pattern that is in square brackets is called a "character
248 class". In a character class the only metacharacters are:
249
250 \ general escape character
251 ^ negate the class, but only if the first character
252 - indicates character range
253 [ POSIX character class (only if followed by POSIX
254 syntax)
255 ] terminates the character class
256
257 The following sections describe the use of each of the metacharacters.
258
260
261 The backslash character has several uses. Firstly, if it is followed by
262 a character that is not a number or a letter, it takes away any special
263 meaning that character may have. This use of backslash as an escape
264 character applies both inside and outside character classes.
265
266 For example, if you want to match a * character, you must write \* in
267 the pattern. This escaping action applies whether or not the following
268 character would otherwise be interpreted as a metacharacter, so it is
269 always safe to precede a non-alphanumeric with backslash to specify
270 that it stands for itself. In particular, if you want to match a back‐
271 slash, you write \\.
272
273 In a UTF mode, only ASCII numbers and letters have any special meaning
274 after a backslash. All other characters (in particular, those whose
275 code points are greater than 127) are treated as literals.
276
277 If a pattern is compiled with the PCRE2_EXTENDED option, most white
278 space in the pattern (other than in a character class), and characters
279 between a # outside a character class and the next newline, inclusive,
280 are ignored. An escaping backslash can be used to include a white space
281 or # character as part of the pattern.
282
283 If you want to remove the special meaning from a sequence of charac‐
284 ters, you can do so by putting them between \Q and \E. This is differ‐
285 ent from Perl in that $ and @ are handled as literals in \Q...\E
286 sequences in PCRE2, whereas in Perl, $ and @ cause variable interpola‐
287 tion. Also, Perl does "double-quotish backslash interpolation" on any
288 backslashes between \Q and \E which, its documentation says, "may lead
289 to confusing results". PCRE2 treats a backslash between \Q and \E just
290 like any other character. Note the following examples:
291
292 Pattern PCRE2 matches Perl matches
293
294 \Qabc$xyz\E abc$xyz abc followed by the
295 contents of $xyz
296 \Qabc\$xyz\E abc\$xyz abc\$xyz
297 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
298 \QA\B\E A\B A\B
299 \Q\\E \ \\E
300
301 The \Q...\E sequence is recognized both inside and outside character
302 classes. An isolated \E that is not preceded by \Q is ignored. If \Q
303 is not followed by \E later in the pattern, the literal interpretation
304 continues to the end of the pattern (that is, \E is assumed at the
305 end). If the isolated \Q is inside a character class, this causes an
306 error, because the character class is not terminated by a closing
307 square bracket.
308
309 Non-printing characters
310
311 A second use of backslash provides a way of encoding non-printing char‐
312 acters in patterns in a visible manner. There is no restriction on the
313 appearance of non-printing characters in a pattern, but when a pattern
314 is being prepared by text editing, it is often easier to use one of the
315 following escape sequences than the binary character it represents. In
316 an ASCII or Unicode environment, these escapes are as follows:
317
318 \a alarm, that is, the BEL character (hex 07)
319 \cx "control-x", where x is any printable ASCII character
320 \e escape (hex 1B)
321 \f form feed (hex 0C)
322 \n linefeed (hex 0A)
323 \r carriage return (hex 0D)
324 \t tab (hex 09)
325 \0dd character with octal code 0dd
326 \ddd character with octal code ddd, or backreference
327 \o{ddd..} character with octal code ddd..
328 \xhh character with hex code hh
329 \x{hhh..} character with hex code hhh..
330 \N{U+hhh..} character with Unicode hex code point hhh..
331 \uhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
332
333 The \N{U+hhh..} escape sequence is recognized only when the PCRE2_UTF
334 option is set, that is, when PCRE2 is operating in a Unicode mode. Perl
335 also uses \N{name} to specify characters by Unicode name; PCRE2 does
336 not support this. Note that when \N is not followed by an opening
337 brace (curly bracket) it has an entirely different meaning, matching
338 any character that is not a newline.
339
340 The precise effect of \cx on ASCII characters is as follows: if x is a
341 lower case letter, it is converted to upper case. Then bit 6 of the
342 character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
343 (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
344 hex 7B (; is 3B). If the code unit following \c has a value less than
345 32 or greater than 126, a compile-time error occurs.
346
347 When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported.
348 \a, \e, \f, \n, \r, and \t generate the appropriate EBCDIC code values.
349 The \c escape is processed as specified for Perl in the perlebcdic doc‐
350 ument. The only characters that are allowed after \c are A-Z, a-z, or
351 one of @, [, \, ], ^, _, or ?. Any other character provokes a compile-
352 time error. The sequence \c@ encodes character code 0; after \c the
353 letters (in either case) encode characters 1-26 (hex 01 to hex 1A); [,
354 \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c?
355 becomes either 255 (hex FF) or 95 (hex 5F).
356
357 Thus, apart from \c?, these escapes generate the same character code
358 values as they do in an ASCII environment, though the meanings of the
359 values mostly differ. For example, \cG always generates code value 7,
360 which is BEL in ASCII but DEL in EBCDIC.
361
362 The sequence \c? generates DEL (127, hex 7F) in an ASCII environment,
363 but because 127 is not a control character in EBCDIC, Perl makes it
364 generate the APC character. Unfortunately, there are several variants
365 of EBCDIC. In most of them the APC character has the value 255 (hex
366 FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If
367 certain other characters have POSIX-BC values, PCRE2 makes \c? generate
368 95; otherwise it generates 255.
369
370 After \0 up to two further octal digits are read. If there are fewer
371 than two digits, just those that are present are used. Thus the
372 sequence \0\x\015 specifies two binary zeros followed by a CR character
373 (code value 13). Make sure you supply two digits after the initial zero
374 if the pattern character that follows is itself an octal digit.
375
376 The escape \o must be followed by a sequence of octal digits, enclosed
377 in braces. An error occurs if this is not the case. This escape is a
378 recent addition to Perl; it provides way of specifying character code
379 points as octal numbers greater than 0777, and it also allows octal
380 numbers and backreferences to be unambiguously specified.
381
382 For greater clarity and unambiguity, it is best to avoid following \ by
383 a digit greater than zero. Instead, use \o{} or \x{} to specify numeri‐
384 cal character code points, and \g{} to specify backreferences. The fol‐
385 lowing paragraphs describe the old, ambiguous syntax.
386
387 The handling of a backslash followed by a digit other than 0 is compli‐
388 cated, and Perl has changed over time, causing PCRE2 also to change.
389
390 Outside a character class, PCRE2 reads the digit and any following dig‐
391 its as a decimal number. If the number is less than 10, begins with the
392 digit 8 or 9, or if there are at least that many previous capturing
393 left parentheses in the expression, the entire sequence is taken as a
394 backreference. A description of how this works is given later, follow‐
395 ing the discussion of parenthesized subpatterns. Otherwise, up to
396 three octal digits are read to form a character code.
397
398 Inside a character class, PCRE2 handles \8 and \9 as the literal char‐
399 acters "8" and "9", and otherwise reads up to three octal digits fol‐
400 lowing the backslash, using them to generate a data character. Any sub‐
401 sequent digits stand for themselves. For example, outside a character
402 class:
403
404 \040 is another way of writing an ASCII space
405 \40 is the same, provided there are fewer than 40
406 previous capturing subpatterns
407 \7 is always a backreference
408 \11 might be a backreference, or another way of
409 writing a tab
410 \011 is always a tab
411 \0113 is a tab followed by the character "3"
412 \113 might be a backreference, otherwise the
413 character with octal code 113
414 \377 might be a backreference, otherwise
415 the value 255 (decimal)
416 \81 is always a backreference
417
418 Note that octal values of 100 or greater that are specified using this
419 syntax must not be introduced by a leading zero, because no more than
420 three octal digits are ever read.
421
422 By default, after \x that is not followed by {, from zero to two hexa‐
423 decimal digits are read (letters can be in upper or lower case). Any
424 number of hexadecimal digits may appear between \x{ and }. If a charac‐
425 ter other than a hexadecimal digit appears between \x{ and }, or if
426 there is no terminating }, an error occurs.
427
428 If the PCRE2_ALT_BSUX option is set, the interpretation of \x is as
429 just described only when it is followed by two hexadecimal digits. Oth‐
430 erwise, it matches a literal "x" character. In this mode, support for
431 code points greater than 256 is provided by \u, which must be followed
432 by four hexadecimal digits; otherwise it matches a literal "u" charac‐
433 ter.
434
435 Characters whose value is less than 256 can be defined by either of the
436 two syntaxes for \x (or by \u in PCRE2_ALT_BSUX mode). There is no dif‐
437 ference in the way they are handled. For example, \xdc is exactly the
438 same as \x{dc} (or \u00dc in PCRE2_ALT_BSUX mode).
439
440 Constraints on character values
441
442 Characters that are specified using octal or hexadecimal numbers are
443 limited to certain values, as follows:
444
445 8-bit non-UTF mode no greater than 0xff
446 16-bit non-UTF mode no greater than 0xffff
447 32-bit non-UTF mode no greater than 0xffffffff
448 All UTF modes no greater than 0x10ffff and a valid code point
449
450 Invalid Unicode code points are all those in the range 0xd800 to 0xdfff
451 (the so-called "surrogate" code points). The check for these can be
452 disabled by the caller of pcre2_compile() by setting the option
453 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only in
454 UTF-8 and UTF-32 modes, because these values are not representable in
455 UTF-16.
456
457 Escape sequences in character classes
458
459 All the sequences that define a single character value can be used both
460 inside and outside character classes. In addition, inside a character
461 class, \b is interpreted as the backspace character (hex 08).
462
463 When not followed by an opening brace, \N is not allowed in a character
464 class. \B, \R, and \X are not special inside a character class. Like
465 other unrecognized alphabetic escape sequences, they cause an error.
466 Outside a character class, these sequences have different meanings.
467
468 Unsupported escape sequences
469
470 In Perl, the sequences \F, \l, \L, \u, and \U are recognized by its
471 string handler and used to modify the case of following characters. By
472 default, PCRE2 does not support these escape sequences. However, if the
473 PCRE2_ALT_BSUX option is set, \U matches a "U" character, and \u can be
474 used to define a character by code point, as described above.
475
476 Absolute and relative backreferences
477
478 The sequence \g followed by a signed or unsigned number, optionally
479 enclosed in braces, is an absolute or relative backreference. A named
480 backreference can be coded as \g{name}. Backreferences are discussed
481 later, following the discussion of parenthesized subpatterns.
482
483 Absolute and relative subroutine calls
484
485 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
486 name or a number enclosed either in angle brackets or single quotes, is
487 an alternative syntax for referencing a subpattern as a "subroutine".
488 Details are discussed later. Note that \g{...} (Perl syntax) and
489 \g<...> (Oniguruma syntax) are not synonymous. The former is a backref‐
490 erence; the latter is a subroutine call.
491
492 Generic character types
493
494 Another use of backslash is for specifying generic character types:
495
496 \d any decimal digit
497 \D any character that is not a decimal digit
498 \h any horizontal white space character
499 \H any character that is not a horizontal white space character
500 \N any character that is not a newline
501 \s any white space character
502 \S any character that is not a white space character
503 \v any vertical white space character
504 \V any character that is not a vertical white space character
505 \w any "word" character
506 \W any "non-word" character
507
508 The \N escape sequence has the same meaning as the "." metacharacter
509 when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change
510 the meaning of \N. Note that when \N is followed by an opening brace it
511 has a different meaning. See the section entitled "Non-printing charac‐
512 ters" above for details. Perl also uses \N{name} to specify characters
513 by Unicode name; PCRE2 does not support this.
514
515 Each pair of lower and upper case escape sequences partitions the com‐
516 plete set of characters into two disjoint sets. Any given character
517 matches one, and only one, of each pair. The sequences can appear both
518 inside and outside character classes. They each match one character of
519 the appropriate type. If the current matching point is at the end of
520 the subject string, all of them fail, because there is no character to
521 match.
522
523 The default \s characters are HT (9), LF (10), VT (11), FF (12), CR
524 (13), and space (32), which are defined as white space in the "C"
525 locale. This list may vary if locale-specific matching is taking place.
526 For example, in some locales the "non-breaking space" character (\xA0)
527 is recognized as white space, and in others the VT character is not.
528
529 A "word" character is an underscore or any character that is a letter
530 or digit. By default, the definition of letters and digits is con‐
531 trolled by PCRE2's low-valued character tables, and may vary if locale-
532 specific matching is taking place (see "Locale support" in the pcre2api
533 page). For example, in a French locale such as "fr_FR" in Unix-like
534 systems, or "french" in Windows, some character codes greater than 127
535 are used for accented letters, and these are then matched by \w. The
536 use of locales with Unicode is discouraged.
537
538 By default, characters whose code points are greater than 127 never
539 match \d, \s, or \w, and always match \D, \S, and \W, although this may
540 be different for characters in the range 128-255 when locale-specific
541 matching is happening. These escape sequences retain their original
542 meanings from before Unicode support was available, mainly for effi‐
543 ciency reasons. If the PCRE2_UCP option is set, the behaviour is
544 changed so that Unicode properties are used to determine character
545 types, as follows:
546
547 \d any character that matches \p{Nd} (decimal digit)
548 \s any character that matches \p{Z} or \h or \v
549 \w any character that matches \p{L} or \p{N}, plus underscore
550
551 The upper case escapes match the inverse sets of characters. Note that
552 \d matches only decimal digits, whereas \w matches any Unicode digit,
553 as well as any Unicode letter, and underscore. Note also that PCRE2_UCP
554 affects \b, and \B because they are defined in terms of \w and \W.
555 Matching these sequences is noticeably slower when PCRE2_UCP is set.
556
557 The sequences \h, \H, \v, and \V, in contrast to the other sequences,
558 which match only ASCII characters by default, always match a specific
559 list of code points, whether or not PCRE2_UCP is set. The horizontal
560 space characters are:
561
562 U+0009 Horizontal tab (HT)
563 U+0020 Space
564 U+00A0 Non-break space
565 U+1680 Ogham space mark
566 U+180E Mongolian vowel separator
567 U+2000 En quad
568 U+2001 Em quad
569 U+2002 En space
570 U+2003 Em space
571 U+2004 Three-per-em space
572 U+2005 Four-per-em space
573 U+2006 Six-per-em space
574 U+2007 Figure space
575 U+2008 Punctuation space
576 U+2009 Thin space
577 U+200A Hair space
578 U+202F Narrow no-break space
579 U+205F Medium mathematical space
580 U+3000 Ideographic space
581
582 The vertical space characters are:
583
584 U+000A Linefeed (LF)
585 U+000B Vertical tab (VT)
586 U+000C Form feed (FF)
587 U+000D Carriage return (CR)
588 U+0085 Next line (NEL)
589 U+2028 Line separator
590 U+2029 Paragraph separator
591
592 In 8-bit, non-UTF-8 mode, only the characters with code points less
593 than 256 are relevant.
594
595 Newline sequences
596
597 Outside a character class, by default, the escape sequence \R matches
598 any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent
599 to the following:
600
601 (?>\r\n|\n|\x0b|\f|\r|\x85)
602
603 This is an example of an "atomic group", details of which are given
604 below. This particular group matches either the two-character sequence
605 CR followed by LF, or one of the single characters LF (linefeed,
606 U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car‐
607 riage return, U+000D), or NEL (next line, U+0085). Because this is an
608 atomic group, the two-character sequence is treated as a single unit
609 that cannot be split.
610
611 In other modes, two additional characters whose code points are greater
612 than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa‐
613 rator, U+2029). Unicode support is not needed for these characters to
614 be recognized.
615
616 It is possible to restrict \R to match only CR, LF, or CRLF (instead of
617 the complete set of Unicode line endings) by setting the option
618 PCRE2_BSR_ANYCRLF at compile time. (BSR is an abbrevation for "back‐
619 slash R".) This can be made the default when PCRE2 is built; if this is
620 the case, the other behaviour can be requested via the PCRE2_BSR_UNI‐
621 CODE option. It is also possible to specify these settings by starting
622 a pattern string with one of the following sequences:
623
624 (*BSR_ANYCRLF) CR, LF, or CRLF only
625 (*BSR_UNICODE) any Unicode newline sequence
626
627 These override the default and the options given to the compiling func‐
628 tion. Note that these special settings, which are not Perl-compatible,
629 are recognized only at the very start of a pattern, and that they must
630 be in upper case. If more than one of them is present, the last one is
631 used. They can be combined with a change of newline convention; for
632 example, a pattern can start with:
633
634 (*ANY)(*BSR_ANYCRLF)
635
636 They can also be combined with the (*UTF) or (*UCP) special sequences.
637 Inside a character class, \R is treated as an unrecognized escape
638 sequence, and causes an error.
639
640 Unicode character properties
641
642 When PCRE2 is built with Unicode support (the default), three addi‐
643 tional escape sequences that match characters with specific properties
644 are available. In 8-bit non-UTF-8 mode, these sequences are of course
645 limited to testing characters whose code points are less than 256, but
646 they do work in this mode. In 32-bit non-UTF mode, code points greater
647 than 0x10ffff (the Unicode limit) may be encountered. These are all
648 treated as being in the Common script and with an unassigned type. The
649 extra escape sequences are:
650
651 \p{xx} a character with the xx property
652 \P{xx} a character without the xx property
653 \X a Unicode extended grapheme cluster
654
655 The property names represented by xx above are limited to the Unicode
656 script names, the general category properties, "Any", which matches any
657 character (including newline), and some special PCRE2 properties
658 (described in the next section). Other Perl properties such as "InMu‐
659 sicalSymbols" are not supported by PCRE2. Note that \P{Any} does not
660 match any characters, so always causes a match failure.
661
662 Sets of Unicode characters are defined as belonging to certain scripts.
663 A character from one of these sets can be matched using a script name.
664 For example:
665
666 \p{Greek}
667 \P{Han}
668
669 Those that are not part of an identified script are lumped together as
670 "Common". The current list of scripts is:
671
672 Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Bali‐
673 nese, Bamum, Bassa_Vah, Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi,
674 Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Alba‐
675 nian, Chakma, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot,
676 Cyrillic, Deseret, Devanagari, Dogra, Duployan, Egyptian_Hieroglyphs,
677 Elbasan, Ethiopic, Georgian, Glagolitic, Gothic, Grantha, Greek,
678 Gujarati, Gunjala_Gondi, Gurmukhi, Han, Hangul, Hanifi_Rohingya,
679 Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited,
680 Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan‐
681 nada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao,
682 Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha‐
683 jani, Makasar, Malayalam, Mandaic, Manichaean, Marchen, Masaram_Gondi,
684 Medefaidrin, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive,
685 Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Multani, Myanmar,
686 Nabataean, New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki, Old_Hungar‐
687 ian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_Sog‐
688 dian, Old_South_Arabian, Old_Turkic, Oriya, Osage, Osmanya,
689 Pahawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician,
690 Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha‐
691 vian, Siddham, SignWriting, Sinhala, Sogdian, Sora_Sompeng, Soyombo,
692 Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham,
693 Tai_Viet, Takri, Tamil, Tangut, Telugu, Thaana, Thai, Tibetan, Tifi‐
694 nagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi, Zanabazar_Square.
695
696 Each character has exactly one Unicode general category property, spec‐
697 ified by a two-letter abbreviation. For compatibility with Perl, nega‐
698 tion can be specified by including a circumflex between the opening
699 brace and the property name. For example, \p{^Lu} is the same as
700 \P{Lu}.
701
702 If only one letter is specified with \p or \P, it includes all the gen‐
703 eral category properties that start with that letter. In this case, in
704 the absence of negation, the curly brackets in the escape sequence are
705 optional; these two examples have the same effect:
706
707 \p{L}
708 \pL
709
710 The following general category property codes are supported:
711
712 C Other
713 Cc Control
714 Cf Format
715 Cn Unassigned
716 Co Private use
717 Cs Surrogate
718
719 L Letter
720 Ll Lower case letter
721 Lm Modifier letter
722 Lo Other letter
723 Lt Title case letter
724 Lu Upper case letter
725
726 M Mark
727 Mc Spacing mark
728 Me Enclosing mark
729 Mn Non-spacing mark
730
731 N Number
732 Nd Decimal number
733 Nl Letter number
734 No Other number
735
736 P Punctuation
737 Pc Connector punctuation
738 Pd Dash punctuation
739 Pe Close punctuation
740 Pf Final punctuation
741 Pi Initial punctuation
742 Po Other punctuation
743 Ps Open punctuation
744
745 S Symbol
746 Sc Currency symbol
747 Sk Modifier symbol
748 Sm Mathematical symbol
749 So Other symbol
750
751 Z Separator
752 Zl Line separator
753 Zp Paragraph separator
754 Zs Space separator
755
756 The special property L& is also supported: it matches a character that
757 has the Lu, Ll, or Lt property, in other words, a letter that is not
758 classified as a modifier or "other".
759
760 The Cs (Surrogate) property applies only to characters in the range
761 U+D800 to U+DFFF. Such characters are not valid in Unicode strings and
762 so cannot be tested by PCRE2, unless UTF validity checking has been
763 turned off (see the discussion of PCRE2_NO_UTF_CHECK in the pcre2api
764 page). Perl does not support the Cs property.
765
766 The long synonyms for property names that Perl supports (such as
767 \p{Letter}) are not supported by PCRE2, nor is it permitted to prefix
768 any of these properties with "Is".
769
770 No character that is in the Unicode table has the Cn (unassigned) prop‐
771 erty. Instead, this property is assumed for any code point that is not
772 in the Unicode table.
773
774 Specifying caseless matching does not affect these escape sequences.
775 For example, \p{Lu} always matches only upper case letters. This is
776 different from the behaviour of current versions of Perl.
777
778 Matching characters by Unicode property is not fast, because PCRE2 has
779 to do a multistage table lookup in order to find a character's prop‐
780 erty. That is why the traditional escape sequences such as \d and \w do
781 not use Unicode properties in PCRE2 by default, though you can make
782 them do so by setting the PCRE2_UCP option or by starting the pattern
783 with (*UCP).
784
785 Extended grapheme clusters
786
787 The \X escape matches any number of Unicode characters that form an
788 "extended grapheme cluster", and treats the sequence as an atomic group
789 (see below). Unicode supports various kinds of composite character by
790 giving each character a grapheme breaking property, and having rules
791 that use these properties to define the boundaries of extended grapheme
792 clusters. The rules are defined in Unicode Standard Annex 29, "Unicode
793 Text Segmentation". Unicode 11.0.0 abandoned the use of some previous
794 properties that had been used for emojis. Instead it introduced vari‐
795 ous emoji-specific properties. PCRE2 uses only the Extended Picto‐
796 graphic property.
797
798 \X always matches at least one character. Then it decides whether to
799 add additional characters according to the following rules for ending a
800 cluster:
801
802 1. End at the end of the subject string.
803
804 2. Do not end between CR and LF; otherwise end after any control char‐
805 acter.
806
807 3. Do not break Hangul (a Korean script) syllable sequences. Hangul
808 characters are of five types: L, V, T, LV, and LVT. An L character may
809 be followed by an L, V, LV, or LVT character; an LV or V character may
810 be followed by a V or T character; an LVT or T character may be follwed
811 only by a T character.
812
813 4. Do not end before extending characters or spacing marks or the
814 "zero-width joiner" character. Characters with the "mark" property
815 always have the "extend" grapheme breaking property.
816
817 5. Do not end after prepend characters.
818
819 6. Do not break within emoji modifier sequences or emoji zwj sequences.
820 That is, do not break between characters with the Extended_Pictographic
821 property. Extend and ZWJ characters are allowed between the charac‐
822 ters.
823
824 7. Do not break within emoji flag sequences. That is, do not break
825 between regional indicator (RI) characters if there are an odd number
826 of RI characters before the break point.
827
828 8. Otherwise, end the cluster.
829
830 PCRE2's additional properties
831
832 As well as the standard Unicode properties described above, PCRE2 sup‐
833 ports four more that make it possible to convert traditional escape
834 sequences such as \w and \s to use Unicode properties. PCRE2 uses these
835 non-standard, non-Perl properties internally when PCRE2_UCP is set.
836 However, they may also be used explicitly. These properties are:
837
838 Xan Any alphanumeric character
839 Xps Any POSIX space character
840 Xsp Any Perl space character
841 Xwd Any Perl "word" character
842
843 Xan matches characters that have either the L (letter) or the N (num‐
844 ber) property. Xps matches the characters tab, linefeed, vertical tab,
845 form feed, or carriage return, and any other character that has the Z
846 (separator) property. Xsp is the same as Xps; in PCRE1 it used to
847 exclude vertical tab, for Perl compatibility, but Perl changed. Xwd
848 matches the same characters as Xan, plus underscore.
849
850 There is another non-standard property, Xuc, which matches any charac‐
851 ter that can be represented by a Universal Character Name in C++ and
852 other programming languages. These are the characters $, @, ` (grave
853 accent), and all characters with Unicode code points greater than or
854 equal to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that
855 most base (ASCII) characters are excluded. (Universal Character Names
856 are of the form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit.
857 Note that the Xuc property does not match these sequences but the char‐
858 acters that they represent.)
859
860 Resetting the match start
861
862 In normal use, the escape sequence \K causes any previously matched
863 characters not to be included in the final matched sequence that is
864 returned. For example, the pattern:
865
866 foo\Kbar
867
868 matches "foobar", but reports that it has matched "bar". \K does not
869 interact with anchoring in any way. The pattern:
870
871 ^foo\Kbar
872
873 matches only when the subject begins with "foobar" (in single line
874 mode), though it again reports the matched string as "bar". This fea‐
875 ture is similar to a lookbehind assertion (described below). However,
876 in this case, the part of the subject before the real match does not
877 have to be of fixed length, as lookbehind assertions do. The use of \K
878 does not interfere with the setting of captured substrings. For exam‐
879 ple, when the pattern
880
881 (foo)\Kbar
882
883 matches "foobar", the first substring is still set to "foo".
884
885 Perl documents that the use of \K within assertions is "not well
886 defined". In PCRE2, \K is acted upon when it occurs inside positive
887 assertions, but is ignored in negative assertions. Note that when a
888 pattern such as (?=ab\K) matches, the reported start of the match can
889 be greater than the end of the match. Using \K in a lookbehind asser‐
890 tion at the start of a pattern can also lead to odd effects. For exam‐
891 ple, consider this pattern:
892
893 (?<=\Kfoo)bar
894
895 If the subject is "foobar", a call to pcre2_match() with a starting
896 offset of 3 succeeds and reports the matching string as "foobar", that
897 is, the start of the reported match is earlier than where the match
898 started.
899
900 Simple assertions
901
902 The final use of backslash is for certain simple assertions. An asser‐
903 tion specifies a condition that has to be met at a particular point in
904 a match, without consuming any characters from the subject string. The
905 use of subpatterns for more complicated assertions is described below.
906 The backslashed assertions are:
907
908 \b matches at a word boundary
909 \B matches when not at a word boundary
910 \A matches at the start of the subject
911 \Z matches at the end of the subject
912 also matches before a newline at the end of the subject
913 \z matches only at the end of the subject
914 \G matches at the first matching position in the subject
915
916 Inside a character class, \b has a different meaning; it matches the
917 backspace character. If any other of these assertions appears in a
918 character class, an "invalid escape sequence" error is generated.
919
920 A word boundary is a position in the subject string where the current
921 character and the previous character do not both match \w or \W (i.e.
922 one matches \w and the other matches \W), or the start or end of the
923 string if the first or last character matches \w, respectively. In a
924 UTF mode, the meanings of \w and \W can be changed by setting the
925 PCRE2_UCP option. When this is done, it also affects \b and \B. Neither
926 PCRE2 nor Perl has a separate "start of word" or "end of word" metase‐
927 quence. However, whatever follows \b normally determines which it is.
928 For example, the fragment \ba matches "a" at the start of a word.
929
930 The \A, \Z, and \z assertions differ from the traditional circumflex
931 and dollar (described in the next section) in that they only ever match
932 at the very start and end of the subject string, whatever options are
933 set. Thus, they are independent of multiline mode. These three asser‐
934 tions are not affected by the PCRE2_NOTBOL or PCRE2_NOTEOL options,
935 which affect only the behaviour of the circumflex and dollar metachar‐
936 acters. However, if the startoffset argument of pcre2_match() is non-
937 zero, indicating that matching is to start at a point other than the
938 beginning of the subject, \A can never match. The difference between
939 \Z and \z is that \Z matches before a newline at the end of the string
940 as well as at the very end, whereas \z matches only at the end.
941
942 The \G assertion is true only when the current matching position is at
943 the start point of the matching process, as specified by the startoff‐
944 set argument of pcre2_match(). It differs from \A when the value of
945 startoffset is non-zero. By calling pcre2_match() multiple times with
946 appropriate arguments, you can mimic Perl's /g option, and it is in
947 this kind of implementation where \G can be useful.
948
949 Note, however, that PCRE2's implementation of \G, being true at the
950 starting character of the matching process, is subtly different from
951 Perl's, which defines it as true at the end of the previous match. In
952 Perl, these can be different when the previously matched string was
953 empty. Because PCRE2 does just one match at a time, it cannot reproduce
954 this behaviour.
955
956 If all the alternatives of a pattern begin with \G, the expression is
957 anchored to the starting match position, and the "anchored" flag is set
958 in the compiled regular expression.
959
961
962 The circumflex and dollar metacharacters are zero-width assertions.
963 That is, they test for a particular condition being true without con‐
964 suming any characters from the subject string. These two metacharacters
965 are concerned with matching the starts and ends of lines. If the new‐
966 line convention is set so that only the two-character sequence CRLF is
967 recognized as a newline, isolated CR and LF characters are treated as
968 ordinary data characters, and are not recognized as newlines.
969
970 Outside a character class, in the default matching mode, the circumflex
971 character is an assertion that is true only if the current matching
972 point is at the start of the subject string. If the startoffset argu‐
973 ment of pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum‐
974 flex can never match if the PCRE2_MULTILINE option is unset. Inside a
975 character class, circumflex has an entirely different meaning (see
976 below).
977
978 Circumflex need not be the first character of the pattern if a number
979 of alternatives are involved, but it should be the first thing in each
980 alternative in which it appears if the pattern is ever to match that
981 branch. If all possible alternatives start with a circumflex, that is,
982 if the pattern is constrained to match only at the start of the sub‐
983 ject, it is said to be an "anchored" pattern. (There are also other
984 constructs that can cause a pattern to be anchored.)
985
986 The dollar character is an assertion that is true only if the current
987 matching point is at the end of the subject string, or immediately
988 before a newline at the end of the string (by default), unless
989 PCRE2_NOTEOL is set. Note, however, that it does not actually match the
990 newline. Dollar need not be the last character of the pattern if a num‐
991 ber of alternatives are involved, but it should be the last item in any
992 branch in which it appears. Dollar has no special meaning in a charac‐
993 ter class.
994
995 The meaning of dollar can be changed so that it matches only at the
996 very end of the string, by setting the PCRE2_DOLLAR_ENDONLY option at
997 compile time. This does not affect the \Z assertion.
998
999 The meanings of the circumflex and dollar metacharacters are changed if
1000 the PCRE2_MULTILINE option is set. When this is the case, a dollar
1001 character matches before any newlines in the string, as well as at the
1002 very end, and a circumflex matches immediately after internal newlines
1003 as well as at the start of the subject string. It does not match after
1004 a newline that ends the string, for compatibility with Perl. However,
1005 this can be changed by setting the PCRE2_ALT_CIRCUMFLEX option.
1006
1007 For example, the pattern /^abc$/ matches the subject string "def\nabc"
1008 (where \n represents a newline) in multiline mode, but not otherwise.
1009 Consequently, patterns that are anchored in single line mode because
1010 all branches start with ^ are not anchored in multiline mode, and a
1011 match for circumflex is possible when the startoffset argument of
1012 pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option is ignored
1013 if PCRE2_MULTILINE is set.
1014
1015 When the newline convention (see "Newline conventions" below) recog‐
1016 nizes the two-character sequence CRLF as a newline, this is preferred,
1017 even if the single characters CR and LF are also recognized as new‐
1018 lines. For example, if the newline convention is "any", a multiline
1019 mode circumflex matches before "xyz" in the string "abc\r\nxyz" rather
1020 than after CR, even though CR on its own is a valid newline. (It also
1021 matches at the very start of the string, of course.)
1022
1023 Note that the sequences \A, \Z, and \z can be used to match the start
1024 and end of the subject in both modes, and if all branches of a pattern
1025 start with \A it is always anchored, whether or not PCRE2_MULTILINE is
1026 set.
1027
1029
1030 Outside a character class, a dot in the pattern matches any one charac‐
1031 ter in the subject string except (by default) a character that signi‐
1032 fies the end of a line.
1033
1034 When a line ending is defined as a single character, dot never matches
1035 that character; when the two-character sequence CRLF is used, dot does
1036 not match CR if it is immediately followed by LF, but otherwise it
1037 matches all characters (including isolated CRs and LFs). When any Uni‐
1038 code line endings are being recognized, dot does not match CR or LF or
1039 any of the other line ending characters.
1040
1041 The behaviour of dot with regard to newlines can be changed. If the
1042 PCRE2_DOTALL option is set, a dot matches any one character, without
1043 exception. If the two-character sequence CRLF is present in the sub‐
1044 ject string, it takes two dots to match it.
1045
1046 The handling of dot is entirely independent of the handling of circum‐
1047 flex and dollar, the only relationship being that they both involve
1048 newlines. Dot has no special meaning in a character class.
1049
1050 The escape sequence \N when not followed by an opening brace behaves
1051 like a dot, except that it is not affected by the PCRE2_DOTALL option.
1052 In other words, it matches any character except one that signifies the
1053 end of a line.
1054
1055 When \N is followed by an opening brace it has a different meaning. See
1056 the section entitled "Non-printing characters" above for details. Perl
1057 also uses \N{name} to specify characters by Unicode name; PCRE2 does
1058 not support this.
1059
1061
1062 Outside a character class, the escape sequence \C matches any one code
1063 unit, whether or not a UTF mode is set. In the 8-bit library, one code
1064 unit is one byte; in the 16-bit library it is a 16-bit unit; in the
1065 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches
1066 line-ending characters. The feature is provided in Perl in order to
1067 match individual bytes in UTF-8 mode, but it is unclear how it can use‐
1068 fully be used.
1069
1070 Because \C breaks up characters into individual code units, matching
1071 one unit with \C in UTF-8 or UTF-16 mode means that the rest of the
1072 string may start with a malformed UTF character. This has undefined
1073 results, because PCRE2 assumes that it is matching character by charac‐
1074 ter in a valid UTF string (by default it checks the subject string's
1075 validity at the start of processing unless the PCRE2_NO_UTF_CHECK
1076 option is used).
1077
1078 An application can lock out the use of \C by setting the
1079 PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also
1080 possible to build PCRE2 with the use of \C permanently disabled.
1081
1082 PCRE2 does not allow \C to appear in lookbehind assertions (described
1083 below) in UTF-8 or UTF-16 modes, because this would make it impossible
1084 to calculate the length of the lookbehind. Neither the alternative
1085 matching function pcre2_dfa_match() nor the JIT optimizer support \C in
1086 these UTF modes. The former gives a match-time error; the latter fails
1087 to optimize and so the match is always run using the interpreter.
1088
1089 In the 32-bit library, however, \C is always supported (when not
1090 explicitly locked out) because it always matches a single code unit,
1091 whether or not UTF-32 is specified.
1092
1093 In general, the \C escape sequence is best avoided. However, one way of
1094 using it that avoids the problem of malformed UTF-8 or UTF-16 charac‐
1095 ters is to use a lookahead to check the length of the next character,
1096 as in this pattern, which could be used with a UTF-8 string (ignore
1097 white space and line breaks):
1098
1099 (?| (?=[\x00-\x7f])(\C) |
1100 (?=[\x80-\x{7ff}])(\C)(\C) |
1101 (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
1102 (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
1103
1104 In this example, a group that starts with (?| resets the capturing
1105 parentheses numbers in each alternative (see "Duplicate Subpattern Num‐
1106 bers" below). The assertions at the start of each branch check the next
1107 UTF-8 character for values whose encoding uses 1, 2, 3, or 4 bytes,
1108 respectively. The character's individual bytes are then captured by the
1109 appropriate number of \C groups.
1110
1112
1113 An opening square bracket introduces a character class, terminated by a
1114 closing square bracket. A closing square bracket on its own is not spe‐
1115 cial by default. If a closing square bracket is required as a member
1116 of the class, it should be the first data character in the class (after
1117 an initial circumflex, if present) or escaped with a backslash. This
1118 means that, by default, an empty class cannot be defined. However, if
1119 the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at
1120 the start does end the (empty) class.
1121
1122 A character class matches a single character in the subject. A matched
1123 character must be in the set of characters defined by the class, unless
1124 the first character in the class definition is a circumflex, in which
1125 case the subject character must not be in the set defined by the class.
1126 If a circumflex is actually required as a member of the class, ensure
1127 it is not the first character, or escape it with a backslash.
1128
1129 For example, the character class [aeiou] matches any lower case vowel,
1130 while [^aeiou] matches any character that is not a lower case vowel.
1131 Note that a circumflex is just a convenient notation for specifying the
1132 characters that are in the class by enumerating those that are not. A
1133 class that starts with a circumflex is not an assertion; it still con‐
1134 sumes a character from the subject string, and therefore it fails if
1135 the current pointer is at the end of the string.
1136
1137 Characters in a class may be specified by their code points using \o,
1138 \x, or \N{U+hh..} in the usual way. When caseless matching is set, any
1139 letters in a class represent both their upper case and lower case ver‐
1140 sions, so for example, a caseless [aeiou] matches "A" as well as "a",
1141 and a caseless [^aeiou] does not match "A", whereas a caseful version
1142 would.
1143
1144 Characters that might indicate line breaks are never treated in any
1145 special way when matching character classes, whatever line-ending
1146 sequence is in use, and whatever setting of the PCRE2_DOTALL and
1147 PCRE2_MULTILINE options is used. A class such as [^a] always matches
1148 one of these characters.
1149
1150 The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s,
1151 \S, \v, \V, \w, and \W may appear in a character class, and add the
1152 characters that they match to the class. For example, [\dABCDEF]
1153 matches any hexadecimal digit. In UTF modes, the PCRE2_UCP option
1154 affects the meanings of \d, \s, \w and their upper case partners, just
1155 as it does when they appear outside a character class, as described in
1156 the section entitled "Generic character types" above. The escape
1157 sequence \b has a different meaning inside a character class; it
1158 matches the backspace character. The sequences \B, \R, and \X are not
1159 special inside a character class. Like any other unrecognized escape
1160 sequences, they cause an error. The same is true for \N when not fol‐
1161 lowed by an opening brace.
1162
1163 The minus (hyphen) character can be used to specify a range of charac‐
1164 ters in a character class. For example, [d-m] matches any letter
1165 between d and m, inclusive. If a minus character is required in a
1166 class, it must be escaped with a backslash or appear in a position
1167 where it cannot be interpreted as indicating a range, typically as the
1168 first or last character in the class, or immediately after a range. For
1169 example, [b-d-z] matches letters in the range b to d, a hyphen charac‐
1170 ter, or z.
1171
1172 Perl treats a hyphen as a literal if it appears before or after a POSIX
1173 class (see below) or before or after a character type escape such as as
1174 \d or \H. However, unless the hyphen is the last character in the
1175 class, Perl outputs a warning in its warning mode, as this is most
1176 likely a user error. As PCRE2 has no facility for warning, an error is
1177 given in these cases.
1178
1179 It is not possible to have the literal character "]" as the end charac‐
1180 ter of a range. A pattern such as [W-]46] is interpreted as a class of
1181 two characters ("W" and "-") followed by a literal string "46]", so it
1182 would match "W46]" or "-46]". However, if the "]" is escaped with a
1183 backslash it is interpreted as the end of range, so [W-\]46] is inter‐
1184 preted as a class containing a range followed by two other characters.
1185 The octal or hexadecimal representation of "]" can also be used to end
1186 a range.
1187
1188 Ranges normally include all code points between the start and end char‐
1189 acters, inclusive. They can also be used for code points specified
1190 numerically, for example [\000-\037]. Ranges can include any characters
1191 that are valid for the current mode. In any UTF mode, the so-called
1192 "surrogate" characters (those whose code points lie between 0xd800 and
1193 0xdfff inclusive) may not be specified explicitly by default (the
1194 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option disables this check). How‐
1195 ever, ranges such as [\x{d7ff}-\x{e000}], which include the surrogates,
1196 are always permitted.
1197
1198 There is a special case in EBCDIC environments for ranges whose end
1199 points are both specified as literal letters in the same case. For com‐
1200 patibility with Perl, EBCDIC code points within the range that are not
1201 letters are omitted. For example, [h-k] matches only four characters,
1202 even though the codes for h and k are 0x88 and 0x92, a range of 11 code
1203 points. However, if the range is specified numerically, for example,
1204 [\x88-\x92] or [h-\x92], all code points are included.
1205
1206 If a range that includes letters is used when caseless matching is set,
1207 it matches the letters in either case. For example, [W-c] is equivalent
1208 to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if
1209 character tables for a French locale are in use, [\xc8-\xcb] matches
1210 accented E characters in both cases.
1211
1212 A circumflex can conveniently be used with the upper case character
1213 types to specify a more restricted set of characters than the matching
1214 lower case type. For example, the class [^\W_] matches any letter or
1215 digit, but not underscore, whereas [\w] includes underscore. A positive
1216 character class should be read as "something OR something OR ..." and a
1217 negative class as "NOT something AND NOT something AND NOT ...".
1218
1219 The only metacharacters that are recognized in character classes are
1220 backslash, hyphen (only where it can be interpreted as specifying a
1221 range), circumflex (only at the start), opening square bracket (only
1222 when it can be interpreted as introducing a POSIX class name, or for a
1223 special compatibility feature - see the next two sections), and the
1224 terminating closing square bracket. However, escaping other non-
1225 alphanumeric characters does no harm.
1226
1228
1229 Perl supports the POSIX notation for character classes. This uses names
1230 enclosed by [: and :] within the enclosing square brackets. PCRE2 also
1231 supports this notation. For example,
1232
1233 [01[:alpha:]%]
1234
1235 matches "0", "1", any alphabetic character, or "%". The supported class
1236 names are:
1237
1238 alnum letters and digits
1239 alpha letters
1240 ascii character codes 0 - 127
1241 blank space or tab only
1242 cntrl control characters
1243 digit decimal digits (same as \d)
1244 graph printing characters, excluding space
1245 lower lower case letters
1246 print printing characters, including space
1247 punct printing characters, excluding letters and digits and space
1248 space white space (the same as \s from PCRE2 8.34)
1249 upper upper case letters
1250 word "word" characters (same as \w)
1251 xdigit hexadecimal digits
1252
1253 The default "space" characters are HT (9), LF (10), VT (11), FF (12),
1254 CR (13), and space (32). If locale-specific matching is taking place,
1255 the list of space characters may be different; there may be fewer or
1256 more of them. "Space" and \s match the same set of characters.
1257
1258 The name "word" is a Perl extension, and "blank" is a GNU extension
1259 from Perl 5.8. Another Perl extension is negation, which is indicated
1260 by a ^ character after the colon. For example,
1261
1262 [12[:^digit:]]
1263
1264 matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the
1265 POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
1266 these are not supported, and an error is given if they are encountered.
1267
1268 By default, characters with values greater than 127 do not match any of
1269 the POSIX character classes, although this may be different for charac‐
1270 ters in the range 128-255 when locale-specific matching is happening.
1271 However, if the PCRE2_UCP option is passed to pcre2_compile(), some of
1272 the classes are changed so that Unicode character properties are used.
1273 This is achieved by replacing certain POSIX classes with other
1274 sequences, as follows:
1275
1276 [:alnum:] becomes \p{Xan}
1277 [:alpha:] becomes \p{L}
1278 [:blank:] becomes \h
1279 [:cntrl:] becomes \p{Cc}
1280 [:digit:] becomes \p{Nd}
1281 [:lower:] becomes \p{Ll}
1282 [:space:] becomes \p{Xps}
1283 [:upper:] becomes \p{Lu}
1284 [:word:] becomes \p{Xwd}
1285
1286 Negated versions, such as [:^alpha:] use \P instead of \p. Three other
1287 POSIX classes are handled specially in UCP mode:
1288
1289 [:graph:] This matches characters that have glyphs that mark the page
1290 when printed. In Unicode property terms, it matches all char‐
1291 acters with the L, M, N, P, S, or Cf properties, except for:
1292
1293 U+061C Arabic Letter Mark
1294 U+180E Mongolian Vowel Separator
1295 U+2066 - U+2069 Various "isolate"s
1296
1297
1298 [:print:] This matches the same characters as [:graph:] plus space
1299 characters that are not controls, that is, characters with
1300 the Zs property.
1301
1302 [:punct:] This matches all characters that have the Unicode P (punctua‐
1303 tion) property, plus those characters with code points less
1304 than 256 that have the S (Symbol) property.
1305
1306 The other POSIX classes are unchanged, and match only characters with
1307 code points less than 256.
1308
1310
1311 In the POSIX.2 compliant library that was included in 4.4BSD Unix, the
1312 ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word"
1313 and "end of word". PCRE2 treats these items as follows:
1314
1315 [[:<:]] is converted to \b(?=\w)
1316 [[:>:]] is converted to \b(?<=\w)
1317
1318 Only these exact character sequences are recognized. A sequence such as
1319 [a[:<:]b] provokes error for an unrecognized POSIX class name. This
1320 support is not compatible with Perl. It is provided to help migrations
1321 from other environments, and is best not used in any new patterns. Note
1322 that \b matches at the start and the end of a word (see "Simple asser‐
1323 tions" above), and in a Perl-style pattern the preceding or following
1324 character normally shows which is wanted, without the need for the
1325 assertions that are used above in order to give exactly the POSIX be‐
1326 haviour.
1327
1329
1330 Vertical bar characters are used to separate alternative patterns. For
1331 example, the pattern
1332
1333 gilbert|sullivan
1334
1335 matches either "gilbert" or "sullivan". Any number of alternatives may
1336 appear, and an empty alternative is permitted (matching the empty
1337 string). The matching process tries each alternative in turn, from left
1338 to right, and the first one that succeeds is used. If the alternatives
1339 are within a subpattern (defined below), "succeeds" means matching the
1340 rest of the main pattern as well as the alternative in the subpattern.
1341
1343
1344 The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL,
1345 PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options
1346 can be changed from within the pattern by a sequence of letters
1347 enclosed between "(?" and ")". These options are Perl-compatible, and
1348 are described in detail in the pcre2api documentation. The option let‐
1349 ters are:
1350
1351 i for PCRE2_CASELESS
1352 m for PCRE2_MULTILINE
1353 n for PCRE2_NO_AUTO_CAPTURE
1354 s for PCRE2_DOTALL
1355 x for PCRE2_EXTENDED
1356 xx for PCRE2_EXTENDED_MORE
1357
1358 For example, (?im) sets caseless, multiline matching. It is also possi‐
1359 ble to unset these options by preceding the relevant letters with a
1360 hyphen, for example (?-im). The two "extended" options are not indepen‐
1361 dent; unsetting either one cancels the effects of both of them.
1362
1363 A combined setting and unsetting such as (?im-sx), which sets
1364 PCRE2_CASELESS and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and
1365 PCRE2_EXTENDED, is also permitted. Only one hyphen may appear in the
1366 options string. If a letter appears both before and after the hyphen,
1367 the option is unset. An empty options setting "(?)" is allowed. Need‐
1368 less to say, it has no effect.
1369
1370 If the first character following (? is a circumflex, it causes all of
1371 the above options to be unset. Thus, (?^) is equivalent to (?-imnsx).
1372 Letters may follow the circumflex to cause some options to be re-
1373 instated, but a hyphen may not appear.
1374
1375 The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be
1376 changed in the same way as the Perl-compatible options by using the
1377 characters J and U respectively. However, these are not unset by (?^).
1378
1379 When one of these option changes occurs at top level (that is, not
1380 inside subpattern parentheses), the change applies to the remainder of
1381 the pattern that follows. An option change within a subpattern (see
1382 below for a description of subpatterns) affects only that part of the
1383 subpattern that follows it, so
1384
1385 (a(?i)b)c
1386
1387 matches abc and aBc and no other strings (assuming PCRE2_CASELESS is
1388 not used). By this means, options can be made to have different set‐
1389 tings in different parts of the pattern. Any changes made in one alter‐
1390 native do carry on into subsequent branches within the same subpattern.
1391 For example,
1392
1393 (a(?i)b|c)
1394
1395 matches "ab", "aB", "c", and "C", even though when matching "C" the
1396 first branch is abandoned before the option setting. This is because
1397 the effects of option settings happen at compile time. There would be
1398 some very weird behaviour otherwise.
1399
1400 As a convenient shorthand, if any option settings are required at the
1401 start of a non-capturing subpattern (see the next section), the option
1402 letters may appear between the "?" and the ":". Thus the two patterns
1403
1404 (?i:saturday|sunday)
1405 (?:(?i)saturday|sunday)
1406
1407 match exactly the same set of strings.
1408
1409 Note: There are other PCRE2-specific options that can be set by the
1410 application when the compiling function is called. The pattern can con‐
1411 tain special leading sequences such as (*CRLF) to override what the
1412 application has set or what has been defaulted. Details are given in
1413 the section entitled "Newline sequences" above. There are also the
1414 (*UTF) and (*UCP) leading sequences that can be used to set UTF and
1415 Unicode property modes; they are equivalent to setting the PCRE2_UTF
1416 and PCRE2_UCP options, respectively. However, the application can set
1417 the PCRE2_NEVER_UTF and PCRE2_NEVER_UCP options, which lock out the use
1418 of the (*UTF) and (*UCP) sequences.
1419
1421
1422 Subpatterns are delimited by parentheses (round brackets), which can be
1423 nested. Turning part of a pattern into a subpattern does two things:
1424
1425 1. It localizes a set of alternatives. For example, the pattern
1426
1427 cat(aract|erpillar|)
1428
1429 matches "cataract", "caterpillar", or "cat". Without the parentheses,
1430 it would match "cataract", "erpillar" or an empty string.
1431
1432 2. It sets up the subpattern as a capturing subpattern. This means
1433 that, when the whole pattern matches, the portion of the subject string
1434 that matched the subpattern is passed back to the caller, separately
1435 from the portion that matched the whole pattern. (This applies only to
1436 the traditional matching function; the DFA matching function does not
1437 support capturing.)
1438
1439 Opening parentheses are counted from left to right (starting from 1) to
1440 obtain numbers for the capturing subpatterns. For example, if the
1441 string "the red king" is matched against the pattern
1442
1443 the ((red|white) (king|queen))
1444
1445 the captured substrings are "red king", "red", and "king", and are num‐
1446 bered 1, 2, and 3, respectively.
1447
1448 The fact that plain parentheses fulfil two functions is not always
1449 helpful. There are often times when a grouping subpattern is required
1450 without a capturing requirement. If an opening parenthesis is followed
1451 by a question mark and a colon, the subpattern does not do any captur‐
1452 ing, and is not counted when computing the number of any subsequent
1453 capturing subpatterns. For example, if the string "the white queen" is
1454 matched against the pattern
1455
1456 the ((?:red|white) (king|queen))
1457
1458 the captured substrings are "white queen" and "queen", and are numbered
1459 1 and 2. The maximum number of capturing subpatterns is 65535.
1460
1461 As a convenient shorthand, if any option settings are required at the
1462 start of a non-capturing subpattern, the option letters may appear
1463 between the "?" and the ":". Thus the two patterns
1464
1465 (?i:saturday|sunday)
1466 (?:(?i)saturday|sunday)
1467
1468 match exactly the same set of strings. Because alternative branches are
1469 tried from left to right, and options are not reset until the end of
1470 the subpattern is reached, an option setting in one branch does affect
1471 subsequent branches, so the above patterns match "SUNDAY" as well as
1472 "Saturday".
1473
1475
1476 Perl 5.10 introduced a feature whereby each alternative in a subpattern
1477 uses the same numbers for its capturing parentheses. Such a subpattern
1478 starts with (?| and is itself a non-capturing subpattern. For example,
1479 consider this pattern:
1480
1481 (?|(Sat)ur|(Sun))day
1482
1483 Because the two alternatives are inside a (?| group, both sets of cap‐
1484 turing parentheses are numbered one. Thus, when the pattern matches,
1485 you can look at captured substring number one, whichever alternative
1486 matched. This construct is useful when you want to capture part, but
1487 not all, of one of a number of alternatives. Inside a (?| group, paren‐
1488 theses are numbered as usual, but the number is reset at the start of
1489 each branch. The numbers of any capturing parentheses that follow the
1490 subpattern start after the highest number used in any branch. The fol‐
1491 lowing example is taken from the Perl documentation. The numbers under‐
1492 neath show in which buffer the captured content will be stored.
1493
1494 # before ---------------branch-reset----------- after
1495 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
1496 # 1 2 2 3 2 3 4
1497
1498 A backreference to a numbered subpattern uses the most recent value
1499 that is set for that number by any subpattern. The following pattern
1500 matches "abcabc" or "defdef":
1501
1502 /(?|(abc)|(def))\1/
1503
1504 In contrast, a subroutine call to a numbered subpattern always refers
1505 to the first one in the pattern with the given number. The following
1506 pattern matches "abcabc" or "defabc":
1507
1508 /(?|(abc)|(def))(?1)/
1509
1510 A relative reference such as (?-1) is no different: it is just a conve‐
1511 nient way of computing an absolute group number.
1512
1513 If a condition test for a subpattern's having matched refers to a non-
1514 unique number, the test is true if any of the subpatterns of that num‐
1515 ber have matched.
1516
1517 An alternative approach to using this "branch reset" feature is to use
1518 duplicate named subpatterns, as described in the next section.
1519
1521
1522 Identifying capturing parentheses by number is simple, but it can be
1523 very hard to keep track of the numbers in complicated patterns. Fur‐
1524 thermore, if an expression is modified, the numbers may change. To help
1525 with this difficulty, PCRE2 supports the naming of capturing subpat‐
1526 terns. This feature was not added to Perl until release 5.10. Python
1527 had the feature earlier, and PCRE1 introduced it at release 4.0, using
1528 the Python syntax. PCRE2 supports both the Perl and the Python syntax.
1529
1530 In PCRE2, a capturing subpattern can be named in one of three ways:
1531 (?<name>...) or (?'name'...) as in Perl, or (?P<name>...) as in Python.
1532 Names consist of up to 32 alphanumeric characters and underscores, but
1533 must start with a non-digit. References to capturing parentheses from
1534 other parts of the pattern, such as backreferences, recursion, and con‐
1535 ditions, can all be made by name as well as by number.
1536
1537 Named capturing parentheses are allocated numbers as well as names,
1538 exactly as if the names were not present. In both PCRE2 and Perl, cap‐
1539 turing subpatterns are primarily identified by numbers; any names are
1540 just aliases for these numbers. The PCRE2 API provides function calls
1541 for extracting the complete name-to-number translation table from a
1542 compiled pattern, as well as convenience functions for extracting cap‐
1543 tured substrings by name.
1544
1545 Warning: When more than one subpattern has the same number, as
1546 described in the previous section, a name given to one of them applies
1547 to all of them. Perl allows identically numbered subpatterns to have
1548 different names. Consider this pattern, where there are two capturing
1549 subpatterns, both numbered 1:
1550
1551 (?|(?<AA>aa)|(?<BB>bb))
1552
1553 Perl allows this, with both names AA and BB as aliases of group 1.
1554 Thus, after a successful match, both names yield the same value (either
1555 "aa" or "bb").
1556
1557 In an attempt to reduce confusion, PCRE2 does not allow the same group
1558 number to be associated with more than one name. The example above pro‐
1559 vokes a compile-time error. However, there is still scope for confu‐
1560 sion. Consider this pattern:
1561
1562 (?|(?<AA>aa)|(bb))
1563
1564 Although the second subpattern number 1 is not explicitly named, the
1565 name AA is still an alias for subpattern 1. Whether the pattern matches
1566 "aa" or "bb", a reference by name to group AA yields the matched
1567 string.
1568
1569 By default, a name must be unique within a pattern, except that dupli‐
1570 cate names are permitted for subpatterns with the same number, for
1571 example:
1572
1573 (?|(?<AA>aa)|(?<AA>bb))
1574
1575 The duplicate name constraint can be disabled by setting the PCRE2_DUP‐
1576 NAMES option at compile time, or by the use of (?J) within the pattern.
1577 Duplicate names can be useful for patterns where only one instance of
1578 the named parentheses can match. Suppose you want to match the name of
1579 a weekday, either as a 3-letter abbreviation or as the full name, and
1580 in both cases you want to extract the abbreviation. This pattern
1581 (ignoring the line breaks) does the job:
1582
1583 (?<DN>Mon|Fri|Sun)(?:day)?|
1584 (?<DN>Tue)(?:sday)?|
1585 (?<DN>Wed)(?:nesday)?|
1586 (?<DN>Thu)(?:rsday)?|
1587 (?<DN>Sat)(?:urday)?
1588
1589 There are five capturing substrings, but only one is ever set after a
1590 match. The convenience functions for extracting the data by name
1591 returns the substring for the first (and in this example, the only)
1592 subpattern of that name that matched. This saves searching to find
1593 which numbered subpattern it was. (An alternative way of solving this
1594 problem is to use a "branch reset" subpattern, as described in the pre‐
1595 vious section.)
1596
1597 If you make a backreference to a non-unique named subpattern from else‐
1598 where in the pattern, the subpatterns to which the name refers are
1599 checked in the order in which they appear in the overall pattern. The
1600 first one that is set is used for the reference. For example, this pat‐
1601 tern matches both "foofoo" and "barbar" but not "foobar" or "barfoo":
1602
1603 (?:(?<n>foo)|(?<n>bar))\k<n>
1604
1605
1606 If you make a subroutine call to a non-unique named subpattern, the one
1607 that corresponds to the first occurrence of the name is used. In the
1608 absence of duplicate numbers this is the one with the lowest number.
1609
1610 If you use a named reference in a condition test (see the section about
1611 conditions below), either to check whether a subpattern has matched, or
1612 to check for recursion, all subpatterns with the same name are tested.
1613 If the condition is true for any one of them, the overall condition is
1614 true. This is the same behaviour as testing by number. For further
1615 details of the interfaces for handling named subpatterns, see the
1616 pcre2api documentation.
1617
1619
1620 Repetition is specified by quantifiers, which can follow any of the
1621 following items:
1622
1623 a literal data character
1624 the dot metacharacter
1625 the \C escape sequence
1626 the \X escape sequence
1627 the \R escape sequence
1628 an escape such as \d or \pL that matches a single character
1629 a character class
1630 a backreference
1631 a parenthesized subpattern (including most assertions)
1632 a subroutine call to a subpattern (recursive or otherwise)
1633
1634 The general repetition quantifier specifies a minimum and maximum num‐
1635 ber of permitted matches, by giving the two numbers in curly brackets
1636 (braces), separated by a comma. The numbers must be less than 65536,
1637 and the first must be less than or equal to the second. For example:
1638
1639 z{2,4}
1640
1641 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
1642 special character. If the second number is omitted, but the comma is
1643 present, there is no upper limit; if the second number and the comma
1644 are both omitted, the quantifier specifies an exact number of required
1645 matches. Thus
1646
1647 [aeiou]{3,}
1648
1649 matches at least 3 successive vowels, but may match many more, whereas
1650
1651 \d{8}
1652
1653 matches exactly 8 digits. An opening curly bracket that appears in a
1654 position where a quantifier is not allowed, or one that does not match
1655 the syntax of a quantifier, is taken as a literal character. For exam‐
1656 ple, {,6} is not a quantifier, but a literal string of four characters.
1657
1658 In UTF modes, quantifiers apply to characters rather than to individual
1659 code units. Thus, for example, \x{100}{2} matches two characters, each
1660 of which is represented by a two-byte sequence in a UTF-8 string. Simi‐
1661 larly, \X{3} matches three Unicode extended grapheme clusters, each of
1662 which may be several code units long (and they may be of different
1663 lengths).
1664
1665 The quantifier {0} is permitted, causing the expression to behave as if
1666 the previous item and the quantifier were not present. This may be use‐
1667 ful for subpatterns that are referenced as subroutines from elsewhere
1668 in the pattern (but see also the section entitled "Defining subpatterns
1669 for use by reference only" below). Items other than subpatterns that
1670 have a {0} quantifier are omitted from the compiled pattern.
1671
1672 For convenience, the three most common quantifiers have single-charac‐
1673 ter abbreviations:
1674
1675 * is equivalent to {0,}
1676 + is equivalent to {1,}
1677 ? is equivalent to {0,1}
1678
1679 It is possible to construct infinite loops by following a subpattern
1680 that can match no characters with a quantifier that has no upper limit,
1681 for example:
1682
1683 (a?)*
1684
1685 Earlier versions of Perl and PCRE1 used to give an error at compile
1686 time for such patterns. However, because there are cases where this can
1687 be useful, such patterns are now accepted, but if any repetition of the
1688 subpattern does in fact match no characters, the loop is forcibly bro‐
1689 ken.
1690
1691 By default, the quantifiers are "greedy", that is, they match as much
1692 as possible (up to the maximum number of permitted times), without
1693 causing the rest of the pattern to fail. The classic example of where
1694 this gives problems is in trying to match comments in C programs. These
1695 appear between /* and */ and within the comment, individual * and /
1696 characters may appear. An attempt to match C comments by applying the
1697 pattern
1698
1699 /\*.*\*/
1700
1701 to the string
1702
1703 /* first comment */ not comment /* second comment */
1704
1705 fails, because it matches the entire string owing to the greediness of
1706 the .* item.
1707
1708 If a quantifier is followed by a question mark, it ceases to be greedy,
1709 and instead matches the minimum number of times possible, so the pat‐
1710 tern
1711
1712 /\*.*?\*/
1713
1714 does the right thing with the C comments. The meaning of the various
1715 quantifiers is not otherwise changed, just the preferred number of
1716 matches. Do not confuse this use of question mark with its use as a
1717 quantifier in its own right. Because it has two uses, it can sometimes
1718 appear doubled, as in
1719
1720 \d??\d
1721
1722 which matches one digit by preference, but can match two if that is the
1723 only way the rest of the pattern matches.
1724
1725 If the PCRE2_UNGREEDY option is set (an option that is not available in
1726 Perl), the quantifiers are not greedy by default, but individual ones
1727 can be made greedy by following them with a question mark. In other
1728 words, it inverts the default behaviour.
1729
1730 When a parenthesized subpattern is quantified with a minimum repeat
1731 count that is greater than 1 or with a limited maximum, more memory is
1732 required for the compiled pattern, in proportion to the size of the
1733 minimum or maximum.
1734
1735 If a pattern starts with .* or .{0,} and the PCRE2_DOTALL option
1736 (equivalent to Perl's /s) is set, thus allowing the dot to match new‐
1737 lines, the pattern is implicitly anchored, because whatever follows
1738 will be tried against every character position in the subject string,
1739 so there is no point in retrying the overall match at any position
1740 after the first. PCRE2 normally treats such a pattern as though it were
1741 preceded by \A.
1742
1743 In cases where it is known that the subject string contains no new‐
1744 lines, it is worth setting PCRE2_DOTALL in order to obtain this opti‐
1745 mization, or alternatively, using ^ to indicate anchoring explicitly.
1746
1747 However, there are some cases where the optimization cannot be used.
1748 When .* is inside capturing parentheses that are the subject of a
1749 backreference elsewhere in the pattern, a match at the start may fail
1750 where a later one succeeds. Consider, for example:
1751
1752 (.*)abc\1
1753
1754 If the subject is "xyz123abc123" the match point is the fourth charac‐
1755 ter. For this reason, such a pattern is not implicitly anchored.
1756
1757 Another case where implicit anchoring is not applied is when the lead‐
1758 ing .* is inside an atomic group. Once again, a match at the start may
1759 fail where a later one succeeds. Consider this pattern:
1760
1761 (?>.*?a)b
1762
1763 It matches "ab" in the subject "aab". The use of the backtracking con‐
1764 trol verbs (*PRUNE) and (*SKIP) also disable this optimization, and
1765 there is an option, PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
1766
1767 When a capturing subpattern is repeated, the value captured is the sub‐
1768 string that matched the final iteration. For example, after
1769
1770 (tweedle[dume]{3}\s*)+
1771
1772 has matched "tweedledum tweedledee" the value of the captured substring
1773 is "tweedledee". However, if there are nested capturing subpatterns,
1774 the corresponding captured values may have been set in previous itera‐
1775 tions. For example, after
1776
1777 (a|(b))+
1778
1779 matches "aba" the value of the second captured substring is "b".
1780
1782
1783 With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
1784 repetition, failure of what follows normally causes the repeated item
1785 to be re-evaluated to see if a different number of repeats allows the
1786 rest of the pattern to match. Sometimes it is useful to prevent this,
1787 either to change the nature of the match, or to cause it fail earlier
1788 than it otherwise might, when the author of the pattern knows there is
1789 no point in carrying on.
1790
1791 Consider, for example, the pattern \d+foo when applied to the subject
1792 line
1793
1794 123456bar
1795
1796 After matching all 6 digits and then failing to match "foo", the normal
1797 action of the matcher is to try again with only 5 digits matching the
1798 \d+ item, and then with 4, and so on, before ultimately failing.
1799 "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
1800 the means for specifying that once a subpattern has matched, it is not
1801 to be re-evaluated in this way.
1802
1803 If we use atomic grouping for the previous example, the matcher gives
1804 up immediately on failing to match "foo" the first time. The notation
1805 is a kind of special parenthesis, starting with (?> as in this example:
1806
1807 (?>\d+)foo
1808
1809 This kind of parenthesis "locks up" the part of the pattern it con‐
1810 tains once it has matched, and a failure further into the pattern is
1811 prevented from backtracking into it. Backtracking past it to previous
1812 items, however, works as normal.
1813
1814 An alternative description is that a subpattern of this type matches
1815 exactly the string of characters that an identical standalone pattern
1816 would match, if anchored at the current point in the subject string.
1817
1818 Atomic grouping subpatterns are not capturing subpatterns. Simple cases
1819 such as the above example can be thought of as a maximizing repeat that
1820 must swallow everything it can. So, while both \d+ and \d+? are pre‐
1821 pared to adjust the number of digits they match in order to make the
1822 rest of the pattern match, (?>\d+) can only match an entire sequence of
1823 digits.
1824
1825 Atomic groups in general can of course contain arbitrarily complicated
1826 subpatterns, and can be nested. However, when the subpattern for an
1827 atomic group is just a single repeated item, as in the example above, a
1828 simpler notation, called a "possessive quantifier" can be used. This
1829 consists of an additional + character following a quantifier. Using
1830 this notation, the previous example can be rewritten as
1831
1832 \d++foo
1833
1834 Note that a possessive quantifier can be used with an entire group, for
1835 example:
1836
1837 (abc|xyz){2,3}+
1838
1839 Possessive quantifiers are always greedy; the setting of the
1840 PCRE2_UNGREEDY option is ignored. They are a convenient notation for
1841 the simpler forms of atomic group. However, there is no difference in
1842 the meaning of a possessive quantifier and the equivalent atomic group,
1843 though there may be a performance difference; possessive quantifiers
1844 should be slightly faster.
1845
1846 The possessive quantifier syntax is an extension to the Perl 5.8 syn‐
1847 tax. Jeffrey Friedl originated the idea (and the name) in the first
1848 edition of his book. Mike McCloskey liked it, so implemented it when he
1849 built Sun's Java package, and PCRE1 copied it from there. It ultimately
1850 found its way into Perl at release 5.10.
1851
1852 PCRE2 has an optimization that automatically "possessifies" certain
1853 simple pattern constructs. For example, the sequence A+B is treated as
1854 A++B because there is no point in backtracking into a sequence of A's
1855 when B must follow. This feature can be disabled by the PCRE2_NO_AUTO‐
1856 POSSESS option, or starting the pattern with (*NO_AUTO_POSSESS).
1857
1858 When a pattern contains an unlimited repeat inside a subpattern that
1859 can itself be repeated an unlimited number of times, the use of an
1860 atomic group is the only way to avoid some failing matches taking a
1861 very long time indeed. The pattern
1862
1863 (\D+|<\d+>)*[!?]
1864
1865 matches an unlimited number of substrings that either consist of non-
1866 digits, or digits enclosed in <>, followed by either ! or ?. When it
1867 matches, it runs quickly. However, if it is applied to
1868
1869 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1870
1871 it takes a long time before reporting failure. This is because the
1872 string can be divided between the internal \D+ repeat and the external
1873 * repeat in a large number of ways, and all have to be tried. (The
1874 example uses [!?] rather than a single character at the end, because
1875 both PCRE2 and Perl have an optimization that allows for fast failure
1876 when a single character is used. They remember the last single charac‐
1877 ter that is required for a match, and fail early if it is not present
1878 in the string.) If the pattern is changed so that it uses an atomic
1879 group, like this:
1880
1881 ((?>\D+)|<\d+>)*[!?]
1882
1883 sequences of non-digits cannot be broken, and failure happens quickly.
1884
1886
1887 Outside a character class, a backslash followed by a digit greater than
1888 0 (and possibly further digits) is a backreference to a capturing sub‐
1889 pattern earlier (that is, to its left) in the pattern, provided there
1890 have been that many previous capturing left parentheses.
1891
1892 However, if the decimal number following the backslash is less than 8,
1893 it is always taken as a backreference, and causes an error only if
1894 there are not that many capturing left parentheses in the entire pat‐
1895 tern. In other words, the parentheses that are referenced need not be
1896 to the left of the reference for numbers less than 8. A "forward back‐
1897 reference" of this type can make sense when a repetition is involved
1898 and the subpattern to the right has participated in an earlier itera‐
1899 tion.
1900
1901 It is not possible to have a numerical "forward backreference" to a
1902 subpattern whose number is 8 or more using this syntax because a
1903 sequence such as \50 is interpreted as a character defined in octal.
1904 See the subsection entitled "Non-printing characters" above for further
1905 details of the handling of digits following a backslash. There is no
1906 such problem when named parentheses are used. A backreference to any
1907 subpattern is possible using named parentheses (see below).
1908
1909 Another way of avoiding the ambiguity inherent in the use of digits
1910 following a backslash is to use the \g escape sequence. This escape
1911 must be followed by a signed or unsigned number, optionally enclosed in
1912 braces. These examples are all identical:
1913
1914 (ring), \1
1915 (ring), \g1
1916 (ring), \g{1}
1917
1918 An unsigned number specifies an absolute reference without the ambigu‐
1919 ity that is present in the older syntax. It is also useful when literal
1920 digits follow the reference. A signed number is a relative reference.
1921 Consider this example:
1922
1923 (abc(def)ghi)\g{-1}
1924
1925 The sequence \g{-1} is a reference to the most recently started captur‐
1926 ing subpattern before \g, that is, is it equivalent to \2 in this exam‐
1927 ple. Similarly, \g{-2} would be equivalent to \1. The use of relative
1928 references can be helpful in long patterns, and also in patterns that
1929 are created by joining together fragments that contain references
1930 within themselves.
1931
1932 The sequence \g{+1} is a reference to the next capturing subpattern.
1933 This kind of forward reference can be useful it patterns that repeat.
1934 Perl does not support the use of + in this way.
1935
1936 A backreference matches whatever actually matched the capturing subpat‐
1937 tern in the current subject string, rather than anything matching the
1938 subpattern itself (see "Subpatterns as subroutines" below for a way of
1939 doing that). So the pattern
1940
1941 (sens|respons)e and \1ibility
1942
1943 matches "sense and sensibility" and "response and responsibility", but
1944 not "sense and responsibility". If caseful matching is in force at the
1945 time of the backreference, the case of letters is relevant. For exam‐
1946 ple,
1947
1948 ((?i)rah)\s+\1
1949
1950 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
1951 original capturing subpattern is matched caselessly.
1952
1953 There are several different ways of writing backreferences to named
1954 subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or
1955 \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's
1956 unified backreference syntax, in which \g can be used for both numeric
1957 and named references, is also supported. We could rewrite the above
1958 example in any of the following ways:
1959
1960 (?<p1>(?i)rah)\s+\k<p1>
1961 (?'p1'(?i)rah)\s+\k{p1}
1962 (?P<p1>(?i)rah)\s+(?P=p1)
1963 (?<p1>(?i)rah)\s+\g{p1}
1964
1965 A subpattern that is referenced by name may appear in the pattern
1966 before or after the reference.
1967
1968 There may be more than one backreference to the same subpattern. If a
1969 subpattern has not actually been used in a particular match, any back‐
1970 references to it always fail by default. For example, the pattern
1971
1972 (a|(bc))\2
1973
1974 always fails if it starts to match "a" rather than "bc". However, if
1975 the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backref‐
1976 erence to an unset value matches an empty string.
1977
1978 Because there may be many capturing parentheses in a pattern, all dig‐
1979 its following a backslash are taken as part of a potential backrefer‐
1980 ence number. If the pattern continues with a digit character, some
1981 delimiter must be used to terminate the backreference. If the
1982 PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, this can be white
1983 space. Otherwise, the \g{ syntax or an empty comment (see "Comments"
1984 below) can be used.
1985
1986 Recursive backreferences
1987
1988 A backreference that occurs inside the parentheses to which it refers
1989 fails when the subpattern is first used, so, for example, (a\1) never
1990 matches. However, such references can be useful inside repeated sub‐
1991 patterns. For example, the pattern
1992
1993 (a|b\1)+
1994
1995 matches any number of "a"s and also "aba", "ababbaa" etc. At each iter‐
1996 ation of the subpattern, the backreference matches the character string
1997 corresponding to the previous iteration. In order for this to work, the
1998 pattern must be such that the first iteration does not need to match
1999 the backreference. This can be done using alternation, as in the exam‐
2000 ple above, or by a quantifier with a minimum of zero.
2001
2002 Backreferences of this type cause the group that they reference to be
2003 treated as an atomic group. Once the whole group has been matched, a
2004 subsequent matching failure cannot cause backtracking into the middle
2005 of the group.
2006
2008
2009 An assertion is a test on the characters following or preceding the
2010 current matching point that does not consume any characters. The simple
2011 assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described
2012 above.
2013
2014 More complicated assertions are coded as subpatterns. There are two
2015 kinds: those that look ahead of the current position in the subject
2016 string, and those that look behind it, and in each case an assertion
2017 may be positive (must succeed for matching to continue) or negative
2018 (must not succeed for matching to continue). An assertion subpattern is
2019 matched in the normal way, except that, when matching continues after a
2020 successful assertion, the matching position in the subject string is as
2021 it was before the assertion was processed.
2022
2023 Assertion subpatterns are not capturing subpatterns. If an assertion
2024 contains capturing subpatterns within it, these are counted for the
2025 purposes of numbering the capturing subpatterns in the whole pattern.
2026 Within each branch of an assertion, locally captured substrings may be
2027 referenced in the usual way. For example, a sequence such as (.)\g{-1}
2028 can be used to check that two adjacent characters are the same.
2029
2030 When a branch within an assertion fails to match, any substrings that
2031 were captured are discarded (as happens with any pattern branch that
2032 fails to match). A negative assertion succeeds only when all its
2033 branches fail to match; this means that no captured substrings are ever
2034 retained after a successful negative assertion. When an assertion con‐
2035 tains a matching branch, what happens depends on the type of assertion.
2036
2037 For a positive assertion, internally captured substrings in the suc‐
2038 cessful branch are retained, and matching continues with the next pat‐
2039 tern item after the assertion. For a negative assertion, a matching
2040 branch means that the assertion has failed. If the assertion is being
2041 used as a condition in a conditional subpattern (see below), captured
2042 substrings are retained, because matching continues with the "no"
2043 branch of the condition. For other failing negative assertions, control
2044 passes to the previous backtracking point, thus discarding any captured
2045 strings within the assertion.
2046
2047 For compatibility with Perl, most assertion subpatterns may be
2048 repeated; though it makes no sense to assert the same thing several
2049 times, the side effect of capturing parentheses may occasionally be
2050 useful. However, an assertion that forms the condition for a condi‐
2051 tional subpattern may not be quantified. In practice, for other asser‐
2052 tions, there only three cases:
2053
2054 (1) If the quantifier is {0}, the assertion is never obeyed during
2055 matching. However, it may contain internal capturing parenthesized
2056 groups that are called from elsewhere via the subroutine mechanism.
2057
2058 (2) If quantifier is {0,n} where n is greater than zero, it is treated
2059 as if it were {0,1}. At run time, the rest of the pattern match is
2060 tried with and without the assertion, the order depending on the greed‐
2061 iness of the quantifier.
2062
2063 (3) If the minimum repetition is greater than zero, the quantifier is
2064 ignored. The assertion is obeyed just once when encountered during
2065 matching.
2066
2067 Lookahead assertions
2068
2069 Lookahead assertions start with (?= for positive assertions and (?! for
2070 negative assertions. For example,
2071
2072 \w+(?=;)
2073
2074 matches a word followed by a semicolon, but does not include the semi‐
2075 colon in the match, and
2076
2077 foo(?!bar)
2078
2079 matches any occurrence of "foo" that is not followed by "bar". Note
2080 that the apparently similar pattern
2081
2082 (?!foo)bar
2083
2084 does not find an occurrence of "bar" that is preceded by something
2085 other than "foo"; it finds any occurrence of "bar" whatsoever, because
2086 the assertion (?!foo) is always true when the next three characters are
2087 "bar". A lookbehind assertion is needed to achieve the other effect.
2088
2089 If you want to force a matching failure at some point in a pattern, the
2090 most convenient way to do it is with (?!) because an empty string
2091 always matches, so an assertion that requires there not to be an empty
2092 string must always fail. The backtracking control verb (*FAIL) or (*F)
2093 is a synonym for (?!).
2094
2095 Lookbehind assertions
2096
2097 Lookbehind assertions start with (?<= for positive assertions and (?<!
2098 for negative assertions. For example,
2099
2100 (?<!foo)bar
2101
2102 does find an occurrence of "bar" that is not preceded by "foo". The
2103 contents of a lookbehind assertion are restricted such that all the
2104 strings it matches must have a fixed length. However, if there are sev‐
2105 eral top-level alternatives, they do not all have to have the same
2106 fixed length. Thus
2107
2108 (?<=bullock|donkey)
2109
2110 is permitted, but
2111
2112 (?<!dogs?|cats?)
2113
2114 causes an error at compile time. Branches that match different length
2115 strings are permitted only at the top level of a lookbehind assertion.
2116 This is an extension compared with Perl, which requires all branches to
2117 match the same length of string. An assertion such as
2118
2119 (?<=ab(c|de))
2120
2121 is not permitted, because its single top-level branch can match two
2122 different lengths, but it is acceptable to PCRE2 if rewritten to use
2123 two top-level branches:
2124
2125 (?<=abc|abde)
2126
2127 In some cases, the escape sequence \K (see above) can be used instead
2128 of a lookbehind assertion to get round the fixed-length restriction.
2129
2130 The implementation of lookbehind assertions is, for each alternative,
2131 to temporarily move the current position back by the fixed length and
2132 then try to match. If there are insufficient characters before the cur‐
2133 rent position, the assertion fails.
2134
2135 In UTF-8 and UTF-16 modes, PCRE2 does not allow the \C escape (which
2136 matches a single code unit even in a UTF mode) to appear in lookbehind
2137 assertions, because it makes it impossible to calculate the length of
2138 the lookbehind. The \X and \R escapes, which can match different num‐
2139 bers of code units, are never permitted in lookbehinds.
2140
2141 "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in
2142 lookbehinds, as long as the subpattern matches a fixed-length string.
2143 However, recursion, that is, a "subroutine" call into a group that is
2144 already active, is not supported.
2145
2146 Perl does not support backreferences in lookbehinds. PCRE2 does support
2147 them, but only if certain conditions are met. The
2148 PCRE2_MATCH_UNSET_BACKREF option must not be set, there must be no use
2149 of (?| in the pattern (it creates duplicate subpattern numbers), and if
2150 the backreference is by name, the name must be unique. Of course, the
2151 referenced subpattern must itself be of fixed length. The following
2152 pattern matches words containing at least two characters that begin and
2153 end with the same character:
2154
2155 \b(\w)\w++(?<=\1)
2156
2157 Possessive quantifiers can be used in conjunction with lookbehind
2158 assertions to specify efficient matching of fixed-length strings at the
2159 end of subject strings. Consider a simple pattern such as
2160
2161 abcd$
2162
2163 when applied to a long string that does not match. Because matching
2164 proceeds from left to right, PCRE2 will look for each "a" in the sub‐
2165 ject and then see if what follows matches the rest of the pattern. If
2166 the pattern is specified as
2167
2168 ^.*abcd$
2169
2170 the initial .* matches the entire string at first, but when this fails
2171 (because there is no following "a"), it backtracks to match all but the
2172 last character, then all but the last two characters, and so on. Once
2173 again the search for "a" covers the entire string, from right to left,
2174 so we are no better off. However, if the pattern is written as
2175
2176 ^.*+(?<=abcd)
2177
2178 there can be no backtracking for the .*+ item because of the possessive
2179 quantifier; it can match only the entire string. The subsequent lookbe‐
2180 hind assertion does a single test on the last four characters. If it
2181 fails, the match fails immediately. For long strings, this approach
2182 makes a significant difference to the processing time.
2183
2184 Using multiple assertions
2185
2186 Several assertions (of any sort) may occur in succession. For example,
2187
2188 (?<=\d{3})(?<!999)foo
2189
2190 matches "foo" preceded by three digits that are not "999". Notice that
2191 each of the assertions is applied independently at the same point in
2192 the subject string. First there is a check that the previous three
2193 characters are all digits, and then there is a check that the same
2194 three characters are not "999". This pattern does not match "foo" pre‐
2195 ceded by six characters, the first of which are digits and the last
2196 three of which are not "999". For example, it doesn't match "123abc‐
2197 foo". A pattern to do that is
2198
2199 (?<=\d{3}...)(?<!999)foo
2200
2201 This time the first assertion looks at the preceding six characters,
2202 checking that the first three are digits, and then the second assertion
2203 checks that the preceding three characters are not "999".
2204
2205 Assertions can be nested in any combination. For example,
2206
2207 (?<=(?<!foo)bar)baz
2208
2209 matches an occurrence of "baz" that is preceded by "bar" which in turn
2210 is not preceded by "foo", while
2211
2212 (?<=\d{3}(?!999)...)foo
2213
2214 is another pattern that matches "foo" preceded by three digits and any
2215 three characters that are not "999".
2216
2218
2219 It is possible to cause the matching process to obey a subpattern con‐
2220 ditionally or to choose between two alternative subpatterns, depending
2221 on the result of an assertion, or whether a specific capturing subpat‐
2222 tern has already been matched. The two possible forms of conditional
2223 subpattern are:
2224
2225 (?(condition)yes-pattern)
2226 (?(condition)yes-pattern|no-pattern)
2227
2228 If the condition is satisfied, the yes-pattern is used; otherwise the
2229 no-pattern (if present) is used. An absent no-pattern is equivalent to
2230 an empty string (it always matches). If there are more than two alter‐
2231 natives in the subpattern, a compile-time error occurs. Each of the two
2232 alternatives may itself contain nested subpatterns of any form, includ‐
2233 ing conditional subpatterns; the restriction to two alternatives
2234 applies only at the level of the condition. This pattern fragment is an
2235 example where the alternatives are complex:
2236
2237 (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
2238
2239
2240 There are five kinds of condition: references to subpatterns, refer‐
2241 ences to recursion, two pseudo-conditions called DEFINE and VERSION,
2242 and assertions.
2243
2244 Checking for a used subpattern by number
2245
2246 If the text between the parentheses consists of a sequence of digits,
2247 the condition is true if a capturing subpattern of that number has pre‐
2248 viously matched. If there is more than one capturing subpattern with
2249 the same number (see the earlier section about duplicate subpattern
2250 numbers), the condition is true if any of them have matched. An alter‐
2251 native notation is to precede the digits with a plus or minus sign. In
2252 this case, the subpattern number is relative rather than absolute. The
2253 most recently opened parentheses can be referenced by (?(-1), the next
2254 most recent by (?(-2), and so on. Inside loops it can also make sense
2255 to refer to subsequent groups. The next parentheses to be opened can be
2256 referenced as (?(+1), and so on. (The value zero in any of these forms
2257 is not used; it provokes a compile-time error.)
2258
2259 Consider the following pattern, which contains non-significant white
2260 space to make it more readable (assume the PCRE2_EXTENDED option) and
2261 to divide it into three parts for ease of discussion:
2262
2263 ( \( )? [^()]+ (?(1) \) )
2264
2265 The first part matches an optional opening parenthesis, and if that
2266 character is present, sets it as the first captured substring. The sec‐
2267 ond part matches one or more characters that are not parentheses. The
2268 third part is a conditional subpattern that tests whether or not the
2269 first set of parentheses matched. If they did, that is, if subject
2270 started with an opening parenthesis, the condition is true, and so the
2271 yes-pattern is executed and a closing parenthesis is required. Other‐
2272 wise, since no-pattern is not present, the subpattern matches nothing.
2273 In other words, this pattern matches a sequence of non-parentheses,
2274 optionally enclosed in parentheses.
2275
2276 If you were embedding this pattern in a larger one, you could use a
2277 relative reference:
2278
2279 ...other stuff... ( \( )? [^()]+ (?(-1) \) ) ...
2280
2281 This makes the fragment independent of the parentheses in the larger
2282 pattern.
2283
2284 Checking for a used subpattern by name
2285
2286 Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
2287 used subpattern by name. For compatibility with earlier versions of
2288 PCRE1, which had this facility before Perl, the syntax (?(name)...) is
2289 also recognized. Note, however, that undelimited names consisting of
2290 the letter R followed by digits are ambiguous (see the following sec‐
2291 tion).
2292
2293 Rewriting the above example to use a named subpattern gives this:
2294
2295 (?<OPEN> \( )? [^()]+ (?(<OPEN>) \) )
2296
2297 If the name used in a condition of this kind is a duplicate, the test
2298 is applied to all subpatterns of the same name, and is true if any one
2299 of them has matched.
2300
2301 Checking for pattern recursion
2302
2303 "Recursion" in this sense refers to any subroutine-like call from one
2304 part of the pattern to another, whether or not it is actually recur‐
2305 sive. See the sections entitled "Recursive patterns" and "Subpatterns
2306 as subroutines" below for details of recursion and subpattern calls.
2307
2308 If a condition is the string (R), and there is no subpattern with the
2309 name R, the condition is true if matching is currently in a recursion
2310 or subroutine call to the whole pattern or any subpattern. If digits
2311 follow the letter R, and there is no subpattern with that name, the
2312 condition is true if the most recent call is into a subpattern with the
2313 given number, which must exist somewhere in the overall pattern. This
2314 is a contrived example that is equivalent to a+b:
2315
2316 ((?(R1)a+|(?1)b))
2317
2318 However, in both cases, if there is a subpattern with a matching name,
2319 the condition tests for its being set, as described in the section
2320 above, instead of testing for recursion. For example, creating a group
2321 with the name R1 by adding (?<R1>) to the above pattern completely
2322 changes its meaning.
2323
2324 If a name preceded by ampersand follows the letter R, for example:
2325
2326 (?(R&name)...)
2327
2328 the condition is true if the most recent recursion is into a subpattern
2329 of that name (which must exist within the pattern).
2330
2331 This condition does not check the entire recursion stack. It tests only
2332 the current level. If the name used in a condition of this kind is a
2333 duplicate, the test is applied to all subpatterns of the same name, and
2334 is true if any one of them is the most recent recursion.
2335
2336 At "top level", all these recursion test conditions are false.
2337
2338 Defining subpatterns for use by reference only
2339
2340 If the condition is the string (DEFINE), the condition is always false,
2341 even if there is a group with the name DEFINE. In this case, there may
2342 be only one alternative in the subpattern. It is always skipped if con‐
2343 trol reaches this point in the pattern; the idea of DEFINE is that it
2344 can be used to define subroutines that can be referenced from else‐
2345 where. (The use of subroutines is described below.) For example, a pat‐
2346 tern to match an IPv4 address such as "192.168.23.245" could be written
2347 like this (ignore white space and line breaks):
2348
2349 (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
2350 \b (?&byte) (\.(?&byte)){3} \b
2351
2352 The first part of the pattern is a DEFINE group inside which a another
2353 group named "byte" is defined. This matches an individual component of
2354 an IPv4 address (a number less than 256). When matching takes place,
2355 this part of the pattern is skipped because DEFINE acts like a false
2356 condition. The rest of the pattern uses references to the named group
2357 to match the four dot-separated components of an IPv4 address, insist‐
2358 ing on a word boundary at each end.
2359
2360 Checking the PCRE2 version
2361
2362 Programs that link with a PCRE2 library can check the version by call‐
2363 ing pcre2_config() with appropriate arguments. Users of applications
2364 that do not have access to the underlying code cannot do this. A spe‐
2365 cial "condition" called VERSION exists to allow such users to discover
2366 which version of PCRE2 they are dealing with by using this condition to
2367 match a string such as "yesno". VERSION must be followed either by "="
2368 or ">=" and a version number. For example:
2369
2370 (?(VERSION>=10.4)yes|no)
2371
2372 This pattern matches "yes" if the PCRE2 version is greater or equal to
2373 10.4, or "no" otherwise. The fractional part of the version number may
2374 not contain more than two digits.
2375
2376 Assertion conditions
2377
2378 If the condition is not in any of the above formats, it must be an
2379 assertion. This may be a positive or negative lookahead or lookbehind
2380 assertion. Consider this pattern, again containing non-significant
2381 white space, and with the two alternatives on the second line:
2382
2383 (?(?=[^a-z]*[a-z])
2384 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
2385
2386 The condition is a positive lookahead assertion that matches an
2387 optional sequence of non-letters followed by a letter. In other words,
2388 it tests for the presence of at least one letter in the subject. If a
2389 letter is found, the subject is matched against the first alternative;
2390 otherwise it is matched against the second. This pattern matches
2391 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
2392 letters and dd are digits.
2393
2394 When an assertion that is a condition contains capturing subpatterns,
2395 any capturing that occurs in a matching branch is retained afterwards,
2396 for both positive and negative assertions, because matching always con‐
2397 tinues after the assertion, whether it succeeds or fails. (Compare non-
2398 conditional assertions, when captures are retained only for positive
2399 assertions that succeed.)
2400
2402
2403 There are two ways of including comments in patterns that are processed
2404 by PCRE2. In both cases, the start of the comment must not be in a
2405 character class, nor in the middle of any other sequence of related
2406 characters such as (?: or a subpattern name or number. The characters
2407 that make up a comment play no part in the pattern matching.
2408
2409 The sequence (?# marks the start of a comment that continues up to the
2410 next closing parenthesis. Nested parentheses are not permitted. If the
2411 PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped #
2412 character also introduces a comment, which in this case continues to
2413 immediately after the next newline character or character sequence in
2414 the pattern. Which characters are interpreted as newlines is controlled
2415 by an option passed to the compiling function or by a special sequence
2416 at the start of the pattern, as described in the section entitled "New‐
2417 line conventions" above. Note that the end of this type of comment is a
2418 literal newline sequence in the pattern; escape sequences that happen
2419 to represent a newline do not count. For example, consider this pattern
2420 when PCRE2_EXTENDED is set, and the default newline convention (a sin‐
2421 gle linefeed character) is in force:
2422
2423 abc #comment \n still comment
2424
2425 On encountering the # character, pcre2_compile() skips along, looking
2426 for a newline in the pattern. The sequence \n is still literal at this
2427 stage, so it does not terminate the comment. Only an actual character
2428 with the code value 0x0a (the default newline) does so.
2429
2431
2432 Consider the problem of matching a string in parentheses, allowing for
2433 unlimited nested parentheses. Without the use of recursion, the best
2434 that can be done is to use a pattern that matches up to some fixed
2435 depth of nesting. It is not possible to handle an arbitrary nesting
2436 depth.
2437
2438 For some time, Perl has provided a facility that allows regular expres‐
2439 sions to recurse (amongst other things). It does this by interpolating
2440 Perl code in the expression at run time, and the code can refer to the
2441 expression itself. A Perl pattern using code interpolation to solve the
2442 parentheses problem can be created like this:
2443
2444 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
2445
2446 The (?p{...}) item interpolates Perl code at run time, and in this case
2447 refers recursively to the pattern in which it appears.
2448
2449 Obviously, PCRE2 cannot support the interpolation of Perl code.
2450 Instead, it supports special syntax for recursion of the entire pat‐
2451 tern, and also for individual subpattern recursion. After its introduc‐
2452 tion in PCRE1 and Python, this kind of recursion was subsequently
2453 introduced into Perl at release 5.10.
2454
2455 A special item that consists of (? followed by a number greater than
2456 zero and a closing parenthesis is a recursive subroutine call of the
2457 subpattern of the given number, provided that it occurs inside that
2458 subpattern. (If not, it is a non-recursive subroutine call, which is
2459 described in the next section.) The special item (?R) or (?0) is a
2460 recursive call of the entire regular expression.
2461
2462 This PCRE2 pattern solves the nested parentheses problem (assume the
2463 PCRE2_EXTENDED option is set so that white space is ignored):
2464
2465 \( ( [^()]++ | (?R) )* \)
2466
2467 First it matches an opening parenthesis. Then it matches any number of
2468 substrings which can either be a sequence of non-parentheses, or a
2469 recursive match of the pattern itself (that is, a correctly parenthe‐
2470 sized substring). Finally there is a closing parenthesis. Note the use
2471 of a possessive quantifier to avoid backtracking into sequences of non-
2472 parentheses.
2473
2474 If this were part of a larger pattern, you would not want to recurse
2475 the entire pattern, so instead you could use this:
2476
2477 ( \( ( [^()]++ | (?1) )* \) )
2478
2479 We have put the pattern into parentheses, and caused the recursion to
2480 refer to them instead of the whole pattern.
2481
2482 In a larger pattern, keeping track of parenthesis numbers can be
2483 tricky. This is made easier by the use of relative references. Instead
2484 of (?1) in the pattern above you can write (?-2) to refer to the second
2485 most recently opened parentheses preceding the recursion. In other
2486 words, a negative number counts capturing parentheses leftwards from
2487 the point at which it is encountered.
2488
2489 Be aware however, that if duplicate subpattern numbers are in use, rel‐
2490 ative references refer to the earliest subpattern with the appropriate
2491 number. Consider, for example:
2492
2493 (?|(a)|(b)) (c) (?-2)
2494
2495 The first two capturing groups (a) and (b) are both numbered 1, and
2496 group (c) is number 2. When the reference (?-2) is encountered, the
2497 second most recently opened parentheses has the number 1, but it is the
2498 first such group (the (a) group) to which the recursion refers. This
2499 would be the same if an absolute reference (?1) was used. In other
2500 words, relative references are just a shorthand for computing a group
2501 number.
2502
2503 It is also possible to refer to subsequently opened parentheses, by
2504 writing references such as (?+2). However, these cannot be recursive
2505 because the reference is not inside the parentheses that are refer‐
2506 enced. They are always non-recursive subroutine calls, as described in
2507 the next section.
2508
2509 An alternative approach is to use named parentheses. The Perl syntax
2510 for this is (?&name); PCRE1's earlier syntax (?P>name) is also sup‐
2511 ported. We could rewrite the above example as follows:
2512
2513 (?<pn> \( ( [^()]++ | (?&pn) )* \) )
2514
2515 If there is more than one subpattern with the same name, the earliest
2516 one is used.
2517
2518 The example pattern that we have been looking at contains nested unlim‐
2519 ited repeats, and so the use of a possessive quantifier for matching
2520 strings of non-parentheses is important when applying the pattern to
2521 strings that do not match. For example, when this pattern is applied to
2522
2523 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
2524
2525 it yields "no match" quickly. However, if a possessive quantifier is
2526 not used, the match runs for a very long time indeed because there are
2527 so many different ways the + and * repeats can carve up the subject,
2528 and all have to be tested before failure can be reported.
2529
2530 At the end of a match, the values of capturing parentheses are those
2531 from the outermost level. If you want to obtain intermediate values, a
2532 callout function can be used (see below and the pcre2callout documenta‐
2533 tion). If the pattern above is matched against
2534
2535 (ab(cd)ef)
2536
2537 the value for the inner capturing parentheses (numbered 2) is "ef",
2538 which is the last value taken on at the top level. If a capturing sub‐
2539 pattern is not matched at the top level, its final captured value is
2540 unset, even if it was (temporarily) set at a deeper level during the
2541 matching process.
2542
2543 Do not confuse the (?R) item with the condition (R), which tests for
2544 recursion. Consider this pattern, which matches text in angle brack‐
2545 ets, allowing for arbitrary nesting. Only digits are allowed in nested
2546 brackets (that is, when recursing), whereas any characters are permit‐
2547 ted at the outer level.
2548
2549 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
2550
2551 In this pattern, (?(R) is the start of a conditional subpattern, with
2552 two different alternatives for the recursive and non-recursive cases.
2553 The (?R) item is the actual recursive call.
2554
2555 Differences in recursion processing between PCRE2 and Perl
2556
2557 Some former differences between PCRE2 and Perl no longer exist.
2558
2559 Before release 10.30, recursion processing in PCRE2 differed from Perl
2560 in that a recursive subpattern call was always treated as an atomic
2561 group. That is, once it had matched some of the subject string, it was
2562 never re-entered, even if it contained untried alternatives and there
2563 was a subsequent matching failure. (Historical note: PCRE implemented
2564 recursion before Perl did.)
2565
2566 Starting with release 10.30, recursive subroutine calls are no longer
2567 treated as atomic. That is, they can be re-entered to try unused alter‐
2568 natives if there is a matching failure later in the pattern. This is
2569 now compatible with the way Perl works. If you want a subroutine call
2570 to be atomic, you must explicitly enclose it in an atomic group.
2571
2572 Supporting backtracking into recursions simplifies certain types of
2573 recursive pattern. For example, this pattern matches palindromic
2574 strings:
2575
2576 ^((.)(?1)\2|.?)$
2577
2578 The second branch in the group matches a single central character in
2579 the palindrome when there are an odd number of characters, or nothing
2580 when there are an even number of characters, but in order to work it
2581 has to be able to try the second case when the rest of the pattern
2582 match fails. If you want to match typical palindromic phrases, the pat‐
2583 tern has to ignore all non-word characters, which can be done like
2584 this:
2585
2586 ^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$
2587
2588 If run with the PCRE2_CASELESS option, this pattern matches phrases
2589 such as "A man, a plan, a canal: Panama!". Note the use of the posses‐
2590 sive quantifier *+ to avoid backtracking into sequences of non-word
2591 characters. Without this, PCRE2 takes a great deal longer (ten times or
2592 more) to match typical phrases, and Perl takes so long that you think
2593 it has gone into a loop.
2594
2595 Another way in which PCRE2 and Perl used to differ in their recursion
2596 processing is in the handling of captured values. Formerly in Perl,
2597 when a subpattern was called recursively or as a subpattern (see the
2598 next section), it had no access to any values that were captured out‐
2599 side the recursion, whereas in PCRE2 these values can be referenced.
2600 Consider this pattern:
2601
2602 ^(.)(\1|a(?2))
2603
2604 This pattern matches "bab". The first capturing parentheses match "b",
2605 then in the second group, when the backreference \1 fails to match "b",
2606 the second alternative matches "a" and then recurses. In the recursion,
2607 \1 does now match "b" and so the whole match succeeds. This match used
2608 to fail in Perl, but in later versions (I tried 5.024) it now works.
2609
2611
2612 If the syntax for a recursive subpattern call (either by number or by
2613 name) is used outside the parentheses to which it refers, it operates a
2614 bit like a subroutine in a programming language. More accurately, PCRE2
2615 treats the referenced subpattern as an independent subpattern which it
2616 tries to match at the current matching position. The called subpattern
2617 may be defined before or after the reference. A numbered reference can
2618 be absolute or relative, as in these examples:
2619
2620 (...(absolute)...)...(?2)...
2621 (...(relative)...)...(?-1)...
2622 (...(?+1)...(relative)...
2623
2624 An earlier example pointed out that the pattern
2625
2626 (sens|respons)e and \1ibility
2627
2628 matches "sense and sensibility" and "response and responsibility", but
2629 not "sense and responsibility". If instead the pattern
2630
2631 (sens|respons)e and (?1)ibility
2632
2633 is used, it does match "sense and responsibility" as well as the other
2634 two strings. Another example is given in the discussion of DEFINE
2635 above.
2636
2637 Like recursions, subroutine calls used to be treated as atomic, but
2638 this changed at PCRE2 release 10.30, so backtracking into subroutine
2639 calls can now occur. However, any capturing parentheses that are set
2640 during the subroutine call revert to their previous values afterwards.
2641
2642 Processing options such as case-independence are fixed when a subpat‐
2643 tern is defined, so if it is used as a subroutine, such options cannot
2644 be changed for different calls. For example, consider this pattern:
2645
2646 (abc)(?i:(?-1))
2647
2648 It matches "abcabc". It does not match "abcABC" because the change of
2649 processing option does not affect the called subpattern.
2650
2651 The behaviour of backtracking control verbs in subpatterns when called
2652 as subroutines is described in the section entitled "Backtracking verbs
2653 in subroutines" below.
2654
2656
2657 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
2658 name or a number enclosed either in angle brackets or single quotes, is
2659 an alternative syntax for referencing a subpattern as a subroutine,
2660 possibly recursively. Here are two of the examples used above, rewrit‐
2661 ten using this syntax:
2662
2663 (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
2664 (sens|respons)e and \g'1'ibility
2665
2666 PCRE2 supports an extension to Oniguruma: if a number is preceded by a
2667 plus or a minus sign it is taken as a relative reference. For example:
2668
2669 (abc)(?i:\g<-1>)
2670
2671 Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
2672 synonymous. The former is a backreference; the latter is a subroutine
2673 call.
2674
2676
2677 Perl has a feature whereby using the sequence (?{...}) causes arbitrary
2678 Perl code to be obeyed in the middle of matching a regular expression.
2679 This makes it possible, amongst other things, to extract different sub‐
2680 strings that match the same pair of parentheses when there is a repeti‐
2681 tion.
2682
2683 PCRE2 provides a similar feature, but of course it cannot obey arbi‐
2684 trary Perl code. The feature is called "callout". The caller of PCRE2
2685 provides an external function by putting its entry point in a match
2686 context using the function pcre2_set_callout(), and then passing that
2687 context to pcre2_match() or pcre2_dfa_match(). If no match context is
2688 passed, or if the callout entry point is set to NULL, callouts are dis‐
2689 abled.
2690
2691 Within a regular expression, (?C<arg>) indicates a point at which the
2692 external function is to be called. There are two kinds of callout:
2693 those with a numerical argument and those with a string argument. (?C)
2694 on its own with no argument is treated as (?C0). A numerical argument
2695 allows the application to distinguish between different callouts.
2696 String arguments were added for release 10.20 to make it possible for
2697 script languages that use PCRE2 to embed short scripts within patterns
2698 in a similar way to Perl.
2699
2700 During matching, when PCRE2 reaches a callout point, the external func‐
2701 tion is called. It is provided with the number or string argument of
2702 the callout, the position in the pattern, and one item of data that is
2703 also set in the match block. The callout function may cause matching to
2704 proceed, to backtrack, or to fail.
2705
2706 By default, PCRE2 implements a number of optimizations at matching
2707 time, and one side-effect is that sometimes callouts are skipped. If
2708 you need all possible callouts to happen, you need to set options that
2709 disable the relevant optimizations. More details, including a complete
2710 description of the programming interface to the callout function, are
2711 given in the pcre2callout documentation.
2712
2713 Callouts with numerical arguments
2714
2715 If you just want to have a means of identifying different callout
2716 points, put a number less than 256 after the letter C. For example,
2717 this pattern has two callout points:
2718
2719 (?C1)abc(?C2)def
2720
2721 If the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(), numerical
2722 callouts are automatically installed before each item in the pattern.
2723 They are all numbered 255. If there is a conditional group in the pat‐
2724 tern whose condition is an assertion, an additional callout is inserted
2725 just before the condition. An explicit callout may also be set at this
2726 position, as in this example:
2727
2728 (?(?C9)(?=a)abc|def)
2729
2730 Note that this applies only to assertion conditions, not to other types
2731 of condition.
2732
2733 Callouts with string arguments
2734
2735 A delimited string may be used instead of a number as a callout argu‐
2736 ment. The starting delimiter must be one of ` ' " ^ % # $ { and the
2737 ending delimiter is the same as the start, except for {, where the end‐
2738 ing delimiter is }. If the ending delimiter is needed within the
2739 string, it must be doubled. For example:
2740
2741 (?C'ab ''c'' d')xyz(?C{any text})pqr
2742
2743 The doubling is removed before the string is passed to the callout
2744 function.
2745
2747
2748 There are a number of special "Backtracking Control Verbs" (to use
2749 Perl's terminology) that modify the behaviour of backtracking during
2750 matching. They are generally of the form (*VERB) or (*VERB:NAME). Some
2751 verbs take either form, possibly behaving differently depending on
2752 whether or not a name is present.
2753
2754 By default, for compatibility with Perl, a name is any sequence of
2755 characters that does not include a closing parenthesis. The name is not
2756 processed in any way, and it is not possible to include a closing
2757 parenthesis in the name. This can be changed by setting the
2758 PCRE2_ALT_VERBNAMES option, but the result is no longer Perl-compati‐
2759 ble.
2760
2761 When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to
2762 verb names and only an unescaped closing parenthesis terminates the
2763 name. However, the only backslash items that are permitted are \Q, \E,
2764 and sequences such as \x{100} that define character code points. Char‐
2765 acter type escapes such as \d are faulted.
2766
2767 A closing parenthesis can be included in a name either as \) or between
2768 \Q and \E. In addition to backslash processing, if the PCRE2_EXTENDED
2769 or PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb
2770 names is skipped, and #-comments are recognized, exactly as in the rest
2771 of the pattern. PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect
2772 verb names unless PCRE2_ALT_VERBNAMES is also set.
2773
2774 The maximum length of a name is 255 in the 8-bit library and 65535 in
2775 the 16-bit and 32-bit libraries. If the name is empty, that is, if the
2776 closing parenthesis immediately follows the colon, the effect is as if
2777 the colon were not there. Any number of these verbs may occur in a pat‐
2778 tern.
2779
2780 Since these verbs are specifically related to backtracking, most of
2781 them can be used only when the pattern is to be matched using the tra‐
2782 ditional matching function, because that uses a backtracking algorithm.
2783 With the exception of (*FAIL), which behaves like a failing negative
2784 assertion, the backtracking control verbs cause an error if encountered
2785 by the DFA matching function.
2786
2787 The behaviour of these verbs in repeated groups, assertions, and in
2788 subpatterns called as subroutines (whether or not recursively) is docu‐
2789 mented below.
2790
2791 Optimizations that affect backtracking verbs
2792
2793 PCRE2 contains some optimizations that are used to speed up matching by
2794 running some checks at the start of each match attempt. For example, it
2795 may know the minimum length of matching subject, or that a particular
2796 character must be present. When one of these optimizations bypasses the
2797 running of a match, any included backtracking verbs will not, of
2798 course, be processed. You can suppress the start-of-match optimizations
2799 by setting the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com‐
2800 pile(), or by starting the pattern with (*NO_START_OPT). There is more
2801 discussion of this option in the section entitled "Compiling a pattern"
2802 in the pcre2api documentation.
2803
2804 Experiments with Perl suggest that it too has similar optimizations,
2805 and like PCRE2, turning them off can change the result of a match.
2806
2807 Verbs that act immediately
2808
2809 The following verbs act as soon as they are encountered.
2810
2811 (*ACCEPT) or (*ACCEPT:NAME)
2812
2813 This verb causes the match to end successfully, skipping the remainder
2814 of the pattern. However, when it is inside a subpattern that is called
2815 as a subroutine, only that subpattern is ended successfully. Matching
2816 then continues at the outer level. If (*ACCEPT) in triggered in a posi‐
2817 tive assertion, the assertion succeeds; in a negative assertion, the
2818 assertion fails.
2819
2820 If (*ACCEPT) is inside capturing parentheses, the data so far is cap‐
2821 tured. For example:
2822
2823 A((?:A|B(*ACCEPT)|C)D)
2824
2825 This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap‐
2826 tured by the outer parentheses.
2827
2828 (*FAIL) or (*FAIL:NAME)
2829
2830 This verb causes a matching failure, forcing backtracking to occur. It
2831 may be abbreviated to (*F). It is equivalent to (?!) but easier to
2832 read. The Perl documentation notes that it is probably useful only when
2833 combined with (?{}) or (??{}). Those are, of course, Perl features that
2834 are not present in PCRE2. The nearest equivalent is the callout fea‐
2835 ture, as for example in this pattern:
2836
2837 a+(?C)(*FAIL)
2838
2839 A match with the string "aaaa" always fails, but the callout is taken
2840 before each backtrack happens (in this example, 10 times).
2841
2842 (*ACCEPT:NAME) and (*FAIL:NAME) behave exactly the same as
2843 (*MARK:NAME)(*ACCEPT) and (*MARK:NAME)(*FAIL), respectively.
2844
2845 Recording which path was taken
2846
2847 There is one verb whose main purpose is to track how a match was
2848 arrived at, though it also has a secondary use in conjunction with
2849 advancing the match starting point (see (*SKIP) below).
2850
2851 (*MARK:NAME) or (*:NAME)
2852
2853 A name is always required with this verb. There may be as many
2854 instances of (*MARK) as you like in a pattern, and their names do not
2855 have to be unique.
2856
2857 When a match succeeds, the name of the last-encountered (*MARK:NAME) on
2858 the matching path is passed back to the caller as described in the sec‐
2859 tion entitled "Other information about the match" in the pcre2api docu‐
2860 mentation. This applies to all instances of (*MARK), including those
2861 inside assertions and atomic groups. (There are differences in those
2862 cases when (*MARK) is used in conjunction with (*SKIP) as described
2863 below.)
2864
2865 As well as (*MARK), the (*COMMIT), (*PRUNE) and (*THEN) verbs may have
2866 associated NAME arguments. Whichever is last on the matching path is
2867 passed back. See below for more details of these other verbs.
2868
2869 Here is an example of pcre2test output, where the "mark" modifier
2870 requests the retrieval and outputting of (*MARK) data:
2871
2872 re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
2873 data> XY
2874 0: XY
2875 MK: A
2876 XZ
2877 0: XZ
2878 MK: B
2879
2880 The (*MARK) name is tagged with "MK:" in this output, and in this exam‐
2881 ple it indicates which of the two alternatives matched. This is a more
2882 efficient way of obtaining this information than putting each alterna‐
2883 tive in its own capturing parentheses.
2884
2885 If a verb with a name is encountered in a positive assertion that is
2886 true, the name is recorded and passed back if it is the last-encoun‐
2887 tered. This does not happen for negative assertions or failing positive
2888 assertions.
2889
2890 After a partial match or a failed match, the last encountered name in
2891 the entire match process is returned. For example:
2892
2893 re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
2894 data> XP
2895 No match, mark = B
2896
2897 Note that in this unanchored example the mark is retained from the
2898 match attempt that started at the letter "X" in the subject. Subsequent
2899 match attempts starting at "P" and then with an empty string do not get
2900 as far as the (*MARK) item, but nevertheless do not reset it.
2901
2902 If you are interested in (*MARK) values after failed matches, you
2903 should probably set the PCRE2_NO_START_OPTIMIZE option (see above) to
2904 ensure that the match is always attempted.
2905
2906 Verbs that act after backtracking
2907
2908 The following verbs do nothing when they are encountered. Matching con‐
2909 tinues with what follows, but if there is a subsequent match failure,
2910 causing a backtrack to the verb, a failure is forced. That is, back‐
2911 tracking cannot pass to the left of the verb. However, when one of
2912 these verbs appears inside an atomic group or in a lookaround assertion
2913 that is true, its effect is confined to that group, because once the
2914 group has been matched, there is never any backtracking into it. Back‐
2915 tracking from beyond an assertion or an atomic group ignores the entire
2916 group, and seeks a preceeding backtracking point.
2917
2918 These verbs differ in exactly what kind of failure occurs when back‐
2919 tracking reaches them. The behaviour described below is what happens
2920 when the verb is not in a subroutine or an assertion. Subsequent sec‐
2921 tions cover these special cases.
2922
2923 (*COMMIT) or (*COMMIT:NAME)
2924
2925 This verb causes the whole match to fail outright if there is a later
2926 matching failure that causes backtracking to reach it. Even if the pat‐
2927 tern is unanchored, no further attempts to find a match by advancing
2928 the starting point take place. If (*COMMIT) is the only backtracking
2929 verb that is encountered, once it has been passed pcre2_match() is com‐
2930 mitted to finding a match at the current starting point, or not at all.
2931 For example:
2932
2933 a+(*COMMIT)b
2934
2935 This matches "xxaab" but not "aacaab". It can be thought of as a kind
2936 of dynamic anchor, or "I've started, so I must finish."
2937
2938 The behaviour of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COM‐
2939 MIT). It is like (*MARK:NAME) in that the name is remembered for pass‐
2940 ing back to the caller. However, (*SKIP:NAME) searches only for names
2941 set with (*MARK), ignoring those set by (*COMMIT), (*PRUNE) and
2942 (*THEN).
2943
2944 If there is more than one backtracking verb in a pattern, a different
2945 one that follows (*COMMIT) may be triggered first, so merely passing
2946 (*COMMIT) during a match does not always guarantee that a match must be
2947 at this starting point.
2948
2949 Note that (*COMMIT) at the start of a pattern is not the same as an
2950 anchor, unless PCRE2's start-of-match optimizations are turned off, as
2951 shown in this output from pcre2test:
2952
2953 re> /(*COMMIT)abc/
2954 data> xyzabc
2955 0: abc
2956 data>
2957 re> /(*COMMIT)abc/no_start_optimize
2958 data> xyzabc
2959 No match
2960
2961 For the first pattern, PCRE2 knows that any match must start with "a",
2962 so the optimization skips along the subject to "a" before applying the
2963 pattern to the first set of data. The match attempt then succeeds. The
2964 second pattern disables the optimization that skips along to the first
2965 character. The pattern is now applied starting at "x", and so the
2966 (*COMMIT) causes the match to fail without trying any other starting
2967 points.
2968
2969 (*PRUNE) or (*PRUNE:NAME)
2970
2971 This verb causes the match to fail at the current starting position in
2972 the subject if there is a later matching failure that causes backtrack‐
2973 ing to reach it. If the pattern is unanchored, the normal "bumpalong"
2974 advance to the next starting character then happens. Backtracking can
2975 occur as usual to the left of (*PRUNE), before it is reached, or when
2976 matching to the right of (*PRUNE), but if there is no match to the
2977 right, backtracking cannot cross (*PRUNE). In simple cases, the use of
2978 (*PRUNE) is just an alternative to an atomic group or possessive quan‐
2979 tifier, but there are some uses of (*PRUNE) that cannot be expressed in
2980 any other way. In an anchored pattern (*PRUNE) has the same effect as
2981 (*COMMIT).
2982
2983 The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE).
2984 It is like (*MARK:NAME) in that the name is remembered for passing back
2985 to the caller. However, (*SKIP:NAME) searches only for names set with
2986 (*MARK), ignoring those set by (*COMMIT), (*PRUNE) or (*THEN).
2987
2988 (*SKIP)
2989
2990 This verb, when given without a name, is like (*PRUNE), except that if
2991 the pattern is unanchored, the "bumpalong" advance is not to the next
2992 character, but to the position in the subject where (*SKIP) was encoun‐
2993 tered. (*SKIP) signifies that whatever text was matched leading up to
2994 it cannot be part of a successful match if there is a later mismatch.
2995 Consider:
2996
2997 a+(*SKIP)b
2998
2999 If the subject is "aaaac...", after the first match attempt fails
3000 (starting at the first character in the string), the starting point
3001 skips on to start the next attempt at "c". Note that a possessive quan‐
3002 tifer does not have the same effect as this example; although it would
3003 suppress backtracking during the first match attempt, the second
3004 attempt would start at the second character instead of skipping on to
3005 "c".
3006
3007 (*SKIP:NAME)
3008
3009 When (*SKIP) has an associated name, its behaviour is modified. When
3010 such a (*SKIP) is triggered, the previous path through the pattern is
3011 searched for the most recent (*MARK) that has the same name. If one is
3012 found, the "bumpalong" advance is to the subject position that corre‐
3013 sponds to that (*MARK) instead of to where (*SKIP) was encountered. If
3014 no (*MARK) with a matching name is found, the (*SKIP) is ignored.
3015
3016 The search for a (*MARK) name uses the normal backtracking mechanism,
3017 which means that it does not see (*MARK) settings that are inside
3018 atomic groups or assertions, because they are never re-entered by back‐
3019 tracking. Compare the following pcre2test examples:
3020
3021 re> /a(?>(*MARK:X))(*SKIP:X)(*F)|(.)/
3022 data: abc
3023 0: a
3024 1: a
3025 data:
3026 re> /a(?:(*MARK:X))(*SKIP:X)(*F)|(.)/
3027 data: abc
3028 0: b
3029 1: b
3030
3031 In the first example, the (*MARK) setting is in an atomic group, so it
3032 is not seen when (*SKIP:X) triggers, causing the (*SKIP) to be ignored.
3033 This allows the second branch of the pattern to be tried at the first
3034 character position. In the second example, the (*MARK) setting is not
3035 in an atomic group. This allows (*SKIP:X) to find the (*MARK) when it
3036 backtracks, and this causes a new matching attempt to start at the sec‐
3037 ond character. This time, the (*MARK) is never seen because "a" does
3038 not match "b", so the matcher immediately jumps to the second branch of
3039 the pattern.
3040
3041 Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It
3042 ignores names that are set by (*COMMIT:NAME), (*PRUNE:NAME) or
3043 (*THEN:NAME).
3044
3045 (*THEN) or (*THEN:NAME)
3046
3047 This verb causes a skip to the next innermost alternative when back‐
3048 tracking reaches it. That is, it cancels any further backtracking
3049 within the current alternative. Its name comes from the observation
3050 that it can be used for a pattern-based if-then-else block:
3051
3052 ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
3053
3054 If the COND1 pattern matches, FOO is tried (and possibly further items
3055 after the end of the group if FOO succeeds); on failure, the matcher
3056 skips to the second alternative and tries COND2, without backtracking
3057 into COND1. If that succeeds and BAR fails, COND3 is tried. If subse‐
3058 quently BAZ fails, there are no more alternatives, so there is a back‐
3059 track to whatever came before the entire group. If (*THEN) is not
3060 inside an alternation, it acts like (*PRUNE).
3061
3062 The behaviour of (*THEN:NAME) is not the same as (*MARK:NAME)(*THEN).
3063 It is like (*MARK:NAME) in that the name is remembered for passing back
3064 to the caller. However, (*SKIP:NAME) searches only for names set with
3065 (*MARK), ignoring those set by (*COMMIT), (*PRUNE) and (*THEN).
3066
3067 A subpattern that does not contain a | character is just a part of the
3068 enclosing alternative; it is not a nested alternation with only one
3069 alternative. The effect of (*THEN) extends beyond such a subpattern to
3070 the enclosing alternative. Consider this pattern, where A, B, etc. are
3071 complex pattern fragments that do not contain any | characters at this
3072 level:
3073
3074 A (B(*THEN)C) | D
3075
3076 If A and B are matched, but there is a failure in C, matching does not
3077 backtrack into A; instead it moves to the next alternative, that is, D.
3078 However, if the subpattern containing (*THEN) is given an alternative,
3079 it behaves differently:
3080
3081 A (B(*THEN)C | (*FAIL)) | D
3082
3083 The effect of (*THEN) is now confined to the inner subpattern. After a
3084 failure in C, matching moves to (*FAIL), which causes the whole subpat‐
3085 tern to fail because there are no more alternatives to try. In this
3086 case, matching does now backtrack into A.
3087
3088 Note that a conditional subpattern is not considered as having two
3089 alternatives, because only one is ever used. In other words, the |
3090 character in a conditional subpattern has a different meaning. Ignoring
3091 white space, consider:
3092
3093 ^.*? (?(?=a) a | b(*THEN)c )
3094
3095 If the subject is "ba", this pattern does not match. Because .*? is
3096 ungreedy, it initially matches zero characters. The condition (?=a)
3097 then fails, the character "b" is matched, but "c" is not. At this
3098 point, matching does not backtrack to .*? as might perhaps be expected
3099 from the presence of the | character. The conditional subpattern is
3100 part of the single alternative that comprises the whole pattern, and so
3101 the match fails. (If there was a backtrack into .*?, allowing it to
3102 match "b", the match would succeed.)
3103
3104 The verbs just described provide four different "strengths" of control
3105 when subsequent matching fails. (*THEN) is the weakest, carrying on the
3106 match at the next alternative. (*PRUNE) comes next, failing the match
3107 at the current starting position, but allowing an advance to the next
3108 character (for an unanchored pattern). (*SKIP) is similar, except that
3109 the advance may be more than one character. (*COMMIT) is the strongest,
3110 causing the entire match to fail.
3111
3112 More than one backtracking verb
3113
3114 If more than one backtracking verb is present in a pattern, the one
3115 that is backtracked onto first acts. For example, consider this pat‐
3116 tern, where A, B, etc. are complex pattern fragments:
3117
3118 (A(*COMMIT)B(*THEN)C|ABD)
3119
3120 If A matches but B fails, the backtrack to (*COMMIT) causes the entire
3121 match to fail. However, if A and B match, but C fails, the backtrack to
3122 (*THEN) causes the next alternative (ABD) to be tried. This behaviour
3123 is consistent, but is not always the same as Perl's. It means that if
3124 two or more backtracking verbs appear in succession, all the the last
3125 of them has no effect. Consider this example:
3126
3127 ...(*COMMIT)(*PRUNE)...
3128
3129 If there is a matching failure to the right, backtracking onto (*PRUNE)
3130 causes it to be triggered, and its action is taken. There can never be
3131 a backtrack onto (*COMMIT).
3132
3133 Backtracking verbs in repeated groups
3134
3135 PCRE2 sometimes differs from Perl in its handling of backtracking verbs
3136 in repeated groups. For example, consider:
3137
3138 /(a(*COMMIT)b)+ac/
3139
3140 If the subject is "abac", Perl matches unless its optimizations are
3141 disabled, but PCRE2 always fails because the (*COMMIT) in the second
3142 repeat of the group acts.
3143
3144 Backtracking verbs in assertions
3145
3146 (*FAIL) in any assertion has its normal effect: it forces an immediate
3147 backtrack. The behaviour of the other backtracking verbs depends on
3148 whether or not the assertion is standalone or acting as the condition
3149 in a conditional subpattern.
3150
3151 (*ACCEPT) in a standalone positive assertion causes the assertion to
3152 succeed without any further processing; captured strings and a (*MARK)
3153 name (if set) are retained. In a standalone negative assertion,
3154 (*ACCEPT) causes the assertion to fail without any further processing;
3155 captured substrings and any (*MARK) name are discarded.
3156
3157 If the assertion is a condition, (*ACCEPT) causes the condition to be
3158 true for a positive assertion and false for a negative one; captured
3159 substrings are retained in both cases.
3160
3161 The remaining verbs act only when a later failure causes a backtrack to
3162 reach them. This means that their effect is confined to the assertion,
3163 because lookaround assertions are atomic. A backtrack that occurs after
3164 an assertion is complete does not jump back into the assertion. Note in
3165 particular that a (*MARK) name that is set in an assertion is not
3166 "seen" by an instance of (*SKIP:NAME) latter in the pattern.
3167
3168 The effect of (*THEN) is not allowed to escape beyond an assertion. If
3169 there are no more branches to try, (*THEN) causes a positive assertion
3170 to be false, and a negative assertion to be true.
3171
3172 The other backtracking verbs are not treated specially if they appear
3173 in a standalone positive assertion. In a conditional positive asser‐
3174 tion, backtracking (from within the assertion) into (*COMMIT), (*SKIP),
3175 or (*PRUNE) causes the condition to be false. However, for both stand‐
3176 alone and conditional negative assertions, backtracking into (*COMMIT),
3177 (*SKIP), or (*PRUNE) causes the assertion to be true, without consider‐
3178 ing any further alternative branches.
3179
3180 Backtracking verbs in subroutines
3181
3182 These behaviours occur whether or not the subpattern is called recur‐
3183 sively.
3184
3185 (*ACCEPT) in a subpattern called as a subroutine causes the subroutine
3186 match to succeed without any further processing. Matching then contin‐
3187 ues after the subroutine call. Perl documents this behaviour. Perl's
3188 treatment of the other verbs in subroutines is different in some cases.
3189
3190 (*FAIL) in a subpattern called as a subroutine has its normal effect:
3191 it forces an immediate backtrack.
3192
3193 (*COMMIT), (*SKIP), and (*PRUNE) cause the subroutine match to fail
3194 when triggered by being backtracked to in a subpattern called as a sub‐
3195 routine. There is then a backtrack at the outer level.
3196
3197 (*THEN), when triggered, skips to the next alternative in the innermost
3198 enclosing group within the subpattern that has alternatives (its normal
3199 behaviour). However, if there is no such group within the subroutine
3200 subpattern, the subroutine match fails and there is a backtrack at the
3201 outer level.
3202
3204
3205 pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2syntax(3),
3206 pcre2(3).
3207
3209
3210 Philip Hazel
3211 University Computing Service
3212 Cambridge, England.
3213
3215
3216 Last updated: 04 September 2018
3217 Copyright (c) 1997-2018 University of Cambridge.
3218
3219
3220
3221PCRE2 10.32 04 September 2018 PCRE2PATTERN(3)