1PCRE2PATTERN(3) Library Functions Manual PCRE2PATTERN(3)
2
3
4
6 PCRE2 - Perl-compatible regular expressions (revised API)
7
9
10 The syntax and semantics of the regular expressions that are supported
11 by PCRE2 are described in detail below. There is a quick-reference syn‐
12 tax summary in the pcre2syntax page. PCRE2 tries to match Perl syntax
13 and semantics as closely as it can. PCRE2 also supports some alterna‐
14 tive regular expression syntax (which does not conflict with the Perl
15 syntax) in order to provide some compatibility with regular expressions
16 in Python, .NET, and Oniguruma.
17
18 Perl's regular expressions are described in its own documentation, and
19 regular expressions in general are covered in a number of books, some
20 of which have copious examples. Jeffrey Friedl's "Mastering Regular
21 Expressions", published by O'Reilly, covers regular expressions in
22 great detail. This description of PCRE2's regular expressions is
23 intended as reference material.
24
25 This document discusses the patterns that are supported by PCRE2 when
26 its main matching function, pcre2_match(), is used. PCRE2 also has an
27 alternative matching function, pcre2_dfa_match(), which matches using a
28 different algorithm that is not Perl-compatible. Some of the features
29 discussed below are not available when DFA matching is used. The advan‐
30 tages and disadvantages of the alternative function, and how it differs
31 from the normal function, are discussed in the pcre2matching page.
32
34
35 A number of options that can be passed to pcre2_compile() can also be
36 set by special items at the start of a pattern. These are not Perl-com‐
37 patible, but are provided to make these options accessible to pattern
38 writers who are not able to change the program that processes the pat‐
39 tern. Any number of these items may appear, but they must all be
40 together right at the start of the pattern string, and the letters must
41 be in upper case.
42
43 UTF support
44
45 In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either
46 as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32
47 can be specified for the 32-bit library, in which case it constrains
48 the character values to valid Unicode code points. To process UTF
49 strings, PCRE2 must be built to include Unicode support (which is the
50 default). When using UTF strings you must either call the compiling
51 function with the PCRE2_UTF option, or the pattern must start with the
52 special sequence (*UTF), which is equivalent to setting the relevant
53 option. How setting a UTF mode affects pattern matching is mentioned in
54 several places below. There is also a summary of features in the
55 pcre2unicode page.
56
57 Some applications that allow their users to supply patterns may wish to
58 restrict them to non-UTF data for security reasons. If the
59 PCRE2_NEVER_UTF option is passed to pcre2_compile(), (*UTF) is not
60 allowed, and its appearance in a pattern causes an error.
61
62 Unicode property support
63
64 Another special sequence that may appear at the start of a pattern is
65 (*UCP). This has the same effect as setting the PCRE2_UCP option: it
66 causes sequences such as \d and \w to use Unicode properties to deter‐
67 mine character types, instead of recognizing only characters with codes
68 less than 256 via a lookup table.
69
70 Some applications that allow their users to supply patterns may wish to
71 restrict them for security reasons. If the PCRE2_NEVER_UCP option is
72 passed to pcre2_compile(), (*UCP) is not allowed, and its appearance in
73 a pattern causes an error.
74
75 Locking out empty string matching
76
77 Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same
78 effect as passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option
79 to whichever matching function is subsequently called to match the pat‐
80 tern. These options lock out the matching of empty strings, either
81 entirely, or only at the start of the subject.
82
83 Disabling auto-possessification
84
85 If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as
86 setting the PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from making
87 quantifiers possessive when what follows cannot match the repeated
88 item. For example, by default a+b is treated as a++b. For more details,
89 see the pcre2api documentation.
90
91 Disabling start-up optimizations
92
93 If a pattern starts with (*NO_START_OPT), it has the same effect as
94 setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti‐
95 mizations for quickly reaching "no match" results. For more details,
96 see the pcre2api documentation.
97
98 Disabling automatic anchoring
99
100 If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect
101 as setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimiza‐
102 tions that apply to patterns whose top-level branches all start with .*
103 (match any number of arbitrary characters). For more details, see the
104 pcre2api documentation.
105
106 Disabling JIT compilation
107
108 If a pattern that starts with (*NO_JIT) is successfully compiled, an
109 attempt by the application to apply the JIT optimization by calling
110 pcre2_jit_compile() is ignored.
111
112 Setting match and recursion limits
113
114 The caller of pcre2_match() can set a limit on the number of times the
115 internal match() function is called and on the maximum depth of recur‐
116 sive calls. These facilities are provided to catch runaway matches that
117 are provoked by patterns with huge matching trees (a typical example is
118 a pattern with nested unlimited repeats) and to avoid running out of
119 system stack by too much recursion. When one of these limits is
120 reached, pcre2_match() gives an error return. The limits can also be
121 set by items at the start of the pattern of the form
122
123 (*LIMIT_MATCH=d)
124 (*LIMIT_RECURSION=d)
125
126 where d is any number of decimal digits. However, the value of the set‐
127 ting must be less than the value set (or defaulted) by the caller of
128 pcre2_match() for it to have any effect. In other words, the pattern
129 writer can lower the limits set by the programmer, but not raise them.
130 If there is more than one setting of one of these limits, the lower
131 value is used.
132
133 The match limit is used (but in a different way) when JIT is being
134 used, but it is not relevant, and is ignored, when matching with
135 pcre2_dfa_match(). However, the recursion limit is relevant for DFA
136 matching, which does use some function recursion, in particular, for
137 recursions within the pattern.
138
139 Newline conventions
140
141 PCRE2 supports five different conventions for indicating line breaks in
142 strings: a single CR (carriage return) character, a single LF (line‐
143 feed) character, the two-character sequence CRLF, any of the three pre‐
144 ceding, or any Unicode newline sequence. The pcre2api page has further
145 discussion about newlines, and shows how to set the newline convention
146 when calling pcre2_compile().
147
148 It is also possible to specify a newline convention by starting a pat‐
149 tern string with one of the following five sequences:
150
151 (*CR) carriage return
152 (*LF) linefeed
153 (*CRLF) carriage return, followed by linefeed
154 (*ANYCRLF) any of the three above
155 (*ANY) all Unicode newline sequences
156
157 These override the default and the options given to the compiling func‐
158 tion. For example, on a Unix system where LF is the default newline
159 sequence, the pattern
160
161 (*CR)a.b
162
163 changes the convention to CR. That pattern matches "a\nb" because LF is
164 no longer a newline. If more than one of these settings is present, the
165 last one is used.
166
167 The newline convention affects where the circumflex and dollar asser‐
168 tions are true. It also affects the interpretation of the dot metachar‐
169 acter when PCRE2_DOTALL is not set, and the behaviour of \N. However,
170 it does not affect what the \R escape sequence matches. By default,
171 this is any Unicode newline sequence, for Perl compatibility. However,
172 this can be changed; see the description of \R in the section entitled
173 "Newline sequences" below. A change of \R setting can be combined with
174 a change of newline convention.
175
176 Specifying what \R matches
177
178 It is possible to restrict \R to match only CR, LF, or CRLF (instead of
179 the complete set of Unicode line endings) by setting the option
180 PCRE2_BSR_ANYCRLF at compile time. This effect can also be achieved by
181 starting a pattern with (*BSR_ANYCRLF). For completeness, (*BSR_UNI‐
182 CODE) is also recognized, corresponding to PCRE2_BSR_UNICODE.
183
185
186 PCRE2 can be compiled to run in an environment that uses EBCDIC as its
187 character code rather than ASCII or Unicode (typically a mainframe sys‐
188 tem). In the sections below, character code values are ASCII or Uni‐
189 code; in an EBCDIC environment these characters may have different code
190 values, and there are no code points greater than 255.
191
193
194 A regular expression is a pattern that is matched against a subject
195 string from left to right. Most characters stand for themselves in a
196 pattern, and match the corresponding characters in the subject. As a
197 trivial example, the pattern
198
199 The quick brown fox
200
201 matches a portion of a subject string that is identical to itself. When
202 caseless matching is specified (the PCRE2_CASELESS option), letters are
203 matched independently of case.
204
205 The power of regular expressions comes from the ability to include
206 alternatives and repetitions in the pattern. These are encoded in the
207 pattern by the use of metacharacters, which do not stand for themselves
208 but instead are interpreted in some special way.
209
210 There are two different sets of metacharacters: those that are recog‐
211 nized anywhere in the pattern except within square brackets, and those
212 that are recognized within square brackets. Outside square brackets,
213 the metacharacters are as follows:
214
215 \ general escape character with several uses
216 ^ assert start of string (or line, in multiline mode)
217 $ assert end of string (or line, in multiline mode)
218 . match any character except newline (by default)
219 [ start character class definition
220 | start of alternative branch
221 ( start subpattern
222 ) end subpattern
223 ? extends the meaning of (
224 also 0 or 1 quantifier
225 also quantifier minimizer
226 * 0 or more quantifier
227 + 1 or more quantifier
228 also "possessive quantifier"
229 { start min/max quantifier
230
231 Part of a pattern that is in square brackets is called a "character
232 class". In a character class the only metacharacters are:
233
234 \ general escape character
235 ^ negate the class, but only if the first character
236 - indicates character range
237 [ POSIX character class (only if followed by POSIX
238 syntax)
239 ] terminates the character class
240
241 The following sections describe the use of each of the metacharacters.
242
244
245 The backslash character has several uses. Firstly, if it is followed by
246 a character that is not a number or a letter, it takes away any special
247 meaning that character may have. This use of backslash as an escape
248 character applies both inside and outside character classes.
249
250 For example, if you want to match a * character, you write \* in the
251 pattern. This escaping action applies whether or not the following
252 character would otherwise be interpreted as a metacharacter, so it is
253 always safe to precede a non-alphanumeric with backslash to specify
254 that it stands for itself. In particular, if you want to match a back‐
255 slash, you write \\.
256
257 In a UTF mode, only ASCII numbers and letters have any special meaning
258 after a backslash. All other characters (in particular, those whose
259 codepoints are greater than 127) are treated as literals.
260
261 If a pattern is compiled with the PCRE2_EXTENDED option, most white
262 space in the pattern (other than in a character class), and characters
263 between a # outside a character class and the next newline, inclusive,
264 are ignored. An escaping backslash can be used to include a white space
265 or # character as part of the pattern.
266
267 If you want to remove the special meaning from a sequence of charac‐
268 ters, you can do so by putting them between \Q and \E. This is differ‐
269 ent from Perl in that $ and @ are handled as literals in \Q...\E
270 sequences in PCRE2, whereas in Perl, $ and @ cause variable interpola‐
271 tion. Note the following examples:
272
273 Pattern PCRE2 matches Perl matches
274
275 \Qabc$xyz\E abc$xyz abc followed by the
276 contents of $xyz
277 \Qabc\$xyz\E abc\$xyz abc\$xyz
278 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
279
280 The \Q...\E sequence is recognized both inside and outside character
281 classes. An isolated \E that is not preceded by \Q is ignored. If \Q
282 is not followed by \E later in the pattern, the literal interpretation
283 continues to the end of the pattern (that is, \E is assumed at the
284 end). If the isolated \Q is inside a character class, this causes an
285 error, because the character class is not terminated.
286
287 Non-printing characters
288
289 A second use of backslash provides a way of encoding non-printing char‐
290 acters in patterns in a visible manner. There is no restriction on the
291 appearance of non-printing characters in a pattern, but when a pattern
292 is being prepared by text editing, it is often easier to use one of the
293 following escape sequences than the binary character it represents. In
294 an ASCII or Unicode environment, these escapes are as follows:
295
296 \a alarm, that is, the BEL character (hex 07)
297 \cx "control-x", where x is any printable ASCII character
298 \e escape (hex 1B)
299 \f form feed (hex 0C)
300 \n linefeed (hex 0A)
301 \r carriage return (hex 0D)
302 \t tab (hex 09)
303 \0dd character with octal code 0dd
304 \ddd character with octal code ddd, or back reference
305 \o{ddd..} character with octal code ddd..
306 \xhh character with hex code hh
307 \x{hhh..} character with hex code hhh.. (default mode)
308 \uhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
309
310 The precise effect of \cx on ASCII characters is as follows: if x is a
311 lower case letter, it is converted to upper case. Then bit 6 of the
312 character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
313 (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
314 hex 7B (; is 3B). If the code unit following \c has a value less than
315 32 or greater than 126, a compile-time error occurs.
316
317 When PCRE2 is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t gen‐
318 erate the appropriate EBCDIC code values. The \c escape is processed as
319 specified for Perl in the perlebcdic document. The only characters that
320 are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?.
321 Any other character provokes a compile-time error. The sequence \c@
322 encodes character code 0; after \c the letters (in either case) encode
323 characters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters
324 27-31 (hex 1B to hex 1F), and \c? becomes either 255 (hex FF) or 95
325 (hex 5F).
326
327 Thus, apart from \c?, these escapes generate the same character code
328 values as they do in an ASCII environment, though the meanings of the
329 values mostly differ. For example, \cG always generates code value 7,
330 which is BEL in ASCII but DEL in EBCDIC.
331
332 The sequence \c? generates DEL (127, hex 7F) in an ASCII environment,
333 but because 127 is not a control character in EBCDIC, Perl makes it
334 generate the APC character. Unfortunately, there are several variants
335 of EBCDIC. In most of them the APC character has the value 255 (hex
336 FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If
337 certain other characters have POSIX-BC values, PCRE2 makes \c? generate
338 95; otherwise it generates 255.
339
340 After \0 up to two further octal digits are read. If there are fewer
341 than two digits, just those that are present are used. Thus the
342 sequence \0\x\015 specifies two binary zeros followed by a CR character
343 (code value 13). Make sure you supply two digits after the initial zero
344 if the pattern character that follows is itself an octal digit.
345
346 The escape \o must be followed by a sequence of octal digits, enclosed
347 in braces. An error occurs if this is not the case. This escape is a
348 recent addition to Perl; it provides way of specifying character code
349 points as octal numbers greater than 0777, and it also allows octal
350 numbers and back references to be unambiguously specified.
351
352 For greater clarity and unambiguity, it is best to avoid following \ by
353 a digit greater than zero. Instead, use \o{} or \x{} to specify charac‐
354 ter numbers, and \g{} to specify back references. The following para‐
355 graphs describe the old, ambiguous syntax.
356
357 The handling of a backslash followed by a digit other than 0 is compli‐
358 cated, and Perl has changed over time, causing PCRE2 also to change.
359
360 Outside a character class, PCRE2 reads the digit and any following dig‐
361 its as a decimal number. If the number is less than 10, begins with the
362 digit 8 or 9, or if there are at least that many previous capturing
363 left parentheses in the expression, the entire sequence is taken as a
364 back reference. A description of how this works is given later, follow‐
365 ing the discussion of parenthesized subpatterns. Otherwise, up to
366 three octal digits are read to form a character code.
367
368 Inside a character class, PCRE2 handles \8 and \9 as the literal char‐
369 acters "8" and "9", and otherwise reads up to three octal digits fol‐
370 lowing the backslash, using them to generate a data character. Any sub‐
371 sequent digits stand for themselves. For example, outside a character
372 class:
373
374 \040 is another way of writing an ASCII space
375 \40 is the same, provided there are fewer than 40
376 previous capturing subpatterns
377 \7 is always a back reference
378 \11 might be a back reference, or another way of
379 writing a tab
380 \011 is always a tab
381 \0113 is a tab followed by the character "3"
382 \113 might be a back reference, otherwise the
383 character with octal code 113
384 \377 might be a back reference, otherwise
385 the value 255 (decimal)
386 \81 is always a back reference
387
388 Note that octal values of 100 or greater that are specified using this
389 syntax must not be introduced by a leading zero, because no more than
390 three octal digits are ever read.
391
392 By default, after \x that is not followed by {, from zero to two hexa‐
393 decimal digits are read (letters can be in upper or lower case). Any
394 number of hexadecimal digits may appear between \x{ and }. If a charac‐
395 ter other than a hexadecimal digit appears between \x{ and }, or if
396 there is no terminating }, an error occurs.
397
398 If the PCRE2_ALT_BSUX option is set, the interpretation of \x is as
399 just described only when it is followed by two hexadecimal digits. Oth‐
400 erwise, it matches a literal "x" character. In this mode mode, support
401 for code points greater than 256 is provided by \u, which must be fol‐
402 lowed by four hexadecimal digits; otherwise it matches a literal "u"
403 character.
404
405 Characters whose value is less than 256 can be defined by either of the
406 two syntaxes for \x (or by \u in PCRE2_ALT_BSUX mode). There is no dif‐
407 ference in the way they are handled. For example, \xdc is exactly the
408 same as \x{dc} (or \u00dc in PCRE2_ALT_BSUX mode).
409
410 Constraints on character values
411
412 Characters that are specified using octal or hexadecimal numbers are
413 limited to certain values, as follows:
414
415 8-bit non-UTF mode less than 0x100
416 8-bit UTF-8 mode less than 0x10ffff and a valid codepoint
417 16-bit non-UTF mode less than 0x10000
418 16-bit UTF-16 mode less than 0x10ffff and a valid codepoint
419 32-bit non-UTF mode less than 0x100000000
420 32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
421
422 Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-
423 called "surrogate" codepoints), and 0xffef.
424
425 Escape sequences in character classes
426
427 All the sequences that define a single character value can be used both
428 inside and outside character classes. In addition, inside a character
429 class, \b is interpreted as the backspace character (hex 08).
430
431 \N is not allowed in a character class. \B, \R, and \X are not special
432 inside a character class. Like other unrecognized alphabetic escape
433 sequences, they cause an error. Outside a character class, these
434 sequences have different meanings.
435
436 Unsupported escape sequences
437
438 In Perl, the sequences \l, \L, \u, and \U are recognized by its string
439 handler and used to modify the case of following characters. By
440 default, PCRE2 does not support these escape sequences. However, if the
441 PCRE2_ALT_BSUX option is set, \U matches a "U" character, and \u can be
442 used to define a character by code point, as described in the previous
443 section.
444
445 Absolute and relative back references
446
447 The sequence \g followed by a signed or unsigned number, optionally
448 enclosed in braces, is an absolute or relative back reference. A named
449 back reference can be coded as \g{name}. Back references are discussed
450 later, following the discussion of parenthesized subpatterns.
451
452 Absolute and relative subroutine calls
453
454 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
455 name or a number enclosed either in angle brackets or single quotes, is
456 an alternative syntax for referencing a subpattern as a "subroutine".
457 Details are discussed later. Note that \g{...} (Perl syntax) and
458 \g<...> (Oniguruma syntax) are not synonymous. The former is a back
459 reference; the latter is a subroutine call.
460
461 Generic character types
462
463 Another use of backslash is for specifying generic character types:
464
465 \d any decimal digit
466 \D any character that is not a decimal digit
467 \h any horizontal white space character
468 \H any character that is not a horizontal white space character
469 \s any white space character
470 \S any character that is not a white space character
471 \v any vertical white space character
472 \V any character that is not a vertical white space character
473 \w any "word" character
474 \W any "non-word" character
475
476 There is also the single sequence \N, which matches a non-newline char‐
477 acter. This is the same as the "." metacharacter when PCRE2_DOTALL is
478 not set. Perl also uses \N to match characters by name; PCRE2 does not
479 support this.
480
481 Each pair of lower and upper case escape sequences partitions the com‐
482 plete set of characters into two disjoint sets. Any given character
483 matches one, and only one, of each pair. The sequences can appear both
484 inside and outside character classes. They each match one character of
485 the appropriate type. If the current matching point is at the end of
486 the subject string, all of them fail, because there is no character to
487 match.
488
489 The default \s characters are HT (9), LF (10), VT (11), FF (12), CR
490 (13), and space (32), which are defined as white space in the "C"
491 locale. This list may vary if locale-specific matching is taking place.
492 For example, in some locales the "non-breaking space" character (\xA0)
493 is recognized as white space, and in others the VT character is not.
494
495 A "word" character is an underscore or any character that is a letter
496 or digit. By default, the definition of letters and digits is con‐
497 trolled by PCRE2's low-valued character tables, and may vary if locale-
498 specific matching is taking place (see "Locale support" in the pcre2api
499 page). For example, in a French locale such as "fr_FR" in Unix-like
500 systems, or "french" in Windows, some character codes greater than 127
501 are used for accented letters, and these are then matched by \w. The
502 use of locales with Unicode is discouraged.
503
504 By default, characters whose code points are greater than 127 never
505 match \d, \s, or \w, and always match \D, \S, and \W, although this may
506 be different for characters in the range 128-255 when locale-specific
507 matching is happening. These escape sequences retain their original
508 meanings from before Unicode support was available, mainly for effi‐
509 ciency reasons. If the PCRE2_UCP option is set, the behaviour is
510 changed so that Unicode properties are used to determine character
511 types, as follows:
512
513 \d any character that matches \p{Nd} (decimal digit)
514 \s any character that matches \p{Z} or \h or \v
515 \w any character that matches \p{L} or \p{N}, plus underscore
516
517 The upper case escapes match the inverse sets of characters. Note that
518 \d matches only decimal digits, whereas \w matches any Unicode digit,
519 as well as any Unicode letter, and underscore. Note also that PCRE2_UCP
520 affects \b, and \B because they are defined in terms of \w and \W.
521 Matching these sequences is noticeably slower when PCRE2_UCP is set.
522
523 The sequences \h, \H, \v, and \V, in contrast to the other sequences,
524 which match only ASCII characters by default, always match a specific
525 list of code points, whether or not PCRE2_UCP is set. The horizontal
526 space characters are:
527
528 U+0009 Horizontal tab (HT)
529 U+0020 Space
530 U+00A0 Non-break space
531 U+1680 Ogham space mark
532 U+180E Mongolian vowel separator
533 U+2000 En quad
534 U+2001 Em quad
535 U+2002 En space
536 U+2003 Em space
537 U+2004 Three-per-em space
538 U+2005 Four-per-em space
539 U+2006 Six-per-em space
540 U+2007 Figure space
541 U+2008 Punctuation space
542 U+2009 Thin space
543 U+200A Hair space
544 U+202F Narrow no-break space
545 U+205F Medium mathematical space
546 U+3000 Ideographic space
547
548 The vertical space characters are:
549
550 U+000A Linefeed (LF)
551 U+000B Vertical tab (VT)
552 U+000C Form feed (FF)
553 U+000D Carriage return (CR)
554 U+0085 Next line (NEL)
555 U+2028 Line separator
556 U+2029 Paragraph separator
557
558 In 8-bit, non-UTF-8 mode, only the characters with code points less
559 than 256 are relevant.
560
561 Newline sequences
562
563 Outside a character class, by default, the escape sequence \R matches
564 any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent
565 to the following:
566
567 (?>\r\n|\n|\x0b|\f|\r|\x85)
568
569 This is an example of an "atomic group", details of which are given
570 below. This particular group matches either the two-character sequence
571 CR followed by LF, or one of the single characters LF (linefeed,
572 U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car‐
573 riage return, U+000D), or NEL (next line, U+0085). Because this is an
574 atomic group, the two-character sequence is treated as a single unit
575 that cannot be split.
576
577 In other modes, two additional characters whose codepoints are greater
578 than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa‐
579 rator, U+2029). Unicode support is not needed for these characters to
580 be recognized.
581
582 It is possible to restrict \R to match only CR, LF, or CRLF (instead of
583 the complete set of Unicode line endings) by setting the option
584 PCRE2_BSR_ANYCRLF at compile time. (BSR is an abbrevation for "back‐
585 slash R".) This can be made the default when PCRE2 is built; if this is
586 the case, the other behaviour can be requested via the PCRE2_BSR_UNI‐
587 CODE option. It is also possible to specify these settings by starting
588 a pattern string with one of the following sequences:
589
590 (*BSR_ANYCRLF) CR, LF, or CRLF only
591 (*BSR_UNICODE) any Unicode newline sequence
592
593 These override the default and the options given to the compiling func‐
594 tion. Note that these special settings, which are not Perl-compatible,
595 are recognized only at the very start of a pattern, and that they must
596 be in upper case. If more than one of them is present, the last one is
597 used. They can be combined with a change of newline convention; for
598 example, a pattern can start with:
599
600 (*ANY)(*BSR_ANYCRLF)
601
602 They can also be combined with the (*UTF) or (*UCP) special sequences.
603 Inside a character class, \R is treated as an unrecognized escape
604 sequence, and causes an error.
605
606 Unicode character properties
607
608 When PCRE2 is built with Unicode support (the default), three addi‐
609 tional escape sequences that match characters with specific properties
610 are available. In 8-bit non-UTF-8 mode, these sequences are of course
611 limited to testing characters whose codepoints are less than 256, but
612 they do work in this mode. The extra escape sequences are:
613
614 \p{xx} a character with the xx property
615 \P{xx} a character without the xx property
616 \X a Unicode extended grapheme cluster
617
618 The property names represented by xx above are limited to the Unicode
619 script names, the general category properties, "Any", which matches any
620 character (including newline), and some special PCRE2 properties
621 (described in the next section). Other Perl properties such as "InMu‐
622 sicalSymbols" are not supported by PCRE2. Note that \P{Any} does not
623 match any characters, so always causes a match failure.
624
625 Sets of Unicode characters are defined as belonging to certain scripts.
626 A character from one of these sets can be matched using a script name.
627 For example:
628
629 \p{Greek}
630 \P{Han}
631
632 Those that are not part of an identified script are lumped together as
633 "Common". The current list of scripts is:
634
635 Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Balinese,
636 Bamum, Bassa_Vah, Batak, Bengali, Bopomofo, Brahmi, Braille, Buginese,
637 Buhid, Canadian_Aboriginal, Carian, Caucasian_Albanian, Chakma, Cham,
638 Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret,
639 Devanagari, Duployan, Egyptian_Hieroglyphs, Elbasan, Ethiopic, Geor‐
640 gian, Glagolitic, Gothic, Grantha, Greek, Gujarati, Gurmukhi, Han,
641 Hangul, Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited,
642 Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan‐
643 nada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao,
644 Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha‐
645 jani, Malayalam, Mandaic, Manichaean, Meetei_Mayek, Mende_Kikakui,
646 Meroitic_Cursive, Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro,
647 Multani, Myanmar, Nabataean, New_Tai_Lue, Nko, Ogham, Ol_Chiki,
648 Old_Hungarian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian,
649 Old_South_Arabian, Old_Turkic, Oriya, Osmanya, Pahawh_Hmong, Palmyrene,
650 Pau_Cin_Hau, Phags_Pa, Phoenician, Psalter_Pahlavi, Rejang, Runic,
651 Samaritan, Saurashtra, Sharada, Shavian, Siddham, SignWriting, Sinhala,
652 Sora_Sompeng, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa,
653 Tai_Le, Tai_Tham, Tai_Viet, Takri, Tamil, Telugu, Thaana, Thai,
654 Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi.
655
656 Each character has exactly one Unicode general category property, spec‐
657 ified by a two-letter abbreviation. For compatibility with Perl, nega‐
658 tion can be specified by including a circumflex between the opening
659 brace and the property name. For example, \p{^Lu} is the same as
660 \P{Lu}.
661
662 If only one letter is specified with \p or \P, it includes all the gen‐
663 eral category properties that start with that letter. In this case, in
664 the absence of negation, the curly brackets in the escape sequence are
665 optional; these two examples have the same effect:
666
667 \p{L}
668 \pL
669
670 The following general category property codes are supported:
671
672 C Other
673 Cc Control
674 Cf Format
675 Cn Unassigned
676 Co Private use
677 Cs Surrogate
678
679 L Letter
680 Ll Lower case letter
681 Lm Modifier letter
682 Lo Other letter
683 Lt Title case letter
684 Lu Upper case letter
685
686 M Mark
687 Mc Spacing mark
688 Me Enclosing mark
689 Mn Non-spacing mark
690
691 N Number
692 Nd Decimal number
693 Nl Letter number
694 No Other number
695
696 P Punctuation
697 Pc Connector punctuation
698 Pd Dash punctuation
699 Pe Close punctuation
700 Pf Final punctuation
701 Pi Initial punctuation
702 Po Other punctuation
703 Ps Open punctuation
704
705 S Symbol
706 Sc Currency symbol
707 Sk Modifier symbol
708 Sm Mathematical symbol
709 So Other symbol
710
711 Z Separator
712 Zl Line separator
713 Zp Paragraph separator
714 Zs Space separator
715
716 The special property L& is also supported: it matches a character that
717 has the Lu, Ll, or Lt property, in other words, a letter that is not
718 classified as a modifier or "other".
719
720 The Cs (Surrogate) property applies only to characters in the range
721 U+D800 to U+DFFF. Such characters are not valid in Unicode strings and
722 so cannot be tested by PCRE2, unless UTF validity checking has been
723 turned off (see the discussion of PCRE2_NO_UTF_CHECK in the pcre2api
724 page). Perl does not support the Cs property.
725
726 The long synonyms for property names that Perl supports (such as
727 \p{Letter}) are not supported by PCRE2, nor is it permitted to prefix
728 any of these properties with "Is".
729
730 No character that is in the Unicode table has the Cn (unassigned) prop‐
731 erty. Instead, this property is assumed for any code point that is not
732 in the Unicode table.
733
734 Specifying caseless matching does not affect these escape sequences.
735 For example, \p{Lu} always matches only upper case letters. This is
736 different from the behaviour of current versions of Perl.
737
738 Matching characters by Unicode property is not fast, because PCRE2 has
739 to do a multistage table lookup in order to find a character's prop‐
740 erty. That is why the traditional escape sequences such as \d and \w do
741 not use Unicode properties in PCRE2 by default, though you can make
742 them do so by setting the PCRE2_UCP option or by starting the pattern
743 with (*UCP).
744
745 Extended grapheme clusters
746
747 The \X escape matches any number of Unicode characters that form an
748 "extended grapheme cluster", and treats the sequence as an atomic group
749 (see below). Unicode supports various kinds of composite character by
750 giving each character a grapheme breaking property, and having rules
751 that use these properties to define the boundaries of extended grapheme
752 clusters. \X always matches at least one character. Then it decides
753 whether to add additional characters according to the following rules
754 for ending a cluster:
755
756 1. End at the end of the subject string.
757
758 2. Do not end between CR and LF; otherwise end after any control char‐
759 acter.
760
761 3. Do not break Hangul (a Korean script) syllable sequences. Hangul
762 characters are of five types: L, V, T, LV, and LVT. An L character may
763 be followed by an L, V, LV, or LVT character; an LV or V character may
764 be followed by a V or T character; an LVT or T character may be follwed
765 only by a T character.
766
767 4. Do not end before extending characters or spacing marks. Characters
768 with the "mark" property always have the "extend" grapheme breaking
769 property.
770
771 5. Do not end after prepend characters.
772
773 6. Otherwise, end the cluster.
774
775 PCRE2's additional properties
776
777 As well as the standard Unicode properties described above, PCRE2 sup‐
778 ports four more that make it possible to convert traditional escape
779 sequences such as \w and \s to use Unicode properties. PCRE2 uses these
780 non-standard, non-Perl properties internally when PCRE2_UCP is set.
781 However, they may also be used explicitly. These properties are:
782
783 Xan Any alphanumeric character
784 Xps Any POSIX space character
785 Xsp Any Perl space character
786 Xwd Any Perl "word" character
787
788 Xan matches characters that have either the L (letter) or the N (num‐
789 ber) property. Xps matches the characters tab, linefeed, vertical tab,
790 form feed, or carriage return, and any other character that has the Z
791 (separator) property. Xsp is the same as Xps; in PCRE1 it used to
792 exclude vertical tab, for Perl compatibility, but Perl changed. Xwd
793 matches the same characters as Xan, plus underscore.
794
795 There is another non-standard property, Xuc, which matches any charac‐
796 ter that can be represented by a Universal Character Name in C++ and
797 other programming languages. These are the characters $, @, ` (grave
798 accent), and all characters with Unicode code points greater than or
799 equal to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that
800 most base (ASCII) characters are excluded. (Universal Character Names
801 are of the form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit.
802 Note that the Xuc property does not match these sequences but the char‐
803 acters that they represent.)
804
805 Resetting the match start
806
807 The escape sequence \K causes any previously matched characters not to
808 be included in the final matched sequence. For example, the pattern:
809
810 foo\Kbar
811
812 matches "foobar", but reports that it has matched "bar". This feature
813 is similar to a lookbehind assertion (described below). However, in
814 this case, the part of the subject before the real match does not have
815 to be of fixed length, as lookbehind assertions do. The use of \K does
816 not interfere with the setting of captured substrings. For example,
817 when the pattern
818
819 (foo)\Kbar
820
821 matches "foobar", the first substring is still set to "foo".
822
823 Perl documents that the use of \K within assertions is "not well
824 defined". In PCRE2, \K is acted upon when it occurs inside positive
825 assertions, but is ignored in negative assertions. Note that when a
826 pattern such as (?=ab\K) matches, the reported start of the match can
827 be greater than the end of the match.
828
829 Simple assertions
830
831 The final use of backslash is for certain simple assertions. An asser‐
832 tion specifies a condition that has to be met at a particular point in
833 a match, without consuming any characters from the subject string. The
834 use of subpatterns for more complicated assertions is described below.
835 The backslashed assertions are:
836
837 \b matches at a word boundary
838 \B matches when not at a word boundary
839 \A matches at the start of the subject
840 \Z matches at the end of the subject
841 also matches before a newline at the end of the subject
842 \z matches only at the end of the subject
843 \G matches at the first matching position in the subject
844
845 Inside a character class, \b has a different meaning; it matches the
846 backspace character. If any other of these assertions appears in a
847 character class, an "invalid escape sequence" error is generated.
848
849 A word boundary is a position in the subject string where the current
850 character and the previous character do not both match \w or \W (i.e.
851 one matches \w and the other matches \W), or the start or end of the
852 string if the first or last character matches \w, respectively. In a
853 UTF mode, the meanings of \w and \W can be changed by setting the
854 PCRE2_UCP option. When this is done, it also affects \b and \B. Neither
855 PCRE2 nor Perl has a separate "start of word" or "end of word" metase‐
856 quence. However, whatever follows \b normally determines which it is.
857 For example, the fragment \ba matches "a" at the start of a word.
858
859 The \A, \Z, and \z assertions differ from the traditional circumflex
860 and dollar (described in the next section) in that they only ever match
861 at the very start and end of the subject string, whatever options are
862 set. Thus, they are independent of multiline mode. These three asser‐
863 tions are not affected by the PCRE2_NOTBOL or PCRE2_NOTEOL options,
864 which affect only the behaviour of the circumflex and dollar metachar‐
865 acters. However, if the startoffset argument of pcre2_match() is non-
866 zero, indicating that matching is to start at a point other than the
867 beginning of the subject, \A can never match. The difference between
868 \Z and \z is that \Z matches before a newline at the end of the string
869 as well as at the very end, whereas \z matches only at the end.
870
871 The \G assertion is true only when the current matching position is at
872 the start point of the match, as specified by the startoffset argument
873 of pcre2_match(). It differs from \A when the value of startoffset is
874 non-zero. By calling pcre2_match() multiple times with appropriate
875 arguments, you can mimic Perl's /g option, and it is in this kind of
876 implementation where \G can be useful.
877
878 Note, however, that PCRE2's interpretation of \G, as the start of the
879 current match, is subtly different from Perl's, which defines it as the
880 end of the previous match. In Perl, these can be different when the
881 previously matched string was empty. Because PCRE2 does just one match
882 at a time, it cannot reproduce this behaviour.
883
884 If all the alternatives of a pattern begin with \G, the expression is
885 anchored to the starting match position, and the "anchored" flag is set
886 in the compiled regular expression.
887
889
890 The circumflex and dollar metacharacters are zero-width assertions.
891 That is, they test for a particular condition being true without con‐
892 suming any characters from the subject string. These two metacharacters
893 are concerned with matching the starts and ends of lines. If the new‐
894 line convention is set so that only the two-character sequence CRLF is
895 recognized as a newline, isolated CR and LF characters are treated as
896 ordinary data characters, and are not recognized as newlines.
897
898 Outside a character class, in the default matching mode, the circumflex
899 character is an assertion that is true only if the current matching
900 point is at the start of the subject string. If the startoffset argu‐
901 ment of pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum‐
902 flex can never match if the PCRE2_MULTILINE option is unset. Inside a
903 character class, circumflex has an entirely different meaning (see
904 below).
905
906 Circumflex need not be the first character of the pattern if a number
907 of alternatives are involved, but it should be the first thing in each
908 alternative in which it appears if the pattern is ever to match that
909 branch. If all possible alternatives start with a circumflex, that is,
910 if the pattern is constrained to match only at the start of the sub‐
911 ject, it is said to be an "anchored" pattern. (There are also other
912 constructs that can cause a pattern to be anchored.)
913
914 The dollar character is an assertion that is true only if the current
915 matching point is at the end of the subject string, or immediately
916 before a newline at the end of the string (by default), unless
917 PCRE2_NOTEOL is set. Note, however, that it does not actually match the
918 newline. Dollar need not be the last character of the pattern if a num‐
919 ber of alternatives are involved, but it should be the last item in any
920 branch in which it appears. Dollar has no special meaning in a charac‐
921 ter class.
922
923 The meaning of dollar can be changed so that it matches only at the
924 very end of the string, by setting the PCRE2_DOLLAR_ENDONLY option at
925 compile time. This does not affect the \Z assertion.
926
927 The meanings of the circumflex and dollar metacharacters are changed if
928 the PCRE2_MULTILINE option is set. When this is the case, a dollar
929 character matches before any newlines in the string, as well as at the
930 very end, and a circumflex matches immediately after internal newlines
931 as well as at the start of the subject string. It does not match after
932 a newline that ends the string, for compatibility with Perl. However,
933 this can be changed by setting the PCRE2_ALT_CIRCUMFLEX option.
934
935 For example, the pattern /^abc$/ matches the subject string "def\nabc"
936 (where \n represents a newline) in multiline mode, but not otherwise.
937 Consequently, patterns that are anchored in single line mode because
938 all branches start with ^ are not anchored in multiline mode, and a
939 match for circumflex is possible when the startoffset argument of
940 pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option is ignored
941 if PCRE2_MULTILINE is set.
942
943 When the newline convention (see "Newline conventions" below) recog‐
944 nizes the two-character sequence CRLF as a newline, this is preferred,
945 even if the single characters CR and LF are also recognized as new‐
946 lines. For example, if the newline convention is "any", a multiline
947 mode circumflex matches before "xyz" in the string "abc\r\nxyz" rather
948 than after CR, even though CR on its own is a valid newline. (It also
949 matches at the very start of the string, of course.)
950
951 Note that the sequences \A, \Z, and \z can be used to match the start
952 and end of the subject in both modes, and if all branches of a pattern
953 start with \A it is always anchored, whether or not PCRE2_MULTILINE is
954 set.
955
957
958 Outside a character class, a dot in the pattern matches any one charac‐
959 ter in the subject string except (by default) a character that signi‐
960 fies the end of a line.
961
962 When a line ending is defined as a single character, dot never matches
963 that character; when the two-character sequence CRLF is used, dot does
964 not match CR if it is immediately followed by LF, but otherwise it
965 matches all characters (including isolated CRs and LFs). When any Uni‐
966 code line endings are being recognized, dot does not match CR or LF or
967 any of the other line ending characters.
968
969 The behaviour of dot with regard to newlines can be changed. If the
970 PCRE2_DOTALL option is set, a dot matches any one character, without
971 exception. If the two-character sequence CRLF is present in the sub‐
972 ject string, it takes two dots to match it.
973
974 The handling of dot is entirely independent of the handling of circum‐
975 flex and dollar, the only relationship being that they both involve
976 newlines. Dot has no special meaning in a character class.
977
978 The escape sequence \N behaves like a dot, except that it is not
979 affected by the PCRE2_DOTALL option. In other words, it matches any
980 character except one that signifies the end of a line. Perl also uses
981 \N to match characters by name; PCRE2 does not support this.
982
984
985 Outside a character class, the escape sequence \C matches any one code
986 unit, whether or not a UTF mode is set. In the 8-bit library, one code
987 unit is one byte; in the 16-bit library it is a 16-bit unit; in the
988 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches
989 line-ending characters. The feature is provided in Perl in order to
990 match individual bytes in UTF-8 mode, but it is unclear how it can use‐
991 fully be used.
992
993 Because \C breaks up characters into individual code units, matching
994 one unit with \C in UTF-8 or UTF-16 mode means that the rest of the
995 string may start with a malformed UTF character. This has undefined
996 results, because PCRE2 assumes that it is matching character by charac‐
997 ter in a valid UTF string (by default it checks the subject string's
998 validity at the start of processing unless the PCRE2_NO_UTF_CHECK
999 option is used).
1000
1001 An application can lock out the use of \C by setting the
1002 PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also
1003 possible to build PCRE2 with the use of \C permanently disabled.
1004
1005 PCRE2 does not allow \C to appear in lookbehind assertions (described
1006 below) in UTF-8 or UTF-16 modes, because this would make it impossible
1007 to calculate the length of the lookbehind. Neither the alternative
1008 matching function pcre2_dfa_match() nor the JIT optimizer support \C in
1009 these UTF modes. The former gives a match-time error; the latter fails
1010 to optimize and so the match is always run using the interpreter.
1011
1012 In the 32-bit library, however, \C is always supported (when not
1013 explicitly locked out) because it always matches a single code unit,
1014 whether or not UTF-32 is specified.
1015
1016 In general, the \C escape sequence is best avoided. However, one way of
1017 using it that avoids the problem of malformed UTF-8 or UTF-16 charac‐
1018 ters is to use a lookahead to check the length of the next character,
1019 as in this pattern, which could be used with a UTF-8 string (ignore
1020 white space and line breaks):
1021
1022 (?| (?=[\x00-\x7f])(\C) |
1023 (?=[\x80-\x{7ff}])(\C)(\C) |
1024 (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
1025 (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
1026
1027 In this example, a group that starts with (?| resets the capturing
1028 parentheses numbers in each alternative (see "Duplicate Subpattern Num‐
1029 bers" below). The assertions at the start of each branch check the next
1030 UTF-8 character for values whose encoding uses 1, 2, 3, or 4 bytes,
1031 respectively. The character's individual bytes are then captured by the
1032 appropriate number of \C groups.
1033
1035
1036 An opening square bracket introduces a character class, terminated by a
1037 closing square bracket. A closing square bracket on its own is not spe‐
1038 cial by default. If a closing square bracket is required as a member
1039 of the class, it should be the first data character in the class (after
1040 an initial circumflex, if present) or escaped with a backslash. This
1041 means that, by default, an empty class cannot be defined. However, if
1042 the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at
1043 the start does end the (empty) class.
1044
1045 A character class matches a single character in the subject. A matched
1046 character must be in the set of characters defined by the class, unless
1047 the first character in the class definition is a circumflex, in which
1048 case the subject character must not be in the set defined by the class.
1049 If a circumflex is actually required as a member of the class, ensure
1050 it is not the first character, or escape it with a backslash.
1051
1052 For example, the character class [aeiou] matches any lower case vowel,
1053 while [^aeiou] matches any character that is not a lower case vowel.
1054 Note that a circumflex is just a convenient notation for specifying the
1055 characters that are in the class by enumerating those that are not. A
1056 class that starts with a circumflex is not an assertion; it still con‐
1057 sumes a character from the subject string, and therefore it fails if
1058 the current pointer is at the end of the string.
1059
1060 When caseless matching is set, any letters in a class represent both
1061 their upper case and lower case versions, so for example, a caseless
1062 [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
1063 match "A", whereas a caseful version would.
1064
1065 Characters that might indicate line breaks are never treated in any
1066 special way when matching character classes, whatever line-ending
1067 sequence is in use, and whatever setting of the PCRE2_DOTALL and
1068 PCRE2_MULTILINE options is used. A class such as [^a] always matches
1069 one of these characters.
1070
1071 The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,
1072 \w, and \W may appear in a character class, and add the characters that
1073 they match to the class. For example, [\dABCDEF] matches any hexadeci‐
1074 mal digit. In UTF modes, the PCRE2_UCP option affects the meanings of
1075 \d, \s, \w and their upper case partners, just as it does when they
1076 appear outside a character class, as described in the section entitled
1077 "Generic character types" above. The escape sequence \b has a different
1078 meaning inside a character class; it matches the backspace character.
1079 The sequences \B, \N, \R, and \X are not special inside a character
1080 class. Like any other unrecognized escape sequences, they cause an
1081 error.
1082
1083 The minus (hyphen) character can be used to specify a range of charac‐
1084 ters in a character class. For example, [d-m] matches any letter
1085 between d and m, inclusive. If a minus character is required in a
1086 class, it must be escaped with a backslash or appear in a position
1087 where it cannot be interpreted as indicating a range, typically as the
1088 first or last character in the class, or immediately after a range. For
1089 example, [b-d-z] matches letters in the range b to d, a hyphen charac‐
1090 ter, or z.
1091
1092 Perl treats a hyphen as a literal if it appears before or after a POSIX
1093 class (see below) or a character type escape such as as \d, but gives a
1094 warning in its warning mode, as this is most likely a user error. As
1095 PCRE2 has no facility for warning, an error is given in these cases.
1096
1097 It is not possible to have the literal character "]" as the end charac‐
1098 ter of a range. A pattern such as [W-]46] is interpreted as a class of
1099 two characters ("W" and "-") followed by a literal string "46]", so it
1100 would match "W46]" or "-46]". However, if the "]" is escaped with a
1101 backslash it is interpreted as the end of range, so [W-\]46] is inter‐
1102 preted as a class containing a range followed by two other characters.
1103 The octal or hexadecimal representation of "]" can also be used to end
1104 a range.
1105
1106 Ranges normally include all code points between the start and end char‐
1107 acters, inclusive. They can also be used for code points specified
1108 numerically, for example [\000-\037]. Ranges can include any characters
1109 that are valid for the current mode.
1110
1111 There is a special case in EBCDIC environments for ranges whose end
1112 points are both specified as literal letters in the same case. For com‐
1113 patibility with Perl, EBCDIC code points within the range that are not
1114 letters are omitted. For example, [h-k] matches only four characters,
1115 even though the codes for h and k are 0x88 and 0x92, a range of 11 code
1116 points. However, if the range is specified numerically, for example,
1117 [\x88-\x92] or [h-\x92], all code points are included.
1118
1119 If a range that includes letters is used when caseless matching is set,
1120 it matches the letters in either case. For example, [W-c] is equivalent
1121 to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if
1122 character tables for a French locale are in use, [\xc8-\xcb] matches
1123 accented E characters in both cases.
1124
1125 A circumflex can conveniently be used with the upper case character
1126 types to specify a more restricted set of characters than the matching
1127 lower case type. For example, the class [^\W_] matches any letter or
1128 digit, but not underscore, whereas [\w] includes underscore. A positive
1129 character class should be read as "something OR something OR ..." and a
1130 negative class as "NOT something AND NOT something AND NOT ...".
1131
1132 The only metacharacters that are recognized in character classes are
1133 backslash, hyphen (only where it can be interpreted as specifying a
1134 range), circumflex (only at the start), opening square bracket (only
1135 when it can be interpreted as introducing a POSIX class name, or for a
1136 special compatibility feature - see the next two sections), and the
1137 terminating closing square bracket. However, escaping other non-
1138 alphanumeric characters does no harm.
1139
1141
1142 Perl supports the POSIX notation for character classes. This uses names
1143 enclosed by [: and :] within the enclosing square brackets. PCRE2 also
1144 supports this notation. For example,
1145
1146 [01[:alpha:]%]
1147
1148 matches "0", "1", any alphabetic character, or "%". The supported class
1149 names are:
1150
1151 alnum letters and digits
1152 alpha letters
1153 ascii character codes 0 - 127
1154 blank space or tab only
1155 cntrl control characters
1156 digit decimal digits (same as \d)
1157 graph printing characters, excluding space
1158 lower lower case letters
1159 print printing characters, including space
1160 punct printing characters, excluding letters and digits and space
1161 space white space (the same as \s from PCRE2 8.34)
1162 upper upper case letters
1163 word "word" characters (same as \w)
1164 xdigit hexadecimal digits
1165
1166 The default "space" characters are HT (9), LF (10), VT (11), FF (12),
1167 CR (13), and space (32). If locale-specific matching is taking place,
1168 the list of space characters may be different; there may be fewer or
1169 more of them. "Space" and \s match the same set of characters.
1170
1171 The name "word" is a Perl extension, and "blank" is a GNU extension
1172 from Perl 5.8. Another Perl extension is negation, which is indicated
1173 by a ^ character after the colon. For example,
1174
1175 [12[:^digit:]]
1176
1177 matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the
1178 POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
1179 these are not supported, and an error is given if they are encountered.
1180
1181 By default, characters with values greater than 127 do not match any of
1182 the POSIX character classes, although this may be different for charac‐
1183 ters in the range 128-255 when locale-specific matching is happening.
1184 However, if the PCRE2_UCP option is passed to pcre2_compile(), some of
1185 the classes are changed so that Unicode character properties are used.
1186 This is achieved by replacing certain POSIX classes with other
1187 sequences, as follows:
1188
1189 [:alnum:] becomes \p{Xan}
1190 [:alpha:] becomes \p{L}
1191 [:blank:] becomes \h
1192 [:cntrl:] becomes \p{Cc}
1193 [:digit:] becomes \p{Nd}
1194 [:lower:] becomes \p{Ll}
1195 [:space:] becomes \p{Xps}
1196 [:upper:] becomes \p{Lu}
1197 [:word:] becomes \p{Xwd}
1198
1199 Negated versions, such as [:^alpha:] use \P instead of \p. Three other
1200 POSIX classes are handled specially in UCP mode:
1201
1202 [:graph:] This matches characters that have glyphs that mark the page
1203 when printed. In Unicode property terms, it matches all char‐
1204 acters with the L, M, N, P, S, or Cf properties, except for:
1205
1206 U+061C Arabic Letter Mark
1207 U+180E Mongolian Vowel Separator
1208 U+2066 - U+2069 Various "isolate"s
1209
1210
1211 [:print:] This matches the same characters as [:graph:] plus space
1212 characters that are not controls, that is, characters with
1213 the Zs property.
1214
1215 [:punct:] This matches all characters that have the Unicode P (punctua‐
1216 tion) property, plus those characters with code points less
1217 than 256 that have the S (Symbol) property.
1218
1219 The other POSIX classes are unchanged, and match only characters with
1220 code points less than 256.
1221
1223
1224 In the POSIX.2 compliant library that was included in 4.4BSD Unix, the
1225 ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word"
1226 and "end of word". PCRE2 treats these items as follows:
1227
1228 [[:<:]] is converted to \b(?=\w)
1229 [[:>:]] is converted to \b(?<=\w)
1230
1231 Only these exact character sequences are recognized. A sequence such as
1232 [a[:<:]b] provokes error for an unrecognized POSIX class name. This
1233 support is not compatible with Perl. It is provided to help migrations
1234 from other environments, and is best not used in any new patterns. Note
1235 that \b matches at the start and the end of a word (see "Simple asser‐
1236 tions" above), and in a Perl-style pattern the preceding or following
1237 character normally shows which is wanted, without the need for the
1238 assertions that are used above in order to give exactly the POSIX be‐
1239 haviour.
1240
1242
1243 Vertical bar characters are used to separate alternative patterns. For
1244 example, the pattern
1245
1246 gilbert|sullivan
1247
1248 matches either "gilbert" or "sullivan". Any number of alternatives may
1249 appear, and an empty alternative is permitted (matching the empty
1250 string). The matching process tries each alternative in turn, from left
1251 to right, and the first one that succeeds is used. If the alternatives
1252 are within a subpattern (defined below), "succeeds" means matching the
1253 rest of the main pattern as well as the alternative in the subpattern.
1254
1256
1257 The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL, and
1258 PCRE2_EXTENDED options (which are Perl-compatible) can be changed from
1259 within the pattern by a sequence of Perl option letters enclosed
1260 between "(?" and ")". The option letters are
1261
1262 i for PCRE2_CASELESS
1263 m for PCRE2_MULTILINE
1264 s for PCRE2_DOTALL
1265 x for PCRE2_EXTENDED
1266
1267 For example, (?im) sets caseless, multiline matching. It is also possi‐
1268 ble to unset these options by preceding the letter with a hyphen, and a
1269 combined setting and unsetting such as (?im-sx), which sets PCRE2_CASE‐
1270 LESS and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and
1271 PCRE2_EXTENDED, is also permitted. If a letter appears both before and
1272 after the hyphen, the option is unset. An empty options setting "(?)"
1273 is allowed. Needless to say, it has no effect.
1274
1275 The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be
1276 changed in the same way as the Perl-compatible options by using the
1277 characters J and U respectively.
1278
1279 When one of these option changes occurs at top level (that is, not
1280 inside subpattern parentheses), the change applies to the remainder of
1281 the pattern that follows. An option change within a subpattern (see
1282 below for a description of subpatterns) affects only that part of the
1283 subpattern that follows it, so
1284
1285 (a(?i)b)c
1286
1287 matches abc and aBc and no other strings (assuming PCRE2_CASELESS is
1288 not used). By this means, options can be made to have different set‐
1289 tings in different parts of the pattern. Any changes made in one alter‐
1290 native do carry on into subsequent branches within the same subpattern.
1291 For example,
1292
1293 (a(?i)b|c)
1294
1295 matches "ab", "aB", "c", and "C", even though when matching "C" the
1296 first branch is abandoned before the option setting. This is because
1297 the effects of option settings happen at compile time. There would be
1298 some very weird behaviour otherwise.
1299
1300 As a convenient shorthand, if any option settings are required at the
1301 start of a non-capturing subpattern (see the next section), the option
1302 letters may appear between the "?" and the ":". Thus the two patterns
1303
1304 (?i:saturday|sunday)
1305 (?:(?i)saturday|sunday)
1306
1307 match exactly the same set of strings.
1308
1309 Note: There are other PCRE2-specific options that can be set by the
1310 application when the compiling function is called. The pattern can con‐
1311 tain special leading sequences such as (*CRLF) to override what the
1312 application has set or what has been defaulted. Details are given in
1313 the section entitled "Newline sequences" above. There are also the
1314 (*UTF) and (*UCP) leading sequences that can be used to set UTF and
1315 Unicode property modes; they are equivalent to setting the PCRE2_UTF
1316 and PCRE2_UCP options, respectively. However, the application can set
1317 the PCRE2_NEVER_UTF and PCRE2_NEVER_UCP options, which lock out the use
1318 of the (*UTF) and (*UCP) sequences.
1319
1321
1322 Subpatterns are delimited by parentheses (round brackets), which can be
1323 nested. Turning part of a pattern into a subpattern does two things:
1324
1325 1. It localizes a set of alternatives. For example, the pattern
1326
1327 cat(aract|erpillar|)
1328
1329 matches "cataract", "caterpillar", or "cat". Without the parentheses,
1330 it would match "cataract", "erpillar" or an empty string.
1331
1332 2. It sets up the subpattern as a capturing subpattern. This means
1333 that, when the whole pattern matches, the portion of the subject string
1334 that matched the subpattern is passed back to the caller, separately
1335 from the portion that matched the whole pattern. (This applies only to
1336 the traditional matching function; the DFA matching function does not
1337 support capturing.)
1338
1339 Opening parentheses are counted from left to right (starting from 1) to
1340 obtain numbers for the capturing subpatterns. For example, if the
1341 string "the red king" is matched against the pattern
1342
1343 the ((red|white) (king|queen))
1344
1345 the captured substrings are "red king", "red", and "king", and are num‐
1346 bered 1, 2, and 3, respectively.
1347
1348 The fact that plain parentheses fulfil two functions is not always
1349 helpful. There are often times when a grouping subpattern is required
1350 without a capturing requirement. If an opening parenthesis is followed
1351 by a question mark and a colon, the subpattern does not do any captur‐
1352 ing, and is not counted when computing the number of any subsequent
1353 capturing subpatterns. For example, if the string "the white queen" is
1354 matched against the pattern
1355
1356 the ((?:red|white) (king|queen))
1357
1358 the captured substrings are "white queen" and "queen", and are numbered
1359 1 and 2. The maximum number of capturing subpatterns is 65535.
1360
1361 As a convenient shorthand, if any option settings are required at the
1362 start of a non-capturing subpattern, the option letters may appear
1363 between the "?" and the ":". Thus the two patterns
1364
1365 (?i:saturday|sunday)
1366 (?:(?i)saturday|sunday)
1367
1368 match exactly the same set of strings. Because alternative branches are
1369 tried from left to right, and options are not reset until the end of
1370 the subpattern is reached, an option setting in one branch does affect
1371 subsequent branches, so the above patterns match "SUNDAY" as well as
1372 "Saturday".
1373
1375
1376 Perl 5.10 introduced a feature whereby each alternative in a subpattern
1377 uses the same numbers for its capturing parentheses. Such a subpattern
1378 starts with (?| and is itself a non-capturing subpattern. For example,
1379 consider this pattern:
1380
1381 (?|(Sat)ur|(Sun))day
1382
1383 Because the two alternatives are inside a (?| group, both sets of cap‐
1384 turing parentheses are numbered one. Thus, when the pattern matches,
1385 you can look at captured substring number one, whichever alternative
1386 matched. This construct is useful when you want to capture part, but
1387 not all, of one of a number of alternatives. Inside a (?| group, paren‐
1388 theses are numbered as usual, but the number is reset at the start of
1389 each branch. The numbers of any capturing parentheses that follow the
1390 subpattern start after the highest number used in any branch. The fol‐
1391 lowing example is taken from the Perl documentation. The numbers under‐
1392 neath show in which buffer the captured content will be stored.
1393
1394 # before ---------------branch-reset----------- after
1395 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
1396 # 1 2 2 3 2 3 4
1397
1398 A back reference to a numbered subpattern uses the most recent value
1399 that is set for that number by any subpattern. The following pattern
1400 matches "abcabc" or "defdef":
1401
1402 /(?|(abc)|(def))\1/
1403
1404 In contrast, a subroutine call to a numbered subpattern always refers
1405 to the first one in the pattern with the given number. The following
1406 pattern matches "abcabc" or "defabc":
1407
1408 /(?|(abc)|(def))(?1)/
1409
1410 A relative reference such as (?-1) is no different: it is just a conve‐
1411 nient way of computing an absolute group number.
1412
1413 If a condition test for a subpattern's having matched refers to a non-
1414 unique number, the test is true if any of the subpatterns of that num‐
1415 ber have matched.
1416
1417 An alternative approach to using this "branch reset" feature is to use
1418 duplicate named subpatterns, as described in the next section.
1419
1421
1422 Identifying capturing parentheses by number is simple, but it can be
1423 very hard to keep track of the numbers in complicated regular expres‐
1424 sions. Furthermore, if an expression is modified, the numbers may
1425 change. To help with this difficulty, PCRE2 supports the naming of sub‐
1426 patterns. This feature was not added to Perl until release 5.10. Python
1427 had the feature earlier, and PCRE1 introduced it at release 4.0, using
1428 the Python syntax. PCRE2 supports both the Perl and the Python syntax.
1429 Perl allows identically numbered subpatterns to have different names,
1430 but PCRE2 does not.
1431
1432 In PCRE2, a subpattern can be named in one of three ways: (?<name>...)
1433 or (?'name'...) as in Perl, or (?P<name>...) as in Python. References
1434 to capturing parentheses from other parts of the pattern, such as back
1435 references, recursion, and conditions, can be made by name as well as
1436 by number.
1437
1438 Names consist of up to 32 alphanumeric characters and underscores, but
1439 must start with a non-digit. Named capturing parentheses are still
1440 allocated numbers as well as names, exactly as if the names were not
1441 present. The PCRE2 API provides function calls for extracting the name-
1442 to-number translation table from a compiled pattern. There are also
1443 convenience functions for extracting a captured substring by name.
1444
1445 By default, a name must be unique within a pattern, but it is possible
1446 to relax this constraint by setting the PCRE2_DUPNAMES option at com‐
1447 pile time. (Duplicate names are also always permitted for subpatterns
1448 with the same number, set up as described in the previous section.)
1449 Duplicate names can be useful for patterns where only one instance of
1450 the named parentheses can match. Suppose you want to match the name of
1451 a weekday, either as a 3-letter abbreviation or as the full name, and
1452 in both cases you want to extract the abbreviation. This pattern
1453 (ignoring the line breaks) does the job:
1454
1455 (?<DN>Mon|Fri|Sun)(?:day)?|
1456 (?<DN>Tue)(?:sday)?|
1457 (?<DN>Wed)(?:nesday)?|
1458 (?<DN>Thu)(?:rsday)?|
1459 (?<DN>Sat)(?:urday)?
1460
1461 There are five capturing substrings, but only one is ever set after a
1462 match. (An alternative way of solving this problem is to use a "branch
1463 reset" subpattern, as described in the previous section.)
1464
1465 The convenience functions for extracting the data by name returns the
1466 substring for the first (and in this example, the only) subpattern of
1467 that name that matched. This saves searching to find which numbered
1468 subpattern it was.
1469
1470 If you make a back reference to a non-unique named subpattern from
1471 elsewhere in the pattern, the subpatterns to which the name refers are
1472 checked in the order in which they appear in the overall pattern. The
1473 first one that is set is used for the reference. For example, this pat‐
1474 tern matches both "foofoo" and "barbar" but not "foobar" or "barfoo":
1475
1476 (?:(?<n>foo)|(?<n>bar))\k<n>
1477
1478
1479 If you make a subroutine call to a non-unique named subpattern, the one
1480 that corresponds to the first occurrence of the name is used. In the
1481 absence of duplicate numbers (see the previous section) this is the one
1482 with the lowest number.
1483
1484 If you use a named reference in a condition test (see the section about
1485 conditions below), either to check whether a subpattern has matched, or
1486 to check for recursion, all subpatterns with the same name are tested.
1487 If the condition is true for any one of them, the overall condition is
1488 true. This is the same behaviour as testing by number. For further
1489 details of the interfaces for handling named subpatterns, see the
1490 pcre2api documentation.
1491
1492 Warning: You cannot use different names to distinguish between two sub‐
1493 patterns with the same number because PCRE2 uses only the numbers when
1494 matching. For this reason, an error is given at compile time if differ‐
1495 ent names are given to subpatterns with the same number. However, you
1496 can always give the same name to subpatterns with the same number, even
1497 when PCRE2_DUPNAMES is not set.
1498
1500
1501 Repetition is specified by quantifiers, which can follow any of the
1502 following items:
1503
1504 a literal data character
1505 the dot metacharacter
1506 the \C escape sequence
1507 the \X escape sequence
1508 the \R escape sequence
1509 an escape such as \d or \pL that matches a single character
1510 a character class
1511 a back reference
1512 a parenthesized subpattern (including most assertions)
1513 a subroutine call to a subpattern (recursive or otherwise)
1514
1515 The general repetition quantifier specifies a minimum and maximum num‐
1516 ber of permitted matches, by giving the two numbers in curly brackets
1517 (braces), separated by a comma. The numbers must be less than 65536,
1518 and the first must be less than or equal to the second. For example:
1519
1520 z{2,4}
1521
1522 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
1523 special character. If the second number is omitted, but the comma is
1524 present, there is no upper limit; if the second number and the comma
1525 are both omitted, the quantifier specifies an exact number of required
1526 matches. Thus
1527
1528 [aeiou]{3,}
1529
1530 matches at least 3 successive vowels, but may match many more, whereas
1531
1532 \d{8}
1533
1534 matches exactly 8 digits. An opening curly bracket that appears in a
1535 position where a quantifier is not allowed, or one that does not match
1536 the syntax of a quantifier, is taken as a literal character. For exam‐
1537 ple, {,6} is not a quantifier, but a literal string of four characters.
1538
1539 In UTF modes, quantifiers apply to characters rather than to individual
1540 code units. Thus, for example, \x{100}{2} matches two characters, each
1541 of which is represented by a two-byte sequence in a UTF-8 string. Simi‐
1542 larly, \X{3} matches three Unicode extended grapheme clusters, each of
1543 which may be several code units long (and they may be of different
1544 lengths).
1545
1546 The quantifier {0} is permitted, causing the expression to behave as if
1547 the previous item and the quantifier were not present. This may be use‐
1548 ful for subpatterns that are referenced as subroutines from elsewhere
1549 in the pattern (but see also the section entitled "Defining subpatterns
1550 for use by reference only" below). Items other than subpatterns that
1551 have a {0} quantifier are omitted from the compiled pattern.
1552
1553 For convenience, the three most common quantifiers have single-charac‐
1554 ter abbreviations:
1555
1556 * is equivalent to {0,}
1557 + is equivalent to {1,}
1558 ? is equivalent to {0,1}
1559
1560 It is possible to construct infinite loops by following a subpattern
1561 that can match no characters with a quantifier that has no upper limit,
1562 for example:
1563
1564 (a?)*
1565
1566 Earlier versions of Perl and PCRE1 used to give an error at compile
1567 time for such patterns. However, because there are cases where this can
1568 be useful, such patterns are now accepted, but if any repetition of the
1569 subpattern does in fact match no characters, the loop is forcibly bro‐
1570 ken.
1571
1572 By default, the quantifiers are "greedy", that is, they match as much
1573 as possible (up to the maximum number of permitted times), without
1574 causing the rest of the pattern to fail. The classic example of where
1575 this gives problems is in trying to match comments in C programs. These
1576 appear between /* and */ and within the comment, individual * and /
1577 characters may appear. An attempt to match C comments by applying the
1578 pattern
1579
1580 /\*.*\*/
1581
1582 to the string
1583
1584 /* first comment */ not comment /* second comment */
1585
1586 fails, because it matches the entire string owing to the greediness of
1587 the .* item.
1588
1589 If a quantifier is followed by a question mark, it ceases to be greedy,
1590 and instead matches the minimum number of times possible, so the pat‐
1591 tern
1592
1593 /\*.*?\*/
1594
1595 does the right thing with the C comments. The meaning of the various
1596 quantifiers is not otherwise changed, just the preferred number of
1597 matches. Do not confuse this use of question mark with its use as a
1598 quantifier in its own right. Because it has two uses, it can sometimes
1599 appear doubled, as in
1600
1601 \d??\d
1602
1603 which matches one digit by preference, but can match two if that is the
1604 only way the rest of the pattern matches.
1605
1606 If the PCRE2_UNGREEDY option is set (an option that is not available in
1607 Perl), the quantifiers are not greedy by default, but individual ones
1608 can be made greedy by following them with a question mark. In other
1609 words, it inverts the default behaviour.
1610
1611 When a parenthesized subpattern is quantified with a minimum repeat
1612 count that is greater than 1 or with a limited maximum, more memory is
1613 required for the compiled pattern, in proportion to the size of the
1614 minimum or maximum.
1615
1616 If a pattern starts with .* or .{0,} and the PCRE2_DOTALL option
1617 (equivalent to Perl's /s) is set, thus allowing the dot to match new‐
1618 lines, the pattern is implicitly anchored, because whatever follows
1619 will be tried against every character position in the subject string,
1620 so there is no point in retrying the overall match at any position
1621 after the first. PCRE2 normally treats such a pattern as though it were
1622 preceded by \A.
1623
1624 In cases where it is known that the subject string contains no new‐
1625 lines, it is worth setting PCRE2_DOTALL in order to obtain this opti‐
1626 mization, or alternatively, using ^ to indicate anchoring explicitly.
1627
1628 However, there are some cases where the optimization cannot be used.
1629 When .* is inside capturing parentheses that are the subject of a back
1630 reference elsewhere in the pattern, a match at the start may fail where
1631 a later one succeeds. Consider, for example:
1632
1633 (.*)abc\1
1634
1635 If the subject is "xyz123abc123" the match point is the fourth charac‐
1636 ter. For this reason, such a pattern is not implicitly anchored.
1637
1638 Another case where implicit anchoring is not applied is when the lead‐
1639 ing .* is inside an atomic group. Once again, a match at the start may
1640 fail where a later one succeeds. Consider this pattern:
1641
1642 (?>.*?a)b
1643
1644 It matches "ab" in the subject "aab". The use of the backtracking con‐
1645 trol verbs (*PRUNE) and (*SKIP) also disable this optimization, and
1646 there is an option, PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
1647
1648 When a capturing subpattern is repeated, the value captured is the sub‐
1649 string that matched the final iteration. For example, after
1650
1651 (tweedle[dume]{3}\s*)+
1652
1653 has matched "tweedledum tweedledee" the value of the captured substring
1654 is "tweedledee". However, if there are nested capturing subpatterns,
1655 the corresponding captured values may have been set in previous itera‐
1656 tions. For example, after
1657
1658 (a|(b))+
1659
1660 matches "aba" the value of the second captured substring is "b".
1661
1663
1664 With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
1665 repetition, failure of what follows normally causes the repeated item
1666 to be re-evaluated to see if a different number of repeats allows the
1667 rest of the pattern to match. Sometimes it is useful to prevent this,
1668 either to change the nature of the match, or to cause it fail earlier
1669 than it otherwise might, when the author of the pattern knows there is
1670 no point in carrying on.
1671
1672 Consider, for example, the pattern \d+foo when applied to the subject
1673 line
1674
1675 123456bar
1676
1677 After matching all 6 digits and then failing to match "foo", the normal
1678 action of the matcher is to try again with only 5 digits matching the
1679 \d+ item, and then with 4, and so on, before ultimately failing.
1680 "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
1681 the means for specifying that once a subpattern has matched, it is not
1682 to be re-evaluated in this way.
1683
1684 If we use atomic grouping for the previous example, the matcher gives
1685 up immediately on failing to match "foo" the first time. The notation
1686 is a kind of special parenthesis, starting with (?> as in this example:
1687
1688 (?>\d+)foo
1689
1690 This kind of parenthesis "locks up" the part of the pattern it con‐
1691 tains once it has matched, and a failure further into the pattern is
1692 prevented from backtracking into it. Backtracking past it to previous
1693 items, however, works as normal.
1694
1695 An alternative description is that a subpattern of this type matches
1696 exactly the string of characters that an identical standalone pattern
1697 would match, if anchored at the current point in the subject string.
1698
1699 Atomic grouping subpatterns are not capturing subpatterns. Simple cases
1700 such as the above example can be thought of as a maximizing repeat that
1701 must swallow everything it can. So, while both \d+ and \d+? are pre‐
1702 pared to adjust the number of digits they match in order to make the
1703 rest of the pattern match, (?>\d+) can only match an entire sequence of
1704 digits.
1705
1706 Atomic groups in general can of course contain arbitrarily complicated
1707 subpatterns, and can be nested. However, when the subpattern for an
1708 atomic group is just a single repeated item, as in the example above, a
1709 simpler notation, called a "possessive quantifier" can be used. This
1710 consists of an additional + character following a quantifier. Using
1711 this notation, the previous example can be rewritten as
1712
1713 \d++foo
1714
1715 Note that a possessive quantifier can be used with an entire group, for
1716 example:
1717
1718 (abc|xyz){2,3}+
1719
1720 Possessive quantifiers are always greedy; the setting of the
1721 PCRE2_UNGREEDY option is ignored. They are a convenient notation for
1722 the simpler forms of atomic group. However, there is no difference in
1723 the meaning of a possessive quantifier and the equivalent atomic group,
1724 though there may be a performance difference; possessive quantifiers
1725 should be slightly faster.
1726
1727 The possessive quantifier syntax is an extension to the Perl 5.8 syn‐
1728 tax. Jeffrey Friedl originated the idea (and the name) in the first
1729 edition of his book. Mike McCloskey liked it, so implemented it when he
1730 built Sun's Java package, and PCRE1 copied it from there. It ultimately
1731 found its way into Perl at release 5.10.
1732
1733 PCRE2 has an optimization that automatically "possessifies" certain
1734 simple pattern constructs. For example, the sequence A+B is treated as
1735 A++B because there is no point in backtracking into a sequence of A's
1736 when B must follow. This feature can be disabled by the PCRE2_NO_AUTO‐
1737 POSSESS option, or starting the pattern with (*NO_AUTO_POSSESS).
1738
1739 When a pattern contains an unlimited repeat inside a subpattern that
1740 can itself be repeated an unlimited number of times, the use of an
1741 atomic group is the only way to avoid some failing matches taking a
1742 very long time indeed. The pattern
1743
1744 (\D+|<\d+>)*[!?]
1745
1746 matches an unlimited number of substrings that either consist of non-
1747 digits, or digits enclosed in <>, followed by either ! or ?. When it
1748 matches, it runs quickly. However, if it is applied to
1749
1750 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1751
1752 it takes a long time before reporting failure. This is because the
1753 string can be divided between the internal \D+ repeat and the external
1754 * repeat in a large number of ways, and all have to be tried. (The
1755 example uses [!?] rather than a single character at the end, because
1756 both PCRE2 and Perl have an optimization that allows for fast failure
1757 when a single character is used. They remember the last single charac‐
1758 ter that is required for a match, and fail early if it is not present
1759 in the string.) If the pattern is changed so that it uses an atomic
1760 group, like this:
1761
1762 ((?>\D+)|<\d+>)*[!?]
1763
1764 sequences of non-digits cannot be broken, and failure happens quickly.
1765
1767
1768 Outside a character class, a backslash followed by a digit greater than
1769 0 (and possibly further digits) is a back reference to a capturing sub‐
1770 pattern earlier (that is, to its left) in the pattern, provided there
1771 have been that many previous capturing left parentheses.
1772
1773 However, if the decimal number following the backslash is less than 8,
1774 it is always taken as a back reference, and causes an error only if
1775 there are not that many capturing left parentheses in the entire pat‐
1776 tern. In other words, the parentheses that are referenced need not be
1777 to the left of the reference for numbers less than 8. A "forward back
1778 reference" of this type can make sense when a repetition is involved
1779 and the subpattern to the right has participated in an earlier itera‐
1780 tion.
1781
1782 It is not possible to have a numerical "forward back reference" to a
1783 subpattern whose number is 8 or more using this syntax because a
1784 sequence such as \50 is interpreted as a character defined in octal.
1785 See the subsection entitled "Non-printing characters" above for further
1786 details of the handling of digits following a backslash. There is no
1787 such problem when named parentheses are used. A back reference to any
1788 subpattern is possible using named parentheses (see below).
1789
1790 Another way of avoiding the ambiguity inherent in the use of digits
1791 following a backslash is to use the \g escape sequence. This escape
1792 must be followed by a signed or unsigned number, optionally enclosed in
1793 braces. These examples are all identical:
1794
1795 (ring), \1
1796 (ring), \g1
1797 (ring), \g{1}
1798
1799 An unsigned number specifies an absolute reference without the ambigu‐
1800 ity that is present in the older syntax. It is also useful when literal
1801 digits follow the reference. A signed number is a relative reference.
1802 Consider this example:
1803
1804 (abc(def)ghi)\g{-1}
1805
1806 The sequence \g{-1} is a reference to the most recently started captur‐
1807 ing subpattern before \g, that is, is it equivalent to \2 in this exam‐
1808 ple. Similarly, \g{-2} would be equivalent to \1. The use of relative
1809 references can be helpful in long patterns, and also in patterns that
1810 are created by joining together fragments that contain references
1811 within themselves.
1812
1813 The sequence \g{+1} is a reference to the next capturing subpattern.
1814 This kind of forward reference can be useful it patterns that repeat.
1815 Perl does not support the use of + in this way.
1816
1817 A back reference matches whatever actually matched the capturing sub‐
1818 pattern in the current subject string, rather than anything matching
1819 the subpattern itself (see "Subpatterns as subroutines" below for a way
1820 of doing that). So the pattern
1821
1822 (sens|respons)e and \1ibility
1823
1824 matches "sense and sensibility" and "response and responsibility", but
1825 not "sense and responsibility". If caseful matching is in force at the
1826 time of the back reference, the case of letters is relevant. For exam‐
1827 ple,
1828
1829 ((?i)rah)\s+\1
1830
1831 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
1832 original capturing subpattern is matched caselessly.
1833
1834 There are several different ways of writing back references to named
1835 subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or
1836 \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's
1837 unified back reference syntax, in which \g can be used for both numeric
1838 and named references, is also supported. We could rewrite the above
1839 example in any of the following ways:
1840
1841 (?<p1>(?i)rah)\s+\k<p1>
1842 (?'p1'(?i)rah)\s+\k{p1}
1843 (?P<p1>(?i)rah)\s+(?P=p1)
1844 (?<p1>(?i)rah)\s+\g{p1}
1845
1846 A subpattern that is referenced by name may appear in the pattern
1847 before or after the reference.
1848
1849 There may be more than one back reference to the same subpattern. If a
1850 subpattern has not actually been used in a particular match, any back
1851 references to it always fail by default. For example, the pattern
1852
1853 (a|(bc))\2
1854
1855 always fails if it starts to match "a" rather than "bc". However, if
1856 the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a back
1857 reference to an unset value matches an empty string.
1858
1859 Because there may be many capturing parentheses in a pattern, all dig‐
1860 its following a backslash are taken as part of a potential back refer‐
1861 ence number. If the pattern continues with a digit character, some
1862 delimiter must be used to terminate the back reference. If the
1863 PCRE2_EXTENDED option is set, this can be white space. Otherwise, the
1864 \g{ syntax or an empty comment (see "Comments" below) can be used.
1865
1866 Recursive back references
1867
1868 A back reference that occurs inside the parentheses to which it refers
1869 fails when the subpattern is first used, so, for example, (a\1) never
1870 matches. However, such references can be useful inside repeated sub‐
1871 patterns. For example, the pattern
1872
1873 (a|b\1)+
1874
1875 matches any number of "a"s and also "aba", "ababbaa" etc. At each iter‐
1876 ation of the subpattern, the back reference matches the character
1877 string corresponding to the previous iteration. In order for this to
1878 work, the pattern must be such that the first iteration does not need
1879 to match the back reference. This can be done using alternation, as in
1880 the example above, or by a quantifier with a minimum of zero.
1881
1882 Back references of this type cause the group that they reference to be
1883 treated as an atomic group. Once the whole group has been matched, a
1884 subsequent matching failure cannot cause backtracking into the middle
1885 of the group.
1886
1888
1889 An assertion is a test on the characters following or preceding the
1890 current matching point that does not consume any characters. The simple
1891 assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described
1892 above.
1893
1894 More complicated assertions are coded as subpatterns. There are two
1895 kinds: those that look ahead of the current position in the subject
1896 string, and those that look behind it. An assertion subpattern is
1897 matched in the normal way, except that it does not cause the current
1898 matching position to be changed.
1899
1900 Assertion subpatterns are not capturing subpatterns. If such an asser‐
1901 tion contains capturing subpatterns within it, these are counted for
1902 the purposes of numbering the capturing subpatterns in the whole pat‐
1903 tern. However, substring capturing is carried out only for positive
1904 assertions. (Perl sometimes, but not always, does do capturing in nega‐
1905 tive assertions.)
1906
1907 WARNING: If a positive assertion containing one or more capturing sub‐
1908 patterns succeeds, but failure to match later in the pattern causes
1909 backtracking over this assertion, the captures within the assertion are
1910 reset only if no higher numbered captures are already set. This is,
1911 unfortunately, a fundamental limitation of the current implementation;
1912 it may get removed in a future reworking.
1913
1914 For compatibility with Perl, most assertion subpatterns may be
1915 repeated; though it makes no sense to assert the same thing several
1916 times, the side effect of capturing parentheses may occasionally be
1917 useful. However, an assertion that forms the condition for a condi‐
1918 tional subpattern may not be quantified. In practice, for other asser‐
1919 tions, there only three cases:
1920
1921 (1) If the quantifier is {0}, the assertion is never obeyed during
1922 matching. However, it may contain internal capturing parenthesized
1923 groups that are called from elsewhere via the subroutine mechanism.
1924
1925 (2) If quantifier is {0,n} where n is greater than zero, it is treated
1926 as if it were {0,1}. At run time, the rest of the pattern match is
1927 tried with and without the assertion, the order depending on the greed‐
1928 iness of the quantifier.
1929
1930 (3) If the minimum repetition is greater than zero, the quantifier is
1931 ignored. The assertion is obeyed just once when encountered during
1932 matching.
1933
1934 Lookahead assertions
1935
1936 Lookahead assertions start with (?= for positive assertions and (?! for
1937 negative assertions. For example,
1938
1939 \w+(?=;)
1940
1941 matches a word followed by a semicolon, but does not include the semi‐
1942 colon in the match, and
1943
1944 foo(?!bar)
1945
1946 matches any occurrence of "foo" that is not followed by "bar". Note
1947 that the apparently similar pattern
1948
1949 (?!foo)bar
1950
1951 does not find an occurrence of "bar" that is preceded by something
1952 other than "foo"; it finds any occurrence of "bar" whatsoever, because
1953 the assertion (?!foo) is always true when the next three characters are
1954 "bar". A lookbehind assertion is needed to achieve the other effect.
1955
1956 If you want to force a matching failure at some point in a pattern, the
1957 most convenient way to do it is with (?!) because an empty string
1958 always matches, so an assertion that requires there not to be an empty
1959 string must always fail. The backtracking control verb (*FAIL) or (*F)
1960 is a synonym for (?!).
1961
1962 Lookbehind assertions
1963
1964 Lookbehind assertions start with (?<= for positive assertions and (?<!
1965 for negative assertions. For example,
1966
1967 (?<!foo)bar
1968
1969 does find an occurrence of "bar" that is not preceded by "foo". The
1970 contents of a lookbehind assertion are restricted such that all the
1971 strings it matches must have a fixed length. However, if there are sev‐
1972 eral top-level alternatives, they do not all have to have the same
1973 fixed length. Thus
1974
1975 (?<=bullock|donkey)
1976
1977 is permitted, but
1978
1979 (?<!dogs?|cats?)
1980
1981 causes an error at compile time. Branches that match different length
1982 strings are permitted only at the top level of a lookbehind assertion.
1983 This is an extension compared with Perl, which requires all branches to
1984 match the same length of string. An assertion such as
1985
1986 (?<=ab(c|de))
1987
1988 is not permitted, because its single top-level branch can match two
1989 different lengths, but it is acceptable to PCRE2 if rewritten to use
1990 two top-level branches:
1991
1992 (?<=abc|abde)
1993
1994 In some cases, the escape sequence \K (see above) can be used instead
1995 of a lookbehind assertion to get round the fixed-length restriction.
1996
1997 The implementation of lookbehind assertions is, for each alternative,
1998 to temporarily move the current position back by the fixed length and
1999 then try to match. If there are insufficient characters before the cur‐
2000 rent position, the assertion fails.
2001
2002 In UTF-8 and UTF-16 modes, PCRE2 does not allow the \C escape (which
2003 matches a single code unit even in a UTF mode) to appear in lookbehind
2004 assertions, because it makes it impossible to calculate the length of
2005 the lookbehind. The \X and \R escapes, which can match different num‐
2006 bers of code units, are never permitted in lookbehinds.
2007
2008 "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in
2009 lookbehinds, as long as the subpattern matches a fixed-length string.
2010 However, recursion, that is, a "subroutine" call into a group that is
2011 already active, is not supported.
2012
2013 Perl does not support back references in lookbehinds. PCRE2 does sup‐
2014 port them, but only if certain conditions are met. The
2015 PCRE2_MATCH_UNSET_BACKREF option must not be set, there must be no use
2016 of (?| in the pattern (it creates duplicate subpattern numbers), and if
2017 the back reference is by name, the name must be unique. Of course, the
2018 referenced subpattern must itself be of fixed length. The following
2019 pattern matches words containing at least two characters that begin and
2020 end with the same character:
2021
2022 \b(\w)\w++(?<=\1)
2023
2024 Possessive quantifiers can be used in conjunction with lookbehind
2025 assertions to specify efficient matching of fixed-length strings at the
2026 end of subject strings. Consider a simple pattern such as
2027
2028 abcd$
2029
2030 when applied to a long string that does not match. Because matching
2031 proceeds from left to right, PCRE2 will look for each "a" in the sub‐
2032 ject and then see if what follows matches the rest of the pattern. If
2033 the pattern is specified as
2034
2035 ^.*abcd$
2036
2037 the initial .* matches the entire string at first, but when this fails
2038 (because there is no following "a"), it backtracks to match all but the
2039 last character, then all but the last two characters, and so on. Once
2040 again the search for "a" covers the entire string, from right to left,
2041 so we are no better off. However, if the pattern is written as
2042
2043 ^.*+(?<=abcd)
2044
2045 there can be no backtracking for the .*+ item because of the possessive
2046 quantifier; it can match only the entire string. The subsequent lookbe‐
2047 hind assertion does a single test on the last four characters. If it
2048 fails, the match fails immediately. For long strings, this approach
2049 makes a significant difference to the processing time.
2050
2051 Using multiple assertions
2052
2053 Several assertions (of any sort) may occur in succession. For example,
2054
2055 (?<=\d{3})(?<!999)foo
2056
2057 matches "foo" preceded by three digits that are not "999". Notice that
2058 each of the assertions is applied independently at the same point in
2059 the subject string. First there is a check that the previous three
2060 characters are all digits, and then there is a check that the same
2061 three characters are not "999". This pattern does not match "foo" pre‐
2062 ceded by six characters, the first of which are digits and the last
2063 three of which are not "999". For example, it doesn't match "123abc‐
2064 foo". A pattern to do that is
2065
2066 (?<=\d{3}...)(?<!999)foo
2067
2068 This time the first assertion looks at the preceding six characters,
2069 checking that the first three are digits, and then the second assertion
2070 checks that the preceding three characters are not "999".
2071
2072 Assertions can be nested in any combination. For example,
2073
2074 (?<=(?<!foo)bar)baz
2075
2076 matches an occurrence of "baz" that is preceded by "bar" which in turn
2077 is not preceded by "foo", while
2078
2079 (?<=\d{3}(?!999)...)foo
2080
2081 is another pattern that matches "foo" preceded by three digits and any
2082 three characters that are not "999".
2083
2085
2086 It is possible to cause the matching process to obey a subpattern con‐
2087 ditionally or to choose between two alternative subpatterns, depending
2088 on the result of an assertion, or whether a specific capturing subpat‐
2089 tern has already been matched. The two possible forms of conditional
2090 subpattern are:
2091
2092 (?(condition)yes-pattern)
2093 (?(condition)yes-pattern|no-pattern)
2094
2095 If the condition is satisfied, the yes-pattern is used; otherwise the
2096 no-pattern (if present) is used. If there are more than two alterna‐
2097 tives in the subpattern, a compile-time error occurs. Each of the two
2098 alternatives may itself contain nested subpatterns of any form, includ‐
2099 ing conditional subpatterns; the restriction to two alternatives
2100 applies only at the level of the condition. This pattern fragment is an
2101 example where the alternatives are complex:
2102
2103 (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
2104
2105
2106 There are five kinds of condition: references to subpatterns, refer‐
2107 ences to recursion, two pseudo-conditions called DEFINE and VERSION,
2108 and assertions.
2109
2110 Checking for a used subpattern by number
2111
2112 If the text between the parentheses consists of a sequence of digits,
2113 the condition is true if a capturing subpattern of that number has pre‐
2114 viously matched. If there is more than one capturing subpattern with
2115 the same number (see the earlier section about duplicate subpattern
2116 numbers), the condition is true if any of them have matched. An alter‐
2117 native notation is to precede the digits with a plus or minus sign. In
2118 this case, the subpattern number is relative rather than absolute. The
2119 most recently opened parentheses can be referenced by (?(-1), the next
2120 most recent by (?(-2), and so on. Inside loops it can also make sense
2121 to refer to subsequent groups. The next parentheses to be opened can be
2122 referenced as (?(+1), and so on. (The value zero in any of these forms
2123 is not used; it provokes a compile-time error.)
2124
2125 Consider the following pattern, which contains non-significant white
2126 space to make it more readable (assume the PCRE2_EXTENDED option) and
2127 to divide it into three parts for ease of discussion:
2128
2129 ( \( )? [^()]+ (?(1) \) )
2130
2131 The first part matches an optional opening parenthesis, and if that
2132 character is present, sets it as the first captured substring. The sec‐
2133 ond part matches one or more characters that are not parentheses. The
2134 third part is a conditional subpattern that tests whether or not the
2135 first set of parentheses matched. If they did, that is, if subject
2136 started with an opening parenthesis, the condition is true, and so the
2137 yes-pattern is executed and a closing parenthesis is required. Other‐
2138 wise, since no-pattern is not present, the subpattern matches nothing.
2139 In other words, this pattern matches a sequence of non-parentheses,
2140 optionally enclosed in parentheses.
2141
2142 If you were embedding this pattern in a larger one, you could use a
2143 relative reference:
2144
2145 ...other stuff... ( \( )? [^()]+ (?(-1) \) ) ...
2146
2147 This makes the fragment independent of the parentheses in the larger
2148 pattern.
2149
2150 Checking for a used subpattern by name
2151
2152 Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
2153 used subpattern by name. For compatibility with earlier versions of
2154 PCRE1, which had this facility before Perl, the syntax (?(name)...) is
2155 also recognized. Note, however, that undelimited names consisting of
2156 the letter R followed by digits are ambiguous (see the following sec‐
2157 tion).
2158
2159 Rewriting the above example to use a named subpattern gives this:
2160
2161 (?<OPEN> \( )? [^()]+ (?(<OPEN>) \) )
2162
2163 If the name used in a condition of this kind is a duplicate, the test
2164 is applied to all subpatterns of the same name, and is true if any one
2165 of them has matched.
2166
2167 Checking for pattern recursion
2168
2169 "Recursion" in this sense refers to any subroutine-like call from one
2170 part of the pattern to another, whether or not it is actually recur‐
2171 sive. See the sections entitled "Recursive patterns" and "Subpatterns
2172 as subroutines" below for details of recursion and subpattern calls.
2173
2174 If a condition is the string (R), and there is no subpattern with the
2175 name R, the condition is true if matching is currently in a recursion
2176 or subroutine call to the whole pattern or any subpattern. If digits
2177 follow the letter R, and there is no subpattern with that name, the
2178 condition is true if the most recent call is into a subpattern with the
2179 given number, which must exist somewhere in the overall pattern. This
2180 is a contrived example that is equivalent to a+b:
2181
2182 ((?(R1)a+|(?1)b))
2183
2184 However, in both cases, if there is a subpattern with a matching name,
2185 the condition tests for its being set, as described in the section
2186 above, instead of testing for recursion. For example, creating a group
2187 with the name R1 by adding (?<R1>) to the above pattern completely
2188 changes its meaning.
2189
2190 If a name preceded by ampersand follows the letter R, for example:
2191
2192 (?(R&name)...)
2193
2194 the condition is true if the most recent recursion is into a subpattern
2195 of that name (which must exist within the pattern).
2196
2197 This condition does not check the entire recursion stack. It tests only
2198 the current level. If the name used in a condition of this kind is a
2199 duplicate, the test is applied to all subpatterns of the same name, and
2200 is true if any one of them is the most recent recursion.
2201
2202 At "top level", all these recursion test conditions are false.
2203
2204 Defining subpatterns for use by reference only
2205
2206 If the condition is the string (DEFINE), the condition is always false,
2207 even if there is a group with the name DEFINE. In this case, there may
2208 be only one alternative in the subpattern. It is always skipped if con‐
2209 trol reaches this point in the pattern; the idea of DEFINE is that it
2210 can be used to define subroutines that can be referenced from else‐
2211 where. (The use of subroutines is described below.) For example, a pat‐
2212 tern to match an IPv4 address such as "192.168.23.245" could be written
2213 like this (ignore white space and line breaks):
2214
2215 (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
2216 \b (?&byte) (\.(?&byte)){3} \b
2217
2218 The first part of the pattern is a DEFINE group inside which a another
2219 group named "byte" is defined. This matches an individual component of
2220 an IPv4 address (a number less than 256). When matching takes place,
2221 this part of the pattern is skipped because DEFINE acts like a false
2222 condition. The rest of the pattern uses references to the named group
2223 to match the four dot-separated components of an IPv4 address, insist‐
2224 ing on a word boundary at each end.
2225
2226 Checking the PCRE2 version
2227
2228 Programs that link with a PCRE2 library can check the version by call‐
2229 ing pcre2_config() with appropriate arguments. Users of applications
2230 that do not have access to the underlying code cannot do this. A spe‐
2231 cial "condition" called VERSION exists to allow such users to discover
2232 which version of PCRE2 they are dealing with by using this condition to
2233 match a string such as "yesno". VERSION must be followed either by "="
2234 or ">=" and a version number. For example:
2235
2236 (?(VERSION>=10.4)yes|no)
2237
2238 This pattern matches "yes" if the PCRE2 version is greater or equal to
2239 10.4, or "no" otherwise. The fractional part of the version number may
2240 not contain more than two digits.
2241
2242 Assertion conditions
2243
2244 If the condition is not in any of the above formats, it must be an
2245 assertion. This may be a positive or negative lookahead or lookbehind
2246 assertion. Consider this pattern, again containing non-significant
2247 white space, and with the two alternatives on the second line:
2248
2249 (?(?=[^a-z]*[a-z])
2250 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
2251
2252 The condition is a positive lookahead assertion that matches an
2253 optional sequence of non-letters followed by a letter. In other words,
2254 it tests for the presence of at least one letter in the subject. If a
2255 letter is found, the subject is matched against the first alternative;
2256 otherwise it is matched against the second. This pattern matches
2257 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
2258 letters and dd are digits.
2259
2261
2262 There are two ways of including comments in patterns that are processed
2263 by PCRE2. In both cases, the start of the comment must not be in a
2264 character class, nor in the middle of any other sequence of related
2265 characters such as (?: or a subpattern name or number. The characters
2266 that make up a comment play no part in the pattern matching.
2267
2268 The sequence (?# marks the start of a comment that continues up to the
2269 next closing parenthesis. Nested parentheses are not permitted. If the
2270 PCRE2_EXTENDED option is set, an unescaped # character also introduces
2271 a comment, which in this case continues to immediately after the next
2272 newline character or character sequence in the pattern. Which charac‐
2273 ters are interpreted as newlines is controlled by an option passed to
2274 the compiling function or by a special sequence at the start of the
2275 pattern, as described in the section entitled "Newline conventions"
2276 above. Note that the end of this type of comment is a literal newline
2277 sequence in the pattern; escape sequences that happen to represent a
2278 newline do not count. For example, consider this pattern when
2279 PCRE2_EXTENDED is set, and the default newline convention (a single
2280 linefeed character) is in force:
2281
2282 abc #comment \n still comment
2283
2284 On encountering the # character, pcre2_compile() skips along, looking
2285 for a newline in the pattern. The sequence \n is still literal at this
2286 stage, so it does not terminate the comment. Only an actual character
2287 with the code value 0x0a (the default newline) does so.
2288
2290
2291 Consider the problem of matching a string in parentheses, allowing for
2292 unlimited nested parentheses. Without the use of recursion, the best
2293 that can be done is to use a pattern that matches up to some fixed
2294 depth of nesting. It is not possible to handle an arbitrary nesting
2295 depth.
2296
2297 For some time, Perl has provided a facility that allows regular expres‐
2298 sions to recurse (amongst other things). It does this by interpolating
2299 Perl code in the expression at run time, and the code can refer to the
2300 expression itself. A Perl pattern using code interpolation to solve the
2301 parentheses problem can be created like this:
2302
2303 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
2304
2305 The (?p{...}) item interpolates Perl code at run time, and in this case
2306 refers recursively to the pattern in which it appears.
2307
2308 Obviously, PCRE2 cannot support the interpolation of Perl code.
2309 Instead, it supports special syntax for recursion of the entire pat‐
2310 tern, and also for individual subpattern recursion. After its introduc‐
2311 tion in PCRE1 and Python, this kind of recursion was subsequently
2312 introduced into Perl at release 5.10.
2313
2314 A special item that consists of (? followed by a number greater than
2315 zero and a closing parenthesis is a recursive subroutine call of the
2316 subpattern of the given number, provided that it occurs inside that
2317 subpattern. (If not, it is a non-recursive subroutine call, which is
2318 described in the next section.) The special item (?R) or (?0) is a
2319 recursive call of the entire regular expression.
2320
2321 This PCRE2 pattern solves the nested parentheses problem (assume the
2322 PCRE2_EXTENDED option is set so that white space is ignored):
2323
2324 \( ( [^()]++ | (?R) )* \)
2325
2326 First it matches an opening parenthesis. Then it matches any number of
2327 substrings which can either be a sequence of non-parentheses, or a
2328 recursive match of the pattern itself (that is, a correctly parenthe‐
2329 sized substring). Finally there is a closing parenthesis. Note the use
2330 of a possessive quantifier to avoid backtracking into sequences of non-
2331 parentheses.
2332
2333 If this were part of a larger pattern, you would not want to recurse
2334 the entire pattern, so instead you could use this:
2335
2336 ( \( ( [^()]++ | (?1) )* \) )
2337
2338 We have put the pattern into parentheses, and caused the recursion to
2339 refer to them instead of the whole pattern.
2340
2341 In a larger pattern, keeping track of parenthesis numbers can be
2342 tricky. This is made easier by the use of relative references. Instead
2343 of (?1) in the pattern above you can write (?-2) to refer to the second
2344 most recently opened parentheses preceding the recursion. In other
2345 words, a negative number counts capturing parentheses leftwards from
2346 the point at which it is encountered.
2347
2348 Be aware however, that if duplicate subpattern numbers are in use, rel‐
2349 ative references refer to the earliest subpattern with the appropriate
2350 number. Consider, for example:
2351
2352 (?|(a)|(b)) (c) (?-2)
2353
2354 The first two capturing groups (a) and (b) are both numbered 1, and
2355 group (c) is number 2. When the reference (?-2) is encountered, the
2356 second most recently opened parentheses has the number 1, but it is the
2357 first such group (the (a) group) to which the recursion refers. This
2358 would be the same if an absolute reference (?1) was used. In other
2359 words, relative references are just a shorthand for computing a group
2360 number.
2361
2362 It is also possible to refer to subsequently opened parentheses, by
2363 writing references such as (?+2). However, these cannot be recursive
2364 because the reference is not inside the parentheses that are refer‐
2365 enced. They are always non-recursive subroutine calls, as described in
2366 the next section.
2367
2368 An alternative approach is to use named parentheses. The Perl syntax
2369 for this is (?&name); PCRE1's earlier syntax (?P>name) is also sup‐
2370 ported. We could rewrite the above example as follows:
2371
2372 (?<pn> \( ( [^()]++ | (?&pn) )* \) )
2373
2374 If there is more than one subpattern with the same name, the earliest
2375 one is used.
2376
2377 The example pattern that we have been looking at contains nested unlim‐
2378 ited repeats, and so the use of a possessive quantifier for matching
2379 strings of non-parentheses is important when applying the pattern to
2380 strings that do not match. For example, when this pattern is applied to
2381
2382 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
2383
2384 it yields "no match" quickly. However, if a possessive quantifier is
2385 not used, the match runs for a very long time indeed because there are
2386 so many different ways the + and * repeats can carve up the subject,
2387 and all have to be tested before failure can be reported.
2388
2389 At the end of a match, the values of capturing parentheses are those
2390 from the outermost level. If you want to obtain intermediate values, a
2391 callout function can be used (see below and the pcre2callout documenta‐
2392 tion). If the pattern above is matched against
2393
2394 (ab(cd)ef)
2395
2396 the value for the inner capturing parentheses (numbered 2) is "ef",
2397 which is the last value taken on at the top level. If a capturing sub‐
2398 pattern is not matched at the top level, its final captured value is
2399 unset, even if it was (temporarily) set at a deeper level during the
2400 matching process.
2401
2402 If there are more than 15 capturing parentheses in a pattern, PCRE2 has
2403 to obtain extra memory from the heap to store data during a recursion.
2404 If no memory can be obtained, the match fails with the
2405 PCRE2_ERROR_NOMEMORY error.
2406
2407 Do not confuse the (?R) item with the condition (R), which tests for
2408 recursion. Consider this pattern, which matches text in angle brack‐
2409 ets, allowing for arbitrary nesting. Only digits are allowed in nested
2410 brackets (that is, when recursing), whereas any characters are permit‐
2411 ted at the outer level.
2412
2413 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
2414
2415 In this pattern, (?(R) is the start of a conditional subpattern, with
2416 two different alternatives for the recursive and non-recursive cases.
2417 The (?R) item is the actual recursive call.
2418
2419 Differences in recursion processing between PCRE2 and Perl
2420
2421 Recursion processing in PCRE2 differs from Perl in two important ways.
2422 In PCRE2 (like Python, but unlike Perl), a recursive subpattern call is
2423 always treated as an atomic group. That is, once it has matched some of
2424 the subject string, it is never re-entered, even if it contains untried
2425 alternatives and there is a subsequent matching failure. This can be
2426 illustrated by the following pattern, which purports to match a palin‐
2427 dromic string that contains an odd number of characters (for example,
2428 "a", "aba", "abcba", "abcdcba"):
2429
2430 ^(.|(.)(?1)\2)$
2431
2432 The idea is that it either matches a single character, or two identical
2433 characters surrounding a sub-palindrome. In Perl, this pattern works;
2434 in PCRE2 it does not if the pattern is longer than three characters.
2435 Consider the subject string "abcba":
2436
2437 At the top level, the first character is matched, but as it is not at
2438 the end of the string, the first alternative fails; the second alterna‐
2439 tive is taken and the recursion kicks in. The recursive call to subpat‐
2440 tern 1 successfully matches the next character ("b"). (Note that the
2441 beginning and end of line tests are not part of the recursion).
2442
2443 Back at the top level, the next character ("c") is compared with what
2444 subpattern 2 matched, which was "a". This fails. Because the recursion
2445 is treated as an atomic group, there are now no backtracking points,
2446 and so the entire match fails. (Perl is able, at this point, to re-
2447 enter the recursion and try the second alternative.) However, if the
2448 pattern is written with the alternatives in the other order, things are
2449 different:
2450
2451 ^((.)(?1)\2|.)$
2452
2453 This time, the recursing alternative is tried first, and continues to
2454 recurse until it runs out of characters, at which point the recursion
2455 fails. But this time we do have another alternative to try at the
2456 higher level. That is the big difference: in the previous case the
2457 remaining alternative is at a deeper recursion level, which PCRE2 can‐
2458 not use.
2459
2460 To change the pattern so that it matches all palindromic strings, not
2461 just those with an odd number of characters, it is tempting to change
2462 the pattern to this:
2463
2464 ^((.)(?1)\2|.?)$
2465
2466 Again, this works in Perl, but not in PCRE2, and for the same reason.
2467 When a deeper recursion has matched a single character, it cannot be
2468 entered again in order to match an empty string. The solution is to
2469 separate the two cases, and write out the odd and even cases as alter‐
2470 natives at the higher level:
2471
2472 ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
2473
2474 If you want to match typical palindromic phrases, the pattern has to
2475 ignore all non-word characters, which can be done like this:
2476
2477 ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
2478
2479 If run with the PCRE2_CASELESS option, this pattern matches phrases
2480 such as "A man, a plan, a canal: Panama!" and it works in both PCRE2
2481 and Perl. Note the use of the possessive quantifier *+ to avoid back‐
2482 tracking into sequences of non-word characters. Without this, PCRE2
2483 takes a great deal longer (ten times or more) to match typical phrases,
2484 and Perl takes so long that you think it has gone into a loop.
2485
2486 WARNING: The palindrome-matching patterns above work only if the sub‐
2487 ject string does not start with a palindrome that is shorter than the
2488 entire string. For example, although "abcba" is correctly matched, if
2489 the subject is "ababa", PCRE2 finds the palindrome "aba" at the start,
2490 then fails at top level because the end of the string does not follow.
2491 Once again, it cannot jump back into the recursion to try other alter‐
2492 natives, so the entire match fails.
2493
2494 The second way in which PCRE2 and Perl differ in their recursion pro‐
2495 cessing is in the handling of captured values. In Perl, when a subpat‐
2496 tern is called recursively or as a subpattern (see the next section),
2497 it has no access to any values that were captured outside the recur‐
2498 sion, whereas in PCRE2 these values can be referenced. Consider this
2499 pattern:
2500
2501 ^(.)(\1|a(?2))
2502
2503 In PCRE2, this pattern matches "bab". The first capturing parentheses
2504 match "b", then in the second group, when the back reference \1 fails
2505 to match "b", the second alternative matches "a" and then recurses. In
2506 the recursion, \1 does now match "b" and so the whole match succeeds.
2507 In Perl, the pattern fails to match because inside the recursive call
2508 \1 cannot access the externally set value.
2509
2511
2512 If the syntax for a recursive subpattern call (either by number or by
2513 name) is used outside the parentheses to which it refers, it operates
2514 like a subroutine in a programming language. The called subpattern may
2515 be defined before or after the reference. A numbered reference can be
2516 absolute or relative, as in these examples:
2517
2518 (...(absolute)...)...(?2)...
2519 (...(relative)...)...(?-1)...
2520 (...(?+1)...(relative)...
2521
2522 An earlier example pointed out that the pattern
2523
2524 (sens|respons)e and \1ibility
2525
2526 matches "sense and sensibility" and "response and responsibility", but
2527 not "sense and responsibility". If instead the pattern
2528
2529 (sens|respons)e and (?1)ibility
2530
2531 is used, it does match "sense and responsibility" as well as the other
2532 two strings. Another example is given in the discussion of DEFINE
2533 above.
2534
2535 All subroutine calls, whether recursive or not, are always treated as
2536 atomic groups. That is, once a subroutine has matched some of the sub‐
2537 ject string, it is never re-entered, even if it contains untried alter‐
2538 natives and there is a subsequent matching failure. Any capturing
2539 parentheses that are set during the subroutine call revert to their
2540 previous values afterwards.
2541
2542 Processing options such as case-independence are fixed when a subpat‐
2543 tern is defined, so if it is used as a subroutine, such options cannot
2544 be changed for different calls. For example, consider this pattern:
2545
2546 (abc)(?i:(?-1))
2547
2548 It matches "abcabc". It does not match "abcABC" because the change of
2549 processing option does not affect the called subpattern.
2550
2552
2553 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
2554 name or a number enclosed either in angle brackets or single quotes, is
2555 an alternative syntax for referencing a subpattern as a subroutine,
2556 possibly recursively. Here are two of the examples used above, rewrit‐
2557 ten using this syntax:
2558
2559 (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
2560 (sens|respons)e and \g'1'ibility
2561
2562 PCRE2 supports an extension to Oniguruma: if a number is preceded by a
2563 plus or a minus sign it is taken as a relative reference. For example:
2564
2565 (abc)(?i:\g<-1>)
2566
2567 Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
2568 synonymous. The former is a back reference; the latter is a subroutine
2569 call.
2570
2572
2573 Perl has a feature whereby using the sequence (?{...}) causes arbitrary
2574 Perl code to be obeyed in the middle of matching a regular expression.
2575 This makes it possible, amongst other things, to extract different sub‐
2576 strings that match the same pair of parentheses when there is a repeti‐
2577 tion.
2578
2579 PCRE2 provides a similar feature, but of course it cannot obey arbi‐
2580 trary Perl code. The feature is called "callout". The caller of PCRE2
2581 provides an external function by putting its entry point in a match
2582 context using the function pcre2_set_callout(), and then passing that
2583 context to pcre2_match() or pcre2_dfa_match(). If no match context is
2584 passed, or if the callout entry point is set to NULL, callouts are dis‐
2585 abled.
2586
2587 Within a regular expression, (?C<arg>) indicates a point at which the
2588 external function is to be called. There are two kinds of callout:
2589 those with a numerical argument and those with a string argument. (?C)
2590 on its own with no argument is treated as (?C0). A numerical argument
2591 allows the application to distinguish between different callouts.
2592 String arguments were added for release 10.20 to make it possible for
2593 script languages that use PCRE2 to embed short scripts within patterns
2594 in a similar way to Perl.
2595
2596 During matching, when PCRE2 reaches a callout point, the external func‐
2597 tion is called. It is provided with the number or string argument of
2598 the callout, the position in the pattern, and one item of data that is
2599 also set in the match block. The callout function may cause matching to
2600 proceed, to backtrack, or to fail.
2601
2602 By default, PCRE2 implements a number of optimizations at matching
2603 time, and one side-effect is that sometimes callouts are skipped. If
2604 you need all possible callouts to happen, you need to set options that
2605 disable the relevant optimizations. More details, including a complete
2606 description of the programming interface to the callout function, are
2607 given in the pcre2callout documentation.
2608
2609 Callouts with numerical arguments
2610
2611 If you just want to have a means of identifying different callout
2612 points, put a number less than 256 after the letter C. For example,
2613 this pattern has two callout points:
2614
2615 (?C1)abc(?C2)def
2616
2617 If the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(), numerical
2618 callouts are automatically installed before each item in the pattern.
2619 They are all numbered 255. If there is a conditional group in the pat‐
2620 tern whose condition is an assertion, an additional callout is inserted
2621 just before the condition. An explicit callout may also be set at this
2622 position, as in this example:
2623
2624 (?(?C9)(?=a)abc|def)
2625
2626 Note that this applies only to assertion conditions, not to other types
2627 of condition.
2628
2629 Callouts with string arguments
2630
2631 A delimited string may be used instead of a number as a callout argu‐
2632 ment. The starting delimiter must be one of ` ' " ^ % # $ { and the
2633 ending delimiter is the same as the start, except for {, where the end‐
2634 ing delimiter is }. If the ending delimiter is needed within the
2635 string, it must be doubled. For example:
2636
2637 (?C'ab ''c'' d')xyz(?C{any text})pqr
2638
2639 The doubling is removed before the string is passed to the callout
2640 function.
2641
2643
2644 Perl 5.10 introduced a number of "Special Backtracking Control Verbs",
2645 which are still described in the Perl documentation as "experimental
2646 and subject to change or removal in a future version of Perl". It goes
2647 on to say: "Their usage in production code should be noted to avoid
2648 problems during upgrades." The same remarks apply to the PCRE2 features
2649 described in this section.
2650
2651 The new verbs make use of what was previously invalid syntax: an open‐
2652 ing parenthesis followed by an asterisk. They are generally of the form
2653 (*VERB) or (*VERB:NAME). Some verbs take either form, possibly behaving
2654 differently depending on whether or not a name is present.
2655
2656 By default, for compatibility with Perl, a name is any sequence of
2657 characters that does not include a closing parenthesis. The name is not
2658 processed in any way, and it is not possible to include a closing
2659 parenthesis in the name. This can be changed by setting the
2660 PCRE2_ALT_VERBNAMES option, but the result is no longer Perl-compati‐
2661 ble.
2662
2663 When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to
2664 verb names and only an unescaped closing parenthesis terminates the
2665 name. However, the only backslash items that are permitted are \Q, \E,
2666 and sequences such as \x{100} that define character code points. Char‐
2667 acter type escapes such as \d are faulted.
2668
2669 A closing parenthesis can be included in a name either as \) or between
2670 \Q and \E. In addition to backslash processing, if the PCRE2_EXTENDED
2671 option is also set, unescaped whitespace in verb names is skipped, and
2672 #-comments are recognized, exactly as in the rest of the pattern.
2673 PCRE2_EXTENDED does not affect verb names unless PCRE2_ALT_VERBNAMES is
2674 also set.
2675
2676 The maximum length of a name is 255 in the 8-bit library and 65535 in
2677 the 16-bit and 32-bit libraries. If the name is empty, that is, if the
2678 closing parenthesis immediately follows the colon, the effect is as if
2679 the colon were not there. Any number of these verbs may occur in a pat‐
2680 tern.
2681
2682 Since these verbs are specifically related to backtracking, most of
2683 them can be used only when the pattern is to be matched using the tra‐
2684 ditional matching function, because these use a backtracking algorithm.
2685 With the exception of (*FAIL), which behaves like a failing negative
2686 assertion, the backtracking control verbs cause an error if encountered
2687 by the DFA matching function.
2688
2689 The behaviour of these verbs in repeated groups, assertions, and in
2690 subpatterns called as subroutines (whether or not recursively) is docu‐
2691 mented below.
2692
2693 Optimizations that affect backtracking verbs
2694
2695 PCRE2 contains some optimizations that are used to speed up matching by
2696 running some checks at the start of each match attempt. For example, it
2697 may know the minimum length of matching subject, or that a particular
2698 character must be present. When one of these optimizations bypasses the
2699 running of a match, any included backtracking verbs will not, of
2700 course, be processed. You can suppress the start-of-match optimizations
2701 by setting the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com‐
2702 pile(), or by starting the pattern with (*NO_START_OPT). There is more
2703 discussion of this option in the section entitled "Compiling a pattern"
2704 in the pcre2api documentation.
2705
2706 Experiments with Perl suggest that it too has similar optimizations,
2707 sometimes leading to anomalous results.
2708
2709 Verbs that act immediately
2710
2711 The following verbs act as soon as they are encountered. They may not
2712 be followed by a name.
2713
2714 (*ACCEPT)
2715
2716 This verb causes the match to end successfully, skipping the remainder
2717 of the pattern. However, when it is inside a subpattern that is called
2718 as a subroutine, only that subpattern is ended successfully. Matching
2719 then continues at the outer level. If (*ACCEPT) in triggered in a posi‐
2720 tive assertion, the assertion succeeds; in a negative assertion, the
2721 assertion fails.
2722
2723 If (*ACCEPT) is inside capturing parentheses, the data so far is cap‐
2724 tured. For example:
2725
2726 A((?:A|B(*ACCEPT)|C)D)
2727
2728 This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap‐
2729 tured by the outer parentheses.
2730
2731 (*FAIL) or (*F)
2732
2733 This verb causes a matching failure, forcing backtracking to occur. It
2734 is equivalent to (?!) but easier to read. The Perl documentation notes
2735 that it is probably useful only when combined with (?{}) or (??{}).
2736 Those are, of course, Perl features that are not present in PCRE2. The
2737 nearest equivalent is the callout feature, as for example in this pat‐
2738 tern:
2739
2740 a+(?C)(*FAIL)
2741
2742 A match with the string "aaaa" always fails, but the callout is taken
2743 before each backtrack happens (in this example, 10 times).
2744
2745 Recording which path was taken
2746
2747 There is one verb whose main purpose is to track how a match was
2748 arrived at, though it also has a secondary use in conjunction with
2749 advancing the match starting point (see (*SKIP) below).
2750
2751 (*MARK:NAME) or (*:NAME)
2752
2753 A name is always required with this verb. There may be as many
2754 instances of (*MARK) as you like in a pattern, and their names do not
2755 have to be unique.
2756
2757 When a match succeeds, the name of the last-encountered (*MARK:NAME),
2758 (*PRUNE:NAME), or (*THEN:NAME) on the matching path is passed back to
2759 the caller as described in the section entitled "Other information
2760 about the match" in the pcre2api documentation. Here is an example of
2761 pcre2test output, where the "mark" modifier requests the retrieval and
2762 outputting of (*MARK) data:
2763
2764 re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
2765 data> XY
2766 0: XY
2767 MK: A
2768 XZ
2769 0: XZ
2770 MK: B
2771
2772 The (*MARK) name is tagged with "MK:" in this output, and in this exam‐
2773 ple it indicates which of the two alternatives matched. This is a more
2774 efficient way of obtaining this information than putting each alterna‐
2775 tive in its own capturing parentheses.
2776
2777 If a verb with a name is encountered in a positive assertion that is
2778 true, the name is recorded and passed back if it is the last-encoun‐
2779 tered. This does not happen for negative assertions or failing positive
2780 assertions.
2781
2782 After a partial match or a failed match, the last encountered name in
2783 the entire match process is returned. For example:
2784
2785 re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
2786 data> XP
2787 No match, mark = B
2788
2789 Note that in this unanchored example the mark is retained from the
2790 match attempt that started at the letter "X" in the subject. Subsequent
2791 match attempts starting at "P" and then with an empty string do not get
2792 as far as the (*MARK) item, but nevertheless do not reset it.
2793
2794 If you are interested in (*MARK) values after failed matches, you
2795 should probably set the PCRE2_NO_START_OPTIMIZE option (see above) to
2796 ensure that the match is always attempted.
2797
2798 Verbs that act after backtracking
2799
2800 The following verbs do nothing when they are encountered. Matching con‐
2801 tinues with what follows, but if there is no subsequent match, causing
2802 a backtrack to the verb, a failure is forced. That is, backtracking
2803 cannot pass to the left of the verb. However, when one of these verbs
2804 appears inside an atomic group (which includes any group that is called
2805 as a subroutine) or in an assertion that is true, its effect is con‐
2806 fined to that group, because once the group has been matched, there is
2807 never any backtracking into it. In this situation, backtracking has to
2808 jump to the left of the entire atomic group or assertion.
2809
2810 These verbs differ in exactly what kind of failure occurs when back‐
2811 tracking reaches them. The behaviour described below is what happens
2812 when the verb is not in a subroutine or an assertion. Subsequent sec‐
2813 tions cover these special cases.
2814
2815 (*COMMIT)
2816
2817 This verb, which may not be followed by a name, causes the whole match
2818 to fail outright if there is a later matching failure that causes back‐
2819 tracking to reach it. Even if the pattern is unanchored, no further
2820 attempts to find a match by advancing the starting point take place. If
2821 (*COMMIT) is the only backtracking verb that is encountered, once it
2822 has been passed pcre2_match() is committed to finding a match at the
2823 current starting point, or not at all. For example:
2824
2825 a+(*COMMIT)b
2826
2827 This matches "xxaab" but not "aacaab". It can be thought of as a kind
2828 of dynamic anchor, or "I've started, so I must finish." The name of the
2829 most recently passed (*MARK) in the path is passed back when (*COMMIT)
2830 forces a match failure.
2831
2832 If there is more than one backtracking verb in a pattern, a different
2833 one that follows (*COMMIT) may be triggered first, so merely passing
2834 (*COMMIT) during a match does not always guarantee that a match must be
2835 at this starting point.
2836
2837 Note that (*COMMIT) at the start of a pattern is not the same as an
2838 anchor, unless PCRE2's start-of-match optimizations are turned off, as
2839 shown in this output from pcre2test:
2840
2841 re> /(*COMMIT)abc/
2842 data> xyzabc
2843 0: abc
2844 data>
2845 re> /(*COMMIT)abc/no_start_optimize
2846 data> xyzabc
2847 No match
2848
2849 For the first pattern, PCRE2 knows that any match must start with "a",
2850 so the optimization skips along the subject to "a" before applying the
2851 pattern to the first set of data. The match attempt then succeeds. The
2852 second pattern disables the optimization that skips along to the first
2853 character. The pattern is now applied starting at "x", and so the
2854 (*COMMIT) causes the match to fail without trying any other starting
2855 points.
2856
2857 (*PRUNE) or (*PRUNE:NAME)
2858
2859 This verb causes the match to fail at the current starting position in
2860 the subject if there is a later matching failure that causes backtrack‐
2861 ing to reach it. If the pattern is unanchored, the normal "bumpalong"
2862 advance to the next starting character then happens. Backtracking can
2863 occur as usual to the left of (*PRUNE), before it is reached, or when
2864 matching to the right of (*PRUNE), but if there is no match to the
2865 right, backtracking cannot cross (*PRUNE). In simple cases, the use of
2866 (*PRUNE) is just an alternative to an atomic group or possessive quan‐
2867 tifier, but there are some uses of (*PRUNE) that cannot be expressed in
2868 any other way. In an anchored pattern (*PRUNE) has the same effect as
2869 (*COMMIT).
2870
2871 The behaviour of (*PRUNE:NAME) is the not the same as
2872 (*MARK:NAME)(*PRUNE). It is like (*MARK:NAME) in that the name is
2873 remembered for passing back to the caller. However, (*SKIP:NAME)
2874 searches only for names set with (*MARK), ignoring those set by
2875 (*PRUNE) or (*THEN).
2876
2877 (*SKIP)
2878
2879 This verb, when given without a name, is like (*PRUNE), except that if
2880 the pattern is unanchored, the "bumpalong" advance is not to the next
2881 character, but to the position in the subject where (*SKIP) was encoun‐
2882 tered. (*SKIP) signifies that whatever text was matched leading up to
2883 it cannot be part of a successful match. Consider:
2884
2885 a+(*SKIP)b
2886
2887 If the subject is "aaaac...", after the first match attempt fails
2888 (starting at the first character in the string), the starting point
2889 skips on to start the next attempt at "c". Note that a possessive quan‐
2890 tifer does not have the same effect as this example; although it would
2891 suppress backtracking during the first match attempt, the second
2892 attempt would start at the second character instead of skipping on to
2893 "c".
2894
2895 (*SKIP:NAME)
2896
2897 When (*SKIP) has an associated name, its behaviour is modified. When it
2898 is triggered, the previous path through the pattern is searched for the
2899 most recent (*MARK) that has the same name. If one is found, the
2900 "bumpalong" advance is to the subject position that corresponds to that
2901 (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with
2902 a matching name is found, the (*SKIP) is ignored.
2903
2904 Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It
2905 ignores names that are set by (*PRUNE:NAME) or (*THEN:NAME).
2906
2907 (*THEN) or (*THEN:NAME)
2908
2909 This verb causes a skip to the next innermost alternative when back‐
2910 tracking reaches it. That is, it cancels any further backtracking
2911 within the current alternative. Its name comes from the observation
2912 that it can be used for a pattern-based if-then-else block:
2913
2914 ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
2915
2916 If the COND1 pattern matches, FOO is tried (and possibly further items
2917 after the end of the group if FOO succeeds); on failure, the matcher
2918 skips to the second alternative and tries COND2, without backtracking
2919 into COND1. If that succeeds and BAR fails, COND3 is tried. If subse‐
2920 quently BAZ fails, there are no more alternatives, so there is a back‐
2921 track to whatever came before the entire group. If (*THEN) is not
2922 inside an alternation, it acts like (*PRUNE).
2923
2924 The behaviour of (*THEN:NAME) is the not the same as
2925 (*MARK:NAME)(*THEN). It is like (*MARK:NAME) in that the name is
2926 remembered for passing back to the caller. However, (*SKIP:NAME)
2927 searches only for names set with (*MARK), ignoring those set by
2928 (*PRUNE) and (*THEN).
2929
2930 A subpattern that does not contain a | character is just a part of the
2931 enclosing alternative; it is not a nested alternation with only one
2932 alternative. The effect of (*THEN) extends beyond such a subpattern to
2933 the enclosing alternative. Consider this pattern, where A, B, etc. are
2934 complex pattern fragments that do not contain any | characters at this
2935 level:
2936
2937 A (B(*THEN)C) | D
2938
2939 If A and B are matched, but there is a failure in C, matching does not
2940 backtrack into A; instead it moves to the next alternative, that is, D.
2941 However, if the subpattern containing (*THEN) is given an alternative,
2942 it behaves differently:
2943
2944 A (B(*THEN)C | (*FAIL)) | D
2945
2946 The effect of (*THEN) is now confined to the inner subpattern. After a
2947 failure in C, matching moves to (*FAIL), which causes the whole subpat‐
2948 tern to fail because there are no more alternatives to try. In this
2949 case, matching does now backtrack into A.
2950
2951 Note that a conditional subpattern is not considered as having two
2952 alternatives, because only one is ever used. In other words, the |
2953 character in a conditional subpattern has a different meaning. Ignoring
2954 white space, consider:
2955
2956 ^.*? (?(?=a) a | b(*THEN)c )
2957
2958 If the subject is "ba", this pattern does not match. Because .*? is
2959 ungreedy, it initially matches zero characters. The condition (?=a)
2960 then fails, the character "b" is matched, but "c" is not. At this
2961 point, matching does not backtrack to .*? as might perhaps be expected
2962 from the presence of the | character. The conditional subpattern is
2963 part of the single alternative that comprises the whole pattern, and so
2964 the match fails. (If there was a backtrack into .*?, allowing it to
2965 match "b", the match would succeed.)
2966
2967 The verbs just described provide four different "strengths" of control
2968 when subsequent matching fails. (*THEN) is the weakest, carrying on the
2969 match at the next alternative. (*PRUNE) comes next, failing the match
2970 at the current starting position, but allowing an advance to the next
2971 character (for an unanchored pattern). (*SKIP) is similar, except that
2972 the advance may be more than one character. (*COMMIT) is the strongest,
2973 causing the entire match to fail.
2974
2975 More than one backtracking verb
2976
2977 If more than one backtracking verb is present in a pattern, the one
2978 that is backtracked onto first acts. For example, consider this pat‐
2979 tern, where A, B, etc. are complex pattern fragments:
2980
2981 (A(*COMMIT)B(*THEN)C|ABD)
2982
2983 If A matches but B fails, the backtrack to (*COMMIT) causes the entire
2984 match to fail. However, if A and B match, but C fails, the backtrack to
2985 (*THEN) causes the next alternative (ABD) to be tried. This behaviour
2986 is consistent, but is not always the same as Perl's. It means that if
2987 two or more backtracking verbs appear in succession, all the the last
2988 of them has no effect. Consider this example:
2989
2990 ...(*COMMIT)(*PRUNE)...
2991
2992 If there is a matching failure to the right, backtracking onto (*PRUNE)
2993 causes it to be triggered, and its action is taken. There can never be
2994 a backtrack onto (*COMMIT).
2995
2996 Backtracking verbs in repeated groups
2997
2998 PCRE2 differs from Perl in its handling of backtracking verbs in
2999 repeated groups. For example, consider:
3000
3001 /(a(*COMMIT)b)+ac/
3002
3003 If the subject is "abac", Perl matches, but PCRE2 fails because the
3004 (*COMMIT) in the second repeat of the group acts.
3005
3006 Backtracking verbs in assertions
3007
3008 (*FAIL) in an assertion has its normal effect: it forces an immediate
3009 backtrack.
3010
3011 (*ACCEPT) in a positive assertion causes the assertion to succeed with‐
3012 out any further processing. In a negative assertion, (*ACCEPT) causes
3013 the assertion to fail without any further processing.
3014
3015 The other backtracking verbs are not treated specially if they appear
3016 in a positive assertion. In particular, (*THEN) skips to the next
3017 alternative in the innermost enclosing group that has alternations,
3018 whether or not this is within the assertion.
3019
3020 Negative assertions are, however, different, in order to ensure that
3021 changing a positive assertion into a negative assertion changes its
3022 result. Backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes a neg‐
3023 ative assertion to be true, without considering any further alternative
3024 branches in the assertion. Backtracking into (*THEN) causes it to skip
3025 to the next enclosing alternative within the assertion (the normal be‐
3026 haviour), but if the assertion does not have such an alternative,
3027 (*THEN) behaves like (*PRUNE).
3028
3029 Backtracking verbs in subroutines
3030
3031 These behaviours occur whether or not the subpattern is called recur‐
3032 sively. Perl's treatment of subroutines is different in some cases.
3033
3034 (*FAIL) in a subpattern called as a subroutine has its normal effect:
3035 it forces an immediate backtrack.
3036
3037 (*ACCEPT) in a subpattern called as a subroutine causes the subroutine
3038 match to succeed without any further processing. Matching then contin‐
3039 ues after the subroutine call.
3040
3041 (*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as a subroutine
3042 cause the subroutine match to fail.
3043
3044 (*THEN) skips to the next alternative in the innermost enclosing group
3045 within the subpattern that has alternatives. If there is no such group
3046 within the subpattern, (*THEN) causes the subroutine match to fail.
3047
3049
3050 pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2syntax(3),
3051 pcre2(3).
3052
3054
3055 Philip Hazel
3056 University Computing Service
3057 Cambridge, England.
3058
3060
3061 Last updated: 27 December 2016
3062 Copyright (c) 1997-2016 University of Cambridge.
3063
3064
3065
3066PCRE2 10.23 27 December 2016 PCRE2PATTERN(3)