1PCRE2PATTERN(3) Library Functions Manual PCRE2PATTERN(3)
2
3
4
6 PCRE2 - Perl-compatible regular expressions (revised API)
7
9
10 The syntax and semantics of the regular expressions that are supported
11 by PCRE2 are described in detail below. There is a quick-reference syn‐
12 tax summary in the pcre2syntax page. PCRE2 tries to match Perl syntax
13 and semantics as closely as it can. PCRE2 also supports some alterna‐
14 tive regular expression syntax (which does not conflict with the Perl
15 syntax) in order to provide some compatibility with regular expressions
16 in Python, .NET, and Oniguruma.
17
18 Perl's regular expressions are described in its own documentation, and
19 regular expressions in general are covered in a number of books, some
20 of which have copious examples. Jeffrey Friedl's "Mastering Regular
21 Expressions", published by O'Reilly, covers regular expressions in
22 great detail. This description of PCRE2's regular expressions is
23 intended as reference material.
24
25 This document discusses the regular expression patterns that are sup‐
26 ported by PCRE2 when its main matching function, pcre2_match(), is
27 used. PCRE2 also has an alternative matching function,
28 pcre2_dfa_match(), which matches using a different algorithm that is
29 not Perl-compatible. Some of the features discussed below are not
30 available when DFA matching is used. The advantages and disadvantages
31 of the alternative function, and how it differs from the normal func‐
32 tion, are discussed in the pcre2matching page.
33
35
36 A number of options that can be passed to pcre2_compile() can also be
37 set by special items at the start of a pattern. These are not Perl-com‐
38 patible, but are provided to make these options accessible to pattern
39 writers who are not able to change the program that processes the pat‐
40 tern. Any number of these items may appear, but they must all be
41 together right at the start of the pattern string, and the letters must
42 be in upper case.
43
44 UTF support
45
46 In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either
47 as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32
48 can be specified for the 32-bit library, in which case it constrains
49 the character values to valid Unicode code points. To process UTF
50 strings, PCRE2 must be built to include Unicode support (which is the
51 default). When using UTF strings you must either call the compiling
52 function with one or both of the PCRE2_UTF or PCRE2_MATCH_INVALID_UTF
53 options, or the pattern must start with the special sequence (*UTF),
54 which is equivalent to setting the relevant PCRE2_UTF. How setting a
55 UTF mode affects pattern matching is mentioned in several places below.
56 There is also a summary of features in the pcre2unicode page.
57
58 Some applications that allow their users to supply patterns may wish to
59 restrict them to non-UTF data for security reasons. If the
60 PCRE2_NEVER_UTF option is passed to pcre2_compile(), (*UTF) is not
61 allowed, and its appearance in a pattern causes an error.
62
63 Unicode property support
64
65 Another special sequence that may appear at the start of a pattern is
66 (*UCP). This has the same effect as setting the PCRE2_UCP option: it
67 causes sequences such as \d and \w to use Unicode properties to deter‐
68 mine character types, instead of recognizing only characters with codes
69 less than 256 via a lookup table. If also causes upper/lower casing
70 operations to use Unicode properties for characters with code points
71 greater than 127, even when UTF is not set.
72
73 Some applications that allow their users to supply patterns may wish to
74 restrict them for security reasons. If the PCRE2_NEVER_UCP option is
75 passed to pcre2_compile(), (*UCP) is not allowed, and its appearance in
76 a pattern causes an error.
77
78 Locking out empty string matching
79
80 Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same
81 effect as passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option
82 to whichever matching function is subsequently called to match the pat‐
83 tern. These options lock out the matching of empty strings, either
84 entirely, or only at the start of the subject.
85
86 Disabling auto-possessification
87
88 If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as
89 setting the PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from making
90 quantifiers possessive when what follows cannot match the repeated
91 item. For example, by default a+b is treated as a++b. For more details,
92 see the pcre2api documentation.
93
94 Disabling start-up optimizations
95
96 If a pattern starts with (*NO_START_OPT), it has the same effect as
97 setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti‐
98 mizations for quickly reaching "no match" results. For more details,
99 see the pcre2api documentation.
100
101 Disabling automatic anchoring
102
103 If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect
104 as setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimiza‐
105 tions that apply to patterns whose top-level branches all start with .*
106 (match any number of arbitrary characters). For more details, see the
107 pcre2api documentation.
108
109 Disabling JIT compilation
110
111 If a pattern that starts with (*NO_JIT) is successfully compiled, an
112 attempt by the application to apply the JIT optimization by calling
113 pcre2_jit_compile() is ignored.
114
115 Setting match resource limits
116
117 The pcre2_match() function contains a counter that is incremented every
118 time it goes round its main loop. The caller of pcre2_match() can set a
119 limit on this counter, which therefore limits the amount of computing
120 resource used for a match. The maximum depth of nested backtracking can
121 also be limited; this indirectly restricts the amount of heap memory
122 that is used, but there is also an explicit memory limit that can be
123 set.
124
125 These facilities are provided to catch runaway matches that are pro‐
126 voked by patterns with huge matching trees. A common example is a pat‐
127 tern with nested unlimited repeats applied to a long string that does
128 not match. When one of these limits is reached, pcre2_match() gives an
129 error return. The limits can also be set by items at the start of the
130 pattern of the form
131
132 (*LIMIT_HEAP=d)
133 (*LIMIT_MATCH=d)
134 (*LIMIT_DEPTH=d)
135
136 where d is any number of decimal digits. However, the value of the set‐
137 ting must be less than the value set (or defaulted) by the caller of
138 pcre2_match() for it to have any effect. In other words, the pattern
139 writer can lower the limits set by the programmer, but not raise them.
140 If there is more than one setting of one of these limits, the lower
141 value is used. The heap limit is specified in kibibytes (units of 1024
142 bytes).
143
144 Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This
145 name is still recognized for backwards compatibility.
146
147 The heap limit applies only when the pcre2_match() or pcre2_dfa_match()
148 interpreters are used for matching. It does not apply to JIT. The match
149 limit is used (but in a different way) when JIT is being used, or when
150 pcre2_dfa_match() is called, to limit computing resource usage by those
151 matching functions. The depth limit is ignored by JIT but is relevant
152 for DFA matching, which uses function recursion for recursions within
153 the pattern and for lookaround assertions and atomic groups. In this
154 case, the depth limit controls the depth of such recursion.
155
156 Newline conventions
157
158 PCRE2 supports six different conventions for indicating line breaks in
159 strings: a single CR (carriage return) character, a single LF (line‐
160 feed) character, the two-character sequence CRLF, any of the three pre‐
161 ceding, any Unicode newline sequence, or the NUL character (binary
162 zero). The pcre2api page has further discussion about newlines, and
163 shows how to set the newline convention when calling pcre2_compile().
164
165 It is also possible to specify a newline convention by starting a pat‐
166 tern string with one of the following sequences:
167
168 (*CR) carriage return
169 (*LF) linefeed
170 (*CRLF) carriage return, followed by linefeed
171 (*ANYCRLF) any of the three above
172 (*ANY) all Unicode newline sequences
173 (*NUL) the NUL character (binary zero)
174
175 These override the default and the options given to the compiling func‐
176 tion. For example, on a Unix system where LF is the default newline
177 sequence, the pattern
178
179 (*CR)a.b
180
181 changes the convention to CR. That pattern matches "a\nb" because LF is
182 no longer a newline. If more than one of these settings is present, the
183 last one is used.
184
185 The newline convention affects where the circumflex and dollar asser‐
186 tions are true. It also affects the interpretation of the dot metachar‐
187 acter when PCRE2_DOTALL is not set, and the behaviour of \N when not
188 followed by an opening brace. However, it does not affect what the \R
189 escape sequence matches. By default, this is any Unicode newline
190 sequence, for Perl compatibility. However, this can be changed; see the
191 next section and the description of \R in the section entitled "Newline
192 sequences" below. A change of \R setting can be combined with a change
193 of newline convention.
194
195 Specifying what \R matches
196
197 It is possible to restrict \R to match only CR, LF, or CRLF (instead of
198 the complete set of Unicode line endings) by setting the option
199 PCRE2_BSR_ANYCRLF at compile time. This effect can also be achieved by
200 starting a pattern with (*BSR_ANYCRLF). For completeness, (*BSR_UNI‐
201 CODE) is also recognized, corresponding to PCRE2_BSR_UNICODE.
202
204
205 PCRE2 can be compiled to run in an environment that uses EBCDIC as its
206 character code instead of ASCII or Unicode (typically a mainframe sys‐
207 tem). In the sections below, character code values are ASCII or Uni‐
208 code; in an EBCDIC environment these characters may have different code
209 values, and there are no code points greater than 255.
210
212
213 A regular expression is a pattern that is matched against a subject
214 string from left to right. Most characters stand for themselves in a
215 pattern, and match the corresponding characters in the subject. As a
216 trivial example, the pattern
217
218 The quick brown fox
219
220 matches a portion of a subject string that is identical to itself. When
221 caseless matching is specified (the PCRE2_CASELESS option or (?i)
222 within the pattern), letters are matched independently of case. Note
223 that there are two ASCII characters, K and S, that, in addition to
224 their lower case ASCII equivalents, are case-equivalent with Unicode
225 U+212A (Kelvin sign) and U+017F (long S) respectively when either
226 PCRE2_UTF or PCRE2_UCP is set.
227
228 The power of regular expressions comes from the ability to include wild
229 cards, character classes, alternatives, and repetitions in the pattern.
230 These are encoded in the pattern by the use of metacharacters, which do
231 not stand for themselves but instead are interpreted in some special
232 way.
233
234 There are two different sets of metacharacters: those that are recog‐
235 nized anywhere in the pattern except within square brackets, and those
236 that are recognized within square brackets. Outside square brackets,
237 the metacharacters are as follows:
238
239 \ general escape character with several uses
240 ^ assert start of string (or line, in multiline mode)
241 $ assert end of string (or line, in multiline mode)
242 . match any character except newline (by default)
243 [ start character class definition
244 | start of alternative branch
245 ( start group or control verb
246 ) end group or control verb
247 * 0 or more quantifier
248 + 1 or more quantifier; also "possessive quantifier"
249 ? 0 or 1 quantifier; also quantifier minimizer
250 { start min/max quantifier
251
252 Part of a pattern that is in square brackets is called a "character
253 class". In a character class the only metacharacters are:
254
255 \ general escape character
256 ^ negate the class, but only if the first character
257 - indicates character range
258 [ POSIX character class (if followed by POSIX syntax)
259 ] terminates the character class
260
261 If a pattern is compiled with the PCRE2_EXTENDED option, most white
262 space in the pattern, other than in a character class, and characters
263 between a # outside a character class and the next newline, inclusive,
264 are ignored. An escaping backslash can be used to include a white space
265 or a # character as part of the pattern. If the PCRE2_EXTENDED_MORE
266 option is set, the same applies, but in addition unescaped space and
267 horizontal tab characters are ignored inside a character class. Note:
268 only these two characters are ignored, not the full set of pattern
269 white space characters that are ignored outside a character class.
270 Option settings can be changed within a pattern; see the section enti‐
271 tled "Internal Option Setting" below.
272
273 The following sections describe the use of each of the metacharacters.
274
276
277 The backslash character has several uses. Firstly, if it is followed by
278 a character that is not a digit or a letter, it takes away any special
279 meaning that character may have. This use of backslash as an escape
280 character applies both inside and outside character classes.
281
282 For example, if you want to match a * character, you must write \* in
283 the pattern. This escaping action applies whether or not the following
284 character would otherwise be interpreted as a metacharacter, so it is
285 always safe to precede a non-alphanumeric with backslash to specify
286 that it stands for itself. In particular, if you want to match a back‐
287 slash, you write \\.
288
289 Only ASCII digits and letters have any special meaning after a back‐
290 slash. All other characters (in particular, those whose code points are
291 greater than 127) are treated as literals.
292
293 If you want to treat all characters in a sequence as literals, you can
294 do so by putting them between \Q and \E. This is different from Perl in
295 that $ and @ are handled as literals in \Q...\E sequences in PCRE2,
296 whereas in Perl, $ and @ cause variable interpolation. Also, Perl does
297 "double-quotish backslash interpolation" on any backslashes between \Q
298 and \E which, its documentation says, "may lead to confusing results".
299 PCRE2 treats a backslash between \Q and \E just like any other charac‐
300 ter. Note the following examples:
301
302 Pattern PCRE2 matches Perl matches
303
304 \Qabc$xyz\E abc$xyz abc followed by the
305 contents of $xyz
306 \Qabc\$xyz\E abc\$xyz abc\$xyz
307 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
308 \QA\B\E A\B A\B
309 \Q\\E \ \\E
310
311 The \Q...\E sequence is recognized both inside and outside character
312 classes. An isolated \E that is not preceded by \Q is ignored. If \Q
313 is not followed by \E later in the pattern, the literal interpretation
314 continues to the end of the pattern (that is, \E is assumed at the
315 end). If the isolated \Q is inside a character class, this causes an
316 error, because the character class is not terminated by a closing
317 square bracket.
318
319 Non-printing characters
320
321 A second use of backslash provides a way of encoding non-printing char‐
322 acters in patterns in a visible manner. There is no restriction on the
323 appearance of non-printing characters in a pattern, but when a pattern
324 is being prepared by text editing, it is often easier to use one of the
325 following escape sequences instead of the binary character it repre‐
326 sents. In an ASCII or Unicode environment, these escapes are as fol‐
327 lows:
328
329 \a alarm, that is, the BEL character (hex 07)
330 \cx "control-x", where x is any printable ASCII character
331 \e escape (hex 1B)
332 \f form feed (hex 0C)
333 \n linefeed (hex 0A)
334 \r carriage return (hex 0D) (but see below)
335 \t tab (hex 09)
336 \0dd character with octal code 0dd
337 \ddd character with octal code ddd, or backreference
338 \o{ddd..} character with octal code ddd..
339 \xhh character with hex code hh
340 \x{hhh..} character with hex code hhh..
341 \N{U+hhh..} character with Unicode hex code point hhh..
342
343 By default, after \x that is not followed by {, from zero to two hexa‐
344 decimal digits are read (letters can be in upper or lower case). Any
345 number of hexadecimal digits may appear between \x{ and }. If a charac‐
346 ter other than a hexadecimal digit appears between \x{ and }, or if
347 there is no terminating }, an error occurs.
348
349 Characters whose code points are less than 256 can be defined by either
350 of the two syntaxes for \x or by an octal sequence. There is no differ‐
351 ence in the way they are handled. For example, \xdc is exactly the same
352 as \x{dc} or \334. However, using the braced versions does make such
353 sequences easier to read.
354
355 Support is available for some ECMAScript (aka JavaScript) escape
356 sequences via two compile-time options. If PCRE2_ALT_BSUX is set, the
357 sequence \x followed by { is not recognized. Only if \x is followed by
358 two hexadecimal digits is it recognized as a character escape. Other‐
359 wise it is interpreted as a literal "x" character. In this mode, sup‐
360 port for code points greater than 256 is provided by \u, which must be
361 followed by four hexadecimal digits; otherwise it is interpreted as a
362 literal "u" character.
363
364 PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in
365 addition, \u{hhh..} is recognized as the character specified by hexa‐
366 decimal code point. There may be any number of hexadecimal digits.
367 This syntax is from ECMAScript 6.
368
369 The \N{U+hhh..} escape sequence is recognized only when PCRE2 is oper‐
370 ating in UTF mode. Perl also uses \N{name} to specify characters by
371 Unicode name; PCRE2 does not support this. Note that when \N is not
372 followed by an opening brace (curly bracket) it has an entirely differ‐
373 ent meaning, matching any character that is not a newline.
374
375 There are some legacy applications where the escape sequence \r is
376 expected to match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option
377 is set, \r in a pattern is converted to \n so that it matches a LF
378 (linefeed) instead of a CR (carriage return) character.
379
380 The precise effect of \cx on ASCII characters is as follows: if x is a
381 lower case letter, it is converted to upper case. Then bit 6 of the
382 character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
383 (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
384 hex 7B (; is 3B). If the code unit following \c has a value less than
385 32 or greater than 126, a compile-time error occurs.
386
387 When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported.
388 \a, \e, \f, \n, \r, and \t generate the appropriate EBCDIC code values.
389 The \c escape is processed as specified for Perl in the perlebcdic doc‐
390 ument. The only characters that are allowed after \c are A-Z, a-z, or
391 one of @, [, \, ], ^, _, or ?. Any other character provokes a compile-
392 time error. The sequence \c@ encodes character code 0; after \c the
393 letters (in either case) encode characters 1-26 (hex 01 to hex 1A); [,
394 \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c?
395 becomes either 255 (hex FF) or 95 (hex 5F).
396
397 Thus, apart from \c?, these escapes generate the same character code
398 values as they do in an ASCII environment, though the meanings of the
399 values mostly differ. For example, \cG always generates code value 7,
400 which is BEL in ASCII but DEL in EBCDIC.
401
402 The sequence \c? generates DEL (127, hex 7F) in an ASCII environment,
403 but because 127 is not a control character in EBCDIC, Perl makes it
404 generate the APC character. Unfortunately, there are several variants
405 of EBCDIC. In most of them the APC character has the value 255 (hex
406 FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If
407 certain other characters have POSIX-BC values, PCRE2 makes \c? generate
408 95; otherwise it generates 255.
409
410 After \0 up to two further octal digits are read. If there are fewer
411 than two digits, just those that are present are used. Thus the
412 sequence \0\x\015 specifies two binary zeros followed by a CR character
413 (code value 13). Make sure you supply two digits after the initial zero
414 if the pattern character that follows is itself an octal digit.
415
416 The escape \o must be followed by a sequence of octal digits, enclosed
417 in braces. An error occurs if this is not the case. This escape is a
418 recent addition to Perl; it provides way of specifying character code
419 points as octal numbers greater than 0777, and it also allows octal
420 numbers and backreferences to be unambiguously specified.
421
422 For greater clarity and unambiguity, it is best to avoid following \ by
423 a digit greater than zero. Instead, use \o{} or \x{} to specify numeri‐
424 cal character code points, and \g{} to specify backreferences. The fol‐
425 lowing paragraphs describe the old, ambiguous syntax.
426
427 The handling of a backslash followed by a digit other than 0 is compli‐
428 cated, and Perl has changed over time, causing PCRE2 also to change.
429
430 Outside a character class, PCRE2 reads the digit and any following dig‐
431 its as a decimal number. If the number is less than 10, begins with the
432 digit 8 or 9, or if there are at least that many previous capture
433 groups in the expression, the entire sequence is taken as a backrefer‐
434 ence. A description of how this works is given later, following the
435 discussion of parenthesized groups. Otherwise, up to three octal dig‐
436 its are read to form a character code.
437
438 Inside a character class, PCRE2 handles \8 and \9 as the literal char‐
439 acters "8" and "9", and otherwise reads up to three octal digits fol‐
440 lowing the backslash, using them to generate a data character. Any sub‐
441 sequent digits stand for themselves. For example, outside a character
442 class:
443
444 \040 is another way of writing an ASCII space
445 \40 is the same, provided there are fewer than 40
446 previous capture groups
447 \7 is always a backreference
448 \11 might be a backreference, or another way of
449 writing a tab
450 \011 is always a tab
451 \0113 is a tab followed by the character "3"
452 \113 might be a backreference, otherwise the
453 character with octal code 113
454 \377 might be a backreference, otherwise
455 the value 255 (decimal)
456 \81 is always a backreference
457
458 Note that octal values of 100 or greater that are specified using this
459 syntax must not be introduced by a leading zero, because no more than
460 three octal digits are ever read.
461
462 Constraints on character values
463
464 Characters that are specified using octal or hexadecimal numbers are
465 limited to certain values, as follows:
466
467 8-bit non-UTF mode no greater than 0xff
468 16-bit non-UTF mode no greater than 0xffff
469 32-bit non-UTF mode no greater than 0xffffffff
470 All UTF modes no greater than 0x10ffff and a valid code point
471
472 Invalid Unicode code points are all those in the range 0xd800 to 0xdfff
473 (the so-called "surrogate" code points). The check for these can be
474 disabled by the caller of pcre2_compile() by setting the option
475 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only in
476 UTF-8 and UTF-32 modes, because these values are not representable in
477 UTF-16.
478
479 Escape sequences in character classes
480
481 All the sequences that define a single character value can be used both
482 inside and outside character classes. In addition, inside a character
483 class, \b is interpreted as the backspace character (hex 08).
484
485 When not followed by an opening brace, \N is not allowed in a character
486 class. \B, \R, and \X are not special inside a character class. Like
487 other unrecognized alphabetic escape sequences, they cause an error.
488 Outside a character class, these sequences have different meanings.
489
490 Unsupported escape sequences
491
492 In Perl, the sequences \F, \l, \L, \u, and \U are recognized by its
493 string handler and used to modify the case of following characters. By
494 default, PCRE2 does not support these escape sequences in patterns.
495 However, if either of the PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX
496 options is set, \U matches a "U" character, and \u can be used to
497 define a character by code point, as described above.
498
499 Absolute and relative backreferences
500
501 The sequence \g followed by a signed or unsigned number, optionally
502 enclosed in braces, is an absolute or relative backreference. A named
503 backreference can be coded as \g{name}. Backreferences are discussed
504 later, following the discussion of parenthesized groups.
505
506 Absolute and relative subroutine calls
507
508 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
509 name or a number enclosed either in angle brackets or single quotes, is
510 an alternative syntax for referencing a capture group as a subroutine.
511 Details are discussed later. Note that \g{...} (Perl syntax) and
512 \g<...> (Oniguruma syntax) are not synonymous. The former is a backref‐
513 erence; the latter is a subroutine call.
514
515 Generic character types
516
517 Another use of backslash is for specifying generic character types:
518
519 \d any decimal digit
520 \D any character that is not a decimal digit
521 \h any horizontal white space character
522 \H any character that is not a horizontal white space character
523 \N any character that is not a newline
524 \s any white space character
525 \S any character that is not a white space character
526 \v any vertical white space character
527 \V any character that is not a vertical white space character
528 \w any "word" character
529 \W any "non-word" character
530
531 The \N escape sequence has the same meaning as the "." metacharacter
532 when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change
533 the meaning of \N. Note that when \N is followed by an opening brace it
534 has a different meaning. See the section entitled "Non-printing charac‐
535 ters" above for details. Perl also uses \N{name} to specify characters
536 by Unicode name; PCRE2 does not support this.
537
538 Each pair of lower and upper case escape sequences partitions the com‐
539 plete set of characters into two disjoint sets. Any given character
540 matches one, and only one, of each pair. The sequences can appear both
541 inside and outside character classes. They each match one character of
542 the appropriate type. If the current matching point is at the end of
543 the subject string, all of them fail, because there is no character to
544 match.
545
546 The default \s characters are HT (9), LF (10), VT (11), FF (12), CR
547 (13), and space (32), which are defined as white space in the "C"
548 locale. This list may vary if locale-specific matching is taking place.
549 For example, in some locales the "non-breaking space" character (\xA0)
550 is recognized as white space, and in others the VT character is not.
551
552 A "word" character is an underscore or any character that is a letter
553 or digit. By default, the definition of letters and digits is con‐
554 trolled by PCRE2's low-valued character tables, and may vary if locale-
555 specific matching is taking place (see "Locale support" in the pcre2api
556 page). For example, in a French locale such as "fr_FR" in Unix-like
557 systems, or "french" in Windows, some character codes greater than 127
558 are used for accented letters, and these are then matched by \w. The
559 use of locales with Unicode is discouraged.
560
561 By default, characters whose code points are greater than 127 never
562 match \d, \s, or \w, and always match \D, \S, and \W, although this may
563 be different for characters in the range 128-255 when locale-specific
564 matching is happening. These escape sequences retain their original
565 meanings from before Unicode support was available, mainly for effi‐
566 ciency reasons. If the PCRE2_UCP option is set, the behaviour is
567 changed so that Unicode properties are used to determine character
568 types, as follows:
569
570 \d any character that matches \p{Nd} (decimal digit)
571 \s any character that matches \p{Z} or \h or \v
572 \w any character that matches \p{L} or \p{N}, plus underscore
573
574 The upper case escapes match the inverse sets of characters. Note that
575 \d matches only decimal digits, whereas \w matches any Unicode digit,
576 as well as any Unicode letter, and underscore. Note also that PCRE2_UCP
577 affects \b, and \B because they are defined in terms of \w and \W.
578 Matching these sequences is noticeably slower when PCRE2_UCP is set.
579
580 The sequences \h, \H, \v, and \V, in contrast to the other sequences,
581 which match only ASCII characters by default, always match a specific
582 list of code points, whether or not PCRE2_UCP is set. The horizontal
583 space characters are:
584
585 U+0009 Horizontal tab (HT)
586 U+0020 Space
587 U+00A0 Non-break space
588 U+1680 Ogham space mark
589 U+180E Mongolian vowel separator
590 U+2000 En quad
591 U+2001 Em quad
592 U+2002 En space
593 U+2003 Em space
594 U+2004 Three-per-em space
595 U+2005 Four-per-em space
596 U+2006 Six-per-em space
597 U+2007 Figure space
598 U+2008 Punctuation space
599 U+2009 Thin space
600 U+200A Hair space
601 U+202F Narrow no-break space
602 U+205F Medium mathematical space
603 U+3000 Ideographic space
604
605 The vertical space characters are:
606
607 U+000A Linefeed (LF)
608 U+000B Vertical tab (VT)
609 U+000C Form feed (FF)
610 U+000D Carriage return (CR)
611 U+0085 Next line (NEL)
612 U+2028 Line separator
613 U+2029 Paragraph separator
614
615 In 8-bit, non-UTF-8 mode, only the characters with code points less
616 than 256 are relevant.
617
618 Newline sequences
619
620 Outside a character class, by default, the escape sequence \R matches
621 any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent
622 to the following:
623
624 (?>\r\n|\n|\x0b|\f|\r|\x85)
625
626 This is an example of an "atomic group", details of which are given
627 below. This particular group matches either the two-character sequence
628 CR followed by LF, or one of the single characters LF (linefeed,
629 U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car‐
630 riage return, U+000D), or NEL (next line, U+0085). Because this is an
631 atomic group, the two-character sequence is treated as a single unit
632 that cannot be split.
633
634 In other modes, two additional characters whose code points are greater
635 than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa‐
636 rator, U+2029). Unicode support is not needed for these characters to
637 be recognized.
638
639 It is possible to restrict \R to match only CR, LF, or CRLF (instead of
640 the complete set of Unicode line endings) by setting the option
641 PCRE2_BSR_ANYCRLF at compile time. (BSR is an abbrevation for "back‐
642 slash R".) This can be made the default when PCRE2 is built; if this is
643 the case, the other behaviour can be requested via the PCRE2_BSR_UNI‐
644 CODE option. It is also possible to specify these settings by starting
645 a pattern string with one of the following sequences:
646
647 (*BSR_ANYCRLF) CR, LF, or CRLF only
648 (*BSR_UNICODE) any Unicode newline sequence
649
650 These override the default and the options given to the compiling func‐
651 tion. Note that these special settings, which are not Perl-compatible,
652 are recognized only at the very start of a pattern, and that they must
653 be in upper case. If more than one of them is present, the last one is
654 used. They can be combined with a change of newline convention; for
655 example, a pattern can start with:
656
657 (*ANY)(*BSR_ANYCRLF)
658
659 They can also be combined with the (*UTF) or (*UCP) special sequences.
660 Inside a character class, \R is treated as an unrecognized escape
661 sequence, and causes an error.
662
663 Unicode character properties
664
665 When PCRE2 is built with Unicode support (the default), three addi‐
666 tional escape sequences that match characters with specific properties
667 are available. They can be used in any mode, though in 8-bit and 16-bit
668 non-UTF modes these sequences are of course limited to testing charac‐
669 ters whose code points are less than U+0100 and U+10000, respectively.
670 In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode
671 limit) may be encountered. These are all treated as being in the
672 Unknown script and with an unassigned type. The extra escape sequences
673 are:
674
675 \p{xx} a character with the xx property
676 \P{xx} a character without the xx property
677 \X a Unicode extended grapheme cluster
678
679 The property names represented by xx above are case-sensitive. There is
680 support for Unicode script names, Unicode general category properties,
681 "Any", which matches any character (including newline), and some spe‐
682 cial PCRE2 properties (described in the next section). Other Perl
683 properties such as "InMusicalSymbols" are not supported by PCRE2. Note
684 that \P{Any} does not match any characters, so always causes a match
685 failure.
686
687 Sets of Unicode characters are defined as belonging to certain scripts.
688 A character from one of these sets can be matched using a script name.
689 For example:
690
691 \p{Greek}
692 \P{Han}
693
694 Unassigned characters (and in non-UTF 32-bit mode, characters with code
695 points greater than 0x10FFFF) are assigned the "Unknown" script. Others
696 that are not part of an identified script are lumped together as "Com‐
697 mon". The current list of scripts is:
698
699 Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Bali‐
700 nese, Bamum, Bassa_Vah, Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi,
701 Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Alba‐
702 nian, Chakma, Cham, Cherokee, Chorasmian, Common, Coptic, Cuneiform,
703 Cypriot, Cyrillic, Deseret, Devanagari, Dives_Akuru, Dogra, Duployan,
704 Egyptian_Hieroglyphs, Elbasan, Elymaic, Ethiopic, Georgian, Glagolitic,
705 Gothic, Grantha, Greek, Gujarati, Gunjala_Gondi, Gurmukhi, Han, Hangul,
706 Hanifi_Rohingya, Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic,
707 Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese,
708 Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khitan_Small_Script,
709 Khmer, Khojki, Khudawadi, Lao, Latin, Lepcha, Limbu, Linear_A, Lin‐
710 ear_B, Lisu, Lycian, Lydian, Mahajani, Makasar, Malayalam, Mandaic,
711 Manichaean, Marchen, Masaram_Gondi, Medefaidrin, Meetei_Mayek,
712 Mende_Kikakui, Meroitic_Cursive, Meroitic_Hieroglyphs, Miao, Modi, Mon‐
713 golian, Mro, Multani, Myanmar, Nabataean, Nandinagari, New_Tai_Lue,
714 Newa, Nko, Nushu, Nyakeng_Puachue_Hmong, Ogham, Ol_Chiki, Old_Hungar‐
715 ian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_Sog‐
716 dian, Old_South_Arabian, Old_Turkic, Oriya, Osage, Osmanya,
717 Pahawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician,
718 Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha‐
719 vian, Siddham, SignWriting, Sinhala, Sogdian, Sora_Sompeng, Soyombo,
720 Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham,
721 Tai_Viet, Takri, Tamil, Tangut, Telugu, Thaana, Thai, Tibetan, Tifi‐
722 nagh, Tirhuta, Ugaritic, Unknown, Vai, Wancho, Warang_Citi, Yezidi, Yi,
723 Zanabazar_Square.
724
725 Each character has exactly one Unicode general category property, spec‐
726 ified by a two-letter abbreviation. For compatibility with Perl, nega‐
727 tion can be specified by including a circumflex between the opening
728 brace and the property name. For example, \p{^Lu} is the same as
729 \P{Lu}.
730
731 If only one letter is specified with \p or \P, it includes all the gen‐
732 eral category properties that start with that letter. In this case, in
733 the absence of negation, the curly brackets in the escape sequence are
734 optional; these two examples have the same effect:
735
736 \p{L}
737 \pL
738
739 The following general category property codes are supported:
740
741 C Other
742 Cc Control
743 Cf Format
744 Cn Unassigned
745 Co Private use
746 Cs Surrogate
747
748 L Letter
749 Ll Lower case letter
750 Lm Modifier letter
751 Lo Other letter
752 Lt Title case letter
753 Lu Upper case letter
754
755 M Mark
756 Mc Spacing mark
757 Me Enclosing mark
758 Mn Non-spacing mark
759
760 N Number
761 Nd Decimal number
762 Nl Letter number
763 No Other number
764
765 P Punctuation
766 Pc Connector punctuation
767 Pd Dash punctuation
768 Pe Close punctuation
769 Pf Final punctuation
770 Pi Initial punctuation
771 Po Other punctuation
772 Ps Open punctuation
773
774 S Symbol
775 Sc Currency symbol
776 Sk Modifier symbol
777 Sm Mathematical symbol
778 So Other symbol
779
780 Z Separator
781 Zl Line separator
782 Zp Paragraph separator
783 Zs Space separator
784
785 The special property L& is also supported: it matches a character that
786 has the Lu, Ll, or Lt property, in other words, a letter that is not
787 classified as a modifier or "other".
788
789 The Cs (Surrogate) property applies only to characters whose code
790 points are in the range U+D800 to U+DFFF. These characters are no dif‐
791 ferent to any other character when PCRE2 is not in UTF mode (using the
792 16-bit or 32-bit library). However, they are not valid in Unicode
793 strings and so cannot be tested by PCRE2 in UTF mode, unless UTF valid‐
794 ity checking has been turned off (see the discussion of
795 PCRE2_NO_UTF_CHECK in the pcre2api page).
796
797 The long synonyms for property names that Perl supports (such as
798 \p{Letter}) are not supported by PCRE2, nor is it permitted to prefix
799 any of these properties with "Is".
800
801 No character that is in the Unicode table has the Cn (unassigned) prop‐
802 erty. Instead, this property is assumed for any code point that is not
803 in the Unicode table.
804
805 Specifying caseless matching does not affect these escape sequences.
806 For example, \p{Lu} always matches only upper case letters. This is
807 different from the behaviour of current versions of Perl.
808
809 Matching characters by Unicode property is not fast, because PCRE2 has
810 to do a multistage table lookup in order to find a character's prop‐
811 erty. That is why the traditional escape sequences such as \d and \w do
812 not use Unicode properties in PCRE2 by default, though you can make
813 them do so by setting the PCRE2_UCP option or by starting the pattern
814 with (*UCP).
815
816 Extended grapheme clusters
817
818 The \X escape matches any number of Unicode characters that form an
819 "extended grapheme cluster", and treats the sequence as an atomic group
820 (see below). Unicode supports various kinds of composite character by
821 giving each character a grapheme breaking property, and having rules
822 that use these properties to define the boundaries of extended grapheme
823 clusters. The rules are defined in Unicode Standard Annex 29, "Unicode
824 Text Segmentation". Unicode 11.0.0 abandoned the use of some previous
825 properties that had been used for emojis. Instead it introduced vari‐
826 ous emoji-specific properties. PCRE2 uses only the Extended Picto‐
827 graphic property.
828
829 \X always matches at least one character. Then it decides whether to
830 add additional characters according to the following rules for ending a
831 cluster:
832
833 1. End at the end of the subject string.
834
835 2. Do not end between CR and LF; otherwise end after any control char‐
836 acter.
837
838 3. Do not break Hangul (a Korean script) syllable sequences. Hangul
839 characters are of five types: L, V, T, LV, and LVT. An L character may
840 be followed by an L, V, LV, or LVT character; an LV or V character may
841 be followed by a V or T character; an LVT or T character may be follwed
842 only by a T character.
843
844 4. Do not end before extending characters or spacing marks or the
845 "zero-width joiner" character. Characters with the "mark" property
846 always have the "extend" grapheme breaking property.
847
848 5. Do not end after prepend characters.
849
850 6. Do not break within emoji modifier sequences or emoji zwj sequences.
851 That is, do not break between characters with the Extended_Pictographic
852 property. Extend and ZWJ characters are allowed between the charac‐
853 ters.
854
855 7. Do not break within emoji flag sequences. That is, do not break
856 between regional indicator (RI) characters if there are an odd number
857 of RI characters before the break point.
858
859 8. Otherwise, end the cluster.
860
861 PCRE2's additional properties
862
863 As well as the standard Unicode properties described above, PCRE2 sup‐
864 ports four more that make it possible to convert traditional escape
865 sequences such as \w and \s to use Unicode properties. PCRE2 uses these
866 non-standard, non-Perl properties internally when PCRE2_UCP is set.
867 However, they may also be used explicitly. These properties are:
868
869 Xan Any alphanumeric character
870 Xps Any POSIX space character
871 Xsp Any Perl space character
872 Xwd Any Perl "word" character
873
874 Xan matches characters that have either the L (letter) or the N (num‐
875 ber) property. Xps matches the characters tab, linefeed, vertical tab,
876 form feed, or carriage return, and any other character that has the Z
877 (separator) property. Xsp is the same as Xps; in PCRE1 it used to
878 exclude vertical tab, for Perl compatibility, but Perl changed. Xwd
879 matches the same characters as Xan, plus underscore.
880
881 There is another non-standard property, Xuc, which matches any charac‐
882 ter that can be represented by a Universal Character Name in C++ and
883 other programming languages. These are the characters $, @, ` (grave
884 accent), and all characters with Unicode code points greater than or
885 equal to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that
886 most base (ASCII) characters are excluded. (Universal Character Names
887 are of the form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit.
888 Note that the Xuc property does not match these sequences but the char‐
889 acters that they represent.)
890
891 Resetting the match start
892
893 In normal use, the escape sequence \K causes any previously matched
894 characters not to be included in the final matched sequence that is
895 returned. For example, the pattern:
896
897 foo\Kbar
898
899 matches "foobar", but reports that it has matched "bar". \K does not
900 interact with anchoring in any way. The pattern:
901
902 ^foo\Kbar
903
904 matches only when the subject begins with "foobar" (in single line
905 mode), though it again reports the matched string as "bar". This fea‐
906 ture is similar to a lookbehind assertion (described below). However,
907 in this case, the part of the subject before the real match does not
908 have to be of fixed length, as lookbehind assertions do. The use of \K
909 does not interfere with the setting of captured substrings. For exam‐
910 ple, when the pattern
911
912 (foo)\Kbar
913
914 matches "foobar", the first substring is still set to "foo".
915
916 Perl used to document that the use of \K within lookaround assertions
917 is "not well defined", but from version 5.32.0 Perl does not support
918 this usage at all. In PCRE2, \K is acted upon when it occurs inside
919 positive assertions, but is ignored in negative assertions. Note that
920 when a pattern such as (?=ab\K) matches, the reported start of the
921 match can be greater than the end of the match. Using \K in a lookbe‐
922 hind assertion at the start of a pattern can also lead to odd effects.
923 For example, consider this pattern:
924
925 (?<=\Kfoo)bar
926
927 If the subject is "foobar", a call to pcre2_match() with a starting
928 offset of 3 succeeds and reports the matching string as "foobar", that
929 is, the start of the reported match is earlier than where the match
930 started.
931
932 Simple assertions
933
934 The final use of backslash is for certain simple assertions. An asser‐
935 tion specifies a condition that has to be met at a particular point in
936 a match, without consuming any characters from the subject string. The
937 use of groups for more complicated assertions is described below. The
938 backslashed assertions are:
939
940 \b matches at a word boundary
941 \B matches when not at a word boundary
942 \A matches at the start of the subject
943 \Z matches at the end of the subject
944 also matches before a newline at the end of the subject
945 \z matches only at the end of the subject
946 \G matches at the first matching position in the subject
947
948 Inside a character class, \b has a different meaning; it matches the
949 backspace character. If any other of these assertions appears in a
950 character class, an "invalid escape sequence" error is generated.
951
952 A word boundary is a position in the subject string where the current
953 character and the previous character do not both match \w or \W (i.e.
954 one matches \w and the other matches \W), or the start or end of the
955 string if the first or last character matches \w, respectively. When
956 PCRE2 is built with Unicode support, the meanings of \w and \W can be
957 changed by setting the PCRE2_UCP option. When this is done, it also
958 affects \b and \B. Neither PCRE2 nor Perl has a separate "start of
959 word" or "end of word" metasequence. However, whatever follows \b nor‐
960 mally determines which it is. For example, the fragment \ba matches "a"
961 at the start of a word.
962
963 The \A, \Z, and \z assertions differ from the traditional circumflex
964 and dollar (described in the next section) in that they only ever match
965 at the very start and end of the subject string, whatever options are
966 set. Thus, they are independent of multiline mode. These three asser‐
967 tions are not affected by the PCRE2_NOTBOL or PCRE2_NOTEOL options,
968 which affect only the behaviour of the circumflex and dollar metachar‐
969 acters. However, if the startoffset argument of pcre2_match() is non-
970 zero, indicating that matching is to start at a point other than the
971 beginning of the subject, \A can never match. The difference between
972 \Z and \z is that \Z matches before a newline at the end of the string
973 as well as at the very end, whereas \z matches only at the end.
974
975 The \G assertion is true only when the current matching position is at
976 the start point of the matching process, as specified by the startoff‐
977 set argument of pcre2_match(). It differs from \A when the value of
978 startoffset is non-zero. By calling pcre2_match() multiple times with
979 appropriate arguments, you can mimic Perl's /g option, and it is in
980 this kind of implementation where \G can be useful.
981
982 Note, however, that PCRE2's implementation of \G, being true at the
983 starting character of the matching process, is subtly different from
984 Perl's, which defines it as true at the end of the previous match. In
985 Perl, these can be different when the previously matched string was
986 empty. Because PCRE2 does just one match at a time, it cannot reproduce
987 this behaviour.
988
989 If all the alternatives of a pattern begin with \G, the expression is
990 anchored to the starting match position, and the "anchored" flag is set
991 in the compiled regular expression.
992
994
995 The circumflex and dollar metacharacters are zero-width assertions.
996 That is, they test for a particular condition being true without con‐
997 suming any characters from the subject string. These two metacharacters
998 are concerned with matching the starts and ends of lines. If the new‐
999 line convention is set so that only the two-character sequence CRLF is
1000 recognized as a newline, isolated CR and LF characters are treated as
1001 ordinary data characters, and are not recognized as newlines.
1002
1003 Outside a character class, in the default matching mode, the circumflex
1004 character is an assertion that is true only if the current matching
1005 point is at the start of the subject string. If the startoffset argu‐
1006 ment of pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum‐
1007 flex can never match if the PCRE2_MULTILINE option is unset. Inside a
1008 character class, circumflex has an entirely different meaning (see
1009 below).
1010
1011 Circumflex need not be the first character of the pattern if a number
1012 of alternatives are involved, but it should be the first thing in each
1013 alternative in which it appears if the pattern is ever to match that
1014 branch. If all possible alternatives start with a circumflex, that is,
1015 if the pattern is constrained to match only at the start of the sub‐
1016 ject, it is said to be an "anchored" pattern. (There are also other
1017 constructs that can cause a pattern to be anchored.)
1018
1019 The dollar character is an assertion that is true only if the current
1020 matching point is at the end of the subject string, or immediately
1021 before a newline at the end of the string (by default), unless
1022 PCRE2_NOTEOL is set. Note, however, that it does not actually match the
1023 newline. Dollar need not be the last character of the pattern if a num‐
1024 ber of alternatives are involved, but it should be the last item in any
1025 branch in which it appears. Dollar has no special meaning in a charac‐
1026 ter class.
1027
1028 The meaning of dollar can be changed so that it matches only at the
1029 very end of the string, by setting the PCRE2_DOLLAR_ENDONLY option at
1030 compile time. This does not affect the \Z assertion.
1031
1032 The meanings of the circumflex and dollar metacharacters are changed if
1033 the PCRE2_MULTILINE option is set. When this is the case, a dollar
1034 character matches before any newlines in the string, as well as at the
1035 very end, and a circumflex matches immediately after internal newlines
1036 as well as at the start of the subject string. It does not match after
1037 a newline that ends the string, for compatibility with Perl. However,
1038 this can be changed by setting the PCRE2_ALT_CIRCUMFLEX option.
1039
1040 For example, the pattern /^abc$/ matches the subject string "def\nabc"
1041 (where \n represents a newline) in multiline mode, but not otherwise.
1042 Consequently, patterns that are anchored in single line mode because
1043 all branches start with ^ are not anchored in multiline mode, and a
1044 match for circumflex is possible when the startoffset argument of
1045 pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option is ignored
1046 if PCRE2_MULTILINE is set.
1047
1048 When the newline convention (see "Newline conventions" below) recog‐
1049 nizes the two-character sequence CRLF as a newline, this is preferred,
1050 even if the single characters CR and LF are also recognized as new‐
1051 lines. For example, if the newline convention is "any", a multiline
1052 mode circumflex matches before "xyz" in the string "abc\r\nxyz" rather
1053 than after CR, even though CR on its own is a valid newline. (It also
1054 matches at the very start of the string, of course.)
1055
1056 Note that the sequences \A, \Z, and \z can be used to match the start
1057 and end of the subject in both modes, and if all branches of a pattern
1058 start with \A it is always anchored, whether or not PCRE2_MULTILINE is
1059 set.
1060
1062
1063 Outside a character class, a dot in the pattern matches any one charac‐
1064 ter in the subject string except (by default) a character that signi‐
1065 fies the end of a line.
1066
1067 When a line ending is defined as a single character, dot never matches
1068 that character; when the two-character sequence CRLF is used, dot does
1069 not match CR if it is immediately followed by LF, but otherwise it
1070 matches all characters (including isolated CRs and LFs). When any Uni‐
1071 code line endings are being recognized, dot does not match CR or LF or
1072 any of the other line ending characters.
1073
1074 The behaviour of dot with regard to newlines can be changed. If the
1075 PCRE2_DOTALL option is set, a dot matches any one character, without
1076 exception. If the two-character sequence CRLF is present in the sub‐
1077 ject string, it takes two dots to match it.
1078
1079 The handling of dot is entirely independent of the handling of circum‐
1080 flex and dollar, the only relationship being that they both involve
1081 newlines. Dot has no special meaning in a character class.
1082
1083 The escape sequence \N when not followed by an opening brace behaves
1084 like a dot, except that it is not affected by the PCRE2_DOTALL option.
1085 In other words, it matches any character except one that signifies the
1086 end of a line.
1087
1088 When \N is followed by an opening brace it has a different meaning. See
1089 the section entitled "Non-printing characters" above for details. Perl
1090 also uses \N{name} to specify characters by Unicode name; PCRE2 does
1091 not support this.
1092
1094
1095 Outside a character class, the escape sequence \C matches any one code
1096 unit, whether or not a UTF mode is set. In the 8-bit library, one code
1097 unit is one byte; in the 16-bit library it is a 16-bit unit; in the
1098 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches
1099 line-ending characters. The feature is provided in Perl in order to
1100 match individual bytes in UTF-8 mode, but it is unclear how it can use‐
1101 fully be used.
1102
1103 Because \C breaks up characters into individual code units, matching
1104 one unit with \C in UTF-8 or UTF-16 mode means that the rest of the
1105 string may start with a malformed UTF character. This has undefined
1106 results, because PCRE2 assumes that it is matching character by charac‐
1107 ter in a valid UTF string (by default it checks the subject string's
1108 validity at the start of processing unless the PCRE2_NO_UTF_CHECK or
1109 PCRE2_MATCH_INVALID_UTF option is used).
1110
1111 An application can lock out the use of \C by setting the
1112 PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also
1113 possible to build PCRE2 with the use of \C permanently disabled.
1114
1115 PCRE2 does not allow \C to appear in lookbehind assertions (described
1116 below) in UTF-8 or UTF-16 modes, because this would make it impossible
1117 to calculate the length of the lookbehind. Neither the alternative
1118 matching function pcre2_dfa_match() nor the JIT optimizer support \C in
1119 these UTF modes. The former gives a match-time error; the latter fails
1120 to optimize and so the match is always run using the interpreter.
1121
1122 In the 32-bit library, however, \C is always supported (when not
1123 explicitly locked out) because it always matches a single code unit,
1124 whether or not UTF-32 is specified.
1125
1126 In general, the \C escape sequence is best avoided. However, one way of
1127 using it that avoids the problem of malformed UTF-8 or UTF-16 charac‐
1128 ters is to use a lookahead to check the length of the next character,
1129 as in this pattern, which could be used with a UTF-8 string (ignore
1130 white space and line breaks):
1131
1132 (?| (?=[\x00-\x7f])(\C) |
1133 (?=[\x80-\x{7ff}])(\C)(\C) |
1134 (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
1135 (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
1136
1137 In this example, a group that starts with (?| resets the capturing
1138 parentheses numbers in each alternative (see "Duplicate Group Numbers"
1139 below). The assertions at the start of each branch check the next UTF-8
1140 character for values whose encoding uses 1, 2, 3, or 4 bytes, respec‐
1141 tively. The character's individual bytes are then captured by the
1142 appropriate number of \C groups.
1143
1145
1146 An opening square bracket introduces a character class, terminated by a
1147 closing square bracket. A closing square bracket on its own is not spe‐
1148 cial by default. If a closing square bracket is required as a member
1149 of the class, it should be the first data character in the class (after
1150 an initial circumflex, if present) or escaped with a backslash. This
1151 means that, by default, an empty class cannot be defined. However, if
1152 the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at
1153 the start does end the (empty) class.
1154
1155 A character class matches a single character in the subject. A matched
1156 character must be in the set of characters defined by the class, unless
1157 the first character in the class definition is a circumflex, in which
1158 case the subject character must not be in the set defined by the class.
1159 If a circumflex is actually required as a member of the class, ensure
1160 it is not the first character, or escape it with a backslash.
1161
1162 For example, the character class [aeiou] matches any lower case vowel,
1163 while [^aeiou] matches any character that is not a lower case vowel.
1164 Note that a circumflex is just a convenient notation for specifying the
1165 characters that are in the class by enumerating those that are not. A
1166 class that starts with a circumflex is not an assertion; it still con‐
1167 sumes a character from the subject string, and therefore it fails if
1168 the current pointer is at the end of the string.
1169
1170 Characters in a class may be specified by their code points using \o,
1171 \x, or \N{U+hh..} in the usual way. When caseless matching is set, any
1172 letters in a class represent both their upper case and lower case ver‐
1173 sions, so for example, a caseless [aeiou] matches "A" as well as "a",
1174 and a caseless [^aeiou] does not match "A", whereas a caseful version
1175 would. Note that there are two ASCII characters, K and S, that, in
1176 addition to their lower case ASCII equivalents, are case-equivalent
1177 with Unicode U+212A (Kelvin sign) and U+017F (long S) respectively when
1178 either PCRE2_UTF or PCRE2_UCP is set.
1179
1180 Characters that might indicate line breaks are never treated in any
1181 special way when matching character classes, whatever line-ending
1182 sequence is in use, and whatever setting of the PCRE2_DOTALL and
1183 PCRE2_MULTILINE options is used. A class such as [^a] always matches
1184 one of these characters.
1185
1186 The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s,
1187 \S, \v, \V, \w, and \W may appear in a character class, and add the
1188 characters that they match to the class. For example, [\dABCDEF]
1189 matches any hexadecimal digit. In UTF modes, the PCRE2_UCP option
1190 affects the meanings of \d, \s, \w and their upper case partners, just
1191 as it does when they appear outside a character class, as described in
1192 the section entitled "Generic character types" above. The escape
1193 sequence \b has a different meaning inside a character class; it
1194 matches the backspace character. The sequences \B, \R, and \X are not
1195 special inside a character class. Like any other unrecognized escape
1196 sequences, they cause an error. The same is true for \N when not fol‐
1197 lowed by an opening brace.
1198
1199 The minus (hyphen) character can be used to specify a range of charac‐
1200 ters in a character class. For example, [d-m] matches any letter
1201 between d and m, inclusive. If a minus character is required in a
1202 class, it must be escaped with a backslash or appear in a position
1203 where it cannot be interpreted as indicating a range, typically as the
1204 first or last character in the class, or immediately after a range. For
1205 example, [b-d-z] matches letters in the range b to d, a hyphen charac‐
1206 ter, or z.
1207
1208 Perl treats a hyphen as a literal if it appears before or after a POSIX
1209 class (see below) or before or after a character type escape such as as
1210 \d or \H. However, unless the hyphen is the last character in the
1211 class, Perl outputs a warning in its warning mode, as this is most
1212 likely a user error. As PCRE2 has no facility for warning, an error is
1213 given in these cases.
1214
1215 It is not possible to have the literal character "]" as the end charac‐
1216 ter of a range. A pattern such as [W-]46] is interpreted as a class of
1217 two characters ("W" and "-") followed by a literal string "46]", so it
1218 would match "W46]" or "-46]". However, if the "]" is escaped with a
1219 backslash it is interpreted as the end of range, so [W-\]46] is inter‐
1220 preted as a class containing a range followed by two other characters.
1221 The octal or hexadecimal representation of "]" can also be used to end
1222 a range.
1223
1224 Ranges normally include all code points between the start and end char‐
1225 acters, inclusive. They can also be used for code points specified
1226 numerically, for example [\000-\037]. Ranges can include any characters
1227 that are valid for the current mode. In any UTF mode, the so-called
1228 "surrogate" characters (those whose code points lie between 0xd800 and
1229 0xdfff inclusive) may not be specified explicitly by default (the
1230 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option disables this check). How‐
1231 ever, ranges such as [\x{d7ff}-\x{e000}], which include the surrogates,
1232 are always permitted.
1233
1234 There is a special case in EBCDIC environments for ranges whose end
1235 points are both specified as literal letters in the same case. For com‐
1236 patibility with Perl, EBCDIC code points within the range that are not
1237 letters are omitted. For example, [h-k] matches only four characters,
1238 even though the codes for h and k are 0x88 and 0x92, a range of 11 code
1239 points. However, if the range is specified numerically, for example,
1240 [\x88-\x92] or [h-\x92], all code points are included.
1241
1242 If a range that includes letters is used when caseless matching is set,
1243 it matches the letters in either case. For example, [W-c] is equivalent
1244 to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if
1245 character tables for a French locale are in use, [\xc8-\xcb] matches
1246 accented E characters in both cases.
1247
1248 A circumflex can conveniently be used with the upper case character
1249 types to specify a more restricted set of characters than the matching
1250 lower case type. For example, the class [^\W_] matches any letter or
1251 digit, but not underscore, whereas [\w] includes underscore. A positive
1252 character class should be read as "something OR something OR ..." and a
1253 negative class as "NOT something AND NOT something AND NOT ...".
1254
1255 The only metacharacters that are recognized in character classes are
1256 backslash, hyphen (only where it can be interpreted as specifying a
1257 range), circumflex (only at the start), opening square bracket (only
1258 when it can be interpreted as introducing a POSIX class name, or for a
1259 special compatibility feature - see the next two sections), and the
1260 terminating closing square bracket. However, escaping other non-
1261 alphanumeric characters does no harm.
1262
1264
1265 Perl supports the POSIX notation for character classes. This uses names
1266 enclosed by [: and :] within the enclosing square brackets. PCRE2 also
1267 supports this notation. For example,
1268
1269 [01[:alpha:]%]
1270
1271 matches "0", "1", any alphabetic character, or "%". The supported class
1272 names are:
1273
1274 alnum letters and digits
1275 alpha letters
1276 ascii character codes 0 - 127
1277 blank space or tab only
1278 cntrl control characters
1279 digit decimal digits (same as \d)
1280 graph printing characters, excluding space
1281 lower lower case letters
1282 print printing characters, including space
1283 punct printing characters, excluding letters and digits and space
1284 space white space (the same as \s from PCRE2 8.34)
1285 upper upper case letters
1286 word "word" characters (same as \w)
1287 xdigit hexadecimal digits
1288
1289 The default "space" characters are HT (9), LF (10), VT (11), FF (12),
1290 CR (13), and space (32). If locale-specific matching is taking place,
1291 the list of space characters may be different; there may be fewer or
1292 more of them. "Space" and \s match the same set of characters.
1293
1294 The name "word" is a Perl extension, and "blank" is a GNU extension
1295 from Perl 5.8. Another Perl extension is negation, which is indicated
1296 by a ^ character after the colon. For example,
1297
1298 [12[:^digit:]]
1299
1300 matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the
1301 POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
1302 these are not supported, and an error is given if they are encountered.
1303
1304 By default, characters with values greater than 127 do not match any of
1305 the POSIX character classes, although this may be different for charac‐
1306 ters in the range 128-255 when locale-specific matching is happening.
1307 However, if the PCRE2_UCP option is passed to pcre2_compile(), some of
1308 the classes are changed so that Unicode character properties are used.
1309 This is achieved by replacing certain POSIX classes with other
1310 sequences, as follows:
1311
1312 [:alnum:] becomes \p{Xan}
1313 [:alpha:] becomes \p{L}
1314 [:blank:] becomes \h
1315 [:cntrl:] becomes \p{Cc}
1316 [:digit:] becomes \p{Nd}
1317 [:lower:] becomes \p{Ll}
1318 [:space:] becomes \p{Xps}
1319 [:upper:] becomes \p{Lu}
1320 [:word:] becomes \p{Xwd}
1321
1322 Negated versions, such as [:^alpha:] use \P instead of \p. Three other
1323 POSIX classes are handled specially in UCP mode:
1324
1325 [:graph:] This matches characters that have glyphs that mark the page
1326 when printed. In Unicode property terms, it matches all char‐
1327 acters with the L, M, N, P, S, or Cf properties, except for:
1328
1329 U+061C Arabic Letter Mark
1330 U+180E Mongolian Vowel Separator
1331 U+2066 - U+2069 Various "isolate"s
1332
1333
1334 [:print:] This matches the same characters as [:graph:] plus space
1335 characters that are not controls, that is, characters with
1336 the Zs property.
1337
1338 [:punct:] This matches all characters that have the Unicode P (punctua‐
1339 tion) property, plus those characters with code points less
1340 than 256 that have the S (Symbol) property.
1341
1342 The other POSIX classes are unchanged, and match only characters with
1343 code points less than 256.
1344
1346
1347 In the POSIX.2 compliant library that was included in 4.4BSD Unix, the
1348 ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word"
1349 and "end of word". PCRE2 treats these items as follows:
1350
1351 [[:<:]] is converted to \b(?=\w)
1352 [[:>:]] is converted to \b(?<=\w)
1353
1354 Only these exact character sequences are recognized. A sequence such as
1355 [a[:<:]b] provokes error for an unrecognized POSIX class name. This
1356 support is not compatible with Perl. It is provided to help migrations
1357 from other environments, and is best not used in any new patterns. Note
1358 that \b matches at the start and the end of a word (see "Simple asser‐
1359 tions" above), and in a Perl-style pattern the preceding or following
1360 character normally shows which is wanted, without the need for the
1361 assertions that are used above in order to give exactly the POSIX be‐
1362 haviour.
1363
1365
1366 Vertical bar characters are used to separate alternative patterns. For
1367 example, the pattern
1368
1369 gilbert|sullivan
1370
1371 matches either "gilbert" or "sullivan". Any number of alternatives may
1372 appear, and an empty alternative is permitted (matching the empty
1373 string). The matching process tries each alternative in turn, from left
1374 to right, and the first one that succeeds is used. If the alternatives
1375 are within a group (defined below), "succeeds" means matching the rest
1376 of the main pattern as well as the alternative in the group.
1377
1379
1380 The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL,
1381 PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options
1382 can be changed from within the pattern by a sequence of letters
1383 enclosed between "(?" and ")". These options are Perl-compatible, and
1384 are described in detail in the pcre2api documentation. The option let‐
1385 ters are:
1386
1387 i for PCRE2_CASELESS
1388 m for PCRE2_MULTILINE
1389 n for PCRE2_NO_AUTO_CAPTURE
1390 s for PCRE2_DOTALL
1391 x for PCRE2_EXTENDED
1392 xx for PCRE2_EXTENDED_MORE
1393
1394 For example, (?im) sets caseless, multiline matching. It is also possi‐
1395 ble to unset these options by preceding the relevant letters with a
1396 hyphen, for example (?-im). The two "extended" options are not indepen‐
1397 dent; unsetting either one cancels the effects of both of them.
1398
1399 A combined setting and unsetting such as (?im-sx), which sets
1400 PCRE2_CASELESS and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and
1401 PCRE2_EXTENDED, is also permitted. Only one hyphen may appear in the
1402 options string. If a letter appears both before and after the hyphen,
1403 the option is unset. An empty options setting "(?)" is allowed. Need‐
1404 less to say, it has no effect.
1405
1406 If the first character following (? is a circumflex, it causes all of
1407 the above options to be unset. Thus, (?^) is equivalent to (?-imnsx).
1408 Letters may follow the circumflex to cause some options to be re-
1409 instated, but a hyphen may not appear.
1410
1411 The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be
1412 changed in the same way as the Perl-compatible options by using the
1413 characters J and U respectively. However, these are not unset by (?^).
1414
1415 When one of these option changes occurs at top level (that is, not
1416 inside group parentheses), the change applies to the remainder of the
1417 pattern that follows. An option change within a group (see below for a
1418 description of groups) affects only that part of the group that follows
1419 it, so
1420
1421 (a(?i)b)c
1422
1423 matches abc and aBc and no other strings (assuming PCRE2_CASELESS is
1424 not used). By this means, options can be made to have different set‐
1425 tings in different parts of the pattern. Any changes made in one alter‐
1426 native do carry on into subsequent branches within the same group. For
1427 example,
1428
1429 (a(?i)b|c)
1430
1431 matches "ab", "aB", "c", and "C", even though when matching "C" the
1432 first branch is abandoned before the option setting. This is because
1433 the effects of option settings happen at compile time. There would be
1434 some very weird behaviour otherwise.
1435
1436 As a convenient shorthand, if any option settings are required at the
1437 start of a non-capturing group (see the next section), the option let‐
1438 ters may appear between the "?" and the ":". Thus the two patterns
1439
1440 (?i:saturday|sunday)
1441 (?:(?i)saturday|sunday)
1442
1443 match exactly the same set of strings.
1444
1445 Note: There are other PCRE2-specific options, applying to the whole
1446 pattern, which can be set by the application when the compiling func‐
1447 tion is called. In addition, the pattern can contain special leading
1448 sequences such as (*CRLF) to override what the application has set or
1449 what has been defaulted. Details are given in the section entitled
1450 "Newline sequences" above. There are also the (*UTF) and (*UCP) leading
1451 sequences that can be used to set UTF and Unicode property modes; they
1452 are equivalent to setting the PCRE2_UTF and PCRE2_UCP options, respec‐
1453 tively. However, the application can set the PCRE2_NEVER_UTF and
1454 PCRE2_NEVER_UCP options, which lock out the use of the (*UTF) and
1455 (*UCP) sequences.
1456
1458
1459 Groups are delimited by parentheses (round brackets), which can be
1460 nested. Turning part of a pattern into a group does two things:
1461
1462 1. It localizes a set of alternatives. For example, the pattern
1463
1464 cat(aract|erpillar|)
1465
1466 matches "cataract", "caterpillar", or "cat". Without the parentheses,
1467 it would match "cataract", "erpillar" or an empty string.
1468
1469 2. It creates a "capture group". This means that, when the whole pat‐
1470 tern matches, the portion of the subject string that matched the group
1471 is passed back to the caller, separately from the portion that matched
1472 the whole pattern. (This applies only to the traditional matching
1473 function; the DFA matching function does not support capturing.)
1474
1475 Opening parentheses are counted from left to right (starting from 1) to
1476 obtain numbers for capture groups. For example, if the string "the red
1477 king" is matched against the pattern
1478
1479 the ((red|white) (king|queen))
1480
1481 the captured substrings are "red king", "red", and "king", and are num‐
1482 bered 1, 2, and 3, respectively.
1483
1484 The fact that plain parentheses fulfil two functions is not always
1485 helpful. There are often times when grouping is required without cap‐
1486 turing. If an opening parenthesis is followed by a question mark and a
1487 colon, the group does not do any capturing, and is not counted when
1488 computing the number of any subsequent capture groups. For example, if
1489 the string "the white queen" is matched against the pattern
1490
1491 the ((?:red|white) (king|queen))
1492
1493 the captured substrings are "white queen" and "queen", and are numbered
1494 1 and 2. The maximum number of capture groups is 65535.
1495
1496 As a convenient shorthand, if any option settings are required at the
1497 start of a non-capturing group, the option letters may appear between
1498 the "?" and the ":". Thus the two patterns
1499
1500 (?i:saturday|sunday)
1501 (?:(?i)saturday|sunday)
1502
1503 match exactly the same set of strings. Because alternative branches are
1504 tried from left to right, and options are not reset until the end of
1505 the group is reached, an option setting in one branch does affect sub‐
1506 sequent branches, so the above patterns match "SUNDAY" as well as "Sat‐
1507 urday".
1508
1510
1511 Perl 5.10 introduced a feature whereby each alternative in a group uses
1512 the same numbers for its capturing parentheses. Such a group starts
1513 with (?| and is itself a non-capturing group. For example, consider
1514 this pattern:
1515
1516 (?|(Sat)ur|(Sun))day
1517
1518 Because the two alternatives are inside a (?| group, both sets of cap‐
1519 turing parentheses are numbered one. Thus, when the pattern matches,
1520 you can look at captured substring number one, whichever alternative
1521 matched. This construct is useful when you want to capture part, but
1522 not all, of one of a number of alternatives. Inside a (?| group, paren‐
1523 theses are numbered as usual, but the number is reset at the start of
1524 each branch. The numbers of any capturing parentheses that follow the
1525 whole group start after the highest number used in any branch. The fol‐
1526 lowing example is taken from the Perl documentation. The numbers under‐
1527 neath show in which buffer the captured content will be stored.
1528
1529 # before ---------------branch-reset----------- after
1530 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
1531 # 1 2 2 3 2 3 4
1532
1533 A backreference to a capture group uses the most recent value that is
1534 set for the group. The following pattern matches "abcabc" or "defdef":
1535
1536 /(?|(abc)|(def))\1/
1537
1538 In contrast, a subroutine call to a capture group always refers to the
1539 first one in the pattern with the given number. The following pattern
1540 matches "abcabc" or "defabc":
1541
1542 /(?|(abc)|(def))(?1)/
1543
1544 A relative reference such as (?-1) is no different: it is just a conve‐
1545 nient way of computing an absolute group number.
1546
1547 If a condition test for a group's having matched refers to a non-unique
1548 number, the test is true if any group with that number has matched.
1549
1550 An alternative approach to using this "branch reset" feature is to use
1551 duplicate named groups, as described in the next section.
1552
1554
1555 Identifying capture groups by number is simple, but it can be very hard
1556 to keep track of the numbers in complicated patterns. Furthermore, if
1557 an expression is modified, the numbers may change. To help with this
1558 difficulty, PCRE2 supports the naming of capture groups. This feature
1559 was not added to Perl until release 5.10. Python had the feature ear‐
1560 lier, and PCRE1 introduced it at release 4.0, using the Python syntax.
1561 PCRE2 supports both the Perl and the Python syntax.
1562
1563 In PCRE2, a capture group can be named in one of three ways:
1564 (?<name>...) or (?'name'...) as in Perl, or (?P<name>...) as in Python.
1565 Names may be up to 32 code units long. When PCRE2_UTF is not set, they
1566 may contain only ASCII alphanumeric characters and underscores, but
1567 must start with a non-digit. When PCRE2_UTF is set, the syntax of group
1568 names is extended to allow any Unicode letter or Unicode decimal digit.
1569 In other words, group names must match one of these patterns:
1570
1571 ^[_A-Za-z][_A-Za-z0-9]*\z when PCRE2_UTF is not set
1572 ^[_\p{L}][_\p{L}\p{Nd}]*\z when PCRE2_UTF is set
1573
1574 References to capture groups from other parts of the pattern, such as
1575 backreferences, recursion, and conditions, can all be made by name as
1576 well as by number.
1577
1578 Named capture groups are allocated numbers as well as names, exactly as
1579 if the names were not present. In both PCRE2 and Perl, capture groups
1580 are primarily identified by numbers; any names are just aliases for
1581 these numbers. The PCRE2 API provides function calls for extracting the
1582 complete name-to-number translation table from a compiled pattern, as
1583 well as convenience functions for extracting captured substrings by
1584 name.
1585
1586 Warning: When more than one capture group has the same number, as
1587 described in the previous section, a name given to one of them applies
1588 to all of them. Perl allows identically numbered groups to have differ‐
1589 ent names. Consider this pattern, where there are two capture groups,
1590 both numbered 1:
1591
1592 (?|(?<AA>aa)|(?<BB>bb))
1593
1594 Perl allows this, with both names AA and BB as aliases of group 1.
1595 Thus, after a successful match, both names yield the same value (either
1596 "aa" or "bb").
1597
1598 In an attempt to reduce confusion, PCRE2 does not allow the same group
1599 number to be associated with more than one name. The example above pro‐
1600 vokes a compile-time error. However, there is still scope for confu‐
1601 sion. Consider this pattern:
1602
1603 (?|(?<AA>aa)|(bb))
1604
1605 Although the second group number 1 is not explicitly named, the name AA
1606 is still an alias for any group 1. Whether the pattern matches "aa" or
1607 "bb", a reference by name to group AA yields the matched string.
1608
1609 By default, a name must be unique within a pattern, except that dupli‐
1610 cate names are permitted for groups with the same number, for example:
1611
1612 (?|(?<AA>aa)|(?<AA>bb))
1613
1614 The duplicate name constraint can be disabled by setting the PCRE2_DUP‐
1615 NAMES option at compile time, or by the use of (?J) within the pattern,
1616 as described in the section entitled "Internal Option Setting" above.
1617
1618 Duplicate names can be useful for patterns where only one instance of
1619 the named capture group can match. Suppose you want to match the name
1620 of a weekday, either as a 3-letter abbreviation or as the full name,
1621 and in both cases you want to extract the abbreviation. This pattern
1622 (ignoring the line breaks) does the job:
1623
1624 (?J)
1625 (?<DN>Mon|Fri|Sun)(?:day)?|
1626 (?<DN>Tue)(?:sday)?|
1627 (?<DN>Wed)(?:nesday)?|
1628 (?<DN>Thu)(?:rsday)?|
1629 (?<DN>Sat)(?:urday)?
1630
1631 There are five capture groups, but only one is ever set after a match.
1632 The convenience functions for extracting the data by name returns the
1633 substring for the first (and in this example, the only) group of that
1634 name that matched. This saves searching to find which numbered group it
1635 was. (An alternative way of solving this problem is to use a "branch
1636 reset" group, as described in the previous section.)
1637
1638 If you make a backreference to a non-unique named group from elsewhere
1639 in the pattern, the groups to which the name refers are checked in the
1640 order in which they appear in the overall pattern. The first one that
1641 is set is used for the reference. For example, this pattern matches
1642 both "foofoo" and "barbar" but not "foobar" or "barfoo":
1643
1644 (?J)(?:(?<n>foo)|(?<n>bar))\k<n>
1645
1646
1647 If you make a subroutine call to a non-unique named group, the one that
1648 corresponds to the first occurrence of the name is used. In the absence
1649 of duplicate numbers this is the one with the lowest number.
1650
1651 If you use a named reference in a condition test (see the section about
1652 conditions below), either to check whether a capture group has matched,
1653 or to check for recursion, all groups with the same name are tested. If
1654 the condition is true for any one of them, the overall condition is
1655 true. This is the same behaviour as testing by number. For further
1656 details of the interfaces for handling named capture groups, see the
1657 pcre2api documentation.
1658
1660
1661 Repetition is specified by quantifiers, which can follow any of the
1662 following items:
1663
1664 a literal data character
1665 the dot metacharacter
1666 the \C escape sequence
1667 the \R escape sequence
1668 the \X escape sequence
1669 an escape such as \d or \pL that matches a single character
1670 a character class
1671 a backreference
1672 a parenthesized group (including lookaround assertions)
1673 a subroutine call (recursive or otherwise)
1674
1675 The general repetition quantifier specifies a minimum and maximum num‐
1676 ber of permitted matches, by giving the two numbers in curly brackets
1677 (braces), separated by a comma. The numbers must be less than 65536,
1678 and the first must be less than or equal to the second. For example,
1679
1680 z{2,4}
1681
1682 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
1683 special character. If the second number is omitted, but the comma is
1684 present, there is no upper limit; if the second number and the comma
1685 are both omitted, the quantifier specifies an exact number of required
1686 matches. Thus
1687
1688 [aeiou]{3,}
1689
1690 matches at least 3 successive vowels, but may match many more, whereas
1691
1692 \d{8}
1693
1694 matches exactly 8 digits. An opening curly bracket that appears in a
1695 position where a quantifier is not allowed, or one that does not match
1696 the syntax of a quantifier, is taken as a literal character. For exam‐
1697 ple, {,6} is not a quantifier, but a literal string of four characters.
1698
1699 In UTF modes, quantifiers apply to characters rather than to individual
1700 code units. Thus, for example, \x{100}{2} matches two characters, each
1701 of which is represented by a two-byte sequence in a UTF-8 string. Simi‐
1702 larly, \X{3} matches three Unicode extended grapheme clusters, each of
1703 which may be several code units long (and they may be of different
1704 lengths).
1705
1706 The quantifier {0} is permitted, causing the expression to behave as if
1707 the previous item and the quantifier were not present. This may be use‐
1708 ful for capture groups that are referenced as subroutines from else‐
1709 where in the pattern (but see also the section entitled "Defining cap‐
1710 ture groups for use by reference only" below). Except for parenthesized
1711 groups, items that have a {0} quantifier are omitted from the compiled
1712 pattern.
1713
1714 For convenience, the three most common quantifiers have single-charac‐
1715 ter abbreviations:
1716
1717 * is equivalent to {0,}
1718 + is equivalent to {1,}
1719 ? is equivalent to {0,1}
1720
1721 It is possible to construct infinite loops by following a group that
1722 can match no characters with a quantifier that has no upper limit, for
1723 example:
1724
1725 (a?)*
1726
1727 Earlier versions of Perl and PCRE1 used to give an error at compile
1728 time for such patterns. However, because there are cases where this can
1729 be useful, such patterns are now accepted, but whenever an iteration of
1730 such a group matches no characters, matching moves on to the next item
1731 in the pattern instead of repeatedly matching an empty string. This
1732 does not prevent backtracking into any of the iterations if a subse‐
1733 quent item fails to match.
1734
1735 By default, quantifiers are "greedy", that is, they match as much as
1736 possible (up to the maximum number of permitted times), without causing
1737 the rest of the pattern to fail. The classic example of where this
1738 gives problems is in trying to match comments in C programs. These
1739 appear between /* and */ and within the comment, individual * and /
1740 characters may appear. An attempt to match C comments by applying the
1741 pattern
1742
1743 /\*.*\*/
1744
1745 to the string
1746
1747 /* first comment */ not comment /* second comment */
1748
1749 fails, because it matches the entire string owing to the greediness of
1750 the .* item. However, if a quantifier is followed by a question mark,
1751 it ceases to be greedy, and instead matches the minimum number of times
1752 possible, so the pattern
1753
1754 /\*.*?\*/
1755
1756 does the right thing with the C comments. The meaning of the various
1757 quantifiers is not otherwise changed, just the preferred number of
1758 matches. Do not confuse this use of question mark with its use as a
1759 quantifier in its own right. Because it has two uses, it can sometimes
1760 appear doubled, as in
1761
1762 \d??\d
1763
1764 which matches one digit by preference, but can match two if that is the
1765 only way the rest of the pattern matches.
1766
1767 If the PCRE2_UNGREEDY option is set (an option that is not available in
1768 Perl), the quantifiers are not greedy by default, but individual ones
1769 can be made greedy by following them with a question mark. In other
1770 words, it inverts the default behaviour.
1771
1772 When a parenthesized group is quantified with a minimum repeat count
1773 that is greater than 1 or with a limited maximum, more memory is
1774 required for the compiled pattern, in proportion to the size of the
1775 minimum or maximum.
1776
1777 If a pattern starts with .* or .{0,} and the PCRE2_DOTALL option
1778 (equivalent to Perl's /s) is set, thus allowing the dot to match new‐
1779 lines, the pattern is implicitly anchored, because whatever follows
1780 will be tried against every character position in the subject string,
1781 so there is no point in retrying the overall match at any position
1782 after the first. PCRE2 normally treats such a pattern as though it were
1783 preceded by \A.
1784
1785 In cases where it is known that the subject string contains no new‐
1786 lines, it is worth setting PCRE2_DOTALL in order to obtain this opti‐
1787 mization, or alternatively, using ^ to indicate anchoring explicitly.
1788
1789 However, there are some cases where the optimization cannot be used.
1790 When .* is inside capturing parentheses that are the subject of a
1791 backreference elsewhere in the pattern, a match at the start may fail
1792 where a later one succeeds. Consider, for example:
1793
1794 (.*)abc\1
1795
1796 If the subject is "xyz123abc123" the match point is the fourth charac‐
1797 ter. For this reason, such a pattern is not implicitly anchored.
1798
1799 Another case where implicit anchoring is not applied is when the lead‐
1800 ing .* is inside an atomic group. Once again, a match at the start may
1801 fail where a later one succeeds. Consider this pattern:
1802
1803 (?>.*?a)b
1804
1805 It matches "ab" in the subject "aab". The use of the backtracking con‐
1806 trol verbs (*PRUNE) and (*SKIP) also disable this optimization, and
1807 there is an option, PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
1808
1809 When a capture group is repeated, the value captured is the substring
1810 that matched the final iteration. For example, after
1811
1812 (tweedle[dume]{3}\s*)+
1813
1814 has matched "tweedledum tweedledee" the value of the captured substring
1815 is "tweedledee". However, if there are nested capture groups, the cor‐
1816 responding captured values may have been set in previous iterations.
1817 For example, after
1818
1819 (a|(b))+
1820
1821 matches "aba" the value of the second captured substring is "b".
1822
1824
1825 With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
1826 repetition, failure of what follows normally causes the repeated item
1827 to be re-evaluated to see if a different number of repeats allows the
1828 rest of the pattern to match. Sometimes it is useful to prevent this,
1829 either to change the nature of the match, or to cause it fail earlier
1830 than it otherwise might, when the author of the pattern knows there is
1831 no point in carrying on.
1832
1833 Consider, for example, the pattern \d+foo when applied to the subject
1834 line
1835
1836 123456bar
1837
1838 After matching all 6 digits and then failing to match "foo", the normal
1839 action of the matcher is to try again with only 5 digits matching the
1840 \d+ item, and then with 4, and so on, before ultimately failing.
1841 "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
1842 the means for specifying that once a group has matched, it is not to be
1843 re-evaluated in this way.
1844
1845 If we use atomic grouping for the previous example, the matcher gives
1846 up immediately on failing to match "foo" the first time. The notation
1847 is a kind of special parenthesis, starting with (?> as in this example:
1848
1849 (?>\d+)foo
1850
1851 Perl 5.28 introduced an experimental alphabetic form starting with (*
1852 which may be easier to remember:
1853
1854 (*atomic:\d+)foo
1855
1856 This kind of parenthesized group "locks up" the part of the pattern it
1857 contains once it has matched, and a failure further into the pattern is
1858 prevented from backtracking into it. Backtracking past it to previous
1859 items, however, works as normal.
1860
1861 An alternative description is that a group of this type matches exactly
1862 the string of characters that an identical standalone pattern would
1863 match, if anchored at the current point in the subject string.
1864
1865 Atomic groups are not capture groups. Simple cases such as the above
1866 example can be thought of as a maximizing repeat that must swallow
1867 everything it can. So, while both \d+ and \d+? are prepared to adjust
1868 the number of digits they match in order to make the rest of the pat‐
1869 tern match, (?>\d+) can only match an entire sequence of digits.
1870
1871 Atomic groups in general can of course contain arbitrarily complicated
1872 expressions, and can be nested. However, when the contents of an atomic
1873 group is just a single repeated item, as in the example above, a sim‐
1874 pler notation, called a "possessive quantifier" can be used. This con‐
1875 sists of an additional + character following a quantifier. Using this
1876 notation, the previous example can be rewritten as
1877
1878 \d++foo
1879
1880 Note that a possessive quantifier can be used with an entire group, for
1881 example:
1882
1883 (abc|xyz){2,3}+
1884
1885 Possessive quantifiers are always greedy; the setting of the
1886 PCRE2_UNGREEDY option is ignored. They are a convenient notation for
1887 the simpler forms of atomic group. However, there is no difference in
1888 the meaning of a possessive quantifier and the equivalent atomic group,
1889 though there may be a performance difference; possessive quantifiers
1890 should be slightly faster.
1891
1892 The possessive quantifier syntax is an extension to the Perl 5.8 syn‐
1893 tax. Jeffrey Friedl originated the idea (and the name) in the first
1894 edition of his book. Mike McCloskey liked it, so implemented it when he
1895 built Sun's Java package, and PCRE1 copied it from there. It found its
1896 way into Perl at release 5.10.
1897
1898 PCRE2 has an optimization that automatically "possessifies" certain
1899 simple pattern constructs. For example, the sequence A+B is treated as
1900 A++B because there is no point in backtracking into a sequence of A's
1901 when B must follow. This feature can be disabled by the PCRE2_NO_AUTO‐
1902 POSSESS option, or starting the pattern with (*NO_AUTO_POSSESS).
1903
1904 When a pattern contains an unlimited repeat inside a group that can
1905 itself be repeated an unlimited number of times, the use of an atomic
1906 group is the only way to avoid some failing matches taking a very long
1907 time indeed. The pattern
1908
1909 (\D+|<\d+>)*[!?]
1910
1911 matches an unlimited number of substrings that either consist of non-
1912 digits, or digits enclosed in <>, followed by either ! or ?. When it
1913 matches, it runs quickly. However, if it is applied to
1914
1915 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1916
1917 it takes a long time before reporting failure. This is because the
1918 string can be divided between the internal \D+ repeat and the external
1919 * repeat in a large number of ways, and all have to be tried. (The
1920 example uses [!?] rather than a single character at the end, because
1921 both PCRE2 and Perl have an optimization that allows for fast failure
1922 when a single character is used. They remember the last single charac‐
1923 ter that is required for a match, and fail early if it is not present
1924 in the string.) If the pattern is changed so that it uses an atomic
1925 group, like this:
1926
1927 ((?>\D+)|<\d+>)*[!?]
1928
1929 sequences of non-digits cannot be broken, and failure happens quickly.
1930
1932
1933 Outside a character class, a backslash followed by a digit greater than
1934 0 (and possibly further digits) is a backreference to a capture group
1935 earlier (that is, to its left) in the pattern, provided there have been
1936 that many previous capture groups.
1937
1938 However, if the decimal number following the backslash is less than 8,
1939 it is always taken as a backreference, and causes an error only if
1940 there are not that many capture groups in the entire pattern. In other
1941 words, the group that is referenced need not be to the left of the ref‐
1942 erence for numbers less than 8. A "forward backreference" of this type
1943 can make sense when a repetition is involved and the group to the right
1944 has participated in an earlier iteration.
1945
1946 It is not possible to have a numerical "forward backreference" to a
1947 group whose number is 8 or more using this syntax because a sequence
1948 such as \50 is interpreted as a character defined in octal. See the
1949 subsection entitled "Non-printing characters" above for further details
1950 of the handling of digits following a backslash. Other forms of back‐
1951 referencing do not suffer from this restriction. In particular, there
1952 is no problem when named capture groups are used (see below).
1953
1954 Another way of avoiding the ambiguity inherent in the use of digits
1955 following a backslash is to use the \g escape sequence. This escape
1956 must be followed by a signed or unsigned number, optionally enclosed in
1957 braces. These examples are all identical:
1958
1959 (ring), \1
1960 (ring), \g1
1961 (ring), \g{1}
1962
1963 An unsigned number specifies an absolute reference without the ambigu‐
1964 ity that is present in the older syntax. It is also useful when literal
1965 digits follow the reference. A signed number is a relative reference.
1966 Consider this example:
1967
1968 (abc(def)ghi)\g{-1}
1969
1970 The sequence \g{-1} is a reference to the most recently started capture
1971 group before \g, that is, is it equivalent to \2 in this example. Simi‐
1972 larly, \g{-2} would be equivalent to \1. The use of relative references
1973 can be helpful in long patterns, and also in patterns that are created
1974 by joining together fragments that contain references within them‐
1975 selves.
1976
1977 The sequence \g{+1} is a reference to the next capture group. This kind
1978 of forward reference can be useful in patterns that repeat. Perl does
1979 not support the use of + in this way.
1980
1981 A backreference matches whatever actually most recently matched the
1982 capture group in the current subject string, rather than anything at
1983 all that matches the group (see "Groups as subroutines" below for a way
1984 of doing that). So the pattern
1985
1986 (sens|respons)e and \1ibility
1987
1988 matches "sense and sensibility" and "response and responsibility", but
1989 not "sense and responsibility". If caseful matching is in force at the
1990 time of the backreference, the case of letters is relevant. For exam‐
1991 ple,
1992
1993 ((?i)rah)\s+\1
1994
1995 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
1996 original capture group is matched caselessly.
1997
1998 There are several different ways of writing backreferences to named
1999 capture groups. The .NET syntax \k{name} and the Perl syntax \k<name>
2000 or \k'name' are supported, as is the Python syntax (?P=name). Perl
2001 5.10's unified backreference syntax, in which \g can be used for both
2002 numeric and named references, is also supported. We could rewrite the
2003 above example in any of the following ways:
2004
2005 (?<p1>(?i)rah)\s+\k<p1>
2006 (?'p1'(?i)rah)\s+\k{p1}
2007 (?P<p1>(?i)rah)\s+(?P=p1)
2008 (?<p1>(?i)rah)\s+\g{p1}
2009
2010 A capture group that is referenced by name may appear in the pattern
2011 before or after the reference.
2012
2013 There may be more than one backreference to the same group. If a group
2014 has not actually been used in a particular match, backreferences to it
2015 always fail by default. For example, the pattern
2016
2017 (a|(bc))\2
2018
2019 always fails if it starts to match "a" rather than "bc". However, if
2020 the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backref‐
2021 erence to an unset value matches an empty string.
2022
2023 Because there may be many capture groups in a pattern, all digits fol‐
2024 lowing a backslash are taken as part of a potential backreference num‐
2025 ber. If the pattern continues with a digit character, some delimiter
2026 must be used to terminate the backreference. If the PCRE2_EXTENDED or
2027 PCRE2_EXTENDED_MORE option is set, this can be white space. Otherwise,
2028 the \g{} syntax or an empty comment (see "Comments" below) can be used.
2029
2030 Recursive backreferences
2031
2032 A backreference that occurs inside the group to which it refers fails
2033 when the group is first used, so, for example, (a\1) never matches.
2034 However, such references can be useful inside repeated groups. For
2035 example, the pattern
2036
2037 (a|b\1)+
2038
2039 matches any number of "a"s and also "aba", "ababbaa" etc. At each iter‐
2040 ation of the group, the backreference matches the character string cor‐
2041 responding to the previous iteration. In order for this to work, the
2042 pattern must be such that the first iteration does not need to match
2043 the backreference. This can be done using alternation, as in the exam‐
2044 ple above, or by a quantifier with a minimum of zero.
2045
2046 For versions of PCRE2 less than 10.25, backreferences of this type used
2047 to cause the group that they reference to be treated as an atomic
2048 group. This restriction no longer applies, and backtracking into such
2049 groups can occur as normal.
2050
2052
2053 An assertion is a test on the characters following or preceding the
2054 current matching point that does not consume any characters. The simple
2055 assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described
2056 above.
2057
2058 More complicated assertions are coded as parenthesized groups. There
2059 are two kinds: those that look ahead of the current position in the
2060 subject string, and those that look behind it, and in each case an
2061 assertion may be positive (must match for the assertion to be true) or
2062 negative (must not match for the assertion to be true). An assertion
2063 group is matched in the normal way, and if it is true, matching contin‐
2064 ues after it, but with the matching position in the subject string
2065 reset to what it was before the assertion was processed.
2066
2067 The Perl-compatible lookaround assertions are atomic. If an assertion
2068 is true, but there is a subsequent matching failure, there is no back‐
2069 tracking into the assertion. However, there are some cases where non-
2070 atomic assertions can be useful. PCRE2 has some support for these,
2071 described in the section entitled "Non-atomic assertions" below, but
2072 they are not Perl-compatible.
2073
2074 A lookaround assertion may appear as the condition in a conditional
2075 group (see below). In this case, the result of matching the assertion
2076 determines which branch of the condition is followed.
2077
2078 Assertion groups are not capture groups. If an assertion contains cap‐
2079 ture groups within it, these are counted for the purposes of numbering
2080 the capture groups in the whole pattern. Within each branch of an
2081 assertion, locally captured substrings may be referenced in the usual
2082 way. For example, a sequence such as (.)\g{-1} can be used to check
2083 that two adjacent characters are the same.
2084
2085 When a branch within an assertion fails to match, any substrings that
2086 were captured are discarded (as happens with any pattern branch that
2087 fails to match). A negative assertion is true only when all its
2088 branches fail to match; this means that no captured substrings are ever
2089 retained after a successful negative assertion. When an assertion con‐
2090 tains a matching branch, what happens depends on the type of assertion.
2091
2092 For a positive assertion, internally captured substrings in the suc‐
2093 cessful branch are retained, and matching continues with the next pat‐
2094 tern item after the assertion. For a negative assertion, a matching
2095 branch means that the assertion is not true. If such an assertion is
2096 being used as a condition in a conditional group (see below), captured
2097 substrings are retained, because matching continues with the "no"
2098 branch of the condition. For other failing negative assertions, control
2099 passes to the previous backtracking point, thus discarding any captured
2100 strings within the assertion.
2101
2102 Most assertion groups may be repeated; though it makes no sense to
2103 assert the same thing several times, the side effect of capturing in
2104 positive assertions may occasionally be useful. However, an assertion
2105 that forms the condition for a conditional group may not be quantified.
2106 PCRE2 used to restrict the repetition of assertions, but from release
2107 10.35 the only restriction is that an unlimited maximum repetition is
2108 changed to be one more than the minimum. For example, {3,} is treated
2109 as {3,4}.
2110
2111 Alphabetic assertion names
2112
2113 Traditionally, symbolic sequences such as (?= and (?<= have been used
2114 to specify lookaround assertions. Perl 5.28 introduced some experimen‐
2115 tal alphabetic alternatives which might be easier to remember. They all
2116 start with (* instead of (? and must be written using lower case let‐
2117 ters. PCRE2 supports the following synonyms:
2118
2119 (*positive_lookahead: or (*pla: is the same as (?=
2120 (*negative_lookahead: or (*nla: is the same as (?!
2121 (*positive_lookbehind: or (*plb: is the same as (?<=
2122 (*negative_lookbehind: or (*nlb: is the same as (?<!
2123
2124 For example, (*pla:foo) is the same assertion as (?=foo). In the fol‐
2125 lowing sections, the various assertions are described using the origi‐
2126 nal symbolic forms.
2127
2128 Lookahead assertions
2129
2130 Lookahead assertions start with (?= for positive assertions and (?! for
2131 negative assertions. For example,
2132
2133 \w+(?=;)
2134
2135 matches a word followed by a semicolon, but does not include the semi‐
2136 colon in the match, and
2137
2138 foo(?!bar)
2139
2140 matches any occurrence of "foo" that is not followed by "bar". Note
2141 that the apparently similar pattern
2142
2143 (?!foo)bar
2144
2145 does not find an occurrence of "bar" that is preceded by something
2146 other than "foo"; it finds any occurrence of "bar" whatsoever, because
2147 the assertion (?!foo) is always true when the next three characters are
2148 "bar". A lookbehind assertion is needed to achieve the other effect.
2149
2150 If you want to force a matching failure at some point in a pattern, the
2151 most convenient way to do it is with (?!) because an empty string
2152 always matches, so an assertion that requires there not to be an empty
2153 string must always fail. The backtracking control verb (*FAIL) or (*F)
2154 is a synonym for (?!).
2155
2156 Lookbehind assertions
2157
2158 Lookbehind assertions start with (?<= for positive assertions and (?<!
2159 for negative assertions. For example,
2160
2161 (?<!foo)bar
2162
2163 does find an occurrence of "bar" that is not preceded by "foo". The
2164 contents of a lookbehind assertion are restricted such that all the
2165 strings it matches must have a fixed length. However, if there are sev‐
2166 eral top-level alternatives, they do not all have to have the same
2167 fixed length. Thus
2168
2169 (?<=bullock|donkey)
2170
2171 is permitted, but
2172
2173 (?<!dogs?|cats?)
2174
2175 causes an error at compile time. Branches that match different length
2176 strings are permitted only at the top level of a lookbehind assertion.
2177 This is an extension compared with Perl, which requires all branches to
2178 match the same length of string. An assertion such as
2179
2180 (?<=ab(c|de))
2181
2182 is not permitted, because its single top-level branch can match two
2183 different lengths, but it is acceptable to PCRE2 if rewritten to use
2184 two top-level branches:
2185
2186 (?<=abc|abde)
2187
2188 In some cases, the escape sequence \K (see above) can be used instead
2189 of a lookbehind assertion to get round the fixed-length restriction.
2190
2191 The implementation of lookbehind assertions is, for each alternative,
2192 to temporarily move the current position back by the fixed length and
2193 then try to match. If there are insufficient characters before the cur‐
2194 rent position, the assertion fails.
2195
2196 In UTF-8 and UTF-16 modes, PCRE2 does not allow the \C escape (which
2197 matches a single code unit even in a UTF mode) to appear in lookbehind
2198 assertions, because it makes it impossible to calculate the length of
2199 the lookbehind. The \X and \R escapes, which can match different num‐
2200 bers of code units, are never permitted in lookbehinds.
2201
2202 "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in
2203 lookbehinds, as long as the called capture group matches a fixed-length
2204 string. However, recursion, that is, a "subroutine" call into a group
2205 that is already active, is not supported.
2206
2207 Perl does not support backreferences in lookbehinds. PCRE2 does support
2208 them, but only if certain conditions are met. The
2209 PCRE2_MATCH_UNSET_BACKREF option must not be set, there must be no use
2210 of (?| in the pattern (it creates duplicate group numbers), and if the
2211 backreference is by name, the name must be unique. Of course, the ref‐
2212 erenced group must itself match a fixed length substring. The following
2213 pattern matches words containing at least two characters that begin and
2214 end with the same character:
2215
2216 \b(\w)\w++(?<=\1)
2217
2218 Possessive quantifiers can be used in conjunction with lookbehind
2219 assertions to specify efficient matching of fixed-length strings at the
2220 end of subject strings. Consider a simple pattern such as
2221
2222 abcd$
2223
2224 when applied to a long string that does not match. Because matching
2225 proceeds from left to right, PCRE2 will look for each "a" in the sub‐
2226 ject and then see if what follows matches the rest of the pattern. If
2227 the pattern is specified as
2228
2229 ^.*abcd$
2230
2231 the initial .* matches the entire string at first, but when this fails
2232 (because there is no following "a"), it backtracks to match all but the
2233 last character, then all but the last two characters, and so on. Once
2234 again the search for "a" covers the entire string, from right to left,
2235 so we are no better off. However, if the pattern is written as
2236
2237 ^.*+(?<=abcd)
2238
2239 there can be no backtracking for the .*+ item because of the possessive
2240 quantifier; it can match only the entire string. The subsequent lookbe‐
2241 hind assertion does a single test on the last four characters. If it
2242 fails, the match fails immediately. For long strings, this approach
2243 makes a significant difference to the processing time.
2244
2245 Using multiple assertions
2246
2247 Several assertions (of any sort) may occur in succession. For example,
2248
2249 (?<=\d{3})(?<!999)foo
2250
2251 matches "foo" preceded by three digits that are not "999". Notice that
2252 each of the assertions is applied independently at the same point in
2253 the subject string. First there is a check that the previous three
2254 characters are all digits, and then there is a check that the same
2255 three characters are not "999". This pattern does not match "foo" pre‐
2256 ceded by six characters, the first of which are digits and the last
2257 three of which are not "999". For example, it doesn't match "123abc‐
2258 foo". A pattern to do that is
2259
2260 (?<=\d{3}...)(?<!999)foo
2261
2262 This time the first assertion looks at the preceding six characters,
2263 checking that the first three are digits, and then the second assertion
2264 checks that the preceding three characters are not "999".
2265
2266 Assertions can be nested in any combination. For example,
2267
2268 (?<=(?<!foo)bar)baz
2269
2270 matches an occurrence of "baz" that is preceded by "bar" which in turn
2271 is not preceded by "foo", while
2272
2273 (?<=\d{3}(?!999)...)foo
2274
2275 is another pattern that matches "foo" preceded by three digits and any
2276 three characters that are not "999".
2277
2279
2280 The traditional Perl-compatible lookaround assertions are atomic. That
2281 is, if an assertion is true, but there is a subsequent matching fail‐
2282 ure, there is no backtracking into the assertion. However, there are
2283 some cases where non-atomic positive assertions can be useful. PCRE2
2284 provides these using the following syntax:
2285
2286 (*non_atomic_positive_lookahead: or (*napla: or (?*
2287 (*non_atomic_positive_lookbehind: or (*naplb: or (?<*
2288
2289 Consider the problem of finding the right-most word in a string that
2290 also appears earlier in the string, that is, it must appear at least
2291 twice in total. This pattern returns the required result as captured
2292 substring 1:
2293
2294 ^(?x)(*napla: .* \b(\w++)) (?> .*? \b\1\b ){2}
2295
2296 For a subject such as "word1 word2 word3 word2 word3 word4" the result
2297 is "word3". How does it work? At the start, ^(?x) anchors the pattern
2298 and sets the "x" option, which causes white space (introduced for read‐
2299 ability) to be ignored. Inside the assertion, the greedy .* at first
2300 consumes the entire string, but then has to backtrack until the rest of
2301 the assertion can match a word, which is captured by group 1. In other
2302 words, when the assertion first succeeds, it captures the right-most
2303 word in the string.
2304
2305 The current matching point is then reset to the start of the subject,
2306 and the rest of the pattern match checks for two occurrences of the
2307 captured word, using an ungreedy .*? to scan from the left. If this
2308 succeeds, we are done, but if the last word in the string does not
2309 occur twice, this part of the pattern fails. If a traditional atomic
2310 lookhead (?= or (*pla: had been used, the assertion could not be re-
2311 entered, and the whole match would fail. The pattern would succeed only
2312 if the very last word in the subject was found twice.
2313
2314 Using a non-atomic lookahead, however, means that when the last word
2315 does not occur twice in the string, the lookahead can backtrack and
2316 find the second-last word, and so on, until either the match succeeds,
2317 or all words have been tested.
2318
2319 Two conditions must be met for a non-atomic assertion to be useful: the
2320 contents of one or more capturing groups must change after a backtrack
2321 into the assertion, and there must be a backreference to a changed
2322 group later in the pattern. If this is not the case, the rest of the
2323 pattern match fails exactly as before because nothing has changed, so
2324 using a non-atomic assertion just wastes resources.
2325
2326 There is one exception to backtracking into a non-atomic assertion. If
2327 an (*ACCEPT) control verb is triggered, the assertion succeeds atomi‐
2328 cally. That is, a subsequent match failure cannot backtrack into the
2329 assertion.
2330
2331 Non-atomic assertions are not supported by the alternative matching
2332 function pcre2_dfa_match(). They are supported by JIT, but only if they
2333 do not contain any control verbs such as (*ACCEPT). (This may change in
2334 future). Note that assertions that appear as conditions for conditional
2335 groups (see below) must be atomic.
2336
2338
2339 In concept, a script run is a sequence of characters that are all from
2340 the same Unicode script such as Latin or Greek. However, because some
2341 scripts are commonly used together, and because some diacritical and
2342 other marks are used with multiple scripts, it is not that simple.
2343 There is a full description of the rules that PCRE2 uses in the section
2344 entitled "Script Runs" in the pcre2unicode documentation.
2345
2346 If part of a pattern is enclosed between (*script_run: or (*sr: and a
2347 closing parenthesis, it fails if the sequence of characters that it
2348 matches are not a script run. After a failure, normal backtracking
2349 occurs. Script runs can be used to detect spoofing attacks using char‐
2350 acters that look the same, but are from different scripts. The string
2351 "paypal.com" is an infamous example, where the letters could be a mix‐
2352 ture of Latin and Cyrillic. This pattern ensures that the matched char‐
2353 acters in a sequence of non-spaces that follow white space are a script
2354 run:
2355
2356 \s+(*sr:\S+)
2357
2358 To be sure that they are all from the Latin script (for example), a
2359 lookahead can be used:
2360
2361 \s+(?=\p{Latin})(*sr:\S+)
2362
2363 This works as long as the first character is expected to be a character
2364 in that script, and not (for example) punctuation, which is allowed
2365 with any script. If this is not the case, a more creative lookahead is
2366 needed. For example, if digits, underscore, and dots are permitted at
2367 the start:
2368
2369 \s+(?=[0-9_.]*\p{Latin})(*sr:\S+)
2370
2371
2372 In many cases, backtracking into a script run pattern fragment is not
2373 desirable. The script run can employ an atomic group to prevent this.
2374 Because this is a common requirement, a shorthand notation is provided
2375 by (*atomic_script_run: or (*asr:
2376
2377 (*asr:...) is the same as (*sr:(?>...))
2378
2379 Note that the atomic group is inside the script run. Putting it outside
2380 would not prevent backtracking into the script run pattern.
2381
2382 Support for script runs is not available if PCRE2 is compiled without
2383 Unicode support. A compile-time error is given if any of the above con‐
2384 structs is encountered. Script runs are not supported by the alternate
2385 matching function, pcre2_dfa_match() because they use the same mecha‐
2386 nism as capturing parentheses.
2387
2388 Warning: The (*ACCEPT) control verb (see below) should not be used
2389 within a script run group, because it causes an immediate exit from the
2390 group, bypassing the script run checking.
2391
2393
2394 It is possible to cause the matching process to obey a pattern fragment
2395 conditionally or to choose between two alternative fragments, depending
2396 on the result of an assertion, or whether a specific capture group has
2397 already been matched. The two possible forms of conditional group are:
2398
2399 (?(condition)yes-pattern)
2400 (?(condition)yes-pattern|no-pattern)
2401
2402 If the condition is satisfied, the yes-pattern is used; otherwise the
2403 no-pattern (if present) is used. An absent no-pattern is equivalent to
2404 an empty string (it always matches). If there are more than two alter‐
2405 natives in the group, a compile-time error occurs. Each of the two
2406 alternatives may itself contain nested groups of any form, including
2407 conditional groups; the restriction to two alternatives applies only at
2408 the level of the condition itself. This pattern fragment is an example
2409 where the alternatives are complex:
2410
2411 (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
2412
2413
2414 There are five kinds of condition: references to capture groups, refer‐
2415 ences to recursion, two pseudo-conditions called DEFINE and VERSION,
2416 and assertions.
2417
2418 Checking for a used capture group by number
2419
2420 If the text between the parentheses consists of a sequence of digits,
2421 the condition is true if a capture group of that number has previously
2422 matched. If there is more than one capture group with the same number
2423 (see the earlier section about duplicate group numbers), the condition
2424 is true if any of them have matched. An alternative notation is to pre‐
2425 cede the digits with a plus or minus sign. In this case, the group num‐
2426 ber is relative rather than absolute. The most recently opened capture
2427 group can be referenced by (?(-1), the next most recent by (?(-2), and
2428 so on. Inside loops it can also make sense to refer to subsequent
2429 groups. The next capture group can be referenced as (?(+1), and so on.
2430 (The value zero in any of these forms is not used; it provokes a com‐
2431 pile-time error.)
2432
2433 Consider the following pattern, which contains non-significant white
2434 space to make it more readable (assume the PCRE2_EXTENDED option) and
2435 to divide it into three parts for ease of discussion:
2436
2437 ( \( )? [^()]+ (?(1) \) )
2438
2439 The first part matches an optional opening parenthesis, and if that
2440 character is present, sets it as the first captured substring. The sec‐
2441 ond part matches one or more characters that are not parentheses. The
2442 third part is a conditional group that tests whether or not the first
2443 capture group matched. If it did, that is, if subject started with an
2444 opening parenthesis, the condition is true, and so the yes-pattern is
2445 executed and a closing parenthesis is required. Otherwise, since no-
2446 pattern is not present, the conditional group matches nothing. In other
2447 words, this pattern matches a sequence of non-parentheses, optionally
2448 enclosed in parentheses.
2449
2450 If you were embedding this pattern in a larger one, you could use a
2451 relative reference:
2452
2453 ...other stuff... ( \( )? [^()]+ (?(-1) \) ) ...
2454
2455 This makes the fragment independent of the parentheses in the larger
2456 pattern.
2457
2458 Checking for a used capture group by name
2459
2460 Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
2461 used capture group by name. For compatibility with earlier versions of
2462 PCRE1, which had this facility before Perl, the syntax (?(name)...) is
2463 also recognized. Note, however, that undelimited names consisting of
2464 the letter R followed by digits are ambiguous (see the following sec‐
2465 tion). Rewriting the above example to use a named group gives this:
2466
2467 (?<OPEN> \( )? [^()]+ (?(<OPEN>) \) )
2468
2469 If the name used in a condition of this kind is a duplicate, the test
2470 is applied to all groups of the same name, and is true if any one of
2471 them has matched.
2472
2473 Checking for pattern recursion
2474
2475 "Recursion" in this sense refers to any subroutine-like call from one
2476 part of the pattern to another, whether or not it is actually recur‐
2477 sive. See the sections entitled "Recursive patterns" and "Groups as
2478 subroutines" below for details of recursion and subroutine calls.
2479
2480 If a condition is the string (R), and there is no capture group with
2481 the name R, the condition is true if matching is currently in a recur‐
2482 sion or subroutine call to the whole pattern or any capture group. If
2483 digits follow the letter R, and there is no group with that name, the
2484 condition is true if the most recent call is into a group with the
2485 given number, which must exist somewhere in the overall pattern. This
2486 is a contrived example that is equivalent to a+b:
2487
2488 ((?(R1)a+|(?1)b))
2489
2490 However, in both cases, if there is a capture group with a matching
2491 name, the condition tests for its being set, as described in the sec‐
2492 tion above, instead of testing for recursion. For example, creating a
2493 group with the name R1 by adding (?<R1>) to the above pattern com‐
2494 pletely changes its meaning.
2495
2496 If a name preceded by ampersand follows the letter R, for example:
2497
2498 (?(R&name)...)
2499
2500 the condition is true if the most recent recursion is into a group of
2501 that name (which must exist within the pattern).
2502
2503 This condition does not check the entire recursion stack. It tests only
2504 the current level. If the name used in a condition of this kind is a
2505 duplicate, the test is applied to all groups of the same name, and is
2506 true if any one of them is the most recent recursion.
2507
2508 At "top level", all these recursion test conditions are false.
2509
2510 Defining capture groups for use by reference only
2511
2512 If the condition is the string (DEFINE), the condition is always false,
2513 even if there is a group with the name DEFINE. In this case, there may
2514 be only one alternative in the rest of the conditional group. It is
2515 always skipped if control reaches this point in the pattern; the idea
2516 of DEFINE is that it can be used to define subroutines that can be ref‐
2517 erenced from elsewhere. (The use of subroutines is described below.)
2518 For example, a pattern to match an IPv4 address such as
2519 "192.168.23.245" could be written like this (ignore white space and
2520 line breaks):
2521
2522 (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
2523 \b (?&byte) (\.(?&byte)){3} \b
2524
2525 The first part of the pattern is a DEFINE group inside which a another
2526 group named "byte" is defined. This matches an individual component of
2527 an IPv4 address (a number less than 256). When matching takes place,
2528 this part of the pattern is skipped because DEFINE acts like a false
2529 condition. The rest of the pattern uses references to the named group
2530 to match the four dot-separated components of an IPv4 address, insist‐
2531 ing on a word boundary at each end.
2532
2533 Checking the PCRE2 version
2534
2535 Programs that link with a PCRE2 library can check the version by call‐
2536 ing pcre2_config() with appropriate arguments. Users of applications
2537 that do not have access to the underlying code cannot do this. A spe‐
2538 cial "condition" called VERSION exists to allow such users to discover
2539 which version of PCRE2 they are dealing with by using this condition to
2540 match a string such as "yesno". VERSION must be followed either by "="
2541 or ">=" and a version number. For example:
2542
2543 (?(VERSION>=10.4)yes|no)
2544
2545 This pattern matches "yes" if the PCRE2 version is greater or equal to
2546 10.4, or "no" otherwise. The fractional part of the version number may
2547 not contain more than two digits.
2548
2549 Assertion conditions
2550
2551 If the condition is not in any of the above formats, it must be a
2552 parenthesized assertion. This may be a positive or negative lookahead
2553 or lookbehind assertion. However, it must be a traditional atomic
2554 assertion, not one of the PCRE2-specific non-atomic assertions.
2555
2556 Consider this pattern, again containing non-significant white space,
2557 and with the two alternatives on the second line:
2558
2559 (?(?=[^a-z]*[a-z])
2560 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
2561
2562 The condition is a positive lookahead assertion that matches an
2563 optional sequence of non-letters followed by a letter. In other words,
2564 it tests for the presence of at least one letter in the subject. If a
2565 letter is found, the subject is matched against the first alternative;
2566 otherwise it is matched against the second. This pattern matches
2567 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
2568 letters and dd are digits.
2569
2570 When an assertion that is a condition contains capture groups, any cap‐
2571 turing that occurs in a matching branch is retained afterwards, for
2572 both positive and negative assertions, because matching always contin‐
2573 ues after the assertion, whether it succeeds or fails. (Compare non-
2574 conditional assertions, for which captures are retained only for posi‐
2575 tive assertions that succeed.)
2576
2578
2579 There are two ways of including comments in patterns that are processed
2580 by PCRE2. In both cases, the start of the comment must not be in a
2581 character class, nor in the middle of any other sequence of related
2582 characters such as (?: or a group name or number. The characters that
2583 make up a comment play no part in the pattern matching.
2584
2585 The sequence (?# marks the start of a comment that continues up to the
2586 next closing parenthesis. Nested parentheses are not permitted. If the
2587 PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped #
2588 character also introduces a comment, which in this case continues to
2589 immediately after the next newline character or character sequence in
2590 the pattern. Which characters are interpreted as newlines is controlled
2591 by an option passed to the compiling function or by a special sequence
2592 at the start of the pattern, as described in the section entitled "New‐
2593 line conventions" above. Note that the end of this type of comment is a
2594 literal newline sequence in the pattern; escape sequences that happen
2595 to represent a newline do not count. For example, consider this pattern
2596 when PCRE2_EXTENDED is set, and the default newline convention (a sin‐
2597 gle linefeed character) is in force:
2598
2599 abc #comment \n still comment
2600
2601 On encountering the # character, pcre2_compile() skips along, looking
2602 for a newline in the pattern. The sequence \n is still literal at this
2603 stage, so it does not terminate the comment. Only an actual character
2604 with the code value 0x0a (the default newline) does so.
2605
2607
2608 Consider the problem of matching a string in parentheses, allowing for
2609 unlimited nested parentheses. Without the use of recursion, the best
2610 that can be done is to use a pattern that matches up to some fixed
2611 depth of nesting. It is not possible to handle an arbitrary nesting
2612 depth.
2613
2614 For some time, Perl has provided a facility that allows regular expres‐
2615 sions to recurse (amongst other things). It does this by interpolating
2616 Perl code in the expression at run time, and the code can refer to the
2617 expression itself. A Perl pattern using code interpolation to solve the
2618 parentheses problem can be created like this:
2619
2620 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
2621
2622 The (?p{...}) item interpolates Perl code at run time, and in this case
2623 refers recursively to the pattern in which it appears.
2624
2625 Obviously, PCRE2 cannot support the interpolation of Perl code.
2626 Instead, it supports special syntax for recursion of the entire pat‐
2627 tern, and also for individual capture group recursion. After its intro‐
2628 duction in PCRE1 and Python, this kind of recursion was subsequently
2629 introduced into Perl at release 5.10.
2630
2631 A special item that consists of (? followed by a number greater than
2632 zero and a closing parenthesis is a recursive subroutine call of the
2633 capture group of the given number, provided that it occurs inside that
2634 group. (If not, it is a non-recursive subroutine call, which is
2635 described in the next section.) The special item (?R) or (?0) is a
2636 recursive call of the entire regular expression.
2637
2638 This PCRE2 pattern solves the nested parentheses problem (assume the
2639 PCRE2_EXTENDED option is set so that white space is ignored):
2640
2641 \( ( [^()]++ | (?R) )* \)
2642
2643 First it matches an opening parenthesis. Then it matches any number of
2644 substrings which can either be a sequence of non-parentheses, or a
2645 recursive match of the pattern itself (that is, a correctly parenthe‐
2646 sized substring). Finally there is a closing parenthesis. Note the use
2647 of a possessive quantifier to avoid backtracking into sequences of non-
2648 parentheses.
2649
2650 If this were part of a larger pattern, you would not want to recurse
2651 the entire pattern, so instead you could use this:
2652
2653 ( \( ( [^()]++ | (?1) )* \) )
2654
2655 We have put the pattern into parentheses, and caused the recursion to
2656 refer to them instead of the whole pattern.
2657
2658 In a larger pattern, keeping track of parenthesis numbers can be
2659 tricky. This is made easier by the use of relative references. Instead
2660 of (?1) in the pattern above you can write (?-2) to refer to the second
2661 most recently opened parentheses preceding the recursion. In other
2662 words, a negative number counts capturing parentheses leftwards from
2663 the point at which it is encountered.
2664
2665 Be aware however, that if duplicate capture group numbers are in use,
2666 relative references refer to the earliest group with the appropriate
2667 number. Consider, for example:
2668
2669 (?|(a)|(b)) (c) (?-2)
2670
2671 The first two capture groups (a) and (b) are both numbered 1, and group
2672 (c) is number 2. When the reference (?-2) is encountered, the second
2673 most recently opened parentheses has the number 1, but it is the first
2674 such group (the (a) group) to which the recursion refers. This would be
2675 the same if an absolute reference (?1) was used. In other words, rela‐
2676 tive references are just a shorthand for computing a group number.
2677
2678 It is also possible to refer to subsequent capture groups, by writing
2679 references such as (?+2). However, these cannot be recursive because
2680 the reference is not inside the parentheses that are referenced. They
2681 are always non-recursive subroutine calls, as described in the next
2682 section.
2683
2684 An alternative approach is to use named parentheses. The Perl syntax
2685 for this is (?&name); PCRE1's earlier syntax (?P>name) is also sup‐
2686 ported. We could rewrite the above example as follows:
2687
2688 (?<pn> \( ( [^()]++ | (?&pn) )* \) )
2689
2690 If there is more than one group with the same name, the earliest one is
2691 used.
2692
2693 The example pattern that we have been looking at contains nested unlim‐
2694 ited repeats, and so the use of a possessive quantifier for matching
2695 strings of non-parentheses is important when applying the pattern to
2696 strings that do not match. For example, when this pattern is applied to
2697
2698 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
2699
2700 it yields "no match" quickly. However, if a possessive quantifier is
2701 not used, the match runs for a very long time indeed because there are
2702 so many different ways the + and * repeats can carve up the subject,
2703 and all have to be tested before failure can be reported.
2704
2705 At the end of a match, the values of capturing parentheses are those
2706 from the outermost level. If you want to obtain intermediate values, a
2707 callout function can be used (see below and the pcre2callout documenta‐
2708 tion). If the pattern above is matched against
2709
2710 (ab(cd)ef)
2711
2712 the value for the inner capturing parentheses (numbered 2) is "ef",
2713 which is the last value taken on at the top level. If a capture group
2714 is not matched at the top level, its final captured value is unset,
2715 even if it was (temporarily) set at a deeper level during the matching
2716 process.
2717
2718 Do not confuse the (?R) item with the condition (R), which tests for
2719 recursion. Consider this pattern, which matches text in angle brack‐
2720 ets, allowing for arbitrary nesting. Only digits are allowed in nested
2721 brackets (that is, when recursing), whereas any characters are permit‐
2722 ted at the outer level.
2723
2724 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
2725
2726 In this pattern, (?(R) is the start of a conditional group, with two
2727 different alternatives for the recursive and non-recursive cases. The
2728 (?R) item is the actual recursive call.
2729
2730 Differences in recursion processing between PCRE2 and Perl
2731
2732 Some former differences between PCRE2 and Perl no longer exist.
2733
2734 Before release 10.30, recursion processing in PCRE2 differed from Perl
2735 in that a recursive subroutine call was always treated as an atomic
2736 group. That is, once it had matched some of the subject string, it was
2737 never re-entered, even if it contained untried alternatives and there
2738 was a subsequent matching failure. (Historical note: PCRE implemented
2739 recursion before Perl did.)
2740
2741 Starting with release 10.30, recursive subroutine calls are no longer
2742 treated as atomic. That is, they can be re-entered to try unused alter‐
2743 natives if there is a matching failure later in the pattern. This is
2744 now compatible with the way Perl works. If you want a subroutine call
2745 to be atomic, you must explicitly enclose it in an atomic group.
2746
2747 Supporting backtracking into recursions simplifies certain types of
2748 recursive pattern. For example, this pattern matches palindromic
2749 strings:
2750
2751 ^((.)(?1)\2|.?)$
2752
2753 The second branch in the group matches a single central character in
2754 the palindrome when there are an odd number of characters, or nothing
2755 when there are an even number of characters, but in order to work it
2756 has to be able to try the second case when the rest of the pattern
2757 match fails. If you want to match typical palindromic phrases, the pat‐
2758 tern has to ignore all non-word characters, which can be done like
2759 this:
2760
2761 ^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$
2762
2763 If run with the PCRE2_CASELESS option, this pattern matches phrases
2764 such as "A man, a plan, a canal: Panama!". Note the use of the posses‐
2765 sive quantifier *+ to avoid backtracking into sequences of non-word
2766 characters. Without this, PCRE2 takes a great deal longer (ten times or
2767 more) to match typical phrases, and Perl takes so long that you think
2768 it has gone into a loop.
2769
2770 Another way in which PCRE2 and Perl used to differ in their recursion
2771 processing is in the handling of captured values. Formerly in Perl,
2772 when a group was called recursively or as a subroutine (see the next
2773 section), it had no access to any values that were captured outside the
2774 recursion, whereas in PCRE2 these values can be referenced. Consider
2775 this pattern:
2776
2777 ^(.)(\1|a(?2))
2778
2779 This pattern matches "bab". The first capturing parentheses match "b",
2780 then in the second group, when the backreference \1 fails to match "b",
2781 the second alternative matches "a" and then recurses. In the recursion,
2782 \1 does now match "b" and so the whole match succeeds. This match used
2783 to fail in Perl, but in later versions (I tried 5.024) it now works.
2784
2786
2787 If the syntax for a recursive group call (either by number or by name)
2788 is used outside the parentheses to which it refers, it operates a bit
2789 like a subroutine in a programming language. More accurately, PCRE2
2790 treats the referenced group as an independent subpattern which it tries
2791 to match at the current matching position. The called group may be
2792 defined before or after the reference. A numbered reference can be
2793 absolute or relative, as in these examples:
2794
2795 (...(absolute)...)...(?2)...
2796 (...(relative)...)...(?-1)...
2797 (...(?+1)...(relative)...
2798
2799 An earlier example pointed out that the pattern
2800
2801 (sens|respons)e and \1ibility
2802
2803 matches "sense and sensibility" and "response and responsibility", but
2804 not "sense and responsibility". If instead the pattern
2805
2806 (sens|respons)e and (?1)ibility
2807
2808 is used, it does match "sense and responsibility" as well as the other
2809 two strings. Another example is given in the discussion of DEFINE
2810 above.
2811
2812 Like recursions, subroutine calls used to be treated as atomic, but
2813 this changed at PCRE2 release 10.30, so backtracking into subroutine
2814 calls can now occur. However, any capturing parentheses that are set
2815 during the subroutine call revert to their previous values afterwards.
2816
2817 Processing options such as case-independence are fixed when a group is
2818 defined, so if it is used as a subroutine, such options cannot be
2819 changed for different calls. For example, consider this pattern:
2820
2821 (abc)(?i:(?-1))
2822
2823 It matches "abcabc". It does not match "abcABC" because the change of
2824 processing option does not affect the called group.
2825
2826 The behaviour of backtracking control verbs in groups when called as
2827 subroutines is described in the section entitled "Backtracking verbs in
2828 subroutines" below.
2829
2831
2832 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
2833 name or a number enclosed either in angle brackets or single quotes, is
2834 an alternative syntax for calling a group as a subroutine, possibly
2835 recursively. Here are two of the examples used above, rewritten using
2836 this syntax:
2837
2838 (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
2839 (sens|respons)e and \g'1'ibility
2840
2841 PCRE2 supports an extension to Oniguruma: if a number is preceded by a
2842 plus or a minus sign it is taken as a relative reference. For example:
2843
2844 (abc)(?i:\g<-1>)
2845
2846 Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
2847 synonymous. The former is a backreference; the latter is a subroutine
2848 call.
2849
2851
2852 Perl has a feature whereby using the sequence (?{...}) causes arbitrary
2853 Perl code to be obeyed in the middle of matching a regular expression.
2854 This makes it possible, amongst other things, to extract different sub‐
2855 strings that match the same pair of parentheses when there is a repeti‐
2856 tion.
2857
2858 PCRE2 provides a similar feature, but of course it cannot obey arbi‐
2859 trary Perl code. The feature is called "callout". The caller of PCRE2
2860 provides an external function by putting its entry point in a match
2861 context using the function pcre2_set_callout(), and then passing that
2862 context to pcre2_match() or pcre2_dfa_match(). If no match context is
2863 passed, or if the callout entry point is set to NULL, callouts are dis‐
2864 abled.
2865
2866 Within a regular expression, (?C<arg>) indicates a point at which the
2867 external function is to be called. There are two kinds of callout:
2868 those with a numerical argument and those with a string argument. (?C)
2869 on its own with no argument is treated as (?C0). A numerical argument
2870 allows the application to distinguish between different callouts.
2871 String arguments were added for release 10.20 to make it possible for
2872 script languages that use PCRE2 to embed short scripts within patterns
2873 in a similar way to Perl.
2874
2875 During matching, when PCRE2 reaches a callout point, the external func‐
2876 tion is called. It is provided with the number or string argument of
2877 the callout, the position in the pattern, and one item of data that is
2878 also set in the match block. The callout function may cause matching to
2879 proceed, to backtrack, or to fail.
2880
2881 By default, PCRE2 implements a number of optimizations at matching
2882 time, and one side-effect is that sometimes callouts are skipped. If
2883 you need all possible callouts to happen, you need to set options that
2884 disable the relevant optimizations. More details, including a complete
2885 description of the programming interface to the callout function, are
2886 given in the pcre2callout documentation.
2887
2888 Callouts with numerical arguments
2889
2890 If you just want to have a means of identifying different callout
2891 points, put a number less than 256 after the letter C. For example,
2892 this pattern has two callout points:
2893
2894 (?C1)abc(?C2)def
2895
2896 If the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(), numerical
2897 callouts are automatically installed before each item in the pattern.
2898 They are all numbered 255. If there is a conditional group in the pat‐
2899 tern whose condition is an assertion, an additional callout is inserted
2900 just before the condition. An explicit callout may also be set at this
2901 position, as in this example:
2902
2903 (?(?C9)(?=a)abc|def)
2904
2905 Note that this applies only to assertion conditions, not to other types
2906 of condition.
2907
2908 Callouts with string arguments
2909
2910 A delimited string may be used instead of a number as a callout argu‐
2911 ment. The starting delimiter must be one of ` ' " ^ % # $ { and the
2912 ending delimiter is the same as the start, except for {, where the end‐
2913 ing delimiter is }. If the ending delimiter is needed within the
2914 string, it must be doubled. For example:
2915
2916 (?C'ab ''c'' d')xyz(?C{any text})pqr
2917
2918 The doubling is removed before the string is passed to the callout
2919 function.
2920
2922
2923 There are a number of special "Backtracking Control Verbs" (to use
2924 Perl's terminology) that modify the behaviour of backtracking during
2925 matching. They are generally of the form (*VERB) or (*VERB:NAME). Some
2926 verbs take either form, and may behave differently depending on whether
2927 or not a name argument is present. The names are not required to be
2928 unique within the pattern.
2929
2930 By default, for compatibility with Perl, a name is any sequence of
2931 characters that does not include a closing parenthesis. The name is not
2932 processed in any way, and it is not possible to include a closing
2933 parenthesis in the name. This can be changed by setting the
2934 PCRE2_ALT_VERBNAMES option, but the result is no longer Perl-compati‐
2935 ble.
2936
2937 When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to
2938 verb names and only an unescaped closing parenthesis terminates the
2939 name. However, the only backslash items that are permitted are \Q, \E,
2940 and sequences such as \x{100} that define character code points. Char‐
2941 acter type escapes such as \d are faulted.
2942
2943 A closing parenthesis can be included in a name either as \) or between
2944 \Q and \E. In addition to backslash processing, if the PCRE2_EXTENDED
2945 or PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb
2946 names is skipped, and #-comments are recognized, exactly as in the rest
2947 of the pattern. PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect
2948 verb names unless PCRE2_ALT_VERBNAMES is also set.
2949
2950 The maximum length of a name is 255 in the 8-bit library and 65535 in
2951 the 16-bit and 32-bit libraries. If the name is empty, that is, if the
2952 closing parenthesis immediately follows the colon, the effect is as if
2953 the colon were not there. Any number of these verbs may occur in a pat‐
2954 tern. Except for (*ACCEPT), they may not be quantified.
2955
2956 Since these verbs are specifically related to backtracking, most of
2957 them can be used only when the pattern is to be matched using the tra‐
2958 ditional matching function, because that uses a backtracking algorithm.
2959 With the exception of (*FAIL), which behaves like a failing negative
2960 assertion, the backtracking control verbs cause an error if encountered
2961 by the DFA matching function.
2962
2963 The behaviour of these verbs in repeated groups, assertions, and in
2964 capture groups called as subroutines (whether or not recursively) is
2965 documented below.
2966
2967 Optimizations that affect backtracking verbs
2968
2969 PCRE2 contains some optimizations that are used to speed up matching by
2970 running some checks at the start of each match attempt. For example, it
2971 may know the minimum length of matching subject, or that a particular
2972 character must be present. When one of these optimizations bypasses the
2973 running of a match, any included backtracking verbs will not, of
2974 course, be processed. You can suppress the start-of-match optimizations
2975 by setting the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com‐
2976 pile(), or by starting the pattern with (*NO_START_OPT). There is more
2977 discussion of this option in the section entitled "Compiling a pattern"
2978 in the pcre2api documentation.
2979
2980 Experiments with Perl suggest that it too has similar optimizations,
2981 and like PCRE2, turning them off can change the result of a match.
2982
2983 Verbs that act immediately
2984
2985 The following verbs act as soon as they are encountered.
2986
2987 (*ACCEPT) or (*ACCEPT:NAME)
2988
2989 This verb causes the match to end successfully, skipping the remainder
2990 of the pattern. However, when it is inside a capture group that is
2991 called as a subroutine, only that group is ended successfully. Matching
2992 then continues at the outer level. If (*ACCEPT) in triggered in a posi‐
2993 tive assertion, the assertion succeeds; in a negative assertion, the
2994 assertion fails.
2995
2996 If (*ACCEPT) is inside capturing parentheses, the data so far is cap‐
2997 tured. For example:
2998
2999 A((?:A|B(*ACCEPT)|C)D)
3000
3001 This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap‐
3002 tured by the outer parentheses.
3003
3004 (*ACCEPT) is the only backtracking verb that is allowed to be quanti‐
3005 fied because an ungreedy quantification with a minimum of zero acts
3006 only when a backtrack happens. Consider, for example,
3007
3008 (A(*ACCEPT)??B)C
3009
3010 where A, B, and C may be complex expressions. After matching "A", the
3011 matcher processes "BC"; if that fails, causing a backtrack, (*ACCEPT)
3012 is triggered and the match succeeds. In both cases, all but C is cap‐
3013 tured. Whereas (*COMMIT) (see below) means "fail on backtrack", a
3014 repeated (*ACCEPT) of this type means "succeed on backtrack".
3015
3016 Warning: (*ACCEPT) should not be used within a script run group,
3017 because it causes an immediate exit from the group, bypassing the
3018 script run checking.
3019
3020 (*FAIL) or (*FAIL:NAME)
3021
3022 This verb causes a matching failure, forcing backtracking to occur. It
3023 may be abbreviated to (*F). It is equivalent to (?!) but easier to
3024 read. The Perl documentation notes that it is probably useful only when
3025 combined with (?{}) or (??{}). Those are, of course, Perl features that
3026 are not present in PCRE2. The nearest equivalent is the callout fea‐
3027 ture, as for example in this pattern:
3028
3029 a+(?C)(*FAIL)
3030
3031 A match with the string "aaaa" always fails, but the callout is taken
3032 before each backtrack happens (in this example, 10 times).
3033
3034 (*ACCEPT:NAME) and (*FAIL:NAME) behave the same as
3035 (*MARK:NAME)(*ACCEPT) and (*MARK:NAME)(*FAIL), respectively, that is, a
3036 (*MARK) is recorded just before the verb acts.
3037
3038 Recording which path was taken
3039
3040 There is one verb whose main purpose is to track how a match was
3041 arrived at, though it also has a secondary use in conjunction with
3042 advancing the match starting point (see (*SKIP) below).
3043
3044 (*MARK:NAME) or (*:NAME)
3045
3046 A name is always required with this verb. For all the other backtrack‐
3047 ing control verbs, a NAME argument is optional.
3048
3049 When a match succeeds, the name of the last-encountered mark name on
3050 the matching path is passed back to the caller as described in the sec‐
3051 tion entitled "Other information about the match" in the pcre2api docu‐
3052 mentation. This applies to all instances of (*MARK) and other verbs,
3053 including those inside assertions and atomic groups. However, there are
3054 differences in those cases when (*MARK) is used in conjunction with
3055 (*SKIP) as described below.
3056
3057 The mark name that was last encountered on the matching path is passed
3058 back. A verb without a NAME argument is ignored for this purpose. Here
3059 is an example of pcre2test output, where the "mark" modifier requests
3060 the retrieval and outputting of (*MARK) data:
3061
3062 re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
3063 data> XY
3064 0: XY
3065 MK: A
3066 XZ
3067 0: XZ
3068 MK: B
3069
3070 The (*MARK) name is tagged with "MK:" in this output, and in this exam‐
3071 ple it indicates which of the two alternatives matched. This is a more
3072 efficient way of obtaining this information than putting each alterna‐
3073 tive in its own capturing parentheses.
3074
3075 If a verb with a name is encountered in a positive assertion that is
3076 true, the name is recorded and passed back if it is the last-encoun‐
3077 tered. This does not happen for negative assertions or failing positive
3078 assertions.
3079
3080 After a partial match or a failed match, the last encountered name in
3081 the entire match process is returned. For example:
3082
3083 re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
3084 data> XP
3085 No match, mark = B
3086
3087 Note that in this unanchored example the mark is retained from the
3088 match attempt that started at the letter "X" in the subject. Subsequent
3089 match attempts starting at "P" and then with an empty string do not get
3090 as far as the (*MARK) item, but nevertheless do not reset it.
3091
3092 If you are interested in (*MARK) values after failed matches, you
3093 should probably set the PCRE2_NO_START_OPTIMIZE option (see above) to
3094 ensure that the match is always attempted.
3095
3096 Verbs that act after backtracking
3097
3098 The following verbs do nothing when they are encountered. Matching con‐
3099 tinues with what follows, but if there is a subsequent match failure,
3100 causing a backtrack to the verb, a failure is forced. That is, back‐
3101 tracking cannot pass to the left of the verb. However, when one of
3102 these verbs appears inside an atomic group or in a lookaround assertion
3103 that is true, its effect is confined to that group, because once the
3104 group has been matched, there is never any backtracking into it. Back‐
3105 tracking from beyond an assertion or an atomic group ignores the entire
3106 group, and seeks a preceding backtracking point.
3107
3108 These verbs differ in exactly what kind of failure occurs when back‐
3109 tracking reaches them. The behaviour described below is what happens
3110 when the verb is not in a subroutine or an assertion. Subsequent sec‐
3111 tions cover these special cases.
3112
3113 (*COMMIT) or (*COMMIT:NAME)
3114
3115 This verb causes the whole match to fail outright if there is a later
3116 matching failure that causes backtracking to reach it. Even if the pat‐
3117 tern is unanchored, no further attempts to find a match by advancing
3118 the starting point take place. If (*COMMIT) is the only backtracking
3119 verb that is encountered, once it has been passed pcre2_match() is com‐
3120 mitted to finding a match at the current starting point, or not at all.
3121 For example:
3122
3123 a+(*COMMIT)b
3124
3125 This matches "xxaab" but not "aacaab". It can be thought of as a kind
3126 of dynamic anchor, or "I've started, so I must finish."
3127
3128 The behaviour of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COM‐
3129 MIT). It is like (*MARK:NAME) in that the name is remembered for pass‐
3130 ing back to the caller. However, (*SKIP:NAME) searches only for names
3131 that are set with (*MARK), ignoring those set by any of the other back‐
3132 tracking verbs.
3133
3134 If there is more than one backtracking verb in a pattern, a different
3135 one that follows (*COMMIT) may be triggered first, so merely passing
3136 (*COMMIT) during a match does not always guarantee that a match must be
3137 at this starting point.
3138
3139 Note that (*COMMIT) at the start of a pattern is not the same as an
3140 anchor, unless PCRE2's start-of-match optimizations are turned off, as
3141 shown in this output from pcre2test:
3142
3143 re> /(*COMMIT)abc/
3144 data> xyzabc
3145 0: abc
3146 data>
3147 re> /(*COMMIT)abc/no_start_optimize
3148 data> xyzabc
3149 No match
3150
3151 For the first pattern, PCRE2 knows that any match must start with "a",
3152 so the optimization skips along the subject to "a" before applying the
3153 pattern to the first set of data. The match attempt then succeeds. The
3154 second pattern disables the optimization that skips along to the first
3155 character. The pattern is now applied starting at "x", and so the
3156 (*COMMIT) causes the match to fail without trying any other starting
3157 points.
3158
3159 (*PRUNE) or (*PRUNE:NAME)
3160
3161 This verb causes the match to fail at the current starting position in
3162 the subject if there is a later matching failure that causes backtrack‐
3163 ing to reach it. If the pattern is unanchored, the normal "bumpalong"
3164 advance to the next starting character then happens. Backtracking can
3165 occur as usual to the left of (*PRUNE), before it is reached, or when
3166 matching to the right of (*PRUNE), but if there is no match to the
3167 right, backtracking cannot cross (*PRUNE). In simple cases, the use of
3168 (*PRUNE) is just an alternative to an atomic group or possessive quan‐
3169 tifier, but there are some uses of (*PRUNE) that cannot be expressed in
3170 any other way. In an anchored pattern (*PRUNE) has the same effect as
3171 (*COMMIT).
3172
3173 The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE).
3174 It is like (*MARK:NAME) in that the name is remembered for passing back
3175 to the caller. However, (*SKIP:NAME) searches only for names set with
3176 (*MARK), ignoring those set by other backtracking verbs.
3177
3178 (*SKIP)
3179
3180 This verb, when given without a name, is like (*PRUNE), except that if
3181 the pattern is unanchored, the "bumpalong" advance is not to the next
3182 character, but to the position in the subject where (*SKIP) was encoun‐
3183 tered. (*SKIP) signifies that whatever text was matched leading up to
3184 it cannot be part of a successful match if there is a later mismatch.
3185 Consider:
3186
3187 a+(*SKIP)b
3188
3189 If the subject is "aaaac...", after the first match attempt fails
3190 (starting at the first character in the string), the starting point
3191 skips on to start the next attempt at "c". Note that a possessive quan‐
3192 tifer does not have the same effect as this example; although it would
3193 suppress backtracking during the first match attempt, the second
3194 attempt would start at the second character instead of skipping on to
3195 "c".
3196
3197 If (*SKIP) is used to specify a new starting position that is the same
3198 as the starting position of the current match, or (by being inside a
3199 lookbehind) earlier, the position specified by (*SKIP) is ignored, and
3200 instead the normal "bumpalong" occurs.
3201
3202 (*SKIP:NAME)
3203
3204 When (*SKIP) has an associated name, its behaviour is modified. When
3205 such a (*SKIP) is triggered, the previous path through the pattern is
3206 searched for the most recent (*MARK) that has the same name. If one is
3207 found, the "bumpalong" advance is to the subject position that corre‐
3208 sponds to that (*MARK) instead of to where (*SKIP) was encountered. If
3209 no (*MARK) with a matching name is found, the (*SKIP) is ignored.
3210
3211 The search for a (*MARK) name uses the normal backtracking mechanism,
3212 which means that it does not see (*MARK) settings that are inside
3213 atomic groups or assertions, because they are never re-entered by back‐
3214 tracking. Compare the following pcre2test examples:
3215
3216 re> /a(?>(*MARK:X))(*SKIP:X)(*F)|(.)/
3217 data: abc
3218 0: a
3219 1: a
3220 data:
3221 re> /a(?:(*MARK:X))(*SKIP:X)(*F)|(.)/
3222 data: abc
3223 0: b
3224 1: b
3225
3226 In the first example, the (*MARK) setting is in an atomic group, so it
3227 is not seen when (*SKIP:X) triggers, causing the (*SKIP) to be ignored.
3228 This allows the second branch of the pattern to be tried at the first
3229 character position. In the second example, the (*MARK) setting is not
3230 in an atomic group. This allows (*SKIP:X) to find the (*MARK) when it
3231 backtracks, and this causes a new matching attempt to start at the sec‐
3232 ond character. This time, the (*MARK) is never seen because "a" does
3233 not match "b", so the matcher immediately jumps to the second branch of
3234 the pattern.
3235
3236 Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It
3237 ignores names that are set by other backtracking verbs.
3238
3239 (*THEN) or (*THEN:NAME)
3240
3241 This verb causes a skip to the next innermost alternative when back‐
3242 tracking reaches it. That is, it cancels any further backtracking
3243 within the current alternative. Its name comes from the observation
3244 that it can be used for a pattern-based if-then-else block:
3245
3246 ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
3247
3248 If the COND1 pattern matches, FOO is tried (and possibly further items
3249 after the end of the group if FOO succeeds); on failure, the matcher
3250 skips to the second alternative and tries COND2, without backtracking
3251 into COND1. If that succeeds and BAR fails, COND3 is tried. If subse‐
3252 quently BAZ fails, there are no more alternatives, so there is a back‐
3253 track to whatever came before the entire group. If (*THEN) is not
3254 inside an alternation, it acts like (*PRUNE).
3255
3256 The behaviour of (*THEN:NAME) is not the same as (*MARK:NAME)(*THEN).
3257 It is like (*MARK:NAME) in that the name is remembered for passing back
3258 to the caller. However, (*SKIP:NAME) searches only for names set with
3259 (*MARK), ignoring those set by other backtracking verbs.
3260
3261 A group that does not contain a | character is just a part of the
3262 enclosing alternative; it is not a nested alternation with only one
3263 alternative. The effect of (*THEN) extends beyond such a group to the
3264 enclosing alternative. Consider this pattern, where A, B, etc. are
3265 complex pattern fragments that do not contain any | characters at this
3266 level:
3267
3268 A (B(*THEN)C) | D
3269
3270 If A and B are matched, but there is a failure in C, matching does not
3271 backtrack into A; instead it moves to the next alternative, that is, D.
3272 However, if the group containing (*THEN) is given an alternative, it
3273 behaves differently:
3274
3275 A (B(*THEN)C | (*FAIL)) | D
3276
3277 The effect of (*THEN) is now confined to the inner group. After a fail‐
3278 ure in C, matching moves to (*FAIL), which causes the whole group to
3279 fail because there are no more alternatives to try. In this case,
3280 matching does backtrack into A.
3281
3282 Note that a conditional group is not considered as having two alterna‐
3283 tives, because only one is ever used. In other words, the | character
3284 in a conditional group has a different meaning. Ignoring white space,
3285 consider:
3286
3287 ^.*? (?(?=a) a | b(*THEN)c )
3288
3289 If the subject is "ba", this pattern does not match. Because .*? is
3290 ungreedy, it initially matches zero characters. The condition (?=a)
3291 then fails, the character "b" is matched, but "c" is not. At this
3292 point, matching does not backtrack to .*? as might perhaps be expected
3293 from the presence of the | character. The conditional group is part of
3294 the single alternative that comprises the whole pattern, and so the
3295 match fails. (If there was a backtrack into .*?, allowing it to match
3296 "b", the match would succeed.)
3297
3298 The verbs just described provide four different "strengths" of control
3299 when subsequent matching fails. (*THEN) is the weakest, carrying on the
3300 match at the next alternative. (*PRUNE) comes next, failing the match
3301 at the current starting position, but allowing an advance to the next
3302 character (for an unanchored pattern). (*SKIP) is similar, except that
3303 the advance may be more than one character. (*COMMIT) is the strongest,
3304 causing the entire match to fail.
3305
3306 More than one backtracking verb
3307
3308 If more than one backtracking verb is present in a pattern, the one
3309 that is backtracked onto first acts. For example, consider this pat‐
3310 tern, where A, B, etc. are complex pattern fragments:
3311
3312 (A(*COMMIT)B(*THEN)C|ABD)
3313
3314 If A matches but B fails, the backtrack to (*COMMIT) causes the entire
3315 match to fail. However, if A and B match, but C fails, the backtrack to
3316 (*THEN) causes the next alternative (ABD) to be tried. This behaviour
3317 is consistent, but is not always the same as Perl's. It means that if
3318 two or more backtracking verbs appear in succession, all the the last
3319 of them has no effect. Consider this example:
3320
3321 ...(*COMMIT)(*PRUNE)...
3322
3323 If there is a matching failure to the right, backtracking onto (*PRUNE)
3324 causes it to be triggered, and its action is taken. There can never be
3325 a backtrack onto (*COMMIT).
3326
3327 Backtracking verbs in repeated groups
3328
3329 PCRE2 sometimes differs from Perl in its handling of backtracking verbs
3330 in repeated groups. For example, consider:
3331
3332 /(a(*COMMIT)b)+ac/
3333
3334 If the subject is "abac", Perl matches unless its optimizations are
3335 disabled, but PCRE2 always fails because the (*COMMIT) in the second
3336 repeat of the group acts.
3337
3338 Backtracking verbs in assertions
3339
3340 (*FAIL) in any assertion has its normal effect: it forces an immediate
3341 backtrack. The behaviour of the other backtracking verbs depends on
3342 whether or not the assertion is standalone or acting as the condition
3343 in a conditional group.
3344
3345 (*ACCEPT) in a standalone positive assertion causes the assertion to
3346 succeed without any further processing; captured strings and a mark
3347 name (if set) are retained. In a standalone negative assertion,
3348 (*ACCEPT) causes the assertion to fail without any further processing;
3349 captured substrings and any mark name are discarded.
3350
3351 If the assertion is a condition, (*ACCEPT) causes the condition to be
3352 true for a positive assertion and false for a negative one; captured
3353 substrings are retained in both cases.
3354
3355 The remaining verbs act only when a later failure causes a backtrack to
3356 reach them. This means that, for the Perl-compatible assertions, their
3357 effect is confined to the assertion, because Perl lookaround assertions
3358 are atomic. A backtrack that occurs after such an assertion is complete
3359 does not jump back into the assertion. Note in particular that a
3360 (*MARK) name that is set in an assertion is not "seen" by an instance
3361 of (*SKIP:NAME) later in the pattern.
3362
3363 PCRE2 now supports non-atomic positive assertions, as described in the
3364 section entitled "Non-atomic assertions" above. These assertions must
3365 be standalone (not used as conditions). They are not Perl-compatible.
3366 For these assertions, a later backtrack does jump back into the asser‐
3367 tion, and therefore verbs such as (*COMMIT) can be triggered by back‐
3368 tracks from later in the pattern.
3369
3370 The effect of (*THEN) is not allowed to escape beyond an assertion. If
3371 there are no more branches to try, (*THEN) causes a positive assertion
3372 to be false, and a negative assertion to be true.
3373
3374 The other backtracking verbs are not treated specially if they appear
3375 in a standalone positive assertion. In a conditional positive asser‐
3376 tion, backtracking (from within the assertion) into (*COMMIT), (*SKIP),
3377 or (*PRUNE) causes the condition to be false. However, for both stand‐
3378 alone and conditional negative assertions, backtracking into (*COMMIT),
3379 (*SKIP), or (*PRUNE) causes the assertion to be true, without consider‐
3380 ing any further alternative branches.
3381
3382 Backtracking verbs in subroutines
3383
3384 These behaviours occur whether or not the group is called recursively.
3385
3386 (*ACCEPT) in a group called as a subroutine causes the subroutine match
3387 to succeed without any further processing. Matching then continues
3388 after the subroutine call. Perl documents this behaviour. Perl's treat‐
3389 ment of the other verbs in subroutines is different in some cases.
3390
3391 (*FAIL) in a group called as a subroutine has its normal effect: it
3392 forces an immediate backtrack.
3393
3394 (*COMMIT), (*SKIP), and (*PRUNE) cause the subroutine match to fail
3395 when triggered by being backtracked to in a group called as a subrou‐
3396 tine. There is then a backtrack at the outer level.
3397
3398 (*THEN), when triggered, skips to the next alternative in the innermost
3399 enclosing group that has alternatives (its normal behaviour). However,
3400 if there is no such group within the subroutine's group, the subroutine
3401 match fails and there is a backtrack at the outer level.
3402
3404
3405 pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2syntax(3),
3406 pcre2(3).
3407
3409
3410 Philip Hazel
3411 University Computing Service
3412 Cambridge, England.
3413
3415
3416 Last updated: 06 October 2020
3417 Copyright (c) 1997-2020 University of Cambridge.
3418
3419
3420
3421PCRE2 10.35 06 October 2020 PCRE2PATTERN(3)