1PCRE2PATTERN(3) Library Functions Manual PCRE2PATTERN(3)
2
3
4
6 PCRE2 - Perl-compatible regular expressions (revised API)
7
9
10 The syntax and semantics of the regular expressions that are supported
11 by PCRE2 are described in detail below. There is a quick-reference syn‐
12 tax summary in the pcre2syntax page. PCRE2 tries to match Perl syntax
13 and semantics as closely as it can. PCRE2 also supports some alterna‐
14 tive regular expression syntax (which does not conflict with the Perl
15 syntax) in order to provide some compatibility with regular expressions
16 in Python, .NET, and Oniguruma.
17
18 Perl's regular expressions are described in its own documentation, and
19 regular expressions in general are covered in a number of books, some
20 of which have copious examples. Jeffrey Friedl's "Mastering Regular Ex‐
21 pressions", published by O'Reilly, covers regular expressions in great
22 detail. This description of PCRE2's regular expressions is intended as
23 reference material.
24
25 This document discusses the regular expression patterns that are sup‐
26 ported by PCRE2 when its main matching function, pcre2_match(), is
27 used. PCRE2 also has an alternative matching function,
28 pcre2_dfa_match(), which matches using a different algorithm that is
29 not Perl-compatible. Some of the features discussed below are not
30 available when DFA matching is used. The advantages and disadvantages
31 of the alternative function, and how it differs from the normal func‐
32 tion, are discussed in the pcre2matching page.
33
35
36 A number of options that can be passed to pcre2_compile() can also be
37 set by special items at the start of a pattern. These are not Perl-com‐
38 patible, but are provided to make these options accessible to pattern
39 writers who are not able to change the program that processes the pat‐
40 tern. Any number of these items may appear, but they must all be to‐
41 gether right at the start of the pattern string, and the letters must
42 be in upper case.
43
44 UTF support
45
46 In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either
47 as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32
48 can be specified for the 32-bit library, in which case it constrains
49 the character values to valid Unicode code points. To process UTF
50 strings, PCRE2 must be built to include Unicode support (which is the
51 default). When using UTF strings you must either call the compiling
52 function with one or both of the PCRE2_UTF or PCRE2_MATCH_INVALID_UTF
53 options, or the pattern must start with the special sequence (*UTF),
54 which is equivalent to setting the relevant PCRE2_UTF. How setting a
55 UTF mode affects pattern matching is mentioned in several places below.
56 There is also a summary of features in the pcre2unicode page.
57
58 Some applications that allow their users to supply patterns may wish to
59 restrict them to non-UTF data for security reasons. If the
60 PCRE2_NEVER_UTF option is passed to pcre2_compile(), (*UTF) is not al‐
61 lowed, and its appearance in a pattern causes an error.
62
63 Unicode property support
64
65 Another special sequence that may appear at the start of a pattern is
66 (*UCP). This has the same effect as setting the PCRE2_UCP option: it
67 causes sequences such as \d and \w to use Unicode properties to deter‐
68 mine character types, instead of recognizing only characters with codes
69 less than 256 via a lookup table. If also causes upper/lower casing op‐
70 erations to use Unicode properties for characters with code points
71 greater than 127, even when UTF is not set.
72
73 Some applications that allow their users to supply patterns may wish to
74 restrict them for security reasons. If the PCRE2_NEVER_UCP option is
75 passed to pcre2_compile(), (*UCP) is not allowed, and its appearance in
76 a pattern causes an error.
77
78 Locking out empty string matching
79
80 Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same
81 effect as passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option
82 to whichever matching function is subsequently called to match the pat‐
83 tern. These options lock out the matching of empty strings, either en‐
84 tirely, or only at the start of the subject.
85
86 Disabling auto-possessification
87
88 If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as
89 setting the PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from making
90 quantifiers possessive when what follows cannot match the repeated
91 item. For example, by default a+b is treated as a++b. For more details,
92 see the pcre2api documentation.
93
94 Disabling start-up optimizations
95
96 If a pattern starts with (*NO_START_OPT), it has the same effect as
97 setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti‐
98 mizations for quickly reaching "no match" results. For more details,
99 see the pcre2api documentation.
100
101 Disabling automatic anchoring
102
103 If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect
104 as setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimiza‐
105 tions that apply to patterns whose top-level branches all start with .*
106 (match any number of arbitrary characters). For more details, see the
107 pcre2api documentation.
108
109 Disabling JIT compilation
110
111 If a pattern that starts with (*NO_JIT) is successfully compiled, an
112 attempt by the application to apply the JIT optimization by calling
113 pcre2_jit_compile() is ignored.
114
115 Setting match resource limits
116
117 The pcre2_match() function contains a counter that is incremented every
118 time it goes round its main loop. The caller of pcre2_match() can set a
119 limit on this counter, which therefore limits the amount of computing
120 resource used for a match. The maximum depth of nested backtracking can
121 also be limited; this indirectly restricts the amount of heap memory
122 that is used, but there is also an explicit memory limit that can be
123 set.
124
125 These facilities are provided to catch runaway matches that are pro‐
126 voked by patterns with huge matching trees. A common example is a pat‐
127 tern with nested unlimited repeats applied to a long string that does
128 not match. When one of these limits is reached, pcre2_match() gives an
129 error return. The limits can also be set by items at the start of the
130 pattern of the form
131
132 (*LIMIT_HEAP=d)
133 (*LIMIT_MATCH=d)
134 (*LIMIT_DEPTH=d)
135
136 where d is any number of decimal digits. However, the value of the set‐
137 ting must be less than the value set (or defaulted) by the caller of
138 pcre2_match() for it to have any effect. In other words, the pattern
139 writer can lower the limits set by the programmer, but not raise them.
140 If there is more than one setting of one of these limits, the lower
141 value is used. The heap limit is specified in kibibytes (units of 1024
142 bytes).
143
144 Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This
145 name is still recognized for backwards compatibility.
146
147 The heap limit applies only when the pcre2_match() or pcre2_dfa_match()
148 interpreters are used for matching. It does not apply to JIT. The match
149 limit is used (but in a different way) when JIT is being used, or when
150 pcre2_dfa_match() is called, to limit computing resource usage by those
151 matching functions. The depth limit is ignored by JIT but is relevant
152 for DFA matching, which uses function recursion for recursions within
153 the pattern and for lookaround assertions and atomic groups. In this
154 case, the depth limit controls the depth of such recursion.
155
156 Newline conventions
157
158 PCRE2 supports six different conventions for indicating line breaks in
159 strings: a single CR (carriage return) character, a single LF (line‐
160 feed) character, the two-character sequence CRLF, any of the three pre‐
161 ceding, any Unicode newline sequence, or the NUL character (binary
162 zero). The pcre2api page has further discussion about newlines, and
163 shows how to set the newline convention when calling pcre2_compile().
164
165 It is also possible to specify a newline convention by starting a pat‐
166 tern string with one of the following sequences:
167
168 (*CR) carriage return
169 (*LF) linefeed
170 (*CRLF) carriage return, followed by linefeed
171 (*ANYCRLF) any of the three above
172 (*ANY) all Unicode newline sequences
173 (*NUL) the NUL character (binary zero)
174
175 These override the default and the options given to the compiling func‐
176 tion. For example, on a Unix system where LF is the default newline se‐
177 quence, the pattern
178
179 (*CR)a.b
180
181 changes the convention to CR. That pattern matches "a\nb" because LF is
182 no longer a newline. If more than one of these settings is present, the
183 last one is used.
184
185 The newline convention affects where the circumflex and dollar asser‐
186 tions are true. It also affects the interpretation of the dot metachar‐
187 acter when PCRE2_DOTALL is not set, and the behaviour of \N when not
188 followed by an opening brace. However, it does not affect what the \R
189 escape sequence matches. By default, this is any Unicode newline se‐
190 quence, for Perl compatibility. However, this can be changed; see the
191 next section and the description of \R in the section entitled "Newline
192 sequences" below. A change of \R setting can be combined with a change
193 of newline convention.
194
195 Specifying what \R matches
196
197 It is possible to restrict \R to match only CR, LF, or CRLF (instead of
198 the complete set of Unicode line endings) by setting the option
199 PCRE2_BSR_ANYCRLF at compile time. This effect can also be achieved by
200 starting a pattern with (*BSR_ANYCRLF). For completeness, (*BSR_UNI‐
201 CODE) is also recognized, corresponding to PCRE2_BSR_UNICODE.
202
204
205 PCRE2 can be compiled to run in an environment that uses EBCDIC as its
206 character code instead of ASCII or Unicode (typically a mainframe sys‐
207 tem). In the sections below, character code values are ASCII or Uni‐
208 code; in an EBCDIC environment these characters may have different code
209 values, and there are no code points greater than 255.
210
212
213 A regular expression is a pattern that is matched against a subject
214 string from left to right. Most characters stand for themselves in a
215 pattern, and match the corresponding characters in the subject. As a
216 trivial example, the pattern
217
218 The quick brown fox
219
220 matches a portion of a subject string that is identical to itself. When
221 caseless matching is specified (the PCRE2_CASELESS option or (?i)
222 within the pattern), letters are matched independently of case. Note
223 that there are two ASCII characters, K and S, that, in addition to
224 their lower case ASCII equivalents, are case-equivalent with Unicode
225 U+212A (Kelvin sign) and U+017F (long S) respectively when either
226 PCRE2_UTF or PCRE2_UCP is set.
227
228 The power of regular expressions comes from the ability to include wild
229 cards, character classes, alternatives, and repetitions in the pattern.
230 These are encoded in the pattern by the use of metacharacters, which do
231 not stand for themselves but instead are interpreted in some special
232 way.
233
234 There are two different sets of metacharacters: those that are recog‐
235 nized anywhere in the pattern except within square brackets, and those
236 that are recognized within square brackets. Outside square brackets,
237 the metacharacters are as follows:
238
239 \ general escape character with several uses
240 ^ assert start of string (or line, in multiline mode)
241 $ assert end of string (or line, in multiline mode)
242 . match any character except newline (by default)
243 [ start character class definition
244 | start of alternative branch
245 ( start group or control verb
246 ) end group or control verb
247 * 0 or more quantifier
248 + 1 or more quantifier; also "possessive quantifier"
249 ? 0 or 1 quantifier; also quantifier minimizer
250 { start min/max quantifier
251
252 Part of a pattern that is in square brackets is called a "character
253 class". In a character class the only metacharacters are:
254
255 \ general escape character
256 ^ negate the class, but only if the first character
257 - indicates character range
258 [ POSIX character class (if followed by POSIX syntax)
259 ] terminates the character class
260
261 If a pattern is compiled with the PCRE2_EXTENDED option, most white
262 space in the pattern, other than in a character class, and characters
263 between a # outside a character class and the next newline, inclusive,
264 are ignored. An escaping backslash can be used to include a white space
265 or a # character as part of the pattern. If the PCRE2_EXTENDED_MORE op‐
266 tion is set, the same applies, but in addition unescaped space and hor‐
267 izontal tab characters are ignored inside a character class. Note: only
268 these two characters are ignored, not the full set of pattern white
269 space characters that are ignored outside a character class. Option
270 settings can be changed within a pattern; see the section entitled "In‐
271 ternal Option Setting" below.
272
273 The following sections describe the use of each of the metacharacters.
274
276
277 The backslash character has several uses. Firstly, if it is followed by
278 a character that is not a digit or a letter, it takes away any special
279 meaning that character may have. This use of backslash as an escape
280 character applies both inside and outside character classes.
281
282 For example, if you want to match a * character, you must write \* in
283 the pattern. This escaping action applies whether or not the following
284 character would otherwise be interpreted as a metacharacter, so it is
285 always safe to precede a non-alphanumeric with backslash to specify
286 that it stands for itself. In particular, if you want to match a back‐
287 slash, you write \\.
288
289 Only ASCII digits and letters have any special meaning after a back‐
290 slash. All other characters (in particular, those whose code points are
291 greater than 127) are treated as literals.
292
293 If you want to treat all characters in a sequence as literals, you can
294 do so by putting them between \Q and \E. This is different from Perl in
295 that $ and @ are handled as literals in \Q...\E sequences in PCRE2,
296 whereas in Perl, $ and @ cause variable interpolation. Also, Perl does
297 "double-quotish backslash interpolation" on any backslashes between \Q
298 and \E which, its documentation says, "may lead to confusing results".
299 PCRE2 treats a backslash between \Q and \E just like any other charac‐
300 ter. Note the following examples:
301
302 Pattern PCRE2 matches Perl matches
303
304 \Qabc$xyz\E abc$xyz abc followed by the
305 contents of $xyz
306 \Qabc\$xyz\E abc\$xyz abc\$xyz
307 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
308 \QA\B\E A\B A\B
309 \Q\\E \ \\E
310
311 The \Q...\E sequence is recognized both inside and outside character
312 classes. An isolated \E that is not preceded by \Q is ignored. If \Q
313 is not followed by \E later in the pattern, the literal interpretation
314 continues to the end of the pattern (that is, \E is assumed at the
315 end). If the isolated \Q is inside a character class, this causes an
316 error, because the character class is not terminated by a closing
317 square bracket.
318
319 Non-printing characters
320
321 A second use of backslash provides a way of encoding non-printing char‐
322 acters in patterns in a visible manner. There is no restriction on the
323 appearance of non-printing characters in a pattern, but when a pattern
324 is being prepared by text editing, it is often easier to use one of the
325 following escape sequences instead of the binary character it repre‐
326 sents. In an ASCII or Unicode environment, these escapes are as fol‐
327 lows:
328
329 \a alarm, that is, the BEL character (hex 07)
330 \cx "control-x", where x is any printable ASCII character
331 \e escape (hex 1B)
332 \f form feed (hex 0C)
333 \n linefeed (hex 0A)
334 \r carriage return (hex 0D) (but see below)
335 \t tab (hex 09)
336 \0dd character with octal code 0dd
337 \ddd character with octal code ddd, or backreference
338 \o{ddd..} character with octal code ddd..
339 \xhh character with hex code hh
340 \x{hhh..} character with hex code hhh..
341 \N{U+hhh..} character with Unicode hex code point hhh..
342
343 By default, after \x that is not followed by {, from zero to two hexa‐
344 decimal digits are read (letters can be in upper or lower case). Any
345 number of hexadecimal digits may appear between \x{ and }. If a charac‐
346 ter other than a hexadecimal digit appears between \x{ and }, or if
347 there is no terminating }, an error occurs.
348
349 Characters whose code points are less than 256 can be defined by either
350 of the two syntaxes for \x or by an octal sequence. There is no differ‐
351 ence in the way they are handled. For example, \xdc is exactly the same
352 as \x{dc} or \334. However, using the braced versions does make such
353 sequences easier to read.
354
355 Support is available for some ECMAScript (aka JavaScript) escape se‐
356 quences via two compile-time options. If PCRE2_ALT_BSUX is set, the se‐
357 quence \x followed by { is not recognized. Only if \x is followed by
358 two hexadecimal digits is it recognized as a character escape. Other‐
359 wise it is interpreted as a literal "x" character. In this mode, sup‐
360 port for code points greater than 256 is provided by \u, which must be
361 followed by four hexadecimal digits; otherwise it is interpreted as a
362 literal "u" character.
363
364 PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in ad‐
365 dition, \u{hhh..} is recognized as the character specified by hexadeci‐
366 mal code point. There may be any number of hexadecimal digits. This
367 syntax is from ECMAScript 6.
368
369 The \N{U+hhh..} escape sequence is recognized only when PCRE2 is oper‐
370 ating in UTF mode. Perl also uses \N{name} to specify characters by
371 Unicode name; PCRE2 does not support this. Note that when \N is not
372 followed by an opening brace (curly bracket) it has an entirely differ‐
373 ent meaning, matching any character that is not a newline.
374
375 There are some legacy applications where the escape sequence \r is ex‐
376 pected to match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option
377 is set, \r in a pattern is converted to \n so that it matches a LF
378 (linefeed) instead of a CR (carriage return) character.
379
380 The precise effect of \cx on ASCII characters is as follows: if x is a
381 lower case letter, it is converted to upper case. Then bit 6 of the
382 character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
383 (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
384 hex 7B (; is 3B). If the code unit following \c has a value less than
385 32 or greater than 126, a compile-time error occurs.
386
387 When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported.
388 \a, \e, \f, \n, \r, and \t generate the appropriate EBCDIC code values.
389 The \c escape is processed as specified for Perl in the perlebcdic doc‐
390 ument. The only characters that are allowed after \c are A-Z, a-z, or
391 one of @, [, \, ], ^, _, or ?. Any other character provokes a compile-
392 time error. The sequence \c@ encodes character code 0; after \c the
393 letters (in either case) encode characters 1-26 (hex 01 to hex 1A); [,
394 \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c? be‐
395 comes either 255 (hex FF) or 95 (hex 5F).
396
397 Thus, apart from \c?, these escapes generate the same character code
398 values as they do in an ASCII environment, though the meanings of the
399 values mostly differ. For example, \cG always generates code value 7,
400 which is BEL in ASCII but DEL in EBCDIC.
401
402 The sequence \c? generates DEL (127, hex 7F) in an ASCII environment,
403 but because 127 is not a control character in EBCDIC, Perl makes it
404 generate the APC character. Unfortunately, there are several variants
405 of EBCDIC. In most of them the APC character has the value 255 (hex
406 FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If
407 certain other characters have POSIX-BC values, PCRE2 makes \c? generate
408 95; otherwise it generates 255.
409
410 After \0 up to two further octal digits are read. If there are fewer
411 than two digits, just those that are present are used. Thus the se‐
412 quence \0\x\015 specifies two binary zeros followed by a CR character
413 (code value 13). Make sure you supply two digits after the initial zero
414 if the pattern character that follows is itself an octal digit.
415
416 The escape \o must be followed by a sequence of octal digits, enclosed
417 in braces. An error occurs if this is not the case. This escape is a
418 recent addition to Perl; it provides way of specifying character code
419 points as octal numbers greater than 0777, and it also allows octal
420 numbers and backreferences to be unambiguously specified.
421
422 For greater clarity and unambiguity, it is best to avoid following \ by
423 a digit greater than zero. Instead, use \o{} or \x{} to specify numeri‐
424 cal character code points, and \g{} to specify backreferences. The fol‐
425 lowing paragraphs describe the old, ambiguous syntax.
426
427 The handling of a backslash followed by a digit other than 0 is compli‐
428 cated, and Perl has changed over time, causing PCRE2 also to change.
429
430 Outside a character class, PCRE2 reads the digit and any following dig‐
431 its as a decimal number. If the number is less than 10, begins with the
432 digit 8 or 9, or if there are at least that many previous capture
433 groups in the expression, the entire sequence is taken as a backrefer‐
434 ence. A description of how this works is given later, following the
435 discussion of parenthesized groups. Otherwise, up to three octal dig‐
436 its are read to form a character code.
437
438 Inside a character class, PCRE2 handles \8 and \9 as the literal char‐
439 acters "8" and "9", and otherwise reads up to three octal digits fol‐
440 lowing the backslash, using them to generate a data character. Any sub‐
441 sequent digits stand for themselves. For example, outside a character
442 class:
443
444 \040 is another way of writing an ASCII space
445 \40 is the same, provided there are fewer than 40
446 previous capture groups
447 \7 is always a backreference
448 \11 might be a backreference, or another way of
449 writing a tab
450 \011 is always a tab
451 \0113 is a tab followed by the character "3"
452 \113 might be a backreference, otherwise the
453 character with octal code 113
454 \377 might be a backreference, otherwise
455 the value 255 (decimal)
456 \81 is always a backreference
457
458 Note that octal values of 100 or greater that are specified using this
459 syntax must not be introduced by a leading zero, because no more than
460 three octal digits are ever read.
461
462 Constraints on character values
463
464 Characters that are specified using octal or hexadecimal numbers are
465 limited to certain values, as follows:
466
467 8-bit non-UTF mode no greater than 0xff
468 16-bit non-UTF mode no greater than 0xffff
469 32-bit non-UTF mode no greater than 0xffffffff
470 All UTF modes no greater than 0x10ffff and a valid code point
471
472 Invalid Unicode code points are all those in the range 0xd800 to 0xdfff
473 (the so-called "surrogate" code points). The check for these can be
474 disabled by the caller of pcre2_compile() by setting the option
475 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only in
476 UTF-8 and UTF-32 modes, because these values are not representable in
477 UTF-16.
478
479 Escape sequences in character classes
480
481 All the sequences that define a single character value can be used both
482 inside and outside character classes. In addition, inside a character
483 class, \b is interpreted as the backspace character (hex 08).
484
485 When not followed by an opening brace, \N is not allowed in a character
486 class. \B, \R, and \X are not special inside a character class. Like
487 other unrecognized alphabetic escape sequences, they cause an error.
488 Outside a character class, these sequences have different meanings.
489
490 Unsupported escape sequences
491
492 In Perl, the sequences \F, \l, \L, \u, and \U are recognized by its
493 string handler and used to modify the case of following characters. By
494 default, PCRE2 does not support these escape sequences in patterns.
495 However, if either of the PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX op‐
496 tions is set, \U matches a "U" character, and \u can be used to define
497 a character by code point, as described above.
498
499 Absolute and relative backreferences
500
501 The sequence \g followed by a signed or unsigned number, optionally en‐
502 closed in braces, is an absolute or relative backreference. A named
503 backreference can be coded as \g{name}. Backreferences are discussed
504 later, following the discussion of parenthesized groups.
505
506 Absolute and relative subroutine calls
507
508 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
509 name or a number enclosed either in angle brackets or single quotes, is
510 an alternative syntax for referencing a capture group as a subroutine.
511 Details are discussed later. Note that \g{...} (Perl syntax) and
512 \g<...> (Oniguruma syntax) are not synonymous. The former is a backref‐
513 erence; the latter is a subroutine call.
514
515 Generic character types
516
517 Another use of backslash is for specifying generic character types:
518
519 \d any decimal digit
520 \D any character that is not a decimal digit
521 \h any horizontal white space character
522 \H any character that is not a horizontal white space character
523 \N any character that is not a newline
524 \s any white space character
525 \S any character that is not a white space character
526 \v any vertical white space character
527 \V any character that is not a vertical white space character
528 \w any "word" character
529 \W any "non-word" character
530
531 The \N escape sequence has the same meaning as the "." metacharacter
532 when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change
533 the meaning of \N. Note that when \N is followed by an opening brace it
534 has a different meaning. See the section entitled "Non-printing charac‐
535 ters" above for details. Perl also uses \N{name} to specify characters
536 by Unicode name; PCRE2 does not support this.
537
538 Each pair of lower and upper case escape sequences partitions the com‐
539 plete set of characters into two disjoint sets. Any given character
540 matches one, and only one, of each pair. The sequences can appear both
541 inside and outside character classes. They each match one character of
542 the appropriate type. If the current matching point is at the end of
543 the subject string, all of them fail, because there is no character to
544 match.
545
546 The default \s characters are HT (9), LF (10), VT (11), FF (12), CR
547 (13), and space (32), which are defined as white space in the "C" lo‐
548 cale. This list may vary if locale-specific matching is taking place.
549 For example, in some locales the "non-breaking space" character (\xA0)
550 is recognized as white space, and in others the VT character is not.
551
552 A "word" character is an underscore or any character that is a letter
553 or digit. By default, the definition of letters and digits is con‐
554 trolled by PCRE2's low-valued character tables, and may vary if locale-
555 specific matching is taking place (see "Locale support" in the pcre2api
556 page). For example, in a French locale such as "fr_FR" in Unix-like
557 systems, or "french" in Windows, some character codes greater than 127
558 are used for accented letters, and these are then matched by \w. The
559 use of locales with Unicode is discouraged.
560
561 By default, characters whose code points are greater than 127 never
562 match \d, \s, or \w, and always match \D, \S, and \W, although this may
563 be different for characters in the range 128-255 when locale-specific
564 matching is happening. These escape sequences retain their original
565 meanings from before Unicode support was available, mainly for effi‐
566 ciency reasons. If the PCRE2_UCP option is set, the behaviour is
567 changed so that Unicode properties are used to determine character
568 types, as follows:
569
570 \d any character that matches \p{Nd} (decimal digit)
571 \s any character that matches \p{Z} or \h or \v
572 \w any character that matches \p{L} or \p{N}, plus underscore
573
574 The upper case escapes match the inverse sets of characters. Note that
575 \d matches only decimal digits, whereas \w matches any Unicode digit,
576 as well as any Unicode letter, and underscore. Note also that PCRE2_UCP
577 affects \b, and \B because they are defined in terms of \w and \W.
578 Matching these sequences is noticeably slower when PCRE2_UCP is set.
579
580 The sequences \h, \H, \v, and \V, in contrast to the other sequences,
581 which match only ASCII characters by default, always match a specific
582 list of code points, whether or not PCRE2_UCP is set. The horizontal
583 space characters are:
584
585 U+0009 Horizontal tab (HT)
586 U+0020 Space
587 U+00A0 Non-break space
588 U+1680 Ogham space mark
589 U+180E Mongolian vowel separator
590 U+2000 En quad
591 U+2001 Em quad
592 U+2002 En space
593 U+2003 Em space
594 U+2004 Three-per-em space
595 U+2005 Four-per-em space
596 U+2006 Six-per-em space
597 U+2007 Figure space
598 U+2008 Punctuation space
599 U+2009 Thin space
600 U+200A Hair space
601 U+202F Narrow no-break space
602 U+205F Medium mathematical space
603 U+3000 Ideographic space
604
605 The vertical space characters are:
606
607 U+000A Linefeed (LF)
608 U+000B Vertical tab (VT)
609 U+000C Form feed (FF)
610 U+000D Carriage return (CR)
611 U+0085 Next line (NEL)
612 U+2028 Line separator
613 U+2029 Paragraph separator
614
615 In 8-bit, non-UTF-8 mode, only the characters with code points less
616 than 256 are relevant.
617
618 Newline sequences
619
620 Outside a character class, by default, the escape sequence \R matches
621 any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent
622 to the following:
623
624 (?>\r\n|\n|\x0b|\f|\r|\x85)
625
626 This is an example of an "atomic group", details of which are given be‐
627 low. This particular group matches either the two-character sequence
628 CR followed by LF, or one of the single characters LF (linefeed,
629 U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car‐
630 riage return, U+000D), or NEL (next line, U+0085). Because this is an
631 atomic group, the two-character sequence is treated as a single unit
632 that cannot be split.
633
634 In other modes, two additional characters whose code points are greater
635 than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa‐
636 rator, U+2029). Unicode support is not needed for these characters to
637 be recognized.
638
639 It is possible to restrict \R to match only CR, LF, or CRLF (instead of
640 the complete set of Unicode line endings) by setting the option
641 PCRE2_BSR_ANYCRLF at compile time. (BSR is an abbreviation for "back‐
642 slash R".) This can be made the default when PCRE2 is built; if this is
643 the case, the other behaviour can be requested via the PCRE2_BSR_UNI‐
644 CODE option. It is also possible to specify these settings by starting
645 a pattern string with one of the following sequences:
646
647 (*BSR_ANYCRLF) CR, LF, or CRLF only
648 (*BSR_UNICODE) any Unicode newline sequence
649
650 These override the default and the options given to the compiling func‐
651 tion. Note that these special settings, which are not Perl-compatible,
652 are recognized only at the very start of a pattern, and that they must
653 be in upper case. If more than one of them is present, the last one is
654 used. They can be combined with a change of newline convention; for ex‐
655 ample, a pattern can start with:
656
657 (*ANY)(*BSR_ANYCRLF)
658
659 They can also be combined with the (*UTF) or (*UCP) special sequences.
660 Inside a character class, \R is treated as an unrecognized escape se‐
661 quence, and causes an error.
662
663 Unicode character properties
664
665 When PCRE2 is built with Unicode support (the default), three addi‐
666 tional escape sequences that match characters with specific properties
667 are available. They can be used in any mode, though in 8-bit and 16-bit
668 non-UTF modes these sequences are of course limited to testing charac‐
669 ters whose code points are less than U+0100 and U+10000, respectively.
670 In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode
671 limit) may be encountered. These are all treated as being in the Un‐
672 known script and with an unassigned type.
673
674 Matching characters by Unicode property is not fast, because PCRE2 has
675 to do a multistage table lookup in order to find a character's prop‐
676 erty. That is why the traditional escape sequences such as \d and \w do
677 not use Unicode properties in PCRE2 by default, though you can make
678 them do so by setting the PCRE2_UCP option or by starting the pattern
679 with (*UCP).
680
681 The extra escape sequences that provide property support are:
682
683 \p{xx} a character with the xx property
684 \P{xx} a character without the xx property
685 \X a Unicode extended grapheme cluster
686
687 The property names represented by xx above are not case-sensitive, and
688 in accordance with Unicode's "loose matching" rules, spaces, hyphens,
689 and underscores are ignored. There is support for Unicode script names,
690 Unicode general category properties, "Any", which matches any character
691 (including newline), Bidi_Class, a number of binary (yes/no) proper‐
692 ties, and some special PCRE2 properties (described below). Certain
693 other Perl properties such as "InMusicalSymbols" are not supported by
694 PCRE2. Note that \P{Any} does not match any characters, so always
695 causes a match failure.
696
697 Script properties for \p and \P
698
699 There are three different syntax forms for matching a script. Each Uni‐
700 code character has a basic script and, optionally, a list of other
701 scripts ("Script Extensions") with which it is commonly used. Using the
702 Adlam script as an example, \p{sc:Adlam} matches characters whose basic
703 script is Adlam, whereas \p{scx:Adlam} matches, in addition, characters
704 that have Adlam in their extensions list. The full names "script" and
705 "script extensions" for the property types are recognized, and a equals
706 sign is an alternative to the colon. If a script name is given without
707 a property type, for example, \p{Adlam}, it is treated as \p{scx:Ad‐
708 lam}. Perl changed to this interpretation at release 5.26 and PCRE2
709 changed at release 10.40.
710
711 Unassigned characters (and in non-UTF 32-bit mode, characters with code
712 points greater than 0x10FFFF) are assigned the "Unknown" script. Others
713 that are not part of an identified script are lumped together as "Com‐
714 mon". The current list of recognized script names and their 4-character
715 abbreviations can be obtained by running this command:
716
717 pcre2test -LS
718
719
720 The general category property for \p and \P
721
722 Each character has exactly one Unicode general category property, spec‐
723 ified by a two-letter abbreviation. For compatibility with Perl, nega‐
724 tion can be specified by including a circumflex between the opening
725 brace and the property name. For example, \p{^Lu} is the same as
726 \P{Lu}.
727
728 If only one letter is specified with \p or \P, it includes all the gen‐
729 eral category properties that start with that letter. In this case, in
730 the absence of negation, the curly brackets in the escape sequence are
731 optional; these two examples have the same effect:
732
733 \p{L}
734 \pL
735
736 The following general category property codes are supported:
737
738 C Other
739 Cc Control
740 Cf Format
741 Cn Unassigned
742 Co Private use
743 Cs Surrogate
744
745 L Letter
746 Ll Lower case letter
747 Lm Modifier letter
748 Lo Other letter
749 Lt Title case letter
750 Lu Upper case letter
751
752 M Mark
753 Mc Spacing mark
754 Me Enclosing mark
755 Mn Non-spacing mark
756
757 N Number
758 Nd Decimal number
759 Nl Letter number
760 No Other number
761
762 P Punctuation
763 Pc Connector punctuation
764 Pd Dash punctuation
765 Pe Close punctuation
766 Pf Final punctuation
767 Pi Initial punctuation
768 Po Other punctuation
769 Ps Open punctuation
770
771 S Symbol
772 Sc Currency symbol
773 Sk Modifier symbol
774 Sm Mathematical symbol
775 So Other symbol
776
777 Z Separator
778 Zl Line separator
779 Zp Paragraph separator
780 Zs Space separator
781
782 The special property LC, which has the synonym L&, is also supported:
783 it matches a character that has the Lu, Ll, or Lt property, in other
784 words, a letter that is not classified as a modifier or "other".
785
786 The Cs (Surrogate) property applies only to characters whose code
787 points are in the range U+D800 to U+DFFF. These characters are no dif‐
788 ferent to any other character when PCRE2 is not in UTF mode (using the
789 16-bit or 32-bit library). However, they are not valid in Unicode
790 strings and so cannot be tested by PCRE2 in UTF mode, unless UTF valid‐
791 ity checking has been turned off (see the discussion of
792 PCRE2_NO_UTF_CHECK in the pcre2api page).
793
794 The long synonyms for property names that Perl supports (such as
795 \p{Letter}) are not supported by PCRE2, nor is it permitted to prefix
796 any of these properties with "Is".
797
798 No character that is in the Unicode table has the Cn (unassigned) prop‐
799 erty. Instead, this property is assumed for any code point that is not
800 in the Unicode table.
801
802 Specifying caseless matching does not affect these escape sequences.
803 For example, \p{Lu} always matches only upper case letters. This is
804 different from the behaviour of current versions of Perl.
805
806 Binary (yes/no) properties for \p and \P
807
808 Unicode defines a number of binary properties, that is, properties
809 whose only values are true or false. You can obtain a list of those
810 that are recognized by \p and \P, along with their abbreviations, by
811 running this command:
812
813 pcre2test -LP
814
815
816 The Bidi_Class property for \p and \P
817
818 \p{Bidi_Class:<class>} matches a character with the given class
819 \p{BC:<class>} matches a character with the given class
820
821 The recognized classes are:
822
823 AL Arabic letter
824 AN Arabic number
825 B paragraph separator
826 BN boundary neutral
827 CS common separator
828 EN European number
829 ES European separator
830 ET European terminator
831 FSI first strong isolate
832 L left-to-right
833 LRE left-to-right embedding
834 LRI left-to-right isolate
835 LRO left-to-right override
836 NSM non-spacing mark
837 ON other neutral
838 PDF pop directional format
839 PDI pop directional isolate
840 R right-to-left
841 RLE right-to-left embedding
842 RLI right-to-left isolate
843 RLO right-to-left override
844 S segment separator
845 WS which space
846
847 An equals sign may be used instead of a colon. The class names are
848 case-insensitive; only the short names listed above are recognized.
849
850 Extended grapheme clusters
851
852 The \X escape matches any number of Unicode characters that form an
853 "extended grapheme cluster", and treats the sequence as an atomic group
854 (see below). Unicode supports various kinds of composite character by
855 giving each character a grapheme breaking property, and having rules
856 that use these properties to define the boundaries of extended grapheme
857 clusters. The rules are defined in Unicode Standard Annex 29, "Unicode
858 Text Segmentation". Unicode 11.0.0 abandoned the use of some previous
859 properties that had been used for emojis. Instead it introduced vari‐
860 ous emoji-specific properties. PCRE2 uses only the Extended Picto‐
861 graphic property.
862
863 \X always matches at least one character. Then it decides whether to
864 add additional characters according to the following rules for ending a
865 cluster:
866
867 1. End at the end of the subject string.
868
869 2. Do not end between CR and LF; otherwise end after any control char‐
870 acter.
871
872 3. Do not break Hangul (a Korean script) syllable sequences. Hangul
873 characters are of five types: L, V, T, LV, and LVT. An L character may
874 be followed by an L, V, LV, or LVT character; an LV or V character may
875 be followed by a V or T character; an LVT or T character may be fol‐
876 lowed only by a T character.
877
878 4. Do not end before extending characters or spacing marks or the
879 "zero-width joiner" character. Characters with the "mark" property al‐
880 ways have the "extend" grapheme breaking property.
881
882 5. Do not end after prepend characters.
883
884 6. Do not break within emoji modifier sequences or emoji zwj sequences.
885 That is, do not break between characters with the Extended_Pictographic
886 property. Extend and ZWJ characters are allowed between the charac‐
887 ters.
888
889 7. Do not break within emoji flag sequences. That is, do not break be‐
890 tween regional indicator (RI) characters if there are an odd number of
891 RI characters before the break point.
892
893 8. Otherwise, end the cluster.
894
895 PCRE2's additional properties
896
897 As well as the standard Unicode properties described above, PCRE2 sup‐
898 ports four more that make it possible to convert traditional escape se‐
899 quences such as \w and \s to use Unicode properties. PCRE2 uses these
900 non-standard, non-Perl properties internally when PCRE2_UCP is set.
901 However, they may also be used explicitly. These properties are:
902
903 Xan Any alphanumeric character
904 Xps Any POSIX space character
905 Xsp Any Perl space character
906 Xwd Any Perl "word" character
907
908 Xan matches characters that have either the L (letter) or the N (num‐
909 ber) property. Xps matches the characters tab, linefeed, vertical tab,
910 form feed, or carriage return, and any other character that has the Z
911 (separator) property. Xsp is the same as Xps; in PCRE1 it used to ex‐
912 clude vertical tab, for Perl compatibility, but Perl changed. Xwd
913 matches the same characters as Xan, plus underscore.
914
915 There is another non-standard property, Xuc, which matches any charac‐
916 ter that can be represented by a Universal Character Name in C++ and
917 other programming languages. These are the characters $, @, ` (grave
918 accent), and all characters with Unicode code points greater than or
919 equal to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that
920 most base (ASCII) characters are excluded. (Universal Character Names
921 are of the form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit.
922 Note that the Xuc property does not match these sequences but the char‐
923 acters that they represent.)
924
925 Resetting the match start
926
927 In normal use, the escape sequence \K causes any previously matched
928 characters not to be included in the final matched sequence that is re‐
929 turned. For example, the pattern:
930
931 foo\Kbar
932
933 matches "foobar", but reports that it has matched "bar". \K does not
934 interact with anchoring in any way. The pattern:
935
936 ^foo\Kbar
937
938 matches only when the subject begins with "foobar" (in single line
939 mode), though it again reports the matched string as "bar". This fea‐
940 ture is similar to a lookbehind assertion (described below). However,
941 in this case, the part of the subject before the real match does not
942 have to be of fixed length, as lookbehind assertions do. The use of \K
943 does not interfere with the setting of captured substrings. For exam‐
944 ple, when the pattern
945
946 (foo)\Kbar
947
948 matches "foobar", the first substring is still set to "foo".
949
950 From version 5.32.0 Perl forbids the use of \K in lookaround asser‐
951 tions. From release 10.38 PCRE2 also forbids this by default. However,
952 the PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK option can be used when calling
953 pcre2_compile() to re-enable the previous behaviour. When this option
954 is set, \K is acted upon when it occurs inside positive assertions, but
955 is ignored in negative assertions. Note that when a pattern such as
956 (?=ab\K) matches, the reported start of the match can be greater than
957 the end of the match. Using \K in a lookbehind assertion at the start
958 of a pattern can also lead to odd effects. For example, consider this
959 pattern:
960
961 (?<=\Kfoo)bar
962
963 If the subject is "foobar", a call to pcre2_match() with a starting
964 offset of 3 succeeds and reports the matching string as "foobar", that
965 is, the start of the reported match is earlier than where the match
966 started.
967
968 Simple assertions
969
970 The final use of backslash is for certain simple assertions. An asser‐
971 tion specifies a condition that has to be met at a particular point in
972 a match, without consuming any characters from the subject string. The
973 use of groups for more complicated assertions is described below. The
974 backslashed assertions are:
975
976 \b matches at a word boundary
977 \B matches when not at a word boundary
978 \A matches at the start of the subject
979 \Z matches at the end of the subject
980 also matches before a newline at the end of the subject
981 \z matches only at the end of the subject
982 \G matches at the first matching position in the subject
983
984 Inside a character class, \b has a different meaning; it matches the
985 backspace character. If any other of these assertions appears in a
986 character class, an "invalid escape sequence" error is generated.
987
988 A word boundary is a position in the subject string where the current
989 character and the previous character do not both match \w or \W (i.e.
990 one matches \w and the other matches \W), or the start or end of the
991 string if the first or last character matches \w, respectively. When
992 PCRE2 is built with Unicode support, the meanings of \w and \W can be
993 changed by setting the PCRE2_UCP option. When this is done, it also af‐
994 fects \b and \B. Neither PCRE2 nor Perl has a separate "start of word"
995 or "end of word" metasequence. However, whatever follows \b normally
996 determines which it is. For example, the fragment \ba matches "a" at
997 the start of a word.
998
999 The \A, \Z, and \z assertions differ from the traditional circumflex
1000 and dollar (described in the next section) in that they only ever match
1001 at the very start and end of the subject string, whatever options are
1002 set. Thus, they are independent of multiline mode. These three asser‐
1003 tions are not affected by the PCRE2_NOTBOL or PCRE2_NOTEOL options,
1004 which affect only the behaviour of the circumflex and dollar metachar‐
1005 acters. However, if the startoffset argument of pcre2_match() is non-
1006 zero, indicating that matching is to start at a point other than the
1007 beginning of the subject, \A can never match. The difference between
1008 \Z and \z is that \Z matches before a newline at the end of the string
1009 as well as at the very end, whereas \z matches only at the end.
1010
1011 The \G assertion is true only when the current matching position is at
1012 the start point of the matching process, as specified by the startoff‐
1013 set argument of pcre2_match(). It differs from \A when the value of
1014 startoffset is non-zero. By calling pcre2_match() multiple times with
1015 appropriate arguments, you can mimic Perl's /g option, and it is in
1016 this kind of implementation where \G can be useful.
1017
1018 Note, however, that PCRE2's implementation of \G, being true at the
1019 starting character of the matching process, is subtly different from
1020 Perl's, which defines it as true at the end of the previous match. In
1021 Perl, these can be different when the previously matched string was
1022 empty. Because PCRE2 does just one match at a time, it cannot reproduce
1023 this behaviour.
1024
1025 If all the alternatives of a pattern begin with \G, the expression is
1026 anchored to the starting match position, and the "anchored" flag is set
1027 in the compiled regular expression.
1028
1030
1031 The circumflex and dollar metacharacters are zero-width assertions.
1032 That is, they test for a particular condition being true without con‐
1033 suming any characters from the subject string. These two metacharacters
1034 are concerned with matching the starts and ends of lines. If the new‐
1035 line convention is set so that only the two-character sequence CRLF is
1036 recognized as a newline, isolated CR and LF characters are treated as
1037 ordinary data characters, and are not recognized as newlines.
1038
1039 Outside a character class, in the default matching mode, the circumflex
1040 character is an assertion that is true only if the current matching
1041 point is at the start of the subject string. If the startoffset argu‐
1042 ment of pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum‐
1043 flex can never match if the PCRE2_MULTILINE option is unset. Inside a
1044 character class, circumflex has an entirely different meaning (see be‐
1045 low).
1046
1047 Circumflex need not be the first character of the pattern if a number
1048 of alternatives are involved, but it should be the first thing in each
1049 alternative in which it appears if the pattern is ever to match that
1050 branch. If all possible alternatives start with a circumflex, that is,
1051 if the pattern is constrained to match only at the start of the sub‐
1052 ject, it is said to be an "anchored" pattern. (There are also other
1053 constructs that can cause a pattern to be anchored.)
1054
1055 The dollar character is an assertion that is true only if the current
1056 matching point is at the end of the subject string, or immediately be‐
1057 fore a newline at the end of the string (by default), unless PCRE2_NO‐
1058 TEOL is set. Note, however, that it does not actually match the new‐
1059 line. Dollar need not be the last character of the pattern if a number
1060 of alternatives are involved, but it should be the last item in any
1061 branch in which it appears. Dollar has no special meaning in a charac‐
1062 ter class.
1063
1064 The meaning of dollar can be changed so that it matches only at the
1065 very end of the string, by setting the PCRE2_DOLLAR_ENDONLY option at
1066 compile time. This does not affect the \Z assertion.
1067
1068 The meanings of the circumflex and dollar metacharacters are changed if
1069 the PCRE2_MULTILINE option is set. When this is the case, a dollar
1070 character matches before any newlines in the string, as well as at the
1071 very end, and a circumflex matches immediately after internal newlines
1072 as well as at the start of the subject string. It does not match after
1073 a newline that ends the string, for compatibility with Perl. However,
1074 this can be changed by setting the PCRE2_ALT_CIRCUMFLEX option.
1075
1076 For example, the pattern /^abc$/ matches the subject string "def\nabc"
1077 (where \n represents a newline) in multiline mode, but not otherwise.
1078 Consequently, patterns that are anchored in single line mode because
1079 all branches start with ^ are not anchored in multiline mode, and a
1080 match for circumflex is possible when the startoffset argument of
1081 pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option is ignored
1082 if PCRE2_MULTILINE is set.
1083
1084 When the newline convention (see "Newline conventions" below) recog‐
1085 nizes the two-character sequence CRLF as a newline, this is preferred,
1086 even if the single characters CR and LF are also recognized as new‐
1087 lines. For example, if the newline convention is "any", a multiline
1088 mode circumflex matches before "xyz" in the string "abc\r\nxyz" rather
1089 than after CR, even though CR on its own is a valid newline. (It also
1090 matches at the very start of the string, of course.)
1091
1092 Note that the sequences \A, \Z, and \z can be used to match the start
1093 and end of the subject in both modes, and if all branches of a pattern
1094 start with \A it is always anchored, whether or not PCRE2_MULTILINE is
1095 set.
1096
1098
1099 Outside a character class, a dot in the pattern matches any one charac‐
1100 ter in the subject string except (by default) a character that signi‐
1101 fies the end of a line. One or more characters may be specified as line
1102 terminators (see "Newline conventions" above).
1103
1104 Dot never matches a single line-ending character. When the two-charac‐
1105 ter sequence CRLF is the only line ending, dot does not match CR if it
1106 is immediately followed by LF, but otherwise it matches all characters
1107 (including isolated CRs and LFs). When ANYCRLF is selected for line
1108 endings, no occurences of CR of LF match dot. When all Unicode line
1109 endings are being recognized, dot does not match CR or LF or any of the
1110 other line ending characters.
1111
1112 The behaviour of dot with regard to newlines can be changed. If the
1113 PCRE2_DOTALL option is set, a dot matches any one character, without
1114 exception. If the two-character sequence CRLF is present in the sub‐
1115 ject string, it takes two dots to match it.
1116
1117 The handling of dot is entirely independent of the handling of circum‐
1118 flex and dollar, the only relationship being that they both involve
1119 newlines. Dot has no special meaning in a character class.
1120
1121 The escape sequence \N when not followed by an opening brace behaves
1122 like a dot, except that it is not affected by the PCRE2_DOTALL option.
1123 In other words, it matches any character except one that signifies the
1124 end of a line.
1125
1126 When \N is followed by an opening brace it has a different meaning. See
1127 the section entitled "Non-printing characters" above for details. Perl
1128 also uses \N{name} to specify characters by Unicode name; PCRE2 does
1129 not support this.
1130
1132
1133 Outside a character class, the escape sequence \C matches any one code
1134 unit, whether or not a UTF mode is set. In the 8-bit library, one code
1135 unit is one byte; in the 16-bit library it is a 16-bit unit; in the
1136 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches
1137 line-ending characters. The feature is provided in Perl in order to
1138 match individual bytes in UTF-8 mode, but it is unclear how it can use‐
1139 fully be used.
1140
1141 Because \C breaks up characters into individual code units, matching
1142 one unit with \C in UTF-8 or UTF-16 mode means that the rest of the
1143 string may start with a malformed UTF character. This has undefined re‐
1144 sults, because PCRE2 assumes that it is matching character by character
1145 in a valid UTF string (by default it checks the subject string's valid‐
1146 ity at the start of processing unless the PCRE2_NO_UTF_CHECK or
1147 PCRE2_MATCH_INVALID_UTF option is used).
1148
1149 An application can lock out the use of \C by setting the
1150 PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also
1151 possible to build PCRE2 with the use of \C permanently disabled.
1152
1153 PCRE2 does not allow \C to appear in lookbehind assertions (described
1154 below) in UTF-8 or UTF-16 modes, because this would make it impossible
1155 to calculate the length of the lookbehind. Neither the alternative
1156 matching function pcre2_dfa_match() nor the JIT optimizer support \C in
1157 these UTF modes. The former gives a match-time error; the latter fails
1158 to optimize and so the match is always run using the interpreter.
1159
1160 In the 32-bit library, however, \C is always supported (when not ex‐
1161 plicitly locked out) because it always matches a single code unit,
1162 whether or not UTF-32 is specified.
1163
1164 In general, the \C escape sequence is best avoided. However, one way of
1165 using it that avoids the problem of malformed UTF-8 or UTF-16 charac‐
1166 ters is to use a lookahead to check the length of the next character,
1167 as in this pattern, which could be used with a UTF-8 string (ignore
1168 white space and line breaks):
1169
1170 (?| (?=[\x00-\x7f])(\C) |
1171 (?=[\x80-\x{7ff}])(\C)(\C) |
1172 (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
1173 (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
1174
1175 In this example, a group that starts with (?| resets the capturing
1176 parentheses numbers in each alternative (see "Duplicate Group Numbers"
1177 below). The assertions at the start of each branch check the next UTF-8
1178 character for values whose encoding uses 1, 2, 3, or 4 bytes, respec‐
1179 tively. The character's individual bytes are then captured by the ap‐
1180 propriate number of \C groups.
1181
1183
1184 An opening square bracket introduces a character class, terminated by a
1185 closing square bracket. A closing square bracket on its own is not spe‐
1186 cial by default. If a closing square bracket is required as a member
1187 of the class, it should be the first data character in the class (after
1188 an initial circumflex, if present) or escaped with a backslash. This
1189 means that, by default, an empty class cannot be defined. However, if
1190 the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at
1191 the start does end the (empty) class.
1192
1193 A character class matches a single character in the subject. A matched
1194 character must be in the set of characters defined by the class, unless
1195 the first character in the class definition is a circumflex, in which
1196 case the subject character must not be in the set defined by the class.
1197 If a circumflex is actually required as a member of the class, ensure
1198 it is not the first character, or escape it with a backslash.
1199
1200 For example, the character class [aeiou] matches any lower case vowel,
1201 while [^aeiou] matches any character that is not a lower case vowel.
1202 Note that a circumflex is just a convenient notation for specifying the
1203 characters that are in the class by enumerating those that are not. A
1204 class that starts with a circumflex is not an assertion; it still con‐
1205 sumes a character from the subject string, and therefore it fails if
1206 the current pointer is at the end of the string.
1207
1208 Characters in a class may be specified by their code points using \o,
1209 \x, or \N{U+hh..} in the usual way. When caseless matching is set, any
1210 letters in a class represent both their upper case and lower case ver‐
1211 sions, so for example, a caseless [aeiou] matches "A" as well as "a",
1212 and a caseless [^aeiou] does not match "A", whereas a caseful version
1213 would. Note that there are two ASCII characters, K and S, that, in ad‐
1214 dition to their lower case ASCII equivalents, are case-equivalent with
1215 Unicode U+212A (Kelvin sign) and U+017F (long S) respectively when ei‐
1216 ther PCRE2_UTF or PCRE2_UCP is set.
1217
1218 Characters that might indicate line breaks are never treated in any
1219 special way when matching character classes, whatever line-ending se‐
1220 quence is in use, and whatever setting of the PCRE2_DOTALL and
1221 PCRE2_MULTILINE options is used. A class such as [^a] always matches
1222 one of these characters.
1223
1224 The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s,
1225 \S, \v, \V, \w, and \W may appear in a character class, and add the
1226 characters that they match to the class. For example, [\dABCDEF]
1227 matches any hexadecimal digit. In UTF modes, the PCRE2_UCP option af‐
1228 fects the meanings of \d, \s, \w and their upper case partners, just as
1229 it does when they appear outside a character class, as described in the
1230 section entitled "Generic character types" above. The escape sequence
1231 \b has a different meaning inside a character class; it matches the
1232 backspace character. The sequences \B, \R, and \X are not special in‐
1233 side a character class. Like any other unrecognized escape sequences,
1234 they cause an error. The same is true for \N when not followed by an
1235 opening brace.
1236
1237 The minus (hyphen) character can be used to specify a range of charac‐
1238 ters in a character class. For example, [d-m] matches any letter be‐
1239 tween d and m, inclusive. If a minus character is required in a class,
1240 it must be escaped with a backslash or appear in a position where it
1241 cannot be interpreted as indicating a range, typically as the first or
1242 last character in the class, or immediately after a range. For example,
1243 [b-d-z] matches letters in the range b to d, a hyphen character, or z.
1244
1245 Perl treats a hyphen as a literal if it appears before or after a POSIX
1246 class (see below) or before or after a character type escape such as as
1247 \d or \H. However, unless the hyphen is the last character in the
1248 class, Perl outputs a warning in its warning mode, as this is most
1249 likely a user error. As PCRE2 has no facility for warning, an error is
1250 given in these cases.
1251
1252 It is not possible to have the literal character "]" as the end charac‐
1253 ter of a range. A pattern such as [W-]46] is interpreted as a class of
1254 two characters ("W" and "-") followed by a literal string "46]", so it
1255 would match "W46]" or "-46]". However, if the "]" is escaped with a
1256 backslash it is interpreted as the end of range, so [W-\]46] is inter‐
1257 preted as a class containing a range followed by two other characters.
1258 The octal or hexadecimal representation of "]" can also be used to end
1259 a range.
1260
1261 Ranges normally include all code points between the start and end char‐
1262 acters, inclusive. They can also be used for code points specified nu‐
1263 merically, for example [\000-\037]. Ranges can include any characters
1264 that are valid for the current mode. In any UTF mode, the so-called
1265 "surrogate" characters (those whose code points lie between 0xd800 and
1266 0xdfff inclusive) may not be specified explicitly by default (the
1267 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option disables this check). How‐
1268 ever, ranges such as [\x{d7ff}-\x{e000}], which include the surrogates,
1269 are always permitted.
1270
1271 There is a special case in EBCDIC environments for ranges whose end
1272 points are both specified as literal letters in the same case. For com‐
1273 patibility with Perl, EBCDIC code points within the range that are not
1274 letters are omitted. For example, [h-k] matches only four characters,
1275 even though the codes for h and k are 0x88 and 0x92, a range of 11 code
1276 points. However, if the range is specified numerically, for example,
1277 [\x88-\x92] or [h-\x92], all code points are included.
1278
1279 If a range that includes letters is used when caseless matching is set,
1280 it matches the letters in either case. For example, [W-c] is equivalent
1281 to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if
1282 character tables for a French locale are in use, [\xc8-\xcb] matches
1283 accented E characters in both cases.
1284
1285 A circumflex can conveniently be used with the upper case character
1286 types to specify a more restricted set of characters than the matching
1287 lower case type. For example, the class [^\W_] matches any letter or
1288 digit, but not underscore, whereas [\w] includes underscore. A positive
1289 character class should be read as "something OR something OR ..." and a
1290 negative class as "NOT something AND NOT something AND NOT ...".
1291
1292 The only metacharacters that are recognized in character classes are
1293 backslash, hyphen (only where it can be interpreted as specifying a
1294 range), circumflex (only at the start), opening square bracket (only
1295 when it can be interpreted as introducing a POSIX class name, or for a
1296 special compatibility feature - see the next two sections), and the
1297 terminating closing square bracket. However, escaping other non-al‐
1298 phanumeric characters does no harm.
1299
1301
1302 Perl supports the POSIX notation for character classes. This uses names
1303 enclosed by [: and :] within the enclosing square brackets. PCRE2 also
1304 supports this notation. For example,
1305
1306 [01[:alpha:]%]
1307
1308 matches "0", "1", any alphabetic character, or "%". The supported class
1309 names are:
1310
1311 alnum letters and digits
1312 alpha letters
1313 ascii character codes 0 - 127
1314 blank space or tab only
1315 cntrl control characters
1316 digit decimal digits (same as \d)
1317 graph printing characters, excluding space
1318 lower lower case letters
1319 print printing characters, including space
1320 punct printing characters, excluding letters and digits and space
1321 space white space (the same as \s from PCRE2 8.34)
1322 upper upper case letters
1323 word "word" characters (same as \w)
1324 xdigit hexadecimal digits
1325
1326 The default "space" characters are HT (9), LF (10), VT (11), FF (12),
1327 CR (13), and space (32). If locale-specific matching is taking place,
1328 the list of space characters may be different; there may be fewer or
1329 more of them. "Space" and \s match the same set of characters.
1330
1331 The name "word" is a Perl extension, and "blank" is a GNU extension
1332 from Perl 5.8. Another Perl extension is negation, which is indicated
1333 by a ^ character after the colon. For example,
1334
1335 [12[:^digit:]]
1336
1337 matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the
1338 POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
1339 these are not supported, and an error is given if they are encountered.
1340
1341 By default, characters with values greater than 127 do not match any of
1342 the POSIX character classes, although this may be different for charac‐
1343 ters in the range 128-255 when locale-specific matching is happening.
1344 However, if the PCRE2_UCP option is passed to pcre2_compile(), some of
1345 the classes are changed so that Unicode character properties are used.
1346 This is achieved by replacing certain POSIX classes with other se‐
1347 quences, as follows:
1348
1349 [:alnum:] becomes \p{Xan}
1350 [:alpha:] becomes \p{L}
1351 [:blank:] becomes \h
1352 [:cntrl:] becomes \p{Cc}
1353 [:digit:] becomes \p{Nd}
1354 [:lower:] becomes \p{Ll}
1355 [:space:] becomes \p{Xps}
1356 [:upper:] becomes \p{Lu}
1357 [:word:] becomes \p{Xwd}
1358
1359 Negated versions, such as [:^alpha:] use \P instead of \p. Three other
1360 POSIX classes are handled specially in UCP mode:
1361
1362 [:graph:] This matches characters that have glyphs that mark the page
1363 when printed. In Unicode property terms, it matches all char‐
1364 acters with the L, M, N, P, S, or Cf properties, except for:
1365
1366 U+061C Arabic Letter Mark
1367 U+180E Mongolian Vowel Separator
1368 U+2066 - U+2069 Various "isolate"s
1369
1370
1371 [:print:] This matches the same characters as [:graph:] plus space
1372 characters that are not controls, that is, characters with
1373 the Zs property.
1374
1375 [:punct:] This matches all characters that have the Unicode P (punctua‐
1376 tion) property, plus those characters with code points less
1377 than 256 that have the S (Symbol) property.
1378
1379 The other POSIX classes are unchanged, and match only characters with
1380 code points less than 256.
1381
1383
1384 In the POSIX.2 compliant library that was included in 4.4BSD Unix, the
1385 ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word"
1386 and "end of word". PCRE2 treats these items as follows:
1387
1388 [[:<:]] is converted to \b(?=\w)
1389 [[:>:]] is converted to \b(?<=\w)
1390
1391 Only these exact character sequences are recognized. A sequence such as
1392 [a[:<:]b] provokes error for an unrecognized POSIX class name. This
1393 support is not compatible with Perl. It is provided to help migrations
1394 from other environments, and is best not used in any new patterns. Note
1395 that \b matches at the start and the end of a word (see "Simple asser‐
1396 tions" above), and in a Perl-style pattern the preceding or following
1397 character normally shows which is wanted, without the need for the as‐
1398 sertions that are used above in order to give exactly the POSIX behav‐
1399 iour.
1400
1402
1403 Vertical bar characters are used to separate alternative patterns. For
1404 example, the pattern
1405
1406 gilbert|sullivan
1407
1408 matches either "gilbert" or "sullivan". Any number of alternatives may
1409 appear, and an empty alternative is permitted (matching the empty
1410 string). The matching process tries each alternative in turn, from left
1411 to right, and the first one that succeeds is used. If the alternatives
1412 are within a group (defined below), "succeeds" means matching the rest
1413 of the main pattern as well as the alternative in the group.
1414
1416
1417 The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL,
1418 PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options
1419 can be changed from within the pattern by a sequence of letters en‐
1420 closed between "(?" and ")". These options are Perl-compatible, and
1421 are described in detail in the pcre2api documentation. The option let‐
1422 ters are:
1423
1424 i for PCRE2_CASELESS
1425 m for PCRE2_MULTILINE
1426 n for PCRE2_NO_AUTO_CAPTURE
1427 s for PCRE2_DOTALL
1428 x for PCRE2_EXTENDED
1429 xx for PCRE2_EXTENDED_MORE
1430
1431 For example, (?im) sets caseless, multiline matching. It is also possi‐
1432 ble to unset these options by preceding the relevant letters with a hy‐
1433 phen, for example (?-im). The two "extended" options are not indepen‐
1434 dent; unsetting either one cancels the effects of both of them.
1435
1436 A combined setting and unsetting such as (?im-sx), which sets
1437 PCRE2_CASELESS and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and
1438 PCRE2_EXTENDED, is also permitted. Only one hyphen may appear in the
1439 options string. If a letter appears both before and after the hyphen,
1440 the option is unset. An empty options setting "(?)" is allowed. Need‐
1441 less to say, it has no effect.
1442
1443 If the first character following (? is a circumflex, it causes all of
1444 the above options to be unset. Thus, (?^) is equivalent to (?-imnsx).
1445 Letters may follow the circumflex to cause some options to be re-in‐
1446 stated, but a hyphen may not appear.
1447
1448 The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be
1449 changed in the same way as the Perl-compatible options by using the
1450 characters J and U respectively. However, these are not unset by (?^).
1451
1452 When one of these option changes occurs at top level (that is, not in‐
1453 side group parentheses), the change applies to the remainder of the
1454 pattern that follows. An option change within a group (see below for a
1455 description of groups) affects only that part of the group that follows
1456 it, so
1457
1458 (a(?i)b)c
1459
1460 matches abc and aBc and no other strings (assuming PCRE2_CASELESS is
1461 not used). By this means, options can be made to have different set‐
1462 tings in different parts of the pattern. Any changes made in one alter‐
1463 native do carry on into subsequent branches within the same group. For
1464 example,
1465
1466 (a(?i)b|c)
1467
1468 matches "ab", "aB", "c", and "C", even though when matching "C" the
1469 first branch is abandoned before the option setting. This is because
1470 the effects of option settings happen at compile time. There would be
1471 some very weird behaviour otherwise.
1472
1473 As a convenient shorthand, if any option settings are required at the
1474 start of a non-capturing group (see the next section), the option let‐
1475 ters may appear between the "?" and the ":". Thus the two patterns
1476
1477 (?i:saturday|sunday)
1478 (?:(?i)saturday|sunday)
1479
1480 match exactly the same set of strings.
1481
1482 Note: There are other PCRE2-specific options, applying to the whole
1483 pattern, which can be set by the application when the compiling func‐
1484 tion is called. In addition, the pattern can contain special leading
1485 sequences such as (*CRLF) to override what the application has set or
1486 what has been defaulted. Details are given in the section entitled
1487 "Newline sequences" above. There are also the (*UTF) and (*UCP) leading
1488 sequences that can be used to set UTF and Unicode property modes; they
1489 are equivalent to setting the PCRE2_UTF and PCRE2_UCP options, respec‐
1490 tively. However, the application can set the PCRE2_NEVER_UTF and
1491 PCRE2_NEVER_UCP options, which lock out the use of the (*UTF) and
1492 (*UCP) sequences.
1493
1495
1496 Groups are delimited by parentheses (round brackets), which can be
1497 nested. Turning part of a pattern into a group does two things:
1498
1499 1. It localizes a set of alternatives. For example, the pattern
1500
1501 cat(aract|erpillar|)
1502
1503 matches "cataract", "caterpillar", or "cat". Without the parentheses,
1504 it would match "cataract", "erpillar" or an empty string.
1505
1506 2. It creates a "capture group". This means that, when the whole pat‐
1507 tern matches, the portion of the subject string that matched the group
1508 is passed back to the caller, separately from the portion that matched
1509 the whole pattern. (This applies only to the traditional matching
1510 function; the DFA matching function does not support capturing.)
1511
1512 Opening parentheses are counted from left to right (starting from 1) to
1513 obtain numbers for capture groups. For example, if the string "the red
1514 king" is matched against the pattern
1515
1516 the ((red|white) (king|queen))
1517
1518 the captured substrings are "red king", "red", and "king", and are num‐
1519 bered 1, 2, and 3, respectively.
1520
1521 The fact that plain parentheses fulfil two functions is not always
1522 helpful. There are often times when grouping is required without cap‐
1523 turing. If an opening parenthesis is followed by a question mark and a
1524 colon, the group does not do any capturing, and is not counted when
1525 computing the number of any subsequent capture groups. For example, if
1526 the string "the white queen" is matched against the pattern
1527
1528 the ((?:red|white) (king|queen))
1529
1530 the captured substrings are "white queen" and "queen", and are numbered
1531 1 and 2. The maximum number of capture groups is 65535.
1532
1533 As a convenient shorthand, if any option settings are required at the
1534 start of a non-capturing group, the option letters may appear between
1535 the "?" and the ":". Thus the two patterns
1536
1537 (?i:saturday|sunday)
1538 (?:(?i)saturday|sunday)
1539
1540 match exactly the same set of strings. Because alternative branches are
1541 tried from left to right, and options are not reset until the end of
1542 the group is reached, an option setting in one branch does affect sub‐
1543 sequent branches, so the above patterns match "SUNDAY" as well as "Sat‐
1544 urday".
1545
1547
1548 Perl 5.10 introduced a feature whereby each alternative in a group uses
1549 the same numbers for its capturing parentheses. Such a group starts
1550 with (?| and is itself a non-capturing group. For example, consider
1551 this pattern:
1552
1553 (?|(Sat)ur|(Sun))day
1554
1555 Because the two alternatives are inside a (?| group, both sets of cap‐
1556 turing parentheses are numbered one. Thus, when the pattern matches,
1557 you can look at captured substring number one, whichever alternative
1558 matched. This construct is useful when you want to capture part, but
1559 not all, of one of a number of alternatives. Inside a (?| group, paren‐
1560 theses are numbered as usual, but the number is reset at the start of
1561 each branch. The numbers of any capturing parentheses that follow the
1562 whole group start after the highest number used in any branch. The fol‐
1563 lowing example is taken from the Perl documentation. The numbers under‐
1564 neath show in which buffer the captured content will be stored.
1565
1566 # before ---------------branch-reset----------- after
1567 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
1568 # 1 2 2 3 2 3 4
1569
1570 A backreference to a capture group uses the most recent value that is
1571 set for the group. The following pattern matches "abcabc" or "defdef":
1572
1573 /(?|(abc)|(def))\1/
1574
1575 In contrast, a subroutine call to a capture group always refers to the
1576 first one in the pattern with the given number. The following pattern
1577 matches "abcabc" or "defabc":
1578
1579 /(?|(abc)|(def))(?1)/
1580
1581 A relative reference such as (?-1) is no different: it is just a conve‐
1582 nient way of computing an absolute group number.
1583
1584 If a condition test for a group's having matched refers to a non-unique
1585 number, the test is true if any group with that number has matched.
1586
1587 An alternative approach to using this "branch reset" feature is to use
1588 duplicate named groups, as described in the next section.
1589
1591
1592 Identifying capture groups by number is simple, but it can be very hard
1593 to keep track of the numbers in complicated patterns. Furthermore, if
1594 an expression is modified, the numbers may change. To help with this
1595 difficulty, PCRE2 supports the naming of capture groups. This feature
1596 was not added to Perl until release 5.10. Python had the feature ear‐
1597 lier, and PCRE1 introduced it at release 4.0, using the Python syntax.
1598 PCRE2 supports both the Perl and the Python syntax.
1599
1600 In PCRE2, a capture group can be named in one of three ways:
1601 (?<name>...) or (?'name'...) as in Perl, or (?P<name>...) as in Python.
1602 Names may be up to 32 code units long. When PCRE2_UTF is not set, they
1603 may contain only ASCII alphanumeric characters and underscores, but
1604 must start with a non-digit. When PCRE2_UTF is set, the syntax of group
1605 names is extended to allow any Unicode letter or Unicode decimal digit.
1606 In other words, group names must match one of these patterns:
1607
1608 ^[_A-Za-z][_A-Za-z0-9]*\z when PCRE2_UTF is not set
1609 ^[_\p{L}][_\p{L}\p{Nd}]*\z when PCRE2_UTF is set
1610
1611 References to capture groups from other parts of the pattern, such as
1612 backreferences, recursion, and conditions, can all be made by name as
1613 well as by number.
1614
1615 Named capture groups are allocated numbers as well as names, exactly as
1616 if the names were not present. In both PCRE2 and Perl, capture groups
1617 are primarily identified by numbers; any names are just aliases for
1618 these numbers. The PCRE2 API provides function calls for extracting the
1619 complete name-to-number translation table from a compiled pattern, as
1620 well as convenience functions for extracting captured substrings by
1621 name.
1622
1623 Warning: When more than one capture group has the same number, as de‐
1624 scribed in the previous section, a name given to one of them applies to
1625 all of them. Perl allows identically numbered groups to have different
1626 names. Consider this pattern, where there are two capture groups, both
1627 numbered 1:
1628
1629 (?|(?<AA>aa)|(?<BB>bb))
1630
1631 Perl allows this, with both names AA and BB as aliases of group 1.
1632 Thus, after a successful match, both names yield the same value (either
1633 "aa" or "bb").
1634
1635 In an attempt to reduce confusion, PCRE2 does not allow the same group
1636 number to be associated with more than one name. The example above pro‐
1637 vokes a compile-time error. However, there is still scope for confu‐
1638 sion. Consider this pattern:
1639
1640 (?|(?<AA>aa)|(bb))
1641
1642 Although the second group number 1 is not explicitly named, the name AA
1643 is still an alias for any group 1. Whether the pattern matches "aa" or
1644 "bb", a reference by name to group AA yields the matched string.
1645
1646 By default, a name must be unique within a pattern, except that dupli‐
1647 cate names are permitted for groups with the same number, for example:
1648
1649 (?|(?<AA>aa)|(?<AA>bb))
1650
1651 The duplicate name constraint can be disabled by setting the PCRE2_DUP‐
1652 NAMES option at compile time, or by the use of (?J) within the pattern,
1653 as described in the section entitled "Internal Option Setting" above.
1654
1655 Duplicate names can be useful for patterns where only one instance of
1656 the named capture group can match. Suppose you want to match the name
1657 of a weekday, either as a 3-letter abbreviation or as the full name,
1658 and in both cases you want to extract the abbreviation. This pattern
1659 (ignoring the line breaks) does the job:
1660
1661 (?J)
1662 (?<DN>Mon|Fri|Sun)(?:day)?|
1663 (?<DN>Tue)(?:sday)?|
1664 (?<DN>Wed)(?:nesday)?|
1665 (?<DN>Thu)(?:rsday)?|
1666 (?<DN>Sat)(?:urday)?
1667
1668 There are five capture groups, but only one is ever set after a match.
1669 The convenience functions for extracting the data by name returns the
1670 substring for the first (and in this example, the only) group of that
1671 name that matched. This saves searching to find which numbered group it
1672 was. (An alternative way of solving this problem is to use a "branch
1673 reset" group, as described in the previous section.)
1674
1675 If you make a backreference to a non-unique named group from elsewhere
1676 in the pattern, the groups to which the name refers are checked in the
1677 order in which they appear in the overall pattern. The first one that
1678 is set is used for the reference. For example, this pattern matches
1679 both "foofoo" and "barbar" but not "foobar" or "barfoo":
1680
1681 (?J)(?:(?<n>foo)|(?<n>bar))\k<n>
1682
1683
1684 If you make a subroutine call to a non-unique named group, the one that
1685 corresponds to the first occurrence of the name is used. In the absence
1686 of duplicate numbers this is the one with the lowest number.
1687
1688 If you use a named reference in a condition test (see the section about
1689 conditions below), either to check whether a capture group has matched,
1690 or to check for recursion, all groups with the same name are tested. If
1691 the condition is true for any one of them, the overall condition is
1692 true. This is the same behaviour as testing by number. For further de‐
1693 tails of the interfaces for handling named capture groups, see the
1694 pcre2api documentation.
1695
1697
1698 Repetition is specified by quantifiers, which can follow any of the
1699 following items:
1700
1701 a literal data character
1702 the dot metacharacter
1703 the \C escape sequence
1704 the \R escape sequence
1705 the \X escape sequence
1706 an escape such as \d or \pL that matches a single character
1707 a character class
1708 a backreference
1709 a parenthesized group (including lookaround assertions)
1710 a subroutine call (recursive or otherwise)
1711
1712 The general repetition quantifier specifies a minimum and maximum num‐
1713 ber of permitted matches, by giving the two numbers in curly brackets
1714 (braces), separated by a comma. The numbers must be less than 65536,
1715 and the first must be less than or equal to the second. For example,
1716
1717 z{2,4}
1718
1719 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
1720 special character. If the second number is omitted, but the comma is
1721 present, there is no upper limit; if the second number and the comma
1722 are both omitted, the quantifier specifies an exact number of required
1723 matches. Thus
1724
1725 [aeiou]{3,}
1726
1727 matches at least 3 successive vowels, but may match many more, whereas
1728
1729 \d{8}
1730
1731 matches exactly 8 digits. An opening curly bracket that appears in a
1732 position where a quantifier is not allowed, or one that does not match
1733 the syntax of a quantifier, is taken as a literal character. For exam‐
1734 ple, {,6} is not a quantifier, but a literal string of four characters.
1735
1736 In UTF modes, quantifiers apply to characters rather than to individual
1737 code units. Thus, for example, \x{100}{2} matches two characters, each
1738 of which is represented by a two-byte sequence in a UTF-8 string. Simi‐
1739 larly, \X{3} matches three Unicode extended grapheme clusters, each of
1740 which may be several code units long (and they may be of different
1741 lengths).
1742
1743 The quantifier {0} is permitted, causing the expression to behave as if
1744 the previous item and the quantifier were not present. This may be use‐
1745 ful for capture groups that are referenced as subroutines from else‐
1746 where in the pattern (but see also the section entitled "Defining cap‐
1747 ture groups for use by reference only" below). Except for parenthesized
1748 groups, items that have a {0} quantifier are omitted from the compiled
1749 pattern.
1750
1751 For convenience, the three most common quantifiers have single-charac‐
1752 ter abbreviations:
1753
1754 * is equivalent to {0,}
1755 + is equivalent to {1,}
1756 ? is equivalent to {0,1}
1757
1758 It is possible to construct infinite loops by following a group that
1759 can match no characters with a quantifier that has no upper limit, for
1760 example:
1761
1762 (a?)*
1763
1764 Earlier versions of Perl and PCRE1 used to give an error at compile
1765 time for such patterns. However, because there are cases where this can
1766 be useful, such patterns are now accepted, but whenever an iteration of
1767 such a group matches no characters, matching moves on to the next item
1768 in the pattern instead of repeatedly matching an empty string. This
1769 does not prevent backtracking into any of the iterations if a subse‐
1770 quent item fails to match.
1771
1772 By default, quantifiers are "greedy", that is, they match as much as
1773 possible (up to the maximum number of permitted times), without causing
1774 the rest of the pattern to fail. The classic example of where this
1775 gives problems is in trying to match comments in C programs. These ap‐
1776 pear between /* and */ and within the comment, individual * and / char‐
1777 acters may appear. An attempt to match C comments by applying the pat‐
1778 tern
1779
1780 /\*.*\*/
1781
1782 to the string
1783
1784 /* first comment */ not comment /* second comment */
1785
1786 fails, because it matches the entire string owing to the greediness of
1787 the .* item. However, if a quantifier is followed by a question mark,
1788 it ceases to be greedy, and instead matches the minimum number of times
1789 possible, so the pattern
1790
1791 /\*.*?\*/
1792
1793 does the right thing with the C comments. The meaning of the various
1794 quantifiers is not otherwise changed, just the preferred number of
1795 matches. Do not confuse this use of question mark with its use as a
1796 quantifier in its own right. Because it has two uses, it can sometimes
1797 appear doubled, as in
1798
1799 \d??\d
1800
1801 which matches one digit by preference, but can match two if that is the
1802 only way the rest of the pattern matches.
1803
1804 If the PCRE2_UNGREEDY option is set (an option that is not available in
1805 Perl), the quantifiers are not greedy by default, but individual ones
1806 can be made greedy by following them with a question mark. In other
1807 words, it inverts the default behaviour.
1808
1809 When a parenthesized group is quantified with a minimum repeat count
1810 that is greater than 1 or with a limited maximum, more memory is re‐
1811 quired for the compiled pattern, in proportion to the size of the mini‐
1812 mum or maximum.
1813
1814 If a pattern starts with .* or .{0,} and the PCRE2_DOTALL option
1815 (equivalent to Perl's /s) is set, thus allowing the dot to match new‐
1816 lines, the pattern is implicitly anchored, because whatever follows
1817 will be tried against every character position in the subject string,
1818 so there is no point in retrying the overall match at any position af‐
1819 ter the first. PCRE2 normally treats such a pattern as though it were
1820 preceded by \A.
1821
1822 In cases where it is known that the subject string contains no new‐
1823 lines, it is worth setting PCRE2_DOTALL in order to obtain this opti‐
1824 mization, or alternatively, using ^ to indicate anchoring explicitly.
1825
1826 However, there are some cases where the optimization cannot be used.
1827 When .* is inside capturing parentheses that are the subject of a
1828 backreference elsewhere in the pattern, a match at the start may fail
1829 where a later one succeeds. Consider, for example:
1830
1831 (.*)abc\1
1832
1833 If the subject is "xyz123abc123" the match point is the fourth charac‐
1834 ter. For this reason, such a pattern is not implicitly anchored.
1835
1836 Another case where implicit anchoring is not applied is when the lead‐
1837 ing .* is inside an atomic group. Once again, a match at the start may
1838 fail where a later one succeeds. Consider this pattern:
1839
1840 (?>.*?a)b
1841
1842 It matches "ab" in the subject "aab". The use of the backtracking con‐
1843 trol verbs (*PRUNE) and (*SKIP) also disable this optimization, and
1844 there is an option, PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
1845
1846 When a capture group is repeated, the value captured is the substring
1847 that matched the final iteration. For example, after
1848
1849 (tweedle[dume]{3}\s*)+
1850
1851 has matched "tweedledum tweedledee" the value of the captured substring
1852 is "tweedledee". However, if there are nested capture groups, the cor‐
1853 responding captured values may have been set in previous iterations.
1854 For example, after
1855
1856 (a|(b))+
1857
1858 matches "aba" the value of the second captured substring is "b".
1859
1861
1862 With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
1863 repetition, failure of what follows normally causes the repeated item
1864 to be re-evaluated to see if a different number of repeats allows the
1865 rest of the pattern to match. Sometimes it is useful to prevent this,
1866 either to change the nature of the match, or to cause it fail earlier
1867 than it otherwise might, when the author of the pattern knows there is
1868 no point in carrying on.
1869
1870 Consider, for example, the pattern \d+foo when applied to the subject
1871 line
1872
1873 123456bar
1874
1875 After matching all 6 digits and then failing to match "foo", the normal
1876 action of the matcher is to try again with only 5 digits matching the
1877 \d+ item, and then with 4, and so on, before ultimately failing.
1878 "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
1879 the means for specifying that once a group has matched, it is not to be
1880 re-evaluated in this way.
1881
1882 If we use atomic grouping for the previous example, the matcher gives
1883 up immediately on failing to match "foo" the first time. The notation
1884 is a kind of special parenthesis, starting with (?> as in this example:
1885
1886 (?>\d+)foo
1887
1888 Perl 5.28 introduced an experimental alphabetic form starting with (*
1889 which may be easier to remember:
1890
1891 (*atomic:\d+)foo
1892
1893 This kind of parenthesized group "locks up" the part of the pattern it
1894 contains once it has matched, and a failure further into the pattern is
1895 prevented from backtracking into it. Backtracking past it to previous
1896 items, however, works as normal.
1897
1898 An alternative description is that a group of this type matches exactly
1899 the string of characters that an identical standalone pattern would
1900 match, if anchored at the current point in the subject string.
1901
1902 Atomic groups are not capture groups. Simple cases such as the above
1903 example can be thought of as a maximizing repeat that must swallow ev‐
1904 erything it can. So, while both \d+ and \d+? are prepared to adjust
1905 the number of digits they match in order to make the rest of the pat‐
1906 tern match, (?>\d+) can only match an entire sequence of digits.
1907
1908 Atomic groups in general can of course contain arbitrarily complicated
1909 expressions, and can be nested. However, when the contents of an atomic
1910 group is just a single repeated item, as in the example above, a sim‐
1911 pler notation, called a "possessive quantifier" can be used. This con‐
1912 sists of an additional + character following a quantifier. Using this
1913 notation, the previous example can be rewritten as
1914
1915 \d++foo
1916
1917 Note that a possessive quantifier can be used with an entire group, for
1918 example:
1919
1920 (abc|xyz){2,3}+
1921
1922 Possessive quantifiers are always greedy; the setting of the PCRE2_UN‐
1923 GREEDY option is ignored. They are a convenient notation for the sim‐
1924 pler forms of atomic group. However, there is no difference in the
1925 meaning of a possessive quantifier and the equivalent atomic group,
1926 though there may be a performance difference; possessive quantifiers
1927 should be slightly faster.
1928
1929 The possessive quantifier syntax is an extension to the Perl 5.8 syn‐
1930 tax. Jeffrey Friedl originated the idea (and the name) in the first
1931 edition of his book. Mike McCloskey liked it, so implemented it when he
1932 built Sun's Java package, and PCRE1 copied it from there. It found its
1933 way into Perl at release 5.10.
1934
1935 PCRE2 has an optimization that automatically "possessifies" certain
1936 simple pattern constructs. For example, the sequence A+B is treated as
1937 A++B because there is no point in backtracking into a sequence of A's
1938 when B must follow. This feature can be disabled by the PCRE2_NO_AUTO‐
1939 POSSESS option, or starting the pattern with (*NO_AUTO_POSSESS).
1940
1941 When a pattern contains an unlimited repeat inside a group that can it‐
1942 self be repeated an unlimited number of times, the use of an atomic
1943 group is the only way to avoid some failing matches taking a very long
1944 time indeed. The pattern
1945
1946 (\D+|<\d+>)*[!?]
1947
1948 matches an unlimited number of substrings that either consist of non-
1949 digits, or digits enclosed in <>, followed by either ! or ?. When it
1950 matches, it runs quickly. However, if it is applied to
1951
1952 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1953
1954 it takes a long time before reporting failure. This is because the
1955 string can be divided between the internal \D+ repeat and the external
1956 * repeat in a large number of ways, and all have to be tried. (The ex‐
1957 ample uses [!?] rather than a single character at the end, because both
1958 PCRE2 and Perl have an optimization that allows for fast failure when a
1959 single character is used. They remember the last single character that
1960 is required for a match, and fail early if it is not present in the
1961 string.) If the pattern is changed so that it uses an atomic group,
1962 like this:
1963
1964 ((?>\D+)|<\d+>)*[!?]
1965
1966 sequences of non-digits cannot be broken, and failure happens quickly.
1967
1969
1970 Outside a character class, a backslash followed by a digit greater than
1971 0 (and possibly further digits) is a backreference to a capture group
1972 earlier (that is, to its left) in the pattern, provided there have been
1973 that many previous capture groups.
1974
1975 However, if the decimal number following the backslash is less than 8,
1976 it is always taken as a backreference, and causes an error only if
1977 there are not that many capture groups in the entire pattern. In other
1978 words, the group that is referenced need not be to the left of the ref‐
1979 erence for numbers less than 8. A "forward backreference" of this type
1980 can make sense when a repetition is involved and the group to the right
1981 has participated in an earlier iteration.
1982
1983 It is not possible to have a numerical "forward backreference" to a
1984 group whose number is 8 or more using this syntax because a sequence
1985 such as \50 is interpreted as a character defined in octal. See the
1986 subsection entitled "Non-printing characters" above for further details
1987 of the handling of digits following a backslash. Other forms of back‐
1988 referencing do not suffer from this restriction. In particular, there
1989 is no problem when named capture groups are used (see below).
1990
1991 Another way of avoiding the ambiguity inherent in the use of digits
1992 following a backslash is to use the \g escape sequence. This escape
1993 must be followed by a signed or unsigned number, optionally enclosed in
1994 braces. These examples are all identical:
1995
1996 (ring), \1
1997 (ring), \g1
1998 (ring), \g{1}
1999
2000 An unsigned number specifies an absolute reference without the ambigu‐
2001 ity that is present in the older syntax. It is also useful when literal
2002 digits follow the reference. A signed number is a relative reference.
2003 Consider this example:
2004
2005 (abc(def)ghi)\g{-1}
2006
2007 The sequence \g{-1} is a reference to the most recently started capture
2008 group before \g, that is, is it equivalent to \2 in this example. Simi‐
2009 larly, \g{-2} would be equivalent to \1. The use of relative references
2010 can be helpful in long patterns, and also in patterns that are created
2011 by joining together fragments that contain references within them‐
2012 selves.
2013
2014 The sequence \g{+1} is a reference to the next capture group. This kind
2015 of forward reference can be useful in patterns that repeat. Perl does
2016 not support the use of + in this way.
2017
2018 A backreference matches whatever actually most recently matched the
2019 capture group in the current subject string, rather than anything at
2020 all that matches the group (see "Groups as subroutines" below for a way
2021 of doing that). So the pattern
2022
2023 (sens|respons)e and \1ibility
2024
2025 matches "sense and sensibility" and "response and responsibility", but
2026 not "sense and responsibility". If caseful matching is in force at the
2027 time of the backreference, the case of letters is relevant. For exam‐
2028 ple,
2029
2030 ((?i)rah)\s+\1
2031
2032 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
2033 original capture group is matched caselessly.
2034
2035 There are several different ways of writing backreferences to named
2036 capture groups. The .NET syntax \k{name} and the Perl syntax \k<name>
2037 or \k'name' are supported, as is the Python syntax (?P=name). Perl
2038 5.10's unified backreference syntax, in which \g can be used for both
2039 numeric and named references, is also supported. We could rewrite the
2040 above example in any of the following ways:
2041
2042 (?<p1>(?i)rah)\s+\k<p1>
2043 (?'p1'(?i)rah)\s+\k{p1}
2044 (?P<p1>(?i)rah)\s+(?P=p1)
2045 (?<p1>(?i)rah)\s+\g{p1}
2046
2047 A capture group that is referenced by name may appear in the pattern
2048 before or after the reference.
2049
2050 There may be more than one backreference to the same group. If a group
2051 has not actually been used in a particular match, backreferences to it
2052 always fail by default. For example, the pattern
2053
2054 (a|(bc))\2
2055
2056 always fails if it starts to match "a" rather than "bc". However, if
2057 the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backref‐
2058 erence to an unset value matches an empty string.
2059
2060 Because there may be many capture groups in a pattern, all digits fol‐
2061 lowing a backslash are taken as part of a potential backreference num‐
2062 ber. If the pattern continues with a digit character, some delimiter
2063 must be used to terminate the backreference. If the PCRE2_EXTENDED or
2064 PCRE2_EXTENDED_MORE option is set, this can be white space. Otherwise,
2065 the \g{} syntax or an empty comment (see "Comments" below) can be used.
2066
2067 Recursive backreferences
2068
2069 A backreference that occurs inside the group to which it refers fails
2070 when the group is first used, so, for example, (a\1) never matches.
2071 However, such references can be useful inside repeated groups. For ex‐
2072 ample, the pattern
2073
2074 (a|b\1)+
2075
2076 matches any number of "a"s and also "aba", "ababbaa" etc. At each iter‐
2077 ation of the group, the backreference matches the character string cor‐
2078 responding to the previous iteration. In order for this to work, the
2079 pattern must be such that the first iteration does not need to match
2080 the backreference. This can be done using alternation, as in the exam‐
2081 ple above, or by a quantifier with a minimum of zero.
2082
2083 For versions of PCRE2 less than 10.25, backreferences of this type used
2084 to cause the group that they reference to be treated as an atomic
2085 group. This restriction no longer applies, and backtracking into such
2086 groups can occur as normal.
2087
2089
2090 An assertion is a test on the characters following or preceding the
2091 current matching point that does not consume any characters. The simple
2092 assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described
2093 above.
2094
2095 More complicated assertions are coded as parenthesized groups. There
2096 are two kinds: those that look ahead of the current position in the
2097 subject string, and those that look behind it, and in each case an as‐
2098 sertion may be positive (must match for the assertion to be true) or
2099 negative (must not match for the assertion to be true). An assertion
2100 group is matched in the normal way, and if it is true, matching contin‐
2101 ues after it, but with the matching position in the subject string re‐
2102 set to what it was before the assertion was processed.
2103
2104 The Perl-compatible lookaround assertions are atomic. If an assertion
2105 is true, but there is a subsequent matching failure, there is no back‐
2106 tracking into the assertion. However, there are some cases where non-
2107 atomic assertions can be useful. PCRE2 has some support for these, de‐
2108 scribed in the section entitled "Non-atomic assertions" below, but they
2109 are not Perl-compatible.
2110
2111 A lookaround assertion may appear as the condition in a conditional
2112 group (see below). In this case, the result of matching the assertion
2113 determines which branch of the condition is followed.
2114
2115 Assertion groups are not capture groups. If an assertion contains cap‐
2116 ture groups within it, these are counted for the purposes of numbering
2117 the capture groups in the whole pattern. Within each branch of an as‐
2118 sertion, locally captured substrings may be referenced in the usual
2119 way. For example, a sequence such as (.)\g{-1} can be used to check
2120 that two adjacent characters are the same.
2121
2122 When a branch within an assertion fails to match, any substrings that
2123 were captured are discarded (as happens with any pattern branch that
2124 fails to match). A negative assertion is true only when all its
2125 branches fail to match; this means that no captured substrings are ever
2126 retained after a successful negative assertion. When an assertion con‐
2127 tains a matching branch, what happens depends on the type of assertion.
2128
2129 For a positive assertion, internally captured substrings in the suc‐
2130 cessful branch are retained, and matching continues with the next pat‐
2131 tern item after the assertion. For a negative assertion, a matching
2132 branch means that the assertion is not true. If such an assertion is
2133 being used as a condition in a conditional group (see below), captured
2134 substrings are retained, because matching continues with the "no"
2135 branch of the condition. For other failing negative assertions, control
2136 passes to the previous backtracking point, thus discarding any captured
2137 strings within the assertion.
2138
2139 Most assertion groups may be repeated; though it makes no sense to as‐
2140 sert the same thing several times, the side effect of capturing in pos‐
2141 itive assertions may occasionally be useful. However, an assertion that
2142 forms the condition for a conditional group may not be quantified.
2143 PCRE2 used to restrict the repetition of assertions, but from release
2144 10.35 the only restriction is that an unlimited maximum repetition is
2145 changed to be one more than the minimum. For example, {3,} is treated
2146 as {3,4}.
2147
2148 Alphabetic assertion names
2149
2150 Traditionally, symbolic sequences such as (?= and (?<= have been used
2151 to specify lookaround assertions. Perl 5.28 introduced some experimen‐
2152 tal alphabetic alternatives which might be easier to remember. They all
2153 start with (* instead of (? and must be written using lower case let‐
2154 ters. PCRE2 supports the following synonyms:
2155
2156 (*positive_lookahead: or (*pla: is the same as (?=
2157 (*negative_lookahead: or (*nla: is the same as (?!
2158 (*positive_lookbehind: or (*plb: is the same as (?<=
2159 (*negative_lookbehind: or (*nlb: is the same as (?<!
2160
2161 For example, (*pla:foo) is the same assertion as (?=foo). In the fol‐
2162 lowing sections, the various assertions are described using the origi‐
2163 nal symbolic forms.
2164
2165 Lookahead assertions
2166
2167 Lookahead assertions start with (?= for positive assertions and (?! for
2168 negative assertions. For example,
2169
2170 \w+(?=;)
2171
2172 matches a word followed by a semicolon, but does not include the semi‐
2173 colon in the match, and
2174
2175 foo(?!bar)
2176
2177 matches any occurrence of "foo" that is not followed by "bar". Note
2178 that the apparently similar pattern
2179
2180 (?!foo)bar
2181
2182 does not find an occurrence of "bar" that is preceded by something
2183 other than "foo"; it finds any occurrence of "bar" whatsoever, because
2184 the assertion (?!foo) is always true when the next three characters are
2185 "bar". A lookbehind assertion is needed to achieve the other effect.
2186
2187 If you want to force a matching failure at some point in a pattern, the
2188 most convenient way to do it is with (?!) because an empty string al‐
2189 ways matches, so an assertion that requires there not to be an empty
2190 string must always fail. The backtracking control verb (*FAIL) or (*F)
2191 is a synonym for (?!).
2192
2193 Lookbehind assertions
2194
2195 Lookbehind assertions start with (?<= for positive assertions and (?<!
2196 for negative assertions. For example,
2197
2198 (?<!foo)bar
2199
2200 does find an occurrence of "bar" that is not preceded by "foo". The
2201 contents of a lookbehind assertion are restricted such that all the
2202 strings it matches must have a fixed length. However, if there are sev‐
2203 eral top-level alternatives, they do not all have to have the same
2204 fixed length. Thus
2205
2206 (?<=bullock|donkey)
2207
2208 is permitted, but
2209
2210 (?<!dogs?|cats?)
2211
2212 causes an error at compile time. Branches that match different length
2213 strings are permitted only at the top level of a lookbehind assertion.
2214 This is an extension compared with Perl, which requires all branches to
2215 match the same length of string. An assertion such as
2216
2217 (?<=ab(c|de))
2218
2219 is not permitted, because its single top-level branch can match two
2220 different lengths, but it is acceptable to PCRE2 if rewritten to use
2221 two top-level branches:
2222
2223 (?<=abc|abde)
2224
2225 In some cases, the escape sequence \K (see above) can be used instead
2226 of a lookbehind assertion to get round the fixed-length restriction.
2227
2228 The implementation of lookbehind assertions is, for each alternative,
2229 to temporarily move the current position back by the fixed length and
2230 then try to match. If there are insufficient characters before the cur‐
2231 rent position, the assertion fails.
2232
2233 In UTF-8 and UTF-16 modes, PCRE2 does not allow the \C escape (which
2234 matches a single code unit even in a UTF mode) to appear in lookbehind
2235 assertions, because it makes it impossible to calculate the length of
2236 the lookbehind. The \X and \R escapes, which can match different num‐
2237 bers of code units, are never permitted in lookbehinds.
2238
2239 "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in
2240 lookbehinds, as long as the called capture group matches a fixed-length
2241 string. However, recursion, that is, a "subroutine" call into a group
2242 that is already active, is not supported.
2243
2244 Perl does not support backreferences in lookbehinds. PCRE2 does support
2245 them, but only if certain conditions are met. The PCRE2_MATCH_UN‐
2246 SET_BACKREF option must not be set, there must be no use of (?| in the
2247 pattern (it creates duplicate group numbers), and if the backreference
2248 is by name, the name must be unique. Of course, the referenced group
2249 must itself match a fixed length substring. The following pattern
2250 matches words containing at least two characters that begin and end
2251 with the same character:
2252
2253 \b(\w)\w++(?<=\1)
2254
2255 Possessive quantifiers can be used in conjunction with lookbehind as‐
2256 sertions to specify efficient matching of fixed-length strings at the
2257 end of subject strings. Consider a simple pattern such as
2258
2259 abcd$
2260
2261 when applied to a long string that does not match. Because matching
2262 proceeds from left to right, PCRE2 will look for each "a" in the sub‐
2263 ject and then see if what follows matches the rest of the pattern. If
2264 the pattern is specified as
2265
2266 ^.*abcd$
2267
2268 the initial .* matches the entire string at first, but when this fails
2269 (because there is no following "a"), it backtracks to match all but the
2270 last character, then all but the last two characters, and so on. Once
2271 again the search for "a" covers the entire string, from right to left,
2272 so we are no better off. However, if the pattern is written as
2273
2274 ^.*+(?<=abcd)
2275
2276 there can be no backtracking for the .*+ item because of the possessive
2277 quantifier; it can match only the entire string. The subsequent lookbe‐
2278 hind assertion does a single test on the last four characters. If it
2279 fails, the match fails immediately. For long strings, this approach
2280 makes a significant difference to the processing time.
2281
2282 Using multiple assertions
2283
2284 Several assertions (of any sort) may occur in succession. For example,
2285
2286 (?<=\d{3})(?<!999)foo
2287
2288 matches "foo" preceded by three digits that are not "999". Notice that
2289 each of the assertions is applied independently at the same point in
2290 the subject string. First there is a check that the previous three
2291 characters are all digits, and then there is a check that the same
2292 three characters are not "999". This pattern does not match "foo" pre‐
2293 ceded by six characters, the first of which are digits and the last
2294 three of which are not "999". For example, it doesn't match "123abc‐
2295 foo". A pattern to do that is
2296
2297 (?<=\d{3}...)(?<!999)foo
2298
2299 This time the first assertion looks at the preceding six characters,
2300 checking that the first three are digits, and then the second assertion
2301 checks that the preceding three characters are not "999".
2302
2303 Assertions can be nested in any combination. For example,
2304
2305 (?<=(?<!foo)bar)baz
2306
2307 matches an occurrence of "baz" that is preceded by "bar" which in turn
2308 is not preceded by "foo", while
2309
2310 (?<=\d{3}(?!999)...)foo
2311
2312 is another pattern that matches "foo" preceded by three digits and any
2313 three characters that are not "999".
2314
2316
2317 The traditional Perl-compatible lookaround assertions are atomic. That
2318 is, if an assertion is true, but there is a subsequent matching fail‐
2319 ure, there is no backtracking into the assertion. However, there are
2320 some cases where non-atomic positive assertions can be useful. PCRE2
2321 provides these using the following syntax:
2322
2323 (*non_atomic_positive_lookahead: or (*napla: or (?*
2324 (*non_atomic_positive_lookbehind: or (*naplb: or (?<*
2325
2326 Consider the problem of finding the right-most word in a string that
2327 also appears earlier in the string, that is, it must appear at least
2328 twice in total. This pattern returns the required result as captured
2329 substring 1:
2330
2331 ^(?x)(*napla: .* \b(\w++)) (?> .*? \b\1\b ){2}
2332
2333 For a subject such as "word1 word2 word3 word2 word3 word4" the result
2334 is "word3". How does it work? At the start, ^(?x) anchors the pattern
2335 and sets the "x" option, which causes white space (introduced for read‐
2336 ability) to be ignored. Inside the assertion, the greedy .* at first
2337 consumes the entire string, but then has to backtrack until the rest of
2338 the assertion can match a word, which is captured by group 1. In other
2339 words, when the assertion first succeeds, it captures the right-most
2340 word in the string.
2341
2342 The current matching point is then reset to the start of the subject,
2343 and the rest of the pattern match checks for two occurrences of the
2344 captured word, using an ungreedy .*? to scan from the left. If this
2345 succeeds, we are done, but if the last word in the string does not oc‐
2346 cur twice, this part of the pattern fails. If a traditional atomic
2347 lookhead (?= or (*pla: had been used, the assertion could not be re-en‐
2348 tered, and the whole match would fail. The pattern would succeed only
2349 if the very last word in the subject was found twice.
2350
2351 Using a non-atomic lookahead, however, means that when the last word
2352 does not occur twice in the string, the lookahead can backtrack and
2353 find the second-last word, and so on, until either the match succeeds,
2354 or all words have been tested.
2355
2356 Two conditions must be met for a non-atomic assertion to be useful: the
2357 contents of one or more capturing groups must change after a backtrack
2358 into the assertion, and there must be a backreference to a changed
2359 group later in the pattern. If this is not the case, the rest of the
2360 pattern match fails exactly as before because nothing has changed, so
2361 using a non-atomic assertion just wastes resources.
2362
2363 There is one exception to backtracking into a non-atomic assertion. If
2364 an (*ACCEPT) control verb is triggered, the assertion succeeds atomi‐
2365 cally. That is, a subsequent match failure cannot backtrack into the
2366 assertion.
2367
2368 Non-atomic assertions are not supported by the alternative matching
2369 function pcre2_dfa_match(). They are supported by JIT, but only if they
2370 do not contain any control verbs such as (*ACCEPT). (This may change in
2371 future). Note that assertions that appear as conditions for conditional
2372 groups (see below) must be atomic.
2373
2375
2376 In concept, a script run is a sequence of characters that are all from
2377 the same Unicode script such as Latin or Greek. However, because some
2378 scripts are commonly used together, and because some diacritical and
2379 other marks are used with multiple scripts, it is not that simple.
2380 There is a full description of the rules that PCRE2 uses in the section
2381 entitled "Script Runs" in the pcre2unicode documentation.
2382
2383 If part of a pattern is enclosed between (*script_run: or (*sr: and a
2384 closing parenthesis, it fails if the sequence of characters that it
2385 matches are not a script run. After a failure, normal backtracking oc‐
2386 curs. Script runs can be used to detect spoofing attacks using charac‐
2387 ters that look the same, but are from different scripts. The string
2388 "paypal.com" is an infamous example, where the letters could be a mix‐
2389 ture of Latin and Cyrillic. This pattern ensures that the matched char‐
2390 acters in a sequence of non-spaces that follow white space are a script
2391 run:
2392
2393 \s+(*sr:\S+)
2394
2395 To be sure that they are all from the Latin script (for example), a
2396 lookahead can be used:
2397
2398 \s+(?=\p{Latin})(*sr:\S+)
2399
2400 This works as long as the first character is expected to be a character
2401 in that script, and not (for example) punctuation, which is allowed
2402 with any script. If this is not the case, a more creative lookahead is
2403 needed. For example, if digits, underscore, and dots are permitted at
2404 the start:
2405
2406 \s+(?=[0-9_.]*\p{Latin})(*sr:\S+)
2407
2408
2409 In many cases, backtracking into a script run pattern fragment is not
2410 desirable. The script run can employ an atomic group to prevent this.
2411 Because this is a common requirement, a shorthand notation is provided
2412 by (*atomic_script_run: or (*asr:
2413
2414 (*asr:...) is the same as (*sr:(?>...))
2415
2416 Note that the atomic group is inside the script run. Putting it outside
2417 would not prevent backtracking into the script run pattern.
2418
2419 Support for script runs is not available if PCRE2 is compiled without
2420 Unicode support. A compile-time error is given if any of the above con‐
2421 structs is encountered. Script runs are not supported by the alternate
2422 matching function, pcre2_dfa_match() because they use the same mecha‐
2423 nism as capturing parentheses.
2424
2425 Warning: The (*ACCEPT) control verb (see below) should not be used
2426 within a script run group, because it causes an immediate exit from the
2427 group, bypassing the script run checking.
2428
2430
2431 It is possible to cause the matching process to obey a pattern fragment
2432 conditionally or to choose between two alternative fragments, depending
2433 on the result of an assertion, or whether a specific capture group has
2434 already been matched. The two possible forms of conditional group are:
2435
2436 (?(condition)yes-pattern)
2437 (?(condition)yes-pattern|no-pattern)
2438
2439 If the condition is satisfied, the yes-pattern is used; otherwise the
2440 no-pattern (if present) is used. An absent no-pattern is equivalent to
2441 an empty string (it always matches). If there are more than two alter‐
2442 natives in the group, a compile-time error occurs. Each of the two al‐
2443 ternatives may itself contain nested groups of any form, including con‐
2444 ditional groups; the restriction to two alternatives applies only at
2445 the level of the condition itself. This pattern fragment is an example
2446 where the alternatives are complex:
2447
2448 (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
2449
2450
2451 There are five kinds of condition: references to capture groups, refer‐
2452 ences to recursion, two pseudo-conditions called DEFINE and VERSION,
2453 and assertions.
2454
2455 Checking for a used capture group by number
2456
2457 If the text between the parentheses consists of a sequence of digits,
2458 the condition is true if a capture group of that number has previously
2459 matched. If there is more than one capture group with the same number
2460 (see the earlier section about duplicate group numbers), the condition
2461 is true if any of them have matched. An alternative notation is to pre‐
2462 cede the digits with a plus or minus sign. In this case, the group num‐
2463 ber is relative rather than absolute. The most recently opened capture
2464 group can be referenced by (?(-1), the next most recent by (?(-2), and
2465 so on. Inside loops it can also make sense to refer to subsequent
2466 groups. The next capture group can be referenced as (?(+1), and so on.
2467 (The value zero in any of these forms is not used; it provokes a com‐
2468 pile-time error.)
2469
2470 Consider the following pattern, which contains non-significant white
2471 space to make it more readable (assume the PCRE2_EXTENDED option) and
2472 to divide it into three parts for ease of discussion:
2473
2474 ( \( )? [^()]+ (?(1) \) )
2475
2476 The first part matches an optional opening parenthesis, and if that
2477 character is present, sets it as the first captured substring. The sec‐
2478 ond part matches one or more characters that are not parentheses. The
2479 third part is a conditional group that tests whether or not the first
2480 capture group matched. If it did, that is, if subject started with an
2481 opening parenthesis, the condition is true, and so the yes-pattern is
2482 executed and a closing parenthesis is required. Otherwise, since no-
2483 pattern is not present, the conditional group matches nothing. In other
2484 words, this pattern matches a sequence of non-parentheses, optionally
2485 enclosed in parentheses.
2486
2487 If you were embedding this pattern in a larger one, you could use a
2488 relative reference:
2489
2490 ...other stuff... ( \( )? [^()]+ (?(-1) \) ) ...
2491
2492 This makes the fragment independent of the parentheses in the larger
2493 pattern.
2494
2495 Checking for a used capture group by name
2496
2497 Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
2498 used capture group by name. For compatibility with earlier versions of
2499 PCRE1, which had this facility before Perl, the syntax (?(name)...) is
2500 also recognized. Note, however, that undelimited names consisting of
2501 the letter R followed by digits are ambiguous (see the following sec‐
2502 tion). Rewriting the above example to use a named group gives this:
2503
2504 (?<OPEN> \( )? [^()]+ (?(<OPEN>) \) )
2505
2506 If the name used in a condition of this kind is a duplicate, the test
2507 is applied to all groups of the same name, and is true if any one of
2508 them has matched.
2509
2510 Checking for pattern recursion
2511
2512 "Recursion" in this sense refers to any subroutine-like call from one
2513 part of the pattern to another, whether or not it is actually recur‐
2514 sive. See the sections entitled "Recursive patterns" and "Groups as
2515 subroutines" below for details of recursion and subroutine calls.
2516
2517 If a condition is the string (R), and there is no capture group with
2518 the name R, the condition is true if matching is currently in a recur‐
2519 sion or subroutine call to the whole pattern or any capture group. If
2520 digits follow the letter R, and there is no group with that name, the
2521 condition is true if the most recent call is into a group with the
2522 given number, which must exist somewhere in the overall pattern. This
2523 is a contrived example that is equivalent to a+b:
2524
2525 ((?(R1)a+|(?1)b))
2526
2527 However, in both cases, if there is a capture group with a matching
2528 name, the condition tests for its being set, as described in the sec‐
2529 tion above, instead of testing for recursion. For example, creating a
2530 group with the name R1 by adding (?<R1>) to the above pattern com‐
2531 pletely changes its meaning.
2532
2533 If a name preceded by ampersand follows the letter R, for example:
2534
2535 (?(R&name)...)
2536
2537 the condition is true if the most recent recursion is into a group of
2538 that name (which must exist within the pattern).
2539
2540 This condition does not check the entire recursion stack. It tests only
2541 the current level. If the name used in a condition of this kind is a
2542 duplicate, the test is applied to all groups of the same name, and is
2543 true if any one of them is the most recent recursion.
2544
2545 At "top level", all these recursion test conditions are false.
2546
2547 Defining capture groups for use by reference only
2548
2549 If the condition is the string (DEFINE), the condition is always false,
2550 even if there is a group with the name DEFINE. In this case, there may
2551 be only one alternative in the rest of the conditional group. It is al‐
2552 ways skipped if control reaches this point in the pattern; the idea of
2553 DEFINE is that it can be used to define subroutines that can be refer‐
2554 enced from elsewhere. (The use of subroutines is described below.) For
2555 example, a pattern to match an IPv4 address such as "192.168.23.245"
2556 could be written like this (ignore white space and line breaks):
2557
2558 (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
2559 \b (?&byte) (\.(?&byte)){3} \b
2560
2561 The first part of the pattern is a DEFINE group inside which another
2562 group named "byte" is defined. This matches an individual component of
2563 an IPv4 address (a number less than 256). When matching takes place,
2564 this part of the pattern is skipped because DEFINE acts like a false
2565 condition. The rest of the pattern uses references to the named group
2566 to match the four dot-separated components of an IPv4 address, insist‐
2567 ing on a word boundary at each end.
2568
2569 Checking the PCRE2 version
2570
2571 Programs that link with a PCRE2 library can check the version by call‐
2572 ing pcre2_config() with appropriate arguments. Users of applications
2573 that do not have access to the underlying code cannot do this. A spe‐
2574 cial "condition" called VERSION exists to allow such users to discover
2575 which version of PCRE2 they are dealing with by using this condition to
2576 match a string such as "yesno". VERSION must be followed either by "="
2577 or ">=" and a version number. For example:
2578
2579 (?(VERSION>=10.4)yes|no)
2580
2581 This pattern matches "yes" if the PCRE2 version is greater or equal to
2582 10.4, or "no" otherwise. The fractional part of the version number may
2583 not contain more than two digits.
2584
2585 Assertion conditions
2586
2587 If the condition is not in any of the above formats, it must be a
2588 parenthesized assertion. This may be a positive or negative lookahead
2589 or lookbehind assertion. However, it must be a traditional atomic as‐
2590 sertion, not one of the PCRE2-specific non-atomic assertions.
2591
2592 Consider this pattern, again containing non-significant white space,
2593 and with the two alternatives on the second line:
2594
2595 (?(?=[^a-z]*[a-z])
2596 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
2597
2598 The condition is a positive lookahead assertion that matches an op‐
2599 tional sequence of non-letters followed by a letter. In other words, it
2600 tests for the presence of at least one letter in the subject. If a let‐
2601 ter is found, the subject is matched against the first alternative;
2602 otherwise it is matched against the second. This pattern matches
2603 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
2604 letters and dd are digits.
2605
2606 When an assertion that is a condition contains capture groups, any cap‐
2607 turing that occurs in a matching branch is retained afterwards, for
2608 both positive and negative assertions, because matching always contin‐
2609 ues after the assertion, whether it succeeds or fails. (Compare non-
2610 conditional assertions, for which captures are retained only for posi‐
2611 tive assertions that succeed.)
2612
2614
2615 There are two ways of including comments in patterns that are processed
2616 by PCRE2. In both cases, the start of the comment must not be in a
2617 character class, nor in the middle of any other sequence of related
2618 characters such as (?: or a group name or number. The characters that
2619 make up a comment play no part in the pattern matching.
2620
2621 The sequence (?# marks the start of a comment that continues up to the
2622 next closing parenthesis. Nested parentheses are not permitted. If the
2623 PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped #
2624 character also introduces a comment, which in this case continues to
2625 immediately after the next newline character or character sequence in
2626 the pattern. Which characters are interpreted as newlines is controlled
2627 by an option passed to the compiling function or by a special sequence
2628 at the start of the pattern, as described in the section entitled "New‐
2629 line conventions" above. Note that the end of this type of comment is a
2630 literal newline sequence in the pattern; escape sequences that happen
2631 to represent a newline do not count. For example, consider this pattern
2632 when PCRE2_EXTENDED is set, and the default newline convention (a sin‐
2633 gle linefeed character) is in force:
2634
2635 abc #comment \n still comment
2636
2637 On encountering the # character, pcre2_compile() skips along, looking
2638 for a newline in the pattern. The sequence \n is still literal at this
2639 stage, so it does not terminate the comment. Only an actual character
2640 with the code value 0x0a (the default newline) does so.
2641
2643
2644 Consider the problem of matching a string in parentheses, allowing for
2645 unlimited nested parentheses. Without the use of recursion, the best
2646 that can be done is to use a pattern that matches up to some fixed
2647 depth of nesting. It is not possible to handle an arbitrary nesting
2648 depth.
2649
2650 For some time, Perl has provided a facility that allows regular expres‐
2651 sions to recurse (amongst other things). It does this by interpolating
2652 Perl code in the expression at run time, and the code can refer to the
2653 expression itself. A Perl pattern using code interpolation to solve the
2654 parentheses problem can be created like this:
2655
2656 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
2657
2658 The (?p{...}) item interpolates Perl code at run time, and in this case
2659 refers recursively to the pattern in which it appears.
2660
2661 Obviously, PCRE2 cannot support the interpolation of Perl code. In‐
2662 stead, it supports special syntax for recursion of the entire pattern,
2663 and also for individual capture group recursion. After its introduction
2664 in PCRE1 and Python, this kind of recursion was subsequently introduced
2665 into Perl at release 5.10.
2666
2667 A special item that consists of (? followed by a number greater than
2668 zero and a closing parenthesis is a recursive subroutine call of the
2669 capture group of the given number, provided that it occurs inside that
2670 group. (If not, it is a non-recursive subroutine call, which is de‐
2671 scribed in the next section.) The special item (?R) or (?0) is a recur‐
2672 sive call of the entire regular expression.
2673
2674 This PCRE2 pattern solves the nested parentheses problem (assume the
2675 PCRE2_EXTENDED option is set so that white space is ignored):
2676
2677 \( ( [^()]++ | (?R) )* \)
2678
2679 First it matches an opening parenthesis. Then it matches any number of
2680 substrings which can either be a sequence of non-parentheses, or a re‐
2681 cursive match of the pattern itself (that is, a correctly parenthesized
2682 substring). Finally there is a closing parenthesis. Note the use of a
2683 possessive quantifier to avoid backtracking into sequences of non-
2684 parentheses.
2685
2686 If this were part of a larger pattern, you would not want to recurse
2687 the entire pattern, so instead you could use this:
2688
2689 ( \( ( [^()]++ | (?1) )* \) )
2690
2691 We have put the pattern into parentheses, and caused the recursion to
2692 refer to them instead of the whole pattern.
2693
2694 In a larger pattern, keeping track of parenthesis numbers can be
2695 tricky. This is made easier by the use of relative references. Instead
2696 of (?1) in the pattern above you can write (?-2) to refer to the second
2697 most recently opened parentheses preceding the recursion. In other
2698 words, a negative number counts capturing parentheses leftwards from
2699 the point at which it is encountered.
2700
2701 Be aware however, that if duplicate capture group numbers are in use,
2702 relative references refer to the earliest group with the appropriate
2703 number. Consider, for example:
2704
2705 (?|(a)|(b)) (c) (?-2)
2706
2707 The first two capture groups (a) and (b) are both numbered 1, and group
2708 (c) is number 2. When the reference (?-2) is encountered, the second
2709 most recently opened parentheses has the number 1, but it is the first
2710 such group (the (a) group) to which the recursion refers. This would be
2711 the same if an absolute reference (?1) was used. In other words, rela‐
2712 tive references are just a shorthand for computing a group number.
2713
2714 It is also possible to refer to subsequent capture groups, by writing
2715 references such as (?+2). However, these cannot be recursive because
2716 the reference is not inside the parentheses that are referenced. They
2717 are always non-recursive subroutine calls, as described in the next
2718 section.
2719
2720 An alternative approach is to use named parentheses. The Perl syntax
2721 for this is (?&name); PCRE1's earlier syntax (?P>name) is also sup‐
2722 ported. We could rewrite the above example as follows:
2723
2724 (?<pn> \( ( [^()]++ | (?&pn) )* \) )
2725
2726 If there is more than one group with the same name, the earliest one is
2727 used.
2728
2729 The example pattern that we have been looking at contains nested unlim‐
2730 ited repeats, and so the use of a possessive quantifier for matching
2731 strings of non-parentheses is important when applying the pattern to
2732 strings that do not match. For example, when this pattern is applied to
2733
2734 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
2735
2736 it yields "no match" quickly. However, if a possessive quantifier is
2737 not used, the match runs for a very long time indeed because there are
2738 so many different ways the + and * repeats can carve up the subject,
2739 and all have to be tested before failure can be reported.
2740
2741 At the end of a match, the values of capturing parentheses are those
2742 from the outermost level. If you want to obtain intermediate values, a
2743 callout function can be used (see below and the pcre2callout documenta‐
2744 tion). If the pattern above is matched against
2745
2746 (ab(cd)ef)
2747
2748 the value for the inner capturing parentheses (numbered 2) is "ef",
2749 which is the last value taken on at the top level. If a capture group
2750 is not matched at the top level, its final captured value is unset,
2751 even if it was (temporarily) set at a deeper level during the matching
2752 process.
2753
2754 Do not confuse the (?R) item with the condition (R), which tests for
2755 recursion. Consider this pattern, which matches text in angle brack‐
2756 ets, allowing for arbitrary nesting. Only digits are allowed in nested
2757 brackets (that is, when recursing), whereas any characters are permit‐
2758 ted at the outer level.
2759
2760 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
2761
2762 In this pattern, (?(R) is the start of a conditional group, with two
2763 different alternatives for the recursive and non-recursive cases. The
2764 (?R) item is the actual recursive call.
2765
2766 Differences in recursion processing between PCRE2 and Perl
2767
2768 Some former differences between PCRE2 and Perl no longer exist.
2769
2770 Before release 10.30, recursion processing in PCRE2 differed from Perl
2771 in that a recursive subroutine call was always treated as an atomic
2772 group. That is, once it had matched some of the subject string, it was
2773 never re-entered, even if it contained untried alternatives and there
2774 was a subsequent matching failure. (Historical note: PCRE implemented
2775 recursion before Perl did.)
2776
2777 Starting with release 10.30, recursive subroutine calls are no longer
2778 treated as atomic. That is, they can be re-entered to try unused alter‐
2779 natives if there is a matching failure later in the pattern. This is
2780 now compatible with the way Perl works. If you want a subroutine call
2781 to be atomic, you must explicitly enclose it in an atomic group.
2782
2783 Supporting backtracking into recursions simplifies certain types of re‐
2784 cursive pattern. For example, this pattern matches palindromic strings:
2785
2786 ^((.)(?1)\2|.?)$
2787
2788 The second branch in the group matches a single central character in
2789 the palindrome when there are an odd number of characters, or nothing
2790 when there are an even number of characters, but in order to work it
2791 has to be able to try the second case when the rest of the pattern
2792 match fails. If you want to match typical palindromic phrases, the pat‐
2793 tern has to ignore all non-word characters, which can be done like
2794 this:
2795
2796 ^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$
2797
2798 If run with the PCRE2_CASELESS option, this pattern matches phrases
2799 such as "A man, a plan, a canal: Panama!". Note the use of the posses‐
2800 sive quantifier *+ to avoid backtracking into sequences of non-word
2801 characters. Without this, PCRE2 takes a great deal longer (ten times or
2802 more) to match typical phrases, and Perl takes so long that you think
2803 it has gone into a loop.
2804
2805 Another way in which PCRE2 and Perl used to differ in their recursion
2806 processing is in the handling of captured values. Formerly in Perl,
2807 when a group was called recursively or as a subroutine (see the next
2808 section), it had no access to any values that were captured outside the
2809 recursion, whereas in PCRE2 these values can be referenced. Consider
2810 this pattern:
2811
2812 ^(.)(\1|a(?2))
2813
2814 This pattern matches "bab". The first capturing parentheses match "b",
2815 then in the second group, when the backreference \1 fails to match "b",
2816 the second alternative matches "a" and then recurses. In the recursion,
2817 \1 does now match "b" and so the whole match succeeds. This match used
2818 to fail in Perl, but in later versions (I tried 5.024) it now works.
2819
2821
2822 If the syntax for a recursive group call (either by number or by name)
2823 is used outside the parentheses to which it refers, it operates a bit
2824 like a subroutine in a programming language. More accurately, PCRE2
2825 treats the referenced group as an independent subpattern which it tries
2826 to match at the current matching position. The called group may be de‐
2827 fined before or after the reference. A numbered reference can be abso‐
2828 lute or relative, as in these examples:
2829
2830 (...(absolute)...)...(?2)...
2831 (...(relative)...)...(?-1)...
2832 (...(?+1)...(relative)...
2833
2834 An earlier example pointed out that the pattern
2835
2836 (sens|respons)e and \1ibility
2837
2838 matches "sense and sensibility" and "response and responsibility", but
2839 not "sense and responsibility". If instead the pattern
2840
2841 (sens|respons)e and (?1)ibility
2842
2843 is used, it does match "sense and responsibility" as well as the other
2844 two strings. Another example is given in the discussion of DEFINE
2845 above.
2846
2847 Like recursions, subroutine calls used to be treated as atomic, but
2848 this changed at PCRE2 release 10.30, so backtracking into subroutine
2849 calls can now occur. However, any capturing parentheses that are set
2850 during the subroutine call revert to their previous values afterwards.
2851
2852 Processing options such as case-independence are fixed when a group is
2853 defined, so if it is used as a subroutine, such options cannot be
2854 changed for different calls. For example, consider this pattern:
2855
2856 (abc)(?i:(?-1))
2857
2858 It matches "abcabc". It does not match "abcABC" because the change of
2859 processing option does not affect the called group.
2860
2861 The behaviour of backtracking control verbs in groups when called as
2862 subroutines is described in the section entitled "Backtracking verbs in
2863 subroutines" below.
2864
2866
2867 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
2868 name or a number enclosed either in angle brackets or single quotes, is
2869 an alternative syntax for calling a group as a subroutine, possibly re‐
2870 cursively. Here are two of the examples used above, rewritten using
2871 this syntax:
2872
2873 (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
2874 (sens|respons)e and \g'1'ibility
2875
2876 PCRE2 supports an extension to Oniguruma: if a number is preceded by a
2877 plus or a minus sign it is taken as a relative reference. For example:
2878
2879 (abc)(?i:\g<-1>)
2880
2881 Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
2882 synonymous. The former is a backreference; the latter is a subroutine
2883 call.
2884
2886
2887 Perl has a feature whereby using the sequence (?{...}) causes arbitrary
2888 Perl code to be obeyed in the middle of matching a regular expression.
2889 This makes it possible, amongst other things, to extract different sub‐
2890 strings that match the same pair of parentheses when there is a repeti‐
2891 tion.
2892
2893 PCRE2 provides a similar feature, but of course it cannot obey arbi‐
2894 trary Perl code. The feature is called "callout". The caller of PCRE2
2895 provides an external function by putting its entry point in a match
2896 context using the function pcre2_set_callout(), and then passing that
2897 context to pcre2_match() or pcre2_dfa_match(). If no match context is
2898 passed, or if the callout entry point is set to NULL, callouts are dis‐
2899 abled.
2900
2901 Within a regular expression, (?C<arg>) indicates a point at which the
2902 external function is to be called. There are two kinds of callout:
2903 those with a numerical argument and those with a string argument. (?C)
2904 on its own with no argument is treated as (?C0). A numerical argument
2905 allows the application to distinguish between different callouts.
2906 String arguments were added for release 10.20 to make it possible for
2907 script languages that use PCRE2 to embed short scripts within patterns
2908 in a similar way to Perl.
2909
2910 During matching, when PCRE2 reaches a callout point, the external func‐
2911 tion is called. It is provided with the number or string argument of
2912 the callout, the position in the pattern, and one item of data that is
2913 also set in the match block. The callout function may cause matching to
2914 proceed, to backtrack, or to fail.
2915
2916 By default, PCRE2 implements a number of optimizations at matching
2917 time, and one side-effect is that sometimes callouts are skipped. If
2918 you need all possible callouts to happen, you need to set options that
2919 disable the relevant optimizations. More details, including a complete
2920 description of the programming interface to the callout function, are
2921 given in the pcre2callout documentation.
2922
2923 Callouts with numerical arguments
2924
2925 If you just want to have a means of identifying different callout
2926 points, put a number less than 256 after the letter C. For example,
2927 this pattern has two callout points:
2928
2929 (?C1)abc(?C2)def
2930
2931 If the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(), numerical
2932 callouts are automatically installed before each item in the pattern.
2933 They are all numbered 255. If there is a conditional group in the pat‐
2934 tern whose condition is an assertion, an additional callout is inserted
2935 just before the condition. An explicit callout may also be set at this
2936 position, as in this example:
2937
2938 (?(?C9)(?=a)abc|def)
2939
2940 Note that this applies only to assertion conditions, not to other types
2941 of condition.
2942
2943 Callouts with string arguments
2944
2945 A delimited string may be used instead of a number as a callout argu‐
2946 ment. The starting delimiter must be one of ` ' " ^ % # $ { and the
2947 ending delimiter is the same as the start, except for {, where the end‐
2948 ing delimiter is }. If the ending delimiter is needed within the
2949 string, it must be doubled. For example:
2950
2951 (?C'ab ''c'' d')xyz(?C{any text})pqr
2952
2953 The doubling is removed before the string is passed to the callout
2954 function.
2955
2957
2958 There are a number of special "Backtracking Control Verbs" (to use
2959 Perl's terminology) that modify the behaviour of backtracking during
2960 matching. They are generally of the form (*VERB) or (*VERB:NAME). Some
2961 verbs take either form, and may behave differently depending on whether
2962 or not a name argument is present. The names are not required to be
2963 unique within the pattern.
2964
2965 By default, for compatibility with Perl, a name is any sequence of
2966 characters that does not include a closing parenthesis. The name is not
2967 processed in any way, and it is not possible to include a closing
2968 parenthesis in the name. This can be changed by setting the
2969 PCRE2_ALT_VERBNAMES option, but the result is no longer Perl-compati‐
2970 ble.
2971
2972 When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to
2973 verb names and only an unescaped closing parenthesis terminates the
2974 name. However, the only backslash items that are permitted are \Q, \E,
2975 and sequences such as \x{100} that define character code points. Char‐
2976 acter type escapes such as \d are faulted.
2977
2978 A closing parenthesis can be included in a name either as \) or between
2979 \Q and \E. In addition to backslash processing, if the PCRE2_EXTENDED
2980 or PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb
2981 names is skipped, and #-comments are recognized, exactly as in the rest
2982 of the pattern. PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect
2983 verb names unless PCRE2_ALT_VERBNAMES is also set.
2984
2985 The maximum length of a name is 255 in the 8-bit library and 65535 in
2986 the 16-bit and 32-bit libraries. If the name is empty, that is, if the
2987 closing parenthesis immediately follows the colon, the effect is as if
2988 the colon were not there. Any number of these verbs may occur in a pat‐
2989 tern. Except for (*ACCEPT), they may not be quantified.
2990
2991 Since these verbs are specifically related to backtracking, most of
2992 them can be used only when the pattern is to be matched using the tra‐
2993 ditional matching function, because that uses a backtracking algorithm.
2994 With the exception of (*FAIL), which behaves like a failing negative
2995 assertion, the backtracking control verbs cause an error if encountered
2996 by the DFA matching function.
2997
2998 The behaviour of these verbs in repeated groups, assertions, and in
2999 capture groups called as subroutines (whether or not recursively) is
3000 documented below.
3001
3002 Optimizations that affect backtracking verbs
3003
3004 PCRE2 contains some optimizations that are used to speed up matching by
3005 running some checks at the start of each match attempt. For example, it
3006 may know the minimum length of matching subject, or that a particular
3007 character must be present. When one of these optimizations bypasses the
3008 running of a match, any included backtracking verbs will not, of
3009 course, be processed. You can suppress the start-of-match optimizations
3010 by setting the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com‐
3011 pile(), or by starting the pattern with (*NO_START_OPT). There is more
3012 discussion of this option in the section entitled "Compiling a pattern"
3013 in the pcre2api documentation.
3014
3015 Experiments with Perl suggest that it too has similar optimizations,
3016 and like PCRE2, turning them off can change the result of a match.
3017
3018 Verbs that act immediately
3019
3020 The following verbs act as soon as they are encountered.
3021
3022 (*ACCEPT) or (*ACCEPT:NAME)
3023
3024 This verb causes the match to end successfully, skipping the remainder
3025 of the pattern. However, when it is inside a capture group that is
3026 called as a subroutine, only that group is ended successfully. Matching
3027 then continues at the outer level. If (*ACCEPT) in triggered in a posi‐
3028 tive assertion, the assertion succeeds; in a negative assertion, the
3029 assertion fails.
3030
3031 If (*ACCEPT) is inside capturing parentheses, the data so far is cap‐
3032 tured. For example:
3033
3034 A((?:A|B(*ACCEPT)|C)D)
3035
3036 This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap‐
3037 tured by the outer parentheses.
3038
3039 (*ACCEPT) is the only backtracking verb that is allowed to be quanti‐
3040 fied because an ungreedy quantification with a minimum of zero acts
3041 only when a backtrack happens. Consider, for example,
3042
3043 (A(*ACCEPT)??B)C
3044
3045 where A, B, and C may be complex expressions. After matching "A", the
3046 matcher processes "BC"; if that fails, causing a backtrack, (*ACCEPT)
3047 is triggered and the match succeeds. In both cases, all but C is cap‐
3048 tured. Whereas (*COMMIT) (see below) means "fail on backtrack", a re‐
3049 peated (*ACCEPT) of this type means "succeed on backtrack".
3050
3051 Warning: (*ACCEPT) should not be used within a script run group, be‐
3052 cause it causes an immediate exit from the group, bypassing the script
3053 run checking.
3054
3055 (*FAIL) or (*FAIL:NAME)
3056
3057 This verb causes a matching failure, forcing backtracking to occur. It
3058 may be abbreviated to (*F). It is equivalent to (?!) but easier to
3059 read. The Perl documentation notes that it is probably useful only when
3060 combined with (?{}) or (??{}). Those are, of course, Perl features that
3061 are not present in PCRE2. The nearest equivalent is the callout fea‐
3062 ture, as for example in this pattern:
3063
3064 a+(?C)(*FAIL)
3065
3066 A match with the string "aaaa" always fails, but the callout is taken
3067 before each backtrack happens (in this example, 10 times).
3068
3069 (*ACCEPT:NAME) and (*FAIL:NAME) behave the same as (*MARK:NAME)(*AC‐
3070 CEPT) and (*MARK:NAME)(*FAIL), respectively, that is, a (*MARK) is
3071 recorded just before the verb acts.
3072
3073 Recording which path was taken
3074
3075 There is one verb whose main purpose is to track how a match was ar‐
3076 rived at, though it also has a secondary use in conjunction with ad‐
3077 vancing the match starting point (see (*SKIP) below).
3078
3079 (*MARK:NAME) or (*:NAME)
3080
3081 A name is always required with this verb. For all the other backtrack‐
3082 ing control verbs, a NAME argument is optional.
3083
3084 When a match succeeds, the name of the last-encountered mark name on
3085 the matching path is passed back to the caller as described in the sec‐
3086 tion entitled "Other information about the match" in the pcre2api docu‐
3087 mentation. This applies to all instances of (*MARK) and other verbs,
3088 including those inside assertions and atomic groups. However, there are
3089 differences in those cases when (*MARK) is used in conjunction with
3090 (*SKIP) as described below.
3091
3092 The mark name that was last encountered on the matching path is passed
3093 back. A verb without a NAME argument is ignored for this purpose. Here
3094 is an example of pcre2test output, where the "mark" modifier requests
3095 the retrieval and outputting of (*MARK) data:
3096
3097 re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
3098 data> XY
3099 0: XY
3100 MK: A
3101 XZ
3102 0: XZ
3103 MK: B
3104
3105 The (*MARK) name is tagged with "MK:" in this output, and in this exam‐
3106 ple it indicates which of the two alternatives matched. This is a more
3107 efficient way of obtaining this information than putting each alterna‐
3108 tive in its own capturing parentheses.
3109
3110 If a verb with a name is encountered in a positive assertion that is
3111 true, the name is recorded and passed back if it is the last-encoun‐
3112 tered. This does not happen for negative assertions or failing positive
3113 assertions.
3114
3115 After a partial match or a failed match, the last encountered name in
3116 the entire match process is returned. For example:
3117
3118 re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
3119 data> XP
3120 No match, mark = B
3121
3122 Note that in this unanchored example the mark is retained from the
3123 match attempt that started at the letter "X" in the subject. Subsequent
3124 match attempts starting at "P" and then with an empty string do not get
3125 as far as the (*MARK) item, but nevertheless do not reset it.
3126
3127 If you are interested in (*MARK) values after failed matches, you
3128 should probably set the PCRE2_NO_START_OPTIMIZE option (see above) to
3129 ensure that the match is always attempted.
3130
3131 Verbs that act after backtracking
3132
3133 The following verbs do nothing when they are encountered. Matching con‐
3134 tinues with what follows, but if there is a subsequent match failure,
3135 causing a backtrack to the verb, a failure is forced. That is, back‐
3136 tracking cannot pass to the left of the verb. However, when one of
3137 these verbs appears inside an atomic group or in a lookaround assertion
3138 that is true, its effect is confined to that group, because once the
3139 group has been matched, there is never any backtracking into it. Back‐
3140 tracking from beyond an assertion or an atomic group ignores the entire
3141 group, and seeks a preceding backtracking point.
3142
3143 These verbs differ in exactly what kind of failure occurs when back‐
3144 tracking reaches them. The behaviour described below is what happens
3145 when the verb is not in a subroutine or an assertion. Subsequent sec‐
3146 tions cover these special cases.
3147
3148 (*COMMIT) or (*COMMIT:NAME)
3149
3150 This verb causes the whole match to fail outright if there is a later
3151 matching failure that causes backtracking to reach it. Even if the pat‐
3152 tern is unanchored, no further attempts to find a match by advancing
3153 the starting point take place. If (*COMMIT) is the only backtracking
3154 verb that is encountered, once it has been passed pcre2_match() is com‐
3155 mitted to finding a match at the current starting point, or not at all.
3156 For example:
3157
3158 a+(*COMMIT)b
3159
3160 This matches "xxaab" but not "aacaab". It can be thought of as a kind
3161 of dynamic anchor, or "I've started, so I must finish."
3162
3163 The behaviour of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COM‐
3164 MIT). It is like (*MARK:NAME) in that the name is remembered for pass‐
3165 ing back to the caller. However, (*SKIP:NAME) searches only for names
3166 that are set with (*MARK), ignoring those set by any of the other back‐
3167 tracking verbs.
3168
3169 If there is more than one backtracking verb in a pattern, a different
3170 one that follows (*COMMIT) may be triggered first, so merely passing
3171 (*COMMIT) during a match does not always guarantee that a match must be
3172 at this starting point.
3173
3174 Note that (*COMMIT) at the start of a pattern is not the same as an an‐
3175 chor, unless PCRE2's start-of-match optimizations are turned off, as
3176 shown in this output from pcre2test:
3177
3178 re> /(*COMMIT)abc/
3179 data> xyzabc
3180 0: abc
3181 data>
3182 re> /(*COMMIT)abc/no_start_optimize
3183 data> xyzabc
3184 No match
3185
3186 For the first pattern, PCRE2 knows that any match must start with "a",
3187 so the optimization skips along the subject to "a" before applying the
3188 pattern to the first set of data. The match attempt then succeeds. The
3189 second pattern disables the optimization that skips along to the first
3190 character. The pattern is now applied starting at "x", and so the
3191 (*COMMIT) causes the match to fail without trying any other starting
3192 points.
3193
3194 (*PRUNE) or (*PRUNE:NAME)
3195
3196 This verb causes the match to fail at the current starting position in
3197 the subject if there is a later matching failure that causes backtrack‐
3198 ing to reach it. If the pattern is unanchored, the normal "bumpalong"
3199 advance to the next starting character then happens. Backtracking can
3200 occur as usual to the left of (*PRUNE), before it is reached, or when
3201 matching to the right of (*PRUNE), but if there is no match to the
3202 right, backtracking cannot cross (*PRUNE). In simple cases, the use of
3203 (*PRUNE) is just an alternative to an atomic group or possessive quan‐
3204 tifier, but there are some uses of (*PRUNE) that cannot be expressed in
3205 any other way. In an anchored pattern (*PRUNE) has the same effect as
3206 (*COMMIT).
3207
3208 The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE).
3209 It is like (*MARK:NAME) in that the name is remembered for passing back
3210 to the caller. However, (*SKIP:NAME) searches only for names set with
3211 (*MARK), ignoring those set by other backtracking verbs.
3212
3213 (*SKIP)
3214
3215 This verb, when given without a name, is like (*PRUNE), except that if
3216 the pattern is unanchored, the "bumpalong" advance is not to the next
3217 character, but to the position in the subject where (*SKIP) was encoun‐
3218 tered. (*SKIP) signifies that whatever text was matched leading up to
3219 it cannot be part of a successful match if there is a later mismatch.
3220 Consider:
3221
3222 a+(*SKIP)b
3223
3224 If the subject is "aaaac...", after the first match attempt fails
3225 (starting at the first character in the string), the starting point
3226 skips on to start the next attempt at "c". Note that a possessive quan‐
3227 tifier does not have the same effect as this example; although it would
3228 suppress backtracking during the first match attempt, the second at‐
3229 tempt would start at the second character instead of skipping on to
3230 "c".
3231
3232 If (*SKIP) is used to specify a new starting position that is the same
3233 as the starting position of the current match, or (by being inside a
3234 lookbehind) earlier, the position specified by (*SKIP) is ignored, and
3235 instead the normal "bumpalong" occurs.
3236
3237 (*SKIP:NAME)
3238
3239 When (*SKIP) has an associated name, its behaviour is modified. When
3240 such a (*SKIP) is triggered, the previous path through the pattern is
3241 searched for the most recent (*MARK) that has the same name. If one is
3242 found, the "bumpalong" advance is to the subject position that corre‐
3243 sponds to that (*MARK) instead of to where (*SKIP) was encountered. If
3244 no (*MARK) with a matching name is found, the (*SKIP) is ignored.
3245
3246 The search for a (*MARK) name uses the normal backtracking mechanism,
3247 which means that it does not see (*MARK) settings that are inside
3248 atomic groups or assertions, because they are never re-entered by back‐
3249 tracking. Compare the following pcre2test examples:
3250
3251 re> /a(?>(*MARK:X))(*SKIP:X)(*F)|(.)/
3252 data: abc
3253 0: a
3254 1: a
3255 data:
3256 re> /a(?:(*MARK:X))(*SKIP:X)(*F)|(.)/
3257 data: abc
3258 0: b
3259 1: b
3260
3261 In the first example, the (*MARK) setting is in an atomic group, so it
3262 is not seen when (*SKIP:X) triggers, causing the (*SKIP) to be ignored.
3263 This allows the second branch of the pattern to be tried at the first
3264 character position. In the second example, the (*MARK) setting is not
3265 in an atomic group. This allows (*SKIP:X) to find the (*MARK) when it
3266 backtracks, and this causes a new matching attempt to start at the sec‐
3267 ond character. This time, the (*MARK) is never seen because "a" does
3268 not match "b", so the matcher immediately jumps to the second branch of
3269 the pattern.
3270
3271 Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It
3272 ignores names that are set by other backtracking verbs.
3273
3274 (*THEN) or (*THEN:NAME)
3275
3276 This verb causes a skip to the next innermost alternative when back‐
3277 tracking reaches it. That is, it cancels any further backtracking
3278 within the current alternative. Its name comes from the observation
3279 that it can be used for a pattern-based if-then-else block:
3280
3281 ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
3282
3283 If the COND1 pattern matches, FOO is tried (and possibly further items
3284 after the end of the group if FOO succeeds); on failure, the matcher
3285 skips to the second alternative and tries COND2, without backtracking
3286 into COND1. If that succeeds and BAR fails, COND3 is tried. If subse‐
3287 quently BAZ fails, there are no more alternatives, so there is a back‐
3288 track to whatever came before the entire group. If (*THEN) is not in‐
3289 side an alternation, it acts like (*PRUNE).
3290
3291 The behaviour of (*THEN:NAME) is not the same as (*MARK:NAME)(*THEN).
3292 It is like (*MARK:NAME) in that the name is remembered for passing back
3293 to the caller. However, (*SKIP:NAME) searches only for names set with
3294 (*MARK), ignoring those set by other backtracking verbs.
3295
3296 A group that does not contain a | character is just a part of the en‐
3297 closing alternative; it is not a nested alternation with only one al‐
3298 ternative. The effect of (*THEN) extends beyond such a group to the en‐
3299 closing alternative. Consider this pattern, where A, B, etc. are com‐
3300 plex pattern fragments that do not contain any | characters at this
3301 level:
3302
3303 A (B(*THEN)C) | D
3304
3305 If A and B are matched, but there is a failure in C, matching does not
3306 backtrack into A; instead it moves to the next alternative, that is, D.
3307 However, if the group containing (*THEN) is given an alternative, it
3308 behaves differently:
3309
3310 A (B(*THEN)C | (*FAIL)) | D
3311
3312 The effect of (*THEN) is now confined to the inner group. After a fail‐
3313 ure in C, matching moves to (*FAIL), which causes the whole group to
3314 fail because there are no more alternatives to try. In this case,
3315 matching does backtrack into A.
3316
3317 Note that a conditional group is not considered as having two alterna‐
3318 tives, because only one is ever used. In other words, the | character
3319 in a conditional group has a different meaning. Ignoring white space,
3320 consider:
3321
3322 ^.*? (?(?=a) a | b(*THEN)c )
3323
3324 If the subject is "ba", this pattern does not match. Because .*? is un‐
3325 greedy, it initially matches zero characters. The condition (?=a) then
3326 fails, the character "b" is matched, but "c" is not. At this point,
3327 matching does not backtrack to .*? as might perhaps be expected from
3328 the presence of the | character. The conditional group is part of the
3329 single alternative that comprises the whole pattern, and so the match
3330 fails. (If there was a backtrack into .*?, allowing it to match "b",
3331 the match would succeed.)
3332
3333 The verbs just described provide four different "strengths" of control
3334 when subsequent matching fails. (*THEN) is the weakest, carrying on the
3335 match at the next alternative. (*PRUNE) comes next, failing the match
3336 at the current starting position, but allowing an advance to the next
3337 character (for an unanchored pattern). (*SKIP) is similar, except that
3338 the advance may be more than one character. (*COMMIT) is the strongest,
3339 causing the entire match to fail.
3340
3341 More than one backtracking verb
3342
3343 If more than one backtracking verb is present in a pattern, the one
3344 that is backtracked onto first acts. For example, consider this pat‐
3345 tern, where A, B, etc. are complex pattern fragments:
3346
3347 (A(*COMMIT)B(*THEN)C|ABD)
3348
3349 If A matches but B fails, the backtrack to (*COMMIT) causes the entire
3350 match to fail. However, if A and B match, but C fails, the backtrack to
3351 (*THEN) causes the next alternative (ABD) to be tried. This behaviour
3352 is consistent, but is not always the same as Perl's. It means that if
3353 two or more backtracking verbs appear in succession, all the the last
3354 of them has no effect. Consider this example:
3355
3356 ...(*COMMIT)(*PRUNE)...
3357
3358 If there is a matching failure to the right, backtracking onto (*PRUNE)
3359 causes it to be triggered, and its action is taken. There can never be
3360 a backtrack onto (*COMMIT).
3361
3362 Backtracking verbs in repeated groups
3363
3364 PCRE2 sometimes differs from Perl in its handling of backtracking verbs
3365 in repeated groups. For example, consider:
3366
3367 /(a(*COMMIT)b)+ac/
3368
3369 If the subject is "abac", Perl matches unless its optimizations are
3370 disabled, but PCRE2 always fails because the (*COMMIT) in the second
3371 repeat of the group acts.
3372
3373 Backtracking verbs in assertions
3374
3375 (*FAIL) in any assertion has its normal effect: it forces an immediate
3376 backtrack. The behaviour of the other backtracking verbs depends on
3377 whether or not the assertion is standalone or acting as the condition
3378 in a conditional group.
3379
3380 (*ACCEPT) in a standalone positive assertion causes the assertion to
3381 succeed without any further processing; captured strings and a mark
3382 name (if set) are retained. In a standalone negative assertion, (*AC‐
3383 CEPT) causes the assertion to fail without any further processing; cap‐
3384 tured substrings and any mark name are discarded.
3385
3386 If the assertion is a condition, (*ACCEPT) causes the condition to be
3387 true for a positive assertion and false for a negative one; captured
3388 substrings are retained in both cases.
3389
3390 The remaining verbs act only when a later failure causes a backtrack to
3391 reach them. This means that, for the Perl-compatible assertions, their
3392 effect is confined to the assertion, because Perl lookaround assertions
3393 are atomic. A backtrack that occurs after such an assertion is complete
3394 does not jump back into the assertion. Note in particular that a
3395 (*MARK) name that is set in an assertion is not "seen" by an instance
3396 of (*SKIP:NAME) later in the pattern.
3397
3398 PCRE2 now supports non-atomic positive assertions, as described in the
3399 section entitled "Non-atomic assertions" above. These assertions must
3400 be standalone (not used as conditions). They are not Perl-compatible.
3401 For these assertions, a later backtrack does jump back into the asser‐
3402 tion, and therefore verbs such as (*COMMIT) can be triggered by back‐
3403 tracks from later in the pattern.
3404
3405 The effect of (*THEN) is not allowed to escape beyond an assertion. If
3406 there are no more branches to try, (*THEN) causes a positive assertion
3407 to be false, and a negative assertion to be true.
3408
3409 The other backtracking verbs are not treated specially if they appear
3410 in a standalone positive assertion. In a conditional positive asser‐
3411 tion, backtracking (from within the assertion) into (*COMMIT), (*SKIP),
3412 or (*PRUNE) causes the condition to be false. However, for both stand‐
3413 alone and conditional negative assertions, backtracking into (*COMMIT),
3414 (*SKIP), or (*PRUNE) causes the assertion to be true, without consider‐
3415 ing any further alternative branches.
3416
3417 Backtracking verbs in subroutines
3418
3419 These behaviours occur whether or not the group is called recursively.
3420
3421 (*ACCEPT) in a group called as a subroutine causes the subroutine match
3422 to succeed without any further processing. Matching then continues af‐
3423 ter the subroutine call. Perl documents this behaviour. Perl's treat‐
3424 ment of the other verbs in subroutines is different in some cases.
3425
3426 (*FAIL) in a group called as a subroutine has its normal effect: it
3427 forces an immediate backtrack.
3428
3429 (*COMMIT), (*SKIP), and (*PRUNE) cause the subroutine match to fail
3430 when triggered by being backtracked to in a group called as a subrou‐
3431 tine. There is then a backtrack at the outer level.
3432
3433 (*THEN), when triggered, skips to the next alternative in the innermost
3434 enclosing group that has alternatives (its normal behaviour). However,
3435 if there is no such group within the subroutine's group, the subroutine
3436 match fails and there is a backtrack at the outer level.
3437
3439
3440 pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2syntax(3),
3441 pcre2(3).
3442
3444
3445 Philip Hazel
3446 Retired from University Computing Service
3447 Cambridge, England.
3448
3450
3451 Last updated: 12 January 2022
3452 Copyright (c) 1997-2022 University of Cambridge.
3453
3454
3455
3456PCRE2 10.40 12 January 2022 PCRE2PATTERN(3)