1PCRE2PATTERN(3) Library Functions Manual PCRE2PATTERN(3)
2
3
4
6 PCRE2 - Perl-compatible regular expressions (revised API)
7
9
10 The syntax and semantics of the regular expressions that are supported
11 by PCRE2 are described in detail below. There is a quick-reference syn‐
12 tax summary in the pcre2syntax page. PCRE2 tries to match Perl syntax
13 and semantics as closely as it can. PCRE2 also supports some alterna‐
14 tive regular expression syntax (which does not conflict with the Perl
15 syntax) in order to provide some compatibility with regular expressions
16 in Python, .NET, and Oniguruma.
17
18 Perl's regular expressions are described in its own documentation, and
19 regular expressions in general are covered in a number of books, some
20 of which have copious examples. Jeffrey Friedl's "Mastering Regular
21 Expressions", published by O'Reilly, covers regular expressions in
22 great detail. This description of PCRE2's regular expressions is
23 intended as reference material.
24
25 This document discusses the regular expression patterns that are sup‐
26 ported by PCRE2 when its main matching function, pcre2_match(), is
27 used. PCRE2 also has an alternative matching function,
28 pcre2_dfa_match(), which matches using a different algorithm that is
29 not Perl-compatible. Some of the features discussed below are not
30 available when DFA matching is used. The advantages and disadvantages
31 of the alternative function, and how it differs from the normal func‐
32 tion, are discussed in the pcre2matching page.
33
35
36 A number of options that can be passed to pcre2_compile() can also be
37 set by special items at the start of a pattern. These are not Perl-com‐
38 patible, but are provided to make these options accessible to pattern
39 writers who are not able to change the program that processes the pat‐
40 tern. Any number of these items may appear, but they must all be
41 together right at the start of the pattern string, and the letters must
42 be in upper case.
43
44 UTF support
45
46 In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either
47 as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32
48 can be specified for the 32-bit library, in which case it constrains
49 the character values to valid Unicode code points. To process UTF
50 strings, PCRE2 must be built to include Unicode support (which is the
51 default). When using UTF strings you must either call the compiling
52 function with the PCRE2_UTF option, or the pattern must start with the
53 special sequence (*UTF), which is equivalent to setting the relevant
54 option. How setting a UTF mode affects pattern matching is mentioned in
55 several places below. There is also a summary of features in the
56 pcre2unicode page.
57
58 Some applications that allow their users to supply patterns may wish to
59 restrict them to non-UTF data for security reasons. If the
60 PCRE2_NEVER_UTF option is passed to pcre2_compile(), (*UTF) is not
61 allowed, and its appearance in a pattern causes an error.
62
63 Unicode property support
64
65 Another special sequence that may appear at the start of a pattern is
66 (*UCP). This has the same effect as setting the PCRE2_UCP option: it
67 causes sequences such as \d and \w to use Unicode properties to deter‐
68 mine character types, instead of recognizing only characters with codes
69 less than 256 via a lookup table.
70
71 Some applications that allow their users to supply patterns may wish to
72 restrict them for security reasons. If the PCRE2_NEVER_UCP option is
73 passed to pcre2_compile(), (*UCP) is not allowed, and its appearance in
74 a pattern causes an error.
75
76 Locking out empty string matching
77
78 Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same
79 effect as passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option
80 to whichever matching function is subsequently called to match the pat‐
81 tern. These options lock out the matching of empty strings, either
82 entirely, or only at the start of the subject.
83
84 Disabling auto-possessification
85
86 If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as
87 setting the PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from making
88 quantifiers possessive when what follows cannot match the repeated
89 item. For example, by default a+b is treated as a++b. For more details,
90 see the pcre2api documentation.
91
92 Disabling start-up optimizations
93
94 If a pattern starts with (*NO_START_OPT), it has the same effect as
95 setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti‐
96 mizations for quickly reaching "no match" results. For more details,
97 see the pcre2api documentation.
98
99 Disabling automatic anchoring
100
101 If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect
102 as setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimiza‐
103 tions that apply to patterns whose top-level branches all start with .*
104 (match any number of arbitrary characters). For more details, see the
105 pcre2api documentation.
106
107 Disabling JIT compilation
108
109 If a pattern that starts with (*NO_JIT) is successfully compiled, an
110 attempt by the application to apply the JIT optimization by calling
111 pcre2_jit_compile() is ignored.
112
113 Setting match resource limits
114
115 The pcre2_match() function contains a counter that is incremented every
116 time it goes round its main loop. The caller of pcre2_match() can set a
117 limit on this counter, which therefore limits the amount of computing
118 resource used for a match. The maximum depth of nested backtracking can
119 also be limited; this indirectly restricts the amount of heap memory
120 that is used, but there is also an explicit memory limit that can be
121 set.
122
123 These facilities are provided to catch runaway matches that are pro‐
124 voked by patterns with huge matching trees. A common example is a pat‐
125 tern with nested unlimited repeats applied to a long string that does
126 not match. When one of these limits is reached, pcre2_match() gives an
127 error return. The limits can also be set by items at the start of the
128 pattern of the form
129
130 (*LIMIT_HEAP=d)
131 (*LIMIT_MATCH=d)
132 (*LIMIT_DEPTH=d)
133
134 where d is any number of decimal digits. However, the value of the set‐
135 ting must be less than the value set (or defaulted) by the caller of
136 pcre2_match() for it to have any effect. In other words, the pattern
137 writer can lower the limits set by the programmer, but not raise them.
138 If there is more than one setting of one of these limits, the lower
139 value is used. The heap limit is specified in kibibytes (units of 1024
140 bytes).
141
142 Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This
143 name is still recognized for backwards compatibility.
144
145 The heap limit applies only when the pcre2_match() or pcre2_dfa_match()
146 interpreters are used for matching. It does not apply to JIT. The match
147 limit is used (but in a different way) when JIT is being used, or when
148 pcre2_dfa_match() is called, to limit computing resource usage by those
149 matching functions. The depth limit is ignored by JIT but is relevant
150 for DFA matching, which uses function recursion for recursions within
151 the pattern and for lookaround assertions and atomic groups. In this
152 case, the depth limit controls the depth of such recursion.
153
154 Newline conventions
155
156 PCRE2 supports six different conventions for indicating line breaks in
157 strings: a single CR (carriage return) character, a single LF (line‐
158 feed) character, the two-character sequence CRLF, any of the three pre‐
159 ceding, any Unicode newline sequence, or the NUL character (binary
160 zero). The pcre2api page has further discussion about newlines, and
161 shows how to set the newline convention when calling pcre2_compile().
162
163 It is also possible to specify a newline convention by starting a pat‐
164 tern string with one of the following sequences:
165
166 (*CR) carriage return
167 (*LF) linefeed
168 (*CRLF) carriage return, followed by linefeed
169 (*ANYCRLF) any of the three above
170 (*ANY) all Unicode newline sequences
171 (*NUL) the NUL character (binary zero)
172
173 These override the default and the options given to the compiling func‐
174 tion. For example, on a Unix system where LF is the default newline
175 sequence, the pattern
176
177 (*CR)a.b
178
179 changes the convention to CR. That pattern matches "a\nb" because LF is
180 no longer a newline. If more than one of these settings is present, the
181 last one is used.
182
183 The newline convention affects where the circumflex and dollar asser‐
184 tions are true. It also affects the interpretation of the dot metachar‐
185 acter when PCRE2_DOTALL is not set, and the behaviour of \N when not
186 followed by an opening brace. However, it does not affect what the \R
187 escape sequence matches. By default, this is any Unicode newline
188 sequence, for Perl compatibility. However, this can be changed; see the
189 next section and the description of \R in the section entitled "Newline
190 sequences" below. A change of \R setting can be combined with a change
191 of newline convention.
192
193 Specifying what \R matches
194
195 It is possible to restrict \R to match only CR, LF, or CRLF (instead of
196 the complete set of Unicode line endings) by setting the option
197 PCRE2_BSR_ANYCRLF at compile time. This effect can also be achieved by
198 starting a pattern with (*BSR_ANYCRLF). For completeness, (*BSR_UNI‐
199 CODE) is also recognized, corresponding to PCRE2_BSR_UNICODE.
200
202
203 PCRE2 can be compiled to run in an environment that uses EBCDIC as its
204 character code instead of ASCII or Unicode (typically a mainframe sys‐
205 tem). In the sections below, character code values are ASCII or Uni‐
206 code; in an EBCDIC environment these characters may have different code
207 values, and there are no code points greater than 255.
208
210
211 A regular expression is a pattern that is matched against a subject
212 string from left to right. Most characters stand for themselves in a
213 pattern, and match the corresponding characters in the subject. As a
214 trivial example, the pattern
215
216 The quick brown fox
217
218 matches a portion of a subject string that is identical to itself. When
219 caseless matching is specified (the PCRE2_CASELESS option), letters are
220 matched independently of case.
221
222 The power of regular expressions comes from the ability to include wild
223 cards, character classes, alternatives, and repetitions in the pattern.
224 These are encoded in the pattern by the use of metacharacters, which do
225 not stand for themselves but instead are interpreted in some special
226 way.
227
228 There are two different sets of metacharacters: those that are recog‐
229 nized anywhere in the pattern except within square brackets, and those
230 that are recognized within square brackets. Outside square brackets,
231 the metacharacters are as follows:
232
233 \ general escape character with several uses
234 ^ assert start of string (or line, in multiline mode)
235 $ assert end of string (or line, in multiline mode)
236 . match any character except newline (by default)
237 [ start character class definition
238 | start of alternative branch
239 ( start group or control verb
240 ) end group or control verb
241 * 0 or more quantifier
242 + 1 or more quantifier; also "possessive quantifier"
243 ? 0 or 1 quantifier; also quantifier minimizer
244 { start min/max quantifier
245
246 Part of a pattern that is in square brackets is called a "character
247 class". In a character class the only metacharacters are:
248
249 \ general escape character
250 ^ negate the class, but only if the first character
251 - indicates character range
252 [ POSIX character class (if followed by POSIX syntax)
253 ] terminates the character class
254
255 The following sections describe the use of each of the metacharacters.
256
258
259 The backslash character has several uses. Firstly, if it is followed by
260 a character that is not a digit or a letter, it takes away any special
261 meaning that character may have. This use of backslash as an escape
262 character applies both inside and outside character classes.
263
264 For example, if you want to match a * character, you must write \* in
265 the pattern. This escaping action applies whether or not the following
266 character would otherwise be interpreted as a metacharacter, so it is
267 always safe to precede a non-alphanumeric with backslash to specify
268 that it stands for itself. In particular, if you want to match a back‐
269 slash, you write \\.
270
271 In a UTF mode, only ASCII digits and letters have any special meaning
272 after a backslash. All other characters (in particular, those whose
273 code points are greater than 127) are treated as literals.
274
275 If a pattern is compiled with the PCRE2_EXTENDED option, most white
276 space in the pattern (other than in a character class), and characters
277 between a # outside a character class and the next newline, inclusive,
278 are ignored. An escaping backslash can be used to include a white space
279 or # character as part of the pattern.
280
281 If you want to treat all characters in a sequence as literals, you can
282 do so by putting them between \Q and \E. This is different from Perl in
283 that $ and @ are handled as literals in \Q...\E sequences in PCRE2,
284 whereas in Perl, $ and @ cause variable interpolation. Also, Perl does
285 "double-quotish backslash interpolation" on any backslashes between \Q
286 and \E which, its documentation says, "may lead to confusing results".
287 PCRE2 treats a backslash between \Q and \E just like any other charac‐
288 ter. Note the following examples:
289
290 Pattern PCRE2 matches Perl matches
291
292 \Qabc$xyz\E abc$xyz abc followed by the
293 contents of $xyz
294 \Qabc\$xyz\E abc\$xyz abc\$xyz
295 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
296 \QA\B\E A\B A\B
297 \Q\\E \ \\E
298
299 The \Q...\E sequence is recognized both inside and outside character
300 classes. An isolated \E that is not preceded by \Q is ignored. If \Q
301 is not followed by \E later in the pattern, the literal interpretation
302 continues to the end of the pattern (that is, \E is assumed at the
303 end). If the isolated \Q is inside a character class, this causes an
304 error, because the character class is not terminated by a closing
305 square bracket.
306
307 Non-printing characters
308
309 A second use of backslash provides a way of encoding non-printing char‐
310 acters in patterns in a visible manner. There is no restriction on the
311 appearance of non-printing characters in a pattern, but when a pattern
312 is being prepared by text editing, it is often easier to use one of the
313 following escape sequences instead of the binary character it repre‐
314 sents. In an ASCII or Unicode environment, these escapes are as fol‐
315 lows:
316
317 \a alarm, that is, the BEL character (hex 07)
318 \cx "control-x", where x is any printable ASCII character
319 \e escape (hex 1B)
320 \f form feed (hex 0C)
321 \n linefeed (hex 0A)
322 \r carriage return (hex 0D) (but see below)
323 \t tab (hex 09)
324 \0dd character with octal code 0dd
325 \ddd character with octal code ddd, or backreference
326 \o{ddd..} character with octal code ddd..
327 \xhh character with hex code hh
328 \x{hhh..} character with hex code hhh..
329 \N{U+hhh..} character with Unicode hex code point hhh..
330
331 By default, after \x that is not followed by {, from zero to two hexa‐
332 decimal digits are read (letters can be in upper or lower case). Any
333 number of hexadecimal digits may appear between \x{ and }. If a charac‐
334 ter other than a hexadecimal digit appears between \x{ and }, or if
335 there is no terminating }, an error occurs.
336
337 Characters whose code points are less than 256 can be defined by either
338 of the two syntaxes for \x or by an octal sequence. There is no differ‐
339 ence in the way they are handled. For example, \xdc is exactly the same
340 as \x{dc} or \334. However, using the braced versions does make such
341 sequences easier to read.
342
343 Support is available for some ECMAScript (aka JavaScript) escape
344 sequences via two compile-time options. If PCRE2_ALT_BSUX is set, the
345 sequence \x followed by { is not recognized. Only if \x is followed by
346 two hexadecimal digits is it recognized as a character escape. Other‐
347 wise it is interpreted as a literal "x" character. In this mode, sup‐
348 port for code points greater than 256 is provided by \u, which must be
349 followed by four hexadecimal digits; otherwise it is interpreted as a
350 literal "u" character.
351
352 PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in
353 addition, \u{hhh..} is recognized as the character specified by hexa‐
354 decimal code point. There may be any number of hexadecimal digits.
355 This syntax is from ECMAScript 6.
356
357 The \N{U+hhh..} escape sequence is recognized only when the PCRE2_UTF
358 option is set, that is, when PCRE2 is operating in a Unicode mode. Perl
359 also uses \N{name} to specify characters by Unicode name; PCRE2 does
360 not support this. Note that when \N is not followed by an opening
361 brace (curly bracket) it has an entirely different meaning, matching
362 any character that is not a newline.
363
364 There are some legacy applications where the escape sequence \r is
365 expected to match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option
366 is set, \r in a pattern is converted to \n so that it matches a LF
367 (linefeed) instead of a CR (carriage return) character.
368
369 The precise effect of \cx on ASCII characters is as follows: if x is a
370 lower case letter, it is converted to upper case. Then bit 6 of the
371 character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
372 (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
373 hex 7B (; is 3B). If the code unit following \c has a value less than
374 32 or greater than 126, a compile-time error occurs.
375
376 When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported.
377 \a, \e, \f, \n, \r, and \t generate the appropriate EBCDIC code values.
378 The \c escape is processed as specified for Perl in the perlebcdic doc‐
379 ument. The only characters that are allowed after \c are A-Z, a-z, or
380 one of @, [, \, ], ^, _, or ?. Any other character provokes a compile-
381 time error. The sequence \c@ encodes character code 0; after \c the
382 letters (in either case) encode characters 1-26 (hex 01 to hex 1A); [,
383 \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c?
384 becomes either 255 (hex FF) or 95 (hex 5F).
385
386 Thus, apart from \c?, these escapes generate the same character code
387 values as they do in an ASCII environment, though the meanings of the
388 values mostly differ. For example, \cG always generates code value 7,
389 which is BEL in ASCII but DEL in EBCDIC.
390
391 The sequence \c? generates DEL (127, hex 7F) in an ASCII environment,
392 but because 127 is not a control character in EBCDIC, Perl makes it
393 generate the APC character. Unfortunately, there are several variants
394 of EBCDIC. In most of them the APC character has the value 255 (hex
395 FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If
396 certain other characters have POSIX-BC values, PCRE2 makes \c? generate
397 95; otherwise it generates 255.
398
399 After \0 up to two further octal digits are read. If there are fewer
400 than two digits, just those that are present are used. Thus the
401 sequence \0\x\015 specifies two binary zeros followed by a CR character
402 (code value 13). Make sure you supply two digits after the initial zero
403 if the pattern character that follows is itself an octal digit.
404
405 The escape \o must be followed by a sequence of octal digits, enclosed
406 in braces. An error occurs if this is not the case. This escape is a
407 recent addition to Perl; it provides way of specifying character code
408 points as octal numbers greater than 0777, and it also allows octal
409 numbers and backreferences to be unambiguously specified.
410
411 For greater clarity and unambiguity, it is best to avoid following \ by
412 a digit greater than zero. Instead, use \o{} or \x{} to specify numeri‐
413 cal character code points, and \g{} to specify backreferences. The fol‐
414 lowing paragraphs describe the old, ambiguous syntax.
415
416 The handling of a backslash followed by a digit other than 0 is compli‐
417 cated, and Perl has changed over time, causing PCRE2 also to change.
418
419 Outside a character class, PCRE2 reads the digit and any following dig‐
420 its as a decimal number. If the number is less than 10, begins with the
421 digit 8 or 9, or if there are at least that many previous capture
422 groups in the expression, the entire sequence is taken as a backrefer‐
423 ence. A description of how this works is given later, following the
424 discussion of parenthesized groups. Otherwise, up to three octal dig‐
425 its are read to form a character code.
426
427 Inside a character class, PCRE2 handles \8 and \9 as the literal char‐
428 acters "8" and "9", and otherwise reads up to three octal digits fol‐
429 lowing the backslash, using them to generate a data character. Any sub‐
430 sequent digits stand for themselves. For example, outside a character
431 class:
432
433 \040 is another way of writing an ASCII space
434 \40 is the same, provided there are fewer than 40
435 previous capture groups
436 \7 is always a backreference
437 \11 might be a backreference, or another way of
438 writing a tab
439 \011 is always a tab
440 \0113 is a tab followed by the character "3"
441 \113 might be a backreference, otherwise the
442 character with octal code 113
443 \377 might be a backreference, otherwise
444 the value 255 (decimal)
445 \81 is always a backreference
446
447 Note that octal values of 100 or greater that are specified using this
448 syntax must not be introduced by a leading zero, because no more than
449 three octal digits are ever read.
450
451 Constraints on character values
452
453 Characters that are specified using octal or hexadecimal numbers are
454 limited to certain values, as follows:
455
456 8-bit non-UTF mode no greater than 0xff
457 16-bit non-UTF mode no greater than 0xffff
458 32-bit non-UTF mode no greater than 0xffffffff
459 All UTF modes no greater than 0x10ffff and a valid code point
460
461 Invalid Unicode code points are all those in the range 0xd800 to 0xdfff
462 (the so-called "surrogate" code points). The check for these can be
463 disabled by the caller of pcre2_compile() by setting the option
464 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only in
465 UTF-8 and UTF-32 modes, because these values are not representable in
466 UTF-16.
467
468 Escape sequences in character classes
469
470 All the sequences that define a single character value can be used both
471 inside and outside character classes. In addition, inside a character
472 class, \b is interpreted as the backspace character (hex 08).
473
474 When not followed by an opening brace, \N is not allowed in a character
475 class. \B, \R, and \X are not special inside a character class. Like
476 other unrecognized alphabetic escape sequences, they cause an error.
477 Outside a character class, these sequences have different meanings.
478
479 Unsupported escape sequences
480
481 In Perl, the sequences \F, \l, \L, \u, and \U are recognized by its
482 string handler and used to modify the case of following characters. By
483 default, PCRE2 does not support these escape sequences in patterns.
484 However, if either of the PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX
485 options is set, \U matches a "U" character, and \u can be used to
486 define a character by code point, as described above.
487
488 Absolute and relative backreferences
489
490 The sequence \g followed by a signed or unsigned number, optionally
491 enclosed in braces, is an absolute or relative backreference. A named
492 backreference can be coded as \g{name}. Backreferences are discussed
493 later, following the discussion of parenthesized groups.
494
495 Absolute and relative subroutine calls
496
497 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
498 name or a number enclosed either in angle brackets or single quotes, is
499 an alternative syntax for referencing a capture group as a subroutine.
500 Details are discussed later. Note that \g{...} (Perl syntax) and
501 \g<...> (Oniguruma syntax) are not synonymous. The former is a backref‐
502 erence; the latter is a subroutine call.
503
504 Generic character types
505
506 Another use of backslash is for specifying generic character types:
507
508 \d any decimal digit
509 \D any character that is not a decimal digit
510 \h any horizontal white space character
511 \H any character that is not a horizontal white space character
512 \N any character that is not a newline
513 \s any white space character
514 \S any character that is not a white space character
515 \v any vertical white space character
516 \V any character that is not a vertical white space character
517 \w any "word" character
518 \W any "non-word" character
519
520 The \N escape sequence has the same meaning as the "." metacharacter
521 when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change
522 the meaning of \N. Note that when \N is followed by an opening brace it
523 has a different meaning. See the section entitled "Non-printing charac‐
524 ters" above for details. Perl also uses \N{name} to specify characters
525 by Unicode name; PCRE2 does not support this.
526
527 Each pair of lower and upper case escape sequences partitions the com‐
528 plete set of characters into two disjoint sets. Any given character
529 matches one, and only one, of each pair. The sequences can appear both
530 inside and outside character classes. They each match one character of
531 the appropriate type. If the current matching point is at the end of
532 the subject string, all of them fail, because there is no character to
533 match.
534
535 The default \s characters are HT (9), LF (10), VT (11), FF (12), CR
536 (13), and space (32), which are defined as white space in the "C"
537 locale. This list may vary if locale-specific matching is taking place.
538 For example, in some locales the "non-breaking space" character (\xA0)
539 is recognized as white space, and in others the VT character is not.
540
541 A "word" character is an underscore or any character that is a letter
542 or digit. By default, the definition of letters and digits is con‐
543 trolled by PCRE2's low-valued character tables, and may vary if locale-
544 specific matching is taking place (see "Locale support" in the pcre2api
545 page). For example, in a French locale such as "fr_FR" in Unix-like
546 systems, or "french" in Windows, some character codes greater than 127
547 are used for accented letters, and these are then matched by \w. The
548 use of locales with Unicode is discouraged.
549
550 By default, characters whose code points are greater than 127 never
551 match \d, \s, or \w, and always match \D, \S, and \W, although this may
552 be different for characters in the range 128-255 when locale-specific
553 matching is happening. These escape sequences retain their original
554 meanings from before Unicode support was available, mainly for effi‐
555 ciency reasons. If the PCRE2_UCP option is set, the behaviour is
556 changed so that Unicode properties are used to determine character
557 types, as follows:
558
559 \d any character that matches \p{Nd} (decimal digit)
560 \s any character that matches \p{Z} or \h or \v
561 \w any character that matches \p{L} or \p{N}, plus underscore
562
563 The upper case escapes match the inverse sets of characters. Note that
564 \d matches only decimal digits, whereas \w matches any Unicode digit,
565 as well as any Unicode letter, and underscore. Note also that PCRE2_UCP
566 affects \b, and \B because they are defined in terms of \w and \W.
567 Matching these sequences is noticeably slower when PCRE2_UCP is set.
568
569 The sequences \h, \H, \v, and \V, in contrast to the other sequences,
570 which match only ASCII characters by default, always match a specific
571 list of code points, whether or not PCRE2_UCP is set. The horizontal
572 space characters are:
573
574 U+0009 Horizontal tab (HT)
575 U+0020 Space
576 U+00A0 Non-break space
577 U+1680 Ogham space mark
578 U+180E Mongolian vowel separator
579 U+2000 En quad
580 U+2001 Em quad
581 U+2002 En space
582 U+2003 Em space
583 U+2004 Three-per-em space
584 U+2005 Four-per-em space
585 U+2006 Six-per-em space
586 U+2007 Figure space
587 U+2008 Punctuation space
588 U+2009 Thin space
589 U+200A Hair space
590 U+202F Narrow no-break space
591 U+205F Medium mathematical space
592 U+3000 Ideographic space
593
594 The vertical space characters are:
595
596 U+000A Linefeed (LF)
597 U+000B Vertical tab (VT)
598 U+000C Form feed (FF)
599 U+000D Carriage return (CR)
600 U+0085 Next line (NEL)
601 U+2028 Line separator
602 U+2029 Paragraph separator
603
604 In 8-bit, non-UTF-8 mode, only the characters with code points less
605 than 256 are relevant.
606
607 Newline sequences
608
609 Outside a character class, by default, the escape sequence \R matches
610 any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent
611 to the following:
612
613 (?>\r\n|\n|\x0b|\f|\r|\x85)
614
615 This is an example of an "atomic group", details of which are given
616 below. This particular group matches either the two-character sequence
617 CR followed by LF, or one of the single characters LF (linefeed,
618 U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car‐
619 riage return, U+000D), or NEL (next line, U+0085). Because this is an
620 atomic group, the two-character sequence is treated as a single unit
621 that cannot be split.
622
623 In other modes, two additional characters whose code points are greater
624 than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa‐
625 rator, U+2029). Unicode support is not needed for these characters to
626 be recognized.
627
628 It is possible to restrict \R to match only CR, LF, or CRLF (instead of
629 the complete set of Unicode line endings) by setting the option
630 PCRE2_BSR_ANYCRLF at compile time. (BSR is an abbrevation for "back‐
631 slash R".) This can be made the default when PCRE2 is built; if this is
632 the case, the other behaviour can be requested via the PCRE2_BSR_UNI‐
633 CODE option. It is also possible to specify these settings by starting
634 a pattern string with one of the following sequences:
635
636 (*BSR_ANYCRLF) CR, LF, or CRLF only
637 (*BSR_UNICODE) any Unicode newline sequence
638
639 These override the default and the options given to the compiling func‐
640 tion. Note that these special settings, which are not Perl-compatible,
641 are recognized only at the very start of a pattern, and that they must
642 be in upper case. If more than one of them is present, the last one is
643 used. They can be combined with a change of newline convention; for
644 example, a pattern can start with:
645
646 (*ANY)(*BSR_ANYCRLF)
647
648 They can also be combined with the (*UTF) or (*UCP) special sequences.
649 Inside a character class, \R is treated as an unrecognized escape
650 sequence, and causes an error.
651
652 Unicode character properties
653
654 When PCRE2 is built with Unicode support (the default), three addi‐
655 tional escape sequences that match characters with specific properties
656 are available. They can be used in any mode, though in 8-bit and 16-bit
657 non-UTF modes these sequences are of course limited to testing charac‐
658 ters whose code points are less than U+0100 and U+10000, respectively.
659 In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode
660 limit) may be encountered. These are all treated as being in the
661 Unknown script and with an unassigned type. The extra escape sequences
662 are:
663
664 \p{xx} a character with the xx property
665 \P{xx} a character without the xx property
666 \X a Unicode extended grapheme cluster
667
668 The property names represented by xx above are case-sensitive. There is
669 support for Unicode script names, Unicode general category properties,
670 "Any", which matches any character (including newline), and some spe‐
671 cial PCRE2 properties (described in the next section). Other Perl
672 properties such as "InMusicalSymbols" are not supported by PCRE2. Note
673 that \P{Any} does not match any characters, so always causes a match
674 failure.
675
676 Sets of Unicode characters are defined as belonging to certain scripts.
677 A character from one of these sets can be matched using a script name.
678 For example:
679
680 \p{Greek}
681 \P{Han}
682
683 Unassigned characters (and in non-UTF 32-bit mode, characters with code
684 points greater than 0x10FFFF) are assigned the "Unknown" script. Others
685 that are not part of an identified script are lumped together as "Com‐
686 mon". The current list of scripts is:
687
688 Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Bali‐
689 nese, Bamum, Bassa_Vah, Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi,
690 Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Alba‐
691 nian, Chakma, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot,
692 Cyrillic, Deseret, Devanagari, Dogra, Duployan, Egyptian_Hieroglyphs,
693 Elbasan, Ethiopic, Georgian, Glagolitic, Gothic, Grantha, Greek,
694 Gujarati, Gunjala_Gondi, Gurmukhi, Han, Hangul, Hanifi_Rohingya,
695 Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited,
696 Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan‐
697 nada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao,
698 Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha‐
699 jani, Makasar, Malayalam, Mandaic, Manichaean, Marchen, Masaram_Gondi,
700 Medefaidrin, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive,
701 Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Multani, Myanmar,
702 Nabataean, New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki, Old_Hungar‐
703 ian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_Sog‐
704 dian, Old_South_Arabian, Old_Turkic, Oriya, Osage, Osmanya,
705 Pahawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician,
706 Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha‐
707 vian, Siddham, SignWriting, Sinhala, Sogdian, Sora_Sompeng, Soyombo,
708 Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham,
709 Tai_Viet, Takri, Tamil, Tangut, Telugu, Thaana, Thai, Tibetan, Tifi‐
710 nagh, Tirhuta, Ugaritic, Unknown, Vai, Warang_Citi, Yi, Zan‐
711 abazar_Square.
712
713 Each character has exactly one Unicode general category property, spec‐
714 ified by a two-letter abbreviation. For compatibility with Perl, nega‐
715 tion can be specified by including a circumflex between the opening
716 brace and the property name. For example, \p{^Lu} is the same as
717 \P{Lu}.
718
719 If only one letter is specified with \p or \P, it includes all the gen‐
720 eral category properties that start with that letter. In this case, in
721 the absence of negation, the curly brackets in the escape sequence are
722 optional; these two examples have the same effect:
723
724 \p{L}
725 \pL
726
727 The following general category property codes are supported:
728
729 C Other
730 Cc Control
731 Cf Format
732 Cn Unassigned
733 Co Private use
734 Cs Surrogate
735
736 L Letter
737 Ll Lower case letter
738 Lm Modifier letter
739 Lo Other letter
740 Lt Title case letter
741 Lu Upper case letter
742
743 M Mark
744 Mc Spacing mark
745 Me Enclosing mark
746 Mn Non-spacing mark
747
748 N Number
749 Nd Decimal number
750 Nl Letter number
751 No Other number
752
753 P Punctuation
754 Pc Connector punctuation
755 Pd Dash punctuation
756 Pe Close punctuation
757 Pf Final punctuation
758 Pi Initial punctuation
759 Po Other punctuation
760 Ps Open punctuation
761
762 S Symbol
763 Sc Currency symbol
764 Sk Modifier symbol
765 Sm Mathematical symbol
766 So Other symbol
767
768 Z Separator
769 Zl Line separator
770 Zp Paragraph separator
771 Zs Space separator
772
773 The special property L& is also supported: it matches a character that
774 has the Lu, Ll, or Lt property, in other words, a letter that is not
775 classified as a modifier or "other".
776
777 The Cs (Surrogate) property applies only to characters whose code
778 points are in the range U+D800 to U+DFFF. These characters are no dif‐
779 ferent to any other character when PCRE2 is not in UTF mode (using the
780 16-bit or 32-bit library). However, they are not valid in Unicode
781 strings and so cannot be tested by PCRE2 in UTF mode, unless UTF valid‐
782 ity checking has been turned off (see the discussion of
783 PCRE2_NO_UTF_CHECK in the pcre2api page).
784
785 The long synonyms for property names that Perl supports (such as
786 \p{Letter}) are not supported by PCRE2, nor is it permitted to prefix
787 any of these properties with "Is".
788
789 No character that is in the Unicode table has the Cn (unassigned) prop‐
790 erty. Instead, this property is assumed for any code point that is not
791 in the Unicode table.
792
793 Specifying caseless matching does not affect these escape sequences.
794 For example, \p{Lu} always matches only upper case letters. This is
795 different from the behaviour of current versions of Perl.
796
797 Matching characters by Unicode property is not fast, because PCRE2 has
798 to do a multistage table lookup in order to find a character's prop‐
799 erty. That is why the traditional escape sequences such as \d and \w do
800 not use Unicode properties in PCRE2 by default, though you can make
801 them do so by setting the PCRE2_UCP option or by starting the pattern
802 with (*UCP).
803
804 Extended grapheme clusters
805
806 The \X escape matches any number of Unicode characters that form an
807 "extended grapheme cluster", and treats the sequence as an atomic group
808 (see below). Unicode supports various kinds of composite character by
809 giving each character a grapheme breaking property, and having rules
810 that use these properties to define the boundaries of extended grapheme
811 clusters. The rules are defined in Unicode Standard Annex 29, "Unicode
812 Text Segmentation". Unicode 11.0.0 abandoned the use of some previous
813 properties that had been used for emojis. Instead it introduced vari‐
814 ous emoji-specific properties. PCRE2 uses only the Extended Picto‐
815 graphic property.
816
817 \X always matches at least one character. Then it decides whether to
818 add additional characters according to the following rules for ending a
819 cluster:
820
821 1. End at the end of the subject string.
822
823 2. Do not end between CR and LF; otherwise end after any control char‐
824 acter.
825
826 3. Do not break Hangul (a Korean script) syllable sequences. Hangul
827 characters are of five types: L, V, T, LV, and LVT. An L character may
828 be followed by an L, V, LV, or LVT character; an LV or V character may
829 be followed by a V or T character; an LVT or T character may be follwed
830 only by a T character.
831
832 4. Do not end before extending characters or spacing marks or the
833 "zero-width joiner" character. Characters with the "mark" property
834 always have the "extend" grapheme breaking property.
835
836 5. Do not end after prepend characters.
837
838 6. Do not break within emoji modifier sequences or emoji zwj sequences.
839 That is, do not break between characters with the Extended_Pictographic
840 property. Extend and ZWJ characters are allowed between the charac‐
841 ters.
842
843 7. Do not break within emoji flag sequences. That is, do not break
844 between regional indicator (RI) characters if there are an odd number
845 of RI characters before the break point.
846
847 8. Otherwise, end the cluster.
848
849 PCRE2's additional properties
850
851 As well as the standard Unicode properties described above, PCRE2 sup‐
852 ports four more that make it possible to convert traditional escape
853 sequences such as \w and \s to use Unicode properties. PCRE2 uses these
854 non-standard, non-Perl properties internally when PCRE2_UCP is set.
855 However, they may also be used explicitly. These properties are:
856
857 Xan Any alphanumeric character
858 Xps Any POSIX space character
859 Xsp Any Perl space character
860 Xwd Any Perl "word" character
861
862 Xan matches characters that have either the L (letter) or the N (num‐
863 ber) property. Xps matches the characters tab, linefeed, vertical tab,
864 form feed, or carriage return, and any other character that has the Z
865 (separator) property. Xsp is the same as Xps; in PCRE1 it used to
866 exclude vertical tab, for Perl compatibility, but Perl changed. Xwd
867 matches the same characters as Xan, plus underscore.
868
869 There is another non-standard property, Xuc, which matches any charac‐
870 ter that can be represented by a Universal Character Name in C++ and
871 other programming languages. These are the characters $, @, ` (grave
872 accent), and all characters with Unicode code points greater than or
873 equal to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that
874 most base (ASCII) characters are excluded. (Universal Character Names
875 are of the form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit.
876 Note that the Xuc property does not match these sequences but the char‐
877 acters that they represent.)
878
879 Resetting the match start
880
881 In normal use, the escape sequence \K causes any previously matched
882 characters not to be included in the final matched sequence that is
883 returned. For example, the pattern:
884
885 foo\Kbar
886
887 matches "foobar", but reports that it has matched "bar". \K does not
888 interact with anchoring in any way. The pattern:
889
890 ^foo\Kbar
891
892 matches only when the subject begins with "foobar" (in single line
893 mode), though it again reports the matched string as "bar". This fea‐
894 ture is similar to a lookbehind assertion (described below). However,
895 in this case, the part of the subject before the real match does not
896 have to be of fixed length, as lookbehind assertions do. The use of \K
897 does not interfere with the setting of captured substrings. For exam‐
898 ple, when the pattern
899
900 (foo)\Kbar
901
902 matches "foobar", the first substring is still set to "foo".
903
904 Perl documents that the use of \K within assertions is "not well
905 defined". In PCRE2, \K is acted upon when it occurs inside positive
906 assertions, but is ignored in negative assertions. Note that when a
907 pattern such as (?=ab\K) matches, the reported start of the match can
908 be greater than the end of the match. Using \K in a lookbehind asser‐
909 tion at the start of a pattern can also lead to odd effects. For exam‐
910 ple, consider this pattern:
911
912 (?<=\Kfoo)bar
913
914 If the subject is "foobar", a call to pcre2_match() with a starting
915 offset of 3 succeeds and reports the matching string as "foobar", that
916 is, the start of the reported match is earlier than where the match
917 started.
918
919 Simple assertions
920
921 The final use of backslash is for certain simple assertions. An asser‐
922 tion specifies a condition that has to be met at a particular point in
923 a match, without consuming any characters from the subject string. The
924 use of groups for more complicated assertions is described below. The
925 backslashed assertions are:
926
927 \b matches at a word boundary
928 \B matches when not at a word boundary
929 \A matches at the start of the subject
930 \Z matches at the end of the subject
931 also matches before a newline at the end of the subject
932 \z matches only at the end of the subject
933 \G matches at the first matching position in the subject
934
935 Inside a character class, \b has a different meaning; it matches the
936 backspace character. If any other of these assertions appears in a
937 character class, an "invalid escape sequence" error is generated.
938
939 A word boundary is a position in the subject string where the current
940 character and the previous character do not both match \w or \W (i.e.
941 one matches \w and the other matches \W), or the start or end of the
942 string if the first or last character matches \w, respectively. When
943 PCRE2 is built with Unicode support, the meanings of \w and \W can be
944 changed by setting the PCRE2_UCP option. When this is done, it also
945 affects \b and \B. Neither PCRE2 nor Perl has a separate "start of
946 word" or "end of word" metasequence. However, whatever follows \b nor‐
947 mally determines which it is. For example, the fragment \ba matches "a"
948 at the start of a word.
949
950 The \A, \Z, and \z assertions differ from the traditional circumflex
951 and dollar (described in the next section) in that they only ever match
952 at the very start and end of the subject string, whatever options are
953 set. Thus, they are independent of multiline mode. These three asser‐
954 tions are not affected by the PCRE2_NOTBOL or PCRE2_NOTEOL options,
955 which affect only the behaviour of the circumflex and dollar metachar‐
956 acters. However, if the startoffset argument of pcre2_match() is non-
957 zero, indicating that matching is to start at a point other than the
958 beginning of the subject, \A can never match. The difference between
959 \Z and \z is that \Z matches before a newline at the end of the string
960 as well as at the very end, whereas \z matches only at the end.
961
962 The \G assertion is true only when the current matching position is at
963 the start point of the matching process, as specified by the startoff‐
964 set argument of pcre2_match(). It differs from \A when the value of
965 startoffset is non-zero. By calling pcre2_match() multiple times with
966 appropriate arguments, you can mimic Perl's /g option, and it is in
967 this kind of implementation where \G can be useful.
968
969 Note, however, that PCRE2's implementation of \G, being true at the
970 starting character of the matching process, is subtly different from
971 Perl's, which defines it as true at the end of the previous match. In
972 Perl, these can be different when the previously matched string was
973 empty. Because PCRE2 does just one match at a time, it cannot reproduce
974 this behaviour.
975
976 If all the alternatives of a pattern begin with \G, the expression is
977 anchored to the starting match position, and the "anchored" flag is set
978 in the compiled regular expression.
979
981
982 The circumflex and dollar metacharacters are zero-width assertions.
983 That is, they test for a particular condition being true without con‐
984 suming any characters from the subject string. These two metacharacters
985 are concerned with matching the starts and ends of lines. If the new‐
986 line convention is set so that only the two-character sequence CRLF is
987 recognized as a newline, isolated CR and LF characters are treated as
988 ordinary data characters, and are not recognized as newlines.
989
990 Outside a character class, in the default matching mode, the circumflex
991 character is an assertion that is true only if the current matching
992 point is at the start of the subject string. If the startoffset argu‐
993 ment of pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum‐
994 flex can never match if the PCRE2_MULTILINE option is unset. Inside a
995 character class, circumflex has an entirely different meaning (see
996 below).
997
998 Circumflex need not be the first character of the pattern if a number
999 of alternatives are involved, but it should be the first thing in each
1000 alternative in which it appears if the pattern is ever to match that
1001 branch. If all possible alternatives start with a circumflex, that is,
1002 if the pattern is constrained to match only at the start of the sub‐
1003 ject, it is said to be an "anchored" pattern. (There are also other
1004 constructs that can cause a pattern to be anchored.)
1005
1006 The dollar character is an assertion that is true only if the current
1007 matching point is at the end of the subject string, or immediately
1008 before a newline at the end of the string (by default), unless
1009 PCRE2_NOTEOL is set. Note, however, that it does not actually match the
1010 newline. Dollar need not be the last character of the pattern if a num‐
1011 ber of alternatives are involved, but it should be the last item in any
1012 branch in which it appears. Dollar has no special meaning in a charac‐
1013 ter class.
1014
1015 The meaning of dollar can be changed so that it matches only at the
1016 very end of the string, by setting the PCRE2_DOLLAR_ENDONLY option at
1017 compile time. This does not affect the \Z assertion.
1018
1019 The meanings of the circumflex and dollar metacharacters are changed if
1020 the PCRE2_MULTILINE option is set. When this is the case, a dollar
1021 character matches before any newlines in the string, as well as at the
1022 very end, and a circumflex matches immediately after internal newlines
1023 as well as at the start of the subject string. It does not match after
1024 a newline that ends the string, for compatibility with Perl. However,
1025 this can be changed by setting the PCRE2_ALT_CIRCUMFLEX option.
1026
1027 For example, the pattern /^abc$/ matches the subject string "def\nabc"
1028 (where \n represents a newline) in multiline mode, but not otherwise.
1029 Consequently, patterns that are anchored in single line mode because
1030 all branches start with ^ are not anchored in multiline mode, and a
1031 match for circumflex is possible when the startoffset argument of
1032 pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option is ignored
1033 if PCRE2_MULTILINE is set.
1034
1035 When the newline convention (see "Newline conventions" below) recog‐
1036 nizes the two-character sequence CRLF as a newline, this is preferred,
1037 even if the single characters CR and LF are also recognized as new‐
1038 lines. For example, if the newline convention is "any", a multiline
1039 mode circumflex matches before "xyz" in the string "abc\r\nxyz" rather
1040 than after CR, even though CR on its own is a valid newline. (It also
1041 matches at the very start of the string, of course.)
1042
1043 Note that the sequences \A, \Z, and \z can be used to match the start
1044 and end of the subject in both modes, and if all branches of a pattern
1045 start with \A it is always anchored, whether or not PCRE2_MULTILINE is
1046 set.
1047
1049
1050 Outside a character class, a dot in the pattern matches any one charac‐
1051 ter in the subject string except (by default) a character that signi‐
1052 fies the end of a line.
1053
1054 When a line ending is defined as a single character, dot never matches
1055 that character; when the two-character sequence CRLF is used, dot does
1056 not match CR if it is immediately followed by LF, but otherwise it
1057 matches all characters (including isolated CRs and LFs). When any Uni‐
1058 code line endings are being recognized, dot does not match CR or LF or
1059 any of the other line ending characters.
1060
1061 The behaviour of dot with regard to newlines can be changed. If the
1062 PCRE2_DOTALL option is set, a dot matches any one character, without
1063 exception. If the two-character sequence CRLF is present in the sub‐
1064 ject string, it takes two dots to match it.
1065
1066 The handling of dot is entirely independent of the handling of circum‐
1067 flex and dollar, the only relationship being that they both involve
1068 newlines. Dot has no special meaning in a character class.
1069
1070 The escape sequence \N when not followed by an opening brace behaves
1071 like a dot, except that it is not affected by the PCRE2_DOTALL option.
1072 In other words, it matches any character except one that signifies the
1073 end of a line.
1074
1075 When \N is followed by an opening brace it has a different meaning. See
1076 the section entitled "Non-printing characters" above for details. Perl
1077 also uses \N{name} to specify characters by Unicode name; PCRE2 does
1078 not support this.
1079
1081
1082 Outside a character class, the escape sequence \C matches any one code
1083 unit, whether or not a UTF mode is set. In the 8-bit library, one code
1084 unit is one byte; in the 16-bit library it is a 16-bit unit; in the
1085 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches
1086 line-ending characters. The feature is provided in Perl in order to
1087 match individual bytes in UTF-8 mode, but it is unclear how it can use‐
1088 fully be used.
1089
1090 Because \C breaks up characters into individual code units, matching
1091 one unit with \C in UTF-8 or UTF-16 mode means that the rest of the
1092 string may start with a malformed UTF character. This has undefined
1093 results, because PCRE2 assumes that it is matching character by charac‐
1094 ter in a valid UTF string (by default it checks the subject string's
1095 validity at the start of processing unless the PCRE2_NO_UTF_CHECK
1096 option is used).
1097
1098 An application can lock out the use of \C by setting the
1099 PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also
1100 possible to build PCRE2 with the use of \C permanently disabled.
1101
1102 PCRE2 does not allow \C to appear in lookbehind assertions (described
1103 below) in UTF-8 or UTF-16 modes, because this would make it impossible
1104 to calculate the length of the lookbehind. Neither the alternative
1105 matching function pcre2_dfa_match() nor the JIT optimizer support \C in
1106 these UTF modes. The former gives a match-time error; the latter fails
1107 to optimize and so the match is always run using the interpreter.
1108
1109 In the 32-bit library, however, \C is always supported (when not
1110 explicitly locked out) because it always matches a single code unit,
1111 whether or not UTF-32 is specified.
1112
1113 In general, the \C escape sequence is best avoided. However, one way of
1114 using it that avoids the problem of malformed UTF-8 or UTF-16 charac‐
1115 ters is to use a lookahead to check the length of the next character,
1116 as in this pattern, which could be used with a UTF-8 string (ignore
1117 white space and line breaks):
1118
1119 (?| (?=[\x00-\x7f])(\C) |
1120 (?=[\x80-\x{7ff}])(\C)(\C) |
1121 (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
1122 (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
1123
1124 In this example, a group that starts with (?| resets the capturing
1125 parentheses numbers in each alternative (see "Duplicate Group Numbers"
1126 below). The assertions at the start of each branch check the next UTF-8
1127 character for values whose encoding uses 1, 2, 3, or 4 bytes, respec‐
1128 tively. The character's individual bytes are then captured by the
1129 appropriate number of \C groups.
1130
1132
1133 An opening square bracket introduces a character class, terminated by a
1134 closing square bracket. A closing square bracket on its own is not spe‐
1135 cial by default. If a closing square bracket is required as a member
1136 of the class, it should be the first data character in the class (after
1137 an initial circumflex, if present) or escaped with a backslash. This
1138 means that, by default, an empty class cannot be defined. However, if
1139 the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at
1140 the start does end the (empty) class.
1141
1142 A character class matches a single character in the subject. A matched
1143 character must be in the set of characters defined by the class, unless
1144 the first character in the class definition is a circumflex, in which
1145 case the subject character must not be in the set defined by the class.
1146 If a circumflex is actually required as a member of the class, ensure
1147 it is not the first character, or escape it with a backslash.
1148
1149 For example, the character class [aeiou] matches any lower case vowel,
1150 while [^aeiou] matches any character that is not a lower case vowel.
1151 Note that a circumflex is just a convenient notation for specifying the
1152 characters that are in the class by enumerating those that are not. A
1153 class that starts with a circumflex is not an assertion; it still con‐
1154 sumes a character from the subject string, and therefore it fails if
1155 the current pointer is at the end of the string.
1156
1157 Characters in a class may be specified by their code points using \o,
1158 \x, or \N{U+hh..} in the usual way. When caseless matching is set, any
1159 letters in a class represent both their upper case and lower case ver‐
1160 sions, so for example, a caseless [aeiou] matches "A" as well as "a",
1161 and a caseless [^aeiou] does not match "A", whereas a caseful version
1162 would.
1163
1164 Characters that might indicate line breaks are never treated in any
1165 special way when matching character classes, whatever line-ending
1166 sequence is in use, and whatever setting of the PCRE2_DOTALL and
1167 PCRE2_MULTILINE options is used. A class such as [^a] always matches
1168 one of these characters.
1169
1170 The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s,
1171 \S, \v, \V, \w, and \W may appear in a character class, and add the
1172 characters that they match to the class. For example, [\dABCDEF]
1173 matches any hexadecimal digit. In UTF modes, the PCRE2_UCP option
1174 affects the meanings of \d, \s, \w and their upper case partners, just
1175 as it does when they appear outside a character class, as described in
1176 the section entitled "Generic character types" above. The escape
1177 sequence \b has a different meaning inside a character class; it
1178 matches the backspace character. The sequences \B, \R, and \X are not
1179 special inside a character class. Like any other unrecognized escape
1180 sequences, they cause an error. The same is true for \N when not fol‐
1181 lowed by an opening brace.
1182
1183 The minus (hyphen) character can be used to specify a range of charac‐
1184 ters in a character class. For example, [d-m] matches any letter
1185 between d and m, inclusive. If a minus character is required in a
1186 class, it must be escaped with a backslash or appear in a position
1187 where it cannot be interpreted as indicating a range, typically as the
1188 first or last character in the class, or immediately after a range. For
1189 example, [b-d-z] matches letters in the range b to d, a hyphen charac‐
1190 ter, or z.
1191
1192 Perl treats a hyphen as a literal if it appears before or after a POSIX
1193 class (see below) or before or after a character type escape such as as
1194 \d or \H. However, unless the hyphen is the last character in the
1195 class, Perl outputs a warning in its warning mode, as this is most
1196 likely a user error. As PCRE2 has no facility for warning, an error is
1197 given in these cases.
1198
1199 It is not possible to have the literal character "]" as the end charac‐
1200 ter of a range. A pattern such as [W-]46] is interpreted as a class of
1201 two characters ("W" and "-") followed by a literal string "46]", so it
1202 would match "W46]" or "-46]". However, if the "]" is escaped with a
1203 backslash it is interpreted as the end of range, so [W-\]46] is inter‐
1204 preted as a class containing a range followed by two other characters.
1205 The octal or hexadecimal representation of "]" can also be used to end
1206 a range.
1207
1208 Ranges normally include all code points between the start and end char‐
1209 acters, inclusive. They can also be used for code points specified
1210 numerically, for example [\000-\037]. Ranges can include any characters
1211 that are valid for the current mode. In any UTF mode, the so-called
1212 "surrogate" characters (those whose code points lie between 0xd800 and
1213 0xdfff inclusive) may not be specified explicitly by default (the
1214 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option disables this check). How‐
1215 ever, ranges such as [\x{d7ff}-\x{e000}], which include the surrogates,
1216 are always permitted.
1217
1218 There is a special case in EBCDIC environments for ranges whose end
1219 points are both specified as literal letters in the same case. For com‐
1220 patibility with Perl, EBCDIC code points within the range that are not
1221 letters are omitted. For example, [h-k] matches only four characters,
1222 even though the codes for h and k are 0x88 and 0x92, a range of 11 code
1223 points. However, if the range is specified numerically, for example,
1224 [\x88-\x92] or [h-\x92], all code points are included.
1225
1226 If a range that includes letters is used when caseless matching is set,
1227 it matches the letters in either case. For example, [W-c] is equivalent
1228 to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if
1229 character tables for a French locale are in use, [\xc8-\xcb] matches
1230 accented E characters in both cases.
1231
1232 A circumflex can conveniently be used with the upper case character
1233 types to specify a more restricted set of characters than the matching
1234 lower case type. For example, the class [^\W_] matches any letter or
1235 digit, but not underscore, whereas [\w] includes underscore. A positive
1236 character class should be read as "something OR something OR ..." and a
1237 negative class as "NOT something AND NOT something AND NOT ...".
1238
1239 The only metacharacters that are recognized in character classes are
1240 backslash, hyphen (only where it can be interpreted as specifying a
1241 range), circumflex (only at the start), opening square bracket (only
1242 when it can be interpreted as introducing a POSIX class name, or for a
1243 special compatibility feature - see the next two sections), and the
1244 terminating closing square bracket. However, escaping other non-
1245 alphanumeric characters does no harm.
1246
1248
1249 Perl supports the POSIX notation for character classes. This uses names
1250 enclosed by [: and :] within the enclosing square brackets. PCRE2 also
1251 supports this notation. For example,
1252
1253 [01[:alpha:]%]
1254
1255 matches "0", "1", any alphabetic character, or "%". The supported class
1256 names are:
1257
1258 alnum letters and digits
1259 alpha letters
1260 ascii character codes 0 - 127
1261 blank space or tab only
1262 cntrl control characters
1263 digit decimal digits (same as \d)
1264 graph printing characters, excluding space
1265 lower lower case letters
1266 print printing characters, including space
1267 punct printing characters, excluding letters and digits and space
1268 space white space (the same as \s from PCRE2 8.34)
1269 upper upper case letters
1270 word "word" characters (same as \w)
1271 xdigit hexadecimal digits
1272
1273 The default "space" characters are HT (9), LF (10), VT (11), FF (12),
1274 CR (13), and space (32). If locale-specific matching is taking place,
1275 the list of space characters may be different; there may be fewer or
1276 more of them. "Space" and \s match the same set of characters.
1277
1278 The name "word" is a Perl extension, and "blank" is a GNU extension
1279 from Perl 5.8. Another Perl extension is negation, which is indicated
1280 by a ^ character after the colon. For example,
1281
1282 [12[:^digit:]]
1283
1284 matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the
1285 POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
1286 these are not supported, and an error is given if they are encountered.
1287
1288 By default, characters with values greater than 127 do not match any of
1289 the POSIX character classes, although this may be different for charac‐
1290 ters in the range 128-255 when locale-specific matching is happening.
1291 However, if the PCRE2_UCP option is passed to pcre2_compile(), some of
1292 the classes are changed so that Unicode character properties are used.
1293 This is achieved by replacing certain POSIX classes with other
1294 sequences, as follows:
1295
1296 [:alnum:] becomes \p{Xan}
1297 [:alpha:] becomes \p{L}
1298 [:blank:] becomes \h
1299 [:cntrl:] becomes \p{Cc}
1300 [:digit:] becomes \p{Nd}
1301 [:lower:] becomes \p{Ll}
1302 [:space:] becomes \p{Xps}
1303 [:upper:] becomes \p{Lu}
1304 [:word:] becomes \p{Xwd}
1305
1306 Negated versions, such as [:^alpha:] use \P instead of \p. Three other
1307 POSIX classes are handled specially in UCP mode:
1308
1309 [:graph:] This matches characters that have glyphs that mark the page
1310 when printed. In Unicode property terms, it matches all char‐
1311 acters with the L, M, N, P, S, or Cf properties, except for:
1312
1313 U+061C Arabic Letter Mark
1314 U+180E Mongolian Vowel Separator
1315 U+2066 - U+2069 Various "isolate"s
1316
1317
1318 [:print:] This matches the same characters as [:graph:] plus space
1319 characters that are not controls, that is, characters with
1320 the Zs property.
1321
1322 [:punct:] This matches all characters that have the Unicode P (punctua‐
1323 tion) property, plus those characters with code points less
1324 than 256 that have the S (Symbol) property.
1325
1326 The other POSIX classes are unchanged, and match only characters with
1327 code points less than 256.
1328
1330
1331 In the POSIX.2 compliant library that was included in 4.4BSD Unix, the
1332 ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word"
1333 and "end of word". PCRE2 treats these items as follows:
1334
1335 [[:<:]] is converted to \b(?=\w)
1336 [[:>:]] is converted to \b(?<=\w)
1337
1338 Only these exact character sequences are recognized. A sequence such as
1339 [a[:<:]b] provokes error for an unrecognized POSIX class name. This
1340 support is not compatible with Perl. It is provided to help migrations
1341 from other environments, and is best not used in any new patterns. Note
1342 that \b matches at the start and the end of a word (see "Simple asser‐
1343 tions" above), and in a Perl-style pattern the preceding or following
1344 character normally shows which is wanted, without the need for the
1345 assertions that are used above in order to give exactly the POSIX be‐
1346 haviour.
1347
1349
1350 Vertical bar characters are used to separate alternative patterns. For
1351 example, the pattern
1352
1353 gilbert|sullivan
1354
1355 matches either "gilbert" or "sullivan". Any number of alternatives may
1356 appear, and an empty alternative is permitted (matching the empty
1357 string). The matching process tries each alternative in turn, from left
1358 to right, and the first one that succeeds is used. If the alternatives
1359 are within a group (defined below), "succeeds" means matching the rest
1360 of the main pattern as well as the alternative in the group.
1361
1363
1364 The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL,
1365 PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options
1366 can be changed from within the pattern by a sequence of letters
1367 enclosed between "(?" and ")". These options are Perl-compatible, and
1368 are described in detail in the pcre2api documentation. The option let‐
1369 ters are:
1370
1371 i for PCRE2_CASELESS
1372 m for PCRE2_MULTILINE
1373 n for PCRE2_NO_AUTO_CAPTURE
1374 s for PCRE2_DOTALL
1375 x for PCRE2_EXTENDED
1376 xx for PCRE2_EXTENDED_MORE
1377
1378 For example, (?im) sets caseless, multiline matching. It is also possi‐
1379 ble to unset these options by preceding the relevant letters with a
1380 hyphen, for example (?-im). The two "extended" options are not indepen‐
1381 dent; unsetting either one cancels the effects of both of them.
1382
1383 A combined setting and unsetting such as (?im-sx), which sets
1384 PCRE2_CASELESS and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and
1385 PCRE2_EXTENDED, is also permitted. Only one hyphen may appear in the
1386 options string. If a letter appears both before and after the hyphen,
1387 the option is unset. An empty options setting "(?)" is allowed. Need‐
1388 less to say, it has no effect.
1389
1390 If the first character following (? is a circumflex, it causes all of
1391 the above options to be unset. Thus, (?^) is equivalent to (?-imnsx).
1392 Letters may follow the circumflex to cause some options to be re-
1393 instated, but a hyphen may not appear.
1394
1395 The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be
1396 changed in the same way as the Perl-compatible options by using the
1397 characters J and U respectively. However, these are not unset by (?^).
1398
1399 When one of these option changes occurs at top level (that is, not
1400 inside group parentheses), the change applies to the remainder of the
1401 pattern that follows. An option change within a group (see below for a
1402 description of groups) affects only that part of the group that follows
1403 it, so
1404
1405 (a(?i)b)c
1406
1407 matches abc and aBc and no other strings (assuming PCRE2_CASELESS is
1408 not used). By this means, options can be made to have different set‐
1409 tings in different parts of the pattern. Any changes made in one alter‐
1410 native do carry on into subsequent branches within the same group. For
1411 example,
1412
1413 (a(?i)b|c)
1414
1415 matches "ab", "aB", "c", and "C", even though when matching "C" the
1416 first branch is abandoned before the option setting. This is because
1417 the effects of option settings happen at compile time. There would be
1418 some very weird behaviour otherwise.
1419
1420 As a convenient shorthand, if any option settings are required at the
1421 start of a non-capturing group (see the next section), the option let‐
1422 ters may appear between the "?" and the ":". Thus the two patterns
1423
1424 (?i:saturday|sunday)
1425 (?:(?i)saturday|sunday)
1426
1427 match exactly the same set of strings.
1428
1429 Note: There are other PCRE2-specific options, applying to the whole
1430 pattern, which can be set by the application when the compiling func‐
1431 tion is called. In addition, the pattern can contain special leading
1432 sequences such as (*CRLF) to override what the application has set or
1433 what has been defaulted. Details are given in the section entitled
1434 "Newline sequences" above. There are also the (*UTF) and (*UCP) leading
1435 sequences that can be used to set UTF and Unicode property modes; they
1436 are equivalent to setting the PCRE2_UTF and PCRE2_UCP options, respec‐
1437 tively. However, the application can set the PCRE2_NEVER_UTF and
1438 PCRE2_NEVER_UCP options, which lock out the use of the (*UTF) and
1439 (*UCP) sequences.
1440
1442
1443 Groups are delimited by parentheses (round brackets), which can be
1444 nested. Turning part of a pattern into a group does two things:
1445
1446 1. It localizes a set of alternatives. For example, the pattern
1447
1448 cat(aract|erpillar|)
1449
1450 matches "cataract", "caterpillar", or "cat". Without the parentheses,
1451 it would match "cataract", "erpillar" or an empty string.
1452
1453 2. It creates a "capture group". This means that, when the whole pat‐
1454 tern matches, the portion of the subject string that matched the group
1455 is passed back to the caller, separately from the portion that matched
1456 the whole pattern. (This applies only to the traditional matching
1457 function; the DFA matching function does not support capturing.)
1458
1459 Opening parentheses are counted from left to right (starting from 1) to
1460 obtain numbers for capture groups. For example, if the string "the red
1461 king" is matched against the pattern
1462
1463 the ((red|white) (king|queen))
1464
1465 the captured substrings are "red king", "red", and "king", and are num‐
1466 bered 1, 2, and 3, respectively.
1467
1468 The fact that plain parentheses fulfil two functions is not always
1469 helpful. There are often times when grouping is required without cap‐
1470 turing. If an opening parenthesis is followed by a question mark and a
1471 colon, the group does not do any capturing, and is not counted when
1472 computing the number of any subsequent capture groups. For example, if
1473 the string "the white queen" is matched against the pattern
1474
1475 the ((?:red|white) (king|queen))
1476
1477 the captured substrings are "white queen" and "queen", and are numbered
1478 1 and 2. The maximum number of capture groups is 65535.
1479
1480 As a convenient shorthand, if any option settings are required at the
1481 start of a non-capturing group, the option letters may appear between
1482 the "?" and the ":". Thus the two patterns
1483
1484 (?i:saturday|sunday)
1485 (?:(?i)saturday|sunday)
1486
1487 match exactly the same set of strings. Because alternative branches are
1488 tried from left to right, and options are not reset until the end of
1489 the group is reached, an option setting in one branch does affect sub‐
1490 sequent branches, so the above patterns match "SUNDAY" as well as "Sat‐
1491 urday".
1492
1494
1495 Perl 5.10 introduced a feature whereby each alternative in a group uses
1496 the same numbers for its capturing parentheses. Such a group starts
1497 with (?| and is itself a non-capturing group. For example, consider
1498 this pattern:
1499
1500 (?|(Sat)ur|(Sun))day
1501
1502 Because the two alternatives are inside a (?| group, both sets of cap‐
1503 turing parentheses are numbered one. Thus, when the pattern matches,
1504 you can look at captured substring number one, whichever alternative
1505 matched. This construct is useful when you want to capture part, but
1506 not all, of one of a number of alternatives. Inside a (?| group, paren‐
1507 theses are numbered as usual, but the number is reset at the start of
1508 each branch. The numbers of any capturing parentheses that follow the
1509 whole group start after the highest number used in any branch. The fol‐
1510 lowing example is taken from the Perl documentation. The numbers under‐
1511 neath show in which buffer the captured content will be stored.
1512
1513 # before ---------------branch-reset----------- after
1514 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
1515 # 1 2 2 3 2 3 4
1516
1517 A backreference to a capture group uses the most recent value that is
1518 set for the group. The following pattern matches "abcabc" or "defdef":
1519
1520 /(?|(abc)|(def))\1/
1521
1522 In contrast, a subroutine call to a capture group always refers to the
1523 first one in the pattern with the given number. The following pattern
1524 matches "abcabc" or "defabc":
1525
1526 /(?|(abc)|(def))(?1)/
1527
1528 A relative reference such as (?-1) is no different: it is just a conve‐
1529 nient way of computing an absolute group number.
1530
1531 If a condition test for a group's having matched refers to a non-unique
1532 number, the test is true if any group with that number has matched.
1533
1534 An alternative approach to using this "branch reset" feature is to use
1535 duplicate named groups, as described in the next section.
1536
1538
1539 Identifying capture groups by number is simple, but it can be very hard
1540 to keep track of the numbers in complicated patterns. Furthermore, if
1541 an expression is modified, the numbers may change. To help with this
1542 difficulty, PCRE2 supports the naming of capture groups. This feature
1543 was not added to Perl until release 5.10. Python had the feature ear‐
1544 lier, and PCRE1 introduced it at release 4.0, using the Python syntax.
1545 PCRE2 supports both the Perl and the Python syntax.
1546
1547 In PCRE2, a capture group can be named in one of three ways:
1548 (?<name>...) or (?'name'...) as in Perl, or (?P<name>...) as in Python.
1549 Names may be up to 32 code units long. When PCRE2_UTF is not set, they
1550 may contain only ASCII alphanumeric characters and underscores, but
1551 must start with a non-digit. When PCRE2_UTF is set, the syntax of group
1552 names is extended to allow any Unicode letter or Unicode decimal digit.
1553 In other words, group names must match one of these patterns:
1554
1555 ^[_A-Za-z][_A-Za-z0-9]*\z when PCRE2_UTF is not set
1556 ^[_\p{L}][_\p{L}\p{Nd}]*\z when PCRE2_UTF is set
1557
1558 References to capture groups from other parts of the pattern, such as
1559 backreferences, recursion, and conditions, can all be made by name as
1560 well as by number.
1561
1562 Named capture groups are allocated numbers as well as names, exactly as
1563 if the names were not present. In both PCRE2 and Perl, capture groups
1564 are primarily identified by numbers; any names are just aliases for
1565 these numbers. The PCRE2 API provides function calls for extracting the
1566 complete name-to-number translation table from a compiled pattern, as
1567 well as convenience functions for extracting captured substrings by
1568 name.
1569
1570 Warning: When more than one capture group has the same number, as
1571 described in the previous section, a name given to one of them applies
1572 to all of them. Perl allows identically numbered groups to have differ‐
1573 ent names. Consider this pattern, where there are two capture groups,
1574 both numbered 1:
1575
1576 (?|(?<AA>aa)|(?<BB>bb))
1577
1578 Perl allows this, with both names AA and BB as aliases of group 1.
1579 Thus, after a successful match, both names yield the same value (either
1580 "aa" or "bb").
1581
1582 In an attempt to reduce confusion, PCRE2 does not allow the same group
1583 number to be associated with more than one name. The example above pro‐
1584 vokes a compile-time error. However, there is still scope for confu‐
1585 sion. Consider this pattern:
1586
1587 (?|(?<AA>aa)|(bb))
1588
1589 Although the second group number 1 is not explicitly named, the name AA
1590 is still an alias for any group 1. Whether the pattern matches "aa" or
1591 "bb", a reference by name to group AA yields the matched string.
1592
1593 By default, a name must be unique within a pattern, except that dupli‐
1594 cate names are permitted for groups with the same number, for example:
1595
1596 (?|(?<AA>aa)|(?<AA>bb))
1597
1598 The duplicate name constraint can be disabled by setting the PCRE2_DUP‐
1599 NAMES option at compile time, or by the use of (?J) within the pattern.
1600 Duplicate names can be useful for patterns where only one instance of
1601 the named capture group can match. Suppose you want to match the name
1602 of a weekday, either as a 3-letter abbreviation or as the full name,
1603 and in both cases you want to extract the abbreviation. This pattern
1604 (ignoring the line breaks) does the job:
1605
1606 (?<DN>Mon|Fri|Sun)(?:day)?|
1607 (?<DN>Tue)(?:sday)?|
1608 (?<DN>Wed)(?:nesday)?|
1609 (?<DN>Thu)(?:rsday)?|
1610 (?<DN>Sat)(?:urday)?
1611
1612 There are five capture groups, but only one is ever set after a match.
1613 The convenience functions for extracting the data by name returns the
1614 substring for the first (and in this example, the only) group of that
1615 name that matched. This saves searching to find which numbered group it
1616 was. (An alternative way of solving this problem is to use a "branch
1617 reset" group, as described in the previous section.)
1618
1619 If you make a backreference to a non-unique named group from elsewhere
1620 in the pattern, the groups to which the name refers are checked in the
1621 order in which they appear in the overall pattern. The first one that
1622 is set is used for the reference. For example, this pattern matches
1623 both "foofoo" and "barbar" but not "foobar" or "barfoo":
1624
1625 (?:(?<n>foo)|(?<n>bar))\k<n>
1626
1627
1628 If you make a subroutine call to a non-unique named group, the one that
1629 corresponds to the first occurrence of the name is used. In the absence
1630 of duplicate numbers this is the one with the lowest number.
1631
1632 If you use a named reference in a condition test (see the section about
1633 conditions below), either to check whether a capture group has matched,
1634 or to check for recursion, all groups with the same name are tested. If
1635 the condition is true for any one of them, the overall condition is
1636 true. This is the same behaviour as testing by number. For further
1637 details of the interfaces for handling named capture groups, see the
1638 pcre2api documentation.
1639
1641
1642 Repetition is specified by quantifiers, which can follow any of the
1643 following items:
1644
1645 a literal data character
1646 the dot metacharacter
1647 the \C escape sequence
1648 the \R escape sequence
1649 the \X escape sequence
1650 an escape such as \d or \pL that matches a single character
1651 a character class
1652 a backreference
1653 a parenthesized group (including most assertions)
1654 a subroutine call (recursive or otherwise)
1655
1656 The general repetition quantifier specifies a minimum and maximum num‐
1657 ber of permitted matches, by giving the two numbers in curly brackets
1658 (braces), separated by a comma. The numbers must be less than 65536,
1659 and the first must be less than or equal to the second. For example,
1660
1661 z{2,4}
1662
1663 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
1664 special character. If the second number is omitted, but the comma is
1665 present, there is no upper limit; if the second number and the comma
1666 are both omitted, the quantifier specifies an exact number of required
1667 matches. Thus
1668
1669 [aeiou]{3,}
1670
1671 matches at least 3 successive vowels, but may match many more, whereas
1672
1673 \d{8}
1674
1675 matches exactly 8 digits. An opening curly bracket that appears in a
1676 position where a quantifier is not allowed, or one that does not match
1677 the syntax of a quantifier, is taken as a literal character. For exam‐
1678 ple, {,6} is not a quantifier, but a literal string of four characters.
1679
1680 In UTF modes, quantifiers apply to characters rather than to individual
1681 code units. Thus, for example, \x{100}{2} matches two characters, each
1682 of which is represented by a two-byte sequence in a UTF-8 string. Simi‐
1683 larly, \X{3} matches three Unicode extended grapheme clusters, each of
1684 which may be several code units long (and they may be of different
1685 lengths).
1686
1687 The quantifier {0} is permitted, causing the expression to behave as if
1688 the previous item and the quantifier were not present. This may be use‐
1689 ful for capture groups that are referenced as subroutines from else‐
1690 where in the pattern (but see also the section entitled "Defining cap‐
1691 ture groups for use by reference only" below). Except for parenthesized
1692 groups, items that have a {0} quantifier are omitted from the compiled
1693 pattern.
1694
1695 For convenience, the three most common quantifiers have single-charac‐
1696 ter abbreviations:
1697
1698 * is equivalent to {0,}
1699 + is equivalent to {1,}
1700 ? is equivalent to {0,1}
1701
1702 It is possible to construct infinite loops by following a group that
1703 can match no characters with a quantifier that has no upper limit, for
1704 example:
1705
1706 (a?)*
1707
1708 Earlier versions of Perl and PCRE1 used to give an error at compile
1709 time for such patterns. However, because there are cases where this can
1710 be useful, such patterns are now accepted, but if any repetition of the
1711 group does in fact match no characters, the loop is forcibly broken.
1712
1713 By default, quantifiers are "greedy", that is, they match as much as
1714 possible (up to the maximum number of permitted times), without causing
1715 the rest of the pattern to fail. The classic example of where this
1716 gives problems is in trying to match comments in C programs. These
1717 appear between /* and */ and within the comment, individual * and /
1718 characters may appear. An attempt to match C comments by applying the
1719 pattern
1720
1721 /\*.*\*/
1722
1723 to the string
1724
1725 /* first comment */ not comment /* second comment */
1726
1727 fails, because it matches the entire string owing to the greediness of
1728 the .* item. However, if a quantifier is followed by a question mark,
1729 it ceases to be greedy, and instead matches the minimum number of times
1730 possible, so the pattern
1731
1732 /\*.*?\*/
1733
1734 does the right thing with the C comments. The meaning of the various
1735 quantifiers is not otherwise changed, just the preferred number of
1736 matches. Do not confuse this use of question mark with its use as a
1737 quantifier in its own right. Because it has two uses, it can sometimes
1738 appear doubled, as in
1739
1740 \d??\d
1741
1742 which matches one digit by preference, but can match two if that is the
1743 only way the rest of the pattern matches.
1744
1745 If the PCRE2_UNGREEDY option is set (an option that is not available in
1746 Perl), the quantifiers are not greedy by default, but individual ones
1747 can be made greedy by following them with a question mark. In other
1748 words, it inverts the default behaviour.
1749
1750 When a parenthesized group is quantified with a minimum repeat count
1751 that is greater than 1 or with a limited maximum, more memory is
1752 required for the compiled pattern, in proportion to the size of the
1753 minimum or maximum.
1754
1755 If a pattern starts with .* or .{0,} and the PCRE2_DOTALL option
1756 (equivalent to Perl's /s) is set, thus allowing the dot to match new‐
1757 lines, the pattern is implicitly anchored, because whatever follows
1758 will be tried against every character position in the subject string,
1759 so there is no point in retrying the overall match at any position
1760 after the first. PCRE2 normally treats such a pattern as though it were
1761 preceded by \A.
1762
1763 In cases where it is known that the subject string contains no new‐
1764 lines, it is worth setting PCRE2_DOTALL in order to obtain this opti‐
1765 mization, or alternatively, using ^ to indicate anchoring explicitly.
1766
1767 However, there are some cases where the optimization cannot be used.
1768 When .* is inside capturing parentheses that are the subject of a
1769 backreference elsewhere in the pattern, a match at the start may fail
1770 where a later one succeeds. Consider, for example:
1771
1772 (.*)abc\1
1773
1774 If the subject is "xyz123abc123" the match point is the fourth charac‐
1775 ter. For this reason, such a pattern is not implicitly anchored.
1776
1777 Another case where implicit anchoring is not applied is when the lead‐
1778 ing .* is inside an atomic group. Once again, a match at the start may
1779 fail where a later one succeeds. Consider this pattern:
1780
1781 (?>.*?a)b
1782
1783 It matches "ab" in the subject "aab". The use of the backtracking con‐
1784 trol verbs (*PRUNE) and (*SKIP) also disable this optimization, and
1785 there is an option, PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
1786
1787 When a capture group is repeated, the value captured is the substring
1788 that matched the final iteration. For example, after
1789
1790 (tweedle[dume]{3}\s*)+
1791
1792 has matched "tweedledum tweedledee" the value of the captured substring
1793 is "tweedledee". However, if there are nested capture groups, the cor‐
1794 responding captured values may have been set in previous iterations.
1795 For example, after
1796
1797 (a|(b))+
1798
1799 matches "aba" the value of the second captured substring is "b".
1800
1802
1803 With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
1804 repetition, failure of what follows normally causes the repeated item
1805 to be re-evaluated to see if a different number of repeats allows the
1806 rest of the pattern to match. Sometimes it is useful to prevent this,
1807 either to change the nature of the match, or to cause it fail earlier
1808 than it otherwise might, when the author of the pattern knows there is
1809 no point in carrying on.
1810
1811 Consider, for example, the pattern \d+foo when applied to the subject
1812 line
1813
1814 123456bar
1815
1816 After matching all 6 digits and then failing to match "foo", the normal
1817 action of the matcher is to try again with only 5 digits matching the
1818 \d+ item, and then with 4, and so on, before ultimately failing.
1819 "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
1820 the means for specifying that once a group has matched, it is not to be
1821 re-evaluated in this way.
1822
1823 If we use atomic grouping for the previous example, the matcher gives
1824 up immediately on failing to match "foo" the first time. The notation
1825 is a kind of special parenthesis, starting with (?> as in this example:
1826
1827 (?>\d+)foo
1828
1829 Perl 5.28 introduced an experimental alphabetic form starting with (*
1830 which may be easier to remember:
1831
1832 (*atomic:\d+)foo
1833
1834 This kind of parenthesized group "locks up" the part of the pattern it
1835 contains once it has matched, and a failure further into the pattern is
1836 prevented from backtracking into it. Backtracking past it to previous
1837 items, however, works as normal.
1838
1839 An alternative description is that a group of this type matches exactly
1840 the string of characters that an identical standalone pattern would
1841 match, if anchored at the current point in the subject string.
1842
1843 Atomic groups are not capture groups. Simple cases such as the above
1844 example can be thought of as a maximizing repeat that must swallow
1845 everything it can. So, while both \d+ and \d+? are prepared to adjust
1846 the number of digits they match in order to make the rest of the pat‐
1847 tern match, (?>\d+) can only match an entire sequence of digits.
1848
1849 Atomic groups in general can of course contain arbitrarily complicated
1850 expressions, and can be nested. However, when the contents of an atomic
1851 group is just a single repeated item, as in the example above, a sim‐
1852 pler notation, called a "possessive quantifier" can be used. This con‐
1853 sists of an additional + character following a quantifier. Using this
1854 notation, the previous example can be rewritten as
1855
1856 \d++foo
1857
1858 Note that a possessive quantifier can be used with an entire group, for
1859 example:
1860
1861 (abc|xyz){2,3}+
1862
1863 Possessive quantifiers are always greedy; the setting of the
1864 PCRE2_UNGREEDY option is ignored. They are a convenient notation for
1865 the simpler forms of atomic group. However, there is no difference in
1866 the meaning of a possessive quantifier and the equivalent atomic group,
1867 though there may be a performance difference; possessive quantifiers
1868 should be slightly faster.
1869
1870 The possessive quantifier syntax is an extension to the Perl 5.8 syn‐
1871 tax. Jeffrey Friedl originated the idea (and the name) in the first
1872 edition of his book. Mike McCloskey liked it, so implemented it when he
1873 built Sun's Java package, and PCRE1 copied it from there. It found its
1874 way into Perl at release 5.10.
1875
1876 PCRE2 has an optimization that automatically "possessifies" certain
1877 simple pattern constructs. For example, the sequence A+B is treated as
1878 A++B because there is no point in backtracking into a sequence of A's
1879 when B must follow. This feature can be disabled by the PCRE2_NO_AUTO‐
1880 POSSESS option, or starting the pattern with (*NO_AUTO_POSSESS).
1881
1882 When a pattern contains an unlimited repeat inside a group that can
1883 itself be repeated an unlimited number of times, the use of an atomic
1884 group is the only way to avoid some failing matches taking a very long
1885 time indeed. The pattern
1886
1887 (\D+|<\d+>)*[!?]
1888
1889 matches an unlimited number of substrings that either consist of non-
1890 digits, or digits enclosed in <>, followed by either ! or ?. When it
1891 matches, it runs quickly. However, if it is applied to
1892
1893 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1894
1895 it takes a long time before reporting failure. This is because the
1896 string can be divided between the internal \D+ repeat and the external
1897 * repeat in a large number of ways, and all have to be tried. (The
1898 example uses [!?] rather than a single character at the end, because
1899 both PCRE2 and Perl have an optimization that allows for fast failure
1900 when a single character is used. They remember the last single charac‐
1901 ter that is required for a match, and fail early if it is not present
1902 in the string.) If the pattern is changed so that it uses an atomic
1903 group, like this:
1904
1905 ((?>\D+)|<\d+>)*[!?]
1906
1907 sequences of non-digits cannot be broken, and failure happens quickly.
1908
1910
1911 Outside a character class, a backslash followed by a digit greater than
1912 0 (and possibly further digits) is a backreference to a capture group
1913 earlier (that is, to its left) in the pattern, provided there have been
1914 that many previous capture groups.
1915
1916 However, if the decimal number following the backslash is less than 8,
1917 it is always taken as a backreference, and causes an error only if
1918 there are not that many capture groups in the entire pattern. In other
1919 words, the group that is referenced need not be to the left of the ref‐
1920 erence for numbers less than 8. A "forward backreference" of this type
1921 can make sense when a repetition is involved and the group to the right
1922 has participated in an earlier iteration.
1923
1924 It is not possible to have a numerical "forward backreference" to a
1925 group whose number is 8 or more using this syntax because a sequence
1926 such as \50 is interpreted as a character defined in octal. See the
1927 subsection entitled "Non-printing characters" above for further details
1928 of the handling of digits following a backslash. Other forms of back‐
1929 referencing do not suffer from this restriction. In particular, there
1930 is no problem when named capture groups are used (see below).
1931
1932 Another way of avoiding the ambiguity inherent in the use of digits
1933 following a backslash is to use the \g escape sequence. This escape
1934 must be followed by a signed or unsigned number, optionally enclosed in
1935 braces. These examples are all identical:
1936
1937 (ring), \1
1938 (ring), \g1
1939 (ring), \g{1}
1940
1941 An unsigned number specifies an absolute reference without the ambigu‐
1942 ity that is present in the older syntax. It is also useful when literal
1943 digits follow the reference. A signed number is a relative reference.
1944 Consider this example:
1945
1946 (abc(def)ghi)\g{-1}
1947
1948 The sequence \g{-1} is a reference to the most recently started capture
1949 group before \g, that is, is it equivalent to \2 in this example. Simi‐
1950 larly, \g{-2} would be equivalent to \1. The use of relative references
1951 can be helpful in long patterns, and also in patterns that are created
1952 by joining together fragments that contain references within them‐
1953 selves.
1954
1955 The sequence \g{+1} is a reference to the next capture group. This kind
1956 of forward reference can be useful in patterns that repeat. Perl does
1957 not support the use of + in this way.
1958
1959 A backreference matches whatever actually most recently matched the
1960 capture group in the current subject string, rather than anything at
1961 all that matches the group (see "Groups as subroutines" below for a way
1962 of doing that). So the pattern
1963
1964 (sens|respons)e and \1ibility
1965
1966 matches "sense and sensibility" and "response and responsibility", but
1967 not "sense and responsibility". If caseful matching is in force at the
1968 time of the backreference, the case of letters is relevant. For exam‐
1969 ple,
1970
1971 ((?i)rah)\s+\1
1972
1973 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
1974 original capture group is matched caselessly.
1975
1976 There are several different ways of writing backreferences to named
1977 capture groups. The .NET syntax \k{name} and the Perl syntax \k<name>
1978 or \k'name' are supported, as is the Python syntax (?P=name). Perl
1979 5.10's unified backreference syntax, in which \g can be used for both
1980 numeric and named references, is also supported. We could rewrite the
1981 above example in any of the following ways:
1982
1983 (?<p1>(?i)rah)\s+\k<p1>
1984 (?'p1'(?i)rah)\s+\k{p1}
1985 (?P<p1>(?i)rah)\s+(?P=p1)
1986 (?<p1>(?i)rah)\s+\g{p1}
1987
1988 A capture group that is referenced by name may appear in the pattern
1989 before or after the reference.
1990
1991 There may be more than one backreference to the same group. If a group
1992 has not actually been used in a particular match, backreferences to it
1993 always fail by default. For example, the pattern
1994
1995 (a|(bc))\2
1996
1997 always fails if it starts to match "a" rather than "bc". However, if
1998 the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backref‐
1999 erence to an unset value matches an empty string.
2000
2001 Because there may be many capture groups in a pattern, all digits fol‐
2002 lowing a backslash are taken as part of a potential backreference num‐
2003 ber. If the pattern continues with a digit character, some delimiter
2004 must be used to terminate the backreference. If the PCRE2_EXTENDED or
2005 PCRE2_EXTENDED_MORE option is set, this can be white space. Otherwise,
2006 the \g{} syntax or an empty comment (see "Comments" below) can be used.
2007
2008 Recursive backreferences
2009
2010 A backreference that occurs inside the group to which it refers fails
2011 when the group is first used, so, for example, (a\1) never matches.
2012 However, such references can be useful inside repeated groups. For
2013 example, the pattern
2014
2015 (a|b\1)+
2016
2017 matches any number of "a"s and also "aba", "ababbaa" etc. At each iter‐
2018 ation of the group, the backreference matches the character string cor‐
2019 responding to the previous iteration. In order for this to work, the
2020 pattern must be such that the first iteration does not need to match
2021 the backreference. This can be done using alternation, as in the exam‐
2022 ple above, or by a quantifier with a minimum of zero.
2023
2024 Backreferences of this type cause the group that they reference to be
2025 treated as an atomic group. Once the whole group has been matched, a
2026 subsequent matching failure cannot cause backtracking into the middle
2027 of the group.
2028
2030
2031 An assertion is a test on the characters following or preceding the
2032 current matching point that does not consume any characters. The simple
2033 assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described
2034 above.
2035
2036 More complicated assertions are coded as parenthesized groups. There
2037 are two kinds: those that look ahead of the current position in the
2038 subject string, and those that look behind it, and in each case an
2039 assertion may be positive (must match for the assertion to be true) or
2040 negative (must not match for the assertion to be true). An assertion
2041 group is matched in the normal way, and if it is true, matching contin‐
2042 ues after it, but with the matching position in the subject string is
2043 was it was before the assertion was processed.
2044
2045 A lookaround assertion may also appear as the condition in a condi‐
2046 tional group (see below). In this case, the result of matching the
2047 assertion determines which branch of the condition is followed.
2048
2049 Assertion groups are not capture groups. If an assertion contains cap‐
2050 ture groups within it, these are counted for the purposes of numbering
2051 the capture groups in the whole pattern. Within each branch of an
2052 assertion, locally captured substrings may be referenced in the usual
2053 way. For example, a sequence such as (.)\g{-1} can be used to check
2054 that two adjacent characters are the same.
2055
2056 When a branch within an assertion fails to match, any substrings that
2057 were captured are discarded (as happens with any pattern branch that
2058 fails to match). A negative assertion is true only when all its
2059 branches fail to match; this means that no captured substrings are ever
2060 retained after a successful negative assertion. When an assertion con‐
2061 tains a matching branch, what happens depends on the type of assertion.
2062
2063 For a positive assertion, internally captured substrings in the suc‐
2064 cessful branch are retained, and matching continues with the next pat‐
2065 tern item after the assertion. For a negative assertion, a matching
2066 branch means that the assertion is not true. If such an assertion is
2067 being used as a condition in a conditional group (see below), captured
2068 substrings are retained, because matching continues with the "no"
2069 branch of the condition. For other failing negative assertions, control
2070 passes to the previous backtracking point, thus discarding any captured
2071 strings within the assertion.
2072
2073 For compatibility with Perl, most assertion groups may be repeated;
2074 though it makes no sense to assert the same thing several times, the
2075 side effect of capturing may occasionally be useful. However, an asser‐
2076 tion that forms the condition for a conditional group may not be quan‐
2077 tified. In practice, for other assertions, there only three cases:
2078
2079 (1) If the quantifier is {0}, the assertion is never obeyed during
2080 matching. However, it may contain internal capture groups that are
2081 called from elsewhere via the subroutine mechanism.
2082
2083 (2) If quantifier is {0,n} where n is greater than zero, it is treated
2084 as if it were {0,1}. At run time, the rest of the pattern match is
2085 tried with and without the assertion, the order depending on the greed‐
2086 iness of the quantifier.
2087
2088 (3) If the minimum repetition is greater than zero, the quantifier is
2089 ignored. The assertion is obeyed just once when encountered during
2090 matching.
2091
2092 Alphabetic assertion names
2093
2094 Traditionally, symbolic sequences such as (?= and (?<= have been used
2095 to specify lookaround assertions. Perl 5.28 introduced some experimen‐
2096 tal alphabetic alternatives which might be easier to remember. They all
2097 start with (* instead of (? and must be written using lower case let‐
2098 ters. PCRE2 supports the following synonyms:
2099
2100 (*positive_lookahead: or (*pla: is the same as (?=
2101 (*negative_lookahead: or (*nla: is the same as (?!
2102 (*positive_lookbehind: or (*plb: is the same as (?<=
2103 (*negative_lookbehind: or (*nlb: is the same as (?<!
2104
2105 For example, (*pla:foo) is the same assertion as (?=foo). In the fol‐
2106 lowing sections, the various assertions are described using the origi‐
2107 nal symbolic forms.
2108
2109 Lookahead assertions
2110
2111 Lookahead assertions start with (?= for positive assertions and (?! for
2112 negative assertions. For example,
2113
2114 \w+(?=;)
2115
2116 matches a word followed by a semicolon, but does not include the semi‐
2117 colon in the match, and
2118
2119 foo(?!bar)
2120
2121 matches any occurrence of "foo" that is not followed by "bar". Note
2122 that the apparently similar pattern
2123
2124 (?!foo)bar
2125
2126 does not find an occurrence of "bar" that is preceded by something
2127 other than "foo"; it finds any occurrence of "bar" whatsoever, because
2128 the assertion (?!foo) is always true when the next three characters are
2129 "bar". A lookbehind assertion is needed to achieve the other effect.
2130
2131 If you want to force a matching failure at some point in a pattern, the
2132 most convenient way to do it is with (?!) because an empty string
2133 always matches, so an assertion that requires there not to be an empty
2134 string must always fail. The backtracking control verb (*FAIL) or (*F)
2135 is a synonym for (?!).
2136
2137 Lookbehind assertions
2138
2139 Lookbehind assertions start with (?<= for positive assertions and (?<!
2140 for negative assertions. For example,
2141
2142 (?<!foo)bar
2143
2144 does find an occurrence of "bar" that is not preceded by "foo". The
2145 contents of a lookbehind assertion are restricted such that all the
2146 strings it matches must have a fixed length. However, if there are sev‐
2147 eral top-level alternatives, they do not all have to have the same
2148 fixed length. Thus
2149
2150 (?<=bullock|donkey)
2151
2152 is permitted, but
2153
2154 (?<!dogs?|cats?)
2155
2156 causes an error at compile time. Branches that match different length
2157 strings are permitted only at the top level of a lookbehind assertion.
2158 This is an extension compared with Perl, which requires all branches to
2159 match the same length of string. An assertion such as
2160
2161 (?<=ab(c|de))
2162
2163 is not permitted, because its single top-level branch can match two
2164 different lengths, but it is acceptable to PCRE2 if rewritten to use
2165 two top-level branches:
2166
2167 (?<=abc|abde)
2168
2169 In some cases, the escape sequence \K (see above) can be used instead
2170 of a lookbehind assertion to get round the fixed-length restriction.
2171
2172 The implementation of lookbehind assertions is, for each alternative,
2173 to temporarily move the current position back by the fixed length and
2174 then try to match. If there are insufficient characters before the cur‐
2175 rent position, the assertion fails.
2176
2177 In UTF-8 and UTF-16 modes, PCRE2 does not allow the \C escape (which
2178 matches a single code unit even in a UTF mode) to appear in lookbehind
2179 assertions, because it makes it impossible to calculate the length of
2180 the lookbehind. The \X and \R escapes, which can match different num‐
2181 bers of code units, are never permitted in lookbehinds.
2182
2183 "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in
2184 lookbehinds, as long as the called capture group matches a fixed-length
2185 string. However, recursion, that is, a "subroutine" call into a group
2186 that is already active, is not supported.
2187
2188 Perl does not support backreferences in lookbehinds. PCRE2 does support
2189 them, but only if certain conditions are met. The
2190 PCRE2_MATCH_UNSET_BACKREF option must not be set, there must be no use
2191 of (?| in the pattern (it creates duplicate group numbers), and if the
2192 backreference is by name, the name must be unique. Of course, the ref‐
2193 erenced group must itself match a fixed length substring. The following
2194 pattern matches words containing at least two characters that begin and
2195 end with the same character:
2196
2197 \b(\w)\w++(?<=\1)
2198
2199 Possessive quantifiers can be used in conjunction with lookbehind
2200 assertions to specify efficient matching of fixed-length strings at the
2201 end of subject strings. Consider a simple pattern such as
2202
2203 abcd$
2204
2205 when applied to a long string that does not match. Because matching
2206 proceeds from left to right, PCRE2 will look for each "a" in the sub‐
2207 ject and then see if what follows matches the rest of the pattern. If
2208 the pattern is specified as
2209
2210 ^.*abcd$
2211
2212 the initial .* matches the entire string at first, but when this fails
2213 (because there is no following "a"), it backtracks to match all but the
2214 last character, then all but the last two characters, and so on. Once
2215 again the search for "a" covers the entire string, from right to left,
2216 so we are no better off. However, if the pattern is written as
2217
2218 ^.*+(?<=abcd)
2219
2220 there can be no backtracking for the .*+ item because of the possessive
2221 quantifier; it can match only the entire string. The subsequent lookbe‐
2222 hind assertion does a single test on the last four characters. If it
2223 fails, the match fails immediately. For long strings, this approach
2224 makes a significant difference to the processing time.
2225
2226 Using multiple assertions
2227
2228 Several assertions (of any sort) may occur in succession. For example,
2229
2230 (?<=\d{3})(?<!999)foo
2231
2232 matches "foo" preceded by three digits that are not "999". Notice that
2233 each of the assertions is applied independently at the same point in
2234 the subject string. First there is a check that the previous three
2235 characters are all digits, and then there is a check that the same
2236 three characters are not "999". This pattern does not match "foo" pre‐
2237 ceded by six characters, the first of which are digits and the last
2238 three of which are not "999". For example, it doesn't match "123abc‐
2239 foo". A pattern to do that is
2240
2241 (?<=\d{3}...)(?<!999)foo
2242
2243 This time the first assertion looks at the preceding six characters,
2244 checking that the first three are digits, and then the second assertion
2245 checks that the preceding three characters are not "999".
2246
2247 Assertions can be nested in any combination. For example,
2248
2249 (?<=(?<!foo)bar)baz
2250
2251 matches an occurrence of "baz" that is preceded by "bar" which in turn
2252 is not preceded by "foo", while
2253
2254 (?<=\d{3}(?!999)...)foo
2255
2256 is another pattern that matches "foo" preceded by three digits and any
2257 three characters that are not "999".
2258
2260
2261 In concept, a script run is a sequence of characters that are all from
2262 the same Unicode script such as Latin or Greek. However, because some
2263 scripts are commonly used together, and because some diacritical and
2264 other marks are used with multiple scripts, it is not that simple.
2265 There is a full description of the rules that PCRE2 uses in the section
2266 entitled "Script Runs" in the pcre2unicode documentation.
2267
2268 If part of a pattern is enclosed between (*script_run: or (*sr: and a
2269 closing parenthesis, it fails if the sequence of characters that it
2270 matches are not a script run. After a failure, normal backtracking
2271 occurs. Script runs can be used to detect spoofing attacks using char‐
2272 acters that look the same, but are from different scripts. The string
2273 "paypal.com" is an infamous example, where the letters could be a mix‐
2274 ture of Latin and Cyrillic. This pattern ensures that the matched char‐
2275 acters in a sequence of non-spaces that follow white space are a script
2276 run:
2277
2278 \s+(*sr:\S+)
2279
2280 To be sure that they are all from the Latin script (for example), a
2281 lookahead can be used:
2282
2283 \s+(?=\p{Latin})(*sr:\S+)
2284
2285 This works as long as the first character is expected to be a character
2286 in that script, and not (for example) punctuation, which is allowed
2287 with any script. If this is not the case, a more creative lookahead is
2288 needed. For example, if digits, underscore, and dots are permitted at
2289 the start:
2290
2291 \s+(?=[0-9_.]*\p{Latin})(*sr:\S+)
2292
2293
2294 In many cases, backtracking into a script run pattern fragment is not
2295 desirable. The script run can employ an atomic group to prevent this.
2296 Because this is a common requirement, a shorthand notation is provided
2297 by (*atomic_script_run: or (*asr:
2298
2299 (*asr:...) is the same as (*sr:(?>...))
2300
2301 Note that the atomic group is inside the script run. Putting it outside
2302 would not prevent backtracking into the script run pattern.
2303
2304 Support for script runs is not available if PCRE2 is compiled without
2305 Unicode support. A compile-time error is given if any of the above con‐
2306 structs is encountered. Script runs are not supported by the alternate
2307 matching function, pcre2_dfa_match() because they use the same mecha‐
2308 nism as capturing parentheses.
2309
2310 Warning: The (*ACCEPT) control verb (see below) should not be used
2311 within a script run group, because it causes an immediate exit from the
2312 group, bypassing the script run checking.
2313
2315
2316 It is possible to cause the matching process to obey a pattern fragment
2317 conditionally or to choose between two alternative fragments, depending
2318 on the result of an assertion, or whether a specific capture group has
2319 already been matched. The two possible forms of conditional group are:
2320
2321 (?(condition)yes-pattern)
2322 (?(condition)yes-pattern|no-pattern)
2323
2324 If the condition is satisfied, the yes-pattern is used; otherwise the
2325 no-pattern (if present) is used. An absent no-pattern is equivalent to
2326 an empty string (it always matches). If there are more than two alter‐
2327 natives in the group, a compile-time error occurs. Each of the two
2328 alternatives may itself contain nested groups of any form, including
2329 conditional groups; the restriction to two alternatives applies only at
2330 the level of the condition itself. This pattern fragment is an example
2331 where the alternatives are complex:
2332
2333 (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
2334
2335
2336 There are five kinds of condition: references to capture groups, refer‐
2337 ences to recursion, two pseudo-conditions called DEFINE and VERSION,
2338 and assertions.
2339
2340 Checking for a used capture group by number
2341
2342 If the text between the parentheses consists of a sequence of digits,
2343 the condition is true if a capture group of that number has previously
2344 matched. If there is more than one capture group with the same number
2345 (see the earlier section about duplicate group numbers), the condition
2346 is true if any of them have matched. An alternative notation is to pre‐
2347 cede the digits with a plus or minus sign. In this case, the group num‐
2348 ber is relative rather than absolute. The most recently opened capture
2349 group can be referenced by (?(-1), the next most recent by (?(-2), and
2350 so on. Inside loops it can also make sense to refer to subsequent
2351 groups. The next capture group can be referenced as (?(+1), and so on.
2352 (The value zero in any of these forms is not used; it provokes a com‐
2353 pile-time error.)
2354
2355 Consider the following pattern, which contains non-significant white
2356 space to make it more readable (assume the PCRE2_EXTENDED option) and
2357 to divide it into three parts for ease of discussion:
2358
2359 ( \( )? [^()]+ (?(1) \) )
2360
2361 The first part matches an optional opening parenthesis, and if that
2362 character is present, sets it as the first captured substring. The sec‐
2363 ond part matches one or more characters that are not parentheses. The
2364 third part is a conditional group that tests whether or not the first
2365 capture group matched. If it did, that is, if subject started with an
2366 opening parenthesis, the condition is true, and so the yes-pattern is
2367 executed and a closing parenthesis is required. Otherwise, since no-
2368 pattern is not present, the conditional group matches nothing. In other
2369 words, this pattern matches a sequence of non-parentheses, optionally
2370 enclosed in parentheses.
2371
2372 If you were embedding this pattern in a larger one, you could use a
2373 relative reference:
2374
2375 ...other stuff... ( \( )? [^()]+ (?(-1) \) ) ...
2376
2377 This makes the fragment independent of the parentheses in the larger
2378 pattern.
2379
2380 Checking for a used capture group by name
2381
2382 Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
2383 used capture group by name. For compatibility with earlier versions of
2384 PCRE1, which had this facility before Perl, the syntax (?(name)...) is
2385 also recognized. Note, however, that undelimited names consisting of
2386 the letter R followed by digits are ambiguous (see the following sec‐
2387 tion). Rewriting the above example to use a named group gives this:
2388
2389 (?<OPEN> \( )? [^()]+ (?(<OPEN>) \) )
2390
2391 If the name used in a condition of this kind is a duplicate, the test
2392 is applied to all groups of the same name, and is true if any one of
2393 them has matched.
2394
2395 Checking for pattern recursion
2396
2397 "Recursion" in this sense refers to any subroutine-like call from one
2398 part of the pattern to another, whether or not it is actually recur‐
2399 sive. See the sections entitled "Recursive patterns" and "Groups as
2400 subroutines" below for details of recursion and subroutine calls.
2401
2402 If a condition is the string (R), and there is no capture group with
2403 the name R, the condition is true if matching is currently in a recur‐
2404 sion or subroutine call to the whole pattern or any capture group. If
2405 digits follow the letter R, and there is no group with that name, the
2406 condition is true if the most recent call is into a group with the
2407 given number, which must exist somewhere in the overall pattern. This
2408 is a contrived example that is equivalent to a+b:
2409
2410 ((?(R1)a+|(?1)b))
2411
2412 However, in both cases, if there is a capture group with a matching
2413 name, the condition tests for its being set, as described in the sec‐
2414 tion above, instead of testing for recursion. For example, creating a
2415 group with the name R1 by adding (?<R1>) to the above pattern com‐
2416 pletely changes its meaning.
2417
2418 If a name preceded by ampersand follows the letter R, for example:
2419
2420 (?(R&name)...)
2421
2422 the condition is true if the most recent recursion is into a group of
2423 that name (which must exist within the pattern).
2424
2425 This condition does not check the entire recursion stack. It tests only
2426 the current level. If the name used in a condition of this kind is a
2427 duplicate, the test is applied to all groups of the same name, and is
2428 true if any one of them is the most recent recursion.
2429
2430 At "top level", all these recursion test conditions are false.
2431
2432 Defining capture groups for use by reference only
2433
2434 If the condition is the string (DEFINE), the condition is always false,
2435 even if there is a group with the name DEFINE. In this case, there may
2436 be only one alternative in the rest of the conditional group. It is
2437 always skipped if control reaches this point in the pattern; the idea
2438 of DEFINE is that it can be used to define subroutines that can be ref‐
2439 erenced from elsewhere. (The use of subroutines is described below.)
2440 For example, a pattern to match an IPv4 address such as
2441 "192.168.23.245" could be written like this (ignore white space and
2442 line breaks):
2443
2444 (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
2445 \b (?&byte) (\.(?&byte)){3} \b
2446
2447 The first part of the pattern is a DEFINE group inside which a another
2448 group named "byte" is defined. This matches an individual component of
2449 an IPv4 address (a number less than 256). When matching takes place,
2450 this part of the pattern is skipped because DEFINE acts like a false
2451 condition. The rest of the pattern uses references to the named group
2452 to match the four dot-separated components of an IPv4 address, insist‐
2453 ing on a word boundary at each end.
2454
2455 Checking the PCRE2 version
2456
2457 Programs that link with a PCRE2 library can check the version by call‐
2458 ing pcre2_config() with appropriate arguments. Users of applications
2459 that do not have access to the underlying code cannot do this. A spe‐
2460 cial "condition" called VERSION exists to allow such users to discover
2461 which version of PCRE2 they are dealing with by using this condition to
2462 match a string such as "yesno". VERSION must be followed either by "="
2463 or ">=" and a version number. For example:
2464
2465 (?(VERSION>=10.4)yes|no)
2466
2467 This pattern matches "yes" if the PCRE2 version is greater or equal to
2468 10.4, or "no" otherwise. The fractional part of the version number may
2469 not contain more than two digits.
2470
2471 Assertion conditions
2472
2473 If the condition is not in any of the above formats, it must be a
2474 parenthesized assertion. This may be a positive or negative lookahead
2475 or lookbehind assertion. Consider this pattern, again containing non-
2476 significant white space, and with the two alternatives on the second
2477 line:
2478
2479 (?(?=[^a-z]*[a-z])
2480 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
2481
2482 The condition is a positive lookahead assertion that matches an
2483 optional sequence of non-letters followed by a letter. In other words,
2484 it tests for the presence of at least one letter in the subject. If a
2485 letter is found, the subject is matched against the first alternative;
2486 otherwise it is matched against the second. This pattern matches
2487 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
2488 letters and dd are digits.
2489
2490 When an assertion that is a condition contains capture groups, any cap‐
2491 turing that occurs in a matching branch is retained afterwards, for
2492 both positive and negative assertions, because matching always contin‐
2493 ues after the assertion, whether it succeeds or fails. (Compare non-
2494 conditional assertions, for which captures are retained only for posi‐
2495 tive assertions that succeed.)
2496
2498
2499 There are two ways of including comments in patterns that are processed
2500 by PCRE2. In both cases, the start of the comment must not be in a
2501 character class, nor in the middle of any other sequence of related
2502 characters such as (?: or a group name or number. The characters that
2503 make up a comment play no part in the pattern matching.
2504
2505 The sequence (?# marks the start of a comment that continues up to the
2506 next closing parenthesis. Nested parentheses are not permitted. If the
2507 PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped #
2508 character also introduces a comment, which in this case continues to
2509 immediately after the next newline character or character sequence in
2510 the pattern. Which characters are interpreted as newlines is controlled
2511 by an option passed to the compiling function or by a special sequence
2512 at the start of the pattern, as described in the section entitled "New‐
2513 line conventions" above. Note that the end of this type of comment is a
2514 literal newline sequence in the pattern; escape sequences that happen
2515 to represent a newline do not count. For example, consider this pattern
2516 when PCRE2_EXTENDED is set, and the default newline convention (a sin‐
2517 gle linefeed character) is in force:
2518
2519 abc #comment \n still comment
2520
2521 On encountering the # character, pcre2_compile() skips along, looking
2522 for a newline in the pattern. The sequence \n is still literal at this
2523 stage, so it does not terminate the comment. Only an actual character
2524 with the code value 0x0a (the default newline) does so.
2525
2527
2528 Consider the problem of matching a string in parentheses, allowing for
2529 unlimited nested parentheses. Without the use of recursion, the best
2530 that can be done is to use a pattern that matches up to some fixed
2531 depth of nesting. It is not possible to handle an arbitrary nesting
2532 depth.
2533
2534 For some time, Perl has provided a facility that allows regular expres‐
2535 sions to recurse (amongst other things). It does this by interpolating
2536 Perl code in the expression at run time, and the code can refer to the
2537 expression itself. A Perl pattern using code interpolation to solve the
2538 parentheses problem can be created like this:
2539
2540 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
2541
2542 The (?p{...}) item interpolates Perl code at run time, and in this case
2543 refers recursively to the pattern in which it appears.
2544
2545 Obviously, PCRE2 cannot support the interpolation of Perl code.
2546 Instead, it supports special syntax for recursion of the entire pat‐
2547 tern, and also for individual capture group recursion. After its intro‐
2548 duction in PCRE1 and Python, this kind of recursion was subsequently
2549 introduced into Perl at release 5.10.
2550
2551 A special item that consists of (? followed by a number greater than
2552 zero and a closing parenthesis is a recursive subroutine call of the
2553 capture group of the given number, provided that it occurs inside that
2554 group. (If not, it is a non-recursive subroutine call, which is
2555 described in the next section.) The special item (?R) or (?0) is a
2556 recursive call of the entire regular expression.
2557
2558 This PCRE2 pattern solves the nested parentheses problem (assume the
2559 PCRE2_EXTENDED option is set so that white space is ignored):
2560
2561 \( ( [^()]++ | (?R) )* \)
2562
2563 First it matches an opening parenthesis. Then it matches any number of
2564 substrings which can either be a sequence of non-parentheses, or a
2565 recursive match of the pattern itself (that is, a correctly parenthe‐
2566 sized substring). Finally there is a closing parenthesis. Note the use
2567 of a possessive quantifier to avoid backtracking into sequences of non-
2568 parentheses.
2569
2570 If this were part of a larger pattern, you would not want to recurse
2571 the entire pattern, so instead you could use this:
2572
2573 ( \( ( [^()]++ | (?1) )* \) )
2574
2575 We have put the pattern into parentheses, and caused the recursion to
2576 refer to them instead of the whole pattern.
2577
2578 In a larger pattern, keeping track of parenthesis numbers can be
2579 tricky. This is made easier by the use of relative references. Instead
2580 of (?1) in the pattern above you can write (?-2) to refer to the second
2581 most recently opened parentheses preceding the recursion. In other
2582 words, a negative number counts capturing parentheses leftwards from
2583 the point at which it is encountered.
2584
2585 Be aware however, that if duplicate capture group numbers are in use,
2586 relative references refer to the earliest group with the appropriate
2587 number. Consider, for example:
2588
2589 (?|(a)|(b)) (c) (?-2)
2590
2591 The first two capture groups (a) and (b) are both numbered 1, and group
2592 (c) is number 2. When the reference (?-2) is encountered, the second
2593 most recently opened parentheses has the number 1, but it is the first
2594 such group (the (a) group) to which the recursion refers. This would be
2595 the same if an absolute reference (?1) was used. In other words, rela‐
2596 tive references are just a shorthand for computing a group number.
2597
2598 It is also possible to refer to subsequent capture groups, by writing
2599 references such as (?+2). However, these cannot be recursive because
2600 the reference is not inside the parentheses that are referenced. They
2601 are always non-recursive subroutine calls, as described in the next
2602 section.
2603
2604 An alternative approach is to use named parentheses. The Perl syntax
2605 for this is (?&name); PCRE1's earlier syntax (?P>name) is also sup‐
2606 ported. We could rewrite the above example as follows:
2607
2608 (?<pn> \( ( [^()]++ | (?&pn) )* \) )
2609
2610 If there is more than one group with the same name, the earliest one is
2611 used.
2612
2613 The example pattern that we have been looking at contains nested unlim‐
2614 ited repeats, and so the use of a possessive quantifier for matching
2615 strings of non-parentheses is important when applying the pattern to
2616 strings that do not match. For example, when this pattern is applied to
2617
2618 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
2619
2620 it yields "no match" quickly. However, if a possessive quantifier is
2621 not used, the match runs for a very long time indeed because there are
2622 so many different ways the + and * repeats can carve up the subject,
2623 and all have to be tested before failure can be reported.
2624
2625 At the end of a match, the values of capturing parentheses are those
2626 from the outermost level. If you want to obtain intermediate values, a
2627 callout function can be used (see below and the pcre2callout documenta‐
2628 tion). If the pattern above is matched against
2629
2630 (ab(cd)ef)
2631
2632 the value for the inner capturing parentheses (numbered 2) is "ef",
2633 which is the last value taken on at the top level. If a capture group
2634 is not matched at the top level, its final captured value is unset,
2635 even if it was (temporarily) set at a deeper level during the matching
2636 process.
2637
2638 Do not confuse the (?R) item with the condition (R), which tests for
2639 recursion. Consider this pattern, which matches text in angle brack‐
2640 ets, allowing for arbitrary nesting. Only digits are allowed in nested
2641 brackets (that is, when recursing), whereas any characters are permit‐
2642 ted at the outer level.
2643
2644 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
2645
2646 In this pattern, (?(R) is the start of a conditional group, with two
2647 different alternatives for the recursive and non-recursive cases. The
2648 (?R) item is the actual recursive call.
2649
2650 Differences in recursion processing between PCRE2 and Perl
2651
2652 Some former differences between PCRE2 and Perl no longer exist.
2653
2654 Before release 10.30, recursion processing in PCRE2 differed from Perl
2655 in that a recursive subroutine call was always treated as an atomic
2656 group. That is, once it had matched some of the subject string, it was
2657 never re-entered, even if it contained untried alternatives and there
2658 was a subsequent matching failure. (Historical note: PCRE implemented
2659 recursion before Perl did.)
2660
2661 Starting with release 10.30, recursive subroutine calls are no longer
2662 treated as atomic. That is, they can be re-entered to try unused alter‐
2663 natives if there is a matching failure later in the pattern. This is
2664 now compatible with the way Perl works. If you want a subroutine call
2665 to be atomic, you must explicitly enclose it in an atomic group.
2666
2667 Supporting backtracking into recursions simplifies certain types of
2668 recursive pattern. For example, this pattern matches palindromic
2669 strings:
2670
2671 ^((.)(?1)\2|.?)$
2672
2673 The second branch in the group matches a single central character in
2674 the palindrome when there are an odd number of characters, or nothing
2675 when there are an even number of characters, but in order to work it
2676 has to be able to try the second case when the rest of the pattern
2677 match fails. If you want to match typical palindromic phrases, the pat‐
2678 tern has to ignore all non-word characters, which can be done like
2679 this:
2680
2681 ^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$
2682
2683 If run with the PCRE2_CASELESS option, this pattern matches phrases
2684 such as "A man, a plan, a canal: Panama!". Note the use of the posses‐
2685 sive quantifier *+ to avoid backtracking into sequences of non-word
2686 characters. Without this, PCRE2 takes a great deal longer (ten times or
2687 more) to match typical phrases, and Perl takes so long that you think
2688 it has gone into a loop.
2689
2690 Another way in which PCRE2 and Perl used to differ in their recursion
2691 processing is in the handling of captured values. Formerly in Perl,
2692 when a group was called recursively or as a subroutine (see the next
2693 section), it had no access to any values that were captured outside the
2694 recursion, whereas in PCRE2 these values can be referenced. Consider
2695 this pattern:
2696
2697 ^(.)(\1|a(?2))
2698
2699 This pattern matches "bab". The first capturing parentheses match "b",
2700 then in the second group, when the backreference \1 fails to match "b",
2701 the second alternative matches "a" and then recurses. In the recursion,
2702 \1 does now match "b" and so the whole match succeeds. This match used
2703 to fail in Perl, but in later versions (I tried 5.024) it now works.
2704
2706
2707 If the syntax for a recursive group call (either by number or by name)
2708 is used outside the parentheses to which it refers, it operates a bit
2709 like a subroutine in a programming language. More accurately, PCRE2
2710 treats the referenced group as an independent subpattern which it tries
2711 to match at the current matching position. The called group may be
2712 defined before or after the reference. A numbered reference can be
2713 absolute or relative, as in these examples:
2714
2715 (...(absolute)...)...(?2)...
2716 (...(relative)...)...(?-1)...
2717 (...(?+1)...(relative)...
2718
2719 An earlier example pointed out that the pattern
2720
2721 (sens|respons)e and \1ibility
2722
2723 matches "sense and sensibility" and "response and responsibility", but
2724 not "sense and responsibility". If instead the pattern
2725
2726 (sens|respons)e and (?1)ibility
2727
2728 is used, it does match "sense and responsibility" as well as the other
2729 two strings. Another example is given in the discussion of DEFINE
2730 above.
2731
2732 Like recursions, subroutine calls used to be treated as atomic, but
2733 this changed at PCRE2 release 10.30, so backtracking into subroutine
2734 calls can now occur. However, any capturing parentheses that are set
2735 during the subroutine call revert to their previous values afterwards.
2736
2737 Processing options such as case-independence are fixed when a group is
2738 defined, so if it is used as a subroutine, such options cannot be
2739 changed for different calls. For example, consider this pattern:
2740
2741 (abc)(?i:(?-1))
2742
2743 It matches "abcabc". It does not match "abcABC" because the change of
2744 processing option does not affect the called group.
2745
2746 The behaviour of backtracking control verbs in groups when called as
2747 subroutines is described in the section entitled "Backtracking verbs in
2748 subroutines" below.
2749
2751
2752 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
2753 name or a number enclosed either in angle brackets or single quotes, is
2754 an alternative syntax for calling a group as a subroutine, possibly
2755 recursively. Here are two of the examples used above, rewritten using
2756 this syntax:
2757
2758 (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
2759 (sens|respons)e and \g'1'ibility
2760
2761 PCRE2 supports an extension to Oniguruma: if a number is preceded by a
2762 plus or a minus sign it is taken as a relative reference. For example:
2763
2764 (abc)(?i:\g<-1>)
2765
2766 Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
2767 synonymous. The former is a backreference; the latter is a subroutine
2768 call.
2769
2771
2772 Perl has a feature whereby using the sequence (?{...}) causes arbitrary
2773 Perl code to be obeyed in the middle of matching a regular expression.
2774 This makes it possible, amongst other things, to extract different sub‐
2775 strings that match the same pair of parentheses when there is a repeti‐
2776 tion.
2777
2778 PCRE2 provides a similar feature, but of course it cannot obey arbi‐
2779 trary Perl code. The feature is called "callout". The caller of PCRE2
2780 provides an external function by putting its entry point in a match
2781 context using the function pcre2_set_callout(), and then passing that
2782 context to pcre2_match() or pcre2_dfa_match(). If no match context is
2783 passed, or if the callout entry point is set to NULL, callouts are dis‐
2784 abled.
2785
2786 Within a regular expression, (?C<arg>) indicates a point at which the
2787 external function is to be called. There are two kinds of callout:
2788 those with a numerical argument and those with a string argument. (?C)
2789 on its own with no argument is treated as (?C0). A numerical argument
2790 allows the application to distinguish between different callouts.
2791 String arguments were added for release 10.20 to make it possible for
2792 script languages that use PCRE2 to embed short scripts within patterns
2793 in a similar way to Perl.
2794
2795 During matching, when PCRE2 reaches a callout point, the external func‐
2796 tion is called. It is provided with the number or string argument of
2797 the callout, the position in the pattern, and one item of data that is
2798 also set in the match block. The callout function may cause matching to
2799 proceed, to backtrack, or to fail.
2800
2801 By default, PCRE2 implements a number of optimizations at matching
2802 time, and one side-effect is that sometimes callouts are skipped. If
2803 you need all possible callouts to happen, you need to set options that
2804 disable the relevant optimizations. More details, including a complete
2805 description of the programming interface to the callout function, are
2806 given in the pcre2callout documentation.
2807
2808 Callouts with numerical arguments
2809
2810 If you just want to have a means of identifying different callout
2811 points, put a number less than 256 after the letter C. For example,
2812 this pattern has two callout points:
2813
2814 (?C1)abc(?C2)def
2815
2816 If the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(), numerical
2817 callouts are automatically installed before each item in the pattern.
2818 They are all numbered 255. If there is a conditional group in the pat‐
2819 tern whose condition is an assertion, an additional callout is inserted
2820 just before the condition. An explicit callout may also be set at this
2821 position, as in this example:
2822
2823 (?(?C9)(?=a)abc|def)
2824
2825 Note that this applies only to assertion conditions, not to other types
2826 of condition.
2827
2828 Callouts with string arguments
2829
2830 A delimited string may be used instead of a number as a callout argu‐
2831 ment. The starting delimiter must be one of ` ' " ^ % # $ { and the
2832 ending delimiter is the same as the start, except for {, where the end‐
2833 ing delimiter is }. If the ending delimiter is needed within the
2834 string, it must be doubled. For example:
2835
2836 (?C'ab ''c'' d')xyz(?C{any text})pqr
2837
2838 The doubling is removed before the string is passed to the callout
2839 function.
2840
2842
2843 There are a number of special "Backtracking Control Verbs" (to use
2844 Perl's terminology) that modify the behaviour of backtracking during
2845 matching. They are generally of the form (*VERB) or (*VERB:NAME). Some
2846 verbs take either form, possibly behaving differently depending on
2847 whether or not a name is present. The names are not required to be
2848 unique within the pattern.
2849
2850 By default, for compatibility with Perl, a name is any sequence of
2851 characters that does not include a closing parenthesis. The name is not
2852 processed in any way, and it is not possible to include a closing
2853 parenthesis in the name. This can be changed by setting the
2854 PCRE2_ALT_VERBNAMES option, but the result is no longer Perl-compati‐
2855 ble.
2856
2857 When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to
2858 verb names and only an unescaped closing parenthesis terminates the
2859 name. However, the only backslash items that are permitted are \Q, \E,
2860 and sequences such as \x{100} that define character code points. Char‐
2861 acter type escapes such as \d are faulted.
2862
2863 A closing parenthesis can be included in a name either as \) or between
2864 \Q and \E. In addition to backslash processing, if the PCRE2_EXTENDED
2865 or PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb
2866 names is skipped, and #-comments are recognized, exactly as in the rest
2867 of the pattern. PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect
2868 verb names unless PCRE2_ALT_VERBNAMES is also set.
2869
2870 The maximum length of a name is 255 in the 8-bit library and 65535 in
2871 the 16-bit and 32-bit libraries. If the name is empty, that is, if the
2872 closing parenthesis immediately follows the colon, the effect is as if
2873 the colon were not there. Any number of these verbs may occur in a pat‐
2874 tern.
2875
2876 Since these verbs are specifically related to backtracking, most of
2877 them can be used only when the pattern is to be matched using the tra‐
2878 ditional matching function, because that uses a backtracking algorithm.
2879 With the exception of (*FAIL), which behaves like a failing negative
2880 assertion, the backtracking control verbs cause an error if encountered
2881 by the DFA matching function.
2882
2883 The behaviour of these verbs in repeated groups, assertions, and in
2884 capture groups called as subroutines (whether or not recursively) is
2885 documented below.
2886
2887 Optimizations that affect backtracking verbs
2888
2889 PCRE2 contains some optimizations that are used to speed up matching by
2890 running some checks at the start of each match attempt. For example, it
2891 may know the minimum length of matching subject, or that a particular
2892 character must be present. When one of these optimizations bypasses the
2893 running of a match, any included backtracking verbs will not, of
2894 course, be processed. You can suppress the start-of-match optimizations
2895 by setting the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com‐
2896 pile(), or by starting the pattern with (*NO_START_OPT). There is more
2897 discussion of this option in the section entitled "Compiling a pattern"
2898 in the pcre2api documentation.
2899
2900 Experiments with Perl suggest that it too has similar optimizations,
2901 and like PCRE2, turning them off can change the result of a match.
2902
2903 Verbs that act immediately
2904
2905 The following verbs act as soon as they are encountered.
2906
2907 (*ACCEPT) or (*ACCEPT:NAME)
2908
2909 This verb causes the match to end successfully, skipping the remainder
2910 of the pattern. However, when it is inside a capture group that is
2911 called as a subroutine, only that group is ended successfully. Matching
2912 then continues at the outer level. If (*ACCEPT) in triggered in a posi‐
2913 tive assertion, the assertion succeeds; in a negative assertion, the
2914 assertion fails.
2915
2916 If (*ACCEPT) is inside capturing parentheses, the data so far is cap‐
2917 tured. For example:
2918
2919 A((?:A|B(*ACCEPT)|C)D)
2920
2921 This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap‐
2922 tured by the outer parentheses.
2923
2924 Warning: (*ACCEPT) should not be used within a script run group,
2925 because it causes an immediate exit from the group, bypassing the
2926 script run checking.
2927
2928 (*FAIL) or (*FAIL:NAME)
2929
2930 This verb causes a matching failure, forcing backtracking to occur. It
2931 may be abbreviated to (*F). It is equivalent to (?!) but easier to
2932 read. The Perl documentation notes that it is probably useful only when
2933 combined with (?{}) or (??{}). Those are, of course, Perl features that
2934 are not present in PCRE2. The nearest equivalent is the callout fea‐
2935 ture, as for example in this pattern:
2936
2937 a+(?C)(*FAIL)
2938
2939 A match with the string "aaaa" always fails, but the callout is taken
2940 before each backtrack happens (in this example, 10 times).
2941
2942 (*ACCEPT:NAME) and (*FAIL:NAME) are treated as (*MARK:NAME)(*ACCEPT)
2943 and (*MARK:NAME)(*FAIL), respectively.
2944
2945 Recording which path was taken
2946
2947 There is one verb whose main purpose is to track how a match was
2948 arrived at, though it also has a secondary use in conjunction with
2949 advancing the match starting point (see (*SKIP) below).
2950
2951 (*MARK:NAME) or (*:NAME)
2952
2953 A name is always required with this verb. For all the other backtrack‐
2954 ing control verbs, a NAME argument is optional.
2955
2956 When a match succeeds, the name of the last-encountered mark name on
2957 the matching path is passed back to the caller as described in the sec‐
2958 tion entitled "Other information about the match" in the pcre2api docu‐
2959 mentation. This applies to all instances of (*MARK) and other verbs,
2960 including those inside assertions and atomic groups. However, there are
2961 differences in those cases when (*MARK) is used in conjunction with
2962 (*SKIP) as described below.
2963
2964 The mark name that was last encountered on the matching path is passed
2965 back. A verb without a NAME argument is ignored for this purpose. Here
2966 is an example of pcre2test output, where the "mark" modifier requests
2967 the retrieval and outputting of (*MARK) data:
2968
2969 re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
2970 data> XY
2971 0: XY
2972 MK: A
2973 XZ
2974 0: XZ
2975 MK: B
2976
2977 The (*MARK) name is tagged with "MK:" in this output, and in this exam‐
2978 ple it indicates which of the two alternatives matched. This is a more
2979 efficient way of obtaining this information than putting each alterna‐
2980 tive in its own capturing parentheses.
2981
2982 If a verb with a name is encountered in a positive assertion that is
2983 true, the name is recorded and passed back if it is the last-encoun‐
2984 tered. This does not happen for negative assertions or failing positive
2985 assertions.
2986
2987 After a partial match or a failed match, the last encountered name in
2988 the entire match process is returned. For example:
2989
2990 re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
2991 data> XP
2992 No match, mark = B
2993
2994 Note that in this unanchored example the mark is retained from the
2995 match attempt that started at the letter "X" in the subject. Subsequent
2996 match attempts starting at "P" and then with an empty string do not get
2997 as far as the (*MARK) item, but nevertheless do not reset it.
2998
2999 If you are interested in (*MARK) values after failed matches, you
3000 should probably set the PCRE2_NO_START_OPTIMIZE option (see above) to
3001 ensure that the match is always attempted.
3002
3003 Verbs that act after backtracking
3004
3005 The following verbs do nothing when they are encountered. Matching con‐
3006 tinues with what follows, but if there is a subsequent match failure,
3007 causing a backtrack to the verb, a failure is forced. That is, back‐
3008 tracking cannot pass to the left of the verb. However, when one of
3009 these verbs appears inside an atomic group or in a lookaround assertion
3010 that is true, its effect is confined to that group, because once the
3011 group has been matched, there is never any backtracking into it. Back‐
3012 tracking from beyond an assertion or an atomic group ignores the entire
3013 group, and seeks a preceding backtracking point.
3014
3015 These verbs differ in exactly what kind of failure occurs when back‐
3016 tracking reaches them. The behaviour described below is what happens
3017 when the verb is not in a subroutine or an assertion. Subsequent sec‐
3018 tions cover these special cases.
3019
3020 (*COMMIT) or (*COMMIT:NAME)
3021
3022 This verb causes the whole match to fail outright if there is a later
3023 matching failure that causes backtracking to reach it. Even if the pat‐
3024 tern is unanchored, no further attempts to find a match by advancing
3025 the starting point take place. If (*COMMIT) is the only backtracking
3026 verb that is encountered, once it has been passed pcre2_match() is com‐
3027 mitted to finding a match at the current starting point, or not at all.
3028 For example:
3029
3030 a+(*COMMIT)b
3031
3032 This matches "xxaab" but not "aacaab". It can be thought of as a kind
3033 of dynamic anchor, or "I've started, so I must finish."
3034
3035 The behaviour of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COM‐
3036 MIT). It is like (*MARK:NAME) in that the name is remembered for pass‐
3037 ing back to the caller. However, (*SKIP:NAME) searches only for names
3038 that are set with (*MARK), ignoring those set by any of the other back‐
3039 tracking verbs.
3040
3041 If there is more than one backtracking verb in a pattern, a different
3042 one that follows (*COMMIT) may be triggered first, so merely passing
3043 (*COMMIT) during a match does not always guarantee that a match must be
3044 at this starting point.
3045
3046 Note that (*COMMIT) at the start of a pattern is not the same as an
3047 anchor, unless PCRE2's start-of-match optimizations are turned off, as
3048 shown in this output from pcre2test:
3049
3050 re> /(*COMMIT)abc/
3051 data> xyzabc
3052 0: abc
3053 data>
3054 re> /(*COMMIT)abc/no_start_optimize
3055 data> xyzabc
3056 No match
3057
3058 For the first pattern, PCRE2 knows that any match must start with "a",
3059 so the optimization skips along the subject to "a" before applying the
3060 pattern to the first set of data. The match attempt then succeeds. The
3061 second pattern disables the optimization that skips along to the first
3062 character. The pattern is now applied starting at "x", and so the
3063 (*COMMIT) causes the match to fail without trying any other starting
3064 points.
3065
3066 (*PRUNE) or (*PRUNE:NAME)
3067
3068 This verb causes the match to fail at the current starting position in
3069 the subject if there is a later matching failure that causes backtrack‐
3070 ing to reach it. If the pattern is unanchored, the normal "bumpalong"
3071 advance to the next starting character then happens. Backtracking can
3072 occur as usual to the left of (*PRUNE), before it is reached, or when
3073 matching to the right of (*PRUNE), but if there is no match to the
3074 right, backtracking cannot cross (*PRUNE). In simple cases, the use of
3075 (*PRUNE) is just an alternative to an atomic group or possessive quan‐
3076 tifier, but there are some uses of (*PRUNE) that cannot be expressed in
3077 any other way. In an anchored pattern (*PRUNE) has the same effect as
3078 (*COMMIT).
3079
3080 The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE).
3081 It is like (*MARK:NAME) in that the name is remembered for passing back
3082 to the caller. However, (*SKIP:NAME) searches only for names set with
3083 (*MARK), ignoring those set by other backtracking verbs.
3084
3085 (*SKIP)
3086
3087 This verb, when given without a name, is like (*PRUNE), except that if
3088 the pattern is unanchored, the "bumpalong" advance is not to the next
3089 character, but to the position in the subject where (*SKIP) was encoun‐
3090 tered. (*SKIP) signifies that whatever text was matched leading up to
3091 it cannot be part of a successful match if there is a later mismatch.
3092 Consider:
3093
3094 a+(*SKIP)b
3095
3096 If the subject is "aaaac...", after the first match attempt fails
3097 (starting at the first character in the string), the starting point
3098 skips on to start the next attempt at "c". Note that a possessive quan‐
3099 tifer does not have the same effect as this example; although it would
3100 suppress backtracking during the first match attempt, the second
3101 attempt would start at the second character instead of skipping on to
3102 "c".
3103
3104 (*SKIP:NAME)
3105
3106 When (*SKIP) has an associated name, its behaviour is modified. When
3107 such a (*SKIP) is triggered, the previous path through the pattern is
3108 searched for the most recent (*MARK) that has the same name. If one is
3109 found, the "bumpalong" advance is to the subject position that corre‐
3110 sponds to that (*MARK) instead of to where (*SKIP) was encountered. If
3111 no (*MARK) with a matching name is found, the (*SKIP) is ignored.
3112
3113 The search for a (*MARK) name uses the normal backtracking mechanism,
3114 which means that it does not see (*MARK) settings that are inside
3115 atomic groups or assertions, because they are never re-entered by back‐
3116 tracking. Compare the following pcre2test examples:
3117
3118 re> /a(?>(*MARK:X))(*SKIP:X)(*F)|(.)/
3119 data: abc
3120 0: a
3121 1: a
3122 data:
3123 re> /a(?:(*MARK:X))(*SKIP:X)(*F)|(.)/
3124 data: abc
3125 0: b
3126 1: b
3127
3128 In the first example, the (*MARK) setting is in an atomic group, so it
3129 is not seen when (*SKIP:X) triggers, causing the (*SKIP) to be ignored.
3130 This allows the second branch of the pattern to be tried at the first
3131 character position. In the second example, the (*MARK) setting is not
3132 in an atomic group. This allows (*SKIP:X) to find the (*MARK) when it
3133 backtracks, and this causes a new matching attempt to start at the sec‐
3134 ond character. This time, the (*MARK) is never seen because "a" does
3135 not match "b", so the matcher immediately jumps to the second branch of
3136 the pattern.
3137
3138 Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It
3139 ignores names that are set by other backtracking verbs.
3140
3141 (*THEN) or (*THEN:NAME)
3142
3143 This verb causes a skip to the next innermost alternative when back‐
3144 tracking reaches it. That is, it cancels any further backtracking
3145 within the current alternative. Its name comes from the observation
3146 that it can be used for a pattern-based if-then-else block:
3147
3148 ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
3149
3150 If the COND1 pattern matches, FOO is tried (and possibly further items
3151 after the end of the group if FOO succeeds); on failure, the matcher
3152 skips to the second alternative and tries COND2, without backtracking
3153 into COND1. If that succeeds and BAR fails, COND3 is tried. If subse‐
3154 quently BAZ fails, there are no more alternatives, so there is a back‐
3155 track to whatever came before the entire group. If (*THEN) is not
3156 inside an alternation, it acts like (*PRUNE).
3157
3158 The behaviour of (*THEN:NAME) is not the same as (*MARK:NAME)(*THEN).
3159 It is like (*MARK:NAME) in that the name is remembered for passing back
3160 to the caller. However, (*SKIP:NAME) searches only for names set with
3161 (*MARK), ignoring those set by other backtracking verbs.
3162
3163 A group that does not contain a | character is just a part of the
3164 enclosing alternative; it is not a nested alternation with only one
3165 alternative. The effect of (*THEN) extends beyond such a group to the
3166 enclosing alternative. Consider this pattern, where A, B, etc. are
3167 complex pattern fragments that do not contain any | characters at this
3168 level:
3169
3170 A (B(*THEN)C) | D
3171
3172 If A and B are matched, but there is a failure in C, matching does not
3173 backtrack into A; instead it moves to the next alternative, that is, D.
3174 However, if the group containing (*THEN) is given an alternative, it
3175 behaves differently:
3176
3177 A (B(*THEN)C | (*FAIL)) | D
3178
3179 The effect of (*THEN) is now confined to the inner group. After a fail‐
3180 ure in C, matching moves to (*FAIL), which causes the whole group to
3181 fail because there are no more alternatives to try. In this case,
3182 matching does backtrack into A.
3183
3184 Note that a conditional group is not considered as having two alterna‐
3185 tives, because only one is ever used. In other words, the | character
3186 in a conditional group has a different meaning. Ignoring white space,
3187 consider:
3188
3189 ^.*? (?(?=a) a | b(*THEN)c )
3190
3191 If the subject is "ba", this pattern does not match. Because .*? is
3192 ungreedy, it initially matches zero characters. The condition (?=a)
3193 then fails, the character "b" is matched, but "c" is not. At this
3194 point, matching does not backtrack to .*? as might perhaps be expected
3195 from the presence of the | character. The conditional group is part of
3196 the single alternative that comprises the whole pattern, and so the
3197 match fails. (If there was a backtrack into .*?, allowing it to match
3198 "b", the match would succeed.)
3199
3200 The verbs just described provide four different "strengths" of control
3201 when subsequent matching fails. (*THEN) is the weakest, carrying on the
3202 match at the next alternative. (*PRUNE) comes next, failing the match
3203 at the current starting position, but allowing an advance to the next
3204 character (for an unanchored pattern). (*SKIP) is similar, except that
3205 the advance may be more than one character. (*COMMIT) is the strongest,
3206 causing the entire match to fail.
3207
3208 More than one backtracking verb
3209
3210 If more than one backtracking verb is present in a pattern, the one
3211 that is backtracked onto first acts. For example, consider this pat‐
3212 tern, where A, B, etc. are complex pattern fragments:
3213
3214 (A(*COMMIT)B(*THEN)C|ABD)
3215
3216 If A matches but B fails, the backtrack to (*COMMIT) causes the entire
3217 match to fail. However, if A and B match, but C fails, the backtrack to
3218 (*THEN) causes the next alternative (ABD) to be tried. This behaviour
3219 is consistent, but is not always the same as Perl's. It means that if
3220 two or more backtracking verbs appear in succession, all the the last
3221 of them has no effect. Consider this example:
3222
3223 ...(*COMMIT)(*PRUNE)...
3224
3225 If there is a matching failure to the right, backtracking onto (*PRUNE)
3226 causes it to be triggered, and its action is taken. There can never be
3227 a backtrack onto (*COMMIT).
3228
3229 Backtracking verbs in repeated groups
3230
3231 PCRE2 sometimes differs from Perl in its handling of backtracking verbs
3232 in repeated groups. For example, consider:
3233
3234 /(a(*COMMIT)b)+ac/
3235
3236 If the subject is "abac", Perl matches unless its optimizations are
3237 disabled, but PCRE2 always fails because the (*COMMIT) in the second
3238 repeat of the group acts.
3239
3240 Backtracking verbs in assertions
3241
3242 (*FAIL) in any assertion has its normal effect: it forces an immediate
3243 backtrack. The behaviour of the other backtracking verbs depends on
3244 whether or not the assertion is standalone or acting as the condition
3245 in a conditional group.
3246
3247 (*ACCEPT) in a standalone positive assertion causes the assertion to
3248 succeed without any further processing; captured strings and a mark
3249 name (if set) are retained. In a standalone negative assertion,
3250 (*ACCEPT) causes the assertion to fail without any further processing;
3251 captured substrings and any mark name are discarded.
3252
3253 If the assertion is a condition, (*ACCEPT) causes the condition to be
3254 true for a positive assertion and false for a negative one; captured
3255 substrings are retained in both cases.
3256
3257 The remaining verbs act only when a later failure causes a backtrack to
3258 reach them. This means that their effect is confined to the assertion,
3259 because lookaround assertions are atomic. A backtrack that occurs after
3260 an assertion is complete does not jump back into the assertion. Note in
3261 particular that a (*MARK) name that is set in an assertion is not
3262 "seen" by an instance of (*SKIP:NAME) latter in the pattern.
3263
3264 The effect of (*THEN) is not allowed to escape beyond an assertion. If
3265 there are no more branches to try, (*THEN) causes a positive assertion
3266 to be false, and a negative assertion to be true.
3267
3268 The other backtracking verbs are not treated specially if they appear
3269 in a standalone positive assertion. In a conditional positive asser‐
3270 tion, backtracking (from within the assertion) into (*COMMIT), (*SKIP),
3271 or (*PRUNE) causes the condition to be false. However, for both stand‐
3272 alone and conditional negative assertions, backtracking into (*COMMIT),
3273 (*SKIP), or (*PRUNE) causes the assertion to be true, without consider‐
3274 ing any further alternative branches.
3275
3276 Backtracking verbs in subroutines
3277
3278 These behaviours occur whether or not the group is called recursively.
3279
3280 (*ACCEPT) in a group called as a subroutine causes the subroutine match
3281 to succeed without any further processing. Matching then continues
3282 after the subroutine call. Perl documents this behaviour. Perl's treat‐
3283 ment of the other verbs in subroutines is different in some cases.
3284
3285 (*FAIL) in a group called as a subroutine has its normal effect: it
3286 forces an immediate backtrack.
3287
3288 (*COMMIT), (*SKIP), and (*PRUNE) cause the subroutine match to fail
3289 when triggered by being backtracked to in a group called as a subrou‐
3290 tine. There is then a backtrack at the outer level.
3291
3292 (*THEN), when triggered, skips to the next alternative in the innermost
3293 enclosing group that has alternatives (its normal behaviour). However,
3294 if there is no such group within the subroutine's group, the subroutine
3295 match fails and there is a backtrack at the outer level.
3296
3298
3299 pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2syntax(3),
3300 pcre2(3).
3301
3303
3304 Philip Hazel
3305 University Computing Service
3306 Cambridge, England.
3307
3309
3310 Last updated: 12 February 2019
3311 Copyright (c) 1997-2019 University of Cambridge.
3312
3313
3314
3315PCRE2 10.33 12 February 2019 PCRE2PATTERN(3)