1re(3) Erlang Module Definition re(3)
2
3
4
6 re - Perl-like regular expressions for Erlang.
7
9 This module contains regular expression matching functions for strings
10 and binaries.
11
12 The regular expression syntax and semantics resemble that of Perl.
13
14 The matching algorithms of the library are based on the PCRE library,
15 but not all of the PCRE library is interfaced and some parts of the
16 library go beyond what PCRE offers. Currently PCRE version 8.40
17 (release date 2017-01-11) is used. The sections of the PCRE documenta‐
18 tion that are relevant to this module are included here.
19
20 Note:
21 The Erlang literal syntax for strings uses the "\" (backslash) charac‐
22 ter as an escape code. You need to escape backslashes in literal
23 strings, both in your code and in the shell, with an extra backslash,
24 that is, "\\".
25
26
28 mp() = {re_pattern, term(), term(), term(), term()}
29
30 Opaque data type containing a compiled regular expression. mp()
31 is guaranteed to be a tuple() having the atom re_pattern as its
32 first element, to allow for matching in guards. The arity of the
33 tuple or the content of the other fields can change in future
34 Erlang/OTP releases.
35
36 nl_spec() = cr | crlf | lf | anycrlf | any
37
38 compile_option() =
39 unicode |
40 anchored |
41 caseless |
42 dollar_endonly |
43 dotall |
44 extended |
45 firstline |
46 multiline |
47 no_auto_capture |
48 dupnames |
49 ungreedy |
50 {newline, nl_spec()} |
51 bsr_anycrlf |
52 bsr_unicode |
53 no_start_optimize |
54 ucp |
55 never_utf
56
58 version() -> binary()
59
60 The return of this function is a string with the PCRE version of
61 the system that was used in the Erlang/OTP compilation.
62
63 compile(Regexp) -> {ok, MP} | {error, ErrSpec}
64
65 Types:
66
67 Regexp = iodata()
68 MP = mp()
69 ErrSpec =
70 {ErrString :: string(), Position :: integer() >= 0}
71
72 The same as compile(Regexp,[])
73
74 compile(Regexp, Options) -> {ok, MP} | {error, ErrSpec}
75
76 Types:
77
78 Regexp = iodata() | unicode:charlist()
79 Options = [Option]
80 Option = compile_option()
81 MP = mp()
82 ErrSpec =
83 {ErrString :: string(), Position :: integer() >= 0}
84
85 Compiles a regular expression, with the syntax described below,
86 into an internal format to be used later as a parameter to run/2
87 and run/3.
88
89 Compiling the regular expression before matching is useful if
90 the same expression is to be used in matching against multiple
91 subjects during the lifetime of the program. Compiling once and
92 executing many times is far more efficient than compiling each
93 time one wants to match.
94
95 When option unicode is specified, the regular expression is to
96 be specified as a valid Unicode charlist(), otherwise as any
97 valid iodata().
98
99 Options:
100
101 unicode:
102 The regular expression is specified as a Unicode charlist()
103 and the resulting regular expression code is to be run
104 against a valid Unicode charlist() subject. Also consider
105 option ucp when using Unicode characters.
106
107 anchored:
108 The pattern is forced to be "anchored", that is, it is con‐
109 strained to match only at the first matching point in the
110 string that is searched (the "subject string"). This effect
111 can also be achieved by appropriate constructs in the pat‐
112 tern itself.
113
114 caseless:
115 Letters in the pattern match both uppercase and lowercase
116 letters. It is equivalent to Perl option /i and can be
117 changed within a pattern by a (?i) option setting. Uppercase
118 and lowercase letters are defined as in the ISO 8859-1 char‐
119 acter set.
120
121 dollar_endonly:
122 A dollar metacharacter in the pattern matches only at the
123 end of the subject string. Without this option, a dollar
124 also matches immediately before a newline at the end of the
125 string (but not before any other newlines). This option is
126 ignored if option multiline is specified. There is no equiv‐
127 alent option in Perl, and it cannot be set within a pattern.
128
129 dotall:
130 A dot in the pattern matches all characters, including those
131 indicating newline. Without it, a dot does not match when
132 the current position is at a newline. This option is equiva‐
133 lent to Perl option /s and it can be changed within a pat‐
134 tern by a (?s) option setting. A negative class, such as
135 [^a], always matches newline characters, independent of the
136 setting of this option.
137
138 extended:
139 If this option is set, most white space characters in the
140 pattern are totally ignored except when escaped or inside a
141 character class. However, white space is not allowed within
142 sequences such as (?> that introduce various parenthesized
143 subpatterns, nor within a numerical quantifier such as
144 {1,3}. However, ignorable white space is permitted between
145 an item and a following quantifier and between a quantifier
146 and a following + that indicates possessiveness.
147
148 White space did not used to include the VT character (code
149 11), because Perl did not treat this character as white
150 space. However, Perl changed at release 5.18, so PCRE fol‐
151 lowed at release 8.34, and VT is now treated as white space.
152
153 This also causes characters between an unescaped # outside a
154 character class and the next newline, inclusive, to be
155 ignored. This is equivalent to Perl's /x option, and it can
156 be changed within a pattern by a (?x) option setting.
157
158 With this option, comments inside complicated patterns can
159 be included. However, notice that this applies only to data
160 characters. Whitespace characters can never appear within
161 special character sequences in a pattern, for example within
162 sequence (?( that introduces a conditional subpattern.
163
164 firstline:
165 An unanchored pattern is required to match before or at the
166 first newline in the subject string, although the matched
167 text can continue over the newline.
168
169 multiline:
170 By default, PCRE treats the subject string as consisting of
171 a single line of characters (even if it contains newlines).
172 The "start of line" metacharacter (^) matches only at the
173 start of the string, while the "end of line" metacharacter
174 ($) matches only at the end of the string, or before a ter‐
175 minating newline (unless option dollar_endonly is speci‐
176 fied). This is the same as in Perl.
177
178 When this option is specified, the "start of line" and "end
179 of line" constructs match immediately following or immedi‐
180 ately before internal newlines in the subject string,
181 respectively, as well as at the very start and end. This is
182 equivalent to Perl option /m and can be changed within a
183 pattern by a (?m) option setting. If there are no newlines
184 in a subject string, or no occurrences of ^ or $ in a pat‐
185 tern, setting multiline has no effect.
186
187 no_auto_capture:
188 Disables the use of numbered capturing parentheses in the
189 pattern. Any opening parenthesis that is not followed by ?
190 behaves as if it is followed by ?:. Named parentheses can
191 still be used for capturing (and they acquire numbers in the
192 usual way). There is no equivalent option in Perl.
193
194 dupnames:
195 Names used to identify capturing subpatterns need not be
196 unique. This can be helpful for certain types of pattern
197 when it is known that only one instance of the named subpat‐
198 tern can ever be matched. More details of named subpatterns
199 are provided below.
200
201 ungreedy:
202 Inverts the "greediness" of the quantifiers so that they are
203 not greedy by default, but become greedy if followed by "?".
204 It is not compatible with Perl. It can also be set by a (?U)
205 option setting within the pattern.
206
207 {newline, NLSpec}:
208 Overrides the default definition of a newline in the subject
209 string, which is LF (ASCII 10) in Erlang.
210
211 cr:
212 Newline is indicated by a single character cr (ASCII 13).
213
214 lf:
215 Newline is indicated by a single character LF (ASCII 10),
216 the default.
217
218 crlf:
219 Newline is indicated by the two-character CRLF (ASCII 13
220 followed by ASCII 10) sequence.
221
222 anycrlf:
223 Any of the three preceding sequences is to be recognized.
224
225 any:
226 Any of the newline sequences above, and the Unicode
227 sequences VT (vertical tab, U+000B), FF (formfeed,
228 U+000C), NEL (next line, U+0085), LS (line separator,
229 U+2028), and PS (paragraph separator, U+2029).
230
231 bsr_anycrlf:
232 Specifies specifically that \R is to match only the CR, LF,
233 or CRLF sequences, not the Unicode-specific newline charac‐
234 ters.
235
236 bsr_unicode:
237 Specifies specifically that \R is to match all the Unicode
238 newline characters (including CRLF, and so on, the default).
239
240 no_start_optimize:
241 Disables optimization that can malfunction if "Special
242 start-of-pattern items" are present in the regular expres‐
243 sion. A typical example would be when matching "DEFABC"
244 against "(*COMMIT)ABC", where the start optimization of PCRE
245 would skip the subject up to "A" and never realize that the
246 (*COMMIT) instruction is to have made the matching fail.
247 This option is only relevant if you use "start-of-pattern
248 items", as discussed in section PCRE Regular Expression
249 Details.
250
251 ucp:
252 Specifies that Unicode character properties are to be used
253 when resolving \B, \b, \D, \d, \S, \s, \W and \w. Without
254 this flag, only ISO Latin-1 properties are used. Using Uni‐
255 code properties hurts performance, but is semantically cor‐
256 rect when working with Unicode characters beyond the ISO
257 Latin-1 range.
258
259 never_utf:
260 Specifies that the (*UTF) and/or (*UTF8) "start-of-pattern
261 items" are forbidden. This flag cannot be combined with
262 option unicode. Useful if ISO Latin-1 patterns from an
263 external source are to be compiled.
264
265 inspect(MP, Item) -> {namelist, [binary()]}
266
267 Types:
268
269 MP = mp()
270 Item = namelist
271
272 Takes a compiled regular expression and an item, and returns the
273 relevant data from the regular expression. The only supported
274 item is namelist, which returns the tuple {namelist,
275 [binary()]}, containing the names of all (unique) named subpat‐
276 terns in the regular expression. For example:
277
278 1> {ok,MP} = re:compile("(?<A>A)|(?<B>B)|(?<C>C)").
279 {ok,{re_pattern,3,0,0,
280 <<69,82,67,80,119,0,0,0,0,0,0,0,1,0,0,0,255,255,255,255,
281 255,255,...>>}}
282 2> re:inspect(MP,namelist).
283 {namelist,[<<"A">>,<<"B">>,<<"C">>]}
284 3> {ok,MPD} = re:compile("(?<C>A)|(?<B>B)|(?<C>C)",[dupnames]).
285 {ok,{re_pattern,3,0,0,
286 <<69,82,67,80,119,0,0,0,0,0,8,0,1,0,0,0,255,255,255,255,
287 255,255,...>>}}
288 4> re:inspect(MPD,namelist).
289 {namelist,[<<"B">>,<<"C">>]}
290
291 Notice in the second example that the duplicate name only occurs
292 once in the returned list, and that the list is in alphabetical
293 order regardless of where the names are positioned in the regu‐
294 lar expression. The order of the names is the same as the order
295 of captured subexpressions if {capture, all_names} is specified
296 as an option to run/3. You can therefore create a name-to-value
297 mapping from the result of run/3 like this:
298
299 1> {ok,MP} = re:compile("(?<A>A)|(?<B>B)|(?<C>C)").
300 {ok,{re_pattern,3,0,0,
301 <<69,82,67,80,119,0,0,0,0,0,0,0,1,0,0,0,255,255,255,255,
302 255,255,...>>}}
303 2> {namelist, N} = re:inspect(MP,namelist).
304 {namelist,[<<"A">>,<<"B">>,<<"C">>]}
305 3> {match,L} = re:run("AA",MP,[{capture,all_names,binary}]).
306 {match,[<<"A">>,<<>>,<<>>]}
307 4> NameMap = lists:zip(N,L).
308 [{<<"A">>,<<"A">>},{<<"B">>,<<>>},{<<"C">>,<<>>}]
309
310 replace(Subject, RE, Replacement) -> iodata() | unicode:charlist()
311
312 Types:
313
314 Subject = iodata() | unicode:charlist()
315 RE = mp() | iodata()
316 Replacement = iodata() | unicode:charlist()
317
318 Same as replace(Subject, RE, Replacement, []).
319
320 replace(Subject, RE, Replacement, Options) ->
321 iodata() | unicode:charlist()
322
323 Types:
324
325 Subject = iodata() | unicode:charlist()
326 RE = mp() | iodata() | unicode:charlist()
327 Replacement = iodata() | unicode:charlist()
328 Options = [Option]
329 Option =
330 anchored |
331 global |
332 notbol |
333 noteol |
334 notempty |
335 notempty_atstart |
336 {offset, integer() >= 0} |
337 {newline, NLSpec} |
338 bsr_anycrlf |
339 {match_limit, integer() >= 0} |
340 {match_limit_recursion, integer() >= 0} |
341 bsr_unicode |
342 {return, ReturnType} |
343 CompileOpt
344 ReturnType = iodata | list | binary
345 CompileOpt = compile_option()
346 NLSpec = cr | crlf | lf | anycrlf | any
347
348 Replaces the matched part of the Subject string with the con‐
349 tents of Replacement.
350
351 The permissible options are the same as for run/3, except that
352 option capture is not allowed. Instead a {return, ReturnType} is
353 present. The default return type is iodata, constructed in a way
354 to minimize copying. The iodata result can be used directly in
355 many I/O operations. If a flat list() is desired, specify
356 {return, list}. If a binary is desired, specify {return,
357 binary}.
358
359 As in function run/3, an mp() compiled with option unicode
360 requires Subject to be a Unicode charlist(). If compilation is
361 done implicitly and the unicode compilation option is specified
362 to this function, both the regular expression and Subject are to
363 specified as valid Unicode charlist()s.
364
365 The replacement string can contain the special character &,
366 which inserts the whole matching expression in the result, and
367 the special sequence \N (where N is an integer > 0), \gN, or
368 \g{N}, resulting in the subexpression number N, is inserted in
369 the result. If no subexpression with that number is generated by
370 the regular expression, nothing is inserted.
371
372 To insert an & or a \ in the result, precede it with a \. Notice
373 that Erlang already gives a special meaning to \ in literal
374 strings, so a single \ must be written as "\\" and therefore a
375 double \ as "\\\\".
376
377 Example:
378
379 re:replace("abcd","c","[&]",[{return,list}]).
380
381 gives
382
383 "ab[c]d"
384
385 while
386
387 re:replace("abcd","c","[\\&]",[{return,list}]).
388
389 gives
390
391 "ab[&]d"
392
393 As with run/3, compilation errors raise the badarg exception.
394 compile/2 can be used to get more information about the error.
395
396 run(Subject, RE) -> {match, Captured} | nomatch
397
398 Types:
399
400 Subject = iodata() | unicode:charlist()
401 RE = mp() | iodata()
402 Captured = [CaptureData]
403 CaptureData = {integer(), integer()}
404
405 Same as run(Subject,RE,[]).
406
407 run(Subject, RE, Options) ->
408 {match, Captured} | match | nomatch | {error, ErrType}
409
410 Types:
411
412 Subject = iodata() | unicode:charlist()
413 RE = mp() | iodata() | unicode:charlist()
414 Options = [Option]
415 Option =
416 anchored |
417 global |
418 notbol |
419 noteol |
420 notempty |
421 notempty_atstart |
422 report_errors |
423 {offset, integer() >= 0} |
424 {match_limit, integer() >= 0} |
425 {match_limit_recursion, integer() >= 0} |
426 {newline, NLSpec :: nl_spec()} |
427 bsr_anycrlf |
428 bsr_unicode |
429 {capture, ValueSpec} |
430 {capture, ValueSpec, Type} |
431 CompileOpt
432 Type = index | list | binary
433 ValueSpec =
434 all | all_but_first | all_names | first | none | Val‐
435 ueList
436 ValueList = [ValueID]
437 ValueID = integer() | string() | atom()
438 CompileOpt = compile_option()
439 See compile/2.
440 Captured = [CaptureData] | [[CaptureData]]
441 CaptureData =
442 {integer(), integer()} | ListConversionData | binary()
443 ListConversionData =
444 string() |
445 {error, string(), binary()} |
446 {incomplete, string(), binary()}
447 ErrType =
448 match_limit | match_limit_recursion | {compile, Com‐
449 pileErr}
450 CompileErr =
451 {ErrString :: string(), Position :: integer() >= 0}
452
453 Executes a regular expression matching, and returns
454 match/{match, Captured} or nomatch. The regular expression can
455 be specified either as iodata() in which case it is automati‐
456 cally compiled (as by compile/2) and executed, or as a precom‐
457 piled mp() in which case it is executed against the subject
458 directly.
459
460 When compilation is involved, exception badarg is thrown if a
461 compilation error occurs. Call compile/2 to get information
462 about the location of the error in the regular expression.
463
464 If the regular expression is previously compiled, the option
465 list can only contain the following options:
466
467 * anchored
468
469 * {capture, ValueSpec}/{capture, ValueSpec, Type}
470
471 * global
472
473 * {match_limit, integer() >= 0}
474
475 * {match_limit_recursion, integer() >= 0}
476
477 * {newline, NLSpec}
478
479 * notbol
480
481 * notempty
482
483 * notempty_atstart
484
485 * noteol
486
487 * {offset, integer() >= 0}
488
489 * report_errors
490
491 Otherwise all options valid for function compile/2 are also
492 allowed. Options allowed both for compilation and execution of a
493 match, namely anchored and {newline, NLSpec}, affect both the
494 compilation and execution if present together with a non-precom‐
495 piled regular expression.
496
497 If the regular expression was previously compiled with option
498 unicode, Subject is to be provided as a valid Unicode
499 charlist(), otherwise any iodata() will do. If compilation is
500 involved and option unicode is specified, both Subject and the
501 regular expression are to be specified as valid Unicode
502 charlists().
503
504 {capture, ValueSpec}/{capture, ValueSpec, Type} defines what to
505 return from the function upon successful matching. The capture
506 tuple can contain both a value specification, telling which of
507 the captured substrings are to be returned, and a type specifi‐
508 cation, telling how captured substrings are to be returned (as
509 index tuples, lists, or binaries). The options are described in
510 detail below.
511
512 If the capture options describe that no substring capturing is
513 to be done ({capture, none}), the function returns the single
514 atom match upon successful matching, otherwise the tuple {match,
515 ValueList}. Disabling capturing can be done either by specifying
516 none or an empty list as ValueSpec.
517
518 Option report_errors adds the possibility that an error tuple is
519 returned. The tuple either indicates a matching error
520 (match_limit or match_limit_recursion), or a compilation error,
521 where the error tuple has the format {error, {compile, Com‐
522 pileErr}}. Notice that if option report_errors is not specified,
523 the function never returns error tuples, but reports compilation
524 errors as a badarg exception and failed matches because of
525 exceeded match limits simply as nomatch.
526
527 The following options are relevant for execution:
528
529 anchored:
530 Limits run/3 to matching at the first matching position. If
531 a pattern was compiled with anchored, or turned out to be
532 anchored by virtue of its contents, it cannot be made unan‐
533 chored at matching time, hence there is no unanchored
534 option.
535
536 global:
537 Implements global (repetitive) search (flag g in Perl). Each
538 match is returned as a separate list() containing the spe‐
539 cific match and any matching subexpressions (or as specified
540 by option capture. The Captured part of the return value is
541 hence a list() of list()s when this option is specified.
542
543 The interaction of option global with a regular expression
544 that matches an empty string surprises some users. When
545 option global is specified, run/3 handles empty matches in
546 the same way as Perl: a zero-length match at any point is
547 also retried with options [anchored, notempty_atstart]. If
548 that search gives a result of length > 0, the result is
549 included. Example:
550
551 re:run("cat","(|at)",[global]).
552
553 The following matchings are performed:
554
555 At offset 0:
556 The regular expression (|at) first match at the initial
557 position of string cat, giving the result set
558 [{0,0},{0,0}] (the second {0,0} is because of the subex‐
559 pression marked by the parentheses). As the length of the
560 match is 0, we do not advance to the next position yet.
561
562 At offset 0 with [anchored, notempty_atstart]:
563 The search is retried with options [anchored,
564 notempty_atstart] at the same position, which does not
565 give any interesting result of longer length, so the
566 search position is advanced to the next character (a).
567
568 At offset 1:
569 The search results in [{1,0},{1,0}], so this search is
570 also repeated with the extra options.
571
572 At offset 1 with [anchored, notempty_atstart]:
573 Alternative ab is found and the result is [{1,2},{1,2}].
574 The result is added to the list of results and the posi‐
575 tion in the search string is advanced two steps.
576
577 At offset 3:
578 The search once again matches the empty string, giving
579 [{3,0},{3,0}].
580
581 At offset 1 with [anchored, notempty_atstart]:
582 This gives no result of length > 0 and we are at the last
583 position, so the global search is complete.
584
585 The result of the call is:
586
587 {match,[[{0,0},{0,0}],[{1,0},{1,0}],[{1,2},{1,2}],[{3,0},{3,0}]]}
588
589 notempty:
590 An empty string is not considered to be a valid match if
591 this option is specified. If alternatives in the pattern
592 exist, they are tried. If all the alternatives match the
593 empty string, the entire match fails.
594
595 Example:
596
597 If the following pattern is applied to a string not begin‐
598 ning with "a" or "b", it would normally match the empty
599 string at the start of the subject:
600
601 a?b?
602
603 With option notempty, this match is invalid, so run/3
604 searches further into the string for occurrences of "a" or
605 "b".
606
607 notempty_atstart:
608 Like notempty, except that an empty string match that is not
609 at the start of the subject is permitted. If the pattern is
610 anchored, such a match can occur only if the pattern con‐
611 tains \K.
612
613 Perl has no direct equivalent of notempty or
614 notempty_atstart, but it does make a special case of a pat‐
615 tern match of the empty string within its split() function,
616 and when using modifier /g. The Perl behavior can be emu‐
617 lated after matching a null string by first trying the match
618 again at the same offset with notempty_atstart and anchored,
619 and then, if that fails, by advancing the starting offset
620 (see below) and trying an ordinary match again.
621
622 notbol:
623 Specifies that the first character of the subject string is
624 not the beginning of a line, so the circumflex metacharacter
625 is not to match before it. Setting this without multiline
626 (at compile time) causes circumflex never to match. This
627 option only affects the behavior of the circumflex metachar‐
628 acter. It does not affect \A.
629
630 noteol:
631 Specifies that the end of the subject string is not the end
632 of a line, so the dollar metacharacter is not to match it
633 nor (except in multiline mode) a newline immediately before
634 it. Setting this without multiline (at compile time) causes
635 dollar never to match. This option affects only the behavior
636 of the dollar metacharacter. It does not affect \Z or \z.
637
638 report_errors:
639 Gives better control of the error handling in run/3. When
640 specified, compilation errors (if the regular expression is
641 not already compiled) and runtime errors are explicitly
642 returned as an error tuple.
643
644 The following are the possible runtime errors:
645
646 match_limit:
647 The PCRE library sets a limit on how many times the inter‐
648 nal match function can be called. Defaults to 10,000,000
649 in the library compiled for Erlang. If {error,
650 match_limit} is returned, the execution of the regular
651 expression has reached this limit. This is normally to be
652 regarded as a nomatch, which is the default return value
653 when this occurs, but by specifying report_errors, you are
654 informed when the match fails because of too many internal
655 calls.
656
657 match_limit_recursion:
658 This error is very similar to match_limit, but occurs when
659 the internal match function of PCRE is "recursively"
660 called more times than the match_limit_recursion limit,
661 which defaults to 10,000,000 as well. Notice that as long
662 as the match_limit and match_limit_default values are kept
663 at the default values, the match_limit_recursion error
664 cannot occur, as the match_limit error occurs before that
665 (each recursive call is also a call, but not conversely).
666 Both limits can however be changed, either by setting lim‐
667 its directly in the regular expression string (see section
668 PCRE Regular Eexpression Details) or by specifying options
669 to run/3.
670
671 It is important to understand that what is referred to as
672 "recursion" when limiting matches is not recursion on the C
673 stack of the Erlang machine or on the Erlang process stack.
674 The PCRE version compiled into the Erlang VM uses machine
675 "heap" memory to store values that must be kept over recur‐
676 sion in regular expression matches.
677
678 {match_limit, integer() >= 0}:
679 Limits the execution time of a match in an implementation-
680 specific way. It is described as follows by the PCRE docu‐
681 mentation:
682
683 The match_limit field provides a means of preventing PCRE from using
684 up a vast amount of resources when running patterns that are not going
685 to match, but which have a very large number of possibilities in their
686 search trees. The classic example is a pattern that uses nested
687 unlimited repeats.
688
689 Internally, pcre_exec() uses a function called match(), which it calls
690 repeatedly (sometimes recursively). The limit set by match_limit is
691 imposed on the number of times this function is called during a match,
692 which has the effect of limiting the amount of backtracking that can
693 take place. For patterns that are not anchored, the count restarts
694 from zero for each position in the subject string.
695
696 This means that runaway regular expression matches can fail
697 faster if the limit is lowered using this option. The
698 default value 10,000,000 is compiled into the Erlang VM.
699
700 Note:
701 This option does in no way affect the execution of the Erlang
702 VM in terms of "long running BIFs". run/3 always gives control
703 back to the scheduler of Erlang processes at intervals that
704 ensures the real-time properties of the Erlang system.
705
706
707 {match_limit_recursion, integer() >= 0}:
708 Limits the execution time and memory consumption of a match
709 in an implementation-specific way, very similar to
710 match_limit. It is described as follows by the PCRE documen‐
711 tation:
712
713 The match_limit_recursion field is similar to match_limit, but instead
714 of limiting the total number of times that match() is called, it
715 limits the depth of recursion. The recursion depth is a smaller number
716 than the total number of calls, because not all calls to match() are
717 recursive. This limit is of use only if it is set smaller than
718 match_limit.
719
720 Limiting the recursion depth limits the amount of machine stack that
721 can be used, or, when PCRE has been compiled to use memory on the heap
722 instead of the stack, the amount of heap memory that can be used.
723
724 The Erlang VM uses a PCRE library where heap memory is used
725 when regular expression match recursion occurs. This there‐
726 fore limits the use of machine heap, not C stack.
727
728 Specifying a lower value can result in matches with deep
729 recursion failing, when they should have matched:
730
731 1> re:run("aaaaaaaaaaaaaz","(a+)*z").
732 {match,[{0,14},{0,13}]}
733 2> re:run("aaaaaaaaaaaaaz","(a+)*z",[{match_limit_recursion,5}]).
734 nomatch
735 3> re:run("aaaaaaaaaaaaaz","(a+)*z",[{match_limit_recursion,5},report_errors]).
736 {error,match_limit_recursion}
737
738 This option and option match_limit are only to be used in
739 rare cases. Understanding of the PCRE library internals is
740 recommended before tampering with these limits.
741
742 {offset, integer() >= 0}:
743 Start matching at the offset (position) specified in the
744 subject string. The offset is zero-based, so that the
745 default is {offset,0} (all of the subject string).
746
747 {newline, NLSpec}:
748 Overrides the default definition of a newline in the subject
749 string, which is LF (ASCII 10) in Erlang.
750
751 cr:
752 Newline is indicated by a single character CR (ASCII 13).
753
754 lf:
755 Newline is indicated by a single character LF (ASCII 10),
756 the default.
757
758 crlf:
759 Newline is indicated by the two-character CRLF (ASCII 13
760 followed by ASCII 10) sequence.
761
762 anycrlf:
763 Any of the three preceding sequences is be recognized.
764
765 any:
766 Any of the newline sequences above, and the Unicode
767 sequences VT (vertical tab, U+000B), FF (formfeed,
768 U+000C), NEL (next line, U+0085), LS (line separator,
769 U+2028), and PS (paragraph separator, U+2029).
770
771 bsr_anycrlf:
772 Specifies specifically that \R is to match only the CR LF,
773 or CRLF sequences, not the Unicode-specific newline charac‐
774 ters. (Overrides the compilation option.)
775
776 bsr_unicode:
777 Specifies specifically that \R is to match all the Unicode
778 newline characters (including CRLF, and so on, the default).
779 (Overrides the compilation option.)
780
781 {capture, ValueSpec}/{capture, ValueSpec, Type}:
782 Specifies which captured substrings are returned and in what
783 format. By default, run/3 captures all of the matching part
784 of the substring and all capturing subpatterns (all of the
785 pattern is automatically captured). The default return type
786 is (zero-based) indexes of the captured parts of the string,
787 specified as {Offset,Length} pairs (the index Type of cap‐
788 turing).
789
790 As an example of the default behavior, the following call
791 returns, as first and only captured string, the matching
792 part of the subject ("abcd" in the middle) as an index pair
793 {3,4}, where character positions are zero-based, just as in
794 offsets:
795
796 re:run("ABCabcdABC","abcd",[]).
797
798 The return value of this call is:
799
800 {match,[{3,4}]}
801
802 Another (and quite common) case is where the regular expres‐
803 sion matches all of the subject:
804
805 re:run("ABCabcdABC",".*abcd.*",[]).
806
807 Here the return value correspondingly points out all of the
808 string, beginning at index 0, and it is 10 characters long:
809
810 {match,[{0,10}]}
811
812 If the regular expression contains capturing subpatterns,
813 like in:
814
815 re:run("ABCabcdABC",".*(abcd).*",[]).
816
817 all of the matched subject is captured, as well as the cap‐
818 tured substrings:
819
820 {match,[{0,10},{3,4}]}
821
822 The complete matching pattern always gives the first return
823 value in the list and the remaining subpatterns are added in
824 the order they occurred in the regular expression.
825
826 The capture tuple is built up as follows:
827
828 ValueSpec:
829 Specifies which captured (sub)patterns are to be returned.
830 ValueSpec can either be an atom describing a predefined
831 set of return values, or a list containing the indexes or
832 the names of specific subpatterns to return.
833
834 The following are the predefined sets of subpatterns:
835
836 all:
837 All captured subpatterns including the complete matching
838 string. This is the default.
839
840 all_names:
841 All named subpatterns in the regular expression, as if a
842 list() of all the names in alphabetical order was speci‐
843 fied. The list of all names can also be retrieved with
844 inspect/2.
845
846 first:
847 Only the first captured subpattern, which is always the
848 complete matching part of the subject. All explicitly
849 captured subpatterns are discarded.
850
851 all_but_first:
852 All but the first matching subpattern, that is, all
853 explicitly captured subpatterns, but not the complete
854 matching part of the subject string. This is useful if
855 the regular expression as a whole matches a large part
856 of the subject, but the part you are interested in is in
857 an explicitly captured subpattern. If the return type is
858 list or binary, not returning subpatterns you are not
859 interested in is a good way to optimize.
860
861 none:
862 Returns no matching subpatterns, gives the single atom
863 match as the return value of the function when matching
864 successfully instead of the {match, list()} return.
865 Specifying an empty list gives the same behavior.
866
867 The value list is a list of indexes for the subpatterns to
868 return, where index 0 is for all of the pattern, and 1 is
869 for the first explicit capturing subpattern in the regular
870 expression, and so on. When using named captured subpat‐
871 terns (see below) in the regular expression, one can use
872 atom()s or string()s to specify the subpatterns to be
873 returned. For example, consider the regular expression:
874
875 ".*(abcd).*"
876
877 matched against string "ABCabcdABC", capturing only the
878 "abcd" part (the first explicit subpattern):
879
880 re:run("ABCabcdABC",".*(abcd).*",[{capture,[1]}]).
881
882 The call gives the following result, as the first explic‐
883 itly captured subpattern is "(abcd)", matching "abcd" in
884 the subject, at (zero-based) position 3, of length 4:
885
886 {match,[{3,4}]}
887
888 Consider the same regular expression, but with the subpat‐
889 tern explicitly named 'FOO':
890
891 ".*(?<FOO>abcd).*"
892
893 With this expression, we could still give the index of the
894 subpattern with the following call:
895
896 re:run("ABCabcdABC",".*(?<FOO>abcd).*",[{capture,[1]}]).
897
898 giving the same result as before. But, as the subpattern
899 is named, we can also specify its name in the value list:
900
901 re:run("ABCabcdABC",".*(?<FOO>abcd).*",[{capture,['FOO']}]).
902
903 This would give the same result as the earlier examples,
904 namely:
905
906 {match,[{3,4}]}
907
908 The values list can specify indexes or names not present
909 in the regular expression, in which case the return values
910 vary depending on the type. If the type is index, the
911 tuple {-1,0} is returned for values with no corresponding
912 subpattern in the regular expression, but for the other
913 types (binary and list), the values are the empty binary
914 or list, respectively.
915
916 Type:
917 Optionally specifies how captured substrings are to be
918 returned. If omitted, the default of index is used.
919
920 Type can be one of the following:
921
922 index:
923 Returns captured substrings as pairs of byte indexes
924 into the subject string and length of the matching
925 string in the subject (as if the subject string was
926 flattened with erlang:iolist_to_binary/1 or uni‐
927 code:characters_to_binary/2 before matching). Notice
928 that option unicode results in byte-oriented indexes in
929 a (possibly virtual) UTF-8 encoded binary. A byte index
930 tuple {0,2} can therefore represent one or two charac‐
931 ters when unicode is in effect. This can seem counter-
932 intuitive, but has been deemed the most effective and
933 useful way to do it. To return lists instead can result
934 in simpler code if that is desired. This return type is
935 the default.
936
937 list:
938 Returns matching substrings as lists of characters
939 (Erlang string()s). It option unicode is used in combi‐
940 nation with the \C sequence in the regular expression, a
941 captured subpattern can contain bytes that are not valid
942 UTF-8 (\C matches bytes regardless of character encod‐
943 ing). In that case the list capturing can result in the
944 same types of tuples that unicode:characters_to_list/2
945 can return, namely three-tuples with tag incomplete or
946 error, the successfully converted characters and the
947 invalid UTF-8 tail of the conversion as a binary. The
948 best strategy is to avoid using the \C sequence when
949 capturing lists.
950
951 binary:
952 Returns matching substrings as binaries. If option uni‐
953 code is used, these binaries are in UTF-8. If the \C
954 sequence is used together with unicode, the binaries can
955 be invalid UTF-8.
956
957 In general, subpatterns that were not assigned a value in
958 the match are returned as the tuple {-1,0} when type is
959 index. Unassigned subpatterns are returned as the empty
960 binary or list, respectively, for other return types. Con‐
961 sider the following regular expression:
962
963 ".*((?<FOO>abdd)|a(..d)).*"
964
965 There are three explicitly capturing subpatterns, where the
966 opening parenthesis position determines the order in the
967 result, hence ((?<FOO>abdd)|a(..d)) is subpattern index 1,
968 (?<FOO>abdd) is subpattern index 2, and (..d) is subpattern
969 index 3. When matched against the following string:
970
971 "ABCabcdABC"
972
973 the subpattern at index 2 does not match, as "abdd" is not
974 present in the string, but the complete pattern matches
975 (because of the alternative a(..d)). The subpattern at index
976 2 is therefore unassigned and the default return value is:
977
978 {match,[{0,10},{3,4},{-1,0},{4,3}]}
979
980 Setting the capture Type to binary gives:
981
982 {match,[<<"ABCabcdABC">>,<<"abcd">>,<<>>,<<"bcd">>]}
983
984 Here the empty binary (<<>>) represents the unassigned sub‐
985 pattern. In the binary case, some information about the
986 matching is therefore lost, as <<>> can also be an empty
987 string captured.
988
989 If differentiation between empty matches and non-existing
990 subpatterns is necessary, use the type index and do the con‐
991 version to the final type in Erlang code.
992
993 When option global is speciified, the capture specification
994 affects each match separately, so that:
995
996 re:run("cacb","c(a|b)",[global,{capture,[1],list}]).
997
998 gives
999
1000 {match,[["a"],["b"]]}
1001
1002 For a descriptions of options only affecting the compilation
1003 step, see compile/2.
1004
1005 split(Subject, RE) -> SplitList
1006
1007 Types:
1008
1009 Subject = iodata() | unicode:charlist()
1010 RE = mp() | iodata()
1011 SplitList = [iodata() | unicode:charlist()]
1012
1013 Same as split(Subject, RE, []).
1014
1015 split(Subject, RE, Options) -> SplitList
1016
1017 Types:
1018
1019 Subject = iodata() | unicode:charlist()
1020 RE = mp() | iodata() | unicode:charlist()
1021 Options = [Option]
1022 Option =
1023 anchored |
1024 notbol |
1025 noteol |
1026 notempty |
1027 notempty_atstart |
1028 {offset, integer() >= 0} |
1029 {newline, nl_spec()} |
1030 {match_limit, integer() >= 0} |
1031 {match_limit_recursion, integer() >= 0} |
1032 bsr_anycrlf |
1033 bsr_unicode |
1034 {return, ReturnType} |
1035 {parts, NumParts} |
1036 group |
1037 trim |
1038 CompileOpt
1039 NumParts = integer() >= 0 | infinity
1040 ReturnType = iodata | list | binary
1041 CompileOpt = compile_option()
1042 See compile/2.
1043 SplitList = [RetData] | [GroupedRetData]
1044 GroupedRetData = [RetData]
1045 RetData = iodata() | unicode:charlist() | binary() | list()
1046
1047 Splits the input into parts by finding tokens according to the
1048 regular expression supplied. The splitting is basically done by
1049 running a global regular expression match and dividing the ini‐
1050 tial string wherever a match occurs. The matching part of the
1051 string is removed from the output.
1052
1053 As in run/3, an mp() compiled with option unicode requires Sub‐
1054 ject to be a Unicode charlist(). If compilation is done implic‐
1055 itly and the unicode compilation option is specified to this
1056 function, both the regular expression and Subject are to be
1057 specified as valid Unicode charlist()s.
1058
1059 The result is given as a list of "strings", the preferred data
1060 type specified in option return (default iodata).
1061
1062 If subexpressions are specified in the regular expression, the
1063 matching subexpressions are returned in the resulting list as
1064 well. For example:
1065
1066 re:split("Erlang","[ln]",[{return,list}]).
1067
1068 gives
1069
1070 ["Er","a","g"]
1071
1072 while
1073
1074 re:split("Erlang","([ln])",[{return,list}]).
1075
1076 gives
1077
1078 ["Er","l","a","n","g"]
1079
1080 The text matching the subexpression (marked by the parentheses
1081 in the regular expression) is inserted in the result list where
1082 it was found. This means that concatenating the result of a
1083 split where the whole regular expression is a single subexpres‐
1084 sion (as in the last example) always results in the original
1085 string.
1086
1087 As there is no matching subexpression for the last part in the
1088 example (the "g"), nothing is inserted after that. To make the
1089 group of strings and the parts matching the subexpressions more
1090 obvious, one can use option group, which groups together the
1091 part of the subject string with the parts matching the subex‐
1092 pressions when the string was split:
1093
1094 re:split("Erlang","([ln])",[{return,list},group]).
1095
1096 gives
1097
1098 [["Er","l"],["a","n"],["g"]]
1099
1100 Here the regular expression first matched the "l", causing "Er"
1101 to be the first part in the result. When the regular expression
1102 matched, the (only) subexpression was bound to the "l", so the
1103 "l" is inserted in the group together with "Er". The next match
1104 is of the "n", making "a" the next part to be returned. As the
1105 subexpression is bound to substring "n" in this case, the "n" is
1106 inserted into this group. The last group consists of the remain‐
1107 ing string, as no more matches are found.
1108
1109 By default, all parts of the string, including the empty
1110 strings, are returned from the function, for example:
1111
1112 re:split("Erlang","[lg]",[{return,list}]).
1113
1114 gives
1115
1116 ["Er","an",[]]
1117
1118 as the matching of the "g" in the end of the string leaves an
1119 empty rest, which is also returned. This behavior differs from
1120 the default behavior of the split function in Perl, where empty
1121 strings at the end are by default removed. To get the "trimming"
1122 default behavior of Perl, specify trim as an option:
1123
1124 re:split("Erlang","[lg]",[{return,list},trim]).
1125
1126 gives
1127
1128 ["Er","an"]
1129
1130 The "trim" option says; "give me as many parts as possible
1131 except the empty ones", which sometimes can be useful. You can
1132 also specify how many parts you want, by specifying {parts,N}:
1133
1134 re:split("Erlang","[lg]",[{return,list},{parts,2}]).
1135
1136 gives
1137
1138 ["Er","ang"]
1139
1140 Notice that the last part is "ang", not "an", as splitting was
1141 specified into two parts, and the splitting stops when enough
1142 parts are given, which is why the result differs from that of
1143 trim.
1144
1145 More than three parts are not possible with this indata, so
1146
1147 re:split("Erlang","[lg]",[{return,list},{parts,4}]).
1148
1149 gives the same result as the default, which is to be viewed as
1150 "an infinite number of parts".
1151
1152 Specifying 0 as the number of parts gives the same effect as
1153 option trim. If subexpressions are captured, empty subexpres‐
1154 sions matched at the end are also stripped from the result if
1155 trim or {parts,0} is specified.
1156
1157 The trim behavior corresponds exactly to the Perl default.
1158 {parts,N}, where N is a positive integer, corresponds exactly to
1159 the Perl behavior with a positive numerical third parameter. The
1160 default behavior of split/3 corresponds to the Perl behavior
1161 when a negative integer is specified as the third parameter for
1162 the Perl routine.
1163
1164 Summary of options not previously described for function run/3:
1165
1166 {return,ReturnType}:
1167 Specifies how the parts of the original string are presented
1168 in the result list. Valid types:
1169
1170 iodata:
1171 The variant of iodata() that gives the least copying of
1172 data with the current implementation (often a binary, but
1173 do not depend on it).
1174
1175 binary:
1176 All parts returned as binaries.
1177
1178 list:
1179 All parts returned as lists of characters ("strings").
1180
1181 group:
1182 Groups together the part of the string with the parts of the
1183 string matching the subexpressions of the regular expres‐
1184 sion.
1185
1186 The return value from the function is in this case a list()
1187 of list()s. Each sublist begins with the string picked out
1188 of the subject string, followed by the parts matching each
1189 of the subexpressions in order of occurrence in the regular
1190 expression.
1191
1192 {parts,N}:
1193 Specifies the number of parts the subject string is to be
1194 split into.
1195
1196 The number of parts is to be a positive integer for a spe‐
1197 cific maximum number of parts, and infinity for the maximum
1198 number of parts possible (the default). Specifying {parts,0}
1199 gives as many parts as possible disregarding empty parts at
1200 the end, the same as specifying trim.
1201
1202 trim:
1203 Specifies that empty parts at the end of the result list are
1204 to be disregarded. The same as specifying {parts,0}. This
1205 corresponds to the default behavior of the split built-in
1206 function in Perl.
1207
1209 The following sections contain reference material for the regular
1210 expressions used by this module. The information is based on the PCRE
1211 documentation, with changes where this module behaves differently to
1212 the PCRE library.
1213
1215 The syntax and semantics of the regular expressions supported by PCRE
1216 are described in detail in the following sections. Perl's regular
1217 expressions are described in its own documentation, and regular expres‐
1218 sions in general are covered in many books, some with copious examples.
1219 Jeffrey Friedl's "Mastering Regular Expressions", published by
1220 O'Reilly, covers regular expressions in great detail. This description
1221 of the PCRE regular expressions is intended as reference material.
1222
1223 The reference material is divided into the following sections:
1224
1225 * Special Start-of-Pattern Items
1226
1227 * Characters and Metacharacters
1228
1229 * Backslash
1230
1231 * Circumflex and Dollar
1232
1233 * Full Stop (Period, Dot) and \N
1234
1235 * Matching a Single Data Unit
1236
1237 * Square Brackets and Character Classes
1238
1239 * Posix Character Classes
1240
1241 * Vertical Bar
1242
1243 * Internal Option Setting
1244
1245 * Subpatterns
1246
1247 * Duplicate Subpattern Numbers
1248
1249 * Named Subpatterns
1250
1251 * Repetition
1252
1253 * Atomic Grouping and Possessive Quantifiers
1254
1255 * Back References
1256
1257 * Assertions
1258
1259 * Conditional Subpatterns
1260
1261 * Comments
1262
1263 * Recursive Patterns
1264
1265 * Subpatterns as Subroutines
1266
1267 * Oniguruma Subroutine Syntax
1268
1269 * Backtracking Control
1270
1272 Some options that can be passed to compile/2 can also be set by special
1273 items at the start of a pattern. These are not Perl-compatible, but are
1274 provided to make these options accessible to pattern writers who are
1275 not able to change the program that processes the pattern. Any number
1276 of these items can appear, but they must all be together right at the
1277 start of the pattern string, and the letters must be in upper case.
1278
1279 UTF Support
1280
1281 Unicode support is basically UTF-8 based. To use Unicode characters,
1282 you either call compile/2 or run/3 with option unicode, or the pattern
1283 must start with one of these special sequences:
1284
1285 (*UTF8)
1286 (*UTF)
1287
1288 Both options give the same effect, the input string is interpreted as
1289 UTF-8. Notice that with these instructions, the automatic conversion of
1290 lists to UTF-8 is not performed by the re functions. Therefore, using
1291 these sequences is not recommended. Add option unicode when running
1292 compile/2 instead.
1293
1294 Some applications that allow their users to supply patterns can wish to
1295 restrict them to non-UTF data for security reasons. If option never_utf
1296 is set at compile time, (*UTF), and so on, are not allowed, and their
1297 appearance causes an error.
1298
1299 Unicode Property Support
1300
1301 The following is another special sequence that can appear at the start
1302 of a pattern:
1303
1304 (*UCP)
1305
1306 This has the same effect as setting option ucp: it causes sequences
1307 such as \d and \w to use Unicode properties to determine character
1308 types, instead of recognizing only characters with codes < 256 through
1309 a lookup table.
1310
1311 Disabling Startup Optimizations
1312
1313 If a pattern starts with (*NO_START_OPT), it has the same effect as
1314 setting option no_start_optimize at compile time.
1315
1316 Newline Conventions
1317
1318 PCRE supports five conventions for indicating line breaks in strings: a
1319 single CR (carriage return) character, a single LF (line feed) charac‐
1320 ter, the two-character sequence CRLF, any of the three preceding, and
1321 any Unicode newline sequence.
1322
1323 A newline convention can also be specified by starting a pattern string
1324 with one of the following five sequences:
1325
1326 (*CR):
1327 Carriage return
1328
1329 (*LF):
1330 Line feed
1331
1332 (*CRLF):
1333 >Carriage return followed by line feed
1334
1335 (*ANYCRLF):
1336 Any of the three above
1337
1338 (*ANY):
1339 All Unicode newline sequences
1340
1341 These override the default and the options specified to compile/2. For
1342 example, the following pattern changes the convention to CR:
1343
1344 (*CR)a.b
1345
1346 This pattern matches a\nb, as LF is no longer a newline. If more than
1347 one of them is present, the last one is used.
1348
1349 The newline convention affects where the circumflex and dollar asser‐
1350 tions are true. It also affects the interpretation of the dot metachar‐
1351 acter when dotall is not set, and the behavior of \N. However, it does
1352 not affect what the \R escape sequence matches. By default, this is any
1353 Unicode newline sequence, for Perl compatibility. However, this can be
1354 changed; see the description of \R in section Newline Sequences. A
1355 change of the \R setting can be combined with a change of the newline
1356 convention.
1357
1358 Setting Match and Recursion Limits
1359
1360 The caller of run/3 can set a limit on the number of times the internal
1361 match() function is called and on the maximum depth of recursive calls.
1362 These facilities are provided to catch runaway matches that are pro‐
1363 voked by patterns with huge matching trees (a typical example is a pat‐
1364 tern with nested unlimited repeats) and to avoid running out of system
1365 stack by too much recursion. When one of these limits is reached,
1366 pcre_exec() gives an error return. The limits can also be set by items
1367 at the start of the pattern of the following forms:
1368
1369 (*LIMIT_MATCH=d)
1370 (*LIMIT_RECURSION=d)
1371
1372 Here d is any number of decimal digits. However, the value of the set‐
1373 ting must be less than the value set by the caller of run/3 for it to
1374 have any effect. That is, the pattern writer can lower the limit set by
1375 the programmer, but not raise it. If there is more than one setting of
1376 one of these limits, the lower value is used.
1377
1378 The default value for both the limits is 10,000,000 in the Erlang VM.
1379 Notice that the recursion limit does not affect the stack depth of the
1380 VM, as PCRE for Erlang is compiled in such a way that the match func‐
1381 tion never does recursion on the C stack.
1382
1383 Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of
1384 the limits set by the caller, not increase them.
1385
1387 A regular expression is a pattern that is matched against a subject
1388 string from left to right. Most characters stand for themselves in a
1389 pattern and match the corresponding characters in the subject. As a
1390 trivial example, the following pattern matches a portion of a subject
1391 string that is identical to itself:
1392
1393 The quick brown fox
1394
1395 When caseless matching is specified (option caseless), letters are
1396 matched independently of case.
1397
1398 The power of regular expressions comes from the ability to include
1399 alternatives and repetitions in the pattern. These are encoded in the
1400 pattern by the use of metacharacters, which do not stand for themselves
1401 but instead are interpreted in some special way.
1402
1403 Two sets of metacharacters exist: those that are recognized anywhere in
1404 the pattern except within square brackets, and those that are recog‐
1405 nized within square brackets. Outside square brackets, the metacharac‐
1406 ters are as follows:
1407
1408 \:
1409 General escape character with many uses
1410
1411 ^:
1412 Assert start of string (or line, in multiline mode)
1413
1414 $:
1415 Assert end of string (or line, in multiline mode)
1416
1417 .:
1418 Match any character except newline (by default)
1419
1420 [:
1421 Start character class definition
1422
1423 |:
1424 Start of alternative branch
1425
1426 (:
1427 Start subpattern
1428
1429 ):
1430 End subpattern
1431
1432 ?:
1433 Extends the meaning of (, also 0 or 1 quantifier, also quantifier
1434 minimizer
1435
1436 *:
1437 0 or more quantifiers
1438
1439 +:
1440 1 or more quantifier, also "possessive quantifier"
1441
1442 {:
1443 Start min/max quantifier
1444
1445 Part of a pattern within square brackets is called a "character class".
1446 The following are the only metacharacters in a character class:
1447
1448 \:
1449 General escape character
1450
1451 ^:
1452 Negate the class, but only if the first character
1453
1454 -:
1455 Indicates character range
1456
1457 [:
1458 Posix character class (only if followed by Posix syntax)
1459
1460 ]:
1461 Terminates the character class
1462
1463 The following sections describe the use of each metacharacter.
1464
1466 The backslash character has many uses. First, if it is followed by a
1467 character that is not a number or a letter, it takes away any special
1468 meaning that a character can have. This use of backslash as an escape
1469 character applies both inside and outside character classes.
1470
1471 For example, if you want to match a * character, you write \* in the
1472 pattern. This escaping action applies if the following character would
1473 otherwise be interpreted as a metacharacter, so it is always safe to
1474 precede a non-alphanumeric with backslash to specify that it stands for
1475 itself. In particular, if you want to match a backslash, write \\.
1476
1477 In unicode mode, only ASCII numbers and letters have any special mean‐
1478 ing after a backslash. All other characters (in particular, those whose
1479 code points are > 127) are treated as literals.
1480
1481 If a pattern is compiled with option extended, whitespace in the pat‐
1482 tern (other than in a character class) and characters between a # out‐
1483 side a character class and the next newline are ignored. An escaping
1484 backslash can be used to include a whitespace or # character as part of
1485 the pattern.
1486
1487 To remove the special meaning from a sequence of characters, put them
1488 between \Q and \E. This is different from Perl in that $ and @ are han‐
1489 dled as literals in \Q...\E sequences in PCRE, while $ and @ cause
1490 variable interpolation in Perl. Notice the following examples:
1491
1492 Pattern PCRE matches Perl matches
1493
1494 \Qabc$xyz\E abc$xyz abc followed by the contents of $xyz
1495 \Qabc\$xyz\E abc\$xyz abc\$xyz
1496 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
1497
1498 The \Q...\E sequence is recognized both inside and outside character
1499 classes. An isolated \E that is not preceded by \Q is ignored. If \Q is
1500 not followed by \E later in the pattern, the literal interpretation
1501 continues to the end of the pattern (that is, \E is assumed at the
1502 end). If the isolated \Q is inside a character class, this causes an
1503 error, as the character class is not terminated.
1504
1505 Non-Printing Characters
1506
1507 A second use of backslash provides a way of encoding non-printing char‐
1508 acters in patterns in a visible manner. There is no restriction on the
1509 appearance of non-printing characters, apart from the binary zero that
1510 terminates a pattern. When a pattern is prepared by text editing, it is
1511 often easier to use one of the following escape sequences than the
1512 binary character it represents:
1513
1514 \a:
1515 Alarm, that is, the BEL character (hex 07)
1516
1517 \cx:
1518 "Control-x", where x is any ASCII character
1519
1520 \e:
1521 Escape (hex 1B)
1522
1523 \f:
1524 Form feed (hex 0C)
1525
1526 \n:
1527 Line feed (hex 0A)
1528
1529 \r:
1530 Carriage return (hex 0D)
1531
1532 \t:
1533 Tab (hex 09)
1534
1535 \0dd:
1536 Character with octal code 0dd
1537
1538 \ddd:
1539 Character with octal code ddd, or back reference
1540
1541 \o{ddd..}:
1542 character with octal code ddd..
1543
1544 \xhh:
1545 Character with hex code hh
1546
1547 \x{hhh..}:
1548 Character with hex code hhh..
1549
1550 Note:
1551 Note that \0dd is always an octal code, and that \8 and \9 are the lit‐
1552 eral characters "8" and "9".
1553
1554
1555 The precise effect of \cx on ASCII characters is as follows: if x is a
1556 lowercase letter, it is converted to upper case. Then bit 6 of the
1557 character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
1558 (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
1559 hex 7B (; is 3B). If the data item (byte or 16-bit value) following \c
1560 has a value > 127, a compile-time error occurs. This locks out non-
1561 ASCII characters in all modes.
1562
1563 The \c facility was designed for use with ASCII characters, but with
1564 the extension to Unicode it is even less useful than it once was.
1565
1566 After \0 up to two further octal digits are read. If there are fewer
1567 than two digits, just those that are present are used. Thus the
1568 sequence \0\x\015 specifies two binary zeros followed by a CR character
1569 (code value 13). Make sure you supply two digits after the initial zero
1570 if the pattern character that follows is itself an octal digit.
1571
1572 The escape \o must be followed by a sequence of octal digits, enclosed
1573 in braces. An error occurs if this is not the case. This escape is a
1574 recent addition to Perl; it provides way of specifying character code
1575 points as octal numbers greater than 0777, and it also allows octal
1576 numbers and back references to be unambiguously specified.
1577
1578 For greater clarity and unambiguity, it is best to avoid following \ by
1579 a digit greater than zero. Instead, use \o{} or \x{} to specify charac‐
1580 ter numbers, and \g{} to specify back references. The following para‐
1581 graphs describe the old, ambiguous syntax.
1582
1583 The handling of a backslash followed by a digit other than 0 is compli‐
1584 cated, and Perl has changed in recent releases, causing PCRE also to
1585 change. Outside a character class, PCRE reads the digit and any follow‐
1586 ing digits as a decimal number. If the number is < 8, or if there have
1587 been at least that many previous capturing left parentheses in the
1588 expression, the entire sequence is taken as a back reference. A
1589 description of how this works is provided later, following the discus‐
1590 sion of parenthesized subpatterns.
1591
1592 Inside a character class, or if the decimal number following \ is > 7
1593 and there have not been that many capturing subpatterns, PCRE handles
1594 \8 and \9 as the literal characters "8" and "9", and otherwise re-reads
1595 up to three octal digits following the backslash, and using them to
1596 generate a data character. Any subsequent digits stand for themselves.
1597 For example:
1598
1599 \040:
1600 Another way of writing an ASCII space
1601
1602 \40:
1603 The same, provided there are < 40 previous capturing subpatterns
1604
1605 \7:
1606 Always a back reference
1607
1608 \11:
1609 Can be a back reference, or another way of writing a tab
1610
1611 \011:
1612 Always a tab
1613
1614 \0113:
1615 A tab followed by character "3"
1616
1617 \113:
1618 Can be a back reference, otherwise the character with octal code
1619 113
1620
1621 \377:
1622 Can be a back reference, otherwise value 255 (decimal)
1623
1624 \81:
1625 Either a back reference, or the two characters "8" and "1"
1626
1627 Notice that octal values >= 100 that are specified using this syntax
1628 must not be introduced by a leading zero, as no more than three octal
1629 digits are ever read.
1630
1631 By default, after \x that is not followed by {, from zero to two hexa‐
1632 decimal digits are read (letters can be in upper or lower case). Any
1633 number of hexadecimal digits may appear between \x{ and }. If a charac‐
1634 ter other than a hexadecimal digit appears between \x{ and }, or if
1635 there is no terminating }, an error occurs.
1636
1637 Characters whose value is less than 256 can be defined by either of the
1638 two syntaxes for \x. There is no difference in the way they are han‐
1639 dled. For example, \xdc is exactly the same as \x{dc}.
1640
1641 Constraints on character values
1642
1643 Characters that are specified using octal or hexadecimal numbers are
1644 limited to certain values, as follows:
1645
1646 8-bit non-UTF mode:
1647 < 0x100
1648
1649 8-bit UTF-8 mode:
1650 < 0x10ffff and a valid codepoint
1651
1652 Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-
1653 called "surrogate" codepoints), and 0xffef.
1654
1655 Escape sequences in character classes
1656
1657 All the sequences that define a single character value can be used both
1658 inside and outside character classes. Also, inside a character class,
1659 \b is interpreted as the backspace character (hex 08).
1660
1661 \N is not allowed in a character class. \B, \R, and \X are not special
1662 inside a character class. Like other unrecognized escape sequences,
1663 they are treated as the literal characters "B", "R", and "X". Outside a
1664 character class, these sequences have different meanings.
1665
1666 Unsupported Escape Sequences
1667
1668 In Perl, the sequences \l, \L, \u, and \U are recognized by its string
1669 handler and used to modify the case of following characters. PCRE does
1670 not support these escape sequences.
1671
1672 Absolute and Relative Back References
1673
1674 The sequence \g followed by an unsigned or a negative number, option‐
1675 ally enclosed in braces, is an absolute or relative back reference. A
1676 named back reference can be coded as \g{name}. Back references are dis‐
1677 cussed later, following the discussion of parenthesized subpatterns.
1678
1679 Absolute and Relative Subroutine Calls
1680
1681 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
1682 name or a number enclosed either in angle brackets or single quotes, is
1683 alternative syntax for referencing a subpattern as a "subroutine".
1684 Details are discussed later. Notice that \g{...} (Perl syntax) and
1685 \g<...> (Oniguruma syntax) are not synonymous. The former is a back
1686 reference and the latter is a subroutine call.
1687
1688 Generic Character Types
1689
1690 Another use of backslash is for specifying generic character types:
1691
1692 \d:
1693 Any decimal digit
1694
1695 \D:
1696 Any character that is not a decimal digit
1697
1698 \h:
1699 Any horizontal whitespace character
1700
1701 \H:
1702 Any character that is not a horizontal whitespace character
1703
1704 \s:
1705 Any whitespace character
1706
1707 \S:
1708 Any character that is not a whitespace character
1709
1710 \v:
1711 Any vertical whitespace character
1712
1713 \V:
1714 Any character that is not a vertical whitespace character
1715
1716 \w:
1717 Any "word" character
1718
1719 \W:
1720 Any "non-word" character
1721
1722 There is also the single sequence \N, which matches a non-newline char‐
1723 acter. This is the same as the "." metacharacter when dotall is not
1724 set. Perl also uses \N to match characters by name, but PCRE does not
1725 support this.
1726
1727 Each pair of lowercase and uppercase escape sequences partitions the
1728 complete set of characters into two disjoint sets. Any given character
1729 matches one, and only one, of each pair. The sequences can appear both
1730 inside and outside character classes. They each match one character of
1731 the appropriate type. If the current matching point is at the end of
1732 the subject string, all fail, as there is no character to match.
1733
1734 For compatibility with Perl, \s did not used to match the VT character
1735 (code 11), which made it different from the the POSIX "space" class.
1736 However, Perl added VT at release 5.18, and PCRE followed suit at
1737 release 8.34. The default \s characters are now HT (9), LF (10), VT
1738 (11), FF (12), CR (13), and space (32), which are defined as white
1739 space in the "C" locale. This list may vary if locale-specific matching
1740 is taking place. For example, in some locales the "non-breaking space"
1741 character (\xA0) is recognized as white space, and in others the VT
1742 character is not.
1743
1744 A "word" character is an underscore or any character that is a letter
1745 or a digit. By default, the definition of letters and digits is con‐
1746 trolled by the PCRE low-valued character tables, in Erlang's case (and
1747 without option unicode), the ISO Latin-1 character set.
1748
1749 By default, in unicode mode, characters with values > 255, that is, all
1750 characters outside the ISO Latin-1 character set, never match \d, \s,
1751 or \w, and always match \D, \S, and \W. These sequences retain their
1752 original meanings from before UTF support was available, mainly for
1753 efficiency reasons. However, if option ucp is set, the behavior is
1754 changed so that Unicode properties are used to determine character
1755 types, as follows:
1756
1757 \d:
1758 Any character that \p{Nd} matches (decimal digit)
1759
1760 \s:
1761 Any character that \p{Z} or \h or \v
1762
1763 \w:
1764 Any character that matches \p{L} or \p{N} matches, plus underscore
1765
1766 The uppercase escapes match the inverse sets of characters. Notice that
1767 \d matches only decimal digits, while \w matches any Unicode digit, any
1768 Unicode letter, and underscore. Notice also that ucp affects \b and \B,
1769 as they are defined in terms of \w and \W. Matching these sequences is
1770 noticeably slower when ucp is set.
1771
1772 The sequences \h, \H, \v, and \V are features that were added to Perl
1773 in release 5.10. In contrast to the other sequences, which match only
1774 ASCII characters by default, these always match certain high-valued
1775 code points, regardless if ucp is set.
1776
1777 The following are the horizontal space characters:
1778
1779 U+0009:
1780 Horizontal tab (HT)
1781
1782 U+0020:
1783 Space
1784
1785 U+00A0:
1786 Non-break space
1787
1788 U+1680:
1789 Ogham space mark
1790
1791 U+180E:
1792 Mongolian vowel separator
1793
1794 U+2000:
1795 En quad
1796
1797 U+2001:
1798 Em quad
1799
1800 U+2002:
1801 En space
1802
1803 U+2003:
1804 Em space
1805
1806 U+2004:
1807 Three-per-em space
1808
1809 U+2005:
1810 Four-per-em space
1811
1812 U+2006:
1813 Six-per-em space
1814
1815 U+2007:
1816 Figure space
1817
1818 U+2008:
1819 Punctuation space
1820
1821 U+2009:
1822 Thin space
1823
1824 U+200A:
1825 Hair space
1826
1827 U+202F:
1828 Narrow no-break space
1829
1830 U+205F:
1831 Medium mathematical space
1832
1833 U+3000:
1834 Ideographic space
1835
1836 The following are the vertical space characters:
1837
1838 U+000A:
1839 Line feed (LF)
1840
1841 U+000B:
1842 Vertical tab (VT)
1843
1844 U+000C:
1845 Form feed (FF)
1846
1847 U+000D:
1848 Carriage return (CR)
1849
1850 U+0085:
1851 Next line (NEL)
1852
1853 U+2028:
1854 Line separator
1855
1856 U+2029:
1857 Paragraph separator
1858
1859 In 8-bit, non-UTF-8 mode, only the characters with code points < 256
1860 are relevant.
1861
1862 Newline Sequences
1863
1864 Outside a character class, by default, the escape sequence \R matches
1865 any Unicode newline sequence. In non-UTF-8 mode, \R is equivalent to
1866 the following:
1867
1868 (?>\r\n|\n|\x0b|\f|\r|\x85)
1869
1870 This is an example of an "atomic group", details are provided below.
1871
1872 This particular group matches either the two-character sequence CR fol‐
1873 lowed by LF, or one of the single characters LF (line feed, U+000A), VT
1874 (vertical tab, U+000B), FF (form feed, U+000C), CR (carriage return,
1875 U+000D), or NEL (next line, U+0085). The two-character sequence is
1876 treated as a single unit that cannot be split.
1877
1878 In Unicode mode, two more characters whose code points are > 255 are
1879 added: LS (line separator, U+2028) and PS (paragraph separator,
1880 U+2029). Unicode character property support is not needed for these
1881 characters to be recognized.
1882
1883 \R can be restricted to match only CR, LF, or CRLF (instead of the com‐
1884 plete set of Unicode line endings) by setting option bsr_anycrlf either
1885 at compile time or when the pattern is matched. (BSR is an acronym for
1886 "backslash R".) This can be made the default when PCRE is built; if so,
1887 the other behavior can be requested through option bsr_unicode. These
1888 settings can also be specified by starting a pattern string with one of
1889 the following sequences:
1890
1891 (*BSR_ANYCRLF):
1892 CR, LF, or CRLF only
1893
1894 (*BSR_UNICODE):
1895 Any Unicode newline sequence
1896
1897 These override the default and the options specified to the compiling
1898 function, but they can themselves be overridden by options specified to
1899 a matching function. Notice that these special settings, which are not
1900 Perl-compatible, are recognized only at the very start of a pattern,
1901 and that they must be in upper case. If more than one of them is
1902 present, the last one is used. They can be combined with a change of
1903 newline convention; for example, a pattern can start with:
1904
1905 (*ANY)(*BSR_ANYCRLF)
1906
1907 They can also be combined with the (*UTF8), (*UTF), or (*UCP) special
1908 sequences. Inside a character class, \R is treated as an unrecognized
1909 escape sequence, and so matches the letter "R" by default.
1910
1911 Unicode Character Properties
1912
1913 Three more escape sequences that match characters with specific proper‐
1914 ties are available. When in 8-bit non-UTF-8 mode, these sequences are
1915 limited to testing characters whose code points are < 256, but they do
1916 work in this mode. The following are the extra escape sequences:
1917
1918 \p{xx}:
1919 A character with property xx
1920
1921 \P{xx}:
1922 A character without property xx
1923
1924 \X:
1925 A Unicode extended grapheme cluster
1926
1927 The property names represented by xx above are limited to the Unicode
1928 script names, the general category properties, "Any", which matches any
1929 character (including newline), and some special PCRE properties
1930 (described in the next section). Other Perl properties, such as "InMu‐
1931 sicalSymbols", are currently not supported by PCRE. Notice that \P{Any}
1932 does not match any characters and always causes a match failure.
1933
1934 Sets of Unicode characters are defined as belonging to certain scripts.
1935 A character from one of these sets can be matched using a script name,
1936 for example:
1937
1938 \p{Greek} \P{Han}
1939
1940 Those that are not part of an identified script are lumped together as
1941 "Common". The following is the current list of scripts:
1942
1943 * Arabic
1944
1945 * Armenian
1946
1947 * Avestan
1948
1949 * Balinese
1950
1951 * Bamum
1952
1953 * Bassa_Vah
1954
1955 * Batak
1956
1957 * Bengali
1958
1959 * Bopomofo
1960
1961 * Braille
1962
1963 * Buginese
1964
1965 * Buhid
1966
1967 * Canadian_Aboriginal
1968
1969 * Carian
1970
1971 * Caucasian_Albanian
1972
1973 * Chakma
1974
1975 * Cham
1976
1977 * Cherokee
1978
1979 * Common
1980
1981 * Coptic
1982
1983 * Cuneiform
1984
1985 * Cypriot
1986
1987 * Cyrillic
1988
1989 * Deseret
1990
1991 * Devanagari
1992
1993 * Duployan
1994
1995 * Egyptian_Hieroglyphs
1996
1997 * Elbasan
1998
1999 * Ethiopic
2000
2001 * Georgian
2002
2003 * Glagolitic
2004
2005 * Gothic
2006
2007 * Grantha
2008
2009 * Greek
2010
2011 * Gujarati
2012
2013 * Gurmukhi
2014
2015 * Han
2016
2017 * Hangul
2018
2019 * Hanunoo
2020
2021 * Hebrew
2022
2023 * Hiragana
2024
2025 * Imperial_Aramaic
2026
2027 * Inherited
2028
2029 * Inscriptional_Pahlavi
2030
2031 * Inscriptional_Parthian
2032
2033 * Javanese
2034
2035 * Kaithi
2036
2037 * Kannada
2038
2039 * Katakana
2040
2041 * Kayah_Li
2042
2043 * Kharoshthi
2044
2045 * Khmer
2046
2047 * Khojki
2048
2049 * Khudawadi
2050
2051 * Lao
2052
2053 * Latin
2054
2055 * Lepcha
2056
2057 * Limbu
2058
2059 * Linear_A
2060
2061 * Linear_B
2062
2063 * Lisu
2064
2065 * Lycian
2066
2067 * Lydian
2068
2069 * Mahajani
2070
2071 * Malayalam
2072
2073 * Mandaic
2074
2075 * Manichaean
2076
2077 * Meetei_Mayek
2078
2079 * Mende_Kikakui
2080
2081 * Meroitic_Cursive
2082
2083 * Meroitic_Hieroglyphs
2084
2085 * Miao
2086
2087 * Modi
2088
2089 * Mongolian
2090
2091 * Mro
2092
2093 * Myanmar
2094
2095 * Nabataean
2096
2097 * New_Tai_Lue
2098
2099 * Nko
2100
2101 * Ogham
2102
2103 * Ol_Chiki
2104
2105 * Old_Italic
2106
2107 * Old_North_Arabian
2108
2109 * Old_Permic
2110
2111 * Old_Persian
2112
2113 * Oriya
2114
2115 * Old_South_Arabian
2116
2117 * Old_Turkic
2118
2119 * Osmanya
2120
2121 * Pahawh_Hmong
2122
2123 * Palmyrene
2124
2125 * Pau_Cin_Hau
2126
2127 * Phags_Pa
2128
2129 * Phoenician
2130
2131 * Psalter_Pahlavi
2132
2133 * Rejang
2134
2135 * Runic
2136
2137 * Samaritan
2138
2139 * Saurashtra
2140
2141 * Sharada
2142
2143 * Shavian
2144
2145 * Siddham
2146
2147 * Sinhala
2148
2149 * Sora_Sompeng
2150
2151 * Sundanese
2152
2153 * Syloti_Nagri
2154
2155 * Syriac
2156
2157 * Tagalog
2158
2159 * Tagbanwa
2160
2161 * Tai_Le
2162
2163 * Tai_Tham
2164
2165 * Tai_Viet
2166
2167 * Takri
2168
2169 * Tamil
2170
2171 * Telugu
2172
2173 * Thaana
2174
2175 * Thai
2176
2177 * Tibetan
2178
2179 * Tifinagh
2180
2181 * Tirhuta
2182
2183 * Ugaritic
2184
2185 * Vai
2186
2187 * Warang_Citi
2188
2189 * Yi
2190
2191 Each character has exactly one Unicode general category property, spec‐
2192 ified by a two-letter acronym. For compatibility with Perl, negation
2193 can be specified by including a circumflex between the opening brace
2194 and the property name. For example, \p{^Lu} is the same as \P{Lu}.
2195
2196 If only one letter is specified with \p or \P, it includes all the gen‐
2197 eral category properties that start with that letter. In this case, in
2198 the absence of negation, the curly brackets in the escape sequence are
2199 optional. The following two examples have the same effect:
2200
2201 \p{L}
2202 \pL
2203
2204 The following general category property codes are supported:
2205
2206 C:
2207 Other
2208
2209 Cc:
2210 Control
2211
2212 Cf:
2213 Format
2214
2215 Cn:
2216 Unassigned
2217
2218 Co:
2219 Private use
2220
2221 Cs:
2222 Surrogate
2223
2224 L:
2225 Letter
2226
2227 Ll:
2228 Lowercase letter
2229
2230 Lm:
2231 Modifier letter
2232
2233 Lo:
2234 Other letter
2235
2236 Lt:
2237 Title case letter
2238
2239 Lu:
2240 Uppercase letter
2241
2242 M:
2243 Mark
2244
2245 Mc:
2246 Spacing mark
2247
2248 Me:
2249 Enclosing mark
2250
2251 Mn:
2252 Non-spacing mark
2253
2254 N:
2255 Number
2256
2257 Nd:
2258 Decimal number
2259
2260 Nl:
2261 Letter number
2262
2263 No:
2264 Other number
2265
2266 P:
2267 Punctuation
2268
2269 Pc:
2270 Connector punctuation
2271
2272 Pd:
2273 Dash punctuation
2274
2275 Pe:
2276 Close punctuation
2277
2278 Pf:
2279 Final punctuation
2280
2281 Pi:
2282 Initial punctuation
2283
2284 Po:
2285 Other punctuation
2286
2287 Ps:
2288 Open punctuation
2289
2290 S:
2291 Symbol
2292
2293 Sc:
2294 Currency symbol
2295
2296 Sk:
2297 Modifier symbol
2298
2299 Sm:
2300 Mathematical symbol
2301
2302 So:
2303 Other symbol
2304
2305 Z:
2306 Separator
2307
2308 Zl:
2309 Line separator
2310
2311 Zp:
2312 Paragraph separator
2313
2314 Zs:
2315 Space separator
2316
2317 The special property L& is also supported. It matches a character that
2318 has the Lu, Ll, or Lt property, that is, a letter that is not classi‐
2319 fied as a modifier or "other".
2320
2321 The Cs (Surrogate) property applies only to characters in the range
2322 U+D800 to U+DFFF. Such characters are invalid in Unicode strings and so
2323 cannot be tested by PCRE. Perl does not support the Cs property.
2324
2325 The long synonyms for property names supported by Perl (such as \p{Let‐
2326 ter}) are not supported by PCRE. It is not permitted to prefix any of
2327 these properties with "Is".
2328
2329 No character in the Unicode table has the Cn (unassigned) property.
2330 This property is instead assumed for any code point that is not in the
2331 Unicode table.
2332
2333 Specifying caseless matching does not affect these escape sequences.
2334 For example, \p{Lu} always matches only uppercase letters. This is dif‐
2335 ferent from the behavior of current versions of Perl.
2336
2337 Matching characters by Unicode property is not fast, as PCRE must do a
2338 multistage table lookup to find a character property. That is why the
2339 traditional escape sequences such as \d and \w do not use Unicode prop‐
2340 erties in PCRE by default. However, you can make them do so by setting
2341 option ucp or by starting the pattern with (*UCP).
2342
2343 Extended Grapheme Clusters
2344
2345 The \X escape matches any number of Unicode characters that form an
2346 "extended grapheme cluster", and treats the sequence as an atomic group
2347 (see below). Up to and including release 8.31, PCRE matched an earlier,
2348 simpler definition that was equivalent to (?>\PM\pM*). That is, it
2349 matched a character without the "mark" property, followed by zero or
2350 more characters with the "mark" property. Characters with the "mark"
2351 property are typically non-spacing accents that affect the preceding
2352 character.
2353
2354 This simple definition was extended in Unicode to include more compli‐
2355 cated kinds of composite character by giving each character a grapheme
2356 breaking property, and creating rules that use these properties to
2357 define the boundaries of extended grapheme clusters. In PCRE releases
2358 later than 8.31, \X matches one of these clusters.
2359
2360 \X always matches at least one character. Then it decides whether to
2361 add more characters according to the following rules for ending a clus‐
2362 ter:
2363
2364 * End at the end of the subject string.
2365
2366 * Do not end between CR and LF; otherwise end after any control char‐
2367 acter.
2368
2369 * Do not break Hangul (a Korean script) syllable sequences. Hangul
2370 characters are of five types: L, V, T, LV, and LVT. An L character
2371 can be followed by an L, V, LV, or LVT character. An LV or V char‐
2372 acter can be followed by a V or T character. An LVT or T character
2373 can be followed only by a T character.
2374
2375 * Do not end before extending characters or spacing marks. Characters
2376 with the "mark" property always have the "extend" grapheme breaking
2377 property.
2378
2379 * Do not end after prepend characters.
2380
2381 * Otherwise, end the cluster.
2382
2383 PCRE Additional Properties
2384
2385 In addition to the standard Unicode properties described earlier, PCRE
2386 supports four more that make it possible to convert traditional escape
2387 sequences, such as \w and \s to use Unicode properties. PCRE uses these
2388 non-standard, non-Perl properties internally when the ucp option is
2389 passed. However, they can also be used explicitly. The properties are
2390 as follows:
2391
2392 Xan:
2393 Any alphanumeric character. Matches characters that have either the
2394 L (letter) or the N (number) property.
2395
2396 Xps:
2397 Any Posix space character. Matches the characters tab, line feed,
2398 vertical tab, form feed, carriage return, and any other character
2399 that has the Z (separator) property.
2400
2401 Xsp:
2402 Any Perl space character. Matches the same as Xps, except that ver‐
2403 tical tab is excluded.
2404
2405 Xwd:
2406 Any Perl "word" character. Matches the same characters as Xan, plus
2407 underscore.
2408
2409 Perl and POSIX space are now the same. Perl added VT to its space char‐
2410 acter set at release 5.18 and PCRE changed at release 8.34.
2411
2412 Xan matches characters that have either the L (letter) or the N (num‐
2413 ber) property. Xps matches the characters tab, linefeed, vertical tab,
2414 form feed, or carriage return, and any other character that has the Z
2415 (separator) property. Xsp is the same as Xps; it used to exclude verti‐
2416 cal tab, for Perl compatibility, but Perl changed, and so PCRE followed
2417 at release 8.34. Xwd matches the same characters as Xan, plus under‐
2418 score.
2419
2420 There is another non-standard property, Xuc, which matches any charac‐
2421 ter that can be represented by a Universal Character Name in C++ and
2422 other programming languages. These are the characters $, @, ` (grave
2423 accent), and all characters with Unicode code points >= U+00A0, except
2424 for the surrogates U+D800 to U+DFFF. Notice that most base (ASCII)
2425 characters are excluded. (Universal Character Names are of the form
2426 \uHHHH or \UHHHHHHHH, where H is a hexadecimal digit. Notice that the
2427 Xuc property does not match these sequences but the characters that
2428 they represent.)
2429
2430 Resetting the Match Start
2431
2432 The escape sequence \K causes any previously matched characters not to
2433 be included in the final matched sequence. For example, the following
2434 pattern matches "foobar", but reports that it has matched "bar":
2435
2436 foo\Kbar
2437
2438 This feature is similar to a lookbehind assertion (described below).
2439 However, in this case, the part of the subject before the real match
2440 does not have to be of fixed length, as lookbehind assertions do. The
2441 use of \K does not interfere with the setting of captured substrings.
2442 For example, when the following pattern matches "foobar", the first
2443 substring is still set to "foo":
2444
2445 (foo)\Kbar
2446
2447 Perl documents that the use of \K within assertions is "not well
2448 defined". In PCRE, \K is acted upon when it occurs inside positive
2449 assertions, but is ignored in negative assertions. Note that when a
2450 pattern such as (?=ab\K) matches, the reported start of the match can
2451 be greater than the end of the match.
2452
2453 Simple Assertions
2454
2455 The final use of backslash is for certain simple assertions. An asser‐
2456 tion specifies a condition that must be met at a particular point in a
2457 match, without consuming any characters from the subject string. The
2458 use of subpatterns for more complicated assertions is described below.
2459 The following are the backslashed assertions:
2460
2461 \b:
2462 Matches at a word boundary.
2463
2464 \B:
2465 Matches when not at a word boundary.
2466
2467 \A:
2468 Matches at the start of the subject.
2469
2470 \Z:
2471 Matches at the end of the subject, and before a newline at the end
2472 of the subject.
2473
2474 \z:
2475 Matches only at the end of the subject.
2476
2477 \G:
2478 Matches at the first matching position in the subject.
2479
2480 Inside a character class, \b has a different meaning; it matches the
2481 backspace character. If any other of these assertions appears in a
2482 character class, by default it matches the corresponding literal char‐
2483 acter (for example, \B matches the letter B).
2484
2485 A word boundary is a position in the subject string where the current
2486 character and the previous character do not both match \w or \W (that
2487 is, one matches \w and the other matches \W), or the start or end of
2488 the string if the first or last character matches \w, respectively. In
2489 UTF mode, the meanings of \w and \W can be changed by setting option
2490 ucp. When this is done, it also affects \b and \B. PCRE and Perl do not
2491 have a separate "start of word" or "end of word" metasequence. However,
2492 whatever follows \b normally determines which it is. For example, the
2493 fragment \ba matches "a" at the start of a word.
2494
2495 The \A, \Z, and \z assertions differ from the traditional circumflex
2496 and dollar (described in the next section) in that they only ever match
2497 at the very start and end of the subject string, whatever options are
2498 set. Thus, they are independent of multiline mode. These three asser‐
2499 tions are not affected by options notbol or noteol, which affect only
2500 the behavior of the circumflex and dollar metacharacters. However, if
2501 argument startoffset of run/3 is non-zero, indicating that matching is
2502 to start at a point other than the beginning of the subject, \A can
2503 never match. The difference between \Z and \z is that \Z matches before
2504 a newline at the end of the string and at the very end, while \z
2505 matches only at the end.
2506
2507 The \G assertion is true only when the current matching position is at
2508 the start point of the match, as specified by argument startoffset of
2509 run/3. It differs from \A when the value of startoffset is non-zero. By
2510 calling run/3 multiple times with appropriate arguments, you can mimic
2511 the Perl option /g, and it is in this kind of implementation where \G
2512 can be useful.
2513
2514 Notice, however, that the PCRE interpretation of \G, as the start of
2515 the current match, is subtly different from Perl, which defines it as
2516 the end of the previous match. In Perl, these can be different when the
2517 previously matched string was empty. As PCRE does only one match at a
2518 time, it cannot reproduce this behavior.
2519
2520 If all the alternatives of a pattern begin with \G, the expression is
2521 anchored to the starting match position, and the "anchored" flag is set
2522 in the compiled regular expression.
2523
2525 The circumflex and dollar metacharacters are zero-width assertions.
2526 That is, they test for a particular condition to be true without con‐
2527 suming any characters from the subject string.
2528
2529 Outside a character class, in the default matching mode, the circumflex
2530 character is an assertion that is true only if the current matching
2531 point is at the start of the subject string. If argument startoffset of
2532 run/3 is non-zero, circumflex can never match if option multiline is
2533 unset. Inside a character class, circumflex has an entirely different
2534 meaning (see below).
2535
2536 Circumflex needs not to be the first character of the pattern if some
2537 alternatives are involved, but it is to be the first thing in each
2538 alternative in which it appears if the pattern is ever to match that
2539 branch. If all possible alternatives start with a circumflex, that is,
2540 if the pattern is constrained to match only at the start of the sub‐
2541 ject, it is said to be an "anchored" pattern. (There are also other
2542 constructs that can cause a pattern to be anchored.)
2543
2544 The dollar character is an assertion that is true only if the current
2545 matching point is at the end of the subject string, or immediately
2546 before a newline at the end of the string (by default). Notice however
2547 that it does not match the newline. Dollar needs not to be the last
2548 character of the pattern if some alternatives are involved, but it is
2549 to be the last item in any branch in which it appears. Dollar has no
2550 special meaning in a character class.
2551
2552 The meaning of dollar can be changed so that it matches only at the
2553 very end of the string, by setting option dollar_endonly at compile
2554 time. This does not affect the \Z assertion.
2555
2556 The meanings of the circumflex and dollar characters are changed if
2557 option multiline is set. When this is the case, a circumflex matches
2558 immediately after internal newlines and at the start of the subject
2559 string. It does not match after a newline that ends the string. A dol‐
2560 lar matches before any newlines in the string, and at the very end,
2561 when multiline is set. When newline is specified as the two-character
2562 sequence CRLF, isolated CR and LF characters do not indicate newlines.
2563
2564 For example, the pattern /^abc$/ matches the subject string "def\nabc"
2565 (where \n represents a newline) in multiline mode, but not otherwise.
2566 So, patterns that are anchored in single-line mode because all branches
2567 start with ^ are not anchored in multiline mode, and a match for cir‐
2568 cumflex is possible when argument startoffset of run/3 is non-zero.
2569 Option dollar_endonly is ignored if multiline is set.
2570
2571 Notice that the sequences \A, \Z, and \z can be used to match the start
2572 and end of the subject in both modes. If all branches of a pattern
2573 start with \A, it is always anchored, regardless if multiline is set.
2574
2576 Outside a character class, a dot in the pattern matches any character
2577 in the subject string except (by default) a character that signifies
2578 the end of a line.
2579
2580 When a line ending is defined as a single character, dot never matches
2581 that character. When the two-character sequence CRLF is used, dot does
2582 not match CR if it is immediately followed by LF, otherwise it matches
2583 all characters (including isolated CRs and LFs). When any Unicode line
2584 endings are recognized, dot does not match CR, LF, or any of the other
2585 line-ending characters.
2586
2587 The behavior of dot regarding newlines can be changed. If option dotall
2588 is set, a dot matches any character, without exception. If the two-
2589 character sequence CRLF is present in the subject string, it takes two
2590 dots to match it.
2591
2592 The handling of dot is entirely independent of the handling of circum‐
2593 flex and dollar, the only relationship is that both involve newlines.
2594 Dot has no special meaning in a character class.
2595
2596 The escape sequence \N behaves like a dot, except that it is not
2597 affected by option PCRE_DOTALL. That is, it matches any character
2598 except one that signifies the end of a line. Perl also uses \N to match
2599 characters by name but PCRE does not support this.
2600
2602 Outside a character class, the escape sequence \C matches any data
2603 unit, regardless if a UTF mode is set. One data unit is one byte.
2604 Unlike a dot, \C always matches line-ending characters. The feature is
2605 provided in Perl to match individual bytes in UTF-8 mode, but it is
2606 unclear how it can usefully be used. As \C breaks up characters into
2607 individual data units, matching one unit with \C in a UTF mode means
2608 that the remaining string can start with a malformed UTF character.
2609 This has undefined results, as PCRE assumes that it deals with valid
2610 UTF strings.
2611
2612 PCRE does not allow \C to appear in lookbehind assertions (described
2613 below) in a UTF mode, as this would make it impossible to calculate the
2614 length of the lookbehind.
2615
2616 The \C escape sequence is best avoided. However, one way of using it
2617 that avoids the problem of malformed UTF characters is to use a looka‐
2618 head to check the length of the next character, as in the following
2619 pattern, which can be used with a UTF-8 string (ignore whitespace and
2620 line breaks):
2621
2622 (?| (?=[\x00-\x7f])(\C) |
2623 (?=[\x80-\x{7ff}])(\C)(\C) |
2624 (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
2625 (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
2626
2627 A group that starts with (?| resets the capturing parentheses numbers
2628 in each alternative (see section Duplicate Subpattern Numbers). The
2629 assertions at the start of each branch check the next UTF-8 character
2630 for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
2631 individual bytes of the character are then captured by the appropriate
2632 number of groups.
2633
2635 An opening square bracket introduces a character class, terminated by a
2636 closing square bracket. A closing square bracket on its own is not spe‐
2637 cial by default. However, if option PCRE_JAVASCRIPT_COMPAT is set, a
2638 lone closing square bracket causes a compile-time error. If a closing
2639 square bracket is required as a member of the class, it is to be the
2640 first data character in the class (after an initial circumflex, if
2641 present) or escaped with a backslash.
2642
2643 A character class matches a single character in the subject. In a UTF
2644 mode, the character can be more than one data unit long. A matched
2645 character must be in the set of characters defined by the class, unless
2646 the first character in the class definition is a circumflex, in which
2647 case the subject character must not be in the set defined by the class.
2648 If a circumflex is required as a member of the class, ensure that it is
2649 not the first character, or escape it with a backslash.
2650
2651 For example, the character class [aeiou] matches any lowercase vowel,
2652 while [^aeiou] matches any character that is not a lowercase vowel.
2653 Notice that a circumflex is just a convenient notation for specifying
2654 the characters that are in the class by enumerating those that are not.
2655 A class that starts with a circumflex is not an assertion; it still
2656 consumes a character from the subject string, and therefore it fails if
2657 the current pointer is at the end of the string.
2658
2659 In UTF-8 mode, characters with values > 255 (0xffff) can be included in
2660 a class as a literal string of data units, or by using the \x{ escaping
2661 mechanism.
2662
2663 When caseless matching is set, any letters in a class represent both
2664 their uppercase and lowercase versions. For example, a caseless [aeiou]
2665 matches "A" and "a", and a caseless [^aeiou] does not match "A", but a
2666 caseful version would. In a UTF mode, PCRE always understands the con‐
2667 cept of case for characters whose values are < 256, so caseless match‐
2668 ing is always possible. For characters with higher values, the concept
2669 of case is supported only if PCRE is compiled with Unicode property
2670 support. If you want to use caseless matching in a UTF mode for charac‐
2671 ters >=, ensure that PCRE is compiled with Unicode property support and
2672 with UTF support.
2673
2674 Characters that can indicate line breaks are never treated in any spe‐
2675 cial way when matching character classes, whatever line-ending sequence
2676 is in use, and whatever setting of options PCRE_DOTALL and PCRE_MULTI‐
2677 LINE is used. A class such as [^a] always matches one of these charac‐
2678 ters.
2679
2680 The minus (hyphen) character can be used to specify a range of charac‐
2681 ters in a character class. For example, [d-m] matches any letter
2682 between d and m, inclusive. If a minus character is required in a
2683 class, it must be escaped with a backslash or appear in a position
2684 where it cannot be interpreted as indicating a range, typically as the
2685 first or last character in the class, or immediately after a range. For
2686 example, [b-d-z] matches letters in the range b to d, a hyphen charac‐
2687 ter, or z.
2688
2689 The literal character "]" cannot be the end character of a range. A
2690 pattern such as [W-]46] is interpreted as a class of two characters
2691 ("W" and "-") followed by a literal string "46]", so it would match
2692 "W46]" or "-46]". However, if "]" is escaped with a backslash, it is
2693 interpreted as the end of range, so [W-\]46] is interpreted as a class
2694 containing a range followed by two other characters. The octal or hexa‐
2695 decimal representation of "]" can also be used to end a range.
2696
2697 An error is generated if a POSIX character class (see below) or an
2698 escape sequence other than one that defines a single character appears
2699 at a point where a range ending character is expected. For example,
2700 [z-\xff] is valid, but [A-\d] and [A-[:digit:]] are not.
2701
2702 Ranges operate in the collating sequence of character values. They can
2703 also be used for characters specified numerically, for example,
2704 [\000-\037]. Ranges can include any characters that are valid for the
2705 current mode.
2706
2707 If a range that includes letters is used when caseless matching is set,
2708 it matches the letters in either case. For example, [W-c] is equivalent
2709 to [][\\^_`wxyzabc], matched caselessly. In a non-UTF mode, if charac‐
2710 ter tables for a French locale are in use, [\xc8-\xcb] matches accented
2711 E characters in both cases. In UTF modes, PCRE supports the concept of
2712 case for characters with values > 255 only when it is compiled with
2713 Unicode property support.
2714
2715 The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,
2716 \w, and \W can appear in a character class, and add the characters that
2717 they match to the class. For example, [\dABCDEF] matches any hexadeci‐
2718 mal digit. In UTF modes, option ucp affects the meanings of \d, \s, \w
2719 and their uppercase partners, just as it does when they appear outside
2720 a character class, as described in section Generic Character Types ear‐
2721 lier. The escape sequence \b has a different meaning inside a character
2722 class; it matches the backspace character. The sequences \B, \N, \R,
2723 and \X are not special inside a character class. Like any other unrec‐
2724 ognized escape sequences, they are treated as the literal characters
2725 "B", "N", "R", and "X".
2726
2727 A circumflex can conveniently be used with the uppercase character
2728 types to specify a more restricted set of characters than the matching
2729 lowercase type. For example, class [^\W_] matches any letter or digit,
2730 but not underscore, while [\w] includes underscore. A positive charac‐
2731 ter class is to be read as "something OR something OR ..." and a nega‐
2732 tive class as "NOT something AND NOT something AND NOT ...".
2733
2734 Only the following metacharacters are recognized in character classes:
2735
2736 * Backslash
2737
2738 * Hyphen (only where it can be interpreted as specifying a range)
2739
2740 * Circumflex (only at the start)
2741
2742 * Opening square bracket (only when it can be interpreted as intro‐
2743 ducing a Posix class name, or for a special compatibility feature;
2744 see the next two sections)
2745
2746 * Terminating closing square bracket
2747
2748 However, escaping other non-alphanumeric characters does no harm.
2749
2751 Perl supports the Posix notation for character classes. This uses names
2752 enclosed by [: and :] within the enclosing square brackets. PCRE also
2753 supports this notation. For example, the following matches "0", "1",
2754 any alphabetic character, or "%":
2755
2756 [01[:alpha:]%]
2757
2758 The following are the supported class names:
2759
2760 alnum:
2761 Letters and digits
2762
2763 alpha:
2764 Letters
2765
2766 ascii:
2767 Character codes 0-127
2768
2769 blank:
2770 Space or tab only
2771
2772 cntrl:
2773 Control characters
2774
2775 digit:
2776 Decimal digits (same as \d)
2777
2778 graph:
2779 Printing characters, excluding space
2780
2781 lower:
2782 Lowercase letters
2783
2784 print:
2785 Printing characters, including space
2786
2787 punct:
2788 Printing characters, excluding letters, digits, and space
2789
2790 space:
2791 Whitespace (the same as \s from PCRE 8.34)
2792
2793 upper:
2794 Uppercase letters
2795
2796 word:
2797 "Word" characters (same as \w)
2798
2799 xdigit:
2800 Hexadecimal digits
2801
2802 The default "space" characters are HT (9), LF (10), VT (11), FF (12),
2803 CR (13), and space (32). If locale-specific matching is taking place,
2804 the list of space characters may be different; there may be fewer or
2805 more of them. "Space" used to be different to \s, which did not include
2806 VT, for Perl compatibility. However, Perl changed at release 5.18, and
2807 PCRE followed at release 8.34. "Space" and \s now match the same set of
2808 characters.
2809
2810 The name "word" is a Perl extension, and "blank" is a GNU extension
2811 from Perl 5.8. Another Perl extension is negation, which is indicated
2812 by a ^ character after the colon. For example, the following matches
2813 "1", "2", or any non-digit:
2814
2815 [12[:^digit:]]
2816
2817 PCRE (and Perl) also recognize the Posix syntax [.ch.] and [=ch=] where
2818 "ch" is a "collating element", but these are not supported, and an
2819 error is given if they are encountered.
2820
2821 By default, characters with values > 255 do not match any of the Posix
2822 character classes. However, if option PCRE_UCP is passed to pcre_com‐
2823 pile(), some of the classes are changed so that Unicode character prop‐
2824 erties are used. This is achieved by replacing certain Posix classes by
2825 other sequences, as follows:
2826
2827 [:alnum:]:
2828 Becomes \p{Xan}
2829
2830 [:alpha:]:
2831 Becomes \p{L}
2832
2833 [:blank:]:
2834 Becomes \h
2835
2836 [:digit:]:
2837 Becomes \p{Nd}
2838
2839 [:lower:]:
2840 Becomes \p{Ll}
2841
2842 [:space:]:
2843 Becomes \p{Xps}
2844
2845 [:upper:]:
2846 Becomes \p{Lu}
2847
2848 [:word:]:
2849 Becomes \p{Xwd}
2850
2851 Negated versions, such as [:^alpha:], use \P instead of \p. Three other
2852 POSIX classes are handled specially in UCP mode:
2853
2854 [:graph:]:
2855 This matches characters that have glyphs that mark the page when
2856 printed. In Unicode property terms, it matches all characters with
2857 the L, M, N, P, S, or Cf properties, except for:
2858
2859 U+061C:
2860 Arabic Letter Mark
2861
2862 U+180E:
2863 Mongolian Vowel Separator
2864
2865 U+2066 - U+2069:
2866 Various "isolate"s
2867
2868 [:print:]:
2869 This matches the same characters as [:graph:] plus space characters
2870 that are not controls, that is, characters with the Zs property.
2871
2872 [:punct:]:
2873 This matches all characters that have the Unicode P (punctuation)
2874 property, plus those characters whose code points are less than 128
2875 that have the S (Symbol) property.
2876
2877 The other POSIX classes are unchanged, and match only characters with
2878 code points less than 128.
2879
2880 Compatibility Feature for Word Boundaries
2881
2882 In the POSIX.2 compliant library that was included in 4.4BSD Unix, the
2883 ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word"
2884 and "end of word". PCRE treats these items as follows:
2885
2886 [[:<:]]:
2887 is converted to \b(?=\w)
2888
2889 [[:>:]]:
2890 is converted to \b(?<=\w)
2891
2892 Only these exact character sequences are recognized. A sequence such as
2893 [a[:<:]b] provokes error for an unrecognized POSIX class name. This
2894 support is not compatible with Perl. It is provided to help migrations
2895 from other environments, and is best not used in any new patterns. Note
2896 that \b matches at the start and the end of a word (see "Simple asser‐
2897 tions" above), and in a Perl-style pattern the preceding or following
2898 character normally shows which is wanted, without the need for the
2899 assertions that are used above in order to give exactly the POSIX be‐
2900 haviour.
2901
2903 Vertical bar characters are used to separate alternative patterns. For
2904 example, the following pattern matches either "gilbert" or "sullivan":
2905
2906 gilbert|sullivan
2907
2908 Any number of alternatives can appear, and an empty alternative is per‐
2909 mitted (matching the empty string). The matching process tries each
2910 alternative in turn, from left to right, and the first that succeeds is
2911 used. If the alternatives are within a subpattern (defined in section
2912 Subpatterns), "succeeds" means matching the remaining main pattern and
2913 the alternative in the subpattern.
2914
2916 The settings of the Perl-compatible options caseless, multiline,
2917 dotall, and extended can be changed from within the pattern by a
2918 sequence of Perl option letters enclosed between "(?" and ")". The
2919 option letters are as follows:
2920
2921 i:
2922 For caseless
2923
2924 m:
2925 For multiline
2926
2927 s:
2928 For dotall
2929
2930 x:
2931 For extended
2932
2933 For example, (?im) sets caseless, multiline matching. These options can
2934 also be unset by preceding the letter with a hyphen. A combined setting
2935 and unsetting such as (?im-sx), which sets caseless and multiline,
2936 while unsetting dotall and extended, is also permitted. If a letter
2937 appears both before and after the hyphen, the option is unset.
2938
2939 The PCRE-specific options dupnames, ungreedy, and extra can be changed
2940 in the same way as the Perl-compatible options by using the characters
2941 J, U, and X respectively.
2942
2943 When one of these option changes occurs at top-level (that is, not
2944 inside subpattern parentheses), the change applies to the remainder of
2945 the pattern that follows.
2946
2947 An option change within a subpattern (see section Subpatterns) affects
2948 only that part of the subpattern that follows it. So, the following
2949 matches abc and aBc and no other strings (assuming caseless is not
2950 used):
2951
2952 (a(?i)b)c
2953
2954 By this means, options can be made to have different settings in dif‐
2955 ferent parts of the pattern. Any changes made in one alternative do
2956 carry on into subsequent branches within the same subpattern. For exam‐
2957 ple:
2958
2959 (a(?i)b|c)
2960
2961 matches "ab", "aB", "c", and "C", although when matching "C" the first
2962 branch is abandoned before the option setting. This is because the
2963 effects of option settings occur at compile time. There would be some
2964 weird behavior otherwise.
2965
2966 Note:
2967 Other PCRE-specific options can be set by the application when the com‐
2968 piling or matching functions are called. Sometimes the pattern can con‐
2969 tain special leading sequences, such as (*CRLF), to override what the
2970 application has set or what has been defaulted. Details are provided in
2971 section Newline Sequences earlier.
2972
2973 The (*UTF8) and (*UCP) leading sequences can be used to set UTF and
2974 Unicode property modes. They are equivalent to setting options unicode
2975 and ucp, respectively. The (*UTF) sequence is a generic version that
2976 can be used with any of the libraries. However, the application can set
2977 option never_utf, which locks out the use of the (*UTF) sequences.
2978
2979
2981 Subpatterns are delimited by parentheses (round brackets), which can be
2982 nested. Turning part of a pattern into a subpattern does two things:
2983
2984 1.:
2985 It localizes a set of alternatives. For example, the following pat‐
2986 tern matches "cataract", "caterpillar", or "cat":
2987
2988 cat(aract|erpillar|)
2989
2990 Without the parentheses, it would match "cataract", "erpillar", or
2991 an empty string.
2992
2993 2.:
2994 It sets up the subpattern as a capturing subpattern. That is, when
2995 the complete pattern matches, that portion of the subject string
2996 that matched the subpattern is passed back to the caller through
2997 the return value of run/3.
2998
2999 Opening parentheses are counted from left to right (starting from 1) to
3000 obtain numbers for the capturing subpatterns. For example, if the
3001 string "the red king" is matched against the following pattern, the
3002 captured substrings are "red king", "red", and "king", and are numbered
3003 1, 2, and 3, respectively:
3004
3005 the ((red|white) (king|queen))
3006
3007 It is not always helpful that plain parentheses fulfill two functions.
3008 Often a grouping subpattern is required without a capturing require‐
3009 ment. If an opening parenthesis is followed by a question mark and a
3010 colon, the subpattern does not do any capturing, and is not counted
3011 when computing the number of any subsequent capturing subpatterns. For
3012 example, if the string "the white queen" is matched against the follow‐
3013 ing pattern, the captured substrings are "white queen" and "queen", and
3014 are numbered 1 and 2:
3015
3016 the ((?:red|white) (king|queen))
3017
3018 The maximum number of capturing subpatterns is 65535.
3019
3020 As a convenient shorthand, if any option settings are required at the
3021 start of a non-capturing subpattern, the option letters can appear
3022 between "?" and ":". Thus, the following two patterns match the same
3023 set of strings:
3024
3025 (?i:saturday|sunday)
3026 (?:(?i)saturday|sunday)
3027
3028 As alternative branches are tried from left to right, and options are
3029 not reset until the end of the subpattern is reached, an option setting
3030 in one branch does affect subsequent branches, so the above patterns
3031 match both "SUNDAY" and "Saturday".
3032
3034 Perl 5.10 introduced a feature where each alternative in a subpattern
3035 uses the same numbers for its capturing parentheses. Such a subpattern
3036 starts with (?| and is itself a non-capturing subpattern. For example,
3037 consider the following pattern:
3038
3039 (?|(Sat)ur|(Sun))day
3040
3041 As the two alternatives are inside a (?| group, both sets of capturing
3042 parentheses are numbered one. Thus, when the pattern matches, you can
3043 look at captured substring number one, whichever alternative matched.
3044 This construct is useful when you want to capture a part, but not all,
3045 of one of many alternatives. Inside a (?| group, parentheses are num‐
3046 bered as usual, but the number is reset at the start of each branch.
3047 The numbers of any capturing parentheses that follow the subpattern
3048 start after the highest number used in any branch. The following exam‐
3049 ple is from the Perl documentation; the numbers underneath show in
3050 which buffer the captured content is stored:
3051
3052 # before ---------------branch-reset----------- after
3053 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
3054 # 1 2 2 3 2 3 4
3055
3056 A back reference to a numbered subpattern uses the most recent value
3057 that is set for that number by any subpattern. The following pattern
3058 matches "abcabc" or "defdef":
3059
3060 /(?|(abc)|(def))\1/
3061
3062 In contrast, a subroutine call to a numbered subpattern always refers
3063 to the first one in the pattern with the given number. The following
3064 pattern matches "abcabc" or "defabc":
3065
3066 /(?|(abc)|(def))(?1)/
3067
3068 If a condition test for a subpattern having matched refers to a non-
3069 unique number, the test is true if any of the subpatterns of that num‐
3070 ber have matched.
3071
3072 An alternative approach using this "branch reset" feature is to use
3073 duplicate named subpatterns, as described in the next section.
3074
3076 Identifying capturing parentheses by number is simple, but it can be
3077 hard to keep track of the numbers in complicated regular expressions.
3078 Also, if an expression is modified, the numbers can change. To help
3079 with this difficulty, PCRE supports the naming of subpatterns. This
3080 feature was not added to Perl until release 5.10. Python had the fea‐
3081 ture earlier, and PCRE introduced it at release 4.0, using the Python
3082 syntax. PCRE now supports both the Perl and the Python syntax. Perl
3083 allows identically numbered subpatterns to have different names, but
3084 PCRE does not.
3085
3086 In PCRE, a subpattern can be named in one of three ways: (?<name>...)
3087 or (?'name'...) as in Perl, or (?P<name>...) as in Python. References
3088 to capturing parentheses from other parts of the pattern, such as back
3089 references, recursion, and conditions, can be made by name and by num‐
3090 ber.
3091
3092 Names consist of up to 32 alphanumeric characters and underscores, but
3093 must start with a non-digit. Named capturing parentheses are still
3094 allocated numbers as well as names, exactly as if the names were not
3095 present. The capture specification to run/3 can use named values if
3096 they are present in the regular expression.
3097
3098 By default, a name must be unique within a pattern, but this constraint
3099 can be relaxed by setting option dupnames at compile time. (Duplicate
3100 names are also always permitted for subpatterns with the same number,
3101 set up as described in the previous section.) Duplicate names can be
3102 useful for patterns where only one instance of the named parentheses
3103 can match. Suppose that you want to match the name of a weekday, either
3104 as a 3-letter abbreviation or as the full name, and in both cases you
3105 want to extract the abbreviation. The following pattern (ignoring the
3106 line breaks) does the job:
3107
3108 (?<DN>Mon|Fri|Sun)(?:day)?|
3109 (?<DN>Tue)(?:sday)?|
3110 (?<DN>Wed)(?:nesday)?|
3111 (?<DN>Thu)(?:rsday)?|
3112 (?<DN>Sat)(?:urday)?
3113
3114 There are five capturing substrings, but only one is ever set after a
3115 match. (An alternative way of solving this problem is to use a "branch
3116 reset" subpattern, as described in the previous section.)
3117
3118 For capturing named subpatterns which names are not unique, the first
3119 matching occurrence (counted from left to right in the subject) is
3120 returned from run/3, if the name is specified in the values part of the
3121 capture statement. The all_names capturing value matches all the names
3122 in the same way.
3123
3124 Note:
3125 You cannot use different names to distinguish between two subpatterns
3126 with the same number, as PCRE uses only the numbers when matching. For
3127 this reason, an error is given at compile time if different names are
3128 specified to subpatterns with the same number. However, you can specify
3129 the same name to subpatterns with the same number, even when dupnames
3130 is not set.
3131
3132
3134 Repetition is specified by quantifiers, which can follow any of the
3135 following items:
3136
3137 * A literal data character
3138
3139 * The dot metacharacter
3140
3141 * The \C escape sequence
3142
3143 * The \X escape sequence
3144
3145 * The \R escape sequence
3146
3147 * An escape such as \d or \pL that matches a single character
3148
3149 * A character class
3150
3151 * A back reference (see the next section)
3152
3153 * A parenthesized subpattern (including assertions)
3154
3155 * A subroutine call to a subpattern (recursive or otherwise)
3156
3157 The general repetition quantifier specifies a minimum and maximum num‐
3158 ber of permitted matches, by giving the two numbers in curly brackets
3159 (braces), separated by a comma. The numbers must be < 65536, and the
3160 first must be less than or equal to the second. For example, the fol‐
3161 lowing matches "zz", "zzz", or "zzzz":
3162
3163 z{2,4}
3164
3165 A closing brace on its own is not a special character. If the second
3166 number is omitted, but the comma is present, there is no upper limit.
3167 If the second number and the comma are both omitted, the quantifier
3168 specifies an exact number of required matches. Thus, the following
3169 matches at least three successive vowels, but can match many more:
3170
3171 [aeiou]{3,}
3172
3173 The following matches exactly eight digits:
3174
3175 \d{8}
3176
3177 An opening curly bracket that appears in a position where a quantifier
3178 is not allowed, or one that does not match the syntax of a quantifier,
3179 is taken as a literal character. For example, {,6} is not a quantifier,
3180 but a literal string of four characters.
3181
3182 In Unicode mode, quantifiers apply to characters rather than to indi‐
3183 vidual data units. Thus, for example, \x{100}{2} matches two charac‐
3184 ters, each of which is represented by a 2-byte sequence in a UTF-8
3185 string. Similarly, \X{3} matches three Unicode extended grapheme clus‐
3186 ters, each of which can be many data units long (and they can be of
3187 different lengths).
3188
3189 The quantifier {0} is permitted, causing the expression to behave as if
3190 the previous item and the quantifier were not present. This can be use‐
3191 ful for subpatterns that are referenced as subroutines from elsewhere
3192 in the pattern (but see also section Defining Subpatterns for Use by
3193 Reference Only). Items other than subpatterns that have a {0} quanti‐
3194 fier are omitted from the compiled pattern.
3195
3196 For convenience, the three most common quantifiers have single-charac‐
3197 ter abbreviations:
3198
3199 *:
3200 Equivalent to {0,}
3201
3202 +:
3203 Equivalent to {1,}
3204
3205 ?:
3206 Equivalent to {0,1}
3207
3208 Infinite loops can be constructed by following a subpattern that can
3209 match no characters with a quantifier that has no upper limit, for
3210 example:
3211
3212 (a?)*
3213
3214 Earlier versions of Perl and PCRE used to give an error at compile time
3215 for such patterns. However, as there are cases where this can be use‐
3216 ful, such patterns are now accepted. However, if any repetition of the
3217 subpattern matches no characters, the loop is forcibly broken.
3218
3219 By default, the quantifiers are "greedy", that is, they match as much
3220 as possible (up to the maximum number of permitted times), without
3221 causing the remaining pattern to fail. The classic example of where
3222 this gives problems is in trying to match comments in C programs. These
3223 appear between /* and */. Within the comment, individual * and / char‐
3224 acters can appear. An attempt to match C comments by applying the pat‐
3225 tern
3226
3227 /\*.*\*/
3228
3229 to the string
3230
3231 /* first comment */ not comment /* second comment */
3232
3233 fails, as it matches the entire string owing to the greediness of the
3234 .* item.
3235
3236 However, if a quantifier is followed by a question mark, it ceases to
3237 be greedy, and instead matches the minimum number of times possible, so
3238 the following pattern does the right thing with the C comments:
3239
3240 /\*.*?\*/
3241
3242 The meaning of the various quantifiers is not otherwise changed, only
3243 the preferred number of matches. Do not confuse this use of question
3244 mark with its use as a quantifier in its own right. As it has two uses,
3245 it can sometimes appear doubled, as in
3246
3247 \d??\d
3248
3249 which matches one digit by preference, but can match two if that is the
3250 only way the remaining pattern matches.
3251
3252 If option ungreedy is set (an option that is not available in Perl),
3253 the quantifiers are not greedy by default, but individual ones can be
3254 made greedy by following them with a question mark. That is, it inverts
3255 the default behavior.
3256
3257 When a parenthesized subpattern is quantified with a minimum repeat
3258 count that is > 1 or with a limited maximum, more memory is required
3259 for the compiled pattern, in proportion to the size of the minimum or
3260 maximum.
3261
3262 If a pattern starts with .* or .{0,} and option dotall (equivalent to
3263 Perl option /s) is set, thus allowing the dot to match newlines, the
3264 pattern is implicitly anchored, because whatever follows is tried
3265 against every character position in the subject string. So, there is no
3266 point in retrying the overall match at any position after the first.
3267 PCRE normally treats such a pattern as if it was preceded by \A.
3268
3269 In cases where it is known that the subject string contains no new‐
3270 lines, it is worth setting dotall to obtain this optimization, or
3271 alternatively using ^ to indicate anchoring explicitly.
3272
3273 However, there are some cases where the optimization cannot be used.
3274 When .* is inside capturing parentheses that are the subject of a back
3275 reference elsewhere in the pattern, a match at the start can fail where
3276 a later one succeeds. Consider, for example:
3277
3278 (.*)abc\1
3279
3280 If the subject is "xyz123abc123", the match point is the fourth charac‐
3281 ter. Therefore, such a pattern is not implicitly anchored.
3282
3283 Another case where implicit anchoring is not applied is when the lead‐
3284 ing .* is inside an atomic group. Once again, a match at the start can
3285 fail where a later one succeeds. Consider the following pattern:
3286
3287 (?>.*?a)b
3288
3289 It matches "ab" in the subject "aab". The use of the backtracking con‐
3290 trol verbs (*PRUNE) and (*SKIP) also disable this optimization.
3291
3292 When a capturing subpattern is repeated, the value captured is the sub‐
3293 string that matched the final iteration. For example, after
3294
3295 (tweedle[dume]{3}\s*)+
3296
3297 has matched "tweedledum tweedledee", the value of the captured sub‐
3298 string is "tweedledee". However, if there are nested capturing subpat‐
3299 terns, the corresponding captured values can have been set in previous
3300 iterations. For example, after
3301
3302 /(a|(b))+/
3303
3304 matches "aba", the value of the second captured substring is "b".
3305
3307 With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
3308 repetition, failure of what follows normally causes the repeated item
3309 to be re-evaluated to see if a different number of repeats allows the
3310 remaining pattern to match. Sometimes it is useful to prevent this,
3311 either to change the nature of the match, or to cause it to fail ear‐
3312 lier than it otherwise might, when the author of the pattern knows that
3313 there is no point in carrying on.
3314
3315 Consider, for example, the pattern \d+foo when applied to the following
3316 subject line:
3317
3318 123456bar
3319
3320 After matching all six digits and then failing to match "foo", the nor‐
3321 mal action of the matcher is to try again with only five digits match‐
3322 ing item \d+, and then with four, and so on, before ultimately failing.
3323 "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
3324 the means for specifying that once a subpattern has matched, it is not
3325 to be re-evaluated in this way.
3326
3327 If atomic grouping is used for the previous example, the matcher gives
3328 up immediately on failing to match "foo" the first time. The notation
3329 is a kind of special parenthesis, starting with (?> as in the following
3330 example:
3331
3332 (?>\d+)foo
3333
3334 This kind of parenthesis "locks up" the part of the pattern it contains
3335 once it has matched, and a failure further into the pattern is pre‐
3336 vented from backtracking into it. Backtracking past it to previous
3337 items, however, works as normal.
3338
3339 An alternative description is that a subpattern of this type matches
3340 the string of characters that an identical standalone pattern would
3341 match, if anchored at the current point in the subject string.
3342
3343 Atomic grouping subpatterns are not capturing subpatterns. Simple cases
3344 such as the above example can be thought of as a maximizing repeat that
3345 must swallow everything it can. So, while both \d+ and \d+? are pre‐
3346 pared to adjust the number of digits they match to make the remaining
3347 pattern match, (?>\d+) can only match an entire sequence of digits.
3348
3349 Atomic groups in general can contain any complicated subpatterns, and
3350 can be nested. However, when the subpattern for an atomic group is just
3351 a single repeated item, as in the example above, a simpler notation,
3352 called a "possessive quantifier" can be used. This consists of an extra
3353 + character following a quantifier. Using this notation, the previous
3354 example can be rewritten as
3355
3356 \d++foo
3357
3358 Notice that a possessive quantifier can be used with an entire group,
3359 for example:
3360
3361 (abc|xyz){2,3}+
3362
3363 Possessive quantifiers are always greedy; the setting of option
3364 ungreedy is ignored. They are a convenient notation for the simpler
3365 forms of an atomic group. However, there is no difference in the mean‐
3366 ing of a possessive quantifier and the equivalent atomic group, but
3367 there can be a performance difference; possessive quantifiers are prob‐
3368 ably slightly faster.
3369
3370 The possessive quantifier syntax is an extension to the Perl 5.8 syn‐
3371 tax. Jeffrey Friedl originated the idea (and the name) in the first
3372 edition of his book. Mike McCloskey liked it, so implemented it when he
3373 built the Sun Java package, and PCRE copied it from there. It ulti‐
3374 mately found its way into Perl at release 5.10.
3375
3376 PCRE has an optimization that automatically "possessifies" certain sim‐
3377 ple pattern constructs. For example, the sequence A+B is treated as
3378 A++B, as there is no point in backtracking into a sequence of A:s when
3379 B must follow.
3380
3381 When a pattern contains an unlimited repeat inside a subpattern that
3382 can itself be repeated an unlimited number of times, the use of an
3383 atomic group is the only way to avoid some failing matches taking a
3384 long time. The pattern
3385
3386 (\D+|<\d+>)*[!?]
3387
3388 matches an unlimited number of substrings that either consist of non-
3389 digits, or digits enclosed in <>, followed by ! or ?. When it matches,
3390 it runs quickly. However, if it is applied to
3391
3392 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
3393
3394 it takes a long time before reporting failure. This is because the
3395 string can be divided between the internal \D+ repeat and the external
3396 * repeat in many ways, and all must be tried. (The example uses [!?]
3397 rather than a single character at the end, as both PCRE and Perl have
3398 an optimization that allows for fast failure when a single character is
3399 used. They remember the last single character that is required for a
3400 match, and fail early if it is not present in the string.) If the pat‐
3401 tern is changed so that it uses an atomic group, like the following,
3402 sequences of non-digits cannot be broken, and failure happens quickly:
3403
3404 ((?>\D+)|<\d+>)*[!?]
3405
3407 Outside a character class, a backslash followed by a digit > 0 (and
3408 possibly further digits) is a back reference to a capturing subpattern
3409 earlier (that is, to its left) in the pattern, provided there have been
3410 that many previous capturing left parentheses.
3411
3412 However, if the decimal number following the backslash is < 10, it is
3413 always taken as a back reference, and causes an error only if there are
3414 not that many capturing left parentheses in the entire pattern. That
3415 is, the parentheses that are referenced do need not be to the left of
3416 the reference for numbers < 10. A "forward back reference" of this type
3417 can make sense when a repetition is involved and the subpattern to the
3418 right has participated in an earlier iteration.
3419
3420 It is not possible to have a numerical "forward back reference" to a
3421 subpattern whose number is 10 or more using this syntax, as a sequence
3422 such as \50 is interpreted as a character defined in octal. For more
3423 details of the handling of digits following a backslash, see section
3424 Non-Printing Characters earlier. There is no such problem when named
3425 parentheses are used. A back reference to any subpattern is possible
3426 using named parentheses (see below).
3427
3428 Another way to avoid the ambiguity inherent in the use of digits fol‐
3429 lowing a backslash is to use the \g escape sequence. This escape must
3430 be followed by an unsigned number or a negative number, optionally
3431 enclosed in braces. The following examples are identical:
3432
3433 (ring), \1
3434 (ring), \g1
3435 (ring), \g{1}
3436
3437 An unsigned number specifies an absolute reference without the ambigu‐
3438 ity that is present in the older syntax. It is also useful when literal
3439 digits follow the reference. A negative number is a relative reference.
3440 Consider the following example:
3441
3442 (abc(def)ghi)\g{-1}
3443
3444 The sequence \g{-1} is a reference to the most recently started captur‐
3445 ing subpattern before \g, that is, it is equivalent to \2 in this exam‐
3446 ple. Similarly, \g{-2} would be equivalent to \1. The use of relative
3447 references can be helpful in long patterns, and also in patterns that
3448 are created by joining fragments containing references within them‐
3449 selves.
3450
3451 A back reference matches whatever matched the capturing subpattern in
3452 the current subject string, rather than anything matching the subpat‐
3453 tern itself (section Subpattern as Subroutines describes a way of doing
3454 that). So, the following pattern matches "sense and sensibility" and
3455 "response and responsibility", but not "sense and responsibility":
3456
3457 (sens|respons)e and \1ibility
3458
3459 If caseful matching is in force at the time of the back reference, the
3460 case of letters is relevant. For example, the following matches "rah
3461 rah" and "RAH RAH", but not "RAH rah", although the original capturing
3462 subpattern is matched caselessly:
3463
3464 ((?i)rah)\s+\1
3465
3466 There are many different ways of writing back references to named sub‐
3467 patterns. The .NET syntax \k{name} and the Perl syntax \k<name> or
3468 \k'name' are supported, as is the Python syntax (?P=name). The unified
3469 back reference syntax in Perl 5.10, in which \g can be used for both
3470 numeric and named references, is also supported. The previous example
3471 can be rewritten in the following ways:
3472
3473 (?<p1>(?i)rah)\s+\k<p1>
3474 (?'p1'(?i)rah)\s+\k{p1}
3475 (?P<p1>(?i)rah)\s+(?P=p1)
3476 (?<p1>(?i)rah)\s+\g{p1}
3477
3478 A subpattern that is referenced by name can appear in the pattern
3479 before or after the reference.
3480
3481 There can be more than one back reference to the same subpattern. If a
3482 subpattern has not been used in a particular match, any back references
3483 to it always fails. For example, the following pattern always fails if
3484 it starts to match "a" rather than "bc":
3485
3486 (a|(bc))\2
3487
3488 As there can be many capturing parentheses in a pattern, all digits
3489 following the backslash are taken as part of a potential back reference
3490 number. If the pattern continues with a digit character, some delimiter
3491 must be used to terminate the back reference. If option extended is
3492 set, this can be whitespace. Otherwise an empty comment (see section
3493 Comments) can be used.
3494
3495 Recursive Back References
3496
3497 A back reference that occurs inside the parentheses to which it refers
3498 fails when the subpattern is first used, so, for example, (a\1) never
3499 matches. However, such references can be useful inside repeated subpat‐
3500 terns. For example, the following pattern matches any number of "a"s
3501 and also "aba", "ababbaa", and so on:
3502
3503 (a|b\1)+
3504
3505 At each iteration of the subpattern, the back reference matches the
3506 character string corresponding to the previous iteration. In order for
3507 this to work, the pattern must be such that the first iteration does
3508 not need to match the back reference. This can be done using alterna‐
3509 tion, as in the example above, or by a quantifier with a minimum of
3510 zero.
3511
3512 Back references of this type cause the group that they reference to be
3513 treated as an atomic group. Once the whole group has been matched, a
3514 subsequent matching failure cannot cause backtracking into the middle
3515 of the group.
3516
3518 An assertion is a test on the characters following or preceding the
3519 current matching point that does not consume any characters. The simple
3520 assertions coded as \b, \B, \A, \G, \Z, \z, ^, and $ are described in
3521 the previous sections.
3522
3523 More complicated assertions are coded as subpatterns. There are two
3524 kinds: those that look ahead of the current position in the subject
3525 string, and those that look behind it. An assertion subpattern is
3526 matched in the normal way, except that it does not cause the current
3527 matching position to be changed.
3528
3529 Assertion subpatterns are not capturing subpatterns. If such an asser‐
3530 tion contains capturing subpatterns within it, these are counted for
3531 the purposes of numbering the capturing subpatterns in the whole pat‐
3532 tern. However, substring capturing is done only for positive asser‐
3533 tions. (Perl sometimes, but not always, performs capturing in negative
3534 assertions.)
3535
3536 Warning:
3537 If a positive assertion containing one or more capturing subpatterns
3538 succeeds, but failure to match later in the pattern causes backtracking
3539 over this assertion, the captures within the assertion are reset only
3540 if no higher numbered captures are already set. This is, unfortunately,
3541 a fundamental limitation of the current implementation, and as PCRE1 is
3542 now in maintenance-only status, it is unlikely ever to change.
3543
3544
3545 For compatibility with Perl, assertion subpatterns can be repeated.
3546 However, it makes no sense to assert the same thing many times, the
3547 side effect of capturing parentheses can occasionally be useful. In
3548 practice, there are only three cases:
3549
3550 * If the quantifier is {0}, the assertion is never obeyed during
3551 matching. However, it can contain internal capturing parenthesized
3552 groups that are called from elsewhere through the subroutine mecha‐
3553 nism.
3554
3555 * If quantifier is {0,n}, where n > 0, it is treated as if it was
3556 {0,1}. At runtime, the remaining pattern match is tried with and
3557 without the assertion, the order depends on the greediness of the
3558 quantifier.
3559
3560 * If the minimum repetition is > 0, the quantifier is ignored. The
3561 assertion is obeyed only once when encountered during matching.
3562
3563 Lookahead Assertions
3564
3565 Lookahead assertions start with (?= for positive assertions and (?! for
3566 negative assertions. For example, the following matches a word followed
3567 by a semicolon, but does not include the semicolon in the match:
3568
3569 \w+(?=;)
3570
3571 The following matches any occurrence of "foo" that is not followed by
3572 "bar":
3573
3574 foo(?!bar)
3575
3576 Notice that the apparently similar pattern
3577
3578 (?!foo)bar
3579
3580 does not find an occurrence of "bar" that is preceded by something
3581 other than "foo". It finds any occurrence of "bar" whatsoever, as the
3582 assertion (?!foo) is always true when the next three characters are
3583 "bar". A lookbehind assertion is needed to achieve the other effect.
3584
3585 If you want to force a matching failure at some point in a pattern, the
3586 most convenient way to do it is with (?!), as an empty string always
3587 matches. So, an assertion that requires there is not to be an empty
3588 string must always fail. The backtracking control verb (*FAIL) or (*F)
3589 is a synonym for (?!).
3590
3591 Lookbehind Assertions
3592
3593 Lookbehind assertions start with (?<= for positive assertions and (?<!
3594 for negative assertions. For example, the following finds an occurrence
3595 of "bar" that is not preceded by "foo":
3596
3597 (?<!foo)bar
3598
3599 The contents of a lookbehind assertion are restricted such that all the
3600 strings it matches must have a fixed length. However, if there are many
3601 top-level alternatives, they do not all have to have the same fixed
3602 length. Thus, the following is permitted:
3603
3604 (?<=bullock|donkey)
3605
3606 The following causes an error at compile time:
3607
3608 (?<!dogs?|cats?)
3609
3610 Branches that match different length strings are permitted only at the
3611 top-level of a lookbehind assertion. This is an extension compared with
3612 Perl, which requires all branches to match the same length of string.
3613 An assertion such as the following is not permitted, as its single top-
3614 level branch can match two different lengths:
3615
3616 (?<=ab(c|de))
3617
3618 However, it is acceptable to PCRE if rewritten to use two top-level
3619 branches:
3620
3621 (?<=abc|abde)
3622
3623 Sometimes the escape sequence \K (see above) can be used instead of a
3624 lookbehind assertion to get round the fixed-length restriction.
3625
3626 The implementation of lookbehind assertions is, for each alternative,
3627 to move the current position back temporarily by the fixed length and
3628 then try to match. If there are insufficient characters before the cur‐
3629 rent position, the assertion fails.
3630
3631 In a UTF mode, PCRE does not allow the \C escape (which matches a sin‐
3632 gle data unit even in a UTF mode) to appear in lookbehind assertions,
3633 as it makes it impossible to calculate the length of the lookbehind.
3634 The \X and \R escapes, which can match different numbers of data units,
3635 are not permitted either.
3636
3637 "Subroutine" calls (see below), such as (?2) or (?&X), are permitted in
3638 lookbehinds, as long as the subpattern matches a fixed-length string.
3639 Recursion, however, is not supported.
3640
3641 Possessive quantifiers can be used with lookbehind assertions to spec‐
3642 ify efficient matching of fixed-length strings at the end of subject
3643 strings. Consider the following simple pattern when applied to a long
3644 string that does not match:
3645
3646 abcd$
3647
3648 As matching proceeds from left to right, PCRE looks for each "a" in the
3649 subject and then sees if what follows matches the remaining pattern. If
3650 the pattern is specified as
3651
3652 ^.*abcd$
3653
3654 the initial .* matches the entire string at first. However, when this
3655 fails (as there is no following "a"), it backtracks to match all but
3656 the last character, then all but the last two characters, and so on.
3657 Once again the search for "a" covers the entire string, from right to
3658 left, so we are no better off. However, if the pattern is written as
3659
3660 ^.*+(?<=abcd)
3661
3662 there can be no backtracking for the .*+ item; it can match only the
3663 entire string. The subsequent lookbehind assertion does a single test
3664 on the last four characters. If it fails, the match fails immediately.
3665 For long strings, this approach makes a significant difference to the
3666 processing time.
3667
3668 Using Multiple Assertions
3669
3670 Many assertions (of any sort) can occur in succession. For example, the
3671 following matches "foo" preceded by three digits that are not "999":
3672
3673 (?<=\d{3})(?<!999)foo
3674
3675 Notice that each of the assertions is applied independently at the same
3676 point in the subject string. First there is a check that the previous
3677 three characters are all digits, and then there is a check that the
3678 same three characters are not "999". This pattern does not match "foo"
3679 preceded by six characters, the first of which are digits and the last
3680 three of which are not "999". For example, it does not match "123abc‐
3681 foo". A pattern to do that is the following:
3682
3683 (?<=\d{3}...)(?<!999)foo
3684
3685 This time the first assertion looks at the preceding six characters,
3686 checks that the first three are digits, and then the second assertion
3687 checks that the preceding three characters are not "999".
3688
3689 Assertions can be nested in any combination. For example, the following
3690 matches an occurrence of "baz" that is preceded by "bar", which in turn
3691 is not preceded by "foo":
3692
3693 (?<=(?<!foo)bar)baz
3694
3695 The following pattern matches "foo" preceded by three digits and any
3696 three characters that are not "999":
3697
3698 (?<=\d{3}(?!999)...)foo
3699
3701 It is possible to cause the matching process to obey a subpattern con‐
3702 ditionally or to choose between two alternative subpatterns, depending
3703 on the result of an assertion, or whether a specific capturing subpat‐
3704 tern has already been matched. The following are the two possible forms
3705 of conditional subpattern:
3706
3707 (?(condition)yes-pattern)
3708 (?(condition)yes-pattern|no-pattern)
3709
3710 If the condition is satisfied, the yes-pattern is used, otherwise the
3711 no-pattern (if present). If more than two alternatives exist in the
3712 subpattern, a compile-time error occurs. Each of the two alternatives
3713 can itself contain nested subpatterns of any form, including condi‐
3714 tional subpatterns; the restriction to two alternatives applies only at
3715 the level of the condition. The following pattern fragment is an exam‐
3716 ple where the alternatives are complex:
3717
3718 (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
3719
3720 There are four kinds of condition: references to subpatterns, refer‐
3721 ences to recursion, a pseudo-condition called DEFINE, and assertions.
3722
3723 Checking for a Used Subpattern By Number
3724
3725 If the text between the parentheses consists of a sequence of digits,
3726 the condition is true if a capturing subpattern of that number has pre‐
3727 viously matched. If more than one capturing subpattern with the same
3728 number exists (see section Duplicate Subpattern Numbers earlier), the
3729 condition is true if any of them have matched. An alternative notation
3730 is to precede the digits with a plus or minus sign. In this case, the
3731 subpattern number is relative rather than absolute. The most recently
3732 opened parentheses can be referenced by (?(-1), the next most recent by
3733 (?(-2), and so on. Inside loops, it can also make sense to refer to
3734 subsequent groups. The next parentheses to be opened can be referenced
3735 as (?(+1), and so on. (The value zero in any of these forms is not
3736 used; it provokes a compile-time error.)
3737
3738 Consider the following pattern, which contains non-significant white‐
3739 space to make it more readable (assume option extended) and to divide
3740 it into three parts for ease of discussion:
3741
3742 ( \( )? [^()]+ (?(1) \) )
3743
3744 The first part matches an optional opening parenthesis, and if that
3745 character is present, sets it as the first captured substring. The sec‐
3746 ond part matches one or more characters that are not parentheses. The
3747 third part is a conditional subpattern that tests whether the first set
3748 of parentheses matched or not. If they did, that is, if subject started
3749 with an opening parenthesis, the condition is true, and so the yes-pat‐
3750 tern is executed and a closing parenthesis is required. Otherwise, as
3751 no-pattern is not present, the subpattern matches nothing. That is,
3752 this pattern matches a sequence of non-parentheses, optionally enclosed
3753 in parentheses.
3754
3755 If this pattern is embedded in a larger one, a relative reference can
3756 be used:
3757
3758 This makes the fragment independent of the parentheses in the larger
3759 pattern.
3760
3761 Checking for a Used Subpattern By Name
3762
3763 Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
3764 used subpattern by name. For compatibility with earlier versions of
3765 PCRE, which had this facility before Perl, the syntax (?(name)...) is
3766 also recognized.
3767
3768 Rewriting the previous example to use a named subpattern gives:
3769
3770 (?<OPEN> \( )? [^()]+ (?(<OPEN>) \) )
3771
3772 If the name used in a condition of this kind is a duplicate, the test
3773 is applied to all subpatterns of the same name, and is true if any one
3774 of them has matched.
3775
3776 Checking for Pattern Recursion
3777
3778 If the condition is the string (R), and there is no subpattern with the
3779 name R, the condition is true if a recursive call to the whole pattern
3780 or any subpattern has been made. If digits or a name preceded by amper‐
3781 sand follow the letter R, for example:
3782
3783 (?(R3)...) or (?(R&name)...)
3784
3785 the condition is true if the most recent recursion is into a subpattern
3786 whose number or name is given. This condition does not check the entire
3787 recursion stack. If the name used in a condition of this kind is a
3788 duplicate, the test is applied to all subpatterns of the same name, and
3789 is true if any one of them is the most recent recursion.
3790
3791 At "top-level", all these recursion test conditions are false. The syn‐
3792 tax for recursive patterns is described below.
3793
3794 Defining Subpatterns for Use By Reference Only
3795
3796 If the condition is the string (DEFINE), and there is no subpattern
3797 with the name DEFINE, the condition is always false. In this case,
3798 there can be only one alternative in the subpattern. It is always
3799 skipped if control reaches this point in the pattern. The idea of
3800 DEFINE is that it can be used to define "subroutines" that can be ref‐
3801 erenced from elsewhere. (The use of subroutines is described below.)
3802 For example, a pattern to match an IPv4 address, such as
3803 "192.168.23.245", can be written like this (ignore whitespace and line
3804 breaks):
3805
3806 (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) \b (?&byte) (\.(?&byte)){3} \b
3807
3808 The first part of the pattern is a DEFINE group inside which is a
3809 another group named "byte" is defined. This matches an individual com‐
3810 ponent of an IPv4 address (a number < 256). When matching takes place,
3811 this part of the pattern is skipped, as DEFINE acts like a false condi‐
3812 tion. The remaining pattern uses references to the named group to match
3813 the four dot-separated components of an IPv4 address, insisting on a
3814 word boundary at each end.
3815
3816 Assertion Conditions
3817
3818 If the condition is not in any of the above formats, it must be an
3819 assertion. This can be a positive or negative lookahead or lookbehind
3820 assertion. Consider the following pattern, containing non-significant
3821 whitespace, and with the two alternatives on the second line:
3822
3823 (?(?=[^a-z]*[a-z])
3824 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
3825
3826 The condition is a positive lookahead assertion that matches an
3827 optional sequence of non-letters followed by a letter. That is, it
3828 tests for the presence of at least one letter in the subject. If a let‐
3829 ter is found, the subject is matched against the first alternative,
3830 otherwise it is matched against the second. This pattern matches
3831 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
3832 letters and dd are digits.
3833
3835 There are two ways to include comments in patterns that are processed
3836 by PCRE. In both cases, the start of the comment must not be in a char‐
3837 acter class, or in the middle of any other sequence of related charac‐
3838 ters such as (?: or a subpattern name or number. The characters that
3839 make up a comment play no part in the pattern matching.
3840
3841 The sequence (?# marks the start of a comment that continues up to the
3842 next closing parenthesis. Nested parentheses are not permitted. If
3843 option PCRE_EXTENDED is set, an unescaped # character also introduces a
3844 comment, which in this case continues to immediately after the next
3845 newline character or character sequence in the pattern. Which charac‐
3846 ters are interpreted as newlines is controlled by the options passed to
3847 a compiling function or by a special sequence at the start of the pat‐
3848 tern, as described in section Newline Conventions earlier.
3849
3850 Notice that the end of this type of comment is a literal newline
3851 sequence in the pattern; escape sequences that happen to represent a
3852 newline do not count. For example, consider the following pattern when
3853 extended is set, and the default newline convention is in force:
3854
3855 abc #comment \n still comment
3856
3857 On encountering character #, pcre_compile() skips along, looking for a
3858 newline in the pattern. The sequence \n is still literal at this stage,
3859 so it does not terminate the comment. Only a character with code value
3860 0x0a (the default newline) does so.
3861
3863 Consider the problem of matching a string in parentheses, allowing for
3864 unlimited nested parentheses. Without the use of recursion, the best
3865 that can be done is to use a pattern that matches up to some fixed
3866 depth of nesting. It is not possible to handle an arbitrary nesting
3867 depth.
3868
3869 For some time, Perl has provided a facility that allows regular expres‐
3870 sions to recurse (among other things). It does this by interpolating
3871 Perl code in the expression at runtime, and the code can refer to the
3872 expression itself. A Perl pattern using code interpolation to solve the
3873 parentheses problem can be created like this:
3874
3875 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
3876
3877 Item (?p{...}) interpolates Perl code at runtime, and in this case
3878 refers recursively to the pattern in which it appears.
3879
3880 Obviously, PCRE cannot support the interpolation of Perl code. Instead,
3881 it supports special syntax for recursion of the entire pattern, and for
3882 individual subpattern recursion. After its introduction in PCRE and
3883 Python, this kind of recursion was later introduced into Perl at
3884 release 5.10.
3885
3886 A special item that consists of (? followed by a number > 0 and a clos‐
3887 ing parenthesis is a recursive subroutine call of the subpattern of the
3888 given number, if it occurs inside that subpattern. (If not, it is a
3889 non-recursive subroutine call, which is described in the next section.)
3890 The special item (?R) or (?0) is a recursive call of the entire regular
3891 expression.
3892
3893 This PCRE pattern solves the nested parentheses problem (assume that
3894 option extended is set so that whitespace is ignored):
3895
3896 \( ( [^()]++ | (?R) )* \)
3897
3898 First it matches an opening parenthesis. Then it matches any number of
3899 substrings, which can either be a sequence of non-parentheses or a
3900 recursive match of the pattern itself (that is, a correctly parenthe‐
3901 sized substring). Finally there is a closing parenthesis. Notice the
3902 use of a possessive quantifier to avoid backtracking into sequences of
3903 non-parentheses.
3904
3905 If this was part of a larger pattern, you would not want to recurse the
3906 entire pattern, so instead you can use:
3907
3908 ( \( ( [^()]++ | (?1) )* \) )
3909
3910 The pattern is here within parentheses so that the recursion refers to
3911 them instead of the whole pattern.
3912
3913 In a larger pattern, keeping track of parenthesis numbers can be
3914 tricky. This is made easier by the use of relative references. Instead
3915 of (?1) in the pattern above, you can write (?-2) to refer to the sec‐
3916 ond most recently opened parentheses preceding the recursion. That is,
3917 a negative number counts capturing parentheses leftwards from the point
3918 at which it is encountered.
3919
3920 It is also possible to refer to later opened parentheses, by writing
3921 references such as (?+2). However, these cannot be recursive, as the
3922 reference is not inside the parentheses that are referenced. They are
3923 always non-recursive subroutine calls, as described in the next sec‐
3924 tion.
3925
3926 An alternative approach is to use named parentheses instead. The Perl
3927 syntax for this is (?&name). The earlier PCRE syntax (?P>name) is also
3928 supported. We can rewrite the above example as follows:
3929
3930 (?<pn> \( ( [^()]++ | (?&pn) )* \) )
3931
3932 If there is more than one subpattern with the same name, the earliest
3933 one is used.
3934
3935 This particular example pattern that we have studied contains nested
3936 unlimited repeats, and so the use of a possessive quantifier for match‐
3937 ing strings of non-parentheses is important when applying the pattern
3938 to strings that do not match. For example, when this pattern is applied
3939 to
3940
3941 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
3942
3943 it gives "no match" quickly. However, if a possessive quantifier is not
3944 used, the match runs for a long time, as there are so many different
3945 ways the + and * repeats can carve up the subject, and all must be
3946 tested before failure can be reported.
3947
3948 At the end of a match, the values of capturing parentheses are those
3949 from the outermost level. If the pattern above is matched against
3950
3951 (ab(cd)ef)
3952
3953 the value for the inner capturing parentheses (numbered 2) is "ef",
3954 which is the last value taken on at the top-level. If a capturing sub‐
3955 pattern is not matched at the top level, its final captured value is
3956 unset, even if it was (temporarily) set at a deeper level during the
3957 matching process.
3958
3959 Do not confuse item (?R) with condition (R), which tests for recursion.
3960 Consider the following pattern, which matches text in angle brackets,
3961 allowing for arbitrary nesting. Only digits are allowed in nested
3962 brackets (that is, when recursing), while any characters are permitted
3963 at the outer level.
3964
3965 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
3966
3967 Here (?(R) is the start of a conditional subpattern, with two different
3968 alternatives for the recursive and non-recursive cases. Item (?R) is
3969 the actual recursive call.
3970
3971 Differences in Recursion Processing between PCRE and Perl
3972
3973 Recursion processing in PCRE differs from Perl in two important ways.
3974 In PCRE (like Python, but unlike Perl), a recursive subpattern call is
3975 always treated as an atomic group. That is, once it has matched some of
3976 the subject string, it is never re-entered, even if it contains untried
3977 alternatives and there is a subsequent matching failure. This can be
3978 illustrated by the following pattern, which means to match a palin‐
3979 dromic string containing an odd number of characters (for example, "a",
3980 "aba", "abcba", "abcdcba"):
3981
3982 ^(.|(.)(?1)\2)$
3983
3984 The idea is that it either matches a single character, or two identical
3985 characters surrounding a subpalindrome. In Perl, this pattern works; in
3986 PCRE it does not work if the pattern is longer than three characters.
3987 Consider the subject string "abcba".
3988
3989 At the top level, the first character is matched, but as it is not at
3990 the end of the string, the first alternative fails, the second alterna‐
3991 tive is taken, and the recursion kicks in. The recursive call to sub‐
3992 pattern 1 successfully matches the next character ("b"). (Notice that
3993 the beginning and end of line tests are not part of the recursion.)
3994
3995 Back at the top level, the next character ("c") is compared with what
3996 subpattern 2 matched, which was "a". This fails. As the recursion is
3997 treated as an atomic group, there are now no backtracking points, and
3998 so the entire match fails. (Perl can now re-enter the recursion and try
3999 the second alternative.) However, if the pattern is written with the
4000 alternatives in the other order, things are different:
4001
4002 ^((.)(?1)\2|.)$
4003
4004 This time, the recursing alternative is tried first, and continues to
4005 recurse until it runs out of characters, at which point the recursion
4006 fails. But this time we have another alternative to try at the higher
4007 level. That is the significant difference: in the previous case the
4008 remaining alternative is at a deeper recursion level, which PCRE cannot
4009 use.
4010
4011 To change the pattern so that it matches all palindromic strings, not
4012 only those with an odd number of characters, it is tempting to change
4013 the pattern to this:
4014
4015 ^((.)(?1)\2|.?)$
4016
4017 Again, this works in Perl, but not in PCRE, and for the same reason.
4018 When a deeper recursion has matched a single character, it cannot be
4019 entered again to match an empty string. The solution is to separate the
4020 two cases, and write out the odd and even cases as alternatives at the
4021 higher level:
4022
4023 ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
4024
4025 If you want to match typical palindromic phrases, the pattern must
4026 ignore all non-word characters, which can be done as follows:
4027
4028 ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
4029
4030 If run with option caseless, this pattern matches phrases such as "A
4031 man, a plan, a canal: Panama!" and it works well in both PCRE and Perl.
4032 Notice the use of the possessive quantifier *+ to avoid backtracking
4033 into sequences of non-word characters. Without this, PCRE takes much
4034 longer (10 times or more) to match typical phrases, and Perl takes so
4035 long that you think it has gone into a loop.
4036
4037 Note:
4038 The palindrome-matching patterns above work only if the subject string
4039 does not start with a palindrome that is shorter than the entire
4040 string. For example, although "abcba" is correctly matched, if the sub‐
4041 ject is "ababa", PCRE finds palindrome "aba" at the start, and then
4042 fails at top level, as the end of the string does not follow. Once
4043 again, it cannot jump back into the recursion to try other alterna‐
4044 tives, so the entire match fails.
4045
4046
4047 The second way in which PCRE and Perl differ in their recursion pro‐
4048 cessing is in the handling of captured values. In Perl, when a subpat‐
4049 tern is called recursively or as a subpattern (see the next section),
4050 it has no access to any values that were captured outside the recur‐
4051 sion. In PCRE these values can be referenced. Consider the following
4052 pattern:
4053
4054 ^(.)(\1|a(?2))
4055
4056 In PCRE, it matches "bab". The first capturing parentheses match "b",
4057 then in the second group, when the back reference \1 fails to match
4058 "b", the second alternative matches "a", and then recurses. In the
4059 recursion, \1 does now match "b" and so the whole match succeeds. In
4060 Perl, the pattern fails to match because inside the recursive call \1
4061 cannot access the externally set value.
4062
4064 If the syntax for a recursive subpattern call (either by number or by
4065 name) is used outside the parentheses to which it refers, it operates
4066 like a subroutine in a programming language. The called subpattern can
4067 be defined before or after the reference. A numbered reference can be
4068 absolute or relative, as in the following examples:
4069
4070 (...(absolute)...)...(?2)...
4071 (...(relative)...)...(?-1)...
4072 (...(?+1)...(relative)...
4073
4074 An earlier example pointed out that the following pattern matches
4075 "sense and sensibility" and "response and responsibility", but not
4076 "sense and responsibility":
4077
4078 (sens|respons)e and \1ibility
4079
4080 If instead the following pattern is used, it matches "sense and respon‐
4081 sibility" and the other two strings:
4082
4083 (sens|respons)e and (?1)ibility
4084
4085 Another example is provided in the discussion of DEFINE earlier.
4086
4087 All subroutine calls, recursive or not, are always treated as atomic
4088 groups. That is, once a subroutine has matched some of the subject
4089 string, it is never re-entered, even if it contains untried alterna‐
4090 tives and there is a subsequent matching failure. Any capturing paren‐
4091 theses that are set during the subroutine call revert to their previous
4092 values afterwards.
4093
4094 Processing options such as case-independence are fixed when a subpat‐
4095 tern is defined, so if it is used as a subroutine, such options cannot
4096 be changed for different calls. For example, the following pattern
4097 matches "abcabc" but not "abcABC", as the change of processing option
4098 does not affect the called subpattern:
4099
4100 (abc)(?i:(?-1))
4101
4103 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
4104 name or a number enclosed either in angle brackets or single quotes, is
4105 alternative syntax for referencing a subpattern as a subroutine, possi‐
4106 bly recursively. Here follows two of the examples used above, rewritten
4107 using this syntax:
4108
4109 (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
4110 (sens|respons)e and \g'1'ibility
4111
4112 PCRE supports an extension to Oniguruma: if a number is preceded by a
4113 plus or minus sign, it is taken as a relative reference, for example:
4114
4115 (abc)(?i:\g<-1>)
4116
4117 Notice that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are
4118 not synonymous. The former is a back reference; the latter is a subrou‐
4119 tine call.
4120
4122 Perl 5.10 introduced some "Special Backtracking Control Verbs", which
4123 are still described in the Perl documentation as "experimental and sub‐
4124 ject to change or removal in a future version of Perl". It goes on to
4125 say: "Their usage in production code should be noted to avoid problems
4126 during upgrades." The same remarks apply to the PCRE features described
4127 in this section.
4128
4129 The new verbs make use of what was previously invalid syntax: an open‐
4130 ing parenthesis followed by an asterisk. They are generally of the form
4131 (*VERB) or (*VERB:NAME). Some can take either form, possibly behaving
4132 differently depending on whether a name is present. A name is any
4133 sequence of characters that does not include a closing parenthesis. The
4134 maximum name length is 255 in the 8-bit library and 65535 in the 16-bit
4135 and 32-bit libraries. If the name is empty, that is, if the closing
4136 parenthesis immediately follows the colon, the effect is as if the
4137 colon was not there. Any number of these verbs can occur in a pattern.
4138
4139 The behavior of these verbs in repeated groups, assertions, and in sub‐
4140 patterns called as subroutines (whether or not recursively) is
4141 described below.
4142
4143 Optimizations That Affect Backtracking Verbs
4144
4145 PCRE contains some optimizations that are used to speed up matching by
4146 running some checks at the start of each match attempt. For example, it
4147 can know the minimum length of matching subject, or that a particular
4148 character must be present. When one of these optimizations bypasses the
4149 running of a match, any included backtracking verbs are not processed.
4150 processed. You can suppress the start-of-match optimizations by setting
4151 option no_start_optimize when calling compile/2 or run/3, or by start‐
4152 ing the pattern with (*NO_START_OPT).
4153
4154 Experiments with Perl suggest that it too has similar optimizations,
4155 sometimes leading to anomalous results.
4156
4157 Verbs That Act Immediately
4158
4159 The following verbs act as soon as they are encountered. They must not
4160 be followed by a name.
4161
4162 (*ACCEPT)
4163
4164 This verb causes the match to end successfully, skipping the remainder
4165 of the pattern. However, when it is inside a subpattern that is called
4166 as a subroutine, only that subpattern is ended successfully. Matching
4167 then continues at the outer level. If (*ACCEPT) is triggered in a posi‐
4168 tive assertion, the assertion succeeds; in a negative assertion, the
4169 assertion fails.
4170
4171 If (*ACCEPT) is inside capturing parentheses, the data so far is cap‐
4172 tured. For example, the following matches "AB", "AAD", or "ACD". When
4173 it matches "AB", "B" is captured by the outer parentheses.
4174
4175 A((?:A|B(*ACCEPT)|C)D)
4176
4177 The following verb causes a matching failure, forcing backtracking to
4178 occur. It is equivalent to (?!) but easier to read.
4179
4180 (*FAIL) or (*F)
4181
4182 The Perl documentation states that it is probably useful only when com‐
4183 bined with (?{}) or (??{}). Those are Perl features that are not
4184 present in PCRE.
4185
4186 A match with the string "aaaa" always fails, but the callout is taken
4187 before each backtrack occurs (in this example, 10 times).
4188
4189 Recording Which Path Was Taken
4190
4191 The main purpose of this verb is to track how a match was arrived at,
4192 although it also has a secondary use in with advancing the match start‐
4193 ing point (see (*SKIP) below).
4194
4195 Note:
4196 In Erlang, there is no interface to retrieve a mark with run/2,3, so
4197 only the secondary purpose is relevant to the Erlang programmer.
4198
4199 The rest of this section is therefore deliberately not adapted for
4200 reading by the Erlang programmer, but the examples can help in under‐
4201 standing NAMES as they can be used by (*SKIP).
4202
4203
4204 (*MARK:NAME) or (*:NAME)
4205
4206 A name is always required with this verb. There can be as many
4207 instances of (*MARK) as you like in a pattern, and their names do not
4208 have to be unique.
4209
4210 When a match succeeds, the name of the last encountered (*MARK:NAME),
4211 (*PRUNE:NAME), or (*THEN:NAME) on the matching path is passed back to
4212 the caller as described in section "Extra data for pcre_exec()" in the
4213 pcreapi documentation. In the following example of pcretest output, the
4214 /K modifier requests the retrieval and outputting of (*MARK) data:
4215
4216 re> /X(*MARK:A)Y|X(*MARK:B)Z/K
4217 data> XY
4218 0: XY
4219 MK: A
4220 XZ
4221 0: XZ
4222 MK: B
4223
4224 The (*MARK) name is tagged with "MK:" in this output, and in this exam‐
4225 ple it indicates which of the two alternatives matched. This is a more
4226 efficient way of obtaining this information than putting each alterna‐
4227 tive in its own capturing parentheses.
4228
4229 If a verb with a name is encountered in a positive assertion that is
4230 true, the name is recorded and passed back if it is the last encoun‐
4231 tered. This does not occur for negative assertions or failing positive
4232 assertions.
4233
4234 After a partial match or a failed match, the last encountered name in
4235 the entire match process is returned, for example:
4236
4237 re> /X(*MARK:A)Y|X(*MARK:B)Z/K
4238 data> XP
4239 No match, mark = B
4240
4241 Notice that in this unanchored example, the mark is retained from the
4242 match attempt that started at letter "X" in the subject. Subsequent
4243 match attempts starting at "P" and then with an empty string do not get
4244 as far as the (*MARK) item, nevertheless do not reset it.
4245
4246 Verbs That Act after Backtracking
4247
4248 The following verbs do nothing when they are encountered. Matching con‐
4249 tinues with what follows, but if there is no subsequent match, causing
4250 a backtrack to the verb, a failure is forced. That is, backtracking
4251 cannot pass to the left of the verb. However, when one of these verbs
4252 appears inside an atomic group or an assertion that is true, its effect
4253 is confined to that group, as once the group has been matched, there is
4254 never any backtracking into it. In this situation, backtracking can
4255 "jump back" to the left of the entire atomic group or assertion.
4256 (Remember also, as stated above, that this localization also applies in
4257 subroutine calls.)
4258
4259 These verbs differ in exactly what kind of failure occurs when back‐
4260 tracking reaches them. The behavior described below is what occurs when
4261 the verb is not in a subroutine or an assertion. Subsequent sections
4262 cover these special cases.
4263
4264 The following verb, which must not be followed by a name, causes the
4265 whole match to fail outright if there is a later matching failure that
4266 causes backtracking to reach it. Even if the pattern is unanchored, no
4267 further attempts to find a match by advancing the starting point take
4268 place.
4269
4270 (*COMMIT)
4271
4272 If (*COMMIT) is the only backtracking verb that is encountered, once it
4273 has been passed, run/2,3 is committed to find a match at the current
4274 starting point, or not at all, for example:
4275
4276 a+(*COMMIT)b
4277
4278 This matches "xxaab" but not "aacaab". It can be thought of as a kind
4279 of dynamic anchor, or "I've started, so I must finish". The name of the
4280 most recently passed (*MARK) in the path is passed back when (*COMMIT)
4281 forces a match failure.
4282
4283 If more than one backtracking verb exists in a pattern, a different one
4284 that follows (*COMMIT) can be triggered first, so merely passing (*COM‐
4285 MIT) during a match does not always guarantee that a match must be at
4286 this starting point.
4287
4288 Notice that (*COMMIT) at the start of a pattern is not the same as an
4289 anchor, unless the PCRE start-of-match optimizations are turned off, as
4290 shown in the following example:
4291
4292 1> re:run("xyzabc","(*COMMIT)abc",[{capture,all,list}]).
4293 {match,["abc"]}
4294 2> re:run("xyzabc","(*COMMIT)abc",[{capture,all,list},no_start_optimize]).
4295 nomatch
4296
4297 For this pattern, PCRE knows that any match must start with "a", so the
4298 optimization skips along the subject to "a" before applying the pattern
4299 to the first set of data. The match attempt then succeeds. In the sec‐
4300 ond call the no_start_optimize disables the optimization that skips
4301 along to the first character. The pattern is now applied starting at
4302 "x", and so the (*COMMIT) causes the match to fail without trying any
4303 other starting points.
4304
4305 The following verb causes the match to fail at the current starting
4306 position in the subject if there is a later matching failure that
4307 causes backtracking to reach it:
4308
4309 (*PRUNE) or (*PRUNE:NAME)
4310
4311 If the pattern is unanchored, the normal "bumpalong" advance to the
4312 next starting character then occurs. Backtracking can occur as usual to
4313 the left of (*PRUNE), before it is reached, or when matching to the
4314 right of (*PRUNE), but if there is no match to the right, backtracking
4315 cannot cross (*PRUNE). In simple cases, the use of (*PRUNE) is just an
4316 alternative to an atomic group or possessive quantifier, but there are
4317 some uses of (*PRUNE) that cannot be expressed in any other way. In an
4318 anchored pattern, (*PRUNE) has the same effect as (*COMMIT).
4319
4320 The behavior of (*PRUNE:NAME) is the not the same as
4321 (*MARK:NAME)(*PRUNE). It is like (*MARK:NAME) in that the name is
4322 remembered for passing back to the caller. However, (*SKIP:NAME)
4323 searches only for names set with (*MARK).
4324
4325 Note:
4326 The fact that (*PRUNE:NAME) remembers the name is useless to the Erlang
4327 programmer, as names cannot be retrieved.
4328
4329
4330 The following verb, when specified without a name, is like (*PRUNE),
4331 except that if the pattern is unanchored, the "bumpalong" advance is
4332 not to the next character, but to the position in the subject where
4333 (*SKIP) was encountered.
4334
4335 (*SKIP)
4336
4337 (*SKIP) signifies that whatever text was matched leading up to it can‐
4338 not be part of a successful match. Consider:
4339
4340 a+(*SKIP)b
4341
4342 If the subject is "aaaac...", after the first match attempt fails
4343 (starting at the first character in the string), the starting point
4344 skips on to start the next attempt at "c". Notice that a possessive
4345 quantifier does not have the same effect as this example; although it
4346 would suppress backtracking during the first match attempt, the second
4347 attempt would start at the second character instead of skipping on to
4348 "c".
4349
4350 When (*SKIP) has an associated name, its behavior is modified:
4351
4352 (*SKIP:NAME)
4353
4354 When this is triggered, the previous path through the pattern is
4355 searched for the most recent (*MARK) that has the same name. If one is
4356 found, the "bumpalong" advance is to the subject position that corre‐
4357 sponds to that (*MARK) instead of to where (*SKIP) was encountered. If
4358 no (*MARK) with a matching name is found, (*SKIP) is ignored.
4359
4360 Notice that (*SKIP:NAME) searches only for names set by (*MARK:NAME).
4361 It ignores names that are set by (*PRUNE:NAME) or (*THEN:NAME).
4362
4363 The following verb causes a skip to the next innermost alternative when
4364 backtracking reaches it. That is, it cancels any further backtracking
4365 within the current alternative.
4366
4367 (*THEN) or (*THEN:NAME)
4368
4369 The verb name comes from the observation that it can be used for a pat‐
4370 tern-based if-then-else block:
4371
4372 ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
4373
4374 If the COND1 pattern matches, FOO is tried (and possibly further items
4375 after the end of the group if FOO succeeds). On failure, the matcher
4376 skips to the second alternative and tries COND2, without backtracking
4377 into COND1. If that succeeds and BAR fails, COND3 is tried. If BAZ then
4378 fails, there are no more alternatives, so there is a backtrack to what‐
4379 ever came before the entire group. If (*THEN) is not inside an alterna‐
4380 tion, it acts like (*PRUNE).
4381
4382 The behavior of (*THEN:NAME) is the not the same as
4383 (*MARK:NAME)(*THEN). It is like (*MARK:NAME) in that the name is remem‐
4384 bered for passing back to the caller. However, (*SKIP:NAME) searches
4385 only for names set with (*MARK).
4386
4387 Note:
4388 The fact that (*THEN:NAME) remembers the name is useless to the Erlang
4389 programmer, as names cannot be retrieved.
4390
4391
4392 A subpattern that does not contain a | character is just a part of the
4393 enclosing alternative; it is not a nested alternation with only one
4394 alternative. The effect of (*THEN) extends beyond such a subpattern to
4395 the enclosing alternative. Consider the following pattern, where A, B,
4396 and so on, are complex pattern fragments that do not contain any |
4397 characters at this level:
4398
4399 A (B(*THEN)C) | D
4400
4401 If A and B are matched, but there is a failure in C, matching does not
4402 backtrack into A; instead it moves to the next alternative, that is, D.
4403 However, if the subpattern containing (*THEN) is given an alternative,
4404 it behaves differently:
4405
4406 A (B(*THEN)C | (*FAIL)) | D
4407
4408 The effect of (*THEN) is now confined to the inner subpattern. After a
4409 failure in C, matching moves to (*FAIL), which causes the whole subpat‐
4410 tern to fail, as there are no more alternatives to try. In this case,
4411 matching does now backtrack into A.
4412
4413 Notice that a conditional subpattern is not considered as having two
4414 alternatives, as only one is ever used. That is, the | character in a
4415 conditional subpattern has a different meaning. Ignoring whitespace,
4416 consider:
4417
4418 ^.*? (?(?=a) a | b(*THEN)c )
4419
4420 If the subject is "ba", this pattern does not match. As .*? is
4421 ungreedy, it initially matches zero characters. The condition (?=a)
4422 then fails, the character "b" is matched, but "c" is not. At this
4423 point, matching does not backtrack to .*? as can perhaps be expected
4424 from the presence of the | character. The conditional subpattern is
4425 part of the single alternative that comprises the whole pattern, and so
4426 the match fails. (If there was a backtrack into .*?, allowing it to
4427 match "b", the match would succeed.)
4428
4429 The verbs described above provide four different "strengths" of control
4430 when subsequent matching fails:
4431
4432 * (*THEN) is the weakest, carrying on the match at the next alterna‐
4433 tive.
4434
4435 * (*PRUNE) comes next, fails the match at the current starting posi‐
4436 tion, but allows an advance to the next character (for an unan‐
4437 chored pattern).
4438
4439 * (*SKIP) is similar, except that the advance can be more than one
4440 character.
4441
4442 * (*COMMIT) is the strongest, causing the entire match to fail.
4443
4444 More than One Backtracking Verb
4445
4446 If more than one backtracking verb is present in a pattern, the one
4447 that is backtracked onto first acts. For example, consider the follow‐
4448 ing pattern, where A, B, and so on, are complex pattern fragments:
4449
4450 (A(*COMMIT)B(*THEN)C|ABD)
4451
4452 If A matches but B fails, the backtrack to (*COMMIT) causes the entire
4453 match to fail. However, if A and B match, but C fails, the backtrack to
4454 (*THEN) causes the next alternative (ABD) to be tried. This behavior is
4455 consistent, but is not always the same as in Perl. It means that if two
4456 or more backtracking verbs appear in succession, the last of them has
4457 no effect. Consider the following example:
4458
4459 If there is a matching failure to the right, backtracking onto (*PRUNE)
4460 causes it to be triggered, and its action is taken. There can never be
4461 a backtrack onto (*COMMIT).
4462
4463 Backtracking Verbs in Repeated Groups
4464
4465 PCRE differs from Perl in its handling of backtracking verbs in
4466 repeated groups. For example, consider:
4467
4468 /(a(*COMMIT)b)+ac/
4469
4470 If the subject is "abac", Perl matches, but PCRE fails because the
4471 (*COMMIT) in the second repeat of the group acts.
4472
4473 Backtracking Verbs in Assertions
4474
4475 (*FAIL) in an assertion has its normal effect: it forces an immediate
4476 backtrack.
4477
4478 (*ACCEPT) in a positive assertion causes the assertion to succeed with‐
4479 out any further processing. In a negative assertion, (*ACCEPT) causes
4480 the assertion to fail without any further processing.
4481
4482 The other backtracking verbs are not treated specially if they appear
4483 in a positive assertion. In particular, (*THEN) skips to the next
4484 alternative in the innermost enclosing group that has alternations,
4485 regardless if this is within the assertion.
4486
4487 Negative assertions are, however, different, to ensure that changing a
4488 positive assertion into a negative assertion changes its result. Back‐
4489 tracking into (*COMMIT), (*SKIP), or (*PRUNE) causes a negative asser‐
4490 tion to be true, without considering any further alternative branches
4491 in the assertion. Backtracking into (*THEN) causes it to skip to the
4492 next enclosing alternative within the assertion (the normal behavior),
4493 but if the assertion does not have such an alternative, (*THEN) behaves
4494 like (*PRUNE).
4495
4496 Backtracking Verbs in Subroutines
4497
4498 These behaviors occur regardless if the subpattern is called recur‐
4499 sively. The treatment of subroutines in Perl is different in some
4500 cases.
4501
4502 * (*FAIL) in a subpattern called as a subroutine has its normal
4503 effect: it forces an immediate backtrack.
4504
4505 * (*ACCEPT) in a subpattern called as a subroutine causes the subrou‐
4506 tine match to succeed without any further processing. Matching then
4507 continues after the subroutine call.
4508
4509 * (*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as a sub‐
4510 routine cause the subroutine match to fail.
4511
4512 * (*THEN) skips to the next alternative in the innermost enclosing
4513 group within the subpattern that has alternatives. If there is no
4514 such group within the subpattern, (*THEN) causes the subroutine
4515 match to fail.
4516
4517Ericsson AB stdlib 3.8.2.1 re(3)