1re(3) Erlang Module Definition re(3)
2
3
4
6 re - Perl-like regular expressions for Erlang.
7
9 This module contains regular expression matching functions for strings
10 and binaries.
11
12 The regular expression syntax and semantics resemble that of Perl.
13
14 The matching algorithms of the library are based on the PCRE library,
15 but not all of the PCRE library is interfaced and some parts of the li‐
16 brary go beyond what PCRE offers. Currently PCRE version 8.40 (release
17 date 2017-01-11) is used. The sections of the PCRE documentation that
18 are relevant to this module are included here.
19
20 Note:
21 The Erlang literal syntax for strings uses the "\" (backslash) charac‐
22 ter as an escape code. You need to escape backslashes in literal
23 strings, both in your code and in the shell, with an extra backslash,
24 that is, "\\".
25
26
28 mp() = {re_pattern, term(), term(), term(), term()}
29
30 Opaque data type containing a compiled regular expression. mp()
31 is guaranteed to be a tuple() having the atom re_pattern as its
32 first element, to allow for matching in guards. The arity of the
33 tuple or the content of the other fields can change in future
34 Erlang/OTP releases.
35
36 nl_spec() = cr | crlf | lf | anycrlf | any
37
38 compile_option() =
39 unicode | anchored | caseless | dollar_endonly | dotall |
40 extended | firstline | multiline | no_auto_capture |
41 dupnames | ungreedy |
42 {newline, nl_spec()} |
43 bsr_anycrlf | bsr_unicode | no_start_optimize | ucp |
44 never_utf
45
47 version() -> binary()
48
49 The return of this function is a string with the PCRE version of
50 the system that was used in the Erlang/OTP compilation.
51
52 compile(Regexp) -> {ok, MP} | {error, ErrSpec}
53
54 Types:
55
56 Regexp = iodata()
57 MP = mp()
58 ErrSpec =
59 {ErrString :: string(), Position :: integer() >= 0}
60
61 The same as compile(Regexp,[])
62
63 compile(Regexp, Options) -> {ok, MP} | {error, ErrSpec}
64
65 Types:
66
67 Regexp = iodata() | unicode:charlist()
68 Options = [Option]
69 Option = compile_option()
70 MP = mp()
71 ErrSpec =
72 {ErrString :: string(), Position :: integer() >= 0}
73
74 Compiles a regular expression, with the syntax described below,
75 into an internal format to be used later as a parameter to run/2
76 and run/3.
77
78 Compiling the regular expression before matching is useful if
79 the same expression is to be used in matching against multiple
80 subjects during the lifetime of the program. Compiling once and
81 executing many times is far more efficient than compiling each
82 time one wants to match.
83
84 When option unicode is specified, the regular expression is to
85 be specified as a valid Unicode charlist(), otherwise as any
86 valid iodata().
87
88 Options:
89
90 unicode:
91 The regular expression is specified as a Unicode charlist()
92 and the resulting regular expression code is to be run
93 against a valid Unicode charlist() subject. Also consider
94 option ucp when using Unicode characters.
95
96 anchored:
97 The pattern is forced to be "anchored", that is, it is con‐
98 strained to match only at the first matching point in the
99 string that is searched (the "subject string"). This effect
100 can also be achieved by appropriate constructs in the pat‐
101 tern itself.
102
103 caseless:
104 Letters in the pattern match both uppercase and lowercase
105 letters. It is equivalent to Perl option /i and can be
106 changed within a pattern by a (?i) option setting. Uppercase
107 and lowercase letters are defined as in the ISO 8859-1 char‐
108 acter set.
109
110 dollar_endonly:
111 A dollar metacharacter in the pattern matches only at the
112 end of the subject string. Without this option, a dollar
113 also matches immediately before a newline at the end of the
114 string (but not before any other newlines). This option is
115 ignored if option multiline is specified. There is no equiv‐
116 alent option in Perl, and it cannot be set within a pattern.
117
118 dotall:
119 A dot in the pattern matches all characters, including those
120 indicating newline. Without it, a dot does not match when
121 the current position is at a newline. This option is equiva‐
122 lent to Perl option /s and it can be changed within a pat‐
123 tern by a (?s) option setting. A negative class, such as
124 [^a], always matches newline characters, independent of the
125 setting of this option.
126
127 extended:
128 If this option is set, most white space characters in the
129 pattern are totally ignored except when escaped or inside a
130 character class. However, white space is not allowed within
131 sequences such as (?> that introduce various parenthesized
132 subpatterns, nor within a numerical quantifier such as
133 {1,3}. However, ignorable white space is permitted between
134 an item and a following quantifier and between a quantifier
135 and a following + that indicates possessiveness.
136
137 White space did not used to include the VT character (code
138 11), because Perl did not treat this character as white
139 space. However, Perl changed at release 5.18, so PCRE fol‐
140 lowed at release 8.34, and VT is now treated as white space.
141
142 This also causes characters between an unescaped # outside a
143 character class and the next newline, inclusive, to be ig‐
144 nored. This is equivalent to Perl's /x option, and it can be
145 changed within a pattern by a (?x) option setting.
146
147 With this option, comments inside complicated patterns can
148 be included. However, notice that this applies only to data
149 characters. Whitespace characters can never appear within
150 special character sequences in a pattern, for example within
151 sequence (?( that introduces a conditional subpattern.
152
153 firstline:
154 An unanchored pattern is required to match before or at the
155 first newline in the subject string, although the matched
156 text can continue over the newline.
157
158 multiline:
159 By default, PCRE treats the subject string as consisting of
160 a single line of characters (even if it contains newlines).
161 The "start of line" metacharacter (^) matches only at the
162 start of the string, while the "end of line" metacharacter
163 ($) matches only at the end of the string, or before a ter‐
164 minating newline (unless option dollar_endonly is speci‐
165 fied). This is the same as in Perl.
166
167 When this option is specified, the "start of line" and "end
168 of line" constructs match immediately following or immedi‐
169 ately before internal newlines in the subject string, re‐
170 spectively, as well as at the very start and end. This is
171 equivalent to Perl option /m and can be changed within a
172 pattern by a (?m) option setting. If there are no newlines
173 in a subject string, or no occurrences of ^ or $ in a pat‐
174 tern, setting multiline has no effect.
175
176 no_auto_capture:
177 Disables the use of numbered capturing parentheses in the
178 pattern. Any opening parenthesis that is not followed by ?
179 behaves as if it is followed by ?:. Named parentheses can
180 still be used for capturing (and they acquire numbers in the
181 usual way). There is no equivalent option in Perl.
182
183 dupnames:
184 Names used to identify capturing subpatterns need not be
185 unique. This can be helpful for certain types of pattern
186 when it is known that only one instance of the named subpat‐
187 tern can ever be matched. More details of named subpatterns
188 are provided below.
189
190 ungreedy:
191 Inverts the "greediness" of the quantifiers so that they are
192 not greedy by default, but become greedy if followed by "?".
193 It is not compatible with Perl. It can also be set by a (?U)
194 option setting within the pattern.
195
196 {newline, NLSpec}:
197 Overrides the default definition of a newline in the subject
198 string, which is LF (ASCII 10) in Erlang.
199
200 cr:
201 Newline is indicated by a single character cr (ASCII 13).
202
203 lf:
204 Newline is indicated by a single character LF (ASCII 10),
205 the default.
206
207 crlf:
208 Newline is indicated by the two-character CRLF (ASCII 13
209 followed by ASCII 10) sequence.
210
211 anycrlf:
212 Any of the three preceding sequences is to be recognized.
213
214 any:
215 Any of the newline sequences above, and the Unicode se‐
216 quences VT (vertical tab, U+000B), FF (formfeed, U+000C),
217 NEL (next line, U+0085), LS (line separator, U+2028), and
218 PS (paragraph separator, U+2029).
219
220 bsr_anycrlf:
221 Specifies specifically that \R is to match only the CR, LF,
222 or CRLF sequences, not the Unicode-specific newline charac‐
223 ters.
224
225 bsr_unicode:
226 Specifies specifically that \R is to match all the Unicode
227 newline characters (including CRLF, and so on, the default).
228
229 no_start_optimize:
230 Disables optimization that can malfunction if "Special
231 start-of-pattern items" are present in the regular expres‐
232 sion. A typical example would be when matching "DEFABC"
233 against "(*COMMIT)ABC", where the start optimization of PCRE
234 would skip the subject up to "A" and never realize that the
235 (*COMMIT) instruction is to have made the matching fail.
236 This option is only relevant if you use "start-of-pattern
237 items", as discussed in section PCRE Regular Expression De‐
238 tails.
239
240 ucp:
241 Specifies that Unicode character properties are to be used
242 when resolving \B, \b, \D, \d, \S, \s, \W and \w. Without
243 this flag, only ISO Latin-1 properties are used. Using Uni‐
244 code properties hurts performance, but is semantically cor‐
245 rect when working with Unicode characters beyond the ISO
246 Latin-1 range.
247
248 never_utf:
249 Specifies that the (*UTF) and/or (*UTF8) "start-of-pattern
250 items" are forbidden. This flag cannot be combined with op‐
251 tion unicode. Useful if ISO Latin-1 patterns from an exter‐
252 nal source are to be compiled.
253
254 inspect(MP, Item) -> {namelist, [binary()]}
255
256 Types:
257
258 MP = mp()
259 Item = namelist
260
261 Takes a compiled regular expression and an item, and returns the
262 relevant data from the regular expression. The only supported
263 item is namelist, which returns the tuple {namelist, [bi‐
264 nary()]}, containing the names of all (unique) named subpatterns
265 in the regular expression. For example:
266
267 1> {ok,MP} = re:compile("(?<A>A)|(?<B>B)|(?<C>C)").
268 {ok,{re_pattern,3,0,0,
269 <<69,82,67,80,119,0,0,0,0,0,0,0,1,0,0,0,255,255,255,255,
270 255,255,...>>}}
271 2> re:inspect(MP,namelist).
272 {namelist,[<<"A">>,<<"B">>,<<"C">>]}
273 3> {ok,MPD} = re:compile("(?<C>A)|(?<B>B)|(?<C>C)",[dupnames]).
274 {ok,{re_pattern,3,0,0,
275 <<69,82,67,80,119,0,0,0,0,0,8,0,1,0,0,0,255,255,255,255,
276 255,255,...>>}}
277 4> re:inspect(MPD,namelist).
278 {namelist,[<<"B">>,<<"C">>]}
279
280 Notice in the second example that the duplicate name only occurs
281 once in the returned list, and that the list is in alphabetical
282 order regardless of where the names are positioned in the regu‐
283 lar expression. The order of the names is the same as the order
284 of captured subexpressions if {capture, all_names} is specified
285 as an option to run/3. You can therefore create a name-to-value
286 mapping from the result of run/3 like this:
287
288 1> {ok,MP} = re:compile("(?<A>A)|(?<B>B)|(?<C>C)").
289 {ok,{re_pattern,3,0,0,
290 <<69,82,67,80,119,0,0,0,0,0,0,0,1,0,0,0,255,255,255,255,
291 255,255,...>>}}
292 2> {namelist, N} = re:inspect(MP,namelist).
293 {namelist,[<<"A">>,<<"B">>,<<"C">>]}
294 3> {match,L} = re:run("AA",MP,[{capture,all_names,binary}]).
295 {match,[<<"A">>,<<>>,<<>>]}
296 4> NameMap = lists:zip(N,L).
297 [{<<"A">>,<<"A">>},{<<"B">>,<<>>},{<<"C">>,<<>>}]
298
299 replace(Subject, RE, Replacement) -> iodata() | unicode:charlist()
300
301 Types:
302
303 Subject = iodata() | unicode:charlist()
304 RE = mp() | iodata()
305 Replacement = iodata() | unicode:charlist()
306
307 Same as replace(Subject, RE, Replacement, []).
308
309 replace(Subject, RE, Replacement, Options) ->
310 iodata() | unicode:charlist()
311
312 Types:
313
314 Subject = iodata() | unicode:charlist()
315 RE = mp() | iodata() | unicode:charlist()
316 Replacement = iodata() | unicode:charlist()
317 Options = [Option]
318 Option =
319 anchored | global | notbol | noteol | notempty |
320 notempty_atstart |
321 {offset, integer() >= 0} |
322 {newline, NLSpec} |
323 bsr_anycrlf |
324 {match_limit, integer() >= 0} |
325 {match_limit_recursion, integer() >= 0} |
326 bsr_unicode |
327 {return, ReturnType} |
328 CompileOpt
329 ReturnType = iodata | list | binary
330 CompileOpt = compile_option()
331 NLSpec = cr | crlf | lf | anycrlf | any
332
333 Replaces the matched part of the Subject string with the con‐
334 tents of Replacement.
335
336 The permissible options are the same as for run/3, except that
337 option capture is not allowed. Instead a {return, ReturnType} is
338 present. The default return type is iodata, constructed in a way
339 to minimize copying. The iodata result can be used directly in
340 many I/O operations. If a flat list() is desired, specify {re‐
341 turn, list}. If a binary is desired, specify {return, binary}.
342
343 As in function run/3, an mp() compiled with option unicode re‐
344 quires Subject to be a Unicode charlist(). If compilation is
345 done implicitly and the unicode compilation option is specified
346 to this function, both the regular expression and Subject are to
347 specified as valid Unicode charlist()s.
348
349 The replacement string can contain the special character &,
350 which inserts the whole matching expression in the result, and
351 the special sequence \N (where N is an integer > 0), \gN, or
352 \g{N}, resulting in the subexpression number N, is inserted in
353 the result. If no subexpression with that number is generated by
354 the regular expression, nothing is inserted.
355
356 To insert an & or a \ in the result, precede it with a \. Notice
357 that Erlang already gives a special meaning to \ in literal
358 strings, so a single \ must be written as "\\" and therefore a
359 double \ as "\\\\".
360
361 Example:
362
363 re:replace("abcd","c","[&]",[{return,list}]).
364
365 gives
366
367 "ab[c]d"
368
369 while
370
371 re:replace("abcd","c","[\\&]",[{return,list}]).
372
373 gives
374
375 "ab[&]d"
376
377 As with run/3, compilation errors raise the badarg exception.
378 compile/2 can be used to get more information about the error.
379
380 run(Subject, RE) -> {match, Captured} | nomatch
381
382 Types:
383
384 Subject = iodata() | unicode:charlist()
385 RE = mp() | iodata()
386 Captured = [CaptureData]
387 CaptureData = {integer(), integer()}
388
389 Same as run(Subject,RE,[]).
390
391 run(Subject, RE, Options) ->
392 {match, Captured} | match | nomatch | {error, ErrType}
393
394 Types:
395
396 Subject = iodata() | unicode:charlist()
397 RE = mp() | iodata() | unicode:charlist()
398 Options = [Option]
399 Option =
400 anchored | global | notbol | noteol | notempty |
401 notempty_atstart | report_errors |
402 {offset, integer() >= 0} |
403 {match_limit, integer() >= 0} |
404 {match_limit_recursion, integer() >= 0} |
405 {newline, NLSpec :: nl_spec()} |
406 bsr_anycrlf | bsr_unicode |
407 {capture, ValueSpec} |
408 {capture, ValueSpec, Type} |
409 CompileOpt
410 Type = index | list | binary
411 ValueSpec =
412 all | all_but_first | all_names | first | none | Val‐
413 ueList
414 ValueList = [ValueID]
415 ValueID = integer() | string() | atom()
416 CompileOpt = compile_option()
417 See compile/2.
418 Captured = [CaptureData] | [[CaptureData]]
419 CaptureData =
420 {integer(), integer()} | ListConversionData | binary()
421 ListConversionData =
422 string() |
423 {error, string(), binary()} |
424 {incomplete, string(), binary()}
425 ErrType =
426 match_limit | match_limit_recursion | {compile, Com‐
427 pileErr}
428 CompileErr =
429 {ErrString :: string(), Position :: integer() >= 0}
430
431 Executes a regular expression matching, and returns
432 match/{match, Captured} or nomatch. The regular expression can
433 be specified either as iodata() in which case it is automati‐
434 cally compiled (as by compile/2) and executed, or as a precom‐
435 piled mp() in which case it is executed against the subject di‐
436 rectly.
437
438 When compilation is involved, exception badarg is thrown if a
439 compilation error occurs. Call compile/2 to get information
440 about the location of the error in the regular expression.
441
442 If the regular expression is previously compiled, the option
443 list can only contain the following options:
444
445 * anchored
446
447 * {capture, ValueSpec}/{capture, ValueSpec, Type}
448
449 * global
450
451 * {match_limit, integer() >= 0}
452
453 * {match_limit_recursion, integer() >= 0}
454
455 * {newline, NLSpec}
456
457 * notbol
458
459 * notempty
460
461 * notempty_atstart
462
463 * noteol
464
465 * {offset, integer() >= 0}
466
467 * report_errors
468
469 Otherwise all options valid for function compile/2 are also al‐
470 lowed. Options allowed both for compilation and execution of a
471 match, namely anchored and {newline, NLSpec}, affect both the
472 compilation and execution if present together with a non-precom‐
473 piled regular expression.
474
475 If the regular expression was previously compiled with option
476 unicode, Subject is to be provided as a valid Unicode
477 charlist(), otherwise any iodata() will do. If compilation is
478 involved and option unicode is specified, both Subject and the
479 regular expression are to be specified as valid Unicode
480 charlists().
481
482 {capture, ValueSpec}/{capture, ValueSpec, Type} defines what to
483 return from the function upon successful matching. The capture
484 tuple can contain both a value specification, telling which of
485 the captured substrings are to be returned, and a type specifi‐
486 cation, telling how captured substrings are to be returned (as
487 index tuples, lists, or binaries). The options are described in
488 detail below.
489
490 If the capture options describe that no substring capturing is
491 to be done ({capture, none}), the function returns the single
492 atom match upon successful matching, otherwise the tuple {match,
493 ValueList}. Disabling capturing can be done either by specifying
494 none or an empty list as ValueSpec.
495
496 Option report_errors adds the possibility that an error tuple is
497 returned. The tuple either indicates a matching error
498 (match_limit or match_limit_recursion), or a compilation error,
499 where the error tuple has the format {error, {compile, Com‐
500 pileErr}}. Notice that if option report_errors is not specified,
501 the function never returns error tuples, but reports compilation
502 errors as a badarg exception and failed matches because of ex‐
503 ceeded match limits simply as nomatch.
504
505 The following options are relevant for execution:
506
507 anchored:
508 Limits run/3 to matching at the first matching position. If
509 a pattern was compiled with anchored, or turned out to be
510 anchored by virtue of its contents, it cannot be made unan‐
511 chored at matching time, hence there is no unanchored op‐
512 tion.
513
514 global:
515 Implements global (repetitive) search (flag g in Perl). Each
516 match is returned as a separate list() containing the spe‐
517 cific match and any matching subexpressions (or as specified
518 by option capture. The Captured part of the return value is
519 hence a list() of list()s when this option is specified.
520
521 The interaction of option global with a regular expression
522 that matches an empty string surprises some users. When op‐
523 tion global is specified, run/3 handles empty matches in the
524 same way as Perl: a zero-length match at any point is also
525 retried with options [anchored, notempty_atstart]. If that
526 search gives a result of length > 0, the result is included.
527 Example:
528
529 re:run("cat","(|at)",[global]).
530
531 The following matchings are performed:
532
533 At offset 0:
534 The regular expression (|at) first match at the initial
535 position of string cat, giving the result set
536 [{0,0},{0,0}] (the second {0,0} is because of the subex‐
537 pression marked by the parentheses). As the length of the
538 match is 0, we do not advance to the next position yet.
539
540 At offset 0 with [anchored, notempty_atstart]:
541 The search is retried with options [anchored, notempty_at‐
542 start] at the same position, which does not give any in‐
543 teresting result of longer length, so the search position
544 is advanced to the next character (a).
545
546 At offset 1:
547 The search results in [{1,0},{1,0}], so this search is
548 also repeated with the extra options.
549
550 At offset 1 with [anchored, notempty_atstart]:
551 Alternative ab is found and the result is [{1,2},{1,2}].
552 The result is added to the list of results and the posi‐
553 tion in the search string is advanced two steps.
554
555 At offset 3:
556 The search once again matches the empty string, giving
557 [{3,0},{3,0}].
558
559 At offset 1 with [anchored, notempty_atstart]:
560 This gives no result of length > 0 and we are at the last
561 position, so the global search is complete.
562
563 The result of the call is:
564
565 {match,[[{0,0},{0,0}],[{1,0},{1,0}],[{1,2},{1,2}],[{3,0},{3,0}]]}
566
567 notempty:
568 An empty string is not considered to be a valid match if
569 this option is specified. If alternatives in the pattern ex‐
570 ist, they are tried. If all the alternatives match the empty
571 string, the entire match fails.
572
573 Example:
574
575 If the following pattern is applied to a string not begin‐
576 ning with "a" or "b", it would normally match the empty
577 string at the start of the subject:
578
579 a?b?
580
581 With option notempty, this match is invalid, so run/3
582 searches further into the string for occurrences of "a" or
583 "b".
584
585 notempty_atstart:
586 Like notempty, except that an empty string match that is not
587 at the start of the subject is permitted. If the pattern is
588 anchored, such a match can occur only if the pattern con‐
589 tains \K.
590
591 Perl has no direct equivalent of notempty or notempty_at‐
592 start, but it does make a special case of a pattern match of
593 the empty string within its split() function, and when using
594 modifier /g. The Perl behavior can be emulated after match‐
595 ing a null string by first trying the match again at the
596 same offset with notempty_atstart and anchored, and then, if
597 that fails, by advancing the starting offset (see below) and
598 trying an ordinary match again.
599
600 notbol:
601 Specifies that the first character of the subject string is
602 not the beginning of a line, so the circumflex metacharacter
603 is not to match before it. Setting this without multiline
604 (at compile time) causes circumflex never to match. This op‐
605 tion only affects the behavior of the circumflex metacharac‐
606 ter. It does not affect \A.
607
608 noteol:
609 Specifies that the end of the subject string is not the end
610 of a line, so the dollar metacharacter is not to match it
611 nor (except in multiline mode) a newline immediately before
612 it. Setting this without multiline (at compile time) causes
613 dollar never to match. This option affects only the behavior
614 of the dollar metacharacter. It does not affect \Z or \z.
615
616 report_errors:
617 Gives better control of the error handling in run/3. When
618 specified, compilation errors (if the regular expression is
619 not already compiled) and runtime errors are explicitly re‐
620 turned as an error tuple.
621
622 The following are the possible runtime errors:
623
624 match_limit:
625 The PCRE library sets a limit on how many times the inter‐
626 nal match function can be called. Defaults to 10,000,000
627 in the library compiled for Erlang. If {error,
628 match_limit} is returned, the execution of the regular ex‐
629 pression has reached this limit. This is normally to be
630 regarded as a nomatch, which is the default return value
631 when this occurs, but by specifying report_errors, you are
632 informed when the match fails because of too many internal
633 calls.
634
635 match_limit_recursion:
636 This error is very similar to match_limit, but occurs when
637 the internal match function of PCRE is "recursively"
638 called more times than the match_limit_recursion limit,
639 which defaults to 10,000,000 as well. Notice that as long
640 as the match_limit and match_limit_default values are kept
641 at the default values, the match_limit_recursion error
642 cannot occur, as the match_limit error occurs before that
643 (each recursive call is also a call, but not conversely).
644 Both limits can however be changed, either by setting lim‐
645 its directly in the regular expression string (see section
646 PCRE Regular Eexpression Details) or by specifying options
647 to run/3.
648
649 It is important to understand that what is referred to as
650 "recursion" when limiting matches is not recursion on the C
651 stack of the Erlang machine or on the Erlang process stack.
652 The PCRE version compiled into the Erlang VM uses machine
653 "heap" memory to store values that must be kept over recur‐
654 sion in regular expression matches.
655
656 {match_limit, integer() >= 0}:
657 Limits the execution time of a match in an implementation-
658 specific way. It is described as follows by the PCRE docu‐
659 mentation:
660
661 The match_limit field provides a means of preventing PCRE from using
662 up a vast amount of resources when running patterns that are not going
663 to match, but which have a very large number of possibilities in their
664 search trees. The classic example is a pattern that uses nested
665 unlimited repeats.
666
667 Internally, pcre_exec() uses a function called match(), which it calls
668 repeatedly (sometimes recursively). The limit set by match_limit is
669 imposed on the number of times this function is called during a match,
670 which has the effect of limiting the amount of backtracking that can
671 take place. For patterns that are not anchored, the count restarts
672 from zero for each position in the subject string.
673
674 This means that runaway regular expression matches can fail
675 faster if the limit is lowered using this option. The de‐
676 fault value 10,000,000 is compiled into the Erlang VM.
677
678 Note:
679 This option does in no way affect the execution of the Erlang
680 VM in terms of "long running BIFs". run/3 always gives control
681 back to the scheduler of Erlang processes at intervals that
682 ensures the real-time properties of the Erlang system.
683
684
685 {match_limit_recursion, integer() >= 0}:
686 Limits the execution time and memory consumption of a match
687 in an implementation-specific way, very similar to
688 match_limit. It is described as follows by the PCRE documen‐
689 tation:
690
691 The match_limit_recursion field is similar to match_limit, but instead
692 of limiting the total number of times that match() is called, it
693 limits the depth of recursion. The recursion depth is a smaller number
694 than the total number of calls, because not all calls to match() are
695 recursive. This limit is of use only if it is set smaller than
696 match_limit.
697
698 Limiting the recursion depth limits the amount of machine stack that
699 can be used, or, when PCRE has been compiled to use memory on the heap
700 instead of the stack, the amount of heap memory that can be used.
701
702 The Erlang VM uses a PCRE library where heap memory is used
703 when regular expression match recursion occurs. This there‐
704 fore limits the use of machine heap, not C stack.
705
706 Specifying a lower value can result in matches with deep re‐
707 cursion failing, when they should have matched:
708
709 1> re:run("aaaaaaaaaaaaaz","(a+)*z").
710 {match,[{0,14},{0,13}]}
711 2> re:run("aaaaaaaaaaaaaz","(a+)*z",[{match_limit_recursion,5}]).
712 nomatch
713 3> re:run("aaaaaaaaaaaaaz","(a+)*z",[{match_limit_recursion,5},report_errors]).
714 {error,match_limit_recursion}
715
716 This option and option match_limit are only to be used in
717 rare cases. Understanding of the PCRE library internals is
718 recommended before tampering with these limits.
719
720 {offset, integer() >= 0}:
721 Start matching at the offset (position) specified in the
722 subject string. The offset is zero-based, so that the de‐
723 fault is {offset,0} (all of the subject string).
724
725 {newline, NLSpec}:
726 Overrides the default definition of a newline in the subject
727 string, which is LF (ASCII 10) in Erlang.
728
729 cr:
730 Newline is indicated by a single character CR (ASCII 13).
731
732 lf:
733 Newline is indicated by a single character LF (ASCII 10),
734 the default.
735
736 crlf:
737 Newline is indicated by the two-character CRLF (ASCII 13
738 followed by ASCII 10) sequence.
739
740 anycrlf:
741 Any of the three preceding sequences is be recognized.
742
743 any:
744 Any of the newline sequences above, and the Unicode se‐
745 quences VT (vertical tab, U+000B), FF (formfeed, U+000C),
746 NEL (next line, U+0085), LS (line separator, U+2028), and
747 PS (paragraph separator, U+2029).
748
749 bsr_anycrlf:
750 Specifies specifically that \R is to match only the CR LF,
751 or CRLF sequences, not the Unicode-specific newline charac‐
752 ters. (Overrides the compilation option.)
753
754 bsr_unicode:
755 Specifies specifically that \R is to match all the Unicode
756 newline characters (including CRLF, and so on, the default).
757 (Overrides the compilation option.)
758
759 {capture, ValueSpec}/{capture, ValueSpec, Type}:
760 Specifies which captured substrings are returned and in what
761 format. By default, run/3 captures all of the matching part
762 of the substring and all capturing subpatterns (all of the
763 pattern is automatically captured). The default return type
764 is (zero-based) indexes of the captured parts of the string,
765 specified as {Offset,Length} pairs (the index Type of cap‐
766 turing).
767
768 As an example of the default behavior, the following call
769 returns, as first and only captured string, the matching
770 part of the subject ("abcd" in the middle) as an index pair
771 {3,4}, where character positions are zero-based, just as in
772 offsets:
773
774 re:run("ABCabcdABC","abcd",[]).
775
776 The return value of this call is:
777
778 {match,[{3,4}]}
779
780 Another (and quite common) case is where the regular expres‐
781 sion matches all of the subject:
782
783 re:run("ABCabcdABC",".*abcd.*",[]).
784
785 Here the return value correspondingly points out all of the
786 string, beginning at index 0, and it is 10 characters long:
787
788 {match,[{0,10}]}
789
790 If the regular expression contains capturing subpatterns,
791 like in:
792
793 re:run("ABCabcdABC",".*(abcd).*",[]).
794
795 all of the matched subject is captured, as well as the cap‐
796 tured substrings:
797
798 {match,[{0,10},{3,4}]}
799
800 The complete matching pattern always gives the first return
801 value in the list and the remaining subpatterns are added in
802 the order they occurred in the regular expression.
803
804 The capture tuple is built up as follows:
805
806 ValueSpec:
807 Specifies which captured (sub)patterns are to be returned.
808 ValueSpec can either be an atom describing a predefined
809 set of return values, or a list containing the indexes or
810 the names of specific subpatterns to return.
811
812 The following are the predefined sets of subpatterns:
813
814 all:
815 All captured subpatterns including the complete matching
816 string. This is the default.
817
818 all_names:
819 All named subpatterns in the regular expression, as if a
820 list() of all the names in alphabetical order was speci‐
821 fied. The list of all names can also be retrieved with
822 inspect/2.
823
824 first:
825 Only the first captured subpattern, which is always the
826 complete matching part of the subject. All explicitly
827 captured subpatterns are discarded.
828
829 all_but_first:
830 All but the first matching subpattern, that is, all ex‐
831 plicitly captured subpatterns, but not the complete
832 matching part of the subject string. This is useful if
833 the regular expression as a whole matches a large part
834 of the subject, but the part you are interested in is in
835 an explicitly captured subpattern. If the return type is
836 list or binary, not returning subpatterns you are not
837 interested in is a good way to optimize.
838
839 none:
840 Returns no matching subpatterns, gives the single atom
841 match as the return value of the function when matching
842 successfully instead of the {match, list()} return.
843 Specifying an empty list gives the same behavior.
844
845 The value list is a list of indexes for the subpatterns to
846 return, where index 0 is for all of the pattern, and 1 is
847 for the first explicit capturing subpattern in the regular
848 expression, and so on. When using named captured subpat‐
849 terns (see below) in the regular expression, one can use
850 atom()s or string()s to specify the subpatterns to be re‐
851 turned. For example, consider the regular expression:
852
853 ".*(abcd).*"
854
855 matched against string "ABCabcdABC", capturing only the
856 "abcd" part (the first explicit subpattern):
857
858 re:run("ABCabcdABC",".*(abcd).*",[{capture,[1]}]).
859
860 The call gives the following result, as the first explic‐
861 itly captured subpattern is "(abcd)", matching "abcd" in
862 the subject, at (zero-based) position 3, of length 4:
863
864 {match,[{3,4}]}
865
866 Consider the same regular expression, but with the subpat‐
867 tern explicitly named 'FOO':
868
869 ".*(?<FOO>abcd).*"
870
871 With this expression, we could still give the index of the
872 subpattern with the following call:
873
874 re:run("ABCabcdABC",".*(?<FOO>abcd).*",[{capture,[1]}]).
875
876 giving the same result as before. But, as the subpattern
877 is named, we can also specify its name in the value list:
878
879 re:run("ABCabcdABC",".*(?<FOO>abcd).*",[{capture,['FOO']}]).
880
881 This would give the same result as the earlier examples,
882 namely:
883
884 {match,[{3,4}]}
885
886 The values list can specify indexes or names not present
887 in the regular expression, in which case the return values
888 vary depending on the type. If the type is index, the tu‐
889 ple {-1,0} is returned for values with no corresponding
890 subpattern in the regular expression, but for the other
891 types (binary and list), the values are the empty binary
892 or list, respectively.
893
894 Type:
895 Optionally specifies how captured substrings are to be re‐
896 turned. If omitted, the default of index is used.
897
898 Type can be one of the following:
899
900 index:
901 Returns captured substrings as pairs of byte indexes
902 into the subject string and length of the matching
903 string in the subject (as if the subject string was
904 flattened with erlang:iolist_to_binary/1 or uni‐
905 code:characters_to_binary/2 before matching). Notice
906 that option unicode results in byte-oriented indexes in
907 a (possibly virtual) UTF-8 encoded binary. A byte index
908 tuple {0,2} can therefore represent one or two charac‐
909 ters when unicode is in effect. This can seem counter-
910 intuitive, but has been deemed the most effective and
911 useful way to do it. To return lists instead can result
912 in simpler code if that is desired. This return type is
913 the default.
914
915 list:
916 Returns matching substrings as lists of characters (Er‐
917 lang string()s). It option unicode is used in combina‐
918 tion with the \C sequence in the regular expression, a
919 captured subpattern can contain bytes that are not valid
920 UTF-8 (\C matches bytes regardless of character encod‐
921 ing). In that case the list capturing can result in the
922 same types of tuples that unicode:characters_to_list/2
923 can return, namely three-tuples with tag incomplete or
924 error, the successfully converted characters and the in‐
925 valid UTF-8 tail of the conversion as a binary. The best
926 strategy is to avoid using the \C sequence when captur‐
927 ing lists.
928
929 binary:
930 Returns matching substrings as binaries. If option uni‐
931 code is used, these binaries are in UTF-8. If the \C se‐
932 quence is used together with unicode, the binaries can
933 be invalid UTF-8.
934
935 In general, subpatterns that were not assigned a value in
936 the match are returned as the tuple {-1,0} when type is in‐
937 dex. Unassigned subpatterns are returned as the empty binary
938 or list, respectively, for other return types. Consider the
939 following regular expression:
940
941 ".*((?<FOO>abdd)|a(..d)).*"
942
943 There are three explicitly capturing subpatterns, where the
944 opening parenthesis position determines the order in the re‐
945 sult, hence ((?<FOO>abdd)|a(..d)) is subpattern index 1,
946 (?<FOO>abdd) is subpattern index 2, and (..d) is subpattern
947 index 3. When matched against the following string:
948
949 "ABCabcdABC"
950
951 the subpattern at index 2 does not match, as "abdd" is not
952 present in the string, but the complete pattern matches (be‐
953 cause of the alternative a(..d)). The subpattern at index 2
954 is therefore unassigned and the default return value is:
955
956 {match,[{0,10},{3,4},{-1,0},{4,3}]}
957
958 Setting the capture Type to binary gives:
959
960 {match,[<<"ABCabcdABC">>,<<"abcd">>,<<>>,<<"bcd">>]}
961
962 Here the empty binary (<<>>) represents the unassigned sub‐
963 pattern. In the binary case, some information about the
964 matching is therefore lost, as <<>> can also be an empty
965 string captured.
966
967 If differentiation between empty matches and non-existing
968 subpatterns is necessary, use the type index and do the con‐
969 version to the final type in Erlang code.
970
971 When option global is speciified, the capture specification
972 affects each match separately, so that:
973
974 re:run("cacb","c(a|b)",[global,{capture,[1],list}]).
975
976 gives
977
978 {match,[["a"],["b"]]}
979
980 For a descriptions of options only affecting the compilation
981 step, see compile/2.
982
983 split(Subject, RE) -> SplitList
984
985 Types:
986
987 Subject = iodata() | unicode:charlist()
988 RE = mp() | iodata()
989 SplitList = [iodata() | unicode:charlist()]
990
991 Same as split(Subject, RE, []).
992
993 split(Subject, RE, Options) -> SplitList
994
995 Types:
996
997 Subject = iodata() | unicode:charlist()
998 RE = mp() | iodata() | unicode:charlist()
999 Options = [Option]
1000 Option =
1001 anchored | notbol | noteol | notempty | notempty_atstart
1002 |
1003 {offset, integer() >= 0} |
1004 {newline, nl_spec()} |
1005 {match_limit, integer() >= 0} |
1006 {match_limit_recursion, integer() >= 0} |
1007 bsr_anycrlf | bsr_unicode |
1008 {return, ReturnType} |
1009 {parts, NumParts} |
1010 group | trim | CompileOpt
1011 NumParts = integer() >= 0 | infinity
1012 ReturnType = iodata | list | binary
1013 CompileOpt = compile_option()
1014 See compile/2.
1015 SplitList = [RetData] | [GroupedRetData]
1016 GroupedRetData = [RetData]
1017 RetData = iodata() | unicode:charlist() | binary() | list()
1018
1019 Splits the input into parts by finding tokens according to the
1020 regular expression supplied. The splitting is basically done by
1021 running a global regular expression match and dividing the ini‐
1022 tial string wherever a match occurs. The matching part of the
1023 string is removed from the output.
1024
1025 As in run/3, an mp() compiled with option unicode requires Sub‐
1026 ject to be a Unicode charlist(). If compilation is done implic‐
1027 itly and the unicode compilation option is specified to this
1028 function, both the regular expression and Subject are to be
1029 specified as valid Unicode charlist()s.
1030
1031 The result is given as a list of "strings", the preferred data
1032 type specified in option return (default iodata).
1033
1034 If subexpressions are specified in the regular expression, the
1035 matching subexpressions are returned in the resulting list as
1036 well. For example:
1037
1038 re:split("Erlang","[ln]",[{return,list}]).
1039
1040 gives
1041
1042 ["Er","a","g"]
1043
1044 while
1045
1046 re:split("Erlang","([ln])",[{return,list}]).
1047
1048 gives
1049
1050 ["Er","l","a","n","g"]
1051
1052 The text matching the subexpression (marked by the parentheses
1053 in the regular expression) is inserted in the result list where
1054 it was found. This means that concatenating the result of a
1055 split where the whole regular expression is a single subexpres‐
1056 sion (as in the last example) always results in the original
1057 string.
1058
1059 As there is no matching subexpression for the last part in the
1060 example (the "g"), nothing is inserted after that. To make the
1061 group of strings and the parts matching the subexpressions more
1062 obvious, one can use option group, which groups together the
1063 part of the subject string with the parts matching the subex‐
1064 pressions when the string was split:
1065
1066 re:split("Erlang","([ln])",[{return,list},group]).
1067
1068 gives
1069
1070 [["Er","l"],["a","n"],["g"]]
1071
1072 Here the regular expression first matched the "l", causing "Er"
1073 to be the first part in the result. When the regular expression
1074 matched, the (only) subexpression was bound to the "l", so the
1075 "l" is inserted in the group together with "Er". The next match
1076 is of the "n", making "a" the next part to be returned. As the
1077 subexpression is bound to substring "n" in this case, the "n" is
1078 inserted into this group. The last group consists of the remain‐
1079 ing string, as no more matches are found.
1080
1081 By default, all parts of the string, including the empty
1082 strings, are returned from the function, for example:
1083
1084 re:split("Erlang","[lg]",[{return,list}]).
1085
1086 gives
1087
1088 ["Er","an",[]]
1089
1090 as the matching of the "g" in the end of the string leaves an
1091 empty rest, which is also returned. This behavior differs from
1092 the default behavior of the split function in Perl, where empty
1093 strings at the end are by default removed. To get the "trimming"
1094 default behavior of Perl, specify trim as an option:
1095
1096 re:split("Erlang","[lg]",[{return,list},trim]).
1097
1098 gives
1099
1100 ["Er","an"]
1101
1102 The "trim" option says; "give me as many parts as possible ex‐
1103 cept the empty ones", which sometimes can be useful. You can
1104 also specify how many parts you want, by specifying {parts,N}:
1105
1106 re:split("Erlang","[lg]",[{return,list},{parts,2}]).
1107
1108 gives
1109
1110 ["Er","ang"]
1111
1112 Notice that the last part is "ang", not "an", as splitting was
1113 specified into two parts, and the splitting stops when enough
1114 parts are given, which is why the result differs from that of
1115 trim.
1116
1117 More than three parts are not possible with this indata, so
1118
1119 re:split("Erlang","[lg]",[{return,list},{parts,4}]).
1120
1121 gives the same result as the default, which is to be viewed as
1122 "an infinite number of parts".
1123
1124 Specifying 0 as the number of parts gives the same effect as op‐
1125 tion trim. If subexpressions are captured, empty subexpressions
1126 matched at the end are also stripped from the result if trim or
1127 {parts,0} is specified.
1128
1129 The trim behavior corresponds exactly to the Perl default.
1130 {parts,N}, where N is a positive integer, corresponds exactly to
1131 the Perl behavior with a positive numerical third parameter. The
1132 default behavior of split/3 corresponds to the Perl behavior
1133 when a negative integer is specified as the third parameter for
1134 the Perl routine.
1135
1136 Summary of options not previously described for function run/3:
1137
1138 {return,ReturnType}:
1139 Specifies how the parts of the original string are presented
1140 in the result list. Valid types:
1141
1142 iodata:
1143 The variant of iodata() that gives the least copying of
1144 data with the current implementation (often a binary, but
1145 do not depend on it).
1146
1147 binary:
1148 All parts returned as binaries.
1149
1150 list:
1151 All parts returned as lists of characters ("strings").
1152
1153 group:
1154 Groups together the part of the string with the parts of the
1155 string matching the subexpressions of the regular expres‐
1156 sion.
1157
1158 The return value from the function is in this case a list()
1159 of list()s. Each sublist begins with the string picked out
1160 of the subject string, followed by the parts matching each
1161 of the subexpressions in order of occurrence in the regular
1162 expression.
1163
1164 {parts,N}:
1165 Specifies the number of parts the subject string is to be
1166 split into.
1167
1168 The number of parts is to be a positive integer for a spe‐
1169 cific maximum number of parts, and infinity for the maximum
1170 number of parts possible (the default). Specifying {parts,0}
1171 gives as many parts as possible disregarding empty parts at
1172 the end, the same as specifying trim.
1173
1174 trim:
1175 Specifies that empty parts at the end of the result list are
1176 to be disregarded. The same as specifying {parts,0}. This
1177 corresponds to the default behavior of the split built-in
1178 function in Perl.
1179
1181 The following sections contain reference material for the regular ex‐
1182 pressions used by this module. The information is based on the PCRE
1183 documentation, with changes where this module behaves differently to
1184 the PCRE library.
1185
1187 The syntax and semantics of the regular expressions supported by PCRE
1188 are described in detail in the following sections. Perl's regular ex‐
1189 pressions are described in its own documentation, and regular expres‐
1190 sions in general are covered in many books, some with copious examples.
1191 Jeffrey Friedl's "Mastering Regular Expressions", published by
1192 O'Reilly, covers regular expressions in great detail. This description
1193 of the PCRE regular expressions is intended as reference material.
1194
1195 The reference material is divided into the following sections:
1196
1197 * Special Start-of-Pattern Items
1198
1199 * Characters and Metacharacters
1200
1201 * Backslash
1202
1203 * Circumflex and Dollar
1204
1205 * Full Stop (Period, Dot) and \N
1206
1207 * Matching a Single Data Unit
1208
1209 * Square Brackets and Character Classes
1210
1211 * Posix Character Classes
1212
1213 * Vertical Bar
1214
1215 * Internal Option Setting
1216
1217 * Subpatterns
1218
1219 * Duplicate Subpattern Numbers
1220
1221 * Named Subpatterns
1222
1223 * Repetition
1224
1225 * Atomic Grouping and Possessive Quantifiers
1226
1227 * Back References
1228
1229 * Assertions
1230
1231 * Conditional Subpatterns
1232
1233 * Comments
1234
1235 * Recursive Patterns
1236
1237 * Subpatterns as Subroutines
1238
1239 * Oniguruma Subroutine Syntax
1240
1241 * Backtracking Control
1242
1244 Some options that can be passed to compile/2 can also be set by special
1245 items at the start of a pattern. These are not Perl-compatible, but are
1246 provided to make these options accessible to pattern writers who are
1247 not able to change the program that processes the pattern. Any number
1248 of these items can appear, but they must all be together right at the
1249 start of the pattern string, and the letters must be in upper case.
1250
1251 UTF Support
1252
1253 Unicode support is basically UTF-8 based. To use Unicode characters,
1254 you either call compile/2 or run/3 with option unicode, or the pattern
1255 must start with one of these special sequences:
1256
1257 (*UTF8)
1258 (*UTF)
1259
1260 Both options give the same effect, the input string is interpreted as
1261 UTF-8. Notice that with these instructions, the automatic conversion of
1262 lists to UTF-8 is not performed by the re functions. Therefore, using
1263 these sequences is not recommended. Add option unicode when running
1264 compile/2 instead.
1265
1266 Some applications that allow their users to supply patterns can wish to
1267 restrict them to non-UTF data for security reasons. If option never_utf
1268 is set at compile time, (*UTF), and so on, are not allowed, and their
1269 appearance causes an error.
1270
1271 Unicode Property Support
1272
1273 The following is another special sequence that can appear at the start
1274 of a pattern:
1275
1276 (*UCP)
1277
1278 This has the same effect as setting option ucp: it causes sequences
1279 such as \d and \w to use Unicode properties to determine character
1280 types, instead of recognizing only characters with codes < 256 through
1281 a lookup table.
1282
1283 Disabling Startup Optimizations
1284
1285 If a pattern starts with (*NO_START_OPT), it has the same effect as
1286 setting option no_start_optimize at compile time.
1287
1288 Newline Conventions
1289
1290 PCRE supports five conventions for indicating line breaks in strings: a
1291 single CR (carriage return) character, a single LF (line feed) charac‐
1292 ter, the two-character sequence CRLF, any of the three preceding, and
1293 any Unicode newline sequence.
1294
1295 A newline convention can also be specified by starting a pattern string
1296 with one of the following five sequences:
1297
1298 (*CR):
1299 Carriage return
1300
1301 (*LF):
1302 Line feed
1303
1304 (*CRLF):
1305 >Carriage return followed by line feed
1306
1307 (*ANYCRLF):
1308 Any of the three above
1309
1310 (*ANY):
1311 All Unicode newline sequences
1312
1313 These override the default and the options specified to compile/2. For
1314 example, the following pattern changes the convention to CR:
1315
1316 (*CR)a.b
1317
1318 This pattern matches a\nb, as LF is no longer a newline. If more than
1319 one of them is present, the last one is used.
1320
1321 The newline convention affects where the circumflex and dollar asser‐
1322 tions are true. It also affects the interpretation of the dot metachar‐
1323 acter when dotall is not set, and the behavior of \N. However, it does
1324 not affect what the \R escape sequence matches. By default, this is any
1325 Unicode newline sequence, for Perl compatibility. However, this can be
1326 changed; see the description of \R in section Newline Sequences. A
1327 change of the \R setting can be combined with a change of the newline
1328 convention.
1329
1330 Setting Match and Recursion Limits
1331
1332 The caller of run/3 can set a limit on the number of times the internal
1333 match() function is called and on the maximum depth of recursive calls.
1334 These facilities are provided to catch runaway matches that are pro‐
1335 voked by patterns with huge matching trees (a typical example is a pat‐
1336 tern with nested unlimited repeats) and to avoid running out of system
1337 stack by too much recursion. When one of these limits is reached,
1338 pcre_exec() gives an error return. The limits can also be set by items
1339 at the start of the pattern of the following forms:
1340
1341 (*LIMIT_MATCH=d)
1342 (*LIMIT_RECURSION=d)
1343
1344 Here d is any number of decimal digits. However, the value of the set‐
1345 ting must be less than the value set by the caller of run/3 for it to
1346 have any effect. That is, the pattern writer can lower the limit set by
1347 the programmer, but not raise it. If there is more than one setting of
1348 one of these limits, the lower value is used.
1349
1350 The default value for both the limits is 10,000,000 in the Erlang VM.
1351 Notice that the recursion limit does not affect the stack depth of the
1352 VM, as PCRE for Erlang is compiled in such a way that the match func‐
1353 tion never does recursion on the C stack.
1354
1355 Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of
1356 the limits set by the caller, not increase them.
1357
1359 A regular expression is a pattern that is matched against a subject
1360 string from left to right. Most characters stand for themselves in a
1361 pattern and match the corresponding characters in the subject. As a
1362 trivial example, the following pattern matches a portion of a subject
1363 string that is identical to itself:
1364
1365 The quick brown fox
1366
1367 When caseless matching is specified (option caseless), letters are
1368 matched independently of case.
1369
1370 The power of regular expressions comes from the ability to include al‐
1371 ternatives and repetitions in the pattern. These are encoded in the
1372 pattern by the use of metacharacters, which do not stand for themselves
1373 but instead are interpreted in some special way.
1374
1375 Two sets of metacharacters exist: those that are recognized anywhere in
1376 the pattern except within square brackets, and those that are recog‐
1377 nized within square brackets. Outside square brackets, the metacharac‐
1378 ters are as follows:
1379
1380 \:
1381 General escape character with many uses
1382
1383 ^:
1384 Assert start of string (or line, in multiline mode)
1385
1386 $:
1387 Assert end of string (or line, in multiline mode)
1388
1389 .:
1390 Match any character except newline (by default)
1391
1392 [:
1393 Start character class definition
1394
1395 |:
1396 Start of alternative branch
1397
1398 (:
1399 Start subpattern
1400
1401 ):
1402 End subpattern
1403
1404 ?:
1405 Extends the meaning of (, also 0 or 1 quantifier, also quantifier
1406 minimizer
1407
1408 *:
1409 0 or more quantifiers
1410
1411 +:
1412 1 or more quantifier, also "possessive quantifier"
1413
1414 {:
1415 Start min/max quantifier
1416
1417 Part of a pattern within square brackets is called a "character class".
1418 The following are the only metacharacters in a character class:
1419
1420 \:
1421 General escape character
1422
1423 ^:
1424 Negate the class, but only if the first character
1425
1426 -:
1427 Indicates character range
1428
1429 [:
1430 Posix character class (only if followed by Posix syntax)
1431
1432 ]:
1433 Terminates the character class
1434
1435 The following sections describe the use of each metacharacter.
1436
1438 The backslash character has many uses. First, if it is followed by a
1439 character that is not a number or a letter, it takes away any special
1440 meaning that a character can have. This use of backslash as an escape
1441 character applies both inside and outside character classes.
1442
1443 For example, if you want to match a * character, you write \* in the
1444 pattern. This escaping action applies if the following character would
1445 otherwise be interpreted as a metacharacter, so it is always safe to
1446 precede a non-alphanumeric with backslash to specify that it stands for
1447 itself. In particular, if you want to match a backslash, write \\.
1448
1449 In unicode mode, only ASCII numbers and letters have any special mean‐
1450 ing after a backslash. All other characters (in particular, those whose
1451 code points are > 127) are treated as literals.
1452
1453 If a pattern is compiled with option extended, whitespace in the pat‐
1454 tern (other than in a character class) and characters between a # out‐
1455 side a character class and the next newline are ignored. An escaping
1456 backslash can be used to include a whitespace or # character as part of
1457 the pattern.
1458
1459 To remove the special meaning from a sequence of characters, put them
1460 between \Q and \E. This is different from Perl in that $ and @ are han‐
1461 dled as literals in \Q...\E sequences in PCRE, while $ and @ cause
1462 variable interpolation in Perl. Notice the following examples:
1463
1464 Pattern PCRE matches Perl matches
1465
1466 \Qabc$xyz\E abc$xyz abc followed by the contents of $xyz
1467 \Qabc\$xyz\E abc\$xyz abc\$xyz
1468 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
1469
1470 The \Q...\E sequence is recognized both inside and outside character
1471 classes. An isolated \E that is not preceded by \Q is ignored. If \Q is
1472 not followed by \E later in the pattern, the literal interpretation
1473 continues to the end of the pattern (that is, \E is assumed at the
1474 end). If the isolated \Q is inside a character class, this causes an
1475 error, as the character class is not terminated.
1476
1477 Non-Printing Characters
1478
1479 A second use of backslash provides a way of encoding non-printing char‐
1480 acters in patterns in a visible manner. There is no restriction on the
1481 appearance of non-printing characters, apart from the binary zero that
1482 terminates a pattern. When a pattern is prepared by text editing, it is
1483 often easier to use one of the following escape sequences than the bi‐
1484 nary character it represents:
1485
1486 \a:
1487 Alarm, that is, the BEL character (hex 07)
1488
1489 \cx:
1490 "Control-x", where x is any ASCII character
1491
1492 \e:
1493 Escape (hex 1B)
1494
1495 \f:
1496 Form feed (hex 0C)
1497
1498 \n:
1499 Line feed (hex 0A)
1500
1501 \r:
1502 Carriage return (hex 0D)
1503
1504 \t:
1505 Tab (hex 09)
1506
1507 \0dd:
1508 Character with octal code 0dd
1509
1510 \ddd:
1511 Character with octal code ddd, or back reference
1512
1513 \o{ddd..}:
1514 character with octal code ddd..
1515
1516 \xhh:
1517 Character with hex code hh
1518
1519 \x{hhh..}:
1520 Character with hex code hhh..
1521
1522 Note:
1523 Note that \0dd is always an octal code, and that \8 and \9 are the lit‐
1524 eral characters "8" and "9".
1525
1526
1527 The precise effect of \cx on ASCII characters is as follows: if x is a
1528 lowercase letter, it is converted to upper case. Then bit 6 of the
1529 character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
1530 (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
1531 hex 7B (; is 3B). If the data item (byte or 16-bit value) following \c
1532 has a value > 127, a compile-time error occurs. This locks out non-
1533 ASCII characters in all modes.
1534
1535 The \c facility was designed for use with ASCII characters, but with
1536 the extension to Unicode it is even less useful than it once was.
1537
1538 After \0 up to two further octal digits are read. If there are fewer
1539 than two digits, just those that are present are used. Thus the se‐
1540 quence \0\x\015 specifies two binary zeros followed by a CR character
1541 (code value 13). Make sure you supply two digits after the initial zero
1542 if the pattern character that follows is itself an octal digit.
1543
1544 The escape \o must be followed by a sequence of octal digits, enclosed
1545 in braces. An error occurs if this is not the case. This escape is a
1546 recent addition to Perl; it provides way of specifying character code
1547 points as octal numbers greater than 0777, and it also allows octal
1548 numbers and back references to be unambiguously specified.
1549
1550 For greater clarity and unambiguity, it is best to avoid following \ by
1551 a digit greater than zero. Instead, use \o{} or \x{} to specify charac‐
1552 ter numbers, and \g{} to specify back references. The following para‐
1553 graphs describe the old, ambiguous syntax.
1554
1555 The handling of a backslash followed by a digit other than 0 is compli‐
1556 cated, and Perl has changed in recent releases, causing PCRE also to
1557 change. Outside a character class, PCRE reads the digit and any follow‐
1558 ing digits as a decimal number. If the number is < 8, or if there have
1559 been at least that many previous capturing left parentheses in the ex‐
1560 pression, the entire sequence is taken as a back reference. A descrip‐
1561 tion of how this works is provided later, following the discussion of
1562 parenthesized subpatterns.
1563
1564 Inside a character class, or if the decimal number following \ is > 7
1565 and there have not been that many capturing subpatterns, PCRE handles
1566 \8 and \9 as the literal characters "8" and "9", and otherwise re-reads
1567 up to three octal digits following the backslash, and using them to
1568 generate a data character. Any subsequent digits stand for themselves.
1569 For example:
1570
1571 \040:
1572 Another way of writing an ASCII space
1573
1574 \40:
1575 The same, provided there are < 40 previous capturing subpatterns
1576
1577 \7:
1578 Always a back reference
1579
1580 \11:
1581 Can be a back reference, or another way of writing a tab
1582
1583 \011:
1584 Always a tab
1585
1586 \0113:
1587 A tab followed by character "3"
1588
1589 \113:
1590 Can be a back reference, otherwise the character with octal code
1591 113
1592
1593 \377:
1594 Can be a back reference, otherwise value 255 (decimal)
1595
1596 \81:
1597 Either a back reference, or the two characters "8" and "1"
1598
1599 Notice that octal values >= 100 that are specified using this syntax
1600 must not be introduced by a leading zero, as no more than three octal
1601 digits are ever read.
1602
1603 By default, after \x that is not followed by {, from zero to two hexa‐
1604 decimal digits are read (letters can be in upper or lower case). Any
1605 number of hexadecimal digits may appear between \x{ and }. If a charac‐
1606 ter other than a hexadecimal digit appears between \x{ and }, or if
1607 there is no terminating }, an error occurs.
1608
1609 Characters whose value is less than 256 can be defined by either of the
1610 two syntaxes for \x. There is no difference in the way they are han‐
1611 dled. For example, \xdc is exactly the same as \x{dc}.
1612
1613 Constraints on character values
1614
1615 Characters that are specified using octal or hexadecimal numbers are
1616 limited to certain values, as follows:
1617
1618 8-bit non-UTF mode:
1619 < 0x100
1620
1621 8-bit UTF-8 mode:
1622 < 0x10ffff and a valid codepoint
1623
1624 Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-
1625 called "surrogate" codepoints), and 0xffef.
1626
1627 Escape sequences in character classes
1628
1629 All the sequences that define a single character value can be used both
1630 inside and outside character classes. Also, inside a character class,
1631 \b is interpreted as the backspace character (hex 08).
1632
1633 \N is not allowed in a character class. \B, \R, and \X are not special
1634 inside a character class. Like other unrecognized escape sequences,
1635 they are treated as the literal characters "B", "R", and "X". Outside a
1636 character class, these sequences have different meanings.
1637
1638 Unsupported Escape Sequences
1639
1640 In Perl, the sequences \l, \L, \u, and \U are recognized by its string
1641 handler and used to modify the case of following characters. PCRE does
1642 not support these escape sequences.
1643
1644 Absolute and Relative Back References
1645
1646 The sequence \g followed by an unsigned or a negative number, option‐
1647 ally enclosed in braces, is an absolute or relative back reference. A
1648 named back reference can be coded as \g{name}. Back references are dis‐
1649 cussed later, following the discussion of parenthesized subpatterns.
1650
1651 Absolute and Relative Subroutine Calls
1652
1653 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
1654 name or a number enclosed either in angle brackets or single quotes, is
1655 alternative syntax for referencing a subpattern as a "subroutine". De‐
1656 tails are discussed later. Notice that \g{...} (Perl syntax) and
1657 \g<...> (Oniguruma syntax) are not synonymous. The former is a back
1658 reference and the latter is a subroutine call.
1659
1660 Generic Character Types
1661
1662 Another use of backslash is for specifying generic character types:
1663
1664 \d:
1665 Any decimal digit
1666
1667 \D:
1668 Any character that is not a decimal digit
1669
1670 \h:
1671 Any horizontal whitespace character
1672
1673 \H:
1674 Any character that is not a horizontal whitespace character
1675
1676 \s:
1677 Any whitespace character
1678
1679 \S:
1680 Any character that is not a whitespace character
1681
1682 \v:
1683 Any vertical whitespace character
1684
1685 \V:
1686 Any character that is not a vertical whitespace character
1687
1688 \w:
1689 Any "word" character
1690
1691 \W:
1692 Any "non-word" character
1693
1694 There is also the single sequence \N, which matches a non-newline char‐
1695 acter. This is the same as the "." metacharacter when dotall is not
1696 set. Perl also uses \N to match characters by name, but PCRE does not
1697 support this.
1698
1699 Each pair of lowercase and uppercase escape sequences partitions the
1700 complete set of characters into two disjoint sets. Any given character
1701 matches one, and only one, of each pair. The sequences can appear both
1702 inside and outside character classes. They each match one character of
1703 the appropriate type. If the current matching point is at the end of
1704 the subject string, all fail, as there is no character to match.
1705
1706 For compatibility with Perl, \s did not used to match the VT character
1707 (code 11), which made it different from the the POSIX "space" class.
1708 However, Perl added VT at release 5.18, and PCRE followed suit at re‐
1709 lease 8.34. The default \s characters are now HT (9), LF (10), VT (11),
1710 FF (12), CR (13), and space (32), which are defined as white space in
1711 the "C" locale. This list may vary if locale-specific matching is tak‐
1712 ing place. For example, in some locales the "non-breaking space" char‐
1713 acter (\xA0) is recognized as white space, and in others the VT charac‐
1714 ter is not.
1715
1716 A "word" character is an underscore or any character that is a letter
1717 or a digit. By default, the definition of letters and digits is con‐
1718 trolled by the PCRE low-valued character tables, in Erlang's case (and
1719 without option unicode), the ISO Latin-1 character set.
1720
1721 By default, in unicode mode, characters with values > 255, that is, all
1722 characters outside the ISO Latin-1 character set, never match \d, \s,
1723 or \w, and always match \D, \S, and \W. These sequences retain their
1724 original meanings from before UTF support was available, mainly for ef‐
1725 ficiency reasons. However, if option ucp is set, the behavior is
1726 changed so that Unicode properties are used to determine character
1727 types, as follows:
1728
1729 \d:
1730 Any character that \p{Nd} matches (decimal digit)
1731
1732 \s:
1733 Any character that \p{Z} or \h or \v
1734
1735 \w:
1736 Any character that matches \p{L} or \p{N} matches, plus underscore
1737
1738 The uppercase escapes match the inverse sets of characters. Notice that
1739 \d matches only decimal digits, while \w matches any Unicode digit, any
1740 Unicode letter, and underscore. Notice also that ucp affects \b and \B,
1741 as they are defined in terms of \w and \W. Matching these sequences is
1742 noticeably slower when ucp is set.
1743
1744 The sequences \h, \H, \v, and \V are features that were added to Perl
1745 in release 5.10. In contrast to the other sequences, which match only
1746 ASCII characters by default, these always match certain high-valued
1747 code points, regardless if ucp is set.
1748
1749 The following are the horizontal space characters:
1750
1751 U+0009:
1752 Horizontal tab (HT)
1753
1754 U+0020:
1755 Space
1756
1757 U+00A0:
1758 Non-break space
1759
1760 U+1680:
1761 Ogham space mark
1762
1763 U+180E:
1764 Mongolian vowel separator
1765
1766 U+2000:
1767 En quad
1768
1769 U+2001:
1770 Em quad
1771
1772 U+2002:
1773 En space
1774
1775 U+2003:
1776 Em space
1777
1778 U+2004:
1779 Three-per-em space
1780
1781 U+2005:
1782 Four-per-em space
1783
1784 U+2006:
1785 Six-per-em space
1786
1787 U+2007:
1788 Figure space
1789
1790 U+2008:
1791 Punctuation space
1792
1793 U+2009:
1794 Thin space
1795
1796 U+200A:
1797 Hair space
1798
1799 U+202F:
1800 Narrow no-break space
1801
1802 U+205F:
1803 Medium mathematical space
1804
1805 U+3000:
1806 Ideographic space
1807
1808 The following are the vertical space characters:
1809
1810 U+000A:
1811 Line feed (LF)
1812
1813 U+000B:
1814 Vertical tab (VT)
1815
1816 U+000C:
1817 Form feed (FF)
1818
1819 U+000D:
1820 Carriage return (CR)
1821
1822 U+0085:
1823 Next line (NEL)
1824
1825 U+2028:
1826 Line separator
1827
1828 U+2029:
1829 Paragraph separator
1830
1831 In 8-bit, non-UTF-8 mode, only the characters with code points < 256
1832 are relevant.
1833
1834 Newline Sequences
1835
1836 Outside a character class, by default, the escape sequence \R matches
1837 any Unicode newline sequence. In non-UTF-8 mode, \R is equivalent to
1838 the following:
1839
1840 (?>\r\n|\n|\x0b|\f|\r|\x85)
1841
1842 This is an example of an "atomic group", details are provided below.
1843
1844 This particular group matches either the two-character sequence CR fol‐
1845 lowed by LF, or one of the single characters LF (line feed, U+000A), VT
1846 (vertical tab, U+000B), FF (form feed, U+000C), CR (carriage return,
1847 U+000D), or NEL (next line, U+0085). The two-character sequence is
1848 treated as a single unit that cannot be split.
1849
1850 In Unicode mode, two more characters whose code points are > 255 are
1851 added: LS (line separator, U+2028) and PS (paragraph separator,
1852 U+2029). Unicode character property support is not needed for these
1853 characters to be recognized.
1854
1855 \R can be restricted to match only CR, LF, or CRLF (instead of the com‐
1856 plete set of Unicode line endings) by setting option bsr_anycrlf either
1857 at compile time or when the pattern is matched. (BSR is an acronym for
1858 "backslash R".) This can be made the default when PCRE is built; if so,
1859 the other behavior can be requested through option bsr_unicode. These
1860 settings can also be specified by starting a pattern string with one of
1861 the following sequences:
1862
1863 (*BSR_ANYCRLF):
1864 CR, LF, or CRLF only
1865
1866 (*BSR_UNICODE):
1867 Any Unicode newline sequence
1868
1869 These override the default and the options specified to the compiling
1870 function, but they can themselves be overridden by options specified to
1871 a matching function. Notice that these special settings, which are not
1872 Perl-compatible, are recognized only at the very start of a pattern,
1873 and that they must be in upper case. If more than one of them is
1874 present, the last one is used. They can be combined with a change of
1875 newline convention; for example, a pattern can start with:
1876
1877 (*ANY)(*BSR_ANYCRLF)
1878
1879 They can also be combined with the (*UTF8), (*UTF), or (*UCP) special
1880 sequences. Inside a character class, \R is treated as an unrecognized
1881 escape sequence, and so matches the letter "R" by default.
1882
1883 Unicode Character Properties
1884
1885 Three more escape sequences that match characters with specific proper‐
1886 ties are available. When in 8-bit non-UTF-8 mode, these sequences are
1887 limited to testing characters whose code points are < 256, but they do
1888 work in this mode. The following are the extra escape sequences:
1889
1890 \p{xx}:
1891 A character with property xx
1892
1893 \P{xx}:
1894 A character without property xx
1895
1896 \X:
1897 A Unicode extended grapheme cluster
1898
1899 The property names represented by xx above are limited to the Unicode
1900 script names, the general category properties, "Any", which matches any
1901 character (including newline), and some special PCRE properties (de‐
1902 scribed in the next section). Other Perl properties, such as "InMusi‐
1903 calSymbols", are currently not supported by PCRE. Notice that \P{Any}
1904 does not match any characters and always causes a match failure.
1905
1906 Sets of Unicode characters are defined as belonging to certain scripts.
1907 A character from one of these sets can be matched using a script name,
1908 for example:
1909
1910 \p{Greek} \P{Han}
1911
1912 Those that are not part of an identified script are lumped together as
1913 "Common". The following is the current list of scripts:
1914
1915 * Arabic
1916
1917 * Armenian
1918
1919 * Avestan
1920
1921 * Balinese
1922
1923 * Bamum
1924
1925 * Bassa_Vah
1926
1927 * Batak
1928
1929 * Bengali
1930
1931 * Bopomofo
1932
1933 * Braille
1934
1935 * Buginese
1936
1937 * Buhid
1938
1939 * Canadian_Aboriginal
1940
1941 * Carian
1942
1943 * Caucasian_Albanian
1944
1945 * Chakma
1946
1947 * Cham
1948
1949 * Cherokee
1950
1951 * Common
1952
1953 * Coptic
1954
1955 * Cuneiform
1956
1957 * Cypriot
1958
1959 * Cyrillic
1960
1961 * Deseret
1962
1963 * Devanagari
1964
1965 * Duployan
1966
1967 * Egyptian_Hieroglyphs
1968
1969 * Elbasan
1970
1971 * Ethiopic
1972
1973 * Georgian
1974
1975 * Glagolitic
1976
1977 * Gothic
1978
1979 * Grantha
1980
1981 * Greek
1982
1983 * Gujarati
1984
1985 * Gurmukhi
1986
1987 * Han
1988
1989 * Hangul
1990
1991 * Hanunoo
1992
1993 * Hebrew
1994
1995 * Hiragana
1996
1997 * Imperial_Aramaic
1998
1999 * Inherited
2000
2001 * Inscriptional_Pahlavi
2002
2003 * Inscriptional_Parthian
2004
2005 * Javanese
2006
2007 * Kaithi
2008
2009 * Kannada
2010
2011 * Katakana
2012
2013 * Kayah_Li
2014
2015 * Kharoshthi
2016
2017 * Khmer
2018
2019 * Khojki
2020
2021 * Khudawadi
2022
2023 * Lao
2024
2025 * Latin
2026
2027 * Lepcha
2028
2029 * Limbu
2030
2031 * Linear_A
2032
2033 * Linear_B
2034
2035 * Lisu
2036
2037 * Lycian
2038
2039 * Lydian
2040
2041 * Mahajani
2042
2043 * Malayalam
2044
2045 * Mandaic
2046
2047 * Manichaean
2048
2049 * Meetei_Mayek
2050
2051 * Mende_Kikakui
2052
2053 * Meroitic_Cursive
2054
2055 * Meroitic_Hieroglyphs
2056
2057 * Miao
2058
2059 * Modi
2060
2061 * Mongolian
2062
2063 * Mro
2064
2065 * Myanmar
2066
2067 * Nabataean
2068
2069 * New_Tai_Lue
2070
2071 * Nko
2072
2073 * Ogham
2074
2075 * Ol_Chiki
2076
2077 * Old_Italic
2078
2079 * Old_North_Arabian
2080
2081 * Old_Permic
2082
2083 * Old_Persian
2084
2085 * Oriya
2086
2087 * Old_South_Arabian
2088
2089 * Old_Turkic
2090
2091 * Osmanya
2092
2093 * Pahawh_Hmong
2094
2095 * Palmyrene
2096
2097 * Pau_Cin_Hau
2098
2099 * Phags_Pa
2100
2101 * Phoenician
2102
2103 * Psalter_Pahlavi
2104
2105 * Rejang
2106
2107 * Runic
2108
2109 * Samaritan
2110
2111 * Saurashtra
2112
2113 * Sharada
2114
2115 * Shavian
2116
2117 * Siddham
2118
2119 * Sinhala
2120
2121 * Sora_Sompeng
2122
2123 * Sundanese
2124
2125 * Syloti_Nagri
2126
2127 * Syriac
2128
2129 * Tagalog
2130
2131 * Tagbanwa
2132
2133 * Tai_Le
2134
2135 * Tai_Tham
2136
2137 * Tai_Viet
2138
2139 * Takri
2140
2141 * Tamil
2142
2143 * Telugu
2144
2145 * Thaana
2146
2147 * Thai
2148
2149 * Tibetan
2150
2151 * Tifinagh
2152
2153 * Tirhuta
2154
2155 * Ugaritic
2156
2157 * Vai
2158
2159 * Warang_Citi
2160
2161 * Yi
2162
2163 Each character has exactly one Unicode general category property, spec‐
2164 ified by a two-letter acronym. For compatibility with Perl, negation
2165 can be specified by including a circumflex between the opening brace
2166 and the property name. For example, \p{^Lu} is the same as \P{Lu}.
2167
2168 If only one letter is specified with \p or \P, it includes all the gen‐
2169 eral category properties that start with that letter. In this case, in
2170 the absence of negation, the curly brackets in the escape sequence are
2171 optional. The following two examples have the same effect:
2172
2173 \p{L}
2174 \pL
2175
2176 The following general category property codes are supported:
2177
2178 C:
2179 Other
2180
2181 Cc:
2182 Control
2183
2184 Cf:
2185 Format
2186
2187 Cn:
2188 Unassigned
2189
2190 Co:
2191 Private use
2192
2193 Cs:
2194 Surrogate
2195
2196 L:
2197 Letter
2198
2199 Ll:
2200 Lowercase letter
2201
2202 Lm:
2203 Modifier letter
2204
2205 Lo:
2206 Other letter
2207
2208 Lt:
2209 Title case letter
2210
2211 Lu:
2212 Uppercase letter
2213
2214 M:
2215 Mark
2216
2217 Mc:
2218 Spacing mark
2219
2220 Me:
2221 Enclosing mark
2222
2223 Mn:
2224 Non-spacing mark
2225
2226 N:
2227 Number
2228
2229 Nd:
2230 Decimal number
2231
2232 Nl:
2233 Letter number
2234
2235 No:
2236 Other number
2237
2238 P:
2239 Punctuation
2240
2241 Pc:
2242 Connector punctuation
2243
2244 Pd:
2245 Dash punctuation
2246
2247 Pe:
2248 Close punctuation
2249
2250 Pf:
2251 Final punctuation
2252
2253 Pi:
2254 Initial punctuation
2255
2256 Po:
2257 Other punctuation
2258
2259 Ps:
2260 Open punctuation
2261
2262 S:
2263 Symbol
2264
2265 Sc:
2266 Currency symbol
2267
2268 Sk:
2269 Modifier symbol
2270
2271 Sm:
2272 Mathematical symbol
2273
2274 So:
2275 Other symbol
2276
2277 Z:
2278 Separator
2279
2280 Zl:
2281 Line separator
2282
2283 Zp:
2284 Paragraph separator
2285
2286 Zs:
2287 Space separator
2288
2289 The special property L& is also supported. It matches a character that
2290 has the Lu, Ll, or Lt property, that is, a letter that is not classi‐
2291 fied as a modifier or "other".
2292
2293 The Cs (Surrogate) property applies only to characters in the range
2294 U+D800 to U+DFFF. Such characters are invalid in Unicode strings and so
2295 cannot be tested by PCRE. Perl does not support the Cs property.
2296
2297 The long synonyms for property names supported by Perl (such as \p{Let‐
2298 ter}) are not supported by PCRE. It is not permitted to prefix any of
2299 these properties with "Is".
2300
2301 No character in the Unicode table has the Cn (unassigned) property.
2302 This property is instead assumed for any code point that is not in the
2303 Unicode table.
2304
2305 Specifying caseless matching does not affect these escape sequences.
2306 For example, \p{Lu} always matches only uppercase letters. This is dif‐
2307 ferent from the behavior of current versions of Perl.
2308
2309 Matching characters by Unicode property is not fast, as PCRE must do a
2310 multistage table lookup to find a character property. That is why the
2311 traditional escape sequences such as \d and \w do not use Unicode prop‐
2312 erties in PCRE by default. However, you can make them do so by setting
2313 option ucp or by starting the pattern with (*UCP).
2314
2315 Extended Grapheme Clusters
2316
2317 The \X escape matches any number of Unicode characters that form an
2318 "extended grapheme cluster", and treats the sequence as an atomic group
2319 (see below). Up to and including release 8.31, PCRE matched an earlier,
2320 simpler definition that was equivalent to (?>\PM\pM*). That is, it
2321 matched a character without the "mark" property, followed by zero or
2322 more characters with the "mark" property. Characters with the "mark"
2323 property are typically non-spacing accents that affect the preceding
2324 character.
2325
2326 This simple definition was extended in Unicode to include more compli‐
2327 cated kinds of composite character by giving each character a grapheme
2328 breaking property, and creating rules that use these properties to de‐
2329 fine the boundaries of extended grapheme clusters. In PCRE releases
2330 later than 8.31, \X matches one of these clusters.
2331
2332 \X always matches at least one character. Then it decides whether to
2333 add more characters according to the following rules for ending a clus‐
2334 ter:
2335
2336 * End at the end of the subject string.
2337
2338 * Do not end between CR and LF; otherwise end after any control char‐
2339 acter.
2340
2341 * Do not break Hangul (a Korean script) syllable sequences. Hangul
2342 characters are of five types: L, V, T, LV, and LVT. An L character
2343 can be followed by an L, V, LV, or LVT character. An LV or V char‐
2344 acter can be followed by a V or T character. An LVT or T character
2345 can be followed only by a T character.
2346
2347 * Do not end before extending characters or spacing marks. Characters
2348 with the "mark" property always have the "extend" grapheme breaking
2349 property.
2350
2351 * Do not end after prepend characters.
2352
2353 * Otherwise, end the cluster.
2354
2355 PCRE Additional Properties
2356
2357 In addition to the standard Unicode properties described earlier, PCRE
2358 supports four more that make it possible to convert traditional escape
2359 sequences, such as \w and \s to use Unicode properties. PCRE uses these
2360 non-standard, non-Perl properties internally when the ucp option is
2361 passed. However, they can also be used explicitly. The properties are
2362 as follows:
2363
2364 Xan:
2365 Any alphanumeric character. Matches characters that have either the
2366 L (letter) or the N (number) property.
2367
2368 Xps:
2369 Any Posix space character. Matches the characters tab, line feed,
2370 vertical tab, form feed, carriage return, and any other character
2371 that has the Z (separator) property.
2372
2373 Xsp:
2374 Any Perl space character. Matches the same as Xps, except that ver‐
2375 tical tab is excluded.
2376
2377 Xwd:
2378 Any Perl "word" character. Matches the same characters as Xan, plus
2379 underscore.
2380
2381 Perl and POSIX space are now the same. Perl added VT to its space char‐
2382 acter set at release 5.18 and PCRE changed at release 8.34.
2383
2384 Xan matches characters that have either the L (letter) or the N (num‐
2385 ber) property. Xps matches the characters tab, linefeed, vertical tab,
2386 form feed, or carriage return, and any other character that has the Z
2387 (separator) property. Xsp is the same as Xps; it used to exclude verti‐
2388 cal tab, for Perl compatibility, but Perl changed, and so PCRE followed
2389 at release 8.34. Xwd matches the same characters as Xan, plus under‐
2390 score.
2391
2392 There is another non-standard property, Xuc, which matches any charac‐
2393 ter that can be represented by a Universal Character Name in C++ and
2394 other programming languages. These are the characters $, @, ` (grave
2395 accent), and all characters with Unicode code points >= U+00A0, except
2396 for the surrogates U+D800 to U+DFFF. Notice that most base (ASCII)
2397 characters are excluded. (Universal Character Names are of the form
2398 \uHHHH or \UHHHHHHHH, where H is a hexadecimal digit. Notice that the
2399 Xuc property does not match these sequences but the characters that
2400 they represent.)
2401
2402 Resetting the Match Start
2403
2404 The escape sequence \K causes any previously matched characters not to
2405 be included in the final matched sequence. For example, the following
2406 pattern matches "foobar", but reports that it has matched "bar":
2407
2408 foo\Kbar
2409
2410 This feature is similar to a lookbehind assertion (described below).
2411 However, in this case, the part of the subject before the real match
2412 does not have to be of fixed length, as lookbehind assertions do. The
2413 use of \K does not interfere with the setting of captured substrings.
2414 For example, when the following pattern matches "foobar", the first
2415 substring is still set to "foo":
2416
2417 (foo)\Kbar
2418
2419 Perl documents that the use of \K within assertions is "not well de‐
2420 fined". In PCRE, \K is acted upon when it occurs inside positive asser‐
2421 tions, but is ignored in negative assertions. Note that when a pattern
2422 such as (?=ab\K) matches, the reported start of the match can be
2423 greater than the end of the match.
2424
2425 Simple Assertions
2426
2427 The final use of backslash is for certain simple assertions. An asser‐
2428 tion specifies a condition that must be met at a particular point in a
2429 match, without consuming any characters from the subject string. The
2430 use of subpatterns for more complicated assertions is described below.
2431 The following are the backslashed assertions:
2432
2433 \b:
2434 Matches at a word boundary.
2435
2436 \B:
2437 Matches when not at a word boundary.
2438
2439 \A:
2440 Matches at the start of the subject.
2441
2442 \Z:
2443 Matches at the end of the subject, and before a newline at the end
2444 of the subject.
2445
2446 \z:
2447 Matches only at the end of the subject.
2448
2449 \G:
2450 Matches at the first matching position in the subject.
2451
2452 Inside a character class, \b has a different meaning; it matches the
2453 backspace character. If any other of these assertions appears in a
2454 character class, by default it matches the corresponding literal char‐
2455 acter (for example, \B matches the letter B).
2456
2457 A word boundary is a position in the subject string where the current
2458 character and the previous character do not both match \w or \W (that
2459 is, one matches \w and the other matches \W), or the start or end of
2460 the string if the first or last character matches \w, respectively. In
2461 UTF mode, the meanings of \w and \W can be changed by setting option
2462 ucp. When this is done, it also affects \b and \B. PCRE and Perl do not
2463 have a separate "start of word" or "end of word" metasequence. However,
2464 whatever follows \b normally determines which it is. For example, the
2465 fragment \ba matches "a" at the start of a word.
2466
2467 The \A, \Z, and \z assertions differ from the traditional circumflex
2468 and dollar (described in the next section) in that they only ever match
2469 at the very start and end of the subject string, whatever options are
2470 set. Thus, they are independent of multiline mode. These three asser‐
2471 tions are not affected by options notbol or noteol, which affect only
2472 the behavior of the circumflex and dollar metacharacters. However, if
2473 argument startoffset of run/3 is non-zero, indicating that matching is
2474 to start at a point other than the beginning of the subject, \A can
2475 never match. The difference between \Z and \z is that \Z matches before
2476 a newline at the end of the string and at the very end, while \z
2477 matches only at the end.
2478
2479 The \G assertion is true only when the current matching position is at
2480 the start point of the match, as specified by argument startoffset of
2481 run/3. It differs from \A when the value of startoffset is non-zero. By
2482 calling run/3 multiple times with appropriate arguments, you can mimic
2483 the Perl option /g, and it is in this kind of implementation where \G
2484 can be useful.
2485
2486 Notice, however, that the PCRE interpretation of \G, as the start of
2487 the current match, is subtly different from Perl, which defines it as
2488 the end of the previous match. In Perl, these can be different when the
2489 previously matched string was empty. As PCRE does only one match at a
2490 time, it cannot reproduce this behavior.
2491
2492 If all the alternatives of a pattern begin with \G, the expression is
2493 anchored to the starting match position, and the "anchored" flag is set
2494 in the compiled regular expression.
2495
2497 The circumflex and dollar metacharacters are zero-width assertions.
2498 That is, they test for a particular condition to be true without con‐
2499 suming any characters from the subject string.
2500
2501 Outside a character class, in the default matching mode, the circumflex
2502 character is an assertion that is true only if the current matching
2503 point is at the start of the subject string. If argument startoffset of
2504 run/3 is non-zero, circumflex can never match if option multiline is
2505 unset. Inside a character class, circumflex has an entirely different
2506 meaning (see below).
2507
2508 Circumflex needs not to be the first character of the pattern if some
2509 alternatives are involved, but it is to be the first thing in each al‐
2510 ternative in which it appears if the pattern is ever to match that
2511 branch. If all possible alternatives start with a circumflex, that is,
2512 if the pattern is constrained to match only at the start of the sub‐
2513 ject, it is said to be an "anchored" pattern. (There are also other
2514 constructs that can cause a pattern to be anchored.)
2515
2516 The dollar character is an assertion that is true only if the current
2517 matching point is at the end of the subject string, or immediately be‐
2518 fore a newline at the end of the string (by default). Notice however
2519 that it does not match the newline. Dollar needs not to be the last
2520 character of the pattern if some alternatives are involved, but it is
2521 to be the last item in any branch in which it appears. Dollar has no
2522 special meaning in a character class.
2523
2524 The meaning of dollar can be changed so that it matches only at the
2525 very end of the string, by setting option dollar_endonly at compile
2526 time. This does not affect the \Z assertion.
2527
2528 The meanings of the circumflex and dollar characters are changed if op‐
2529 tion multiline is set. When this is the case, a circumflex matches im‐
2530 mediately after internal newlines and at the start of the subject
2531 string. It does not match after a newline that ends the string. A dol‐
2532 lar matches before any newlines in the string, and at the very end,
2533 when multiline is set. When newline is specified as the two-character
2534 sequence CRLF, isolated CR and LF characters do not indicate newlines.
2535
2536 For example, the pattern /^abc$/ matches the subject string "def\nabc"
2537 (where \n represents a newline) in multiline mode, but not otherwise.
2538 So, patterns that are anchored in single-line mode because all branches
2539 start with ^ are not anchored in multiline mode, and a match for cir‐
2540 cumflex is possible when argument startoffset of run/3 is non-zero. Op‐
2541 tion dollar_endonly is ignored if multiline is set.
2542
2543 Notice that the sequences \A, \Z, and \z can be used to match the start
2544 and end of the subject in both modes. If all branches of a pattern
2545 start with \A, it is always anchored, regardless if multiline is set.
2546
2548 Outside a character class, a dot in the pattern matches any character
2549 in the subject string except (by default) a character that signifies
2550 the end of a line.
2551
2552 When a line ending is defined as a single character, dot never matches
2553 that character. When the two-character sequence CRLF is used, dot does
2554 not match CR if it is immediately followed by LF, otherwise it matches
2555 all characters (including isolated CRs and LFs). When any Unicode line
2556 endings are recognized, dot does not match CR, LF, or any of the other
2557 line-ending characters.
2558
2559 The behavior of dot regarding newlines can be changed. If option dotall
2560 is set, a dot matches any character, without exception. If the two-
2561 character sequence CRLF is present in the subject string, it takes two
2562 dots to match it.
2563
2564 The handling of dot is entirely independent of the handling of circum‐
2565 flex and dollar, the only relationship is that both involve newlines.
2566 Dot has no special meaning in a character class.
2567
2568 The escape sequence \N behaves like a dot, except that it is not af‐
2569 fected by option PCRE_DOTALL. That is, it matches any character except
2570 one that signifies the end of a line. Perl also uses \N to match char‐
2571 acters by name but PCRE does not support this.
2572
2574 Outside a character class, the escape sequence \C matches any data
2575 unit, regardless if a UTF mode is set. One data unit is one byte. Un‐
2576 like a dot, \C always matches line-ending characters. The feature is
2577 provided in Perl to match individual bytes in UTF-8 mode, but it is un‐
2578 clear how it can usefully be used. As \C breaks up characters into in‐
2579 dividual data units, matching one unit with \C in a UTF mode means that
2580 the remaining string can start with a malformed UTF character. This has
2581 undefined results, as PCRE assumes that it deals with valid UTF
2582 strings.
2583
2584 PCRE does not allow \C to appear in lookbehind assertions (described
2585 below) in a UTF mode, as this would make it impossible to calculate the
2586 length of the lookbehind.
2587
2588 The \C escape sequence is best avoided. However, one way of using it
2589 that avoids the problem of malformed UTF characters is to use a look‐
2590 ahead to check the length of the next character, as in the following
2591 pattern, which can be used with a UTF-8 string (ignore whitespace and
2592 line breaks):
2593
2594 (?| (?=[\x00-\x7f])(\C) |
2595 (?=[\x80-\x{7ff}])(\C)(\C) |
2596 (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
2597 (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
2598
2599 A group that starts with (?| resets the capturing parentheses numbers
2600 in each alternative (see section Duplicate Subpattern Numbers). The as‐
2601 sertions at the start of each branch check the next UTF-8 character for
2602 values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The indi‐
2603 vidual bytes of the character are then captured by the appropriate num‐
2604 ber of groups.
2605
2607 An opening square bracket introduces a character class, terminated by a
2608 closing square bracket. A closing square bracket on its own is not spe‐
2609 cial by default. However, if option PCRE_JAVASCRIPT_COMPAT is set, a
2610 lone closing square bracket causes a compile-time error. If a closing
2611 square bracket is required as a member of the class, it is to be the
2612 first data character in the class (after an initial circumflex, if
2613 present) or escaped with a backslash.
2614
2615 A character class matches a single character in the subject. In a UTF
2616 mode, the character can be more than one data unit long. A matched
2617 character must be in the set of characters defined by the class, unless
2618 the first character in the class definition is a circumflex, in which
2619 case the subject character must not be in the set defined by the class.
2620 If a circumflex is required as a member of the class, ensure that it is
2621 not the first character, or escape it with a backslash.
2622
2623 For example, the character class [aeiou] matches any lowercase vowel,
2624 while [^aeiou] matches any character that is not a lowercase vowel. No‐
2625 tice that a circumflex is just a convenient notation for specifying the
2626 characters that are in the class by enumerating those that are not. A
2627 class that starts with a circumflex is not an assertion; it still con‐
2628 sumes a character from the subject string, and therefore it fails if
2629 the current pointer is at the end of the string.
2630
2631 In UTF-8 mode, characters with values > 255 (0xffff) can be included in
2632 a class as a literal string of data units, or by using the \x{ escaping
2633 mechanism.
2634
2635 When caseless matching is set, any letters in a class represent both
2636 their uppercase and lowercase versions. For example, a caseless [aeiou]
2637 matches "A" and "a", and a caseless [^aeiou] does not match "A", but a
2638 caseful version would. In a UTF mode, PCRE always understands the con‐
2639 cept of case for characters whose values are < 256, so caseless match‐
2640 ing is always possible. For characters with higher values, the concept
2641 of case is supported only if PCRE is compiled with Unicode property
2642 support. If you want to use caseless matching in a UTF mode for charac‐
2643 ters >=, ensure that PCRE is compiled with Unicode property support and
2644 with UTF support.
2645
2646 Characters that can indicate line breaks are never treated in any spe‐
2647 cial way when matching character classes, whatever line-ending sequence
2648 is in use, and whatever setting of options PCRE_DOTALL and PCRE_MULTI‐
2649 LINE is used. A class such as [^a] always matches one of these charac‐
2650 ters.
2651
2652 The minus (hyphen) character can be used to specify a range of charac‐
2653 ters in a character class. For example, [d-m] matches any letter be‐
2654 tween d and m, inclusive. If a minus character is required in a class,
2655 it must be escaped with a backslash or appear in a position where it
2656 cannot be interpreted as indicating a range, typically as the first or
2657 last character in the class, or immediately after a range. For example,
2658 [b-d-z] matches letters in the range b to d, a hyphen character, or z.
2659
2660 The literal character "]" cannot be the end character of a range. A
2661 pattern such as [W-]46] is interpreted as a class of two characters
2662 ("W" and "-") followed by a literal string "46]", so it would match
2663 "W46]" or "-46]". However, if "]" is escaped with a backslash, it is
2664 interpreted as the end of range, so [W-\]46] is interpreted as a class
2665 containing a range followed by two other characters. The octal or hexa‐
2666 decimal representation of "]" can also be used to end a range.
2667
2668 An error is generated if a POSIX character class (see below) or an es‐
2669 cape sequence other than one that defines a single character appears at
2670 a point where a range ending character is expected. For example,
2671 [z-\xff] is valid, but [A-\d] and [A-[:digit:]] are not.
2672
2673 Ranges operate in the collating sequence of character values. They can
2674 also be used for characters specified numerically, for example,
2675 [\000-\037]. Ranges can include any characters that are valid for the
2676 current mode.
2677
2678 If a range that includes letters is used when caseless matching is set,
2679 it matches the letters in either case. For example, [W-c] is equivalent
2680 to [][\\^_`wxyzabc], matched caselessly. In a non-UTF mode, if charac‐
2681 ter tables for a French locale are in use, [\xc8-\xcb] matches accented
2682 E characters in both cases. In UTF modes, PCRE supports the concept of
2683 case for characters with values > 255 only when it is compiled with
2684 Unicode property support.
2685
2686 The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,
2687 \w, and \W can appear in a character class, and add the characters that
2688 they match to the class. For example, [\dABCDEF] matches any hexadeci‐
2689 mal digit. In UTF modes, option ucp affects the meanings of \d, \s, \w
2690 and their uppercase partners, just as it does when they appear outside
2691 a character class, as described in section Generic Character Types ear‐
2692 lier. The escape sequence \b has a different meaning inside a character
2693 class; it matches the backspace character. The sequences \B, \N, \R,
2694 and \X are not special inside a character class. Like any other unrec‐
2695 ognized escape sequences, they are treated as the literal characters
2696 "B", "N", "R", and "X".
2697
2698 A circumflex can conveniently be used with the uppercase character
2699 types to specify a more restricted set of characters than the matching
2700 lowercase type. For example, class [^\W_] matches any letter or digit,
2701 but not underscore, while [\w] includes underscore. A positive charac‐
2702 ter class is to be read as "something OR something OR ..." and a nega‐
2703 tive class as "NOT something AND NOT something AND NOT ...".
2704
2705 Only the following metacharacters are recognized in character classes:
2706
2707 * Backslash
2708
2709 * Hyphen (only where it can be interpreted as specifying a range)
2710
2711 * Circumflex (only at the start)
2712
2713 * Opening square bracket (only when it can be interpreted as intro‐
2714 ducing a Posix class name, or for a special compatibility feature;
2715 see the next two sections)
2716
2717 * Terminating closing square bracket
2718
2719 However, escaping other non-alphanumeric characters does no harm.
2720
2722 Perl supports the Posix notation for character classes. This uses names
2723 enclosed by [: and :] within the enclosing square brackets. PCRE also
2724 supports this notation. For example, the following matches "0", "1",
2725 any alphabetic character, or "%":
2726
2727 [01[:alpha:]%]
2728
2729 The following are the supported class names:
2730
2731 alnum:
2732 Letters and digits
2733
2734 alpha:
2735 Letters
2736
2737 ascii:
2738 Character codes 0-127
2739
2740 blank:
2741 Space or tab only
2742
2743 cntrl:
2744 Control characters
2745
2746 digit:
2747 Decimal digits (same as \d)
2748
2749 graph:
2750 Printing characters, excluding space
2751
2752 lower:
2753 Lowercase letters
2754
2755 print:
2756 Printing characters, including space
2757
2758 punct:
2759 Printing characters, excluding letters, digits, and space
2760
2761 space:
2762 Whitespace (the same as \s from PCRE 8.34)
2763
2764 upper:
2765 Uppercase letters
2766
2767 word:
2768 "Word" characters (same as \w)
2769
2770 xdigit:
2771 Hexadecimal digits
2772
2773 The default "space" characters are HT (9), LF (10), VT (11), FF (12),
2774 CR (13), and space (32). If locale-specific matching is taking place,
2775 the list of space characters may be different; there may be fewer or
2776 more of them. "Space" used to be different to \s, which did not include
2777 VT, for Perl compatibility. However, Perl changed at release 5.18, and
2778 PCRE followed at release 8.34. "Space" and \s now match the same set of
2779 characters.
2780
2781 The name "word" is a Perl extension, and "blank" is a GNU extension
2782 from Perl 5.8. Another Perl extension is negation, which is indicated
2783 by a ^ character after the colon. For example, the following matches
2784 "1", "2", or any non-digit:
2785
2786 [12[:^digit:]]
2787
2788 PCRE (and Perl) also recognize the Posix syntax [.ch.] and [=ch=] where
2789 "ch" is a "collating element", but these are not supported, and an er‐
2790 ror is given if they are encountered.
2791
2792 By default, characters with values > 255 do not match any of the Posix
2793 character classes. However, if option PCRE_UCP is passed to pcre_com‐
2794 pile(), some of the classes are changed so that Unicode character prop‐
2795 erties are used. This is achieved by replacing certain Posix classes by
2796 other sequences, as follows:
2797
2798 [:alnum:]:
2799 Becomes \p{Xan}
2800
2801 [:alpha:]:
2802 Becomes \p{L}
2803
2804 [:blank:]:
2805 Becomes \h
2806
2807 [:digit:]:
2808 Becomes \p{Nd}
2809
2810 [:lower:]:
2811 Becomes \p{Ll}
2812
2813 [:space:]:
2814 Becomes \p{Xps}
2815
2816 [:upper:]:
2817 Becomes \p{Lu}
2818
2819 [:word:]:
2820 Becomes \p{Xwd}
2821
2822 Negated versions, such as [:^alpha:], use \P instead of \p. Three other
2823 POSIX classes are handled specially in UCP mode:
2824
2825 [:graph:]:
2826 This matches characters that have glyphs that mark the page when
2827 printed. In Unicode property terms, it matches all characters with
2828 the L, M, N, P, S, or Cf properties, except for:
2829
2830 U+061C:
2831 Arabic Letter Mark
2832
2833 U+180E:
2834 Mongolian Vowel Separator
2835
2836 U+2066 - U+2069:
2837 Various "isolate"s
2838
2839 [:print:]:
2840 This matches the same characters as [:graph:] plus space characters
2841 that are not controls, that is, characters with the Zs property.
2842
2843 [:punct:]:
2844 This matches all characters that have the Unicode P (punctuation)
2845 property, plus those characters whose code points are less than 128
2846 that have the S (Symbol) property.
2847
2848 The other POSIX classes are unchanged, and match only characters with
2849 code points less than 128.
2850
2851 Compatibility Feature for Word Boundaries
2852
2853 In the POSIX.2 compliant library that was included in 4.4BSD Unix, the
2854 ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word"
2855 and "end of word". PCRE treats these items as follows:
2856
2857 [[:<:]]:
2858 is converted to \b(?=\w)
2859
2860 [[:>:]]:
2861 is converted to \b(?<=\w)
2862
2863 Only these exact character sequences are recognized. A sequence such as
2864 [a[:<:]b] provokes error for an unrecognized POSIX class name. This
2865 support is not compatible with Perl. It is provided to help migrations
2866 from other environments, and is best not used in any new patterns. Note
2867 that \b matches at the start and the end of a word (see "Simple asser‐
2868 tions" above), and in a Perl-style pattern the preceding or following
2869 character normally shows which is wanted, without the need for the as‐
2870 sertions that are used above in order to give exactly the POSIX behav‐
2871 iour.
2872
2874 Vertical bar characters are used to separate alternative patterns. For
2875 example, the following pattern matches either "gilbert" or "sullivan":
2876
2877 gilbert|sullivan
2878
2879 Any number of alternatives can appear, and an empty alternative is per‐
2880 mitted (matching the empty string). The matching process tries each al‐
2881 ternative in turn, from left to right, and the first that succeeds is
2882 used. If the alternatives are within a subpattern (defined in section
2883 Subpatterns), "succeeds" means matching the remaining main pattern and
2884 the alternative in the subpattern.
2885
2887 The settings of the Perl-compatible options caseless, multiline,
2888 dotall, and extended can be changed from within the pattern by a se‐
2889 quence of Perl option letters enclosed between "(?" and ")". The option
2890 letters are as follows:
2891
2892 i:
2893 For caseless
2894
2895 m:
2896 For multiline
2897
2898 s:
2899 For dotall
2900
2901 x:
2902 For extended
2903
2904 For example, (?im) sets caseless, multiline matching. These options can
2905 also be unset by preceding the letter with a hyphen. A combined setting
2906 and unsetting such as (?im-sx), which sets caseless and multiline,
2907 while unsetting dotall and extended, is also permitted. If a letter ap‐
2908 pears both before and after the hyphen, the option is unset.
2909
2910 The PCRE-specific options dupnames, ungreedy, and extra can be changed
2911 in the same way as the Perl-compatible options by using the characters
2912 J, U, and X respectively.
2913
2914 When one of these option changes occurs at top-level (that is, not in‐
2915 side subpattern parentheses), the change applies to the remainder of
2916 the pattern that follows.
2917
2918 An option change within a subpattern (see section Subpatterns) affects
2919 only that part of the subpattern that follows it. So, the following
2920 matches abc and aBc and no other strings (assuming caseless is not
2921 used):
2922
2923 (a(?i)b)c
2924
2925 By this means, options can be made to have different settings in dif‐
2926 ferent parts of the pattern. Any changes made in one alternative do
2927 carry on into subsequent branches within the same subpattern. For exam‐
2928 ple:
2929
2930 (a(?i)b|c)
2931
2932 matches "ab", "aB", "c", and "C", although when matching "C" the first
2933 branch is abandoned before the option setting. This is because the ef‐
2934 fects of option settings occur at compile time. There would be some
2935 weird behavior otherwise.
2936
2937 Note:
2938 Other PCRE-specific options can be set by the application when the com‐
2939 piling or matching functions are called. Sometimes the pattern can con‐
2940 tain special leading sequences, such as (*CRLF), to override what the
2941 application has set or what has been defaulted. Details are provided in
2942 section Newline Sequences earlier.
2943
2944 The (*UTF8) and (*UCP) leading sequences can be used to set UTF and
2945 Unicode property modes. They are equivalent to setting options unicode
2946 and ucp, respectively. The (*UTF) sequence is a generic version that
2947 can be used with any of the libraries. However, the application can set
2948 option never_utf, which locks out the use of the (*UTF) sequences.
2949
2950
2952 Subpatterns are delimited by parentheses (round brackets), which can be
2953 nested. Turning part of a pattern into a subpattern does two things:
2954
2955 1.:
2956 It localizes a set of alternatives. For example, the following pat‐
2957 tern matches "cataract", "caterpillar", or "cat":
2958
2959 cat(aract|erpillar|)
2960
2961 Without the parentheses, it would match "cataract", "erpillar", or
2962 an empty string.
2963
2964 2.:
2965 It sets up the subpattern as a capturing subpattern. That is, when
2966 the complete pattern matches, that portion of the subject string
2967 that matched the subpattern is passed back to the caller through
2968 the return value of run/3.
2969
2970 Opening parentheses are counted from left to right (starting from 1) to
2971 obtain numbers for the capturing subpatterns. For example, if the
2972 string "the red king" is matched against the following pattern, the
2973 captured substrings are "red king", "red", and "king", and are numbered
2974 1, 2, and 3, respectively:
2975
2976 the ((red|white) (king|queen))
2977
2978 It is not always helpful that plain parentheses fulfill two functions.
2979 Often a grouping subpattern is required without a capturing require‐
2980 ment. If an opening parenthesis is followed by a question mark and a
2981 colon, the subpattern does not do any capturing, and is not counted
2982 when computing the number of any subsequent capturing subpatterns. For
2983 example, if the string "the white queen" is matched against the follow‐
2984 ing pattern, the captured substrings are "white queen" and "queen", and
2985 are numbered 1 and 2:
2986
2987 the ((?:red|white) (king|queen))
2988
2989 The maximum number of capturing subpatterns is 65535.
2990
2991 As a convenient shorthand, if any option settings are required at the
2992 start of a non-capturing subpattern, the option letters can appear be‐
2993 tween "?" and ":". Thus, the following two patterns match the same set
2994 of strings:
2995
2996 (?i:saturday|sunday)
2997 (?:(?i)saturday|sunday)
2998
2999 As alternative branches are tried from left to right, and options are
3000 not reset until the end of the subpattern is reached, an option setting
3001 in one branch does affect subsequent branches, so the above patterns
3002 match both "SUNDAY" and "Saturday".
3003
3005 Perl 5.10 introduced a feature where each alternative in a subpattern
3006 uses the same numbers for its capturing parentheses. Such a subpattern
3007 starts with (?| and is itself a non-capturing subpattern. For example,
3008 consider the following pattern:
3009
3010 (?|(Sat)ur|(Sun))day
3011
3012 As the two alternatives are inside a (?| group, both sets of capturing
3013 parentheses are numbered one. Thus, when the pattern matches, you can
3014 look at captured substring number one, whichever alternative matched.
3015 This construct is useful when you want to capture a part, but not all,
3016 of one of many alternatives. Inside a (?| group, parentheses are num‐
3017 bered as usual, but the number is reset at the start of each branch.
3018 The numbers of any capturing parentheses that follow the subpattern
3019 start after the highest number used in any branch. The following exam‐
3020 ple is from the Perl documentation; the numbers underneath show in
3021 which buffer the captured content is stored:
3022
3023 # before ---------------branch-reset----------- after
3024 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
3025 # 1 2 2 3 2 3 4
3026
3027 A back reference to a numbered subpattern uses the most recent value
3028 that is set for that number by any subpattern. The following pattern
3029 matches "abcabc" or "defdef":
3030
3031 /(?|(abc)|(def))\1/
3032
3033 In contrast, a subroutine call to a numbered subpattern always refers
3034 to the first one in the pattern with the given number. The following
3035 pattern matches "abcabc" or "defabc":
3036
3037 /(?|(abc)|(def))(?1)/
3038
3039 If a condition test for a subpattern having matched refers to a non-
3040 unique number, the test is true if any of the subpatterns of that num‐
3041 ber have matched.
3042
3043 An alternative approach using this "branch reset" feature is to use du‐
3044 plicate named subpatterns, as described in the next section.
3045
3047 Identifying capturing parentheses by number is simple, but it can be
3048 hard to keep track of the numbers in complicated regular expressions.
3049 Also, if an expression is modified, the numbers can change. To help
3050 with this difficulty, PCRE supports the naming of subpatterns. This
3051 feature was not added to Perl until release 5.10. Python had the fea‐
3052 ture earlier, and PCRE introduced it at release 4.0, using the Python
3053 syntax. PCRE now supports both the Perl and the Python syntax. Perl al‐
3054 lows identically numbered subpatterns to have different names, but PCRE
3055 does not.
3056
3057 In PCRE, a subpattern can be named in one of three ways: (?<name>...)
3058 or (?'name'...) as in Perl, or (?P<name>...) as in Python. References
3059 to capturing parentheses from other parts of the pattern, such as back
3060 references, recursion, and conditions, can be made by name and by num‐
3061 ber.
3062
3063 Names consist of up to 32 alphanumeric characters and underscores, but
3064 must start with a non-digit. Named capturing parentheses are still al‐
3065 located numbers as well as names, exactly as if the names were not
3066 present. The capture specification to run/3 can use named values if
3067 they are present in the regular expression.
3068
3069 By default, a name must be unique within a pattern, but this constraint
3070 can be relaxed by setting option dupnames at compile time. (Duplicate
3071 names are also always permitted for subpatterns with the same number,
3072 set up as described in the previous section.) Duplicate names can be
3073 useful for patterns where only one instance of the named parentheses
3074 can match. Suppose that you want to match the name of a weekday, either
3075 as a 3-letter abbreviation or as the full name, and in both cases you
3076 want to extract the abbreviation. The following pattern (ignoring the
3077 line breaks) does the job:
3078
3079 (?<DN>Mon|Fri|Sun)(?:day)?|
3080 (?<DN>Tue)(?:sday)?|
3081 (?<DN>Wed)(?:nesday)?|
3082 (?<DN>Thu)(?:rsday)?|
3083 (?<DN>Sat)(?:urday)?
3084
3085 There are five capturing substrings, but only one is ever set after a
3086 match. (An alternative way of solving this problem is to use a "branch
3087 reset" subpattern, as described in the previous section.)
3088
3089 For capturing named subpatterns which names are not unique, the first
3090 matching occurrence (counted from left to right in the subject) is re‐
3091 turned from run/3, if the name is specified in the values part of the
3092 capture statement. The all_names capturing value matches all the names
3093 in the same way.
3094
3095 Note:
3096 You cannot use different names to distinguish between two subpatterns
3097 with the same number, as PCRE uses only the numbers when matching. For
3098 this reason, an error is given at compile time if different names are
3099 specified to subpatterns with the same number. However, you can specify
3100 the same name to subpatterns with the same number, even when dupnames
3101 is not set.
3102
3103
3105 Repetition is specified by quantifiers, which can follow any of the
3106 following items:
3107
3108 * A literal data character
3109
3110 * The dot metacharacter
3111
3112 * The \C escape sequence
3113
3114 * The \X escape sequence
3115
3116 * The \R escape sequence
3117
3118 * An escape such as \d or \pL that matches a single character
3119
3120 * A character class
3121
3122 * A back reference (see the next section)
3123
3124 * A parenthesized subpattern (including assertions)
3125
3126 * A subroutine call to a subpattern (recursive or otherwise)
3127
3128 The general repetition quantifier specifies a minimum and maximum num‐
3129 ber of permitted matches, by giving the two numbers in curly brackets
3130 (braces), separated by a comma. The numbers must be < 65536, and the
3131 first must be less than or equal to the second. For example, the fol‐
3132 lowing matches "zz", "zzz", or "zzzz":
3133
3134 z{2,4}
3135
3136 A closing brace on its own is not a special character. If the second
3137 number is omitted, but the comma is present, there is no upper limit.
3138 If the second number and the comma are both omitted, the quantifier
3139 specifies an exact number of required matches. Thus, the following
3140 matches at least three successive vowels, but can match many more:
3141
3142 [aeiou]{3,}
3143
3144 The following matches exactly eight digits:
3145
3146 \d{8}
3147
3148 An opening curly bracket that appears in a position where a quantifier
3149 is not allowed, or one that does not match the syntax of a quantifier,
3150 is taken as a literal character. For example, {,6} is not a quantifier,
3151 but a literal string of four characters.
3152
3153 In Unicode mode, quantifiers apply to characters rather than to indi‐
3154 vidual data units. Thus, for example, \x{100}{2} matches two charac‐
3155 ters, each of which is represented by a 2-byte sequence in a UTF-8
3156 string. Similarly, \X{3} matches three Unicode extended grapheme clus‐
3157 ters, each of which can be many data units long (and they can be of
3158 different lengths).
3159
3160 The quantifier {0} is permitted, causing the expression to behave as if
3161 the previous item and the quantifier were not present. This can be use‐
3162 ful for subpatterns that are referenced as subroutines from elsewhere
3163 in the pattern (but see also section Defining Subpatterns for Use by
3164 Reference Only). Items other than subpatterns that have a {0} quanti‐
3165 fier are omitted from the compiled pattern.
3166
3167 For convenience, the three most common quantifiers have single-charac‐
3168 ter abbreviations:
3169
3170 *:
3171 Equivalent to {0,}
3172
3173 +:
3174 Equivalent to {1,}
3175
3176 ?:
3177 Equivalent to {0,1}
3178
3179 Infinite loops can be constructed by following a subpattern that can
3180 match no characters with a quantifier that has no upper limit, for ex‐
3181 ample:
3182
3183 (a?)*
3184
3185 Earlier versions of Perl and PCRE used to give an error at compile time
3186 for such patterns. However, as there are cases where this can be use‐
3187 ful, such patterns are now accepted. However, if any repetition of the
3188 subpattern matches no characters, the loop is forcibly broken.
3189
3190 By default, the quantifiers are "greedy", that is, they match as much
3191 as possible (up to the maximum number of permitted times), without
3192 causing the remaining pattern to fail. The classic example of where
3193 this gives problems is in trying to match comments in C programs. These
3194 appear between /* and */. Within the comment, individual * and / char‐
3195 acters can appear. An attempt to match C comments by applying the pat‐
3196 tern
3197
3198 /\*.*\*/
3199
3200 to the string
3201
3202 /* first comment */ not comment /* second comment */
3203
3204 fails, as it matches the entire string owing to the greediness of the
3205 .* item.
3206
3207 However, if a quantifier is followed by a question mark, it ceases to
3208 be greedy, and instead matches the minimum number of times possible, so
3209 the following pattern does the right thing with the C comments:
3210
3211 /\*.*?\*/
3212
3213 The meaning of the various quantifiers is not otherwise changed, only
3214 the preferred number of matches. Do not confuse this use of question
3215 mark with its use as a quantifier in its own right. As it has two uses,
3216 it can sometimes appear doubled, as in
3217
3218 \d??\d
3219
3220 which matches one digit by preference, but can match two if that is the
3221 only way the remaining pattern matches.
3222
3223 If option ungreedy is set (an option that is not available in Perl),
3224 the quantifiers are not greedy by default, but individual ones can be
3225 made greedy by following them with a question mark. That is, it inverts
3226 the default behavior.
3227
3228 When a parenthesized subpattern is quantified with a minimum repeat
3229 count that is > 1 or with a limited maximum, more memory is required
3230 for the compiled pattern, in proportion to the size of the minimum or
3231 maximum.
3232
3233 If a pattern starts with .* or .{0,} and option dotall (equivalent to
3234 Perl option /s) is set, thus allowing the dot to match newlines, the
3235 pattern is implicitly anchored, because whatever follows is tried
3236 against every character position in the subject string. So, there is no
3237 point in retrying the overall match at any position after the first.
3238 PCRE normally treats such a pattern as if it was preceded by \A.
3239
3240 In cases where it is known that the subject string contains no new‐
3241 lines, it is worth setting dotall to obtain this optimization, or al‐
3242 ternatively using ^ to indicate anchoring explicitly.
3243
3244 However, there are some cases where the optimization cannot be used.
3245 When .* is inside capturing parentheses that are the subject of a back
3246 reference elsewhere in the pattern, a match at the start can fail where
3247 a later one succeeds. Consider, for example:
3248
3249 (.*)abc\1
3250
3251 If the subject is "xyz123abc123", the match point is the fourth charac‐
3252 ter. Therefore, such a pattern is not implicitly anchored.
3253
3254 Another case where implicit anchoring is not applied is when the lead‐
3255 ing .* is inside an atomic group. Once again, a match at the start can
3256 fail where a later one succeeds. Consider the following pattern:
3257
3258 (?>.*?a)b
3259
3260 It matches "ab" in the subject "aab". The use of the backtracking con‐
3261 trol verbs (*PRUNE) and (*SKIP) also disable this optimization.
3262
3263 When a capturing subpattern is repeated, the value captured is the sub‐
3264 string that matched the final iteration. For example, after
3265
3266 (tweedle[dume]{3}\s*)+
3267
3268 has matched "tweedledum tweedledee", the value of the captured sub‐
3269 string is "tweedledee". However, if there are nested capturing subpat‐
3270 terns, the corresponding captured values can have been set in previous
3271 iterations. For example, after
3272
3273 /(a|(b))+/
3274
3275 matches "aba", the value of the second captured substring is "b".
3276
3278 With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
3279 repetition, failure of what follows normally causes the repeated item
3280 to be re-evaluated to see if a different number of repeats allows the
3281 remaining pattern to match. Sometimes it is useful to prevent this, ei‐
3282 ther to change the nature of the match, or to cause it to fail earlier
3283 than it otherwise might, when the author of the pattern knows that
3284 there is no point in carrying on.
3285
3286 Consider, for example, the pattern \d+foo when applied to the following
3287 subject line:
3288
3289 123456bar
3290
3291 After matching all six digits and then failing to match "foo", the nor‐
3292 mal action of the matcher is to try again with only five digits match‐
3293 ing item \d+, and then with four, and so on, before ultimately failing.
3294 "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
3295 the means for specifying that once a subpattern has matched, it is not
3296 to be re-evaluated in this way.
3297
3298 If atomic grouping is used for the previous example, the matcher gives
3299 up immediately on failing to match "foo" the first time. The notation
3300 is a kind of special parenthesis, starting with (?> as in the following
3301 example:
3302
3303 (?>\d+)foo
3304
3305 This kind of parenthesis "locks up" the part of the pattern it contains
3306 once it has matched, and a failure further into the pattern is pre‐
3307 vented from backtracking into it. Backtracking past it to previous
3308 items, however, works as normal.
3309
3310 An alternative description is that a subpattern of this type matches
3311 the string of characters that an identical standalone pattern would
3312 match, if anchored at the current point in the subject string.
3313
3314 Atomic grouping subpatterns are not capturing subpatterns. Simple cases
3315 such as the above example can be thought of as a maximizing repeat that
3316 must swallow everything it can. So, while both \d+ and \d+? are pre‐
3317 pared to adjust the number of digits they match to make the remaining
3318 pattern match, (?>\d+) can only match an entire sequence of digits.
3319
3320 Atomic groups in general can contain any complicated subpatterns, and
3321 can be nested. However, when the subpattern for an atomic group is just
3322 a single repeated item, as in the example above, a simpler notation,
3323 called a "possessive quantifier" can be used. This consists of an extra
3324 + character following a quantifier. Using this notation, the previous
3325 example can be rewritten as
3326
3327 \d++foo
3328
3329 Notice that a possessive quantifier can be used with an entire group,
3330 for example:
3331
3332 (abc|xyz){2,3}+
3333
3334 Possessive quantifiers are always greedy; the setting of option un‐
3335 greedy is ignored. They are a convenient notation for the simpler forms
3336 of an atomic group. However, there is no difference in the meaning of a
3337 possessive quantifier and the equivalent atomic group, but there can be
3338 a performance difference; possessive quantifiers are probably slightly
3339 faster.
3340
3341 The possessive quantifier syntax is an extension to the Perl 5.8 syn‐
3342 tax. Jeffrey Friedl originated the idea (and the name) in the first
3343 edition of his book. Mike McCloskey liked it, so implemented it when he
3344 built the Sun Java package, and PCRE copied it from there. It ulti‐
3345 mately found its way into Perl at release 5.10.
3346
3347 PCRE has an optimization that automatically "possessifies" certain sim‐
3348 ple pattern constructs. For example, the sequence A+B is treated as
3349 A++B, as there is no point in backtracking into a sequence of A:s when
3350 B must follow.
3351
3352 When a pattern contains an unlimited repeat inside a subpattern that
3353 can itself be repeated an unlimited number of times, the use of an
3354 atomic group is the only way to avoid some failing matches taking a
3355 long time. The pattern
3356
3357 (\D+|<\d+>)*[!?]
3358
3359 matches an unlimited number of substrings that either consist of non-
3360 digits, or digits enclosed in <>, followed by ! or ?. When it matches,
3361 it runs quickly. However, if it is applied to
3362
3363 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
3364
3365 it takes a long time before reporting failure. This is because the
3366 string can be divided between the internal \D+ repeat and the external
3367 * repeat in many ways, and all must be tried. (The example uses [!?]
3368 rather than a single character at the end, as both PCRE and Perl have
3369 an optimization that allows for fast failure when a single character is
3370 used. They remember the last single character that is required for a
3371 match, and fail early if it is not present in the string.) If the pat‐
3372 tern is changed so that it uses an atomic group, like the following,
3373 sequences of non-digits cannot be broken, and failure happens quickly:
3374
3375 ((?>\D+)|<\d+>)*[!?]
3376
3378 Outside a character class, a backslash followed by a digit > 0 (and
3379 possibly further digits) is a back reference to a capturing subpattern
3380 earlier (that is, to its left) in the pattern, provided there have been
3381 that many previous capturing left parentheses.
3382
3383 However, if the decimal number following the backslash is < 10, it is
3384 always taken as a back reference, and causes an error only if there are
3385 not that many capturing left parentheses in the entire pattern. That
3386 is, the parentheses that are referenced do need not be to the left of
3387 the reference for numbers < 10. A "forward back reference" of this type
3388 can make sense when a repetition is involved and the subpattern to the
3389 right has participated in an earlier iteration.
3390
3391 It is not possible to have a numerical "forward back reference" to a
3392 subpattern whose number is 10 or more using this syntax, as a sequence
3393 such as \50 is interpreted as a character defined in octal. For more
3394 details of the handling of digits following a backslash, see section
3395 Non-Printing Characters earlier. There is no such problem when named
3396 parentheses are used. A back reference to any subpattern is possible
3397 using named parentheses (see below).
3398
3399 Another way to avoid the ambiguity inherent in the use of digits fol‐
3400 lowing a backslash is to use the \g escape sequence. This escape must
3401 be followed by an unsigned number or a negative number, optionally en‐
3402 closed in braces. The following examples are identical:
3403
3404 (ring), \1
3405 (ring), \g1
3406 (ring), \g{1}
3407
3408 An unsigned number specifies an absolute reference without the ambigu‐
3409 ity that is present in the older syntax. It is also useful when literal
3410 digits follow the reference. A negative number is a relative reference.
3411 Consider the following example:
3412
3413 (abc(def)ghi)\g{-1}
3414
3415 The sequence \g{-1} is a reference to the most recently started captur‐
3416 ing subpattern before \g, that is, it is equivalent to \2 in this exam‐
3417 ple. Similarly, \g{-2} would be equivalent to \1. The use of relative
3418 references can be helpful in long patterns, and also in patterns that
3419 are created by joining fragments containing references within them‐
3420 selves.
3421
3422 A back reference matches whatever matched the capturing subpattern in
3423 the current subject string, rather than anything matching the subpat‐
3424 tern itself (section Subpattern as Subroutines describes a way of doing
3425 that). So, the following pattern matches "sense and sensibility" and
3426 "response and responsibility", but not "sense and responsibility":
3427
3428 (sens|respons)e and \1ibility
3429
3430 If caseful matching is in force at the time of the back reference, the
3431 case of letters is relevant. For example, the following matches "rah
3432 rah" and "RAH RAH", but not "RAH rah", although the original capturing
3433 subpattern is matched caselessly:
3434
3435 ((?i)rah)\s+\1
3436
3437 There are many different ways of writing back references to named sub‐
3438 patterns. The .NET syntax \k{name} and the Perl syntax \k<name> or
3439 \k'name' are supported, as is the Python syntax (?P=name). The unified
3440 back reference syntax in Perl 5.10, in which \g can be used for both
3441 numeric and named references, is also supported. The previous example
3442 can be rewritten in the following ways:
3443
3444 (?<p1>(?i)rah)\s+\k<p1>
3445 (?'p1'(?i)rah)\s+\k{p1}
3446 (?P<p1>(?i)rah)\s+(?P=p1)
3447 (?<p1>(?i)rah)\s+\g{p1}
3448
3449 A subpattern that is referenced by name can appear in the pattern be‐
3450 fore or after the reference.
3451
3452 There can be more than one back reference to the same subpattern. If a
3453 subpattern has not been used in a particular match, any back references
3454 to it always fails. For example, the following pattern always fails if
3455 it starts to match "a" rather than "bc":
3456
3457 (a|(bc))\2
3458
3459 As there can be many capturing parentheses in a pattern, all digits
3460 following the backslash are taken as part of a potential back reference
3461 number. If the pattern continues with a digit character, some delimiter
3462 must be used to terminate the back reference. If option extended is
3463 set, this can be whitespace. Otherwise an empty comment (see section
3464 Comments) can be used.
3465
3466 Recursive Back References
3467
3468 A back reference that occurs inside the parentheses to which it refers
3469 fails when the subpattern is first used, so, for example, (a\1) never
3470 matches. However, such references can be useful inside repeated subpat‐
3471 terns. For example, the following pattern matches any number of "a"s
3472 and also "aba", "ababbaa", and so on:
3473
3474 (a|b\1)+
3475
3476 At each iteration of the subpattern, the back reference matches the
3477 character string corresponding to the previous iteration. In order for
3478 this to work, the pattern must be such that the first iteration does
3479 not need to match the back reference. This can be done using alterna‐
3480 tion, as in the example above, or by a quantifier with a minimum of
3481 zero.
3482
3483 Back references of this type cause the group that they reference to be
3484 treated as an atomic group. Once the whole group has been matched, a
3485 subsequent matching failure cannot cause backtracking into the middle
3486 of the group.
3487
3489 An assertion is a test on the characters following or preceding the
3490 current matching point that does not consume any characters. The simple
3491 assertions coded as \b, \B, \A, \G, \Z, \z, ^, and $ are described in
3492 the previous sections.
3493
3494 More complicated assertions are coded as subpatterns. There are two
3495 kinds: those that look ahead of the current position in the subject
3496 string, and those that look behind it. An assertion subpattern is
3497 matched in the normal way, except that it does not cause the current
3498 matching position to be changed.
3499
3500 Assertion subpatterns are not capturing subpatterns. If such an asser‐
3501 tion contains capturing subpatterns within it, these are counted for
3502 the purposes of numbering the capturing subpatterns in the whole pat‐
3503 tern. However, substring capturing is done only for positive asser‐
3504 tions. (Perl sometimes, but not always, performs capturing in negative
3505 assertions.)
3506
3507 Warning:
3508 If a positive assertion containing one or more capturing subpatterns
3509 succeeds, but failure to match later in the pattern causes backtracking
3510 over this assertion, the captures within the assertion are reset only
3511 if no higher numbered captures are already set. This is, unfortunately,
3512 a fundamental limitation of the current implementation, and as PCRE1 is
3513 now in maintenance-only status, it is unlikely ever to change.
3514
3515
3516 For compatibility with Perl, assertion subpatterns can be repeated.
3517 However, it makes no sense to assert the same thing many times, the
3518 side effect of capturing parentheses can occasionally be useful. In
3519 practice, there are only three cases:
3520
3521 * If the quantifier is {0}, the assertion is never obeyed during
3522 matching. However, it can contain internal capturing parenthesized
3523 groups that are called from elsewhere through the subroutine mecha‐
3524 nism.
3525
3526 * If quantifier is {0,n}, where n > 0, it is treated as if it was
3527 {0,1}. At runtime, the remaining pattern match is tried with and
3528 without the assertion, the order depends on the greediness of the
3529 quantifier.
3530
3531 * If the minimum repetition is > 0, the quantifier is ignored. The
3532 assertion is obeyed only once when encountered during matching.
3533
3534 Lookahead Assertions
3535
3536 Lookahead assertions start with (?= for positive assertions and (?! for
3537 negative assertions. For example, the following matches a word followed
3538 by a semicolon, but does not include the semicolon in the match:
3539
3540 \w+(?=;)
3541
3542 The following matches any occurrence of "foo" that is not followed by
3543 "bar":
3544
3545 foo(?!bar)
3546
3547 Notice that the apparently similar pattern
3548
3549 (?!foo)bar
3550
3551 does not find an occurrence of "bar" that is preceded by something
3552 other than "foo". It finds any occurrence of "bar" whatsoever, as the
3553 assertion (?!foo) is always true when the next three characters are
3554 "bar". A lookbehind assertion is needed to achieve the other effect.
3555
3556 If you want to force a matching failure at some point in a pattern, the
3557 most convenient way to do it is with (?!), as an empty string always
3558 matches. So, an assertion that requires there is not to be an empty
3559 string must always fail. The backtracking control verb (*FAIL) or (*F)
3560 is a synonym for (?!).
3561
3562 Lookbehind Assertions
3563
3564 Lookbehind assertions start with (?<= for positive assertions and (?<!
3565 for negative assertions. For example, the following finds an occurrence
3566 of "bar" that is not preceded by "foo":
3567
3568 (?<!foo)bar
3569
3570 The contents of a lookbehind assertion are restricted such that all the
3571 strings it matches must have a fixed length. However, if there are many
3572 top-level alternatives, they do not all have to have the same fixed
3573 length. Thus, the following is permitted:
3574
3575 (?<=bullock|donkey)
3576
3577 The following causes an error at compile time:
3578
3579 (?<!dogs?|cats?)
3580
3581 Branches that match different length strings are permitted only at the
3582 top-level of a lookbehind assertion. This is an extension compared with
3583 Perl, which requires all branches to match the same length of string.
3584 An assertion such as the following is not permitted, as its single top-
3585 level branch can match two different lengths:
3586
3587 (?<=ab(c|de))
3588
3589 However, it is acceptable to PCRE if rewritten to use two top-level
3590 branches:
3591
3592 (?<=abc|abde)
3593
3594 Sometimes the escape sequence \K (see above) can be used instead of a
3595 lookbehind assertion to get round the fixed-length restriction.
3596
3597 The implementation of lookbehind assertions is, for each alternative,
3598 to move the current position back temporarily by the fixed length and
3599 then try to match. If there are insufficient characters before the cur‐
3600 rent position, the assertion fails.
3601
3602 In a UTF mode, PCRE does not allow the \C escape (which matches a sin‐
3603 gle data unit even in a UTF mode) to appear in lookbehind assertions,
3604 as it makes it impossible to calculate the length of the lookbehind.
3605 The \X and \R escapes, which can match different numbers of data units,
3606 are not permitted either.
3607
3608 "Subroutine" calls (see below), such as (?2) or (?&X), are permitted in
3609 lookbehinds, as long as the subpattern matches a fixed-length string.
3610 Recursion, however, is not supported.
3611
3612 Possessive quantifiers can be used with lookbehind assertions to spec‐
3613 ify efficient matching of fixed-length strings at the end of subject
3614 strings. Consider the following simple pattern when applied to a long
3615 string that does not match:
3616
3617 abcd$
3618
3619 As matching proceeds from left to right, PCRE looks for each "a" in the
3620 subject and then sees if what follows matches the remaining pattern. If
3621 the pattern is specified as
3622
3623 ^.*abcd$
3624
3625 the initial .* matches the entire string at first. However, when this
3626 fails (as there is no following "a"), it backtracks to match all but
3627 the last character, then all but the last two characters, and so on.
3628 Once again the search for "a" covers the entire string, from right to
3629 left, so we are no better off. However, if the pattern is written as
3630
3631 ^.*+(?<=abcd)
3632
3633 there can be no backtracking for the .*+ item; it can match only the
3634 entire string. The subsequent lookbehind assertion does a single test
3635 on the last four characters. If it fails, the match fails immediately.
3636 For long strings, this approach makes a significant difference to the
3637 processing time.
3638
3639 Using Multiple Assertions
3640
3641 Many assertions (of any sort) can occur in succession. For example, the
3642 following matches "foo" preceded by three digits that are not "999":
3643
3644 (?<=\d{3})(?<!999)foo
3645
3646 Notice that each of the assertions is applied independently at the same
3647 point in the subject string. First there is a check that the previous
3648 three characters are all digits, and then there is a check that the
3649 same three characters are not "999". This pattern does not match "foo"
3650 preceded by six characters, the first of which are digits and the last
3651 three of which are not "999". For example, it does not match "123abc‐
3652 foo". A pattern to do that is the following:
3653
3654 (?<=\d{3}...)(?<!999)foo
3655
3656 This time the first assertion looks at the preceding six characters,
3657 checks that the first three are digits, and then the second assertion
3658 checks that the preceding three characters are not "999".
3659
3660 Assertions can be nested in any combination. For example, the following
3661 matches an occurrence of "baz" that is preceded by "bar", which in turn
3662 is not preceded by "foo":
3663
3664 (?<=(?<!foo)bar)baz
3665
3666 The following pattern matches "foo" preceded by three digits and any
3667 three characters that are not "999":
3668
3669 (?<=\d{3}(?!999)...)foo
3670
3672 It is possible to cause the matching process to obey a subpattern con‐
3673 ditionally or to choose between two alternative subpatterns, depending
3674 on the result of an assertion, or whether a specific capturing subpat‐
3675 tern has already been matched. The following are the two possible forms
3676 of conditional subpattern:
3677
3678 (?(condition)yes-pattern)
3679 (?(condition)yes-pattern|no-pattern)
3680
3681 If the condition is satisfied, the yes-pattern is used, otherwise the
3682 no-pattern (if present). If more than two alternatives exist in the
3683 subpattern, a compile-time error occurs. Each of the two alternatives
3684 can itself contain nested subpatterns of any form, including condi‐
3685 tional subpatterns; the restriction to two alternatives applies only at
3686 the level of the condition. The following pattern fragment is an exam‐
3687 ple where the alternatives are complex:
3688
3689 (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
3690
3691 There are four kinds of condition: references to subpatterns, refer‐
3692 ences to recursion, a pseudo-condition called DEFINE, and assertions.
3693
3694 Checking for a Used Subpattern By Number
3695
3696 If the text between the parentheses consists of a sequence of digits,
3697 the condition is true if a capturing subpattern of that number has pre‐
3698 viously matched. If more than one capturing subpattern with the same
3699 number exists (see section Duplicate Subpattern Numbers earlier), the
3700 condition is true if any of them have matched. An alternative notation
3701 is to precede the digits with a plus or minus sign. In this case, the
3702 subpattern number is relative rather than absolute. The most recently
3703 opened parentheses can be referenced by (?(-1), the next most recent by
3704 (?(-2), and so on. Inside loops, it can also make sense to refer to
3705 subsequent groups. The next parentheses to be opened can be referenced
3706 as (?(+1), and so on. (The value zero in any of these forms is not
3707 used; it provokes a compile-time error.)
3708
3709 Consider the following pattern, which contains non-significant white‐
3710 space to make it more readable (assume option extended) and to divide
3711 it into three parts for ease of discussion:
3712
3713 ( \( )? [^()]+ (?(1) \) )
3714
3715 The first part matches an optional opening parenthesis, and if that
3716 character is present, sets it as the first captured substring. The sec‐
3717 ond part matches one or more characters that are not parentheses. The
3718 third part is a conditional subpattern that tests whether the first set
3719 of parentheses matched or not. If they did, that is, if subject started
3720 with an opening parenthesis, the condition is true, and so the yes-pat‐
3721 tern is executed and a closing parenthesis is required. Otherwise, as
3722 no-pattern is not present, the subpattern matches nothing. That is,
3723 this pattern matches a sequence of non-parentheses, optionally enclosed
3724 in parentheses.
3725
3726 If this pattern is embedded in a larger one, a relative reference can
3727 be used:
3728
3729 This makes the fragment independent of the parentheses in the larger
3730 pattern.
3731
3732 Checking for a Used Subpattern By Name
3733
3734 Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
3735 used subpattern by name. For compatibility with earlier versions of
3736 PCRE, which had this facility before Perl, the syntax (?(name)...) is
3737 also recognized.
3738
3739 Rewriting the previous example to use a named subpattern gives:
3740
3741 (?<OPEN> \( )? [^()]+ (?(<OPEN>) \) )
3742
3743 If the name used in a condition of this kind is a duplicate, the test
3744 is applied to all subpatterns of the same name, and is true if any one
3745 of them has matched.
3746
3747 Checking for Pattern Recursion
3748
3749 If the condition is the string (R), and there is no subpattern with the
3750 name R, the condition is true if a recursive call to the whole pattern
3751 or any subpattern has been made. If digits or a name preceded by amper‐
3752 sand follow the letter R, for example:
3753
3754 (?(R3)...) or (?(R&name)...)
3755
3756 the condition is true if the most recent recursion is into a subpattern
3757 whose number or name is given. This condition does not check the entire
3758 recursion stack. If the name used in a condition of this kind is a du‐
3759 plicate, the test is applied to all subpatterns of the same name, and
3760 is true if any one of them is the most recent recursion.
3761
3762 At "top-level", all these recursion test conditions are false. The syn‐
3763 tax for recursive patterns is described below.
3764
3765 Defining Subpatterns for Use By Reference Only
3766
3767 If the condition is the string (DEFINE), and there is no subpattern
3768 with the name DEFINE, the condition is always false. In this case,
3769 there can be only one alternative in the subpattern. It is always
3770 skipped if control reaches this point in the pattern. The idea of DE‐
3771 FINE is that it can be used to define "subroutines" that can be refer‐
3772 enced from elsewhere. (The use of subroutines is described below.) For
3773 example, a pattern to match an IPv4 address, such as "192.168.23.245",
3774 can be written like this (ignore whitespace and line breaks):
3775
3776 (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) \b (?&byte) (\.(?&byte)){3} \b
3777
3778 The first part of the pattern is a DEFINE group inside which is a an‐
3779 other group named "byte" is defined. This matches an individual compo‐
3780 nent of an IPv4 address (a number < 256). When matching takes place,
3781 this part of the pattern is skipped, as DEFINE acts like a false condi‐
3782 tion. The remaining pattern uses references to the named group to match
3783 the four dot-separated components of an IPv4 address, insisting on a
3784 word boundary at each end.
3785
3786 Assertion Conditions
3787
3788 If the condition is not in any of the above formats, it must be an as‐
3789 sertion. This can be a positive or negative lookahead or lookbehind as‐
3790 sertion. Consider the following pattern, containing non-significant
3791 whitespace, and with the two alternatives on the second line:
3792
3793 (?(?=[^a-z]*[a-z])
3794 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
3795
3796 The condition is a positive lookahead assertion that matches an op‐
3797 tional sequence of non-letters followed by a letter. That is, it tests
3798 for the presence of at least one letter in the subject. If a letter is
3799 found, the subject is matched against the first alternative, otherwise
3800 it is matched against the second. This pattern matches strings in one
3801 of the two forms dd-aaa-dd or dd-dd-dd, where aaa are letters and dd
3802 are digits.
3803
3805 There are two ways to include comments in patterns that are processed
3806 by PCRE. In both cases, the start of the comment must not be in a char‐
3807 acter class, or in the middle of any other sequence of related charac‐
3808 ters such as (?: or a subpattern name or number. The characters that
3809 make up a comment play no part in the pattern matching.
3810
3811 The sequence (?# marks the start of a comment that continues up to the
3812 next closing parenthesis. Nested parentheses are not permitted. If op‐
3813 tion PCRE_EXTENDED is set, an unescaped # character also introduces a
3814 comment, which in this case continues to immediately after the next
3815 newline character or character sequence in the pattern. Which charac‐
3816 ters are interpreted as newlines is controlled by the options passed to
3817 a compiling function or by a special sequence at the start of the pat‐
3818 tern, as described in section Newline Conventions earlier.
3819
3820 Notice that the end of this type of comment is a literal newline se‐
3821 quence in the pattern; escape sequences that happen to represent a new‐
3822 line do not count. For example, consider the following pattern when ex‐
3823 tended is set, and the default newline convention is in force:
3824
3825 abc #comment \n still comment
3826
3827 On encountering character #, pcre_compile() skips along, looking for a
3828 newline in the pattern. The sequence \n is still literal at this stage,
3829 so it does not terminate the comment. Only a character with code value
3830 0x0a (the default newline) does so.
3831
3833 Consider the problem of matching a string in parentheses, allowing for
3834 unlimited nested parentheses. Without the use of recursion, the best
3835 that can be done is to use a pattern that matches up to some fixed
3836 depth of nesting. It is not possible to handle an arbitrary nesting
3837 depth.
3838
3839 For some time, Perl has provided a facility that allows regular expres‐
3840 sions to recurse (among other things). It does this by interpolating
3841 Perl code in the expression at runtime, and the code can refer to the
3842 expression itself. A Perl pattern using code interpolation to solve the
3843 parentheses problem can be created like this:
3844
3845 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
3846
3847 Item (?p{...}) interpolates Perl code at runtime, and in this case
3848 refers recursively to the pattern in which it appears.
3849
3850 Obviously, PCRE cannot support the interpolation of Perl code. Instead,
3851 it supports special syntax for recursion of the entire pattern, and for
3852 individual subpattern recursion. After its introduction in PCRE and
3853 Python, this kind of recursion was later introduced into Perl at re‐
3854 lease 5.10.
3855
3856 A special item that consists of (? followed by a number > 0 and a clos‐
3857 ing parenthesis is a recursive subroutine call of the subpattern of the
3858 given number, if it occurs inside that subpattern. (If not, it is a
3859 non-recursive subroutine call, which is described in the next section.)
3860 The special item (?R) or (?0) is a recursive call of the entire regular
3861 expression.
3862
3863 This PCRE pattern solves the nested parentheses problem (assume that
3864 option extended is set so that whitespace is ignored):
3865
3866 \( ( [^()]++ | (?R) )* \)
3867
3868 First it matches an opening parenthesis. Then it matches any number of
3869 substrings, which can either be a sequence of non-parentheses or a re‐
3870 cursive match of the pattern itself (that is, a correctly parenthesized
3871 substring). Finally there is a closing parenthesis. Notice the use of a
3872 possessive quantifier to avoid backtracking into sequences of non-
3873 parentheses.
3874
3875 If this was part of a larger pattern, you would not want to recurse the
3876 entire pattern, so instead you can use:
3877
3878 ( \( ( [^()]++ | (?1) )* \) )
3879
3880 The pattern is here within parentheses so that the recursion refers to
3881 them instead of the whole pattern.
3882
3883 In a larger pattern, keeping track of parenthesis numbers can be
3884 tricky. This is made easier by the use of relative references. Instead
3885 of (?1) in the pattern above, you can write (?-2) to refer to the sec‐
3886 ond most recently opened parentheses preceding the recursion. That is,
3887 a negative number counts capturing parentheses leftwards from the point
3888 at which it is encountered.
3889
3890 It is also possible to refer to later opened parentheses, by writing
3891 references such as (?+2). However, these cannot be recursive, as the
3892 reference is not inside the parentheses that are referenced. They are
3893 always non-recursive subroutine calls, as described in the next sec‐
3894 tion.
3895
3896 An alternative approach is to use named parentheses instead. The Perl
3897 syntax for this is (?&name). The earlier PCRE syntax (?P>name) is also
3898 supported. We can rewrite the above example as follows:
3899
3900 (?<pn> \( ( [^()]++ | (?&pn) )* \) )
3901
3902 If there is more than one subpattern with the same name, the earliest
3903 one is used.
3904
3905 This particular example pattern that we have studied contains nested
3906 unlimited repeats, and so the use of a possessive quantifier for match‐
3907 ing strings of non-parentheses is important when applying the pattern
3908 to strings that do not match. For example, when this pattern is applied
3909 to
3910
3911 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
3912
3913 it gives "no match" quickly. However, if a possessive quantifier is not
3914 used, the match runs for a long time, as there are so many different
3915 ways the + and * repeats can carve up the subject, and all must be
3916 tested before failure can be reported.
3917
3918 At the end of a match, the values of capturing parentheses are those
3919 from the outermost level. If the pattern above is matched against
3920
3921 (ab(cd)ef)
3922
3923 the value for the inner capturing parentheses (numbered 2) is "ef",
3924 which is the last value taken on at the top-level. If a capturing sub‐
3925 pattern is not matched at the top level, its final captured value is
3926 unset, even if it was (temporarily) set at a deeper level during the
3927 matching process.
3928
3929 Do not confuse item (?R) with condition (R), which tests for recursion.
3930 Consider the following pattern, which matches text in angle brackets,
3931 allowing for arbitrary nesting. Only digits are allowed in nested
3932 brackets (that is, when recursing), while any characters are permitted
3933 at the outer level.
3934
3935 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
3936
3937 Here (?(R) is the start of a conditional subpattern, with two different
3938 alternatives for the recursive and non-recursive cases. Item (?R) is
3939 the actual recursive call.
3940
3941 Differences in Recursion Processing between PCRE and Perl
3942
3943 Recursion processing in PCRE differs from Perl in two important ways.
3944 In PCRE (like Python, but unlike Perl), a recursive subpattern call is
3945 always treated as an atomic group. That is, once it has matched some of
3946 the subject string, it is never re-entered, even if it contains untried
3947 alternatives and there is a subsequent matching failure. This can be
3948 illustrated by the following pattern, which means to match a palin‐
3949 dromic string containing an odd number of characters (for example, "a",
3950 "aba", "abcba", "abcdcba"):
3951
3952 ^(.|(.)(?1)\2)$
3953
3954 The idea is that it either matches a single character, or two identical
3955 characters surrounding a subpalindrome. In Perl, this pattern works; in
3956 PCRE it does not work if the pattern is longer than three characters.
3957 Consider the subject string "abcba".
3958
3959 At the top level, the first character is matched, but as it is not at
3960 the end of the string, the first alternative fails, the second alterna‐
3961 tive is taken, and the recursion kicks in. The recursive call to sub‐
3962 pattern 1 successfully matches the next character ("b"). (Notice that
3963 the beginning and end of line tests are not part of the recursion.)
3964
3965 Back at the top level, the next character ("c") is compared with what
3966 subpattern 2 matched, which was "a". This fails. As the recursion is
3967 treated as an atomic group, there are now no backtracking points, and
3968 so the entire match fails. (Perl can now re-enter the recursion and try
3969 the second alternative.) However, if the pattern is written with the
3970 alternatives in the other order, things are different:
3971
3972 ^((.)(?1)\2|.)$
3973
3974 This time, the recursing alternative is tried first, and continues to
3975 recurse until it runs out of characters, at which point the recursion
3976 fails. But this time we have another alternative to try at the higher
3977 level. That is the significant difference: in the previous case the re‐
3978 maining alternative is at a deeper recursion level, which PCRE cannot
3979 use.
3980
3981 To change the pattern so that it matches all palindromic strings, not
3982 only those with an odd number of characters, it is tempting to change
3983 the pattern to this:
3984
3985 ^((.)(?1)\2|.?)$
3986
3987 Again, this works in Perl, but not in PCRE, and for the same reason.
3988 When a deeper recursion has matched a single character, it cannot be
3989 entered again to match an empty string. The solution is to separate the
3990 two cases, and write out the odd and even cases as alternatives at the
3991 higher level:
3992
3993 ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
3994
3995 If you want to match typical palindromic phrases, the pattern must ig‐
3996 nore all non-word characters, which can be done as follows:
3997
3998 ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
3999
4000 If run with option caseless, this pattern matches phrases such as "A
4001 man, a plan, a canal: Panama!" and it works well in both PCRE and Perl.
4002 Notice the use of the possessive quantifier *+ to avoid backtracking
4003 into sequences of non-word characters. Without this, PCRE takes much
4004 longer (10 times or more) to match typical phrases, and Perl takes so
4005 long that you think it has gone into a loop.
4006
4007 Note:
4008 The palindrome-matching patterns above work only if the subject string
4009 does not start with a palindrome that is shorter than the entire
4010 string. For example, although "abcba" is correctly matched, if the sub‐
4011 ject is "ababa", PCRE finds palindrome "aba" at the start, and then
4012 fails at top level, as the end of the string does not follow. Once
4013 again, it cannot jump back into the recursion to try other alterna‐
4014 tives, so the entire match fails.
4015
4016
4017 The second way in which PCRE and Perl differ in their recursion pro‐
4018 cessing is in the handling of captured values. In Perl, when a subpat‐
4019 tern is called recursively or as a subpattern (see the next section),
4020 it has no access to any values that were captured outside the recur‐
4021 sion. In PCRE these values can be referenced. Consider the following
4022 pattern:
4023
4024 ^(.)(\1|a(?2))
4025
4026 In PCRE, it matches "bab". The first capturing parentheses match "b",
4027 then in the second group, when the back reference \1 fails to match
4028 "b", the second alternative matches "a", and then recurses. In the re‐
4029 cursion, \1 does now match "b" and so the whole match succeeds. In
4030 Perl, the pattern fails to match because inside the recursive call \1
4031 cannot access the externally set value.
4032
4034 If the syntax for a recursive subpattern call (either by number or by
4035 name) is used outside the parentheses to which it refers, it operates
4036 like a subroutine in a programming language. The called subpattern can
4037 be defined before or after the reference. A numbered reference can be
4038 absolute or relative, as in the following examples:
4039
4040 (...(absolute)...)...(?2)...
4041 (...(relative)...)...(?-1)...
4042 (...(?+1)...(relative)...
4043
4044 An earlier example pointed out that the following pattern matches
4045 "sense and sensibility" and "response and responsibility", but not
4046 "sense and responsibility":
4047
4048 (sens|respons)e and \1ibility
4049
4050 If instead the following pattern is used, it matches "sense and respon‐
4051 sibility" and the other two strings:
4052
4053 (sens|respons)e and (?1)ibility
4054
4055 Another example is provided in the discussion of DEFINE earlier.
4056
4057 All subroutine calls, recursive or not, are always treated as atomic
4058 groups. That is, once a subroutine has matched some of the subject
4059 string, it is never re-entered, even if it contains untried alterna‐
4060 tives and there is a subsequent matching failure. Any capturing paren‐
4061 theses that are set during the subroutine call revert to their previous
4062 values afterwards.
4063
4064 Processing options such as case-independence are fixed when a subpat‐
4065 tern is defined, so if it is used as a subroutine, such options cannot
4066 be changed for different calls. For example, the following pattern
4067 matches "abcabc" but not "abcABC", as the change of processing option
4068 does not affect the called subpattern:
4069
4070 (abc)(?i:(?-1))
4071
4073 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
4074 name or a number enclosed either in angle brackets or single quotes, is
4075 alternative syntax for referencing a subpattern as a subroutine, possi‐
4076 bly recursively. Here follows two of the examples used above, rewritten
4077 using this syntax:
4078
4079 (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
4080 (sens|respons)e and \g'1'ibility
4081
4082 PCRE supports an extension to Oniguruma: if a number is preceded by a
4083 plus or minus sign, it is taken as a relative reference, for example:
4084
4085 (abc)(?i:\g<-1>)
4086
4087 Notice that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are
4088 not synonymous. The former is a back reference; the latter is a subrou‐
4089 tine call.
4090
4092 Perl 5.10 introduced some "Special Backtracking Control Verbs", which
4093 are still described in the Perl documentation as "experimental and sub‐
4094 ject to change or removal in a future version of Perl". It goes on to
4095 say: "Their usage in production code should be noted to avoid problems
4096 during upgrades." The same remarks apply to the PCRE features described
4097 in this section.
4098
4099 The new verbs make use of what was previously invalid syntax: an open‐
4100 ing parenthesis followed by an asterisk. They are generally of the form
4101 (*VERB) or (*VERB:NAME). Some can take either form, possibly behaving
4102 differently depending on whether a name is present. A name is any se‐
4103 quence of characters that does not include a closing parenthesis. The
4104 maximum name length is 255 in the 8-bit library and 65535 in the 16-bit
4105 and 32-bit libraries. If the name is empty, that is, if the closing
4106 parenthesis immediately follows the colon, the effect is as if the
4107 colon was not there. Any number of these verbs can occur in a pattern.
4108
4109 The behavior of these verbs in repeated groups, assertions, and in sub‐
4110 patterns called as subroutines (whether or not recursively) is de‐
4111 scribed below.
4112
4113 Optimizations That Affect Backtracking Verbs
4114
4115 PCRE contains some optimizations that are used to speed up matching by
4116 running some checks at the start of each match attempt. For example, it
4117 can know the minimum length of matching subject, or that a particular
4118 character must be present. When one of these optimizations bypasses the
4119 running of a match, any included backtracking verbs are not processed.
4120 processed. You can suppress the start-of-match optimizations by setting
4121 option no_start_optimize when calling compile/2 or run/3, or by start‐
4122 ing the pattern with (*NO_START_OPT).
4123
4124 Experiments with Perl suggest that it too has similar optimizations,
4125 sometimes leading to anomalous results.
4126
4127 Verbs That Act Immediately
4128
4129 The following verbs act as soon as they are encountered. They must not
4130 be followed by a name.
4131
4132 (*ACCEPT)
4133
4134 This verb causes the match to end successfully, skipping the remainder
4135 of the pattern. However, when it is inside a subpattern that is called
4136 as a subroutine, only that subpattern is ended successfully. Matching
4137 then continues at the outer level. If (*ACCEPT) is triggered in a posi‐
4138 tive assertion, the assertion succeeds; in a negative assertion, the
4139 assertion fails.
4140
4141 If (*ACCEPT) is inside capturing parentheses, the data so far is cap‐
4142 tured. For example, the following matches "AB", "AAD", or "ACD". When
4143 it matches "AB", "B" is captured by the outer parentheses.
4144
4145 A((?:A|B(*ACCEPT)|C)D)
4146
4147 The following verb causes a matching failure, forcing backtracking to
4148 occur. It is equivalent to (?!) but easier to read.
4149
4150 (*FAIL) or (*F)
4151
4152 The Perl documentation states that it is probably useful only when com‐
4153 bined with (?{}) or (??{}). Those are Perl features that are not
4154 present in PCRE.
4155
4156 A match with the string "aaaa" always fails, but the callout is taken
4157 before each backtrack occurs (in this example, 10 times).
4158
4159 Recording Which Path Was Taken
4160
4161 The main purpose of this verb is to track how a match was arrived at,
4162 although it also has a secondary use in with advancing the match start‐
4163 ing point (see (*SKIP) below).
4164
4165 Note:
4166 In Erlang, there is no interface to retrieve a mark with run/2,3, so
4167 only the secondary purpose is relevant to the Erlang programmer.
4168
4169 The rest of this section is therefore deliberately not adapted for
4170 reading by the Erlang programmer, but the examples can help in under‐
4171 standing NAMES as they can be used by (*SKIP).
4172
4173
4174 (*MARK:NAME) or (*:NAME)
4175
4176 A name is always required with this verb. There can be as many in‐
4177 stances of (*MARK) as you like in a pattern, and their names do not
4178 have to be unique.
4179
4180 When a match succeeds, the name of the last encountered (*MARK:NAME),
4181 (*PRUNE:NAME), or (*THEN:NAME) on the matching path is passed back to
4182 the caller as described in section "Extra data for pcre_exec()" in the
4183 pcreapi documentation. In the following example of pcretest output, the
4184 /K modifier requests the retrieval and outputting of (*MARK) data:
4185
4186 re> /X(*MARK:A)Y|X(*MARK:B)Z/K
4187 data> XY
4188 0: XY
4189 MK: A
4190 XZ
4191 0: XZ
4192 MK: B
4193
4194 The (*MARK) name is tagged with "MK:" in this output, and in this exam‐
4195 ple it indicates which of the two alternatives matched. This is a more
4196 efficient way of obtaining this information than putting each alterna‐
4197 tive in its own capturing parentheses.
4198
4199 If a verb with a name is encountered in a positive assertion that is
4200 true, the name is recorded and passed back if it is the last encoun‐
4201 tered. This does not occur for negative assertions or failing positive
4202 assertions.
4203
4204 After a partial match or a failed match, the last encountered name in
4205 the entire match process is returned, for example:
4206
4207 re> /X(*MARK:A)Y|X(*MARK:B)Z/K
4208 data> XP
4209 No match, mark = B
4210
4211 Notice that in this unanchored example, the mark is retained from the
4212 match attempt that started at letter "X" in the subject. Subsequent
4213 match attempts starting at "P" and then with an empty string do not get
4214 as far as the (*MARK) item, nevertheless do not reset it.
4215
4216 Verbs That Act after Backtracking
4217
4218 The following verbs do nothing when they are encountered. Matching con‐
4219 tinues with what follows, but if there is no subsequent match, causing
4220 a backtrack to the verb, a failure is forced. That is, backtracking
4221 cannot pass to the left of the verb. However, when one of these verbs
4222 appears inside an atomic group or an assertion that is true, its effect
4223 is confined to that group, as once the group has been matched, there is
4224 never any backtracking into it. In this situation, backtracking can
4225 "jump back" to the left of the entire atomic group or assertion. (Re‐
4226 member also, as stated above, that this localization also applies in
4227 subroutine calls.)
4228
4229 These verbs differ in exactly what kind of failure occurs when back‐
4230 tracking reaches them. The behavior described below is what occurs when
4231 the verb is not in a subroutine or an assertion. Subsequent sections
4232 cover these special cases.
4233
4234 The following verb, which must not be followed by a name, causes the
4235 whole match to fail outright if there is a later matching failure that
4236 causes backtracking to reach it. Even if the pattern is unanchored, no
4237 further attempts to find a match by advancing the starting point take
4238 place.
4239
4240 (*COMMIT)
4241
4242 If (*COMMIT) is the only backtracking verb that is encountered, once it
4243 has been passed, run/2,3 is committed to find a match at the current
4244 starting point, or not at all, for example:
4245
4246 a+(*COMMIT)b
4247
4248 This matches "xxaab" but not "aacaab". It can be thought of as a kind
4249 of dynamic anchor, or "I've started, so I must finish". The name of the
4250 most recently passed (*MARK) in the path is passed back when (*COMMIT)
4251 forces a match failure.
4252
4253 If more than one backtracking verb exists in a pattern, a different one
4254 that follows (*COMMIT) can be triggered first, so merely passing (*COM‐
4255 MIT) during a match does not always guarantee that a match must be at
4256 this starting point.
4257
4258 Notice that (*COMMIT) at the start of a pattern is not the same as an
4259 anchor, unless the PCRE start-of-match optimizations are turned off, as
4260 shown in the following example:
4261
4262 1> re:run("xyzabc","(*COMMIT)abc",[{capture,all,list}]).
4263 {match,["abc"]}
4264 2> re:run("xyzabc","(*COMMIT)abc",[{capture,all,list},no_start_optimize]).
4265 nomatch
4266
4267 For this pattern, PCRE knows that any match must start with "a", so the
4268 optimization skips along the subject to "a" before applying the pattern
4269 to the first set of data. The match attempt then succeeds. In the sec‐
4270 ond call the no_start_optimize disables the optimization that skips
4271 along to the first character. The pattern is now applied starting at
4272 "x", and so the (*COMMIT) causes the match to fail without trying any
4273 other starting points.
4274
4275 The following verb causes the match to fail at the current starting po‐
4276 sition in the subject if there is a later matching failure that causes
4277 backtracking to reach it:
4278
4279 (*PRUNE) or (*PRUNE:NAME)
4280
4281 If the pattern is unanchored, the normal "bumpalong" advance to the
4282 next starting character then occurs. Backtracking can occur as usual to
4283 the left of (*PRUNE), before it is reached, or when matching to the
4284 right of (*PRUNE), but if there is no match to the right, backtracking
4285 cannot cross (*PRUNE). In simple cases, the use of (*PRUNE) is just an
4286 alternative to an atomic group or possessive quantifier, but there are
4287 some uses of (*PRUNE) that cannot be expressed in any other way. In an
4288 anchored pattern, (*PRUNE) has the same effect as (*COMMIT).
4289
4290 The behavior of (*PRUNE:NAME) is the not the same as
4291 (*MARK:NAME)(*PRUNE). It is like (*MARK:NAME) in that the name is re‐
4292 membered for passing back to the caller. However, (*SKIP:NAME) searches
4293 only for names set with (*MARK).
4294
4295 Note:
4296 The fact that (*PRUNE:NAME) remembers the name is useless to the Erlang
4297 programmer, as names cannot be retrieved.
4298
4299
4300 The following verb, when specified without a name, is like (*PRUNE),
4301 except that if the pattern is unanchored, the "bumpalong" advance is
4302 not to the next character, but to the position in the subject where
4303 (*SKIP) was encountered.
4304
4305 (*SKIP)
4306
4307 (*SKIP) signifies that whatever text was matched leading up to it can‐
4308 not be part of a successful match. Consider:
4309
4310 a+(*SKIP)b
4311
4312 If the subject is "aaaac...", after the first match attempt fails
4313 (starting at the first character in the string), the starting point
4314 skips on to start the next attempt at "c". Notice that a possessive
4315 quantifier does not have the same effect as this example; although it
4316 would suppress backtracking during the first match attempt, the second
4317 attempt would start at the second character instead of skipping on to
4318 "c".
4319
4320 When (*SKIP) has an associated name, its behavior is modified:
4321
4322 (*SKIP:NAME)
4323
4324 When this is triggered, the previous path through the pattern is
4325 searched for the most recent (*MARK) that has the same name. If one is
4326 found, the "bumpalong" advance is to the subject position that corre‐
4327 sponds to that (*MARK) instead of to where (*SKIP) was encountered. If
4328 no (*MARK) with a matching name is found, (*SKIP) is ignored.
4329
4330 Notice that (*SKIP:NAME) searches only for names set by (*MARK:NAME).
4331 It ignores names that are set by (*PRUNE:NAME) or (*THEN:NAME).
4332
4333 The following verb causes a skip to the next innermost alternative when
4334 backtracking reaches it. That is, it cancels any further backtracking
4335 within the current alternative.
4336
4337 (*THEN) or (*THEN:NAME)
4338
4339 The verb name comes from the observation that it can be used for a pat‐
4340 tern-based if-then-else block:
4341
4342 ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
4343
4344 If the COND1 pattern matches, FOO is tried (and possibly further items
4345 after the end of the group if FOO succeeds). On failure, the matcher
4346 skips to the second alternative and tries COND2, without backtracking
4347 into COND1. If that succeeds and BAR fails, COND3 is tried. If BAZ then
4348 fails, there are no more alternatives, so there is a backtrack to what‐
4349 ever came before the entire group. If (*THEN) is not inside an alterna‐
4350 tion, it acts like (*PRUNE).
4351
4352 The behavior of (*THEN:NAME) is the not the same as
4353 (*MARK:NAME)(*THEN). It is like (*MARK:NAME) in that the name is remem‐
4354 bered for passing back to the caller. However, (*SKIP:NAME) searches
4355 only for names set with (*MARK).
4356
4357 Note:
4358 The fact that (*THEN:NAME) remembers the name is useless to the Erlang
4359 programmer, as names cannot be retrieved.
4360
4361
4362 A subpattern that does not contain a | character is just a part of the
4363 enclosing alternative; it is not a nested alternation with only one al‐
4364 ternative. The effect of (*THEN) extends beyond such a subpattern to
4365 the enclosing alternative. Consider the following pattern, where A, B,
4366 and so on, are complex pattern fragments that do not contain any |
4367 characters at this level:
4368
4369 A (B(*THEN)C) | D
4370
4371 If A and B are matched, but there is a failure in C, matching does not
4372 backtrack into A; instead it moves to the next alternative, that is, D.
4373 However, if the subpattern containing (*THEN) is given an alternative,
4374 it behaves differently:
4375
4376 A (B(*THEN)C | (*FAIL)) | D
4377
4378 The effect of (*THEN) is now confined to the inner subpattern. After a
4379 failure in C, matching moves to (*FAIL), which causes the whole subpat‐
4380 tern to fail, as there are no more alternatives to try. In this case,
4381 matching does now backtrack into A.
4382
4383 Notice that a conditional subpattern is not considered as having two
4384 alternatives, as only one is ever used. That is, the | character in a
4385 conditional subpattern has a different meaning. Ignoring whitespace,
4386 consider:
4387
4388 ^.*? (?(?=a) a | b(*THEN)c )
4389
4390 If the subject is "ba", this pattern does not match. As .*? is un‐
4391 greedy, it initially matches zero characters. The condition (?=a) then
4392 fails, the character "b" is matched, but "c" is not. At this point,
4393 matching does not backtrack to .*? as can perhaps be expected from the
4394 presence of the | character. The conditional subpattern is part of the
4395 single alternative that comprises the whole pattern, and so the match
4396 fails. (If there was a backtrack into .*?, allowing it to match "b",
4397 the match would succeed.)
4398
4399 The verbs described above provide four different "strengths" of control
4400 when subsequent matching fails:
4401
4402 * (*THEN) is the weakest, carrying on the match at the next alterna‐
4403 tive.
4404
4405 * (*PRUNE) comes next, fails the match at the current starting posi‐
4406 tion, but allows an advance to the next character (for an unan‐
4407 chored pattern).
4408
4409 * (*SKIP) is similar, except that the advance can be more than one
4410 character.
4411
4412 * (*COMMIT) is the strongest, causing the entire match to fail.
4413
4414 More than One Backtracking Verb
4415
4416 If more than one backtracking verb is present in a pattern, the one
4417 that is backtracked onto first acts. For example, consider the follow‐
4418 ing pattern, where A, B, and so on, are complex pattern fragments:
4419
4420 (A(*COMMIT)B(*THEN)C|ABD)
4421
4422 If A matches but B fails, the backtrack to (*COMMIT) causes the entire
4423 match to fail. However, if A and B match, but C fails, the backtrack to
4424 (*THEN) causes the next alternative (ABD) to be tried. This behavior is
4425 consistent, but is not always the same as in Perl. It means that if two
4426 or more backtracking verbs appear in succession, the last of them has
4427 no effect. Consider the following example:
4428
4429 If there is a matching failure to the right, backtracking onto (*PRUNE)
4430 causes it to be triggered, and its action is taken. There can never be
4431 a backtrack onto (*COMMIT).
4432
4433 Backtracking Verbs in Repeated Groups
4434
4435 PCRE differs from Perl in its handling of backtracking verbs in re‐
4436 peated groups. For example, consider:
4437
4438 /(a(*COMMIT)b)+ac/
4439
4440 If the subject is "abac", Perl matches, but PCRE fails because the
4441 (*COMMIT) in the second repeat of the group acts.
4442
4443 Backtracking Verbs in Assertions
4444
4445 (*FAIL) in an assertion has its normal effect: it forces an immediate
4446 backtrack.
4447
4448 (*ACCEPT) in a positive assertion causes the assertion to succeed with‐
4449 out any further processing. In a negative assertion, (*ACCEPT) causes
4450 the assertion to fail without any further processing.
4451
4452 The other backtracking verbs are not treated specially if they appear
4453 in a positive assertion. In particular, (*THEN) skips to the next al‐
4454 ternative in the innermost enclosing group that has alternations, re‐
4455 gardless if this is within the assertion.
4456
4457 Negative assertions are, however, different, to ensure that changing a
4458 positive assertion into a negative assertion changes its result. Back‐
4459 tracking into (*COMMIT), (*SKIP), or (*PRUNE) causes a negative asser‐
4460 tion to be true, without considering any further alternative branches
4461 in the assertion. Backtracking into (*THEN) causes it to skip to the
4462 next enclosing alternative within the assertion (the normal behavior),
4463 but if the assertion does not have such an alternative, (*THEN) behaves
4464 like (*PRUNE).
4465
4466 Backtracking Verbs in Subroutines
4467
4468 These behaviors occur regardless if the subpattern is called recur‐
4469 sively. The treatment of subroutines in Perl is different in some
4470 cases.
4471
4472 * (*FAIL) in a subpattern called as a subroutine has its normal ef‐
4473 fect: it forces an immediate backtrack.
4474
4475 * (*ACCEPT) in a subpattern called as a subroutine causes the subrou‐
4476 tine match to succeed without any further processing. Matching then
4477 continues after the subroutine call.
4478
4479 * (*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as a sub‐
4480 routine cause the subroutine match to fail.
4481
4482 * (*THEN) skips to the next alternative in the innermost enclosing
4483 group within the subpattern that has alternatives. If there is no
4484 such group within the subpattern, (*THEN) causes the subroutine
4485 match to fail.
4486
4487Ericsson AB stdlib 3.14.2.1 re(3)