1re(3) Erlang Module Definition re(3)
2
3
4
6 re - Perl-like regular expressions for Erlang.
7
9 This module contains regular expression matching functions for strings
10 and binaries.
11
12 The regular expression syntax and semantics resemble that of Perl.
13
14 The matching algorithms of the library are based on the PCRE library,
15 but not all of the PCRE library is interfaced and some parts of the li‐
16 brary go beyond what PCRE offers. Currently PCRE version 8.40 (release
17 date 2017-01-11) is used. The sections of the PCRE documentation that
18 are relevant to this module are included here.
19
20 Note:
21 The Erlang literal syntax for strings uses the "\" (backslash) charac‐
22 ter as an escape code. You need to escape backslashes in literal
23 strings, both in your code and in the shell, with an extra backslash,
24 that is, "\\".
25
26
28 mp() = {re_pattern, term(), term(), term(), term()}
29
30 Opaque data type containing a compiled regular expression. mp()
31 is guaranteed to be a tuple() having the atom re_pattern as its
32 first element, to allow for matching in guards. The arity of the
33 tuple or the content of the other fields can change in future
34 Erlang/OTP releases.
35
36 nl_spec() = cr | crlf | lf | anycrlf | any
37
38 compile_option() =
39 unicode | anchored | caseless | dollar_endonly | dotall |
40 extended | firstline | multiline | no_auto_capture |
41 dupnames | ungreedy |
42 {newline, nl_spec()} |
43 bsr_anycrlf | bsr_unicode | no_start_optimize | ucp |
44 never_utf
45
46 replace_fun() =
47 fun((binary(), [binary()]) -> iodata() | unicode:charlist())
48
50 version() -> binary()
51
52 The return of this function is a string with the PCRE version of
53 the system that was used in the Erlang/OTP compilation.
54
55 compile(Regexp) -> {ok, MP} | {error, ErrSpec}
56
57 Types:
58
59 Regexp = iodata()
60 MP = mp()
61 ErrSpec =
62 {ErrString :: string(), Position :: integer() >= 0}
63
64 The same as compile(Regexp,[])
65
66 compile(Regexp, Options) -> {ok, MP} | {error, ErrSpec}
67
68 Types:
69
70 Regexp = iodata() | unicode:charlist()
71 Options = [Option]
72 Option = compile_option()
73 MP = mp()
74 ErrSpec =
75 {ErrString :: string(), Position :: integer() >= 0}
76
77 Compiles a regular expression, with the syntax described below,
78 into an internal format to be used later as a parameter to run/2
79 and run/3.
80
81 Compiling the regular expression before matching is useful if
82 the same expression is to be used in matching against multiple
83 subjects during the lifetime of the program. Compiling once and
84 executing many times is far more efficient than compiling each
85 time one wants to match.
86
87 When option unicode is specified, the regular expression is to
88 be specified as a valid Unicode charlist(), otherwise as any
89 valid iodata().
90
91 Options:
92
93 unicode:
94 The regular expression is specified as a Unicode charlist()
95 and the resulting regular expression code is to be run
96 against a valid Unicode charlist() subject. Also consider
97 option ucp when using Unicode characters.
98
99 anchored:
100 The pattern is forced to be "anchored", that is, it is con‐
101 strained to match only at the first matching point in the
102 string that is searched (the "subject string"). This effect
103 can also be achieved by appropriate constructs in the pat‐
104 tern itself.
105
106 caseless:
107 Letters in the pattern match both uppercase and lowercase
108 letters. It is equivalent to Perl option /i and can be
109 changed within a pattern by a (?i) option setting. Uppercase
110 and lowercase letters are defined as in the ISO 8859-1 char‐
111 acter set.
112
113 dollar_endonly:
114 A dollar metacharacter in the pattern matches only at the
115 end of the subject string. Without this option, a dollar
116 also matches immediately before a newline at the end of the
117 string (but not before any other newlines). This option is
118 ignored if option multiline is specified. There is no equiv‐
119 alent option in Perl, and it cannot be set within a pattern.
120
121 dotall:
122 A dot in the pattern matches all characters, including those
123 indicating newline. Without it, a dot does not match when
124 the current position is at a newline. This option is equiva‐
125 lent to Perl option /s and it can be changed within a pat‐
126 tern by a (?s) option setting. A negative class, such as
127 [^a], always matches newline characters, independent of the
128 setting of this option.
129
130 extended:
131 If this option is set, most white space characters in the
132 pattern are totally ignored except when escaped or inside a
133 character class. However, white space is not allowed within
134 sequences such as (?> that introduce various parenthesized
135 subpatterns, nor within a numerical quantifier such as
136 {1,3}. However, ignorable white space is permitted between
137 an item and a following quantifier and between a quantifier
138 and a following + that indicates possessiveness.
139
140 White space did not used to include the VT character (code
141 11), because Perl did not treat this character as white
142 space. However, Perl changed at release 5.18, so PCRE fol‐
143 lowed at release 8.34, and VT is now treated as white space.
144
145 This also causes characters between an unescaped # outside a
146 character class and the next newline, inclusive, to be ig‐
147 nored. This is equivalent to Perl's /x option, and it can be
148 changed within a pattern by a (?x) option setting.
149
150 With this option, comments inside complicated patterns can
151 be included. However, notice that this applies only to data
152 characters. Whitespace characters can never appear within
153 special character sequences in a pattern, for example within
154 sequence (?( that introduces a conditional subpattern.
155
156 firstline:
157 An unanchored pattern is required to match before or at the
158 first newline in the subject string, although the matched
159 text can continue over the newline.
160
161 multiline:
162 By default, PCRE treats the subject string as consisting of
163 a single line of characters (even if it contains newlines).
164 The "start of line" metacharacter (^) matches only at the
165 start of the string, while the "end of line" metacharacter
166 ($) matches only at the end of the string, or before a ter‐
167 minating newline (unless option dollar_endonly is speci‐
168 fied). This is the same as in Perl.
169
170 When this option is specified, the "start of line" and "end
171 of line" constructs match immediately following or immedi‐
172 ately before internal newlines in the subject string, re‐
173 spectively, as well as at the very start and end. This is
174 equivalent to Perl option /m and can be changed within a
175 pattern by a (?m) option setting. If there are no newlines
176 in a subject string, or no occurrences of ^ or $ in a pat‐
177 tern, setting multiline has no effect.
178
179 no_auto_capture:
180 Disables the use of numbered capturing parentheses in the
181 pattern. Any opening parenthesis that is not followed by ?
182 behaves as if it is followed by ?:. Named parentheses can
183 still be used for capturing (and they acquire numbers in the
184 usual way). There is no equivalent option in Perl.
185
186 dupnames:
187 Names used to identify capturing subpatterns need not be
188 unique. This can be helpful for certain types of pattern
189 when it is known that only one instance of the named subpat‐
190 tern can ever be matched. More details of named subpatterns
191 are provided below.
192
193 ungreedy:
194 Inverts the "greediness" of the quantifiers so that they are
195 not greedy by default, but become greedy if followed by "?".
196 It is not compatible with Perl. It can also be set by a (?U)
197 option setting within the pattern.
198
199 {newline, NLSpec}:
200 Overrides the default definition of a newline in the subject
201 string, which is LF (ASCII 10) in Erlang.
202
203 cr:
204 Newline is indicated by a single character cr (ASCII 13).
205
206 lf:
207 Newline is indicated by a single character LF (ASCII 10),
208 the default.
209
210 crlf:
211 Newline is indicated by the two-character CRLF (ASCII 13
212 followed by ASCII 10) sequence.
213
214 anycrlf:
215 Any of the three preceding sequences is to be recognized.
216
217 any:
218 Any of the newline sequences above, and the Unicode se‐
219 quences VT (vertical tab, U+000B), FF (formfeed, U+000C),
220 NEL (next line, U+0085), LS (line separator, U+2028), and
221 PS (paragraph separator, U+2029).
222
223 bsr_anycrlf:
224 Specifies specifically that \R is to match only the CR, LF,
225 or CRLF sequences, not the Unicode-specific newline charac‐
226 ters.
227
228 bsr_unicode:
229 Specifies specifically that \R is to match all the Unicode
230 newline characters (including CRLF, and so on, the default).
231
232 no_start_optimize:
233 Disables optimization that can malfunction if "Special
234 start-of-pattern items" are present in the regular expres‐
235 sion. A typical example would be when matching "DEFABC"
236 against "(*COMMIT)ABC", where the start optimization of PCRE
237 would skip the subject up to "A" and never realize that the
238 (*COMMIT) instruction is to have made the matching fail.
239 This option is only relevant if you use "start-of-pattern
240 items", as discussed in section PCRE Regular Expression De‐
241 tails.
242
243 ucp:
244 Specifies that Unicode character properties are to be used
245 when resolving \B, \b, \D, \d, \S, \s, \W and \w. Without
246 this flag, only ISO Latin-1 properties are used. Using Uni‐
247 code properties hurts performance, but is semantically cor‐
248 rect when working with Unicode characters beyond the ISO
249 Latin-1 range.
250
251 never_utf:
252 Specifies that the (*UTF) and/or (*UTF8) "start-of-pattern
253 items" are forbidden. This flag cannot be combined with op‐
254 tion unicode. Useful if ISO Latin-1 patterns from an exter‐
255 nal source are to be compiled.
256
257 inspect(MP, Item) -> {namelist, [binary()]}
258
259 Types:
260
261 MP = mp()
262 Item = namelist
263
264 Takes a compiled regular expression and an item, and returns the
265 relevant data from the regular expression. The only supported
266 item is namelist, which returns the tuple {namelist, [bi‐
267 nary()]}, containing the names of all (unique) named subpatterns
268 in the regular expression. For example:
269
270 1> {ok,MP} = re:compile("(?<A>A)|(?<B>B)|(?<C>C)").
271 {ok,{re_pattern,3,0,0,
272 <<69,82,67,80,119,0,0,0,0,0,0,0,1,0,0,0,255,255,255,255,
273 255,255,...>>}}
274 2> re:inspect(MP,namelist).
275 {namelist,[<<"A">>,<<"B">>,<<"C">>]}
276 3> {ok,MPD} = re:compile("(?<C>A)|(?<B>B)|(?<C>C)",[dupnames]).
277 {ok,{re_pattern,3,0,0,
278 <<69,82,67,80,119,0,0,0,0,0,8,0,1,0,0,0,255,255,255,255,
279 255,255,...>>}}
280 4> re:inspect(MPD,namelist).
281 {namelist,[<<"B">>,<<"C">>]}
282
283 Notice in the second example that the duplicate name only occurs
284 once in the returned list, and that the list is in alphabetical
285 order regardless of where the names are positioned in the regu‐
286 lar expression. The order of the names is the same as the order
287 of captured subexpressions if {capture, all_names} is specified
288 as an option to run/3. You can therefore create a name-to-value
289 mapping from the result of run/3 like this:
290
291 1> {ok,MP} = re:compile("(?<A>A)|(?<B>B)|(?<C>C)").
292 {ok,{re_pattern,3,0,0,
293 <<69,82,67,80,119,0,0,0,0,0,0,0,1,0,0,0,255,255,255,255,
294 255,255,...>>}}
295 2> {namelist, N} = re:inspect(MP,namelist).
296 {namelist,[<<"A">>,<<"B">>,<<"C">>]}
297 3> {match,L} = re:run("AA",MP,[{capture,all_names,binary}]).
298 {match,[<<"A">>,<<>>,<<>>]}
299 4> NameMap = lists:zip(N,L).
300 [{<<"A">>,<<"A">>},{<<"B">>,<<>>},{<<"C">>,<<>>}]
301
302 replace(Subject, RE, Replacement) -> iodata() | unicode:charlist()
303
304 Types:
305
306 Subject = iodata() | unicode:charlist()
307 RE = mp() | iodata()
308 Replacement = iodata() | unicode:charlist() | replace_fun()
309
310 Same as replace(Subject, RE, Replacement, []).
311
312 replace(Subject, RE, Replacement, Options) ->
313 iodata() | unicode:charlist()
314
315 Types:
316
317 Subject = iodata() | unicode:charlist()
318 RE = mp() | iodata() | unicode:charlist()
319 Replacement = iodata() | unicode:charlist() | replace_fun()
320 Options = [Option]
321 Option =
322 anchored | global | notbol | noteol | notempty |
323 notempty_atstart |
324 {offset, integer() >= 0} |
325 {newline, NLSpec} |
326 bsr_anycrlf |
327 {match_limit, integer() >= 0} |
328 {match_limit_recursion, integer() >= 0} |
329 bsr_unicode |
330 {return, ReturnType} |
331 CompileOpt
332 ReturnType = iodata | list | binary
333 CompileOpt = compile_option()
334 NLSpec = cr | crlf | lf | anycrlf | any
335
336 Replaces the matched part of the Subject string with Replace‐
337 ment.
338
339 The permissible options are the same as for run/3, except that
340 option capture is not allowed. Instead a {return, ReturnType} is
341 present. The default return type is iodata, constructed in a way
342 to minimize copying. The iodata result can be used directly in
343 many I/O operations. If a flat list() is desired, specify {re‐
344 turn, list}. If a binary is desired, specify {return, binary}.
345
346 As in function run/3, an mp() compiled with option unicode re‐
347 quires Subject to be a Unicode charlist(). If compilation is
348 done implicitly and the unicode compilation option is specified
349 to this function, both the regular expression and Subject are to
350 specified as valid Unicode charlist()s.
351
352 If the replacement is given as a string, it can contain the spe‐
353 cial character &, which inserts the whole matching expression in
354 the result, and the special sequence \N (where N is an integer >
355 0), \gN, or \g{N}, resulting in the subexpression number N, is
356 inserted in the result. If no subexpression with that number is
357 generated by the regular expression, nothing is inserted.
358
359 To insert an & or a \ in the result, precede it with a \. Notice
360 that Erlang already gives a special meaning to \ in literal
361 strings, so a single \ must be written as "\\" and therefore a
362 double \ as "\\\\".
363
364 Example:
365
366 re:replace("abcd","c","[&]",[{return,list}]).
367
368 gives
369
370 "ab[c]d"
371
372 while
373
374 re:replace("abcd","c","[\\&]",[{return,list}]).
375
376 gives
377
378 "ab[&]d"
379
380 If the replacement is given as a fun, it will be called with the
381 whole matching expression as the first argument and a list of
382 subexpression matches in the order in which they appear in the
383 regular expression. The returned value will be inserted in the
384 result.
385
386 Example:
387
388 re:replace("abcd", ".(.)", fun(Whole, [<<C>>]) -> <<$#, Whole/binary, $-, (C - $a + $A), $#>> end, [{return, list}]).
389
390 gives
391
392 "#ab-B#cd"
393
394 Note:
395 Non-matching optional subexpressions will not be included in the
396 list of subexpression matches if they are the last subexpres‐
397 sions in the regular expression.
398
399 Example:
400
401 The regular expression "(a)(b)?(c)?" ("a", optionally followed
402 by "b", optionally followed by "c") will create the following
403 subexpression lists:
404
405 * [<<"a">>, <<"b">>, <<"c">>] when applied to the string "abc"
406
407 * [<<"a">>, <<>>, <<"c">>] when applied to the string "acx"
408
409 * [<<"a">>, <<"b">>] when applied to the string "abx"
410
411 * [<<"a">>] when applied to the string "axx"
412
413 As with run/3, compilation errors raise the badarg exception.
414 compile/2 can be used to get more information about the error.
415
416 run(Subject, RE) -> {match, Captured} | nomatch
417
418 Types:
419
420 Subject = iodata() | unicode:charlist()
421 RE = mp() | iodata()
422 Captured = [CaptureData]
423 CaptureData = {integer(), integer()}
424
425 Same as run(Subject,RE,[]).
426
427 run(Subject, RE, Options) ->
428 {match, Captured} | match | nomatch | {error, ErrType}
429
430 Types:
431
432 Subject = iodata() | unicode:charlist()
433 RE = mp() | iodata() | unicode:charlist()
434 Options = [Option]
435 Option =
436 anchored | global | notbol | noteol | notempty |
437 notempty_atstart | report_errors |
438 {offset, integer() >= 0} |
439 {match_limit, integer() >= 0} |
440 {match_limit_recursion, integer() >= 0} |
441 {newline, NLSpec :: nl_spec()} |
442 bsr_anycrlf | bsr_unicode |
443 {capture, ValueSpec} |
444 {capture, ValueSpec, Type} |
445 CompileOpt
446 Type = index | list | binary
447 ValueSpec =
448 all | all_but_first | all_names | first | none | Val‐
449 ueList
450 ValueList = [ValueID]
451 ValueID = integer() | string() | atom()
452 CompileOpt = compile_option()
453 See compile/2.
454 Captured = [CaptureData] | [[CaptureData]]
455 CaptureData =
456 {integer(), integer()} | ListConversionData | binary()
457 ListConversionData =
458 string() |
459 {error, string(), binary()} |
460 {incomplete, string(), binary()}
461 ErrType =
462 match_limit | match_limit_recursion | {compile, Com‐
463 pileErr}
464 CompileErr =
465 {ErrString :: string(), Position :: integer() >= 0}
466
467 Executes a regular expression matching, and returns
468 match/{match, Captured} or nomatch. The regular expression can
469 be specified either as iodata() in which case it is automati‐
470 cally compiled (as by compile/2) and executed, or as a precom‐
471 piled mp() in which case it is executed against the subject di‐
472 rectly.
473
474 When compilation is involved, exception badarg is thrown if a
475 compilation error occurs. Call compile/2 to get information
476 about the location of the error in the regular expression.
477
478 If the regular expression is previously compiled, the option
479 list can only contain the following options:
480
481 * anchored
482
483 * {capture, ValueSpec}/{capture, ValueSpec, Type}
484
485 * global
486
487 * {match_limit, integer() >= 0}
488
489 * {match_limit_recursion, integer() >= 0}
490
491 * {newline, NLSpec}
492
493 * notbol
494
495 * notempty
496
497 * notempty_atstart
498
499 * noteol
500
501 * {offset, integer() >= 0}
502
503 * report_errors
504
505 Otherwise all options valid for function compile/2 are also al‐
506 lowed. Options allowed both for compilation and execution of a
507 match, namely anchored and {newline, NLSpec}, affect both the
508 compilation and execution if present together with a non-precom‐
509 piled regular expression.
510
511 If the regular expression was previously compiled with option
512 unicode, Subject is to be provided as a valid Unicode
513 charlist(), otherwise any iodata() will do. If compilation is
514 involved and option unicode is specified, both Subject and the
515 regular expression are to be specified as valid Unicode
516 charlists().
517
518 {capture, ValueSpec}/{capture, ValueSpec, Type} defines what to
519 return from the function upon successful matching. The capture
520 tuple can contain both a value specification, telling which of
521 the captured substrings are to be returned, and a type specifi‐
522 cation, telling how captured substrings are to be returned (as
523 index tuples, lists, or binaries). The options are described in
524 detail below.
525
526 If the capture options describe that no substring capturing is
527 to be done ({capture, none}), the function returns the single
528 atom match upon successful matching, otherwise the tuple {match,
529 ValueList}. Disabling capturing can be done either by specifying
530 none or an empty list as ValueSpec.
531
532 Option report_errors adds the possibility that an error tuple is
533 returned. The tuple either indicates a matching error
534 (match_limit or match_limit_recursion), or a compilation error,
535 where the error tuple has the format {error, {compile, Com‐
536 pileErr}}. Notice that if option report_errors is not specified,
537 the function never returns error tuples, but reports compilation
538 errors as a badarg exception and failed matches because of ex‐
539 ceeded match limits simply as nomatch.
540
541 The following options are relevant for execution:
542
543 anchored:
544 Limits run/3 to matching at the first matching position. If
545 a pattern was compiled with anchored, or turned out to be
546 anchored by virtue of its contents, it cannot be made unan‐
547 chored at matching time, hence there is no unanchored op‐
548 tion.
549
550 global:
551 Implements global (repetitive) search (flag g in Perl). Each
552 match is returned as a separate list() containing the spe‐
553 cific match and any matching subexpressions (or as specified
554 by option capture. The Captured part of the return value is
555 hence a list() of list()s when this option is specified.
556
557 The interaction of option global with a regular expression
558 that matches an empty string surprises some users. When op‐
559 tion global is specified, run/3 handles empty matches in the
560 same way as Perl: a zero-length match at any point is also
561 retried with options [anchored, notempty_atstart]. If that
562 search gives a result of length > 0, the result is included.
563 Example:
564
565 re:run("cat","(|at)",[global]).
566
567 The following matchings are performed:
568
569 At offset 0:
570 The regular expression (|at) first match at the initial
571 position of string cat, giving the result set
572 [{0,0},{0,0}] (the second {0,0} is because of the subex‐
573 pression marked by the parentheses). As the length of the
574 match is 0, we do not advance to the next position yet.
575
576 At offset 0 with [anchored, notempty_atstart]:
577 The search is retried with options [anchored, notempty_at‐
578 start] at the same position, which does not give any in‐
579 teresting result of longer length, so the search position
580 is advanced to the next character (a).
581
582 At offset 1:
583 The search results in [{1,0},{1,0}], so this search is
584 also repeated with the extra options.
585
586 At offset 1 with [anchored, notempty_atstart]:
587 Alternative ab is found and the result is [{1,2},{1,2}].
588 The result is added to the list of results and the posi‐
589 tion in the search string is advanced two steps.
590
591 At offset 3:
592 The search once again matches the empty string, giving
593 [{3,0},{3,0}].
594
595 At offset 1 with [anchored, notempty_atstart]:
596 This gives no result of length > 0 and we are at the last
597 position, so the global search is complete.
598
599 The result of the call is:
600
601 {match,[[{0,0},{0,0}],[{1,0},{1,0}],[{1,2},{1,2}],[{3,0},{3,0}]]}
602
603 notempty:
604 An empty string is not considered to be a valid match if
605 this option is specified. If alternatives in the pattern ex‐
606 ist, they are tried. If all the alternatives match the empty
607 string, the entire match fails.
608
609 Example:
610
611 If the following pattern is applied to a string not begin‐
612 ning with "a" or "b", it would normally match the empty
613 string at the start of the subject:
614
615 a?b?
616
617 With option notempty, this match is invalid, so run/3
618 searches further into the string for occurrences of "a" or
619 "b".
620
621 notempty_atstart:
622 Like notempty, except that an empty string match that is not
623 at the start of the subject is permitted. If the pattern is
624 anchored, such a match can occur only if the pattern con‐
625 tains \K.
626
627 Perl has no direct equivalent of notempty or notempty_at‐
628 start, but it does make a special case of a pattern match of
629 the empty string within its split() function, and when using
630 modifier /g. The Perl behavior can be emulated after match‐
631 ing a null string by first trying the match again at the
632 same offset with notempty_atstart and anchored, and then, if
633 that fails, by advancing the starting offset (see below) and
634 trying an ordinary match again.
635
636 notbol:
637 Specifies that the first character of the subject string is
638 not the beginning of a line, so the circumflex metacharacter
639 is not to match before it. Setting this without multiline
640 (at compile time) causes circumflex never to match. This op‐
641 tion only affects the behavior of the circumflex metacharac‐
642 ter. It does not affect \A.
643
644 noteol:
645 Specifies that the end of the subject string is not the end
646 of a line, so the dollar metacharacter is not to match it
647 nor (except in multiline mode) a newline immediately before
648 it. Setting this without multiline (at compile time) causes
649 dollar never to match. This option affects only the behavior
650 of the dollar metacharacter. It does not affect \Z or \z.
651
652 report_errors:
653 Gives better control of the error handling in run/3. When
654 specified, compilation errors (if the regular expression is
655 not already compiled) and runtime errors are explicitly re‐
656 turned as an error tuple.
657
658 The following are the possible runtime errors:
659
660 match_limit:
661 The PCRE library sets a limit on how many times the inter‐
662 nal match function can be called. Defaults to 10,000,000
663 in the library compiled for Erlang. If {error,
664 match_limit} is returned, the execution of the regular ex‐
665 pression has reached this limit. This is normally to be
666 regarded as a nomatch, which is the default return value
667 when this occurs, but by specifying report_errors, you are
668 informed when the match fails because of too many internal
669 calls.
670
671 match_limit_recursion:
672 This error is very similar to match_limit, but occurs when
673 the internal match function of PCRE is "recursively"
674 called more times than the match_limit_recursion limit,
675 which defaults to 10,000,000 as well. Notice that as long
676 as the match_limit and match_limit_default values are kept
677 at the default values, the match_limit_recursion error
678 cannot occur, as the match_limit error occurs before that
679 (each recursive call is also a call, but not conversely).
680 Both limits can however be changed, either by setting lim‐
681 its directly in the regular expression string (see section
682 PCRE Regular Eexpression Details) or by specifying options
683 to run/3.
684
685 It is important to understand that what is referred to as
686 "recursion" when limiting matches is not recursion on the C
687 stack of the Erlang machine or on the Erlang process stack.
688 The PCRE version compiled into the Erlang VM uses machine
689 "heap" memory to store values that must be kept over recur‐
690 sion in regular expression matches.
691
692 {match_limit, integer() >= 0}:
693 Limits the execution time of a match in an implementation-
694 specific way. It is described as follows by the PCRE docu‐
695 mentation:
696
697 The match_limit field provides a means of preventing PCRE from using
698 up a vast amount of resources when running patterns that are not going
699 to match, but which have a very large number of possibilities in their
700 search trees. The classic example is a pattern that uses nested
701 unlimited repeats.
702
703 Internally, pcre_exec() uses a function called match(), which it calls
704 repeatedly (sometimes recursively). The limit set by match_limit is
705 imposed on the number of times this function is called during a match,
706 which has the effect of limiting the amount of backtracking that can
707 take place. For patterns that are not anchored, the count restarts
708 from zero for each position in the subject string.
709
710 This means that runaway regular expression matches can fail
711 faster if the limit is lowered using this option. The de‐
712 fault value 10,000,000 is compiled into the Erlang VM.
713
714 Note:
715 This option does in no way affect the execution of the Erlang
716 VM in terms of "long running BIFs". run/3 always gives control
717 back to the scheduler of Erlang processes at intervals that
718 ensures the real-time properties of the Erlang system.
719
720
721 {match_limit_recursion, integer() >= 0}:
722 Limits the execution time and memory consumption of a match
723 in an implementation-specific way, very similar to
724 match_limit. It is described as follows by the PCRE documen‐
725 tation:
726
727 The match_limit_recursion field is similar to match_limit, but instead
728 of limiting the total number of times that match() is called, it
729 limits the depth of recursion. The recursion depth is a smaller number
730 than the total number of calls, because not all calls to match() are
731 recursive. This limit is of use only if it is set smaller than
732 match_limit.
733
734 Limiting the recursion depth limits the amount of machine stack that
735 can be used, or, when PCRE has been compiled to use memory on the heap
736 instead of the stack, the amount of heap memory that can be used.
737
738 The Erlang VM uses a PCRE library where heap memory is used
739 when regular expression match recursion occurs. This there‐
740 fore limits the use of machine heap, not C stack.
741
742 Specifying a lower value can result in matches with deep re‐
743 cursion failing, when they should have matched:
744
745 1> re:run("aaaaaaaaaaaaaz","(a+)*z").
746 {match,[{0,14},{0,13}]}
747 2> re:run("aaaaaaaaaaaaaz","(a+)*z",[{match_limit_recursion,5}]).
748 nomatch
749 3> re:run("aaaaaaaaaaaaaz","(a+)*z",[{match_limit_recursion,5},report_errors]).
750 {error,match_limit_recursion}
751
752 This option and option match_limit are only to be used in
753 rare cases. Understanding of the PCRE library internals is
754 recommended before tampering with these limits.
755
756 {offset, integer() >= 0}:
757 Start matching at the offset (position) specified in the
758 subject string. The offset is zero-based, so that the de‐
759 fault is {offset,0} (all of the subject string).
760
761 {newline, NLSpec}:
762 Overrides the default definition of a newline in the subject
763 string, which is LF (ASCII 10) in Erlang.
764
765 cr:
766 Newline is indicated by a single character CR (ASCII 13).
767
768 lf:
769 Newline is indicated by a single character LF (ASCII 10),
770 the default.
771
772 crlf:
773 Newline is indicated by the two-character CRLF (ASCII 13
774 followed by ASCII 10) sequence.
775
776 anycrlf:
777 Any of the three preceding sequences is be recognized.
778
779 any:
780 Any of the newline sequences above, and the Unicode se‐
781 quences VT (vertical tab, U+000B), FF (formfeed, U+000C),
782 NEL (next line, U+0085), LS (line separator, U+2028), and
783 PS (paragraph separator, U+2029).
784
785 bsr_anycrlf:
786 Specifies specifically that \R is to match only the CR LF,
787 or CRLF sequences, not the Unicode-specific newline charac‐
788 ters. (Overrides the compilation option.)
789
790 bsr_unicode:
791 Specifies specifically that \R is to match all the Unicode
792 newline characters (including CRLF, and so on, the default).
793 (Overrides the compilation option.)
794
795 {capture, ValueSpec}/{capture, ValueSpec, Type}:
796 Specifies which captured substrings are returned and in what
797 format. By default, run/3 captures all of the matching part
798 of the substring and all capturing subpatterns (all of the
799 pattern is automatically captured). The default return type
800 is (zero-based) indexes of the captured parts of the string,
801 specified as {Offset,Length} pairs (the index Type of cap‐
802 turing).
803
804 As an example of the default behavior, the following call
805 returns, as first and only captured string, the matching
806 part of the subject ("abcd" in the middle) as an index pair
807 {3,4}, where character positions are zero-based, just as in
808 offsets:
809
810 re:run("ABCabcdABC","abcd",[]).
811
812 The return value of this call is:
813
814 {match,[{3,4}]}
815
816 Another (and quite common) case is where the regular expres‐
817 sion matches all of the subject:
818
819 re:run("ABCabcdABC",".*abcd.*",[]).
820
821 Here the return value correspondingly points out all of the
822 string, beginning at index 0, and it is 10 characters long:
823
824 {match,[{0,10}]}
825
826 If the regular expression contains capturing subpatterns,
827 like in:
828
829 re:run("ABCabcdABC",".*(abcd).*",[]).
830
831 all of the matched subject is captured, as well as the cap‐
832 tured substrings:
833
834 {match,[{0,10},{3,4}]}
835
836 The complete matching pattern always gives the first return
837 value in the list and the remaining subpatterns are added in
838 the order they occurred in the regular expression.
839
840 The capture tuple is built up as follows:
841
842 ValueSpec:
843 Specifies which captured (sub)patterns are to be returned.
844 ValueSpec can either be an atom describing a predefined
845 set of return values, or a list containing the indexes or
846 the names of specific subpatterns to return.
847
848 The following are the predefined sets of subpatterns:
849
850 all:
851 All captured subpatterns including the complete matching
852 string. This is the default.
853
854 all_names:
855 All named subpatterns in the regular expression, as if a
856 list() of all the names in alphabetical order was speci‐
857 fied. The list of all names can also be retrieved with
858 inspect/2.
859
860 first:
861 Only the first captured subpattern, which is always the
862 complete matching part of the subject. All explicitly
863 captured subpatterns are discarded.
864
865 all_but_first:
866 All but the first matching subpattern, that is, all ex‐
867 plicitly captured subpatterns, but not the complete
868 matching part of the subject string. This is useful if
869 the regular expression as a whole matches a large part
870 of the subject, but the part you are interested in is in
871 an explicitly captured subpattern. If the return type is
872 list or binary, not returning subpatterns you are not
873 interested in is a good way to optimize.
874
875 none:
876 Returns no matching subpatterns, gives the single atom
877 match as the return value of the function when matching
878 successfully instead of the {match, list()} return.
879 Specifying an empty list gives the same behavior.
880
881 The value list is a list of indexes for the subpatterns to
882 return, where index 0 is for all of the pattern, and 1 is
883 for the first explicit capturing subpattern in the regular
884 expression, and so on. When using named captured subpat‐
885 terns (see below) in the regular expression, one can use
886 atom()s or string()s to specify the subpatterns to be re‐
887 turned. For example, consider the regular expression:
888
889 ".*(abcd).*"
890
891 matched against string "ABCabcdABC", capturing only the
892 "abcd" part (the first explicit subpattern):
893
894 re:run("ABCabcdABC",".*(abcd).*",[{capture,[1]}]).
895
896 The call gives the following result, as the first explic‐
897 itly captured subpattern is "(abcd)", matching "abcd" in
898 the subject, at (zero-based) position 3, of length 4:
899
900 {match,[{3,4}]}
901
902 Consider the same regular expression, but with the subpat‐
903 tern explicitly named 'FOO':
904
905 ".*(?<FOO>abcd).*"
906
907 With this expression, we could still give the index of the
908 subpattern with the following call:
909
910 re:run("ABCabcdABC",".*(?<FOO>abcd).*",[{capture,[1]}]).
911
912 giving the same result as before. But, as the subpattern
913 is named, we can also specify its name in the value list:
914
915 re:run("ABCabcdABC",".*(?<FOO>abcd).*",[{capture,['FOO']}]).
916
917 This would give the same result as the earlier examples,
918 namely:
919
920 {match,[{3,4}]}
921
922 The values list can specify indexes or names not present
923 in the regular expression, in which case the return values
924 vary depending on the type. If the type is index, the tu‐
925 ple {-1,0} is returned for values with no corresponding
926 subpattern in the regular expression, but for the other
927 types (binary and list), the values are the empty binary
928 or list, respectively.
929
930 Type:
931 Optionally specifies how captured substrings are to be re‐
932 turned. If omitted, the default of index is used.
933
934 Type can be one of the following:
935
936 index:
937 Returns captured substrings as pairs of byte indexes
938 into the subject string and length of the matching
939 string in the subject (as if the subject string was
940 flattened with erlang:iolist_to_binary/1 or uni‐
941 code:characters_to_binary/2 before matching). Notice
942 that option unicode results in byte-oriented indexes in
943 a (possibly virtual) UTF-8 encoded binary. A byte index
944 tuple {0,2} can therefore represent one or two charac‐
945 ters when unicode is in effect. This can seem counter-
946 intuitive, but has been deemed the most effective and
947 useful way to do it. To return lists instead can result
948 in simpler code if that is desired. This return type is
949 the default.
950
951 list:
952 Returns matching substrings as lists of characters (Er‐
953 lang string()s). It option unicode is used in combina‐
954 tion with the \C sequence in the regular expression, a
955 captured subpattern can contain bytes that are not valid
956 UTF-8 (\C matches bytes regardless of character encod‐
957 ing). In that case the list capturing can result in the
958 same types of tuples that unicode:characters_to_list/2
959 can return, namely three-tuples with tag incomplete or
960 error, the successfully converted characters and the in‐
961 valid UTF-8 tail of the conversion as a binary. The best
962 strategy is to avoid using the \C sequence when captur‐
963 ing lists.
964
965 binary:
966 Returns matching substrings as binaries. If option uni‐
967 code is used, these binaries are in UTF-8. If the \C se‐
968 quence is used together with unicode, the binaries can
969 be invalid UTF-8.
970
971 In general, subpatterns that were not assigned a value in
972 the match are returned as the tuple {-1,0} when type is in‐
973 dex. Unassigned subpatterns are returned as the empty binary
974 or list, respectively, for other return types. Consider the
975 following regular expression:
976
977 ".*((?<FOO>abdd)|a(..d)).*"
978
979 There are three explicitly capturing subpatterns, where the
980 opening parenthesis position determines the order in the re‐
981 sult, hence ((?<FOO>abdd)|a(..d)) is subpattern index 1,
982 (?<FOO>abdd) is subpattern index 2, and (..d) is subpattern
983 index 3. When matched against the following string:
984
985 "ABCabcdABC"
986
987 the subpattern at index 2 does not match, as "abdd" is not
988 present in the string, but the complete pattern matches (be‐
989 cause of the alternative a(..d)). The subpattern at index 2
990 is therefore unassigned and the default return value is:
991
992 {match,[{0,10},{3,4},{-1,0},{4,3}]}
993
994 Setting the capture Type to binary gives:
995
996 {match,[<<"ABCabcdABC">>,<<"abcd">>,<<>>,<<"bcd">>]}
997
998 Here the empty binary (<<>>) represents the unassigned sub‐
999 pattern. In the binary case, some information about the
1000 matching is therefore lost, as <<>> can also be an empty
1001 string captured.
1002
1003 If differentiation between empty matches and non-existing
1004 subpatterns is necessary, use the type index and do the con‐
1005 version to the final type in Erlang code.
1006
1007 When option global is speciified, the capture specification
1008 affects each match separately, so that:
1009
1010 re:run("cacb","c(a|b)",[global,{capture,[1],list}]).
1011
1012 gives
1013
1014 {match,[["a"],["b"]]}
1015
1016 For a descriptions of options only affecting the compilation
1017 step, see compile/2.
1018
1019 split(Subject, RE) -> SplitList
1020
1021 Types:
1022
1023 Subject = iodata() | unicode:charlist()
1024 RE = mp() | iodata()
1025 SplitList = [iodata() | unicode:charlist()]
1026
1027 Same as split(Subject, RE, []).
1028
1029 split(Subject, RE, Options) -> SplitList
1030
1031 Types:
1032
1033 Subject = iodata() | unicode:charlist()
1034 RE = mp() | iodata() | unicode:charlist()
1035 Options = [Option]
1036 Option =
1037 anchored | notbol | noteol | notempty | notempty_atstart
1038 |
1039 {offset, integer() >= 0} |
1040 {newline, nl_spec()} |
1041 {match_limit, integer() >= 0} |
1042 {match_limit_recursion, integer() >= 0} |
1043 bsr_anycrlf | bsr_unicode |
1044 {return, ReturnType} |
1045 {parts, NumParts} |
1046 group | trim | CompileOpt
1047 NumParts = integer() >= 0 | infinity
1048 ReturnType = iodata | list | binary
1049 CompileOpt = compile_option()
1050 See compile/2.
1051 SplitList = [RetData] | [GroupedRetData]
1052 GroupedRetData = [RetData]
1053 RetData = iodata() | unicode:charlist() | binary() | list()
1054
1055 Splits the input into parts by finding tokens according to the
1056 regular expression supplied. The splitting is basically done by
1057 running a global regular expression match and dividing the ini‐
1058 tial string wherever a match occurs. The matching part of the
1059 string is removed from the output.
1060
1061 As in run/3, an mp() compiled with option unicode requires Sub‐
1062 ject to be a Unicode charlist(). If compilation is done implic‐
1063 itly and the unicode compilation option is specified to this
1064 function, both the regular expression and Subject are to be
1065 specified as valid Unicode charlist()s.
1066
1067 The result is given as a list of "strings", the preferred data
1068 type specified in option return (default iodata).
1069
1070 If subexpressions are specified in the regular expression, the
1071 matching subexpressions are returned in the resulting list as
1072 well. For example:
1073
1074 re:split("Erlang","[ln]",[{return,list}]).
1075
1076 gives
1077
1078 ["Er","a","g"]
1079
1080 while
1081
1082 re:split("Erlang","([ln])",[{return,list}]).
1083
1084 gives
1085
1086 ["Er","l","a","n","g"]
1087
1088 The text matching the subexpression (marked by the parentheses
1089 in the regular expression) is inserted in the result list where
1090 it was found. This means that concatenating the result of a
1091 split where the whole regular expression is a single subexpres‐
1092 sion (as in the last example) always results in the original
1093 string.
1094
1095 As there is no matching subexpression for the last part in the
1096 example (the "g"), nothing is inserted after that. To make the
1097 group of strings and the parts matching the subexpressions more
1098 obvious, one can use option group, which groups together the
1099 part of the subject string with the parts matching the subex‐
1100 pressions when the string was split:
1101
1102 re:split("Erlang","([ln])",[{return,list},group]).
1103
1104 gives
1105
1106 [["Er","l"],["a","n"],["g"]]
1107
1108 Here the regular expression first matched the "l", causing "Er"
1109 to be the first part in the result. When the regular expression
1110 matched, the (only) subexpression was bound to the "l", so the
1111 "l" is inserted in the group together with "Er". The next match
1112 is of the "n", making "a" the next part to be returned. As the
1113 subexpression is bound to substring "n" in this case, the "n" is
1114 inserted into this group. The last group consists of the remain‐
1115 ing string, as no more matches are found.
1116
1117 By default, all parts of the string, including the empty
1118 strings, are returned from the function, for example:
1119
1120 re:split("Erlang","[lg]",[{return,list}]).
1121
1122 gives
1123
1124 ["Er","an",[]]
1125
1126 as the matching of the "g" in the end of the string leaves an
1127 empty rest, which is also returned. This behavior differs from
1128 the default behavior of the split function in Perl, where empty
1129 strings at the end are by default removed. To get the "trimming"
1130 default behavior of Perl, specify trim as an option:
1131
1132 re:split("Erlang","[lg]",[{return,list},trim]).
1133
1134 gives
1135
1136 ["Er","an"]
1137
1138 The "trim" option says; "give me as many parts as possible ex‐
1139 cept the empty ones", which sometimes can be useful. You can
1140 also specify how many parts you want, by specifying {parts,N}:
1141
1142 re:split("Erlang","[lg]",[{return,list},{parts,2}]).
1143
1144 gives
1145
1146 ["Er","ang"]
1147
1148 Notice that the last part is "ang", not "an", as splitting was
1149 specified into two parts, and the splitting stops when enough
1150 parts are given, which is why the result differs from that of
1151 trim.
1152
1153 More than three parts are not possible with this indata, so
1154
1155 re:split("Erlang","[lg]",[{return,list},{parts,4}]).
1156
1157 gives the same result as the default, which is to be viewed as
1158 "an infinite number of parts".
1159
1160 Specifying 0 as the number of parts gives the same effect as op‐
1161 tion trim. If subexpressions are captured, empty subexpressions
1162 matched at the end are also stripped from the result if trim or
1163 {parts,0} is specified.
1164
1165 The trim behavior corresponds exactly to the Perl default.
1166 {parts,N}, where N is a positive integer, corresponds exactly to
1167 the Perl behavior with a positive numerical third parameter. The
1168 default behavior of split/3 corresponds to the Perl behavior
1169 when a negative integer is specified as the third parameter for
1170 the Perl routine.
1171
1172 Summary of options not previously described for function run/3:
1173
1174 {return,ReturnType}:
1175 Specifies how the parts of the original string are presented
1176 in the result list. Valid types:
1177
1178 iodata:
1179 The variant of iodata() that gives the least copying of
1180 data with the current implementation (often a binary, but
1181 do not depend on it).
1182
1183 binary:
1184 All parts returned as binaries.
1185
1186 list:
1187 All parts returned as lists of characters ("strings").
1188
1189 group:
1190 Groups together the part of the string with the parts of the
1191 string matching the subexpressions of the regular expres‐
1192 sion.
1193
1194 The return value from the function is in this case a list()
1195 of list()s. Each sublist begins with the string picked out
1196 of the subject string, followed by the parts matching each
1197 of the subexpressions in order of occurrence in the regular
1198 expression.
1199
1200 {parts,N}:
1201 Specifies the number of parts the subject string is to be
1202 split into.
1203
1204 The number of parts is to be a positive integer for a spe‐
1205 cific maximum number of parts, and infinity for the maximum
1206 number of parts possible (the default). Specifying {parts,0}
1207 gives as many parts as possible disregarding empty parts at
1208 the end, the same as specifying trim.
1209
1210 trim:
1211 Specifies that empty parts at the end of the result list are
1212 to be disregarded. The same as specifying {parts,0}. This
1213 corresponds to the default behavior of the split built-in
1214 function in Perl.
1215
1217 The following sections contain reference material for the regular ex‐
1218 pressions used by this module. The information is based on the PCRE
1219 documentation, with changes where this module behaves differently to
1220 the PCRE library.
1221
1223 The syntax and semantics of the regular expressions supported by PCRE
1224 are described in detail in the following sections. Perl's regular ex‐
1225 pressions are described in its own documentation, and regular expres‐
1226 sions in general are covered in many books, some with copious examples.
1227 Jeffrey Friedl's "Mastering Regular Expressions", published by
1228 O'Reilly, covers regular expressions in great detail. This description
1229 of the PCRE regular expressions is intended as reference material.
1230
1231 The reference material is divided into the following sections:
1232
1233 * Special Start-of-Pattern Items
1234
1235 * Characters and Metacharacters
1236
1237 * Backslash
1238
1239 * Circumflex and Dollar
1240
1241 * Full Stop (Period, Dot) and \N
1242
1243 * Matching a Single Data Unit
1244
1245 * Square Brackets and Character Classes
1246
1247 * Posix Character Classes
1248
1249 * Vertical Bar
1250
1251 * Internal Option Setting
1252
1253 * Subpatterns
1254
1255 * Duplicate Subpattern Numbers
1256
1257 * Named Subpatterns
1258
1259 * Repetition
1260
1261 * Atomic Grouping and Possessive Quantifiers
1262
1263 * Back References
1264
1265 * Assertions
1266
1267 * Conditional Subpatterns
1268
1269 * Comments
1270
1271 * Recursive Patterns
1272
1273 * Subpatterns as Subroutines
1274
1275 * Oniguruma Subroutine Syntax
1276
1277 * Backtracking Control
1278
1280 Some options that can be passed to compile/2 can also be set by special
1281 items at the start of a pattern. These are not Perl-compatible, but are
1282 provided to make these options accessible to pattern writers who are
1283 not able to change the program that processes the pattern. Any number
1284 of these items can appear, but they must all be together right at the
1285 start of the pattern string, and the letters must be in upper case.
1286
1287 UTF Support
1288
1289 Unicode support is basically UTF-8 based. To use Unicode characters,
1290 you either call compile/2 or run/3 with option unicode, or the pattern
1291 must start with one of these special sequences:
1292
1293 (*UTF8)
1294 (*UTF)
1295
1296 Both options give the same effect, the input string is interpreted as
1297 UTF-8. Notice that with these instructions, the automatic conversion of
1298 lists to UTF-8 is not performed by the re functions. Therefore, using
1299 these sequences is not recommended. Add option unicode when running
1300 compile/2 instead.
1301
1302 Some applications that allow their users to supply patterns can wish to
1303 restrict them to non-UTF data for security reasons. If option never_utf
1304 is set at compile time, (*UTF), and so on, are not allowed, and their
1305 appearance causes an error.
1306
1307 Unicode Property Support
1308
1309 The following is another special sequence that can appear at the start
1310 of a pattern:
1311
1312 (*UCP)
1313
1314 This has the same effect as setting option ucp: it causes sequences
1315 such as \d and \w to use Unicode properties to determine character
1316 types, instead of recognizing only characters with codes < 256 through
1317 a lookup table.
1318
1319 Disabling Startup Optimizations
1320
1321 If a pattern starts with (*NO_START_OPT), it has the same effect as
1322 setting option no_start_optimize at compile time.
1323
1324 Newline Conventions
1325
1326 PCRE supports five conventions for indicating line breaks in strings: a
1327 single CR (carriage return) character, a single LF (line feed) charac‐
1328 ter, the two-character sequence CRLF, any of the three preceding, and
1329 any Unicode newline sequence.
1330
1331 A newline convention can also be specified by starting a pattern string
1332 with one of the following five sequences:
1333
1334 (*CR):
1335 Carriage return
1336
1337 (*LF):
1338 Line feed
1339
1340 (*CRLF):
1341 >Carriage return followed by line feed
1342
1343 (*ANYCRLF):
1344 Any of the three above
1345
1346 (*ANY):
1347 All Unicode newline sequences
1348
1349 These override the default and the options specified to compile/2. For
1350 example, the following pattern changes the convention to CR:
1351
1352 (*CR)a.b
1353
1354 This pattern matches a\nb, as LF is no longer a newline. If more than
1355 one of them is present, the last one is used.
1356
1357 The newline convention affects where the circumflex and dollar asser‐
1358 tions are true. It also affects the interpretation of the dot metachar‐
1359 acter when dotall is not set, and the behavior of \N. However, it does
1360 not affect what the \R escape sequence matches. By default, this is any
1361 Unicode newline sequence, for Perl compatibility. However, this can be
1362 changed; see the description of \R in section Newline Sequences. A
1363 change of the \R setting can be combined with a change of the newline
1364 convention.
1365
1366 Setting Match and Recursion Limits
1367
1368 The caller of run/3 can set a limit on the number of times the internal
1369 match() function is called and on the maximum depth of recursive calls.
1370 These facilities are provided to catch runaway matches that are pro‐
1371 voked by patterns with huge matching trees (a typical example is a pat‐
1372 tern with nested unlimited repeats) and to avoid running out of system
1373 stack by too much recursion. When one of these limits is reached,
1374 pcre_exec() gives an error return. The limits can also be set by items
1375 at the start of the pattern of the following forms:
1376
1377 (*LIMIT_MATCH=d)
1378 (*LIMIT_RECURSION=d)
1379
1380 Here d is any number of decimal digits. However, the value of the set‐
1381 ting must be less than the value set by the caller of run/3 for it to
1382 have any effect. That is, the pattern writer can lower the limit set by
1383 the programmer, but not raise it. If there is more than one setting of
1384 one of these limits, the lower value is used.
1385
1386 The default value for both the limits is 10,000,000 in the Erlang VM.
1387 Notice that the recursion limit does not affect the stack depth of the
1388 VM, as PCRE for Erlang is compiled in such a way that the match func‐
1389 tion never does recursion on the C stack.
1390
1391 Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of
1392 the limits set by the caller, not increase them.
1393
1395 A regular expression is a pattern that is matched against a subject
1396 string from left to right. Most characters stand for themselves in a
1397 pattern and match the corresponding characters in the subject. As a
1398 trivial example, the following pattern matches a portion of a subject
1399 string that is identical to itself:
1400
1401 The quick brown fox
1402
1403 When caseless matching is specified (option caseless), letters are
1404 matched independently of case.
1405
1406 The power of regular expressions comes from the ability to include al‐
1407 ternatives and repetitions in the pattern. These are encoded in the
1408 pattern by the use of metacharacters, which do not stand for themselves
1409 but instead are interpreted in some special way.
1410
1411 Two sets of metacharacters exist: those that are recognized anywhere in
1412 the pattern except within square brackets, and those that are recog‐
1413 nized within square brackets. Outside square brackets, the metacharac‐
1414 ters are as follows:
1415
1416 \:
1417 General escape character with many uses
1418
1419 ^:
1420 Assert start of string (or line, in multiline mode)
1421
1422 $:
1423 Assert end of string (or line, in multiline mode)
1424
1425 .:
1426 Match any character except newline (by default)
1427
1428 [:
1429 Start character class definition
1430
1431 |:
1432 Start of alternative branch
1433
1434 (:
1435 Start subpattern
1436
1437 ):
1438 End subpattern
1439
1440 ?:
1441 Extends the meaning of (, also 0 or 1 quantifier, also quantifier
1442 minimizer
1443
1444 *:
1445 0 or more quantifiers
1446
1447 +:
1448 1 or more quantifier, also "possessive quantifier"
1449
1450 {:
1451 Start min/max quantifier
1452
1453 Part of a pattern within square brackets is called a "character class".
1454 The following are the only metacharacters in a character class:
1455
1456 \:
1457 General escape character
1458
1459 ^:
1460 Negate the class, but only if the first character
1461
1462 -:
1463 Indicates character range
1464
1465 [:
1466 Posix character class (only if followed by Posix syntax)
1467
1468 ]:
1469 Terminates the character class
1470
1471 The following sections describe the use of each metacharacter.
1472
1474 The backslash character has many uses. First, if it is followed by a
1475 character that is not a number or a letter, it takes away any special
1476 meaning that a character can have. This use of backslash as an escape
1477 character applies both inside and outside character classes.
1478
1479 For example, if you want to match a * character, you write \* in the
1480 pattern. This escaping action applies if the following character would
1481 otherwise be interpreted as a metacharacter, so it is always safe to
1482 precede a non-alphanumeric with backslash to specify that it stands for
1483 itself. In particular, if you want to match a backslash, write \\.
1484
1485 In unicode mode, only ASCII numbers and letters have any special mean‐
1486 ing after a backslash. All other characters (in particular, those whose
1487 code points are > 127) are treated as literals.
1488
1489 If a pattern is compiled with option extended, whitespace in the pat‐
1490 tern (other than in a character class) and characters between a # out‐
1491 side a character class and the next newline are ignored. An escaping
1492 backslash can be used to include a whitespace or # character as part of
1493 the pattern.
1494
1495 To remove the special meaning from a sequence of characters, put them
1496 between \Q and \E. This is different from Perl in that $ and @ are han‐
1497 dled as literals in \Q...\E sequences in PCRE, while $ and @ cause
1498 variable interpolation in Perl. Notice the following examples:
1499
1500 Pattern PCRE matches Perl matches
1501
1502 \Qabc$xyz\E abc$xyz abc followed by the contents of $xyz
1503 \Qabc\$xyz\E abc\$xyz abc\$xyz
1504 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
1505
1506 The \Q...\E sequence is recognized both inside and outside character
1507 classes. An isolated \E that is not preceded by \Q is ignored. If \Q is
1508 not followed by \E later in the pattern, the literal interpretation
1509 continues to the end of the pattern (that is, \E is assumed at the
1510 end). If the isolated \Q is inside a character class, this causes an
1511 error, as the character class is not terminated.
1512
1513 Non-Printing Characters
1514
1515 A second use of backslash provides a way of encoding non-printing char‐
1516 acters in patterns in a visible manner. There is no restriction on the
1517 appearance of non-printing characters, apart from the binary zero that
1518 terminates a pattern. When a pattern is prepared by text editing, it is
1519 often easier to use one of the following escape sequences than the bi‐
1520 nary character it represents:
1521
1522 \a:
1523 Alarm, that is, the BEL character (hex 07)
1524
1525 \cx:
1526 "Control-x", where x is any ASCII character
1527
1528 \e:
1529 Escape (hex 1B)
1530
1531 \f:
1532 Form feed (hex 0C)
1533
1534 \n:
1535 Line feed (hex 0A)
1536
1537 \r:
1538 Carriage return (hex 0D)
1539
1540 \t:
1541 Tab (hex 09)
1542
1543 \0dd:
1544 Character with octal code 0dd
1545
1546 \ddd:
1547 Character with octal code ddd, or back reference
1548
1549 \o{ddd..}:
1550 character with octal code ddd..
1551
1552 \xhh:
1553 Character with hex code hh
1554
1555 \x{hhh..}:
1556 Character with hex code hhh..
1557
1558 Note:
1559 Note that \0dd is always an octal code, and that \8 and \9 are the lit‐
1560 eral characters "8" and "9".
1561
1562
1563 The precise effect of \cx on ASCII characters is as follows: if x is a
1564 lowercase letter, it is converted to upper case. Then bit 6 of the
1565 character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
1566 (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
1567 hex 7B (; is 3B). If the data item (byte or 16-bit value) following \c
1568 has a value > 127, a compile-time error occurs. This locks out non-
1569 ASCII characters in all modes.
1570
1571 The \c facility was designed for use with ASCII characters, but with
1572 the extension to Unicode it is even less useful than it once was.
1573
1574 After \0 up to two further octal digits are read. If there are fewer
1575 than two digits, just those that are present are used. Thus the se‐
1576 quence \0\x\015 specifies two binary zeros followed by a CR character
1577 (code value 13). Make sure you supply two digits after the initial zero
1578 if the pattern character that follows is itself an octal digit.
1579
1580 The escape \o must be followed by a sequence of octal digits, enclosed
1581 in braces. An error occurs if this is not the case. This escape is a
1582 recent addition to Perl; it provides way of specifying character code
1583 points as octal numbers greater than 0777, and it also allows octal
1584 numbers and back references to be unambiguously specified.
1585
1586 For greater clarity and unambiguity, it is best to avoid following \ by
1587 a digit greater than zero. Instead, use \o{} or \x{} to specify charac‐
1588 ter numbers, and \g{} to specify back references. The following para‐
1589 graphs describe the old, ambiguous syntax.
1590
1591 The handling of a backslash followed by a digit other than 0 is compli‐
1592 cated, and Perl has changed in recent releases, causing PCRE also to
1593 change. Outside a character class, PCRE reads the digit and any follow‐
1594 ing digits as a decimal number. If the number is < 8, or if there have
1595 been at least that many previous capturing left parentheses in the ex‐
1596 pression, the entire sequence is taken as a back reference. A descrip‐
1597 tion of how this works is provided later, following the discussion of
1598 parenthesized subpatterns.
1599
1600 Inside a character class, or if the decimal number following \ is > 7
1601 and there have not been that many capturing subpatterns, PCRE handles
1602 \8 and \9 as the literal characters "8" and "9", and otherwise re-reads
1603 up to three octal digits following the backslash, and using them to
1604 generate a data character. Any subsequent digits stand for themselves.
1605 For example:
1606
1607 \040:
1608 Another way of writing an ASCII space
1609
1610 \40:
1611 The same, provided there are < 40 previous capturing subpatterns
1612
1613 \7:
1614 Always a back reference
1615
1616 \11:
1617 Can be a back reference, or another way of writing a tab
1618
1619 \011:
1620 Always a tab
1621
1622 \0113:
1623 A tab followed by character "3"
1624
1625 \113:
1626 Can be a back reference, otherwise the character with octal code
1627 113
1628
1629 \377:
1630 Can be a back reference, otherwise value 255 (decimal)
1631
1632 \81:
1633 Either a back reference, or the two characters "8" and "1"
1634
1635 Notice that octal values >= 100 that are specified using this syntax
1636 must not be introduced by a leading zero, as no more than three octal
1637 digits are ever read.
1638
1639 By default, after \x that is not followed by {, from zero to two hexa‐
1640 decimal digits are read (letters can be in upper or lower case). Any
1641 number of hexadecimal digits may appear between \x{ and }. If a charac‐
1642 ter other than a hexadecimal digit appears between \x{ and }, or if
1643 there is no terminating }, an error occurs.
1644
1645 Characters whose value is less than 256 can be defined by either of the
1646 two syntaxes for \x. There is no difference in the way they are han‐
1647 dled. For example, \xdc is exactly the same as \x{dc}.
1648
1649 Constraints on character values
1650
1651 Characters that are specified using octal or hexadecimal numbers are
1652 limited to certain values, as follows:
1653
1654 8-bit non-UTF mode:
1655 < 0x100
1656
1657 8-bit UTF-8 mode:
1658 < 0x10ffff and a valid codepoint
1659
1660 Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-
1661 called "surrogate" codepoints), and 0xffef.
1662
1663 Escape sequences in character classes
1664
1665 All the sequences that define a single character value can be used both
1666 inside and outside character classes. Also, inside a character class,
1667 \b is interpreted as the backspace character (hex 08).
1668
1669 \N is not allowed in a character class. \B, \R, and \X are not special
1670 inside a character class. Like other unrecognized escape sequences,
1671 they are treated as the literal characters "B", "R", and "X". Outside a
1672 character class, these sequences have different meanings.
1673
1674 Unsupported Escape Sequences
1675
1676 In Perl, the sequences \l, \L, \u, and \U are recognized by its string
1677 handler and used to modify the case of following characters. PCRE does
1678 not support these escape sequences.
1679
1680 Absolute and Relative Back References
1681
1682 The sequence \g followed by an unsigned or a negative number, option‐
1683 ally enclosed in braces, is an absolute or relative back reference. A
1684 named back reference can be coded as \g{name}. Back references are dis‐
1685 cussed later, following the discussion of parenthesized subpatterns.
1686
1687 Absolute and Relative Subroutine Calls
1688
1689 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
1690 name or a number enclosed either in angle brackets or single quotes, is
1691 alternative syntax for referencing a subpattern as a "subroutine". De‐
1692 tails are discussed later. Notice that \g{...} (Perl syntax) and
1693 \g<...> (Oniguruma syntax) are not synonymous. The former is a back
1694 reference and the latter is a subroutine call.
1695
1696 Generic Character Types
1697
1698 Another use of backslash is for specifying generic character types:
1699
1700 \d:
1701 Any decimal digit
1702
1703 \D:
1704 Any character that is not a decimal digit
1705
1706 \h:
1707 Any horizontal whitespace character
1708
1709 \H:
1710 Any character that is not a horizontal whitespace character
1711
1712 \s:
1713 Any whitespace character
1714
1715 \S:
1716 Any character that is not a whitespace character
1717
1718 \v:
1719 Any vertical whitespace character
1720
1721 \V:
1722 Any character that is not a vertical whitespace character
1723
1724 \w:
1725 Any "word" character
1726
1727 \W:
1728 Any "non-word" character
1729
1730 There is also the single sequence \N, which matches a non-newline char‐
1731 acter. This is the same as the "." metacharacter when dotall is not
1732 set. Perl also uses \N to match characters by name, but PCRE does not
1733 support this.
1734
1735 Each pair of lowercase and uppercase escape sequences partitions the
1736 complete set of characters into two disjoint sets. Any given character
1737 matches one, and only one, of each pair. The sequences can appear both
1738 inside and outside character classes. They each match one character of
1739 the appropriate type. If the current matching point is at the end of
1740 the subject string, all fail, as there is no character to match.
1741
1742 For compatibility with Perl, \s did not used to match the VT character
1743 (code 11), which made it different from the the POSIX "space" class.
1744 However, Perl added VT at release 5.18, and PCRE followed suit at re‐
1745 lease 8.34. The default \s characters are now HT (9), LF (10), VT (11),
1746 FF (12), CR (13), and space (32), which are defined as white space in
1747 the "C" locale. This list may vary if locale-specific matching is tak‐
1748 ing place. For example, in some locales the "non-breaking space" char‐
1749 acter (\xA0) is recognized as white space, and in others the VT charac‐
1750 ter is not.
1751
1752 A "word" character is an underscore or any character that is a letter
1753 or a digit. By default, the definition of letters and digits is con‐
1754 trolled by the PCRE low-valued character tables, in Erlang's case (and
1755 without option unicode), the ISO Latin-1 character set.
1756
1757 By default, in unicode mode, characters with values > 255, that is, all
1758 characters outside the ISO Latin-1 character set, never match \d, \s,
1759 or \w, and always match \D, \S, and \W. These sequences retain their
1760 original meanings from before UTF support was available, mainly for ef‐
1761 ficiency reasons. However, if option ucp is set, the behavior is
1762 changed so that Unicode properties are used to determine character
1763 types, as follows:
1764
1765 \d:
1766 Any character that \p{Nd} matches (decimal digit)
1767
1768 \s:
1769 Any character that \p{Z} or \h or \v
1770
1771 \w:
1772 Any character that matches \p{L} or \p{N} matches, plus underscore
1773
1774 The uppercase escapes match the inverse sets of characters. Notice that
1775 \d matches only decimal digits, while \w matches any Unicode digit, any
1776 Unicode letter, and underscore. Notice also that ucp affects \b and \B,
1777 as they are defined in terms of \w and \W. Matching these sequences is
1778 noticeably slower when ucp is set.
1779
1780 The sequences \h, \H, \v, and \V are features that were added to Perl
1781 in release 5.10. In contrast to the other sequences, which match only
1782 ASCII characters by default, these always match certain high-valued
1783 code points, regardless if ucp is set.
1784
1785 The following are the horizontal space characters:
1786
1787 U+0009:
1788 Horizontal tab (HT)
1789
1790 U+0020:
1791 Space
1792
1793 U+00A0:
1794 Non-break space
1795
1796 U+1680:
1797 Ogham space mark
1798
1799 U+180E:
1800 Mongolian vowel separator
1801
1802 U+2000:
1803 En quad
1804
1805 U+2001:
1806 Em quad
1807
1808 U+2002:
1809 En space
1810
1811 U+2003:
1812 Em space
1813
1814 U+2004:
1815 Three-per-em space
1816
1817 U+2005:
1818 Four-per-em space
1819
1820 U+2006:
1821 Six-per-em space
1822
1823 U+2007:
1824 Figure space
1825
1826 U+2008:
1827 Punctuation space
1828
1829 U+2009:
1830 Thin space
1831
1832 U+200A:
1833 Hair space
1834
1835 U+202F:
1836 Narrow no-break space
1837
1838 U+205F:
1839 Medium mathematical space
1840
1841 U+3000:
1842 Ideographic space
1843
1844 The following are the vertical space characters:
1845
1846 U+000A:
1847 Line feed (LF)
1848
1849 U+000B:
1850 Vertical tab (VT)
1851
1852 U+000C:
1853 Form feed (FF)
1854
1855 U+000D:
1856 Carriage return (CR)
1857
1858 U+0085:
1859 Next line (NEL)
1860
1861 U+2028:
1862 Line separator
1863
1864 U+2029:
1865 Paragraph separator
1866
1867 In 8-bit, non-UTF-8 mode, only the characters with code points < 256
1868 are relevant.
1869
1870 Newline Sequences
1871
1872 Outside a character class, by default, the escape sequence \R matches
1873 any Unicode newline sequence. In non-UTF-8 mode, \R is equivalent to
1874 the following:
1875
1876 (?>\r\n|\n|\x0b|\f|\r|\x85)
1877
1878 This is an example of an "atomic group", details are provided below.
1879
1880 This particular group matches either the two-character sequence CR fol‐
1881 lowed by LF, or one of the single characters LF (line feed, U+000A), VT
1882 (vertical tab, U+000B), FF (form feed, U+000C), CR (carriage return,
1883 U+000D), or NEL (next line, U+0085). The two-character sequence is
1884 treated as a single unit that cannot be split.
1885
1886 In Unicode mode, two more characters whose code points are > 255 are
1887 added: LS (line separator, U+2028) and PS (paragraph separator,
1888 U+2029). Unicode character property support is not needed for these
1889 characters to be recognized.
1890
1891 \R can be restricted to match only CR, LF, or CRLF (instead of the com‐
1892 plete set of Unicode line endings) by setting option bsr_anycrlf either
1893 at compile time or when the pattern is matched. (BSR is an acronym for
1894 "backslash R".) This can be made the default when PCRE is built; if so,
1895 the other behavior can be requested through option bsr_unicode. These
1896 settings can also be specified by starting a pattern string with one of
1897 the following sequences:
1898
1899 (*BSR_ANYCRLF):
1900 CR, LF, or CRLF only
1901
1902 (*BSR_UNICODE):
1903 Any Unicode newline sequence
1904
1905 These override the default and the options specified to the compiling
1906 function, but they can themselves be overridden by options specified to
1907 a matching function. Notice that these special settings, which are not
1908 Perl-compatible, are recognized only at the very start of a pattern,
1909 and that they must be in upper case. If more than one of them is
1910 present, the last one is used. They can be combined with a change of
1911 newline convention; for example, a pattern can start with:
1912
1913 (*ANY)(*BSR_ANYCRLF)
1914
1915 They can also be combined with the (*UTF8), (*UTF), or (*UCP) special
1916 sequences. Inside a character class, \R is treated as an unrecognized
1917 escape sequence, and so matches the letter "R" by default.
1918
1919 Unicode Character Properties
1920
1921 Three more escape sequences that match characters with specific proper‐
1922 ties are available. When in 8-bit non-UTF-8 mode, these sequences are
1923 limited to testing characters whose code points are < 256, but they do
1924 work in this mode. The following are the extra escape sequences:
1925
1926 \p{xx}:
1927 A character with property xx
1928
1929 \P{xx}:
1930 A character without property xx
1931
1932 \X:
1933 A Unicode extended grapheme cluster
1934
1935 The property names represented by xx above are limited to the Unicode
1936 script names, the general category properties, "Any", which matches any
1937 character (including newline), and some special PCRE properties (de‐
1938 scribed in the next section). Other Perl properties, such as "InMusi‐
1939 calSymbols", are currently not supported by PCRE. Notice that \P{Any}
1940 does not match any characters and always causes a match failure.
1941
1942 Sets of Unicode characters are defined as belonging to certain scripts.
1943 A character from one of these sets can be matched using a script name,
1944 for example:
1945
1946 \p{Greek} \P{Han}
1947
1948 Those that are not part of an identified script are lumped together as
1949 "Common". The following is the current list of scripts:
1950
1951 * Arabic
1952
1953 * Armenian
1954
1955 * Avestan
1956
1957 * Balinese
1958
1959 * Bamum
1960
1961 * Bassa_Vah
1962
1963 * Batak
1964
1965 * Bengali
1966
1967 * Bopomofo
1968
1969 * Braille
1970
1971 * Buginese
1972
1973 * Buhid
1974
1975 * Canadian_Aboriginal
1976
1977 * Carian
1978
1979 * Caucasian_Albanian
1980
1981 * Chakma
1982
1983 * Cham
1984
1985 * Cherokee
1986
1987 * Common
1988
1989 * Coptic
1990
1991 * Cuneiform
1992
1993 * Cypriot
1994
1995 * Cyrillic
1996
1997 * Deseret
1998
1999 * Devanagari
2000
2001 * Duployan
2002
2003 * Egyptian_Hieroglyphs
2004
2005 * Elbasan
2006
2007 * Ethiopic
2008
2009 * Georgian
2010
2011 * Glagolitic
2012
2013 * Gothic
2014
2015 * Grantha
2016
2017 * Greek
2018
2019 * Gujarati
2020
2021 * Gurmukhi
2022
2023 * Han
2024
2025 * Hangul
2026
2027 * Hanunoo
2028
2029 * Hebrew
2030
2031 * Hiragana
2032
2033 * Imperial_Aramaic
2034
2035 * Inherited
2036
2037 * Inscriptional_Pahlavi
2038
2039 * Inscriptional_Parthian
2040
2041 * Javanese
2042
2043 * Kaithi
2044
2045 * Kannada
2046
2047 * Katakana
2048
2049 * Kayah_Li
2050
2051 * Kharoshthi
2052
2053 * Khmer
2054
2055 * Khojki
2056
2057 * Khudawadi
2058
2059 * Lao
2060
2061 * Latin
2062
2063 * Lepcha
2064
2065 * Limbu
2066
2067 * Linear_A
2068
2069 * Linear_B
2070
2071 * Lisu
2072
2073 * Lycian
2074
2075 * Lydian
2076
2077 * Mahajani
2078
2079 * Malayalam
2080
2081 * Mandaic
2082
2083 * Manichaean
2084
2085 * Meetei_Mayek
2086
2087 * Mende_Kikakui
2088
2089 * Meroitic_Cursive
2090
2091 * Meroitic_Hieroglyphs
2092
2093 * Miao
2094
2095 * Modi
2096
2097 * Mongolian
2098
2099 * Mro
2100
2101 * Myanmar
2102
2103 * Nabataean
2104
2105 * New_Tai_Lue
2106
2107 * Nko
2108
2109 * Ogham
2110
2111 * Ol_Chiki
2112
2113 * Old_Italic
2114
2115 * Old_North_Arabian
2116
2117 * Old_Permic
2118
2119 * Old_Persian
2120
2121 * Oriya
2122
2123 * Old_South_Arabian
2124
2125 * Old_Turkic
2126
2127 * Osmanya
2128
2129 * Pahawh_Hmong
2130
2131 * Palmyrene
2132
2133 * Pau_Cin_Hau
2134
2135 * Phags_Pa
2136
2137 * Phoenician
2138
2139 * Psalter_Pahlavi
2140
2141 * Rejang
2142
2143 * Runic
2144
2145 * Samaritan
2146
2147 * Saurashtra
2148
2149 * Sharada
2150
2151 * Shavian
2152
2153 * Siddham
2154
2155 * Sinhala
2156
2157 * Sora_Sompeng
2158
2159 * Sundanese
2160
2161 * Syloti_Nagri
2162
2163 * Syriac
2164
2165 * Tagalog
2166
2167 * Tagbanwa
2168
2169 * Tai_Le
2170
2171 * Tai_Tham
2172
2173 * Tai_Viet
2174
2175 * Takri
2176
2177 * Tamil
2178
2179 * Telugu
2180
2181 * Thaana
2182
2183 * Thai
2184
2185 * Tibetan
2186
2187 * Tifinagh
2188
2189 * Tirhuta
2190
2191 * Ugaritic
2192
2193 * Vai
2194
2195 * Warang_Citi
2196
2197 * Yi
2198
2199 Each character has exactly one Unicode general category property, spec‐
2200 ified by a two-letter acronym. For compatibility with Perl, negation
2201 can be specified by including a circumflex between the opening brace
2202 and the property name. For example, \p{^Lu} is the same as \P{Lu}.
2203
2204 If only one letter is specified with \p or \P, it includes all the gen‐
2205 eral category properties that start with that letter. In this case, in
2206 the absence of negation, the curly brackets in the escape sequence are
2207 optional. The following two examples have the same effect:
2208
2209 \p{L}
2210 \pL
2211
2212 The following general category property codes are supported:
2213
2214 C:
2215 Other
2216
2217 Cc:
2218 Control
2219
2220 Cf:
2221 Format
2222
2223 Cn:
2224 Unassigned
2225
2226 Co:
2227 Private use
2228
2229 Cs:
2230 Surrogate
2231
2232 L:
2233 Letter
2234
2235 Ll:
2236 Lowercase letter
2237
2238 Lm:
2239 Modifier letter
2240
2241 Lo:
2242 Other letter
2243
2244 Lt:
2245 Title case letter
2246
2247 Lu:
2248 Uppercase letter
2249
2250 M:
2251 Mark
2252
2253 Mc:
2254 Spacing mark
2255
2256 Me:
2257 Enclosing mark
2258
2259 Mn:
2260 Non-spacing mark
2261
2262 N:
2263 Number
2264
2265 Nd:
2266 Decimal number
2267
2268 Nl:
2269 Letter number
2270
2271 No:
2272 Other number
2273
2274 P:
2275 Punctuation
2276
2277 Pc:
2278 Connector punctuation
2279
2280 Pd:
2281 Dash punctuation
2282
2283 Pe:
2284 Close punctuation
2285
2286 Pf:
2287 Final punctuation
2288
2289 Pi:
2290 Initial punctuation
2291
2292 Po:
2293 Other punctuation
2294
2295 Ps:
2296 Open punctuation
2297
2298 S:
2299 Symbol
2300
2301 Sc:
2302 Currency symbol
2303
2304 Sk:
2305 Modifier symbol
2306
2307 Sm:
2308 Mathematical symbol
2309
2310 So:
2311 Other symbol
2312
2313 Z:
2314 Separator
2315
2316 Zl:
2317 Line separator
2318
2319 Zp:
2320 Paragraph separator
2321
2322 Zs:
2323 Space separator
2324
2325 The special property L& is also supported. It matches a character that
2326 has the Lu, Ll, or Lt property, that is, a letter that is not classi‐
2327 fied as a modifier or "other".
2328
2329 The Cs (Surrogate) property applies only to characters in the range
2330 U+D800 to U+DFFF. Such characters are invalid in Unicode strings and so
2331 cannot be tested by PCRE. Perl does not support the Cs property.
2332
2333 The long synonyms for property names supported by Perl (such as \p{Let‐
2334 ter}) are not supported by PCRE. It is not permitted to prefix any of
2335 these properties with "Is".
2336
2337 No character in the Unicode table has the Cn (unassigned) property.
2338 This property is instead assumed for any code point that is not in the
2339 Unicode table.
2340
2341 Specifying caseless matching does not affect these escape sequences.
2342 For example, \p{Lu} always matches only uppercase letters. This is dif‐
2343 ferent from the behavior of current versions of Perl.
2344
2345 Matching characters by Unicode property is not fast, as PCRE must do a
2346 multistage table lookup to find a character property. That is why the
2347 traditional escape sequences such as \d and \w do not use Unicode prop‐
2348 erties in PCRE by default. However, you can make them do so by setting
2349 option ucp or by starting the pattern with (*UCP).
2350
2351 Extended Grapheme Clusters
2352
2353 The \X escape matches any number of Unicode characters that form an
2354 "extended grapheme cluster", and treats the sequence as an atomic group
2355 (see below). Up to and including release 8.31, PCRE matched an earlier,
2356 simpler definition that was equivalent to (?>\PM\pM*). That is, it
2357 matched a character without the "mark" property, followed by zero or
2358 more characters with the "mark" property. Characters with the "mark"
2359 property are typically non-spacing accents that affect the preceding
2360 character.
2361
2362 This simple definition was extended in Unicode to include more compli‐
2363 cated kinds of composite character by giving each character a grapheme
2364 breaking property, and creating rules that use these properties to de‐
2365 fine the boundaries of extended grapheme clusters. In PCRE releases
2366 later than 8.31, \X matches one of these clusters.
2367
2368 \X always matches at least one character. Then it decides whether to
2369 add more characters according to the following rules for ending a clus‐
2370 ter:
2371
2372 * End at the end of the subject string.
2373
2374 * Do not end between CR and LF; otherwise end after any control char‐
2375 acter.
2376
2377 * Do not break Hangul (a Korean script) syllable sequences. Hangul
2378 characters are of five types: L, V, T, LV, and LVT. An L character
2379 can be followed by an L, V, LV, or LVT character. An LV or V char‐
2380 acter can be followed by a V or T character. An LVT or T character
2381 can be followed only by a T character.
2382
2383 * Do not end before extending characters or spacing marks. Characters
2384 with the "mark" property always have the "extend" grapheme breaking
2385 property.
2386
2387 * Do not end after prepend characters.
2388
2389 * Otherwise, end the cluster.
2390
2391 PCRE Additional Properties
2392
2393 In addition to the standard Unicode properties described earlier, PCRE
2394 supports four more that make it possible to convert traditional escape
2395 sequences, such as \w and \s to use Unicode properties. PCRE uses these
2396 non-standard, non-Perl properties internally when the ucp option is
2397 passed. However, they can also be used explicitly. The properties are
2398 as follows:
2399
2400 Xan:
2401 Any alphanumeric character. Matches characters that have either the
2402 L (letter) or the N (number) property.
2403
2404 Xps:
2405 Any Posix space character. Matches the characters tab, line feed,
2406 vertical tab, form feed, carriage return, and any other character
2407 that has the Z (separator) property.
2408
2409 Xsp:
2410 Any Perl space character. Matches the same as Xps, except that ver‐
2411 tical tab is excluded.
2412
2413 Xwd:
2414 Any Perl "word" character. Matches the same characters as Xan, plus
2415 underscore.
2416
2417 Perl and POSIX space are now the same. Perl added VT to its space char‐
2418 acter set at release 5.18 and PCRE changed at release 8.34.
2419
2420 Xan matches characters that have either the L (letter) or the N (num‐
2421 ber) property. Xps matches the characters tab, linefeed, vertical tab,
2422 form feed, or carriage return, and any other character that has the Z
2423 (separator) property. Xsp is the same as Xps; it used to exclude verti‐
2424 cal tab, for Perl compatibility, but Perl changed, and so PCRE followed
2425 at release 8.34. Xwd matches the same characters as Xan, plus under‐
2426 score.
2427
2428 There is another non-standard property, Xuc, which matches any charac‐
2429 ter that can be represented by a Universal Character Name in C++ and
2430 other programming languages. These are the characters $, @, ` (grave
2431 accent), and all characters with Unicode code points >= U+00A0, except
2432 for the surrogates U+D800 to U+DFFF. Notice that most base (ASCII)
2433 characters are excluded. (Universal Character Names are of the form
2434 \uHHHH or \UHHHHHHHH, where H is a hexadecimal digit. Notice that the
2435 Xuc property does not match these sequences but the characters that
2436 they represent.)
2437
2438 Resetting the Match Start
2439
2440 The escape sequence \K causes any previously matched characters not to
2441 be included in the final matched sequence. For example, the following
2442 pattern matches "foobar", but reports that it has matched "bar":
2443
2444 foo\Kbar
2445
2446 This feature is similar to a lookbehind assertion (described below).
2447 However, in this case, the part of the subject before the real match
2448 does not have to be of fixed length, as lookbehind assertions do. The
2449 use of \K does not interfere with the setting of captured substrings.
2450 For example, when the following pattern matches "foobar", the first
2451 substring is still set to "foo":
2452
2453 (foo)\Kbar
2454
2455 Perl documents that the use of \K within assertions is "not well de‐
2456 fined". In PCRE, \K is acted upon when it occurs inside positive asser‐
2457 tions, but is ignored in negative assertions. Note that when a pattern
2458 such as (?=ab\K) matches, the reported start of the match can be
2459 greater than the end of the match.
2460
2461 Simple Assertions
2462
2463 The final use of backslash is for certain simple assertions. An asser‐
2464 tion specifies a condition that must be met at a particular point in a
2465 match, without consuming any characters from the subject string. The
2466 use of subpatterns for more complicated assertions is described below.
2467 The following are the backslashed assertions:
2468
2469 \b:
2470 Matches at a word boundary.
2471
2472 \B:
2473 Matches when not at a word boundary.
2474
2475 \A:
2476 Matches at the start of the subject.
2477
2478 \Z:
2479 Matches at the end of the subject, and before a newline at the end
2480 of the subject.
2481
2482 \z:
2483 Matches only at the end of the subject.
2484
2485 \G:
2486 Matches at the first matching position in the subject.
2487
2488 Inside a character class, \b has a different meaning; it matches the
2489 backspace character. If any other of these assertions appears in a
2490 character class, by default it matches the corresponding literal char‐
2491 acter (for example, \B matches the letter B).
2492
2493 A word boundary is a position in the subject string where the current
2494 character and the previous character do not both match \w or \W (that
2495 is, one matches \w and the other matches \W), or the start or end of
2496 the string if the first or last character matches \w, respectively. In
2497 UTF mode, the meanings of \w and \W can be changed by setting option
2498 ucp. When this is done, it also affects \b and \B. PCRE and Perl do not
2499 have a separate "start of word" or "end of word" metasequence. However,
2500 whatever follows \b normally determines which it is. For example, the
2501 fragment \ba matches "a" at the start of a word.
2502
2503 The \A, \Z, and \z assertions differ from the traditional circumflex
2504 and dollar (described in the next section) in that they only ever match
2505 at the very start and end of the subject string, whatever options are
2506 set. Thus, they are independent of multiline mode. These three asser‐
2507 tions are not affected by options notbol or noteol, which affect only
2508 the behavior of the circumflex and dollar metacharacters. However, if
2509 argument startoffset of run/3 is non-zero, indicating that matching is
2510 to start at a point other than the beginning of the subject, \A can
2511 never match. The difference between \Z and \z is that \Z matches before
2512 a newline at the end of the string and at the very end, while \z
2513 matches only at the end.
2514
2515 The \G assertion is true only when the current matching position is at
2516 the start point of the match, as specified by argument startoffset of
2517 run/3. It differs from \A when the value of startoffset is non-zero. By
2518 calling run/3 multiple times with appropriate arguments, you can mimic
2519 the Perl option /g, and it is in this kind of implementation where \G
2520 can be useful.
2521
2522 Notice, however, that the PCRE interpretation of \G, as the start of
2523 the current match, is subtly different from Perl, which defines it as
2524 the end of the previous match. In Perl, these can be different when the
2525 previously matched string was empty. As PCRE does only one match at a
2526 time, it cannot reproduce this behavior.
2527
2528 If all the alternatives of a pattern begin with \G, the expression is
2529 anchored to the starting match position, and the "anchored" flag is set
2530 in the compiled regular expression.
2531
2533 The circumflex and dollar metacharacters are zero-width assertions.
2534 That is, they test for a particular condition to be true without con‐
2535 suming any characters from the subject string.
2536
2537 Outside a character class, in the default matching mode, the circumflex
2538 character is an assertion that is true only if the current matching
2539 point is at the start of the subject string. If argument startoffset of
2540 run/3 is non-zero, circumflex can never match if option multiline is
2541 unset. Inside a character class, circumflex has an entirely different
2542 meaning (see below).
2543
2544 Circumflex needs not to be the first character of the pattern if some
2545 alternatives are involved, but it is to be the first thing in each al‐
2546 ternative in which it appears if the pattern is ever to match that
2547 branch. If all possible alternatives start with a circumflex, that is,
2548 if the pattern is constrained to match only at the start of the sub‐
2549 ject, it is said to be an "anchored" pattern. (There are also other
2550 constructs that can cause a pattern to be anchored.)
2551
2552 The dollar character is an assertion that is true only if the current
2553 matching point is at the end of the subject string, or immediately be‐
2554 fore a newline at the end of the string (by default). Notice however
2555 that it does not match the newline. Dollar needs not to be the last
2556 character of the pattern if some alternatives are involved, but it is
2557 to be the last item in any branch in which it appears. Dollar has no
2558 special meaning in a character class.
2559
2560 The meaning of dollar can be changed so that it matches only at the
2561 very end of the string, by setting option dollar_endonly at compile
2562 time. This does not affect the \Z assertion.
2563
2564 The meanings of the circumflex and dollar characters are changed if op‐
2565 tion multiline is set. When this is the case, a circumflex matches im‐
2566 mediately after internal newlines and at the start of the subject
2567 string. It does not match after a newline that ends the string. A dol‐
2568 lar matches before any newlines in the string, and at the very end,
2569 when multiline is set. When newline is specified as the two-character
2570 sequence CRLF, isolated CR and LF characters do not indicate newlines.
2571
2572 For example, the pattern /^abc$/ matches the subject string "def\nabc"
2573 (where \n represents a newline) in multiline mode, but not otherwise.
2574 So, patterns that are anchored in single-line mode because all branches
2575 start with ^ are not anchored in multiline mode, and a match for cir‐
2576 cumflex is possible when argument startoffset of run/3 is non-zero. Op‐
2577 tion dollar_endonly is ignored if multiline is set.
2578
2579 Notice that the sequences \A, \Z, and \z can be used to match the start
2580 and end of the subject in both modes. If all branches of a pattern
2581 start with \A, it is always anchored, regardless if multiline is set.
2582
2584 Outside a character class, a dot in the pattern matches any character
2585 in the subject string except (by default) a character that signifies
2586 the end of a line.
2587
2588 When a line ending is defined as a single character, dot never matches
2589 that character. When the two-character sequence CRLF is used, dot does
2590 not match CR if it is immediately followed by LF, otherwise it matches
2591 all characters (including isolated CRs and LFs). When any Unicode line
2592 endings are recognized, dot does not match CR, LF, or any of the other
2593 line-ending characters.
2594
2595 The behavior of dot regarding newlines can be changed. If option dotall
2596 is set, a dot matches any character, without exception. If the two-
2597 character sequence CRLF is present in the subject string, it takes two
2598 dots to match it.
2599
2600 The handling of dot is entirely independent of the handling of circum‐
2601 flex and dollar, the only relationship is that both involve newlines.
2602 Dot has no special meaning in a character class.
2603
2604 The escape sequence \N behaves like a dot, except that it is not af‐
2605 fected by option PCRE_DOTALL. That is, it matches any character except
2606 one that signifies the end of a line. Perl also uses \N to match char‐
2607 acters by name but PCRE does not support this.
2608
2610 Outside a character class, the escape sequence \C matches any data
2611 unit, regardless if a UTF mode is set. One data unit is one byte. Un‐
2612 like a dot, \C always matches line-ending characters. The feature is
2613 provided in Perl to match individual bytes in UTF-8 mode, but it is un‐
2614 clear how it can usefully be used. As \C breaks up characters into in‐
2615 dividual data units, matching one unit with \C in a UTF mode means that
2616 the remaining string can start with a malformed UTF character. This has
2617 undefined results, as PCRE assumes that it deals with valid UTF
2618 strings.
2619
2620 PCRE does not allow \C to appear in lookbehind assertions (described
2621 below) in a UTF mode, as this would make it impossible to calculate the
2622 length of the lookbehind.
2623
2624 The \C escape sequence is best avoided. However, one way of using it
2625 that avoids the problem of malformed UTF characters is to use a look‐
2626 ahead to check the length of the next character, as in the following
2627 pattern, which can be used with a UTF-8 string (ignore whitespace and
2628 line breaks):
2629
2630 (?| (?=[\x00-\x7f])(\C) |
2631 (?=[\x80-\x{7ff}])(\C)(\C) |
2632 (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
2633 (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
2634
2635 A group that starts with (?| resets the capturing parentheses numbers
2636 in each alternative (see section Duplicate Subpattern Numbers). The as‐
2637 sertions at the start of each branch check the next UTF-8 character for
2638 values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The indi‐
2639 vidual bytes of the character are then captured by the appropriate num‐
2640 ber of groups.
2641
2643 An opening square bracket introduces a character class, terminated by a
2644 closing square bracket. A closing square bracket on its own is not spe‐
2645 cial by default. However, if option PCRE_JAVASCRIPT_COMPAT is set, a
2646 lone closing square bracket causes a compile-time error. If a closing
2647 square bracket is required as a member of the class, it is to be the
2648 first data character in the class (after an initial circumflex, if
2649 present) or escaped with a backslash.
2650
2651 A character class matches a single character in the subject. In a UTF
2652 mode, the character can be more than one data unit long. A matched
2653 character must be in the set of characters defined by the class, unless
2654 the first character in the class definition is a circumflex, in which
2655 case the subject character must not be in the set defined by the class.
2656 If a circumflex is required as a member of the class, ensure that it is
2657 not the first character, or escape it with a backslash.
2658
2659 For example, the character class [aeiou] matches any lowercase vowel,
2660 while [^aeiou] matches any character that is not a lowercase vowel. No‐
2661 tice that a circumflex is just a convenient notation for specifying the
2662 characters that are in the class by enumerating those that are not. A
2663 class that starts with a circumflex is not an assertion; it still con‐
2664 sumes a character from the subject string, and therefore it fails if
2665 the current pointer is at the end of the string.
2666
2667 In UTF-8 mode, characters with values > 255 (0xffff) can be included in
2668 a class as a literal string of data units, or by using the \x{ escaping
2669 mechanism.
2670
2671 When caseless matching is set, any letters in a class represent both
2672 their uppercase and lowercase versions. For example, a caseless [aeiou]
2673 matches "A" and "a", and a caseless [^aeiou] does not match "A", but a
2674 caseful version would. In a UTF mode, PCRE always understands the con‐
2675 cept of case for characters whose values are < 256, so caseless match‐
2676 ing is always possible. For characters with higher values, the concept
2677 of case is supported only if PCRE is compiled with Unicode property
2678 support. If you want to use caseless matching in a UTF mode for charac‐
2679 ters >=, ensure that PCRE is compiled with Unicode property support and
2680 with UTF support.
2681
2682 Characters that can indicate line breaks are never treated in any spe‐
2683 cial way when matching character classes, whatever line-ending sequence
2684 is in use, and whatever setting of options PCRE_DOTALL and PCRE_MULTI‐
2685 LINE is used. A class such as [^a] always matches one of these charac‐
2686 ters.
2687
2688 The minus (hyphen) character can be used to specify a range of charac‐
2689 ters in a character class. For example, [d-m] matches any letter be‐
2690 tween d and m, inclusive. If a minus character is required in a class,
2691 it must be escaped with a backslash or appear in a position where it
2692 cannot be interpreted as indicating a range, typically as the first or
2693 last character in the class, or immediately after a range. For example,
2694 [b-d-z] matches letters in the range b to d, a hyphen character, or z.
2695
2696 The literal character "]" cannot be the end character of a range. A
2697 pattern such as [W-]46] is interpreted as a class of two characters
2698 ("W" and "-") followed by a literal string "46]", so it would match
2699 "W46]" or "-46]". However, if "]" is escaped with a backslash, it is
2700 interpreted as the end of range, so [W-\]46] is interpreted as a class
2701 containing a range followed by two other characters. The octal or hexa‐
2702 decimal representation of "]" can also be used to end a range.
2703
2704 An error is generated if a POSIX character class (see below) or an es‐
2705 cape sequence other than one that defines a single character appears at
2706 a point where a range ending character is expected. For example,
2707 [z-\xff] is valid, but [A-\d] and [A-[:digit:]] are not.
2708
2709 Ranges operate in the collating sequence of character values. They can
2710 also be used for characters specified numerically, for example,
2711 [\000-\037]. Ranges can include any characters that are valid for the
2712 current mode.
2713
2714 If a range that includes letters is used when caseless matching is set,
2715 it matches the letters in either case. For example, [W-c] is equivalent
2716 to [][\\^_`wxyzabc], matched caselessly. In a non-UTF mode, if charac‐
2717 ter tables for a French locale are in use, [\xc8-\xcb] matches accented
2718 E characters in both cases. In UTF modes, PCRE supports the concept of
2719 case for characters with values > 255 only when it is compiled with
2720 Unicode property support.
2721
2722 The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,
2723 \w, and \W can appear in a character class, and add the characters that
2724 they match to the class. For example, [\dABCDEF] matches any hexadeci‐
2725 mal digit. In UTF modes, option ucp affects the meanings of \d, \s, \w
2726 and their uppercase partners, just as it does when they appear outside
2727 a character class, as described in section Generic Character Types ear‐
2728 lier. The escape sequence \b has a different meaning inside a character
2729 class; it matches the backspace character. The sequences \B, \N, \R,
2730 and \X are not special inside a character class. Like any other unrec‐
2731 ognized escape sequences, they are treated as the literal characters
2732 "B", "N", "R", and "X".
2733
2734 A circumflex can conveniently be used with the uppercase character
2735 types to specify a more restricted set of characters than the matching
2736 lowercase type. For example, class [^\W_] matches any letter or digit,
2737 but not underscore, while [\w] includes underscore. A positive charac‐
2738 ter class is to be read as "something OR something OR ..." and a nega‐
2739 tive class as "NOT something AND NOT something AND NOT ...".
2740
2741 Only the following metacharacters are recognized in character classes:
2742
2743 * Backslash
2744
2745 * Hyphen (only where it can be interpreted as specifying a range)
2746
2747 * Circumflex (only at the start)
2748
2749 * Opening square bracket (only when it can be interpreted as intro‐
2750 ducing a Posix class name, or for a special compatibility feature;
2751 see the next two sections)
2752
2753 * Terminating closing square bracket
2754
2755 However, escaping other non-alphanumeric characters does no harm.
2756
2758 Perl supports the Posix notation for character classes. This uses names
2759 enclosed by [: and :] within the enclosing square brackets. PCRE also
2760 supports this notation. For example, the following matches "0", "1",
2761 any alphabetic character, or "%":
2762
2763 [01[:alpha:]%]
2764
2765 The following are the supported class names:
2766
2767 alnum:
2768 Letters and digits
2769
2770 alpha:
2771 Letters
2772
2773 blank:
2774 Space or tab only
2775
2776 cntrl:
2777 Control characters
2778
2779 digit:
2780 Decimal digits (same as \d)
2781
2782 graph:
2783 Printing characters, excluding space
2784
2785 lower:
2786 Lowercase letters
2787
2788 print:
2789 Printing characters, including space
2790
2791 punct:
2792 Printing characters, excluding letters, digits, and space
2793
2794 space:
2795 Whitespace (the same as \s from PCRE 8.34)
2796
2797 upper:
2798 Uppercase letters
2799
2800 word:
2801 "Word" characters (same as \w)
2802
2803 xdigit:
2804 Hexadecimal digits
2805
2806 There is another character class, ascii, that erroneously matches
2807 Latin-1 characters instead of the 0-127 range specified by POSIX. This
2808 cannot be fixed without altering the behaviour of other classes, so we
2809 recommend matching the range with [\\0-\x7f] instead.
2810
2811 The default "space" characters are HT (9), LF (10), VT (11), FF (12),
2812 CR (13), and space (32). If locale-specific matching is taking place,
2813 the list of space characters may be different; there may be fewer or
2814 more of them. "Space" used to be different to \s, which did not include
2815 VT, for Perl compatibility. However, Perl changed at release 5.18, and
2816 PCRE followed at release 8.34. "Space" and \s now match the same set of
2817 characters.
2818
2819 The name "word" is a Perl extension, and "blank" is a GNU extension
2820 from Perl 5.8. Another Perl extension is negation, which is indicated
2821 by a ^ character after the colon. For example, the following matches
2822 "1", "2", or any non-digit:
2823
2824 [12[:^digit:]]
2825
2826 PCRE (and Perl) also recognize the Posix syntax [.ch.] and [=ch=] where
2827 "ch" is a "collating element", but these are not supported, and an er‐
2828 ror is given if they are encountered.
2829
2830 By default, characters with values > 255 do not match any of the Posix
2831 character classes. However, if option PCRE_UCP is passed to pcre_com‐
2832 pile(), some of the classes are changed so that Unicode character prop‐
2833 erties are used. This is achieved by replacing certain Posix classes by
2834 other sequences, as follows:
2835
2836 [:alnum:]:
2837 Becomes \p{Xan}
2838
2839 [:alpha:]:
2840 Becomes \p{L}
2841
2842 [:blank:]:
2843 Becomes \h
2844
2845 [:digit:]:
2846 Becomes \p{Nd}
2847
2848 [:lower:]:
2849 Becomes \p{Ll}
2850
2851 [:space:]:
2852 Becomes \p{Xps}
2853
2854 [:upper:]:
2855 Becomes \p{Lu}
2856
2857 [:word:]:
2858 Becomes \p{Xwd}
2859
2860 Negated versions, such as [:^alpha:], use \P instead of \p. Three other
2861 POSIX classes are handled specially in UCP mode:
2862
2863 [:graph:]:
2864 This matches characters that have glyphs that mark the page when
2865 printed. In Unicode property terms, it matches all characters with
2866 the L, M, N, P, S, or Cf properties, except for:
2867
2868 U+061C:
2869 Arabic Letter Mark
2870
2871 U+180E:
2872 Mongolian Vowel Separator
2873
2874 U+2066 - U+2069:
2875 Various "isolate"s
2876
2877 [:print:]:
2878 This matches the same characters as [:graph:] plus space characters
2879 that are not controls, that is, characters with the Zs property.
2880
2881 [:punct:]:
2882 This matches all characters that have the Unicode P (punctuation)
2883 property, plus those characters whose code points are less than 128
2884 that have the S (Symbol) property.
2885
2886 The other POSIX classes are unchanged, and match only characters with
2887 code points less than 128.
2888
2889 Compatibility Feature for Word Boundaries
2890
2891 In the POSIX.2 compliant library that was included in 4.4BSD Unix, the
2892 ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word"
2893 and "end of word". PCRE treats these items as follows:
2894
2895 [[:<:]]:
2896 is converted to \b(?=\w)
2897
2898 [[:>:]]:
2899 is converted to \b(?<=\w)
2900
2901 Only these exact character sequences are recognized. A sequence such as
2902 [a[:<:]b] provokes error for an unrecognized POSIX class name. This
2903 support is not compatible with Perl. It is provided to help migrations
2904 from other environments, and is best not used in any new patterns. Note
2905 that \b matches at the start and the end of a word (see "Simple asser‐
2906 tions" above), and in a Perl-style pattern the preceding or following
2907 character normally shows which is wanted, without the need for the as‐
2908 sertions that are used above in order to give exactly the POSIX behav‐
2909 iour.
2910
2912 Vertical bar characters are used to separate alternative patterns. For
2913 example, the following pattern matches either "gilbert" or "sullivan":
2914
2915 gilbert|sullivan
2916
2917 Any number of alternatives can appear, and an empty alternative is per‐
2918 mitted (matching the empty string). The matching process tries each al‐
2919 ternative in turn, from left to right, and the first that succeeds is
2920 used. If the alternatives are within a subpattern (defined in section
2921 Subpatterns), "succeeds" means matching the remaining main pattern and
2922 the alternative in the subpattern.
2923
2925 The settings of the Perl-compatible options caseless, multiline,
2926 dotall, and extended can be changed from within the pattern by a se‐
2927 quence of Perl option letters enclosed between "(?" and ")". The option
2928 letters are as follows:
2929
2930 i:
2931 For caseless
2932
2933 m:
2934 For multiline
2935
2936 s:
2937 For dotall
2938
2939 x:
2940 For extended
2941
2942 For example, (?im) sets caseless, multiline matching. These options can
2943 also be unset by preceding the letter with a hyphen. A combined setting
2944 and unsetting such as (?im-sx), which sets caseless and multiline,
2945 while unsetting dotall and extended, is also permitted. If a letter ap‐
2946 pears both before and after the hyphen, the option is unset.
2947
2948 The PCRE-specific options dupnames, ungreedy, and extra can be changed
2949 in the same way as the Perl-compatible options by using the characters
2950 J, U, and X respectively.
2951
2952 When one of these option changes occurs at top-level (that is, not in‐
2953 side subpattern parentheses), the change applies to the remainder of
2954 the pattern that follows.
2955
2956 An option change within a subpattern (see section Subpatterns) affects
2957 only that part of the subpattern that follows it. So, the following
2958 matches abc and aBc and no other strings (assuming caseless is not
2959 used):
2960
2961 (a(?i)b)c
2962
2963 By this means, options can be made to have different settings in dif‐
2964 ferent parts of the pattern. Any changes made in one alternative do
2965 carry on into subsequent branches within the same subpattern. For exam‐
2966 ple:
2967
2968 (a(?i)b|c)
2969
2970 matches "ab", "aB", "c", and "C", although when matching "C" the first
2971 branch is abandoned before the option setting. This is because the ef‐
2972 fects of option settings occur at compile time. There would be some
2973 weird behavior otherwise.
2974
2975 Note:
2976 Other PCRE-specific options can be set by the application when the com‐
2977 piling or matching functions are called. Sometimes the pattern can con‐
2978 tain special leading sequences, such as (*CRLF), to override what the
2979 application has set or what has been defaulted. Details are provided in
2980 section Newline Sequences earlier.
2981
2982 The (*UTF8) and (*UCP) leading sequences can be used to set UTF and
2983 Unicode property modes. They are equivalent to setting options unicode
2984 and ucp, respectively. The (*UTF) sequence is a generic version that
2985 can be used with any of the libraries. However, the application can set
2986 option never_utf, which locks out the use of the (*UTF) sequences.
2987
2988
2990 Subpatterns are delimited by parentheses (round brackets), which can be
2991 nested. Turning part of a pattern into a subpattern does two things:
2992
2993 1.:
2994 It localizes a set of alternatives. For example, the following pat‐
2995 tern matches "cataract", "caterpillar", or "cat":
2996
2997 cat(aract|erpillar|)
2998
2999 Without the parentheses, it would match "cataract", "erpillar", or
3000 an empty string.
3001
3002 2.:
3003 It sets up the subpattern as a capturing subpattern. That is, when
3004 the complete pattern matches, that portion of the subject string
3005 that matched the subpattern is passed back to the caller through
3006 the return value of run/3.
3007
3008 Opening parentheses are counted from left to right (starting from 1) to
3009 obtain numbers for the capturing subpatterns. For example, if the
3010 string "the red king" is matched against the following pattern, the
3011 captured substrings are "red king", "red", and "king", and are numbered
3012 1, 2, and 3, respectively:
3013
3014 the ((red|white) (king|queen))
3015
3016 It is not always helpful that plain parentheses fulfill two functions.
3017 Often a grouping subpattern is required without a capturing require‐
3018 ment. If an opening parenthesis is followed by a question mark and a
3019 colon, the subpattern does not do any capturing, and is not counted
3020 when computing the number of any subsequent capturing subpatterns. For
3021 example, if the string "the white queen" is matched against the follow‐
3022 ing pattern, the captured substrings are "white queen" and "queen", and
3023 are numbered 1 and 2:
3024
3025 the ((?:red|white) (king|queen))
3026
3027 The maximum number of capturing subpatterns is 65535.
3028
3029 As a convenient shorthand, if any option settings are required at the
3030 start of a non-capturing subpattern, the option letters can appear be‐
3031 tween "?" and ":". Thus, the following two patterns match the same set
3032 of strings:
3033
3034 (?i:saturday|sunday)
3035 (?:(?i)saturday|sunday)
3036
3037 As alternative branches are tried from left to right, and options are
3038 not reset until the end of the subpattern is reached, an option setting
3039 in one branch does affect subsequent branches, so the above patterns
3040 match both "SUNDAY" and "Saturday".
3041
3043 Perl 5.10 introduced a feature where each alternative in a subpattern
3044 uses the same numbers for its capturing parentheses. Such a subpattern
3045 starts with (?| and is itself a non-capturing subpattern. For example,
3046 consider the following pattern:
3047
3048 (?|(Sat)ur|(Sun))day
3049
3050 As the two alternatives are inside a (?| group, both sets of capturing
3051 parentheses are numbered one. Thus, when the pattern matches, you can
3052 look at captured substring number one, whichever alternative matched.
3053 This construct is useful when you want to capture a part, but not all,
3054 of one of many alternatives. Inside a (?| group, parentheses are num‐
3055 bered as usual, but the number is reset at the start of each branch.
3056 The numbers of any capturing parentheses that follow the subpattern
3057 start after the highest number used in any branch. The following exam‐
3058 ple is from the Perl documentation; the numbers underneath show in
3059 which buffer the captured content is stored:
3060
3061 # before ---------------branch-reset----------- after
3062 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
3063 # 1 2 2 3 2 3 4
3064
3065 A back reference to a numbered subpattern uses the most recent value
3066 that is set for that number by any subpattern. The following pattern
3067 matches "abcabc" or "defdef":
3068
3069 /(?|(abc)|(def))\1/
3070
3071 In contrast, a subroutine call to a numbered subpattern always refers
3072 to the first one in the pattern with the given number. The following
3073 pattern matches "abcabc" or "defabc":
3074
3075 /(?|(abc)|(def))(?1)/
3076
3077 If a condition test for a subpattern having matched refers to a non-
3078 unique number, the test is true if any of the subpatterns of that num‐
3079 ber have matched.
3080
3081 An alternative approach using this "branch reset" feature is to use du‐
3082 plicate named subpatterns, as described in the next section.
3083
3085 Identifying capturing parentheses by number is simple, but it can be
3086 hard to keep track of the numbers in complicated regular expressions.
3087 Also, if an expression is modified, the numbers can change. To help
3088 with this difficulty, PCRE supports the naming of subpatterns. This
3089 feature was not added to Perl until release 5.10. Python had the fea‐
3090 ture earlier, and PCRE introduced it at release 4.0, using the Python
3091 syntax. PCRE now supports both the Perl and the Python syntax. Perl al‐
3092 lows identically numbered subpatterns to have different names, but PCRE
3093 does not.
3094
3095 In PCRE, a subpattern can be named in one of three ways: (?<name>...)
3096 or (?'name'...) as in Perl, or (?P<name>...) as in Python. References
3097 to capturing parentheses from other parts of the pattern, such as back
3098 references, recursion, and conditions, can be made by name and by num‐
3099 ber.
3100
3101 Names consist of up to 32 alphanumeric characters and underscores, but
3102 must start with a non-digit. Named capturing parentheses are still al‐
3103 located numbers as well as names, exactly as if the names were not
3104 present. The capture specification to run/3 can use named values if
3105 they are present in the regular expression.
3106
3107 By default, a name must be unique within a pattern, but this constraint
3108 can be relaxed by setting option dupnames at compile time. (Duplicate
3109 names are also always permitted for subpatterns with the same number,
3110 set up as described in the previous section.) Duplicate names can be
3111 useful for patterns where only one instance of the named parentheses
3112 can match. Suppose that you want to match the name of a weekday, either
3113 as a 3-letter abbreviation or as the full name, and in both cases you
3114 want to extract the abbreviation. The following pattern (ignoring the
3115 line breaks) does the job:
3116
3117 (?<DN>Mon|Fri|Sun)(?:day)?|
3118 (?<DN>Tue)(?:sday)?|
3119 (?<DN>Wed)(?:nesday)?|
3120 (?<DN>Thu)(?:rsday)?|
3121 (?<DN>Sat)(?:urday)?
3122
3123 There are five capturing substrings, but only one is ever set after a
3124 match. (An alternative way of solving this problem is to use a "branch
3125 reset" subpattern, as described in the previous section.)
3126
3127 For capturing named subpatterns which names are not unique, the first
3128 matching occurrence (counted from left to right in the subject) is re‐
3129 turned from run/3, if the name is specified in the values part of the
3130 capture statement. The all_names capturing value matches all the names
3131 in the same way.
3132
3133 Note:
3134 You cannot use different names to distinguish between two subpatterns
3135 with the same number, as PCRE uses only the numbers when matching. For
3136 this reason, an error is given at compile time if different names are
3137 specified to subpatterns with the same number. However, you can specify
3138 the same name to subpatterns with the same number, even when dupnames
3139 is not set.
3140
3141
3143 Repetition is specified by quantifiers, which can follow any of the
3144 following items:
3145
3146 * A literal data character
3147
3148 * The dot metacharacter
3149
3150 * The \C escape sequence
3151
3152 * The \X escape sequence
3153
3154 * The \R escape sequence
3155
3156 * An escape such as \d or \pL that matches a single character
3157
3158 * A character class
3159
3160 * A back reference (see the next section)
3161
3162 * A parenthesized subpattern (including assertions)
3163
3164 * A subroutine call to a subpattern (recursive or otherwise)
3165
3166 The general repetition quantifier specifies a minimum and maximum num‐
3167 ber of permitted matches, by giving the two numbers in curly brackets
3168 (braces), separated by a comma. The numbers must be < 65536, and the
3169 first must be less than or equal to the second. For example, the fol‐
3170 lowing matches "zz", "zzz", or "zzzz":
3171
3172 z{2,4}
3173
3174 A closing brace on its own is not a special character. If the second
3175 number is omitted, but the comma is present, there is no upper limit.
3176 If the second number and the comma are both omitted, the quantifier
3177 specifies an exact number of required matches. Thus, the following
3178 matches at least three successive vowels, but can match many more:
3179
3180 [aeiou]{3,}
3181
3182 The following matches exactly eight digits:
3183
3184 \d{8}
3185
3186 An opening curly bracket that appears in a position where a quantifier
3187 is not allowed, or one that does not match the syntax of a quantifier,
3188 is taken as a literal character. For example, {,6} is not a quantifier,
3189 but a literal string of four characters.
3190
3191 In Unicode mode, quantifiers apply to characters rather than to indi‐
3192 vidual data units. Thus, for example, \x{100}{2} matches two charac‐
3193 ters, each of which is represented by a 2-byte sequence in a UTF-8
3194 string. Similarly, \X{3} matches three Unicode extended grapheme clus‐
3195 ters, each of which can be many data units long (and they can be of
3196 different lengths).
3197
3198 The quantifier {0} is permitted, causing the expression to behave as if
3199 the previous item and the quantifier were not present. This can be use‐
3200 ful for subpatterns that are referenced as subroutines from elsewhere
3201 in the pattern (but see also section Defining Subpatterns for Use by
3202 Reference Only). Items other than subpatterns that have a {0} quanti‐
3203 fier are omitted from the compiled pattern.
3204
3205 For convenience, the three most common quantifiers have single-charac‐
3206 ter abbreviations:
3207
3208 *:
3209 Equivalent to {0,}
3210
3211 +:
3212 Equivalent to {1,}
3213
3214 ?:
3215 Equivalent to {0,1}
3216
3217 Infinite loops can be constructed by following a subpattern that can
3218 match no characters with a quantifier that has no upper limit, for ex‐
3219 ample:
3220
3221 (a?)*
3222
3223 Earlier versions of Perl and PCRE used to give an error at compile time
3224 for such patterns. However, as there are cases where this can be use‐
3225 ful, such patterns are now accepted. However, if any repetition of the
3226 subpattern matches no characters, the loop is forcibly broken.
3227
3228 By default, the quantifiers are "greedy", that is, they match as much
3229 as possible (up to the maximum number of permitted times), without
3230 causing the remaining pattern to fail. The classic example of where
3231 this gives problems is in trying to match comments in C programs. These
3232 appear between /* and */. Within the comment, individual * and / char‐
3233 acters can appear. An attempt to match C comments by applying the pat‐
3234 tern
3235
3236 /\*.*\*/
3237
3238 to the string
3239
3240 /* first comment */ not comment /* second comment */
3241
3242 fails, as it matches the entire string owing to the greediness of the
3243 .* item.
3244
3245 However, if a quantifier is followed by a question mark, it ceases to
3246 be greedy, and instead matches the minimum number of times possible, so
3247 the following pattern does the right thing with the C comments:
3248
3249 /\*.*?\*/
3250
3251 The meaning of the various quantifiers is not otherwise changed, only
3252 the preferred number of matches. Do not confuse this use of question
3253 mark with its use as a quantifier in its own right. As it has two uses,
3254 it can sometimes appear doubled, as in
3255
3256 \d??\d
3257
3258 which matches one digit by preference, but can match two if that is the
3259 only way the remaining pattern matches.
3260
3261 If option ungreedy is set (an option that is not available in Perl),
3262 the quantifiers are not greedy by default, but individual ones can be
3263 made greedy by following them with a question mark. That is, it inverts
3264 the default behavior.
3265
3266 When a parenthesized subpattern is quantified with a minimum repeat
3267 count that is > 1 or with a limited maximum, more memory is required
3268 for the compiled pattern, in proportion to the size of the minimum or
3269 maximum.
3270
3271 If a pattern starts with .* or .{0,} and option dotall (equivalent to
3272 Perl option /s) is set, thus allowing the dot to match newlines, the
3273 pattern is implicitly anchored, because whatever follows is tried
3274 against every character position in the subject string. So, there is no
3275 point in retrying the overall match at any position after the first.
3276 PCRE normally treats such a pattern as if it was preceded by \A.
3277
3278 In cases where it is known that the subject string contains no new‐
3279 lines, it is worth setting dotall to obtain this optimization, or al‐
3280 ternatively using ^ to indicate anchoring explicitly.
3281
3282 However, there are some cases where the optimization cannot be used.
3283 When .* is inside capturing parentheses that are the subject of a back
3284 reference elsewhere in the pattern, a match at the start can fail where
3285 a later one succeeds. Consider, for example:
3286
3287 (.*)abc\1
3288
3289 If the subject is "xyz123abc123", the match point is the fourth charac‐
3290 ter. Therefore, such a pattern is not implicitly anchored.
3291
3292 Another case where implicit anchoring is not applied is when the lead‐
3293 ing .* is inside an atomic group. Once again, a match at the start can
3294 fail where a later one succeeds. Consider the following pattern:
3295
3296 (?>.*?a)b
3297
3298 It matches "ab" in the subject "aab". The use of the backtracking con‐
3299 trol verbs (*PRUNE) and (*SKIP) also disable this optimization.
3300
3301 When a capturing subpattern is repeated, the value captured is the sub‐
3302 string that matched the final iteration. For example, after
3303
3304 (tweedle[dume]{3}\s*)+
3305
3306 has matched "tweedledum tweedledee", the value of the captured sub‐
3307 string is "tweedledee". However, if there are nested capturing subpat‐
3308 terns, the corresponding captured values can have been set in previous
3309 iterations. For example, after
3310
3311 /(a|(b))+/
3312
3313 matches "aba", the value of the second captured substring is "b".
3314
3316 With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
3317 repetition, failure of what follows normally causes the repeated item
3318 to be re-evaluated to see if a different number of repeats allows the
3319 remaining pattern to match. Sometimes it is useful to prevent this, ei‐
3320 ther to change the nature of the match, or to cause it to fail earlier
3321 than it otherwise might, when the author of the pattern knows that
3322 there is no point in carrying on.
3323
3324 Consider, for example, the pattern \d+foo when applied to the following
3325 subject line:
3326
3327 123456bar
3328
3329 After matching all six digits and then failing to match "foo", the nor‐
3330 mal action of the matcher is to try again with only five digits match‐
3331 ing item \d+, and then with four, and so on, before ultimately failing.
3332 "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
3333 the means for specifying that once a subpattern has matched, it is not
3334 to be re-evaluated in this way.
3335
3336 If atomic grouping is used for the previous example, the matcher gives
3337 up immediately on failing to match "foo" the first time. The notation
3338 is a kind of special parenthesis, starting with (?> as in the following
3339 example:
3340
3341 (?>\d+)foo
3342
3343 This kind of parenthesis "locks up" the part of the pattern it contains
3344 once it has matched, and a failure further into the pattern is pre‐
3345 vented from backtracking into it. Backtracking past it to previous
3346 items, however, works as normal.
3347
3348 An alternative description is that a subpattern of this type matches
3349 the string of characters that an identical standalone pattern would
3350 match, if anchored at the current point in the subject string.
3351
3352 Atomic grouping subpatterns are not capturing subpatterns. Simple cases
3353 such as the above example can be thought of as a maximizing repeat that
3354 must swallow everything it can. So, while both \d+ and \d+? are pre‐
3355 pared to adjust the number of digits they match to make the remaining
3356 pattern match, (?>\d+) can only match an entire sequence of digits.
3357
3358 Atomic groups in general can contain any complicated subpatterns, and
3359 can be nested. However, when the subpattern for an atomic group is just
3360 a single repeated item, as in the example above, a simpler notation,
3361 called a "possessive quantifier" can be used. This consists of an extra
3362 + character following a quantifier. Using this notation, the previous
3363 example can be rewritten as
3364
3365 \d++foo
3366
3367 Notice that a possessive quantifier can be used with an entire group,
3368 for example:
3369
3370 (abc|xyz){2,3}+
3371
3372 Possessive quantifiers are always greedy; the setting of option un‐
3373 greedy is ignored. They are a convenient notation for the simpler forms
3374 of an atomic group. However, there is no difference in the meaning of a
3375 possessive quantifier and the equivalent atomic group, but there can be
3376 a performance difference; possessive quantifiers are probably slightly
3377 faster.
3378
3379 The possessive quantifier syntax is an extension to the Perl 5.8 syn‐
3380 tax. Jeffrey Friedl originated the idea (and the name) in the first
3381 edition of his book. Mike McCloskey liked it, so implemented it when he
3382 built the Sun Java package, and PCRE copied it from there. It ulti‐
3383 mately found its way into Perl at release 5.10.
3384
3385 PCRE has an optimization that automatically "possessifies" certain sim‐
3386 ple pattern constructs. For example, the sequence A+B is treated as
3387 A++B, as there is no point in backtracking into a sequence of A:s when
3388 B must follow.
3389
3390 When a pattern contains an unlimited repeat inside a subpattern that
3391 can itself be repeated an unlimited number of times, the use of an
3392 atomic group is the only way to avoid some failing matches taking a
3393 long time. The pattern
3394
3395 (\D+|<\d+>)*[!?]
3396
3397 matches an unlimited number of substrings that either consist of non-
3398 digits, or digits enclosed in <>, followed by ! or ?. When it matches,
3399 it runs quickly. However, if it is applied to
3400
3401 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
3402
3403 it takes a long time before reporting failure. This is because the
3404 string can be divided between the internal \D+ repeat and the external
3405 * repeat in many ways, and all must be tried. (The example uses [!?]
3406 rather than a single character at the end, as both PCRE and Perl have
3407 an optimization that allows for fast failure when a single character is
3408 used. They remember the last single character that is required for a
3409 match, and fail early if it is not present in the string.) If the pat‐
3410 tern is changed so that it uses an atomic group, like the following,
3411 sequences of non-digits cannot be broken, and failure happens quickly:
3412
3413 ((?>\D+)|<\d+>)*[!?]
3414
3416 Outside a character class, a backslash followed by a digit > 0 (and
3417 possibly further digits) is a back reference to a capturing subpattern
3418 earlier (that is, to its left) in the pattern, provided there have been
3419 that many previous capturing left parentheses.
3420
3421 However, if the decimal number following the backslash is < 10, it is
3422 always taken as a back reference, and causes an error only if there are
3423 not that many capturing left parentheses in the entire pattern. That
3424 is, the parentheses that are referenced do need not be to the left of
3425 the reference for numbers < 10. A "forward back reference" of this type
3426 can make sense when a repetition is involved and the subpattern to the
3427 right has participated in an earlier iteration.
3428
3429 It is not possible to have a numerical "forward back reference" to a
3430 subpattern whose number is 10 or more using this syntax, as a sequence
3431 such as \50 is interpreted as a character defined in octal. For more
3432 details of the handling of digits following a backslash, see section
3433 Non-Printing Characters earlier. There is no such problem when named
3434 parentheses are used. A back reference to any subpattern is possible
3435 using named parentheses (see below).
3436
3437 Another way to avoid the ambiguity inherent in the use of digits fol‐
3438 lowing a backslash is to use the \g escape sequence. This escape must
3439 be followed by an unsigned number or a negative number, optionally en‐
3440 closed in braces. The following examples are identical:
3441
3442 (ring), \1
3443 (ring), \g1
3444 (ring), \g{1}
3445
3446 An unsigned number specifies an absolute reference without the ambigu‐
3447 ity that is present in the older syntax. It is also useful when literal
3448 digits follow the reference. A negative number is a relative reference.
3449 Consider the following example:
3450
3451 (abc(def)ghi)\g{-1}
3452
3453 The sequence \g{-1} is a reference to the most recently started captur‐
3454 ing subpattern before \g, that is, it is equivalent to \2 in this exam‐
3455 ple. Similarly, \g{-2} would be equivalent to \1. The use of relative
3456 references can be helpful in long patterns, and also in patterns that
3457 are created by joining fragments containing references within them‐
3458 selves.
3459
3460 A back reference matches whatever matched the capturing subpattern in
3461 the current subject string, rather than anything matching the subpat‐
3462 tern itself (section Subpattern as Subroutines describes a way of doing
3463 that). So, the following pattern matches "sense and sensibility" and
3464 "response and responsibility", but not "sense and responsibility":
3465
3466 (sens|respons)e and \1ibility
3467
3468 If caseful matching is in force at the time of the back reference, the
3469 case of letters is relevant. For example, the following matches "rah
3470 rah" and "RAH RAH", but not "RAH rah", although the original capturing
3471 subpattern is matched caselessly:
3472
3473 ((?i)rah)\s+\1
3474
3475 There are many different ways of writing back references to named sub‐
3476 patterns. The .NET syntax \k{name} and the Perl syntax \k<name> or
3477 \k'name' are supported, as is the Python syntax (?P=name). The unified
3478 back reference syntax in Perl 5.10, in which \g can be used for both
3479 numeric and named references, is also supported. The previous example
3480 can be rewritten in the following ways:
3481
3482 (?<p1>(?i)rah)\s+\k<p1>
3483 (?'p1'(?i)rah)\s+\k{p1}
3484 (?P<p1>(?i)rah)\s+(?P=p1)
3485 (?<p1>(?i)rah)\s+\g{p1}
3486
3487 A subpattern that is referenced by name can appear in the pattern be‐
3488 fore or after the reference.
3489
3490 There can be more than one back reference to the same subpattern. If a
3491 subpattern has not been used in a particular match, any back references
3492 to it always fails. For example, the following pattern always fails if
3493 it starts to match "a" rather than "bc":
3494
3495 (a|(bc))\2
3496
3497 As there can be many capturing parentheses in a pattern, all digits
3498 following the backslash are taken as part of a potential back reference
3499 number. If the pattern continues with a digit character, some delimiter
3500 must be used to terminate the back reference. If option extended is
3501 set, this can be whitespace. Otherwise an empty comment (see section
3502 Comments) can be used.
3503
3504 Recursive Back References
3505
3506 A back reference that occurs inside the parentheses to which it refers
3507 fails when the subpattern is first used, so, for example, (a\1) never
3508 matches. However, such references can be useful inside repeated subpat‐
3509 terns. For example, the following pattern matches any number of "a"s
3510 and also "aba", "ababbaa", and so on:
3511
3512 (a|b\1)+
3513
3514 At each iteration of the subpattern, the back reference matches the
3515 character string corresponding to the previous iteration. In order for
3516 this to work, the pattern must be such that the first iteration does
3517 not need to match the back reference. This can be done using alterna‐
3518 tion, as in the example above, or by a quantifier with a minimum of
3519 zero.
3520
3521 Back references of this type cause the group that they reference to be
3522 treated as an atomic group. Once the whole group has been matched, a
3523 subsequent matching failure cannot cause backtracking into the middle
3524 of the group.
3525
3527 An assertion is a test on the characters following or preceding the
3528 current matching point that does not consume any characters. The simple
3529 assertions coded as \b, \B, \A, \G, \Z, \z, ^, and $ are described in
3530 the previous sections.
3531
3532 More complicated assertions are coded as subpatterns. There are two
3533 kinds: those that look ahead of the current position in the subject
3534 string, and those that look behind it. An assertion subpattern is
3535 matched in the normal way, except that it does not cause the current
3536 matching position to be changed.
3537
3538 Assertion subpatterns are not capturing subpatterns. If such an asser‐
3539 tion contains capturing subpatterns within it, these are counted for
3540 the purposes of numbering the capturing subpatterns in the whole pat‐
3541 tern. However, substring capturing is done only for positive asser‐
3542 tions. (Perl sometimes, but not always, performs capturing in negative
3543 assertions.)
3544
3545 Warning:
3546 If a positive assertion containing one or more capturing subpatterns
3547 succeeds, but failure to match later in the pattern causes backtracking
3548 over this assertion, the captures within the assertion are reset only
3549 if no higher numbered captures are already set. This is, unfortunately,
3550 a fundamental limitation of the current implementation, and as PCRE1 is
3551 now in maintenance-only status, it is unlikely ever to change.
3552
3553
3554 For compatibility with Perl, assertion subpatterns can be repeated.
3555 However, it makes no sense to assert the same thing many times, the
3556 side effect of capturing parentheses can occasionally be useful. In
3557 practice, there are only three cases:
3558
3559 * If the quantifier is {0}, the assertion is never obeyed during
3560 matching. However, it can contain internal capturing parenthesized
3561 groups that are called from elsewhere through the subroutine mecha‐
3562 nism.
3563
3564 * If quantifier is {0,n}, where n > 0, it is treated as if it was
3565 {0,1}. At runtime, the remaining pattern match is tried with and
3566 without the assertion, the order depends on the greediness of the
3567 quantifier.
3568
3569 * If the minimum repetition is > 0, the quantifier is ignored. The
3570 assertion is obeyed only once when encountered during matching.
3571
3572 Lookahead Assertions
3573
3574 Lookahead assertions start with (?= for positive assertions and (?! for
3575 negative assertions. For example, the following matches a word followed
3576 by a semicolon, but does not include the semicolon in the match:
3577
3578 \w+(?=;)
3579
3580 The following matches any occurrence of "foo" that is not followed by
3581 "bar":
3582
3583 foo(?!bar)
3584
3585 Notice that the apparently similar pattern
3586
3587 (?!foo)bar
3588
3589 does not find an occurrence of "bar" that is preceded by something
3590 other than "foo". It finds any occurrence of "bar" whatsoever, as the
3591 assertion (?!foo) is always true when the next three characters are
3592 "bar". A lookbehind assertion is needed to achieve the other effect.
3593
3594 If you want to force a matching failure at some point in a pattern, the
3595 most convenient way to do it is with (?!), as an empty string always
3596 matches. So, an assertion that requires there is not to be an empty
3597 string must always fail. The backtracking control verb (*FAIL) or (*F)
3598 is a synonym for (?!).
3599
3600 Lookbehind Assertions
3601
3602 Lookbehind assertions start with (?<= for positive assertions and (?<!
3603 for negative assertions. For example, the following finds an occurrence
3604 of "bar" that is not preceded by "foo":
3605
3606 (?<!foo)bar
3607
3608 The contents of a lookbehind assertion are restricted such that all the
3609 strings it matches must have a fixed length. However, if there are many
3610 top-level alternatives, they do not all have to have the same fixed
3611 length. Thus, the following is permitted:
3612
3613 (?<=bullock|donkey)
3614
3615 The following causes an error at compile time:
3616
3617 (?<!dogs?|cats?)
3618
3619 Branches that match different length strings are permitted only at the
3620 top-level of a lookbehind assertion. This is an extension compared with
3621 Perl, which requires all branches to match the same length of string.
3622 An assertion such as the following is not permitted, as its single top-
3623 level branch can match two different lengths:
3624
3625 (?<=ab(c|de))
3626
3627 However, it is acceptable to PCRE if rewritten to use two top-level
3628 branches:
3629
3630 (?<=abc|abde)
3631
3632 Sometimes the escape sequence \K (see above) can be used instead of a
3633 lookbehind assertion to get round the fixed-length restriction.
3634
3635 The implementation of lookbehind assertions is, for each alternative,
3636 to move the current position back temporarily by the fixed length and
3637 then try to match. If there are insufficient characters before the cur‐
3638 rent position, the assertion fails.
3639
3640 In a UTF mode, PCRE does not allow the \C escape (which matches a sin‐
3641 gle data unit even in a UTF mode) to appear in lookbehind assertions,
3642 as it makes it impossible to calculate the length of the lookbehind.
3643 The \X and \R escapes, which can match different numbers of data units,
3644 are not permitted either.
3645
3646 "Subroutine" calls (see below), such as (?2) or (?&X), are permitted in
3647 lookbehinds, as long as the subpattern matches a fixed-length string.
3648 Recursion, however, is not supported.
3649
3650 Possessive quantifiers can be used with lookbehind assertions to spec‐
3651 ify efficient matching of fixed-length strings at the end of subject
3652 strings. Consider the following simple pattern when applied to a long
3653 string that does not match:
3654
3655 abcd$
3656
3657 As matching proceeds from left to right, PCRE looks for each "a" in the
3658 subject and then sees if what follows matches the remaining pattern. If
3659 the pattern is specified as
3660
3661 ^.*abcd$
3662
3663 the initial .* matches the entire string at first. However, when this
3664 fails (as there is no following "a"), it backtracks to match all but
3665 the last character, then all but the last two characters, and so on.
3666 Once again the search for "a" covers the entire string, from right to
3667 left, so we are no better off. However, if the pattern is written as
3668
3669 ^.*+(?<=abcd)
3670
3671 there can be no backtracking for the .*+ item; it can match only the
3672 entire string. The subsequent lookbehind assertion does a single test
3673 on the last four characters. If it fails, the match fails immediately.
3674 For long strings, this approach makes a significant difference to the
3675 processing time.
3676
3677 Using Multiple Assertions
3678
3679 Many assertions (of any sort) can occur in succession. For example, the
3680 following matches "foo" preceded by three digits that are not "999":
3681
3682 (?<=\d{3})(?<!999)foo
3683
3684 Notice that each of the assertions is applied independently at the same
3685 point in the subject string. First there is a check that the previous
3686 three characters are all digits, and then there is a check that the
3687 same three characters are not "999". This pattern does not match "foo"
3688 preceded by six characters, the first of which are digits and the last
3689 three of which are not "999". For example, it does not match "123abc‐
3690 foo". A pattern to do that is the following:
3691
3692 (?<=\d{3}...)(?<!999)foo
3693
3694 This time the first assertion looks at the preceding six characters,
3695 checks that the first three are digits, and then the second assertion
3696 checks that the preceding three characters are not "999".
3697
3698 Assertions can be nested in any combination. For example, the following
3699 matches an occurrence of "baz" that is preceded by "bar", which in turn
3700 is not preceded by "foo":
3701
3702 (?<=(?<!foo)bar)baz
3703
3704 The following pattern matches "foo" preceded by three digits and any
3705 three characters that are not "999":
3706
3707 (?<=\d{3}(?!999)...)foo
3708
3710 It is possible to cause the matching process to obey a subpattern con‐
3711 ditionally or to choose between two alternative subpatterns, depending
3712 on the result of an assertion, or whether a specific capturing subpat‐
3713 tern has already been matched. The following are the two possible forms
3714 of conditional subpattern:
3715
3716 (?(condition)yes-pattern)
3717 (?(condition)yes-pattern|no-pattern)
3718
3719 If the condition is satisfied, the yes-pattern is used, otherwise the
3720 no-pattern (if present). If more than two alternatives exist in the
3721 subpattern, a compile-time error occurs. Each of the two alternatives
3722 can itself contain nested subpatterns of any form, including condi‐
3723 tional subpatterns; the restriction to two alternatives applies only at
3724 the level of the condition. The following pattern fragment is an exam‐
3725 ple where the alternatives are complex:
3726
3727 (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
3728
3729 There are four kinds of condition: references to subpatterns, refer‐
3730 ences to recursion, a pseudo-condition called DEFINE, and assertions.
3731
3732 Checking for a Used Subpattern By Number
3733
3734 If the text between the parentheses consists of a sequence of digits,
3735 the condition is true if a capturing subpattern of that number has pre‐
3736 viously matched. If more than one capturing subpattern with the same
3737 number exists (see section Duplicate Subpattern Numbers earlier), the
3738 condition is true if any of them have matched. An alternative notation
3739 is to precede the digits with a plus or minus sign. In this case, the
3740 subpattern number is relative rather than absolute. The most recently
3741 opened parentheses can be referenced by (?(-1), the next most recent by
3742 (?(-2), and so on. Inside loops, it can also make sense to refer to
3743 subsequent groups. The next parentheses to be opened can be referenced
3744 as (?(+1), and so on. (The value zero in any of these forms is not
3745 used; it provokes a compile-time error.)
3746
3747 Consider the following pattern, which contains non-significant white‐
3748 space to make it more readable (assume option extended) and to divide
3749 it into three parts for ease of discussion:
3750
3751 ( \( )? [^()]+ (?(1) \) )
3752
3753 The first part matches an optional opening parenthesis, and if that
3754 character is present, sets it as the first captured substring. The sec‐
3755 ond part matches one or more characters that are not parentheses. The
3756 third part is a conditional subpattern that tests whether the first set
3757 of parentheses matched or not. If they did, that is, if subject started
3758 with an opening parenthesis, the condition is true, and so the yes-pat‐
3759 tern is executed and a closing parenthesis is required. Otherwise, as
3760 no-pattern is not present, the subpattern matches nothing. That is,
3761 this pattern matches a sequence of non-parentheses, optionally enclosed
3762 in parentheses.
3763
3764 If this pattern is embedded in a larger one, a relative reference can
3765 be used:
3766
3767 This makes the fragment independent of the parentheses in the larger
3768 pattern.
3769
3770 Checking for a Used Subpattern By Name
3771
3772 Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
3773 used subpattern by name. For compatibility with earlier versions of
3774 PCRE, which had this facility before Perl, the syntax (?(name)...) is
3775 also recognized.
3776
3777 Rewriting the previous example to use a named subpattern gives:
3778
3779 (?<OPEN> \( )? [^()]+ (?(<OPEN>) \) )
3780
3781 If the name used in a condition of this kind is a duplicate, the test
3782 is applied to all subpatterns of the same name, and is true if any one
3783 of them has matched.
3784
3785 Checking for Pattern Recursion
3786
3787 If the condition is the string (R), and there is no subpattern with the
3788 name R, the condition is true if a recursive call to the whole pattern
3789 or any subpattern has been made. If digits or a name preceded by amper‐
3790 sand follow the letter R, for example:
3791
3792 (?(R3)...) or (?(R&name)...)
3793
3794 the condition is true if the most recent recursion is into a subpattern
3795 whose number or name is given. This condition does not check the entire
3796 recursion stack. If the name used in a condition of this kind is a du‐
3797 plicate, the test is applied to all subpatterns of the same name, and
3798 is true if any one of them is the most recent recursion.
3799
3800 At "top-level", all these recursion test conditions are false. The syn‐
3801 tax for recursive patterns is described below.
3802
3803 Defining Subpatterns for Use By Reference Only
3804
3805 If the condition is the string (DEFINE), and there is no subpattern
3806 with the name DEFINE, the condition is always false. In this case,
3807 there can be only one alternative in the subpattern. It is always
3808 skipped if control reaches this point in the pattern. The idea of DE‐
3809 FINE is that it can be used to define "subroutines" that can be refer‐
3810 enced from elsewhere. (The use of subroutines is described below.) For
3811 example, a pattern to match an IPv4 address, such as "192.168.23.245",
3812 can be written like this (ignore whitespace and line breaks):
3813
3814 (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) \b (?&byte) (\.(?&byte)){3} \b
3815
3816 The first part of the pattern is a DEFINE group inside which is a an‐
3817 other group named "byte" is defined. This matches an individual compo‐
3818 nent of an IPv4 address (a number < 256). When matching takes place,
3819 this part of the pattern is skipped, as DEFINE acts like a false condi‐
3820 tion. The remaining pattern uses references to the named group to match
3821 the four dot-separated components of an IPv4 address, insisting on a
3822 word boundary at each end.
3823
3824 Assertion Conditions
3825
3826 If the condition is not in any of the above formats, it must be an as‐
3827 sertion. This can be a positive or negative lookahead or lookbehind as‐
3828 sertion. Consider the following pattern, containing non-significant
3829 whitespace, and with the two alternatives on the second line:
3830
3831 (?(?=[^a-z]*[a-z])
3832 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
3833
3834 The condition is a positive lookahead assertion that matches an op‐
3835 tional sequence of non-letters followed by a letter. That is, it tests
3836 for the presence of at least one letter in the subject. If a letter is
3837 found, the subject is matched against the first alternative, otherwise
3838 it is matched against the second. This pattern matches strings in one
3839 of the two forms dd-aaa-dd or dd-dd-dd, where aaa are letters and dd
3840 are digits.
3841
3843 There are two ways to include comments in patterns that are processed
3844 by PCRE. In both cases, the start of the comment must not be in a char‐
3845 acter class, or in the middle of any other sequence of related charac‐
3846 ters such as (?: or a subpattern name or number. The characters that
3847 make up a comment play no part in the pattern matching.
3848
3849 The sequence (?# marks the start of a comment that continues up to the
3850 next closing parenthesis. Nested parentheses are not permitted. If op‐
3851 tion PCRE_EXTENDED is set, an unescaped # character also introduces a
3852 comment, which in this case continues to immediately after the next
3853 newline character or character sequence in the pattern. Which charac‐
3854 ters are interpreted as newlines is controlled by the options passed to
3855 a compiling function or by a special sequence at the start of the pat‐
3856 tern, as described in section Newline Conventions earlier.
3857
3858 Notice that the end of this type of comment is a literal newline se‐
3859 quence in the pattern; escape sequences that happen to represent a new‐
3860 line do not count. For example, consider the following pattern when ex‐
3861 tended is set, and the default newline convention is in force:
3862
3863 abc #comment \n still comment
3864
3865 On encountering character #, pcre_compile() skips along, looking for a
3866 newline in the pattern. The sequence \n is still literal at this stage,
3867 so it does not terminate the comment. Only a character with code value
3868 0x0a (the default newline) does so.
3869
3871 Consider the problem of matching a string in parentheses, allowing for
3872 unlimited nested parentheses. Without the use of recursion, the best
3873 that can be done is to use a pattern that matches up to some fixed
3874 depth of nesting. It is not possible to handle an arbitrary nesting
3875 depth.
3876
3877 For some time, Perl has provided a facility that allows regular expres‐
3878 sions to recurse (among other things). It does this by interpolating
3879 Perl code in the expression at runtime, and the code can refer to the
3880 expression itself. A Perl pattern using code interpolation to solve the
3881 parentheses problem can be created like this:
3882
3883 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
3884
3885 Item (?p{...}) interpolates Perl code at runtime, and in this case
3886 refers recursively to the pattern in which it appears.
3887
3888 Obviously, PCRE cannot support the interpolation of Perl code. Instead,
3889 it supports special syntax for recursion of the entire pattern, and for
3890 individual subpattern recursion. After its introduction in PCRE and
3891 Python, this kind of recursion was later introduced into Perl at re‐
3892 lease 5.10.
3893
3894 A special item that consists of (? followed by a number > 0 and a clos‐
3895 ing parenthesis is a recursive subroutine call of the subpattern of the
3896 given number, if it occurs inside that subpattern. (If not, it is a
3897 non-recursive subroutine call, which is described in the next section.)
3898 The special item (?R) or (?0) is a recursive call of the entire regular
3899 expression.
3900
3901 This PCRE pattern solves the nested parentheses problem (assume that
3902 option extended is set so that whitespace is ignored):
3903
3904 \( ( [^()]++ | (?R) )* \)
3905
3906 First it matches an opening parenthesis. Then it matches any number of
3907 substrings, which can either be a sequence of non-parentheses or a re‐
3908 cursive match of the pattern itself (that is, a correctly parenthesized
3909 substring). Finally there is a closing parenthesis. Notice the use of a
3910 possessive quantifier to avoid backtracking into sequences of non-
3911 parentheses.
3912
3913 If this was part of a larger pattern, you would not want to recurse the
3914 entire pattern, so instead you can use:
3915
3916 ( \( ( [^()]++ | (?1) )* \) )
3917
3918 The pattern is here within parentheses so that the recursion refers to
3919 them instead of the whole pattern.
3920
3921 In a larger pattern, keeping track of parenthesis numbers can be
3922 tricky. This is made easier by the use of relative references. Instead
3923 of (?1) in the pattern above, you can write (?-2) to refer to the sec‐
3924 ond most recently opened parentheses preceding the recursion. That is,
3925 a negative number counts capturing parentheses leftwards from the point
3926 at which it is encountered.
3927
3928 It is also possible to refer to later opened parentheses, by writing
3929 references such as (?+2). However, these cannot be recursive, as the
3930 reference is not inside the parentheses that are referenced. They are
3931 always non-recursive subroutine calls, as described in the next sec‐
3932 tion.
3933
3934 An alternative approach is to use named parentheses instead. The Perl
3935 syntax for this is (?&name). The earlier PCRE syntax (?P>name) is also
3936 supported. We can rewrite the above example as follows:
3937
3938 (?<pn> \( ( [^()]++ | (?&pn) )* \) )
3939
3940 If there is more than one subpattern with the same name, the earliest
3941 one is used.
3942
3943 This particular example pattern that we have studied contains nested
3944 unlimited repeats, and so the use of a possessive quantifier for match‐
3945 ing strings of non-parentheses is important when applying the pattern
3946 to strings that do not match. For example, when this pattern is applied
3947 to
3948
3949 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
3950
3951 it gives "no match" quickly. However, if a possessive quantifier is not
3952 used, the match runs for a long time, as there are so many different
3953 ways the + and * repeats can carve up the subject, and all must be
3954 tested before failure can be reported.
3955
3956 At the end of a match, the values of capturing parentheses are those
3957 from the outermost level. If the pattern above is matched against
3958
3959 (ab(cd)ef)
3960
3961 the value for the inner capturing parentheses (numbered 2) is "ef",
3962 which is the last value taken on at the top-level. If a capturing sub‐
3963 pattern is not matched at the top level, its final captured value is
3964 unset, even if it was (temporarily) set at a deeper level during the
3965 matching process.
3966
3967 Do not confuse item (?R) with condition (R), which tests for recursion.
3968 Consider the following pattern, which matches text in angle brackets,
3969 allowing for arbitrary nesting. Only digits are allowed in nested
3970 brackets (that is, when recursing), while any characters are permitted
3971 at the outer level.
3972
3973 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
3974
3975 Here (?(R) is the start of a conditional subpattern, with two different
3976 alternatives for the recursive and non-recursive cases. Item (?R) is
3977 the actual recursive call.
3978
3979 Differences in Recursion Processing between PCRE and Perl
3980
3981 Recursion processing in PCRE differs from Perl in two important ways.
3982 In PCRE (like Python, but unlike Perl), a recursive subpattern call is
3983 always treated as an atomic group. That is, once it has matched some of
3984 the subject string, it is never re-entered, even if it contains untried
3985 alternatives and there is a subsequent matching failure. This can be
3986 illustrated by the following pattern, which means to match a palin‐
3987 dromic string containing an odd number of characters (for example, "a",
3988 "aba", "abcba", "abcdcba"):
3989
3990 ^(.|(.)(?1)\2)$
3991
3992 The idea is that it either matches a single character, or two identical
3993 characters surrounding a subpalindrome. In Perl, this pattern works; in
3994 PCRE it does not work if the pattern is longer than three characters.
3995 Consider the subject string "abcba".
3996
3997 At the top level, the first character is matched, but as it is not at
3998 the end of the string, the first alternative fails, the second alterna‐
3999 tive is taken, and the recursion kicks in. The recursive call to sub‐
4000 pattern 1 successfully matches the next character ("b"). (Notice that
4001 the beginning and end of line tests are not part of the recursion.)
4002
4003 Back at the top level, the next character ("c") is compared with what
4004 subpattern 2 matched, which was "a". This fails. As the recursion is
4005 treated as an atomic group, there are now no backtracking points, and
4006 so the entire match fails. (Perl can now re-enter the recursion and try
4007 the second alternative.) However, if the pattern is written with the
4008 alternatives in the other order, things are different:
4009
4010 ^((.)(?1)\2|.)$
4011
4012 This time, the recursing alternative is tried first, and continues to
4013 recurse until it runs out of characters, at which point the recursion
4014 fails. But this time we have another alternative to try at the higher
4015 level. That is the significant difference: in the previous case the re‐
4016 maining alternative is at a deeper recursion level, which PCRE cannot
4017 use.
4018
4019 To change the pattern so that it matches all palindromic strings, not
4020 only those with an odd number of characters, it is tempting to change
4021 the pattern to this:
4022
4023 ^((.)(?1)\2|.?)$
4024
4025 Again, this works in Perl, but not in PCRE, and for the same reason.
4026 When a deeper recursion has matched a single character, it cannot be
4027 entered again to match an empty string. The solution is to separate the
4028 two cases, and write out the odd and even cases as alternatives at the
4029 higher level:
4030
4031 ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
4032
4033 If you want to match typical palindromic phrases, the pattern must ig‐
4034 nore all non-word characters, which can be done as follows:
4035
4036 ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
4037
4038 If run with option caseless, this pattern matches phrases such as "A
4039 man, a plan, a canal: Panama!" and it works well in both PCRE and Perl.
4040 Notice the use of the possessive quantifier *+ to avoid backtracking
4041 into sequences of non-word characters. Without this, PCRE takes much
4042 longer (10 times or more) to match typical phrases, and Perl takes so
4043 long that you think it has gone into a loop.
4044
4045 Note:
4046 The palindrome-matching patterns above work only if the subject string
4047 does not start with a palindrome that is shorter than the entire
4048 string. For example, although "abcba" is correctly matched, if the sub‐
4049 ject is "ababa", PCRE finds palindrome "aba" at the start, and then
4050 fails at top level, as the end of the string does not follow. Once
4051 again, it cannot jump back into the recursion to try other alterna‐
4052 tives, so the entire match fails.
4053
4054
4055 The second way in which PCRE and Perl differ in their recursion pro‐
4056 cessing is in the handling of captured values. In Perl, when a subpat‐
4057 tern is called recursively or as a subpattern (see the next section),
4058 it has no access to any values that were captured outside the recur‐
4059 sion. In PCRE these values can be referenced. Consider the following
4060 pattern:
4061
4062 ^(.)(\1|a(?2))
4063
4064 In PCRE, it matches "bab". The first capturing parentheses match "b",
4065 then in the second group, when the back reference \1 fails to match
4066 "b", the second alternative matches "a", and then recurses. In the re‐
4067 cursion, \1 does now match "b" and so the whole match succeeds. In
4068 Perl, the pattern fails to match because inside the recursive call \1
4069 cannot access the externally set value.
4070
4072 If the syntax for a recursive subpattern call (either by number or by
4073 name) is used outside the parentheses to which it refers, it operates
4074 like a subroutine in a programming language. The called subpattern can
4075 be defined before or after the reference. A numbered reference can be
4076 absolute or relative, as in the following examples:
4077
4078 (...(absolute)...)...(?2)...
4079 (...(relative)...)...(?-1)...
4080 (...(?+1)...(relative)...
4081
4082 An earlier example pointed out that the following pattern matches
4083 "sense and sensibility" and "response and responsibility", but not
4084 "sense and responsibility":
4085
4086 (sens|respons)e and \1ibility
4087
4088 If instead the following pattern is used, it matches "sense and respon‐
4089 sibility" and the other two strings:
4090
4091 (sens|respons)e and (?1)ibility
4092
4093 Another example is provided in the discussion of DEFINE earlier.
4094
4095 All subroutine calls, recursive or not, are always treated as atomic
4096 groups. That is, once a subroutine has matched some of the subject
4097 string, it is never re-entered, even if it contains untried alterna‐
4098 tives and there is a subsequent matching failure. Any capturing paren‐
4099 theses that are set during the subroutine call revert to their previous
4100 values afterwards.
4101
4102 Processing options such as case-independence are fixed when a subpat‐
4103 tern is defined, so if it is used as a subroutine, such options cannot
4104 be changed for different calls. For example, the following pattern
4105 matches "abcabc" but not "abcABC", as the change of processing option
4106 does not affect the called subpattern:
4107
4108 (abc)(?i:(?-1))
4109
4111 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
4112 name or a number enclosed either in angle brackets or single quotes, is
4113 alternative syntax for referencing a subpattern as a subroutine, possi‐
4114 bly recursively. Here follows two of the examples used above, rewritten
4115 using this syntax:
4116
4117 (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
4118 (sens|respons)e and \g'1'ibility
4119
4120 PCRE supports an extension to Oniguruma: if a number is preceded by a
4121 plus or minus sign, it is taken as a relative reference, for example:
4122
4123 (abc)(?i:\g<-1>)
4124
4125 Notice that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are
4126 not synonymous. The former is a back reference; the latter is a subrou‐
4127 tine call.
4128
4130 Perl 5.10 introduced some "Special Backtracking Control Verbs", which
4131 are still described in the Perl documentation as "experimental and sub‐
4132 ject to change or removal in a future version of Perl". It goes on to
4133 say: "Their usage in production code should be noted to avoid problems
4134 during upgrades." The same remarks apply to the PCRE features described
4135 in this section.
4136
4137 The new verbs make use of what was previously invalid syntax: an open‐
4138 ing parenthesis followed by an asterisk. They are generally of the form
4139 (*VERB) or (*VERB:NAME). Some can take either form, possibly behaving
4140 differently depending on whether a name is present. A name is any se‐
4141 quence of characters that does not include a closing parenthesis. The
4142 maximum name length is 255 in the 8-bit library and 65535 in the 16-bit
4143 and 32-bit libraries. If the name is empty, that is, if the closing
4144 parenthesis immediately follows the colon, the effect is as if the
4145 colon was not there. Any number of these verbs can occur in a pattern.
4146
4147 The behavior of these verbs in repeated groups, assertions, and in sub‐
4148 patterns called as subroutines (whether or not recursively) is de‐
4149 scribed below.
4150
4151 Optimizations That Affect Backtracking Verbs
4152
4153 PCRE contains some optimizations that are used to speed up matching by
4154 running some checks at the start of each match attempt. For example, it
4155 can know the minimum length of matching subject, or that a particular
4156 character must be present. When one of these optimizations bypasses the
4157 running of a match, any included backtracking verbs are not processed.
4158 processed. You can suppress the start-of-match optimizations by setting
4159 option no_start_optimize when calling compile/2 or run/3, or by start‐
4160 ing the pattern with (*NO_START_OPT).
4161
4162 Experiments with Perl suggest that it too has similar optimizations,
4163 sometimes leading to anomalous results.
4164
4165 Verbs That Act Immediately
4166
4167 The following verbs act as soon as they are encountered. They must not
4168 be followed by a name.
4169
4170 (*ACCEPT)
4171
4172 This verb causes the match to end successfully, skipping the remainder
4173 of the pattern. However, when it is inside a subpattern that is called
4174 as a subroutine, only that subpattern is ended successfully. Matching
4175 then continues at the outer level. If (*ACCEPT) is triggered in a posi‐
4176 tive assertion, the assertion succeeds; in a negative assertion, the
4177 assertion fails.
4178
4179 If (*ACCEPT) is inside capturing parentheses, the data so far is cap‐
4180 tured. For example, the following matches "AB", "AAD", or "ACD". When
4181 it matches "AB", "B" is captured by the outer parentheses.
4182
4183 A((?:A|B(*ACCEPT)|C)D)
4184
4185 The following verb causes a matching failure, forcing backtracking to
4186 occur. It is equivalent to (?!) but easier to read.
4187
4188 (*FAIL) or (*F)
4189
4190 The Perl documentation states that it is probably useful only when com‐
4191 bined with (?{}) or (??{}). Those are Perl features that are not
4192 present in PCRE.
4193
4194 A match with the string "aaaa" always fails, but the callout is taken
4195 before each backtrack occurs (in this example, 10 times).
4196
4197 Recording Which Path Was Taken
4198
4199 The main purpose of this verb is to track how a match was arrived at,
4200 although it also has a secondary use in with advancing the match start‐
4201 ing point (see (*SKIP) below).
4202
4203 Note:
4204 In Erlang, there is no interface to retrieve a mark with run/2,3, so
4205 only the secondary purpose is relevant to the Erlang programmer.
4206
4207 The rest of this section is therefore deliberately not adapted for
4208 reading by the Erlang programmer, but the examples can help in under‐
4209 standing NAMES as they can be used by (*SKIP).
4210
4211
4212 (*MARK:NAME) or (*:NAME)
4213
4214 A name is always required with this verb. There can be as many in‐
4215 stances of (*MARK) as you like in a pattern, and their names do not
4216 have to be unique.
4217
4218 When a match succeeds, the name of the last encountered (*MARK:NAME),
4219 (*PRUNE:NAME), or (*THEN:NAME) on the matching path is passed back to
4220 the caller as described in section "Extra data for pcre_exec()" in the
4221 pcreapi documentation. In the following example of pcretest output, the
4222 /K modifier requests the retrieval and outputting of (*MARK) data:
4223
4224 re> /X(*MARK:A)Y|X(*MARK:B)Z/K
4225 data> XY
4226 0: XY
4227 MK: A
4228 XZ
4229 0: XZ
4230 MK: B
4231
4232 The (*MARK) name is tagged with "MK:" in this output, and in this exam‐
4233 ple it indicates which of the two alternatives matched. This is a more
4234 efficient way of obtaining this information than putting each alterna‐
4235 tive in its own capturing parentheses.
4236
4237 If a verb with a name is encountered in a positive assertion that is
4238 true, the name is recorded and passed back if it is the last encoun‐
4239 tered. This does not occur for negative assertions or failing positive
4240 assertions.
4241
4242 After a partial match or a failed match, the last encountered name in
4243 the entire match process is returned, for example:
4244
4245 re> /X(*MARK:A)Y|X(*MARK:B)Z/K
4246 data> XP
4247 No match, mark = B
4248
4249 Notice that in this unanchored example, the mark is retained from the
4250 match attempt that started at letter "X" in the subject. Subsequent
4251 match attempts starting at "P" and then with an empty string do not get
4252 as far as the (*MARK) item, nevertheless do not reset it.
4253
4254 Verbs That Act after Backtracking
4255
4256 The following verbs do nothing when they are encountered. Matching con‐
4257 tinues with what follows, but if there is no subsequent match, causing
4258 a backtrack to the verb, a failure is forced. That is, backtracking
4259 cannot pass to the left of the verb. However, when one of these verbs
4260 appears inside an atomic group or an assertion that is true, its effect
4261 is confined to that group, as once the group has been matched, there is
4262 never any backtracking into it. In this situation, backtracking can
4263 "jump back" to the left of the entire atomic group or assertion. (Re‐
4264 member also, as stated above, that this localization also applies in
4265 subroutine calls.)
4266
4267 These verbs differ in exactly what kind of failure occurs when back‐
4268 tracking reaches them. The behavior described below is what occurs when
4269 the verb is not in a subroutine or an assertion. Subsequent sections
4270 cover these special cases.
4271
4272 The following verb, which must not be followed by a name, causes the
4273 whole match to fail outright if there is a later matching failure that
4274 causes backtracking to reach it. Even if the pattern is unanchored, no
4275 further attempts to find a match by advancing the starting point take
4276 place.
4277
4278 (*COMMIT)
4279
4280 If (*COMMIT) is the only backtracking verb that is encountered, once it
4281 has been passed, run/2,3 is committed to find a match at the current
4282 starting point, or not at all, for example:
4283
4284 a+(*COMMIT)b
4285
4286 This matches "xxaab" but not "aacaab". It can be thought of as a kind
4287 of dynamic anchor, or "I've started, so I must finish". The name of the
4288 most recently passed (*MARK) in the path is passed back when (*COMMIT)
4289 forces a match failure.
4290
4291 If more than one backtracking verb exists in a pattern, a different one
4292 that follows (*COMMIT) can be triggered first, so merely passing (*COM‐
4293 MIT) during a match does not always guarantee that a match must be at
4294 this starting point.
4295
4296 Notice that (*COMMIT) at the start of a pattern is not the same as an
4297 anchor, unless the PCRE start-of-match optimizations are turned off, as
4298 shown in the following example:
4299
4300 1> re:run("xyzabc","(*COMMIT)abc",[{capture,all,list}]).
4301 {match,["abc"]}
4302 2> re:run("xyzabc","(*COMMIT)abc",[{capture,all,list},no_start_optimize]).
4303 nomatch
4304
4305 For this pattern, PCRE knows that any match must start with "a", so the
4306 optimization skips along the subject to "a" before applying the pattern
4307 to the first set of data. The match attempt then succeeds. In the sec‐
4308 ond call the no_start_optimize disables the optimization that skips
4309 along to the first character. The pattern is now applied starting at
4310 "x", and so the (*COMMIT) causes the match to fail without trying any
4311 other starting points.
4312
4313 The following verb causes the match to fail at the current starting po‐
4314 sition in the subject if there is a later matching failure that causes
4315 backtracking to reach it:
4316
4317 (*PRUNE) or (*PRUNE:NAME)
4318
4319 If the pattern is unanchored, the normal "bumpalong" advance to the
4320 next starting character then occurs. Backtracking can occur as usual to
4321 the left of (*PRUNE), before it is reached, or when matching to the
4322 right of (*PRUNE), but if there is no match to the right, backtracking
4323 cannot cross (*PRUNE). In simple cases, the use of (*PRUNE) is just an
4324 alternative to an atomic group or possessive quantifier, but there are
4325 some uses of (*PRUNE) that cannot be expressed in any other way. In an
4326 anchored pattern, (*PRUNE) has the same effect as (*COMMIT).
4327
4328 The behavior of (*PRUNE:NAME) is the not the same as
4329 (*MARK:NAME)(*PRUNE). It is like (*MARK:NAME) in that the name is re‐
4330 membered for passing back to the caller. However, (*SKIP:NAME) searches
4331 only for names set with (*MARK).
4332
4333 Note:
4334 The fact that (*PRUNE:NAME) remembers the name is useless to the Erlang
4335 programmer, as names cannot be retrieved.
4336
4337
4338 The following verb, when specified without a name, is like (*PRUNE),
4339 except that if the pattern is unanchored, the "bumpalong" advance is
4340 not to the next character, but to the position in the subject where
4341 (*SKIP) was encountered.
4342
4343 (*SKIP)
4344
4345 (*SKIP) signifies that whatever text was matched leading up to it can‐
4346 not be part of a successful match. Consider:
4347
4348 a+(*SKIP)b
4349
4350 If the subject is "aaaac...", after the first match attempt fails
4351 (starting at the first character in the string), the starting point
4352 skips on to start the next attempt at "c". Notice that a possessive
4353 quantifier does not have the same effect as this example; although it
4354 would suppress backtracking during the first match attempt, the second
4355 attempt would start at the second character instead of skipping on to
4356 "c".
4357
4358 When (*SKIP) has an associated name, its behavior is modified:
4359
4360 (*SKIP:NAME)
4361
4362 When this is triggered, the previous path through the pattern is
4363 searched for the most recent (*MARK) that has the same name. If one is
4364 found, the "bumpalong" advance is to the subject position that corre‐
4365 sponds to that (*MARK) instead of to where (*SKIP) was encountered. If
4366 no (*MARK) with a matching name is found, (*SKIP) is ignored.
4367
4368 Notice that (*SKIP:NAME) searches only for names set by (*MARK:NAME).
4369 It ignores names that are set by (*PRUNE:NAME) or (*THEN:NAME).
4370
4371 The following verb causes a skip to the next innermost alternative when
4372 backtracking reaches it. That is, it cancels any further backtracking
4373 within the current alternative.
4374
4375 (*THEN) or (*THEN:NAME)
4376
4377 The verb name comes from the observation that it can be used for a pat‐
4378 tern-based if-then-else block:
4379
4380 ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
4381
4382 If the COND1 pattern matches, FOO is tried (and possibly further items
4383 after the end of the group if FOO succeeds). On failure, the matcher
4384 skips to the second alternative and tries COND2, without backtracking
4385 into COND1. If that succeeds and BAR fails, COND3 is tried. If BAZ then
4386 fails, there are no more alternatives, so there is a backtrack to what‐
4387 ever came before the entire group. If (*THEN) is not inside an alterna‐
4388 tion, it acts like (*PRUNE).
4389
4390 The behavior of (*THEN:NAME) is the not the same as
4391 (*MARK:NAME)(*THEN). It is like (*MARK:NAME) in that the name is remem‐
4392 bered for passing back to the caller. However, (*SKIP:NAME) searches
4393 only for names set with (*MARK).
4394
4395 Note:
4396 The fact that (*THEN:NAME) remembers the name is useless to the Erlang
4397 programmer, as names cannot be retrieved.
4398
4399
4400 A subpattern that does not contain a | character is just a part of the
4401 enclosing alternative; it is not a nested alternation with only one al‐
4402 ternative. The effect of (*THEN) extends beyond such a subpattern to
4403 the enclosing alternative. Consider the following pattern, where A, B,
4404 and so on, are complex pattern fragments that do not contain any |
4405 characters at this level:
4406
4407 A (B(*THEN)C) | D
4408
4409 If A and B are matched, but there is a failure in C, matching does not
4410 backtrack into A; instead it moves to the next alternative, that is, D.
4411 However, if the subpattern containing (*THEN) is given an alternative,
4412 it behaves differently:
4413
4414 A (B(*THEN)C | (*FAIL)) | D
4415
4416 The effect of (*THEN) is now confined to the inner subpattern. After a
4417 failure in C, matching moves to (*FAIL), which causes the whole subpat‐
4418 tern to fail, as there are no more alternatives to try. In this case,
4419 matching does now backtrack into A.
4420
4421 Notice that a conditional subpattern is not considered as having two
4422 alternatives, as only one is ever used. That is, the | character in a
4423 conditional subpattern has a different meaning. Ignoring whitespace,
4424 consider:
4425
4426 ^.*? (?(?=a) a | b(*THEN)c )
4427
4428 If the subject is "ba", this pattern does not match. As .*? is un‐
4429 greedy, it initially matches zero characters. The condition (?=a) then
4430 fails, the character "b" is matched, but "c" is not. At this point,
4431 matching does not backtrack to .*? as can perhaps be expected from the
4432 presence of the | character. The conditional subpattern is part of the
4433 single alternative that comprises the whole pattern, and so the match
4434 fails. (If there was a backtrack into .*?, allowing it to match "b",
4435 the match would succeed.)
4436
4437 The verbs described above provide four different "strengths" of control
4438 when subsequent matching fails:
4439
4440 * (*THEN) is the weakest, carrying on the match at the next alterna‐
4441 tive.
4442
4443 * (*PRUNE) comes next, fails the match at the current starting posi‐
4444 tion, but allows an advance to the next character (for an unan‐
4445 chored pattern).
4446
4447 * (*SKIP) is similar, except that the advance can be more than one
4448 character.
4449
4450 * (*COMMIT) is the strongest, causing the entire match to fail.
4451
4452 More than One Backtracking Verb
4453
4454 If more than one backtracking verb is present in a pattern, the one
4455 that is backtracked onto first acts. For example, consider the follow‐
4456 ing pattern, where A, B, and so on, are complex pattern fragments:
4457
4458 (A(*COMMIT)B(*THEN)C|ABD)
4459
4460 If A matches but B fails, the backtrack to (*COMMIT) causes the entire
4461 match to fail. However, if A and B match, but C fails, the backtrack to
4462 (*THEN) causes the next alternative (ABD) to be tried. This behavior is
4463 consistent, but is not always the same as in Perl. It means that if two
4464 or more backtracking verbs appear in succession, the last of them has
4465 no effect. Consider the following example:
4466
4467 If there is a matching failure to the right, backtracking onto (*PRUNE)
4468 causes it to be triggered, and its action is taken. There can never be
4469 a backtrack onto (*COMMIT).
4470
4471 Backtracking Verbs in Repeated Groups
4472
4473 PCRE differs from Perl in its handling of backtracking verbs in re‐
4474 peated groups. For example, consider:
4475
4476 /(a(*COMMIT)b)+ac/
4477
4478 If the subject is "abac", Perl matches, but PCRE fails because the
4479 (*COMMIT) in the second repeat of the group acts.
4480
4481 Backtracking Verbs in Assertions
4482
4483 (*FAIL) in an assertion has its normal effect: it forces an immediate
4484 backtrack.
4485
4486 (*ACCEPT) in a positive assertion causes the assertion to succeed with‐
4487 out any further processing. In a negative assertion, (*ACCEPT) causes
4488 the assertion to fail without any further processing.
4489
4490 The other backtracking verbs are not treated specially if they appear
4491 in a positive assertion. In particular, (*THEN) skips to the next al‐
4492 ternative in the innermost enclosing group that has alternations, re‐
4493 gardless if this is within the assertion.
4494
4495 Negative assertions are, however, different, to ensure that changing a
4496 positive assertion into a negative assertion changes its result. Back‐
4497 tracking into (*COMMIT), (*SKIP), or (*PRUNE) causes a negative asser‐
4498 tion to be true, without considering any further alternative branches
4499 in the assertion. Backtracking into (*THEN) causes it to skip to the
4500 next enclosing alternative within the assertion (the normal behavior),
4501 but if the assertion does not have such an alternative, (*THEN) behaves
4502 like (*PRUNE).
4503
4504 Backtracking Verbs in Subroutines
4505
4506 These behaviors occur regardless if the subpattern is called recur‐
4507 sively. The treatment of subroutines in Perl is different in some
4508 cases.
4509
4510 * (*FAIL) in a subpattern called as a subroutine has its normal ef‐
4511 fect: it forces an immediate backtrack.
4512
4513 * (*ACCEPT) in a subpattern called as a subroutine causes the subrou‐
4514 tine match to succeed without any further processing. Matching then
4515 continues after the subroutine call.
4516
4517 * (*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as a sub‐
4518 routine cause the subroutine match to fail.
4519
4520 * (*THEN) skips to the next alternative in the innermost enclosing
4521 group within the subpattern that has alternatives. If there is no
4522 such group within the subpattern, (*THEN) causes the subroutine
4523 match to fail.
4524
4525Ericsson AB stdlib 5.1.1 re(3)