1re(3) Erlang Module Definition re(3)
2
3
4
6 re - Perl-like regular expressions for Erlang.
7
9 This module contains regular expression matching functions for strings
10 and binaries.
11
12 The regular expression syntax and semantics resemble that of Perl.
13
14 The matching algorithms of the library are based on the PCRE library,
15 but not all of the PCRE library is interfaced and some parts of the
16 library go beyond what PCRE offers. Currently PCRE version 8.40
17 (release date 2017-01-11) is used. The sections of the PCRE documenta‐
18 tion that are relevant to this module are included here.
19
20 Note:
21 The Erlang literal syntax for strings uses the "\" (backslash) charac‐
22 ter as an escape code. You need to escape backslashes in literal
23 strings, both in your code and in the shell, with an extra backslash,
24 that is, "\\".
25
26
28 mp() = {re_pattern, term(), term(), term(), term()}
29
30 Opaque data type containing a compiled regular expression. mp()
31 is guaranteed to be a tuple() having the atom re_pattern as its
32 first element, to allow for matching in guards. The arity of the
33 tuple or the content of the other fields can change in future
34 Erlang/OTP releases.
35
36 nl_spec() = cr | crlf | lf | anycrlf | any
37
38 compile_option() =
39 unicode | anchored | caseless | dollar_endonly | dotall |
40 extended | firstline | multiline | no_auto_capture |
41 dupnames | ungreedy |
42 {newline, nl_spec()} |
43 bsr_anycrlf | bsr_unicode | no_start_optimize | ucp |
44 never_utf
45
47 version() -> binary()
48
49 The return of this function is a string with the PCRE version of
50 the system that was used in the Erlang/OTP compilation.
51
52 compile(Regexp) -> {ok, MP} | {error, ErrSpec}
53
54 Types:
55
56 Regexp = iodata()
57 MP = mp()
58 ErrSpec =
59 {ErrString :: string(), Position :: integer() >= 0}
60
61 The same as compile(Regexp,[])
62
63 compile(Regexp, Options) -> {ok, MP} | {error, ErrSpec}
64
65 Types:
66
67 Regexp = iodata() | unicode:charlist()
68 Options = [Option]
69 Option = compile_option()
70 MP = mp()
71 ErrSpec =
72 {ErrString :: string(), Position :: integer() >= 0}
73
74 Compiles a regular expression, with the syntax described below,
75 into an internal format to be used later as a parameter to run/2
76 and run/3.
77
78 Compiling the regular expression before matching is useful if
79 the same expression is to be used in matching against multiple
80 subjects during the lifetime of the program. Compiling once and
81 executing many times is far more efficient than compiling each
82 time one wants to match.
83
84 When option unicode is specified, the regular expression is to
85 be specified as a valid Unicode charlist(), otherwise as any
86 valid iodata().
87
88 Options:
89
90 unicode:
91 The regular expression is specified as a Unicode charlist()
92 and the resulting regular expression code is to be run
93 against a valid Unicode charlist() subject. Also consider
94 option ucp when using Unicode characters.
95
96 anchored:
97 The pattern is forced to be "anchored", that is, it is con‐
98 strained to match only at the first matching point in the
99 string that is searched (the "subject string"). This effect
100 can also be achieved by appropriate constructs in the pat‐
101 tern itself.
102
103 caseless:
104 Letters in the pattern match both uppercase and lowercase
105 letters. It is equivalent to Perl option /i and can be
106 changed within a pattern by a (?i) option setting. Uppercase
107 and lowercase letters are defined as in the ISO 8859-1 char‐
108 acter set.
109
110 dollar_endonly:
111 A dollar metacharacter in the pattern matches only at the
112 end of the subject string. Without this option, a dollar
113 also matches immediately before a newline at the end of the
114 string (but not before any other newlines). This option is
115 ignored if option multiline is specified. There is no equiv‐
116 alent option in Perl, and it cannot be set within a pattern.
117
118 dotall:
119 A dot in the pattern matches all characters, including those
120 indicating newline. Without it, a dot does not match when
121 the current position is at a newline. This option is equiva‐
122 lent to Perl option /s and it can be changed within a pat‐
123 tern by a (?s) option setting. A negative class, such as
124 [^a], always matches newline characters, independent of the
125 setting of this option.
126
127 extended:
128 If this option is set, most white space characters in the
129 pattern are totally ignored except when escaped or inside a
130 character class. However, white space is not allowed within
131 sequences such as (?> that introduce various parenthesized
132 subpatterns, nor within a numerical quantifier such as
133 {1,3}. However, ignorable white space is permitted between
134 an item and a following quantifier and between a quantifier
135 and a following + that indicates possessiveness.
136
137 White space did not used to include the VT character (code
138 11), because Perl did not treat this character as white
139 space. However, Perl changed at release 5.18, so PCRE fol‐
140 lowed at release 8.34, and VT is now treated as white space.
141
142 This also causes characters between an unescaped # outside a
143 character class and the next newline, inclusive, to be
144 ignored. This is equivalent to Perl's /x option, and it can
145 be changed within a pattern by a (?x) option setting.
146
147 With this option, comments inside complicated patterns can
148 be included. However, notice that this applies only to data
149 characters. Whitespace characters can never appear within
150 special character sequences in a pattern, for example within
151 sequence (?( that introduces a conditional subpattern.
152
153 firstline:
154 An unanchored pattern is required to match before or at the
155 first newline in the subject string, although the matched
156 text can continue over the newline.
157
158 multiline:
159 By default, PCRE treats the subject string as consisting of
160 a single line of characters (even if it contains newlines).
161 The "start of line" metacharacter (^) matches only at the
162 start of the string, while the "end of line" metacharacter
163 ($) matches only at the end of the string, or before a ter‐
164 minating newline (unless option dollar_endonly is speci‐
165 fied). This is the same as in Perl.
166
167 When this option is specified, the "start of line" and "end
168 of line" constructs match immediately following or immedi‐
169 ately before internal newlines in the subject string,
170 respectively, as well as at the very start and end. This is
171 equivalent to Perl option /m and can be changed within a
172 pattern by a (?m) option setting. If there are no newlines
173 in a subject string, or no occurrences of ^ or $ in a pat‐
174 tern, setting multiline has no effect.
175
176 no_auto_capture:
177 Disables the use of numbered capturing parentheses in the
178 pattern. Any opening parenthesis that is not followed by ?
179 behaves as if it is followed by ?:. Named parentheses can
180 still be used for capturing (and they acquire numbers in the
181 usual way). There is no equivalent option in Perl.
182
183 dupnames:
184 Names used to identify capturing subpatterns need not be
185 unique. This can be helpful for certain types of pattern
186 when it is known that only one instance of the named subpat‐
187 tern can ever be matched. More details of named subpatterns
188 are provided below.
189
190 ungreedy:
191 Inverts the "greediness" of the quantifiers so that they are
192 not greedy by default, but become greedy if followed by "?".
193 It is not compatible with Perl. It can also be set by a (?U)
194 option setting within the pattern.
195
196 {newline, NLSpec}:
197 Overrides the default definition of a newline in the subject
198 string, which is LF (ASCII 10) in Erlang.
199
200 cr:
201 Newline is indicated by a single character cr (ASCII 13).
202
203 lf:
204 Newline is indicated by a single character LF (ASCII 10),
205 the default.
206
207 crlf:
208 Newline is indicated by the two-character CRLF (ASCII 13
209 followed by ASCII 10) sequence.
210
211 anycrlf:
212 Any of the three preceding sequences is to be recognized.
213
214 any:
215 Any of the newline sequences above, and the Unicode
216 sequences VT (vertical tab, U+000B), FF (formfeed,
217 U+000C), NEL (next line, U+0085), LS (line separator,
218 U+2028), and PS (paragraph separator, U+2029).
219
220 bsr_anycrlf:
221 Specifies specifically that \R is to match only the CR, LF,
222 or CRLF sequences, not the Unicode-specific newline charac‐
223 ters.
224
225 bsr_unicode:
226 Specifies specifically that \R is to match all the Unicode
227 newline characters (including CRLF, and so on, the default).
228
229 no_start_optimize:
230 Disables optimization that can malfunction if "Special
231 start-of-pattern items" are present in the regular expres‐
232 sion. A typical example would be when matching "DEFABC"
233 against "(*COMMIT)ABC", where the start optimization of PCRE
234 would skip the subject up to "A" and never realize that the
235 (*COMMIT) instruction is to have made the matching fail.
236 This option is only relevant if you use "start-of-pattern
237 items", as discussed in section PCRE Regular Expression
238 Details.
239
240 ucp:
241 Specifies that Unicode character properties are to be used
242 when resolving \B, \b, \D, \d, \S, \s, \W and \w. Without
243 this flag, only ISO Latin-1 properties are used. Using Uni‐
244 code properties hurts performance, but is semantically cor‐
245 rect when working with Unicode characters beyond the ISO
246 Latin-1 range.
247
248 never_utf:
249 Specifies that the (*UTF) and/or (*UTF8) "start-of-pattern
250 items" are forbidden. This flag cannot be combined with
251 option unicode. Useful if ISO Latin-1 patterns from an
252 external source are to be compiled.
253
254 inspect(MP, Item) -> {namelist, [binary()]}
255
256 Types:
257
258 MP = mp()
259 Item = namelist
260
261 Takes a compiled regular expression and an item, and returns the
262 relevant data from the regular expression. The only supported
263 item is namelist, which returns the tuple {namelist,
264 [binary()]}, containing the names of all (unique) named subpat‐
265 terns in the regular expression. For example:
266
267 1> {ok,MP} = re:compile("(?<A>A)|(?<B>B)|(?<C>C)").
268 {ok,{re_pattern,3,0,0,
269 <<69,82,67,80,119,0,0,0,0,0,0,0,1,0,0,0,255,255,255,255,
270 255,255,...>>}}
271 2> re:inspect(MP,namelist).
272 {namelist,[<<"A">>,<<"B">>,<<"C">>]}
273 3> {ok,MPD} = re:compile("(?<C>A)|(?<B>B)|(?<C>C)",[dupnames]).
274 {ok,{re_pattern,3,0,0,
275 <<69,82,67,80,119,0,0,0,0,0,8,0,1,0,0,0,255,255,255,255,
276 255,255,...>>}}
277 4> re:inspect(MPD,namelist).
278 {namelist,[<<"B">>,<<"C">>]}
279
280 Notice in the second example that the duplicate name only occurs
281 once in the returned list, and that the list is in alphabetical
282 order regardless of where the names are positioned in the regu‐
283 lar expression. The order of the names is the same as the order
284 of captured subexpressions if {capture, all_names} is specified
285 as an option to run/3. You can therefore create a name-to-value
286 mapping from the result of run/3 like this:
287
288 1> {ok,MP} = re:compile("(?<A>A)|(?<B>B)|(?<C>C)").
289 {ok,{re_pattern,3,0,0,
290 <<69,82,67,80,119,0,0,0,0,0,0,0,1,0,0,0,255,255,255,255,
291 255,255,...>>}}
292 2> {namelist, N} = re:inspect(MP,namelist).
293 {namelist,[<<"A">>,<<"B">>,<<"C">>]}
294 3> {match,L} = re:run("AA",MP,[{capture,all_names,binary}]).
295 {match,[<<"A">>,<<>>,<<>>]}
296 4> NameMap = lists:zip(N,L).
297 [{<<"A">>,<<"A">>},{<<"B">>,<<>>},{<<"C">>,<<>>}]
298
299 replace(Subject, RE, Replacement) -> iodata() | unicode:charlist()
300
301 Types:
302
303 Subject = iodata() | unicode:charlist()
304 RE = mp() | iodata()
305 Replacement = iodata() | unicode:charlist()
306
307 Same as replace(Subject, RE, Replacement, []).
308
309 replace(Subject, RE, Replacement, Options) ->
310 iodata() | unicode:charlist()
311
312 Types:
313
314 Subject = iodata() | unicode:charlist()
315 RE = mp() | iodata() | unicode:charlist()
316 Replacement = iodata() | unicode:charlist()
317 Options = [Option]
318 Option =
319 anchored | global | notbol | noteol | notempty |
320 notempty_atstart |
321 {offset, integer() >= 0} |
322 {newline, NLSpec} |
323 bsr_anycrlf |
324 {match_limit, integer() >= 0} |
325 {match_limit_recursion, integer() >= 0} |
326 bsr_unicode |
327 {return, ReturnType} |
328 CompileOpt
329 ReturnType = iodata | list | binary
330 CompileOpt = compile_option()
331 NLSpec = cr | crlf | lf | anycrlf | any
332
333 Replaces the matched part of the Subject string with the con‐
334 tents of Replacement.
335
336 The permissible options are the same as for run/3, except that
337 option capture is not allowed. Instead a {return, ReturnType} is
338 present. The default return type is iodata, constructed in a way
339 to minimize copying. The iodata result can be used directly in
340 many I/O operations. If a flat list() is desired, specify
341 {return, list}. If a binary is desired, specify {return,
342 binary}.
343
344 As in function run/3, an mp() compiled with option unicode
345 requires Subject to be a Unicode charlist(). If compilation is
346 done implicitly and the unicode compilation option is specified
347 to this function, both the regular expression and Subject are to
348 specified as valid Unicode charlist()s.
349
350 The replacement string can contain the special character &,
351 which inserts the whole matching expression in the result, and
352 the special sequence \N (where N is an integer > 0), \gN, or
353 \g{N}, resulting in the subexpression number N, is inserted in
354 the result. If no subexpression with that number is generated by
355 the regular expression, nothing is inserted.
356
357 To insert an & or a \ in the result, precede it with a \. Notice
358 that Erlang already gives a special meaning to \ in literal
359 strings, so a single \ must be written as "\\" and therefore a
360 double \ as "\\\\".
361
362 Example:
363
364 re:replace("abcd","c","[&]",[{return,list}]).
365
366 gives
367
368 "ab[c]d"
369
370 while
371
372 re:replace("abcd","c","[\\&]",[{return,list}]).
373
374 gives
375
376 "ab[&]d"
377
378 As with run/3, compilation errors raise the badarg exception.
379 compile/2 can be used to get more information about the error.
380
381 run(Subject, RE) -> {match, Captured} | nomatch
382
383 Types:
384
385 Subject = iodata() | unicode:charlist()
386 RE = mp() | iodata()
387 Captured = [CaptureData]
388 CaptureData = {integer(), integer()}
389
390 Same as run(Subject,RE,[]).
391
392 run(Subject, RE, Options) ->
393 {match, Captured} | match | nomatch | {error, ErrType}
394
395 Types:
396
397 Subject = iodata() | unicode:charlist()
398 RE = mp() | iodata() | unicode:charlist()
399 Options = [Option]
400 Option =
401 anchored | global | notbol | noteol | notempty |
402 notempty_atstart | report_errors |
403 {offset, integer() >= 0} |
404 {match_limit, integer() >= 0} |
405 {match_limit_recursion, integer() >= 0} |
406 {newline, NLSpec :: nl_spec()} |
407 bsr_anycrlf | bsr_unicode |
408 {capture, ValueSpec} |
409 {capture, ValueSpec, Type} |
410 CompileOpt
411 Type = index | list | binary
412 ValueSpec =
413 all | all_but_first | all_names | first | none | Val‐
414 ueList
415 ValueList = [ValueID]
416 ValueID = integer() | string() | atom()
417 CompileOpt = compile_option()
418 See compile/2.
419 Captured = [CaptureData] | [[CaptureData]]
420 CaptureData =
421 {integer(), integer()} | ListConversionData | binary()
422 ListConversionData =
423 string() |
424 {error, string(), binary()} |
425 {incomplete, string(), binary()}
426 ErrType =
427 match_limit | match_limit_recursion | {compile, Com‐
428 pileErr}
429 CompileErr =
430 {ErrString :: string(), Position :: integer() >= 0}
431
432 Executes a regular expression matching, and returns
433 match/{match, Captured} or nomatch. The regular expression can
434 be specified either as iodata() in which case it is automati‐
435 cally compiled (as by compile/2) and executed, or as a precom‐
436 piled mp() in which case it is executed against the subject
437 directly.
438
439 When compilation is involved, exception badarg is thrown if a
440 compilation error occurs. Call compile/2 to get information
441 about the location of the error in the regular expression.
442
443 If the regular expression is previously compiled, the option
444 list can only contain the following options:
445
446 * anchored
447
448 * {capture, ValueSpec}/{capture, ValueSpec, Type}
449
450 * global
451
452 * {match_limit, integer() >= 0}
453
454 * {match_limit_recursion, integer() >= 0}
455
456 * {newline, NLSpec}
457
458 * notbol
459
460 * notempty
461
462 * notempty_atstart
463
464 * noteol
465
466 * {offset, integer() >= 0}
467
468 * report_errors
469
470 Otherwise all options valid for function compile/2 are also
471 allowed. Options allowed both for compilation and execution of a
472 match, namely anchored and {newline, NLSpec}, affect both the
473 compilation and execution if present together with a non-precom‐
474 piled regular expression.
475
476 If the regular expression was previously compiled with option
477 unicode, Subject is to be provided as a valid Unicode
478 charlist(), otherwise any iodata() will do. If compilation is
479 involved and option unicode is specified, both Subject and the
480 regular expression are to be specified as valid Unicode
481 charlists().
482
483 {capture, ValueSpec}/{capture, ValueSpec, Type} defines what to
484 return from the function upon successful matching. The capture
485 tuple can contain both a value specification, telling which of
486 the captured substrings are to be returned, and a type specifi‐
487 cation, telling how captured substrings are to be returned (as
488 index tuples, lists, or binaries). The options are described in
489 detail below.
490
491 If the capture options describe that no substring capturing is
492 to be done ({capture, none}), the function returns the single
493 atom match upon successful matching, otherwise the tuple {match,
494 ValueList}. Disabling capturing can be done either by specifying
495 none or an empty list as ValueSpec.
496
497 Option report_errors adds the possibility that an error tuple is
498 returned. The tuple either indicates a matching error
499 (match_limit or match_limit_recursion), or a compilation error,
500 where the error tuple has the format {error, {compile, Com‐
501 pileErr}}. Notice that if option report_errors is not specified,
502 the function never returns error tuples, but reports compilation
503 errors as a badarg exception and failed matches because of
504 exceeded match limits simply as nomatch.
505
506 The following options are relevant for execution:
507
508 anchored:
509 Limits run/3 to matching at the first matching position. If
510 a pattern was compiled with anchored, or turned out to be
511 anchored by virtue of its contents, it cannot be made unan‐
512 chored at matching time, hence there is no unanchored
513 option.
514
515 global:
516 Implements global (repetitive) search (flag g in Perl). Each
517 match is returned as a separate list() containing the spe‐
518 cific match and any matching subexpressions (or as specified
519 by option capture. The Captured part of the return value is
520 hence a list() of list()s when this option is specified.
521
522 The interaction of option global with a regular expression
523 that matches an empty string surprises some users. When
524 option global is specified, run/3 handles empty matches in
525 the same way as Perl: a zero-length match at any point is
526 also retried with options [anchored, notempty_atstart]. If
527 that search gives a result of length > 0, the result is
528 included. Example:
529
530 re:run("cat","(|at)",[global]).
531
532 The following matchings are performed:
533
534 At offset 0:
535 The regular expression (|at) first match at the initial
536 position of string cat, giving the result set
537 [{0,0},{0,0}] (the second {0,0} is because of the subex‐
538 pression marked by the parentheses). As the length of the
539 match is 0, we do not advance to the next position yet.
540
541 At offset 0 with [anchored, notempty_atstart]:
542 The search is retried with options [anchored,
543 notempty_atstart] at the same position, which does not
544 give any interesting result of longer length, so the
545 search position is advanced to the next character (a).
546
547 At offset 1:
548 The search results in [{1,0},{1,0}], so this search is
549 also repeated with the extra options.
550
551 At offset 1 with [anchored, notempty_atstart]:
552 Alternative ab is found and the result is [{1,2},{1,2}].
553 The result is added to the list of results and the posi‐
554 tion in the search string is advanced two steps.
555
556 At offset 3:
557 The search once again matches the empty string, giving
558 [{3,0},{3,0}].
559
560 At offset 1 with [anchored, notempty_atstart]:
561 This gives no result of length > 0 and we are at the last
562 position, so the global search is complete.
563
564 The result of the call is:
565
566 {match,[[{0,0},{0,0}],[{1,0},{1,0}],[{1,2},{1,2}],[{3,0},{3,0}]]}
567
568 notempty:
569 An empty string is not considered to be a valid match if
570 this option is specified. If alternatives in the pattern
571 exist, they are tried. If all the alternatives match the
572 empty string, the entire match fails.
573
574 Example:
575
576 If the following pattern is applied to a string not begin‐
577 ning with "a" or "b", it would normally match the empty
578 string at the start of the subject:
579
580 a?b?
581
582 With option notempty, this match is invalid, so run/3
583 searches further into the string for occurrences of "a" or
584 "b".
585
586 notempty_atstart:
587 Like notempty, except that an empty string match that is not
588 at the start of the subject is permitted. If the pattern is
589 anchored, such a match can occur only if the pattern con‐
590 tains \K.
591
592 Perl has no direct equivalent of notempty or
593 notempty_atstart, but it does make a special case of a pat‐
594 tern match of the empty string within its split() function,
595 and when using modifier /g. The Perl behavior can be emu‐
596 lated after matching a null string by first trying the match
597 again at the same offset with notempty_atstart and anchored,
598 and then, if that fails, by advancing the starting offset
599 (see below) and trying an ordinary match again.
600
601 notbol:
602 Specifies that the first character of the subject string is
603 not the beginning of a line, so the circumflex metacharacter
604 is not to match before it. Setting this without multiline
605 (at compile time) causes circumflex never to match. This
606 option only affects the behavior of the circumflex metachar‐
607 acter. It does not affect \A.
608
609 noteol:
610 Specifies that the end of the subject string is not the end
611 of a line, so the dollar metacharacter is not to match it
612 nor (except in multiline mode) a newline immediately before
613 it. Setting this without multiline (at compile time) causes
614 dollar never to match. This option affects only the behavior
615 of the dollar metacharacter. It does not affect \Z or \z.
616
617 report_errors:
618 Gives better control of the error handling in run/3. When
619 specified, compilation errors (if the regular expression is
620 not already compiled) and runtime errors are explicitly
621 returned as an error tuple.
622
623 The following are the possible runtime errors:
624
625 match_limit:
626 The PCRE library sets a limit on how many times the inter‐
627 nal match function can be called. Defaults to 10,000,000
628 in the library compiled for Erlang. If {error,
629 match_limit} is returned, the execution of the regular
630 expression has reached this limit. This is normally to be
631 regarded as a nomatch, which is the default return value
632 when this occurs, but by specifying report_errors, you are
633 informed when the match fails because of too many internal
634 calls.
635
636 match_limit_recursion:
637 This error is very similar to match_limit, but occurs when
638 the internal match function of PCRE is "recursively"
639 called more times than the match_limit_recursion limit,
640 which defaults to 10,000,000 as well. Notice that as long
641 as the match_limit and match_limit_default values are kept
642 at the default values, the match_limit_recursion error
643 cannot occur, as the match_limit error occurs before that
644 (each recursive call is also a call, but not conversely).
645 Both limits can however be changed, either by setting lim‐
646 its directly in the regular expression string (see section
647 PCRE Regular Eexpression Details) or by specifying options
648 to run/3.
649
650 It is important to understand that what is referred to as
651 "recursion" when limiting matches is not recursion on the C
652 stack of the Erlang machine or on the Erlang process stack.
653 The PCRE version compiled into the Erlang VM uses machine
654 "heap" memory to store values that must be kept over recur‐
655 sion in regular expression matches.
656
657 {match_limit, integer() >= 0}:
658 Limits the execution time of a match in an implementation-
659 specific way. It is described as follows by the PCRE docu‐
660 mentation:
661
662 The match_limit field provides a means of preventing PCRE from using
663 up a vast amount of resources when running patterns that are not going
664 to match, but which have a very large number of possibilities in their
665 search trees. The classic example is a pattern that uses nested
666 unlimited repeats.
667
668 Internally, pcre_exec() uses a function called match(), which it calls
669 repeatedly (sometimes recursively). The limit set by match_limit is
670 imposed on the number of times this function is called during a match,
671 which has the effect of limiting the amount of backtracking that can
672 take place. For patterns that are not anchored, the count restarts
673 from zero for each position in the subject string.
674
675 This means that runaway regular expression matches can fail
676 faster if the limit is lowered using this option. The
677 default value 10,000,000 is compiled into the Erlang VM.
678
679 Note:
680 This option does in no way affect the execution of the Erlang
681 VM in terms of "long running BIFs". run/3 always gives control
682 back to the scheduler of Erlang processes at intervals that
683 ensures the real-time properties of the Erlang system.
684
685
686 {match_limit_recursion, integer() >= 0}:
687 Limits the execution time and memory consumption of a match
688 in an implementation-specific way, very similar to
689 match_limit. It is described as follows by the PCRE documen‐
690 tation:
691
692 The match_limit_recursion field is similar to match_limit, but instead
693 of limiting the total number of times that match() is called, it
694 limits the depth of recursion. The recursion depth is a smaller number
695 than the total number of calls, because not all calls to match() are
696 recursive. This limit is of use only if it is set smaller than
697 match_limit.
698
699 Limiting the recursion depth limits the amount of machine stack that
700 can be used, or, when PCRE has been compiled to use memory on the heap
701 instead of the stack, the amount of heap memory that can be used.
702
703 The Erlang VM uses a PCRE library where heap memory is used
704 when regular expression match recursion occurs. This there‐
705 fore limits the use of machine heap, not C stack.
706
707 Specifying a lower value can result in matches with deep
708 recursion failing, when they should have matched:
709
710 1> re:run("aaaaaaaaaaaaaz","(a+)*z").
711 {match,[{0,14},{0,13}]}
712 2> re:run("aaaaaaaaaaaaaz","(a+)*z",[{match_limit_recursion,5}]).
713 nomatch
714 3> re:run("aaaaaaaaaaaaaz","(a+)*z",[{match_limit_recursion,5},report_errors]).
715 {error,match_limit_recursion}
716
717 This option and option match_limit are only to be used in
718 rare cases. Understanding of the PCRE library internals is
719 recommended before tampering with these limits.
720
721 {offset, integer() >= 0}:
722 Start matching at the offset (position) specified in the
723 subject string. The offset is zero-based, so that the
724 default is {offset,0} (all of the subject string).
725
726 {newline, NLSpec}:
727 Overrides the default definition of a newline in the subject
728 string, which is LF (ASCII 10) in Erlang.
729
730 cr:
731 Newline is indicated by a single character CR (ASCII 13).
732
733 lf:
734 Newline is indicated by a single character LF (ASCII 10),
735 the default.
736
737 crlf:
738 Newline is indicated by the two-character CRLF (ASCII 13
739 followed by ASCII 10) sequence.
740
741 anycrlf:
742 Any of the three preceding sequences is be recognized.
743
744 any:
745 Any of the newline sequences above, and the Unicode
746 sequences VT (vertical tab, U+000B), FF (formfeed,
747 U+000C), NEL (next line, U+0085), LS (line separator,
748 U+2028), and PS (paragraph separator, U+2029).
749
750 bsr_anycrlf:
751 Specifies specifically that \R is to match only the CR LF,
752 or CRLF sequences, not the Unicode-specific newline charac‐
753 ters. (Overrides the compilation option.)
754
755 bsr_unicode:
756 Specifies specifically that \R is to match all the Unicode
757 newline characters (including CRLF, and so on, the default).
758 (Overrides the compilation option.)
759
760 {capture, ValueSpec}/{capture, ValueSpec, Type}:
761 Specifies which captured substrings are returned and in what
762 format. By default, run/3 captures all of the matching part
763 of the substring and all capturing subpatterns (all of the
764 pattern is automatically captured). The default return type
765 is (zero-based) indexes of the captured parts of the string,
766 specified as {Offset,Length} pairs (the index Type of cap‐
767 turing).
768
769 As an example of the default behavior, the following call
770 returns, as first and only captured string, the matching
771 part of the subject ("abcd" in the middle) as an index pair
772 {3,4}, where character positions are zero-based, just as in
773 offsets:
774
775 re:run("ABCabcdABC","abcd",[]).
776
777 The return value of this call is:
778
779 {match,[{3,4}]}
780
781 Another (and quite common) case is where the regular expres‐
782 sion matches all of the subject:
783
784 re:run("ABCabcdABC",".*abcd.*",[]).
785
786 Here the return value correspondingly points out all of the
787 string, beginning at index 0, and it is 10 characters long:
788
789 {match,[{0,10}]}
790
791 If the regular expression contains capturing subpatterns,
792 like in:
793
794 re:run("ABCabcdABC",".*(abcd).*",[]).
795
796 all of the matched subject is captured, as well as the cap‐
797 tured substrings:
798
799 {match,[{0,10},{3,4}]}
800
801 The complete matching pattern always gives the first return
802 value in the list and the remaining subpatterns are added in
803 the order they occurred in the regular expression.
804
805 The capture tuple is built up as follows:
806
807 ValueSpec:
808 Specifies which captured (sub)patterns are to be returned.
809 ValueSpec can either be an atom describing a predefined
810 set of return values, or a list containing the indexes or
811 the names of specific subpatterns to return.
812
813 The following are the predefined sets of subpatterns:
814
815 all:
816 All captured subpatterns including the complete matching
817 string. This is the default.
818
819 all_names:
820 All named subpatterns in the regular expression, as if a
821 list() of all the names in alphabetical order was speci‐
822 fied. The list of all names can also be retrieved with
823 inspect/2.
824
825 first:
826 Only the first captured subpattern, which is always the
827 complete matching part of the subject. All explicitly
828 captured subpatterns are discarded.
829
830 all_but_first:
831 All but the first matching subpattern, that is, all
832 explicitly captured subpatterns, but not the complete
833 matching part of the subject string. This is useful if
834 the regular expression as a whole matches a large part
835 of the subject, but the part you are interested in is in
836 an explicitly captured subpattern. If the return type is
837 list or binary, not returning subpatterns you are not
838 interested in is a good way to optimize.
839
840 none:
841 Returns no matching subpatterns, gives the single atom
842 match as the return value of the function when matching
843 successfully instead of the {match, list()} return.
844 Specifying an empty list gives the same behavior.
845
846 The value list is a list of indexes for the subpatterns to
847 return, where index 0 is for all of the pattern, and 1 is
848 for the first explicit capturing subpattern in the regular
849 expression, and so on. When using named captured subpat‐
850 terns (see below) in the regular expression, one can use
851 atom()s or string()s to specify the subpatterns to be
852 returned. For example, consider the regular expression:
853
854 ".*(abcd).*"
855
856 matched against string "ABCabcdABC", capturing only the
857 "abcd" part (the first explicit subpattern):
858
859 re:run("ABCabcdABC",".*(abcd).*",[{capture,[1]}]).
860
861 The call gives the following result, as the first explic‐
862 itly captured subpattern is "(abcd)", matching "abcd" in
863 the subject, at (zero-based) position 3, of length 4:
864
865 {match,[{3,4}]}
866
867 Consider the same regular expression, but with the subpat‐
868 tern explicitly named 'FOO':
869
870 ".*(?<FOO>abcd).*"
871
872 With this expression, we could still give the index of the
873 subpattern with the following call:
874
875 re:run("ABCabcdABC",".*(?<FOO>abcd).*",[{capture,[1]}]).
876
877 giving the same result as before. But, as the subpattern
878 is named, we can also specify its name in the value list:
879
880 re:run("ABCabcdABC",".*(?<FOO>abcd).*",[{capture,['FOO']}]).
881
882 This would give the same result as the earlier examples,
883 namely:
884
885 {match,[{3,4}]}
886
887 The values list can specify indexes or names not present
888 in the regular expression, in which case the return values
889 vary depending on the type. If the type is index, the
890 tuple {-1,0} is returned for values with no corresponding
891 subpattern in the regular expression, but for the other
892 types (binary and list), the values are the empty binary
893 or list, respectively.
894
895 Type:
896 Optionally specifies how captured substrings are to be
897 returned. If omitted, the default of index is used.
898
899 Type can be one of the following:
900
901 index:
902 Returns captured substrings as pairs of byte indexes
903 into the subject string and length of the matching
904 string in the subject (as if the subject string was
905 flattened with erlang:iolist_to_binary/1 or uni‐
906 code:characters_to_binary/2 before matching). Notice
907 that option unicode results in byte-oriented indexes in
908 a (possibly virtual) UTF-8 encoded binary. A byte index
909 tuple {0,2} can therefore represent one or two charac‐
910 ters when unicode is in effect. This can seem counter-
911 intuitive, but has been deemed the most effective and
912 useful way to do it. To return lists instead can result
913 in simpler code if that is desired. This return type is
914 the default.
915
916 list:
917 Returns matching substrings as lists of characters
918 (Erlang string()s). It option unicode is used in combi‐
919 nation with the \C sequence in the regular expression, a
920 captured subpattern can contain bytes that are not valid
921 UTF-8 (\C matches bytes regardless of character encod‐
922 ing). In that case the list capturing can result in the
923 same types of tuples that unicode:characters_to_list/2
924 can return, namely three-tuples with tag incomplete or
925 error, the successfully converted characters and the
926 invalid UTF-8 tail of the conversion as a binary. The
927 best strategy is to avoid using the \C sequence when
928 capturing lists.
929
930 binary:
931 Returns matching substrings as binaries. If option uni‐
932 code is used, these binaries are in UTF-8. If the \C
933 sequence is used together with unicode, the binaries can
934 be invalid UTF-8.
935
936 In general, subpatterns that were not assigned a value in
937 the match are returned as the tuple {-1,0} when type is
938 index. Unassigned subpatterns are returned as the empty
939 binary or list, respectively, for other return types. Con‐
940 sider the following regular expression:
941
942 ".*((?<FOO>abdd)|a(..d)).*"
943
944 There are three explicitly capturing subpatterns, where the
945 opening parenthesis position determines the order in the
946 result, hence ((?<FOO>abdd)|a(..d)) is subpattern index 1,
947 (?<FOO>abdd) is subpattern index 2, and (..d) is subpattern
948 index 3. When matched against the following string:
949
950 "ABCabcdABC"
951
952 the subpattern at index 2 does not match, as "abdd" is not
953 present in the string, but the complete pattern matches
954 (because of the alternative a(..d)). The subpattern at index
955 2 is therefore unassigned and the default return value is:
956
957 {match,[{0,10},{3,4},{-1,0},{4,3}]}
958
959 Setting the capture Type to binary gives:
960
961 {match,[<<"ABCabcdABC">>,<<"abcd">>,<<>>,<<"bcd">>]}
962
963 Here the empty binary (<<>>) represents the unassigned sub‐
964 pattern. In the binary case, some information about the
965 matching is therefore lost, as <<>> can also be an empty
966 string captured.
967
968 If differentiation between empty matches and non-existing
969 subpatterns is necessary, use the type index and do the con‐
970 version to the final type in Erlang code.
971
972 When option global is speciified, the capture specification
973 affects each match separately, so that:
974
975 re:run("cacb","c(a|b)",[global,{capture,[1],list}]).
976
977 gives
978
979 {match,[["a"],["b"]]}
980
981 For a descriptions of options only affecting the compilation
982 step, see compile/2.
983
984 split(Subject, RE) -> SplitList
985
986 Types:
987
988 Subject = iodata() | unicode:charlist()
989 RE = mp() | iodata()
990 SplitList = [iodata() | unicode:charlist()]
991
992 Same as split(Subject, RE, []).
993
994 split(Subject, RE, Options) -> SplitList
995
996 Types:
997
998 Subject = iodata() | unicode:charlist()
999 RE = mp() | iodata() | unicode:charlist()
1000 Options = [Option]
1001 Option =
1002 anchored | notbol | noteol | notempty | notempty_atstart
1003 |
1004 {offset, integer() >= 0} |
1005 {newline, nl_spec()} |
1006 {match_limit, integer() >= 0} |
1007 {match_limit_recursion, integer() >= 0} |
1008 bsr_anycrlf | bsr_unicode |
1009 {return, ReturnType} |
1010 {parts, NumParts} |
1011 group | trim | CompileOpt
1012 NumParts = integer() >= 0 | infinity
1013 ReturnType = iodata | list | binary
1014 CompileOpt = compile_option()
1015 See compile/2.
1016 SplitList = [RetData] | [GroupedRetData]
1017 GroupedRetData = [RetData]
1018 RetData = iodata() | unicode:charlist() | binary() | list()
1019
1020 Splits the input into parts by finding tokens according to the
1021 regular expression supplied. The splitting is basically done by
1022 running a global regular expression match and dividing the ini‐
1023 tial string wherever a match occurs. The matching part of the
1024 string is removed from the output.
1025
1026 As in run/3, an mp() compiled with option unicode requires Sub‐
1027 ject to be a Unicode charlist(). If compilation is done implic‐
1028 itly and the unicode compilation option is specified to this
1029 function, both the regular expression and Subject are to be
1030 specified as valid Unicode charlist()s.
1031
1032 The result is given as a list of "strings", the preferred data
1033 type specified in option return (default iodata).
1034
1035 If subexpressions are specified in the regular expression, the
1036 matching subexpressions are returned in the resulting list as
1037 well. For example:
1038
1039 re:split("Erlang","[ln]",[{return,list}]).
1040
1041 gives
1042
1043 ["Er","a","g"]
1044
1045 while
1046
1047 re:split("Erlang","([ln])",[{return,list}]).
1048
1049 gives
1050
1051 ["Er","l","a","n","g"]
1052
1053 The text matching the subexpression (marked by the parentheses
1054 in the regular expression) is inserted in the result list where
1055 it was found. This means that concatenating the result of a
1056 split where the whole regular expression is a single subexpres‐
1057 sion (as in the last example) always results in the original
1058 string.
1059
1060 As there is no matching subexpression for the last part in the
1061 example (the "g"), nothing is inserted after that. To make the
1062 group of strings and the parts matching the subexpressions more
1063 obvious, one can use option group, which groups together the
1064 part of the subject string with the parts matching the subex‐
1065 pressions when the string was split:
1066
1067 re:split("Erlang","([ln])",[{return,list},group]).
1068
1069 gives
1070
1071 [["Er","l"],["a","n"],["g"]]
1072
1073 Here the regular expression first matched the "l", causing "Er"
1074 to be the first part in the result. When the regular expression
1075 matched, the (only) subexpression was bound to the "l", so the
1076 "l" is inserted in the group together with "Er". The next match
1077 is of the "n", making "a" the next part to be returned. As the
1078 subexpression is bound to substring "n" in this case, the "n" is
1079 inserted into this group. The last group consists of the remain‐
1080 ing string, as no more matches are found.
1081
1082 By default, all parts of the string, including the empty
1083 strings, are returned from the function, for example:
1084
1085 re:split("Erlang","[lg]",[{return,list}]).
1086
1087 gives
1088
1089 ["Er","an",[]]
1090
1091 as the matching of the "g" in the end of the string leaves an
1092 empty rest, which is also returned. This behavior differs from
1093 the default behavior of the split function in Perl, where empty
1094 strings at the end are by default removed. To get the "trimming"
1095 default behavior of Perl, specify trim as an option:
1096
1097 re:split("Erlang","[lg]",[{return,list},trim]).
1098
1099 gives
1100
1101 ["Er","an"]
1102
1103 The "trim" option says; "give me as many parts as possible
1104 except the empty ones", which sometimes can be useful. You can
1105 also specify how many parts you want, by specifying {parts,N}:
1106
1107 re:split("Erlang","[lg]",[{return,list},{parts,2}]).
1108
1109 gives
1110
1111 ["Er","ang"]
1112
1113 Notice that the last part is "ang", not "an", as splitting was
1114 specified into two parts, and the splitting stops when enough
1115 parts are given, which is why the result differs from that of
1116 trim.
1117
1118 More than three parts are not possible with this indata, so
1119
1120 re:split("Erlang","[lg]",[{return,list},{parts,4}]).
1121
1122 gives the same result as the default, which is to be viewed as
1123 "an infinite number of parts".
1124
1125 Specifying 0 as the number of parts gives the same effect as
1126 option trim. If subexpressions are captured, empty subexpres‐
1127 sions matched at the end are also stripped from the result if
1128 trim or {parts,0} is specified.
1129
1130 The trim behavior corresponds exactly to the Perl default.
1131 {parts,N}, where N is a positive integer, corresponds exactly to
1132 the Perl behavior with a positive numerical third parameter. The
1133 default behavior of split/3 corresponds to the Perl behavior
1134 when a negative integer is specified as the third parameter for
1135 the Perl routine.
1136
1137 Summary of options not previously described for function run/3:
1138
1139 {return,ReturnType}:
1140 Specifies how the parts of the original string are presented
1141 in the result list. Valid types:
1142
1143 iodata:
1144 The variant of iodata() that gives the least copying of
1145 data with the current implementation (often a binary, but
1146 do not depend on it).
1147
1148 binary:
1149 All parts returned as binaries.
1150
1151 list:
1152 All parts returned as lists of characters ("strings").
1153
1154 group:
1155 Groups together the part of the string with the parts of the
1156 string matching the subexpressions of the regular expres‐
1157 sion.
1158
1159 The return value from the function is in this case a list()
1160 of list()s. Each sublist begins with the string picked out
1161 of the subject string, followed by the parts matching each
1162 of the subexpressions in order of occurrence in the regular
1163 expression.
1164
1165 {parts,N}:
1166 Specifies the number of parts the subject string is to be
1167 split into.
1168
1169 The number of parts is to be a positive integer for a spe‐
1170 cific maximum number of parts, and infinity for the maximum
1171 number of parts possible (the default). Specifying {parts,0}
1172 gives as many parts as possible disregarding empty parts at
1173 the end, the same as specifying trim.
1174
1175 trim:
1176 Specifies that empty parts at the end of the result list are
1177 to be disregarded. The same as specifying {parts,0}. This
1178 corresponds to the default behavior of the split built-in
1179 function in Perl.
1180
1182 The following sections contain reference material for the regular
1183 expressions used by this module. The information is based on the PCRE
1184 documentation, with changes where this module behaves differently to
1185 the PCRE library.
1186
1188 The syntax and semantics of the regular expressions supported by PCRE
1189 are described in detail in the following sections. Perl's regular
1190 expressions are described in its own documentation, and regular expres‐
1191 sions in general are covered in many books, some with copious examples.
1192 Jeffrey Friedl's "Mastering Regular Expressions", published by
1193 O'Reilly, covers regular expressions in great detail. This description
1194 of the PCRE regular expressions is intended as reference material.
1195
1196 The reference material is divided into the following sections:
1197
1198 * Special Start-of-Pattern Items
1199
1200 * Characters and Metacharacters
1201
1202 * Backslash
1203
1204 * Circumflex and Dollar
1205
1206 * Full Stop (Period, Dot) and \N
1207
1208 * Matching a Single Data Unit
1209
1210 * Square Brackets and Character Classes
1211
1212 * Posix Character Classes
1213
1214 * Vertical Bar
1215
1216 * Internal Option Setting
1217
1218 * Subpatterns
1219
1220 * Duplicate Subpattern Numbers
1221
1222 * Named Subpatterns
1223
1224 * Repetition
1225
1226 * Atomic Grouping and Possessive Quantifiers
1227
1228 * Back References
1229
1230 * Assertions
1231
1232 * Conditional Subpatterns
1233
1234 * Comments
1235
1236 * Recursive Patterns
1237
1238 * Subpatterns as Subroutines
1239
1240 * Oniguruma Subroutine Syntax
1241
1242 * Backtracking Control
1243
1245 Some options that can be passed to compile/2 can also be set by special
1246 items at the start of a pattern. These are not Perl-compatible, but are
1247 provided to make these options accessible to pattern writers who are
1248 not able to change the program that processes the pattern. Any number
1249 of these items can appear, but they must all be together right at the
1250 start of the pattern string, and the letters must be in upper case.
1251
1252 UTF Support
1253
1254 Unicode support is basically UTF-8 based. To use Unicode characters,
1255 you either call compile/2 or run/3 with option unicode, or the pattern
1256 must start with one of these special sequences:
1257
1258 (*UTF8)
1259 (*UTF)
1260
1261 Both options give the same effect, the input string is interpreted as
1262 UTF-8. Notice that with these instructions, the automatic conversion of
1263 lists to UTF-8 is not performed by the re functions. Therefore, using
1264 these sequences is not recommended. Add option unicode when running
1265 compile/2 instead.
1266
1267 Some applications that allow their users to supply patterns can wish to
1268 restrict them to non-UTF data for security reasons. If option never_utf
1269 is set at compile time, (*UTF), and so on, are not allowed, and their
1270 appearance causes an error.
1271
1272 Unicode Property Support
1273
1274 The following is another special sequence that can appear at the start
1275 of a pattern:
1276
1277 (*UCP)
1278
1279 This has the same effect as setting option ucp: it causes sequences
1280 such as \d and \w to use Unicode properties to determine character
1281 types, instead of recognizing only characters with codes < 256 through
1282 a lookup table.
1283
1284 Disabling Startup Optimizations
1285
1286 If a pattern starts with (*NO_START_OPT), it has the same effect as
1287 setting option no_start_optimize at compile time.
1288
1289 Newline Conventions
1290
1291 PCRE supports five conventions for indicating line breaks in strings: a
1292 single CR (carriage return) character, a single LF (line feed) charac‐
1293 ter, the two-character sequence CRLF, any of the three preceding, and
1294 any Unicode newline sequence.
1295
1296 A newline convention can also be specified by starting a pattern string
1297 with one of the following five sequences:
1298
1299 (*CR):
1300 Carriage return
1301
1302 (*LF):
1303 Line feed
1304
1305 (*CRLF):
1306 >Carriage return followed by line feed
1307
1308 (*ANYCRLF):
1309 Any of the three above
1310
1311 (*ANY):
1312 All Unicode newline sequences
1313
1314 These override the default and the options specified to compile/2. For
1315 example, the following pattern changes the convention to CR:
1316
1317 (*CR)a.b
1318
1319 This pattern matches a\nb, as LF is no longer a newline. If more than
1320 one of them is present, the last one is used.
1321
1322 The newline convention affects where the circumflex and dollar asser‐
1323 tions are true. It also affects the interpretation of the dot metachar‐
1324 acter when dotall is not set, and the behavior of \N. However, it does
1325 not affect what the \R escape sequence matches. By default, this is any
1326 Unicode newline sequence, for Perl compatibility. However, this can be
1327 changed; see the description of \R in section Newline Sequences. A
1328 change of the \R setting can be combined with a change of the newline
1329 convention.
1330
1331 Setting Match and Recursion Limits
1332
1333 The caller of run/3 can set a limit on the number of times the internal
1334 match() function is called and on the maximum depth of recursive calls.
1335 These facilities are provided to catch runaway matches that are pro‐
1336 voked by patterns with huge matching trees (a typical example is a pat‐
1337 tern with nested unlimited repeats) and to avoid running out of system
1338 stack by too much recursion. When one of these limits is reached,
1339 pcre_exec() gives an error return. The limits can also be set by items
1340 at the start of the pattern of the following forms:
1341
1342 (*LIMIT_MATCH=d)
1343 (*LIMIT_RECURSION=d)
1344
1345 Here d is any number of decimal digits. However, the value of the set‐
1346 ting must be less than the value set by the caller of run/3 for it to
1347 have any effect. That is, the pattern writer can lower the limit set by
1348 the programmer, but not raise it. If there is more than one setting of
1349 one of these limits, the lower value is used.
1350
1351 The default value for both the limits is 10,000,000 in the Erlang VM.
1352 Notice that the recursion limit does not affect the stack depth of the
1353 VM, as PCRE for Erlang is compiled in such a way that the match func‐
1354 tion never does recursion on the C stack.
1355
1356 Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of
1357 the limits set by the caller, not increase them.
1358
1360 A regular expression is a pattern that is matched against a subject
1361 string from left to right. Most characters stand for themselves in a
1362 pattern and match the corresponding characters in the subject. As a
1363 trivial example, the following pattern matches a portion of a subject
1364 string that is identical to itself:
1365
1366 The quick brown fox
1367
1368 When caseless matching is specified (option caseless), letters are
1369 matched independently of case.
1370
1371 The power of regular expressions comes from the ability to include
1372 alternatives and repetitions in the pattern. These are encoded in the
1373 pattern by the use of metacharacters, which do not stand for themselves
1374 but instead are interpreted in some special way.
1375
1376 Two sets of metacharacters exist: those that are recognized anywhere in
1377 the pattern except within square brackets, and those that are recog‐
1378 nized within square brackets. Outside square brackets, the metacharac‐
1379 ters are as follows:
1380
1381 \:
1382 General escape character with many uses
1383
1384 ^:
1385 Assert start of string (or line, in multiline mode)
1386
1387 $:
1388 Assert end of string (or line, in multiline mode)
1389
1390 .:
1391 Match any character except newline (by default)
1392
1393 [:
1394 Start character class definition
1395
1396 |:
1397 Start of alternative branch
1398
1399 (:
1400 Start subpattern
1401
1402 ):
1403 End subpattern
1404
1405 ?:
1406 Extends the meaning of (, also 0 or 1 quantifier, also quantifier
1407 minimizer
1408
1409 *:
1410 0 or more quantifiers
1411
1412 +:
1413 1 or more quantifier, also "possessive quantifier"
1414
1415 {:
1416 Start min/max quantifier
1417
1418 Part of a pattern within square brackets is called a "character class".
1419 The following are the only metacharacters in a character class:
1420
1421 \:
1422 General escape character
1423
1424 ^:
1425 Negate the class, but only if the first character
1426
1427 -:
1428 Indicates character range
1429
1430 [:
1431 Posix character class (only if followed by Posix syntax)
1432
1433 ]:
1434 Terminates the character class
1435
1436 The following sections describe the use of each metacharacter.
1437
1439 The backslash character has many uses. First, if it is followed by a
1440 character that is not a number or a letter, it takes away any special
1441 meaning that a character can have. This use of backslash as an escape
1442 character applies both inside and outside character classes.
1443
1444 For example, if you want to match a * character, you write \* in the
1445 pattern. This escaping action applies if the following character would
1446 otherwise be interpreted as a metacharacter, so it is always safe to
1447 precede a non-alphanumeric with backslash to specify that it stands for
1448 itself. In particular, if you want to match a backslash, write \\.
1449
1450 In unicode mode, only ASCII numbers and letters have any special mean‐
1451 ing after a backslash. All other characters (in particular, those whose
1452 code points are > 127) are treated as literals.
1453
1454 If a pattern is compiled with option extended, whitespace in the pat‐
1455 tern (other than in a character class) and characters between a # out‐
1456 side a character class and the next newline are ignored. An escaping
1457 backslash can be used to include a whitespace or # character as part of
1458 the pattern.
1459
1460 To remove the special meaning from a sequence of characters, put them
1461 between \Q and \E. This is different from Perl in that $ and @ are han‐
1462 dled as literals in \Q...\E sequences in PCRE, while $ and @ cause
1463 variable interpolation in Perl. Notice the following examples:
1464
1465 Pattern PCRE matches Perl matches
1466
1467 \Qabc$xyz\E abc$xyz abc followed by the contents of $xyz
1468 \Qabc\$xyz\E abc\$xyz abc\$xyz
1469 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
1470
1471 The \Q...\E sequence is recognized both inside and outside character
1472 classes. An isolated \E that is not preceded by \Q is ignored. If \Q is
1473 not followed by \E later in the pattern, the literal interpretation
1474 continues to the end of the pattern (that is, \E is assumed at the
1475 end). If the isolated \Q is inside a character class, this causes an
1476 error, as the character class is not terminated.
1477
1478 Non-Printing Characters
1479
1480 A second use of backslash provides a way of encoding non-printing char‐
1481 acters in patterns in a visible manner. There is no restriction on the
1482 appearance of non-printing characters, apart from the binary zero that
1483 terminates a pattern. When a pattern is prepared by text editing, it is
1484 often easier to use one of the following escape sequences than the
1485 binary character it represents:
1486
1487 \a:
1488 Alarm, that is, the BEL character (hex 07)
1489
1490 \cx:
1491 "Control-x", where x is any ASCII character
1492
1493 \e:
1494 Escape (hex 1B)
1495
1496 \f:
1497 Form feed (hex 0C)
1498
1499 \n:
1500 Line feed (hex 0A)
1501
1502 \r:
1503 Carriage return (hex 0D)
1504
1505 \t:
1506 Tab (hex 09)
1507
1508 \0dd:
1509 Character with octal code 0dd
1510
1511 \ddd:
1512 Character with octal code ddd, or back reference
1513
1514 \o{ddd..}:
1515 character with octal code ddd..
1516
1517 \xhh:
1518 Character with hex code hh
1519
1520 \x{hhh..}:
1521 Character with hex code hhh..
1522
1523 Note:
1524 Note that \0dd is always an octal code, and that \8 and \9 are the lit‐
1525 eral characters "8" and "9".
1526
1527
1528 The precise effect of \cx on ASCII characters is as follows: if x is a
1529 lowercase letter, it is converted to upper case. Then bit 6 of the
1530 character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
1531 (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
1532 hex 7B (; is 3B). If the data item (byte or 16-bit value) following \c
1533 has a value > 127, a compile-time error occurs. This locks out non-
1534 ASCII characters in all modes.
1535
1536 The \c facility was designed for use with ASCII characters, but with
1537 the extension to Unicode it is even less useful than it once was.
1538
1539 After \0 up to two further octal digits are read. If there are fewer
1540 than two digits, just those that are present are used. Thus the
1541 sequence \0\x\015 specifies two binary zeros followed by a CR character
1542 (code value 13). Make sure you supply two digits after the initial zero
1543 if the pattern character that follows is itself an octal digit.
1544
1545 The escape \o must be followed by a sequence of octal digits, enclosed
1546 in braces. An error occurs if this is not the case. This escape is a
1547 recent addition to Perl; it provides way of specifying character code
1548 points as octal numbers greater than 0777, and it also allows octal
1549 numbers and back references to be unambiguously specified.
1550
1551 For greater clarity and unambiguity, it is best to avoid following \ by
1552 a digit greater than zero. Instead, use \o{} or \x{} to specify charac‐
1553 ter numbers, and \g{} to specify back references. The following para‐
1554 graphs describe the old, ambiguous syntax.
1555
1556 The handling of a backslash followed by a digit other than 0 is compli‐
1557 cated, and Perl has changed in recent releases, causing PCRE also to
1558 change. Outside a character class, PCRE reads the digit and any follow‐
1559 ing digits as a decimal number. If the number is < 8, or if there have
1560 been at least that many previous capturing left parentheses in the
1561 expression, the entire sequence is taken as a back reference. A
1562 description of how this works is provided later, following the discus‐
1563 sion of parenthesized subpatterns.
1564
1565 Inside a character class, or if the decimal number following \ is > 7
1566 and there have not been that many capturing subpatterns, PCRE handles
1567 \8 and \9 as the literal characters "8" and "9", and otherwise re-reads
1568 up to three octal digits following the backslash, and using them to
1569 generate a data character. Any subsequent digits stand for themselves.
1570 For example:
1571
1572 \040:
1573 Another way of writing an ASCII space
1574
1575 \40:
1576 The same, provided there are < 40 previous capturing subpatterns
1577
1578 \7:
1579 Always a back reference
1580
1581 \11:
1582 Can be a back reference, or another way of writing a tab
1583
1584 \011:
1585 Always a tab
1586
1587 \0113:
1588 A tab followed by character "3"
1589
1590 \113:
1591 Can be a back reference, otherwise the character with octal code
1592 113
1593
1594 \377:
1595 Can be a back reference, otherwise value 255 (decimal)
1596
1597 \81:
1598 Either a back reference, or the two characters "8" and "1"
1599
1600 Notice that octal values >= 100 that are specified using this syntax
1601 must not be introduced by a leading zero, as no more than three octal
1602 digits are ever read.
1603
1604 By default, after \x that is not followed by {, from zero to two hexa‐
1605 decimal digits are read (letters can be in upper or lower case). Any
1606 number of hexadecimal digits may appear between \x{ and }. If a charac‐
1607 ter other than a hexadecimal digit appears between \x{ and }, or if
1608 there is no terminating }, an error occurs.
1609
1610 Characters whose value is less than 256 can be defined by either of the
1611 two syntaxes for \x. There is no difference in the way they are han‐
1612 dled. For example, \xdc is exactly the same as \x{dc}.
1613
1614 Constraints on character values
1615
1616 Characters that are specified using octal or hexadecimal numbers are
1617 limited to certain values, as follows:
1618
1619 8-bit non-UTF mode:
1620 < 0x100
1621
1622 8-bit UTF-8 mode:
1623 < 0x10ffff and a valid codepoint
1624
1625 Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-
1626 called "surrogate" codepoints), and 0xffef.
1627
1628 Escape sequences in character classes
1629
1630 All the sequences that define a single character value can be used both
1631 inside and outside character classes. Also, inside a character class,
1632 \b is interpreted as the backspace character (hex 08).
1633
1634 \N is not allowed in a character class. \B, \R, and \X are not special
1635 inside a character class. Like other unrecognized escape sequences,
1636 they are treated as the literal characters "B", "R", and "X". Outside a
1637 character class, these sequences have different meanings.
1638
1639 Unsupported Escape Sequences
1640
1641 In Perl, the sequences \l, \L, \u, and \U are recognized by its string
1642 handler and used to modify the case of following characters. PCRE does
1643 not support these escape sequences.
1644
1645 Absolute and Relative Back References
1646
1647 The sequence \g followed by an unsigned or a negative number, option‐
1648 ally enclosed in braces, is an absolute or relative back reference. A
1649 named back reference can be coded as \g{name}. Back references are dis‐
1650 cussed later, following the discussion of parenthesized subpatterns.
1651
1652 Absolute and Relative Subroutine Calls
1653
1654 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
1655 name or a number enclosed either in angle brackets or single quotes, is
1656 alternative syntax for referencing a subpattern as a "subroutine".
1657 Details are discussed later. Notice that \g{...} (Perl syntax) and
1658 \g<...> (Oniguruma syntax) are not synonymous. The former is a back
1659 reference and the latter is a subroutine call.
1660
1661 Generic Character Types
1662
1663 Another use of backslash is for specifying generic character types:
1664
1665 \d:
1666 Any decimal digit
1667
1668 \D:
1669 Any character that is not a decimal digit
1670
1671 \h:
1672 Any horizontal whitespace character
1673
1674 \H:
1675 Any character that is not a horizontal whitespace character
1676
1677 \s:
1678 Any whitespace character
1679
1680 \S:
1681 Any character that is not a whitespace character
1682
1683 \v:
1684 Any vertical whitespace character
1685
1686 \V:
1687 Any character that is not a vertical whitespace character
1688
1689 \w:
1690 Any "word" character
1691
1692 \W:
1693 Any "non-word" character
1694
1695 There is also the single sequence \N, which matches a non-newline char‐
1696 acter. This is the same as the "." metacharacter when dotall is not
1697 set. Perl also uses \N to match characters by name, but PCRE does not
1698 support this.
1699
1700 Each pair of lowercase and uppercase escape sequences partitions the
1701 complete set of characters into two disjoint sets. Any given character
1702 matches one, and only one, of each pair. The sequences can appear both
1703 inside and outside character classes. They each match one character of
1704 the appropriate type. If the current matching point is at the end of
1705 the subject string, all fail, as there is no character to match.
1706
1707 For compatibility with Perl, \s did not used to match the VT character
1708 (code 11), which made it different from the the POSIX "space" class.
1709 However, Perl added VT at release 5.18, and PCRE followed suit at
1710 release 8.34. The default \s characters are now HT (9), LF (10), VT
1711 (11), FF (12), CR (13), and space (32), which are defined as white
1712 space in the "C" locale. This list may vary if locale-specific matching
1713 is taking place. For example, in some locales the "non-breaking space"
1714 character (\xA0) is recognized as white space, and in others the VT
1715 character is not.
1716
1717 A "word" character is an underscore or any character that is a letter
1718 or a digit. By default, the definition of letters and digits is con‐
1719 trolled by the PCRE low-valued character tables, in Erlang's case (and
1720 without option unicode), the ISO Latin-1 character set.
1721
1722 By default, in unicode mode, characters with values > 255, that is, all
1723 characters outside the ISO Latin-1 character set, never match \d, \s,
1724 or \w, and always match \D, \S, and \W. These sequences retain their
1725 original meanings from before UTF support was available, mainly for
1726 efficiency reasons. However, if option ucp is set, the behavior is
1727 changed so that Unicode properties are used to determine character
1728 types, as follows:
1729
1730 \d:
1731 Any character that \p{Nd} matches (decimal digit)
1732
1733 \s:
1734 Any character that \p{Z} or \h or \v
1735
1736 \w:
1737 Any character that matches \p{L} or \p{N} matches, plus underscore
1738
1739 The uppercase escapes match the inverse sets of characters. Notice that
1740 \d matches only decimal digits, while \w matches any Unicode digit, any
1741 Unicode letter, and underscore. Notice also that ucp affects \b and \B,
1742 as they are defined in terms of \w and \W. Matching these sequences is
1743 noticeably slower when ucp is set.
1744
1745 The sequences \h, \H, \v, and \V are features that were added to Perl
1746 in release 5.10. In contrast to the other sequences, which match only
1747 ASCII characters by default, these always match certain high-valued
1748 code points, regardless if ucp is set.
1749
1750 The following are the horizontal space characters:
1751
1752 U+0009:
1753 Horizontal tab (HT)
1754
1755 U+0020:
1756 Space
1757
1758 U+00A0:
1759 Non-break space
1760
1761 U+1680:
1762 Ogham space mark
1763
1764 U+180E:
1765 Mongolian vowel separator
1766
1767 U+2000:
1768 En quad
1769
1770 U+2001:
1771 Em quad
1772
1773 U+2002:
1774 En space
1775
1776 U+2003:
1777 Em space
1778
1779 U+2004:
1780 Three-per-em space
1781
1782 U+2005:
1783 Four-per-em space
1784
1785 U+2006:
1786 Six-per-em space
1787
1788 U+2007:
1789 Figure space
1790
1791 U+2008:
1792 Punctuation space
1793
1794 U+2009:
1795 Thin space
1796
1797 U+200A:
1798 Hair space
1799
1800 U+202F:
1801 Narrow no-break space
1802
1803 U+205F:
1804 Medium mathematical space
1805
1806 U+3000:
1807 Ideographic space
1808
1809 The following are the vertical space characters:
1810
1811 U+000A:
1812 Line feed (LF)
1813
1814 U+000B:
1815 Vertical tab (VT)
1816
1817 U+000C:
1818 Form feed (FF)
1819
1820 U+000D:
1821 Carriage return (CR)
1822
1823 U+0085:
1824 Next line (NEL)
1825
1826 U+2028:
1827 Line separator
1828
1829 U+2029:
1830 Paragraph separator
1831
1832 In 8-bit, non-UTF-8 mode, only the characters with code points < 256
1833 are relevant.
1834
1835 Newline Sequences
1836
1837 Outside a character class, by default, the escape sequence \R matches
1838 any Unicode newline sequence. In non-UTF-8 mode, \R is equivalent to
1839 the following:
1840
1841 (?>\r\n|\n|\x0b|\f|\r|\x85)
1842
1843 This is an example of an "atomic group", details are provided below.
1844
1845 This particular group matches either the two-character sequence CR fol‐
1846 lowed by LF, or one of the single characters LF (line feed, U+000A), VT
1847 (vertical tab, U+000B), FF (form feed, U+000C), CR (carriage return,
1848 U+000D), or NEL (next line, U+0085). The two-character sequence is
1849 treated as a single unit that cannot be split.
1850
1851 In Unicode mode, two more characters whose code points are > 255 are
1852 added: LS (line separator, U+2028) and PS (paragraph separator,
1853 U+2029). Unicode character property support is not needed for these
1854 characters to be recognized.
1855
1856 \R can be restricted to match only CR, LF, or CRLF (instead of the com‐
1857 plete set of Unicode line endings) by setting option bsr_anycrlf either
1858 at compile time or when the pattern is matched. (BSR is an acronym for
1859 "backslash R".) This can be made the default when PCRE is built; if so,
1860 the other behavior can be requested through option bsr_unicode. These
1861 settings can also be specified by starting a pattern string with one of
1862 the following sequences:
1863
1864 (*BSR_ANYCRLF):
1865 CR, LF, or CRLF only
1866
1867 (*BSR_UNICODE):
1868 Any Unicode newline sequence
1869
1870 These override the default and the options specified to the compiling
1871 function, but they can themselves be overridden by options specified to
1872 a matching function. Notice that these special settings, which are not
1873 Perl-compatible, are recognized only at the very start of a pattern,
1874 and that they must be in upper case. If more than one of them is
1875 present, the last one is used. They can be combined with a change of
1876 newline convention; for example, a pattern can start with:
1877
1878 (*ANY)(*BSR_ANYCRLF)
1879
1880 They can also be combined with the (*UTF8), (*UTF), or (*UCP) special
1881 sequences. Inside a character class, \R is treated as an unrecognized
1882 escape sequence, and so matches the letter "R" by default.
1883
1884 Unicode Character Properties
1885
1886 Three more escape sequences that match characters with specific proper‐
1887 ties are available. When in 8-bit non-UTF-8 mode, these sequences are
1888 limited to testing characters whose code points are < 256, but they do
1889 work in this mode. The following are the extra escape sequences:
1890
1891 \p{xx}:
1892 A character with property xx
1893
1894 \P{xx}:
1895 A character without property xx
1896
1897 \X:
1898 A Unicode extended grapheme cluster
1899
1900 The property names represented by xx above are limited to the Unicode
1901 script names, the general category properties, "Any", which matches any
1902 character (including newline), and some special PCRE properties
1903 (described in the next section). Other Perl properties, such as "InMu‐
1904 sicalSymbols", are currently not supported by PCRE. Notice that \P{Any}
1905 does not match any characters and always causes a match failure.
1906
1907 Sets of Unicode characters are defined as belonging to certain scripts.
1908 A character from one of these sets can be matched using a script name,
1909 for example:
1910
1911 \p{Greek} \P{Han}
1912
1913 Those that are not part of an identified script are lumped together as
1914 "Common". The following is the current list of scripts:
1915
1916 * Arabic
1917
1918 * Armenian
1919
1920 * Avestan
1921
1922 * Balinese
1923
1924 * Bamum
1925
1926 * Bassa_Vah
1927
1928 * Batak
1929
1930 * Bengali
1931
1932 * Bopomofo
1933
1934 * Braille
1935
1936 * Buginese
1937
1938 * Buhid
1939
1940 * Canadian_Aboriginal
1941
1942 * Carian
1943
1944 * Caucasian_Albanian
1945
1946 * Chakma
1947
1948 * Cham
1949
1950 * Cherokee
1951
1952 * Common
1953
1954 * Coptic
1955
1956 * Cuneiform
1957
1958 * Cypriot
1959
1960 * Cyrillic
1961
1962 * Deseret
1963
1964 * Devanagari
1965
1966 * Duployan
1967
1968 * Egyptian_Hieroglyphs
1969
1970 * Elbasan
1971
1972 * Ethiopic
1973
1974 * Georgian
1975
1976 * Glagolitic
1977
1978 * Gothic
1979
1980 * Grantha
1981
1982 * Greek
1983
1984 * Gujarati
1985
1986 * Gurmukhi
1987
1988 * Han
1989
1990 * Hangul
1991
1992 * Hanunoo
1993
1994 * Hebrew
1995
1996 * Hiragana
1997
1998 * Imperial_Aramaic
1999
2000 * Inherited
2001
2002 * Inscriptional_Pahlavi
2003
2004 * Inscriptional_Parthian
2005
2006 * Javanese
2007
2008 * Kaithi
2009
2010 * Kannada
2011
2012 * Katakana
2013
2014 * Kayah_Li
2015
2016 * Kharoshthi
2017
2018 * Khmer
2019
2020 * Khojki
2021
2022 * Khudawadi
2023
2024 * Lao
2025
2026 * Latin
2027
2028 * Lepcha
2029
2030 * Limbu
2031
2032 * Linear_A
2033
2034 * Linear_B
2035
2036 * Lisu
2037
2038 * Lycian
2039
2040 * Lydian
2041
2042 * Mahajani
2043
2044 * Malayalam
2045
2046 * Mandaic
2047
2048 * Manichaean
2049
2050 * Meetei_Mayek
2051
2052 * Mende_Kikakui
2053
2054 * Meroitic_Cursive
2055
2056 * Meroitic_Hieroglyphs
2057
2058 * Miao
2059
2060 * Modi
2061
2062 * Mongolian
2063
2064 * Mro
2065
2066 * Myanmar
2067
2068 * Nabataean
2069
2070 * New_Tai_Lue
2071
2072 * Nko
2073
2074 * Ogham
2075
2076 * Ol_Chiki
2077
2078 * Old_Italic
2079
2080 * Old_North_Arabian
2081
2082 * Old_Permic
2083
2084 * Old_Persian
2085
2086 * Oriya
2087
2088 * Old_South_Arabian
2089
2090 * Old_Turkic
2091
2092 * Osmanya
2093
2094 * Pahawh_Hmong
2095
2096 * Palmyrene
2097
2098 * Pau_Cin_Hau
2099
2100 * Phags_Pa
2101
2102 * Phoenician
2103
2104 * Psalter_Pahlavi
2105
2106 * Rejang
2107
2108 * Runic
2109
2110 * Samaritan
2111
2112 * Saurashtra
2113
2114 * Sharada
2115
2116 * Shavian
2117
2118 * Siddham
2119
2120 * Sinhala
2121
2122 * Sora_Sompeng
2123
2124 * Sundanese
2125
2126 * Syloti_Nagri
2127
2128 * Syriac
2129
2130 * Tagalog
2131
2132 * Tagbanwa
2133
2134 * Tai_Le
2135
2136 * Tai_Tham
2137
2138 * Tai_Viet
2139
2140 * Takri
2141
2142 * Tamil
2143
2144 * Telugu
2145
2146 * Thaana
2147
2148 * Thai
2149
2150 * Tibetan
2151
2152 * Tifinagh
2153
2154 * Tirhuta
2155
2156 * Ugaritic
2157
2158 * Vai
2159
2160 * Warang_Citi
2161
2162 * Yi
2163
2164 Each character has exactly one Unicode general category property, spec‐
2165 ified by a two-letter acronym. For compatibility with Perl, negation
2166 can be specified by including a circumflex between the opening brace
2167 and the property name. For example, \p{^Lu} is the same as \P{Lu}.
2168
2169 If only one letter is specified with \p or \P, it includes all the gen‐
2170 eral category properties that start with that letter. In this case, in
2171 the absence of negation, the curly brackets in the escape sequence are
2172 optional. The following two examples have the same effect:
2173
2174 \p{L}
2175 \pL
2176
2177 The following general category property codes are supported:
2178
2179 C:
2180 Other
2181
2182 Cc:
2183 Control
2184
2185 Cf:
2186 Format
2187
2188 Cn:
2189 Unassigned
2190
2191 Co:
2192 Private use
2193
2194 Cs:
2195 Surrogate
2196
2197 L:
2198 Letter
2199
2200 Ll:
2201 Lowercase letter
2202
2203 Lm:
2204 Modifier letter
2205
2206 Lo:
2207 Other letter
2208
2209 Lt:
2210 Title case letter
2211
2212 Lu:
2213 Uppercase letter
2214
2215 M:
2216 Mark
2217
2218 Mc:
2219 Spacing mark
2220
2221 Me:
2222 Enclosing mark
2223
2224 Mn:
2225 Non-spacing mark
2226
2227 N:
2228 Number
2229
2230 Nd:
2231 Decimal number
2232
2233 Nl:
2234 Letter number
2235
2236 No:
2237 Other number
2238
2239 P:
2240 Punctuation
2241
2242 Pc:
2243 Connector punctuation
2244
2245 Pd:
2246 Dash punctuation
2247
2248 Pe:
2249 Close punctuation
2250
2251 Pf:
2252 Final punctuation
2253
2254 Pi:
2255 Initial punctuation
2256
2257 Po:
2258 Other punctuation
2259
2260 Ps:
2261 Open punctuation
2262
2263 S:
2264 Symbol
2265
2266 Sc:
2267 Currency symbol
2268
2269 Sk:
2270 Modifier symbol
2271
2272 Sm:
2273 Mathematical symbol
2274
2275 So:
2276 Other symbol
2277
2278 Z:
2279 Separator
2280
2281 Zl:
2282 Line separator
2283
2284 Zp:
2285 Paragraph separator
2286
2287 Zs:
2288 Space separator
2289
2290 The special property L& is also supported. It matches a character that
2291 has the Lu, Ll, or Lt property, that is, a letter that is not classi‐
2292 fied as a modifier or "other".
2293
2294 The Cs (Surrogate) property applies only to characters in the range
2295 U+D800 to U+DFFF. Such characters are invalid in Unicode strings and so
2296 cannot be tested by PCRE. Perl does not support the Cs property.
2297
2298 The long synonyms for property names supported by Perl (such as \p{Let‐
2299 ter}) are not supported by PCRE. It is not permitted to prefix any of
2300 these properties with "Is".
2301
2302 No character in the Unicode table has the Cn (unassigned) property.
2303 This property is instead assumed for any code point that is not in the
2304 Unicode table.
2305
2306 Specifying caseless matching does not affect these escape sequences.
2307 For example, \p{Lu} always matches only uppercase letters. This is dif‐
2308 ferent from the behavior of current versions of Perl.
2309
2310 Matching characters by Unicode property is not fast, as PCRE must do a
2311 multistage table lookup to find a character property. That is why the
2312 traditional escape sequences such as \d and \w do not use Unicode prop‐
2313 erties in PCRE by default. However, you can make them do so by setting
2314 option ucp or by starting the pattern with (*UCP).
2315
2316 Extended Grapheme Clusters
2317
2318 The \X escape matches any number of Unicode characters that form an
2319 "extended grapheme cluster", and treats the sequence as an atomic group
2320 (see below). Up to and including release 8.31, PCRE matched an earlier,
2321 simpler definition that was equivalent to (?>\PM\pM*). That is, it
2322 matched a character without the "mark" property, followed by zero or
2323 more characters with the "mark" property. Characters with the "mark"
2324 property are typically non-spacing accents that affect the preceding
2325 character.
2326
2327 This simple definition was extended in Unicode to include more compli‐
2328 cated kinds of composite character by giving each character a grapheme
2329 breaking property, and creating rules that use these properties to
2330 define the boundaries of extended grapheme clusters. In PCRE releases
2331 later than 8.31, \X matches one of these clusters.
2332
2333 \X always matches at least one character. Then it decides whether to
2334 add more characters according to the following rules for ending a clus‐
2335 ter:
2336
2337 * End at the end of the subject string.
2338
2339 * Do not end between CR and LF; otherwise end after any control char‐
2340 acter.
2341
2342 * Do not break Hangul (a Korean script) syllable sequences. Hangul
2343 characters are of five types: L, V, T, LV, and LVT. An L character
2344 can be followed by an L, V, LV, or LVT character. An LV or V char‐
2345 acter can be followed by a V or T character. An LVT or T character
2346 can be followed only by a T character.
2347
2348 * Do not end before extending characters or spacing marks. Characters
2349 with the "mark" property always have the "extend" grapheme breaking
2350 property.
2351
2352 * Do not end after prepend characters.
2353
2354 * Otherwise, end the cluster.
2355
2356 PCRE Additional Properties
2357
2358 In addition to the standard Unicode properties described earlier, PCRE
2359 supports four more that make it possible to convert traditional escape
2360 sequences, such as \w and \s to use Unicode properties. PCRE uses these
2361 non-standard, non-Perl properties internally when the ucp option is
2362 passed. However, they can also be used explicitly. The properties are
2363 as follows:
2364
2365 Xan:
2366 Any alphanumeric character. Matches characters that have either the
2367 L (letter) or the N (number) property.
2368
2369 Xps:
2370 Any Posix space character. Matches the characters tab, line feed,
2371 vertical tab, form feed, carriage return, and any other character
2372 that has the Z (separator) property.
2373
2374 Xsp:
2375 Any Perl space character. Matches the same as Xps, except that ver‐
2376 tical tab is excluded.
2377
2378 Xwd:
2379 Any Perl "word" character. Matches the same characters as Xan, plus
2380 underscore.
2381
2382 Perl and POSIX space are now the same. Perl added VT to its space char‐
2383 acter set at release 5.18 and PCRE changed at release 8.34.
2384
2385 Xan matches characters that have either the L (letter) or the N (num‐
2386 ber) property. Xps matches the characters tab, linefeed, vertical tab,
2387 form feed, or carriage return, and any other character that has the Z
2388 (separator) property. Xsp is the same as Xps; it used to exclude verti‐
2389 cal tab, for Perl compatibility, but Perl changed, and so PCRE followed
2390 at release 8.34. Xwd matches the same characters as Xan, plus under‐
2391 score.
2392
2393 There is another non-standard property, Xuc, which matches any charac‐
2394 ter that can be represented by a Universal Character Name in C++ and
2395 other programming languages. These are the characters $, @, ` (grave
2396 accent), and all characters with Unicode code points >= U+00A0, except
2397 for the surrogates U+D800 to U+DFFF. Notice that most base (ASCII)
2398 characters are excluded. (Universal Character Names are of the form
2399 \uHHHH or \UHHHHHHHH, where H is a hexadecimal digit. Notice that the
2400 Xuc property does not match these sequences but the characters that
2401 they represent.)
2402
2403 Resetting the Match Start
2404
2405 The escape sequence \K causes any previously matched characters not to
2406 be included in the final matched sequence. For example, the following
2407 pattern matches "foobar", but reports that it has matched "bar":
2408
2409 foo\Kbar
2410
2411 This feature is similar to a lookbehind assertion (described below).
2412 However, in this case, the part of the subject before the real match
2413 does not have to be of fixed length, as lookbehind assertions do. The
2414 use of \K does not interfere with the setting of captured substrings.
2415 For example, when the following pattern matches "foobar", the first
2416 substring is still set to "foo":
2417
2418 (foo)\Kbar
2419
2420 Perl documents that the use of \K within assertions is "not well
2421 defined". In PCRE, \K is acted upon when it occurs inside positive
2422 assertions, but is ignored in negative assertions. Note that when a
2423 pattern such as (?=ab\K) matches, the reported start of the match can
2424 be greater than the end of the match.
2425
2426 Simple Assertions
2427
2428 The final use of backslash is for certain simple assertions. An asser‐
2429 tion specifies a condition that must be met at a particular point in a
2430 match, without consuming any characters from the subject string. The
2431 use of subpatterns for more complicated assertions is described below.
2432 The following are the backslashed assertions:
2433
2434 \b:
2435 Matches at a word boundary.
2436
2437 \B:
2438 Matches when not at a word boundary.
2439
2440 \A:
2441 Matches at the start of the subject.
2442
2443 \Z:
2444 Matches at the end of the subject, and before a newline at the end
2445 of the subject.
2446
2447 \z:
2448 Matches only at the end of the subject.
2449
2450 \G:
2451 Matches at the first matching position in the subject.
2452
2453 Inside a character class, \b has a different meaning; it matches the
2454 backspace character. If any other of these assertions appears in a
2455 character class, by default it matches the corresponding literal char‐
2456 acter (for example, \B matches the letter B).
2457
2458 A word boundary is a position in the subject string where the current
2459 character and the previous character do not both match \w or \W (that
2460 is, one matches \w and the other matches \W), or the start or end of
2461 the string if the first or last character matches \w, respectively. In
2462 UTF mode, the meanings of \w and \W can be changed by setting option
2463 ucp. When this is done, it also affects \b and \B. PCRE and Perl do not
2464 have a separate "start of word" or "end of word" metasequence. However,
2465 whatever follows \b normally determines which it is. For example, the
2466 fragment \ba matches "a" at the start of a word.
2467
2468 The \A, \Z, and \z assertions differ from the traditional circumflex
2469 and dollar (described in the next section) in that they only ever match
2470 at the very start and end of the subject string, whatever options are
2471 set. Thus, they are independent of multiline mode. These three asser‐
2472 tions are not affected by options notbol or noteol, which affect only
2473 the behavior of the circumflex and dollar metacharacters. However, if
2474 argument startoffset of run/3 is non-zero, indicating that matching is
2475 to start at a point other than the beginning of the subject, \A can
2476 never match. The difference between \Z and \z is that \Z matches before
2477 a newline at the end of the string and at the very end, while \z
2478 matches only at the end.
2479
2480 The \G assertion is true only when the current matching position is at
2481 the start point of the match, as specified by argument startoffset of
2482 run/3. It differs from \A when the value of startoffset is non-zero. By
2483 calling run/3 multiple times with appropriate arguments, you can mimic
2484 the Perl option /g, and it is in this kind of implementation where \G
2485 can be useful.
2486
2487 Notice, however, that the PCRE interpretation of \G, as the start of
2488 the current match, is subtly different from Perl, which defines it as
2489 the end of the previous match. In Perl, these can be different when the
2490 previously matched string was empty. As PCRE does only one match at a
2491 time, it cannot reproduce this behavior.
2492
2493 If all the alternatives of a pattern begin with \G, the expression is
2494 anchored to the starting match position, and the "anchored" flag is set
2495 in the compiled regular expression.
2496
2498 The circumflex and dollar metacharacters are zero-width assertions.
2499 That is, they test for a particular condition to be true without con‐
2500 suming any characters from the subject string.
2501
2502 Outside a character class, in the default matching mode, the circumflex
2503 character is an assertion that is true only if the current matching
2504 point is at the start of the subject string. If argument startoffset of
2505 run/3 is non-zero, circumflex can never match if option multiline is
2506 unset. Inside a character class, circumflex has an entirely different
2507 meaning (see below).
2508
2509 Circumflex needs not to be the first character of the pattern if some
2510 alternatives are involved, but it is to be the first thing in each
2511 alternative in which it appears if the pattern is ever to match that
2512 branch. If all possible alternatives start with a circumflex, that is,
2513 if the pattern is constrained to match only at the start of the sub‐
2514 ject, it is said to be an "anchored" pattern. (There are also other
2515 constructs that can cause a pattern to be anchored.)
2516
2517 The dollar character is an assertion that is true only if the current
2518 matching point is at the end of the subject string, or immediately
2519 before a newline at the end of the string (by default). Notice however
2520 that it does not match the newline. Dollar needs not to be the last
2521 character of the pattern if some alternatives are involved, but it is
2522 to be the last item in any branch in which it appears. Dollar has no
2523 special meaning in a character class.
2524
2525 The meaning of dollar can be changed so that it matches only at the
2526 very end of the string, by setting option dollar_endonly at compile
2527 time. This does not affect the \Z assertion.
2528
2529 The meanings of the circumflex and dollar characters are changed if
2530 option multiline is set. When this is the case, a circumflex matches
2531 immediately after internal newlines and at the start of the subject
2532 string. It does not match after a newline that ends the string. A dol‐
2533 lar matches before any newlines in the string, and at the very end,
2534 when multiline is set. When newline is specified as the two-character
2535 sequence CRLF, isolated CR and LF characters do not indicate newlines.
2536
2537 For example, the pattern /^abc$/ matches the subject string "def\nabc"
2538 (where \n represents a newline) in multiline mode, but not otherwise.
2539 So, patterns that are anchored in single-line mode because all branches
2540 start with ^ are not anchored in multiline mode, and a match for cir‐
2541 cumflex is possible when argument startoffset of run/3 is non-zero.
2542 Option dollar_endonly is ignored if multiline is set.
2543
2544 Notice that the sequences \A, \Z, and \z can be used to match the start
2545 and end of the subject in both modes. If all branches of a pattern
2546 start with \A, it is always anchored, regardless if multiline is set.
2547
2549 Outside a character class, a dot in the pattern matches any character
2550 in the subject string except (by default) a character that signifies
2551 the end of a line.
2552
2553 When a line ending is defined as a single character, dot never matches
2554 that character. When the two-character sequence CRLF is used, dot does
2555 not match CR if it is immediately followed by LF, otherwise it matches
2556 all characters (including isolated CRs and LFs). When any Unicode line
2557 endings are recognized, dot does not match CR, LF, or any of the other
2558 line-ending characters.
2559
2560 The behavior of dot regarding newlines can be changed. If option dotall
2561 is set, a dot matches any character, without exception. If the two-
2562 character sequence CRLF is present in the subject string, it takes two
2563 dots to match it.
2564
2565 The handling of dot is entirely independent of the handling of circum‐
2566 flex and dollar, the only relationship is that both involve newlines.
2567 Dot has no special meaning in a character class.
2568
2569 The escape sequence \N behaves like a dot, except that it is not
2570 affected by option PCRE_DOTALL. That is, it matches any character
2571 except one that signifies the end of a line. Perl also uses \N to match
2572 characters by name but PCRE does not support this.
2573
2575 Outside a character class, the escape sequence \C matches any data
2576 unit, regardless if a UTF mode is set. One data unit is one byte.
2577 Unlike a dot, \C always matches line-ending characters. The feature is
2578 provided in Perl to match individual bytes in UTF-8 mode, but it is
2579 unclear how it can usefully be used. As \C breaks up characters into
2580 individual data units, matching one unit with \C in a UTF mode means
2581 that the remaining string can start with a malformed UTF character.
2582 This has undefined results, as PCRE assumes that it deals with valid
2583 UTF strings.
2584
2585 PCRE does not allow \C to appear in lookbehind assertions (described
2586 below) in a UTF mode, as this would make it impossible to calculate the
2587 length of the lookbehind.
2588
2589 The \C escape sequence is best avoided. However, one way of using it
2590 that avoids the problem of malformed UTF characters is to use a looka‐
2591 head to check the length of the next character, as in the following
2592 pattern, which can be used with a UTF-8 string (ignore whitespace and
2593 line breaks):
2594
2595 (?| (?=[\x00-\x7f])(\C) |
2596 (?=[\x80-\x{7ff}])(\C)(\C) |
2597 (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
2598 (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
2599
2600 A group that starts with (?| resets the capturing parentheses numbers
2601 in each alternative (see section Duplicate Subpattern Numbers). The
2602 assertions at the start of each branch check the next UTF-8 character
2603 for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
2604 individual bytes of the character are then captured by the appropriate
2605 number of groups.
2606
2608 An opening square bracket introduces a character class, terminated by a
2609 closing square bracket. A closing square bracket on its own is not spe‐
2610 cial by default. However, if option PCRE_JAVASCRIPT_COMPAT is set, a
2611 lone closing square bracket causes a compile-time error. If a closing
2612 square bracket is required as a member of the class, it is to be the
2613 first data character in the class (after an initial circumflex, if
2614 present) or escaped with a backslash.
2615
2616 A character class matches a single character in the subject. In a UTF
2617 mode, the character can be more than one data unit long. A matched
2618 character must be in the set of characters defined by the class, unless
2619 the first character in the class definition is a circumflex, in which
2620 case the subject character must not be in the set defined by the class.
2621 If a circumflex is required as a member of the class, ensure that it is
2622 not the first character, or escape it with a backslash.
2623
2624 For example, the character class [aeiou] matches any lowercase vowel,
2625 while [^aeiou] matches any character that is not a lowercase vowel.
2626 Notice that a circumflex is just a convenient notation for specifying
2627 the characters that are in the class by enumerating those that are not.
2628 A class that starts with a circumflex is not an assertion; it still
2629 consumes a character from the subject string, and therefore it fails if
2630 the current pointer is at the end of the string.
2631
2632 In UTF-8 mode, characters with values > 255 (0xffff) can be included in
2633 a class as a literal string of data units, or by using the \x{ escaping
2634 mechanism.
2635
2636 When caseless matching is set, any letters in a class represent both
2637 their uppercase and lowercase versions. For example, a caseless [aeiou]
2638 matches "A" and "a", and a caseless [^aeiou] does not match "A", but a
2639 caseful version would. In a UTF mode, PCRE always understands the con‐
2640 cept of case for characters whose values are < 256, so caseless match‐
2641 ing is always possible. For characters with higher values, the concept
2642 of case is supported only if PCRE is compiled with Unicode property
2643 support. If you want to use caseless matching in a UTF mode for charac‐
2644 ters >=, ensure that PCRE is compiled with Unicode property support and
2645 with UTF support.
2646
2647 Characters that can indicate line breaks are never treated in any spe‐
2648 cial way when matching character classes, whatever line-ending sequence
2649 is in use, and whatever setting of options PCRE_DOTALL and PCRE_MULTI‐
2650 LINE is used. A class such as [^a] always matches one of these charac‐
2651 ters.
2652
2653 The minus (hyphen) character can be used to specify a range of charac‐
2654 ters in a character class. For example, [d-m] matches any letter
2655 between d and m, inclusive. If a minus character is required in a
2656 class, it must be escaped with a backslash or appear in a position
2657 where it cannot be interpreted as indicating a range, typically as the
2658 first or last character in the class, or immediately after a range. For
2659 example, [b-d-z] matches letters in the range b to d, a hyphen charac‐
2660 ter, or z.
2661
2662 The literal character "]" cannot be the end character of a range. A
2663 pattern such as [W-]46] is interpreted as a class of two characters
2664 ("W" and "-") followed by a literal string "46]", so it would match
2665 "W46]" or "-46]". However, if "]" is escaped with a backslash, it is
2666 interpreted as the end of range, so [W-\]46] is interpreted as a class
2667 containing a range followed by two other characters. The octal or hexa‐
2668 decimal representation of "]" can also be used to end a range.
2669
2670 An error is generated if a POSIX character class (see below) or an
2671 escape sequence other than one that defines a single character appears
2672 at a point where a range ending character is expected. For example,
2673 [z-\xff] is valid, but [A-\d] and [A-[:digit:]] are not.
2674
2675 Ranges operate in the collating sequence of character values. They can
2676 also be used for characters specified numerically, for example,
2677 [\000-\037]. Ranges can include any characters that are valid for the
2678 current mode.
2679
2680 If a range that includes letters is used when caseless matching is set,
2681 it matches the letters in either case. For example, [W-c] is equivalent
2682 to [][\\^_`wxyzabc], matched caselessly. In a non-UTF mode, if charac‐
2683 ter tables for a French locale are in use, [\xc8-\xcb] matches accented
2684 E characters in both cases. In UTF modes, PCRE supports the concept of
2685 case for characters with values > 255 only when it is compiled with
2686 Unicode property support.
2687
2688 The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,
2689 \w, and \W can appear in a character class, and add the characters that
2690 they match to the class. For example, [\dABCDEF] matches any hexadeci‐
2691 mal digit. In UTF modes, option ucp affects the meanings of \d, \s, \w
2692 and their uppercase partners, just as it does when they appear outside
2693 a character class, as described in section Generic Character Types ear‐
2694 lier. The escape sequence \b has a different meaning inside a character
2695 class; it matches the backspace character. The sequences \B, \N, \R,
2696 and \X are not special inside a character class. Like any other unrec‐
2697 ognized escape sequences, they are treated as the literal characters
2698 "B", "N", "R", and "X".
2699
2700 A circumflex can conveniently be used with the uppercase character
2701 types to specify a more restricted set of characters than the matching
2702 lowercase type. For example, class [^\W_] matches any letter or digit,
2703 but not underscore, while [\w] includes underscore. A positive charac‐
2704 ter class is to be read as "something OR something OR ..." and a nega‐
2705 tive class as "NOT something AND NOT something AND NOT ...".
2706
2707 Only the following metacharacters are recognized in character classes:
2708
2709 * Backslash
2710
2711 * Hyphen (only where it can be interpreted as specifying a range)
2712
2713 * Circumflex (only at the start)
2714
2715 * Opening square bracket (only when it can be interpreted as intro‐
2716 ducing a Posix class name, or for a special compatibility feature;
2717 see the next two sections)
2718
2719 * Terminating closing square bracket
2720
2721 However, escaping other non-alphanumeric characters does no harm.
2722
2724 Perl supports the Posix notation for character classes. This uses names
2725 enclosed by [: and :] within the enclosing square brackets. PCRE also
2726 supports this notation. For example, the following matches "0", "1",
2727 any alphabetic character, or "%":
2728
2729 [01[:alpha:]%]
2730
2731 The following are the supported class names:
2732
2733 alnum:
2734 Letters and digits
2735
2736 alpha:
2737 Letters
2738
2739 ascii:
2740 Character codes 0-127
2741
2742 blank:
2743 Space or tab only
2744
2745 cntrl:
2746 Control characters
2747
2748 digit:
2749 Decimal digits (same as \d)
2750
2751 graph:
2752 Printing characters, excluding space
2753
2754 lower:
2755 Lowercase letters
2756
2757 print:
2758 Printing characters, including space
2759
2760 punct:
2761 Printing characters, excluding letters, digits, and space
2762
2763 space:
2764 Whitespace (the same as \s from PCRE 8.34)
2765
2766 upper:
2767 Uppercase letters
2768
2769 word:
2770 "Word" characters (same as \w)
2771
2772 xdigit:
2773 Hexadecimal digits
2774
2775 The default "space" characters are HT (9), LF (10), VT (11), FF (12),
2776 CR (13), and space (32). If locale-specific matching is taking place,
2777 the list of space characters may be different; there may be fewer or
2778 more of them. "Space" used to be different to \s, which did not include
2779 VT, for Perl compatibility. However, Perl changed at release 5.18, and
2780 PCRE followed at release 8.34. "Space" and \s now match the same set of
2781 characters.
2782
2783 The name "word" is a Perl extension, and "blank" is a GNU extension
2784 from Perl 5.8. Another Perl extension is negation, which is indicated
2785 by a ^ character after the colon. For example, the following matches
2786 "1", "2", or any non-digit:
2787
2788 [12[:^digit:]]
2789
2790 PCRE (and Perl) also recognize the Posix syntax [.ch.] and [=ch=] where
2791 "ch" is a "collating element", but these are not supported, and an
2792 error is given if they are encountered.
2793
2794 By default, characters with values > 255 do not match any of the Posix
2795 character classes. However, if option PCRE_UCP is passed to pcre_com‐
2796 pile(), some of the classes are changed so that Unicode character prop‐
2797 erties are used. This is achieved by replacing certain Posix classes by
2798 other sequences, as follows:
2799
2800 [:alnum:]:
2801 Becomes \p{Xan}
2802
2803 [:alpha:]:
2804 Becomes \p{L}
2805
2806 [:blank:]:
2807 Becomes \h
2808
2809 [:digit:]:
2810 Becomes \p{Nd}
2811
2812 [:lower:]:
2813 Becomes \p{Ll}
2814
2815 [:space:]:
2816 Becomes \p{Xps}
2817
2818 [:upper:]:
2819 Becomes \p{Lu}
2820
2821 [:word:]:
2822 Becomes \p{Xwd}
2823
2824 Negated versions, such as [:^alpha:], use \P instead of \p. Three other
2825 POSIX classes are handled specially in UCP mode:
2826
2827 [:graph:]:
2828 This matches characters that have glyphs that mark the page when
2829 printed. In Unicode property terms, it matches all characters with
2830 the L, M, N, P, S, or Cf properties, except for:
2831
2832 U+061C:
2833 Arabic Letter Mark
2834
2835 U+180E:
2836 Mongolian Vowel Separator
2837
2838 U+2066 - U+2069:
2839 Various "isolate"s
2840
2841 [:print:]:
2842 This matches the same characters as [:graph:] plus space characters
2843 that are not controls, that is, characters with the Zs property.
2844
2845 [:punct:]:
2846 This matches all characters that have the Unicode P (punctuation)
2847 property, plus those characters whose code points are less than 128
2848 that have the S (Symbol) property.
2849
2850 The other POSIX classes are unchanged, and match only characters with
2851 code points less than 128.
2852
2853 Compatibility Feature for Word Boundaries
2854
2855 In the POSIX.2 compliant library that was included in 4.4BSD Unix, the
2856 ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word"
2857 and "end of word". PCRE treats these items as follows:
2858
2859 [[:<:]]:
2860 is converted to \b(?=\w)
2861
2862 [[:>:]]:
2863 is converted to \b(?<=\w)
2864
2865 Only these exact character sequences are recognized. A sequence such as
2866 [a[:<:]b] provokes error for an unrecognized POSIX class name. This
2867 support is not compatible with Perl. It is provided to help migrations
2868 from other environments, and is best not used in any new patterns. Note
2869 that \b matches at the start and the end of a word (see "Simple asser‐
2870 tions" above), and in a Perl-style pattern the preceding or following
2871 character normally shows which is wanted, without the need for the
2872 assertions that are used above in order to give exactly the POSIX be‐
2873 haviour.
2874
2876 Vertical bar characters are used to separate alternative patterns. For
2877 example, the following pattern matches either "gilbert" or "sullivan":
2878
2879 gilbert|sullivan
2880
2881 Any number of alternatives can appear, and an empty alternative is per‐
2882 mitted (matching the empty string). The matching process tries each
2883 alternative in turn, from left to right, and the first that succeeds is
2884 used. If the alternatives are within a subpattern (defined in section
2885 Subpatterns), "succeeds" means matching the remaining main pattern and
2886 the alternative in the subpattern.
2887
2889 The settings of the Perl-compatible options caseless, multiline,
2890 dotall, and extended can be changed from within the pattern by a
2891 sequence of Perl option letters enclosed between "(?" and ")". The
2892 option letters are as follows:
2893
2894 i:
2895 For caseless
2896
2897 m:
2898 For multiline
2899
2900 s:
2901 For dotall
2902
2903 x:
2904 For extended
2905
2906 For example, (?im) sets caseless, multiline matching. These options can
2907 also be unset by preceding the letter with a hyphen. A combined setting
2908 and unsetting such as (?im-sx), which sets caseless and multiline,
2909 while unsetting dotall and extended, is also permitted. If a letter
2910 appears both before and after the hyphen, the option is unset.
2911
2912 The PCRE-specific options dupnames, ungreedy, and extra can be changed
2913 in the same way as the Perl-compatible options by using the characters
2914 J, U, and X respectively.
2915
2916 When one of these option changes occurs at top-level (that is, not
2917 inside subpattern parentheses), the change applies to the remainder of
2918 the pattern that follows.
2919
2920 An option change within a subpattern (see section Subpatterns) affects
2921 only that part of the subpattern that follows it. So, the following
2922 matches abc and aBc and no other strings (assuming caseless is not
2923 used):
2924
2925 (a(?i)b)c
2926
2927 By this means, options can be made to have different settings in dif‐
2928 ferent parts of the pattern. Any changes made in one alternative do
2929 carry on into subsequent branches within the same subpattern. For exam‐
2930 ple:
2931
2932 (a(?i)b|c)
2933
2934 matches "ab", "aB", "c", and "C", although when matching "C" the first
2935 branch is abandoned before the option setting. This is because the
2936 effects of option settings occur at compile time. There would be some
2937 weird behavior otherwise.
2938
2939 Note:
2940 Other PCRE-specific options can be set by the application when the com‐
2941 piling or matching functions are called. Sometimes the pattern can con‐
2942 tain special leading sequences, such as (*CRLF), to override what the
2943 application has set or what has been defaulted. Details are provided in
2944 section Newline Sequences earlier.
2945
2946 The (*UTF8) and (*UCP) leading sequences can be used to set UTF and
2947 Unicode property modes. They are equivalent to setting options unicode
2948 and ucp, respectively. The (*UTF) sequence is a generic version that
2949 can be used with any of the libraries. However, the application can set
2950 option never_utf, which locks out the use of the (*UTF) sequences.
2951
2952
2954 Subpatterns are delimited by parentheses (round brackets), which can be
2955 nested. Turning part of a pattern into a subpattern does two things:
2956
2957 1.:
2958 It localizes a set of alternatives. For example, the following pat‐
2959 tern matches "cataract", "caterpillar", or "cat":
2960
2961 cat(aract|erpillar|)
2962
2963 Without the parentheses, it would match "cataract", "erpillar", or
2964 an empty string.
2965
2966 2.:
2967 It sets up the subpattern as a capturing subpattern. That is, when
2968 the complete pattern matches, that portion of the subject string
2969 that matched the subpattern is passed back to the caller through
2970 the return value of run/3.
2971
2972 Opening parentheses are counted from left to right (starting from 1) to
2973 obtain numbers for the capturing subpatterns. For example, if the
2974 string "the red king" is matched against the following pattern, the
2975 captured substrings are "red king", "red", and "king", and are numbered
2976 1, 2, and 3, respectively:
2977
2978 the ((red|white) (king|queen))
2979
2980 It is not always helpful that plain parentheses fulfill two functions.
2981 Often a grouping subpattern is required without a capturing require‐
2982 ment. If an opening parenthesis is followed by a question mark and a
2983 colon, the subpattern does not do any capturing, and is not counted
2984 when computing the number of any subsequent capturing subpatterns. For
2985 example, if the string "the white queen" is matched against the follow‐
2986 ing pattern, the captured substrings are "white queen" and "queen", and
2987 are numbered 1 and 2:
2988
2989 the ((?:red|white) (king|queen))
2990
2991 The maximum number of capturing subpatterns is 65535.
2992
2993 As a convenient shorthand, if any option settings are required at the
2994 start of a non-capturing subpattern, the option letters can appear
2995 between "?" and ":". Thus, the following two patterns match the same
2996 set of strings:
2997
2998 (?i:saturday|sunday)
2999 (?:(?i)saturday|sunday)
3000
3001 As alternative branches are tried from left to right, and options are
3002 not reset until the end of the subpattern is reached, an option setting
3003 in one branch does affect subsequent branches, so the above patterns
3004 match both "SUNDAY" and "Saturday".
3005
3007 Perl 5.10 introduced a feature where each alternative in a subpattern
3008 uses the same numbers for its capturing parentheses. Such a subpattern
3009 starts with (?| and is itself a non-capturing subpattern. For example,
3010 consider the following pattern:
3011
3012 (?|(Sat)ur|(Sun))day
3013
3014 As the two alternatives are inside a (?| group, both sets of capturing
3015 parentheses are numbered one. Thus, when the pattern matches, you can
3016 look at captured substring number one, whichever alternative matched.
3017 This construct is useful when you want to capture a part, but not all,
3018 of one of many alternatives. Inside a (?| group, parentheses are num‐
3019 bered as usual, but the number is reset at the start of each branch.
3020 The numbers of any capturing parentheses that follow the subpattern
3021 start after the highest number used in any branch. The following exam‐
3022 ple is from the Perl documentation; the numbers underneath show in
3023 which buffer the captured content is stored:
3024
3025 # before ---------------branch-reset----------- after
3026 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
3027 # 1 2 2 3 2 3 4
3028
3029 A back reference to a numbered subpattern uses the most recent value
3030 that is set for that number by any subpattern. The following pattern
3031 matches "abcabc" or "defdef":
3032
3033 /(?|(abc)|(def))\1/
3034
3035 In contrast, a subroutine call to a numbered subpattern always refers
3036 to the first one in the pattern with the given number. The following
3037 pattern matches "abcabc" or "defabc":
3038
3039 /(?|(abc)|(def))(?1)/
3040
3041 If a condition test for a subpattern having matched refers to a non-
3042 unique number, the test is true if any of the subpatterns of that num‐
3043 ber have matched.
3044
3045 An alternative approach using this "branch reset" feature is to use
3046 duplicate named subpatterns, as described in the next section.
3047
3049 Identifying capturing parentheses by number is simple, but it can be
3050 hard to keep track of the numbers in complicated regular expressions.
3051 Also, if an expression is modified, the numbers can change. To help
3052 with this difficulty, PCRE supports the naming of subpatterns. This
3053 feature was not added to Perl until release 5.10. Python had the fea‐
3054 ture earlier, and PCRE introduced it at release 4.0, using the Python
3055 syntax. PCRE now supports both the Perl and the Python syntax. Perl
3056 allows identically numbered subpatterns to have different names, but
3057 PCRE does not.
3058
3059 In PCRE, a subpattern can be named in one of three ways: (?<name>...)
3060 or (?'name'...) as in Perl, or (?P<name>...) as in Python. References
3061 to capturing parentheses from other parts of the pattern, such as back
3062 references, recursion, and conditions, can be made by name and by num‐
3063 ber.
3064
3065 Names consist of up to 32 alphanumeric characters and underscores, but
3066 must start with a non-digit. Named capturing parentheses are still
3067 allocated numbers as well as names, exactly as if the names were not
3068 present. The capture specification to run/3 can use named values if
3069 they are present in the regular expression.
3070
3071 By default, a name must be unique within a pattern, but this constraint
3072 can be relaxed by setting option dupnames at compile time. (Duplicate
3073 names are also always permitted for subpatterns with the same number,
3074 set up as described in the previous section.) Duplicate names can be
3075 useful for patterns where only one instance of the named parentheses
3076 can match. Suppose that you want to match the name of a weekday, either
3077 as a 3-letter abbreviation or as the full name, and in both cases you
3078 want to extract the abbreviation. The following pattern (ignoring the
3079 line breaks) does the job:
3080
3081 (?<DN>Mon|Fri|Sun)(?:day)?|
3082 (?<DN>Tue)(?:sday)?|
3083 (?<DN>Wed)(?:nesday)?|
3084 (?<DN>Thu)(?:rsday)?|
3085 (?<DN>Sat)(?:urday)?
3086
3087 There are five capturing substrings, but only one is ever set after a
3088 match. (An alternative way of solving this problem is to use a "branch
3089 reset" subpattern, as described in the previous section.)
3090
3091 For capturing named subpatterns which names are not unique, the first
3092 matching occurrence (counted from left to right in the subject) is
3093 returned from run/3, if the name is specified in the values part of the
3094 capture statement. The all_names capturing value matches all the names
3095 in the same way.
3096
3097 Note:
3098 You cannot use different names to distinguish between two subpatterns
3099 with the same number, as PCRE uses only the numbers when matching. For
3100 this reason, an error is given at compile time if different names are
3101 specified to subpatterns with the same number. However, you can specify
3102 the same name to subpatterns with the same number, even when dupnames
3103 is not set.
3104
3105
3107 Repetition is specified by quantifiers, which can follow any of the
3108 following items:
3109
3110 * A literal data character
3111
3112 * The dot metacharacter
3113
3114 * The \C escape sequence
3115
3116 * The \X escape sequence
3117
3118 * The \R escape sequence
3119
3120 * An escape such as \d or \pL that matches a single character
3121
3122 * A character class
3123
3124 * A back reference (see the next section)
3125
3126 * A parenthesized subpattern (including assertions)
3127
3128 * A subroutine call to a subpattern (recursive or otherwise)
3129
3130 The general repetition quantifier specifies a minimum and maximum num‐
3131 ber of permitted matches, by giving the two numbers in curly brackets
3132 (braces), separated by a comma. The numbers must be < 65536, and the
3133 first must be less than or equal to the second. For example, the fol‐
3134 lowing matches "zz", "zzz", or "zzzz":
3135
3136 z{2,4}
3137
3138 A closing brace on its own is not a special character. If the second
3139 number is omitted, but the comma is present, there is no upper limit.
3140 If the second number and the comma are both omitted, the quantifier
3141 specifies an exact number of required matches. Thus, the following
3142 matches at least three successive vowels, but can match many more:
3143
3144 [aeiou]{3,}
3145
3146 The following matches exactly eight digits:
3147
3148 \d{8}
3149
3150 An opening curly bracket that appears in a position where a quantifier
3151 is not allowed, or one that does not match the syntax of a quantifier,
3152 is taken as a literal character. For example, {,6} is not a quantifier,
3153 but a literal string of four characters.
3154
3155 In Unicode mode, quantifiers apply to characters rather than to indi‐
3156 vidual data units. Thus, for example, \x{100}{2} matches two charac‐
3157 ters, each of which is represented by a 2-byte sequence in a UTF-8
3158 string. Similarly, \X{3} matches three Unicode extended grapheme clus‐
3159 ters, each of which can be many data units long (and they can be of
3160 different lengths).
3161
3162 The quantifier {0} is permitted, causing the expression to behave as if
3163 the previous item and the quantifier were not present. This can be use‐
3164 ful for subpatterns that are referenced as subroutines from elsewhere
3165 in the pattern (but see also section Defining Subpatterns for Use by
3166 Reference Only). Items other than subpatterns that have a {0} quanti‐
3167 fier are omitted from the compiled pattern.
3168
3169 For convenience, the three most common quantifiers have single-charac‐
3170 ter abbreviations:
3171
3172 *:
3173 Equivalent to {0,}
3174
3175 +:
3176 Equivalent to {1,}
3177
3178 ?:
3179 Equivalent to {0,1}
3180
3181 Infinite loops can be constructed by following a subpattern that can
3182 match no characters with a quantifier that has no upper limit, for
3183 example:
3184
3185 (a?)*
3186
3187 Earlier versions of Perl and PCRE used to give an error at compile time
3188 for such patterns. However, as there are cases where this can be use‐
3189 ful, such patterns are now accepted. However, if any repetition of the
3190 subpattern matches no characters, the loop is forcibly broken.
3191
3192 By default, the quantifiers are "greedy", that is, they match as much
3193 as possible (up to the maximum number of permitted times), without
3194 causing the remaining pattern to fail. The classic example of where
3195 this gives problems is in trying to match comments in C programs. These
3196 appear between /* and */. Within the comment, individual * and / char‐
3197 acters can appear. An attempt to match C comments by applying the pat‐
3198 tern
3199
3200 /\*.*\*/
3201
3202 to the string
3203
3204 /* first comment */ not comment /* second comment */
3205
3206 fails, as it matches the entire string owing to the greediness of the
3207 .* item.
3208
3209 However, if a quantifier is followed by a question mark, it ceases to
3210 be greedy, and instead matches the minimum number of times possible, so
3211 the following pattern does the right thing with the C comments:
3212
3213 /\*.*?\*/
3214
3215 The meaning of the various quantifiers is not otherwise changed, only
3216 the preferred number of matches. Do not confuse this use of question
3217 mark with its use as a quantifier in its own right. As it has two uses,
3218 it can sometimes appear doubled, as in
3219
3220 \d??\d
3221
3222 which matches one digit by preference, but can match two if that is the
3223 only way the remaining pattern matches.
3224
3225 If option ungreedy is set (an option that is not available in Perl),
3226 the quantifiers are not greedy by default, but individual ones can be
3227 made greedy by following them with a question mark. That is, it inverts
3228 the default behavior.
3229
3230 When a parenthesized subpattern is quantified with a minimum repeat
3231 count that is > 1 or with a limited maximum, more memory is required
3232 for the compiled pattern, in proportion to the size of the minimum or
3233 maximum.
3234
3235 If a pattern starts with .* or .{0,} and option dotall (equivalent to
3236 Perl option /s) is set, thus allowing the dot to match newlines, the
3237 pattern is implicitly anchored, because whatever follows is tried
3238 against every character position in the subject string. So, there is no
3239 point in retrying the overall match at any position after the first.
3240 PCRE normally treats such a pattern as if it was preceded by \A.
3241
3242 In cases where it is known that the subject string contains no new‐
3243 lines, it is worth setting dotall to obtain this optimization, or
3244 alternatively using ^ to indicate anchoring explicitly.
3245
3246 However, there are some cases where the optimization cannot be used.
3247 When .* is inside capturing parentheses that are the subject of a back
3248 reference elsewhere in the pattern, a match at the start can fail where
3249 a later one succeeds. Consider, for example:
3250
3251 (.*)abc\1
3252
3253 If the subject is "xyz123abc123", the match point is the fourth charac‐
3254 ter. Therefore, such a pattern is not implicitly anchored.
3255
3256 Another case where implicit anchoring is not applied is when the lead‐
3257 ing .* is inside an atomic group. Once again, a match at the start can
3258 fail where a later one succeeds. Consider the following pattern:
3259
3260 (?>.*?a)b
3261
3262 It matches "ab" in the subject "aab". The use of the backtracking con‐
3263 trol verbs (*PRUNE) and (*SKIP) also disable this optimization.
3264
3265 When a capturing subpattern is repeated, the value captured is the sub‐
3266 string that matched the final iteration. For example, after
3267
3268 (tweedle[dume]{3}\s*)+
3269
3270 has matched "tweedledum tweedledee", the value of the captured sub‐
3271 string is "tweedledee". However, if there are nested capturing subpat‐
3272 terns, the corresponding captured values can have been set in previous
3273 iterations. For example, after
3274
3275 /(a|(b))+/
3276
3277 matches "aba", the value of the second captured substring is "b".
3278
3280 With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
3281 repetition, failure of what follows normally causes the repeated item
3282 to be re-evaluated to see if a different number of repeats allows the
3283 remaining pattern to match. Sometimes it is useful to prevent this,
3284 either to change the nature of the match, or to cause it to fail ear‐
3285 lier than it otherwise might, when the author of the pattern knows that
3286 there is no point in carrying on.
3287
3288 Consider, for example, the pattern \d+foo when applied to the following
3289 subject line:
3290
3291 123456bar
3292
3293 After matching all six digits and then failing to match "foo", the nor‐
3294 mal action of the matcher is to try again with only five digits match‐
3295 ing item \d+, and then with four, and so on, before ultimately failing.
3296 "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
3297 the means for specifying that once a subpattern has matched, it is not
3298 to be re-evaluated in this way.
3299
3300 If atomic grouping is used for the previous example, the matcher gives
3301 up immediately on failing to match "foo" the first time. The notation
3302 is a kind of special parenthesis, starting with (?> as in the following
3303 example:
3304
3305 (?>\d+)foo
3306
3307 This kind of parenthesis "locks up" the part of the pattern it contains
3308 once it has matched, and a failure further into the pattern is pre‐
3309 vented from backtracking into it. Backtracking past it to previous
3310 items, however, works as normal.
3311
3312 An alternative description is that a subpattern of this type matches
3313 the string of characters that an identical standalone pattern would
3314 match, if anchored at the current point in the subject string.
3315
3316 Atomic grouping subpatterns are not capturing subpatterns. Simple cases
3317 such as the above example can be thought of as a maximizing repeat that
3318 must swallow everything it can. So, while both \d+ and \d+? are pre‐
3319 pared to adjust the number of digits they match to make the remaining
3320 pattern match, (?>\d+) can only match an entire sequence of digits.
3321
3322 Atomic groups in general can contain any complicated subpatterns, and
3323 can be nested. However, when the subpattern for an atomic group is just
3324 a single repeated item, as in the example above, a simpler notation,
3325 called a "possessive quantifier" can be used. This consists of an extra
3326 + character following a quantifier. Using this notation, the previous
3327 example can be rewritten as
3328
3329 \d++foo
3330
3331 Notice that a possessive quantifier can be used with an entire group,
3332 for example:
3333
3334 (abc|xyz){2,3}+
3335
3336 Possessive quantifiers are always greedy; the setting of option
3337 ungreedy is ignored. They are a convenient notation for the simpler
3338 forms of an atomic group. However, there is no difference in the mean‐
3339 ing of a possessive quantifier and the equivalent atomic group, but
3340 there can be a performance difference; possessive quantifiers are prob‐
3341 ably slightly faster.
3342
3343 The possessive quantifier syntax is an extension to the Perl 5.8 syn‐
3344 tax. Jeffrey Friedl originated the idea (and the name) in the first
3345 edition of his book. Mike McCloskey liked it, so implemented it when he
3346 built the Sun Java package, and PCRE copied it from there. It ulti‐
3347 mately found its way into Perl at release 5.10.
3348
3349 PCRE has an optimization that automatically "possessifies" certain sim‐
3350 ple pattern constructs. For example, the sequence A+B is treated as
3351 A++B, as there is no point in backtracking into a sequence of A:s when
3352 B must follow.
3353
3354 When a pattern contains an unlimited repeat inside a subpattern that
3355 can itself be repeated an unlimited number of times, the use of an
3356 atomic group is the only way to avoid some failing matches taking a
3357 long time. The pattern
3358
3359 (\D+|<\d+>)*[!?]
3360
3361 matches an unlimited number of substrings that either consist of non-
3362 digits, or digits enclosed in <>, followed by ! or ?. When it matches,
3363 it runs quickly. However, if it is applied to
3364
3365 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
3366
3367 it takes a long time before reporting failure. This is because the
3368 string can be divided between the internal \D+ repeat and the external
3369 * repeat in many ways, and all must be tried. (The example uses [!?]
3370 rather than a single character at the end, as both PCRE and Perl have
3371 an optimization that allows for fast failure when a single character is
3372 used. They remember the last single character that is required for a
3373 match, and fail early if it is not present in the string.) If the pat‐
3374 tern is changed so that it uses an atomic group, like the following,
3375 sequences of non-digits cannot be broken, and failure happens quickly:
3376
3377 ((?>\D+)|<\d+>)*[!?]
3378
3380 Outside a character class, a backslash followed by a digit > 0 (and
3381 possibly further digits) is a back reference to a capturing subpattern
3382 earlier (that is, to its left) in the pattern, provided there have been
3383 that many previous capturing left parentheses.
3384
3385 However, if the decimal number following the backslash is < 10, it is
3386 always taken as a back reference, and causes an error only if there are
3387 not that many capturing left parentheses in the entire pattern. That
3388 is, the parentheses that are referenced do need not be to the left of
3389 the reference for numbers < 10. A "forward back reference" of this type
3390 can make sense when a repetition is involved and the subpattern to the
3391 right has participated in an earlier iteration.
3392
3393 It is not possible to have a numerical "forward back reference" to a
3394 subpattern whose number is 10 or more using this syntax, as a sequence
3395 such as \50 is interpreted as a character defined in octal. For more
3396 details of the handling of digits following a backslash, see section
3397 Non-Printing Characters earlier. There is no such problem when named
3398 parentheses are used. A back reference to any subpattern is possible
3399 using named parentheses (see below).
3400
3401 Another way to avoid the ambiguity inherent in the use of digits fol‐
3402 lowing a backslash is to use the \g escape sequence. This escape must
3403 be followed by an unsigned number or a negative number, optionally
3404 enclosed in braces. The following examples are identical:
3405
3406 (ring), \1
3407 (ring), \g1
3408 (ring), \g{1}
3409
3410 An unsigned number specifies an absolute reference without the ambigu‐
3411 ity that is present in the older syntax. It is also useful when literal
3412 digits follow the reference. A negative number is a relative reference.
3413 Consider the following example:
3414
3415 (abc(def)ghi)\g{-1}
3416
3417 The sequence \g{-1} is a reference to the most recently started captur‐
3418 ing subpattern before \g, that is, it is equivalent to \2 in this exam‐
3419 ple. Similarly, \g{-2} would be equivalent to \1. The use of relative
3420 references can be helpful in long patterns, and also in patterns that
3421 are created by joining fragments containing references within them‐
3422 selves.
3423
3424 A back reference matches whatever matched the capturing subpattern in
3425 the current subject string, rather than anything matching the subpat‐
3426 tern itself (section Subpattern as Subroutines describes a way of doing
3427 that). So, the following pattern matches "sense and sensibility" and
3428 "response and responsibility", but not "sense and responsibility":
3429
3430 (sens|respons)e and \1ibility
3431
3432 If caseful matching is in force at the time of the back reference, the
3433 case of letters is relevant. For example, the following matches "rah
3434 rah" and "RAH RAH", but not "RAH rah", although the original capturing
3435 subpattern is matched caselessly:
3436
3437 ((?i)rah)\s+\1
3438
3439 There are many different ways of writing back references to named sub‐
3440 patterns. The .NET syntax \k{name} and the Perl syntax \k<name> or
3441 \k'name' are supported, as is the Python syntax (?P=name). The unified
3442 back reference syntax in Perl 5.10, in which \g can be used for both
3443 numeric and named references, is also supported. The previous example
3444 can be rewritten in the following ways:
3445
3446 (?<p1>(?i)rah)\s+\k<p1>
3447 (?'p1'(?i)rah)\s+\k{p1}
3448 (?P<p1>(?i)rah)\s+(?P=p1)
3449 (?<p1>(?i)rah)\s+\g{p1}
3450
3451 A subpattern that is referenced by name can appear in the pattern
3452 before or after the reference.
3453
3454 There can be more than one back reference to the same subpattern. If a
3455 subpattern has not been used in a particular match, any back references
3456 to it always fails. For example, the following pattern always fails if
3457 it starts to match "a" rather than "bc":
3458
3459 (a|(bc))\2
3460
3461 As there can be many capturing parentheses in a pattern, all digits
3462 following the backslash are taken as part of a potential back reference
3463 number. If the pattern continues with a digit character, some delimiter
3464 must be used to terminate the back reference. If option extended is
3465 set, this can be whitespace. Otherwise an empty comment (see section
3466 Comments) can be used.
3467
3468 Recursive Back References
3469
3470 A back reference that occurs inside the parentheses to which it refers
3471 fails when the subpattern is first used, so, for example, (a\1) never
3472 matches. However, such references can be useful inside repeated subpat‐
3473 terns. For example, the following pattern matches any number of "a"s
3474 and also "aba", "ababbaa", and so on:
3475
3476 (a|b\1)+
3477
3478 At each iteration of the subpattern, the back reference matches the
3479 character string corresponding to the previous iteration. In order for
3480 this to work, the pattern must be such that the first iteration does
3481 not need to match the back reference. This can be done using alterna‐
3482 tion, as in the example above, or by a quantifier with a minimum of
3483 zero.
3484
3485 Back references of this type cause the group that they reference to be
3486 treated as an atomic group. Once the whole group has been matched, a
3487 subsequent matching failure cannot cause backtracking into the middle
3488 of the group.
3489
3491 An assertion is a test on the characters following or preceding the
3492 current matching point that does not consume any characters. The simple
3493 assertions coded as \b, \B, \A, \G, \Z, \z, ^, and $ are described in
3494 the previous sections.
3495
3496 More complicated assertions are coded as subpatterns. There are two
3497 kinds: those that look ahead of the current position in the subject
3498 string, and those that look behind it. An assertion subpattern is
3499 matched in the normal way, except that it does not cause the current
3500 matching position to be changed.
3501
3502 Assertion subpatterns are not capturing subpatterns. If such an asser‐
3503 tion contains capturing subpatterns within it, these are counted for
3504 the purposes of numbering the capturing subpatterns in the whole pat‐
3505 tern. However, substring capturing is done only for positive asser‐
3506 tions. (Perl sometimes, but not always, performs capturing in negative
3507 assertions.)
3508
3509 Warning:
3510 If a positive assertion containing one or more capturing subpatterns
3511 succeeds, but failure to match later in the pattern causes backtracking
3512 over this assertion, the captures within the assertion are reset only
3513 if no higher numbered captures are already set. This is, unfortunately,
3514 a fundamental limitation of the current implementation, and as PCRE1 is
3515 now in maintenance-only status, it is unlikely ever to change.
3516
3517
3518 For compatibility with Perl, assertion subpatterns can be repeated.
3519 However, it makes no sense to assert the same thing many times, the
3520 side effect of capturing parentheses can occasionally be useful. In
3521 practice, there are only three cases:
3522
3523 * If the quantifier is {0}, the assertion is never obeyed during
3524 matching. However, it can contain internal capturing parenthesized
3525 groups that are called from elsewhere through the subroutine mecha‐
3526 nism.
3527
3528 * If quantifier is {0,n}, where n > 0, it is treated as if it was
3529 {0,1}. At runtime, the remaining pattern match is tried with and
3530 without the assertion, the order depends on the greediness of the
3531 quantifier.
3532
3533 * If the minimum repetition is > 0, the quantifier is ignored. The
3534 assertion is obeyed only once when encountered during matching.
3535
3536 Lookahead Assertions
3537
3538 Lookahead assertions start with (?= for positive assertions and (?! for
3539 negative assertions. For example, the following matches a word followed
3540 by a semicolon, but does not include the semicolon in the match:
3541
3542 \w+(?=;)
3543
3544 The following matches any occurrence of "foo" that is not followed by
3545 "bar":
3546
3547 foo(?!bar)
3548
3549 Notice that the apparently similar pattern
3550
3551 (?!foo)bar
3552
3553 does not find an occurrence of "bar" that is preceded by something
3554 other than "foo". It finds any occurrence of "bar" whatsoever, as the
3555 assertion (?!foo) is always true when the next three characters are
3556 "bar". A lookbehind assertion is needed to achieve the other effect.
3557
3558 If you want to force a matching failure at some point in a pattern, the
3559 most convenient way to do it is with (?!), as an empty string always
3560 matches. So, an assertion that requires there is not to be an empty
3561 string must always fail. The backtracking control verb (*FAIL) or (*F)
3562 is a synonym for (?!).
3563
3564 Lookbehind Assertions
3565
3566 Lookbehind assertions start with (?<= for positive assertions and (?<!
3567 for negative assertions. For example, the following finds an occurrence
3568 of "bar" that is not preceded by "foo":
3569
3570 (?<!foo)bar
3571
3572 The contents of a lookbehind assertion are restricted such that all the
3573 strings it matches must have a fixed length. However, if there are many
3574 top-level alternatives, they do not all have to have the same fixed
3575 length. Thus, the following is permitted:
3576
3577 (?<=bullock|donkey)
3578
3579 The following causes an error at compile time:
3580
3581 (?<!dogs?|cats?)
3582
3583 Branches that match different length strings are permitted only at the
3584 top-level of a lookbehind assertion. This is an extension compared with
3585 Perl, which requires all branches to match the same length of string.
3586 An assertion such as the following is not permitted, as its single top-
3587 level branch can match two different lengths:
3588
3589 (?<=ab(c|de))
3590
3591 However, it is acceptable to PCRE if rewritten to use two top-level
3592 branches:
3593
3594 (?<=abc|abde)
3595
3596 Sometimes the escape sequence \K (see above) can be used instead of a
3597 lookbehind assertion to get round the fixed-length restriction.
3598
3599 The implementation of lookbehind assertions is, for each alternative,
3600 to move the current position back temporarily by the fixed length and
3601 then try to match. If there are insufficient characters before the cur‐
3602 rent position, the assertion fails.
3603
3604 In a UTF mode, PCRE does not allow the \C escape (which matches a sin‐
3605 gle data unit even in a UTF mode) to appear in lookbehind assertions,
3606 as it makes it impossible to calculate the length of the lookbehind.
3607 The \X and \R escapes, which can match different numbers of data units,
3608 are not permitted either.
3609
3610 "Subroutine" calls (see below), such as (?2) or (?&X), are permitted in
3611 lookbehinds, as long as the subpattern matches a fixed-length string.
3612 Recursion, however, is not supported.
3613
3614 Possessive quantifiers can be used with lookbehind assertions to spec‐
3615 ify efficient matching of fixed-length strings at the end of subject
3616 strings. Consider the following simple pattern when applied to a long
3617 string that does not match:
3618
3619 abcd$
3620
3621 As matching proceeds from left to right, PCRE looks for each "a" in the
3622 subject and then sees if what follows matches the remaining pattern. If
3623 the pattern is specified as
3624
3625 ^.*abcd$
3626
3627 the initial .* matches the entire string at first. However, when this
3628 fails (as there is no following "a"), it backtracks to match all but
3629 the last character, then all but the last two characters, and so on.
3630 Once again the search for "a" covers the entire string, from right to
3631 left, so we are no better off. However, if the pattern is written as
3632
3633 ^.*+(?<=abcd)
3634
3635 there can be no backtracking for the .*+ item; it can match only the
3636 entire string. The subsequent lookbehind assertion does a single test
3637 on the last four characters. If it fails, the match fails immediately.
3638 For long strings, this approach makes a significant difference to the
3639 processing time.
3640
3641 Using Multiple Assertions
3642
3643 Many assertions (of any sort) can occur in succession. For example, the
3644 following matches "foo" preceded by three digits that are not "999":
3645
3646 (?<=\d{3})(?<!999)foo
3647
3648 Notice that each of the assertions is applied independently at the same
3649 point in the subject string. First there is a check that the previous
3650 three characters are all digits, and then there is a check that the
3651 same three characters are not "999". This pattern does not match "foo"
3652 preceded by six characters, the first of which are digits and the last
3653 three of which are not "999". For example, it does not match "123abc‐
3654 foo". A pattern to do that is the following:
3655
3656 (?<=\d{3}...)(?<!999)foo
3657
3658 This time the first assertion looks at the preceding six characters,
3659 checks that the first three are digits, and then the second assertion
3660 checks that the preceding three characters are not "999".
3661
3662 Assertions can be nested in any combination. For example, the following
3663 matches an occurrence of "baz" that is preceded by "bar", which in turn
3664 is not preceded by "foo":
3665
3666 (?<=(?<!foo)bar)baz
3667
3668 The following pattern matches "foo" preceded by three digits and any
3669 three characters that are not "999":
3670
3671 (?<=\d{3}(?!999)...)foo
3672
3674 It is possible to cause the matching process to obey a subpattern con‐
3675 ditionally or to choose between two alternative subpatterns, depending
3676 on the result of an assertion, or whether a specific capturing subpat‐
3677 tern has already been matched. The following are the two possible forms
3678 of conditional subpattern:
3679
3680 (?(condition)yes-pattern)
3681 (?(condition)yes-pattern|no-pattern)
3682
3683 If the condition is satisfied, the yes-pattern is used, otherwise the
3684 no-pattern (if present). If more than two alternatives exist in the
3685 subpattern, a compile-time error occurs. Each of the two alternatives
3686 can itself contain nested subpatterns of any form, including condi‐
3687 tional subpatterns; the restriction to two alternatives applies only at
3688 the level of the condition. The following pattern fragment is an exam‐
3689 ple where the alternatives are complex:
3690
3691 (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
3692
3693 There are four kinds of condition: references to subpatterns, refer‐
3694 ences to recursion, a pseudo-condition called DEFINE, and assertions.
3695
3696 Checking for a Used Subpattern By Number
3697
3698 If the text between the parentheses consists of a sequence of digits,
3699 the condition is true if a capturing subpattern of that number has pre‐
3700 viously matched. If more than one capturing subpattern with the same
3701 number exists (see section Duplicate Subpattern Numbers earlier), the
3702 condition is true if any of them have matched. An alternative notation
3703 is to precede the digits with a plus or minus sign. In this case, the
3704 subpattern number is relative rather than absolute. The most recently
3705 opened parentheses can be referenced by (?(-1), the next most recent by
3706 (?(-2), and so on. Inside loops, it can also make sense to refer to
3707 subsequent groups. The next parentheses to be opened can be referenced
3708 as (?(+1), and so on. (The value zero in any of these forms is not
3709 used; it provokes a compile-time error.)
3710
3711 Consider the following pattern, which contains non-significant white‐
3712 space to make it more readable (assume option extended) and to divide
3713 it into three parts for ease of discussion:
3714
3715 ( \( )? [^()]+ (?(1) \) )
3716
3717 The first part matches an optional opening parenthesis, and if that
3718 character is present, sets it as the first captured substring. The sec‐
3719 ond part matches one or more characters that are not parentheses. The
3720 third part is a conditional subpattern that tests whether the first set
3721 of parentheses matched or not. If they did, that is, if subject started
3722 with an opening parenthesis, the condition is true, and so the yes-pat‐
3723 tern is executed and a closing parenthesis is required. Otherwise, as
3724 no-pattern is not present, the subpattern matches nothing. That is,
3725 this pattern matches a sequence of non-parentheses, optionally enclosed
3726 in parentheses.
3727
3728 If this pattern is embedded in a larger one, a relative reference can
3729 be used:
3730
3731 This makes the fragment independent of the parentheses in the larger
3732 pattern.
3733
3734 Checking for a Used Subpattern By Name
3735
3736 Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
3737 used subpattern by name. For compatibility with earlier versions of
3738 PCRE, which had this facility before Perl, the syntax (?(name)...) is
3739 also recognized.
3740
3741 Rewriting the previous example to use a named subpattern gives:
3742
3743 (?<OPEN> \( )? [^()]+ (?(<OPEN>) \) )
3744
3745 If the name used in a condition of this kind is a duplicate, the test
3746 is applied to all subpatterns of the same name, and is true if any one
3747 of them has matched.
3748
3749 Checking for Pattern Recursion
3750
3751 If the condition is the string (R), and there is no subpattern with the
3752 name R, the condition is true if a recursive call to the whole pattern
3753 or any subpattern has been made. If digits or a name preceded by amper‐
3754 sand follow the letter R, for example:
3755
3756 (?(R3)...) or (?(R&name)...)
3757
3758 the condition is true if the most recent recursion is into a subpattern
3759 whose number or name is given. This condition does not check the entire
3760 recursion stack. If the name used in a condition of this kind is a
3761 duplicate, the test is applied to all subpatterns of the same name, and
3762 is true if any one of them is the most recent recursion.
3763
3764 At "top-level", all these recursion test conditions are false. The syn‐
3765 tax for recursive patterns is described below.
3766
3767 Defining Subpatterns for Use By Reference Only
3768
3769 If the condition is the string (DEFINE), and there is no subpattern
3770 with the name DEFINE, the condition is always false. In this case,
3771 there can be only one alternative in the subpattern. It is always
3772 skipped if control reaches this point in the pattern. The idea of
3773 DEFINE is that it can be used to define "subroutines" that can be ref‐
3774 erenced from elsewhere. (The use of subroutines is described below.)
3775 For example, a pattern to match an IPv4 address, such as
3776 "192.168.23.245", can be written like this (ignore whitespace and line
3777 breaks):
3778
3779 (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) \b (?&byte) (\.(?&byte)){3} \b
3780
3781 The first part of the pattern is a DEFINE group inside which is a
3782 another group named "byte" is defined. This matches an individual com‐
3783 ponent of an IPv4 address (a number < 256). When matching takes place,
3784 this part of the pattern is skipped, as DEFINE acts like a false condi‐
3785 tion. The remaining pattern uses references to the named group to match
3786 the four dot-separated components of an IPv4 address, insisting on a
3787 word boundary at each end.
3788
3789 Assertion Conditions
3790
3791 If the condition is not in any of the above formats, it must be an
3792 assertion. This can be a positive or negative lookahead or lookbehind
3793 assertion. Consider the following pattern, containing non-significant
3794 whitespace, and with the two alternatives on the second line:
3795
3796 (?(?=[^a-z]*[a-z])
3797 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
3798
3799 The condition is a positive lookahead assertion that matches an
3800 optional sequence of non-letters followed by a letter. That is, it
3801 tests for the presence of at least one letter in the subject. If a let‐
3802 ter is found, the subject is matched against the first alternative,
3803 otherwise it is matched against the second. This pattern matches
3804 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
3805 letters and dd are digits.
3806
3808 There are two ways to include comments in patterns that are processed
3809 by PCRE. In both cases, the start of the comment must not be in a char‐
3810 acter class, or in the middle of any other sequence of related charac‐
3811 ters such as (?: or a subpattern name or number. The characters that
3812 make up a comment play no part in the pattern matching.
3813
3814 The sequence (?# marks the start of a comment that continues up to the
3815 next closing parenthesis. Nested parentheses are not permitted. If
3816 option PCRE_EXTENDED is set, an unescaped # character also introduces a
3817 comment, which in this case continues to immediately after the next
3818 newline character or character sequence in the pattern. Which charac‐
3819 ters are interpreted as newlines is controlled by the options passed to
3820 a compiling function or by a special sequence at the start of the pat‐
3821 tern, as described in section Newline Conventions earlier.
3822
3823 Notice that the end of this type of comment is a literal newline
3824 sequence in the pattern; escape sequences that happen to represent a
3825 newline do not count. For example, consider the following pattern when
3826 extended is set, and the default newline convention is in force:
3827
3828 abc #comment \n still comment
3829
3830 On encountering character #, pcre_compile() skips along, looking for a
3831 newline in the pattern. The sequence \n is still literal at this stage,
3832 so it does not terminate the comment. Only a character with code value
3833 0x0a (the default newline) does so.
3834
3836 Consider the problem of matching a string in parentheses, allowing for
3837 unlimited nested parentheses. Without the use of recursion, the best
3838 that can be done is to use a pattern that matches up to some fixed
3839 depth of nesting. It is not possible to handle an arbitrary nesting
3840 depth.
3841
3842 For some time, Perl has provided a facility that allows regular expres‐
3843 sions to recurse (among other things). It does this by interpolating
3844 Perl code in the expression at runtime, and the code can refer to the
3845 expression itself. A Perl pattern using code interpolation to solve the
3846 parentheses problem can be created like this:
3847
3848 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
3849
3850 Item (?p{...}) interpolates Perl code at runtime, and in this case
3851 refers recursively to the pattern in which it appears.
3852
3853 Obviously, PCRE cannot support the interpolation of Perl code. Instead,
3854 it supports special syntax for recursion of the entire pattern, and for
3855 individual subpattern recursion. After its introduction in PCRE and
3856 Python, this kind of recursion was later introduced into Perl at
3857 release 5.10.
3858
3859 A special item that consists of (? followed by a number > 0 and a clos‐
3860 ing parenthesis is a recursive subroutine call of the subpattern of the
3861 given number, if it occurs inside that subpattern. (If not, it is a
3862 non-recursive subroutine call, which is described in the next section.)
3863 The special item (?R) or (?0) is a recursive call of the entire regular
3864 expression.
3865
3866 This PCRE pattern solves the nested parentheses problem (assume that
3867 option extended is set so that whitespace is ignored):
3868
3869 \( ( [^()]++ | (?R) )* \)
3870
3871 First it matches an opening parenthesis. Then it matches any number of
3872 substrings, which can either be a sequence of non-parentheses or a
3873 recursive match of the pattern itself (that is, a correctly parenthe‐
3874 sized substring). Finally there is a closing parenthesis. Notice the
3875 use of a possessive quantifier to avoid backtracking into sequences of
3876 non-parentheses.
3877
3878 If this was part of a larger pattern, you would not want to recurse the
3879 entire pattern, so instead you can use:
3880
3881 ( \( ( [^()]++ | (?1) )* \) )
3882
3883 The pattern is here within parentheses so that the recursion refers to
3884 them instead of the whole pattern.
3885
3886 In a larger pattern, keeping track of parenthesis numbers can be
3887 tricky. This is made easier by the use of relative references. Instead
3888 of (?1) in the pattern above, you can write (?-2) to refer to the sec‐
3889 ond most recently opened parentheses preceding the recursion. That is,
3890 a negative number counts capturing parentheses leftwards from the point
3891 at which it is encountered.
3892
3893 It is also possible to refer to later opened parentheses, by writing
3894 references such as (?+2). However, these cannot be recursive, as the
3895 reference is not inside the parentheses that are referenced. They are
3896 always non-recursive subroutine calls, as described in the next sec‐
3897 tion.
3898
3899 An alternative approach is to use named parentheses instead. The Perl
3900 syntax for this is (?&name). The earlier PCRE syntax (?P>name) is also
3901 supported. We can rewrite the above example as follows:
3902
3903 (?<pn> \( ( [^()]++ | (?&pn) )* \) )
3904
3905 If there is more than one subpattern with the same name, the earliest
3906 one is used.
3907
3908 This particular example pattern that we have studied contains nested
3909 unlimited repeats, and so the use of a possessive quantifier for match‐
3910 ing strings of non-parentheses is important when applying the pattern
3911 to strings that do not match. For example, when this pattern is applied
3912 to
3913
3914 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
3915
3916 it gives "no match" quickly. However, if a possessive quantifier is not
3917 used, the match runs for a long time, as there are so many different
3918 ways the + and * repeats can carve up the subject, and all must be
3919 tested before failure can be reported.
3920
3921 At the end of a match, the values of capturing parentheses are those
3922 from the outermost level. If the pattern above is matched against
3923
3924 (ab(cd)ef)
3925
3926 the value for the inner capturing parentheses (numbered 2) is "ef",
3927 which is the last value taken on at the top-level. If a capturing sub‐
3928 pattern is not matched at the top level, its final captured value is
3929 unset, even if it was (temporarily) set at a deeper level during the
3930 matching process.
3931
3932 Do not confuse item (?R) with condition (R), which tests for recursion.
3933 Consider the following pattern, which matches text in angle brackets,
3934 allowing for arbitrary nesting. Only digits are allowed in nested
3935 brackets (that is, when recursing), while any characters are permitted
3936 at the outer level.
3937
3938 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
3939
3940 Here (?(R) is the start of a conditional subpattern, with two different
3941 alternatives for the recursive and non-recursive cases. Item (?R) is
3942 the actual recursive call.
3943
3944 Differences in Recursion Processing between PCRE and Perl
3945
3946 Recursion processing in PCRE differs from Perl in two important ways.
3947 In PCRE (like Python, but unlike Perl), a recursive subpattern call is
3948 always treated as an atomic group. That is, once it has matched some of
3949 the subject string, it is never re-entered, even if it contains untried
3950 alternatives and there is a subsequent matching failure. This can be
3951 illustrated by the following pattern, which means to match a palin‐
3952 dromic string containing an odd number of characters (for example, "a",
3953 "aba", "abcba", "abcdcba"):
3954
3955 ^(.|(.)(?1)\2)$
3956
3957 The idea is that it either matches a single character, or two identical
3958 characters surrounding a subpalindrome. In Perl, this pattern works; in
3959 PCRE it does not work if the pattern is longer than three characters.
3960 Consider the subject string "abcba".
3961
3962 At the top level, the first character is matched, but as it is not at
3963 the end of the string, the first alternative fails, the second alterna‐
3964 tive is taken, and the recursion kicks in. The recursive call to sub‐
3965 pattern 1 successfully matches the next character ("b"). (Notice that
3966 the beginning and end of line tests are not part of the recursion.)
3967
3968 Back at the top level, the next character ("c") is compared with what
3969 subpattern 2 matched, which was "a". This fails. As the recursion is
3970 treated as an atomic group, there are now no backtracking points, and
3971 so the entire match fails. (Perl can now re-enter the recursion and try
3972 the second alternative.) However, if the pattern is written with the
3973 alternatives in the other order, things are different:
3974
3975 ^((.)(?1)\2|.)$
3976
3977 This time, the recursing alternative is tried first, and continues to
3978 recurse until it runs out of characters, at which point the recursion
3979 fails. But this time we have another alternative to try at the higher
3980 level. That is the significant difference: in the previous case the
3981 remaining alternative is at a deeper recursion level, which PCRE cannot
3982 use.
3983
3984 To change the pattern so that it matches all palindromic strings, not
3985 only those with an odd number of characters, it is tempting to change
3986 the pattern to this:
3987
3988 ^((.)(?1)\2|.?)$
3989
3990 Again, this works in Perl, but not in PCRE, and for the same reason.
3991 When a deeper recursion has matched a single character, it cannot be
3992 entered again to match an empty string. The solution is to separate the
3993 two cases, and write out the odd and even cases as alternatives at the
3994 higher level:
3995
3996 ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
3997
3998 If you want to match typical palindromic phrases, the pattern must
3999 ignore all non-word characters, which can be done as follows:
4000
4001 ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
4002
4003 If run with option caseless, this pattern matches phrases such as "A
4004 man, a plan, a canal: Panama!" and it works well in both PCRE and Perl.
4005 Notice the use of the possessive quantifier *+ to avoid backtracking
4006 into sequences of non-word characters. Without this, PCRE takes much
4007 longer (10 times or more) to match typical phrases, and Perl takes so
4008 long that you think it has gone into a loop.
4009
4010 Note:
4011 The palindrome-matching patterns above work only if the subject string
4012 does not start with a palindrome that is shorter than the entire
4013 string. For example, although "abcba" is correctly matched, if the sub‐
4014 ject is "ababa", PCRE finds palindrome "aba" at the start, and then
4015 fails at top level, as the end of the string does not follow. Once
4016 again, it cannot jump back into the recursion to try other alterna‐
4017 tives, so the entire match fails.
4018
4019
4020 The second way in which PCRE and Perl differ in their recursion pro‐
4021 cessing is in the handling of captured values. In Perl, when a subpat‐
4022 tern is called recursively or as a subpattern (see the next section),
4023 it has no access to any values that were captured outside the recur‐
4024 sion. In PCRE these values can be referenced. Consider the following
4025 pattern:
4026
4027 ^(.)(\1|a(?2))
4028
4029 In PCRE, it matches "bab". The first capturing parentheses match "b",
4030 then in the second group, when the back reference \1 fails to match
4031 "b", the second alternative matches "a", and then recurses. In the
4032 recursion, \1 does now match "b" and so the whole match succeeds. In
4033 Perl, the pattern fails to match because inside the recursive call \1
4034 cannot access the externally set value.
4035
4037 If the syntax for a recursive subpattern call (either by number or by
4038 name) is used outside the parentheses to which it refers, it operates
4039 like a subroutine in a programming language. The called subpattern can
4040 be defined before or after the reference. A numbered reference can be
4041 absolute or relative, as in the following examples:
4042
4043 (...(absolute)...)...(?2)...
4044 (...(relative)...)...(?-1)...
4045 (...(?+1)...(relative)...
4046
4047 An earlier example pointed out that the following pattern matches
4048 "sense and sensibility" and "response and responsibility", but not
4049 "sense and responsibility":
4050
4051 (sens|respons)e and \1ibility
4052
4053 If instead the following pattern is used, it matches "sense and respon‐
4054 sibility" and the other two strings:
4055
4056 (sens|respons)e and (?1)ibility
4057
4058 Another example is provided in the discussion of DEFINE earlier.
4059
4060 All subroutine calls, recursive or not, are always treated as atomic
4061 groups. That is, once a subroutine has matched some of the subject
4062 string, it is never re-entered, even if it contains untried alterna‐
4063 tives and there is a subsequent matching failure. Any capturing paren‐
4064 theses that are set during the subroutine call revert to their previous
4065 values afterwards.
4066
4067 Processing options such as case-independence are fixed when a subpat‐
4068 tern is defined, so if it is used as a subroutine, such options cannot
4069 be changed for different calls. For example, the following pattern
4070 matches "abcabc" but not "abcABC", as the change of processing option
4071 does not affect the called subpattern:
4072
4073 (abc)(?i:(?-1))
4074
4076 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
4077 name or a number enclosed either in angle brackets or single quotes, is
4078 alternative syntax for referencing a subpattern as a subroutine, possi‐
4079 bly recursively. Here follows two of the examples used above, rewritten
4080 using this syntax:
4081
4082 (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
4083 (sens|respons)e and \g'1'ibility
4084
4085 PCRE supports an extension to Oniguruma: if a number is preceded by a
4086 plus or minus sign, it is taken as a relative reference, for example:
4087
4088 (abc)(?i:\g<-1>)
4089
4090 Notice that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are
4091 not synonymous. The former is a back reference; the latter is a subrou‐
4092 tine call.
4093
4095 Perl 5.10 introduced some "Special Backtracking Control Verbs", which
4096 are still described in the Perl documentation as "experimental and sub‐
4097 ject to change or removal in a future version of Perl". It goes on to
4098 say: "Their usage in production code should be noted to avoid problems
4099 during upgrades." The same remarks apply to the PCRE features described
4100 in this section.
4101
4102 The new verbs make use of what was previously invalid syntax: an open‐
4103 ing parenthesis followed by an asterisk. They are generally of the form
4104 (*VERB) or (*VERB:NAME). Some can take either form, possibly behaving
4105 differently depending on whether a name is present. A name is any
4106 sequence of characters that does not include a closing parenthesis. The
4107 maximum name length is 255 in the 8-bit library and 65535 in the 16-bit
4108 and 32-bit libraries. If the name is empty, that is, if the closing
4109 parenthesis immediately follows the colon, the effect is as if the
4110 colon was not there. Any number of these verbs can occur in a pattern.
4111
4112 The behavior of these verbs in repeated groups, assertions, and in sub‐
4113 patterns called as subroutines (whether or not recursively) is
4114 described below.
4115
4116 Optimizations That Affect Backtracking Verbs
4117
4118 PCRE contains some optimizations that are used to speed up matching by
4119 running some checks at the start of each match attempt. For example, it
4120 can know the minimum length of matching subject, or that a particular
4121 character must be present. When one of these optimizations bypasses the
4122 running of a match, any included backtracking verbs are not processed.
4123 processed. You can suppress the start-of-match optimizations by setting
4124 option no_start_optimize when calling compile/2 or run/3, or by start‐
4125 ing the pattern with (*NO_START_OPT).
4126
4127 Experiments with Perl suggest that it too has similar optimizations,
4128 sometimes leading to anomalous results.
4129
4130 Verbs That Act Immediately
4131
4132 The following verbs act as soon as they are encountered. They must not
4133 be followed by a name.
4134
4135 (*ACCEPT)
4136
4137 This verb causes the match to end successfully, skipping the remainder
4138 of the pattern. However, when it is inside a subpattern that is called
4139 as a subroutine, only that subpattern is ended successfully. Matching
4140 then continues at the outer level. If (*ACCEPT) is triggered in a posi‐
4141 tive assertion, the assertion succeeds; in a negative assertion, the
4142 assertion fails.
4143
4144 If (*ACCEPT) is inside capturing parentheses, the data so far is cap‐
4145 tured. For example, the following matches "AB", "AAD", or "ACD". When
4146 it matches "AB", "B" is captured by the outer parentheses.
4147
4148 A((?:A|B(*ACCEPT)|C)D)
4149
4150 The following verb causes a matching failure, forcing backtracking to
4151 occur. It is equivalent to (?!) but easier to read.
4152
4153 (*FAIL) or (*F)
4154
4155 The Perl documentation states that it is probably useful only when com‐
4156 bined with (?{}) or (??{}). Those are Perl features that are not
4157 present in PCRE.
4158
4159 A match with the string "aaaa" always fails, but the callout is taken
4160 before each backtrack occurs (in this example, 10 times).
4161
4162 Recording Which Path Was Taken
4163
4164 The main purpose of this verb is to track how a match was arrived at,
4165 although it also has a secondary use in with advancing the match start‐
4166 ing point (see (*SKIP) below).
4167
4168 Note:
4169 In Erlang, there is no interface to retrieve a mark with run/2,3, so
4170 only the secondary purpose is relevant to the Erlang programmer.
4171
4172 The rest of this section is therefore deliberately not adapted for
4173 reading by the Erlang programmer, but the examples can help in under‐
4174 standing NAMES as they can be used by (*SKIP).
4175
4176
4177 (*MARK:NAME) or (*:NAME)
4178
4179 A name is always required with this verb. There can be as many
4180 instances of (*MARK) as you like in a pattern, and their names do not
4181 have to be unique.
4182
4183 When a match succeeds, the name of the last encountered (*MARK:NAME),
4184 (*PRUNE:NAME), or (*THEN:NAME) on the matching path is passed back to
4185 the caller as described in section "Extra data for pcre_exec()" in the
4186 pcreapi documentation. In the following example of pcretest output, the
4187 /K modifier requests the retrieval and outputting of (*MARK) data:
4188
4189 re> /X(*MARK:A)Y|X(*MARK:B)Z/K
4190 data> XY
4191 0: XY
4192 MK: A
4193 XZ
4194 0: XZ
4195 MK: B
4196
4197 The (*MARK) name is tagged with "MK:" in this output, and in this exam‐
4198 ple it indicates which of the two alternatives matched. This is a more
4199 efficient way of obtaining this information than putting each alterna‐
4200 tive in its own capturing parentheses.
4201
4202 If a verb with a name is encountered in a positive assertion that is
4203 true, the name is recorded and passed back if it is the last encoun‐
4204 tered. This does not occur for negative assertions or failing positive
4205 assertions.
4206
4207 After a partial match or a failed match, the last encountered name in
4208 the entire match process is returned, for example:
4209
4210 re> /X(*MARK:A)Y|X(*MARK:B)Z/K
4211 data> XP
4212 No match, mark = B
4213
4214 Notice that in this unanchored example, the mark is retained from the
4215 match attempt that started at letter "X" in the subject. Subsequent
4216 match attempts starting at "P" and then with an empty string do not get
4217 as far as the (*MARK) item, nevertheless do not reset it.
4218
4219 Verbs That Act after Backtracking
4220
4221 The following verbs do nothing when they are encountered. Matching con‐
4222 tinues with what follows, but if there is no subsequent match, causing
4223 a backtrack to the verb, a failure is forced. That is, backtracking
4224 cannot pass to the left of the verb. However, when one of these verbs
4225 appears inside an atomic group or an assertion that is true, its effect
4226 is confined to that group, as once the group has been matched, there is
4227 never any backtracking into it. In this situation, backtracking can
4228 "jump back" to the left of the entire atomic group or assertion.
4229 (Remember also, as stated above, that this localization also applies in
4230 subroutine calls.)
4231
4232 These verbs differ in exactly what kind of failure occurs when back‐
4233 tracking reaches them. The behavior described below is what occurs when
4234 the verb is not in a subroutine or an assertion. Subsequent sections
4235 cover these special cases.
4236
4237 The following verb, which must not be followed by a name, causes the
4238 whole match to fail outright if there is a later matching failure that
4239 causes backtracking to reach it. Even if the pattern is unanchored, no
4240 further attempts to find a match by advancing the starting point take
4241 place.
4242
4243 (*COMMIT)
4244
4245 If (*COMMIT) is the only backtracking verb that is encountered, once it
4246 has been passed, run/2,3 is committed to find a match at the current
4247 starting point, or not at all, for example:
4248
4249 a+(*COMMIT)b
4250
4251 This matches "xxaab" but not "aacaab". It can be thought of as a kind
4252 of dynamic anchor, or "I've started, so I must finish". The name of the
4253 most recently passed (*MARK) in the path is passed back when (*COMMIT)
4254 forces a match failure.
4255
4256 If more than one backtracking verb exists in a pattern, a different one
4257 that follows (*COMMIT) can be triggered first, so merely passing (*COM‐
4258 MIT) during a match does not always guarantee that a match must be at
4259 this starting point.
4260
4261 Notice that (*COMMIT) at the start of a pattern is not the same as an
4262 anchor, unless the PCRE start-of-match optimizations are turned off, as
4263 shown in the following example:
4264
4265 1> re:run("xyzabc","(*COMMIT)abc",[{capture,all,list}]).
4266 {match,["abc"]}
4267 2> re:run("xyzabc","(*COMMIT)abc",[{capture,all,list},no_start_optimize]).
4268 nomatch
4269
4270 For this pattern, PCRE knows that any match must start with "a", so the
4271 optimization skips along the subject to "a" before applying the pattern
4272 to the first set of data. The match attempt then succeeds. In the sec‐
4273 ond call the no_start_optimize disables the optimization that skips
4274 along to the first character. The pattern is now applied starting at
4275 "x", and so the (*COMMIT) causes the match to fail without trying any
4276 other starting points.
4277
4278 The following verb causes the match to fail at the current starting
4279 position in the subject if there is a later matching failure that
4280 causes backtracking to reach it:
4281
4282 (*PRUNE) or (*PRUNE:NAME)
4283
4284 If the pattern is unanchored, the normal "bumpalong" advance to the
4285 next starting character then occurs. Backtracking can occur as usual to
4286 the left of (*PRUNE), before it is reached, or when matching to the
4287 right of (*PRUNE), but if there is no match to the right, backtracking
4288 cannot cross (*PRUNE). In simple cases, the use of (*PRUNE) is just an
4289 alternative to an atomic group or possessive quantifier, but there are
4290 some uses of (*PRUNE) that cannot be expressed in any other way. In an
4291 anchored pattern, (*PRUNE) has the same effect as (*COMMIT).
4292
4293 The behavior of (*PRUNE:NAME) is the not the same as
4294 (*MARK:NAME)(*PRUNE). It is like (*MARK:NAME) in that the name is
4295 remembered for passing back to the caller. However, (*SKIP:NAME)
4296 searches only for names set with (*MARK).
4297
4298 Note:
4299 The fact that (*PRUNE:NAME) remembers the name is useless to the Erlang
4300 programmer, as names cannot be retrieved.
4301
4302
4303 The following verb, when specified without a name, is like (*PRUNE),
4304 except that if the pattern is unanchored, the "bumpalong" advance is
4305 not to the next character, but to the position in the subject where
4306 (*SKIP) was encountered.
4307
4308 (*SKIP)
4309
4310 (*SKIP) signifies that whatever text was matched leading up to it can‐
4311 not be part of a successful match. Consider:
4312
4313 a+(*SKIP)b
4314
4315 If the subject is "aaaac...", after the first match attempt fails
4316 (starting at the first character in the string), the starting point
4317 skips on to start the next attempt at "c". Notice that a possessive
4318 quantifier does not have the same effect as this example; although it
4319 would suppress backtracking during the first match attempt, the second
4320 attempt would start at the second character instead of skipping on to
4321 "c".
4322
4323 When (*SKIP) has an associated name, its behavior is modified:
4324
4325 (*SKIP:NAME)
4326
4327 When this is triggered, the previous path through the pattern is
4328 searched for the most recent (*MARK) that has the same name. If one is
4329 found, the "bumpalong" advance is to the subject position that corre‐
4330 sponds to that (*MARK) instead of to where (*SKIP) was encountered. If
4331 no (*MARK) with a matching name is found, (*SKIP) is ignored.
4332
4333 Notice that (*SKIP:NAME) searches only for names set by (*MARK:NAME).
4334 It ignores names that are set by (*PRUNE:NAME) or (*THEN:NAME).
4335
4336 The following verb causes a skip to the next innermost alternative when
4337 backtracking reaches it. That is, it cancels any further backtracking
4338 within the current alternative.
4339
4340 (*THEN) or (*THEN:NAME)
4341
4342 The verb name comes from the observation that it can be used for a pat‐
4343 tern-based if-then-else block:
4344
4345 ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
4346
4347 If the COND1 pattern matches, FOO is tried (and possibly further items
4348 after the end of the group if FOO succeeds). On failure, the matcher
4349 skips to the second alternative and tries COND2, without backtracking
4350 into COND1. If that succeeds and BAR fails, COND3 is tried. If BAZ then
4351 fails, there are no more alternatives, so there is a backtrack to what‐
4352 ever came before the entire group. If (*THEN) is not inside an alterna‐
4353 tion, it acts like (*PRUNE).
4354
4355 The behavior of (*THEN:NAME) is the not the same as
4356 (*MARK:NAME)(*THEN). It is like (*MARK:NAME) in that the name is remem‐
4357 bered for passing back to the caller. However, (*SKIP:NAME) searches
4358 only for names set with (*MARK).
4359
4360 Note:
4361 The fact that (*THEN:NAME) remembers the name is useless to the Erlang
4362 programmer, as names cannot be retrieved.
4363
4364
4365 A subpattern that does not contain a | character is just a part of the
4366 enclosing alternative; it is not a nested alternation with only one
4367 alternative. The effect of (*THEN) extends beyond such a subpattern to
4368 the enclosing alternative. Consider the following pattern, where A, B,
4369 and so on, are complex pattern fragments that do not contain any |
4370 characters at this level:
4371
4372 A (B(*THEN)C) | D
4373
4374 If A and B are matched, but there is a failure in C, matching does not
4375 backtrack into A; instead it moves to the next alternative, that is, D.
4376 However, if the subpattern containing (*THEN) is given an alternative,
4377 it behaves differently:
4378
4379 A (B(*THEN)C | (*FAIL)) | D
4380
4381 The effect of (*THEN) is now confined to the inner subpattern. After a
4382 failure in C, matching moves to (*FAIL), which causes the whole subpat‐
4383 tern to fail, as there are no more alternatives to try. In this case,
4384 matching does now backtrack into A.
4385
4386 Notice that a conditional subpattern is not considered as having two
4387 alternatives, as only one is ever used. That is, the | character in a
4388 conditional subpattern has a different meaning. Ignoring whitespace,
4389 consider:
4390
4391 ^.*? (?(?=a) a | b(*THEN)c )
4392
4393 If the subject is "ba", this pattern does not match. As .*? is
4394 ungreedy, it initially matches zero characters. The condition (?=a)
4395 then fails, the character "b" is matched, but "c" is not. At this
4396 point, matching does not backtrack to .*? as can perhaps be expected
4397 from the presence of the | character. The conditional subpattern is
4398 part of the single alternative that comprises the whole pattern, and so
4399 the match fails. (If there was a backtrack into .*?, allowing it to
4400 match "b", the match would succeed.)
4401
4402 The verbs described above provide four different "strengths" of control
4403 when subsequent matching fails:
4404
4405 * (*THEN) is the weakest, carrying on the match at the next alterna‐
4406 tive.
4407
4408 * (*PRUNE) comes next, fails the match at the current starting posi‐
4409 tion, but allows an advance to the next character (for an unan‐
4410 chored pattern).
4411
4412 * (*SKIP) is similar, except that the advance can be more than one
4413 character.
4414
4415 * (*COMMIT) is the strongest, causing the entire match to fail.
4416
4417 More than One Backtracking Verb
4418
4419 If more than one backtracking verb is present in a pattern, the one
4420 that is backtracked onto first acts. For example, consider the follow‐
4421 ing pattern, where A, B, and so on, are complex pattern fragments:
4422
4423 (A(*COMMIT)B(*THEN)C|ABD)
4424
4425 If A matches but B fails, the backtrack to (*COMMIT) causes the entire
4426 match to fail. However, if A and B match, but C fails, the backtrack to
4427 (*THEN) causes the next alternative (ABD) to be tried. This behavior is
4428 consistent, but is not always the same as in Perl. It means that if two
4429 or more backtracking verbs appear in succession, the last of them has
4430 no effect. Consider the following example:
4431
4432 If there is a matching failure to the right, backtracking onto (*PRUNE)
4433 causes it to be triggered, and its action is taken. There can never be
4434 a backtrack onto (*COMMIT).
4435
4436 Backtracking Verbs in Repeated Groups
4437
4438 PCRE differs from Perl in its handling of backtracking verbs in
4439 repeated groups. For example, consider:
4440
4441 /(a(*COMMIT)b)+ac/
4442
4443 If the subject is "abac", Perl matches, but PCRE fails because the
4444 (*COMMIT) in the second repeat of the group acts.
4445
4446 Backtracking Verbs in Assertions
4447
4448 (*FAIL) in an assertion has its normal effect: it forces an immediate
4449 backtrack.
4450
4451 (*ACCEPT) in a positive assertion causes the assertion to succeed with‐
4452 out any further processing. In a negative assertion, (*ACCEPT) causes
4453 the assertion to fail without any further processing.
4454
4455 The other backtracking verbs are not treated specially if they appear
4456 in a positive assertion. In particular, (*THEN) skips to the next
4457 alternative in the innermost enclosing group that has alternations,
4458 regardless if this is within the assertion.
4459
4460 Negative assertions are, however, different, to ensure that changing a
4461 positive assertion into a negative assertion changes its result. Back‐
4462 tracking into (*COMMIT), (*SKIP), or (*PRUNE) causes a negative asser‐
4463 tion to be true, without considering any further alternative branches
4464 in the assertion. Backtracking into (*THEN) causes it to skip to the
4465 next enclosing alternative within the assertion (the normal behavior),
4466 but if the assertion does not have such an alternative, (*THEN) behaves
4467 like (*PRUNE).
4468
4469 Backtracking Verbs in Subroutines
4470
4471 These behaviors occur regardless if the subpattern is called recur‐
4472 sively. The treatment of subroutines in Perl is different in some
4473 cases.
4474
4475 * (*FAIL) in a subpattern called as a subroutine has its normal
4476 effect: it forces an immediate backtrack.
4477
4478 * (*ACCEPT) in a subpattern called as a subroutine causes the subrou‐
4479 tine match to succeed without any further processing. Matching then
4480 continues after the subroutine call.
4481
4482 * (*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as a sub‐
4483 routine cause the subroutine match to fail.
4484
4485 * (*THEN) skips to the next alternative in the innermost enclosing
4486 group within the subpattern that has alternatives. If there is no
4487 such group within the subpattern, (*THEN) causes the subroutine
4488 match to fail.
4489
4490Ericsson AB stdlib 3.10 re(3)