re_syntax(n)

1re_syntax(n)                 Tcl Built-In Commands                re_syntax(n)
2
3
4
5______________________________________________________________________________
6

NAME

8       re_syntax - Syntax of Tcl regular expressions.
9_________________________________________________________________
10
11

DESCRIPTION

13       A  regular  expression describes strings of characters.  It's a pattern
14       that matches certain strings and doesn't match others.
15
16

DIFFERENT FLAVORS OF REs

18       Regular expressions (``RE''s), as defined by POSIX, come  in  two  fla‐
19       vors:  extended  REs  (``EREs'')  and  basic  REs (``BREs'').  EREs are
20       roughly those of the traditional egrep, while BREs are roughly those of
21       the  traditional ed.  This implementation adds a third flavor, advanced
22       REs (``AREs''), basically EREs with some significant extensions.
23
24       This manual page primarily describes AREs.  BREs mostly exist for back‐
25       ward  compatibility in some old programs; they will be discussed at the
26       end.  POSIX EREs are almost an exact subset of AREs.  Features of  AREs
27       that are not present in EREs will be indicated.
28
29

REGULAR EXPRESSION SYNTAX

31       Tcl  regular  expressions  are implemented using the package written by
32       Henry Spencer, based on the 1003.2 spec and some (not quite all) of the
33       Perl5  extensions (thanks, Henry!).  Much of the description of regular
34       expressions below is copied verbatim from his manual entry.
35
36       An ARE is one or more branches, separated  by  `|',  matching  anything
37       that matches any of the branches.
38
39       A branch is zero or more constraints or quantified atoms, concatenated.
40       It matches a match for the first, followed by a match for  the  second,
41       etc; an empty branch matches the empty string.
42
43       A  quantified atom is an atom possibly followed by a single quantifier.
44       Without a quantifier, it matches a match for  the  atom.   The  quanti‐
45       fiers, and what a so-quantified atom matches, are:
46
47         *     a sequence of 0 or more matches of the atom
48
49         +     a sequence of 1 or more matches of the atom
50
51         ?     a sequence of 0 or 1 matches of the atom
52
53         {m}   a sequence of exactly m matches of the atom
54
55         {m,}  a sequence of m or more matches of the atom
56
57         {m,n} a  sequence  of  m through n (inclusive) matches of the atom; m
58               may not exceed n
59
60         *?  +?  ??  {m}?  {m,}?  {m,n}?
61               non-greedy quantifiers, which match the same possibilities, but
62               prefer  the  smallest  number rather than the largest number of
63               matches (see MATCHING)
64
65       The forms using { and } are known as bounds.  The numbers m and  n  are
66       unsigned  decimal integers with permissible values from 0 to 255 inclu‐
67       sive.
68
69       An atom is one of:
70
71         (re)  (where re is any regular expression) matches a  match  for  re,
72               with the match noted for possible reporting
73
74         (?:re)
75               as  previous, but does no reporting (a ``non-capturing'' set of
76               parentheses)
77
78         ()    matches an empty string, noted for possible reporting
79
80         (?:)  matches an empty string, without reporting
81
82         [chars]
83               a bracket expression,  matching  any  one  of  the  chars  (see
84               BRACKET EXPRESSIONS for more detail)
85
86          .    matches any single character
87
88         \k    (where  k is a non-alphanumeric character) matches that charac‐
89               ter taken as an ordinary character, e.g. \\ matches a backslash
90               character
91
92         \c    where  c  is  alphanumeric  (possibly followed by other charac‐
93               ters), an escape (AREs only), see ESCAPES below
94
95         {     when followed by a character other than a  digit,  matches  the
96               left-brace  character  `{'; when followed by a digit, it is the
97               beginning of a bound (see above)
98
99         x     where x is a  single  character  with  no  other  significance,
100               matches that character.
101
102       A  constraint matches an empty string when specific conditions are met.
103       A constraint may not be followed by  a  quantifier.   The  simple  con‐
104       straints  are  as  follows;  some more constraints are described later,
105       under ESCAPES.
106
107         ^       matches at the beginning of a line
108
109         $       matches at the end of a line
110
111         (?=re)  positive lookahead (AREs only), matches at any point where  a
112                 substring matching re begins
113
114         (?!re)  negative lookahead (AREs only), matches at any point where no
115                 substring matching re begins
116
117       The lookahead constraints may not contain back references (see  later),
118       and all parentheses within them are considered non-capturing.
119
120       An RE may not end with `\'.
121
122

BRACKET EXPRESSIONS

124       A bracket expression is a list of characters enclosed in `[]'.  It nor‐
125       mally matches any single character from the list (but see  below).   If
126       the  list  begins  with  `^',  it matches any single character (but see
127       below) not from the rest of the list.
128
129       If two characters in the list are separated by `-', this  is  shorthand
130       for  the  full range of characters between those two (inclusive) in the
131       collating sequence, e.g.  [0-9] in ASCII  matches  any  decimal  digit.
132       Two  ranges  may  not  share  an  endpoint,  so e.g.  a-c-e is illegal.
133       Ranges are very  collating-sequence-dependent,  and  portable  programs
134       should avoid relying on them.
135
136       To  include  a  literal  ]  or - in the list, the simplest method is to
137       enclose it in [. and .]  to make it a collating  element  (see  below).
138       Alternatively,  make it the first character (following a possible `^'),
139       or (AREs only) precede it with `\'.  Alternatively, for  `-',  make  it
140       the  last  character, or the second endpoint of a range.  To use a lit‐
141       eral - as the first endpoint of a range, make it a collating element or
142       (AREs  only)  precede  it  with `\'.  With the exception of these, some
143       combinations using [ (see next paragraphs), and escapes, all other spe‐
144       cial  characters  lose  their  special  significance  within  a bracket
145       expression.
146
147       Within a bracket expression, a collating element (a character, a multi-
148       character sequence that collates as if it were a single character, or a
149       collating-sequence name for either) enclosed in [. and .]   stands  for
150       the  sequence of characters of that collating element.  The sequence is
151       a single element of the bracket expression's list.  A  bracket  expres‐
152       sion  in  a locale that has multi-character collating elements can thus
153       match more than one character.  So (insidiously), a bracket  expression │
154       that starts with ^ can match multi-character collating elements even if │
155       none of them appear in the bracket expression!   (Note:  Tcl  currently │
156       has  no  multi-character  collating elements.  This information is only │
157       for illustration.)                                                      │
158
159       For example, assume the collating sequence includes a ch  multi-charac‐ │
160       ter  collating element.  Then the RE [[.ch.]]*c (zero or more ch's fol‐ │
161       lowed by c) matches the first five characters of `chchcc'.   Also,  the │
162       RE [^c]b matches all of `chb' (because [^c] matches the multi-character │
163       ch).
164
165       Within a bracket expression, a collating element enclosed in [= and  =]
166       is  an  equivalence  class, standing for the sequences of characters of
167       all collating elements equivalent to that one, including  itself.   (If
168       there  are  no other equivalent collating elements, the treatment is as
169       if the enclosing delimiters were `[.' and `.]'.)  For example, if o and
170       o^  are  the members of an equivalence class, then `[[=o=]]', `[[=o^=]]',
171       and `[oo^]' are all synonymous.  An equivalence class may not be an end‐
172       point  of  a  range.   (Note: Tcl currently implements only the Unicode │
173       locale.  It doesn't define any equivalence classes.  The examples above │
174       are just illustrations.)
175
176       Within  a bracket expression, the name of a character class enclosed in
177       [: and :] stands for the list of all characters (not all collating ele‐
178       ments!)  belonging to that class.  Standard character classes are:
179
180              alpha       A letter.
181              upper       An upper-case letter.
182              lower       A lower-case letter.
183              digit       A decimal digit.
184              xdigit      A hexadecimal digit.
185              alnum       An alphanumeric (letter or digit).
186              print       An alphanumeric (same as alnum).
187              blank       A space or tab character.
188              space       A character producing white space in displayed text.
189              punct       A punctuation character.
190              graph       A character with a visible representation.
191              cntrl       A control character.
192
193       A locale may provide others.  (Note that the current Tcl implementation │
194       has only one locale: the Unicode locale.)  A character class may not be
195       used as an endpoint of a range.
196
197       There are two special cases of bracket expressions: the bracket expres‐
198       sions [[:<:]] and [[:>:]] are constraints, matching  empty  strings  at
199       the  beginning  and end of a word respectively.  A word is defined as a
200       sequence of word characters that is neither preceded  nor  followed  by
201       word  characters.   A word character is an alnum character or an under‐
202       score (_).  These special bracket expressions are deprecated; users  of
203       AREs should use constraint escapes instead (see below).
204

ESCAPES

206       Escapes  (AREs  only), which begin with a \ followed by an alphanumeric
207       character, come in several varieties:  character  entry,  class  short‐
208       hands,  constraint  escapes,  and  back references.  A \ followed by an
209       alphanumeric character but not constituting a valid escape  is  illegal
210       in  AREs.  In EREs, there are no escapes: outside a bracket expression,
211       a \ followed by an alphanumeric character merely stands for that  char‐
212       acter  as  an ordinary character, and inside a bracket expression, \ is
213       an ordinary character.  (The latter is the one  actual  incompatibility
214       between EREs and AREs.)
215
216       Character-entry  escapes (AREs only) exist to make it easier to specify
217       non-printing and otherwise inconvenient characters in REs:
218
219         \a   alert (bell) character, as in C
220
221         \b   backspace, as in C
222
223         \B   synonym for \ to help reduce backslash doubling in some applica‐
224              tions where there are multiple levels of backslash processing
225
226         \cX  (where  X is any character) the character whose low-order 5 bits
227              are the same as those of X, and whose other bits are all zero
228
229         \e   the character whose collating-sequence name is `ESC', or failing
230              that, the character with octal value 033
231
232         \f   formfeed, as in C
233
234         \n   newline, as in C
235
236         \r   carriage return, as in C
237
238         \t   horizontal tab, as in C
239
240         \uwxyz
241              (where  wxyz  is  exactly  four  hexadecimal digits) the Unicode
242              character U+wxyz in the local byte ordering
243
244         \Ustuvwxyz
245              (where stuvwxyz is exactly eight  hexadecimal  digits)  reserved
246              for a somewhat-hypothetical Unicode extension to 32 bits
247
248         \v   vertical tab, as in C are all available.
249
250         \xhhh
251              (where  hhh is any sequence of hexadecimal digits) the character
252              whose hexadecimal value is 0xhhh (a single character  no  matter
253              how many hexadecimal digits are used).
254
255         \0   the character whose value is 0
256
257         \xy  (where  xy is exactly two octal digits, and is not a back refer‐
258              ence (see below)) the character whose octal value is 0xy
259
260         \xyz (where xyz is exactly three octal digits, and is not a back ref‐
261              erence (see below)) the character whose octal value is 0xyz
262
263       Hexadecimal digits are `0'-`9', `a'-`f', and `A'-`F'.  Octal digits are
264       `0'-`7'.
265
266       The character-entry escapes are always taken  as  ordinary  characters.
267       For  example, \135 is ] in ASCII, but \135 does not terminate a bracket
268       expression.  Beware, however, that some applications (e.g.,  C  compil‐
269       ers)  interpret such sequences themselves before the regular-expression
270       package gets to see them,  which  may  require  doubling  (quadrupling,
271       etc.) the `\'.
272
273       Class-shorthand escapes (AREs only) provide shorthands for certain com‐
274       monly-used character classes:
275
276         \d        [[:digit:]]
277
278         \s        [[:space:]]
279
280         \w        [[:alnum:]_] (note underscore)
281
282         \D        [^[:digit:]]
283
284         \S        [^[:space:]]
285
286         \W        [^[:alnum:]_] (note underscore)
287
288       Within bracket expressions, `\d',  `\s',  and  `\w'  lose  their  outer
289       brackets,  and `\D', `\S', and `\W' are illegal.  (So, for example, [a- │
290       c\d] is equivalent to [a-c[:digit:]].  Also, [a-c\D], which is  equiva‐ │
291       lent to [a-c^[:digit:]], is illegal.)
292
293       A  constraint  escape  (AREs  only) is a constraint, matching the empty
294       string if specific conditions are met, written as an escape:
295
296         \A    matches only at the beginning  of  the  string  (see  MATCHING,
297               below, for how this differs from `^')
298
299         \m    matches only at the beginning of a word
300
301         \M    matches only at the end of a word
302
303         \y    matches only at the beginning or end of a word
304
305         \Y    matches  only  at a point that is not the beginning or end of a
306               word
307
308         \Z    matches only at the end of the string (see MATCHING, below, for
309               how this differs from `$')
310
311         \m    (where m is a nonzero digit) a back reference, see below
312
313         \mnn  (where  m  is  a nonzero digit, and nn is some more digits, and
314               the decimal value mnn is not greater than the number of closing
315               capturing parentheses seen so far) a back reference, see below
316
317       A word is defined as in the specification of [[:<:]] and [[:>:]] above.
318       Constraint escapes are illegal within bracket expressions.
319
320       A back reference (AREs only) matches the same  string  matched  by  the
321       parenthesized  subexpression  specified  by  the number, so that (e.g.)
322       ([bc])\1 matches bb  or  cc  but  not  `bc'.   The  subexpression  must
323       entirely precede the back reference in the RE.  Subexpressions are num‐
324       bered in the order of their leading parentheses.  Non-capturing  paren‐
325       theses do not define subexpressions.
326
327       There is an inherent historical ambiguity between octal character-entry
328       escapes and back references, which is resolved by heuristics, as hinted
329       at  above.   A leading zero always indicates an octal escape.  A single
330       non-zero digit, not followed by another digit, is  always  taken  as  a
331       back  reference.   A  multi-digit  sequence not starting with a zero is
332       taken as a back reference if it comes after  a  suitable  subexpression
333       (i.e.  the number is in the legal range for a back reference), and oth‐
334       erwise is taken as octal.
335

METASYNTAX

337       In addition to the main syntax described above, there are some  special
338       forms and miscellaneous syntactic facilities available.
339
340       Normally the flavor of RE being used is specified by application-depen‐
341       dent means.  However, this can be overridden by a director.  If  an  RE
342       of  any flavor begins with `***:', the rest of the RE is an ARE.  If an
343       RE of any flavor begins with `***=', the rest of the RE is taken to  be
344       a literal string, with all characters considered ordinary characters.
345
346       An ARE may begin with embedded options: a sequence (?xyz) (where xyz is
347       one or more alphabetic characters) specifies options affecting the rest
348       of  the  RE.  These supplement, and can override, any options specified
349       by the application.  The available option letters are:
350
351         b  rest of RE is a BRE
352
353         c  case-sensitive matching (usual default)
354
355         e  rest of RE is an ERE
356
357         i  case-insensitive matching (see MATCHING, below)
358
359         m  historical synonym for n
360
361         n  newline-sensitive matching (see MATCHING, below)
362
363         p  partial newline-sensitive matching (see MATCHING, below)
364
365         q  rest of RE is a literal (``quoted'') string, all ordinary  charac‐
366            ters
367
368         s  non-newline-sensitive matching (usual default)
369
370         t  tight syntax (usual default; see below)
371
372         w  inverse partial newline-sensitive (``weird'') matching (see MATCH‐
373            ING, below)
374
375         x  expanded syntax (see below)
376
377       Embedded options take effect at the ) terminating the  sequence.   They
378       are  available  only  at the start of an ARE, and may not be used later
379       within it.
380
381       In addition to the usual (tight) RE syntax, in which all characters are
382       significant,  there  is an expanded syntax, available in all flavors of
383       RE with the -expanded switch, or in AREs with the  embedded  x  option.
384       In  the  expanded  syntax,  white-space  characters are ignored and all
385       characters between a # and the following newline (or the end of the RE)
386       are  ignored,  permitting  paragraphing  and  commenting  a complex RE.
387       There are three exceptions to that basic rule:
388
389         a white-space character or `#' preceded by `\' is retained
390
391         white space or `#' within a bracket expression is retained
392
393         white space and comments are illegal within  multi-character  symbols
394         like the ARE `(?:' or the BRE `\('
395
396       Expanded-syntax white-space characters are blank, tab, newline, and any │
397       character that belongs to the space character class.
398
399       Finally, in an ARE, outside bracket expressions, the sequence `(?#ttt)'
400       (where  ttt  is any text not containing a `)') is a comment, completely
401       ignored.  Again, this is not allowed between the characters  of  multi-
402       character  symbols  like  `(?:'.   Such  comments are more a historical
403       artifact than a useful facility, and their use is deprecated;  use  the
404       expanded syntax instead.
405
406       None of these metasyntax extensions is available if the application (or
407       an initial ***= director)  has  specified  that  the  user's  input  be
408       treated as a literal string rather than as an RE.
409

MATCHING

411       In  the event that an RE could match more than one substring of a given
412       string, the RE matches the one starting earliest in the string.  If the
413       RE  could  match  more  than  one substring starting at that point, its
414       choice is determined by its preference: either the  longest  substring,
415       or the shortest.
416
417       Most  atoms,  and all constraints, have no preference.  A parenthesized
418       RE has the same preference (possibly none) as  the  RE.   A  quantified
419       atom  with  quantifier  {m}  or {m}?  has the same preference (possibly
420       none) as the atom itself.  A quantified atom with other normal  quanti‐
421       fiers  (including  {m,n}  with  m equal to n) prefers longest match.  A
422       quantified atom with other  non-greedy  quantifiers  (including  {m,n}?
423       with m equal to n) prefers shortest match.  A branch has the same pref‐
424       erence as the first quantified atom in it which has a  preference.   An
425       RE  consisting  of  two  or  more  branches connected by the | operator
426       prefers longest match.
427
428       Subject to the constraints imposed by the rules for matching the  whole
429       RE,  subexpressions  also  match  the longest or shortest possible sub‐
430       strings, based on their preferences, with subexpressions starting  ear‐
431       lier  in  the  RE  taking priority over ones starting later.  Note that
432       outer subexpressions thus take priority over their component subexpres‐
433       sions.
434
435       Note  that the quantifiers {1,1} and {1,1}?  can be used to force long‐
436       est and shortest preference, respectively,  on  a  subexpression  or  a
437       whole RE.
438
439       Match  lengths  are measured in characters, not collating elements.  An
440       empty string is considered longer than no match at all.   For  example,
441       bb*    matches    the    three    middle    characters    of   `abbbc',
442       (week|wee)(night|knights) matches all ten characters  of  `weeknights',
443       when  (.*).*   is  matched  against abc the parenthesized subexpression
444       matches all three characters, and when (a*)* is matched against bc both
445       the whole RE and the parenthesized subexpression match an empty string.
446
447       If case-independent matching is specified, the effect is much as if all
448       case distinctions had vanished from the alphabet.  When  an  alphabetic
449       that  exists in multiple cases appears as an ordinary character outside
450       a bracket expression, it is  effectively  transformed  into  a  bracket
451       expression  containing  both  cases, so that x becomes `[xX]'.  When it
452       appears inside a bracket expression, all case counterparts  of  it  are
453       added  to  the  bracket  expression,  so that [x] becomes [xX] and [^x]
454       becomes `[^xX]'.
455
456       If newline-sensitive matching is specified, .  and bracket  expressions
457       using  ^  will  never match the newline character (so that matches will
458       never cross newlines unless the RE explicitly arranges it) and ^ and  $
459       will match the empty string after and before a newline respectively, in
460       addition to matching at beginning and end of string respectively.   ARE
461       \A and \Z continue to match beginning or end of string only.
462
463       If partial newline-sensitive matching is specified, this affects .  and
464       bracket expressions as with newline-sensitive matching, but not  ^  and
465       `$'.
466
467       If  inverse  partial  newline-sensitive  matching  is  specified,  this
468       affects ^ and $ as with newline-sensitive  matching,  but  not  .   and
469       bracket expressions.  This isn't very useful but is provided for symme‐
470       try.
471

LIMITS AND COMPATIBILITY

473       No particular limit is imposed on the length of REs.  Programs intended
474       to be highly portable should not employ REs longer than 256 bytes, as a
475       POSIX-compliant implementation can refuse to accept such REs.
476
477       The only feature of AREs that is actually incompatible with POSIX  EREs
478       is that \ does not lose its special significance inside bracket expres‐
479       sions.  All other ARE features use syntax which is illegal or has unde‐
480       fined or unspecified effects in POSIX EREs; the *** syntax of directors
481       likewise is outside the POSIX syntax for both BREs and EREs.
482
483       Many of the ARE extensions are borrowed from Perl, but some  have  been
484       changed  to  clean  them up, and a few Perl extensions are not present.
485       Incompatibilities of note include  `\b',  `\B',  the  lack  of  special
486       treatment  for a trailing newline, the addition of complemented bracket
487       expressions to the things affected by newline-sensitive  matching,  the
488       restrictions  on  parentheses  and  back  references  in lookahead con‐
489       straints, and  the  longest/shortest-match  (rather  than  first-match)
490       matching semantics.
491
492       The  matching rules for REs containing both normal and non-greedy quan‐
493       tifiers have changed since early beta-test versions  of  this  package.
494       (The  new rules are much simpler and cleaner, but don't work as hard at
495       guessing the user's real intentions.)
496
497       Henry Spencer's original 1986 regexp package, still in  widespread  use
498       (e.g.,  in  pre-8.1  releases  of Tcl), implemented an early version of
499       today's EREs.  There are four incompatibilities between regexp's  near-
500       EREs  (`RREs' for short) and AREs.  In roughly increasing order of sig‐
501       nificance:
502
503              In AREs, \ followed by an alphanumeric character  is  either  an
504              escape  or  an  error, while in RREs, it was just another way of
505              writing the alphanumeric.  This should not be a problem  because
506              there was no reason to write such a sequence in RREs.
507
508              {  followed  by  a  digit in an ARE is the beginning of a bound,
509              while in  RREs,  {  was  always  an  ordinary  character.   Such
510              sequences  should  be  rare,  and  will often result in an error
511              because following characters will not look like a valid bound.
512
513              In AREs, \ remains a special character within `[]', so a literal
514              \  within  []  must  be written `\\'.  \\ also gives a literal \
515              within [] in RREs, but only truly paranoid programmers routinely
516              doubled the backslash.
517
518              AREs  report  the longest/shortest match for the RE, rather than
519              the first found in a specified search order.   This  may  affect
520              some  RREs  which were written in the expectation that the first
521              match would be reported.  (The careful crafting of RREs to opti‐
522              mize  the search order for fast matching is obsolete (AREs exam‐
523              ine all possible matches in parallel, and their  performance  is
524              largely  insensitive  to  their  complexity) but cases where the
525              search order was exploited to deliberately find  a  match  which
526              was not the longest/shortest will need rewriting.)
527
528

BASIC REGULAR EXPRESSIONS

530       BREs  differ from EREs in several respects.  `|', `+', and ?  are ordi‐
531       nary characters and there is no  equivalent  for  their  functionality.
532       The  delimiters  for bounds are \{ and `\}', with { and } by themselves
533       ordinary characters.  The parentheses for nested subexpressions are  \(
534       and  `\)',  with  (  and  ) by themselves ordinary characters.  ^ is an
535       ordinary character except at the beginning of the RE or  the  beginning
536       of  a parenthesized subexpression, $ is an ordinary character except at
537       the end of the RE or the end of a parenthesized subexpression, and * is
538       an  ordinary  character if it appears at the beginning of the RE or the
539       beginning of a parenthesized subexpression (after  a  possible  leading
540       `^').   Finally, single-digit back references are available, and \< and
541       \> are synonyms for [[:<:]] and [[:>:]] respectively; no other  escapes
542       are available.
543
544

KEYWORDS

550       match, regular expression, string
551
552
553
554Tcl                                   8.1                         re_syntax(n)