1re_syntax(n) Tcl Built-In Commands re_syntax(n)
2
3
4
5______________________________________________________________________________
6
8 re_syntax - Syntax of Tcl regular expressions.
9_________________________________________________________________
10
11
13 A regular expression describes strings of characters. It's a pattern
14 that matches certain strings and doesn't match others.
15
16
18 Regular expressions (``RE''s), as defined by POSIX, come in two fla‐
19 vors: extended REs (``EREs'') and basic REs (``BREs''). EREs are
20 roughly those of the traditional egrep, while BREs are roughly those of
21 the traditional ed. This implementation adds a third flavor, advanced
22 REs (``AREs''), basically EREs with some significant extensions.
23
24 This manual page primarily describes AREs. BREs mostly exist for back‐
25 ward compatibility in some old programs; they will be discussed at the
26 end. POSIX EREs are almost an exact subset of AREs. Features of AREs
27 that are not present in EREs will be indicated.
28
29
31 Tcl regular expressions are implemented using the package written by
32 Henry Spencer, based on the 1003.2 spec and some (not quite all) of the
33 Perl5 extensions (thanks, Henry!). Much of the description of regular
34 expressions below is copied verbatim from his manual entry.
35
36 An ARE is one or more branches, separated by `|', matching anything
37 that matches any of the branches.
38
39 A branch is zero or more constraints or quantified atoms, concatenated.
40 It matches a match for the first, followed by a match for the second,
41 etc; an empty branch matches the empty string.
42
43 A quantified atom is an atom possibly followed by a single quantifier.
44 Without a quantifier, it matches a match for the atom. The quanti‐
45 fiers, and what a so-quantified atom matches, are:
46
47 * a sequence of 0 or more matches of the atom
48
49 + a sequence of 1 or more matches of the atom
50
51 ? a sequence of 0 or 1 matches of the atom
52
53 {m} a sequence of exactly m matches of the atom
54
55 {m,} a sequence of m or more matches of the atom
56
57 {m,n} a sequence of m through n (inclusive) matches of the atom; m
58 may not exceed n
59
60 *? +? ?? {m}? {m,}? {m,n}?
61 non-greedy quantifiers, which match the same possibilities, but
62 prefer the smallest number rather than the largest number of
63 matches (see MATCHING)
64
65 The forms using { and } are known as bounds. The numbers m and n are
66 unsigned decimal integers with permissible values from 0 to 255 inclu‐
67 sive.
68
69 An atom is one of:
70
71 (re) (where re is any regular expression) matches a match for re,
72 with the match noted for possible reporting
73
74 (?:re)
75 as previous, but does no reporting (a ``non-capturing'' set of
76 parentheses)
77
78 () matches an empty string, noted for possible reporting
79
80 (?:) matches an empty string, without reporting
81
82 [chars]
83 a bracket expression, matching any one of the chars (see
84 BRACKET EXPRESSIONS for more detail)
85
86 . matches any single character
87
88 \k (where k is a non-alphanumeric character) matches that charac‐
89 ter taken as an ordinary character, e.g. \\ matches a backslash
90 character
91
92 \c where c is alphanumeric (possibly followed by other charac‐
93 ters), an escape (AREs only), see ESCAPES below
94
95 { when followed by a character other than a digit, matches the
96 left-brace character `{'; when followed by a digit, it is the
97 beginning of a bound (see above)
98
99 x where x is a single character with no other significance,
100 matches that character.
101
102 A constraint matches an empty string when specific conditions are met.
103 A constraint may not be followed by a quantifier. The simple con‐
104 straints are as follows; some more constraints are described later,
105 under ESCAPES.
106
107 ^ matches at the beginning of a line
108
109 $ matches at the end of a line
110
111 (?=re) positive lookahead (AREs only), matches at any point where a
112 substring matching re begins
113
114 (?!re) negative lookahead (AREs only), matches at any point where no
115 substring matching re begins
116
117 The lookahead constraints may not contain back references (see later),
118 and all parentheses within them are considered non-capturing.
119
120 An RE may not end with `\'.
121
122
124 A bracket expression is a list of characters enclosed in `[]'. It nor‐
125 mally matches any single character from the list (but see below). If
126 the list begins with `^', it matches any single character (but see
127 below) not from the rest of the list.
128
129 If two characters in the list are separated by `-', this is shorthand
130 for the full range of characters between those two (inclusive) in the
131 collating sequence, e.g. [0-9] in ASCII matches any decimal digit.
132 Two ranges may not share an endpoint, so e.g. a-c-e is illegal.
133 Ranges are very collating-sequence-dependent, and portable programs
134 should avoid relying on them.
135
136 To include a literal ] or - in the list, the simplest method is to
137 enclose it in [. and .] to make it a collating element (see below).
138 Alternatively, make it the first character (following a possible `^'),
139 or (AREs only) precede it with `\'. Alternatively, for `-', make it
140 the last character, or the second endpoint of a range. To use a lit‐
141 eral - as the first endpoint of a range, make it a collating element or
142 (AREs only) precede it with `\'. With the exception of these, some
143 combinations using [ (see next paragraphs), and escapes, all other spe‐
144 cial characters lose their special significance within a bracket
145 expression.
146
147 Within a bracket expression, a collating element (a character, a multi-
148 character sequence that collates as if it were a single character, or a
149 collating-sequence name for either) enclosed in [. and .] stands for
150 the sequence of characters of that collating element. The sequence is
151 a single element of the bracket expression's list. A bracket expres‐
152 sion in a locale that has multi-character collating elements can thus
153 match more than one character. So (insidiously), a bracket expression │
154 that starts with ^ can match multi-character collating elements even if │
155 none of them appear in the bracket expression! (Note: Tcl currently │
156 has no multi-character collating elements. This information is only │
157 for illustration.) │
158
159 For example, assume the collating sequence includes a ch multi-charac‐ │
160 ter collating element. Then the RE [[.ch.]]*c (zero or more ch's fol‐ │
161 lowed by c) matches the first five characters of `chchcc'. Also, the │
162 RE [^c]b matches all of `chb' (because [^c] matches the multi-character │
163 ch).
164
165 Within a bracket expression, a collating element enclosed in [= and =]
166 is an equivalence class, standing for the sequences of characters of
167 all collating elements equivalent to that one, including itself. (If
168 there are no other equivalent collating elements, the treatment is as
169 if the enclosing delimiters were `[.' and `.]'.) For example, if o and
170 o^ are the members of an equivalence class, then `[[=o=]]', `[[=o^=]]',
171 and `[oo^]' are all synonymous. An equivalence class may not be an end‐
172 point of a range. (Note: Tcl currently implements only the Unicode │
173 locale. It doesn't define any equivalence classes. The examples above │
174 are just illustrations.)
175
176 Within a bracket expression, the name of a character class enclosed in
177 [: and :] stands for the list of all characters (not all collating ele‐
178 ments!) belonging to that class. Standard character classes are:
179
180 alpha A letter.
181 upper An upper-case letter.
182 lower A lower-case letter.
183 digit A decimal digit.
184 xdigit A hexadecimal digit.
185 alnum An alphanumeric (letter or digit).
186 print An alphanumeric (same as alnum).
187 blank A space or tab character.
188 space A character producing white space in displayed text.
189 punct A punctuation character.
190 graph A character with a visible representation.
191 cntrl A control character.
192
193 A locale may provide others. (Note that the current Tcl implementation │
194 has only one locale: the Unicode locale.) A character class may not be
195 used as an endpoint of a range.
196
197 There are two special cases of bracket expressions: the bracket expres‐
198 sions [[:<:]] and [[:>:]] are constraints, matching empty strings at
199 the beginning and end of a word respectively. A word is defined as a
200 sequence of word characters that is neither preceded nor followed by
201 word characters. A word character is an alnum character or an under‐
202 score (_). These special bracket expressions are deprecated; users of
203 AREs should use constraint escapes instead (see below).
204
206 Escapes (AREs only), which begin with a \ followed by an alphanumeric
207 character, come in several varieties: character entry, class short‐
208 hands, constraint escapes, and back references. A \ followed by an
209 alphanumeric character but not constituting a valid escape is illegal
210 in AREs. In EREs, there are no escapes: outside a bracket expression,
211 a \ followed by an alphanumeric character merely stands for that char‐
212 acter as an ordinary character, and inside a bracket expression, \ is
213 an ordinary character. (The latter is the one actual incompatibility
214 between EREs and AREs.)
215
216 Character-entry escapes (AREs only) exist to make it easier to specify
217 non-printing and otherwise inconvenient characters in REs:
218
219 \a alert (bell) character, as in C
220
221 \b backspace, as in C
222
223 \B synonym for \ to help reduce backslash doubling in some applica‐
224 tions where there are multiple levels of backslash processing
225
226 \cX (where X is any character) the character whose low-order 5 bits
227 are the same as those of X, and whose other bits are all zero
228
229 \e the character whose collating-sequence name is `ESC', or failing
230 that, the character with octal value 033
231
232 \f formfeed, as in C
233
234 \n newline, as in C
235
236 \r carriage return, as in C
237
238 \t horizontal tab, as in C
239
240 \uwxyz
241 (where wxyz is exactly four hexadecimal digits) the Unicode
242 character U+wxyz in the local byte ordering
243
244 \Ustuvwxyz
245 (where stuvwxyz is exactly eight hexadecimal digits) reserved
246 for a somewhat-hypothetical Unicode extension to 32 bits
247
248 \v vertical tab, as in C are all available.
249
250 \xhhh
251 (where hhh is any sequence of hexadecimal digits) the character
252 whose hexadecimal value is 0xhhh (a single character no matter
253 how many hexadecimal digits are used).
254
255 \0 the character whose value is 0
256
257 \xy (where xy is exactly two octal digits, and is not a back refer‐
258 ence (see below)) the character whose octal value is 0xy
259
260 \xyz (where xyz is exactly three octal digits, and is not a back ref‐
261 erence (see below)) the character whose octal value is 0xyz
262
263 Hexadecimal digits are `0'-`9', `a'-`f', and `A'-`F'. Octal digits are
264 `0'-`7'.
265
266 The character-entry escapes are always taken as ordinary characters.
267 For example, \135 is ] in ASCII, but \135 does not terminate a bracket
268 expression. Beware, however, that some applications (e.g., C compil‐
269 ers) interpret such sequences themselves before the regular-expression
270 package gets to see them, which may require doubling (quadrupling,
271 etc.) the `\'.
272
273 Class-shorthand escapes (AREs only) provide shorthands for certain com‐
274 monly-used character classes:
275
276 \d [[:digit:]]
277
278 \s [[:space:]]
279
280 \w [[:alnum:]_] (note underscore)
281
282 \D [^[:digit:]]
283
284 \S [^[:space:]]
285
286 \W [^[:alnum:]_] (note underscore)
287
288 Within bracket expressions, `\d', `\s', and `\w' lose their outer
289 brackets, and `\D', `\S', and `\W' are illegal. (So, for example, [a- │
290 c\d] is equivalent to [a-c[:digit:]]. Also, [a-c\D], which is equiva‐ │
291 lent to [a-c^[:digit:]], is illegal.)
292
293 A constraint escape (AREs only) is a constraint, matching the empty
294 string if specific conditions are met, written as an escape:
295
296 \A matches only at the beginning of the string (see MATCHING,
297 below, for how this differs from `^')
298
299 \m matches only at the beginning of a word
300
301 \M matches only at the end of a word
302
303 \y matches only at the beginning or end of a word
304
305 \Y matches only at a point that is not the beginning or end of a
306 word
307
308 \Z matches only at the end of the string (see MATCHING, below, for
309 how this differs from `$')
310
311 \m (where m is a nonzero digit) a back reference, see below
312
313 \mnn (where m is a nonzero digit, and nn is some more digits, and
314 the decimal value mnn is not greater than the number of closing
315 capturing parentheses seen so far) a back reference, see below
316
317 A word is defined as in the specification of [[:<:]] and [[:>:]] above.
318 Constraint escapes are illegal within bracket expressions.
319
320 A back reference (AREs only) matches the same string matched by the
321 parenthesized subexpression specified by the number, so that (e.g.)
322 ([bc])\1 matches bb or cc but not `bc'. The subexpression must
323 entirely precede the back reference in the RE. Subexpressions are num‐
324 bered in the order of their leading parentheses. Non-capturing paren‐
325 theses do not define subexpressions.
326
327 There is an inherent historical ambiguity between octal character-entry
328 escapes and back references, which is resolved by heuristics, as hinted
329 at above. A leading zero always indicates an octal escape. A single
330 non-zero digit, not followed by another digit, is always taken as a
331 back reference. A multi-digit sequence not starting with a zero is
332 taken as a back reference if it comes after a suitable subexpression
333 (i.e. the number is in the legal range for a back reference), and oth‐
334 erwise is taken as octal.
335
337 In addition to the main syntax described above, there are some special
338 forms and miscellaneous syntactic facilities available.
339
340 Normally the flavor of RE being used is specified by application-depen‐
341 dent means. However, this can be overridden by a director. If an RE
342 of any flavor begins with `***:', the rest of the RE is an ARE. If an
343 RE of any flavor begins with `***=', the rest of the RE is taken to be
344 a literal string, with all characters considered ordinary characters.
345
346 An ARE may begin with embedded options: a sequence (?xyz) (where xyz is
347 one or more alphabetic characters) specifies options affecting the rest
348 of the RE. These supplement, and can override, any options specified
349 by the application. The available option letters are:
350
351 b rest of RE is a BRE
352
353 c case-sensitive matching (usual default)
354
355 e rest of RE is an ERE
356
357 i case-insensitive matching (see MATCHING, below)
358
359 m historical synonym for n
360
361 n newline-sensitive matching (see MATCHING, below)
362
363 p partial newline-sensitive matching (see MATCHING, below)
364
365 q rest of RE is a literal (``quoted'') string, all ordinary charac‐
366 ters
367
368 s non-newline-sensitive matching (usual default)
369
370 t tight syntax (usual default; see below)
371
372 w inverse partial newline-sensitive (``weird'') matching (see MATCH‐
373 ING, below)
374
375 x expanded syntax (see below)
376
377 Embedded options take effect at the ) terminating the sequence. They
378 are available only at the start of an ARE, and may not be used later
379 within it.
380
381 In addition to the usual (tight) RE syntax, in which all characters are
382 significant, there is an expanded syntax, available in all flavors of
383 RE with the -expanded switch, or in AREs with the embedded x option.
384 In the expanded syntax, white-space characters are ignored and all
385 characters between a # and the following newline (or the end of the RE)
386 are ignored, permitting paragraphing and commenting a complex RE.
387 There are three exceptions to that basic rule:
388
389 a white-space character or `#' preceded by `\' is retained
390
391 white space or `#' within a bracket expression is retained
392
393 white space and comments are illegal within multi-character symbols
394 like the ARE `(?:' or the BRE `\('
395
396 Expanded-syntax white-space characters are blank, tab, newline, and any │
397 character that belongs to the space character class.
398
399 Finally, in an ARE, outside bracket expressions, the sequence `(?#ttt)'
400 (where ttt is any text not containing a `)') is a comment, completely
401 ignored. Again, this is not allowed between the characters of multi-
402 character symbols like `(?:'. Such comments are more a historical
403 artifact than a useful facility, and their use is deprecated; use the
404 expanded syntax instead.
405
406 None of these metasyntax extensions is available if the application (or
407 an initial ***= director) has specified that the user's input be
408 treated as a literal string rather than as an RE.
409
411 In the event that an RE could match more than one substring of a given
412 string, the RE matches the one starting earliest in the string. If the
413 RE could match more than one substring starting at that point, its
414 choice is determined by its preference: either the longest substring,
415 or the shortest.
416
417 Most atoms, and all constraints, have no preference. A parenthesized
418 RE has the same preference (possibly none) as the RE. A quantified
419 atom with quantifier {m} or {m}? has the same preference (possibly
420 none) as the atom itself. A quantified atom with other normal quanti‐
421 fiers (including {m,n} with m equal to n) prefers longest match. A
422 quantified atom with other non-greedy quantifiers (including {m,n}?
423 with m equal to n) prefers shortest match. A branch has the same pref‐
424 erence as the first quantified atom in it which has a preference. An
425 RE consisting of two or more branches connected by the | operator
426 prefers longest match.
427
428 Subject to the constraints imposed by the rules for matching the whole
429 RE, subexpressions also match the longest or shortest possible sub‐
430 strings, based on their preferences, with subexpressions starting ear‐
431 lier in the RE taking priority over ones starting later. Note that
432 outer subexpressions thus take priority over their component subexpres‐
433 sions.
434
435 Note that the quantifiers {1,1} and {1,1}? can be used to force long‐
436 est and shortest preference, respectively, on a subexpression or a
437 whole RE.
438
439 Match lengths are measured in characters, not collating elements. An
440 empty string is considered longer than no match at all. For example,
441 bb* matches the three middle characters of `abbbc',
442 (week|wee)(night|knights) matches all ten characters of `weeknights',
443 when (.*).* is matched against abc the parenthesized subexpression
444 matches all three characters, and when (a*)* is matched against bc both
445 the whole RE and the parenthesized subexpression match an empty string.
446
447 If case-independent matching is specified, the effect is much as if all
448 case distinctions had vanished from the alphabet. When an alphabetic
449 that exists in multiple cases appears as an ordinary character outside
450 a bracket expression, it is effectively transformed into a bracket
451 expression containing both cases, so that x becomes `[xX]'. When it
452 appears inside a bracket expression, all case counterparts of it are
453 added to the bracket expression, so that [x] becomes [xX] and [^x]
454 becomes `[^xX]'.
455
456 If newline-sensitive matching is specified, . and bracket expressions
457 using ^ will never match the newline character (so that matches will
458 never cross newlines unless the RE explicitly arranges it) and ^ and $
459 will match the empty string after and before a newline respectively, in
460 addition to matching at beginning and end of string respectively. ARE
461 \A and \Z continue to match beginning or end of string only.
462
463 If partial newline-sensitive matching is specified, this affects . and
464 bracket expressions as with newline-sensitive matching, but not ^ and
465 `$'.
466
467 If inverse partial newline-sensitive matching is specified, this
468 affects ^ and $ as with newline-sensitive matching, but not . and
469 bracket expressions. This isn't very useful but is provided for symme‐
470 try.
471
473 No particular limit is imposed on the length of REs. Programs intended
474 to be highly portable should not employ REs longer than 256 bytes, as a
475 POSIX-compliant implementation can refuse to accept such REs.
476
477 The only feature of AREs that is actually incompatible with POSIX EREs
478 is that \ does not lose its special significance inside bracket expres‐
479 sions. All other ARE features use syntax which is illegal or has unde‐
480 fined or unspecified effects in POSIX EREs; the *** syntax of directors
481 likewise is outside the POSIX syntax for both BREs and EREs.
482
483 Many of the ARE extensions are borrowed from Perl, but some have been
484 changed to clean them up, and a few Perl extensions are not present.
485 Incompatibilities of note include `\b', `\B', the lack of special
486 treatment for a trailing newline, the addition of complemented bracket
487 expressions to the things affected by newline-sensitive matching, the
488 restrictions on parentheses and back references in lookahead con‐
489 straints, and the longest/shortest-match (rather than first-match)
490 matching semantics.
491
492 The matching rules for REs containing both normal and non-greedy quan‐
493 tifiers have changed since early beta-test versions of this package.
494 (The new rules are much simpler and cleaner, but don't work as hard at
495 guessing the user's real intentions.)
496
497 Henry Spencer's original 1986 regexp package, still in widespread use
498 (e.g., in pre-8.1 releases of Tcl), implemented an early version of
499 today's EREs. There are four incompatibilities between regexp's near-
500 EREs (`RREs' for short) and AREs. In roughly increasing order of sig‐
501 nificance:
502
503 In AREs, \ followed by an alphanumeric character is either an
504 escape or an error, while in RREs, it was just another way of
505 writing the alphanumeric. This should not be a problem because
506 there was no reason to write such a sequence in RREs.
507
508 { followed by a digit in an ARE is the beginning of a bound,
509 while in RREs, { was always an ordinary character. Such
510 sequences should be rare, and will often result in an error
511 because following characters will not look like a valid bound.
512
513 In AREs, \ remains a special character within `[]', so a literal
514 \ within [] must be written `\\'. \\ also gives a literal \
515 within [] in RREs, but only truly paranoid programmers routinely
516 doubled the backslash.
517
518 AREs report the longest/shortest match for the RE, rather than
519 the first found in a specified search order. This may affect
520 some RREs which were written in the expectation that the first
521 match would be reported. (The careful crafting of RREs to opti‐
522 mize the search order for fast matching is obsolete (AREs exam‐
523 ine all possible matches in parallel, and their performance is
524 largely insensitive to their complexity) but cases where the
525 search order was exploited to deliberately find a match which
526 was not the longest/shortest will need rewriting.)
527
528
530 BREs differ from EREs in several respects. `|', `+', and ? are ordi‐
531 nary characters and there is no equivalent for their functionality.
532 The delimiters for bounds are \{ and `\}', with { and } by themselves
533 ordinary characters. The parentheses for nested subexpressions are \(
534 and `\)', with ( and ) by themselves ordinary characters. ^ is an
535 ordinary character except at the beginning of the RE or the beginning
536 of a parenthesized subexpression, $ is an ordinary character except at
537 the end of the RE or the end of a parenthesized subexpression, and * is
538 an ordinary character if it appears at the beginning of the RE or the
539 beginning of a parenthesized subexpression (after a possible leading
540 `^'). Finally, single-digit back references are available, and \< and
541 \> are synonyms for [[:<:]] and [[:>:]] respectively; no other escapes
542 are available.
543
544
546 RegExp(3), regexp(n), regsub(n), lsearch(n), switch(n), text(n)
547
548
550 match, regular expression, string
551
552
553
554Tcl 8.1 re_syntax(n)