1re_syntax(n) Tcl Built-In Commands re_syntax(n)
2
3
4
5______________________________________________________________________________
6
8 re_syntax - Syntax of Tcl regular expressions
9_________________________________________________________________
10
12 A regular expression describes strings of characters. It's a pattern
13 that matches certain strings and does not match others.
14
16 Regular expressions (“RE”s), as defined by POSIX, come in two flavors:
17 extended REs (“ERE”s) and basic REs (“BRE”s). EREs are roughly those
18 of the traditional egrep, while BREs are roughly those of the tradi‐
19 tional ed. This implementation adds a third flavor, advanced REs
20 (“ARE”s), basically EREs with some significant extensions.
21
22 This manual page primarily describes AREs. BREs mostly exist for back‐
23 ward compatibility in some old programs; they will be discussed at the
24 end. POSIX EREs are almost an exact subset of AREs. Features of AREs
25 that are not present in EREs will be indicated.
26
28 Tcl regular expressions are implemented using the package written by
29 Henry Spencer, based on the 1003.2 spec and some (not quite all) of the
30 Perl5 extensions (thanks, Henry!). Much of the description of regular
31 expressions below is copied verbatim from his manual entry.
32
33 An ARE is one or more branches, separated by “|”, matching anything
34 that matches any of the branches.
35
36 A branch is zero or more constraints or quantified atoms, concatenated.
37 It matches a match for the first, followed by a match for the second,
38 etc; an empty branch matches the empty string.
39
40 QUANTIFIERS
41 A quantified atom is an atom possibly followed by a single quantifier.
42 Without a quantifier, it matches a single match for the atom. The
43 quantifiers, and what a so-quantified atom matches, are:
44
45 * a sequence of 0 or more matches of the atom
46
47 + a sequence of 1 or more matches of the atom
48
49 ? a sequence of 0 or 1 matches of the atom
50
51 {m} a sequence of exactly m matches of the atom
52
53 {m,} a sequence of m or more matches of the atom
54
55 {m,n} a sequence of m through n (inclusive) matches of the atom; m
56 may not exceed n
57
58 *? +? ?? {m}? {m,}? {m,n}?
59 non-greedy quantifiers, which match the same possibilities, but
60 prefer the smallest number rather than the largest number of
61 matches (see MATCHING)
62
63 The forms using { and } are known as bounds. The numbers m and n are
64 unsigned decimal integers with permissible values from 0 to 255 inclu‐
65 sive.
66
67 ATOMS
68 An atom is one of:
69
70 (re) matches a match for re (re is any regular expression) with the
71 match noted for possible reporting
72
73 (?:re)
74 as previous, but does no reporting (a “non-capturing” set of
75 parentheses)
76
77 () matches an empty string, noted for possible reporting
78
79 (?:) matches an empty string, without reporting
80
81 [chars]
82 a bracket expression, matching any one of the chars (see
83 BRACKET EXPRESSIONS for more detail)
84
85 . matches any single character
86
87 \k matches the non-alphanumeric character k taken as an ordinary
88 character, e.g. \\ matches a backslash character
89
90 \c where c is alphanumeric (possibly followed by other charac‐
91 ters), an escape (AREs only), see ESCAPES below
92
93 { when followed by a character other than a digit, matches the
94 left-brace character “{”; when followed by a digit, it is the
95 beginning of a bound (see above)
96
97 x where x is a single character with no other significance,
98 matches that character.
99
100 CONSTRAINTS
101 A constraint matches an empty string when specific conditions are met.
102 A constraint may not be followed by a quantifier. The simple con‐
103 straints are as follows; some more constraints are described later,
104 under ESCAPES.
105
106 ^ matches at the beginning of a line
107
108 $ matches at the end of a line
109
110 (?=re) positive lookahead (AREs only), matches at any point where a
111 substring matching re begins
112
113 (?!re) negative lookahead (AREs only), matches at any point where no
114 substring matching re begins
115
116 The lookahead constraints may not contain back references (see later),
117 and all parentheses within them are considered non-capturing.
118
119 An RE may not end with “\”.
120
122 A bracket expression is a list of characters enclosed in “[]”. It nor‐
123 mally matches any single character from the list (but see below). If
124 the list begins with “^”, it matches any single character (but see
125 below) not from the rest of the list.
126
127 If two characters in the list are separated by “-”, this is shorthand
128 for the full range of characters between those two (inclusive) in the
129 collating sequence, e.g. “[0-9]” in Unicode matches any conventional
130 decimal digit. Two ranges may not share an endpoint, so e.g. “a-c-e”
131 is illegal. Ranges in Tcl always use the Unicode collating sequence,
132 but other programs may use other collating sequences and this can be a
133 source of incompatability between programs.
134
135 To include a literal ] or - in the list, the simplest method is to
136 enclose it in [. and .] to make it a collating element (see below).
137 Alternatively, make it the first character (following a possible “^”),
138 or (AREs only) precede it with “\”. Alternatively, for “-”, make it
139 the last character, or the second endpoint of a range. To use a literal
140 - as the first endpoint of a range, make it a collating element or
141 (AREs only) precede it with “\”. With the exception of these, some
142 combinations using [ (see next paragraphs), and escapes, all other spe‐
143 cial characters lose their special significance within a bracket
144 expression.
145
146 CHARACTER CLASSES
147 Within a bracket expression, the name of a character class enclosed in
148 [: and :] stands for the list of all characters (not all collating ele‐
149 ments!) belonging to that class. Standard character classes are:
150
151 alpha A letter.
152
153 upper An upper-case letter.
154
155 lower A lower-case letter.
156
157 digit A decimal digit.
158
159 xdigit A hexadecimal digit.
160
161 alnum An alphanumeric (letter or digit).
162
163 print A "printable" (same as graph, except also including space).
164
165 blank A space or tab character.
166
167 space A character producing white space in displayed text.
168
169 punct A punctuation character.
170
171 graph A character with a visible representation (includes both alnum
172 and punct).
173
174 cntrl A control character.
175
176 A locale may provide others. A character class may not be used as an
177 endpoint of a range.
178
179 (Note: the current Tcl implementation has only one locale, the
180 Unicode locale, which supports exactly the above classes.)
181
182 BRACKETED CONSTRAINTS
183 There are two special cases of bracket expressions: the bracket expres‐
184 sions “[[:<:]]” and “[[:>:]]” are constraints, matching empty strings
185 at the beginning and end of a word respectively. A word is defined as
186 a sequence of word characters that is neither preceded nor followed by
187 word characters. A word character is an alnum character or an under‐
188 score (“_”). These special bracket expressions are deprecated; users
189 of AREs should use constraint escapes instead (see below).
190
191 COLLATING ELEMENTS
192 Within a bracket expression, a collating element (a character, a multi-
193 character sequence that collates as if it were a single character, or a
194 collating-sequence name for either) enclosed in [. and .] stands for
195 the sequence of characters of that collating element. The sequence is a
196 single element of the bracket expression's list. A bracket expression
197 in a locale that has multi-character collating elements can thus match
198 more than one character. So (insidiously), a bracket expression that
199 starts with ^ can match multi-character collating elements even if none
200 of them appear in the bracket expression!
201
202 (Note: Tcl has no multi-character collating elements. This
203 information is only for illustration.)
204
205 For example, assume the collating sequence includes a ch multi-charac‐
206 ter collating element. Then the RE “[[.ch.]]*c” (zero or more “chs”
207 followed by “c”) matches the first five characters of “chchcc”. Also,
208 the RE “[^c]b” matches all of “chb” (because “[^c]” matches the multi-
209 character “ch”).
210
211 EQUIVALENCE CLASSES
212 Within a bracket expression, a collating element enclosed in [= and =]
213 is an equivalence class, standing for the sequences of characters of
214 all collating elements equivalent to that one, including itself. (If
215 there are no other equivalent collating elements, the treatment is as
216 if the enclosing delimiters were “[.” and “.]”.) For example, if o and
217 ô are the members of an equivalence class, then “[[=o=]]”, “[[=ô=]]”,
218 and “[oô]” are all synonymous. An equivalence class may not be an end‐
219 point of a range.
220
221 (Note: Tcl implements only the Unicode locale. It does not
222 define any equivalence classes. The examples above are just
223 illustrations.)
224
226 Escapes (AREs only), which begin with a \ followed by an alphanumeric
227 character, come in several varieties: character entry, class short‐
228 hands, constraint escapes, and back references. A \ followed by an
229 alphanumeric character but not constituting a valid escape is illegal
230 in AREs. In EREs, there are no escapes: outside a bracket expression, a
231 \ followed by an alphanumeric character merely stands for that charac‐
232 ter as an ordinary character, and inside a bracket expression, \ is an
233 ordinary character. (The latter is the one actual incompatibility
234 between EREs and AREs.)
235
236 CHARACTER-ENTRY ESCAPES
237 Character-entry escapes (AREs only) exist to make it easier to specify
238 non-printing and otherwise inconvenient characters in REs:
239
240 \a alert (bell) character, as in C
241
242 \b backspace, as in C
243
244 \B synonym for \ to help reduce backslash doubling in some applica‐
245 tions where there are multiple levels of backslash processing
246
247 \cX (where X is any character) the character whose low-order 5 bits
248 are the same as those of X, and whose other bits are all zero
249
250 \e the character whose collating-sequence name is “ESC”, or failing
251 that, the character with octal value 033
252
253 \f formfeed, as in C
254
255 \n newline, as in C
256
257 \r carriage return, as in C
258
259 \t horizontal tab, as in C
260
261 \uwxyz
262 (where wxyz is exactly four hexadecimal digits) the Unicode
263 character U+wxyz in the local byte ordering
264
265 \Ustuvwxyz
266 (where stuvwxyz is exactly eight hexadecimal digits) reserved
267 for a somewhat-hypothetical Unicode extension to 32 bits
268
269 \v vertical tab, as in C are all available.
270
271 \xhhh
272 (where hhh is any sequence of hexadecimal digits) the character
273 whose hexadecimal value is 0xhhh (a single character no matter
274 how many hexadecimal digits are used).
275
276 \0 the character whose value is 0
277
278 \xy (where xy is exactly two octal digits, and is not a back refer‐
279 ence (see below)) the character whose octal value is 0xy
280
281 \xyz (where xyz is exactly three octal digits, and is not a back ref‐
282 erence (see below)) the character whose octal value is 0xyz
283
284 Hexadecimal digits are “0”-“9”, “a”-“f”, and “A”-“F”. Octal digits are
285 “0”-“7”.
286
287 The character-entry escapes are always taken as ordinary characters.
288 For example, \135 is ] in Unicode, but \135 does not terminate a
289 bracket expression. Beware, however, that some applications (e.g., C
290 compilers and the Tcl interpreter if the regular expression is not
291 quoted with braces) interpret such sequences themselves before the reg‐
292 ular-expression package gets to see them, which may require doubling
293 (quadrupling, etc.) the “\”.
294
295 CLASS-SHORTHAND ESCAPES
296 Class-shorthand escapes (AREs only) provide shorthands for certain com‐
297 monly-used character classes:
298
299 \d [[:digit:]]
300
301 \s [[:space:]]
302
303 \w [[:alnum:]_] (note underscore)
304
305 \D [^[:digit:]]
306
307 \S [^[:space:]]
308
309 \W [^[:alnum:]_] (note underscore)
310
311 Within bracket expressions, “\d”, “\s”, and “\w” lose their outer
312 brackets, and “\D”, “\S”, and “\W” are illegal. (So, for example, “[a-
313 c\d]” is equivalent to “[a-c[:digit:]]”. Also, “[a-c\D]”, which is
314 equivalent to “[a-c^[:digit:]]”, is illegal.)
315
316 CONSTRAINT ESCAPES
317 A constraint escape (AREs only) is a constraint, matching the empty
318 string if specific conditions are met, written as an escape:
319
320 \A matches only at the beginning of the string (see MATCHING,
321 below, for how this differs from “^”)
322
323 \m matches only at the beginning of a word
324
325 \M matches only at the end of a word
326
327 \y matches only at the beginning or end of a word
328
329 \Y matches only at a point that is not the beginning or end of a
330 word
331
332 \Z matches only at the end of the string (see MATCHING, below, for
333 how this differs from “$”)
334
335 \m (where m is a nonzero digit) a back reference, see below
336
337 \mnn (where m is a nonzero digit, and nn is some more digits, and
338 the decimal value mnn is not greater than the number of closing
339 capturing parentheses seen so far) a back reference, see below
340
341 A word is defined as in the specification of “[[:<:]]” and “[[:>:]]”
342 above. Constraint escapes are illegal within bracket expressions.
343
344 BACK REFERENCES
345 A back reference (AREs only) matches the same string matched by the
346 parenthesized subexpression specified by the number, so that (e.g.)
347 “([bc])\1” matches “bb” or “cc” but not “bc”. The subexpression must
348 entirely precede the back reference in the RE. Subexpressions are num‐
349 bered in the order of their leading parentheses. Non-capturing paren‐
350 theses do not define subexpressions.
351
352 There is an inherent historical ambiguity between octal character-entry
353 escapes and back references, which is resolved by heuristics, as hinted
354 at above. A leading zero always indicates an octal escape. A single
355 non-zero digit, not followed by another digit, is always taken as a
356 back reference. A multi-digit sequence not starting with a zero is
357 taken as a back reference if it comes after a suitable subexpression
358 (i.e. the number is in the legal range for a back reference), and oth‐
359 erwise is taken as octal.
360
362 In addition to the main syntax described above, there are some special
363 forms and miscellaneous syntactic facilities available.
364
365 Normally the flavor of RE being used is specified by application-depen‐
366 dent means. However, this can be overridden by a director. If an RE of
367 any flavor begins with “***:”, the rest of the RE is an ARE. If an RE
368 of any flavor begins with “***=”, the rest of the RE is taken to be a
369 literal string, with all characters considered ordinary characters.
370
371 An ARE may begin with embedded options: a sequence (?xyz) (where xyz is
372 one or more alphabetic characters) specifies options affecting the rest
373 of the RE. These supplement, and can override, any options specified by
374 the application. The available option letters are:
375
376 b rest of RE is a BRE
377
378 c case-sensitive matching (usual default)
379
380 e rest of RE is an ERE
381
382 i case-insensitive matching (see MATCHING, below)
383
384 m historical synonym for n
385
386 n newline-sensitive matching (see MATCHING, below)
387
388 p partial newline-sensitive matching (see MATCHING, below)
389
390 q rest of RE is a literal (“quoted”) string, all ordinary characters
391
392 s non-newline-sensitive matching (usual default)
393
394 t tight syntax (usual default; see below)
395
396 w inverse partial newline-sensitive (“weird”) matching (see MATCH‐
397 ING, below)
398
399 x expanded syntax (see below)
400
401 Embedded options take effect at the ) terminating the sequence. They
402 are available only at the start of an ARE, and may not be used later
403 within it.
404
405 In addition to the usual (tight) RE syntax, in which all characters are
406 significant, there is an expanded syntax, available in all flavors of
407 RE with the -expanded switch, or in AREs with the embedded x option. In
408 the expanded syntax, white-space characters are ignored and all charac‐
409 ters between a # and the following newline (or the end of the RE) are
410 ignored, permitting paragraphing and commenting a complex RE. There are
411 three exceptions to that basic rule:
412
413 · a white-space character or “#” preceded by “\” is retained
414
415 · white space or “#” within a bracket expression is retained
416
417 · white space and comments are illegal within multi-character symbols
418 like the ARE “(?:” or the BRE “\(”
419
420 Expanded-syntax white-space characters are blank, tab, newline, and any
421 character that belongs to the space character class.
422
423 Finally, in an ARE, outside bracket expressions, the sequence “(?#ttt)”
424 (where ttt is any text not containing a “)”) is a comment, completely
425 ignored. Again, this is not allowed between the characters of multi-
426 character symbols like “(?:”. Such comments are more a historical
427 artifact than a useful facility, and their use is deprecated; use the
428 expanded syntax instead.
429
430 None of these metasyntax extensions is available if the application (or
431 an initial “***=” director) has specified that the user's input be
432 treated as a literal string rather than as an RE.
433
435 In the event that an RE could match more than one substring of a given
436 string, the RE matches the one starting earliest in the string. If the
437 RE could match more than one substring starting at that point, its
438 choice is determined by its preference: either the longest substring,
439 or the shortest.
440
441 Most atoms, and all constraints, have no preference. A parenthesized RE
442 has the same preference (possibly none) as the RE. A quantified atom
443 with quantifier {m} or {m}? has the same preference (possibly none) as
444 the atom itself. A quantified atom with other normal quantifiers
445 (including {m,n} with m equal to n) prefers longest match. A quantified
446 atom with other non-greedy quantifiers (including {m,n}? with m equal
447 to n) prefers shortest match. A branch has the same preference as the
448 first quantified atom in it which has a preference. An RE consisting of
449 two or more branches connected by the | operator prefers longest match.
450
451 Subject to the constraints imposed by the rules for matching the whole
452 RE, subexpressions also match the longest or shortest possible sub‐
453 strings, based on their preferences, with subexpressions starting ear‐
454 lier in the RE taking priority over ones starting later. Note that
455 outer subexpressions thus take priority over their component subexpres‐
456 sions.
457
458 Note that the quantifiers {1,1} and {1,1}? can be used to force longest
459 and shortest preference, respectively, on a subexpression or a whole
460 RE.
461
462 Match lengths are measured in characters, not collating elements. An
463 empty string is considered longer than no match at all. For example,
464 “bb*” matches the three middle characters of “abbbc”,
465 “(week|wee)(night|knights)” matches all ten characters of “weeknights”,
466 when “(.*).*” is matched against “abc” the parenthesized subexpression
467 matches all three characters, and when “(a*)*” is matched against “bc”
468 both the whole RE and the parenthesized subexpression match an empty
469 string.
470
471 If case-independent matching is specified, the effect is much as if all
472 case distinctions had vanished from the alphabet. When an alphabetic
473 that exists in multiple cases appears as an ordinary character outside
474 a bracket expression, it is effectively transformed into a bracket
475 expression containing both cases, so that x becomes “[xX]”. When it
476 appears inside a bracket expression, all case counterparts of it are
477 added to the bracket expression, so that “[x]” becomes “[xX]” and
478 “[^x]” becomes “[^xX]”.
479
480 If newline-sensitive matching is specified, . and bracket expressions
481 using ^ will never match the newline character (so that matches will
482 never cross newlines unless the RE explicitly arranges it) and ^ and $
483 will match the empty string after and before a newline respectively, in
484 addition to matching at beginning and end of string respectively. ARE
485 \A and \Z continue to match beginning or end of string only.
486
487 If partial newline-sensitive matching is specified, this affects . and
488 bracket expressions as with newline-sensitive matching, but not ^ and
489 $.
490
491 If inverse partial newline-sensitive matching is specified, this
492 affects ^ and $ as with newline-sensitive matching, but not . and
493 bracket expressions. This is not very useful but is provided for symme‐
494 try.
495
497 No particular limit is imposed on the length of REs. Programs intended
498 to be highly portable should not employ REs longer than 256 bytes, as a
499 POSIX-compliant implementation can refuse to accept such REs.
500
501 The only feature of AREs that is actually incompatible with POSIX EREs
502 is that \ does not lose its special significance inside bracket expres‐
503 sions. All other ARE features use syntax which is illegal or has unde‐
504 fined or unspecified effects in POSIX EREs; the *** syntax of directors
505 likewise is outside the POSIX syntax for both BREs and EREs.
506
507 Many of the ARE extensions are borrowed from Perl, but some have been
508 changed to clean them up, and a few Perl extensions are not present.
509 Incompatibilities of note include “\b”, “\B”, the lack of special
510 treatment for a trailing newline, the addition of complemented bracket
511 expressions to the things affected by newline-sensitive matching, the
512 restrictions on parentheses and back references in lookahead con‐
513 straints, and the longest/shortest-match (rather than first-match)
514 matching semantics.
515
516 The matching rules for REs containing both normal and non-greedy quan‐
517 tifiers have changed since early beta-test versions of this package.
518 (The new rules are much simpler and cleaner, but do not work as hard at
519 guessing the user's real intentions.)
520
521 Henry Spencer's original 1986 regexp package, still in widespread use
522 (e.g., in pre-8.1 releases of Tcl), implemented an early version of
523 today's EREs. There are four incompatibilities between regexp's near-
524 EREs (“RREs” for short) and AREs. In roughly increasing order of sig‐
525 nificance:
526
527 · In AREs, \ followed by an alphanumeric character is either an escape
528 or an error, while in RREs, it was just another way of writing the
529 alphanumeric. This should not be a problem because there was no rea‐
530 son to write such a sequence in RREs.
531
532 · { followed by a digit in an ARE is the beginning of a bound, while
533 in RREs, { was always an ordinary character. Such sequences should
534 be rare, and will often result in an error because following charac‐
535 ters will not look like a valid bound.
536
537 · In AREs, \ remains a special character within “[]”, so a literal \
538 within [] must be written “\\”. \\ also gives a literal \ within []
539 in RREs, but only truly paranoid programmers routinely doubled the
540 backslash.
541
542 · AREs report the longest/shortest match for the RE, rather than the
543 first found in a specified search order. This may affect some RREs
544 which were written in the expectation that the first match would be
545 reported. (The careful crafting of RREs to optimize the search order
546 for fast matching is obsolete (AREs examine all possible matches in
547 parallel, and their performance is largely insensitive to their com‐
548 plexity) but cases where the search order was exploited to deliber‐
549 ately find a match which was not the longest/shortest will need
550 rewriting.)
551
553 BREs differ from EREs in several respects. “|”, “+”, and ? are ordi‐
554 nary characters and there is no equivalent for their functionality. The
555 delimiters for bounds are \{ and “\}”, with { and } by themselves ordi‐
556 nary characters. The parentheses for nested subexpressions are \( and
557 “\)”, with ( and ) by themselves ordinary characters. ^ is an ordinary
558 character except at the beginning of the RE or the beginning of a
559 parenthesized subexpression, $ is an ordinary character except at the
560 end of the RE or the end of a parenthesized subexpression, and * is an
561 ordinary character if it appears at the beginning of the RE or the
562 beginning of a parenthesized subexpression (after a possible leading
563 “^”). Finally, single-digit back references are available, and \< and
564 \> are synonyms for “[[:<:]]” and “[[:>:]]” respectively; no other
565 escapes are available.
566
568 RegExp(3), regexp(n), regsub(n), lsearch(n), switch(n), text(n)
569
571 match, regular expression, string
572
573
574
575Tcl 8.1 re_syntax(n)