1re_syntax(n) Tcl Built-In Commands re_syntax(n)
2
3
4
5______________________________________________________________________________
6
8 re_syntax - Syntax of Tcl regular expressions
9_________________________________________________________________
10
12 A regular expression describes strings of characters. It's a pattern
13 that matches certain strings and does not match others.
14
16 Regular expressions (“RE”s), as defined by POSIX, come in two flavors:
17 extended REs (“ERE”s) and basic REs (“BRE”s). EREs are roughly those
18 of the traditional egrep, while BREs are roughly those of the tradi‐
19 tional ed. This implementation adds a third flavor, advanced REs
20 (“ARE”s), basically EREs with some significant extensions.
21
22 This manual page primarily describes AREs. BREs mostly exist for back‐
23 ward compatibility in some old programs; they will be discussed at the
24 end. POSIX EREs are almost an exact subset of AREs. Features of AREs
25 that are not present in EREs will be indicated.
26
28 Tcl regular expressions are implemented using the package written by
29 Henry Spencer, based on the 1003.2 spec and some (not quite all) of the
30 Perl5 extensions (thanks, Henry!). Much of the description of regular
31 expressions below is copied verbatim from his manual entry.
32
33 An ARE is one or more branches, separated by “|”, matching anything
34 that matches any of the branches.
35
36 A branch is zero or more constraints or quantified atoms, concatenated.
37 It matches a match for the first, followed by a match for the second,
38 etc; an empty branch matches the empty string.
39
40 QUANTIFIERS
41 A quantified atom is an atom possibly followed by a single quantifier.
42 Without a quantifier, it matches a single match for the atom. The
43 quantifiers, and what a so-quantified atom matches, are:
44
45 * a sequence of 0 or more matches of the atom
46
47 + a sequence of 1 or more matches of the atom
48
49 ? a sequence of 0 or 1 matches of the atom
50
51 {m} a sequence of exactly m matches of the atom
52
53 {m,} a sequence of m or more matches of the atom
54
55 {m,n} a sequence of m through n (inclusive) matches of the atom; m
56 may not exceed n
57
58 *? +? ?? {m}? {m,}? {m,n}?
59 non-greedy quantifiers, which match the same possibilities, but
60 prefer the smallest number rather than the largest number of
61 matches (see MATCHING)
62
63 The forms using { and } are known as bounds. The numbers m and n are
64 unsigned decimal integers with permissible values from 0 to 255 inclu‐
65 sive.
66
67 ATOMS
68 An atom is one of:
69
70 (re) matches a match for re (re is any regular expression) with the
71 match noted for possible reporting
72
73 (?:re)
74 as previous, but does no reporting (a “non-capturing” set of
75 parentheses)
76
77 () matches an empty string, noted for possible reporting
78
79 (?:) matches an empty string, without reporting
80
81 [chars]
82 a bracket expression, matching any one of the chars (see
83 BRACKET EXPRESSIONS for more detail)
84
85 . matches any single character
86
87 \k matches the non-alphanumeric character k taken as an ordinary
88 character, e.g. \\ matches a backslash character
89
90 \c where c is alphanumeric (possibly followed by other charac‐
91 ters), an escape (AREs only), see ESCAPES below
92
93 { when followed by a character other than a digit, matches the
94 left-brace character “{”; when followed by a digit, it is the
95 beginning of a bound (see above)
96
97 x where x is a single character with no other significance,
98 matches that character.
99
100 CONSTRAINTS
101 A constraint matches an empty string when specific conditions are met.
102 A constraint may not be followed by a quantifier. The simple con‐
103 straints are as follows; some more constraints are described later,
104 under ESCAPES.
105
106 ^ matches at the beginning of a line
107
108 $ matches at the end of a line
109
110 (?=re) positive lookahead (AREs only), matches at any point where a
111 substring matching re begins
112
113 (?!re) negative lookahead (AREs only), matches at any point where no
114 substring matching re begins
115
116 The lookahead constraints may not contain back references (see later),
117 and all parentheses within them are considered non-capturing.
118
119 An RE may not end with “\”.
120
122 A bracket expression is a list of characters enclosed in “[]”. It nor‐
123 mally matches any single character from the list (but see below). If
124 the list begins with “^”, it matches any single character (but see
125 below) not from the rest of the list.
126
127 If two characters in the list are separated by “-”, this is shorthand
128 for the full range of characters between those two (inclusive) in the
129 collating sequence, e.g. “[0-9]” in Unicode matches any conventional
130 decimal digit. Two ranges may not share an endpoint, so e.g. “a-c-e”
131 is illegal. Ranges in Tcl always use the Unicode collating sequence,
132 but other programs may use other collating sequences and this can be a
133 source of incompatibility between programs.
134
135 To include a literal ] or - in the list, the simplest method is to
136 enclose it in [. and .] to make it a collating element (see below).
137 Alternatively, make it the first character (following a possible “^”),
138 or (AREs only) precede it with “\”. Alternatively, for “-”, make it
139 the last character, or the second endpoint of a range. To use a literal
140 - as the first endpoint of a range, make it a collating element or
141 (AREs only) precede it with “\”. With the exception of these, some
142 combinations using [ (see next paragraphs), and escapes, all other spe‐
143 cial characters lose their special significance within a bracket
144 expression.
145
146 CHARACTER CLASSES
147 Within a bracket expression, the name of a character class enclosed in
148 [: and :] stands for the list of all characters (not all collating ele‐
149 ments!) belonging to that class. Standard character classes are:
150
151 alpha A letter.
152
153 upper An upper-case letter.
154
155 lower A lower-case letter.
156
157 digit A decimal digit.
158
159 xdigit A hexadecimal digit.
160
161 alnum An alphanumeric (letter or digit).
162
163 print A "printable" (same as graph, except also including space).
164
165 blank A space or tab character.
166
167 space A character producing white space in displayed text.
168
169 punct A punctuation character.
170
171 graph A character with a visible representation (includes both alnum
172 and punct).
173
174 cntrl A control character.
175
176 A locale may provide others. A character class may not be used as an
177 endpoint of a range.
178
179 (Note: the current Tcl implementation has only one locale, the
180 Unicode locale, which supports exactly the above classes.)
181
182 BRACKETED CONSTRAINTS
183 There are two special cases of bracket expressions: the bracket expres‐
184 sions “[[:<:]]” and “[[:>:]]” are constraints, matching empty strings
185 at the beginning and end of a word respectively. A word is defined as
186 a sequence of word characters that is neither preceded nor followed by
187 word characters. A word character is an alnum character or an under‐
188 score (“_”). These special bracket expressions are deprecated; users
189 of AREs should use constraint escapes instead (see below).
190
191 COLLATING ELEMENTS
192 Within a bracket expression, a collating element (a character, a multi-
193 character sequence that collates as if it were a single character, or a
194 collating-sequence name for either) enclosed in [. and .] stands for
195 the sequence of characters of that collating element. The sequence is a
196 single element of the bracket expression's list. A bracket expression
197 in a locale that has multi-character collating elements can thus match
198 more than one character. So (insidiously), a bracket expression that
199 starts with ^ can match multi-character collating elements even if none
200 of them appear in the bracket expression!
201
202 (Note: Tcl has no multi-character collating elements. This
203 information is only for illustration.)
204
205 For example, assume the collating sequence includes a ch multi-charac‐
206 ter collating element. Then the RE “[[.ch.]]*c” (zero or more “chs”
207 followed by “c”) matches the first five characters of “chchcc”. Also,
208 the RE “[^c]b” matches all of “chb” (because “[^c]” matches the multi-
209 character “ch”).
210
211 EQUIVALENCE CLASSES
212 Within a bracket expression, a collating element enclosed in [= and =]
213 is an equivalence class, standing for the sequences of characters of
214 all collating elements equivalent to that one, including itself. (If
215 there are no other equivalent collating elements, the treatment is as
216 if the enclosing delimiters were “[.” and “.]”.) For example, if o and
217 ô are the members of an equivalence class, then “[[=o=]]”, “[[=ô=]]”,
218 and “[oô]” are all synonymous. An equivalence class may not be an end‐
219 point of a range.
220
221 (Note: Tcl implements only the Unicode locale. It does not
222 define any equivalence classes. The examples above are just
223 illustrations.)
224
226 Escapes (AREs only), which begin with a \ followed by an alphanumeric
227 character, come in several varieties: character entry, class short‐
228 hands, constraint escapes, and back references. A \ followed by an
229 alphanumeric character but not constituting a valid escape is illegal
230 in AREs. In EREs, there are no escapes: outside a bracket expression, a
231 \ followed by an alphanumeric character merely stands for that charac‐
232 ter as an ordinary character, and inside a bracket expression, \ is an
233 ordinary character. (The latter is the one actual incompatibility
234 between EREs and AREs.)
235
236 CHARACTER-ENTRY ESCAPES
237 Character-entry escapes (AREs only) exist to make it easier to specify
238 non-printing and otherwise inconvenient characters in REs:
239
240 \a alert (bell) character, as in C
241
242 \b backspace, as in C
243
244 \B synonym for \ to help reduce backslash doubling in some applica‐
245 tions where there are multiple levels of backslash processing
246
247 \cX (where X is any character) the character whose low-order 5 bits
248 are the same as those of X, and whose other bits are all zero
249
250 \e the character whose collating-sequence name is “ESC”, or failing
251 that, the character with octal value 033
252
253 \f formfeed, as in C
254
255 \n newline, as in C
256
257 \r carriage return, as in C
258
259 \t horizontal tab, as in C
260
261 \uwxyz
262 (where wxyz is one up to four hexadecimal digits) the Unicode
263 character U+wxyz in the local byte ordering
264
265 \Ustuvwxyz
266 (where stuvwxyz is one up to eight hexadecimal digits) reserved
267 for a Unicode extension up to 21 bits. The digits are parsed
268 until the first non-hexadecimal character is encountered, the
269 maximun of eight hexadecimal digits are reached, or an overflow
270 would occur in the maximum value of U+10ffff.
271
272 \v vertical tab, as in C are all available.
273
274 \xhh (where hh is one or two hexadecimal digits) the character whose
275 hexadecimal value is 0xhh.
276
277 \0 the character whose value is 0
278
279 \xyz (where xyz is exactly three octal digits, and is not a back ref‐
280 erence (see below)) the character whose octal value is 0xyz. The
281 first digit must be in the range 0-3, otherwise the two-digit
282 form is assumed.
283
284 \xy (where xy is exactly two octal digits, and is not a back refer‐
285 ence (see below)) the character whose octal value is 0xy
286
287 Hexadecimal digits are “0”-“9”, “a”-“f”, and “A”-“F”. Octal digits are
288 “0”-“7”.
289
290 The character-entry escapes are always taken as ordinary characters.
291 For example, \135 is ] in Unicode, but \135 does not terminate a
292 bracket expression. Beware, however, that some applications (e.g., C
293 compilers and the Tcl interpreter if the regular expression is not
294 quoted with braces) interpret such sequences themselves before the reg‐
295 ular-expression package gets to see them, which may require doubling
296 (quadrupling, etc.) the “\”.
297
298 CLASS-SHORTHAND ESCAPES
299 Class-shorthand escapes (AREs only) provide shorthands for certain com‐
300 monly-used character classes:
301
302 \d [[:digit:]]
303
304 \s [[:space:]]
305
306 \w [[:alnum:]_] (note underscore)
307
308 \D [^[:digit:]]
309
310 \S [^[:space:]]
311
312 \W [^[:alnum:]_] (note underscore)
313
314 Within bracket expressions, “\d”, “\s”, and “\w” lose their outer
315 brackets, and “\D”, “\S”, and “\W” are illegal. (So, for example, “[a-
316 c\d]” is equivalent to “[a-c[:digit:]]”. Also, “[a-c\D]”, which is
317 equivalent to “[a-c^[:digit:]]”, is illegal.)
318
319 CONSTRAINT ESCAPES
320 A constraint escape (AREs only) is a constraint, matching the empty
321 string if specific conditions are met, written as an escape:
322
323 \A matches only at the beginning of the string (see MATCHING,
324 below, for how this differs from “^”)
325
326 \m matches only at the beginning of a word
327
328 \M matches only at the end of a word
329
330 \y matches only at the beginning or end of a word
331
332 \Y matches only at a point that is not the beginning or end of a
333 word
334
335 \Z matches only at the end of the string (see MATCHING, below, for
336 how this differs from “$”)
337
338 \m (where m is a nonzero digit) a back reference, see below
339
340 \mnn (where m is a nonzero digit, and nn is some more digits, and
341 the decimal value mnn is not greater than the number of closing
342 capturing parentheses seen so far) a back reference, see below
343
344 A word is defined as in the specification of “[[:<:]]” and “[[:>:]]”
345 above. Constraint escapes are illegal within bracket expressions.
346
347 BACK REFERENCES
348 A back reference (AREs only) matches the same string matched by the
349 parenthesized subexpression specified by the number, so that (e.g.)
350 “([bc])\1” matches “bb” or “cc” but not “bc”. The subexpression must
351 entirely precede the back reference in the RE. Subexpressions are num‐
352 bered in the order of their leading parentheses. Non-capturing paren‐
353 theses do not define subexpressions.
354
355 There is an inherent historical ambiguity between octal character-entry
356 escapes and back references, which is resolved by heuristics, as hinted
357 at above. A leading zero always indicates an octal escape. A single
358 non-zero digit, not followed by another digit, is always taken as a
359 back reference. A multi-digit sequence not starting with a zero is
360 taken as a back reference if it comes after a suitable subexpression
361 (i.e. the number is in the legal range for a back reference), and oth‐
362 erwise is taken as octal.
363
365 In addition to the main syntax described above, there are some special
366 forms and miscellaneous syntactic facilities available.
367
368 Normally the flavor of RE being used is specified by application-depen‐
369 dent means. However, this can be overridden by a director. If an RE of
370 any flavor begins with “***:”, the rest of the RE is an ARE. If an RE
371 of any flavor begins with “***=”, the rest of the RE is taken to be a
372 literal string, with all characters considered ordinary characters.
373
374 An ARE may begin with embedded options: a sequence (?xyz) (where xyz is
375 one or more alphabetic characters) specifies options affecting the rest
376 of the RE. These supplement, and can override, any options specified by
377 the application. The available option letters are:
378
379 b rest of RE is a BRE
380
381 c case-sensitive matching (usual default)
382
383 e rest of RE is an ERE
384
385 i case-insensitive matching (see MATCHING, below)
386
387 m historical synonym for n
388
389 n newline-sensitive matching (see MATCHING, below)
390
391 p partial newline-sensitive matching (see MATCHING, below)
392
393 q rest of RE is a literal (“quoted”) string, all ordinary characters
394
395 s non-newline-sensitive matching (usual default)
396
397 t tight syntax (usual default; see below)
398
399 w inverse partial newline-sensitive (“weird”) matching (see MATCH‐
400 ING, below)
401
402 x expanded syntax (see below)
403
404 Embedded options take effect at the ) terminating the sequence. They
405 are available only at the start of an ARE, and may not be used later
406 within it.
407
408 In addition to the usual (tight) RE syntax, in which all characters are
409 significant, there is an expanded syntax, available in all flavors of
410 RE with the -expanded switch, or in AREs with the embedded x option. In
411 the expanded syntax, white-space characters are ignored and all charac‐
412 ters between a # and the following newline (or the end of the RE) are
413 ignored, permitting paragraphing and commenting a complex RE. There are
414 three exceptions to that basic rule:
415
416 · a white-space character or “#” preceded by “\” is retained
417
418 · white space or “#” within a bracket expression is retained
419
420 · white space and comments are illegal within multi-character symbols
421 like the ARE “(?:” or the BRE “\(”
422
423 Expanded-syntax white-space characters are blank, tab, newline, and any
424 character that belongs to the space character class.
425
426 Finally, in an ARE, outside bracket expressions, the sequence “(?#ttt)”
427 (where ttt is any text not containing a “)”) is a comment, completely
428 ignored. Again, this is not allowed between the characters of multi-
429 character symbols like “(?:”. Such comments are more a historical
430 artifact than a useful facility, and their use is deprecated; use the
431 expanded syntax instead.
432
433 None of these metasyntax extensions is available if the application (or
434 an initial “***=” director) has specified that the user's input be
435 treated as a literal string rather than as an RE.
436
438 In the event that an RE could match more than one substring of a given
439 string, the RE matches the one starting earliest in the string. If the
440 RE could match more than one substring starting at that point, its
441 choice is determined by its preference: either the longest substring,
442 or the shortest.
443
444 Most atoms, and all constraints, have no preference. A parenthesized RE
445 has the same preference (possibly none) as the RE. A quantified atom
446 with quantifier {m} or {m}? has the same preference (possibly none) as
447 the atom itself. A quantified atom with other normal quantifiers
448 (including {m,n} with m equal to n) prefers longest match. A quantified
449 atom with other non-greedy quantifiers (including {m,n}? with m equal
450 to n) prefers shortest match. A branch has the same preference as the
451 first quantified atom in it which has a preference. An RE consisting of
452 two or more branches connected by the | operator prefers longest match.
453
454 Subject to the constraints imposed by the rules for matching the whole
455 RE, subexpressions also match the longest or shortest possible sub‐
456 strings, based on their preferences, with subexpressions starting ear‐
457 lier in the RE taking priority over ones starting later. Note that
458 outer subexpressions thus take priority over their component subexpres‐
459 sions.
460
461 The quantifiers {1,1} and {1,1}? can be used to force longest and
462 shortest preference, respectively, on a subexpression or a whole RE.
463
464 NOTE: This means that you can usually make a RE be non-greedy
465 overall by putting {1,1}? after one of the first non-constraint
466 atoms or parenthesized sub-expressions in it. It pays to experi‐
467 ment with the placing of this non-greediness override on a suit‐
468 able range of input texts when you are writing a RE if you are
469 using this level of complexity.
470
471 For example, this regular expression is non-greedy, and will
472 match the shortest substring possible given that “abc” will be
473 matched as early as possible (the quantifier does not change
474 that):
475
476 ab{1,1}?c.*x.*cba
477
478 The atom “a” has no greediness preference, we explicitly give
479 one for “b”, and the remaining quantifiers are overridden to be
480 non-greedy by the preceding non-greedy quantifier.
481
482 Match lengths are measured in characters, not collating elements. An
483 empty string is considered longer than no match at all. For example,
484 “bb*” matches the three middle characters of “abbbc”,
485 “(week|wee)(night|knights)” matches all ten characters of “weeknights”,
486 when “(.*).*” is matched against “abc” the parenthesized subexpression
487 matches all three characters, and when “(a*)*” is matched against “bc”
488 both the whole RE and the parenthesized subexpression match an empty
489 string.
490
491 If case-independent matching is specified, the effect is much as if all
492 case distinctions had vanished from the alphabet. When an alphabetic
493 that exists in multiple cases appears as an ordinary character outside
494 a bracket expression, it is effectively transformed into a bracket
495 expression containing both cases, so that x becomes “[xX]”. When it
496 appears inside a bracket expression, all case counterparts of it are
497 added to the bracket expression, so that “[x]” becomes “[xX]” and
498 “[^x]” becomes “[^xX]”.
499
500 If newline-sensitive matching is specified, . and bracket expressions
501 using ^ will never match the newline character (so that matches will
502 never cross newlines unless the RE explicitly arranges it) and ^ and $
503 will match the empty string after and before a newline respectively, in
504 addition to matching at beginning and end of string respectively. ARE
505 \A and \Z continue to match beginning or end of string only.
506
507 If partial newline-sensitive matching is specified, this affects . and
508 bracket expressions as with newline-sensitive matching, but not ^ and
509 $.
510
511 If inverse partial newline-sensitive matching is specified, this
512 affects ^ and $ as with newline-sensitive matching, but not . and
513 bracket expressions. This is not very useful but is provided for symme‐
514 try.
515
517 No particular limit is imposed on the length of REs. Programs intended
518 to be highly portable should not employ REs longer than 256 bytes, as a
519 POSIX-compliant implementation can refuse to accept such REs.
520
521 The only feature of AREs that is actually incompatible with POSIX EREs
522 is that \ does not lose its special significance inside bracket expres‐
523 sions. All other ARE features use syntax which is illegal or has unde‐
524 fined or unspecified effects in POSIX EREs; the *** syntax of directors
525 likewise is outside the POSIX syntax for both BREs and EREs.
526
527 Many of the ARE extensions are borrowed from Perl, but some have been
528 changed to clean them up, and a few Perl extensions are not present.
529 Incompatibilities of note include “\b”, “\B”, the lack of special
530 treatment for a trailing newline, the addition of complemented bracket
531 expressions to the things affected by newline-sensitive matching, the
532 restrictions on parentheses and back references in lookahead con‐
533 straints, and the longest/shortest-match (rather than first-match)
534 matching semantics.
535
536 The matching rules for REs containing both normal and non-greedy quan‐
537 tifiers have changed since early beta-test versions of this package.
538 (The new rules are much simpler and cleaner, but do not work as hard at
539 guessing the user's real intentions.)
540
541 Henry Spencer's original 1986 regexp package, still in widespread use
542 (e.g., in pre-8.1 releases of Tcl), implemented an early version of
543 today's EREs. There are four incompatibilities between regexp's near-
544 EREs (“RREs” for short) and AREs. In roughly increasing order of sig‐
545 nificance:
546
547 · In AREs, \ followed by an alphanumeric character is either an escape
548 or an error, while in RREs, it was just another way of writing the
549 alphanumeric. This should not be a problem because there was no rea‐
550 son to write such a sequence in RREs.
551
552 · { followed by a digit in an ARE is the beginning of a bound, while
553 in RREs, { was always an ordinary character. Such sequences should
554 be rare, and will often result in an error because following charac‐
555 ters will not look like a valid bound.
556
557 · In AREs, \ remains a special character within “[]”, so a literal \
558 within [] must be written “\\”. \\ also gives a literal \ within []
559 in RREs, but only truly paranoid programmers routinely doubled the
560 backslash.
561
562 · AREs report the longest/shortest match for the RE, rather than the
563 first found in a specified search order. This may affect some RREs
564 which were written in the expectation that the first match would be
565 reported. (The careful crafting of RREs to optimize the search order
566 for fast matching is obsolete (AREs examine all possible matches in
567 parallel, and their performance is largely insensitive to their com‐
568 plexity) but cases where the search order was exploited to deliber‐
569 ately find a match which was not the longest/shortest will need
570 rewriting.)
571
573 BREs differ from EREs in several respects. “|”, “+”, and ? are ordi‐
574 nary characters and there is no equivalent for their functionality. The
575 delimiters for bounds are \{ and “\}”, with { and } by themselves ordi‐
576 nary characters. The parentheses for nested subexpressions are \( and
577 “\)”, with ( and ) by themselves ordinary characters. ^ is an ordinary
578 character except at the beginning of the RE or the beginning of a
579 parenthesized subexpression, $ is an ordinary character except at the
580 end of the RE or the end of a parenthesized subexpression, and * is an
581 ordinary character if it appears at the beginning of the RE or the
582 beginning of a parenthesized subexpression (after a possible leading
583 “^”). Finally, single-digit back references are available, and \< and
584 \> are synonyms for “[[:<:]]” and “[[:>:]]” respectively; no other
585 escapes are available.
586
588 RegExp(3), regexp(n), regsub(n), lsearch(n), switch(n), text(n)
589
591 match, regular expression, string
592
593
594
595Tcl 8.1 re_syntax(n)