1regexp(5) Standards, Environments, and Macros regexp(5)
2
3
4
6 regexp, compile, step, advance - simple regular expression compile and
7 match routines
8
10 #define INIT declarations
11 #define GETC(void) getc code
12 #define PEEKC(void) peekc code
13 #define UNGETC(void) ungetc code
14 #define RETURN(ptr) return code
15 #define ERROR(val) error code
16
17 extern char *loc1, *loc2, *locs;
18
19 #include <regexp.h>
20
21 char *compile(char *instring, char *expbuf, const char *endfug, int eof);
22
23
24 int step(const char *string, const char *expbuf);
25
26
27 int advance(const char *string, const char *expbuf);
28
29
31 Regular Expressions (REs) provide a mechanism to select specific
32 strings from a set of character strings. The Simple Regular Expressions
33 described below differ from the Internationalized Regular Expressions
34 described on the regex(5) manual page in the following ways:
35
36 o only Basic Regular Expressions are supported
37
38 o the Internationalization features—character class, equiva‐
39 lence class, and multi-character collation—are not sup‐
40 ported.
41
42
43 The functions step(), advance(), and compile() are general purpose reg‐
44 ular expression matching routines to be used in programs that perform
45 regular expression matching. These functions are defined by the <reg‐
46 exp.h> header.
47
48
49 The functions step() and advance() do pattern matching given a charac‐
50 ter string and a compiled regular expression as input.
51
52
53 The function compile() takes as input a regular expression as defined
54 below and produces a compiled expression that can be used with step()
55 or advance().
56
57 Basic Regular Expressions
58 A regular expression specifies a set of character strings. A member of
59 this set of strings is said to be matched by the regular expression.
60 Some characters have special meaning when used in a regular expression;
61 other characters stand for themselves.
62
63
64 The following one-character REs match a single character:
65
66 1.1 An ordinary character ( not one of those discussed in 1.2 below)
67 is a one-character RE that matches itself.
68
69
70 1.2 A backslash (\) followed by any special character is a one-char‐
71 acter RE that matches the special character itself. The special
72 characters are:
73
74 a. ., *, [, and \ (period, asterisk, left square bracket, and
75 backslash, respectively), which are always special, except
76 when they appear within square brackets ([]; see 1.4
77 below).
78
79
80 b. ^ (caret or circumflex), which is special at the beginning
81 of an entire RE (see 4.1 and 4.3 below), or when it imme‐
82 diately follows the left of a pair of square brackets ([])
83 (see 1.4 below).
84
85
86 c. $ (dollar sign), which is special at the end of an entire
87 RE (see 4.2 below).
88
89
90 d. The character used to bound (that is, delimit) an entire
91 RE, which is special for that RE (for example, see how
92 slash (/) is used in the g command, below.)
93
94
95
96 1.3 A period (.) is a one-character RE that matches any character
97 except new-line.
98
99
100 1.4 A non-empty string of characters enclosed in square brackets
101 ([]) is a one-character RE that matches any one character in
102 that string. If, however, the first character of the string is a
103 circumflex (^), the one-character RE matches any character
104 except new-line and the remaining characters in the string. The
105 ^ has this special meaning only if it occurs first in the
106 string. The minus (-) may be used to indicate a range of consec‐
107 utive characters; for example, [0-9] is equivalent to
108 [0123456789]. The - loses this special meaning if it occurs
109 first (after an initial ^, if any) or last in the string. The
110 right square bracket (]) does not terminate such a string when
111 it is the first character within it (after an initial ^, if
112 any); for example, []a-f] matches either a right square bracket
113 (]) or one of the ASCII letters a through f inclusive. The four
114 characters listed in 1.2.a above stand for themselves within
115 such a string of characters.
116
117
118
119 The following rules may be used to construct REs from one-character
120 REs:
121
122 2.1 A one-character RE is a RE that matches whatever the one-charac‐
123 ter RE matches.
124
125
126 2.2 A one-character RE followed by an asterisk (*) is a RE that
127 matches 0 or more occurrences of the one-character RE. If there
128 is any choice, the longest leftmost string that permits a match
129 is chosen.
130
131
132 2.3 A one-character RE followed by \{m\}, \{m,\}, or \{m,n\} is a RE
133 that matches a range of occurrences of the one-character RE. The
134 values of m and n must be non-negative integers less than 256;
135 \{m\} matches exactly m occurrences; \{m,\} matches at least m
136 occurrences; \{m,n\} matches any number of occurrences between m
137 and n inclusive. Whenever a choice exists, the RE matches as
138 many occurrences as possible.
139
140
141 2.4 The concatenation of REs is a RE that matches the concatenation
142 of the strings matched by each component of the RE.
143
144
145 2.5 A RE enclosed between the character sequences \( and \) is a RE
146 that matches whatever the unadorned RE matches.
147
148
149 2.6 The expression \n matches the same string of characters as was
150 matched by an expression enclosed between \( and \) earlier in
151 the same RE. Here n is a digit; the sub-expression specified is
152 that beginning with the n-th occurrence of \( counting from the
153 left. For example, the expression ^\(.*\)\1$ matches a line con‐
154 sisting of two repeated appearances of the same string.
155
156
157
158 An RE may be constrained to match words.
159
160 3.1 \< constrains a RE to match the beginning of a string or to fol‐
161 low a character that is not a digit, underscore, or letter. The
162 first character matching the RE must be a digit, underscore, or
163 letter.
164
165
166 3.2 \> constrains a RE to match the end of a string or to precede a
167 character that is not a digit, underscore, or letter.
168
169
170
171 An entire RE may be constrained to match only an initial segment or
172 final segment of a line (or both).
173
174 4.1 A circumflex (^) at the beginning of an entire RE constrains
175 that RE to match an initial segment of a line.
176
177
178 4.2 A dollar sign ($) at the end of an entire RE constrains that RE
179 to match a final segment of a line.
180
181
182 4.3 The construction ^entire RE$ constrains the entire RE to match
183 the entire line.
184
185
186
187 The null RE (for example, //) is equivalent to the last RE encountered.
188
189 Addressing with REs
190 Addresses are constructed as follows:
191
192 1. The character "." addresses the current line.
193
194 2. The character "$" addresses the last line of the buffer.
195
196 3. A decimal number n addresses the n-th line of the buffer.
197
198 4. 'x addresses the line marked with the mark name character x,
199 which must be an ASCII lower-case letter (a-z). Lines are
200 marked with the k command described below.
201
202 5. A RE enclosed by slashes (/) addresses the first line found
203 by searching forward from the line following the current
204 line toward the end of the buffer and stopping at the first
205 line containing a string matching the RE. If necessary, the
206 search wraps around to the beginning of the buffer and con‐
207 tinues up to and including the current line, so that the
208 entire buffer is searched.
209
210 6. A RE enclosed in question marks (?) addresses the first line
211 found by searching backward from the line preceding the cur‐
212 rent line toward the beginning of the buffer and stopping at
213 the first line containing a string matching the RE. If nec‐
214 essary, the search wraps around to the end of the buffer and
215 continues up to and including the current line.
216
217 7. An address followed by a plus sign (+) or a minus sign (-)
218 followed by a decimal number specifies that address plus
219 (respectively minus) the indicated number of lines. A short‐
220 hand for .+5 is .5.
221
222 8. If an address begins with + or -, the addition or subtrac‐
223 tion is taken with respect to the current line; for example,
224 -5 is understood to mean .-5.
225
226 9. If an address ends with + or -, then 1 is added to or sub‐
227 tracted from the address, respectively. As a consequence of
228 this rule and of Rule 8, immediately above, the address -
229 refers to the line preceding the current line. (To maintain
230 compatibility with earlier versions of the editor, the char‐
231 acter ^ in addresses is entirely equivalent to -.) Moreover,
232 trailing + and - characters have a cumulative effect, so --
233 refers to the current line less 2.
234
235 10. For convenience, a comma (,) stands for the address pair
236 1,$, while a semicolon (;) stands for the pair .,$.
237
238 Characters With Special Meaning
239 Characters that have special meaning except when they appear within
240 square brackets ([]) or are preceded by \ are: ., *, [, \. Other spe‐
241 cial characters, such as $ have special meaning in more restricted con‐
242 texts.
243
244
245 The character ^ at the beginning of an expression permits a successful
246 match only immediately after a newline, and the character $ at the end
247 of an expression requires a trailing newline.
248
249
250 Two characters have special meaning only when used within square brack‐
251 ets. The character - denotes a range, [c-c], unless it is just after
252 the open bracket or before the closing bracket, [-c] or [c-] in which
253 case it has no special meaning. When used within brackets, the charac‐
254 ter ^ has the meaning complement of if it immediately follows the open
255 bracket (example: [^c]); elsewhere between brackets (example: [c^]) it
256 stands for the ordinary character ^.
257
258
259 The special meaning of the \ operator can be escaped only by preceding
260 it with another \, for example \\.
261
262 Macros
263 Programs must have the following five macros declared before the
264 #include <regexp.h> statement. These macros are used by the compile()
265 routine. The macros GETC, PEEKC, and UNGETC operate on the regular
266 expression given as input to compile().
267
268 GETC This macro returns the value of the next character
269 (byte) in the regular expression pattern. Successive
270 calls to GETC should return successive characters of
271 the regular expression.
272
273
274 PEEKC This macro returns the next character (byte) in the reg‐
275 ular expression. Immediately successive calls to PEEKC
276 should return the same character, which should also be
277 the next character returned by GETC.
278
279
280 UNGETC This macro causes the argument c to be returned by the
281 next call to GETC and PEEKC. No more than one character
282 of pushback is ever needed and this character is guaran‐
283 teed to be the last character read by GETC. The return
284 value of the macro UNGETC(c) is always ignored.
285
286
287 RETURN(ptr) This macro is used on normal exit of the compile() rou‐
288 tine. The value of the argument ptr is a pointer to the
289 character after the last character of the compiled regu‐
290 lar expression. This is useful to programs which have
291 memory allocation to manage.
292
293
294 ERROR(val) This macro is the abnormal return from the compile()
295 routine. The argument val is an error number (see ERRORS
296 below for meanings). This call should never return.
297
298
299 compile()
300 The syntax of the compile() routine is as follows:
301
302 compile(instring, expbuf, endbuf, eof)
303
304
305
306
307 The first parameter, instring, is never used explicitly by the com‐
308 pile() routine but is useful for programs that pass down different
309 pointers to input characters. It is sometimes used in the INIT declara‐
310 tion (see below). Programs which call functions to input characters or
311 have characters in an external array can pass down a value of (char *)0
312 for this parameter.
313
314
315 The next parameter, expbuf, is a character pointer. It points to the
316 place where the compiled regular expression will be placed.
317
318
319 The parameter endbuf is one more than the highest address where the
320 compiled regular expression may be placed. If the compiled expression
321 cannot fit in (endbuf-expbuf) bytes, a call to ERROR(50) is made.
322
323
324 The parameter eof is the character which marks the end of the regular
325 expression. This character is usually a /.
326
327
328 Each program that includes the <regexp.h> header file must have a
329 #define statement for INIT. It is used for dependent declarations and
330 initializations. Most often it is used to set a register variable to
331 point to the beginning of the regular expression so that this register
332 variable can be used in the declarations for GETC, PEEKC, and UNGETC.
333 Otherwise it can be used to declare external variables that might be
334 used by GETC, PEEKC and UNGETC. (See EXAMPLES below.)
335
336 step(), advance()
337 The first parameter to the step() and advance() functions is a pointer
338 to a string of characters to be checked for a match. This string should
339 be null terminated.
340
341
342 The second parameter, expbuf, is the compiled regular expression which
343 was obtained by a call to the function compile().
344
345
346 The function step() returns non-zero if some substring of string
347 matches the regular expression in expbuf and 0 if there is no match.
348 If there is a match, two external character pointers are set as a side
349 effect to the call to step(). The variable loc1 points to the first
350 character that matched the regular expression; the variable loc2 points
351 to the character after the last character that matches the regular
352 expression. Thus if the regular expression matches the entire input
353 string, loc1 will point to the first character of string and loc2 will
354 point to the null at the end of string.
355
356
357 The function advance() returns non-zero if the initial substring of
358 string matches the regular expression in expbuf. If there is a match,
359 an external character pointer, loc2, is set as a side effect. The vari‐
360 able loc2 points to the next character in string after the last charac‐
361 ter that matched.
362
363
364 When advance() encounters a * or \{ \} sequence in the regular expres‐
365 sion, it will advance its pointer to the string to be matched as far as
366 possible and will recursively call itself trying to match the rest of
367 the string to the rest of the regular expression. As long as there is
368 no match, advance() will back up along the string until it finds a
369 match or reaches the point in the string that initially matched the *
370 or \{ \}. It is sometimes desirable to stop this backing up before the
371 initial point in the string is reached. If the external character
372 pointer locs is equal to the point in the string at sometime during the
373 backing up process, advance() will break out of the loop that backs up
374 and will return zero.
375
376
377 The external variables circf, sed, and nbra are reserved.
378
380 Example 1 Using Regular Expression Macros and Calls
381
382
383 The following is an example of how the regular expression macros and
384 calls might be defined by an application program:
385
386
387 #define INIT register char *sp = instring;
388 #define GETC() (*sp++)
389 #define PEEKC() (*sp)
390 #define UNGETC(c) (--sp)
391 #define RETURN(c) return;
392 #define ERROR(c) regerr()
393
394 #include <regexp.h>
395 . . .
396 (void) compile(*argv, expbuf, &expbuf[ESIZE],'\0');
397 . . .
398 if (step(linebuf, expbuf))
399 succeed;
400
401
402
404 The function compile() uses the macro RETURN on success and the macro
405 ERROR on failure (see above). The functions step() and advance() return
406 non-zero on a successful match and zero if there is no match. Errors
407 are:
408
409 11 range endpoint too large.
410
411
412 16 bad number.
413
414
415 25 \ digit out of range.
416
417
418 36 illegal or missing delimiter.
419
420
421 41 no remembered search string.
422
423
424 42 \( \) imbalance.
425
426
427 43 too many \(.
428
429
430 44 more than 2 numbers given in \{ \}.
431
432
433 45 } expected after \.
434
435
436 46 first number exceeds second in \{ \}.
437
438
439 49 [ ] imbalance.
440
441
442 50 regular expression overflow.
443
444
446 regex(5)
447
448
449
450SunOS 5.11 20 May 2002 regexp(5)