1regexp(5)             Standards, Environments, and Macros            regexp(5)
2
3
4

NAME

6       regexp,  compile, step, advance - simple regular expression compile and
7       match routines
8

SYNOPSIS

10       #define INIT declarations
11       #define GETC(void) getc code
12       #define PEEKC(void) peekc code
13       #define UNGETC(void) ungetc code
14       #define RETURN(ptr) return code
15       #define ERROR(val) error code
16
17       extern char *loc1, *loc2, *locs;
18
19       #include <regexp.h>
20
21       char *compile(char *instring, char *expbuf, const char *endfug, int eof);
22
23
24       int step(const char *string, const char *expbuf);
25
26
27       int advance(const char *string, const char *expbuf);
28
29

DESCRIPTION

31       Regular Expressions  (REs)  provide  a  mechanism  to  select  specific
32       strings from a set of character strings. The Simple Regular Expressions
33       described below differ from the  Internationalized Regular  Expressions
34       described on the  regex(5) manual page in the following ways:
35
36           o      only Basic Regular Expressions are supported
37
38           o      the  Internationalization  features—character class, equiva‐
39                  lence class,  and  multi-character  collation—are  not  sup‐
40                  ported.
41
42
43       The functions step(), advance(), and compile() are general purpose reg‐
44       ular expression matching routines to be used in programs  that  perform
45       regular  expression  matching. These functions are defined by the <reg‐
46       exp.h> header.
47
48
49       The functions step() and advance() do pattern matching given a  charac‐
50       ter string and a compiled regular expression as input.
51
52
53       The  function  compile() takes as input a regular expression as defined
54       below and produces a compiled expression that can be used  with  step()
55       or advance().
56
57   Basic Regular Expressions
58       A  regular expression specifies a set of character strings. A member of
59       this set of strings is said to be matched by  the  regular  expression.
60       Some characters have special meaning when used in a regular expression;
61       other characters stand for themselves.
62
63
64       The following one-character REs match a single character:
65
66       1.1    An ordinary character ( not one of those discussed in 1.2 below)
67              is a one-character RE that matches itself.
68
69
70       1.2    A backslash (\) followed by any special character is a one-char‐
71              acter RE that matches the special character itself. The  special
72              characters are:
73
74              a.    ., *, [, and \ (period, asterisk, left square bracket, and
75                    backslash, respectively), which are always special, except
76                    when  they  appear  within  square  brackets  ([]; see 1.4
77                    below).
78
79
80              b.    ^ (caret or circumflex), which is special at the beginning
81                    of  an entire RE (see 4.1 and 4.3 below), or when it imme‐
82                    diately follows the left of a pair of square brackets ([])
83                    (see 1.4 below).
84
85
86              c.    $  (dollar sign), which is special at the end of an entire
87                    RE (see 4.2 below).
88
89
90              d.    The character used to bound (that is, delimit)  an  entire
91                    RE,  which  is  special  for that RE (for example, see how
92                    slash (/) is used in the g command, below.)
93
94
95
96       1.3    A period (.) is a one-character RE that  matches  any  character
97              except new-line.
98
99
100       1.4    A  non-empty  string  of  characters enclosed in square brackets
101              ([]) is a one-character RE that matches  any  one  character  in
102              that string. If, however, the first character of the string is a
103              circumflex (^),  the  one-character  RE  matches  any  character
104              except  new-line and the remaining characters in the string. The
105              ^ has this special meaning  only  if  it  occurs  first  in  the
106              string. The minus (-) may be used to indicate a range of consec‐
107              utive  characters;  for  example,   [0-9]   is   equivalent   to
108              [0123456789].  The  -  loses  this  special meaning if it occurs
109              first (after an initial ^, if any) or last in  the  string.  The
110              right  square  bracket (]) does not terminate such a string when
111              it is the first character within it  (after  an  initial  ^,  if
112              any);  for example, []a-f] matches either a right square bracket
113              (]) or one of the ASCII letters a through f inclusive. The  four
114              characters  listed  in  1.2.a  above stand for themselves within
115              such a string of characters.
116
117
118
119       The following rules may be used to  construct  REs  from  one-character
120       REs:
121
122       2.1    A one-character RE is a RE that matches whatever the one-charac‐
123              ter RE matches.
124
125
126       2.2    A one-character RE followed by an asterisk  (*)  is  a  RE  that
127              matches  0 or more occurrences of the one-character RE. If there
128              is any choice, the longest leftmost string that permits a  match
129              is chosen.
130
131
132       2.3    A one-character RE followed by \{m\}, \{m,\}, or \{m,n\} is a RE
133              that matches a range of occurrences of the one-character RE. The
134              values  of  m and n must be non-negative integers less than 256;
135              \{m\} matches exactly m occurrences; \{m,\} matches at  least  m
136              occurrences; \{m,n\} matches any number of occurrences between m
137              and n inclusive. Whenever a choice exists,  the  RE  matches  as
138              many occurrences as possible.
139
140
141       2.4    The  concatenation of REs is a RE that matches the concatenation
142              of the strings matched by each component of the RE.
143
144
145       2.5    A RE enclosed between the character sequences \( and \) is a  RE
146              that matches whatever the unadorned RE matches.
147
148
149       2.6    The  expression  \n matches the same string of characters as was
150              matched by an expression enclosed between \( and \)  earlier  in
151              the  same RE. Here n is a digit; the sub-expression specified is
152              that beginning with the n-th occurrence of \( counting from  the
153              left. For example, the expression ^\(.*\)\1$ matches a line con‐
154              sisting of two repeated appearances of the same string.
155
156
157
158       An RE may be constrained to match words.
159
160       3.1    \< constrains a RE to match the beginning of a string or to fol‐
161              low  a character that is not a digit, underscore, or letter. The
162              first character matching the RE must be a digit, underscore,  or
163              letter.
164
165
166       3.2    \>  constrains a RE to match the end of a string or to precede a
167              character that is not a digit, underscore, or letter.
168
169
170
171       An entire RE may be constrained to match only  an  initial  segment  or
172       final segment of a line (or both).
173
174       4.1    A  circumflex  (^)  at  the beginning of an entire RE constrains
175              that RE to match an initial segment of a line.
176
177
178       4.2    A dollar sign ($) at the end of an entire RE constrains that  RE
179              to match a final segment of a line.
180
181
182       4.3    The  construction  ^entire RE$ constrains the entire RE to match
183              the entire line.
184
185
186
187       The null RE (for example, //) is equivalent to the last RE encountered.
188
189   Addressing with REs
190       Addresses are constructed as follows:
191
192           1.     The character "." addresses the current line.
193
194           2.     The character "$" addresses the last line of the buffer.
195
196           3.     A decimal number n addresses the n-th line of the buffer.
197
198           4.     'x addresses the line marked with the mark name character x,
199                  which  must  be  an ASCII lower-case letter (a-z). Lines are
200                  marked with the k command described below.
201
202           5.     A RE enclosed by slashes (/) addresses the first line  found
203                  by  searching  forward  from  the line following the current
204                  line toward the end of the buffer and stopping at the  first
205                  line  containing a string matching the RE. If necessary, the
206                  search wraps around to the beginning of the buffer and  con‐
207                  tinues  up  to  and  including the current line, so that the
208                  entire buffer is searched.
209
210           6.     A RE enclosed in question marks (?) addresses the first line
211                  found by searching backward from the line preceding the cur‐
212                  rent line toward the beginning of the buffer and stopping at
213                  the  first line containing a string matching the RE. If nec‐
214                  essary, the search wraps around to the end of the buffer and
215                  continues up to and including the current line.
216
217           7.     An  address  followed by a plus sign (+) or a minus sign (-)
218                  followed by a decimal number  specifies  that  address  plus
219                  (respectively minus) the indicated number of lines. A short‐
220                  hand for .+5 is .5.
221
222           8.     If an address begins with + or -, the addition  or  subtrac‐
223                  tion is taken with respect to the current line; for example,
224                  -5 is understood to mean .-5.
225
226           9.     If an address ends with + or -, then 1 is added to  or  sub‐
227                  tracted  from the address, respectively. As a consequence of
228                  this rule and of Rule 8, immediately above,  the  address  -
229                  refers  to the line preceding the current line. (To maintain
230                  compatibility with earlier versions of the editor, the char‐
231                  acter ^ in addresses is entirely equivalent to -.) Moreover,
232                  trailing + and - characters have a cumulative effect, so  --
233                  refers to the current line less 2.
234
235           10.    For  convenience,  a  comma  (,) stands for the address pair
236                  1,$, while a semicolon (;) stands for the pair .,$.
237
238   Characters With Special Meaning
239       Characters that have special meaning except  when  they  appear  within
240       square  brackets ([]) or are preceded by \ are:  ., *, [, \. Other spe‐
241       cial characters, such as $ have special meaning in more restricted con‐
242       texts.
243
244
245       The  character ^ at the beginning of an expression permits a successful
246       match only immediately after a newline, and the character $ at the  end
247       of an expression requires a trailing newline.
248
249
250       Two characters have special meaning only when used within square brack‐
251       ets. The character - denotes a range, [c-c], unless it  is  just  after
252       the  open  bracket or before the closing bracket, [-c] or [c-] in which
253       case it has no special meaning. When used within brackets, the  charac‐
254       ter  ^ has the meaning complement of if it immediately follows the open
255       bracket (example: [^c]); elsewhere between brackets (example: [c^])  it
256       stands for the ordinary character ^.
257
258
259       The  special meaning of the \ operator can be escaped only by preceding
260       it with another \, for example \\.
261
262   Macros
263       Programs must have  the  following  five  macros  declared  before  the
264       #include  <regexp.h>  statement. These macros are used by the compile()
265       routine. The macros GETC, PEEKC, and  UNGETC  operate  on  the  regular
266       expression given as input to compile().
267
268       GETC           This  macro  returns  the  value  of  the next character
269                      (byte) in the  regular  expression  pattern.  Successive
270                      calls  to   GETC  should return successive characters of
271                      the regular expression.
272
273
274       PEEKC          This macro returns the next character (byte) in the reg‐
275                      ular  expression. Immediately successive calls to  PEEKC
276                      should return the same character, which should  also  be
277                      the next character returned by GETC.
278
279
280       UNGETC         This  macro  causes the argument c to be returned by the
281                      next call to GETC and PEEKC. No more than one  character
282                      of pushback is ever needed and this character is guaran‐
283                      teed to be the last character read by GETC.  The  return
284                      value of the macro UNGETC(c) is always ignored.
285
286
287       RETURN(ptr)    This  macro is used on normal exit of the compile() rou‐
288                      tine. The value of the argument ptr is a pointer to  the
289                      character after the last character of the compiled regu‐
290                      lar expression. This is useful to  programs  which  have
291                      memory allocation to manage.
292
293
294       ERROR(val)     This  macro  is  the  abnormal return from the compile()
295                      routine. The argument val is an error number (see ERRORS
296                      below for meanings). This call should never return.
297
298
299   compile()
300       The syntax of the compile() routine is as follows:
301
302         compile(instring, expbuf, endbuf, eof)
303
304
305
306
307       The  first  parameter,  instring,  is never used explicitly by the com‐
308       pile() routine but is useful for  programs  that  pass  down  different
309       pointers to input characters. It is sometimes used in the INIT declara‐
310       tion (see below). Programs which call functions to input characters  or
311       have characters in an external array can pass down a value of (char *)0
312       for this parameter.
313
314
315       The next parameter, expbuf, is a character pointer. It  points  to  the
316       place where the compiled regular expression will be placed.
317
318
319       The  parameter  endbuf  is  one more than the highest address where the
320       compiled regular expression may be placed. If the  compiled  expression
321       cannot fit in (endbuf-expbuf) bytes, a call to ERROR(50) is made.
322
323
324       The  parameter  eof is the character which marks the end of the regular
325       expression. This character is usually a /.
326
327
328       Each program that includes the  <regexp.h>  header  file  must  have  a
329       #define  statement  for INIT. It is used for dependent declarations and
330       initializations. Most often it is used to set a  register  variable  to
331       point  to the beginning of the regular expression so that this register
332       variable can be used in the declarations for GETC, PEEKC,  and  UNGETC.
333       Otherwise  it  can  be used to declare external variables that might be
334       used by GETC, PEEKC and UNGETC. (See EXAMPLES below.)
335
336   step(), advance()
337       The first parameter to the step() and advance() functions is a  pointer
338       to a string of characters to be checked for a match. This string should
339       be null terminated.
340
341
342       The second parameter, expbuf, is the compiled regular expression  which
343       was obtained by a call to the function compile().
344
345
346       The  function  step()  returns  non-zero  if  some  substring of string
347       matches the regular expression in expbuf and  0 if there is  no  match.
348       If  there is a match, two external character pointers are set as a side
349       effect to the call to step(). The variable loc1  points  to  the  first
350       character that matched the regular expression; the variable loc2 points
351       to the character after the last  character  that  matches  the  regular
352       expression.  Thus  if  the  regular expression matches the entire input
353       string, loc1 will point to the first character of string and loc2  will
354       point to the null at the end of string.
355
356
357       The  function  advance()  returns  non-zero if the initial substring of
358       string matches the regular expression in expbuf. If there is  a  match,
359       an external character pointer, loc2, is set as a side effect. The vari‐
360       able loc2 points to the next character in string after the last charac‐
361       ter that matched.
362
363
364       When  advance() encounters a * or \{ \} sequence in the regular expres‐
365       sion, it will advance its pointer to the string to be matched as far as
366       possible  and  will recursively call itself trying to match the rest of
367       the string to the rest of the regular expression. As long as  there  is
368       no  match,  advance()  will  back  up along the string until it finds a
369       match or reaches the point in the string that initially matched the   *
370       or  \{ \}. It is sometimes desirable to stop this backing up before the
371       initial point in the string  is  reached.  If  the  external  character
372       pointer locs is equal to the point in the string at sometime during the
373       backing up process, advance() will break out of the loop that backs  up
374       and will return zero.
375
376
377       The external variables circf, sed, and nbra are reserved.
378

EXAMPLES

380       Example 1 Using Regular Expression Macros and Calls
381
382
383       The  following  is  an example of how the regular expression macros and
384       calls might be defined by an application program:
385
386
387         #define INIT       register char *sp = instring;
388         #define GETC()     (*sp++)
389         #define PEEKC()    (*sp)
390         #define UNGETC(c)  (--sp)
391         #define RETURN(c)  return;
392         #define ERROR(c)   regerr()
393
394         #include <regexp.h>
395          . . .
396               (void) compile(*argv, expbuf, &expbuf[ESIZE],'\0');
397          . . .
398               if (step(linebuf, expbuf))
399                                 succeed;
400
401
402

DIAGNOSTICS

404       The function compile() uses the macro RETURN on success and  the  macro
405       ERROR on failure (see above). The functions step() and advance() return
406       non-zero on a successful match and zero if there is  no  match.  Errors
407       are:
408
409       11    range endpoint too large.
410
411
412       16    bad number.
413
414
415       25    \ digit out of range.
416
417
418       36    illegal or missing delimiter.
419
420
421       41    no remembered search string.
422
423
424       42    \( \) imbalance.
425
426
427       43    too many \(.
428
429
430       44    more than 2 numbers given in \{ \}.
431
432
433       45    } expected after \.
434
435
436       46    first number exceeds second in \{ \}.
437
438
439       49    [ ] imbalance.
440
441
442       50    regular expression overflow.
443
444

SEE ALSO

446       regex(5)
447
448
449
450SunOS 5.11                        20 May 2002                        regexp(5)
Impressum