1REGCOMP(P)                 POSIX Programmer's Manual                REGCOMP(P)
2
3
4

NAME

6       regcomp, regerror, regexec, regfree - regular expression matching
7

SYNOPSIS

9       #include <regex.h>
10
11       int regcomp(regex_t *restrict preg, const char *restrict pattern,
12              int cflags);
13       size_t regerror(int errcode, const regex_t *restrict preg,
14              char *restrict errbuf, size_t errbuf_size);
15       int regexec(const regex_t *restrict preg, const char *restrict string,
16              size_t nmatch, regmatch_t pmatch[restrict], int eflags);
17       void regfree(regex_t *preg);
18
19

DESCRIPTION

21       These  functions  interpret  basic  and extended regular expressions as
22       described in the Base Definitions volume of IEEE Std 1003.1-2001, Chap‐
23       ter 9, Regular Expressions.
24
25       The regex_t structure is defined in <regex.h> and contains at least the
26       following member:
27
28          Member Type  Member Name  Description
29          size_t       re_nsub      Number of parenthesized subexpressions.
30
31       The regmatch_t structure is defined in <regex.h> and contains at  least
32       the following members:
33
34          Member Type Member Name Description
35          regoff_t    rm_so       Byte offset from start of string to
36                                  start of substring.
37          regoff_t    rm_eo       Byte offset from start of string of the
38                                  first character after the end of sub‐
39                                  string.
40
41       The regcomp() function shall compile the regular  expression  contained
42       in  the string pointed to by the pattern argument and place the results
43       in the structure pointed to by preg.  The cflags argument is  the  bit‐
44       wise-inclusive  OR  of  zero  or more of the following flags, which are
45       defined in the <regex.h> header:
46
47       REG_EXTENDED
48              Use Extended Regular Expressions.
49
50       REG_ICASE
51              Ignore case in  match.  (See  the  Base  Definitions  volume  of
52              IEEE Std 1003.1-2001, Chapter 9, Regular Expressions.)
53
54       REG_NOSUB
55              Report only success/fail in regexec().
56
57       REG_NEWLINE
58              Change the handling of <newline>s, as described in the text.
59
60
61       The  default  regular  expression  type  for pattern is a Basic Regular
62       Expression. The application can specify  Extended  Regular  Expressions
63       using the REG_EXTENDED cflags flag.
64
65       If  the  REG_NOSUB flag was not set in cflags, then regcomp() shall set
66       re_nsub to the number of  parenthesized  subexpressions  (delimited  by
67       "\(\)" in basic regular expressions or "()" in extended regular expres‐
68       sions) found in pattern.
69
70       The regexec() function compares the null-terminated string specified by
71       string  with the compiled regular expression preg initialized by a pre‐
72       vious call to regcomp().  If it finds a match, regexec()  shall  return
73       0; otherwise, it shall return non-zero indicating either no match or an
74       error. The eflags argument is the bitwise-inclusive OR of zero or  more
75       of the following flags, which are defined in the <regex.h> header:
76
77       REG_NOTBOL
78              The  first  character  of the string pointed to by string is not
79              the beginning of the line. Therefore, the circumflex character (
80              '^'  ),  when  taken as a special character, shall not match the
81              beginning of string.
82
83       REG_NOTEOL
84              The last character of the string pointed to by string is not the
85              end  of the line. Therefore, the dollar sign ( '$' ), when taken
86              as a special character, shall not match the end of string.
87
88
89       If nmatch is 0 or REG_NOSUB was set in  the  cflags  argument  to  reg‐
90       comp(), then regexec() shall ignore the pmatch argument. Otherwise, the
91       application shall ensure that the pmatch argument points  to  an  array
92       with at least nmatch elements, and regexec() shall fill in the elements
93       of that array with offsets of the substrings of string that  correspond
94       to the parenthesized subexpressions of pattern: pmatch[ i]. rm_so shall
95       be the byte offset of the beginning and pmatch[ i]. rm_eo shall be  one
96       greater  than the byte offset of the end of substring i. (Subexpression
97       i begins at the ith matched open parenthesis, counting from 1.) Offsets
98       in pmatch[0] identify the substring that corresponds to the entire reg‐
99       ular expression. Unused elements of  pmatch  up  to  pmatch[  nmatch-1]
100       shall  be  filled with -1. If there are more than nmatch subexpressions
101       in pattern ( pattern itself counts as a subexpression), then  regexec()
102       shall  still  do the match, but shall record only the first nmatch sub‐
103       strings.
104
105       When matching a basic or extended regular expression, any given  paren‐
106       thesized  subexpression  of  pattern  might participate in the match of
107       several different substrings of string, or it might not match any  sub‐
108       string  even  though  the  pattern  as a whole did match. The following
109       rules shall be used to determine which substrings to report  in  pmatch
110       when matching regular expressions:
111
112        1. If  subexpression i in a regular expression is not contained within
113           another subexpression, and it participated  in  the  match  several
114           times,  then  the byte offsets in pmatch[ i] shall delimit the last
115           such match.
116
117        2. If subexpression i is not contained within  another  subexpression,
118           and  it  did  not participate in an otherwise successful match, the
119           byte offsets in pmatch[ i] shall be -1. A  subexpression  does  not
120           participate  in  the  match when: '*' or "\{\}" appears immediately
121           after the subexpression in a basic regular expression, or '*' , '?'
122           ,  or  "{}"  appears  immediately  after  the  subexpression  in an
123           extended regular expression, and the subexpression  did  not  match
124           (matched 0 times)
125
126       or: '|' is used in an extended regular expression to select this subex‐
127       pression or another, and the other subexpression matched.
128
129        3. If subexpression i is contained within another subexpression j, and
130           i is not contained within any other subexpression that is contained
131           within j, and a match of subexpression j is reported in pmatch[ j],
132           then  the match or non-match of subexpression i reported in pmatch[
133           i] shall be as described in 1. and 2.  above, but within  the  sub‐
134           string  reported  in  pmatch[  j] rather than the whole string. The
135           offsets in pmatch[ i] are still relative to the start of string.
136
137        4. If subexpression i is contained in subexpression j,  and  the  byte
138           offsets in pmatch[ j] are -1, then the pointers in pmatch[ i] shall
139           also be -1.
140
141        5. If subexpression i matched a zero-length  string,  then  both  byte
142           offsets  in pmatch[ i] shall be the byte offset of the character or
143           null terminator immediately following the zero-length string.
144
145       If, when regexec() is called, the locale is  different  from  when  the
146       regular expression was compiled, the result is undefined.
147
148       If  REG_NEWLINE  is  not  set in cflags, then a <newline> in pattern or
149       string shall be treated as an ordinary  character.  If  REG_NEWLINE  is
150       set, then <newline> shall be treated as an ordinary character except as
151       follows:
152
153        1. A <newline> in string shall not be matched by a  period  outside  a
154           bracket  expression  or by any form of a non-matching list (see the
155           Base Definitions volume of IEEE Std 1003.1-2001, Chapter 9, Regular
156           Expressions).
157
158        2. A  circumflex  (  '^' ) in pattern, when used to specify expression
159           anchoring (see the Base Definitions volume of IEEE Std 1003.1-2001,
160           Section  9.3.8,  BRE  Expression  Anchoring), shall match the zero-
161           length string immediately after a <newline> in  string,  regardless
162           of the setting of REG_NOTBOL.
163
164        3. A  dollar  sign ( '$' ) in pattern, when used to specify expression
165           anchoring, shall match the zero-length string immediately before  a
166           <newline> in string, regardless of the setting of REG_NOTEOL.
167
168       The  regfree() function frees any memory allocated by regcomp() associ‐
169       ated with preg.
170
171       The following constants are defined as error return values:
172
173       REG_NOMATCH
174              regexec() failed to match.
175
176       REG_BADPAT
177              Invalid regular expression.
178
179       REG_ECOLLATE
180              Invalid collating element referenced.
181
182       REG_ECTYPE
183              Invalid character class type referenced.
184
185       REG_EESCAPE
186              Trailing '\' in pattern.
187
188       REG_ESUBREG
189              Number in "\digit" invalid or in error.
190
191       REG_EBRACK
192              "[]" imbalance.
193
194       REG_EPAREN
195              "\(\)" or "()" imbalance.
196
197       REG_EBRACE
198              "\{\}" imbalance.
199
200       REG_BADBR
201              Content of "\{\}" invalid: not a number, number too large,  more
202              than two numbers, first larger than second.
203
204       REG_ERANGE
205              Invalid endpoint in range expression.
206
207       REG_ESPACE
208              Out of memory.
209
210       REG_BADRPT
211              '?' , '*' , or '+' not preceded by valid regular expression.
212
213
214       The regerror() function provides a mapping from error codes returned by
215       regcomp() and regexec() to unspecified printable strings. It  generates
216       a  string corresponding to the value of the errcode argument, which the
217       application shall ensure is the last non-zero value  returned  by  reg‐
218       comp()  or  regexec()  with  the given value of preg. If errcode is not
219       such a value, the content of the generated string is unspecified.
220
221       If preg is a null pointer, but errcode is a value returned by a  previ‐
222       ous  call  to regexec() or regcomp(), the regerror() still generates an
223       error string corresponding to the value of errcode, but it might not be
224       as detailed under some implementations.
225
226       If the errbuf_size argument is not 0, regerror() shall place the gener‐
227       ated string into the buffer of size errbuf_size  bytes  pointed  to  by
228       errbuf.  If  the  string (including the terminating null) cannot fit in
229       the buffer, regerror() shall truncate the string and null-terminate the
230       result.
231
232       If  errbuf_size  is 0, regerror() shall ignore the errbuf argument, and
233       return the size of the buffer needed to hold the generated string.
234
235       If the preg argument to regexec() or regfree() is not a compiled  regu‐
236       lar  expression  returned by regcomp(), the result is undefined. A preg
237       is no longer treated as a compiled regular expression after it is given
238       to regfree().
239

RETURN VALUE

241       Upon successful completion, the regcomp() function shall return 0. Oth‐
242       erwise, it shall  return  an  integer  value  indicating  an  error  as
243       described in <regex.h>, and the content of preg is undefined. If a code
244       is returned, the interpretation shall be as given in <regex.h>.
245
246       If regcomp() detects an invalid RE, it may return REG_BADPAT, or it may
247       return one of the error codes that more precisely describes the error.
248
249       Upon successful completion, the regexec() function shall return 0. Oth‐
250       erwise, it shall return REG_NOMATCH to indicate no match.
251
252       Upon successful completion, the regerror() function  shall  return  the
253       number  of  bytes needed to hold the entire generated string, including
254       the null termination. If the return value is greater than  errbuf_size,
255       the  string  returned in the buffer pointed to by errbuf has been trun‐
256       cated.
257
258       The regfree() function shall not return a value.
259

ERRORS

261       No errors are defined.
262
263       The following sections are informative.
264

EXAMPLES

266              #include <regex.h>
267
268
269              /*
270               * Match string against the extended regular expression in
271               * pattern, treating errors as no match.
272               *
273               * Return 1 for match, 0 for no match.
274               */
275
276
277              int
278              match(const char *string, char *pattern)
279              {
280                  int    status;
281                  regex_t    re;
282
283
284                  if (regcomp(&re, pattern, REG_EXTENDED|REG_NOSUB) != 0) {
285                      return(0);      /* Report error. */
286                  }
287                  status = regexec(&re, string, (size_t) 0, NULL, 0);
288                  regfree(&re);
289                  if (status != 0) {
290                      return(0);      /* Report error. */
291                  }
292                  return(1);
293              }
294
295       The following demonstrates how the REG_NOTBOL flag could be  used  with
296       regexec()  to  find  all substrings in a line that match a pattern sup‐
297       plied by a user. (For simplicity of  the  example,  very  little  error
298       checking is done.)
299
300
301              (void) regcomp (&re, pattern, 0);
302              /* This call to regexec() finds the first match on the line. */
303              error = regexec (&re, &buffer[0], 1, &pm, 0);
304              while (error == 0) {  /* While matches found. */
305                  /* Substring found between pm.rm_so and pm.rm_eo. */
306                  /* This call to regexec() finds the next match. */
307                  error = regexec (&re, buffer + pm.rm_eo, 1, &pm, REG_NOTBOL);
308              }
309

APPLICATION USAGE

311       An application could use:
312
313
314              regerror(code,preg,(char *)NULL,(size_t)0)
315
316       to  find  out how big a buffer is needed for the generated string, mal‐
317       loc() a buffer to hold the string, and then call  regerror()  again  to
318       get the string. Alternatively, it could allocate a fixed, static buffer
319       that is big enough to hold most strings, and then use malloc() to allo‐
320       cate a larger buffer if it finds that this is too small.
321
322       To  match  a  pattern as described in the Shell and Utilities volume of
323       IEEE Std 1003.1-2001, Section 2.13, Pattern Matching Notation, use  the
324       fnmatch() function.
325

RATIONALE

327       The  regexec()  function  must  fill  in all nmatch elements of pmatch,
328       where nmatch and pmatch are supplied by the application, even  if  some
329       elements  of pmatch do not correspond to subexpressions in pattern. The
330       application writer should note that there is  probably  no  reason  for
331       using a value of nmatch that is larger than preg-> re_nsub+1.
332
333       The  REG_NEWLINE  flag  supports a use of RE matching that is needed in
334       some applications like text editors. In  such  applications,  the  user
335       supplies  an  RE asking the application to find a line that matches the
336       given expression. An anchor in such an RE anchors at the  beginning  or
337       end  of  any  line.  Such  an  application can pass a sequence of <new‐
338       line>-separated lines to regexec() as a single long string and  specify
339       REG_NEWLINE  to  regcomp() to get the desired behavior. The application
340       must ensure that there are no explicit  <newline>s  in  pattern  if  it
341       wants to ensure that any match occurs entirely within a single line.
342
343       The  REG_NEWLINE  flag  affects the behavior of regexec(), but it is in
344       the cflags parameter to regcomp() to allow flexibility  of  implementa‐
345       tion.  Some  implementations will want to generate the same compiled RE
346       in  regcomp()  regardless  of  the  setting  of  REG_NEWLINE  and  have
347       regexec()  handle anchors differently based on the setting of the flag.
348       Other implementations will generate different compiled REs based on the
349       REG_NEWLINE.
350
351       The  REG_ICASE flag supports the operations taken by the grep -i option
352       and the historical implementations of ex and vi.  Including  this  flag
353       will  make  it  easier for application code to be written that does the
354       same thing as these utilities.
355
356       The substrings reported in pmatch[] are defined using offsets from  the
357       start  of  the  string rather than pointers. Since this is a new inter‐
358       face, there should be no impact on historical implementations or appli‐
359       cations,  and  offsets  should  be just as easy to use as pointers. The
360       change to offsets was made to facilitate future extensions in which the
361       string  to  be searched is presented to regexec() in blocks, allowing a
362       string to be searched that is not all in memory at once.
363
364       The type regoff_t is used for the elements of pmatch[] to  ensure  that
365       the application can represent either the largest possible array in mem‐
366       ory (important for an application conforming to the Shell and Utilities
367       volume of IEEE Std 1003.1-2001) or the largest possible file (important
368       for an application using the extension where  a  file  is  searched  in
369       chunks).
370
371       The  standard  developers rejected the inclusion of a regsub() function
372       that would be used to do substitutions for a matched RE. While  such  a
373       routine would be useful to some applications, its utility would be much
374       more limited than the matching function described here. Both RE parsing
375       and  substitution  are possible to implement without support other than
376       that required by the ISO C standard, but matching is much more  complex
377       than  substituting.  The only difficult part of substitution, given the
378       information supplied by regexec(), is finding the next character  in  a
379       string  when  there can be multi-byte characters. That is a much larger
380       issue, and one that needs a more general solution.
381
382       The errno variable has not been used for error returns to avoid filling
383       the errno name space for this feature.
384
385       The interface is defined so that the matched substrings rm_sp and rm_ep
386       are in a separate regmatch_t structure  instead  of  in  regex_t.  This
387       allows  a  single compiled RE to be used simultaneously in several con‐
388       texts; in main() and a signal handler, perhaps, or in multiple  threads
389       of  lightweight  processes. (The preg argument to regexec() is declared
390       with type const, so the implementation is  not  permitted  to  use  the
391       structure to store intermediate results.) It also allows an application
392       to request an arbitrary number of substrings from an RE. The number  of
393       subexpressions  in  the  RE  is reported in re_nsub in preg.  With this
394       change to regexec(), consideration was given to dropping the  REG_NOSUB
395       flag since the user can now specify this with a zero nmatch argument to
396       regexec().  However, keeping REG_NOSUB allows an implementation to  use
397       a different (perhaps more efficient) algorithm if it knows in regcomp()
398       that no subexpressions need be reported.  The  implementation  is  only
399       required  to  fill  in pmatch if nmatch is not zero and if REG_NOSUB is
400       not specified. Note that the size_t type, as defined in the ISO C stan‐
401       dard,  is  unsigned,  so  the description of regexec() does not need to
402       address negative values of nmatch.
403
404       REG_NOTBOL was added to allow an application to  do  repeated  searches
405       for  the  same  pattern in a line. If the pattern contains a circumflex
406       character that should match the beginning of a line, then  the  pattern
407       should only match when matched against the beginning of the line. With‐
408       out the REG_NOTBOL flag, the application could rewrite  the  expression
409       for  subsequent  matches,  but  in  the general case this would require
410       parsing the expression. The need for REG_NOTEOL is not as clear; it was
411       added for symmetry.
412
413       The  addition  of the regerror() function addresses the historical need
414       for conforming application programs to have access to error information
415       more  than  "Function  failed to compile/match your RE for unknown rea‐
416       sons".
417
418       This interface provides for two different methods of dealing with error
419       conditions. The specific error codes (REG_EBRACE, for example), defined
420       in <regex.h>, allow an application to recover from an error if it is so
421       able. Many applications, especially those that use patterns supplied by
422       a user, will not try to deal with specific error cases, but  will  just
423       use  regerror()  to obtain a human-readable error message to present to
424       the user.
425
426       The regerror() function uses a scheme similar to confstr() to deal with
427       the  problem  of  allocating  memory  to hold the generated string. The
428       scheme used by strerror() in the ISO C standard  was  considered  unac‐
429       ceptable since it creates difficulties for multi-threaded applications.
430
431       The  preg argument is provided to regerror() to allow an implementation
432       to generate a more descriptive message  than  would  be  possible  with
433       errcode alone. An implementation might, for example, save the character
434       offset of the offending character of the pattern in a  field  of  preg,
435       and  then include that in the generated message string. The implementa‐
436       tion may also ignore preg.
437
438       A REG_FILENAME flag was  considered,  but  omitted.  This  flag  caused
439       regexec()  to  match  patterns  as described in the Shell and Utilities
440       volume of IEEE Std 1003.1-2001, Section 2.13, Pattern Matching Notation
441       instead of REs. This service is now provided by the fnmatch() function.
442
443       Notice   that   there   is  a  difference  in  philosophy  between  the
444       ISO POSIX-2:1993 standard and IEEE Std 1003.1-2001 in how to  handle  a
445       "bad"  regular expression. The ISO POSIX-2:1993 standard says that many
446       bad constructs "produce undefined results", or that "the interpretation
447       is undefined". IEEE Std 1003.1-2001, however, says that the interpreta‐
448       tion of such REs is unspecified. The term "undefined"  means  that  the
449       action by the application is an error, of similar severity to passing a
450       bad pointer to a function.
451
452       The regcomp() and regexec() functions are required to accept any  null-
453       terminated string as the pattern argument. If the meaning of the string
454       is  "undefined",  the  behavior  of  the  function  is   "unspecified".
455       IEEE Std 1003.1-2001  does not specify how the functions will interpret
456       the pattern; they might return error codes, or they  might  do  pattern
457       matching  in  some  completely  unexpected  way, but they should not do
458       something like abort the process.
459

FUTURE DIRECTIONS

461       None.
462

SEE ALSO

464       fnmatch()   ,   glob()   ,    Shell    and    Utilities    volume    of
465       IEEE Std 1003.1-2001,  Section  2.13,  Pattern  Matching Notation, Base
466       Definitions volume of IEEE Std 1003.1-2001, Chapter 9, Regular  Expres‐
467       sions, <regex.h>, <sys/types.h>
468
470       Portions  of  this text are reprinted and reproduced in electronic form
471       from IEEE Std 1003.1, 2003 Edition, Standard for Information Technology
472       --  Portable  Operating  System  Interface (POSIX), The Open Group Base
473       Specifications Issue 6, Copyright (C) 2001-2003  by  the  Institute  of
474       Electrical  and  Electronics  Engineers, Inc and The Open Group. In the
475       event of any discrepancy between this version and the original IEEE and
476       The  Open Group Standard, the original IEEE and The Open Group Standard
477       is the referee document. The original Standard can be  obtained  online
478       at http://www.opengroup.org/unix/online.html .
479
480
481
482IEEE/The Open Group                  2003                           REGCOMP(P)
Impressum