1REGCOMP(3P)                POSIX Programmer's Manual               REGCOMP(3P)
2
3
4

PROLOG

6       This  manual  page is part of the POSIX Programmer's Manual.  The Linux
7       implementation of this interface may differ (consult the  corresponding
8       Linux  manual page for details of Linux behavior), or the interface may
9       not be implemented on Linux.
10

NAME

12       regcomp, regerror, regexec, regfree - regular expression matching
13

SYNOPSIS

15       #include <regex.h>
16
17       int regcomp(regex_t *restrict preg, const char *restrict pattern,
18              int cflags);
19       size_t regerror(int errcode, const regex_t *restrict preg,
20              char *restrict errbuf, size_t errbuf_size);
21       int regexec(const regex_t *restrict preg, const char *restrict string,
22              size_t nmatch, regmatch_t pmatch[restrict], int eflags);
23       void regfree(regex_t *preg);
24
25

DESCRIPTION

27       These functions interpret basic and  extended  regular  expressions  as
28       described in the Base Definitions volume of IEEE Std 1003.1-2001, Chap‐
29       ter 9, Regular Expressions.
30
31       The regex_t structure is defined in <regex.h> and contains at least the
32       following member:
33
34          Member Type  Member Name  Description
35          size_t       re_nsub      Number of parenthesized subexpressions.
36
37       The  regmatch_t structure is defined in <regex.h> and contains at least
38       the following members:
39
40          Member Type Member Name Description
41          regoff_t    rm_so       Byte offset from start of string to
42                                  start of substring.
43          regoff_t    rm_eo       Byte offset from start of string of the
44                                  first character after the end of sub‐
45                                  string.
46
47       The  regcomp()  function shall compile the regular expression contained
48       in the string pointed to by the pattern argument and place the  results
49       in  the  structure pointed to by preg.  The cflags argument is the bit‐
50       wise-inclusive OR of zero or more of the  following  flags,  which  are
51       defined in the <regex.h> header:
52
53       REG_EXTENDED
54              Use Extended Regular Expressions.
55
56       REG_ICASE
57              Ignore  case  in  match.  (See  the  Base  Definitions volume of
58              IEEE Std 1003.1-2001, Chapter 9, Regular Expressions.)
59
60       REG_NOSUB
61              Report only success/fail in regexec().
62
63       REG_NEWLINE
64              Change the handling of <newline>s, as described in the text.
65
66
67       The default regular expression type for  pattern  is  a  Basic  Regular
68       Expression.  The  application  can specify Extended Regular Expressions
69       using the REG_EXTENDED cflags flag.
70
71       If the REG_NOSUB flag was not set in cflags, then regcomp()  shall  set
72       re_nsub  to  the  number  of parenthesized subexpressions (delimited by
73       "\(\)" in basic regular expressions or "()" in extended regular expres‐
74       sions) found in pattern.
75
76       The regexec() function compares the null-terminated string specified by
77       string with the compiled regular expression preg initialized by a  pre‐
78       vious  call  to regcomp().  If it finds a match, regexec() shall return
79       0; otherwise, it shall return non-zero indicating either no match or an
80       error.  The eflags argument is the bitwise-inclusive OR of zero or more
81       of the following flags, which are defined in the <regex.h> header:
82
83       REG_NOTBOL
84              The first character of the string pointed to by  string  is  not
85              the beginning of the line. Therefore, the circumflex character (
86              '^' ), when taken as a special character, shall  not  match  the
87              beginning of string.
88
89       REG_NOTEOL
90              The last character of the string pointed to by string is not the
91              end of the line. Therefore, the dollar sign ( '$' ), when  taken
92              as a special character, shall not match the end of string.
93
94
95       If  nmatch  is  0  or  REG_NOSUB was set in the cflags argument to reg‐
96       comp(), then regexec() shall ignore the pmatch argument. Otherwise, the
97       application  shall  ensure  that the pmatch argument points to an array
98       with at least nmatch elements, and regexec() shall fill in the elements
99       of  that array with offsets of the substrings of string that correspond
100       to the parenthesized subexpressions of pattern: pmatch[ i]. rm_so shall
101       be  the byte offset of the beginning and pmatch[ i]. rm_eo shall be one
102       greater than the byte offset of the end of substring i.  (Subexpression
103       i begins at the ith matched open parenthesis, counting from 1.) Offsets
104       in pmatch[0] identify the substring that corresponds to the entire reg‐
105       ular  expression.  Unused  elements  of  pmatch up to pmatch[ nmatch-1]
106       shall be filled with -1. If there are more than  nmatch  subexpressions
107       in  pattern ( pattern itself counts as a subexpression), then regexec()
108       shall still do the match, but shall record only the first  nmatch  sub‐
109       strings.
110
111       When  matching a basic or extended regular expression, any given paren‐
112       thesized subexpression of pattern might participate  in  the  match  of
113       several  different substrings of string, or it might not match any sub‐
114       string even though the pattern as a  whole  did  match.  The  following
115       rules  shall  be used to determine which substrings to report in pmatch
116       when matching regular expressions:
117
118        1. If subexpression i in a regular expression is not contained  within
119           another  subexpression,  and  it  participated in the match several
120           times, then the byte offsets in pmatch[ i] shall delimit  the  last
121           such match.
122
123        2. If  subexpression  i is not contained within another subexpression,
124           and it did not participate in an otherwise  successful  match,  the
125           byte  offsets  in  pmatch[ i] shall be -1. A subexpression does not
126           participate in the match when: '*' or  "\{\}"  appears  immediately
127           after the subexpression in a basic regular expression, or '*', '?',
128           or "{}" appears immediately after the subexpression in an  extended
129           regular  expression, and the subexpression did not match (matched 0
130           times)
131
132       or: '|' is used in an extended regular expression to select this subex‐
133       pression or another, and the other subexpression matched.
134
135        3. If subexpression i is contained within another subexpression j, and
136           i is not contained within any other subexpression that is contained
137           within j, and a match of subexpression j is reported in pmatch[ j],
138           then the match or non-match of subexpression i reported in  pmatch[
139           i]  shall  be as described in 1. and 2.  above, but within the sub‐
140           string reported in pmatch[ j] rather than  the  whole  string.  The
141           offsets in pmatch[ i] are still relative to the start of string.
142
143        4. If  subexpression  i  is contained in subexpression j, and the byte
144           offsets in pmatch[ j] are -1, then the pointers in pmatch[ i] shall
145           also be -1.
146
147        5. If  subexpression  i  matched  a zero-length string, then both byte
148           offsets in pmatch[ i] shall be the byte offset of the character  or
149           null terminator immediately following the zero-length string.
150
151       If,  when  regexec()  is  called, the locale is different from when the
152       regular expression was compiled, the result is undefined.
153
154       If REG_NEWLINE is not set in cflags, then a  <newline>  in  pattern  or
155       string  shall  be  treated  as an ordinary character. If REG_NEWLINE is
156       set, then <newline> shall be treated as an ordinary character except as
157       follows:
158
159        1. A  <newline>  in  string shall not be matched by a period outside a
160           bracket expression or by any form of a non-matching list  (see  the
161           Base Definitions volume of IEEE Std 1003.1-2001, Chapter 9, Regular
162           Expressions).
163
164        2. A circumflex ( '^' ) in pattern, when used  to  specify  expression
165           anchoring (see the Base Definitions volume of IEEE Std 1003.1-2001,
166           Section 9.3.8, BRE Expression Anchoring),  shall  match  the  zero-
167           length  string  immediately after a <newline> in string, regardless
168           of the setting of REG_NOTBOL.
169
170        3. A dollar sign ( '$' ) in pattern, when used to  specify  expression
171           anchoring,  shall match the zero-length string immediately before a
172           <newline> in string, regardless of the setting of REG_NOTEOL.
173
174       The regfree() function frees any memory allocated by regcomp()  associ‐
175       ated with preg.
176
177       The following constants are defined as error return values:
178
179       REG_NOMATCH
180              regexec() failed to match.
181
182       REG_BADPAT
183              Invalid regular expression.
184
185       REG_ECOLLATE
186              Invalid collating element referenced.
187
188       REG_ECTYPE
189              Invalid character class type referenced.
190
191       REG_EESCAPE
192              Trailing '\' in pattern.
193
194       REG_ESUBREG
195              Number in "\digit" invalid or in error.
196
197       REG_EBRACK
198              "[]" imbalance.
199
200       REG_EPAREN
201              "\(\)" or "()" imbalance.
202
203       REG_EBRACE
204              "\{\}" imbalance.
205
206       REG_BADBR
207              Content  of "\{\}" invalid: not a number, number too large, more
208              than two numbers, first larger than second.
209
210       REG_ERANGE
211              Invalid endpoint in range expression.
212
213       REG_ESPACE
214              Out of memory.
215
216       REG_BADRPT
217              '?', '*', or '+' not preceded by valid regular expression.
218
219
220       The regerror() function provides a mapping from error codes returned by
221       regcomp()  and regexec() to unspecified printable strings. It generates
222       a string corresponding to the value of the errcode argument, which  the
223       application  shall  ensure  is the last non-zero value returned by reg‐
224       comp() or regexec() with the given value of preg.  If  errcode  is  not
225       such a value, the content of the generated string is unspecified.
226
227       If  preg is a null pointer, but errcode is a value returned by a previ‐
228       ous call to regexec() or regcomp(), the regerror() still  generates  an
229       error string corresponding to the value of errcode, but it might not be
230       as detailed under some implementations.
231
232       If the errbuf_size argument is not 0, regerror() shall place the gener‐
233       ated  string  into  the  buffer of size errbuf_size bytes pointed to by
234       errbuf. If the string (including the terminating null)  cannot  fit  in
235       the buffer, regerror() shall truncate the string and null-terminate the
236       result.
237
238       If errbuf_size is 0, regerror() shall ignore the errbuf  argument,  and
239       return the size of the buffer needed to hold the generated string.
240
241       If  the preg argument to regexec() or regfree() is not a compiled regu‐
242       lar expression returned by regcomp(), the result is undefined.  A  preg
243       is no longer treated as a compiled regular expression after it is given
244       to regfree().
245

RETURN VALUE

247       Upon successful completion, the regcomp() function shall return 0. Oth‐
248       erwise,  it  shall  return  an  integer  value  indicating  an error as
249       described in <regex.h>, and the content of preg is undefined. If a code
250       is returned, the interpretation shall be as given in <regex.h>.
251
252       If regcomp() detects an invalid RE, it may return REG_BADPAT, or it may
253       return one of the error codes that more precisely describes the error.
254
255       Upon successful completion, the regexec() function shall return 0. Oth‐
256       erwise, it shall return REG_NOMATCH to indicate no match.
257
258       Upon  successful  completion,  the regerror() function shall return the
259       number of bytes needed to hold the entire generated  string,  including
260       the  null termination. If the return value is greater than errbuf_size,
261       the string returned in the buffer pointed to by errbuf has  been  trun‐
262       cated.
263
264       The regfree() function shall not return a value.
265

ERRORS

267       No errors are defined.
268
269       The following sections are informative.
270

EXAMPLES

272              #include <regex.h>
273
274
275              /*
276               * Match string against the extended regular expression in
277               * pattern, treating errors as no match.
278               *
279               * Return 1 for match, 0 for no match.
280               */
281
282
283              int
284              match(const char *string, char *pattern)
285              {
286                  int    status;
287                  regex_t    re;
288
289
290                  if (regcomp(&re, pattern, REG_EXTENDED|REG_NOSUB) != 0) {
291                      return(0);      /* Report error. */
292                  }
293                  status = regexec(&re, string, (size_t) 0, NULL, 0);
294                  regfree(&re);
295                  if (status != 0) {
296                      return(0);      /* Report error. */
297                  }
298                  return(1);
299              }
300
301       The  following  demonstrates how the REG_NOTBOL flag could be used with
302       regexec() to find all substrings in a line that match  a  pattern  sup‐
303       plied  by  a  user.  (For  simplicity of the example, very little error
304       checking is done.)
305
306
307              (void) regcomp (&re, pattern, 0);
308              /* This call to regexec() finds the first match on the line. */
309              error = regexec (&re, &buffer[0], 1, &pm, 0);
310              while (error == 0) {  /* While matches found. */
311                  /* Substring found between pm.rm_so and pm.rm_eo. */
312                  /* This call to regexec() finds the next match. */
313                  error = regexec (&re, buffer + pm.rm_eo, 1, &pm, REG_NOTBOL);
314              }
315

APPLICATION USAGE

317       An application could use:
318
319
320              regerror(code,preg,(char *)NULL,(size_t)0)
321
322       to find out how big a buffer is needed for the generated  string,  mal‐
323       loc()  a  buffer  to hold the string, and then call regerror() again to
324       get the string. Alternatively, it could allocate a fixed, static buffer
325       that is big enough to hold most strings, and then use malloc() to allo‐
326       cate a larger buffer if it finds that this is too small.
327
328       To match a pattern as described in the Shell and  Utilities  volume  of
329       IEEE Std 1003.1-2001,  Section 2.13, Pattern Matching Notation, use the
330       fnmatch() function.
331

RATIONALE

333       The regexec() function must fill in  all  nmatch  elements  of  pmatch,
334       where  nmatch  and pmatch are supplied by the application, even if some
335       elements of pmatch do not correspond to subexpressions in pattern.  The
336       application  writer  should  note  that there is probably no reason for
337       using a value of nmatch that is larger than preg-> re_nsub+1.
338
339       The REG_NEWLINE flag supports a use of RE matching that  is  needed  in
340       some  applications  like  text  editors. In such applications, the user
341       supplies an RE asking the application to find a line that  matches  the
342       given  expression.  An anchor in such an RE anchors at the beginning or
343       end of any line. Such an application  can  pass  a  sequence  of  <new‐
344       line>-separated  lines to regexec() as a single long string and specify
345       REG_NEWLINE to regcomp() to get the desired behavior.  The  application
346       must  ensure  that  there  are  no explicit <newline>s in pattern if it
347       wants to ensure that any match occurs entirely within a single line.
348
349       The REG_NEWLINE flag affects the behavior of regexec(), but  it  is  in
350       the  cflags  parameter to regcomp() to allow flexibility of implementa‐
351       tion. Some implementations will want to generate the same  compiled  RE
352       in  regcomp()  regardless  of  the  setting  of  REG_NEWLINE  and  have
353       regexec() handle anchors differently based on the setting of the  flag.
354       Other implementations will generate different compiled REs based on the
355       REG_NEWLINE.
356
357       The REG_ICASE flag supports the operations taken by the grep -i  option
358       and  the  historical implementations of ex and vi.  Including this flag
359       will make it easier for application code to be written  that  does  the
360       same thing as these utilities.
361
362       The  substrings reported in pmatch[] are defined using offsets from the
363       start of the string rather than pointers. Since this is  a  new  inter‐
364       face, there should be no impact on historical implementations or appli‐
365       cations, and offsets should be just as easy to  use  as  pointers.  The
366       change to offsets was made to facilitate future extensions in which the
367       string to be searched is presented to regexec() in blocks,  allowing  a
368       string to be searched that is not all in memory at once.
369
370       The  type  regoff_t is used for the elements of pmatch[] to ensure that
371       the application can represent either the largest possible array in mem‐
372       ory (important for an application conforming to the Shell and Utilities
373       volume of IEEE Std 1003.1-2001) or the largest possible file (important
374       for  an  application  using  the  extension where a file is searched in
375       chunks).
376
377       The standard developers rejected the inclusion of a  regsub()  function
378       that  would  be used to do substitutions for a matched RE. While such a
379       routine would be useful to some applications, its utility would be much
380       more limited than the matching function described here. Both RE parsing
381       and substitution are possible to implement without support  other  than
382       that  required by the ISO C standard, but matching is much more complex
383       than substituting.  The only difficult part of substitution, given  the
384       information  supplied  by regexec(), is finding the next character in a
385       string when there can be multi-byte characters. That is a  much  larger
386       issue, and one that needs a more general solution.
387
388       The errno variable has not been used for error returns to avoid filling
389       the errno name space for this feature.
390
391       The interface is defined so that the matched substrings rm_sp and rm_ep
392       are  in  a  separate  regmatch_t  structure instead of in regex_t. This
393       allows a single compiled RE to be used simultaneously in  several  con‐
394       texts;  in main() and a signal handler, perhaps, or in multiple threads
395       of lightweight processes. (The preg argument to regexec()  is  declared
396       with  type  const,  so  the  implementation is not permitted to use the
397       structure to store intermediate results.) It also allows an application
398       to  request an arbitrary number of substrings from an RE. The number of
399       subexpressions in the RE is reported in re_nsub  in  preg.   With  this
400       change  to regexec(), consideration was given to dropping the REG_NOSUB
401       flag since the user can now specify this with a zero nmatch argument to
402       regexec().   However, keeping REG_NOSUB allows an implementation to use
403       a different (perhaps more efficient) algorithm if it knows in regcomp()
404       that  no  subexpressions  need  be reported. The implementation is only
405       required to fill in pmatch if nmatch is not zero and  if  REG_NOSUB  is
406       not specified. Note that the size_t type, as defined in the ISO C stan‐
407       dard, is unsigned, so the description of regexec()  does  not  need  to
408       address negative values of nmatch.
409
410       REG_NOTBOL  was  added  to allow an application to do repeated searches
411       for the same pattern in a line. If the pattern  contains  a  circumflex
412       character  that  should match the beginning of a line, then the pattern
413       should only match when matched against the beginning of the line. With‐
414       out  the  REG_NOTBOL flag, the application could rewrite the expression
415       for subsequent matches, but in the  general  case  this  would  require
416       parsing the expression. The need for REG_NOTEOL is not as clear; it was
417       added for symmetry.
418
419       The addition of the regerror() function addresses the  historical  need
420       for conforming application programs to have access to error information
421       more than "Function failed to compile/match your RE  for  unknown  rea‐
422       sons".
423
424       This interface provides for two different methods of dealing with error
425       conditions. The specific error codes (REG_EBRACE, for example), defined
426       in <regex.h>, allow an application to recover from an error if it is so
427       able. Many applications, especially those that use patterns supplied by
428       a  user,  will not try to deal with specific error cases, but will just
429       use regerror() to obtain a human-readable error message to  present  to
430       the user.
431
432       The regerror() function uses a scheme similar to confstr() to deal with
433       the problem of allocating memory to  hold  the  generated  string.  The
434       scheme  used  by  strerror() in the ISO C standard was considered unac‐
435       ceptable since it creates difficulties for multi-threaded applications.
436
437       The preg argument is provided to regerror() to allow an  implementation
438       to  generate  a  more  descriptive  message than would be possible with
439       errcode alone. An implementation might, for example, save the character
440       offset  of  the  offending character of the pattern in a field of preg,
441       and then include that in the generated message string. The  implementa‐
442       tion may also ignore preg.
443
444       A  REG_FILENAME  flag  was  considered,  but  omitted. This flag caused
445       regexec() to match patterns as described in  the  Shell  and  Utilities
446       volume of IEEE Std 1003.1-2001, Section 2.13, Pattern Matching Notation
447       instead of REs. This service is now provided by the fnmatch() function.
448
449       Notice  that  there  is  a  difference  in   philosophy   between   the
450       ISO POSIX-2:1993  standard  and IEEE Std 1003.1-2001 in how to handle a
451       "bad" regular expression. The ISO POSIX-2:1993 standard says that  many
452       bad constructs "produce undefined results", or that "the interpretation
453       is undefined". IEEE Std 1003.1-2001, however, says that the interpreta‐
454       tion  of  such  REs is unspecified. The term "undefined" means that the
455       action by the application is an error, of similar severity to passing a
456       bad pointer to a function.
457
458       The  regcomp() and regexec() functions are required to accept any null-
459       terminated string as the pattern argument. If the meaning of the string
460       is   "undefined",  the  behavior  of  the  function  is  "unspecified".
461       IEEE Std 1003.1-2001 does not specify how the functions will  interpret
462       the  pattern;  they  might return error codes, or they might do pattern
463       matching in some completely unexpected way,  but  they  should  not  do
464       something like abort the process.
465

FUTURE DIRECTIONS

467       None.
468

SEE ALSO

470       fnmatch(),  glob(), Shell and Utilities volume of IEEE Std 1003.1-2001,
471       Section 2.13, Pattern Matching Notation,  Base  Definitions  volume  of
472       IEEE Std 1003.1-2001,   Chapter   9,  Regular  Expressions,  <regex.h>,
473       <sys/types.h>
474
476       Portions of this text are reprinted and reproduced in  electronic  form
477       from IEEE Std 1003.1, 2003 Edition, Standard for Information Technology
478       -- Portable Operating System Interface (POSIX),  The  Open  Group  Base
479       Specifications  Issue  6,  Copyright  (C) 2001-2003 by the Institute of
480       Electrical and Electronics Engineers, Inc and The Open  Group.  In  the
481       event of any discrepancy between this version and the original IEEE and
482       The Open Group Standard, the original IEEE and The Open Group  Standard
483       is  the  referee document. The original Standard can be obtained online
484       at http://www.opengroup.org/unix/online.html .
485
486
487
488IEEE/The Open Group                  2003                          REGCOMP(3P)
Impressum