1REGCOMP(3P)                POSIX Programmer's Manual               REGCOMP(3P)
2
3
4

PROLOG

6       This  manual  page is part of the POSIX Programmer's Manual.  The Linux
7       implementation of this interface may differ (consult the  corresponding
8       Linux  manual page for details of Linux behavior), or the interface may
9       not be implemented on Linux.
10
11

NAME

13       regcomp, regerror, regexec, regfree — regular expression matching
14

SYNOPSIS

16       #include <regex.h>
17
18       int regcomp(regex_t *restrict preg, const char *restrict pattern,
19           int cflags);
20       size_t regerror(int errcode, const regex_t *restrict preg,
21           char *restrict errbuf, size_t errbuf_size);
22       int regexec(const regex_t *restrict preg, const char *restrict string,
23           size_t nmatch, regmatch_t pmatch[restrict], int eflags);
24       void regfree(regex_t *preg);
25

DESCRIPTION

27       These functions interpret basic and  extended  regular  expressions  as
28       described  in  the  Base Definitions volume of POSIX.1‐2008, Chapter 9,
29       Regular Expressions.
30
31       The regex_t structure is defined in <regex.h> and contains at least the
32       following member:
33
34             ┌──────────────┬──────────────┬───────────────────────────┐
35Member Type   Member Name  Description        
36             ├──────────────┼──────────────┼───────────────────────────┤
37size_t        re_nsub       │ Number  of  parenthesized │
38             │              │              │ subexpressions.           │
39             └──────────────┴──────────────┴───────────────────────────┘
40       The regmatch_t structure is defined in <regex.h> and contains at  least
41       the following members:
42
43             ┌──────────────┬──────────────┬───────────────────────────┐
44Member Type   Member Name  Description        
45             ├──────────────┼──────────────┼───────────────────────────┤
46regoff_t      rm_so         │ Byte offset from start of │
47             │              │              │ string to start  of  sub‐ │
48             │              │              │ string.                   │
49regoff_t      rm_eo         │ Byte offset from start of │
50             │              │              │ string of the first char‐ │
51             │              │              │ acter  after  the  end of │
52             │              │              │ substring.                │
53             └──────────────┴──────────────┴───────────────────────────┘
54       The regcomp() function shall compile the regular  expression  contained
55       in  the string pointed to by the pattern argument and place the results
56       in the structure pointed to by preg.  The cflags argument is  the  bit‐
57       wise-inclusive  OR  of  zero  or more of the following flags, which are
58       defined in the <regex.h> header:
59
60       REG_EXTENDED  Use Extended Regular Expressions.
61
62       REG_ICASE     Ignore case in match (see the Base Definitions volume  of
63                     POSIX.1‐2008, Chapter 9, Regular Expressions).
64
65       REG_NOSUB     Report only success/fail in regexec().
66
67       REG_NEWLINE   Change the handling of <newline> characters, as described
68                     in the text.
69
70       The default regular expression type for  pattern  is  a  Basic  Regular
71       Expression.  The  application  can specify Extended Regular Expressions
72       using the REG_EXTENDED cflags flag.
73
74       If the REG_NOSUB flag was not set in cflags, then regcomp()  shall  set
75       re_nsub  to  the  number  of parenthesized subexpressions (delimited by
76       "\(\)" in basic regular expressions or "()" in extended regular expres‐
77       sions) found in pattern.
78
79       The regexec() function compares the null-terminated string specified by
80       string with the compiled regular expression preg initialized by a  pre‐
81       vious  call  to regcomp().  If it finds a match, regexec() shall return
82       0; otherwise, it shall return non-zero indicating either no match or an
83       error.  The eflags argument is the bitwise-inclusive OR of zero or more
84       of the following flags, which are defined in the <regex.h> header:
85
86       REG_NOTBOL    The first character of the string pointed to by string is
87                     not  the  beginning  of the line. Therefore, the <circum‐
88                     flex> character ('^'), when taken as a special character,
89                     shall not match the beginning of string.
90
91       REG_NOTEOL    The  last character of the string pointed to by string is
92                     not the end of the  line.  Therefore,  the  <dollar-sign>
93                     ('$'), when taken as a special character, shall not match
94                     the end of string.
95
96       If nmatch is 0 or REG_NOSUB was set in  the  cflags  argument  to  reg‐
97       comp(), then regexec() shall ignore the pmatch argument. Otherwise, the
98       application shall ensure that the pmatch argument points  to  an  array
99       with at least nmatch elements, and regexec() shall fill in the elements
100       of that array with offsets of the substrings of string that  correspond
101       to  the  parenthesized subexpressions of pattern: pmatch[i].rm_so shall
102       be the byte offset of the beginning and pmatch[i].rm_eo  shall  be  one
103       greater than the byte offset of the end of substring i.  (Subexpression
104       i begins at the ith matched open parenthesis, counting from 1.) Offsets
105       in pmatch[0] identify the substring that corresponds to the entire reg‐
106       ular expression. Unused elements of pmatch up to pmatch[nmatch−1] shall
107       be filled with −1. If there are more than nmatch subexpressions in pat‐
108       tern (pattern itself counts as a subexpression), then  regexec()  shall
109       still do the match, but shall record only the first nmatch substrings.
110
111       When  matching a basic or extended regular expression, any given paren‐
112       thesized subexpression of pattern might participate  in  the  match  of
113       several  different substrings of string, or it might not match any sub‐
114       string even though the pattern as a  whole  did  match.  The  following
115       rules  shall  be used to determine which substrings to report in pmatch
116       when matching regular expressions:
117
118        1. If subexpression i in a regular expression is not contained  within
119           another  subexpression,  and  it  participated in the match several
120           times, then the byte offsets in pmatch[i] shall  delimit  the  last
121           such match.
122
123        2. If  subexpression  i is not contained within another subexpression,
124           and it did not participate in an otherwise  successful  match,  the
125           byte  offsets  in  pmatch[i]  shall be −1. A subexpression does not
126           participate in the match when:
127
128           '*' or "\{\}" appears immediately  after  the  subexpression  in  a
129           basic  regular expression, or '*', '?', or "{}" appears immediately
130           after the subexpression in an extended regular expression, and  the
131           subexpression did not match (matched 0 times)
132
133           or:
134
135                  '|' is used in an extended regular expression to select this
136                  subexpression  or  another,  and  the  other   subexpression
137                  matched.
138
139        3. If subexpression i is contained within another subexpression j, and
140           i is not contained within any other subexpression that is contained
141           within  j, and a match of subexpression j is reported in pmatch[j],
142           then  the  match  or  non-match  of  subexpression  i  reported  in
143           pmatch[i]  shall be as described in 1. and 2. above, but within the
144           substring reported in pmatch[j] rather than the whole  string.  The
145           offsets in pmatch[i] are still relative to the start of string.
146
147        4. If  subexpression  i  is contained in subexpression j, and the byte
148           offsets in pmatch[j] are −1, then the pointers in  pmatch[i]  shall
149           also be −1.
150
151        5. If  subexpression  i  matched  a zero-length string, then both byte
152           offsets in pmatch[i] shall be the byte offset of the  character  or
153           null terminator immediately following the zero-length string.
154
155       If,  when  regexec()  is  called, the locale is different from when the
156       regular expression was compiled, the result is undefined.
157
158       If REG_NEWLINE is not set in cflags, then a  <newline>  in  pattern  or
159       string  shall  be  treated  as an ordinary character. If REG_NEWLINE is
160       set, then <newline> shall be treated as an ordinary character except as
161       follows:
162
163        1. A  <newline> in string shall not be matched by a <period> outside a
164           bracket expression or by any form of a non-matching list  (see  the
165           Base Definitions volume of POSIX.1‐2008, Chapter 9, Regular Expres‐
166           sions).
167
168        2. A <circumflex> ('^') in pattern, when used  to  specify  expression
169           anchoring (see the Base Definitions volume of POSIX.1‐2008, Section
170           9.3.8, BRE  Expression  Anchoring),  shall  match  the  zero-length
171           string  immediately  after a <newline> in string, regardless of the
172           setting of REG_NOTBOL.
173
174        3. A <dollar-sign> ('$') in pattern, when used to  specify  expression
175           anchoring,  shall match the zero-length string immediately before a
176           <newline> in string, regardless of the setting of REG_NOTEOL.
177
178       The regfree() function frees any memory allocated by regcomp()  associ‐
179       ated with preg.
180
181       The  following constants are defined as the minimum set of error return
182       values, although other errors listed as  implementation  extensions  in
183       <regex.h> are possible:
184
185       REG_BADBR     Content  of  "\{\}"  invalid:  not  a  number, number too
186                     large, more than two numbers, first larger than second.
187
188       REG_BADPAT    Invalid regular expression.
189
190       REG_BADRPT    '?', '*', or '+' not preceded by  valid  regular  expres‐
191                     sion.
192
193       REG_EBRACE    "\{\}" imbalance.
194
195       REG_EBRACK    "[]" imbalance.
196
197       REG_ECOLLATE  Invalid collating element referenced.
198
199       REG_ECTYPE    Invalid character class type referenced.
200
201       REG_EESCAPE   Trailing <backslash> character in pattern.
202
203       REG_EPAREN    "\(\)" or "()" imbalance.
204
205       REG_ERANGE    Invalid endpoint in range expression.
206
207       REG_ESPACE    Out of memory.
208
209       REG_ESUBREG   Number in "\digit" invalid or in error.
210
211       REG_NOMATCH   regexec() failed to match.
212
213       If more than one error occurs in processing a function call, any one of
214       the possible constants may be returned, as the order  of  detection  is
215       unspecified.
216
217       The regerror() function provides a mapping from error codes returned by
218       regcomp() and regexec() to unspecified printable strings. It  generates
219       a  string corresponding to the value of the errcode argument, which the
220       application shall ensure is the last non-zero value  returned  by  reg‐
221       comp()  or  regexec()  with the given value of preg.  If errcode is not
222       such a value, the content of the generated string is unspecified.
223
224       If preg is a null pointer, but errcode is a value returned by a  previ‐
225       ous  call  to regexec() or regcomp(), the regerror() still generates an
226       error string corresponding to the value of errcode, but it might not be
227       as detailed under some implementations.
228
229       If the errbuf_size argument is not 0, regerror() shall place the gener‐
230       ated string into the buffer of size errbuf_size  bytes  pointed  to  by
231       errbuf.   If  the string (including the terminating null) cannot fit in
232       the buffer, regerror() shall truncate the string and null-terminate the
233       result.
234
235       If  errbuf_size  is 0, regerror() shall ignore the errbuf argument, and
236       return the size of the buffer needed to hold the generated string.
237
238       If the preg argument to regexec() or regfree() is not a compiled  regu‐
239       lar  expression  returned by regcomp(), the result is undefined. A preg
240       is no longer treated as a compiled regular expression after it is given
241       to regfree().
242

RETURN VALUE

244       Upon successful completion, the regcomp() function shall return 0. Oth‐
245       erwise, it shall  return  an  integer  value  indicating  an  error  as
246       described in <regex.h>, and the content of preg is undefined. If a code
247       is returned, the interpretation shall be as given in <regex.h>.
248
249       If regcomp() detects an invalid RE, it may return REG_BADPAT, or it may
250       return one of the error codes that more precisely describes the error.
251
252       Upon successful completion, the regexec() function shall return 0. Oth‐
253       erwise, it shall return REG_NOMATCH to indicate no match.
254
255       Upon successful completion, the regerror() function  shall  return  the
256       number  of  bytes needed to hold the entire generated string, including
257       the null termination. If the return value is greater than  errbuf_size,
258       the  string  returned in the buffer pointed to by errbuf has been trun‐
259       cated.
260
261       The regfree() function shall not return a value.
262

ERRORS

264       No errors are defined.
265
266       The following sections are informative.
267

EXAMPLES

269           #include <regex.h>
270
271           /*
272            * Match string against the extended regular expression in
273            * pattern, treating errors as no match.
274            *
275            * Return 1 for match, 0 for no match.
276            */
277
278           int
279           match(const char *string, char *pattern)
280           {
281               int    status;
282               regex_t    re;
283
284               if (regcomp(&re, pattern, REG_EXTENDED|REG_NOSUB) != 0) {
285                   return(0);      /* Report error. */
286               }
287               status = regexec(&re, string, (size_t) 0, NULL, 0);
288               regfree(&re);
289               if (status != 0) {
290                   return(0);      /* Report error. */
291               }
292               return(1);
293           }
294
295       The following demonstrates how the REG_NOTBOL flag could be  used  with
296       regexec()  to  find  all substrings in a line that match a pattern sup‐
297       plied by a user.  (For simplicity of the  example,  very  little  error
298       checking is done.)
299
300           (void) regcomp (&re, pattern, 0);
301           /* This call to regexec() finds the first match on the line. */
302           error = regexec (&re, &buffer[0], 1, &pm, 0);
303           while (error == 0) {  /* While matches found. */
304               /* Substring found between pm.rm_so and pm.rm_eo. */
305               /* This call to regexec() finds the next match. */
306               error = regexec (&re, buffer + pm.rm_eo, 1, &pm, REG_NOTBOL);
307           }
308

APPLICATION USAGE

310       An application could use:
311
312           regerror(code,preg,(char *)NULL,(size_t)0)
313
314       to  find  out how big a buffer is needed for the generated string, mal‐
315       loc() a buffer to hold the string, and then call  regerror()  again  to
316       get the string. Alternatively, it could allocate a fixed, static buffer
317       that is big enough to hold most strings, and then use malloc() to allo‐
318       cate a larger buffer if it finds that this is too small.
319
320       To  match  a  pattern as described in the Shell and Utilities volume of
321       POSIX.1‐2008,  Section  2.13,  Pattern  Matching  Notation,   use   the
322       fnmatch() function.
323

RATIONALE

325       The  regexec()  function  must  fill  in all nmatch elements of pmatch,
326       where nmatch and pmatch are supplied by the application, even  if  some
327       elements of pmatch do not correspond to subexpressions in pattern.  The
328       application developer should note that there is probably no reason  for
329       using a value of nmatch that is larger than preg−>re_nsub+1.
330
331       The  REG_NEWLINE  flag  supports a use of RE matching that is needed in
332       some applications like text editors. In  such  applications,  the  user
333       supplies  an  RE asking the application to find a line that matches the
334       given expression. An anchor in such an RE anchors at the  beginning  or
335       end  of  any  line.  Such  an  application can pass a sequence of <new‐
336       line>-separated lines to regexec() as a single long string and  specify
337       REG_NEWLINE  to  regcomp() to get the desired behavior. The application
338       must ensure that there are no explicit <newline> characters in  pattern
339       if  it  wants  to ensure that any match occurs entirely within a single
340       line.
341
342       The REG_NEWLINE flag affects the behavior of regexec(), but  it  is  in
343       the  cflags  parameter to regcomp() to allow flexibility of implementa‐
344       tion. Some implementations will want to generate the same  compiled  RE
345       in  regcomp()  regardless  of  the  setting  of  REG_NEWLINE  and  have
346       regexec() handle anchors differently based on the setting of the  flag.
347       Other implementations will generate different compiled REs based on the
348       REG_NEWLINE.
349
350       The REG_ICASE flag supports the operations taken by the grep −i  option
351       and  the  historical implementations of ex and vi.  Including this flag
352       will make it easier for application code to be written  that  does  the
353       same thing as these utilities.
354
355       The  substrings reported in pmatch[] are defined using offsets from the
356       start of the string rather than pointers. This allows type-safe  access
357       to both constant and non-constant strings.
358
359       The  type  regoff_t is used for the elements of pmatch[] to ensure that
360       the application can represent large arrays in memory (important for  an
361       application   conforming   to   the   Shell  and  Utilities  volume  of
362       POSIX.1‐2008).
363
364       The 1992 edition of this standard required regoff_t to be at  least  as
365       wide  as  off_t, to facilitate future extensions in which the string to
366       be searched is taken from a file. However, these future extensions have
367       not  appeared.   The requirement rules out popular implementations with
368       32-bit regoff_t and 64-bit off_t, so it has been removed.
369
370       The standard developers rejected the inclusion of a  regsub()  function
371       that  would  be used to do substitutions for a matched RE. While such a
372       routine would be useful to some applications, its utility would be much
373       more limited than the matching function described here. Both RE parsing
374       and substitution are possible to implement without support  other  than
375       that  required by the ISO C standard, but matching is much more complex
376       than substituting. The only difficult part of substitution,  given  the
377       information  supplied  by regexec(), is finding the next character in a
378       string when there can be multi-byte characters. That is a  much  larger
379       issue, and one that needs a more general solution.
380
381       The errno variable has not been used for error returns to avoid filling
382       the errno name space for this feature.
383
384       The interface is defined so that the matched substrings rm_sp and rm_ep
385       are  in  a  separate  regmatch_t structure instead of in regex_t.  This
386       allows a single compiled RE to be used simultaneously in  several  con‐
387       texts;  in main() and a signal handler, perhaps, or in multiple threads
388       of lightweight processes. (The preg argument to regexec()  is  declared
389       with  type  const,  so  the  implementation is not permitted to use the
390       structure to store intermediate results.) It also allows an application
391       to  request an arbitrary number of substrings from an RE. The number of
392       subexpressions in the RE is reported in re_nsub  in  preg.   With  this
393       change  to regexec(), consideration was given to dropping the REG_NOSUB
394       flag since the user can now specify this with a zero nmatch argument to
395       regexec().   However, keeping REG_NOSUB allows an implementation to use
396       a different (perhaps more efficient) algorithm if it knows in regcomp()
397       that  no  subexpressions  need  be reported. The implementation is only
398       required to fill in pmatch if nmatch is not zero and  if  REG_NOSUB  is
399       not specified. Note that the size_t type, as defined in the ISO C stan‐
400       dard, is unsigned, so the description of regexec()  does  not  need  to
401       address negative values of nmatch.
402
403       REG_NOTBOL  was  added  to allow an application to do repeated searches
404       for the same pattern in a line. If the pattern contains a  <circumflex>
405       character  that  should match the beginning of a line, then the pattern
406       should only match when matched  against  the  beginning  of  the  line.
407       Without  the REG_NOTBOL flag, the application could rewrite the expres‐
408       sion for subsequent matches, but in the general case this would require
409       parsing the expression. The need for REG_NOTEOL is not as clear; it was
410       added for symmetry.
411
412       The addition of the regerror() function addresses the  historical  need
413       for conforming application programs to have access to error information
414       more than ``Function failed to compile/match your RE for  unknown  rea‐
415       sons''.
416
417       This interface provides for two different methods of dealing with error
418       conditions. The specific error codes (REG_EBRACE, for example), defined
419       in <regex.h>, allow an application to recover from an error if it is so
420       able. Many applications, especially those that use patterns supplied by
421       a  user,  will not try to deal with specific error cases, but will just
422       use regerror() to obtain a human-readable error message to  present  to
423       the user.
424
425       The regerror() function uses a scheme similar to confstr() to deal with
426       the problem of allocating memory to  hold  the  generated  string.  The
427       scheme  used  by  strerror() in the ISO C standard was considered unac‐
428       ceptable since it creates difficulties for multi-threaded applications.
429
430       The preg argument is provided to regerror() to allow an  implementation
431       to  generate  a  more  descriptive  message than would be possible with
432       errcode alone. An implementation might, for example, save the character
433       offset  of  the  offending character of the pattern in a field of preg,
434       and then include that in the generated message string. The  implementa‐
435       tion may also ignore preg.
436
437       A  REG_FILENAME  flag  was  considered,  but  omitted. This flag caused
438       regexec() to match patterns as described in  the  Shell  and  Utilities
439       volume of POSIX.1‐2008, Section 2.13, Pattern Matching Notation instead
440       of REs. This service is now provided by the fnmatch() function.
441
442       Notice  that  there  is  a  difference  in   philosophy   between   the
443       ISO POSIX‐2:1993  standard  and POSIX.1‐2008 in how to handle a ``bad''
444       regular expression. The ISO POSIX‐2:1993 standard says  that  many  bad
445       constructs  ``produce undefined results'', or that ``the interpretation
446       is undefined''. POSIX.1‐2008, however, says that the interpretation  of
447       such  REs  is unspecified. The term ``undefined'' means that the action
448       by the application is an error, of similar severity to  passing  a  bad
449       pointer to a function.
450
451       The  regcomp() and regexec() functions are required to accept any null-
452       terminated string as the pattern argument. If the meaning of the string
453       is  ``undefined'',  the  behavior  of  the function is ``unspecified''.
454       POSIX.1‐2008 does not specify how the functions will interpret the pat‐
455       tern;  they might return error codes, or they might do pattern matching
456       in some completely unexpected way, but they  should  not  do  something
457       like abort the process.
458

FUTURE DIRECTIONS

460       None.
461

SEE ALSO

463       fnmatch(), glob()
464
465       The Base Definitions volume of POSIX.1‐2008, Chapter 9, Regular Expres‐
466       sions, <regex.h>, <sys_types.h>
467
468       The Shell and Utilities volume of POSIX.1‐2008, Section  2.13,  Pattern
469       Matching Notation
470
472       Portions  of  this text are reprinted and reproduced in electronic form
473       from IEEE Std 1003.1, 2013 Edition, Standard for Information Technology
474       --  Portable  Operating  System  Interface (POSIX), The Open Group Base
475       Specifications Issue 7, Copyright (C) 2013 by the Institute of Electri‐
476       cal  and  Electronics  Engineers,  Inc  and  The  Open Group.  (This is
477       POSIX.1-2008 with the 2013 Technical Corrigendum  1  applied.)  In  the
478       event of any discrepancy between this version and the original IEEE and
479       The Open Group Standard, the original IEEE and The Open Group  Standard
480       is  the  referee document. The original Standard can be obtained online
481       at http://www.unix.org/online.html .
482
483       Any typographical or formatting errors that appear  in  this  page  are
484       most likely to have been introduced during the conversion of the source
485       files to man page format. To report such errors,  see  https://www.ker
486       nel.org/doc/man-pages/reporting_bugs.html .
487
488
489
490IEEE/The Open Group                  2013                          REGCOMP(3P)
Impressum