1REGCOMP(3P) POSIX Programmer's Manual REGCOMP(3P)
2
3
4
6 This manual page is part of the POSIX Programmer's Manual. The Linux
7 implementation of this interface may differ (consult the corresponding
8 Linux manual page for details of Linux behavior), or the interface may
9 not be implemented on Linux.
10
12 regcomp, regerror, regexec, regfree - regular expression matching
13
15 #include <regex.h>
16
17 int regcomp(regex_t *restrict preg, const char *restrict pattern,
18 int cflags);
19 size_t regerror(int errcode, const regex_t *restrict preg,
20 char *restrict errbuf, size_t errbuf_size);
21 int regexec(const regex_t *restrict preg, const char *restrict string,
22 size_t nmatch, regmatch_t pmatch[restrict], int eflags);
23 void regfree(regex_t *preg);
24
25
27 These functions interpret basic and extended regular expressions as
28 described in the Base Definitions volume of IEEE Std 1003.1-2001, Chap‐
29 ter 9, Regular Expressions.
30
31 The regex_t structure is defined in <regex.h> and contains at least the
32 following member:
33
34 Member Type Member Name Description
35 size_t re_nsub Number of parenthesized subexpressions.
36
37 The regmatch_t structure is defined in <regex.h> and contains at least
38 the following members:
39
40 Member Type Member Name Description
41 regoff_t rm_so Byte offset from start of string to
42 start of substring.
43 regoff_t rm_eo Byte offset from start of string of the
44 first character after the end of sub‐
45 string.
46
47 The regcomp() function shall compile the regular expression contained
48 in the string pointed to by the pattern argument and place the results
49 in the structure pointed to by preg. The cflags argument is the bit‐
50 wise-inclusive OR of zero or more of the following flags, which are
51 defined in the <regex.h> header:
52
53 REG_EXTENDED
54 Use Extended Regular Expressions.
55
56 REG_ICASE
57 Ignore case in match. (See the Base Definitions volume of
58 IEEE Std 1003.1-2001, Chapter 9, Regular Expressions.)
59
60 REG_NOSUB
61 Report only success/fail in regexec().
62
63 REG_NEWLINE
64 Change the handling of <newline>s, as described in the text.
65
66
67 The default regular expression type for pattern is a Basic Regular
68 Expression. The application can specify Extended Regular Expressions
69 using the REG_EXTENDED cflags flag.
70
71 If the REG_NOSUB flag was not set in cflags, then regcomp() shall set
72 re_nsub to the number of parenthesized subexpressions (delimited by
73 "\(\)" in basic regular expressions or "()" in extended regular expres‐
74 sions) found in pattern.
75
76 The regexec() function compares the null-terminated string specified by
77 string with the compiled regular expression preg initialized by a pre‐
78 vious call to regcomp(). If it finds a match, regexec() shall return
79 0; otherwise, it shall return non-zero indicating either no match or an
80 error. The eflags argument is the bitwise-inclusive OR of zero or more
81 of the following flags, which are defined in the <regex.h> header:
82
83 REG_NOTBOL
84 The first character of the string pointed to by string is not
85 the beginning of the line. Therefore, the circumflex character (
86 '^' ), when taken as a special character, shall not match the
87 beginning of string.
88
89 REG_NOTEOL
90 The last character of the string pointed to by string is not the
91 end of the line. Therefore, the dollar sign ( '$' ), when taken
92 as a special character, shall not match the end of string.
93
94
95 If nmatch is 0 or REG_NOSUB was set in the cflags argument to reg‐
96 comp(), then regexec() shall ignore the pmatch argument. Otherwise, the
97 application shall ensure that the pmatch argument points to an array
98 with at least nmatch elements, and regexec() shall fill in the elements
99 of that array with offsets of the substrings of string that correspond
100 to the parenthesized subexpressions of pattern: pmatch[ i]. rm_so shall
101 be the byte offset of the beginning and pmatch[ i]. rm_eo shall be one
102 greater than the byte offset of the end of substring i. (Subexpression
103 i begins at the ith matched open parenthesis, counting from 1.) Offsets
104 in pmatch[0] identify the substring that corresponds to the entire reg‐
105 ular expression. Unused elements of pmatch up to pmatch[ nmatch-1]
106 shall be filled with -1. If there are more than nmatch subexpressions
107 in pattern ( pattern itself counts as a subexpression), then regexec()
108 shall still do the match, but shall record only the first nmatch sub‐
109 strings.
110
111 When matching a basic or extended regular expression, any given paren‐
112 thesized subexpression of pattern might participate in the match of
113 several different substrings of string, or it might not match any sub‐
114 string even though the pattern as a whole did match. The following
115 rules shall be used to determine which substrings to report in pmatch
116 when matching regular expressions:
117
118 1. If subexpression i in a regular expression is not contained within
119 another subexpression, and it participated in the match several
120 times, then the byte offsets in pmatch[ i] shall delimit the last
121 such match.
122
123 2. If subexpression i is not contained within another subexpression,
124 and it did not participate in an otherwise successful match, the
125 byte offsets in pmatch[ i] shall be -1. A subexpression does not
126 participate in the match when: '*' or "\{\}" appears immediately
127 after the subexpression in a basic regular expression, or '*', '?',
128 or "{}" appears immediately after the subexpression in an extended
129 regular expression, and the subexpression did not match (matched 0
130 times)
131
132 or: '|' is used in an extended regular expression to select this subex‐
133 pression or another, and the other subexpression matched.
134
135 3. If subexpression i is contained within another subexpression j, and
136 i is not contained within any other subexpression that is contained
137 within j, and a match of subexpression j is reported in pmatch[ j],
138 then the match or non-match of subexpression i reported in pmatch[
139 i] shall be as described in 1. and 2. above, but within the sub‐
140 string reported in pmatch[ j] rather than the whole string. The
141 offsets in pmatch[ i] are still relative to the start of string.
142
143 4. If subexpression i is contained in subexpression j, and the byte
144 offsets in pmatch[ j] are -1, then the pointers in pmatch[ i] shall
145 also be -1.
146
147 5. If subexpression i matched a zero-length string, then both byte
148 offsets in pmatch[ i] shall be the byte offset of the character or
149 null terminator immediately following the zero-length string.
150
151 If, when regexec() is called, the locale is different from when the
152 regular expression was compiled, the result is undefined.
153
154 If REG_NEWLINE is not set in cflags, then a <newline> in pattern or
155 string shall be treated as an ordinary character. If REG_NEWLINE is
156 set, then <newline> shall be treated as an ordinary character except as
157 follows:
158
159 1. A <newline> in string shall not be matched by a period outside a
160 bracket expression or by any form of a non-matching list (see the
161 Base Definitions volume of IEEE Std 1003.1-2001, Chapter 9, Regular
162 Expressions).
163
164 2. A circumflex ( '^' ) in pattern, when used to specify expression
165 anchoring (see the Base Definitions volume of IEEE Std 1003.1-2001,
166 Section 9.3.8, BRE Expression Anchoring), shall match the zero-
167 length string immediately after a <newline> in string, regardless
168 of the setting of REG_NOTBOL.
169
170 3. A dollar sign ( '$' ) in pattern, when used to specify expression
171 anchoring, shall match the zero-length string immediately before a
172 <newline> in string, regardless of the setting of REG_NOTEOL.
173
174 The regfree() function frees any memory allocated by regcomp() associ‐
175 ated with preg.
176
177 The following constants are defined as error return values:
178
179 REG_NOMATCH
180 regexec() failed to match.
181
182 REG_BADPAT
183 Invalid regular expression.
184
185 REG_ECOLLATE
186 Invalid collating element referenced.
187
188 REG_ECTYPE
189 Invalid character class type referenced.
190
191 REG_EESCAPE
192 Trailing '\' in pattern.
193
194 REG_ESUBREG
195 Number in "\digit" invalid or in error.
196
197 REG_EBRACK
198 "[]" imbalance.
199
200 REG_EPAREN
201 "\(\)" or "()" imbalance.
202
203 REG_EBRACE
204 "\{\}" imbalance.
205
206 REG_BADBR
207 Content of "\{\}" invalid: not a number, number too large, more
208 than two numbers, first larger than second.
209
210 REG_ERANGE
211 Invalid endpoint in range expression.
212
213 REG_ESPACE
214 Out of memory.
215
216 REG_BADRPT
217 '?', '*', or '+' not preceded by valid regular expression.
218
219
220 The regerror() function provides a mapping from error codes returned by
221 regcomp() and regexec() to unspecified printable strings. It generates
222 a string corresponding to the value of the errcode argument, which the
223 application shall ensure is the last non-zero value returned by reg‐
224 comp() or regexec() with the given value of preg. If errcode is not
225 such a value, the content of the generated string is unspecified.
226
227 If preg is a null pointer, but errcode is a value returned by a previ‐
228 ous call to regexec() or regcomp(), the regerror() still generates an
229 error string corresponding to the value of errcode, but it might not be
230 as detailed under some implementations.
231
232 If the errbuf_size argument is not 0, regerror() shall place the gener‐
233 ated string into the buffer of size errbuf_size bytes pointed to by
234 errbuf. If the string (including the terminating null) cannot fit in
235 the buffer, regerror() shall truncate the string and null-terminate the
236 result.
237
238 If errbuf_size is 0, regerror() shall ignore the errbuf argument, and
239 return the size of the buffer needed to hold the generated string.
240
241 If the preg argument to regexec() or regfree() is not a compiled regu‐
242 lar expression returned by regcomp(), the result is undefined. A preg
243 is no longer treated as a compiled regular expression after it is given
244 to regfree().
245
247 Upon successful completion, the regcomp() function shall return 0. Oth‐
248 erwise, it shall return an integer value indicating an error as
249 described in <regex.h>, and the content of preg is undefined. If a code
250 is returned, the interpretation shall be as given in <regex.h>.
251
252 If regcomp() detects an invalid RE, it may return REG_BADPAT, or it may
253 return one of the error codes that more precisely describes the error.
254
255 Upon successful completion, the regexec() function shall return 0. Oth‐
256 erwise, it shall return REG_NOMATCH to indicate no match.
257
258 Upon successful completion, the regerror() function shall return the
259 number of bytes needed to hold the entire generated string, including
260 the null termination. If the return value is greater than errbuf_size,
261 the string returned in the buffer pointed to by errbuf has been trun‐
262 cated.
263
264 The regfree() function shall not return a value.
265
267 No errors are defined.
268
269 The following sections are informative.
270
272 #include <regex.h>
273
274
275 /*
276 * Match string against the extended regular expression in
277 * pattern, treating errors as no match.
278 *
279 * Return 1 for match, 0 for no match.
280 */
281
282
283 int
284 match(const char *string, char *pattern)
285 {
286 int status;
287 regex_t re;
288
289
290 if (regcomp(&re, pattern, REG_EXTENDED|REG_NOSUB) != 0) {
291 return(0); /* Report error. */
292 }
293 status = regexec(&re, string, (size_t) 0, NULL, 0);
294 regfree(&re);
295 if (status != 0) {
296 return(0); /* Report error. */
297 }
298 return(1);
299 }
300
301 The following demonstrates how the REG_NOTBOL flag could be used with
302 regexec() to find all substrings in a line that match a pattern sup‐
303 plied by a user. (For simplicity of the example, very little error
304 checking is done.)
305
306
307 (void) regcomp (&re, pattern, 0);
308 /* This call to regexec() finds the first match on the line. */
309 error = regexec (&re, &buffer[0], 1, &pm, 0);
310 while (error == 0) { /* While matches found. */
311 /* Substring found between pm.rm_so and pm.rm_eo. */
312 /* This call to regexec() finds the next match. */
313 error = regexec (&re, buffer + pm.rm_eo, 1, &pm, REG_NOTBOL);
314 }
315
317 An application could use:
318
319
320 regerror(code,preg,(char *)NULL,(size_t)0)
321
322 to find out how big a buffer is needed for the generated string, mal‐
323 loc() a buffer to hold the string, and then call regerror() again to
324 get the string. Alternatively, it could allocate a fixed, static buffer
325 that is big enough to hold most strings, and then use malloc() to allo‐
326 cate a larger buffer if it finds that this is too small.
327
328 To match a pattern as described in the Shell and Utilities volume of
329 IEEE Std 1003.1-2001, Section 2.13, Pattern Matching Notation, use the
330 fnmatch() function.
331
333 The regexec() function must fill in all nmatch elements of pmatch,
334 where nmatch and pmatch are supplied by the application, even if some
335 elements of pmatch do not correspond to subexpressions in pattern. The
336 application writer should note that there is probably no reason for
337 using a value of nmatch that is larger than preg-> re_nsub+1.
338
339 The REG_NEWLINE flag supports a use of RE matching that is needed in
340 some applications like text editors. In such applications, the user
341 supplies an RE asking the application to find a line that matches the
342 given expression. An anchor in such an RE anchors at the beginning or
343 end of any line. Such an application can pass a sequence of <new‐
344 line>-separated lines to regexec() as a single long string and specify
345 REG_NEWLINE to regcomp() to get the desired behavior. The application
346 must ensure that there are no explicit <newline>s in pattern if it
347 wants to ensure that any match occurs entirely within a single line.
348
349 The REG_NEWLINE flag affects the behavior of regexec(), but it is in
350 the cflags parameter to regcomp() to allow flexibility of implementa‐
351 tion. Some implementations will want to generate the same compiled RE
352 in regcomp() regardless of the setting of REG_NEWLINE and have
353 regexec() handle anchors differently based on the setting of the flag.
354 Other implementations will generate different compiled REs based on the
355 REG_NEWLINE.
356
357 The REG_ICASE flag supports the operations taken by the grep -i option
358 and the historical implementations of ex and vi. Including this flag
359 will make it easier for application code to be written that does the
360 same thing as these utilities.
361
362 The substrings reported in pmatch[] are defined using offsets from the
363 start of the string rather than pointers. Since this is a new inter‐
364 face, there should be no impact on historical implementations or appli‐
365 cations, and offsets should be just as easy to use as pointers. The
366 change to offsets was made to facilitate future extensions in which the
367 string to be searched is presented to regexec() in blocks, allowing a
368 string to be searched that is not all in memory at once.
369
370 The type regoff_t is used for the elements of pmatch[] to ensure that
371 the application can represent either the largest possible array in mem‐
372 ory (important for an application conforming to the Shell and Utilities
373 volume of IEEE Std 1003.1-2001) or the largest possible file (important
374 for an application using the extension where a file is searched in
375 chunks).
376
377 The standard developers rejected the inclusion of a regsub() function
378 that would be used to do substitutions for a matched RE. While such a
379 routine would be useful to some applications, its utility would be much
380 more limited than the matching function described here. Both RE parsing
381 and substitution are possible to implement without support other than
382 that required by the ISO C standard, but matching is much more complex
383 than substituting. The only difficult part of substitution, given the
384 information supplied by regexec(), is finding the next character in a
385 string when there can be multi-byte characters. That is a much larger
386 issue, and one that needs a more general solution.
387
388 The errno variable has not been used for error returns to avoid filling
389 the errno name space for this feature.
390
391 The interface is defined so that the matched substrings rm_sp and rm_ep
392 are in a separate regmatch_t structure instead of in regex_t. This
393 allows a single compiled RE to be used simultaneously in several con‐
394 texts; in main() and a signal handler, perhaps, or in multiple threads
395 of lightweight processes. (The preg argument to regexec() is declared
396 with type const, so the implementation is not permitted to use the
397 structure to store intermediate results.) It also allows an application
398 to request an arbitrary number of substrings from an RE. The number of
399 subexpressions in the RE is reported in re_nsub in preg. With this
400 change to regexec(), consideration was given to dropping the REG_NOSUB
401 flag since the user can now specify this with a zero nmatch argument to
402 regexec(). However, keeping REG_NOSUB allows an implementation to use
403 a different (perhaps more efficient) algorithm if it knows in regcomp()
404 that no subexpressions need be reported. The implementation is only
405 required to fill in pmatch if nmatch is not zero and if REG_NOSUB is
406 not specified. Note that the size_t type, as defined in the ISO C stan‐
407 dard, is unsigned, so the description of regexec() does not need to
408 address negative values of nmatch.
409
410 REG_NOTBOL was added to allow an application to do repeated searches
411 for the same pattern in a line. If the pattern contains a circumflex
412 character that should match the beginning of a line, then the pattern
413 should only match when matched against the beginning of the line. With‐
414 out the REG_NOTBOL flag, the application could rewrite the expression
415 for subsequent matches, but in the general case this would require
416 parsing the expression. The need for REG_NOTEOL is not as clear; it was
417 added for symmetry.
418
419 The addition of the regerror() function addresses the historical need
420 for conforming application programs to have access to error information
421 more than "Function failed to compile/match your RE for unknown rea‐
422 sons".
423
424 This interface provides for two different methods of dealing with error
425 conditions. The specific error codes (REG_EBRACE, for example), defined
426 in <regex.h>, allow an application to recover from an error if it is so
427 able. Many applications, especially those that use patterns supplied by
428 a user, will not try to deal with specific error cases, but will just
429 use regerror() to obtain a human-readable error message to present to
430 the user.
431
432 The regerror() function uses a scheme similar to confstr() to deal with
433 the problem of allocating memory to hold the generated string. The
434 scheme used by strerror() in the ISO C standard was considered unac‐
435 ceptable since it creates difficulties for multi-threaded applications.
436
437 The preg argument is provided to regerror() to allow an implementation
438 to generate a more descriptive message than would be possible with
439 errcode alone. An implementation might, for example, save the character
440 offset of the offending character of the pattern in a field of preg,
441 and then include that in the generated message string. The implementa‐
442 tion may also ignore preg.
443
444 A REG_FILENAME flag was considered, but omitted. This flag caused
445 regexec() to match patterns as described in the Shell and Utilities
446 volume of IEEE Std 1003.1-2001, Section 2.13, Pattern Matching Notation
447 instead of REs. This service is now provided by the fnmatch() function.
448
449 Notice that there is a difference in philosophy between the
450 ISO POSIX-2:1993 standard and IEEE Std 1003.1-2001 in how to handle a
451 "bad" regular expression. The ISO POSIX-2:1993 standard says that many
452 bad constructs "produce undefined results", or that "the interpretation
453 is undefined". IEEE Std 1003.1-2001, however, says that the interpreta‐
454 tion of such REs is unspecified. The term "undefined" means that the
455 action by the application is an error, of similar severity to passing a
456 bad pointer to a function.
457
458 The regcomp() and regexec() functions are required to accept any null-
459 terminated string as the pattern argument. If the meaning of the string
460 is "undefined", the behavior of the function is "unspecified".
461 IEEE Std 1003.1-2001 does not specify how the functions will interpret
462 the pattern; they might return error codes, or they might do pattern
463 matching in some completely unexpected way, but they should not do
464 something like abort the process.
465
467 None.
468
470 fnmatch(), glob(), Shell and Utilities volume of IEEE Std 1003.1-2001,
471 Section 2.13, Pattern Matching Notation, Base Definitions volume of
472 IEEE Std 1003.1-2001, Chapter 9, Regular Expressions, <regex.h>,
473 <sys/types.h>
474
476 Portions of this text are reprinted and reproduced in electronic form
477 from IEEE Std 1003.1, 2003 Edition, Standard for Information Technology
478 -- Portable Operating System Interface (POSIX), The Open Group Base
479 Specifications Issue 6, Copyright (C) 2001-2003 by the Institute of
480 Electrical and Electronics Engineers, Inc and The Open Group. In the
481 event of any discrepancy between this version and the original IEEE and
482 The Open Group Standard, the original IEEE and The Open Group Standard
483 is the referee document. The original Standard can be obtained online
484 at http://www.opengroup.org/unix/online.html .
485
486
487
488IEEE/The Open Group 2003 REGCOMP(3P)