1REGCOMP(P) POSIX Programmer's Manual REGCOMP(P)
2
3
4
6 regcomp, regerror, regexec, regfree - regular expression matching
7
9 #include <regex.h>
10
11 int regcomp(regex_t *restrict preg, const char *restrict pattern,
12 int cflags);
13 size_t regerror(int errcode, const regex_t *restrict preg,
14 char *restrict errbuf, size_t errbuf_size);
15 int regexec(const regex_t *restrict preg, const char *restrict string,
16 size_t nmatch, regmatch_t pmatch[restrict], int eflags);
17 void regfree(regex_t *preg);
18
19
21 These functions interpret basic and extended regular expressions as
22 described in the Base Definitions volume of IEEE Std 1003.1-2001, Chap‐
23 ter 9, Regular Expressions.
24
25 The regex_t structure is defined in <regex.h> and contains at least the
26 following member:
27
28 Member Type Member Name Description
29 size_t re_nsub Number of parenthesized subexpressions.
30
31 The regmatch_t structure is defined in <regex.h> and contains at least
32 the following members:
33
34 Member Type Member Name Description
35 regoff_t rm_so Byte offset from start of string to
36 start of substring.
37 regoff_t rm_eo Byte offset from start of string of the
38 first character after the end of sub‐
39 string.
40
41 The regcomp() function shall compile the regular expression contained
42 in the string pointed to by the pattern argument and place the results
43 in the structure pointed to by preg. The cflags argument is the bit‐
44 wise-inclusive OR of zero or more of the following flags, which are
45 defined in the <regex.h> header:
46
47 REG_EXTENDED
48 Use Extended Regular Expressions.
49
50 REG_ICASE
51 Ignore case in match. (See the Base Definitions volume of
52 IEEE Std 1003.1-2001, Chapter 9, Regular Expressions.)
53
54 REG_NOSUB
55 Report only success/fail in regexec().
56
57 REG_NEWLINE
58 Change the handling of <newline>s, as described in the text.
59
60
61 The default regular expression type for pattern is a Basic Regular
62 Expression. The application can specify Extended Regular Expressions
63 using the REG_EXTENDED cflags flag.
64
65 If the REG_NOSUB flag was not set in cflags, then regcomp() shall set
66 re_nsub to the number of parenthesized subexpressions (delimited by
67 "\(\)" in basic regular expressions or "()" in extended regular expres‐
68 sions) found in pattern.
69
70 The regexec() function compares the null-terminated string specified by
71 string with the compiled regular expression preg initialized by a pre‐
72 vious call to regcomp(). If it finds a match, regexec() shall return
73 0; otherwise, it shall return non-zero indicating either no match or an
74 error. The eflags argument is the bitwise-inclusive OR of zero or more
75 of the following flags, which are defined in the <regex.h> header:
76
77 REG_NOTBOL
78 The first character of the string pointed to by string is not
79 the beginning of the line. Therefore, the circumflex character (
80 '^' ), when taken as a special character, shall not match the
81 beginning of string.
82
83 REG_NOTEOL
84 The last character of the string pointed to by string is not the
85 end of the line. Therefore, the dollar sign ( '$' ), when taken
86 as a special character, shall not match the end of string.
87
88
89 If nmatch is 0 or REG_NOSUB was set in the cflags argument to reg‐
90 comp(), then regexec() shall ignore the pmatch argument. Otherwise, the
91 application shall ensure that the pmatch argument points to an array
92 with at least nmatch elements, and regexec() shall fill in the elements
93 of that array with offsets of the substrings of string that correspond
94 to the parenthesized subexpressions of pattern: pmatch[ i]. rm_so shall
95 be the byte offset of the beginning and pmatch[ i]. rm_eo shall be one
96 greater than the byte offset of the end of substring i. (Subexpression
97 i begins at the ith matched open parenthesis, counting from 1.) Offsets
98 in pmatch[0] identify the substring that corresponds to the entire reg‐
99 ular expression. Unused elements of pmatch up to pmatch[ nmatch-1]
100 shall be filled with -1. If there are more than nmatch subexpressions
101 in pattern ( pattern itself counts as a subexpression), then regexec()
102 shall still do the match, but shall record only the first nmatch sub‐
103 strings.
104
105 When matching a basic or extended regular expression, any given paren‐
106 thesized subexpression of pattern might participate in the match of
107 several different substrings of string, or it might not match any sub‐
108 string even though the pattern as a whole did match. The following
109 rules shall be used to determine which substrings to report in pmatch
110 when matching regular expressions:
111
112 1. If subexpression i in a regular expression is not contained within
113 another subexpression, and it participated in the match several
114 times, then the byte offsets in pmatch[ i] shall delimit the last
115 such match.
116
117 2. If subexpression i is not contained within another subexpression,
118 and it did not participate in an otherwise successful match, the
119 byte offsets in pmatch[ i] shall be -1. A subexpression does not
120 participate in the match when: '*' or "\{\}" appears immediately
121 after the subexpression in a basic regular expression, or '*' , '?'
122 , or "{}" appears immediately after the subexpression in an
123 extended regular expression, and the subexpression did not match
124 (matched 0 times)
125
126 or: '|' is used in an extended regular expression to select this subex‐
127 pression or another, and the other subexpression matched.
128
129 3. If subexpression i is contained within another subexpression j, and
130 i is not contained within any other subexpression that is contained
131 within j, and a match of subexpression j is reported in pmatch[ j],
132 then the match or non-match of subexpression i reported in pmatch[
133 i] shall be as described in 1. and 2. above, but within the sub‐
134 string reported in pmatch[ j] rather than the whole string. The
135 offsets in pmatch[ i] are still relative to the start of string.
136
137 4. If subexpression i is contained in subexpression j, and the byte
138 offsets in pmatch[ j] are -1, then the pointers in pmatch[ i] shall
139 also be -1.
140
141 5. If subexpression i matched a zero-length string, then both byte
142 offsets in pmatch[ i] shall be the byte offset of the character or
143 null terminator immediately following the zero-length string.
144
145 If, when regexec() is called, the locale is different from when the
146 regular expression was compiled, the result is undefined.
147
148 If REG_NEWLINE is not set in cflags, then a <newline> in pattern or
149 string shall be treated as an ordinary character. If REG_NEWLINE is
150 set, then <newline> shall be treated as an ordinary character except as
151 follows:
152
153 1. A <newline> in string shall not be matched by a period outside a
154 bracket expression or by any form of a non-matching list (see the
155 Base Definitions volume of IEEE Std 1003.1-2001, Chapter 9, Regular
156 Expressions).
157
158 2. A circumflex ( '^' ) in pattern, when used to specify expression
159 anchoring (see the Base Definitions volume of IEEE Std 1003.1-2001,
160 Section 9.3.8, BRE Expression Anchoring), shall match the zero-
161 length string immediately after a <newline> in string, regardless
162 of the setting of REG_NOTBOL.
163
164 3. A dollar sign ( '$' ) in pattern, when used to specify expression
165 anchoring, shall match the zero-length string immediately before a
166 <newline> in string, regardless of the setting of REG_NOTEOL.
167
168 The regfree() function frees any memory allocated by regcomp() associ‐
169 ated with preg.
170
171 The following constants are defined as error return values:
172
173 REG_NOMATCH
174 regexec() failed to match.
175
176 REG_BADPAT
177 Invalid regular expression.
178
179 REG_ECOLLATE
180 Invalid collating element referenced.
181
182 REG_ECTYPE
183 Invalid character class type referenced.
184
185 REG_EESCAPE
186 Trailing '\' in pattern.
187
188 REG_ESUBREG
189 Number in "\digit" invalid or in error.
190
191 REG_EBRACK
192 "[]" imbalance.
193
194 REG_EPAREN
195 "\(\)" or "()" imbalance.
196
197 REG_EBRACE
198 "\{\}" imbalance.
199
200 REG_BADBR
201 Content of "\{\}" invalid: not a number, number too large, more
202 than two numbers, first larger than second.
203
204 REG_ERANGE
205 Invalid endpoint in range expression.
206
207 REG_ESPACE
208 Out of memory.
209
210 REG_BADRPT
211 '?' , '*' , or '+' not preceded by valid regular expression.
212
213
214 The regerror() function provides a mapping from error codes returned by
215 regcomp() and regexec() to unspecified printable strings. It generates
216 a string corresponding to the value of the errcode argument, which the
217 application shall ensure is the last non-zero value returned by reg‐
218 comp() or regexec() with the given value of preg. If errcode is not
219 such a value, the content of the generated string is unspecified.
220
221 If preg is a null pointer, but errcode is a value returned by a previ‐
222 ous call to regexec() or regcomp(), the regerror() still generates an
223 error string corresponding to the value of errcode, but it might not be
224 as detailed under some implementations.
225
226 If the errbuf_size argument is not 0, regerror() shall place the gener‐
227 ated string into the buffer of size errbuf_size bytes pointed to by
228 errbuf. If the string (including the terminating null) cannot fit in
229 the buffer, regerror() shall truncate the string and null-terminate the
230 result.
231
232 If errbuf_size is 0, regerror() shall ignore the errbuf argument, and
233 return the size of the buffer needed to hold the generated string.
234
235 If the preg argument to regexec() or regfree() is not a compiled regu‐
236 lar expression returned by regcomp(), the result is undefined. A preg
237 is no longer treated as a compiled regular expression after it is given
238 to regfree().
239
241 Upon successful completion, the regcomp() function shall return 0. Oth‐
242 erwise, it shall return an integer value indicating an error as
243 described in <regex.h>, and the content of preg is undefined. If a code
244 is returned, the interpretation shall be as given in <regex.h>.
245
246 If regcomp() detects an invalid RE, it may return REG_BADPAT, or it may
247 return one of the error codes that more precisely describes the error.
248
249 Upon successful completion, the regexec() function shall return 0. Oth‐
250 erwise, it shall return REG_NOMATCH to indicate no match.
251
252 Upon successful completion, the regerror() function shall return the
253 number of bytes needed to hold the entire generated string, including
254 the null termination. If the return value is greater than errbuf_size,
255 the string returned in the buffer pointed to by errbuf has been trun‐
256 cated.
257
258 The regfree() function shall not return a value.
259
261 No errors are defined.
262
263 The following sections are informative.
264
266 #include <regex.h>
267
268
269 /*
270 * Match string against the extended regular expression in
271 * pattern, treating errors as no match.
272 *
273 * Return 1 for match, 0 for no match.
274 */
275
276
277 int
278 match(const char *string, char *pattern)
279 {
280 int status;
281 regex_t re;
282
283
284 if (regcomp(&re, pattern, REG_EXTENDED|REG_NOSUB) != 0) {
285 return(0); /* Report error. */
286 }
287 status = regexec(&re, string, (size_t) 0, NULL, 0);
288 regfree(&re);
289 if (status != 0) {
290 return(0); /* Report error. */
291 }
292 return(1);
293 }
294
295 The following demonstrates how the REG_NOTBOL flag could be used with
296 regexec() to find all substrings in a line that match a pattern sup‐
297 plied by a user. (For simplicity of the example, very little error
298 checking is done.)
299
300
301 (void) regcomp (&re, pattern, 0);
302 /* This call to regexec() finds the first match on the line. */
303 error = regexec (&re, &buffer[0], 1, &pm, 0);
304 while (error == 0) { /* While matches found. */
305 /* Substring found between pm.rm_so and pm.rm_eo. */
306 /* This call to regexec() finds the next match. */
307 error = regexec (&re, buffer + pm.rm_eo, 1, &pm, REG_NOTBOL);
308 }
309
311 An application could use:
312
313
314 regerror(code,preg,(char *)NULL,(size_t)0)
315
316 to find out how big a buffer is needed for the generated string, mal‐
317 loc() a buffer to hold the string, and then call regerror() again to
318 get the string. Alternatively, it could allocate a fixed, static buffer
319 that is big enough to hold most strings, and then use malloc() to allo‐
320 cate a larger buffer if it finds that this is too small.
321
322 To match a pattern as described in the Shell and Utilities volume of
323 IEEE Std 1003.1-2001, Section 2.13, Pattern Matching Notation, use the
324 fnmatch() function.
325
327 The regexec() function must fill in all nmatch elements of pmatch,
328 where nmatch and pmatch are supplied by the application, even if some
329 elements of pmatch do not correspond to subexpressions in pattern. The
330 application writer should note that there is probably no reason for
331 using a value of nmatch that is larger than preg-> re_nsub+1.
332
333 The REG_NEWLINE flag supports a use of RE matching that is needed in
334 some applications like text editors. In such applications, the user
335 supplies an RE asking the application to find a line that matches the
336 given expression. An anchor in such an RE anchors at the beginning or
337 end of any line. Such an application can pass a sequence of <new‐
338 line>-separated lines to regexec() as a single long string and specify
339 REG_NEWLINE to regcomp() to get the desired behavior. The application
340 must ensure that there are no explicit <newline>s in pattern if it
341 wants to ensure that any match occurs entirely within a single line.
342
343 The REG_NEWLINE flag affects the behavior of regexec(), but it is in
344 the cflags parameter to regcomp() to allow flexibility of implementa‐
345 tion. Some implementations will want to generate the same compiled RE
346 in regcomp() regardless of the setting of REG_NEWLINE and have
347 regexec() handle anchors differently based on the setting of the flag.
348 Other implementations will generate different compiled REs based on the
349 REG_NEWLINE.
350
351 The REG_ICASE flag supports the operations taken by the grep -i option
352 and the historical implementations of ex and vi. Including this flag
353 will make it easier for application code to be written that does the
354 same thing as these utilities.
355
356 The substrings reported in pmatch[] are defined using offsets from the
357 start of the string rather than pointers. Since this is a new inter‐
358 face, there should be no impact on historical implementations or appli‐
359 cations, and offsets should be just as easy to use as pointers. The
360 change to offsets was made to facilitate future extensions in which the
361 string to be searched is presented to regexec() in blocks, allowing a
362 string to be searched that is not all in memory at once.
363
364 The type regoff_t is used for the elements of pmatch[] to ensure that
365 the application can represent either the largest possible array in mem‐
366 ory (important for an application conforming to the Shell and Utilities
367 volume of IEEE Std 1003.1-2001) or the largest possible file (important
368 for an application using the extension where a file is searched in
369 chunks).
370
371 The standard developers rejected the inclusion of a regsub() function
372 that would be used to do substitutions for a matched RE. While such a
373 routine would be useful to some applications, its utility would be much
374 more limited than the matching function described here. Both RE parsing
375 and substitution are possible to implement without support other than
376 that required by the ISO C standard, but matching is much more complex
377 than substituting. The only difficult part of substitution, given the
378 information supplied by regexec(), is finding the next character in a
379 string when there can be multi-byte characters. That is a much larger
380 issue, and one that needs a more general solution.
381
382 The errno variable has not been used for error returns to avoid filling
383 the errno name space for this feature.
384
385 The interface is defined so that the matched substrings rm_sp and rm_ep
386 are in a separate regmatch_t structure instead of in regex_t. This
387 allows a single compiled RE to be used simultaneously in several con‐
388 texts; in main() and a signal handler, perhaps, or in multiple threads
389 of lightweight processes. (The preg argument to regexec() is declared
390 with type const, so the implementation is not permitted to use the
391 structure to store intermediate results.) It also allows an application
392 to request an arbitrary number of substrings from an RE. The number of
393 subexpressions in the RE is reported in re_nsub in preg. With this
394 change to regexec(), consideration was given to dropping the REG_NOSUB
395 flag since the user can now specify this with a zero nmatch argument to
396 regexec(). However, keeping REG_NOSUB allows an implementation to use
397 a different (perhaps more efficient) algorithm if it knows in regcomp()
398 that no subexpressions need be reported. The implementation is only
399 required to fill in pmatch if nmatch is not zero and if REG_NOSUB is
400 not specified. Note that the size_t type, as defined in the ISO C stan‐
401 dard, is unsigned, so the description of regexec() does not need to
402 address negative values of nmatch.
403
404 REG_NOTBOL was added to allow an application to do repeated searches
405 for the same pattern in a line. If the pattern contains a circumflex
406 character that should match the beginning of a line, then the pattern
407 should only match when matched against the beginning of the line. With‐
408 out the REG_NOTBOL flag, the application could rewrite the expression
409 for subsequent matches, but in the general case this would require
410 parsing the expression. The need for REG_NOTEOL is not as clear; it was
411 added for symmetry.
412
413 The addition of the regerror() function addresses the historical need
414 for conforming application programs to have access to error information
415 more than "Function failed to compile/match your RE for unknown rea‐
416 sons".
417
418 This interface provides for two different methods of dealing with error
419 conditions. The specific error codes (REG_EBRACE, for example), defined
420 in <regex.h>, allow an application to recover from an error if it is so
421 able. Many applications, especially those that use patterns supplied by
422 a user, will not try to deal with specific error cases, but will just
423 use regerror() to obtain a human-readable error message to present to
424 the user.
425
426 The regerror() function uses a scheme similar to confstr() to deal with
427 the problem of allocating memory to hold the generated string. The
428 scheme used by strerror() in the ISO C standard was considered unac‐
429 ceptable since it creates difficulties for multi-threaded applications.
430
431 The preg argument is provided to regerror() to allow an implementation
432 to generate a more descriptive message than would be possible with
433 errcode alone. An implementation might, for example, save the character
434 offset of the offending character of the pattern in a field of preg,
435 and then include that in the generated message string. The implementa‐
436 tion may also ignore preg.
437
438 A REG_FILENAME flag was considered, but omitted. This flag caused
439 regexec() to match patterns as described in the Shell and Utilities
440 volume of IEEE Std 1003.1-2001, Section 2.13, Pattern Matching Notation
441 instead of REs. This service is now provided by the fnmatch() function.
442
443 Notice that there is a difference in philosophy between the
444 ISO POSIX-2:1993 standard and IEEE Std 1003.1-2001 in how to handle a
445 "bad" regular expression. The ISO POSIX-2:1993 standard says that many
446 bad constructs "produce undefined results", or that "the interpretation
447 is undefined". IEEE Std 1003.1-2001, however, says that the interpreta‐
448 tion of such REs is unspecified. The term "undefined" means that the
449 action by the application is an error, of similar severity to passing a
450 bad pointer to a function.
451
452 The regcomp() and regexec() functions are required to accept any null-
453 terminated string as the pattern argument. If the meaning of the string
454 is "undefined", the behavior of the function is "unspecified".
455 IEEE Std 1003.1-2001 does not specify how the functions will interpret
456 the pattern; they might return error codes, or they might do pattern
457 matching in some completely unexpected way, but they should not do
458 something like abort the process.
459
461 None.
462
464 fnmatch() , glob() , Shell and Utilities volume of
465 IEEE Std 1003.1-2001, Section 2.13, Pattern Matching Notation, Base
466 Definitions volume of IEEE Std 1003.1-2001, Chapter 9, Regular Expres‐
467 sions, <regex.h>, <sys/types.h>
468
470 Portions of this text are reprinted and reproduced in electronic form
471 from IEEE Std 1003.1, 2003 Edition, Standard for Information Technology
472 -- Portable Operating System Interface (POSIX), The Open Group Base
473 Specifications Issue 6, Copyright (C) 2001-2003 by the Institute of
474 Electrical and Electronics Engineers, Inc and The Open Group. In the
475 event of any discrepancy between this version and the original IEEE and
476 The Open Group Standard, the original IEEE and The Open Group Standard
477 is the referee document. The original Standard can be obtained online
478 at http://www.opengroup.org/unix/online.html .
479
480
481
482IEEE/The Open Group 2003 REGCOMP(P)