1REGCOMP(3P) POSIX Programmer's Manual REGCOMP(3P)
2
3
4
6 This manual page is part of the POSIX Programmer's Manual. The Linux
7 implementation of this interface may differ (consult the corresponding
8 Linux manual page for details of Linux behavior), or the interface may
9 not be implemented on Linux.
10
11
13 regcomp, regerror, regexec, regfree — regular expression matching
14
16 #include <regex.h>
17
18 int regcomp(regex_t *restrict preg, const char *restrict pattern,
19 int cflags);
20 size_t regerror(int errcode, const regex_t *restrict preg,
21 char *restrict errbuf, size_t errbuf_size);
22 int regexec(const regex_t *restrict preg, const char *restrict string,
23 size_t nmatch, regmatch_t pmatch[restrict], int eflags);
24 void regfree(regex_t *preg);
25
27 These functions interpret basic and extended regular expressions as
28 described in the Base Definitions volume of POSIX.1‐2008, Chapter 9,
29 Regular Expressions.
30
31 The regex_t structure is defined in <regex.h> and contains at least the
32 following member:
33
34 ┌──────────────┬──────────────┬───────────────────────────┐
35 │Member Type │ Member Name │ Description │
36 ├──────────────┼──────────────┼───────────────────────────┤
37 │size_t │re_nsub │ Number of parenthesized │
38 │ │ │ subexpressions. │
39 └──────────────┴──────────────┴───────────────────────────┘
40 The regmatch_t structure is defined in <regex.h> and contains at least
41 the following members:
42
43 ┌──────────────┬──────────────┬───────────────────────────┐
44 │Member Type │ Member Name │ Description │
45 ├──────────────┼──────────────┼───────────────────────────┤
46 │regoff_t │rm_so │ Byte offset from start of │
47 │ │ │ string to start of sub‐ │
48 │ │ │ string. │
49 │regoff_t │rm_eo │ Byte offset from start of │
50 │ │ │ string of the first char‐ │
51 │ │ │ acter after the end of │
52 │ │ │ substring. │
53 └──────────────┴──────────────┴───────────────────────────┘
54 The regcomp() function shall compile the regular expression contained
55 in the string pointed to by the pattern argument and place the results
56 in the structure pointed to by preg. The cflags argument is the bit‐
57 wise-inclusive OR of zero or more of the following flags, which are
58 defined in the <regex.h> header:
59
60 REG_EXTENDED Use Extended Regular Expressions.
61
62 REG_ICASE Ignore case in match (see the Base Definitions volume of
63 POSIX.1‐2008, Chapter 9, Regular Expressions).
64
65 REG_NOSUB Report only success/fail in regexec().
66
67 REG_NEWLINE Change the handling of <newline> characters, as described
68 in the text.
69
70 The default regular expression type for pattern is a Basic Regular
71 Expression. The application can specify Extended Regular Expressions
72 using the REG_EXTENDED cflags flag.
73
74 If the REG_NOSUB flag was not set in cflags, then regcomp() shall set
75 re_nsub to the number of parenthesized subexpressions (delimited by
76 "\(\)" in basic regular expressions or "()" in extended regular expres‐
77 sions) found in pattern.
78
79 The regexec() function compares the null-terminated string specified by
80 string with the compiled regular expression preg initialized by a pre‐
81 vious call to regcomp(). If it finds a match, regexec() shall return
82 0; otherwise, it shall return non-zero indicating either no match or an
83 error. The eflags argument is the bitwise-inclusive OR of zero or more
84 of the following flags, which are defined in the <regex.h> header:
85
86 REG_NOTBOL The first character of the string pointed to by string is
87 not the beginning of the line. Therefore, the <circum‐
88 flex> character ('^'), when taken as a special character,
89 shall not match the beginning of string.
90
91 REG_NOTEOL The last character of the string pointed to by string is
92 not the end of the line. Therefore, the <dollar-sign>
93 ('$'), when taken as a special character, shall not match
94 the end of string.
95
96 If nmatch is 0 or REG_NOSUB was set in the cflags argument to reg‐
97 comp(), then regexec() shall ignore the pmatch argument. Otherwise, the
98 application shall ensure that the pmatch argument points to an array
99 with at least nmatch elements, and regexec() shall fill in the elements
100 of that array with offsets of the substrings of string that correspond
101 to the parenthesized subexpressions of pattern: pmatch[i].rm_so shall
102 be the byte offset of the beginning and pmatch[i].rm_eo shall be one
103 greater than the byte offset of the end of substring i. (Subexpression
104 i begins at the ith matched open parenthesis, counting from 1.) Offsets
105 in pmatch[0] identify the substring that corresponds to the entire reg‐
106 ular expression. Unused elements of pmatch up to pmatch[nmatch−1] shall
107 be filled with −1. If there are more than nmatch subexpressions in pat‐
108 tern (pattern itself counts as a subexpression), then regexec() shall
109 still do the match, but shall record only the first nmatch substrings.
110
111 When matching a basic or extended regular expression, any given paren‐
112 thesized subexpression of pattern might participate in the match of
113 several different substrings of string, or it might not match any sub‐
114 string even though the pattern as a whole did match. The following
115 rules shall be used to determine which substrings to report in pmatch
116 when matching regular expressions:
117
118 1. If subexpression i in a regular expression is not contained within
119 another subexpression, and it participated in the match several
120 times, then the byte offsets in pmatch[i] shall delimit the last
121 such match.
122
123 2. If subexpression i is not contained within another subexpression,
124 and it did not participate in an otherwise successful match, the
125 byte offsets in pmatch[i] shall be −1. A subexpression does not
126 participate in the match when:
127
128 '*' or "\{\}" appears immediately after the subexpression in a
129 basic regular expression, or '*', '?', or "{}" appears immediately
130 after the subexpression in an extended regular expression, and the
131 subexpression did not match (matched 0 times)
132
133 or:
134
135 '|' is used in an extended regular expression to select this
136 subexpression or another, and the other subexpression
137 matched.
138
139 3. If subexpression i is contained within another subexpression j, and
140 i is not contained within any other subexpression that is contained
141 within j, and a match of subexpression j is reported in pmatch[j],
142 then the match or non-match of subexpression i reported in
143 pmatch[i] shall be as described in 1. and 2. above, but within the
144 substring reported in pmatch[j] rather than the whole string. The
145 offsets in pmatch[i] are still relative to the start of string.
146
147 4. If subexpression i is contained in subexpression j, and the byte
148 offsets in pmatch[j] are −1, then the pointers in pmatch[i] shall
149 also be −1.
150
151 5. If subexpression i matched a zero-length string, then both byte
152 offsets in pmatch[i] shall be the byte offset of the character or
153 null terminator immediately following the zero-length string.
154
155 If, when regexec() is called, the locale is different from when the
156 regular expression was compiled, the result is undefined.
157
158 If REG_NEWLINE is not set in cflags, then a <newline> in pattern or
159 string shall be treated as an ordinary character. If REG_NEWLINE is
160 set, then <newline> shall be treated as an ordinary character except as
161 follows:
162
163 1. A <newline> in string shall not be matched by a <period> outside a
164 bracket expression or by any form of a non-matching list (see the
165 Base Definitions volume of POSIX.1‐2008, Chapter 9, Regular Expres‐
166 sions).
167
168 2. A <circumflex> ('^') in pattern, when used to specify expression
169 anchoring (see the Base Definitions volume of POSIX.1‐2008, Section
170 9.3.8, BRE Expression Anchoring), shall match the zero-length
171 string immediately after a <newline> in string, regardless of the
172 setting of REG_NOTBOL.
173
174 3. A <dollar-sign> ('$') in pattern, when used to specify expression
175 anchoring, shall match the zero-length string immediately before a
176 <newline> in string, regardless of the setting of REG_NOTEOL.
177
178 The regfree() function frees any memory allocated by regcomp() associ‐
179 ated with preg.
180
181 The following constants are defined as the minimum set of error return
182 values, although other errors listed as implementation extensions in
183 <regex.h> are possible:
184
185 REG_BADBR Content of "\{\}" invalid: not a number, number too
186 large, more than two numbers, first larger than second.
187
188 REG_BADPAT Invalid regular expression.
189
190 REG_BADRPT '?', '*', or '+' not preceded by valid regular expres‐
191 sion.
192
193 REG_EBRACE "\{\}" imbalance.
194
195 REG_EBRACK "[]" imbalance.
196
197 REG_ECOLLATE Invalid collating element referenced.
198
199 REG_ECTYPE Invalid character class type referenced.
200
201 REG_EESCAPE Trailing <backslash> character in pattern.
202
203 REG_EPAREN "\(\)" or "()" imbalance.
204
205 REG_ERANGE Invalid endpoint in range expression.
206
207 REG_ESPACE Out of memory.
208
209 REG_ESUBREG Number in "\digit" invalid or in error.
210
211 REG_NOMATCH regexec() failed to match.
212
213 If more than one error occurs in processing a function call, any one of
214 the possible constants may be returned, as the order of detection is
215 unspecified.
216
217 The regerror() function provides a mapping from error codes returned by
218 regcomp() and regexec() to unspecified printable strings. It generates
219 a string corresponding to the value of the errcode argument, which the
220 application shall ensure is the last non-zero value returned by reg‐
221 comp() or regexec() with the given value of preg. If errcode is not
222 such a value, the content of the generated string is unspecified.
223
224 If preg is a null pointer, but errcode is a value returned by a previ‐
225 ous call to regexec() or regcomp(), the regerror() still generates an
226 error string corresponding to the value of errcode, but it might not be
227 as detailed under some implementations.
228
229 If the errbuf_size argument is not 0, regerror() shall place the gener‐
230 ated string into the buffer of size errbuf_size bytes pointed to by
231 errbuf. If the string (including the terminating null) cannot fit in
232 the buffer, regerror() shall truncate the string and null-terminate the
233 result.
234
235 If errbuf_size is 0, regerror() shall ignore the errbuf argument, and
236 return the size of the buffer needed to hold the generated string.
237
238 If the preg argument to regexec() or regfree() is not a compiled regu‐
239 lar expression returned by regcomp(), the result is undefined. A preg
240 is no longer treated as a compiled regular expression after it is given
241 to regfree().
242
244 Upon successful completion, the regcomp() function shall return 0. Oth‐
245 erwise, it shall return an integer value indicating an error as
246 described in <regex.h>, and the content of preg is undefined. If a code
247 is returned, the interpretation shall be as given in <regex.h>.
248
249 If regcomp() detects an invalid RE, it may return REG_BADPAT, or it may
250 return one of the error codes that more precisely describes the error.
251
252 Upon successful completion, the regexec() function shall return 0. Oth‐
253 erwise, it shall return REG_NOMATCH to indicate no match.
254
255 Upon successful completion, the regerror() function shall return the
256 number of bytes needed to hold the entire generated string, including
257 the null termination. If the return value is greater than errbuf_size,
258 the string returned in the buffer pointed to by errbuf has been trun‐
259 cated.
260
261 The regfree() function shall not return a value.
262
264 No errors are defined.
265
266 The following sections are informative.
267
269 #include <regex.h>
270
271 /*
272 * Match string against the extended regular expression in
273 * pattern, treating errors as no match.
274 *
275 * Return 1 for match, 0 for no match.
276 */
277
278 int
279 match(const char *string, char *pattern)
280 {
281 int status;
282 regex_t re;
283
284 if (regcomp(&re, pattern, REG_EXTENDED|REG_NOSUB) != 0) {
285 return(0); /* Report error. */
286 }
287 status = regexec(&re, string, (size_t) 0, NULL, 0);
288 regfree(&re);
289 if (status != 0) {
290 return(0); /* Report error. */
291 }
292 return(1);
293 }
294
295 The following demonstrates how the REG_NOTBOL flag could be used with
296 regexec() to find all substrings in a line that match a pattern sup‐
297 plied by a user. (For simplicity of the example, very little error
298 checking is done.)
299
300 (void) regcomp (&re, pattern, 0);
301 /* This call to regexec() finds the first match on the line. */
302 error = regexec (&re, &buffer[0], 1, &pm, 0);
303 while (error == 0) { /* While matches found. */
304 /* Substring found between pm.rm_so and pm.rm_eo. */
305 /* This call to regexec() finds the next match. */
306 error = regexec (&re, buffer + pm.rm_eo, 1, &pm, REG_NOTBOL);
307 }
308
310 An application could use:
311
312 regerror(code,preg,(char *)NULL,(size_t)0)
313
314 to find out how big a buffer is needed for the generated string, mal‐
315 loc() a buffer to hold the string, and then call regerror() again to
316 get the string. Alternatively, it could allocate a fixed, static buffer
317 that is big enough to hold most strings, and then use malloc() to allo‐
318 cate a larger buffer if it finds that this is too small.
319
320 To match a pattern as described in the Shell and Utilities volume of
321 POSIX.1‐2008, Section 2.13, Pattern Matching Notation, use the
322 fnmatch() function.
323
325 The regexec() function must fill in all nmatch elements of pmatch,
326 where nmatch and pmatch are supplied by the application, even if some
327 elements of pmatch do not correspond to subexpressions in pattern. The
328 application developer should note that there is probably no reason for
329 using a value of nmatch that is larger than preg−>re_nsub+1.
330
331 The REG_NEWLINE flag supports a use of RE matching that is needed in
332 some applications like text editors. In such applications, the user
333 supplies an RE asking the application to find a line that matches the
334 given expression. An anchor in such an RE anchors at the beginning or
335 end of any line. Such an application can pass a sequence of <new‐
336 line>-separated lines to regexec() as a single long string and specify
337 REG_NEWLINE to regcomp() to get the desired behavior. The application
338 must ensure that there are no explicit <newline> characters in pattern
339 if it wants to ensure that any match occurs entirely within a single
340 line.
341
342 The REG_NEWLINE flag affects the behavior of regexec(), but it is in
343 the cflags parameter to regcomp() to allow flexibility of implementa‐
344 tion. Some implementations will want to generate the same compiled RE
345 in regcomp() regardless of the setting of REG_NEWLINE and have
346 regexec() handle anchors differently based on the setting of the flag.
347 Other implementations will generate different compiled REs based on the
348 REG_NEWLINE.
349
350 The REG_ICASE flag supports the operations taken by the grep −i option
351 and the historical implementations of ex and vi. Including this flag
352 will make it easier for application code to be written that does the
353 same thing as these utilities.
354
355 The substrings reported in pmatch[] are defined using offsets from the
356 start of the string rather than pointers. This allows type-safe access
357 to both constant and non-constant strings.
358
359 The type regoff_t is used for the elements of pmatch[] to ensure that
360 the application can represent large arrays in memory (important for an
361 application conforming to the Shell and Utilities volume of
362 POSIX.1‐2008).
363
364 The 1992 edition of this standard required regoff_t to be at least as
365 wide as off_t, to facilitate future extensions in which the string to
366 be searched is taken from a file. However, these future extensions have
367 not appeared. The requirement rules out popular implementations with
368 32-bit regoff_t and 64-bit off_t, so it has been removed.
369
370 The standard developers rejected the inclusion of a regsub() function
371 that would be used to do substitutions for a matched RE. While such a
372 routine would be useful to some applications, its utility would be much
373 more limited than the matching function described here. Both RE parsing
374 and substitution are possible to implement without support other than
375 that required by the ISO C standard, but matching is much more complex
376 than substituting. The only difficult part of substitution, given the
377 information supplied by regexec(), is finding the next character in a
378 string when there can be multi-byte characters. That is a much larger
379 issue, and one that needs a more general solution.
380
381 The errno variable has not been used for error returns to avoid filling
382 the errno name space for this feature.
383
384 The interface is defined so that the matched substrings rm_sp and rm_ep
385 are in a separate regmatch_t structure instead of in regex_t. This
386 allows a single compiled RE to be used simultaneously in several con‐
387 texts; in main() and a signal handler, perhaps, or in multiple threads
388 of lightweight processes. (The preg argument to regexec() is declared
389 with type const, so the implementation is not permitted to use the
390 structure to store intermediate results.) It also allows an application
391 to request an arbitrary number of substrings from an RE. The number of
392 subexpressions in the RE is reported in re_nsub in preg. With this
393 change to regexec(), consideration was given to dropping the REG_NOSUB
394 flag since the user can now specify this with a zero nmatch argument to
395 regexec(). However, keeping REG_NOSUB allows an implementation to use
396 a different (perhaps more efficient) algorithm if it knows in regcomp()
397 that no subexpressions need be reported. The implementation is only
398 required to fill in pmatch if nmatch is not zero and if REG_NOSUB is
399 not specified. Note that the size_t type, as defined in the ISO C stan‐
400 dard, is unsigned, so the description of regexec() does not need to
401 address negative values of nmatch.
402
403 REG_NOTBOL was added to allow an application to do repeated searches
404 for the same pattern in a line. If the pattern contains a <circumflex>
405 character that should match the beginning of a line, then the pattern
406 should only match when matched against the beginning of the line.
407 Without the REG_NOTBOL flag, the application could rewrite the expres‐
408 sion for subsequent matches, but in the general case this would require
409 parsing the expression. The need for REG_NOTEOL is not as clear; it was
410 added for symmetry.
411
412 The addition of the regerror() function addresses the historical need
413 for conforming application programs to have access to error information
414 more than ``Function failed to compile/match your RE for unknown rea‐
415 sons''.
416
417 This interface provides for two different methods of dealing with error
418 conditions. The specific error codes (REG_EBRACE, for example), defined
419 in <regex.h>, allow an application to recover from an error if it is so
420 able. Many applications, especially those that use patterns supplied by
421 a user, will not try to deal with specific error cases, but will just
422 use regerror() to obtain a human-readable error message to present to
423 the user.
424
425 The regerror() function uses a scheme similar to confstr() to deal with
426 the problem of allocating memory to hold the generated string. The
427 scheme used by strerror() in the ISO C standard was considered unac‐
428 ceptable since it creates difficulties for multi-threaded applications.
429
430 The preg argument is provided to regerror() to allow an implementation
431 to generate a more descriptive message than would be possible with
432 errcode alone. An implementation might, for example, save the character
433 offset of the offending character of the pattern in a field of preg,
434 and then include that in the generated message string. The implementa‐
435 tion may also ignore preg.
436
437 A REG_FILENAME flag was considered, but omitted. This flag caused
438 regexec() to match patterns as described in the Shell and Utilities
439 volume of POSIX.1‐2008, Section 2.13, Pattern Matching Notation instead
440 of REs. This service is now provided by the fnmatch() function.
441
442 Notice that there is a difference in philosophy between the
443 ISO POSIX‐2:1993 standard and POSIX.1‐2008 in how to handle a ``bad''
444 regular expression. The ISO POSIX‐2:1993 standard says that many bad
445 constructs ``produce undefined results'', or that ``the interpretation
446 is undefined''. POSIX.1‐2008, however, says that the interpretation of
447 such REs is unspecified. The term ``undefined'' means that the action
448 by the application is an error, of similar severity to passing a bad
449 pointer to a function.
450
451 The regcomp() and regexec() functions are required to accept any null-
452 terminated string as the pattern argument. If the meaning of the string
453 is ``undefined'', the behavior of the function is ``unspecified''.
454 POSIX.1‐2008 does not specify how the functions will interpret the pat‐
455 tern; they might return error codes, or they might do pattern matching
456 in some completely unexpected way, but they should not do something
457 like abort the process.
458
460 None.
461
463 fnmatch(), glob()
464
465 The Base Definitions volume of POSIX.1‐2008, Chapter 9, Regular Expres‐
466 sions, <regex.h>, <sys_types.h>
467
468 The Shell and Utilities volume of POSIX.1‐2008, Section 2.13, Pattern
469 Matching Notation
470
472 Portions of this text are reprinted and reproduced in electronic form
473 from IEEE Std 1003.1, 2013 Edition, Standard for Information Technology
474 -- Portable Operating System Interface (POSIX), The Open Group Base
475 Specifications Issue 7, Copyright (C) 2013 by the Institute of Electri‐
476 cal and Electronics Engineers, Inc and The Open Group. (This is
477 POSIX.1-2008 with the 2013 Technical Corrigendum 1 applied.) In the
478 event of any discrepancy between this version and the original IEEE and
479 The Open Group Standard, the original IEEE and The Open Group Standard
480 is the referee document. The original Standard can be obtained online
481 at http://www.unix.org/online.html .
482
483 Any typographical or formatting errors that appear in this page are
484 most likely to have been introduced during the conversion of the source
485 files to man page format. To report such errors, see https://www.ker‐
486 nel.org/doc/man-pages/reporting_bugs.html .
487
488
489
490IEEE/The Open Group 2013 REGCOMP(3P)