1PCRE2POSIX(3) Library Functions Manual PCRE2POSIX(3)
2
3
4
6 PCRE2 - Perl-compatible regular expressions (revised API)
7
9
10 #include <pcre2posix.h>
11
12 int pcre2_regcomp(regex_t *preg, const char *pattern,
13 int cflags);
14
15 int pcre2_regexec(const regex_t *preg, const char *string,
16 size_t nmatch, regmatch_t pmatch[], int eflags);
17
18 size_t pcre2_regerror(int errcode, const regex_t *preg,
19 char *errbuf, size_t errbuf_size);
20
21 void pcre2_regfree(regex_t *preg);
22
24
25 This set of functions provides a POSIX-style API for the PCRE2 regular
26 expression 8-bit library. There are no POSIX-style wrappers for PCRE2's
27 16-bit and 32-bit libraries. See the pcre2api documentation for a
28 description of PCRE2's native API, which contains much additional func‐
29 tionality.
30
31 The functions described here are wrapper functions that ultimately call
32 the PCRE2 native API. Their prototypes are defined in the pcre2posix.h
33 header file, and they all have unique names starting with pcre2_. How‐
34 ever, the pcre2posix.h header also contains macro definitions that con‐
35 vert the standard POSIX names such regcomp() into pcre2_regcomp() etc.
36 This means that a program can use the usual POSIX names without running
37 the risk of accidentally linking with POSIX functions from a different
38 library.
39
40 On Unix-like systems the PCRE2 POSIX library is called libpcre2-posix,
41 so can be accessed by adding -lpcre2-posix to the command for linking
42 an application. Because the POSIX functions call the native ones, it is
43 also necessary to add -lpcre2-8.
44
45 Although they are not defined as protypes in pcre2posix.h, the library
46 does contain functions with the POSIX names regcomp() etc. These simply
47 pass their arguments to the PCRE2 functions. These functions are pro‐
48 vided for backwards compatibility with earlier versions of PCRE2, so
49 that existing programs do not have to be recompiled.
50
51 Calling the header file pcre2posix.h avoids any conflict with other
52 POSIX libraries. It can, of course, be renamed or aliased as regex.h,
53 which is the "correct" name, if there is no clash. It provides two
54 structure types, regex_t for compiled internal forms, and regmatch_t
55 for returning captured substrings. It also defines some constants whose
56 names start with "REG_"; these are used for setting options and identi‐
57 fying error codes.
58
60
61 Those POSIX option bits that can reasonably be mapped to PCRE2 native
62 options have been implemented. In addition, the option REG_EXTENDED is
63 defined with the value zero. This has no effect, but since programs
64 that are written to the POSIX interface often use it, this makes it
65 easier to slot in PCRE2 as a replacement library. Other POSIX options
66 are not even defined.
67
68 There are also some options that are not defined by POSIX. These have
69 been added at the request of users who want to make use of certain
70 PCRE2-specific features via the POSIX calling interface or to add BSD
71 or GNU functionality.
72
73 When PCRE2 is called via these functions, it is only the API that is
74 POSIX-like in style. The syntax and semantics of the regular expres‐
75 sions themselves are still those of Perl, subject to the setting of
76 various PCRE2 options, as described below. "POSIX-like in style" means
77 that the API approximates to the POSIX definition; it is not fully
78 POSIX-compatible, and in multi-unit encoding domains it is probably
79 even less compatible.
80
81 The descriptions below use the actual names of the functions, but, as
82 described above, the standard POSIX names (without the pcre2_ prefix)
83 may also be used.
84
86
87 The function pcre2_regcomp() is called to compile a pattern into an
88 internal form. By default, the pattern is a C string terminated by a
89 binary zero (but see REG_PEND below). The preg argument is a pointer to
90 a regex_t structure that is used as a base for storing information
91 about the compiled regular expression. (It is also used for input when
92 REG_PEND is set.)
93
94 The argument cflags is either zero, or contains one or more of the bits
95 defined by the following macros:
96
97 REG_DOTALL
98
99 The PCRE2_DOTALL option is set when the regular expression is passed
100 for compilation to the native function. Note that REG_DOTALL is not
101 part of the POSIX standard.
102
103 REG_ICASE
104
105 The PCRE2_CASELESS option is set when the regular expression is passed
106 for compilation to the native function.
107
108 REG_NEWLINE
109
110 The PCRE2_MULTILINE option is set when the regular expression is passed
111 for compilation to the native function. Note that this does not mimic
112 the defined POSIX behaviour for REG_NEWLINE (see the following sec‐
113 tion).
114
115 REG_NOSPEC
116
117 The PCRE2_LITERAL option is set when the regular expression is passed
118 for compilation to the native function. This disables all meta charac‐
119 ters in the pattern, causing it to be treated as a literal string. The
120 only other options that are allowed with REG_NOSPEC are REG_ICASE,
121 REG_NOSUB, REG_PEND, and REG_UTF. Note that REG_NOSPEC is not part of
122 the POSIX standard.
123
124 REG_NOSUB
125
126 When a pattern that is compiled with this flag is passed to
127 pcre2_regexec() for matching, the nmatch and pmatch arguments are
128 ignored, and no captured strings are returned. Versions of the PCRE
129 library prior to 10.22 used to set the PCRE2_NO_AUTO_CAPTURE compile
130 option, but this no longer happens because it disables the use of back‐
131 references.
132
133 REG_PEND
134
135 If this option is set, the reg_endp field in the preg structure (which
136 has the type const char *) must be set to point to the character beyond
137 the end of the pattern before calling pcre2_regcomp(). The pattern
138 itself may now contain binary zeros, which are treated as data charac‐
139 ters. Without REG_PEND, a binary zero terminates the pattern and the
140 re_endp field is ignored. This is a GNU extension to the POSIX standard
141 and should be used with caution in software intended to be portable to
142 other systems.
143
144 REG_UCP
145
146 The PCRE2_UCP option is set when the regular expression is passed for
147 compilation to the native function. This causes PCRE2 to use Unicode
148 properties when matchine \d, \w, etc., instead of just recognizing
149 ASCII values. Note that REG_UCP is not part of the POSIX standard.
150
151 REG_UNGREEDY
152
153 The PCRE2_UNGREEDY option is set when the regular expression is passed
154 for compilation to the native function. Note that REG_UNGREEDY is not
155 part of the POSIX standard.
156
157 REG_UTF
158
159 The PCRE2_UTF option is set when the regular expression is passed for
160 compilation to the native function. This causes the pattern itself and
161 all data strings used for matching it to be treated as UTF-8 strings.
162 Note that REG_UTF is not part of the POSIX standard.
163
164 In the absence of these flags, no options are passed to the native
165 function. This means the the regex is compiled with PCRE2 default
166 semantics. In particular, the way it handles newline characters in the
167 subject string is the Perl way, not the POSIX way. Note that setting
168 PCRE2_MULTILINE has only some of the effects specified for REG_NEWLINE.
169 It does not affect the way newlines are matched by the dot metacharac‐
170 ter (they are not) or by a negative class such as [^a] (they are).
171
172 The yield of pcre2_regcomp() is zero on success, and non-zero other‐
173 wise. The preg structure is filled in on success, and one other member
174 of the structure (as well as re_endp) is public: re_nsub contains the
175 number of capturing subpatterns in the regular expression. Various
176 error codes are defined in the header file.
177
178 NOTE: If the yield of pcre2_regcomp() is non-zero, you must not attempt
179 to use the contents of the preg structure. If, for example, you pass it
180 to pcre2_regexec(), the result is undefined and your program is likely
181 to crash.
182
184
185 This area is not simple, because POSIX and Perl take different views of
186 things. It is not possible to get PCRE2 to obey POSIX semantics, but
187 then PCRE2 was never intended to be a POSIX engine. The following table
188 lists the different possibilities for matching newline characters in
189 Perl and PCRE2:
190
191 Default Change with
192
193 . matches newline no PCRE2_DOTALL
194 newline matches [^a] yes not changeable
195 $ matches \n at end yes PCRE2_DOLLAR_ENDONLY
196 $ matches \n in middle no PCRE2_MULTILINE
197 ^ matches \n in middle no PCRE2_MULTILINE
198
199 This is the equivalent table for a POSIX-compatible pattern matcher:
200
201 Default Change with
202
203 . matches newline yes REG_NEWLINE
204 newline matches [^a] yes REG_NEWLINE
205 $ matches \n at end no REG_NEWLINE
206 $ matches \n in middle no REG_NEWLINE
207 ^ matches \n in middle no REG_NEWLINE
208
209 This behaviour is not what happens when PCRE2 is called via its POSIX
210 API. By default, PCRE2's behaviour is the same as Perl's, except that
211 there is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2
212 and Perl, there is no way to stop newline from matching [^a].
213
214 Default POSIX newline handling can be obtained by setting PCRE2_DOTALL
215 and PCRE2_DOLLAR_ENDONLY when calling pcre2_compile() directly, but
216 there is no way to make PCRE2 behave exactly as for the REG_NEWLINE
217 action. When using the POSIX API, passing REG_NEWLINE to PCRE2's
218 pcre2_regcomp() function causes PCRE2_MULTILINE to be passed to
219 pcre2_compile(), and REG_DOTALL passes PCRE2_DOTALL. There is no way to
220 pass PCRE2_DOLLAR_ENDONLY.
221
223
224 The function pcre2_regexec() is called to match a compiled pattern preg
225 against a given string, which is by default terminated by a zero byte
226 (but see REG_STARTEND below), subject to the options in eflags. These
227 can be:
228
229 REG_NOTBOL
230
231 The PCRE2_NOTBOL option is set when calling the underlying PCRE2 match‐
232 ing function.
233
234 REG_NOTEMPTY
235
236 The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2
237 matching function. Note that REG_NOTEMPTY is not part of the POSIX
238 standard. However, setting this option can give more POSIX-like behav‐
239 iour in some situations.
240
241 REG_NOTEOL
242
243 The PCRE2_NOTEOL option is set when calling the underlying PCRE2 match‐
244 ing function.
245
246 REG_STARTEND
247
248 When this option is set, the subject string starts at string +
249 pmatch[0].rm_so and ends at string + pmatch[0].rm_eo, which should
250 point to the first character beyond the string. There may be binary
251 zeros within the subject string, and indeed, using REG_STARTEND is the
252 only way to pass a subject string that contains a binary zero.
253
254 Whatever the value of pmatch[0].rm_so, the offsets of the matched
255 string and any captured substrings are still given relative to the
256 start of string itself. (Before PCRE2 release 10.30 these were given
257 relative to string + pmatch[0].rm_so, but this differs from other
258 implementations.)
259
260 This is a BSD extension, compatible with but not specified by IEEE
261 Standard 1003.2 (POSIX.2), and should be used with caution in software
262 intended to be portable to other systems. Note that a non-zero rm_so
263 does not imply REG_NOTBOL; REG_STARTEND affects only the location and
264 length of the string, not how it is matched. Setting REG_STARTEND and
265 passing pmatch as NULL are mutually exclusive; the error REG_INVARG is
266 returned.
267
268 If the pattern was compiled with the REG_NOSUB flag, no data about any
269 matched strings is returned. The nmatch and pmatch arguments of
270 pcre2_regexec() are ignored (except possibly as input for REG_STAR‐
271 TEND).
272
273 The value of nmatch may be zero, and the value pmatch may be NULL
274 (unless REG_STARTEND is set); in both these cases no data about any
275 matched strings is returned.
276
277 Otherwise, the portion of the string that was matched, and also any
278 captured substrings, are returned via the pmatch argument, which points
279 to an array of nmatch structures of type regmatch_t, containing the
280 members rm_so and rm_eo. These contain the byte offset to the first
281 character of each substring and the offset to the first character after
282 the end of each substring, respectively. The 0th element of the vector
283 relates to the entire portion of string that was matched; subsequent
284 elements relate to the capturing subpatterns of the regular expression.
285 Unused entries in the array have both structure members set to -1.
286
287 A successful match yields a zero return; various error codes are
288 defined in the header file, of which REG_NOMATCH is the "expected"
289 failure code.
290
292
293 The pcre2_regerror() function maps a non-zero errorcode from either
294 pcre2_regcomp() or pcre2_regexec() to a printable message. If preg is
295 not NULL, the error should have arisen from the use of that structure.
296 A message terminated by a binary zero is placed in errbuf. If the buf‐
297 fer is too short, only the first errbuf_size - 1 characters of the
298 error message are used. The yield of the function is the size of buffer
299 needed to hold the whole message, including the terminating zero. This
300 value is greater than errbuf_size if the message was truncated.
301
303
304 Compiling a regular expression causes memory to be allocated and asso‐
305 ciated with the preg structure. The function pcre2_regfree() frees all
306 such memory, after which preg may no longer be used as a compiled
307 expression.
308
310
311 Philip Hazel
312 University Computing Service
313 Cambridge, England.
314
316
317 Last updated: 30 January 2019
318 Copyright (c) 1997-2019 University of Cambridge.
319
320
321
322PCRE2 10.33 30 January 2019 PCRE2POSIX(3)