1PCRE2POSIX(3)              Library Functions Manual              PCRE2POSIX(3)
2
3
4

NAME

6       PCRE2 - Perl-compatible regular expressions (revised API)
7

SYNOPSIS

9
10       #include <pcre2posix.h>
11
12       int pcre2_regcomp(regex_t *preg, const char *pattern,
13            int cflags);
14
15       int pcre2_regexec(const regex_t *preg, const char *string,
16            size_t nmatch, regmatch_t pmatch[], int eflags);
17
18       size_t pcre2_regerror(int errcode, const regex_t *preg,
19            char *errbuf, size_t errbuf_size);
20
21       void pcre2_regfree(regex_t *preg);
22

DESCRIPTION

24
25       This  set of functions provides a POSIX-style API for the PCRE2 regular
26       expression 8-bit library. There are no POSIX-style wrappers for PCRE2's
27       16-bit  and  32-bit  libraries.  See  the  pcre2api documentation for a
28       description of PCRE2's native API, which contains much additional func‐
29       tionality.
30
31       The functions described here are wrapper functions that ultimately call
32       the PCRE2 native API. Their prototypes are defined in the  pcre2posix.h
33       header  file, and they all have unique names starting with pcre2_. How‐
34       ever, the pcre2posix.h header also contains macro definitions that con‐
35       vert  the standard POSIX names such regcomp() into pcre2_regcomp() etc.
36       This means that a program can use the usual POSIX names without running
37       the  risk of accidentally linking with POSIX functions from a different
38       library.
39
40       On Unix-like systems the PCRE2 POSIX library is called  libpcre2-posix,
41       so  can  be accessed by adding -lpcre2-posix to the command for linking
42       an application. Because the POSIX functions call the native ones, it is
43       also necessary to add -lpcre2-8.
44
45       Although  they are not defined as protypes in pcre2posix.h, the library
46       does contain functions with the POSIX names regcomp() etc. These simply
47       pass  their  arguments to the PCRE2 functions. These functions are pro‐
48       vided for backwards compatibility with earlier versions  of  PCRE2,  so
49       that existing programs do not have to be recompiled.
50
51       Calling  the  header  file  pcre2posix.h avoids any conflict with other
52       POSIX libraries. It can, of course, be renamed or aliased  as  regex.h,
53       which  is  the  "correct"  name,  if there is no clash. It provides two
54       structure types, regex_t for compiled internal  forms,  and  regmatch_t
55       for returning captured substrings. It also defines some constants whose
56       names start with "REG_"; these are used for setting options and identi‐
57       fying error codes.
58

USING THE POSIX FUNCTIONS

60
61       Those  POSIX  option bits that can reasonably be mapped to PCRE2 native
62       options have been implemented. In addition, the option REG_EXTENDED  is
63       defined  with  the  value  zero. This has no effect, but since programs
64       that are written to the POSIX interface often use  it,  this  makes  it
65       easier  to  slot in PCRE2 as a replacement library. Other POSIX options
66       are not even defined.
67
68       There are also some options that are not defined by POSIX.  These  have
69       been  added  at  the  request  of users who want to make use of certain
70       PCRE2-specific features via the POSIX calling interface or to  add  BSD
71       or GNU functionality.
72
73       When  PCRE2  is  called via these functions, it is only the API that is
74       POSIX-like in style. The syntax and semantics of  the  regular  expres‐
75       sions  themselves  are  still  those of Perl, subject to the setting of
76       various PCRE2 options, as described below. "POSIX-like in style"  means
77       that  the  API  approximates  to  the POSIX definition; it is not fully
78       POSIX-compatible, and in multi-unit encoding  domains  it  is  probably
79       even less compatible.
80
81       The  descriptions  below use the actual names of the functions, but, as
82       described above, the standard POSIX names (without the  pcre2_  prefix)
83       may also be used.
84

COMPILING A PATTERN

86
87       The  function  pcre2_regcomp()  is  called to compile a pattern into an
88       internal form. By default, the pattern is a C string  terminated  by  a
89       binary zero (but see REG_PEND below). The preg argument is a pointer to
90       a regex_t structure that is used as  a  base  for  storing  information
91       about  the compiled regular expression. (It is also used for input when
92       REG_PEND is set.)
93
94       The argument cflags is either zero, or contains one or more of the bits
95       defined by the following macros:
96
97         REG_DOTALL
98
99       The  PCRE2_DOTALL  option  is set when the regular expression is passed
100       for compilation to the native function. Note  that  REG_DOTALL  is  not
101       part of the POSIX standard.
102
103         REG_ICASE
104
105       The  PCRE2_CASELESS option is set when the regular expression is passed
106       for compilation to the native function.
107
108         REG_NEWLINE
109
110       The PCRE2_MULTILINE option is set when the regular expression is passed
111       for  compilation  to the native function. Note that this does not mimic
112       the defined POSIX behaviour for REG_NEWLINE  (see  the  following  sec‐
113       tion).
114
115         REG_NOSPEC
116
117       The  PCRE2_LITERAL  option is set when the regular expression is passed
118       for compilation to the native function. This disables all meta  charac‐
119       ters  in the pattern, causing it to be treated as a literal string. The
120       only other options that are  allowed  with  REG_NOSPEC  are  REG_ICASE,
121       REG_NOSUB,  REG_PEND,  and REG_UTF. Note that REG_NOSPEC is not part of
122       the POSIX standard.
123
124         REG_NOSUB
125
126       When  a  pattern  that  is  compiled  with  this  flag  is  passed   to
127       pcre2_regexec()  for  matching,  the  nmatch  and  pmatch arguments are
128       ignored, and no captured strings are returned.  Versions  of  the  PCRE
129       library  prior  to  10.22 used to set the PCRE2_NO_AUTO_CAPTURE compile
130       option, but this no longer happens because it disables the use of back‐
131       references.
132
133         REG_PEND
134
135       If  this option is set, the reg_endp field in the preg structure (which
136       has the type const char *) must be set to point to the character beyond
137       the  end  of  the  pattern  before calling pcre2_regcomp(). The pattern
138       itself may now contain binary zeros, which are treated as data  charac‐
139       ters.  Without  REG_PEND,  a binary zero terminates the pattern and the
140       re_endp field is ignored. This is a GNU extension to the POSIX standard
141       and  should be used with caution in software intended to be portable to
142       other systems.
143
144         REG_UCP
145
146       The PCRE2_UCP option is set when the regular expression is  passed  for
147       compilation  to  the  native function. This causes PCRE2 to use Unicode
148       properties when matchine \d, \w,  etc.,  instead  of  just  recognizing
149       ASCII values. Note that REG_UCP is not part of the POSIX standard.
150
151         REG_UNGREEDY
152
153       The  PCRE2_UNGREEDY option is set when the regular expression is passed
154       for compilation to the native function. Note that REG_UNGREEDY  is  not
155       part of the POSIX standard.
156
157         REG_UTF
158
159       The  PCRE2_UTF  option is set when the regular expression is passed for
160       compilation to the native function. This causes the pattern itself  and
161       all  data  strings used for matching it to be treated as UTF-8 strings.
162       Note that REG_UTF is not part of the POSIX standard.
163
164       In the absence of these flags, no options  are  passed  to  the  native
165       function.   This  means  the  the  regex is compiled with PCRE2 default
166       semantics. In particular, the way it handles newline characters in  the
167       subject  string  is  the Perl way, not the POSIX way. Note that setting
168       PCRE2_MULTILINE has only some of the effects specified for REG_NEWLINE.
169       It  does not affect the way newlines are matched by the dot metacharac‐
170       ter (they are not) or by a negative class such as [^a] (they are).
171
172       The yield of pcre2_regcomp() is zero on success,  and  non-zero  other‐
173       wise.  The preg structure is filled in on success, and one other member
174       of the structure (as well as re_endp) is public: re_nsub  contains  the
175       number  of  capturing  subpatterns  in  the regular expression. Various
176       error codes are defined in the header file.
177
178       NOTE: If the yield of pcre2_regcomp() is non-zero, you must not attempt
179       to use the contents of the preg structure. If, for example, you pass it
180       to pcre2_regexec(), the result is undefined and your program is  likely
181       to crash.
182

MATCHING NEWLINE CHARACTERS

184
185       This area is not simple, because POSIX and Perl take different views of
186       things.  It is not possible to get PCRE2 to obey POSIX  semantics,  but
187       then PCRE2 was never intended to be a POSIX engine. The following table
188       lists the different possibilities for matching  newline  characters  in
189       Perl and PCRE2:
190
191                                 Default   Change with
192
193         . matches newline          no     PCRE2_DOTALL
194         newline matches [^a]       yes    not changeable
195         $ matches \n at end        yes    PCRE2_DOLLAR_ENDONLY
196         $ matches \n in middle     no     PCRE2_MULTILINE
197         ^ matches \n in middle     no     PCRE2_MULTILINE
198
199       This is the equivalent table for a POSIX-compatible pattern matcher:
200
201                                 Default   Change with
202
203         . matches newline          yes    REG_NEWLINE
204         newline matches [^a]       yes    REG_NEWLINE
205         $ matches \n at end        no     REG_NEWLINE
206         $ matches \n in middle     no     REG_NEWLINE
207         ^ matches \n in middle     no     REG_NEWLINE
208
209       This  behaviour  is not what happens when PCRE2 is called via its POSIX
210       API. By default, PCRE2's behaviour is the same as Perl's,  except  that
211       there  is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2
212       and Perl, there is no way to stop newline from matching [^a].
213
214       Default POSIX newline handling can be obtained by setting  PCRE2_DOTALL
215       and  PCRE2_DOLLAR_ENDONLY  when  calling  pcre2_compile() directly, but
216       there is no way to make PCRE2 behave exactly  as  for  the  REG_NEWLINE
217       action.  When  using  the  POSIX  API,  passing  REG_NEWLINE to PCRE2's
218       pcre2_regcomp()  function  causes  PCRE2_MULTILINE  to  be  passed   to
219       pcre2_compile(), and REG_DOTALL passes PCRE2_DOTALL. There is no way to
220       pass PCRE2_DOLLAR_ENDONLY.
221

MATCHING A PATTERN

223
224       The function pcre2_regexec() is called to match a compiled pattern preg
225       against  a  given string, which is by default terminated by a zero byte
226       (but see REG_STARTEND below), subject to the options in eflags.   These
227       can be:
228
229         REG_NOTBOL
230
231       The PCRE2_NOTBOL option is set when calling the underlying PCRE2 match‐
232       ing function.
233
234         REG_NOTEMPTY
235
236       The PCRE2_NOTEMPTY option is set  when  calling  the  underlying  PCRE2
237       matching  function.  Note  that  REG_NOTEMPTY  is not part of the POSIX
238       standard. However, setting this option can give more POSIX-like  behav‐
239       iour in some situations.
240
241         REG_NOTEOL
242
243       The PCRE2_NOTEOL option is set when calling the underlying PCRE2 match‐
244       ing function.
245
246         REG_STARTEND
247
248       When this option  is  set,  the  subject  string  starts  at  string  +
249       pmatch[0].rm_so  and  ends  at  string  + pmatch[0].rm_eo, which should
250       point to the first character beyond the string.  There  may  be  binary
251       zeros  within the subject string, and indeed, using REG_STARTEND is the
252       only way to pass a subject string that contains a binary zero.
253
254       Whatever the value of  pmatch[0].rm_so,  the  offsets  of  the  matched
255       string  and  any  captured  substrings  are still given relative to the
256       start of string itself. (Before PCRE2 release 10.30  these  were  given
257       relative  to  string  +  pmatch[0].rm_so,  but  this differs from other
258       implementations.)
259
260       This is a BSD extension, compatible with  but  not  specified  by  IEEE
261       Standard  1003.2 (POSIX.2), and should be used with caution in software
262       intended to be portable to other systems. Note that  a  non-zero  rm_so
263       does  not  imply REG_NOTBOL; REG_STARTEND affects only the location and
264       length of the string, not how it is matched. Setting  REG_STARTEND  and
265       passing  pmatch as NULL are mutually exclusive; the error REG_INVARG is
266       returned.
267
268       If the pattern was compiled with the REG_NOSUB flag, no data about  any
269       matched  strings  is  returned.  The  nmatch  and  pmatch  arguments of
270       pcre2_regexec() are ignored (except possibly  as  input  for  REG_STAR‐
271       TEND).
272
273       The  value  of  nmatch  may  be  zero, and the value pmatch may be NULL
274       (unless REG_STARTEND is set); in both these cases  no  data  about  any
275       matched strings is returned.
276
277       Otherwise,  the  portion  of  the string that was matched, and also any
278       captured substrings, are returned via the pmatch argument, which points
279       to  an  array  of  nmatch structures of type regmatch_t, containing the
280       members rm_so and rm_eo. These contain the byte  offset  to  the  first
281       character of each substring and the offset to the first character after
282       the end of each substring, respectively. The 0th element of the  vector
283       relates  to  the  entire portion of string that was matched; subsequent
284       elements relate to the capturing subpatterns of the regular expression.
285       Unused entries in the array have both structure members set to -1.
286
287       A  successful  match  yields  a  zero  return;  various error codes are
288       defined in the header file, of  which  REG_NOMATCH  is  the  "expected"
289       failure code.
290

ERROR MESSAGES

292
293       The  pcre2_regerror()  function  maps  a non-zero errorcode from either
294       pcre2_regcomp() or pcre2_regexec() to a printable message. If  preg  is
295       not  NULL, the error should have arisen from the use of that structure.
296       A message terminated by a binary zero is placed in errbuf. If the  buf‐
297       fer  is  too  short,  only  the first errbuf_size - 1 characters of the
298       error message are used. The yield of the function is the size of buffer
299       needed  to hold the whole message, including the terminating zero. This
300       value is greater than errbuf_size if the message was truncated.
301

MEMORY USAGE

303
304       Compiling a regular expression causes memory to be allocated and  asso‐
305       ciated  with the preg structure. The function pcre2_regfree() frees all
306       such memory, after which preg may no  longer  be  used  as  a  compiled
307       expression.
308

AUTHOR

310
311       Philip Hazel
312       University Computing Service
313       Cambridge, England.
314

REVISION

316
317       Last updated: 30 January 2019
318       Copyright (c) 1997-2019 University of Cambridge.
319
320
321
322PCRE2 10.33                     30 January 2019                  PCRE2POSIX(3)
Impressum