1PCREPOSIX(3)               Library Functions Manual               PCREPOSIX(3)
2
3
4

NAME

6       PCRE - Perl-compatible regular expressions.
7

SYNOPSIS OF POSIX API

9
10       #include <pcreposix.h>
11
12       int regcomp(regex_t *preg, const char *pattern,
13            int cflags);
14
15       int regexec(regex_t *preg, const char *string,
16            size_t nmatch, regmatch_t pmatch[], int eflags);
17
18       size_t regerror(int errcode, const regex_t *preg,
19            char *errbuf, size_t errbuf_size);
20
21       void regfree(regex_t *preg);
22

DESCRIPTION

24
25       This  set  of  functions provides a POSIX-style API to the PCRE regular
26       expression package. See the pcreapi documentation for a description  of
27       PCRE's native API, which contains much additional functionality.
28
29       The functions described here are just wrapper functions that ultimately
30       call  the  PCRE  native  API.  Their  prototypes  are  defined  in  the
31       pcreposix.h  header  file,  and  on  Unix systems the library itself is
32       called pcreposix.a, so can be accessed by  adding  -lpcreposix  to  the
33       command  for  linking  an application that uses them. Because the POSIX
34       functions call the native ones, it is also necessary to add -lpcre.
35
36       I have implemented only those POSIX option bits that can be  reasonably
37       mapped  to PCRE native options. In addition, the option REG_EXTENDED is
38       defined with the value zero. This has no  effect,  but  since  programs
39       that  are  written  to  the POSIX interface often use it, this makes it
40       easier to slot in PCRE as a replacement library.  Other  POSIX  options
41       are not even defined.
42
43       There  are also some other options that are not defined by POSIX. These
44       have been added at the request of users who want to make use of certain
45       PCRE-specific features via the POSIX calling interface.
46
47       When  PCRE  is  called  via these functions, it is only the API that is
48       POSIX-like in style. The syntax and semantics of  the  regular  expres‐
49       sions  themselves  are  still  those of Perl, subject to the setting of
50       various PCRE options, as described below. "POSIX-like in  style"  means
51       that  the  API  approximates  to  the POSIX definition; it is not fully
52       POSIX-compatible, and in multi-byte encoding  domains  it  is  probably
53       even less compatible.
54
55       The  header for these functions is supplied as pcreposix.h to avoid any
56       potential clash with other POSIX  libraries.  It  can,  of  course,  be
57       renamed or aliased as regex.h, which is the "correct" name. It provides
58       two structure types, regex_t for  compiled  internal  forms,  and  reg‐
59       match_t  for  returning  captured substrings. It also defines some con‐
60       stants whose names start  with  "REG_";  these  are  used  for  setting
61       options and identifying error codes.
62

COMPILING A PATTERN

64
65       The  function regcomp() is called to compile a pattern into an internal
66       form. The pattern is a C string terminated by a  binary  zero,  and  is
67       passed  in  the  argument  pattern. The preg argument is a pointer to a
68       regex_t structure that is used as a base for storing information  about
69       the compiled regular expression.
70
71       The argument cflags is either zero, or contains one or more of the bits
72       defined by the following macros:
73
74         REG_DOTALL
75
76       The PCRE_DOTALL option is set when the regular expression is passed for
77       compilation to the native function. Note that REG_DOTALL is not part of
78       the POSIX standard.
79
80         REG_ICASE
81
82       The PCRE_CASELESS option is set when the regular expression  is  passed
83       for compilation to the native function.
84
85         REG_NEWLINE
86
87       The  PCRE_MULTILINE option is set when the regular expression is passed
88       for compilation to the native function. Note that this does  not  mimic
89       the  defined  POSIX  behaviour  for REG_NEWLINE (see the following sec‐
90       tion).
91
92         REG_NOSUB
93
94       The PCRE_NO_AUTO_CAPTURE option is set when the regular  expression  is
95       passed for compilation to the native function. In addition, when a pat‐
96       tern that is compiled with this flag is passed to regexec() for  match‐
97       ing,  the  nmatch  and  pmatch  arguments  are ignored, and no captured
98       strings are returned.
99
100         REG_UCP
101
102       The PCRE_UCP option is set when the regular expression  is  passed  for
103       compilation  to  the  native  function. This causes PCRE to use Unicode
104       properties when matchine \d, \w,  etc.,  instead  of  just  recognizing
105       ASCII values. Note that REG_UTF8 is not part of the POSIX standard.
106
107         REG_UNGREEDY
108
109       The  PCRE_UNGREEDY  option is set when the regular expression is passed
110       for compilation to the native function. Note that REG_UNGREEDY  is  not
111       part of the POSIX standard.
112
113         REG_UTF8
114
115       The  PCRE_UTF8  option is set when the regular expression is passed for
116       compilation to the native function. This causes the pattern itself  and
117       all  data  strings used for matching it to be treated as UTF-8 strings.
118       Note that REG_UTF8 is not part of the POSIX standard.
119
120       In the absence of these flags, no options  are  passed  to  the  native
121       function.   This  means  the  the  regex  is compiled with PCRE default
122       semantics. In particular, the way it handles newline characters in  the
123       subject  string  is  the Perl way, not the POSIX way. Note that setting
124       PCRE_MULTILINE has only some of the effects specified for  REG_NEWLINE.
125       It  does not affect the way newlines are matched by . (they are not) or
126       by a negative class such as [^a] (they are).
127
128       The yield of regcomp() is zero on success, and non-zero otherwise.  The
129       preg structure is filled in on success, and one member of the structure
130       is public: re_nsub contains the number of capturing subpatterns in  the
131       regular expression. Various error codes are defined in the header file.
132
133       NOTE:  If  the  yield of regcomp() is non-zero, you must not attempt to
134       use the contents of the preg structure. If, for example, you pass it to
135       regexec(), the result is undefined and your program is likely to crash.
136

MATCHING NEWLINE CHARACTERS

138
139       This area is not simple, because POSIX and Perl take different views of
140       things.  It is not possible to get PCRE to obey  POSIX  semantics,  but
141       then  PCRE was never intended to be a POSIX engine. The following table
142       lists the different possibilities for matching  newline  characters  in
143       PCRE:
144
145                                 Default   Change with
146
147         . matches newline          no     PCRE_DOTALL
148         newline matches [^a]       yes    not changeable
149         $ matches \n at end        yes    PCRE_DOLLARENDONLY
150         $ matches \n in middle     no     PCRE_MULTILINE
151         ^ matches \n in middle     no     PCRE_MULTILINE
152
153       This is the equivalent table for POSIX:
154
155                                 Default   Change with
156
157         . matches newline          yes    REG_NEWLINE
158         newline matches [^a]       yes    REG_NEWLINE
159         $ matches \n at end        no     REG_NEWLINE
160         $ matches \n in middle     no     REG_NEWLINE
161         ^ matches \n in middle     no     REG_NEWLINE
162
163       PCRE's behaviour is the same as Perl's, except that there is no equiva‐
164       lent for PCRE_DOLLAR_ENDONLY in Perl. In both PCRE and Perl,  there  is
165       no way to stop newline from matching [^a].
166
167       The   default  POSIX  newline  handling  can  be  obtained  by  setting
168       PCRE_DOTALL and PCRE_DOLLAR_ENDONLY, but there is no way to  make  PCRE
169       behave exactly as for the REG_NEWLINE action.
170

MATCHING A PATTERN

172
173       The  function  regexec()  is  called  to  match a compiled pattern preg
174       against a given string, which is by default terminated by a  zero  byte
175       (but  see  REG_STARTEND below), subject to the options in eflags. These
176       can be:
177
178         REG_NOTBOL
179
180       The PCRE_NOTBOL option is set when calling the underlying PCRE matching
181       function.
182
183         REG_NOTEMPTY
184
185       The PCRE_NOTEMPTY option is set when calling the underlying PCRE match‐
186       ing function. Note that REG_NOTEMPTY is not part of the POSIX standard.
187       However, setting this option can give more POSIX-like behaviour in some
188       situations.
189
190         REG_NOTEOL
191
192       The PCRE_NOTEOL option is set when calling the underlying PCRE matching
193       function.
194
195         REG_STARTEND
196
197       The  string  is  considered to start at string + pmatch[0].rm_so and to
198       have a terminating NUL located at string + pmatch[0].rm_eo (there  need
199       not  actually  be  a  NUL at that location), regardless of the value of
200       nmatch. This is a BSD extension, compatible with but not  specified  by
201       IEEE  Standard  1003.2  (POSIX.2),  and  should be used with caution in
202       software intended to be portable to other systems. Note that a non-zero
203       rm_so does not imply REG_NOTBOL; REG_STARTEND affects only the location
204       of the string, not how it is matched.
205
206       If the pattern was compiled with the REG_NOSUB flag, no data about  any
207       matched  strings  is  returned.  The  nmatch  and  pmatch  arguments of
208       regexec() are ignored.
209
210       If the value of nmatch is zero, or if the value pmatch is NULL, no data
211       about any matched strings is returned.
212
213       Otherwise,the portion of the string that was matched, and also any cap‐
214       tured substrings, are returned via the pmatch argument, which points to
215       an  array  of nmatch structures of type regmatch_t, containing the mem‐
216       bers rm_so and rm_eo. These contain the offset to the  first  character
217       of  each  substring and the offset to the first character after the end
218       of each substring, respectively. The 0th element of the vector  relates
219       to  the  entire portion of string that was matched; subsequent elements
220       relate to the capturing subpatterns of the regular  expression.  Unused
221       entries in the array have both structure members set to -1.
222
223       A  successful  match  yields  a  zero  return;  various error codes are
224       defined in the header file, of  which  REG_NOMATCH  is  the  "expected"
225       failure code.
226

ERROR MESSAGES

228
229       The regerror() function maps a non-zero errorcode from either regcomp()
230       or regexec() to a printable message. If preg is  not  NULL,  the  error
231       should have arisen from the use of that structure. A message terminated
232       by a binary zero is placed  in  errbuf.  The  length  of  the  message,
233       including  the  zero, is limited to errbuf_size. The yield of the func‐
234       tion is the size of buffer needed to hold the whole message.
235

MEMORY USAGE

237
238       Compiling a regular expression causes memory to be allocated and  asso‐
239       ciated  with  the preg structure. The function regfree() frees all such
240       memory, after which preg may no longer be used as  a  compiled  expres‐
241       sion.
242

AUTHOR

244
245       Philip Hazel
246       University Computing Service
247       Cambridge CB2 3QH, England.
248

REVISION

250
251       Last updated: 16 May 2010
252       Copyright (c) 1997-2010 University of Cambridge.
253
254
255
256                                                                  PCREPOSIX(3)
Impressum