pcre(3) - f14

1PCRE(3)                    Library Functions Manual                    PCRE(3)
2
3
4

NAME

6       PCRE - Perl-compatible regular expressions
7

INTRODUCTION

9
10       The  PCRE  library is a set of functions that implement regular expres‐
11       sion pattern matching using the same syntax and semantics as Perl, with
12       just  a few differences. Some features that appeared in Python and PCRE
13       before they appeared in Perl are also available using the  Python  syn‐
14       tax,  there  is  some  support for one or two .NET and Oniguruma syntax
15       items, and there is an option for requesting some  minor  changes  that
16       give better JavaScript compatibility.
17
18       The  current implementation of PCRE corresponds approximately with Perl
19       5.10/5.11, including support for UTF-8 encoded strings and Unicode gen‐
20       eral  category properties. However, UTF-8 and Unicode support has to be
21       explicitly enabled; it is not the default. The  Unicode  tables  corre‐
22       spond to Unicode release 5.2.0.
23
24       In  addition to the Perl-compatible matching function, PCRE contains an
25       alternative function that matches the same compiled patterns in a  dif‐
26       ferent way. In certain circumstances, the alternative function has some
27       advantages.  For a discussion of the two matching algorithms,  see  the
28       pcrematching page.
29
30       PCRE  is  written  in C and released as a C library. A number of people
31       have written wrappers and interfaces of various kinds.  In  particular,
32       Google  Inc.   have  provided  a comprehensive C++ wrapper. This is now
33       included as part of the PCRE distribution. The pcrecpp page has details
34       of  this  interface.  Other  people's contributions can be found in the
35       Contrib directory at the primary FTP site, which is:
36
37       ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
38
39       Details of exactly which Perl regular expression features are  and  are
40       not supported by PCRE are given in separate documents. See the pcrepat‐
41       tern and pcrecompat pages. There is a syntax summary in the  pcresyntax
42       page.
43
44       Some  features  of  PCRE can be included, excluded, or changed when the
45       library is built. The pcre_config() function makes it  possible  for  a
46       client  to  discover  which  features are available. The features them‐
47       selves are described in the pcrebuild page. Documentation about  build‐
48       ing  PCRE  for various operating systems can be found in the README and
49       NON-UNIX-USE files in the source distribution.
50
51       The library contains a number of undocumented  internal  functions  and
52       data  tables  that  are  used by more than one of the exported external
53       functions, but which are not intended  for  use  by  external  callers.
54       Their  names  all begin with "_pcre_", which hopefully will not provoke
55       any name clashes. In some environments, it is possible to control which
56       external  symbols  are  exported when a shared library is built, and in
57       these cases the undocumented symbols are not exported.
58

USER DOCUMENTATION

60
61       The user documentation for PCRE comprises a number  of  different  sec‐
62       tions.  In the "man" format, each of these is a separate "man page". In
63       the HTML format, each is a separate page, linked from the  index  page.
64       In  the  plain  text format, all the sections, except the pcredemo sec‐
65       tion, are concatenated, for ease of searching. The sections are as fol‐
66       lows:
67
68         pcre              this document
69         pcre-config       show PCRE installation configuration information
70         pcreapi           details of PCRE's native C API
71         pcrebuild         options for building PCRE
72         pcrecallout       details of the callout feature
73         pcrecompat        discussion of Perl compatibility
74         pcrecpp           details of the C++ wrapper
75         pcredemo          a demonstration C program that uses PCRE
76         pcregrep          description of the pcregrep command
77         pcrematching      discussion of the two matching algorithms
78         pcrepartial       details of the partial matching facility
79         pcrepattern       syntax and semantics of supported
80                             regular expressions
81         pcreperform       discussion of performance issues
82         pcreposix         the POSIX-compatible C API
83         pcreprecompile    details of saving and re-using precompiled patterns
84         pcresample        discussion of the pcredemo program
85         pcrestack         discussion of stack usage
86         pcresyntax        quick syntax reference
87         pcretest          description of the pcretest testing command
88
89       In  addition,  in the "man" and HTML formats, there is a short page for
90       each C library function, listing its arguments and results.
91

LIMITATIONS

93
94       There are some size limitations in PCRE but it is hoped that they  will
95       never in practice be relevant.
96
97       The  maximum  length of a compiled pattern is 65539 (sic) bytes if PCRE
98       is compiled with the default internal linkage size of 2. If you want to
99       process  regular  expressions  that are truly enormous, you can compile
100       PCRE with an internal linkage size of 3 or 4 (see the  README  file  in
101       the  source  distribution and the pcrebuild documentation for details).
102       In these cases the limit is substantially larger.  However,  the  speed
103       of execution is slower.
104
105       All values in repeating quantifiers must be less than 65536.
106
107       There is no limit to the number of parenthesized subpatterns, but there
108       can be no more than 65535 capturing subpatterns.
109
110       The maximum length of name for a named subpattern is 32 characters, and
111       the maximum number of named subpatterns is 10000.
112
113       The  maximum  length of a subject string is the largest positive number
114       that an integer variable can hold. However, when using the  traditional
115       matching function, PCRE uses recursion to handle subpatterns and indef‐
116       inite repetition.  This means that the available stack space may  limit
117       the size of a subject string that can be processed by certain patterns.
118       For a discussion of stack issues, see the pcrestack documentation.
119

UTF-8 AND UNICODE PROPERTY SUPPORT

121
122       From release 3.3, PCRE has  had  some  support  for  character  strings
123       encoded  in the UTF-8 format. For release 4.0 this was greatly extended
124       to cover most common requirements, and in release 5.0  additional  sup‐
125       port for Unicode general category properties was added.
126
127       In  order  process  UTF-8 strings, you must build PCRE to include UTF-8
128       support in the code, and, in addition,  you  must  call  pcre_compile()
129       with  the  PCRE_UTF8  option  flag,  or the pattern must start with the
130       sequence (*UTF8). When either of these is the case,  both  the  pattern
131       and  any  subject  strings  that  are matched against it are treated as
132       UTF-8 strings instead of strings of 1-byte characters.
133
134       If you compile PCRE with UTF-8 support, but do not use it at run  time,
135       the  library will be a bit bigger, but the additional run time overhead
136       is limited to testing the PCRE_UTF8 flag occasionally, so should not be
137       very big.
138
139       If PCRE is built with Unicode character property support (which implies
140       UTF-8 support), the escape sequences \p{..}, \P{..}, and  \X  are  sup‐
141       ported.  The available properties that can be tested are limited to the
142       general category properties such as Lu for an upper case letter  or  Nd
143       for  a  decimal number, the Unicode script names such as Arabic or Han,
144       and the derived properties Any and L&. A full  list  is  given  in  the
145       pcrepattern documentation. Only the short names for properties are sup‐
146       ported. For example, \p{L} matches a letter. Its Perl synonym,  \p{Let‐
147       ter},  is  not  supported.   Furthermore,  in Perl, many properties may
148       optionally be prefixed by "Is", for compatibility with Perl  5.6.  PCRE
149       does not support this.
150
151   Validity of UTF-8 strings
152
153       When  you  set  the  PCRE_UTF8 flag, the strings passed as patterns and
154       subjects are (by default) checked for validity on entry to the relevant
155       functions.  From  release 7.3 of PCRE, the check is according the rules
156       of RFC 3629, which are themselves derived from the  Unicode  specifica‐
157       tion.  Earlier  releases  of PCRE followed the rules of RFC 2279, which
158       allows the full range of 31-bit values (0 to 0x7FFFFFFF).  The  current
159       check allows only values in the range U+0 to U+10FFFF, excluding U+D800
160       to U+DFFF.
161
162       The excluded code points are the "Low Surrogate Area"  of  Unicode,  of
163       which  the Unicode Standard says this: "The Low Surrogate Area does not
164       contain any  character  assignments,  consequently  no  character  code
165       charts or namelists are provided for this area. Surrogates are reserved
166       for use with UTF-16 and then must be used in pairs."  The  code  points
167       that  are  encoded  by  UTF-16  pairs are available as independent code
168       points in the UTF-8 encoding. (In  other  words,  the  whole  surrogate
169       thing is a fudge for UTF-16 which unfortunately messes up UTF-8.)
170
171       If  an  invalid  UTF-8  string  is  passed  to  PCRE,  an  error return
172       (PCRE_ERROR_BADUTF8) is given. In some situations, you may already know
173       that your strings are valid, and therefore want to skip these checks in
174       order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag at
175       compile  time  or at run time, PCRE assumes that the pattern or subject
176       it is given (respectively) contains only valid  UTF-8  codes.  In  this
177       case, it does not diagnose an invalid UTF-8 string.
178
179       If  you  pass  an  invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set,
180       what happens depends on why the string is invalid. If the  string  con‐
181       forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a
182       string of characters in the range 0  to  0x7FFFFFFF.  In  other  words,
183       apart from the initial validity test, PCRE (when in UTF-8 mode) handles
184       strings according to the more liberal rules of RFC  2279.  However,  if
185       the  string does not even conform to RFC 2279, the result is undefined.
186       Your program may crash.
187
188       If you want to process strings  of  values  in  the  full  range  0  to
189       0x7FFFFFFF,  encoded in a UTF-8-like manner as per the old RFC, you can
190       set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in
191       this situation, you will have to apply your own validity check.
192
193   General comments about UTF-8 mode
194
195       1.  An  unbraced  hexadecimal  escape sequence (such as \xb3) matches a
196       two-byte UTF-8 character if the value is greater than 127.
197
198       2. Octal numbers up to \777 are recognized, and  match  two-byte  UTF-8
199       characters for values greater than \177.
200
201       3.  Repeat quantifiers apply to complete UTF-8 characters, not to indi‐
202       vidual bytes, for example: \x{100}{3}.
203
204       4. The dot metacharacter matches one UTF-8 character instead of a  sin‐
205       gle byte.
206
207       5.  The  escape sequence \C can be used to match a single byte in UTF-8
208       mode, but its use can lead to some strange effects.  This  facility  is
209       not available in the alternative matching function, pcre_dfa_exec().
210
211       6.  The  character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
212       test characters of any code value, but, by default, the characters that
213       PCRE  recognizes  as digits, spaces, or word characters remain the same
214       set as before, all with values less than 256. This  remains  true  even
215       when  PCRE  is built to include Unicode property support, because to do
216       otherwise would slow down PCRE in many common  cases.  Note  that  this
217       also applies to \b, because it is defined in terms of \w and \W. If you
218       really want to test for a wider sense of, say,  "digit",  you  can  use
219       explicit  Unicode property tests such as \p{Nd}.  Alternatively, if you
220       set the PCRE_UCP option, the way that the  character  escapes  work  is
221       changed  so that Unicode properties are used to determine which charac‐
222       ters match. There are more details in the section on generic  character
223       types in the pcrepattern documentation.
224
225       7.  Similarly,  characters that match the POSIX named character classes
226       are all low-valued characters, unless the PCRE_UCP option is set.
227
228       8. However, the Perl 5.10 horizontal and vertical  whitespace  matching
229       escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char‐
230       acters, whether or not PCRE_UCP is set.
231
232       9. Case-insensitive matching applies only to  characters  whose  values
233       are  less than 128, unless PCRE is built with Unicode property support.
234       Even when Unicode property support is available, PCRE  still  uses  its
235       own  character  tables when checking the case of low-valued characters,
236       so as not to degrade performance.  The Unicode property information  is
237       used only for characters with higher values. Even when Unicode property
238       support is available, PCRE supports case-insensitive matching only when
239       there  is  a  one-to-one  mapping between a letter's cases. There are a
240       small number of many-to-one mappings in Unicode;  these  are  not  sup‐
241       ported by PCRE.
242

AUTHOR

244
245       Philip Hazel
246       University Computing Service
247       Cambridge CB2 3QH, England.
248
249       Putting  an actual email address here seems to have been a spam magnet,
250       so I've taken it away. If you want to email me, use  my  two  initials,
251       followed by the two digits 10, at the domain cam.ac.uk.
252

REVISION

254
255       Last updated: 12 May 2010
256       Copyright (c) 1997-2010 University of Cambridge.
257
258
259
260                                                                       PCRE(3)