pcre(3) - c6

1PCRE(3)                    Library Functions Manual                    PCRE(3)
2
3
4

NAME

6       PCRE - Perl-compatible regular expressions
7

INTRODUCTION

9
10       The  PCRE  library is a set of functions that implement regular expres‐
11       sion pattern matching using the same syntax and semantics as Perl, with
12       just  a  few  differences. Certain features that appeared in Python and
13       PCRE before they appeared in Perl are also available using  the  Python
14       syntax.  There is also some support for certain .NET and Oniguruma syn‐
15       tax items, and there is an option for  requesting  some  minor  changes
16       that give better JavaScript compatibility.
17
18       The  current  implementation of PCRE (release 7.x) corresponds approxi‐
19       mately with Perl 5.10, including support for UTF-8 encoded strings  and
20       Unicode general category properties. However, UTF-8 and Unicode support
21       has to be explicitly enabled; it is not the default. The Unicode tables
22       correspond to Unicode release 5.0.0.
23
24       In  addition to the Perl-compatible matching function, PCRE contains an
25       alternative matching function that matches the same  compiled  patterns
26       in  a different way. In certain circumstances, the alternative function
27       has some advantages. For a discussion of the two  matching  algorithms,
28       see the pcrematching page.
29
30       PCRE  is  written  in C and released as a C library. A number of people
31       have written wrappers and interfaces of various kinds.  In  particular,
32       Google  Inc.   have  provided  a comprehensive C++ wrapper. This is now
33       included as part of the PCRE distribution. The pcrecpp page has details
34       of  this  interface.  Other  people's contributions can be found in the
35       Contrib directory at the primary FTP site, which is:
36
37       ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
38
39       Details of exactly which Perl regular expression features are  and  are
40       not supported by PCRE are given in separate documents. See the pcrepat‐
41       tern and pcrecompat pages. There is a syntax summary in the  pcresyntax
42       page.
43
44       Some  features  of  PCRE can be included, excluded, or changed when the
45       library is built. The pcre_config() function makes it  possible  for  a
46       client  to  discover  which  features are available. The features them‐
47       selves are described in the pcrebuild page. Documentation about  build‐
48       ing  PCRE for various operating systems can be found in the README file
49       in the source distribution.
50
51       The library contains a number of undocumented  internal  functions  and
52       data  tables  that  are  used by more than one of the exported external
53       functions, but which are not intended  for  use  by  external  callers.
54       Their  names  all begin with "_pcre_", which hopefully will not provoke
55       any name clashes. In some environments, it is possible to control which
56       external  symbols  are  exported when a shared library is built, and in
57       these cases the undocumented symbols are not exported.
58

USER DOCUMENTATION

60
61       The user documentation for PCRE comprises a number  of  different  sec‐
62       tions.  In the "man" format, each of these is a separate "man page". In
63       the HTML format, each is a separate page, linked from the  index  page.
64       In  the  plain text format, all the sections are concatenated, for ease
65       of searching. The sections are as follows:
66
67         pcre              this document
68         pcre-config       show PCRE installation configuration information
69         pcreapi           details of PCRE's native C API
70         pcrebuild         options for building PCRE
71         pcrecallout       details of the callout feature
72         pcrecompat        discussion of Perl compatibility
73         pcrecpp           details of the C++ wrapper
74         pcregrep          description of the pcregrep command
75         pcrematching      discussion of the two matching algorithms
76         pcrepartial       details of the partial matching facility
77         pcrepattern       syntax and semantics of supported
78                             regular expressions
79         pcresyntax        quick syntax reference
80         pcreperform       discussion of performance issues
81         pcreposix         the POSIX-compatible C API
82         pcreprecompile    details of saving and re-using precompiled patterns
83         pcresample        discussion of the sample program
84         pcrestack         discussion of stack usage
85         pcretest          description of the pcretest testing command
86
87       In addition, in the "man" and HTML formats, there is a short  page  for
88       each C library function, listing its arguments and results.
89

LIMITATIONS

91
92       There  are some size limitations in PCRE but it is hoped that they will
93       never in practice be relevant.
94
95       The maximum length of a compiled pattern is 65539 (sic) bytes  if  PCRE
96       is compiled with the default internal linkage size of 2. If you want to
97       process regular expressions that are truly enormous,  you  can  compile
98       PCRE  with  an  internal linkage size of 3 or 4 (see the README file in
99       the source distribution and the pcrebuild documentation  for  details).
100       In  these  cases the limit is substantially larger.  However, the speed
101       of execution is slower.
102
103       All values in repeating quantifiers must be less than 65536.
104
105       There is no limit to the number of parenthesized subpatterns, but there
106       can be no more than 65535 capturing subpatterns.
107
108       The maximum length of name for a named subpattern is 32 characters, and
109       the maximum number of named subpatterns is 10000.
110
111       The maximum length of a subject string is the largest  positive  number
112       that  an integer variable can hold. However, when using the traditional
113       matching function, PCRE uses recursion to handle subpatterns and indef‐
114       inite  repetition.  This means that the available stack space may limit
115       the size of a subject string that can be processed by certain patterns.
116       For a discussion of stack issues, see the pcrestack documentation.
117

UTF-8 AND UNICODE PROPERTY SUPPORT

119
120       From  release  3.3,  PCRE  has  had  some support for character strings
121       encoded in the UTF-8 format. For release 4.0 this was greatly  extended
122       to  cover  most common requirements, and in release 5.0 additional sup‐
123       port for Unicode general category properties was added.
124
125       In order process UTF-8 strings, you must build PCRE  to  include  UTF-8
126       support  in  the  code,  and, in addition, you must call pcre_compile()
127       with the PCRE_UTF8 option flag. When you do this, both the pattern  and
128       any  subject  strings  that are matched against it are treated as UTF-8
129       strings instead of just strings of bytes.
130
131       If you compile PCRE with UTF-8 support, but do not use it at run  time,
132       the  library will be a bit bigger, but the additional run time overhead
133       is limited to testing the PCRE_UTF8 flag occasionally, so should not be
134       very big.
135
136       If PCRE is built with Unicode character property support (which implies
137       UTF-8 support), the escape sequences \p{..}, \P{..}, and  \X  are  sup‐
138       ported.  The available properties that can be tested are limited to the
139       general category properties such as Lu for an upper case letter  or  Nd
140       for  a  decimal number, the Unicode script names such as Arabic or Han,
141       and the derived properties Any and L&. A full  list  is  given  in  the
142       pcrepattern documentation. Only the short names for properties are sup‐
143       ported. For example, \p{L} matches a letter. Its Perl synonym,  \p{Let‐
144       ter},  is  not  supported.   Furthermore,  in Perl, many properties may
145       optionally be prefixed by "Is", for compatibility with Perl  5.6.  PCRE
146       does not support this.
147
148   Validity of UTF-8 strings
149
150       When  you  set  the  PCRE_UTF8 flag, the strings passed as patterns and
151       subjects are (by default) checked for validity on entry to the relevant
152       functions.  From  release 7.3 of PCRE, the check is according the rules
153       of RFC 3629, which are themselves derived from the  Unicode  specifica‐
154       tion.  Earlier  releases  of PCRE followed the rules of RFC 2279, which
155       allows the full range of 31-bit values (0 to 0x7FFFFFFF).  The  current
156       check allows only values in the range U+0 to U+10FFFF, excluding U+D800
157       to U+DFFF.
158
159       The excluded code points are the "Low Surrogate Area"  of  Unicode,  of
160       which  the Unicode Standard says this: "The Low Surrogate Area does not
161       contain any  character  assignments,  consequently  no  character  code
162       charts or namelists are provided for this area. Surrogates are reserved
163       for use with UTF-16 and then must be used in pairs."  The  code  points
164       that  are  encoded  by  UTF-16  pairs are available as independent code
165       points in the UTF-8 encoding. (In  other  words,  the  whole  surrogate
166       thing is a fudge for UTF-16 which unfortunately messes up UTF-8.)
167
168       If an invalid UTF-8 string is passed to PCRE, an error return is given.
169       At compile time, the only additional information is the offset  to  the
170       first byte of the failing character. The runtime functions (pcre_exec()
171       and pcre_dfa_exec()), pass back this information  as  well  as  a  more
172       detailed  reason  code if the caller has provided memory in which to do
173       this.
174
175       In some situations, you may already know that your strings  are  valid,
176       and  therefore  want  to  skip these checks in order to improve perfor‐
177       mance. If you set the PCRE_NO_UTF8_CHECK flag at compile time or at run
178       time,  PCRE  assumes  that  the pattern or subject it is given (respec‐
179       tively) contains only valid UTF-8 codes. In  this  case,  it  does  not
180       diagnose an invalid UTF-8 string.
181
182       If  you  pass  an  invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set,
183       what happens depends on why the string is invalid. If the  string  con‐
184       forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a
185       string of characters in the range 0  to  0x7FFFFFFF.  In  other  words,
186       apart from the initial validity test, PCRE (when in UTF-8 mode) handles
187       strings according to the more liberal rules of RFC  2279.  However,  if
188       the  string does not even conform to RFC 2279, the result is undefined.
189       Your program may crash.
190
191       If you want to process strings  of  values  in  the  full  range  0  to
192       0x7FFFFFFF,  encoded in a UTF-8-like manner as per the old RFC, you can
193       set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in
194       this situation, you will have to apply your own validity check.
195
196   General comments about UTF-8 mode
197
198       1.  An  unbraced  hexadecimal  escape sequence (such as \xb3) matches a
199       two-byte UTF-8 character if the value is greater than 127.
200
201       2. Octal numbers up to \777 are recognized, and  match  two-byte  UTF-8
202       characters for values greater than \177.
203
204       3.  Repeat quantifiers apply to complete UTF-8 characters, not to indi‐
205       vidual bytes, for example: \x{100}{3}.
206
207       4. The dot metacharacter matches one UTF-8 character instead of a  sin‐
208       gle byte.
209
210       5.  The  escape sequence \C can be used to match a single byte in UTF-8
211       mode, but its use can lead to some strange effects.  This  facility  is
212       not available in the alternative matching function, pcre_dfa_exec().
213
214       6.  The  character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
215       test characters of any code value, but the characters that PCRE  recog‐
216       nizes  as  digits,  spaces,  or  word characters remain the same set as
217       before, all with values less than 256. This remains true even when PCRE
218       includes  Unicode  property support, because to do otherwise would slow
219       down PCRE in many common cases. If you really want to test for a  wider
220       sense  of,  say,  "digit",  you must use Unicode property tests such as
221       \p{Nd}.
222
223       7. Similarly, characters that match the POSIX named  character  classes
224       are all low-valued characters.
225
226       8.  However, the Perl 5.10 horizontal and vertical white space matching
227       escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char‐
228       acters.
229
230       9.  Case-insensitive  matching  applies only to characters whose values
231       are less than 128, unless PCRE is built with Unicode property  support.
232       Even  when  Unicode  property support is available, PCRE still uses its
233       own character tables when checking the case of  low-valued  characters,
234       so  as not to degrade performance.  The Unicode property information is
235       used only for characters with higher values. Even when Unicode property
236       support is available, PCRE supports case-insensitive matching only when
237       there is a one-to-one mapping between a letter's  cases.  There  are  a
238       small  number  of  many-to-one  mappings in Unicode; these are not sup‐
239       ported by PCRE.
240

AUTHOR

242
243       Philip Hazel
244       University Computing Service
245       Cambridge CB2 3QH, England.
246
247       Putting an actual email address here seems to have been a spam  magnet,
248       so  I've  taken  it away. If you want to email me, use my two initials,
249       followed by the two digits 10, at the domain cam.ac.uk.
250

REVISION

252
253       Last updated: 12 April 2008
254       Copyright (c) 1997-2011 University of Cambridge.
255
256
257
258                                                                       PCRE(3)