1PCRE(3) Library Functions Manual PCRE(3)
2
3
4
6 PCRE - Perl-compatible regular expressions
7
9
10 The PCRE library is a set of functions that implement regular expres‐
11 sion pattern matching using the same syntax and semantics as Perl, with
12 just a few differences. Some features that appeared in Python and PCRE
13 before they appeared in Perl are also available using the Python syn‐
14 tax, there is some support for one or two .NET and Oniguruma syntax
15 items, and there is an option for requesting some minor changes that
16 give better JavaScript compatibility.
17
18 The current implementation of PCRE corresponds approximately with Perl
19 5.10/5.11, including support for UTF-8 encoded strings and Unicode gen‐
20 eral category properties. However, UTF-8 and Unicode support has to be
21 explicitly enabled; it is not the default. The Unicode tables corre‐
22 spond to Unicode release 5.2.0.
23
24 In addition to the Perl-compatible matching function, PCRE contains an
25 alternative function that matches the same compiled patterns in a dif‐
26 ferent way. In certain circumstances, the alternative function has some
27 advantages. For a discussion of the two matching algorithms, see the
28 pcrematching page.
29
30 PCRE is written in C and released as a C library. A number of people
31 have written wrappers and interfaces of various kinds. In particular,
32 Google Inc. have provided a comprehensive C++ wrapper. This is now
33 included as part of the PCRE distribution. The pcrecpp page has details
34 of this interface. Other people's contributions can be found in the
35 Contrib directory at the primary FTP site, which is:
36
37 ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
38
39 Details of exactly which Perl regular expression features are and are
40 not supported by PCRE are given in separate documents. See the pcrepat‐
41 tern and pcrecompat pages. There is a syntax summary in the pcresyntax
42 page.
43
44 Some features of PCRE can be included, excluded, or changed when the
45 library is built. The pcre_config() function makes it possible for a
46 client to discover which features are available. The features them‐
47 selves are described in the pcrebuild page. Documentation about build‐
48 ing PCRE for various operating systems can be found in the README and
49 NON-UNIX-USE files in the source distribution.
50
51 The library contains a number of undocumented internal functions and
52 data tables that are used by more than one of the exported external
53 functions, but which are not intended for use by external callers.
54 Their names all begin with "_pcre_", which hopefully will not provoke
55 any name clashes. In some environments, it is possible to control which
56 external symbols are exported when a shared library is built, and in
57 these cases the undocumented symbols are not exported.
58
60
61 The user documentation for PCRE comprises a number of different sec‐
62 tions. In the "man" format, each of these is a separate "man page". In
63 the HTML format, each is a separate page, linked from the index page.
64 In the plain text format, all the sections, except the pcredemo sec‐
65 tion, are concatenated, for ease of searching. The sections are as fol‐
66 lows:
67
68 pcre this document
69 pcre-config show PCRE installation configuration information
70 pcreapi details of PCRE's native C API
71 pcrebuild options for building PCRE
72 pcrecallout details of the callout feature
73 pcrecompat discussion of Perl compatibility
74 pcrecpp details of the C++ wrapper
75 pcredemo a demonstration C program that uses PCRE
76 pcregrep description of the pcregrep command
77 pcrematching discussion of the two matching algorithms
78 pcrepartial details of the partial matching facility
79 pcrepattern syntax and semantics of supported
80 regular expressions
81 pcreperform discussion of performance issues
82 pcreposix the POSIX-compatible C API
83 pcreprecompile details of saving and re-using precompiled patterns
84 pcresample discussion of the pcredemo program
85 pcrestack discussion of stack usage
86 pcresyntax quick syntax reference
87 pcretest description of the pcretest testing command
88
89 In addition, in the "man" and HTML formats, there is a short page for
90 each C library function, listing its arguments and results.
91
93
94 There are some size limitations in PCRE but it is hoped that they will
95 never in practice be relevant.
96
97 The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE
98 is compiled with the default internal linkage size of 2. If you want to
99 process regular expressions that are truly enormous, you can compile
100 PCRE with an internal linkage size of 3 or 4 (see the README file in
101 the source distribution and the pcrebuild documentation for details).
102 In these cases the limit is substantially larger. However, the speed
103 of execution is slower.
104
105 All values in repeating quantifiers must be less than 65536.
106
107 There is no limit to the number of parenthesized subpatterns, but there
108 can be no more than 65535 capturing subpatterns.
109
110 The maximum length of name for a named subpattern is 32 characters, and
111 the maximum number of named subpatterns is 10000.
112
113 The maximum length of a subject string is the largest positive number
114 that an integer variable can hold. However, when using the traditional
115 matching function, PCRE uses recursion to handle subpatterns and indef‐
116 inite repetition. This means that the available stack space may limit
117 the size of a subject string that can be processed by certain patterns.
118 For a discussion of stack issues, see the pcrestack documentation.
119
121
122 From release 3.3, PCRE has had some support for character strings
123 encoded in the UTF-8 format. For release 4.0 this was greatly extended
124 to cover most common requirements, and in release 5.0 additional sup‐
125 port for Unicode general category properties was added.
126
127 In order process UTF-8 strings, you must build PCRE to include UTF-8
128 support in the code, and, in addition, you must call pcre_compile()
129 with the PCRE_UTF8 option flag, or the pattern must start with the
130 sequence (*UTF8). When either of these is the case, both the pattern
131 and any subject strings that are matched against it are treated as
132 UTF-8 strings instead of strings of 1-byte characters.
133
134 If you compile PCRE with UTF-8 support, but do not use it at run time,
135 the library will be a bit bigger, but the additional run time overhead
136 is limited to testing the PCRE_UTF8 flag occasionally, so should not be
137 very big.
138
139 If PCRE is built with Unicode character property support (which implies
140 UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup‐
141 ported. The available properties that can be tested are limited to the
142 general category properties such as Lu for an upper case letter or Nd
143 for a decimal number, the Unicode script names such as Arabic or Han,
144 and the derived properties Any and L&. A full list is given in the
145 pcrepattern documentation. Only the short names for properties are sup‐
146 ported. For example, \p{L} matches a letter. Its Perl synonym, \p{Let‐
147 ter}, is not supported. Furthermore, in Perl, many properties may
148 optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE
149 does not support this.
150
151 Validity of UTF-8 strings
152
153 When you set the PCRE_UTF8 flag, the strings passed as patterns and
154 subjects are (by default) checked for validity on entry to the relevant
155 functions. From release 7.3 of PCRE, the check is according the rules
156 of RFC 3629, which are themselves derived from the Unicode specifica‐
157 tion. Earlier releases of PCRE followed the rules of RFC 2279, which
158 allows the full range of 31-bit values (0 to 0x7FFFFFFF). The current
159 check allows only values in the range U+0 to U+10FFFF, excluding U+D800
160 to U+DFFF.
161
162 The excluded code points are the "Low Surrogate Area" of Unicode, of
163 which the Unicode Standard says this: "The Low Surrogate Area does not
164 contain any character assignments, consequently no character code
165 charts or namelists are provided for this area. Surrogates are reserved
166 for use with UTF-16 and then must be used in pairs." The code points
167 that are encoded by UTF-16 pairs are available as independent code
168 points in the UTF-8 encoding. (In other words, the whole surrogate
169 thing is a fudge for UTF-16 which unfortunately messes up UTF-8.)
170
171 If an invalid UTF-8 string is passed to PCRE, an error return
172 (PCRE_ERROR_BADUTF8) is given. In some situations, you may already know
173 that your strings are valid, and therefore want to skip these checks in
174 order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag at
175 compile time or at run time, PCRE assumes that the pattern or subject
176 it is given (respectively) contains only valid UTF-8 codes. In this
177 case, it does not diagnose an invalid UTF-8 string.
178
179 If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set,
180 what happens depends on why the string is invalid. If the string con‐
181 forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a
182 string of characters in the range 0 to 0x7FFFFFFF. In other words,
183 apart from the initial validity test, PCRE (when in UTF-8 mode) handles
184 strings according to the more liberal rules of RFC 2279. However, if
185 the string does not even conform to RFC 2279, the result is undefined.
186 Your program may crash.
187
188 If you want to process strings of values in the full range 0 to
189 0x7FFFFFFF, encoded in a UTF-8-like manner as per the old RFC, you can
190 set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in
191 this situation, you will have to apply your own validity check.
192
193 General comments about UTF-8 mode
194
195 1. An unbraced hexadecimal escape sequence (such as \xb3) matches a
196 two-byte UTF-8 character if the value is greater than 127.
197
198 2. Octal numbers up to \777 are recognized, and match two-byte UTF-8
199 characters for values greater than \177.
200
201 3. Repeat quantifiers apply to complete UTF-8 characters, not to indi‐
202 vidual bytes, for example: \x{100}{3}.
203
204 4. The dot metacharacter matches one UTF-8 character instead of a sin‐
205 gle byte.
206
207 5. The escape sequence \C can be used to match a single byte in UTF-8
208 mode, but its use can lead to some strange effects. This facility is
209 not available in the alternative matching function, pcre_dfa_exec().
210
211 6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
212 test characters of any code value, but, by default, the characters that
213 PCRE recognizes as digits, spaces, or word characters remain the same
214 set as before, all with values less than 256. This remains true even
215 when PCRE is built to include Unicode property support, because to do
216 otherwise would slow down PCRE in many common cases. Note that this
217 also applies to \b, because it is defined in terms of \w and \W. If you
218 really want to test for a wider sense of, say, "digit", you can use
219 explicit Unicode property tests such as \p{Nd}. Alternatively, if you
220 set the PCRE_UCP option, the way that the character escapes work is
221 changed so that Unicode properties are used to determine which charac‐
222 ters match. There are more details in the section on generic character
223 types in the pcrepattern documentation.
224
225 7. Similarly, characters that match the POSIX named character classes
226 are all low-valued characters, unless the PCRE_UCP option is set.
227
228 8. However, the Perl 5.10 horizontal and vertical whitespace matching
229 escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char‐
230 acters, whether or not PCRE_UCP is set.
231
232 9. Case-insensitive matching applies only to characters whose values
233 are less than 128, unless PCRE is built with Unicode property support.
234 Even when Unicode property support is available, PCRE still uses its
235 own character tables when checking the case of low-valued characters,
236 so as not to degrade performance. The Unicode property information is
237 used only for characters with higher values. Even when Unicode property
238 support is available, PCRE supports case-insensitive matching only when
239 there is a one-to-one mapping between a letter's cases. There are a
240 small number of many-to-one mappings in Unicode; these are not sup‐
241 ported by PCRE.
242
244
245 Philip Hazel
246 University Computing Service
247 Cambridge CB2 3QH, England.
248
249 Putting an actual email address here seems to have been a spam magnet,
250 so I've taken it away. If you want to email me, use my two initials,
251 followed by the two digits 10, at the domain cam.ac.uk.
252
254
255 Last updated: 12 May 2010
256 Copyright (c) 1997-2010 University of Cambridge.
257
258
259
260 PCRE(3)