1PCRECOMPAT(3) Library Functions Manual PCRECOMPAT(3)
2
3
4
6 PCRE - Perl-compatible regular expressions
7
9
10 This document describes the differences in the ways that PCRE and Perl
11 handle regular expressions. The differences described here are with
12 respect to Perl 5.10/5.11.
13
14 1. PCRE has only a subset of Perl's UTF-8 and Unicode support. Details
15 of what it does have are given in the section on UTF-8 support in the
16 main pcre page.
17
18 2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl
19 permits them, but they do not mean what you might think. For example,
20 (?!a){3} does not assert that the next three characters are not "a". It
21 just asserts that the next character is not "a" three times.
22
23 3. Capturing subpatterns that occur inside negative lookahead asser‐
24 tions are counted, but their entries in the offsets vector are never
25 set. Perl sets its numerical variables from any such patterns that are
26 matched before the assertion fails to match something (thereby succeed‐
27 ing), but only if the negative lookahead assertion contains just one
28 branch.
29
30 4. Though binary zero characters are supported in the subject string,
31 they are not allowed in a pattern string because it is passed as a nor‐
32 mal C string, terminated by zero. The escape sequence \0 can be used in
33 the pattern to represent a binary zero.
34
35 5. The following Perl escape sequences are not supported: \l, \u, \L,
36 \U, and \N. In fact these are implemented by Perl's general string-han‐
37 dling and are not part of its pattern matching engine. If any of these
38 are encountered by PCRE, an error is generated.
39
40 6. The Perl escape sequences \p, \P, and \X are supported only if PCRE
41 is built with Unicode character property support. The properties that
42 can be tested with \p and \P are limited to the general category prop‐
43 erties such as Lu and Nd, script names such as Greek or Han, and the
44 derived properties Any and L&. PCRE does support the Cs (surrogate)
45 property, which Perl does not; the Perl documentation says "Because
46 Perl hides the need for the user to understand the internal representa‐
47 tion of Unicode characters, there is no need to implement the somewhat
48 messy concept of surrogates."
49
50 7. PCRE does support the \Q...\E escape for quoting substrings. Charac‐
51 ters in between are treated as literals. This is slightly different
52 from Perl in that $ and @ are also handled as literals inside the
53 quotes. In Perl, they cause variable interpolation (but of course PCRE
54 does not have variables). Note the following examples:
55
56 Pattern PCRE matches Perl matches
57
58 \Qabc$xyz\E abc$xyz abc followed by the
59 contents of $xyz
60 \Qabc\$xyz\E abc\$xyz abc\$xyz
61 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
62
63 The \Q...\E sequence is recognized both inside and outside character
64 classes.
65
66 8. Fairly obviously, PCRE does not support the (?{code}) and (??{code})
67 constructions. However, there is support for recursive patterns. This
68 is not available in Perl 5.8, but it is in Perl 5.10. Also, the PCRE
69 "callout" feature allows an external function to be called during pat‐
70 tern matching. See the pcrecallout documentation for details.
71
72 9. Subpatterns that are called recursively or as "subroutines" are
73 always treated as atomic groups in PCRE. This is like Python, but
74 unlike Perl. There is a discussion of an example that explains this in
75 more detail in the section on recursion differences from Perl in the
76 pcrepattern page.
77
78 10. There are some differences that are concerned with the settings of
79 captured strings when part of a pattern is repeated. For example,
80 matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2
81 unset, but in PCRE it is set to "b".
82
83 11. PCRE's handling of duplicate subpattern numbers and duplicate sub‐
84 pattern names is not as general as Perl's. This is a consequence of the
85 fact the PCRE works internally just with numbers, using an external ta‐
86 ble to translate between numbers and names. In particular, a pattern
87 such as (?|(?<a>A)|(?<b)B), where the two capturing parentheses have
88 the same number but different names, is not supported, and causes an
89 error at compile time. If it were allowed, it would not be possible to
90 distinguish which parentheses matched, because both names map to cap‐
91 turing subpattern number 1. To avoid this confusing situation, an error
92 is given at compile time.
93
94 12. PCRE provides some extensions to the Perl regular expression facil‐
95 ities. Perl 5.10 includes new features that are not in earlier ver‐
96 sions of Perl, some of which (such as named parentheses) have been in
97 PCRE for some time. This list is with respect to Perl 5.10:
98
99 (a) Although lookbehind assertions in PCRE must match fixed length
100 strings, each alternative branch of a lookbehind assertion can match a
101 different length of string. Perl requires them all to have the same
102 length.
103
104 (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $
105 meta-character matches only at the very end of the string.
106
107 (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe‐
108 cial meaning is faulted. Otherwise, like Perl, the backslash is quietly
109 ignored. (Perl can be made to issue a warning.)
110
111 (d) If PCRE_UNGREEDY is set, the greediness of the repetition quanti‐
112 fiers is inverted, that is, by default they are not greedy, but if fol‐
113 lowed by a question mark they are.
114
115 (e) PCRE_ANCHORED can be used at matching time to force a pattern to be
116 tried only at the first matching position in the subject string.
117
118 (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
119 and PCRE_NO_AUTO_CAPTURE options for pcre_exec() have no Perl equiva‐
120 lents.
121
122 (g) The \R escape sequence can be restricted to match only CR, LF, or
123 CRLF by the PCRE_BSR_ANYCRLF option.
124
125 (h) The callout facility is PCRE-specific.
126
127 (i) The partial matching facility is PCRE-specific.
128
129 (j) Patterns compiled by PCRE can be saved and re-used at a later time,
130 even on different hosts that have the other endianness.
131
132 (k) The alternative matching function (pcre_dfa_exec()) matches in a
133 different way and is not Perl-compatible.
134
135 (l) PCRE recognizes some special sequences such as (*CR) at the start
136 of a pattern that set overall options that cannot be changed within the
137 pattern.
138
140
141 Philip Hazel
142 University Computing Service
143 Cambridge CB2 3QH, England.
144
146
147 Last updated: 12 May 2010
148 Copyright (c) 1997-2010 University of Cambridge.
149
150
151
152 PCRECOMPAT(3)