1PCRECPP(3) Library Functions Manual PCRECPP(3)
2
3
4
6 PCRE - Perl-compatible regular expressions.
7
9
10 #include <pcrecpp.h>
11
13
14 The C++ wrapper for PCRE was provided by Google Inc. Some additional
15 functionality was added by Giuseppe Maxia. This brief man page was con‐
16 structed from the notes in the pcrecpp.h file, which should be con‐
17 sulted for further details.
18
20
21 The "FullMatch" operation checks that supplied text matches a supplied
22 pattern exactly. If pointer arguments are supplied, it copies matched
23 sub-strings that match sub-patterns into them.
24
25 Example: successful match
26 pcrecpp::RE re("h.*o");
27 re.FullMatch("hello");
28
29 Example: unsuccessful match (requires full match):
30 pcrecpp::RE re("e");
31 !re.FullMatch("hello");
32
33 Example: creating a temporary RE object:
34 pcrecpp::RE("h.*o").FullMatch("hello");
35
36 You can pass in a "const char*" or a "string" for "text". The examples
37 below tend to use a const char*. You can, as in the different examples
38 above, store the RE object explicitly in a variable or use a temporary
39 RE object. The examples below use one mode or the other arbitrarily.
40 Either could correctly be used for any of these examples.
41
42 You must supply extra pointer arguments to extract matched subpieces.
43
44 Example: extracts "ruby" into "s" and 1234 into "i"
45 int i;
46 string s;
47 pcrecpp::RE re("(\\w+):(\\d+)");
48 re.FullMatch("ruby:1234", &s, &i);
49
50 Example: does not try to extract any extra sub-patterns
51 re.FullMatch("ruby:1234", &s);
52
53 Example: does not try to extract into NULL
54 re.FullMatch("ruby:1234", NULL, &i);
55
56 Example: integer overflow causes failure
57 !re.FullMatch("ruby:1234567891234", NULL, &i);
58
59 Example: fails because there aren't enough sub-patterns:
60 !pcrecpp::RE("\\w+:\\d+").FullMatch("ruby:1234", &s);
61
62 Example: fails because string cannot be stored in integer
63 !pcrecpp::RE("(.*)").FullMatch("ruby", &i);
64
65 The provided pointer arguments can be pointers to any scalar numeric
66 type, or one of:
67
68 string (matched piece is copied to string)
69 StringPiece (StringPiece is mutated to point to matched piece)
70 T (where "bool T::ParseFrom(const char*, int)" exists)
71 NULL (the corresponding matched sub-pattern is not copied)
72
73 The function returns true iff all of the following conditions are sat‐
74 isfied:
75
76 a. "text" matches "pattern" exactly;
77
78 b. The number of matched sub-patterns is >= number of supplied
79 pointers;
80
81 c. The "i"th argument has a suitable type for holding the
82 string captured as the "i"th sub-pattern. If you pass in
83 void * NULL for the "i"th argument, or a non-void * NULL
84 of the correct type, or pass fewer arguments than the
85 number of sub-patterns, "i"th captured sub-pattern is
86 ignored.
87
88 CAVEAT: An optional sub-pattern that does not exist in the matched
89 string is assigned the empty string. Therefore, the following will
90 return false (because the empty string is not a valid number):
91
92 int number;
93 pcrecpp::RE::FullMatch("abc", "[a-z]+(\\d+)?", &number);
94
95 The matching interface supports at most 16 arguments per call. If you
96 need more, consider using the more general interface
97 pcrecpp::RE::DoMatch. See pcrecpp.h for the signature for DoMatch.
98
100
101 You can use the "QuoteMeta" operation to insert backslashes before all
102 potentially meaningful characters in a string. The returned string,
103 used as a regular expression, will exactly match the original string.
104
105 Example:
106 string quoted = RE::QuoteMeta(unquoted);
107
108 Note that it's legal to escape a character even if it has no special
109 meaning in a regular expression -- so this function does that. (This
110 also makes it identical to the perl function of the same name; see
111 "perldoc -f quotemeta".) For example, "1.5-2.0?" becomes
112 "1\.5\-2\.0\?".
113
115
116 You can use the "PartialMatch" operation when you want the pattern to
117 match any substring of the text.
118
119 Example: simple search for a string:
120 pcrecpp::RE("ell").PartialMatch("hello");
121
122 Example: find first number in a string:
123 int number;
124 pcrecpp::RE re("(\\d+)");
125 re.PartialMatch("x*100 + 20", &number);
126 assert(number == 100);
127
129
130 By default, pattern and text are plain text, one byte per character.
131 The UTF8 flag, passed to the constructor, causes both pattern and
132 string to be treated as UTF-8 text, still a byte stream but potentially
133 multiple bytes per character. In practice, the text is likelier to be
134 UTF-8 than the pattern, but the match returned may depend on the UTF8
135 flag, so always use it when matching UTF8 text. For example, "." will
136 match one byte normally but with UTF8 set may match up to three bytes
137 of a multi-byte character.
138
139 Example:
140 pcrecpp::RE_Options options;
141 options.set_utf8();
142 pcrecpp::RE re(utf8_pattern, options);
143 re.FullMatch(utf8_string);
144
145 Example: using the convenience function UTF8():
146 pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
147 re.FullMatch(utf8_string);
148
149 NOTE: The UTF8 flag is ignored if pcre was not configured with the
150 --enable-utf8 flag.
151
153
154 PCRE defines some modifiers to change the behavior of the regular
155 expression engine. The C++ wrapper defines an auxiliary class,
156 RE_Options, as a vehicle to pass such modifiers to a RE class. Cur‐
157 rently, the following modifiers are supported:
158
159 modifier description Perl corresponding
160
161 PCRE_CASELESS case insensitive match /i
162 PCRE_MULTILINE multiple lines match /m
163 PCRE_DOTALL dot matches newlines /s
164 PCRE_DOLLAR_ENDONLY $ matches only at end N/A
165 PCRE_EXTRA strict escape parsing N/A
166 PCRE_EXTENDED ignore white spaces /x
167 PCRE_UTF8 handles UTF8 chars built-in
168 PCRE_UNGREEDY reverses * and *? N/A
169 PCRE_NO_AUTO_CAPTURE disables capturing parens N/A (*)
170
171 (*) Both Perl and PCRE allow non capturing parentheses by means of the
172 "?:" modifier within the pattern itself. e.g. (?:ab|cd) does not cap‐
173 ture, while (ab|cd) does.
174
175 For a full account on how each modifier works, please check the PCRE
176 API reference page.
177
178 For each modifier, there are two member functions whose name is made
179 out of the modifier in lowercase, without the "PCRE_" prefix. For
180 instance, PCRE_CASELESS is handled by
181
182 bool caseless()
183
184 which returns true if the modifier is set, and
185
186 RE_Options & set_caseless(bool)
187
188 which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can
189 be accessed through the set_match_limit() and match_limit() member
190 functions. Setting match_limit to a non-zero value will limit the exe‐
191 cution of pcre to keep it from doing bad things like blowing the stack
192 or taking an eternity to return a result. A value of 5000 is good
193 enough to stop stack blowup in a 2MB thread stack. Setting match_limit
194 to zero disables match limiting. Alternatively, you can call
195 match_limit_recursion() which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to
196 limit how much PCRE recurses. match_limit() limits the number of
197 matches PCRE does; match_limit_recursion() limits the depth of internal
198 recursion, and therefore the amount of stack that is used.
199
200 Normally, to pass one or more modifiers to a RE class, you declare a
201 RE_Options object, set the appropriate options, and pass this object to
202 a RE constructor. Example:
203
204 RE_options opt;
205 opt.set_caseless(true);
206 if (RE("HELLO", opt).PartialMatch("hello world")) ...
207
208 RE_options has two constructors. The default constructor takes no argu‐
209 ments and creates a set of flags that are off by default. The optional
210 parameter option_flags is to facilitate transfer of legacy code from C
211 programs. This lets you do
212
213 RE(pattern,
214 RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);
215
216 However, new code is better off doing
217
218 RE(pattern,
219 RE_Options().set_caseless(true).set_multiline(true))
220 .PartialMatch(str);
221
222 If you are going to pass one of the most used modifiers, there are some
223 convenience functions that return a RE_Options class with the appropri‐
224 ate modifier already set: CASELESS(), UTF8(), MULTILINE(), DOTALL(),
225 and EXTENDED().
226
227 If you need to set several options at once, and you don't want to go
228 through the pains of declaring a RE_Options object and setting several
229 options, there is a parallel method that give you such ability on the
230 fly. You can concatenate several set_xxxxx() member functions, since
231 each of them returns a reference to its class object. For example, to
232 pass PCRE_CASELESS, PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one
233 statement, you may write:
234
235 RE(" ^ xyz \\s+ .* blah$",
236 RE_Options()
237 .set_caseless(true)
238 .set_extended(true)
239 .set_multiline(true)).PartialMatch(sometext);
240
241
243
244 The "Consume" operation may be useful if you want to repeatedly match
245 regular expressions at the front of a string and skip over them as they
246 match. This requires use of the "StringPiece" type, which represents a
247 sub-range of a real string. Like RE, StringPiece is defined in the
248 pcrecpp namespace.
249
250 Example: read lines of the form "var = value" from a string.
251 string contents = ...; // Fill string somehow
252 pcrecpp::StringPiece input(contents); // Wrap in a StringPiece
253
254 string var;
255 int value;
256 pcrecpp::RE re("(\\w+) = (\\d+)\n");
257 while (re.Consume(&input, &var, &value)) {
258 ...;
259 }
260
261 Each successful call to "Consume" will set "var/value", and also
262 advance "input" so it points past the matched text.
263
264 The "FindAndConsume" operation is similar to "Consume" but does not
265 anchor your match at the beginning of the string. For example, you
266 could extract all words from a string by repeatedly calling
267
268 pcrecpp::RE("(\\w+)").FindAndConsume(&input, &word)
269
271
272 By default, if you pass a pointer to a numeric value, the corresponding
273 text is interpreted as a base-10 number. You can instead wrap the
274 pointer with a call to one of the operators Hex(), Octal(), or CRadix()
275 to interpret the text in another base. The CRadix operator interprets
276 C-style "0" (base-8) and "0x" (base-16) prefixes, but defaults to
277 base-10.
278
279 Example:
280 int a, b, c, d;
281 pcrecpp::RE re("(.*) (.*) (.*) (.*)");
282 re.FullMatch("100 40 0100 0x40",
283 pcrecpp::Octal(&a), pcrecpp::Hex(&b),
284 pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));
285
286 will leave 64 in a, b, c, and d.
287
289
290 You can replace the first match of "pattern" in "str" with "rewrite".
291 Within "rewrite", backslash-escaped digits (\1 to \9) can be used to
292 insert text matching corresponding parenthesized group from the pat‐
293 tern. \0 in "rewrite" refers to the entire matching text. For example:
294
295 string s = "yabba dabba doo";
296 pcrecpp::RE("b+").Replace("d", &s);
297
298 will leave "s" containing "yada dabba doo". The result is true if the
299 pattern matches and a replacement occurs, false otherwise.
300
301 GlobalReplace is like Replace except that it replaces all occurrences
302 of the pattern in the string with the rewrite. Replacements are not
303 subject to re-matching. For example:
304
305 string s = "yabba dabba doo";
306 pcrecpp::RE("b+").GlobalReplace("d", &s);
307
308 will leave "s" containing "yada dada doo". It returns the number of
309 replacements made.
310
311 Extract is like Replace, except that if the pattern matches, "rewrite"
312 is copied into "out" (an additional argument) with substitutions. The
313 non-matching portions of "text" are ignored. Returns true iff a match
314 occurred and the extraction happened successfully; if no match occurs,
315 the string is left unaffected.
316
318
319 The C++ wrapper was contributed by Google Inc.
320 Copyright (c) 2007 Google Inc.
321
323
324 Last updated: 12 November 2007
325
326
327
328 PCRECPP(3)