1PCRECPP(3) Library Functions Manual PCRECPP(3)
2
3
4
6 PCRE - Perl-compatible regular expressions.
7
9
10 #include <pcrecpp.h>
11
13
14 The C++ wrapper for PCRE was provided by Google Inc. Some additional
15 functionality was added by Giuseppe Maxia. This brief man page was con‐
16 structed from the notes in the pcrecpp.h file, which should be con‐
17 sulted for further details.
18
20
21 The "FullMatch" operation checks that supplied text matches a supplied
22 pattern exactly. If pointer arguments are supplied, it copies matched
23 sub-strings that match sub-patterns into them.
24
25 Example: successful match
26 pcrecpp::RE re("h.*o");
27 re.FullMatch("hello");
28
29 Example: unsuccessful match (requires full match):
30 pcrecpp::RE re("e");
31 !re.FullMatch("hello");
32
33 Example: creating a temporary RE object:
34 pcrecpp::RE("h.*o").FullMatch("hello");
35
36 You can pass in a "const char*" or a "string" for "text". The examples
37 below tend to use a const char*. You can, as in the different examples
38 above, store the RE object explicitly in a variable or use a temporary
39 RE object. The examples below use one mode or the other arbitrarily.
40 Either could correctly be used for any of these examples.
41
42 You must supply extra pointer arguments to extract matched subpieces.
43
44 Example: extracts "ruby" into "s" and 1234 into "i"
45 int i;
46 string s;
47 pcrecpp::RE re("(\\w+):(\\d+)");
48 re.FullMatch("ruby:1234", &s, &i);
49
50 Example: does not try to extract any extra sub-patterns
51 re.FullMatch("ruby:1234", &s);
52
53 Example: does not try to extract into NULL
54 re.FullMatch("ruby:1234", NULL, &i);
55
56 Example: integer overflow causes failure
57 !re.FullMatch("ruby:1234567891234", NULL, &i);
58
59 Example: fails because there aren't enough sub-patterns:
60 !pcrecpp::RE("\\w+:\\d+").FullMatch("ruby:1234", &s);
61
62 Example: fails because string cannot be stored in integer
63 !pcrecpp::RE("(.*)").FullMatch("ruby", &i);
64
65 The provided pointer arguments can be pointers to any scalar numeric
66 type, or one of:
67
68 string (matched piece is copied to string)
69 StringPiece (StringPiece is mutated to point to matched piece)
70 T (where "bool T::ParseFrom(const char*, int)" exists)
71 NULL (the corresponding matched sub-pattern is not copied)
72
73 The function returns true iff all of the following conditions are sat‐
74 isfied:
75
76 a. "text" matches "pattern" exactly;
77
78 b. The number of matched sub-patterns is >= number of supplied
79 pointers;
80
81 c. The "i"th argument has a suitable type for holding the
82 string captured as the "i"th sub-pattern. If you pass in
83 void * NULL for the "i"th argument, or a non-void * NULL
84 of the correct type, or pass fewer arguments than the
85 number of sub-patterns, "i"th captured sub-pattern is
86 ignored.
87
88 CAVEAT: An optional sub-pattern that does not exist in the matched
89 string is assigned the empty string. Therefore, the following will
90 return false (because the empty string is not a valid number):
91
92 int number;
93 pcrecpp::RE::FullMatch("abc", "[a-z]+(\\d+)?", &number);
94
95 The matching interface supports at most 16 arguments per call. If you
96 need more, consider using the more general interface
97 pcrecpp::RE::DoMatch. See pcrecpp.h for the signature for DoMatch.
98
99 NOTE: Do not use no_arg, which is used internally to mark the end of a
100 list of optional arguments, as a placeholder for missing arguments, as
101 this can lead to segfaults.
102
104
105 You can use the "QuoteMeta" operation to insert backslashes before all
106 potentially meaningful characters in a string. The returned string,
107 used as a regular expression, will exactly match the original string.
108
109 Example:
110 string quoted = RE::QuoteMeta(unquoted);
111
112 Note that it's legal to escape a character even if it has no special
113 meaning in a regular expression -- so this function does that. (This
114 also makes it identical to the perl function of the same name; see
115 "perldoc -f quotemeta".) For example, "1.5-2.0?" becomes
116 "1\.5\-2\.0\?".
117
119
120 You can use the "PartialMatch" operation when you want the pattern to
121 match any substring of the text.
122
123 Example: simple search for a string:
124 pcrecpp::RE("ell").PartialMatch("hello");
125
126 Example: find first number in a string:
127 int number;
128 pcrecpp::RE re("(\\d+)");
129 re.PartialMatch("x*100 + 20", &number);
130 assert(number == 100);
131
133
134 By default, pattern and text are plain text, one byte per character.
135 The UTF8 flag, passed to the constructor, causes both pattern and
136 string to be treated as UTF-8 text, still a byte stream but potentially
137 multiple bytes per character. In practice, the text is likelier to be
138 UTF-8 than the pattern, but the match returned may depend on the UTF8
139 flag, so always use it when matching UTF8 text. For example, "." will
140 match one byte normally but with UTF8 set may match up to three bytes
141 of a multi-byte character.
142
143 Example:
144 pcrecpp::RE_Options options;
145 options.set_utf8();
146 pcrecpp::RE re(utf8_pattern, options);
147 re.FullMatch(utf8_string);
148
149 Example: using the convenience function UTF8():
150 pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
151 re.FullMatch(utf8_string);
152
153 NOTE: The UTF8 flag is ignored if pcre was not configured with the
154 --enable-utf8 flag.
155
157
158 PCRE defines some modifiers to change the behavior of the regular
159 expression engine. The C++ wrapper defines an auxiliary class,
160 RE_Options, as a vehicle to pass such modifiers to a RE class. Cur‐
161 rently, the following modifiers are supported:
162
163 modifier description Perl corresponding
164
165 PCRE_CASELESS case insensitive match /i
166 PCRE_MULTILINE multiple lines match /m
167 PCRE_DOTALL dot matches newlines /s
168 PCRE_DOLLAR_ENDONLY $ matches only at end N/A
169 PCRE_EXTRA strict escape parsing N/A
170 PCRE_EXTENDED ignore whitespaces /x
171 PCRE_UTF8 handles UTF8 chars built-in
172 PCRE_UNGREEDY reverses * and *? N/A
173 PCRE_NO_AUTO_CAPTURE disables capturing parens N/A (*)
174
175 (*) Both Perl and PCRE allow non capturing parentheses by means of the
176 "?:" modifier within the pattern itself. e.g. (?:ab|cd) does not cap‐
177 ture, while (ab|cd) does.
178
179 For a full account on how each modifier works, please check the PCRE
180 API reference page.
181
182 For each modifier, there are two member functions whose name is made
183 out of the modifier in lowercase, without the "PCRE_" prefix. For
184 instance, PCRE_CASELESS is handled by
185
186 bool caseless()
187
188 which returns true if the modifier is set, and
189
190 RE_Options & set_caseless(bool)
191
192 which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can
193 be accessed through the set_match_limit() and match_limit() member
194 functions. Setting match_limit to a non-zero value will limit the exe‐
195 cution of pcre to keep it from doing bad things like blowing the stack
196 or taking an eternity to return a result. A value of 5000 is good
197 enough to stop stack blowup in a 2MB thread stack. Setting match_limit
198 to zero disables match limiting. Alternatively, you can call
199 match_limit_recursion() which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to
200 limit how much PCRE recurses. match_limit() limits the number of
201 matches PCRE does; match_limit_recursion() limits the depth of internal
202 recursion, and therefore the amount of stack that is used.
203
204 Normally, to pass one or more modifiers to a RE class, you declare a
205 RE_Options object, set the appropriate options, and pass this object to
206 a RE constructor. Example:
207
208 RE_options opt;
209 opt.set_caseless(true);
210 if (RE("HELLO", opt).PartialMatch("hello world")) ...
211
212 RE_options has two constructors. The default constructor takes no argu‐
213 ments and creates a set of flags that are off by default. The optional
214 parameter option_flags is to facilitate transfer of legacy code from C
215 programs. This lets you do
216
217 RE(pattern,
218 RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);
219
220 However, new code is better off doing
221
222 RE(pattern,
223 RE_Options().set_caseless(true).set_multiline(true))
224 .PartialMatch(str);
225
226 If you are going to pass one of the most used modifiers, there are some
227 convenience functions that return a RE_Options class with the appropri‐
228 ate modifier already set: CASELESS(), UTF8(), MULTILINE(), DOTALL(),
229 and EXTENDED().
230
231 If you need to set several options at once, and you don't want to go
232 through the pains of declaring a RE_Options object and setting several
233 options, there is a parallel method that give you such ability on the
234 fly. You can concatenate several set_xxxxx() member functions, since
235 each of them returns a reference to its class object. For example, to
236 pass PCRE_CASELESS, PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one
237 statement, you may write:
238
239 RE(" ^ xyz \\s+ .* blah$",
240 RE_Options()
241 .set_caseless(true)
242 .set_extended(true)
243 .set_multiline(true)).PartialMatch(sometext);
244
245
247
248 The "Consume" operation may be useful if you want to repeatedly match
249 regular expressions at the front of a string and skip over them as they
250 match. This requires use of the "StringPiece" type, which represents a
251 sub-range of a real string. Like RE, StringPiece is defined in the
252 pcrecpp namespace.
253
254 Example: read lines of the form "var = value" from a string.
255 string contents = ...; // Fill string somehow
256 pcrecpp::StringPiece input(contents); // Wrap in a StringPiece
257
258 string var;
259 int value;
260 pcrecpp::RE re("(\\w+) = (\\d+)\n");
261 while (re.Consume(&input, &var, &value)) {
262 ...;
263 }
264
265 Each successful call to "Consume" will set "var/value", and also
266 advance "input" so it points past the matched text.
267
268 The "FindAndConsume" operation is similar to "Consume" but does not
269 anchor your match at the beginning of the string. For example, you
270 could extract all words from a string by repeatedly calling
271
272 pcrecpp::RE("(\\w+)").FindAndConsume(&input, &word)
273
275
276 By default, if you pass a pointer to a numeric value, the corresponding
277 text is interpreted as a base-10 number. You can instead wrap the
278 pointer with a call to one of the operators Hex(), Octal(), or CRadix()
279 to interpret the text in another base. The CRadix operator interprets
280 C-style "0" (base-8) and "0x" (base-16) prefixes, but defaults to
281 base-10.
282
283 Example:
284 int a, b, c, d;
285 pcrecpp::RE re("(.*) (.*) (.*) (.*)");
286 re.FullMatch("100 40 0100 0x40",
287 pcrecpp::Octal(&a), pcrecpp::Hex(&b),
288 pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));
289
290 will leave 64 in a, b, c, and d.
291
293
294 You can replace the first match of "pattern" in "str" with "rewrite".
295 Within "rewrite", backslash-escaped digits (\1 to \9) can be used to
296 insert text matching corresponding parenthesized group from the pat‐
297 tern. \0 in "rewrite" refers to the entire matching text. For example:
298
299 string s = "yabba dabba doo";
300 pcrecpp::RE("b+").Replace("d", &s);
301
302 will leave "s" containing "yada dabba doo". The result is true if the
303 pattern matches and a replacement occurs, false otherwise.
304
305 GlobalReplace is like Replace except that it replaces all occurrences
306 of the pattern in the string with the rewrite. Replacements are not
307 subject to re-matching. For example:
308
309 string s = "yabba dabba doo";
310 pcrecpp::RE("b+").GlobalReplace("d", &s);
311
312 will leave "s" containing "yada dada doo". It returns the number of
313 replacements made.
314
315 Extract is like Replace, except that if the pattern matches, "rewrite"
316 is copied into "out" (an additional argument) with substitutions. The
317 non-matching portions of "text" are ignored. Returns true iff a match
318 occurred and the extraction happened successfully; if no match occurs,
319 the string is left unaffected.
320
322
323 The C++ wrapper was contributed by Google Inc.
324 Copyright (c) 2007 Google Inc.
325
327
328 Last updated: 17 March 2009
329
330
331
332 PCRECPP(3)