1PCRECPP(3) Library Functions Manual PCRECPP(3)
2
3
4
6 PCRE - Perl-compatible regular expressions.
7
9
10 #include <pcrecpp.h>
11
13
14 The C++ wrapper for PCRE was provided by Google Inc. Some additional
15 functionality was added by Giuseppe Maxia. This brief man page was con‐
16 structed from the notes in the pcrecpp.h file, which should be con‐
17 sulted for further details.
18
20
21 The "FullMatch" operation checks that supplied text matches a supplied
22 pattern exactly. If pointer arguments are supplied, it copies matched
23 sub-strings that match sub-patterns into them.
24
25 Example: successful match
26 pcrecpp::RE re("h.*o");
27 re.FullMatch("hello");
28
29 Example: unsuccessful match (requires full match):
30 pcrecpp::RE re("e");
31 !re.FullMatch("hello");
32
33 Example: creating a temporary RE object:
34 pcrecpp::RE("h.*o").FullMatch("hello");
35
36 You can pass in a "const char*" or a "string" for "text". The examples
37 below tend to use a const char*. You can, as in the different examples
38 above, store the RE object explicitly in a variable or use a temporary
39 RE object. The examples below use one mode or the other arbitrarily.
40 Either could correctly be used for any of these examples.
41
42 You must supply extra pointer arguments to extract matched subpieces.
43
44 Example: extracts "ruby" into "s" and 1234 into "i"
45 int i;
46 string s;
47 pcrecpp::RE re("(\\w+):(\\d+)");
48 re.FullMatch("ruby:1234", &s, &i);
49
50 Example: does not try to extract any extra sub-patterns
51 re.FullMatch("ruby:1234", &s);
52
53 Example: does not try to extract into NULL
54 re.FullMatch("ruby:1234", NULL, &i);
55
56 Example: integer overflow causes failure
57 !re.FullMatch("ruby:1234567891234", NULL, &i);
58
59 Example: fails because there aren't enough sub-patterns:
60 !pcrecpp::RE("\\w+:\\d+").FullMatch("ruby:1234", &s);
61
62 Example: fails because string cannot be stored in integer
63 !pcrecpp::RE("(.*)").FullMatch("ruby", &i);
64
65 The provided pointer arguments can be pointers to any scalar numeric
66 type, or one of:
67
68 string (matched piece is copied to string)
69 StringPiece (StringPiece is mutated to point to matched piece)
70 T (where "bool T::ParseFrom(const char*, int)" exists)
71 NULL (the corresponding matched sub-pattern is not copied)
72
73 The function returns true iff all of the following conditions are sat‐
74 isfied:
75
76 a. "text" matches "pattern" exactly;
77
78 b. The number of matched sub-patterns is >= number of supplied
79 pointers;
80
81 c. The "i"th argument has a suitable type for holding the
82 string captured as the "i"th sub-pattern. If you pass in
83 NULL for the "i"th argument, or pass fewer arguments than
84 number of sub-patterns, "i"th captured sub-pattern is
85 ignored.
86
87 CAVEAT: An optional sub-pattern that does not exist in the matched
88 string is assigned the empty string. Therefore, the following will
89 return false (because the empty string is not a valid number):
90
91 int number;
92 pcrecpp::RE::FullMatch("abc", "[a-z]+(\\d+)?", &number);
93
94 The matching interface supports at most 16 arguments per call. If you
95 need more, consider using the more general interface
96 pcrecpp::RE::DoMatch. See pcrecpp.h for the signature for DoMatch.
97
99
100 You can use the "QuoteMeta" operation to insert backslashes before all
101 potentially meaningful characters in a string. The returned string,
102 used as a regular expression, will exactly match the original string.
103
104 Example:
105 string quoted = RE::QuoteMeta(unquoted);
106
107 Note that it's legal to escape a character even if it has no special
108 meaning in a regular expression -- so this function does that. (This
109 also makes it identical to the perl function of the same name; see
110 "perldoc -f quotemeta".) For example, "1.5-2.0?" becomes
111 "1\.5\-2\.0\?".
112
114
115 You can use the "PartialMatch" operation when you want the pattern to
116 match any substring of the text.
117
118 Example: simple search for a string:
119 pcrecpp::RE("ell").PartialMatch("hello");
120
121 Example: find first number in a string:
122 int number;
123 pcrecpp::RE re("(\\d+)");
124 re.PartialMatch("x*100 + 20", &number);
125 assert(number == 100);
126
128
129 By default, pattern and text are plain text, one byte per character.
130 The UTF8 flag, passed to the constructor, causes both pattern and
131 string to be treated as UTF-8 text, still a byte stream but potentially
132 multiple bytes per character. In practice, the text is likelier to be
133 UTF-8 than the pattern, but the match returned may depend on the UTF8
134 flag, so always use it when matching UTF8 text. For example, "." will
135 match one byte normally but with UTF8 set may match up to three bytes
136 of a multi-byte character.
137
138 Example:
139 pcrecpp::RE_Options options;
140 options.set_utf8();
141 pcrecpp::RE re(utf8_pattern, options);
142 re.FullMatch(utf8_string);
143
144 Example: using the convenience function UTF8():
145 pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
146 re.FullMatch(utf8_string);
147
148 NOTE: The UTF8 flag is ignored if pcre was not configured with the
149 --enable-utf8 flag.
150
152
153 PCRE defines some modifiers to change the behavior of the regular
154 expression engine. The C++ wrapper defines an auxiliary class,
155 RE_Options, as a vehicle to pass such modifiers to a RE class. Cur‐
156 rently, the following modifiers are supported:
157
158 modifier description Perl corresponding
159
160 PCRE_CASELESS case insensitive match /i
161 PCRE_MULTILINE multiple lines match /m
162 PCRE_DOTALL dot matches newlines /s
163 PCRE_DOLLAR_ENDONLY $ matches only at end N/A
164 PCRE_EXTRA strict escape parsing N/A
165 PCRE_EXTENDED ignore whitespaces /x
166 PCRE_UTF8 handles UTF8 chars built-in
167 PCRE_UNGREEDY reverses * and *? N/A
168 PCRE_NO_AUTO_CAPTURE disables capturing parens N/A (*)
169
170 (*) Both Perl and PCRE allow non capturing parentheses by means of the
171 "?:" modifier within the pattern itself. e.g. (?:ab|cd) does not cap‐
172 ture, while (ab|cd) does.
173
174 For a full account on how each modifier works, please check the PCRE
175 API reference page.
176
177 For each modifier, there are two member functions whose name is made
178 out of the modifier in lowercase, without the "PCRE_" prefix. For
179 instance, PCRE_CASELESS is handled by
180
181 bool caseless()
182
183 which returns true if the modifier is set, and
184
185 RE_Options & set_caseless(bool)
186
187 which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can
188 be accessed through the set_match_limit() and match_limit() member
189 functions. Setting match_limit to a non-zero value will limit the exe‐
190 cution of pcre to keep it from doing bad things like blowing the stack
191 or taking an eternity to return a result. A value of 5000 is good
192 enough to stop stack blowup in a 2MB thread stack. Setting match_limit
193 to zero disables match limiting. Alternatively, you can call
194 match_limit_recursion() which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to
195 limit how much PCRE recurses. match_limit() limits the number of
196 matches PCRE does; match_limit_recursion() limits the depth of internal
197 recursion, and therefore the amount of stack that is used.
198
199 Normally, to pass one or more modifiers to a RE class, you declare a
200 RE_Options object, set the appropriate options, and pass this object to
201 a RE constructor. Example:
202
203 RE_options opt;
204 opt.set_caseless(true);
205 if (RE("HELLO", opt).PartialMatch("hello world")) ...
206
207 RE_options has two constructors. The default constructor takes no argu‐
208 ments and creates a set of flags that are off by default. The optional
209 parameter option_flags is to facilitate transfer of legacy code from C
210 programs. This lets you do
211
212 RE(pattern,
213 RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);
214
215 However, new code is better off doing
216
217 RE(pattern,
218 RE_Options().set_caseless(true).set_multiline(true))
219 .PartialMatch(str);
220
221 If you are going to pass one of the most used modifiers, there are some
222 convenience functions that return a RE_Options class with the appropri‐
223 ate modifier already set: CASELESS(), UTF8(), MULTILINE(), DOTALL(),
224 and EXTENDED().
225
226 If you need to set several options at once, and you don't want to go
227 through the pains of declaring a RE_Options object and setting several
228 options, there is a parallel method that give you such ability on the
229 fly. You can concatenate several set_xxxxx() member functions, since
230 each of them returns a reference to its class object. For example, to
231 pass PCRE_CASELESS, PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one
232 statement, you may write:
233
234 RE(" ^ xyz \\s+ .* blah$",
235 RE_Options()
236 .set_caseless(true)
237 .set_extended(true)
238 .set_multiline(true)).PartialMatch(sometext);
239
240
242
243 The "Consume" operation may be useful if you want to repeatedly match
244 regular expressions at the front of a string and skip over them as they
245 match. This requires use of the "StringPiece" type, which represents a
246 sub-range of a real string. Like RE, StringPiece is defined in the
247 pcrecpp namespace.
248
249 Example: read lines of the form "var = value" from a string.
250 string contents = ...; // Fill string somehow
251 pcrecpp::StringPiece input(contents); // Wrap in a StringPiece
252
253 string var;
254 int value;
255 pcrecpp::RE re("(\\w+) = (\\d+)\n");
256 while (re.Consume(&input, &var, &value)) {
257 ...;
258 }
259
260 Each successful call to "Consume" will set "var/value", and also
261 advance "input" so it points past the matched text.
262
263 The "FindAndConsume" operation is similar to "Consume" but does not
264 anchor your match at the beginning of the string. For example, you
265 could extract all words from a string by repeatedly calling
266
267 pcrecpp::RE("(\\w+)").FindAndConsume(&input, &word)
268
270
271 By default, if you pass a pointer to a numeric value, the corresponding
272 text is interpreted as a base-10 number. You can instead wrap the
273 pointer with a call to one of the operators Hex(), Octal(), or CRadix()
274 to interpret the text in another base. The CRadix operator interprets
275 C-style "0" (base-8) and "0x" (base-16) prefixes, but defaults to
276 base-10.
277
278 Example:
279 int a, b, c, d;
280 pcrecpp::RE re("(.*) (.*) (.*) (.*)");
281 re.FullMatch("100 40 0100 0x40",
282 pcrecpp::Octal(&a), pcrecpp::Hex(&b),
283 pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));
284
285 will leave 64 in a, b, c, and d.
286
288
289 You can replace the first match of "pattern" in "str" with "rewrite".
290 Within "rewrite", backslash-escaped digits (\1 to \9) can be used to
291 insert text matching corresponding parenthesized group from the pat‐
292 tern. \0 in "rewrite" refers to the entire matching text. For example:
293
294 string s = "yabba dabba doo";
295 pcrecpp::RE("b+").Replace("d", &s);
296
297 will leave "s" containing "yada dabba doo". The result is true if the
298 pattern matches and a replacement occurs, false otherwise.
299
300 GlobalReplace is like Replace except that it replaces all occurrences
301 of the pattern in the string with the rewrite. Replacements are not
302 subject to re-matching. For example:
303
304 string s = "yabba dabba doo";
305 pcrecpp::RE("b+").GlobalReplace("d", &s);
306
307 will leave "s" containing "yada dada doo". It returns the number of
308 replacements made.
309
310 Extract is like Replace, except that if the pattern matches, "rewrite"
311 is copied into "out" (an additional argument) with substitutions. The
312 non-matching portions of "text" are ignored. Returns true iff a match
313 occurred and the extraction happened successfully; if no match occurs,
314 the string is left unaffected.
315
317
318 The C++ wrapper was contributed by Google Inc.
319 Copyright (c) 2007 Google Inc.
320
322
323 Last updated: 06 March 2007
324
325
326
327 PCRECPP(3)