1PCRE2CALLOUT(3) Library Functions Manual PCRE2CALLOUT(3)
2
3
4
6 PCRE2 - Perl-compatible regular expressions (revised API)
7
9
10 #include <pcre2.h>
11
12 int (*pcre2_callout)(pcre2_callout_block *, void *);
13
14 int pcre2_callout_enumerate(const pcre2_code *code,
15 int (*callback)(pcre2_callout_enumerate_block *, void *),
16 void *user_data);
17
19
20 PCRE2 provides a feature called "callout", which is a means of tempo‐
21 rarily passing control to the caller of PCRE2 in the middle of pattern
22 matching. The caller of PCRE2 provides an external function by putting
23 its entry point in a match context (see pcre2_set_callout() in the
24 pcre2api documentation).
25
26 Within a regular expression, (?C<arg>) indicates a point at which the
27 external function is to be called. Different callout points can be
28 identified by putting a number less than 256 after the letter C. The
29 default value is zero. Alternatively, the argument may be a delimited
30 string. The starting delimiter must be one of ` ' " ^ % # $ { and the
31 ending delimiter is the same as the start, except for {, where the end‐
32 ing delimiter is }. If the ending delimiter is needed within the
33 string, it must be doubled. For example, this pattern has two callout
34 points:
35
36 (?C1)abc(?C"some ""arbitrary"" text")def
37
38 If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled,
39 PCRE2 automatically inserts callouts, all with number 255, before each
40 item in the pattern except for immediately before or after a callout
41 item in the pattern. For example, if PCRE2_AUTO_CALLOUT is used with
42 the pattern
43
44 A(?C3)B
45
46 it is processed as if it were
47
48 (?C255)A(?C3)B(?C255)
49
50 Here is a more complicated example:
51
52 A(\d{2}|--)
53
54 With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
55
56 (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
57
58 Notice that there is a callout before and after each parenthesis and
59 alternation bar. If the pattern contains a conditional group whose con‐
60 dition is an assertion, an automatic callout is inserted immediately
61 before the condition. Such a callout may also be inserted explicitly,
62 for example:
63
64 (?(?C9)(?=a)ab|de) (?(?C%text%)(?!=d)ab|de)
65
66 This applies only to assertion conditions (because they are themselves
67 independent groups).
68
69 Callouts can be useful for tracking the progress of pattern matching.
70 The pcre2test program has a pattern qualifier (/auto_callout) that sets
71 automatic callouts. When any callouts are present, the output from
72 pcre2test indicates how the pattern is being matched. This is useful
73 information when you are trying to optimize the performance of a par‐
74 ticular pattern.
75
77
78 You should be aware that, because of optimizations in the way PCRE2
79 compiles and matches patterns, callouts sometimes do not happen exactly
80 as you might expect.
81
82 Auto-possessification
83
84 At compile time, PCRE2 "auto-possessifies" repeated items when it knows
85 that what follows cannot be part of the repeat. For example, a+[bc] is
86 compiled as if it were a++[bc]. The pcre2test output when this pattern
87 is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied
88 to the string "aaaa" is:
89
90 --->aaaa
91 +0 ^ a+
92 +2 ^ ^ [bc]
93 No match
94
95 This indicates that when matching [bc] fails, there is no backtracking
96 into a+ (because it is being treated as a++) and therefore the callouts
97 that would be taken for the backtracks do not occur. You can disable
98 the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
99 pcre2_compile(), or starting the pattern with (*NO_AUTO_POSSESS). In
100 this case, the output changes to this:
101
102 --->aaaa
103 +0 ^ a+
104 +2 ^ ^ [bc]
105 +2 ^ ^ [bc]
106 +2 ^ ^ [bc]
107 +2 ^^ [bc]
108 No match
109
110 This time, when matching [bc] fails, the matcher backtracks into a+ and
111 tries again, repeatedly, until a+ itself fails.
112
113 Automatic .* anchoring
114
115 By default, an optimization is applied when .* is the first significant
116 item in a pattern. If PCRE2_DOTALL is set, so that the dot can match
117 any character, the pattern is automatically anchored. If PCRE2_DOTALL
118 is not set, a match can start only after an internal newline or at the
119 beginning of the subject, and pcre2_compile() remembers this. This
120 optimization is disabled, however, if .* is in an atomic group or if
121 there is a back reference to the capturing group in which it appears.
122 It is also disabled if the pattern contains (*PRUNE) or (*SKIP). How‐
123 ever, the presence of callouts does not affect it.
124
125 For example, if the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT
126 and applied to the string "aa", the pcre2test output is:
127
128 --->aa
129 +0 ^ .*
130 +2 ^ ^ \d
131 +2 ^^ \d
132 +2 ^ \d
133 No match
134
135 This shows that all match attempts start at the beginning of the sub‐
136 ject. In other words, the pattern is anchored. You can disable this
137 optimization by passing PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(), or
138 starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the out‐
139 put changes to:
140
141 --->aa
142 +0 ^ .*
143 +2 ^ ^ \d
144 +2 ^^ \d
145 +2 ^ \d
146 +0 ^ .*
147 +2 ^^ \d
148 +2 ^ \d
149 No match
150
151 This shows more match attempts, starting at the second subject charac‐
152 ter. Another optimization, described in the next section, means that
153 there is no subsequent attempt to match with an empty subject.
154
155 If a pattern has more than one top-level branch, automatic anchoring
156 occurs if all branches are anchorable.
157
158 Other optimizations
159
160 Other optimizations that provide fast "no match" results also affect
161 callouts. For example, if the pattern is
162
163 ab(?C4)cd
164
165 PCRE2 knows that any matching string must contain the letter "d". If
166 the subject string is "abyz", the lack of "d" means that matching
167 doesn't ever start, and the callout is never reached. However, with
168 "abyd", though the result is still no match, the callout is obeyed.
169
170 PCRE2 also knows the minimum length of a matching string, and will
171 immediately give a "no match" return without actually running a match
172 if the subject is not long enough, or, for unanchored patterns, if it
173 has been scanned far enough.
174
175 You can disable these optimizations by passing the PCRE2_NO_START_OPTI‐
176 MIZE option to pcre2_compile(), or by starting the pattern with
177 (*NO_START_OPT). This slows down the matching process, but does ensure
178 that callouts such as the example above are obeyed.
179
181
182 During matching, when PCRE2 reaches a callout point, if an external
183 function is set in the match context, it is called. This applies to
184 both normal and DFA matching. The first argument to the callout func‐
185 tion is a pointer to a pcre2_callout block. The second argument is the
186 void * callout data that was supplied when the callout was set up by
187 calling pcre2_set_callout() (see the pcre2api documentation). The call‐
188 out block structure contains the following fields:
189
190 uint32_t version;
191 uint32_t callout_number;
192 uint32_t capture_top;
193 uint32_t capture_last;
194 PCRE2_SIZE *offset_vector;
195 PCRE2_SPTR mark;
196 PCRE2_SPTR subject;
197 PCRE2_SIZE subject_length;
198 PCRE2_SIZE start_match;
199 PCRE2_SIZE current_position;
200 PCRE2_SIZE pattern_position;
201 PCRE2_SIZE next_item_length;
202 PCRE2_SIZE callout_string_offset;
203 PCRE2_SIZE callout_string_length;
204 PCRE2_SPTR callout_string;
205
206 The version field contains the version number of the block format. The
207 current version is 1; the three callout string fields were added for
208 this version. If you are writing an application that might use an ear‐
209 lier release of PCRE2, you should check the version number before
210 accessing any of these fields. The version number will increase in
211 future if more fields are added, but the intention is never to remove
212 any of the existing fields.
213
214 Fields for numerical callouts
215
216 For a numerical callout, callout_string is NULL, and callout_number
217 contains the number of the callout, in the range 0-255. This is the
218 number that follows (?C for callouts that part of the pattern; it is
219 255 for automatically generated callouts.
220
221 Fields for string callouts
222
223 For callouts with string arguments, callout_number is always zero, and
224 callout_string points to the string that is contained within the com‐
225 piled pattern. Its length is given by callout_string_length. Duplicated
226 ending delimiters that were present in the original pattern string have
227 been turned into single characters, but there is no other processing of
228 the callout string argument. An additional code unit containing binary
229 zero is present after the string, but is not included in the length.
230 The delimiter that was used to start the string is also stored within
231 the pattern, immediately before the string itself. You can access this
232 delimiter as callout_string[-1] if you need it.
233
234 The callout_string_offset field is the code unit offset to the start of
235 the callout argument string within the original pattern string. This is
236 provided for the benefit of applications such as script languages that
237 might need to report errors in the callout string within the pattern.
238
239 Fields for all callouts
240
241 The remaining fields in the callout block are the same for both kinds
242 of callout.
243
244 The offset_vector field is a pointer to the vector of capturing offsets
245 (the "ovector") that was passed to the matching function in the match
246 data block. When pcre2_match() is used, the contents can be inspected
247 in order to extract substrings that have been matched so far, in the
248 same way as for extracting substrings after a match has completed. For
249 the DFA matching function, this field is not useful.
250
251 The subject and subject_length fields contain copies of the values that
252 were passed to the matching function.
253
254 The start_match field normally contains the offset within the subject
255 at which the current match attempt started. However, if the escape
256 sequence \K has been encountered, this value is changed to reflect the
257 modified starting point. If the pattern is not anchored, the callout
258 function may be called several times from the same point in the pattern
259 for different starting points in the subject.
260
261 The current_position field contains the offset within the subject of
262 the current match pointer.
263
264 When the pcre2_match() is used, the capture_top field contains one more
265 than the number of the highest numbered captured substring so far. If
266 no substrings have been captured, the value of capture_top is one. This
267 is always the case when the DFA functions are used, because they do not
268 support captured substrings.
269
270 The capture_last field contains the number of the most recently cap‐
271 tured substring. However, when a recursion exits, the value reverts to
272 what it was outside the recursion, as do the values of all captured
273 substrings. If no substrings have been captured, the value of cap‐
274 ture_last is 0. This is always the case for the DFA matching functions.
275
276 The pattern_position field contains the offset in the pattern string to
277 the next item to be matched.
278
279 The next_item_length field contains the length of the next item to be
280 processed in the pattern string. When the callout is at the end of the
281 pattern, the length is zero. When the callout precedes an opening
282 parenthesis, the length includes meta characters that follow the paren‐
283 thesis. For example, in a callout before an assertion such as (?=ab)
284 the length is 3. For an an alternation bar or a closing parenthesis,
285 the length is one, unless a closing parenthesis is followed by a quan‐
286 tifier, in which case its length is included. (This changed in release
287 10.23. In earlier releases, before an opening parenthesis the length
288 was that of the entire subpattern, and before an alternation bar or a
289 closing parenthesis the length was zero.)
290
291 The pattern_position and next_item_length fields are intended to help
292 in distinguishing between different automatic callouts, which all have
293 the same callout number. However, they are set for all callouts, and
294 are used by pcre2test to show the next item to be matched when display‐
295 ing callout information.
296
297 In callouts from pcre2_match() the mark field contains a pointer to the
298 zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
299 (*THEN) item in the match, or NULL if no such items have been passed.
300 Instances of (*PRUNE) or (*THEN) without a name do not obliterate a
301 previous (*MARK). In callouts from the DFA matching function this field
302 always contains NULL.
303
305
306 The external callout function returns an integer to PCRE2. If the value
307 is zero, matching proceeds as normal. If the value is greater than
308 zero, matching fails at the current point, but the testing of other
309 matching possibilities goes ahead, just as if a lookahead assertion had
310 failed. If the value is less than zero, the match is abandoned, and the
311 matching function returns the negative value.
312
313 Negative values should normally be chosen from the set of
314 PCRE2_ERROR_xxx values. In particular, PCRE2_ERROR_NOMATCH forces a
315 standard "no match" failure. The error number PCRE2_ERROR_CALLOUT is
316 reserved for use by callout functions; it will never be used by PCRE2
317 itself.
318
320
321 int pcre2_callout_enumerate(const pcre2_code *code,
322 int (*callback)(pcre2_callout_enumerate_block *, void *),
323 void *user_data);
324
325 A script language that supports the use of string arguments in callouts
326 might like to scan all the callouts in a pattern before running the
327 match. This can be done by calling pcre2_callout_enumerate(). The first
328 argument is a pointer to a compiled pattern, the second points to a
329 callback function, and the third is arbitrary user data. The callback
330 function is called for every callout in the pattern in the order in
331 which they appear. Its first argument is a pointer to a callout enumer‐
332 ation block, and its second argument is the user_data value that was
333 passed to pcre2_callout_enumerate(). The data block contains the fol‐
334 lowing fields:
335
336 version Block version number
337 pattern_position Offset to next item in pattern
338 next_item_length Length of next item in pattern
339 callout_number Number for numbered callouts
340 callout_string_offset Offset to string within pattern
341 callout_string_length Length of callout string
342 callout_string Points to callout string or is NULL
343
344 The version number is currently 0. It will increase if new fields are
345 ever added to the block. The remaining fields are the same as their
346 namesakes in the pcre2_callout block that is used for callouts during
347 matching, as described above.
348
349 Note that the value of pattern_position is unique for each callout.
350 However, if a callout occurs inside a group that is quantified with a
351 non-zero minimum or a fixed maximum, the group is replicated inside the
352 compiled pattern. For example, a pattern such as /(a){2}/ is compiled
353 as if it were /(a)(a)/. This means that the callout will be enumerated
354 more than once, but with the same value for pattern_position in each
355 case.
356
357 The callback function should normally return zero. If it returns a non-
358 zero value, scanning the pattern stops, and that value is returned from
359 pcre2_callout_enumerate().
360
362
363 Philip Hazel
364 University Computing Service
365 Cambridge, England.
366
368
369 Last updated: 29 September 2016
370 Copyright (c) 1997-2016 University of Cambridge.
371
372
373
374PCRE2 10.23 29 September 2016 PCRE2CALLOUT(3)