1PCRE2CALLOUT(3)            Library Functions Manual            PCRE2CALLOUT(3)
2
3
4

NAME

6       PCRE2 - Perl-compatible regular expressions (revised API)
7

SYNOPSIS

9
10       #include <pcre2.h>
11
12       int (*pcre2_callout)(pcre2_callout_block *, void *);
13
14       int pcre2_callout_enumerate(const pcre2_code *code,
15         int (*callback)(pcre2_callout_enumerate_block *, void *),
16         void *user_data);
17

DESCRIPTION

19
20       PCRE2  provides  a feature called "callout", which is a means of tempo‐
21       rarily passing control to the caller of PCRE2 in the middle of  pattern
22       matching.  The caller of PCRE2 provides an external function by putting
23       its entry point in a match  context  (see  pcre2_set_callout()  in  the
24       pcre2api documentation).
25
26       Within  a  regular expression, (?C<arg>) indicates a point at which the
27       external function is to be called.  Different  callout  points  can  be
28       identified  by  putting  a number less than 256 after the letter C. The
29       default value is zero.  Alternatively, the argument may be a  delimited
30       string.  The  starting delimiter must be one of ` ' " ^ % # $ { and the
31       ending delimiter is the same as the start, except for {, where the end‐
32       ing  delimiter  is  }.  If  the  ending  delimiter is needed within the
33       string, it must be doubled. For example, this pattern has  two  callout
34       points:
35
36         (?C1)abc(?C"some ""arbitrary"" text")def
37
38       If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled,
39       PCRE2 automatically inserts callouts, all with number 255, before  each
40       item  in the pattern except for immediately before or after an explicit
41       callout. For example, if PCRE2_AUTO_CALLOUT is used with the pattern
42
43         A(?C3)B
44
45       it is processed as if it were
46
47         (?C255)A(?C3)B(?C255)
48
49       Here is a more complicated example:
50
51         A(\d{2}|--)
52
53       With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
54
55         (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
56
57       Notice that there is a callout before and after  each  parenthesis  and
58       alternation bar. If the pattern contains a conditional group whose con‐
59       dition is an assertion, an automatic callout  is  inserted  immediately
60       before  the  condition. Such a callout may also be inserted explicitly,
61       for example:
62
63         (?(?C9)(?=a)ab|de)  (?(?C%text%)(?!=d)ab|de)
64
65       This applies only to assertion conditions (because they are  themselves
66       independent groups).
67
68       Callouts  can  be useful for tracking the progress of pattern matching.
69       The pcre2test program has a pattern qualifier (/auto_callout) that sets
70       automatic  callouts.   When  any  callouts are present, the output from
71       pcre2test indicates how the pattern is being matched.  This  is  useful
72       information  when  you are trying to optimize the performance of a par‐
73       ticular pattern.
74

MISSING CALLOUTS

76
77       You should be aware that, because of optimizations  in  the  way  PCRE2
78       compiles and matches patterns, callouts sometimes do not happen exactly
79       as you might expect.
80
81   Auto-possessification
82
83       At compile time, PCRE2 "auto-possessifies" repeated items when it knows
84       that  what follows cannot be part of the repeat. For example, a+[bc] is
85       compiled as if it were a++[bc]. The pcre2test output when this  pattern
86       is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied
87       to the string "aaaa" is:
88
89         --->aaaa
90          +0 ^        a+
91          +2 ^   ^    [bc]
92         No match
93
94       This indicates that when matching [bc] fails, there is no  backtracking
95       into a+ (because it is being treated as a++) and therefore the callouts
96       that would be taken for the backtracks do not occur.  You  can  disable
97       the   auto-possessify   feature  by  passing  PCRE2_NO_AUTO_POSSESS  to
98       pcre2_compile(), or starting the pattern  with  (*NO_AUTO_POSSESS).  In
99       this case, the output changes to this:
100
101         --->aaaa
102          +0 ^        a+
103          +2 ^   ^    [bc]
104          +2 ^  ^     [bc]
105          +2 ^ ^      [bc]
106          +2 ^^       [bc]
107         No match
108
109       This time, when matching [bc] fails, the matcher backtracks into a+ and
110       tries again, repeatedly, until a+ itself fails.
111
112   Automatic .* anchoring
113
114       By default, an optimization is applied when .* is the first significant
115       item  in  a  pattern. If PCRE2_DOTALL is set, so that the dot can match
116       any character, the pattern is automatically anchored.  If  PCRE2_DOTALL
117       is  not set, a match can start only after an internal newline or at the
118       beginning of the subject, and pcre2_compile() remembers this. If a pat‐
119       tern  has more than one top-level branch, automatic anchoring occurs if
120       all branches are anchorable.
121
122       This optimization is disabled, however, if .* is in an atomic group  or
123       if there is a backreference to the capturing group in which it appears.
124       It is also disabled if the pattern contains (*PRUNE) or  (*SKIP).  How‐
125       ever, the presence of callouts does not affect it.
126
127       For  example,  if  the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT
128       and applied to the string "aa", the pcre2test output is:
129
130         --->aa
131          +0 ^      .*
132          +2 ^ ^    \d
133          +2 ^^     \d
134          +2 ^      \d
135         No match
136
137       This shows that all match attempts start at the beginning of  the  sub‐
138       ject.  In  other  words,  the pattern is anchored. You can disable this
139       optimization by passing PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(),  or
140       starting  the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the out‐
141       put changes to:
142
143         --->aa
144          +0 ^      .*
145          +2 ^ ^    \d
146          +2 ^^     \d
147          +2 ^      \d
148          +0  ^     .*
149          +2  ^^    \d
150          +2  ^     \d
151         No match
152
153       This shows more match attempts, starting at the second subject  charac‐
154       ter.   Another  optimization, described in the next section, means that
155       there is no subsequent attempt to match with an empty subject.
156
157   Other optimizations
158
159       Other optimizations that provide fast "no match"  results  also  affect
160       callouts.  For example, if the pattern is
161
162         ab(?C4)cd
163
164       PCRE2  knows  that  any matching string must contain the letter "d". If
165       the subject string is "abyz", the  lack  of  "d"  means  that  matching
166       doesn't  ever  start,  and  the callout is never reached. However, with
167       "abyd", though the result is still no match, the callout is obeyed.
168
169       For most patterns PCRE2 also knows the minimum  length  of  a  matching
170       string,  and will immediately give a "no match" return without actually
171       running a match if the subject is not long enough, or,  for  unanchored
172       patterns, if it has been scanned far enough.
173
174       You can disable these optimizations by passing the PCRE2_NO_START_OPTI‐
175       MIZE option  to  pcre2_compile(),  or  by  starting  the  pattern  with
176       (*NO_START_OPT).  This slows down the matching process, but does ensure
177       that callouts such as the example above are obeyed.
178

THE CALLOUT INTERFACE

180
181       During matching, when PCRE2 reaches a callout  point,  if  an  external
182       function  is  provided in the match context, it is called. This applies
183       to both normal, DFA, and JIT matching. The first argument to the  call‐
184       out function is a pointer to a pcre2_callout block. The second argument
185       is the void * callout data that was supplied when the callout  was  set
186       up by calling pcre2_set_callout() (see the pcre2api documentation). The
187       callout block structure contains the following fields, not  necessarily
188       in this order:
189
190         uint32_t      version;
191         uint32_t      callout_number;
192         uint32_t      capture_top;
193         uint32_t      capture_last;
194         uint32_t      callout_flags;
195         PCRE2_SIZE   *offset_vector;
196         PCRE2_SPTR    mark;
197         PCRE2_SPTR    subject;
198         PCRE2_SIZE    subject_length;
199         PCRE2_SIZE    start_match;
200         PCRE2_SIZE    current_position;
201         PCRE2_SIZE    pattern_position;
202         PCRE2_SIZE    next_item_length;
203         PCRE2_SIZE    callout_string_offset;
204         PCRE2_SIZE    callout_string_length;
205         PCRE2_SPTR    callout_string;
206
207       The  version field contains the version number of the block format. The
208       current version is 2; the three callout string fields  were  added  for
209       version  1, and the callout_flags field for version 2. If you are writ‐
210       ing an application that might use an  earlier  release  of  PCRE2,  you
211       should  check  the version number before accessing any of these fields.
212       The version number will increase in future if more  fields  are  added,
213       but the intention is never to remove any of the existing fields.
214
215   Fields for numerical callouts
216
217       For  a  numerical  callout,  callout_string is NULL, and callout_number
218       contains the number of the callout, in the range  0-255.  This  is  the
219       number  that  follows  (?C for callouts that part of the pattern; it is
220       255 for automatically generated callouts.
221
222   Fields for string callouts
223
224       For callouts with string arguments, callout_number is always zero,  and
225       callout_string  points  to the string that is contained within the com‐
226       piled pattern. Its length is given by callout_string_length. Duplicated
227       ending delimiters that were present in the original pattern string have
228       been turned into single characters, but there is no other processing of
229       the  callout string argument. An additional code unit containing binary
230       zero is present after the string, but is not included  in  the  length.
231       The  delimiter  that was used to start the string is also stored within
232       the pattern, immediately before the string itself. You can access  this
233       delimiter as callout_string[-1] if you need it.
234
235       The callout_string_offset field is the code unit offset to the start of
236       the callout argument string within the original pattern string. This is
237       provided  for the benefit of applications such as script languages that
238       might need to report errors in the callout string within the pattern.
239
240   Fields for all callouts
241
242       The remaining fields in the callout block are the same for  both  kinds
243       of callout.
244
245       The  offset_vector  field is a pointer to a vector of capturing offsets
246       (the "ovector"). You may read the elements in this vector, but you must
247       not change any of them.
248
249       For  calls  to  pcre2_match(),  the  offset_vector  field is not (since
250       release 10.30) a pointer to the actual ovector that was passed  to  the
251       matching  function  in  the  match  data block. Instead it points to an
252       internal ovector of a size large enough to hold all  possible  captured
253       substrings in the pattern. Note that whenever a recursion or subroutine
254       call within a pattern completes, the capturing state is reset  to  what
255       it was before.
256
257       The  capture_last  field  contains the number of the most recently cap‐
258       tured substring, and the capture_top field contains one more  than  the
259       number  of  the  highest numbered captured substring so far. If no sub‐
260       strings have yet been captured, the value of capture_last is 0 and  the
261       value  of  capture_top  is  1. The values of these fields do not always
262       differ  by  one;  for  example,  when  the  callout  in   the   pattern
263       ((a)(b))(?C2) is taken, capture_last is 1 but capture_top is 4.
264
265       The   contents  of  ovector[2]  to  ovector[<capture_top>*2-1]  can  be
266       inspected in order to extract substrings that have been matched so far,
267       in  the  same way as extracting substrings after a match has completed.
268       The values in ovector[0] and ovector[1] are always PCRE2_UNSET  because
269       the  match is by definition not complete. Substrings that have not been
270       captured but whose numbers are less than capture_top also have both  of
271       their ovector slots set to PCRE2_UNSET.
272
273       For  DFA  matching,  the offset_vector field points to the ovector that
274       was passed to the matching function in the match data block  for  call‐
275       outs at the top level, but to an internal ovector during the processing
276       of pattern recursions, lookarounds, and atomic groups.  However,  these
277       ovectors  hold no useful information because pcre2_dfa_match() does not
278       support substring capturing. The value of capture_top is always  1  and
279       the value of capture_last is always 0 for DFA matching.
280
281       The subject and subject_length fields contain copies of the values that
282       were passed to the matching function.
283
284       The start_match field normally contains the offset within  the  subject
285       at  which  the  current  match  attempt started. However, if the escape
286       sequence \K has been encountered, this value is changed to reflect  the
287       modified  starting  point.  If the pattern is not anchored, the callout
288       function may be called several times from the same point in the pattern
289       for different starting points in the subject.
290
291       The  current_position  field  contains the offset within the subject of
292       the current match pointer.
293
294       The pattern_position field contains the offset in the pattern string to
295       the next item to be matched.
296
297       The  next_item_length  field contains the length of the next item to be
298       processed in the pattern string. When the callout is at the end of  the
299       pattern,  the  length  is  zero.  When  the callout precedes an opening
300       parenthesis, the length includes meta characters that follow the paren‐
301       thesis.  For  example,  in a callout before an assertion such as (?=ab)
302       the length is 3. For an an alternation bar or  a  closing  parenthesis,
303       the  length is one, unless a closing parenthesis is followed by a quan‐
304       tifier, in which case its length is included.  (This changed in release
305       10.23.  In  earlier  releases, before an opening parenthesis the length
306       was that of the entire subpattern, and before an alternation bar  or  a
307       closing parenthesis the length was zero.)
308
309       The  pattern_position  and next_item_length fields are intended to help
310       in distinguishing between different automatic callouts, which all  have
311       the  same  callout  number. However, they are set for all callouts, and
312       are used by pcre2test to show the next item to be matched when display‐
313       ing callout information.
314
315       In callouts from pcre2_match() the mark field contains a pointer to the
316       zero-terminated name of the most recently passed (*MARK), (*PRUNE),  or
317       (*THEN)  item  in the match, or NULL if no such items have been passed.
318       Instances of (*PRUNE) or (*THEN) without a name  do  not  obliterate  a
319       previous (*MARK). In callouts from the DFA matching function this field
320       always contains NULL.
321
322       The   callout_flags   field   is   always   zero   in   callouts   from
323       pcre2_dfa_match() or when JIT is being used. When pcre2_match() without
324       JIT is used, the following bits may be set:
325
326         PCRE2_CALLOUT_STARTMATCH
327
328       This is set for the first callout after the start of matching for  each
329       new starting position in the subject.
330
331         PCRE2_CALLOUT_BACKTRACK
332
333       This  is  set if there has been a matching backtrack since the previous
334       callout, or since the start of matching if this is  the  first  callout
335       from a pcre2_match() run.
336
337       Both  bits  are  set when a backtrack has caused a "bumpalong" to a new
338       starting position in the subject. Output from pcre2test does not  indi‐
339       cate  the  presence  of these bits unless the callout_extra modifier is
340       set.
341
342       The information in the callout_flags field is provided so that applica‐
343       tions  can track and tell their users how matching with backtracking is
344       done. This can be useful when trying to optimize patterns, or  just  to
345       understand  how  PCRE2  works. There is no support in pcre2_dfa_match()
346       because there is no backtracking in DFA matching, and there is no  sup‐
347       port in JIT because JIT is all about maximimizing matching performance.
348       In both these cases the callout_flags field is always zero.
349

RETURN VALUES FROM CALLOUTS

351
352       The external callout function returns an integer to PCRE2. If the value
353       is  zero,  matching  proceeds  as  normal. If the value is greater than
354       zero, matching fails at the current point, but  the  testing  of  other
355       matching possibilities goes ahead, just as if a lookahead assertion had
356       failed. If the value is less than zero, the match is abandoned, and the
357       matching function returns the negative value.
358
359       Negative   values   should   normally   be   chosen  from  the  set  of
360       PCRE2_ERROR_xxx values. In  particular,  PCRE2_ERROR_NOMATCH  forces  a
361       standard  "no  match"  failure. The error number PCRE2_ERROR_CALLOUT is
362       reserved for use by callout functions; it will never be used  by  PCRE2
363       itself.
364

CALLOUT ENUMERATION

366
367       int pcre2_callout_enumerate(const pcre2_code *code,
368         int (*callback)(pcre2_callout_enumerate_block *, void *),
369         void *user_data);
370
371       A script language that supports the use of string arguments in callouts
372       might like to scan all the callouts in a  pattern  before  running  the
373       match. This can be done by calling pcre2_callout_enumerate(). The first
374       argument is a pointer to a compiled pattern, the  second  points  to  a
375       callback  function,  and the third is arbitrary user data. The callback
376       function is called for every callout in the pattern  in  the  order  in
377       which they appear. Its first argument is a pointer to a callout enumer‐
378       ation block, and its second argument is the user_data  value  that  was
379       passed  to  pcre2_callout_enumerate(). The data block contains the fol‐
380       lowing fields:
381
382         version                Block version number
383         pattern_position       Offset to next item in pattern
384         next_item_length       Length of next item in pattern
385         callout_number         Number for numbered callouts
386         callout_string_offset  Offset to string within pattern
387         callout_string_length  Length of callout string
388         callout_string         Points to callout string or is NULL
389
390       The version number is currently 0. It will increase if new  fields  are
391       ever  added  to  the  block. The remaining fields are the same as their
392       namesakes in the pcre2_callout block that is used for  callouts  during
393       matching, as described above.
394
395       Note  that  the  value  of pattern_position is unique for each callout.
396       However, if a callout occurs inside a group that is quantified  with  a
397       non-zero minimum or a fixed maximum, the group is replicated inside the
398       compiled pattern. For example, a pattern such as /(a){2}/  is  compiled
399       as  if it were /(a)(a)/. This means that the callout will be enumerated
400       more than once, but with the same value for  pattern_position  in  each
401       case.
402
403       The callback function should normally return zero. If it returns a non-
404       zero value, scanning the pattern stops, and that value is returned from
405       pcre2_callout_enumerate().
406

AUTHOR

408
409       Philip Hazel
410       University Computing Service
411       Cambridge, England.
412

REVISION

414
415       Last updated: 26 April 2018
416       Copyright (c) 1997-2018 University of Cambridge.
417
418
419
420PCRE2 10.32                      26 April 2018                 PCRE2CALLOUT(3)
Impressum