1PCRE2CALLOUT(3)            Library Functions Manual            PCRE2CALLOUT(3)
2
3
4

NAME

6       PCRE2 - Perl-compatible regular expressions (revised API)
7

SYNOPSIS

9
10       #include <pcre2.h>
11
12       int (*pcre2_callout)(pcre2_callout_block *, void *);
13
14       int pcre2_callout_enumerate(const pcre2_code *code,
15         int (*callback)(pcre2_callout_enumerate_block *, void *),
16         void *user_data);
17

DESCRIPTION

19
20       PCRE2  provides  a feature called "callout", which is a means of tempo‐
21       rarily passing control to the caller of PCRE2 in the middle of  pattern
22       matching.  The caller of PCRE2 provides an external function by putting
23       its entry point in a match  context  (see  pcre2_set_callout()  in  the
24       pcre2api documentation).
25
26       When  using the pcre2_substitute() function, an additional callout fea‐
27       ture is available. This does a callout after each change to the subject
28       string and is described in the pcre2api documentation; the rest of this
29       document is concerned with callouts during pattern matching.
30
31       Within a regular expression, (?C<arg>) indicates a point at  which  the
32       external  function  is  to  be  called. Different callout points can be
33       identified by putting a number less than 256 after the  letter  C.  The
34       default  value is zero.  Alternatively, the argument may be a delimited
35       string. The starting delimiter must be one of ` ' " ^ % # $ {  and  the
36       ending delimiter is the same as the start, except for {, where the end‐
37       ing delimiter is }. If  the  ending  delimiter  is  needed  within  the
38       string,  it  must be doubled. For example, this pattern has two callout
39       points:
40
41         (?C1)abc(?C"some ""arbitrary"" text")def
42
43       If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled,
44       PCRE2  automatically inserts callouts, all with number 255, before each
45       item in the pattern except for immediately before or after an  explicit
46       callout. For example, if PCRE2_AUTO_CALLOUT is used with the pattern
47
48         A(?C3)B
49
50       it is processed as if it were
51
52         (?C255)A(?C3)B(?C255)
53
54       Here is a more complicated example:
55
56         A(\d{2}|--)
57
58       With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
59
60         (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
61
62       Notice  that  there  is a callout before and after each parenthesis and
63       alternation bar. If the pattern contains a conditional group whose con‐
64       dition  is  an  assertion, an automatic callout is inserted immediately
65       before the condition. Such a callout may also be  inserted  explicitly,
66       for example:
67
68         (?(?C9)(?=a)ab|de)  (?(?C%text%)(?!=d)ab|de)
69
70       This  applies only to assertion conditions (because they are themselves
71       independent groups).
72
73       Callouts can be useful for tracking the progress of  pattern  matching.
74       The pcre2test program has a pattern qualifier (/auto_callout) that sets
75       automatic callouts.  When any callouts are  present,  the  output  from
76       pcre2test  indicates  how  the pattern is being matched. This is useful
77       information when you are trying to optimize the performance of  a  par‐
78       ticular pattern.
79

MISSING CALLOUTS

81
82       You  should  be  aware  that, because of optimizations in the way PCRE2
83       compiles and matches patterns, callouts sometimes do not happen exactly
84       as you might expect.
85
86   Auto-possessification
87
88       At compile time, PCRE2 "auto-possessifies" repeated items when it knows
89       that what follows cannot be part of the repeat. For example, a+[bc]  is
90       compiled  as if it were a++[bc]. The pcre2test output when this pattern
91       is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied
92       to the string "aaaa" is:
93
94         --->aaaa
95          +0 ^        a+
96          +2 ^   ^    [bc]
97         No match
98
99       This  indicates that when matching [bc] fails, there is no backtracking
100       into a+ (because it is being treated as a++) and therefore the callouts
101       that  would  be  taken for the backtracks do not occur. You can disable
102       the  auto-possessify  feature  by  passing   PCRE2_NO_AUTO_POSSESS   to
103       pcre2_compile(),  or  starting  the pattern with (*NO_AUTO_POSSESS). In
104       this case, the output changes to this:
105
106         --->aaaa
107          +0 ^        a+
108          +2 ^   ^    [bc]
109          +2 ^  ^     [bc]
110          +2 ^ ^      [bc]
111          +2 ^^       [bc]
112         No match
113
114       This time, when matching [bc] fails, the matcher backtracks into a+ and
115       tries again, repeatedly, until a+ itself fails.
116
117   Automatic .* anchoring
118
119       By default, an optimization is applied when .* is the first significant
120       item in a pattern. If PCRE2_DOTALL is set, so that the  dot  can  match
121       any  character,  the pattern is automatically anchored. If PCRE2_DOTALL
122       is not set, a match can start only after an internal newline or at  the
123       beginning of the subject, and pcre2_compile() remembers this. If a pat‐
124       tern has more than one top-level branch, automatic anchoring occurs  if
125       all branches are anchorable.
126
127       This  optimization is disabled, however, if .* is in an atomic group or
128       if there is a backreference to the capture group in which  it  appears.
129       It  is  also disabled if the pattern contains (*PRUNE) or (*SKIP). How‐
130       ever, the presence of callouts does not affect it.
131
132       For example, if the pattern .*\d is  compiled  with  PCRE2_AUTO_CALLOUT
133       and applied to the string "aa", the pcre2test output is:
134
135         --->aa
136          +0 ^      .*
137          +2 ^ ^    \d
138          +2 ^^     \d
139          +2 ^      \d
140         No match
141
142       This  shows  that all match attempts start at the beginning of the sub‐
143       ject. In other words, the pattern is anchored.  You  can  disable  this
144       optimization  by passing PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(), or
145       starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the  out‐
146       put changes to:
147
148         --->aa
149          +0 ^      .*
150          +2 ^ ^    \d
151          +2 ^^     \d
152          +2 ^      \d
153          +0  ^     .*
154          +2  ^^    \d
155          +2  ^     \d
156         No match
157
158       This  shows more match attempts, starting at the second subject charac‐
159       ter.  Another optimization, described in the next section,  means  that
160       there is no subsequent attempt to match with an empty subject.
161
162   Other optimizations
163
164       Other  optimizations  that  provide fast "no match" results also affect
165       callouts.  For example, if the pattern is
166
167         ab(?C4)cd
168
169       PCRE2 knows that any matching string must contain the  letter  "d".  If
170       the  subject  string  is  "abyz",  the  lack of "d" means that matching
171       doesn't ever start, and the callout is  never  reached.  However,  with
172       "abyd", though the result is still no match, the callout is obeyed.
173
174       For  most  patterns  PCRE2  also knows the minimum length of a matching
175       string, and will immediately give a "no match" return without  actually
176       running  a  match if the subject is not long enough, or, for unanchored
177       patterns, if it has been scanned far enough.
178
179       You can disable these optimizations by passing the PCRE2_NO_START_OPTI‐
180       MIZE  option  to  pcre2_compile(),  or  by  starting  the  pattern with
181       (*NO_START_OPT). This slows down the matching process, but does  ensure
182       that callouts such as the example above are obeyed.
183

THE CALLOUT INTERFACE

185
186       During  matching,  when  PCRE2  reaches a callout point, if an external
187       function is provided in the match context, it is called.  This  applies
188       to  both normal, DFA, and JIT matching. The first argument to the call‐
189       out function is a pointer to a pcre2_callout block. The second argument
190       is  the  void * callout data that was supplied when the callout was set
191       up by calling pcre2_set_callout() (see the pcre2api documentation). The
192       callout  block structure contains the following fields, not necessarily
193       in this order:
194
195         uint32_t      version;
196         uint32_t      callout_number;
197         uint32_t      capture_top;
198         uint32_t      capture_last;
199         uint32_t      callout_flags;
200         PCRE2_SIZE   *offset_vector;
201         PCRE2_SPTR    mark;
202         PCRE2_SPTR    subject;
203         PCRE2_SIZE    subject_length;
204         PCRE2_SIZE    start_match;
205         PCRE2_SIZE    current_position;
206         PCRE2_SIZE    pattern_position;
207         PCRE2_SIZE    next_item_length;
208         PCRE2_SIZE    callout_string_offset;
209         PCRE2_SIZE    callout_string_length;
210         PCRE2_SPTR    callout_string;
211
212       The version field contains the version number of the block format.  The
213       current  version  is  2; the three callout string fields were added for
214       version 1, and the callout_flags field for version 2. If you are  writ‐
215       ing  an  application  that  might  use an earlier release of PCRE2, you
216       should check the version number before accessing any of  these  fields.
217       The  version  number  will increase in future if more fields are added,
218       but the intention is never to remove any of the existing fields.
219
220   Fields for numerical callouts
221
222       For a numerical callout, callout_string  is  NULL,  and  callout_number
223       contains  the  number  of  the callout, in the range 0-255. This is the
224       number that follows (?C for callouts that part of the  pattern;  it  is
225       255 for automatically generated callouts.
226
227   Fields for string callouts
228
229       For  callouts with string arguments, callout_number is always zero, and
230       callout_string points to the string that is contained within  the  com‐
231       piled pattern. Its length is given by callout_string_length. Duplicated
232       ending delimiters that were present in the original pattern string have
233       been turned into single characters, but there is no other processing of
234       the callout string argument. An additional code unit containing  binary
235       zero  is  present  after the string, but is not included in the length.
236       The delimiter that was used to start the string is also  stored  within
237       the  pattern, immediately before the string itself. You can access this
238       delimiter as callout_string[-1] if you need it.
239
240       The callout_string_offset field is the code unit offset to the start of
241       the callout argument string within the original pattern string. This is
242       provided for the benefit of applications such as script languages  that
243       might need to report errors in the callout string within the pattern.
244
245   Fields for all callouts
246
247       The  remaining  fields in the callout block are the same for both kinds
248       of callout.
249
250       The offset_vector field is a pointer to a vector of  capturing  offsets
251       (the "ovector"). You may read the elements in this vector, but you must
252       not change any of them.
253
254       For calls to pcre2_match(),  the  offset_vector  field  is  not  (since
255       release  10.30)  a pointer to the actual ovector that was passed to the
256       matching function in the match data block.  Instead  it  points  to  an
257       internal  ovector  of a size large enough to hold all possible captured
258       substrings in the pattern. Note that whenever a recursion or subroutine
259       call  within  a pattern completes, the capturing state is reset to what
260       it was before.
261
262       The capture_last field contains the number of the  most  recently  cap‐
263       tured  substring,  and the capture_top field contains one more than the
264       number of the highest numbered captured substring so far.  If  no  sub‐
265       strings  have yet been captured, the value of capture_last is 0 and the
266       value of capture_top is 1. The values of these  fields  do  not  always
267       differ   by   one;  for  example,  when  the  callout  in  the  pattern
268       ((a)(b))(?C2) is taken, capture_last is 1 but capture_top is 4.
269
270       The  contents  of  ovector[2]  to  ovector[<capture_top>*2-1]  can   be
271       inspected in order to extract substrings that have been matched so far,
272       in the same way as extracting substrings after a match  has  completed.
273       The  values in ovector[0] and ovector[1] are always PCRE2_UNSET because
274       the match is by definition not complete. Substrings that have not  been
275       captured  but whose numbers are less than capture_top also have both of
276       their ovector slots set to PCRE2_UNSET.
277
278       For DFA matching, the offset_vector field points to  the  ovector  that
279       was  passed  to the matching function in the match data block for call‐
280       outs at the top level, but to an internal ovector during the processing
281       of  pattern  recursions, lookarounds, and atomic groups. However, these
282       ovectors hold no useful information because pcre2_dfa_match() does  not
283       support  substring  capturing. The value of capture_top is always 1 and
284       the value of capture_last is always 0 for DFA matching.
285
286       The subject and subject_length fields contain copies of the values that
287       were passed to the matching function.
288
289       The  start_match  field normally contains the offset within the subject
290       at which the current match attempt  started.  However,  if  the  escape
291       sequence  \K has been encountered, this value is changed to reflect the
292       modified starting point. If the pattern is not  anchored,  the  callout
293       function may be called several times from the same point in the pattern
294       for different starting points in the subject.
295
296       The current_position field contains the offset within  the  subject  of
297       the current match pointer.
298
299       The pattern_position field contains the offset in the pattern string to
300       the next item to be matched.
301
302       The next_item_length field contains the length of the next item  to  be
303       processed  in the pattern string. When the callout is at the end of the
304       pattern, the length is zero.  When  the  callout  precedes  an  opening
305       parenthesis, the length includes meta characters that follow the paren‐
306       thesis. For example, in a callout before an assertion  such  as  (?=ab)
307       the  length  is  3. For an an alternation bar or a closing parenthesis,
308       the length is one, unless a closing parenthesis is followed by a  quan‐
309       tifier, in which case its length is included.  (This changed in release
310       10.23. In earlier releases, before an opening  parenthesis  the  length
311       was  that of the entire group, and before an alternation bar or a clos‐
312       ing parenthesis the length was zero.)
313
314       The pattern_position and next_item_length fields are intended  to  help
315       in  distinguishing between different automatic callouts, which all have
316       the same callout number. However, they are set for  all  callouts,  and
317       are used by pcre2test to show the next item to be matched when display‐
318       ing callout information.
319
320       In callouts from pcre2_match() the mark field contains a pointer to the
321       zero-terminated  name of the most recently passed (*MARK), (*PRUNE), or
322       (*THEN) item in the match, or NULL if no such items have  been  passed.
323       Instances  of  (*PRUNE)  or  (*THEN) without a name do not obliterate a
324       previous (*MARK). In callouts from the DFA matching function this field
325       always contains NULL.
326
327       The   callout_flags   field   is   always   zero   in   callouts   from
328       pcre2_dfa_match() or when JIT is being used. When pcre2_match() without
329       JIT is used, the following bits may be set:
330
331         PCRE2_CALLOUT_STARTMATCH
332
333       This  is set for the first callout after the start of matching for each
334       new starting position in the subject.
335
336         PCRE2_CALLOUT_BACKTRACK
337
338       This is set if there has been a matching backtrack since  the  previous
339       callout,  or  since  the start of matching if this is the first callout
340       from a pcre2_match() run.
341
342       Both bits are set when a backtrack has caused a "bumpalong"  to  a  new
343       starting  position in the subject. Output from pcre2test does not indi‐
344       cate the presence of these bits unless the  callout_extra  modifier  is
345       set.
346
347       The information in the callout_flags field is provided so that applica‐
348       tions can track and tell their users how matching with backtracking  is
349       done.  This  can be useful when trying to optimize patterns, or just to
350       understand how PCRE2 works. There is no  support  in  pcre2_dfa_match()
351       because  there is no backtracking in DFA matching, and there is no sup‐
352       port in JIT because JIT is all about maximimizing matching performance.
353       In both these cases the callout_flags field is always zero.
354

RETURN VALUES FROM CALLOUTS

356
357       The external callout function returns an integer to PCRE2. If the value
358       is zero, matching proceeds as normal. If  the  value  is  greater  than
359       zero,  matching  fails  at  the current point, but the testing of other
360       matching possibilities goes ahead, just as if a lookahead assertion had
361       failed. If the value is less than zero, the match is abandoned, and the
362       matching function returns the negative value.
363
364       Negative  values  should  normally  be   chosen   from   the   set   of
365       PCRE2_ERROR_xxx  values.  In  particular,  PCRE2_ERROR_NOMATCH forces a
366       standard "no match" failure. The error  number  PCRE2_ERROR_CALLOUT  is
367       reserved  for  use by callout functions; it will never be used by PCRE2
368       itself.
369

CALLOUT ENUMERATION

371
372       int pcre2_callout_enumerate(const pcre2_code *code,
373         int (*callback)(pcre2_callout_enumerate_block *, void *),
374         void *user_data);
375
376       A script language that supports the use of string arguments in callouts
377       might  like  to  scan  all the callouts in a pattern before running the
378       match. This can be done by calling pcre2_callout_enumerate(). The first
379       argument  is  a  pointer  to a compiled pattern, the second points to a
380       callback function, and the third is arbitrary user data.  The  callback
381       function  is  called  for  every callout in the pattern in the order in
382       which they appear. Its first argument is a pointer to a callout enumer‐
383       ation  block,  and  its second argument is the user_data value that was
384       passed to pcre2_callout_enumerate(). The data block contains  the  fol‐
385       lowing fields:
386
387         version                Block version number
388         pattern_position       Offset to next item in pattern
389         next_item_length       Length of next item in pattern
390         callout_number         Number for numbered callouts
391         callout_string_offset  Offset to string within pattern
392         callout_string_length  Length of callout string
393         callout_string         Points to callout string or is NULL
394
395       The  version  number is currently 0. It will increase if new fields are
396       ever added to the block. The remaining fields are  the  same  as  their
397       namesakes  in  the pcre2_callout block that is used for callouts during
398       matching, as described above.
399
400       Note that the value of pattern_position is  unique  for  each  callout.
401       However,  if  a callout occurs inside a group that is quantified with a
402       non-zero minimum or a fixed maximum, the group is replicated inside the
403       compiled  pattern.  For example, a pattern such as /(a){2}/ is compiled
404       as if it were /(a)(a)/. This means that the callout will be  enumerated
405       more  than  once,  but with the same value for pattern_position in each
406       case.
407
408       The callback function should normally return zero. If it returns a non-
409       zero value, scanning the pattern stops, and that value is returned from
410       pcre2_callout_enumerate().
411

AUTHOR

413
414       Philip Hazel
415       University Computing Service
416       Cambridge, England.
417

REVISION

419
420       Last updated: 03 February 2019
421       Copyright (c) 1997-2019 University of Cambridge.
422
423
424
425PCRE2 10.33                    03 February 2019                PCRE2CALLOUT(3)
Impressum