1PCRE2CALLOUT(3)            Library Functions Manual            PCRE2CALLOUT(3)
2
3
4

NAME

6       PCRE2 - Perl-compatible regular expressions (revised API)
7

SYNOPSIS

9
10       #include <pcre2.h>
11
12       int (*pcre2_callout)(pcre2_callout_block *, void *);
13
14       int pcre2_callout_enumerate(const pcre2_code *code,
15         int (*callback)(pcre2_callout_enumerate_block *, void *),
16         void *user_data);
17

DESCRIPTION

19
20       PCRE2  provides  a feature called "callout", which is a means of tempo‐
21       rarily passing control to the caller of PCRE2 in the middle of  pattern
22       matching.  The caller of PCRE2 provides an external function by putting
23       its entry point in a match  context  (see  pcre2_set_callout()  in  the
24       pcre2api documentation).
25
26       Within  a  regular expression, (?C<arg>) indicates a point at which the
27       external function is to be called.  Different  callout  points  can  be
28       identified  by  putting  a number less than 256 after the letter C. The
29       default value is zero.  Alternatively, the argument may be a  delimited
30       string.  The  starting delimiter must be one of ` ' " ^ % # $ { and the
31       ending delimiter is the same as the start, except for {, where the end‐
32       ing  delimiter  is  }.  If  the  ending  delimiter is needed within the
33       string, it must be doubled. For example, this pattern has  two  callout
34       points:
35
36         (?C1)abc(?C"some ""arbitrary"" text")def
37
38       If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled,
39       PCRE2 automatically inserts callouts, all with number 255, before  each
40       item  in  the  pattern except for immediately before or after a callout
41       item in the pattern.  For example, if PCRE2_AUTO_CALLOUT is  used  with
42       the pattern
43
44         A(?C3)B
45
46       it is processed as if it were
47
48         (?C255)A(?C3)B(?C255)
49
50       Here is a more complicated example:
51
52         A(\d{2}|--)
53
54       With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
55
56       (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
57
58       Notice  that  there  is a callout before and after each parenthesis and
59       alternation bar. If the pattern contains a conditional group whose con‐
60       dition  is  an  assertion, an automatic callout is inserted immediately
61       before the condition. Such a callout may also be  inserted  explicitly,
62       for example:
63
64         (?(?C9)(?=a)ab|de)  (?(?C%text%)(?!=d)ab|de)
65
66       This  applies only to assertion conditions (because they are themselves
67       independent groups).
68
69       Callouts can be useful for tracking the progress of  pattern  matching.
70       The pcre2test program has a pattern qualifier (/auto_callout) that sets
71       automatic callouts.  When any callouts are  present,  the  output  from
72       pcre2test  indicates  how  the pattern is being matched. This is useful
73       information when you are trying to optimize the performance of  a  par‐
74       ticular pattern.
75

MISSING CALLOUTS

77
78       You  should  be  aware  that, because of optimizations in the way PCRE2
79       compiles and matches patterns, callouts sometimes do not happen exactly
80       as you might expect.
81
82   Auto-possessification
83
84       At compile time, PCRE2 "auto-possessifies" repeated items when it knows
85       that what follows cannot be part of the repeat. For example, a+[bc]  is
86       compiled  as if it were a++[bc]. The pcre2test output when this pattern
87       is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied
88       to the string "aaaa" is:
89
90         --->aaaa
91          +0 ^        a+
92          +2 ^   ^    [bc]
93         No match
94
95       This  indicates that when matching [bc] fails, there is no backtracking
96       into a+ (because it is being treated as a++) and therefore the callouts
97       that  would  be  taken for the backtracks do not occur. You can disable
98       the  auto-possessify  feature  by  passing   PCRE2_NO_AUTO_POSSESS   to
99       pcre2_compile(),  or  starting  the pattern with (*NO_AUTO_POSSESS). In
100       this case, the output changes to this:
101
102         --->aaaa
103          +0 ^        a+
104          +2 ^   ^    [bc]
105          +2 ^  ^     [bc]
106          +2 ^ ^      [bc]
107          +2 ^^       [bc]
108         No match
109
110       This time, when matching [bc] fails, the matcher backtracks into a+ and
111       tries again, repeatedly, until a+ itself fails.
112
113   Automatic .* anchoring
114
115       By default, an optimization is applied when .* is the first significant
116       item in a pattern. If PCRE2_DOTALL is set, so that the  dot  can  match
117       any  character,  the pattern is automatically anchored. If PCRE2_DOTALL
118       is not set, a match can start only after an internal newline or at  the
119       beginning  of  the  subject,  and  pcre2_compile() remembers this. This
120       optimization is disabled, however, if .* is in an atomic  group  or  if
121       there  is  a back reference to the capturing group in which it appears.
122       It is also disabled if the pattern contains (*PRUNE) or  (*SKIP).  How‐
123       ever, the presence of callouts does not affect it.
124
125       For  example,  if  the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT
126       and applied to the string "aa", the pcre2test output is:
127
128         --->aa
129          +0 ^      .*
130          +2 ^ ^    \d
131          +2 ^^     \d
132          +2 ^      \d
133         No match
134
135       This shows that all match attempts start at the beginning of  the  sub‐
136       ject.  In  other  words,  the pattern is anchored. You can disable this
137       optimization by passing PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(),  or
138       starting  the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the out‐
139       put changes to:
140
141         --->aa
142          +0 ^      .*
143          +2 ^ ^    \d
144          +2 ^^     \d
145          +2 ^      \d
146          +0  ^     .*
147          +2  ^^    \d
148          +2  ^     \d
149         No match
150
151       This shows more match attempts, starting at the second subject  charac‐
152       ter.   Another  optimization, described in the next section, means that
153       there is no subsequent attempt to match with an empty subject.
154
155       If a pattern has more than one top-level  branch,  automatic  anchoring
156       occurs if all branches are anchorable.
157
158   Other optimizations
159
160       Other  optimizations  that  provide fast "no match" results also affect
161       callouts.  For example, if the pattern is
162
163         ab(?C4)cd
164
165       PCRE2 knows that any matching string must contain the  letter  "d".  If
166       the  subject  string  is  "abyz",  the  lack of "d" means that matching
167       doesn't ever start, and the callout is  never  reached.  However,  with
168       "abyd", though the result is still no match, the callout is obeyed.
169
170       PCRE2  also  knows  the  minimum  length of a matching string, and will
171       immediately give a "no match" return without actually running  a  match
172       if  the  subject is not long enough, or, for unanchored patterns, if it
173       has been scanned far enough.
174
175       You can disable these optimizations by passing the PCRE2_NO_START_OPTI‐
176       MIZE  option  to  pcre2_compile(),  or  by  starting  the  pattern with
177       (*NO_START_OPT). This slows down the matching process, but does  ensure
178       that callouts such as the example above are obeyed.
179

THE CALLOUT INTERFACE

181
182       During  matching,  when  PCRE2  reaches a callout point, if an external
183       function is set in the match context, it is  called.  This  applies  to
184       both  normal  and DFA matching. The first argument to the callout func‐
185       tion is a pointer to a pcre2_callout block. The second argument is  the
186       void  *  callout  data that was supplied when the callout was set up by
187       calling pcre2_set_callout() (see the pcre2api documentation). The call‐
188       out block structure contains the following fields:
189
190         uint32_t      version;
191         uint32_t      callout_number;
192         uint32_t      capture_top;
193         uint32_t      capture_last;
194         PCRE2_SIZE   *offset_vector;
195         PCRE2_SPTR    mark;
196         PCRE2_SPTR    subject;
197         PCRE2_SIZE    subject_length;
198         PCRE2_SIZE    start_match;
199         PCRE2_SIZE    current_position;
200         PCRE2_SIZE    pattern_position;
201         PCRE2_SIZE    next_item_length;
202         PCRE2_SIZE    callout_string_offset;
203         PCRE2_SIZE    callout_string_length;
204         PCRE2_SPTR    callout_string;
205
206       The  version field contains the version number of the block format. The
207       current version is 1; the three callout string fields  were  added  for
208       this  version. If you are writing an application that might use an ear‐
209       lier release of PCRE2, you  should  check  the  version  number  before
210       accessing  any  of  these  fields.  The version number will increase in
211       future if more fields are added, but the intention is never  to  remove
212       any of the existing fields.
213
214   Fields for numerical callouts
215
216       For  a  numerical  callout,  callout_string is NULL, and callout_number
217       contains the number of the callout, in the range  0-255.  This  is  the
218       number  that  follows  (?C for callouts that part of the pattern; it is
219       255 for automatically generated callouts.
220
221   Fields for string callouts
222
223       For callouts with string arguments, callout_number is always zero,  and
224       callout_string  points  to the string that is contained within the com‐
225       piled pattern. Its length is given by callout_string_length. Duplicated
226       ending delimiters that were present in the original pattern string have
227       been turned into single characters, but there is no other processing of
228       the  callout string argument. An additional code unit containing binary
229       zero is present after the string, but is not included  in  the  length.
230       The  delimiter  that was used to start the string is also stored within
231       the pattern, immediately before the string itself. You can access  this
232       delimiter as callout_string[-1] if you need it.
233
234       The callout_string_offset field is the code unit offset to the start of
235       the callout argument string within the original pattern string. This is
236       provided  for the benefit of applications such as script languages that
237       might need to report errors in the callout string within the pattern.
238
239   Fields for all callouts
240
241       The remaining fields in the callout block are the same for  both  kinds
242       of callout.
243
244       The offset_vector field is a pointer to the vector of capturing offsets
245       (the "ovector") that was passed to the matching function in  the  match
246       data  block.  When pcre2_match() is used, the contents can be inspected
247       in order to extract substrings that have been matched so  far,  in  the
248       same  way as for extracting substrings after a match has completed. For
249       the DFA matching function, this field is not useful.
250
251       The subject and subject_length fields contain copies of the values that
252       were passed to the matching function.
253
254       The  start_match  field normally contains the offset within the subject
255       at which the current match attempt  started.  However,  if  the  escape
256       sequence  \K has been encountered, this value is changed to reflect the
257       modified starting point. If the pattern is not  anchored,  the  callout
258       function may be called several times from the same point in the pattern
259       for different starting points in the subject.
260
261       The current_position field contains the offset within  the  subject  of
262       the current match pointer.
263
264       When the pcre2_match() is used, the capture_top field contains one more
265       than the number of the highest numbered captured substring so  far.  If
266       no substrings have been captured, the value of capture_top is one. This
267       is always the case when the DFA functions are used, because they do not
268       support captured substrings.
269
270       The  capture_last  field  contains the number of the most recently cap‐
271       tured substring. However, when a recursion exits, the value reverts  to
272       what  it  was  outside  the recursion, as do the values of all captured
273       substrings. If no substrings have been  captured,  the  value  of  cap‐
274       ture_last is 0. This is always the case for the DFA matching functions.
275
276       The pattern_position field contains the offset in the pattern string to
277       the next item to be matched.
278
279       The next_item_length field contains the length of the next item  to  be
280       processed  in the pattern string. When the callout is at the end of the
281       pattern, the length is zero.  When  the  callout  precedes  an  opening
282       parenthesis, the length includes meta characters that follow the paren‐
283       thesis. For example, in a callout before an assertion  such  as  (?=ab)
284       the  length  is  3. For an an alternation bar or a closing parenthesis,
285       the length is one, unless a closing parenthesis is followed by a  quan‐
286       tifier, in which case its length is included.  (This changed in release
287       10.23. In earlier releases, before an opening  parenthesis  the  length
288       was  that  of the entire subpattern, and before an alternation bar or a
289       closing parenthesis the length was zero.)
290
291       The pattern_position and next_item_length fields are intended  to  help
292       in  distinguishing between different automatic callouts, which all have
293       the same callout number. However, they are set for  all  callouts,  and
294       are used by pcre2test to show the next item to be matched when display‐
295       ing callout information.
296
297       In callouts from pcre2_match() the mark field contains a pointer to the
298       zero-terminated  name of the most recently passed (*MARK), (*PRUNE), or
299       (*THEN) item in the match, or NULL if no such items have  been  passed.
300       Instances  of  (*PRUNE)  or  (*THEN) without a name do not obliterate a
301       previous (*MARK). In callouts from the DFA matching function this field
302       always contains NULL.
303

RETURN VALUES FROM CALLOUTS

305
306       The external callout function returns an integer to PCRE2. If the value
307       is zero, matching proceeds as normal. If  the  value  is  greater  than
308       zero,  matching  fails  at  the current point, but the testing of other
309       matching possibilities goes ahead, just as if a lookahead assertion had
310       failed. If the value is less than zero, the match is abandoned, and the
311       matching function returns the negative value.
312
313       Negative  values  should  normally  be   chosen   from   the   set   of
314       PCRE2_ERROR_xxx  values.  In  particular,  PCRE2_ERROR_NOMATCH forces a
315       standard "no match" failure. The error  number  PCRE2_ERROR_CALLOUT  is
316       reserved  for  use by callout functions; it will never be used by PCRE2
317       itself.
318

CALLOUT ENUMERATION

320
321       int pcre2_callout_enumerate(const pcre2_code *code,
322         int (*callback)(pcre2_callout_enumerate_block *, void *),
323         void *user_data);
324
325       A script language that supports the use of string arguments in callouts
326       might  like  to  scan  all the callouts in a pattern before running the
327       match. This can be done by calling pcre2_callout_enumerate(). The first
328       argument  is  a  pointer  to a compiled pattern, the second points to a
329       callback function, and the third is arbitrary user data.  The  callback
330       function  is  called  for  every callout in the pattern in the order in
331       which they appear. Its first argument is a pointer to a callout enumer‐
332       ation  block,  and  its second argument is the user_data value that was
333       passed to pcre2_callout_enumerate(). The data block contains  the  fol‐
334       lowing fields:
335
336         version                Block version number
337         pattern_position       Offset to next item in pattern
338         next_item_length       Length of next item in pattern
339         callout_number         Number for numbered callouts
340         callout_string_offset  Offset to string within pattern
341         callout_string_length  Length of callout string
342         callout_string         Points to callout string or is NULL
343
344       The  version  number is currently 0. It will increase if new fields are
345       ever added to the block. The remaining fields are  the  same  as  their
346       namesakes  in  the pcre2_callout block that is used for callouts during
347       matching, as described above.
348
349       Note that the value of pattern_position is  unique  for  each  callout.
350       However,  if  a callout occurs inside a group that is quantified with a
351       non-zero minimum or a fixed maximum, the group is replicated inside the
352       compiled  pattern.  For example, a pattern such as /(a){2}/ is compiled
353       as if it were /(a)(a)/. This means that the callout will be  enumerated
354       more  than  once,  but with the same value for pattern_position in each
355       case.
356
357       The callback function should normally return zero. If it returns a non-
358       zero value, scanning the pattern stops, and that value is returned from
359       pcre2_callout_enumerate().
360

AUTHOR

362
363       Philip Hazel
364       University Computing Service
365       Cambridge, England.
366

REVISION

368
369       Last updated: 29 September 2016
370       Copyright (c) 1997-2016 University of Cambridge.
371
372
373
374PCRE2 10.23                    29 September 2016               PCRE2CALLOUT(3)
Impressum