1PCRE2PARTIAL(3)            Library Functions Manual            PCRE2PARTIAL(3)
2
3
4

NAME

6       PCRE2 - Perl-compatible regular expressions
7

PARTIAL MATCHING IN PCRE2

9
10       In  normal  use  of  PCRE2,  if  the subject string that is passed to a
11       matching function matches as far as it goes, but is too short to  match
12       the  entire pattern, PCRE2_ERROR_NOMATCH is returned. There are circum‐
13       stances where it might be helpful to distinguish this case  from  other
14       cases in which there is no match.
15
16       Consider, for example, an application where a human is required to type
17       in data for a field with specific formatting requirements.  An  example
18       might be a date in the form ddmmmyy, defined by this pattern:
19
20         ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
21
22       If the application sees the user's keystrokes one by one, and can check
23       that what has been typed so far is potentially valid,  it  is  able  to
24       raise  an  error  as  soon  as  a  mistake  is made, by beeping and not
25       reflecting the character that has been typed, for example. This immedi‐
26       ate  feedback is likely to be a better user interface than a check that
27       is delayed until the entire string has been entered.  Partial  matching
28       can  also be useful when the subject string is very long and is not all
29       available at once.
30
31       PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT  and
32       PCRE2_PARTIAL_HARD  options,  which  can be set when calling a matching
33       function.  The difference between the two options is whether or  not  a
34       partial match is preferred to an alternative complete match, though the
35       details differ between the two types  of  matching  function.  If  both
36       options are set, PCRE2_PARTIAL_HARD takes precedence.
37
38       If  you  want to use partial matching with just-in-time optimized code,
39       you must call pcre2_jit_compile() with one or both of these options:
40
41         PCRE2_JIT_PARTIAL_SOFT
42         PCRE2_JIT_PARTIAL_HARD
43
44       PCRE2_JIT_COMPLETE should also be set if you are going to run  non-par‐
45       tial  matches  on the same pattern. If the appropriate JIT mode has not
46       been compiled, interpretive matching code is used.
47
48       Setting a partial matching option  disables  two  of  PCRE2's  standard
49       optimizations. PCRE2 remembers the last literal code unit in a pattern,
50       and abandons matching immediately if it is not present in  the  subject
51       string.  This  optimization  cannot  be  used for a subject string that
52       might match only partially. PCRE2 also knows the minimum  length  of  a
53       matching  string,  and  does not bother to run the matching function on
54       shorter strings. This optimization is also disabled for partial  match‐
55       ing.
56

PARTIAL MATCHING USING pcre2_match()

58
59       A  partial  match occurs during a call to pcre2_match() when the end of
60       the subject string is reached successfully, but  matching  cannot  con‐
61       tinue because more characters are needed. However, at least one charac‐
62       ter in the subject must have been inspected. This  character  need  not
63       form part of the final matched string; lookbehind assertions and the \K
64       escape sequence provide ways of inspecting characters before the  start
65       of  a matched string. The requirement for inspecting at least one char‐
66       acter exists because an empty string can  always  be  matched;  without
67       such  a  restriction  there would always be a partial match of an empty
68       string at the end of the subject.
69
70       When a partial match is returned, the first two elements in the ovector
71       point to the portion of the subject that was matched, but the values in
72       the rest of the ovector are undefined. The appearance of \K in the pat‐
73       tern has no effect for a partial match. Consider this pattern:
74
75         /abc\K123/
76
77       If it is matched against "456abc123xyz" the result is a complete match,
78       and the ovector defines the matched string as "123", because \K  resets
79       the  "start  of  match" point. However, if a partial match is requested
80       and the subject string is "456abc12", a partial match is found for  the
81       string  "abc12",  because  all these characters are needed for a subse‐
82       quent re-match with additional characters.
83
84       What happens when a partial match is identified depends on which of the
85       two partial matching options are set.
86
87   PCRE2_PARTIAL_SOFT WITH pcre2_match()
88
89       If  PCRE2_PARTIAL_SOFT  is  set when pcre2_match() identifies a partial
90       match, the partial match is remembered, but matching continues as  nor‐
91       mal,  and  other  alternatives in the pattern are tried. If no complete
92       match  can  be  found,  PCRE2_ERROR_PARTIAL  is  returned  instead   of
93       PCRE2_ERROR_NOMATCH.
94
95       This  option  is "soft" because it prefers a complete match over a par‐
96       tial match.  All the various matching items in a pattern behave  as  if
97       the  subject string is potentially complete. For example, \z, \Z, and $
98       match at the end of the subject, as normal, and for \b and \B  the  end
99       of the subject is treated as a non-alphanumeric.
100
101       If  there  is more than one partial match, the first one that was found
102       provides the data that is returned. Consider this pattern:
103
104         /123\w+X|dogY/
105
106       If this is matched against the subject string "abc123dog", both  alter‐
107       natives  fail  to  match,  but the end of the subject is reached during
108       matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to  3
109       and  9, identifying "123dog" as the first partial match that was found.
110       (In this example, there are two partial matches, because "dog"  on  its
111       own partially matches the second alternative.)
112
113   PCRE2_PARTIAL_HARD WITH pcre2_match()
114
115       If  PCRE2_PARTIAL_HARD is set for pcre2_match(), PCRE2_ERROR_PARTIAL is
116       returned as soon as a partial match is  found,  without  continuing  to
117       search  for possible complete matches. This option is "hard" because it
118       prefers an earlier partial match over a later complete match. For  this
119       reason,  the  assumption  is  made that the end of the supplied subject
120       string may not be the true end of the available data, and  so,  if  \z,
121       \Z,  \b, \B, or $ are encountered at the end of the subject, the result
122       is PCRE2_ERROR_PARTIAL, provided that at least  one  character  in  the
123       subject has been inspected.
124
125   Comparing hard and soft partial matching
126
127       The  difference  between the two partial matching options can be illus‐
128       trated by a pattern such as:
129
130         /dog(sbody)?/
131
132       This matches either "dog" or "dogsbody", greedily (that is, it  prefers
133       the  longer  string  if  possible). If it is matched against the string
134       "dog" with PCRE2_PARTIAL_SOFT, it yields a complete  match  for  "dog".
135       However,  if  PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR‐
136       TIAL. On the other hand, if the pattern is made ungreedy the result  is
137       different:
138
139         /dog(sbody)??/
140
141       In  this  case  the  result  is always a complete match because that is
142       found first, and matching never  continues  after  finding  a  complete
143       match. It might be easier to follow this explanation by thinking of the
144       two patterns like this:
145
146         /dog(sbody)?/    is the same as  /dogsbody|dog/
147         /dog(sbody)??/   is the same as  /dog|dogsbody/
148
149       The second pattern will never match "dogsbody", because it will  always
150       find the shorter match first.
151

PARTIAL MATCHING USING pcre2_dfa_match()

153
154       The DFA functions move along the subject string character by character,
155       without backtracking, searching for  all  possible  matches  simultane‐
156       ously.  If the end of the subject is reached before the end of the pat‐
157       tern, there is the possibility of a partial match, again provided  that
158       at least one character has been inspected.
159
160       When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if
161       there have been no complete matches. Otherwise,  the  complete  matches
162       are  returned.   However, if PCRE2_PARTIAL_HARD is set, a partial match
163       takes precedence over any complete matches. The portion of  the  string
164       that was matched when the longest partial match was found is set as the
165       first matching string.
166
167       Because the DFA functions always search for all possible  matches,  and
168       there  is  no  difference between greedy and ungreedy repetition, their
169       behaviour is different from  the  standard  functions  when  PCRE2_PAR‐
170       TIAL_HARD  is  set.  Consider  the  string  "dog"  matched  against the
171       ungreedy pattern shown above:
172
173         /dog(sbody)??/
174
175       Whereas the standard function stops as soon as it  finds  the  complete
176       match  for  "dog",  the  DFA  function also finds the partial match for
177       "dogsbody", and so returns that when PCRE2_PARTIAL_HARD is set.
178

PARTIAL MATCHING AND WORD BOUNDARIES

180
181       If a pattern ends with one of sequences \b or \B, which test  for  word
182       boundaries,  partial matching with PCRE2_PARTIAL_SOFT can give counter-
183       intuitive results. Consider this pattern:
184
185         /\bcat\b/
186
187       This matches "cat", provided there is a word boundary at either end. If
188       the subject string is "the cat", the comparison of the final "t" with a
189       following character cannot take place, so a  partial  match  is  found.
190       However,  normal  matching carries on, and \b matches at the end of the
191       subject when the last character is a letter, so  a  complete  match  is
192       found.   The  result,  therefore,  is  not  PCRE2_ERROR_PARTIAL.  Using
193       PCRE2_PARTIAL_HARD in this case does yield PCRE2_ERROR_PARTIAL, because
194       then the partial match takes precedence.
195

EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST

197
198       If  the  partial_soft  (or  ps) modifier is present on a pcre2test data
199       line, the PCRE2_PARTIAL_SOFT option is used for the match.  Here  is  a
200       run of pcre2test that uses the date example quoted above:
201
202           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
203         data> 25jun04\=ps
204          0: 25jun04
205          1: jun
206         data> 25dec3\=ps
207         Partial match: 23dec3
208         data> 3ju\=ps
209         Partial match: 3ju
210         data> 3juj\=ps
211         No match
212         data> j\=ps
213         No match
214
215       The  first  data  string  is matched completely, so pcre2test shows the
216       matched substrings. The remaining four strings do not  match  the  com‐
217       plete pattern, but the first two are partial matches. Similar output is
218       obtained if DFA matching is used.
219
220       If the partial_hard (or ph) modifier is present  on  a  pcre2test  data
221       line, the PCRE2_PARTIAL_HARD option is set for the match.
222

MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()

224
225       When  a  partial match has been found using a DFA matching function, it
226       is possible to continue the match by providing additional subject  data
227       and  calling  the function again with the same compiled regular expres‐
228       sion, this time setting the PCRE2_DFA_RESTART option. You must pass the
229       same working space as before, because this is where details of the pre‐
230       vious partial match are stored. Here is an example using pcre2test:
231
232           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
233         data> 23ja\=dfa,ps
234         Partial match: 23ja
235         data> n05\=dfa,dfa_restart
236          0: n05
237
238       The first call has "23ja" as the subject, and requests  partial  match‐
239       ing;  the  second  call  has  "n05"  as  the  subject for the continued
240       (restarted) match.  Notice that when the match is  complete,  only  the
241       last  part  is  shown;  PCRE2 does not retain the previously partially-
242       matched string. It is up to the calling program to do that if it  needs
243       to.
244
245       That means that, for an unanchored pattern, if a continued match fails,
246       it is not possible to try again at  a  new  starting  point.  All  this
247       facility  is  capable  of  doing  is continuing with the previous match
248       attempt. In the previous example, if the second set of data  is  "ug23"
249       the  result is no match, even though there would be a match for "aug23"
250       if the entire string were given at once. Depending on the  application,
251       this may or may not be what you want.  The only way to allow for start‐
252       ing again at the next character is to retain the matched  part  of  the
253       subject and try a new complete match.
254
255       You  can  set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with
256       PCRE2_DFA_RESTART to continue partial matching over multiple  segments.
257       This  facility can be used to pass very long subject strings to the DFA
258       matching functions.
259

MULTI-SEGMENT MATCHING WITH pcre2_match()

261
262       Unlike the DFA function, it is not possible  to  restart  the  previous
263       match with a new segment of data when using pcre2_match(). Instead, new
264       data must be added to the previous subject string, and the entire match
265       re-run,  starting from the point where the partial match occurred. Ear‐
266       lier data can be discarded.
267
268       It is best to use PCRE2_PARTIAL_HARD in this situation, because it does
269       not  treat the end of a segment as the end of the subject when matching
270       \z, \Z, \b, \B, and $. Consider  an  unanchored  pattern  that  matches
271       dates:
272
273           re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
274         data> The date is 23ja\=ph
275         Partial match: 23ja
276
277       At  this stage, an application could discard the text preceding "23ja",
278       add on text from the next  segment,  and  call  the  matching  function
279       again.  Unlike  the  DFA  matching function, the entire matching string
280       must always be available, and the complete matching process occurs  for
281       each call, so more memory and more processing time is needed.
282

ISSUES WITH MULTI-SEGMENT MATCHING

284
285       Certain types of pattern may give problems with multi-segment matching,
286       whichever matching function is used.
287
288       1. If the pattern contains a test for the beginning of a line, you need
289       to  pass  the  PCRE2_NOTBOL option when the subject string for any call
290       does start at the beginning of a line. There  is  also  a  PCRE2_NOTEOL
291       option, but in practice when doing multi-segment matching you should be
292       using PCRE2_PARTIAL_HARD, which includes the effect of PCRE2_NOTEOL.
293
294       2. If a pattern contains a lookbehind assertion, characters  that  pre‐
295       cede  the start of the partial match may have been inspected during the
296       matching process.  When using pcre2_match(), sufficient characters must
297       be  retained  for  the  next  match attempt. You can ensure that enough
298       characters are retained by doing the following:
299
300       Before doing any matching, find the length of the longest lookbehind in
301       the     pattern    by    calling    pcre2_pattern_info()    with    the
302       PCRE2_INFO_MAXLOOKBEHIND option. Note that the resulting  count  is  in
303       characters, not code units. After a partial match, moving back from the
304       ovector[0] offset in the subject by the number of characters given  for
305       the  maximum lookbehind gets you to the earliest character that must be
306       retained. In a non-UTF or a 32-bit situation, moving  back  is  just  a
307       subtraction,  but in UTF-8 or UTF-16 you have to count characters while
308       moving back through the code units.
309
310       Characters before the point you have now reached can be discarded,  and
311       after  the  next segment has been added to what is retained, you should
312       run the next match with the startoffset argument set so that the  match
313       begins at the same point as before.
314
315       For  example, if the pattern "(?<=123)abc" is partially matched against
316       the string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maxi‐
317       mum  lookbehind  count  is  3, so all characters before offset 2 can be
318       discarded. The value of startoffset for the next  match  should  be  3.
319       When  pcre2test  displays  a partial match, it indicates the lookbehind
320       characters with '<' characters:
321
322           re> "(?<=123)abc"
323         data> xx123ab\=ph
324         Partial match: 123ab
325                        <<<
326
327       3. Because a partial match must always contain at least one  character,
328       what  might  be  considered a partial match of an empty string actually
329       gives a "no match" result. For example:
330
331           re> /c(?<=abc)x/
332         data> ab\=ps
333         No match
334
335       If the next segment begins "cx", a match should be found, but this will
336       only  happen  if characters from the previous segment are retained. For
337       this reason, a "no match" result  should  be  interpreted  as  "partial
338       match of an empty string" when the pattern contains lookbehinds.
339
340       4.  Matching  a subject string that is split into multiple segments may
341       not always produce exactly the same result as matching over one  single
342       long  string,  especially  when PCRE2_PARTIAL_SOFT is used. The section
343       "Partial Matching and Word Boundaries" above describes  an  issue  that
344       arises  if  the  pattern ends with \b or \B. Another kind of difference
345       may occur when there are multiple matching possibilities, because  (for
346       PCRE2_PARTIAL_SOFT) a partial match result is given only when there are
347       no completed matches. This means that as soon as the shortest match has
348       been  found,  continuation to a new subject segment is no longer possi‐
349       ble. Consider this pcre2test example:
350
351           re> /dog(sbody)?/
352         data> dogsb\=ps
353          0: dog
354         data> do\=ps,dfa
355         Partial match: do
356         data> gsb\=ps,dfa,dfa_restart
357          0: g
358         data> dogsbody\=dfa
359          0: dogsbody
360          1: dog
361
362       The first data line passes the string "dogsb" to  a  standard  matching
363       function, setting the PCRE2_PARTIAL_SOFT option. Although the string is
364       a partial match for "dogsbody", the result is not  PCRE2_ERROR_PARTIAL,
365       because  the  shorter string "dog" is a complete match. Similarly, when
366       the subject is presented to a DFA matching function  in  several  parts
367       ("do"  and  "gsb"  being  the first two) the match stops when "dog" has
368       been found, and it is not possible to continue.  On the other hand,  if
369       "dogsbody"  is  presented  as  a single string, a DFA matching function
370       finds both matches.
371
372       Because of these problems, it is best to  use  PCRE2_PARTIAL_HARD  when
373       matching  multi-segment  data.  The  example above then behaves differ‐
374       ently:
375
376           re> /dog(sbody)?/
377         data> dogsb\=ph
378         Partial match: dogsb
379         data> do\=ps,dfa
380         Partial match: do
381         data> gsb\=ph,dfa,dfa_restart
382         Partial match: gsb
383
384       5. Patterns that contain alternatives at the top level which do not all
385       start  with  the  same  pattern  item  may  not  work  as expected when
386       PCRE2_DFA_RESTART is used. For example, consider this pattern:
387
388         1234|3789
389
390       If the first part of the subject is "ABC123", a partial  match  of  the
391       first  alternative  is found at offset 3. There is no partial match for
392       the second alternative, because such a match does not start at the same
393       point  in  the  subject  string. Attempting to continue with the string
394       "7890" does not yield a match  because  only  those  alternatives  that
395       match  at  one  point in the subject are remembered. The problem arises
396       because the start of the second alternative matches  within  the  first
397       alternative.  There  is  no  problem with anchored patterns or patterns
398       such as:
399
400         1234|ABCD
401
402       where no string can be a partial match for both alternatives.  This  is
403       not  a  problem  if  a  standard matching function is used, because the
404       entire match has to be rerun each time:
405
406           re> /1234|3789/
407         data> ABC123\=ph
408         Partial match: 123
409         data> 1237890
410          0: 3789
411
412       Of course, instead of using PCRE2_DFA_RESTART, the  same  technique  of
413       re-running  the  entire  match  can  also be used with the DFA matching
414       function. Another possibility is to work with two buffers. If a partial
415       match  at  offset  n in the first buffer is followed by "no match" when
416       PCRE2_DFA_RESTART is used on the second buffer, you can then try a  new
417       match starting at offset n+1 in the first buffer.
418

AUTHOR

420
421       Philip Hazel
422       University Computing Service
423       Cambridge, England.
424

REVISION

426
427       Last updated: 22 December 2014
428       Copyright (c) 1997-2014 University of Cambridge.
429
430
431
432PCRE2 10.00                    22 December 2014                PCRE2PARTIAL(3)
Impressum