pcrepartial(3)

1PCREPARTIAL(3)             Library Functions Manual             PCREPARTIAL(3)
2
3
4

NAME

6       PCRE - Perl-compatible regular expressions
7

PARTIAL MATCHING IN PCRE

9
10       In normal use of PCRE, if the subject string that is passed to a match‐
11       ing function matches as far as it goes, but is too short to  match  the
12       entire pattern, PCRE_ERROR_NOMATCH is returned. There are circumstances
13       where it might be helpful to distinguish this case from other cases  in
14       which there is no match.
15
16       Consider, for example, an application where a human is required to type
17       in data for a field with specific formatting requirements.  An  example
18       might be a date in the form ddmmmyy, defined by this pattern:
19
20         ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
21
22       If the application sees the user's keystrokes one by one, and can check
23       that what has been typed so far is potentially valid,  it  is  able  to
24       raise  an  error  as  soon  as  a  mistake  is made, by beeping and not
25       reflecting the character that has been typed, for example. This immedi‐
26       ate  feedback is likely to be a better user interface than a check that
27       is delayed until the entire string has been entered.  Partial  matching
28       can  also be useful when the subject string is very long and is not all
29       available at once.
30
31       PCRE supports partial matching by means of  the  PCRE_PARTIAL_SOFT  and
32       PCRE_PARTIAL_HARD  options,  which  can  be set when calling any of the
33       matching functions. For backwards compatibility, PCRE_PARTIAL is a syn‐
34       onym  for  PCRE_PARTIAL_SOFT.  The essential difference between the two
35       options is whether or not a partial match is preferred to  an  alterna‐
36       tive complete match, though the details differ between the two types of
37       matching function. If both options  are  set,  PCRE_PARTIAL_HARD  takes
38       precedence.
39
40       If  you  want to use partial matching with just-in-time optimized code,
41       you must call pcre_study(), pcre16_study() or  pcre32_study() with  one
42       or both of these options:
43
44         PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
45         PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
46
47       PCRE_STUDY_JIT_COMPILE  should also be set if you are going to run non-
48       partial matches on the same pattern. If the appropriate JIT study  mode
49       has not been set for a match, the interpretive matching code is used.
50
51       Setting a partial matching option disables two of PCRE's standard opti‐
52       mizations. PCRE remembers the last literal data unit in a pattern,  and
53       abandons  matching  immediately  if  it  is  not present in the subject
54       string. This optimization cannot be used  for  a  subject  string  that
55       might  match only partially. If the pattern was studied, PCRE knows the
56       minimum length of a matching string, and does not  bother  to  run  the
57       matching  function  on  shorter strings. This optimization is also dis‐
58       abled for partial matching.
59

PARTIAL MATCHING USING pcre_exec() OR pcre[16|32]_exec()

61
62       A  partial   match   occurs   during   a   call   to   pcre_exec()   or
63       pcre[16|32]_exec()  when  the end of the subject string is reached suc‐
64       cessfully, but matching cannot continue  because  more  characters  are
65       needed.   However, at least one character in the subject must have been
66       inspected. This character need not  form  part  of  the  final  matched
67       string;  lookbehind  assertions and the \K escape sequence provide ways
68       of inspecting characters before the start of a matched  substring.  The
69       requirement  for  inspecting  at  least one character exists because an
70       empty string can always be matched; without such  a  restriction  there
71       would  always  be  a partial match of an empty string at the end of the
72       subject.
73
74       If there are at least two slots in the offsets vector  when  a  partial
75       match  is returned, the first slot is set to the offset of the earliest
76       character that was inspected. For convenience, the second offset points
77       to the end of the subject so that a substring can easily be identified.
78       If there are at least three slots in the offsets vector, the third slot
79       is set to the offset of the character where matching started.
80
81       For the majority of patterns, the contents of the first and third slots
82       will be the same. However, for patterns that contain lookbehind  asser‐
83       tions, or begin with \b or \B, characters before the one where matching
84       started may have been inspected while carrying out the match. For exam‐
85       ple, consider this pattern:
86
87         /(?<=abc)123/
88
89       This pattern matches "123", but only if it is preceded by "abc". If the
90       subject string is "xyzabc12", the first two  offsets  after  a  partial
91       match  are for the substring "abc12", because all these characters were
92       inspected. However, the third offset is set to 6, because that  is  the
93       offset where matching began.
94
95       What happens when a partial match is identified depends on which of the
96       two partial matching options are set.
97
98   PCRE_PARTIAL_SOFT WITH pcre_exec() OR pcre[16|32]_exec()
99
100       If PCRE_PARTIAL_SOFT is  set  when  pcre_exec()  or  pcre[16|32]_exec()
101       identifies a partial match, the partial match is remembered, but match‐
102       ing continues as normal, and other  alternatives  in  the  pattern  are
103       tried.  If  no  complete  match  can  be  found,  PCRE_ERROR_PARTIAL is
104       returned instead of PCRE_ERROR_NOMATCH.
105
106       This option is "soft" because it prefers a complete match over  a  par‐
107       tial  match.   All the various matching items in a pattern behave as if
108       the subject string is potentially complete. For example, \z, \Z, and  $
109       match  at  the end of the subject, as normal, and for \b and \B the end
110       of the subject is treated as a non-alphanumeric.
111
112       If there is more than one partial match, the first one that  was  found
113       provides the data that is returned. Consider this pattern:
114
115         /123\w+X|dogY/
116
117       If  this is matched against the subject string "abc123dog", both alter‐
118       natives fail to match, but the end of the  subject  is  reached  during
119       matching,  so  PCRE_ERROR_PARTIAL is returned. The offsets are set to 3
120       and 9, identifying "123dog" as the first partial match that was  found.
121       (In  this  example, there are two partial matches, because "dog" on its
122       own partially matches the second alternative.)
123
124   PCRE_PARTIAL_HARD WITH pcre_exec() OR pcre[16|32]_exec()
125
126       If PCRE_PARTIAL_HARD is  set  for  pcre_exec()  or  pcre[16|32]_exec(),
127       PCRE_ERROR_PARTIAL  is  returned  as  soon as a partial match is found,
128       without continuing to search for possible complete matches. This option
129       is "hard" because it prefers an earlier partial match over a later com‐
130       plete match. For this reason, the assumption is made that  the  end  of
131       the  supplied  subject  string may not be the true end of the available
132       data, and so, if \z, \Z, \b, \B, or $ are encountered at the end of the
133       subject,  the  result is PCRE_ERROR_PARTIAL, provided that at least one
134       character in the subject has been inspected.
135
136       Setting PCRE_PARTIAL_HARD also affects the way UTF-8 and UTF-16 subject
137       strings  are checked for validity. Normally, an invalid sequence causes
138       the error PCRE_ERROR_BADUTF8 or PCRE_ERROR_BADUTF16.  However,  in  the
139       special  case  of  a  truncated  character  at  the end of the subject,
140       PCRE_ERROR_SHORTUTF8  or   PCRE_ERROR_SHORTUTF16   is   returned   when
141       PCRE_PARTIAL_HARD is set.
142
143   Comparing hard and soft partial matching
144
145       The  difference  between the two partial matching options can be illus‐
146       trated by a pattern such as:
147
148         /dog(sbody)?/
149
150       This matches either "dog" or "dogsbody", greedily (that is, it  prefers
151       the  longer  string  if  possible). If it is matched against the string
152       "dog" with PCRE_PARTIAL_SOFT, it yields a  complete  match  for  "dog".
153       However, if PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL.
154       On the other hand, if the pattern is made ungreedy the result  is  dif‐
155       ferent:
156
157         /dog(sbody)??/
158
159       In  this  case  the  result  is always a complete match because that is
160       found first, and matching never  continues  after  finding  a  complete
161       match. It might be easier to follow this explanation by thinking of the
162       two patterns like this:
163
164         /dog(sbody)?/    is the same as  /dogsbody|dog/
165         /dog(sbody)??/   is the same as  /dog|dogsbody/
166
167       The second pattern will never match "dogsbody", because it will  always
168       find the shorter match first.
169

PARTIAL MATCHING USING pcre_dfa_exec() OR pcre[16|32]_dfa_exec()

171
172       The DFA functions move along the subject string character by character,
173       without backtracking, searching for  all  possible  matches  simultane‐
174       ously.  If the end of the subject is reached before the end of the pat‐
175       tern, there is the possibility of a partial match, again provided  that
176       at least one character has been inspected.
177
178       When  PCRE_PARTIAL_SOFT  is set, PCRE_ERROR_PARTIAL is returned only if
179       there have been no complete matches. Otherwise,  the  complete  matches
180       are  returned.   However,  if PCRE_PARTIAL_HARD is set, a partial match
181       takes precedence over any complete matches. The portion of  the  string
182       that  was  inspected when the longest partial match was found is set as
183       the first matching string, provided there are at least two slots in the
184       offsets vector.
185
186       Because  the  DFA functions always search for all possible matches, and
187       there is no difference between greedy and  ungreedy  repetition,  their
188       behaviour  is  different  from  the  standard  functions when PCRE_PAR‐
189       TIAL_HARD is  set.  Consider  the  string  "dog"  matched  against  the
190       ungreedy pattern shown above:
191
192         /dog(sbody)??/
193
194       Whereas  the  standard functions stop as soon as they find the complete
195       match for "dog", the DFA functions also  find  the  partial  match  for
196       "dogsbody", and so return that when PCRE_PARTIAL_HARD is set.
197

PARTIAL MATCHING AND WORD BOUNDARIES

199
200       If  a  pattern ends with one of sequences \b or \B, which test for word
201       boundaries, partial matching with PCRE_PARTIAL_SOFT can  give  counter-
202       intuitive results. Consider this pattern:
203
204         /\bcat\b/
205
206       This matches "cat", provided there is a word boundary at either end. If
207       the subject string is "the cat", the comparison of the final "t" with a
208       following  character  cannot  take  place, so a partial match is found.
209       However, normal matching carries on, and \b matches at the end  of  the
210       subject  when  the  last  character is a letter, so a complete match is
211       found.  The  result,  therefore,  is  not   PCRE_ERROR_PARTIAL.   Using
212       PCRE_PARTIAL_HARD  in  this case does yield PCRE_ERROR_PARTIAL, because
213       then the partial match takes precedence.
214

FORMERLY RESTRICTED PATTERNS

216
217       For releases of PCRE prior to 8.00, because of the way certain internal
218       optimizations   were  implemented  in  the  pcre_exec()  function,  the
219       PCRE_PARTIAL option (predecessor of  PCRE_PARTIAL_SOFT)  could  not  be
220       used  with all patterns. From release 8.00 onwards, the restrictions no
221       longer apply, and partial matching with can be requested for  any  pat‐
222       tern.
223
224       Items that were formerly restricted were repeated single characters and
225       repeated metasequences. If PCRE_PARTIAL was set for a pattern that  did
226       not  conform  to  the restrictions, pcre_exec() returned the error code
227       PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in  use.  The
228       PCRE_INFO_OKPARTIAL  call  to pcre_fullinfo() to find out if a compiled
229       pattern can be used for partial matching now always returns 1.
230

EXAMPLE OF PARTIAL MATCHING USING PCRETEST

232
233       If the escape sequence \P is present  in  a  pcretest  data  line,  the
234       PCRE_PARTIAL_SOFT  option  is  used  for  the  match.  Here is a run of
235       pcretest that uses the date example quoted above:
236
237           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
238         data> 25jun04\P
239          0: 25jun04
240          1: jun
241         data> 25dec3\P
242         Partial match: 23dec3
243         data> 3ju\P
244         Partial match: 3ju
245         data> 3juj\P
246         No match
247         data> j\P
248         No match
249
250       The first data string is matched  completely,  so  pcretest  shows  the
251       matched  substrings.  The  remaining four strings do not match the com‐
252       plete pattern, but the first two are partial matches. Similar output is
253       obtained if DFA matching is used.
254
255       If  the escape sequence \P is present more than once in a pcretest data
256       line, the PCRE_PARTIAL_HARD option is set for the match.
257

MULTI-SEGMENT MATCHING WITH pcre_dfa_exec() OR pcre[16|32]_dfa_exec()

259
260       When a partial match has been found using a DFA matching  function,  it
261       is  possible to continue the match by providing additional subject data
262       and calling the function again with the same compiled  regular  expres‐
263       sion,  this time setting the PCRE_DFA_RESTART option. You must pass the
264       same working space as before, because this is where details of the pre‐
265       vious  partial  match  are  stored.  Here is an example using pcretest,
266       using the \R escape sequence to set  the  PCRE_DFA_RESTART  option  (\D
267       specifies the use of the DFA matching function):
268
269           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
270         data> 23ja\P\D
271         Partial match: 23ja
272         data> n05\R\D
273          0: n05
274
275       The  first  call has "23ja" as the subject, and requests partial match‐
276       ing; the second call  has  "n05"  as  the  subject  for  the  continued
277       (restarted)  match.   Notice  that when the match is complete, only the
278       last part is shown; PCRE does  not  retain  the  previously  partially-
279       matched  string. It is up to the calling program to do that if it needs
280       to.
281
282       That means that, for an unanchored pattern, if a continued match fails,
283       it  is  not  possible  to  try  again at a new starting point. All this
284       facility is capable of doing is  continuing  with  the  previous  match
285       attempt.  In  the previous example, if the second set of data is "ug23"
286       the result is no match, even though there would be a match for  "aug23"
287       if  the entire string were given at once. Depending on the application,
288       this may or may not be what you want.  The only way to allow for start‐
289       ing  again  at  the next character is to retain the matched part of the
290       subject and try a new complete match.
291
292       You can set the PCRE_PARTIAL_SOFT  or  PCRE_PARTIAL_HARD  options  with
293       PCRE_DFA_RESTART  to  continue partial matching over multiple segments.
294       This facility can be used to pass very long subject strings to the  DFA
295       matching functions.
296

MULTI-SEGMENT MATCHING WITH pcre_exec() OR pcre[16|32]_exec()

298
299       From  release 8.00, the standard matching functions can also be used to
300       do multi-segment matching. Unlike the DFA functions, it is not possible
301       to  restart the previous match with a new segment of data. Instead, new
302       data must be added to the previous subject string, and the entire match
303       re-run,  starting from the point where the partial match occurred. Ear‐
304       lier data can be discarded.
305
306       It is best to use PCRE_PARTIAL_HARD in this situation, because it  does
307       not  treat the end of a segment as the end of the subject when matching
308       \z, \Z, \b, \B, and $. Consider  an  unanchored  pattern  that  matches
309       dates:
310
311           re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
312         data> The date is 23ja\P\P
313         Partial match: 23ja
314
315       At  this stage, an application could discard the text preceding "23ja",
316       add on text from the next  segment,  and  call  the  matching  function
317       again.  Unlike  the  DFA matching functions, the entire matching string
318       must always be available, and the complete matching process occurs  for
319       each call, so more memory and more processing time is needed.
320
321       Note:  If  the pattern contains lookbehind assertions, or \K, or starts
322       with \b or \B, the string that is returned for a partial match includes
323       characters  that precede the start of what would be returned for a com‐
324       plete match, because it contains all the characters that were inspected
325       during the partial match.
326

ISSUES WITH MULTI-SEGMENT MATCHING

328
329       Certain types of pattern may give problems with multi-segment matching,
330       whichever matching function is used.
331
332       1. If the pattern contains a test for the beginning of a line, you need
333       to  pass  the  PCRE_NOTBOL  option when the subject string for any call
334       does start at the beginning of a line.  There  is  also  a  PCRE_NOTEOL
335       option, but in practice when doing multi-segment matching you should be
336       using PCRE_PARTIAL_HARD, which includes the effect of PCRE_NOTEOL.
337
338       2. Lookbehind assertions that have already been obeyed are catered  for
339       in the offsets that are returned for a partial match. However a lookbe‐
340       hind assertion later in the pattern could require even earlier  charac‐
341       ters   to  be  inspected.  You  can  handle  this  case  by  using  the
342       PCRE_INFO_MAXLOOKBEHIND    option    of    the    pcre_fullinfo()    or
343       pcre[16|32]_fullinfo()  functions  to  obtain the length of the longest
344       lookbehind in the pattern. This length  is  given  in  characters,  not
345       bytes.  If  you  always retain at least that many characters before the
346       partially matched string, all should be  well.  (Of  course,  near  the
347       start of the subject, fewer characters may be present; in that case all
348       characters should be retained.)
349
350       From release 8.33, there is a more accurate way of deciding which char‐
351       acters  to  retain.  Instead  of  subtracting the length of the longest
352       lookbehind from the  earliest  inspected  character  (offsets[0]),  the
353       match  start  position  (offsets[2]) should be used, and the next match
354       attempt started at the offsets[2] character by setting the  startoffset
355       argument of pcre_exec() or pcre_dfa_exec().
356
357       For  example, if the pattern "(?<=123)abc" is partially matched against
358       the string "xx123a", the three offset values returned are 2, 6, and  5.
359       This  indicates  that  the  matching  process that gave a partial match
360       started at offset 5, but the characters "123a" were all inspected.  The
361       maximum  lookbehind  for  that pattern is 3, so taking that away from 5
362       shows that we need only keep "123a", and the next match attempt can  be
363       started at offset 3 (that is, at "a") when further characters have been
364       added. When the match start is not the  earliest  inspected  character,
365       pcretest shows it explicitly:
366
367           re> "(?<=123)abc"
368         data> xx123a\P\P
369         Partial match at offset 5: 123a
370
371       3.  Because a partial match must always contain at least one character,
372       what might be considered a partial match of an  empty  string  actually
373       gives a "no match" result. For example:
374
375           re> /c(?<=abc)x/
376         data> ab\P
377         No match
378
379       If the next segment begins "cx", a match should be found, but this will
380       only happen if characters from the previous segment are  retained.  For
381       this  reason,  a  "no  match"  result should be interpreted as "partial
382       match of an empty string" when the pattern contains lookbehinds.
383
384       4. Matching a subject string that is split into multiple  segments  may
385       not  always produce exactly the same result as matching over one single
386       long string, especially when PCRE_PARTIAL_SOFT  is  used.  The  section
387       "Partial  Matching  and  Word Boundaries" above describes an issue that
388       arises if the pattern ends with \b or \B. Another  kind  of  difference
389       may  occur when there are multiple matching possibilities, because (for
390       PCRE_PARTIAL_SOFT) a partial match result is given only when there  are
391       no completed matches. This means that as soon as the shortest match has
392       been found, continuation to a new subject segment is no  longer  possi‐
393       ble. Consider again this pcretest example:
394
395           re> /dog(sbody)?/
396         data> dogsb\P
397          0: dog
398         data> do\P\D
399         Partial match: do
400         data> gsb\R\P\D
401          0: g
402         data> dogsbody\D
403          0: dogsbody
404          1: dog
405
406       The  first  data  line passes the string "dogsb" to a standard matching
407       function, setting the PCRE_PARTIAL_SOFT option. Although the string  is
408       a  partial  match for "dogsbody", the result is not PCRE_ERROR_PARTIAL,
409       because the shorter string "dog" is a complete match.  Similarly,  when
410       the  subject  is  presented to a DFA matching function in several parts
411       ("do" and "gsb" being the first two) the match  stops  when  "dog"  has
412       been  found, and it is not possible to continue.  On the other hand, if
413       "dogsbody" is presented as a single string,  a  DFA  matching  function
414       finds both matches.
415
416       Because  of  these  problems,  it is best to use PCRE_PARTIAL_HARD when
417       matching multi-segment data. The example  above  then  behaves  differ‐
418       ently:
419
420           re> /dog(sbody)?/
421         data> dogsb\P\P
422         Partial match: dogsb
423         data> do\P\D
424         Partial match: do
425         data> gsb\R\P\P\D
426         Partial match: gsb
427
428       5. Patterns that contain alternatives at the top level which do not all
429       start with the  same  pattern  item  may  not  work  as  expected  when
430       PCRE_DFA_RESTART is used. For example, consider this pattern:
431
432         1234|3789
433
434       If  the  first  part of the subject is "ABC123", a partial match of the
435       first alternative is found at offset 3. There is no partial  match  for
436       the second alternative, because such a match does not start at the same
437       point in the subject string. Attempting to  continue  with  the  string
438       "7890"  does  not  yield  a  match because only those alternatives that
439       match at one point in the subject are remembered.  The  problem  arises
440       because  the  start  of the second alternative matches within the first
441       alternative. There is no problem with  anchored  patterns  or  patterns
442       such as:
443
444         1234|ABCD
445
446       where  no  string can be a partial match for both alternatives. This is
447       not a problem if a standard matching  function  is  used,  because  the
448       entire match has to be rerun each time:
449
450           re> /1234|3789/
451         data> ABC123\P\P
452         Partial match: 123
453         data> 1237890
454          0: 3789
455
456       Of course, instead of using PCRE_DFA_RESTART, the same technique of re-
457       running the entire match can also be used with the DFA  matching  func‐
458       tions.  Another  possibility  is to work with two buffers. If a partial
459       match at offset n in the first buffer is followed by  "no  match"  when
460       PCRE_DFA_RESTART  is  used on the second buffer, you can then try a new
461       match starting at offset n+1 in the first buffer.
462

AUTHOR

464
465       Philip Hazel
466       University Computing Service
467       Cambridge CB2 3QH, England.
468

REVISION

470
471       Last updated: 02 July 2013
472       Copyright (c) 1997-2013 University of Cambridge.
473
474
475
476PCRE 8.34                        02 July 2013                   PCREPARTIAL(3)