1PCREPARTIAL(3)             Library Functions Manual             PCREPARTIAL(3)
2
3
4

NAME

6       PCRE - Perl-compatible regular expressions
7

PARTIAL MATCHING IN PCRE

9
10       In normal use of PCRE, if the subject string that is passed to a match‐
11       ing function matches as far as it goes, but is too short to  match  the
12       entire pattern, PCRE_ERROR_NOMATCH is returned. There are circumstances
13       where it might be helpful to distinguish this case from other cases  in
14       which there is no match.
15
16       Consider, for example, an application where a human is required to type
17       in data for a field with specific formatting requirements.  An  example
18       might be a date in the form ddmmmyy, defined by this pattern:
19
20         ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
21
22       If the application sees the user's keystrokes one by one, and can check
23       that what has been typed so far is potentially valid,  it  is  able  to
24       raise  an  error  as  soon  as  a  mistake  is made, by beeping and not
25       reflecting the character that has been typed, for example. This immedi‐
26       ate  feedback is likely to be a better user interface than a check that
27       is delayed until the entire string has been entered.  Partial  matching
28       can  also be useful when the subject string is very long and is not all
29       available at once.
30
31       PCRE supports partial matching by means of  the  PCRE_PARTIAL_SOFT  and
32       PCRE_PARTIAL_HARD  options,  which  can  be set when calling any of the
33       matching functions. For backwards compatibility, PCRE_PARTIAL is a syn‐
34       onym  for  PCRE_PARTIAL_SOFT.  The essential difference between the two
35       options is whether or not a partial match is preferred to  an  alterna‐
36       tive complete match, though the details differ between the two types of
37       matching function. If both options  are  set,  PCRE_PARTIAL_HARD  takes
38       precedence.
39
40       If  you  want to use partial matching with just-in-time optimized code,
41       you must call pcre_study(), pcre16_study() or  pcre32_study() with  one
42       or both of these options:
43
44         PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
45         PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
46
47       PCRE_STUDY_JIT_COMPILE  should also be set if you are going to run non-
48       partial matches on the same pattern. If the appropriate JIT study  mode
49       has not been set for a match, the interpretive matching code is used.
50
51       Setting a partial matching option disables two of PCRE's standard opti‐
52       mizations. PCRE remembers the last literal data unit in a pattern,  and
53       abandons  matching  immediately  if  it  is  not present in the subject
54       string. This optimization cannot be used  for  a  subject  string  that
55       might  match only partially. If the pattern was studied, PCRE knows the
56       minimum length of a matching string, and does not  bother  to  run  the
57       matching  function  on  shorter strings. This optimization is also dis‐
58       abled for partial matching.
59

PARTIAL MATCHING USING pcre_exec() OR pcre[16|32]_exec()

61
62       A  partial   match   occurs   during   a   call   to   pcre_exec()   or
63       pcre[16|32]_exec()  when  the end of the subject string is reached suc‐
64       cessfully, but matching cannot continue  because  more  characters  are
65       needed.  However,  at least one character in the subject must have been
66       inspected. This character need not  form  part  of  the  final  matched
67       string;  lookbehind  assertions and the \K escape sequence provide ways
68       of inspecting characters before the start of a matched  substring.  The
69       requirement  for  inspecting  at  least one character exists because an
70       empty string can always be matched; without such  a  restriction  there
71       would  always  be  a partial match of an empty string at the end of the
72       subject.
73
74       If there are at least two slots in the offsets vector  when  a  partial
75       match  is returned, the first slot is set to the offset of the earliest
76       character that was inspected. For convenience, the second offset points
77       to the end of the subject so that a substring can easily be identified.
78
79       For  the majority of patterns, the first offset identifies the start of
80       the partially matched string. However, for patterns that contain  look‐
81       behind  assertions,  or  \K, or begin with \b or \B, earlier characters
82       have been inspected while carrying out the match. For example:
83
84         /(?<=abc)123/
85
86       This pattern matches "123", but only if it is preceded by "abc". If the
87       subject string is "xyzabc12", the offsets after a partial match are for
88       the substring "abc12", because  all  these  characters  are  needed  if
89       another match is tried with extra characters added to the subject.
90
91       What happens when a partial match is identified depends on which of the
92       two partial matching options are set.
93
94   PCRE_PARTIAL_SOFT WITH pcre_exec() OR pcre[16|32]_exec()
95
96       If PCRE_PARTIAL_SOFT is  set  when  pcre_exec()  or  pcre[16|32]_exec()
97       identifies a partial match, the partial match is remembered, but match‐
98       ing continues as normal, and other  alternatives  in  the  pattern  are
99       tried.  If  no  complete  match  can  be  found,  PCRE_ERROR_PARTIAL is
100       returned instead of PCRE_ERROR_NOMATCH.
101
102       This option is "soft" because it prefers a complete match over  a  par‐
103       tial  match.   All the various matching items in a pattern behave as if
104       the subject string is potentially complete. For example, \z, \Z, and  $
105       match  at  the end of the subject, as normal, and for \b and \B the end
106       of the subject is treated as a non-alphanumeric.
107
108       If there is more than one partial match, the first one that  was  found
109       provides the data that is returned. Consider this pattern:
110
111         /123\w+X|dogY/
112
113       If  this is matched against the subject string "abc123dog", both alter‐
114       natives fail to match, but the end of the  subject  is  reached  during
115       matching,  so  PCRE_ERROR_PARTIAL is returned. The offsets are set to 3
116       and 9, identifying "123dog" as the first partial match that was  found.
117       (In  this  example, there are two partial matches, because "dog" on its
118       own partially matches the second alternative.)
119
120   PCRE_PARTIAL_HARD WITH pcre_exec() OR pcre[16|32]_exec()
121
122       If PCRE_PARTIAL_HARD is  set  for  pcre_exec()  or  pcre[16|32]_exec(),
123       PCRE_ERROR_PARTIAL  is  returned  as  soon as a partial match is found,
124       without continuing to search for possible complete matches. This option
125       is "hard" because it prefers an earlier partial match over a later com‐
126       plete match. For this reason, the assumption is made that  the  end  of
127       the  supplied  subject  string may not be the true end of the available
128       data, and so, if \z, \Z, \b, \B, or $ are encountered at the end of the
129       subject,  the  result is PCRE_ERROR_PARTIAL, provided that at least one
130       character in the subject has been inspected.
131
132       Setting PCRE_PARTIAL_HARD also affects the way UTF-8 and UTF-16 subject
133       strings  are checked for validity. Normally, an invalid sequence causes
134       the error PCRE_ERROR_BADUTF8 or PCRE_ERROR_BADUTF16.  However,  in  the
135       special  case  of  a  truncated  character  at  the end of the subject,
136       PCRE_ERROR_SHORTUTF8  or   PCRE_ERROR_SHORTUTF16   is   returned   when
137       PCRE_PARTIAL_HARD is set.
138
139   Comparing hard and soft partial matching
140
141       The  difference  between the two partial matching options can be illus‐
142       trated by a pattern such as:
143
144         /dog(sbody)?/
145
146       This matches either "dog" or "dogsbody", greedily (that is, it  prefers
147       the  longer  string  if  possible). If it is matched against the string
148       "dog" with PCRE_PARTIAL_SOFT, it yields a  complete  match  for  "dog".
149       However, if PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL.
150       On the other hand, if the pattern is made ungreedy the result  is  dif‐
151       ferent:
152
153         /dog(sbody)??/
154
155       In  this  case  the  result  is always a complete match because that is
156       found first, and matching never  continues  after  finding  a  complete
157       match. It might be easier to follow this explanation by thinking of the
158       two patterns like this:
159
160         /dog(sbody)?/    is the same as  /dogsbody|dog/
161         /dog(sbody)??/   is the same as  /dog|dogsbody/
162
163       The second pattern will never match "dogsbody", because it will  always
164       find the shorter match first.
165

PARTIAL MATCHING USING pcre_dfa_exec() OR pcre[16|32]_dfa_exec()

167
168       The DFA functions move along the subject string character by character,
169       without backtracking, searching for  all  possible  matches  simultane‐
170       ously.  If the end of the subject is reached before the end of the pat‐
171       tern, there is the possibility of a partial match, again provided  that
172       at least one character has been inspected.
173
174       When  PCRE_PARTIAL_SOFT  is set, PCRE_ERROR_PARTIAL is returned only if
175       there have been no complete matches. Otherwise,  the  complete  matches
176       are  returned.   However,  if PCRE_PARTIAL_HARD is set, a partial match
177       takes precedence over any complete matches. The portion of  the  string
178       that  was  inspected when the longest partial match was found is set as
179       the first matching string, provided there are at least two slots in the
180       offsets vector.
181
182       Because  the  DFA functions always search for all possible matches, and
183       there is no difference between greedy and  ungreedy  repetition,  their
184       behaviour  is  different  from  the  standard  functions when PCRE_PAR‐
185       TIAL_HARD is  set.  Consider  the  string  "dog"  matched  against  the
186       ungreedy pattern shown above:
187
188         /dog(sbody)??/
189
190       Whereas  the  standard functions stop as soon as they find the complete
191       match for "dog", the DFA functions also  find  the  partial  match  for
192       "dogsbody", and so return that when PCRE_PARTIAL_HARD is set.
193

PARTIAL MATCHING AND WORD BOUNDARIES

195
196       If  a  pattern ends with one of sequences \b or \B, which test for word
197       boundaries, partial matching with PCRE_PARTIAL_SOFT can  give  counter-
198       intuitive results. Consider this pattern:
199
200         /\bcat\b/
201
202       This matches "cat", provided there is a word boundary at either end. If
203       the subject string is "the cat", the comparison of the final "t" with a
204       following  character  cannot  take  place, so a partial match is found.
205       However, normal matching carries on, and \b matches at the end  of  the
206       subject  when  the  last  character is a letter, so a complete match is
207       found.  The  result,  therefore,  is  not   PCRE_ERROR_PARTIAL.   Using
208       PCRE_PARTIAL_HARD  in  this case does yield PCRE_ERROR_PARTIAL, because
209       then the partial match takes precedence.
210

FORMERLY RESTRICTED PATTERNS

212
213       For releases of PCRE prior to 8.00, because of the way certain internal
214       optimizations   were  implemented  in  the  pcre_exec()  function,  the
215       PCRE_PARTIAL option (predecessor of  PCRE_PARTIAL_SOFT)  could  not  be
216       used  with all patterns. From release 8.00 onwards, the restrictions no
217       longer apply, and partial matching with can be requested for  any  pat‐
218       tern.
219
220       Items that were formerly restricted were repeated single characters and
221       repeated metasequences. If PCRE_PARTIAL was set for a pattern that  did
222       not  conform  to  the restrictions, pcre_exec() returned the error code
223       PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in  use.  The
224       PCRE_INFO_OKPARTIAL  call  to pcre_fullinfo() to find out if a compiled
225       pattern can be used for partial matching now always returns 1.
226

EXAMPLE OF PARTIAL MATCHING USING PCRETEST

228
229       If the escape sequence \P is present  in  a  pcretest  data  line,  the
230       PCRE_PARTIAL_SOFT  option  is  used  for  the  match.  Here is a run of
231       pcretest that uses the date example quoted above:
232
233           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
234         data> 25jun04\P
235          0: 25jun04
236          1: jun
237         data> 25dec3\P
238         Partial match: 23dec3
239         data> 3ju\P
240         Partial match: 3ju
241         data> 3juj\P
242         No match
243         data> j\P
244         No match
245
246       The first data string is matched  completely,  so  pcretest  shows  the
247       matched  substrings.  The  remaining four strings do not match the com‐
248       plete pattern, but the first two are partial matches. Similar output is
249       obtained if DFA matching is used.
250
251       If  the escape sequence \P is present more than once in a pcretest data
252       line, the PCRE_PARTIAL_HARD option is set for the match.
253

MULTI-SEGMENT MATCHING WITH pcre_dfa_exec() OR pcre[16|32]_dfa_exec()

255
256       When a partial match has been found using a DFA matching  function,  it
257       is  possible to continue the match by providing additional subject data
258       and calling the function again with the same compiled  regular  expres‐
259       sion,  this time setting the PCRE_DFA_RESTART option. You must pass the
260       same working space as before, because this is where details of the pre‐
261       vious  partial  match  are  stored.  Here is an example using pcretest,
262       using the \R escape sequence to set  the  PCRE_DFA_RESTART  option  (\D
263       specifies the use of the DFA matching function):
264
265           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
266         data> 23ja\P\D
267         Partial match: 23ja
268         data> n05\R\D
269          0: n05
270
271       The  first  call has "23ja" as the subject, and requests partial match‐
272       ing; the second call  has  "n05"  as  the  subject  for  the  continued
273       (restarted)  match.   Notice  that when the match is complete, only the
274       last part is shown; PCRE does  not  retain  the  previously  partially-
275       matched  string. It is up to the calling program to do that if it needs
276       to.
277
278       You can set the PCRE_PARTIAL_SOFT  or  PCRE_PARTIAL_HARD  options  with
279       PCRE_DFA_RESTART  to  continue partial matching over multiple segments.
280       This facility can be used to pass very long subject strings to the  DFA
281       matching functions.
282

MULTI-SEGMENT MATCHING WITH pcre_exec() OR pcre[16|32]_exec()

284
285       From  release 8.00, the standard matching functions can also be used to
286       do multi-segment matching. Unlike the DFA functions, it is not possible
287       to  restart the previous match with a new segment of data. Instead, new
288       data must be added to the previous subject string, and the entire match
289       re-run,  starting from the point where the partial match occurred. Ear‐
290       lier data can be discarded.
291
292       It is best to use PCRE_PARTIAL_HARD in this situation, because it  does
293       not  treat the end of a segment as the end of the subject when matching
294       \z, \Z, \b, \B, and $. Consider  an  unanchored  pattern  that  matches
295       dates:
296
297           re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
298         data> The date is 23ja\P\P
299         Partial match: 23ja
300
301       At  this stage, an application could discard the text preceding "23ja",
302       add on text from the next  segment,  and  call  the  matching  function
303       again.  Unlike  the  DFA matching functions, the entire matching string
304       must always be available, and the complete matching process occurs  for
305       each call, so more memory and more processing time is needed.
306
307       Note:  If  the pattern contains lookbehind assertions, or \K, or starts
308       with \b or \B, the string that is returned for a partial match includes
309       characters  that  precede  the partially matched string itself, because
310       these must be retained when adding on more characters for a  subsequent
311       matching  attempt.   However, in some cases you may need to retain even
312       earlier characters, as discussed in the next section.
313

ISSUES WITH MULTI-SEGMENT MATCHING

315
316       Certain types of pattern may give problems with multi-segment matching,
317       whichever matching function is used.
318
319       1. If the pattern contains a test for the beginning of a line, you need
320       to pass the PCRE_NOTBOL option when the subject  string  for  any  call
321       does  start  at  the  beginning  of a line. There is also a PCRE_NOTEOL
322       option, but in practice when doing multi-segment matching you should be
323       using PCRE_PARTIAL_HARD, which includes the effect of PCRE_NOTEOL.
324
325       2.  Lookbehind assertions that have already been obeyed are catered for
326       in the offsets that are returned for a partial match. However a lookbe‐
327       hind  assertion later in the pattern could require even earlier charac‐
328       ters  to  be  inspected.  You  can  handle  this  case  by  using   the
329       PCRE_INFO_MAXLOOKBEHIND    option    of    the    pcre_fullinfo()    or
330       pcre[16|32]_fullinfo() functions to obtain the length  of  the  largest
331       lookbehind  in  the  pattern.  This  length is given in characters, not
332       bytes. If you always retain at least that many  characters  before  the
333       partially  matched  string,  all  should  be well. (Of course, near the
334       start of the subject, fewer characters may be present; in that case all
335       characters should be retained.)
336
337       3.  Because a partial match must always contain at least one character,
338       what might be considered a partial match of an  empty  string  actually
339       gives a "no match" result. For example:
340
341           re> /c(?<=abc)x/
342         data> ab\P
343         No match
344
345       If the next segment begins "cx", a match should be found, but this will
346       only happen if characters from the previous segment are  retained.  For
347       this  reason,  a  "no  match"  result should be interpreted as "partial
348       match of an empty string" when the pattern contains lookbehinds.
349
350       4. Matching a subject string that is split into multiple  segments  may
351       not  always produce exactly the same result as matching over one single
352       long string, especially when PCRE_PARTIAL_SOFT  is  used.  The  section
353       "Partial  Matching  and  Word Boundaries" above describes an issue that
354       arises if the pattern ends with \b or \B. Another  kind  of  difference
355       may  occur when there are multiple matching possibilities, because (for
356       PCRE_PARTIAL_SOFT) a partial match result is given only when there  are
357       no completed matches. This means that as soon as the shortest match has
358       been found, continuation to a new subject segment is no  longer  possi‐
359       ble. Consider again this pcretest example:
360
361           re> /dog(sbody)?/
362         data> dogsb\P
363          0: dog
364         data> do\P\D
365         Partial match: do
366         data> gsb\R\P\D
367          0: g
368         data> dogsbody\D
369          0: dogsbody
370          1: dog
371
372       The  first  data  line passes the string "dogsb" to a standard matching
373       function, setting the PCRE_PARTIAL_SOFT option. Although the string  is
374       a  partial  match for "dogsbody", the result is not PCRE_ERROR_PARTIAL,
375       because the shorter string "dog" is a complete match.  Similarly,  when
376       the  subject  is  presented to a DFA matching function in several parts
377       ("do" and "gsb" being the first two) the match  stops  when  "dog"  has
378       been  found, and it is not possible to continue.  On the other hand, if
379       "dogsbody" is presented as a single string,  a  DFA  matching  function
380       finds both matches.
381
382       Because  of  these  problems,  it is best to use PCRE_PARTIAL_HARD when
383       matching multi-segment data. The example  above  then  behaves  differ‐
384       ently:
385
386           re> /dog(sbody)?/
387         data> dogsb\P\P
388         Partial match: dogsb
389         data> do\P\D
390         Partial match: do
391         data> gsb\R\P\P\D
392         Partial match: gsb
393
394       5. Patterns that contain alternatives at the top level which do not all
395       start with the  same  pattern  item  may  not  work  as  expected  when
396       PCRE_DFA_RESTART is used. For example, consider this pattern:
397
398         1234|3789
399
400       If  the  first  part of the subject is "ABC123", a partial match of the
401       first alternative is found at offset 3. There is no partial  match  for
402       the second alternative, because such a match does not start at the same
403       point in the subject string. Attempting to  continue  with  the  string
404       "7890"  does  not  yield  a  match because only those alternatives that
405       match at one point in the subject are remembered.  The  problem  arises
406       because  the  start  of the second alternative matches within the first
407       alternative. There is no problem with  anchored  patterns  or  patterns
408       such as:
409
410         1234|ABCD
411
412       where  no  string can be a partial match for both alternatives. This is
413       not a problem if a standard matching  function  is  used,  because  the
414       entire match has to be rerun each time:
415
416           re> /1234|3789/
417         data> ABC123\P\P
418         Partial match: 123
419         data> 1237890
420          0: 3789
421
422       Of course, instead of using PCRE_DFA_RESTART, the same technique of re-
423       running the entire match can also be used with the DFA  matching  func‐
424       tions.  Another  possibility  is to work with two buffers. If a partial
425       match at offset n in the first buffer is followed by  "no  match"  when
426       PCRE_DFA_RESTART  is  used on the second buffer, you can then try a new
427       match starting at offset n+1 in the first buffer.
428

AUTHOR

430
431       Philip Hazel
432       University Computing Service
433       Cambridge CB2 3QH, England.
434

REVISION

436
437       Last updated: 24 June 2012
438       Copyright (c) 1997-2012 University of Cambridge.
439
440
441
442PCRE 8.31                        24 June 2012                   PCREPARTIAL(3)
Impressum