1PCREPARTIAL(3)             Library Functions Manual             PCREPARTIAL(3)
2
3
4

NAME

6       PCRE - Perl-compatible regular expressions
7

PARTIAL MATCHING IN PCRE

9
10       In  normal  use  of  PCRE,  if  the  subject  string  that is passed to
11       pcre_exec() or pcre_dfa_exec() matches as far as it goes,  but  is  too
12       short  to  match  the  entire  pattern, PCRE_ERROR_NOMATCH is returned.
13       There are circumstances where it might be helpful to  distinguish  this
14       case from other cases in which there is no match.
15
16       Consider, for example, an application where a human is required to type
17       in data for a field with specific formatting requirements.  An  example
18       might be a date in the form ddmmmyy, defined by this pattern:
19
20         ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
21
22       If the application sees the user's keystrokes one by one, and can check
23       that what has been typed so far is potentially valid,  it  is  able  to
24       raise  an  error  as  soon  as  a  mistake  is made, by beeping and not
25       reflecting the character that has been typed, for example. This immedi‐
26       ate  feedback is likely to be a better user interface than a check that
27       is delayed until the entire string has been entered.  Partial  matching
28       can  also  sometimes be useful when the subject string is very long and
29       is not all available at once.
30
31       PCRE supports partial matching by means of  the  PCRE_PARTIAL_SOFT  and
32       PCRE_PARTIAL_HARD options, which can be set when calling pcre_exec() or
33       pcre_dfa_exec(). For backwards compatibility, PCRE_PARTIAL is a synonym
34       for PCRE_PARTIAL_SOFT. The essential difference between the two options
35       is whether or not a partial match is preferred to an  alternative  com‐
36       plete  match,  though the details differ between the two matching func‐
37       tions. If both options are set, PCRE_PARTIAL_HARD takes precedence.
38
39       Setting a partial matching option disables two of PCRE's optimizations.
40       PCRE  remembers the last literal byte in a pattern, and abandons match‐
41       ing immediately if such a byte is not present in  the  subject  string.
42       This  optimization cannot be used for a subject string that might match
43       only partially. If the pattern was  studied,  PCRE  knows  the  minimum
44       length  of  a  matching string, and does not bother to run the matching
45       function on shorter strings. This optimization  is  also  disabled  for
46       partial matching.
47

PARTIAL MATCHING USING pcre_exec()

49
50       A partial match occurs during a call to pcre_exec() whenever the end of
51       the subject string is reached successfully, but  matching  cannot  con‐
52       tinue because more characters are needed. However, at least one charac‐
53       ter must have been matched. (In other words, a partial match can  never
54       be an empty string.)
55
56       If  PCRE_PARTIAL_SOFT  is  set,  the  partial  match is remembered, but
57       matching continues as normal, and other alternatives in the pattern are
58       tried.   If  no  complete  match  can  be  found,  pcre_exec()  returns
59       PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. If there are at least
60       two slots in the offsets vector, the first of them is set to the offset
61       of the earliest character that was inspected when the partial match was
62       found.  For  convenience,  the  second  offset points to the end of the
63       string so that a substring can easily be identified.
64
65       For the majority of patterns, the first offset identifies the start  of
66       the  partially matched string. However, for patterns that contain look‐
67       behind assertions, or \K, or begin with \b or  \B,  earlier  characters
68       have been inspected while carrying out the match. For example:
69
70         /(?<=abc)123/
71
72       This pattern matches "123", but only if it is preceded by "abc". If the
73       subject string is "xyzabc12", the offsets after a partial match are for
74       the  substring  "abc12",  because  all  these  characters are needed if
75       another match is tried with extra characters added.
76
77       If there is more than one partial match, the first one that  was  found
78       provides the data that is returned. Consider this pattern:
79
80         /123\w+X|dogY/
81
82       If  this is matched against the subject string "abc123dog", both alter‐
83       natives fail to match, but the end of the  subject  is  reached  during
84       matching,    so    PCRE_ERROR_PARTIAL    is    returned    instead   of
85       PCRE_ERROR_NOMATCH. The  offsets  are  set  to  3  and  9,  identifying
86       "123dog"  as  the first partial match that was found. (In this example,
87       there are two partial matches,  because  "dog"  on  its  own  partially
88       matches the second alternative.)
89
90       If PCRE_PARTIAL_HARD is set for pcre_exec(), it returns PCRE_ERROR_PAR‐
91       TIAL as soon as a partial match is found, without continuing to  search
92       for  possible  complete matches. The difference between the two options
93       can be illustrated by a pattern such as:
94
95         /dog(sbody)?/
96
97       This matches either "dog" or "dogsbody", greedily (that is, it  prefers
98       the  longer  string  if  possible). If it is matched against the string
99       "dog" with PCRE_PARTIAL_SOFT, it yields a  complete  match  for  "dog".
100       However, if PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL.
101       On the other hand, if the pattern is made ungreedy the result  is  dif‐
102       ferent:
103
104         /dog(sbody)??/
105
106       In  this case the result is always a complete match because pcre_exec()
107       finds that first, and it never continues  after  finding  a  match.  It
108       might  be easier to follow this explanation by thinking of the two pat‐
109       terns like this:
110
111         /dog(sbody)?/    is the same as  /dogsbody|dog/
112         /dog(sbody)??/   is the same as  /dog|dogsbody/
113
114       The second pattern will never  match  "dogsbody"  when  pcre_exec()  is
115       used, because it will always find the shorter match first.
116

PARTIAL MATCHING USING pcre_dfa_exec()

118
119       The  pcre_dfa_exec()  function moves along the subject string character
120       by character, without backtracking, searching for all possible  matches
121       simultaneously.  If the end of the subject is reached before the end of
122       the pattern, there is the possibility of a partial  match,  again  pro‐
123       vided that at least one character has matched.
124
125       When  PCRE_PARTIAL_SOFT  is set, PCRE_ERROR_PARTIAL is returned only if
126       there have been no complete matches. Otherwise,  the  complete  matches
127       are  returned.   However,  if PCRE_PARTIAL_HARD is set, a partial match
128       takes precedence over any complete matches. The portion of  the  string
129       that  was  inspected when the longest partial match was found is set as
130       the first matching string, provided there are at least two slots in the
131       offsets vector.
132
133       Because  pcre_dfa_exec()  always searches for all possible matches, and
134       there is no difference between greedy and ungreedy repetition, its  be‐
135       haviour is different from pcre_exec when PCRE_PARTIAL_HARD is set. Con‐
136       sider the string "dog"  matched  against  the  ungreedy  pattern  shown
137       above:
138
139         /dog(sbody)??/
140
141       Whereas  pcre_exec()  stops  as soon as it finds the complete match for
142       "dog", pcre_dfa_exec() also finds the partial match for "dogsbody", and
143       so returns that when PCRE_PARTIAL_HARD is set.
144

PARTIAL MATCHING AND WORD BOUNDARIES

146
147       If  a  pattern ends with one of sequences \b or \B, which test for word
148       boundaries, partial matching with PCRE_PARTIAL_SOFT can  give  counter-
149       intuitive results. Consider this pattern:
150
151         /\bcat\b/
152
153       This matches "cat", provided there is a word boundary at either end. If
154       the subject string is "the cat", the comparison of the final "t" with a
155       following  character  cannot  take  place, so a partial match is found.
156       However, pcre_exec() carries on with normal matching, which matches  \b
157       at  the  end  of  the subject when the last character is a letter, thus
158       finding a complete match. The result, therefore, is not PCRE_ERROR_PAR‐
159       TIAL.  The  same  thing  happens  with pcre_dfa_exec(), because it also
160       finds the complete match.
161
162       Using PCRE_PARTIAL_HARD in this  case  does  yield  PCRE_ERROR_PARTIAL,
163       because then the partial match takes precedence.
164

FORMERLY RESTRICTED PATTERNS

166
167       For releases of PCRE prior to 8.00, because of the way certain internal
168       optimizations  were  implemented  in  the  pcre_exec()  function,   the
169       PCRE_PARTIAL  option  (predecessor  of  PCRE_PARTIAL_SOFT) could not be
170       used with all patterns. From release 8.00 onwards, the restrictions  no
171       longer  apply,  and  partial matching with pcre_exec() can be requested
172       for any pattern.
173
174       Items that were formerly restricted were repeated single characters and
175       repeated  metasequences. If PCRE_PARTIAL was set for a pattern that did
176       not conform to the restrictions, pcre_exec() returned  the  error  code
177       PCRE_ERROR_BADPARTIAL  (-13).  This error code is no longer in use. The
178       PCRE_INFO_OKPARTIAL call to pcre_fullinfo() to find out if  a  compiled
179       pattern can be used for partial matching now always returns 1.
180

EXAMPLE OF PARTIAL MATCHING USING PCRETEST

182
183       If  the  escape  sequence  \P  is  present in a pcretest data line, the
184       PCRE_PARTIAL_SOFT option is used for  the  match.  Here  is  a  run  of
185       pcretest that uses the date example quoted above:
186
187           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
188         data> 25jun04\P
189          0: 25jun04
190          1: jun
191         data> 25dec3\P
192         Partial match: 23dec3
193         data> 3ju\P
194         Partial match: 3ju
195         data> 3juj\P
196         No match
197         data> j\P
198         No match
199
200       The  first  data  string  is  matched completely, so pcretest shows the
201       matched substrings. The remaining four strings do not  match  the  com‐
202       plete pattern, but the first two are partial matches. Similar output is
203       obtained when pcre_dfa_exec() is used.
204
205       If the escape sequence \P is present more than once in a pcretest  data
206       line, the PCRE_PARTIAL_HARD option is set for the match.
207

MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()

209
210       When a partial match has been found using pcre_dfa_exec(), it is possi‐
211       ble to continue the match by  providing  additional  subject  data  and
212       calling  pcre_dfa_exec()  again  with the same compiled regular expres‐
213       sion, this time setting the PCRE_DFA_RESTART option. You must pass  the
214       same working space as before, because this is where details of the pre‐
215       vious partial match are stored. Here  is  an  example  using  pcretest,
216       using  the  \R  escape  sequence to set the PCRE_DFA_RESTART option (\D
217       specifies the use of pcre_dfa_exec()):
218
219           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
220         data> 23ja\P\D
221         Partial match: 23ja
222         data> n05\R\D
223          0: n05
224
225       The first call has "23ja" as the subject, and requests  partial  match‐
226       ing;  the  second  call  has  "n05"  as  the  subject for the continued
227       (restarted) match.  Notice that when the match is  complete,  only  the
228       last  part  is  shown;  PCRE  does not retain the previously partially-
229       matched string. It is up to the calling program to do that if it  needs
230       to.
231
232       You  can  set  the  PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with
233       PCRE_DFA_RESTART to continue partial matching over  multiple  segments.
234       This  facility  can  be  used  to  pass  very  long  subject strings to
235       pcre_dfa_exec().
236

MULTI-SEGMENT MATCHING WITH pcre_exec()

238
239       From release 8.00, pcre_exec() can also be  used  to  do  multi-segment
240       matching.  Unlike  pcre_dfa_exec(),  it  is not possible to restart the
241       previous match with a new segment of data. Instead, new  data  must  be
242       added  to  the  previous  subject  string, and the entire match re-run,
243       starting from the point where the partial match occurred. Earlier  data
244       can be discarded.  Consider an unanchored pattern that matches dates:
245
246           re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
247         data> The date is 23ja\P
248         Partial match: 23ja
249
250       At  this stage, an application could discard the text preceding "23ja",
251       add on text from the next segment, and call pcre_exec()  again.  Unlike
252       pcre_dfa_exec(),  the  entire matching string must always be available,
253       and the complete matching process occurs for each call, so more  memory
254       and more processing time is needed.
255
256       Note:  If  the pattern contains lookbehind assertions, or \K, or starts
257       with \b or \B, the string that is returned for  a  partial  match  will
258       include  characters  that  precede the partially matched string itself,
259       because these must be retained when adding on  more  characters  for  a
260       subsequent matching attempt.
261

ISSUES WITH MULTI-SEGMENT MATCHING

263
264       Certain types of pattern may give problems with multi-segment matching,
265       whichever matching function is used.
266
267       1. If the pattern contains tests for the beginning or end  of  a  line,
268       you  need  to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri‐
269       ate, when the subject string for any call does not contain  the  begin‐
270       ning or end of a line.
271
272       2.  Lookbehind  assertions at the start of a pattern are catered for in
273       the offsets that are returned for a partial match. However, in  theory,
274       a  lookbehind assertion later in the pattern could require even earlier
275       characters to be inspected, and it might not have been reached  when  a
276       partial  match occurs. This is probably an extremely unlikely case; you
277       could guard against it to a certain extent by  always  including  extra
278       characters at the start.
279
280       3.  Matching  a subject string that is split into multiple segments may
281       not always produce exactly the same result as matching over one  single
282       long  string,  especially  when  PCRE_PARTIAL_SOFT is used. The section
283       "Partial Matching and Word Boundaries" above describes  an  issue  that
284       arises  if  the  pattern ends with \b or \B. Another kind of difference
285       may occur when there are multiple  matching  possibilities,  because  a
286       partial match result is given only when there are no completed matches.
287       This means that as soon as the shortest match has been found, continua‐
288       tion  to  a  new subject segment is no longer possible.  Consider again
289       this pcretest example:
290
291           re> /dog(sbody)?/
292         data> dogsb\P
293          0: dog
294         data> do\P\D
295         Partial match: do
296         data> gsb\R\P\D
297          0: g
298         data> dogsbody\D
299          0: dogsbody
300          1: dog
301
302       The first data line passes the string "dogsb" to  pcre_exec(),  setting
303       the  PCRE_PARTIAL_SOFT  option.  Although the string is a partial match
304       for "dogsbody", the  result  is  not  PCRE_ERROR_PARTIAL,  because  the
305       shorter  string  "dog" is a complete match. Similarly, when the subject
306       is presented to pcre_dfa_exec() in several parts ("do" and "gsb"  being
307       the first two) the match stops when "dog" has been found, and it is not
308       possible to continue. On the other hand, if "dogsbody" is presented  as
309       a single string, pcre_dfa_exec() finds both matches.
310
311       Because of these problems, it is probably best to use PCRE_PARTIAL_HARD
312       when matching multi-segment data. The example above then  behaves  dif‐
313       ferently:
314
315           re> /dog(sbody)?/
316         data> dogsb\P\P
317         Partial match: dogsb
318         data> do\P\D
319         Partial match: do
320         data> gsb\R\P\P\D
321         Partial match: gsb
322
323
324       4. Patterns that contain alternatives at the top level which do not all
325       start with the  same  pattern  item  may  not  work  as  expected  when
326       PCRE_DFA_RESTART  is  used  with pcre_dfa_exec(). For example, consider
327       this pattern:
328
329         1234|3789
330
331       If the first part of the subject is "ABC123", a partial  match  of  the
332       first  alternative  is found at offset 3. There is no partial match for
333       the second alternative, because such a match does not start at the same
334       point  in  the  subject  string. Attempting to continue with the string
335       "7890" does not yield a match  because  only  those  alternatives  that
336       match  at  one  point in the subject are remembered. The problem arises
337       because the start of the second alternative matches  within  the  first
338       alternative.  There  is  no  problem with anchored patterns or patterns
339       such as:
340
341         1234|ABCD
342
343       where no string can be a partial match for both alternatives.  This  is
344       not  a  problem if pcre_exec() is used, because the entire match has to
345       be rerun each time:
346
347           re> /1234|3789/
348         data> ABC123\P
349         Partial match: 123
350         data> 1237890
351          0: 3789
352
353       Of course, instead of using PCRE_DFA_PARTIAL, the same technique of re-
354       running the entire match can also be used with pcre_dfa_exec(). Another
355       possibility is to work with two buffers. If a partial match at offset n
356       in  the first buffer is followed by "no match" when PCRE_DFA_RESTART is
357       used on the second buffer, you can then try a  new  match  starting  at
358       offset n+1 in the first buffer.
359

AUTHOR

361
362       Philip Hazel
363       University Computing Service
364       Cambridge CB2 3QH, England.
365

REVISION

367
368       Last updated: 19 October 2009
369       Copyright (c) 1997-2009 University of Cambridge.
370
371
372
373                                                                PCREPARTIAL(3)
Impressum