1PCREPARTIAL(3) Library Functions Manual PCREPARTIAL(3)
2
3
4
6 PCRE - Perl-compatible regular expressions
7
9
10 In normal use of PCRE, if the subject string that is passed to a match‐
11 ing function matches as far as it goes, but is too short to match the
12 entire pattern, PCRE_ERROR_NOMATCH is returned. There are circumstances
13 where it might be helpful to distinguish this case from other cases in
14 which there is no match.
15
16 Consider, for example, an application where a human is required to type
17 in data for a field with specific formatting requirements. An example
18 might be a date in the form ddmmmyy, defined by this pattern:
19
20 ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
21
22 If the application sees the user's keystrokes one by one, and can check
23 that what has been typed so far is potentially valid, it is able to
24 raise an error as soon as a mistake is made, by beeping and not
25 reflecting the character that has been typed, for example. This immedi‐
26 ate feedback is likely to be a better user interface than a check that
27 is delayed until the entire string has been entered. Partial matching
28 can also be useful when the subject string is very long and is not all
29 available at once.
30
31 PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and
32 PCRE_PARTIAL_HARD options, which can be set when calling any of the
33 matching functions. For backwards compatibility, PCRE_PARTIAL is a syn‐
34 onym for PCRE_PARTIAL_SOFT. The essential difference between the two
35 options is whether or not a partial match is preferred to an alterna‐
36 tive complete match, though the details differ between the two types of
37 matching function. If both options are set, PCRE_PARTIAL_HARD takes
38 precedence.
39
40 If you want to use partial matching with just-in-time optimized code,
41 you must call pcre_study(), pcre16_study() or pcre32_study() with one
42 or both of these options:
43
44 PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
45 PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
46
47 PCRE_STUDY_JIT_COMPILE should also be set if you are going to run non-
48 partial matches on the same pattern. If the appropriate JIT study mode
49 has not been set for a match, the interpretive matching code is used.
50
51 Setting a partial matching option disables two of PCRE's standard opti‐
52 mizations. PCRE remembers the last literal data unit in a pattern, and
53 abandons matching immediately if it is not present in the subject
54 string. This optimization cannot be used for a subject string that
55 might match only partially. If the pattern was studied, PCRE knows the
56 minimum length of a matching string, and does not bother to run the
57 matching function on shorter strings. This optimization is also dis‐
58 abled for partial matching.
59
61
62 A partial match occurs during a call to pcre_exec() or
63 pcre[16|32]_exec() when the end of the subject string is reached suc‐
64 cessfully, but matching cannot continue because more characters are
65 needed. However, at least one character in the subject must have been
66 inspected. This character need not form part of the final matched
67 string; lookbehind assertions and the \K escape sequence provide ways
68 of inspecting characters before the start of a matched substring. The
69 requirement for inspecting at least one character exists because an
70 empty string can always be matched; without such a restriction there
71 would always be a partial match of an empty string at the end of the
72 subject.
73
74 If there are at least two slots in the offsets vector when a partial
75 match is returned, the first slot is set to the offset of the earliest
76 character that was inspected. For convenience, the second offset points
77 to the end of the subject so that a substring can easily be identified.
78 If there are at least three slots in the offsets vector, the third slot
79 is set to the offset of the character where matching started.
80
81 For the majority of patterns, the contents of the first and third slots
82 will be the same. However, for patterns that contain lookbehind asser‐
83 tions, or begin with \b or \B, characters before the one where matching
84 started may have been inspected while carrying out the match. For exam‐
85 ple, consider this pattern:
86
87 /(?<=abc)123/
88
89 This pattern matches "123", but only if it is preceded by "abc". If the
90 subject string is "xyzabc12", the first two offsets after a partial
91 match are for the substring "abc12", because all these characters were
92 inspected. However, the third offset is set to 6, because that is the
93 offset where matching began.
94
95 What happens when a partial match is identified depends on which of the
96 two partial matching options are set.
97
98 PCRE_PARTIAL_SOFT WITH pcre_exec() OR pcre[16|32]_exec()
99
100 If PCRE_PARTIAL_SOFT is set when pcre_exec() or pcre[16|32]_exec()
101 identifies a partial match, the partial match is remembered, but match‐
102 ing continues as normal, and other alternatives in the pattern are
103 tried. If no complete match can be found, PCRE_ERROR_PARTIAL is
104 returned instead of PCRE_ERROR_NOMATCH.
105
106 This option is "soft" because it prefers a complete match over a par‐
107 tial match. All the various matching items in a pattern behave as if
108 the subject string is potentially complete. For example, \z, \Z, and $
109 match at the end of the subject, as normal, and for \b and \B the end
110 of the subject is treated as a non-alphanumeric.
111
112 If there is more than one partial match, the first one that was found
113 provides the data that is returned. Consider this pattern:
114
115 /123\w+X|dogY/
116
117 If this is matched against the subject string "abc123dog", both alter‐
118 natives fail to match, but the end of the subject is reached during
119 matching, so PCRE_ERROR_PARTIAL is returned. The offsets are set to 3
120 and 9, identifying "123dog" as the first partial match that was found.
121 (In this example, there are two partial matches, because "dog" on its
122 own partially matches the second alternative.)
123
124 PCRE_PARTIAL_HARD WITH pcre_exec() OR pcre[16|32]_exec()
125
126 If PCRE_PARTIAL_HARD is set for pcre_exec() or pcre[16|32]_exec(),
127 PCRE_ERROR_PARTIAL is returned as soon as a partial match is found,
128 without continuing to search for possible complete matches. This option
129 is "hard" because it prefers an earlier partial match over a later com‐
130 plete match. For this reason, the assumption is made that the end of
131 the supplied subject string may not be the true end of the available
132 data, and so, if \z, \Z, \b, \B, or $ are encountered at the end of the
133 subject, the result is PCRE_ERROR_PARTIAL, provided that at least one
134 character in the subject has been inspected.
135
136 Setting PCRE_PARTIAL_HARD also affects the way UTF-8 and UTF-16 subject
137 strings are checked for validity. Normally, an invalid sequence causes
138 the error PCRE_ERROR_BADUTF8 or PCRE_ERROR_BADUTF16. However, in the
139 special case of a truncated character at the end of the subject,
140 PCRE_ERROR_SHORTUTF8 or PCRE_ERROR_SHORTUTF16 is returned when
141 PCRE_PARTIAL_HARD is set.
142
143 Comparing hard and soft partial matching
144
145 The difference between the two partial matching options can be illus‐
146 trated by a pattern such as:
147
148 /dog(sbody)?/
149
150 This matches either "dog" or "dogsbody", greedily (that is, it prefers
151 the longer string if possible). If it is matched against the string
152 "dog" with PCRE_PARTIAL_SOFT, it yields a complete match for "dog".
153 However, if PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL.
154 On the other hand, if the pattern is made ungreedy the result is dif‐
155 ferent:
156
157 /dog(sbody)??/
158
159 In this case the result is always a complete match because that is
160 found first, and matching never continues after finding a complete
161 match. It might be easier to follow this explanation by thinking of the
162 two patterns like this:
163
164 /dog(sbody)?/ is the same as /dogsbody|dog/
165 /dog(sbody)??/ is the same as /dog|dogsbody/
166
167 The second pattern will never match "dogsbody", because it will always
168 find the shorter match first.
169
171
172 The DFA functions move along the subject string character by character,
173 without backtracking, searching for all possible matches simultane‐
174 ously. If the end of the subject is reached before the end of the pat‐
175 tern, there is the possibility of a partial match, again provided that
176 at least one character has been inspected.
177
178 When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if
179 there have been no complete matches. Otherwise, the complete matches
180 are returned. However, if PCRE_PARTIAL_HARD is set, a partial match
181 takes precedence over any complete matches. The portion of the string
182 that was inspected when the longest partial match was found is set as
183 the first matching string, provided there are at least two slots in the
184 offsets vector.
185
186 Because the DFA functions always search for all possible matches, and
187 there is no difference between greedy and ungreedy repetition, their
188 behaviour is different from the standard functions when PCRE_PAR‐
189 TIAL_HARD is set. Consider the string "dog" matched against the
190 ungreedy pattern shown above:
191
192 /dog(sbody)??/
193
194 Whereas the standard functions stop as soon as they find the complete
195 match for "dog", the DFA functions also find the partial match for
196 "dogsbody", and so return that when PCRE_PARTIAL_HARD is set.
197
199
200 If a pattern ends with one of sequences \b or \B, which test for word
201 boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter-
202 intuitive results. Consider this pattern:
203
204 /\bcat\b/
205
206 This matches "cat", provided there is a word boundary at either end. If
207 the subject string is "the cat", the comparison of the final "t" with a
208 following character cannot take place, so a partial match is found.
209 However, normal matching carries on, and \b matches at the end of the
210 subject when the last character is a letter, so a complete match is
211 found. The result, therefore, is not PCRE_ERROR_PARTIAL. Using
212 PCRE_PARTIAL_HARD in this case does yield PCRE_ERROR_PARTIAL, because
213 then the partial match takes precedence.
214
216
217 For releases of PCRE prior to 8.00, because of the way certain internal
218 optimizations were implemented in the pcre_exec() function, the
219 PCRE_PARTIAL option (predecessor of PCRE_PARTIAL_SOFT) could not be
220 used with all patterns. From release 8.00 onwards, the restrictions no
221 longer apply, and partial matching with can be requested for any pat‐
222 tern.
223
224 Items that were formerly restricted were repeated single characters and
225 repeated metasequences. If PCRE_PARTIAL was set for a pattern that did
226 not conform to the restrictions, pcre_exec() returned the error code
227 PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in use. The
228 PCRE_INFO_OKPARTIAL call to pcre_fullinfo() to find out if a compiled
229 pattern can be used for partial matching now always returns 1.
230
232
233 If the escape sequence \P is present in a pcretest data line, the
234 PCRE_PARTIAL_SOFT option is used for the match. Here is a run of
235 pcretest that uses the date example quoted above:
236
237 re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
238 data> 25jun04\P
239 0: 25jun04
240 1: jun
241 data> 25dec3\P
242 Partial match: 23dec3
243 data> 3ju\P
244 Partial match: 3ju
245 data> 3juj\P
246 No match
247 data> j\P
248 No match
249
250 The first data string is matched completely, so pcretest shows the
251 matched substrings. The remaining four strings do not match the com‐
252 plete pattern, but the first two are partial matches. Similar output is
253 obtained if DFA matching is used.
254
255 If the escape sequence \P is present more than once in a pcretest data
256 line, the PCRE_PARTIAL_HARD option is set for the match.
257
259
260 When a partial match has been found using a DFA matching function, it
261 is possible to continue the match by providing additional subject data
262 and calling the function again with the same compiled regular expres‐
263 sion, this time setting the PCRE_DFA_RESTART option. You must pass the
264 same working space as before, because this is where details of the pre‐
265 vious partial match are stored. Here is an example using pcretest,
266 using the \R escape sequence to set the PCRE_DFA_RESTART option (\D
267 specifies the use of the DFA matching function):
268
269 re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
270 data> 23ja\P\D
271 Partial match: 23ja
272 data> n05\R\D
273 0: n05
274
275 The first call has "23ja" as the subject, and requests partial match‐
276 ing; the second call has "n05" as the subject for the continued
277 (restarted) match. Notice that when the match is complete, only the
278 last part is shown; PCRE does not retain the previously partially-
279 matched string. It is up to the calling program to do that if it needs
280 to.
281
282 That means that, for an unanchored pattern, if a continued match fails,
283 it is not possible to try again at a new starting point. All this
284 facility is capable of doing is continuing with the previous match
285 attempt. In the previous example, if the second set of data is "ug23"
286 the result is no match, even though there would be a match for "aug23"
287 if the entire string were given at once. Depending on the application,
288 this may or may not be what you want. The only way to allow for start‐
289 ing again at the next character is to retain the matched part of the
290 subject and try a new complete match.
291
292 You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with
293 PCRE_DFA_RESTART to continue partial matching over multiple segments.
294 This facility can be used to pass very long subject strings to the DFA
295 matching functions.
296
298
299 From release 8.00, the standard matching functions can also be used to
300 do multi-segment matching. Unlike the DFA functions, it is not possible
301 to restart the previous match with a new segment of data. Instead, new
302 data must be added to the previous subject string, and the entire match
303 re-run, starting from the point where the partial match occurred. Ear‐
304 lier data can be discarded.
305
306 It is best to use PCRE_PARTIAL_HARD in this situation, because it does
307 not treat the end of a segment as the end of the subject when matching
308 \z, \Z, \b, \B, and $. Consider an unanchored pattern that matches
309 dates:
310
311 re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
312 data> The date is 23ja\P\P
313 Partial match: 23ja
314
315 At this stage, an application could discard the text preceding "23ja",
316 add on text from the next segment, and call the matching function
317 again. Unlike the DFA matching functions, the entire matching string
318 must always be available, and the complete matching process occurs for
319 each call, so more memory and more processing time is needed.
320
321 Note: If the pattern contains lookbehind assertions, or \K, or starts
322 with \b or \B, the string that is returned for a partial match includes
323 characters that precede the start of what would be returned for a com‐
324 plete match, because it contains all the characters that were inspected
325 during the partial match.
326
328
329 Certain types of pattern may give problems with multi-segment matching,
330 whichever matching function is used.
331
332 1. If the pattern contains a test for the beginning of a line, you need
333 to pass the PCRE_NOTBOL option when the subject string for any call
334 does start at the beginning of a line. There is also a PCRE_NOTEOL
335 option, but in practice when doing multi-segment matching you should be
336 using PCRE_PARTIAL_HARD, which includes the effect of PCRE_NOTEOL.
337
338 2. Lookbehind assertions that have already been obeyed are catered for
339 in the offsets that are returned for a partial match. However a lookbe‐
340 hind assertion later in the pattern could require even earlier charac‐
341 ters to be inspected. You can handle this case by using the
342 PCRE_INFO_MAXLOOKBEHIND option of the pcre_fullinfo() or
343 pcre[16|32]_fullinfo() functions to obtain the length of the longest
344 lookbehind in the pattern. This length is given in characters, not
345 bytes. If you always retain at least that many characters before the
346 partially matched string, all should be well. (Of course, near the
347 start of the subject, fewer characters may be present; in that case all
348 characters should be retained.)
349
350 From release 8.33, there is a more accurate way of deciding which char‐
351 acters to retain. Instead of subtracting the length of the longest
352 lookbehind from the earliest inspected character (offsets[0]), the
353 match start position (offsets[2]) should be used, and the next match
354 attempt started at the offsets[2] character by setting the startoffset
355 argument of pcre_exec() or pcre_dfa_exec().
356
357 For example, if the pattern "(?<=123)abc" is partially matched against
358 the string "xx123a", the three offset values returned are 2, 6, and 5.
359 This indicates that the matching process that gave a partial match
360 started at offset 5, but the characters "123a" were all inspected. The
361 maximum lookbehind for that pattern is 3, so taking that away from 5
362 shows that we need only keep "123a", and the next match attempt can be
363 started at offset 3 (that is, at "a") when further characters have been
364 added. When the match start is not the earliest inspected character,
365 pcretest shows it explicitly:
366
367 re> "(?<=123)abc"
368 data> xx123a\P\P
369 Partial match at offset 5: 123a
370
371 3. Because a partial match must always contain at least one character,
372 what might be considered a partial match of an empty string actually
373 gives a "no match" result. For example:
374
375 re> /c(?<=abc)x/
376 data> ab\P
377 No match
378
379 If the next segment begins "cx", a match should be found, but this will
380 only happen if characters from the previous segment are retained. For
381 this reason, a "no match" result should be interpreted as "partial
382 match of an empty string" when the pattern contains lookbehinds.
383
384 4. Matching a subject string that is split into multiple segments may
385 not always produce exactly the same result as matching over one single
386 long string, especially when PCRE_PARTIAL_SOFT is used. The section
387 "Partial Matching and Word Boundaries" above describes an issue that
388 arises if the pattern ends with \b or \B. Another kind of difference
389 may occur when there are multiple matching possibilities, because (for
390 PCRE_PARTIAL_SOFT) a partial match result is given only when there are
391 no completed matches. This means that as soon as the shortest match has
392 been found, continuation to a new subject segment is no longer possi‐
393 ble. Consider again this pcretest example:
394
395 re> /dog(sbody)?/
396 data> dogsb\P
397 0: dog
398 data> do\P\D
399 Partial match: do
400 data> gsb\R\P\D
401 0: g
402 data> dogsbody\D
403 0: dogsbody
404 1: dog
405
406 The first data line passes the string "dogsb" to a standard matching
407 function, setting the PCRE_PARTIAL_SOFT option. Although the string is
408 a partial match for "dogsbody", the result is not PCRE_ERROR_PARTIAL,
409 because the shorter string "dog" is a complete match. Similarly, when
410 the subject is presented to a DFA matching function in several parts
411 ("do" and "gsb" being the first two) the match stops when "dog" has
412 been found, and it is not possible to continue. On the other hand, if
413 "dogsbody" is presented as a single string, a DFA matching function
414 finds both matches.
415
416 Because of these problems, it is best to use PCRE_PARTIAL_HARD when
417 matching multi-segment data. The example above then behaves differ‐
418 ently:
419
420 re> /dog(sbody)?/
421 data> dogsb\P\P
422 Partial match: dogsb
423 data> do\P\D
424 Partial match: do
425 data> gsb\R\P\P\D
426 Partial match: gsb
427
428 5. Patterns that contain alternatives at the top level which do not all
429 start with the same pattern item may not work as expected when
430 PCRE_DFA_RESTART is used. For example, consider this pattern:
431
432 1234|3789
433
434 If the first part of the subject is "ABC123", a partial match of the
435 first alternative is found at offset 3. There is no partial match for
436 the second alternative, because such a match does not start at the same
437 point in the subject string. Attempting to continue with the string
438 "7890" does not yield a match because only those alternatives that
439 match at one point in the subject are remembered. The problem arises
440 because the start of the second alternative matches within the first
441 alternative. There is no problem with anchored patterns or patterns
442 such as:
443
444 1234|ABCD
445
446 where no string can be a partial match for both alternatives. This is
447 not a problem if a standard matching function is used, because the
448 entire match has to be rerun each time:
449
450 re> /1234|3789/
451 data> ABC123\P\P
452 Partial match: 123
453 data> 1237890
454 0: 3789
455
456 Of course, instead of using PCRE_DFA_RESTART, the same technique of re-
457 running the entire match can also be used with the DFA matching func‐
458 tions. Another possibility is to work with two buffers. If a partial
459 match at offset n in the first buffer is followed by "no match" when
460 PCRE_DFA_RESTART is used on the second buffer, you can then try a new
461 match starting at offset n+1 in the first buffer.
462
464
465 Philip Hazel
466 University Computing Service
467 Cambridge CB2 3QH, England.
468
470
471 Last updated: 02 July 2013
472 Copyright (c) 1997-2013 University of Cambridge.
473
474
475
476PCRE 8.34 02 July 2013 PCREPARTIAL(3)