1PCREPARTIAL(3) Library Functions Manual PCREPARTIAL(3)
2
3
4
6 PCRE - Perl-compatible regular expressions
7
9
10 In normal use of PCRE, if the subject string that is passed to
11 pcre_exec() or pcre_dfa_exec() matches as far as it goes, but is too
12 short to match the entire pattern, PCRE_ERROR_NOMATCH is returned.
13 There are circumstances where it might be helpful to distinguish this
14 case from other cases in which there is no match.
15
16 Consider, for example, an application where a human is required to type
17 in data for a field with specific formatting requirements. An example
18 might be a date in the form ddmmmyy, defined by this pattern:
19
20 ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
21
22 If the application sees the user's keystrokes one by one, and can check
23 that what has been typed so far is potentially valid, it is able to
24 raise an error as soon as a mistake is made, by beeping and not
25 reflecting the character that has been typed, for example. This immedi‐
26 ate feedback is likely to be a better user interface than a check that
27 is delayed until the entire string has been entered. Partial matching
28 can also sometimes be useful when the subject string is very long and
29 is not all available at once.
30
31 PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and
32 PCRE_PARTIAL_HARD options, which can be set when calling pcre_exec() or
33 pcre_dfa_exec(). For backwards compatibility, PCRE_PARTIAL is a synonym
34 for PCRE_PARTIAL_SOFT. The essential difference between the two options
35 is whether or not a partial match is preferred to an alternative com‐
36 plete match, though the details differ between the two matching func‐
37 tions. If both options are set, PCRE_PARTIAL_HARD takes precedence.
38
39 Setting a partial matching option disables two of PCRE's optimizations.
40 PCRE remembers the last literal byte in a pattern, and abandons match‐
41 ing immediately if such a byte is not present in the subject string.
42 This optimization cannot be used for a subject string that might match
43 only partially. If the pattern was studied, PCRE knows the minimum
44 length of a matching string, and does not bother to run the matching
45 function on shorter strings. This optimization is also disabled for
46 partial matching.
47
49
50 A partial match occurs during a call to pcre_exec() whenever the end of
51 the subject string is reached successfully, but matching cannot con‐
52 tinue because more characters are needed. However, at least one charac‐
53 ter must have been matched. (In other words, a partial match can never
54 be an empty string.)
55
56 If PCRE_PARTIAL_SOFT is set, the partial match is remembered, but
57 matching continues as normal, and other alternatives in the pattern are
58 tried. If no complete match can be found, pcre_exec() returns
59 PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. If there are at least
60 two slots in the offsets vector, the first of them is set to the offset
61 of the earliest character that was inspected when the partial match was
62 found. For convenience, the second offset points to the end of the
63 string so that a substring can easily be identified.
64
65 For the majority of patterns, the first offset identifies the start of
66 the partially matched string. However, for patterns that contain look‐
67 behind assertions, or \K, or begin with \b or \B, earlier characters
68 have been inspected while carrying out the match. For example:
69
70 /(?<=abc)123/
71
72 This pattern matches "123", but only if it is preceded by "abc". If the
73 subject string is "xyzabc12", the offsets after a partial match are for
74 the substring "abc12", because all these characters are needed if
75 another match is tried with extra characters added.
76
77 If there is more than one partial match, the first one that was found
78 provides the data that is returned. Consider this pattern:
79
80 /123\w+X|dogY/
81
82 If this is matched against the subject string "abc123dog", both alter‐
83 natives fail to match, but the end of the subject is reached during
84 matching, so PCRE_ERROR_PARTIAL is returned instead of
85 PCRE_ERROR_NOMATCH. The offsets are set to 3 and 9, identifying
86 "123dog" as the first partial match that was found. (In this example,
87 there are two partial matches, because "dog" on its own partially
88 matches the second alternative.)
89
90 If PCRE_PARTIAL_HARD is set for pcre_exec(), it returns PCRE_ERROR_PAR‐
91 TIAL as soon as a partial match is found, without continuing to search
92 for possible complete matches. The difference between the two options
93 can be illustrated by a pattern such as:
94
95 /dog(sbody)?/
96
97 This matches either "dog" or "dogsbody", greedily (that is, it prefers
98 the longer string if possible). If it is matched against the string
99 "dog" with PCRE_PARTIAL_SOFT, it yields a complete match for "dog".
100 However, if PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL.
101 On the other hand, if the pattern is made ungreedy the result is dif‐
102 ferent:
103
104 /dog(sbody)??/
105
106 In this case the result is always a complete match because pcre_exec()
107 finds that first, and it never continues after finding a match. It
108 might be easier to follow this explanation by thinking of the two pat‐
109 terns like this:
110
111 /dog(sbody)?/ is the same as /dogsbody|dog/
112 /dog(sbody)??/ is the same as /dog|dogsbody/
113
114 The second pattern will never match "dogsbody" when pcre_exec() is
115 used, because it will always find the shorter match first.
116
118
119 The pcre_dfa_exec() function moves along the subject string character
120 by character, without backtracking, searching for all possible matches
121 simultaneously. If the end of the subject is reached before the end of
122 the pattern, there is the possibility of a partial match, again pro‐
123 vided that at least one character has matched.
124
125 When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if
126 there have been no complete matches. Otherwise, the complete matches
127 are returned. However, if PCRE_PARTIAL_HARD is set, a partial match
128 takes precedence over any complete matches. The portion of the string
129 that was inspected when the longest partial match was found is set as
130 the first matching string, provided there are at least two slots in the
131 offsets vector.
132
133 Because pcre_dfa_exec() always searches for all possible matches, and
134 there is no difference between greedy and ungreedy repetition, its be‐
135 haviour is different from pcre_exec when PCRE_PARTIAL_HARD is set. Con‐
136 sider the string "dog" matched against the ungreedy pattern shown
137 above:
138
139 /dog(sbody)??/
140
141 Whereas pcre_exec() stops as soon as it finds the complete match for
142 "dog", pcre_dfa_exec() also finds the partial match for "dogsbody", and
143 so returns that when PCRE_PARTIAL_HARD is set.
144
146
147 If a pattern ends with one of sequences \b or \B, which test for word
148 boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter-
149 intuitive results. Consider this pattern:
150
151 /\bcat\b/
152
153 This matches "cat", provided there is a word boundary at either end. If
154 the subject string is "the cat", the comparison of the final "t" with a
155 following character cannot take place, so a partial match is found.
156 However, pcre_exec() carries on with normal matching, which matches \b
157 at the end of the subject when the last character is a letter, thus
158 finding a complete match. The result, therefore, is not PCRE_ERROR_PAR‐
159 TIAL. The same thing happens with pcre_dfa_exec(), because it also
160 finds the complete match.
161
162 Using PCRE_PARTIAL_HARD in this case does yield PCRE_ERROR_PARTIAL,
163 because then the partial match takes precedence.
164
166
167 For releases of PCRE prior to 8.00, because of the way certain internal
168 optimizations were implemented in the pcre_exec() function, the
169 PCRE_PARTIAL option (predecessor of PCRE_PARTIAL_SOFT) could not be
170 used with all patterns. From release 8.00 onwards, the restrictions no
171 longer apply, and partial matching with pcre_exec() can be requested
172 for any pattern.
173
174 Items that were formerly restricted were repeated single characters and
175 repeated metasequences. If PCRE_PARTIAL was set for a pattern that did
176 not conform to the restrictions, pcre_exec() returned the error code
177 PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in use. The
178 PCRE_INFO_OKPARTIAL call to pcre_fullinfo() to find out if a compiled
179 pattern can be used for partial matching now always returns 1.
180
182
183 If the escape sequence \P is present in a pcretest data line, the
184 PCRE_PARTIAL_SOFT option is used for the match. Here is a run of
185 pcretest that uses the date example quoted above:
186
187 re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
188 data> 25jun04\P
189 0: 25jun04
190 1: jun
191 data> 25dec3\P
192 Partial match: 23dec3
193 data> 3ju\P
194 Partial match: 3ju
195 data> 3juj\P
196 No match
197 data> j\P
198 No match
199
200 The first data string is matched completely, so pcretest shows the
201 matched substrings. The remaining four strings do not match the com‐
202 plete pattern, but the first two are partial matches. Similar output is
203 obtained when pcre_dfa_exec() is used.
204
205 If the escape sequence \P is present more than once in a pcretest data
206 line, the PCRE_PARTIAL_HARD option is set for the match.
207
209
210 When a partial match has been found using pcre_dfa_exec(), it is possi‐
211 ble to continue the match by providing additional subject data and
212 calling pcre_dfa_exec() again with the same compiled regular expres‐
213 sion, this time setting the PCRE_DFA_RESTART option. You must pass the
214 same working space as before, because this is where details of the pre‐
215 vious partial match are stored. Here is an example using pcretest,
216 using the \R escape sequence to set the PCRE_DFA_RESTART option (\D
217 specifies the use of pcre_dfa_exec()):
218
219 re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
220 data> 23ja\P\D
221 Partial match: 23ja
222 data> n05\R\D
223 0: n05
224
225 The first call has "23ja" as the subject, and requests partial match‐
226 ing; the second call has "n05" as the subject for the continued
227 (restarted) match. Notice that when the match is complete, only the
228 last part is shown; PCRE does not retain the previously partially-
229 matched string. It is up to the calling program to do that if it needs
230 to.
231
232 You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with
233 PCRE_DFA_RESTART to continue partial matching over multiple segments.
234 This facility can be used to pass very long subject strings to
235 pcre_dfa_exec().
236
238
239 From release 8.00, pcre_exec() can also be used to do multi-segment
240 matching. Unlike pcre_dfa_exec(), it is not possible to restart the
241 previous match with a new segment of data. Instead, new data must be
242 added to the previous subject string, and the entire match re-run,
243 starting from the point where the partial match occurred. Earlier data
244 can be discarded. Consider an unanchored pattern that matches dates:
245
246 re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
247 data> The date is 23ja\P
248 Partial match: 23ja
249
250 At this stage, an application could discard the text preceding "23ja",
251 add on text from the next segment, and call pcre_exec() again. Unlike
252 pcre_dfa_exec(), the entire matching string must always be available,
253 and the complete matching process occurs for each call, so more memory
254 and more processing time is needed.
255
256 Note: If the pattern contains lookbehind assertions, or \K, or starts
257 with \b or \B, the string that is returned for a partial match will
258 include characters that precede the partially matched string itself,
259 because these must be retained when adding on more characters for a
260 subsequent matching attempt.
261
263
264 Certain types of pattern may give problems with multi-segment matching,
265 whichever matching function is used.
266
267 1. If the pattern contains tests for the beginning or end of a line,
268 you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri‐
269 ate, when the subject string for any call does not contain the begin‐
270 ning or end of a line.
271
272 2. Lookbehind assertions at the start of a pattern are catered for in
273 the offsets that are returned for a partial match. However, in theory,
274 a lookbehind assertion later in the pattern could require even earlier
275 characters to be inspected, and it might not have been reached when a
276 partial match occurs. This is probably an extremely unlikely case; you
277 could guard against it to a certain extent by always including extra
278 characters at the start.
279
280 3. Matching a subject string that is split into multiple segments may
281 not always produce exactly the same result as matching over one single
282 long string, especially when PCRE_PARTIAL_SOFT is used. The section
283 "Partial Matching and Word Boundaries" above describes an issue that
284 arises if the pattern ends with \b or \B. Another kind of difference
285 may occur when there are multiple matching possibilities, because a
286 partial match result is given only when there are no completed matches.
287 This means that as soon as the shortest match has been found, continua‐
288 tion to a new subject segment is no longer possible. Consider again
289 this pcretest example:
290
291 re> /dog(sbody)?/
292 data> dogsb\P
293 0: dog
294 data> do\P\D
295 Partial match: do
296 data> gsb\R\P\D
297 0: g
298 data> dogsbody\D
299 0: dogsbody
300 1: dog
301
302 The first data line passes the string "dogsb" to pcre_exec(), setting
303 the PCRE_PARTIAL_SOFT option. Although the string is a partial match
304 for "dogsbody", the result is not PCRE_ERROR_PARTIAL, because the
305 shorter string "dog" is a complete match. Similarly, when the subject
306 is presented to pcre_dfa_exec() in several parts ("do" and "gsb" being
307 the first two) the match stops when "dog" has been found, and it is not
308 possible to continue. On the other hand, if "dogsbody" is presented as
309 a single string, pcre_dfa_exec() finds both matches.
310
311 Because of these problems, it is probably best to use PCRE_PARTIAL_HARD
312 when matching multi-segment data. The example above then behaves dif‐
313 ferently:
314
315 re> /dog(sbody)?/
316 data> dogsb\P\P
317 Partial match: dogsb
318 data> do\P\D
319 Partial match: do
320 data> gsb\R\P\P\D
321 Partial match: gsb
322
323
324 4. Patterns that contain alternatives at the top level which do not all
325 start with the same pattern item may not work as expected when
326 PCRE_DFA_RESTART is used with pcre_dfa_exec(). For example, consider
327 this pattern:
328
329 1234|3789
330
331 If the first part of the subject is "ABC123", a partial match of the
332 first alternative is found at offset 3. There is no partial match for
333 the second alternative, because such a match does not start at the same
334 point in the subject string. Attempting to continue with the string
335 "7890" does not yield a match because only those alternatives that
336 match at one point in the subject are remembered. The problem arises
337 because the start of the second alternative matches within the first
338 alternative. There is no problem with anchored patterns or patterns
339 such as:
340
341 1234|ABCD
342
343 where no string can be a partial match for both alternatives. This is
344 not a problem if pcre_exec() is used, because the entire match has to
345 be rerun each time:
346
347 re> /1234|3789/
348 data> ABC123\P
349 Partial match: 123
350 data> 1237890
351 0: 3789
352
353 Of course, instead of using PCRE_DFA_PARTIAL, the same technique of re-
354 running the entire match can also be used with pcre_dfa_exec(). Another
355 possibility is to work with two buffers. If a partial match at offset n
356 in the first buffer is followed by "no match" when PCRE_DFA_RESTART is
357 used on the second buffer, you can then try a new match starting at
358 offset n+1 in the first buffer.
359
361
362 Philip Hazel
363 University Computing Service
364 Cambridge CB2 3QH, England.
365
367
368 Last updated: 19 October 2009
369 Copyright (c) 1997-2009 University of Cambridge.
370
371
372
373 PCREPARTIAL(3)