1PCREAPI(3) Library Functions Manual PCREAPI(3)
2
3
4
6 PCRE - Perl-compatible regular expressions
7
9
10 #include <pcre.h>
11
12 pcre *pcre_compile(const char *pattern, int options,
13 const char **errptr, int *erroffset,
14 const unsigned char *tableptr);
15
16 pcre *pcre_compile2(const char *pattern, int options,
17 int *errorcodeptr,
18 const char **errptr, int *erroffset,
19 const unsigned char *tableptr);
20
21 pcre_extra *pcre_study(const pcre *code, int options,
22 const char **errptr);
23
24 int pcre_exec(const pcre *code, const pcre_extra *extra,
25 const char *subject, int length, int startoffset,
26 int options, int *ovector, int ovecsize);
27
28 int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
29 const char *subject, int length, int startoffset,
30 int options, int *ovector, int ovecsize,
31 int *workspace, int wscount);
32
33 int pcre_copy_named_substring(const pcre *code,
34 const char *subject, int *ovector,
35 int stringcount, const char *stringname,
36 char *buffer, int buffersize);
37
38 int pcre_copy_substring(const char *subject, int *ovector,
39 int stringcount, int stringnumber, char *buffer,
40 int buffersize);
41
42 int pcre_get_named_substring(const pcre *code,
43 const char *subject, int *ovector,
44 int stringcount, const char *stringname,
45 const char **stringptr);
46
47 int pcre_get_stringnumber(const pcre *code,
48 const char *name);
49
50 int pcre_get_stringtable_entries(const pcre *code,
51 const char *name, char **first, char **last);
52
53 int pcre_get_substring(const char *subject, int *ovector,
54 int stringcount, int stringnumber,
55 const char **stringptr);
56
57 int pcre_get_substring_list(const char *subject,
58 int *ovector, int stringcount, const char ***listptr);
59
60 void pcre_free_substring(const char *stringptr);
61
62 void pcre_free_substring_list(const char **stringptr);
63
64 const unsigned char *pcre_maketables(void);
65
66 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
67 int what, void *where);
68
69 int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
70
71 int pcre_refcount(pcre *code, int adjust);
72
73 int pcre_config(int what, void *where);
74
75 char *pcre_version(void);
76
77 void *(*pcre_malloc)(size_t);
78
79 void (*pcre_free)(void *);
80
81 void *(*pcre_stack_malloc)(size_t);
82
83 void (*pcre_stack_free)(void *);
84
85 int (*pcre_callout)(pcre_callout_block *);
86
88
89 PCRE has its own native API, which is described in this document. There
90 are also some wrapper functions that correspond to the POSIX regular
91 expression API. These are described in the pcreposix documentation.
92 Both of these APIs define a set of C function calls. A C++ wrapper is
93 distributed with PCRE. It is documented in the pcrecpp page.
94
95 The native API C function prototypes are defined in the header file
96 pcre.h, and on Unix systems the library itself is called libpcre. It
97 can normally be accessed by adding -lpcre to the command for linking an
98 application that uses PCRE. The header file defines the macros
99 PCRE_MAJOR and PCRE_MINOR to contain the major and minor release num‐
100 bers for the library. Applications can use these to include support
101 for different releases of PCRE.
102
103 The functions pcre_compile(), pcre_compile2(), pcre_study(), and
104 pcre_exec() are used for compiling and matching regular expressions in
105 a Perl-compatible manner. A sample program that demonstrates the sim‐
106 plest way of using them is provided in the file called pcredemo.c in
107 the source distribution. The pcresample documentation describes how to
108 compile and run it.
109
110 A second matching function, pcre_dfa_exec(), which is not Perl-compati‐
111 ble, is also provided. This uses a different algorithm for the match‐
112 ing. The alternative algorithm finds all possible matches (at a given
113 point in the subject), and scans the subject just once. However, this
114 algorithm does not return captured substrings. A description of the two
115 matching algorithms and their advantages and disadvantages is given in
116 the pcrematching documentation.
117
118 In addition to the main compiling and matching functions, there are
119 convenience functions for extracting captured substrings from a subject
120 string that is matched by pcre_exec(). They are:
121
122 pcre_copy_substring()
123 pcre_copy_named_substring()
124 pcre_get_substring()
125 pcre_get_named_substring()
126 pcre_get_substring_list()
127 pcre_get_stringnumber()
128 pcre_get_stringtable_entries()
129
130 pcre_free_substring() and pcre_free_substring_list() are also provided,
131 to free the memory used for extracted strings.
132
133 The function pcre_maketables() is used to build a set of character
134 tables in the current locale for passing to pcre_compile(),
135 pcre_exec(), or pcre_dfa_exec(). This is an optional facility that is
136 provided for specialist use. Most commonly, no special tables are
137 passed, in which case internal tables that are generated when PCRE is
138 built are used.
139
140 The function pcre_fullinfo() is used to find out information about a
141 compiled pattern; pcre_info() is an obsolete version that returns only
142 some of the available information, but is retained for backwards com‐
143 patibility. The function pcre_version() returns a pointer to a string
144 containing the version of PCRE and its date of release.
145
146 The function pcre_refcount() maintains a reference count in a data
147 block containing a compiled pattern. This is provided for the benefit
148 of object-oriented applications.
149
150 The global variables pcre_malloc and pcre_free initially contain the
151 entry points of the standard malloc() and free() functions, respec‐
152 tively. PCRE calls the memory management functions via these variables,
153 so a calling program can replace them if it wishes to intercept the
154 calls. This should be done before calling any PCRE functions.
155
156 The global variables pcre_stack_malloc and pcre_stack_free are also
157 indirections to memory management functions. These special functions
158 are used only when PCRE is compiled to use the heap for remembering
159 data, instead of recursive function calls, when running the pcre_exec()
160 function. See the pcrebuild documentation for details of how to do
161 this. It is a non-standard way of building PCRE, for use in environ‐
162 ments that have limited stacks. Because of the greater use of memory
163 management, it runs more slowly. Separate functions are provided so
164 that special-purpose external code can be used for this case. When
165 used, these functions are always called in a stack-like manner (last
166 obtained, first freed), and always for memory blocks of the same size.
167 There is a discussion about PCRE's stack usage in the pcrestack docu‐
168 mentation.
169
170 The global variable pcre_callout initially contains NULL. It can be set
171 by the caller to a "callout" function, which PCRE will then call at
172 specified points during a matching operation. Details are given in the
173 pcrecallout documentation.
174
176
177 PCRE supports five different conventions for indicating line breaks in
178 strings: a single CR (carriage return) character, a single LF (line‐
179 feed) character, the two-character sequence CRLF, any of the three pre‐
180 ceding, or any Unicode newline sequence. The Unicode newline sequences
181 are the three just mentioned, plus the single characters VT (vertical
182 tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
183 separator, U+2028), and PS (paragraph separator, U+2029).
184
185 Each of the first three conventions is used by at least one operating
186 system as its standard newline sequence. When PCRE is built, a default
187 can be specified. The default default is LF, which is the Unix stan‐
188 dard. When PCRE is run, the default can be overridden, either when a
189 pattern is compiled, or when it is matched.
190
191 At compile time, the newline convention can be specified by the options
192 argument of pcre_compile(), or it can be specified by special text at
193 the start of the pattern itself; this overrides any other settings. See
194 the pcrepattern page for details of the special character sequences.
195
196 In the PCRE documentation the word "newline" is used to mean "the char‐
197 acter or pair of characters that indicate a line break". The choice of
198 newline convention affects the handling of the dot, circumflex, and
199 dollar metacharacters, the handling of #-comments in /x mode, and, when
200 CRLF is a recognized line ending sequence, the match position advance‐
201 ment for a non-anchored pattern. There is more detail about this in the
202 section on pcre_exec() options below.
203
204 The choice of newline convention does not affect the interpretation of
205 the \n or \r escape sequences, nor does it affect what \R matches,
206 which is controlled in a similar way, but by separate options.
207
209
210 The PCRE functions can be used in multi-threading applications, with
211 the proviso that the memory management functions pointed to by
212 pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
213 callout function pointed to by pcre_callout, are shared by all threads.
214
215 The compiled form of a regular expression is not altered during match‐
216 ing, so the same compiled pattern can safely be used by several threads
217 at once.
218
220
221 The compiled form of a regular expression can be saved and re-used at a
222 later time, possibly by a different program, and even on a host other
223 than the one on which it was compiled. Details are given in the
224 pcreprecompile documentation. However, compiling a regular expression
225 with one version of PCRE for use with a different version is not guar‐
226 anteed to work and may cause crashes.
227
229
230 int pcre_config(int what, void *where);
231
232 The function pcre_config() makes it possible for a PCRE client to dis‐
233 cover which optional features have been compiled into the PCRE library.
234 The pcrebuild documentation has more details about these optional fea‐
235 tures.
236
237 The first argument for pcre_config() is an integer, specifying which
238 information is required; the second argument is a pointer to a variable
239 into which the information is placed. The following information is
240 available:
241
242 PCRE_CONFIG_UTF8
243
244 The output is an integer that is set to one if UTF-8 support is avail‐
245 able; otherwise it is set to zero.
246
247 PCRE_CONFIG_UNICODE_PROPERTIES
248
249 The output is an integer that is set to one if support for Unicode
250 character properties is available; otherwise it is set to zero.
251
252 PCRE_CONFIG_NEWLINE
253
254 The output is an integer whose value specifies the default character
255 sequence that is recognized as meaning "newline". The four values that
256 are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF,
257 and -1 for ANY. The default should normally be the standard sequence
258 for your operating system.
259
260 PCRE_CONFIG_BSR
261
262 The output is an integer whose value indicates what character sequences
263 the \R escape sequence matches by default. A value of 0 means that \R
264 matches any Unicode line ending sequence; a value of 1 means that \R
265 matches only CR, LF, or CRLF. The default can be overridden when a pat‐
266 tern is compiled or matched.
267
268 PCRE_CONFIG_LINK_SIZE
269
270 The output is an integer that contains the number of bytes used for
271 internal linkage in compiled regular expressions. The value is 2, 3, or
272 4. Larger values allow larger regular expressions to be compiled, at
273 the expense of slower matching. The default value of 2 is sufficient
274 for all but the most massive patterns, since it allows the compiled
275 pattern to be up to 64K in size.
276
277 PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
278
279 The output is an integer that contains the threshold above which the
280 POSIX interface uses malloc() for output vectors. Further details are
281 given in the pcreposix documentation.
282
283 PCRE_CONFIG_MATCH_LIMIT
284
285 The output is an integer that gives the default limit for the number of
286 internal matching function calls in a pcre_exec() execution. Further
287 details are given with pcre_exec() below.
288
289 PCRE_CONFIG_MATCH_LIMIT_RECURSION
290
291 The output is an integer that gives the default limit for the depth of
292 recursion when calling the internal matching function in a pcre_exec()
293 execution. Further details are given with pcre_exec() below.
294
295 PCRE_CONFIG_STACKRECURSE
296
297 The output is an integer that is set to one if internal recursion when
298 running pcre_exec() is implemented by recursive function calls that use
299 the stack to remember their state. This is the usual way that PCRE is
300 compiled. The output is zero if PCRE was compiled to use blocks of data
301 on the heap instead of recursive function calls. In this case,
302 pcre_stack_malloc and pcre_stack_free are called to manage memory
303 blocks on the heap, thus avoiding the use of the stack.
304
306
307 pcre *pcre_compile(const char *pattern, int options,
308 const char **errptr, int *erroffset,
309 const unsigned char *tableptr);
310
311 pcre *pcre_compile2(const char *pattern, int options,
312 int *errorcodeptr,
313 const char **errptr, int *erroffset,
314 const unsigned char *tableptr);
315
316 Either of the functions pcre_compile() or pcre_compile2() can be called
317 to compile a pattern into an internal form. The only difference between
318 the two interfaces is that pcre_compile2() has an additional argument,
319 errorcodeptr, via which a numerical error code can be returned.
320
321 The pattern is a C string terminated by a binary zero, and is passed in
322 the pattern argument. A pointer to a single block of memory that is
323 obtained via pcre_malloc is returned. This contains the compiled code
324 and related data. The pcre type is defined for the returned block; this
325 is a typedef for a structure whose contents are not externally defined.
326 It is up to the caller to free the memory (via pcre_free) when it is no
327 longer required.
328
329 Although the compiled code of a PCRE regex is relocatable, that is, it
330 does not depend on memory location, the complete pcre data block is not
331 fully relocatable, because it may contain a copy of the tableptr argu‐
332 ment, which is an address (see below).
333
334 The options argument contains various bit settings that affect the com‐
335 pilation. It should be zero if no options are required. The available
336 options are described below. Some of them, in particular, those that
337 are compatible with Perl, can also be set and unset from within the
338 pattern (see the detailed description in the pcrepattern documenta‐
339 tion). For these options, the contents of the options argument speci‐
340 fies their initial settings at the start of compilation and execution.
341 The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at the time
342 of matching as well as at compile time.
343
344 If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise,
345 if compilation of a pattern fails, pcre_compile() returns NULL, and
346 sets the variable pointed to by errptr to point to a textual error mes‐
347 sage. This is a static string that is part of the library. You must not
348 try to free it. Normally, the offset from the start of the pattern to
349 the byte that was being processed when the error was discovered is
350 placed in the variable pointed to by erroffset, which must not be NULL
351 (if it is, an immediate error is given). However, for an invalid UTF-8
352 string, the offset is that of the first byte of the failing character.
353 Also, some errors are not detected until checks are carried out when
354 the whole pattern has been scanned; in these cases the offset passed
355 back is the length of the pattern.
356
357 Note that the offset is in bytes, not characters, even in UTF-8 mode.
358 It may sometimes point into the middle of a UTF-8 character.
359
360 If pcre_compile2() is used instead of pcre_compile(), and the error‐
361 codeptr argument is not NULL, a non-zero error code number is returned
362 via this argument in the event of an error. This is in addition to the
363 textual error message. Error codes and messages are listed below.
364
365 If the final argument, tableptr, is NULL, PCRE uses a default set of
366 character tables that are built when PCRE is compiled, using the
367 default C locale. Otherwise, tableptr must be an address that is the
368 result of a call to pcre_maketables(). This value is stored with the
369 compiled pattern, and used again by pcre_exec(), unless another table
370 pointer is passed to it. For more discussion, see the section on locale
371 support below.
372
373 This code fragment shows a typical straightforward call to pcre_com‐
374 pile():
375
376 pcre *re;
377 const char *error;
378 int erroffset;
379 re = pcre_compile(
380 "^A.*Z", /* the pattern */
381 0, /* default options */
382 &error, /* for error message */
383 &erroffset, /* for error offset */
384 NULL); /* use default character tables */
385
386 The following names for option bits are defined in the pcre.h header
387 file:
388
389 PCRE_ANCHORED
390
391 If this bit is set, the pattern is forced to be "anchored", that is, it
392 is constrained to match only at the first matching point in the string
393 that is being searched (the "subject string"). This effect can also be
394 achieved by appropriate constructs in the pattern itself, which is the
395 only way to do it in Perl.
396
397 PCRE_AUTO_CALLOUT
398
399 If this bit is set, pcre_compile() automatically inserts callout items,
400 all with number 255, before each pattern item. For discussion of the
401 callout facility, see the pcrecallout documentation.
402
403 PCRE_BSR_ANYCRLF
404 PCRE_BSR_UNICODE
405
406 These options (which are mutually exclusive) control what the \R escape
407 sequence matches. The choice is either to match only CR, LF, or CRLF,
408 or to match any Unicode newline sequence. The default is specified when
409 PCRE is built. It can be overridden from within the pattern, or by set‐
410 ting an option when a compiled pattern is matched.
411
412 PCRE_CASELESS
413
414 If this bit is set, letters in the pattern match both upper and lower
415 case letters. It is equivalent to Perl's /i option, and it can be
416 changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE
417 always understands the concept of case for characters whose values are
418 less than 128, so caseless matching is always possible. For characters
419 with higher values, the concept of case is supported if PCRE is com‐
420 piled with Unicode property support, but not otherwise. If you want to
421 use caseless matching for characters 128 and above, you must ensure
422 that PCRE is compiled with Unicode property support as well as with
423 UTF-8 support.
424
425 PCRE_DOLLAR_ENDONLY
426
427 If this bit is set, a dollar metacharacter in the pattern matches only
428 at the end of the subject string. Without this option, a dollar also
429 matches immediately before a newline at the end of the string (but not
430 before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored
431 if PCRE_MULTILINE is set. There is no equivalent to this option in
432 Perl, and no way to set it within a pattern.
433
434 PCRE_DOTALL
435
436 If this bit is set, a dot metacharater in the pattern matches all char‐
437 acters, including those that indicate newline. Without it, a dot does
438 not match when the current position is at a newline. This option is
439 equivalent to Perl's /s option, and it can be changed within a pattern
440 by a (?s) option setting. A negative class such as [^a] always matches
441 newline characters, independent of the setting of this option.
442
443 PCRE_DUPNAMES
444
445 If this bit is set, names used to identify capturing subpatterns need
446 not be unique. This can be helpful for certain types of pattern when it
447 is known that only one instance of the named subpattern can ever be
448 matched. There are more details of named subpatterns below; see also
449 the pcrepattern documentation.
450
451 PCRE_EXTENDED
452
453 If this bit is set, white space data characters in the pattern are
454 totally ignored except when escaped or inside a character class. White
455 space does not include the VT character (code 11). In addition, charac‐
456 ters between an unescaped # outside a character class and the next new‐
457 line, inclusive, are also ignored. This is equivalent to Perl's /x
458 option, and it can be changed within a pattern by a (?x) option set‐
459 ting.
460
461 This option makes it possible to include comments inside complicated
462 patterns. Note, however, that this applies only to data characters.
463 White space characters may never appear within special character
464 sequences in a pattern, for example within the sequence (?( which
465 introduces a conditional subpattern.
466
467 PCRE_EXTRA
468
469 This option was invented in order to turn on additional functionality
470 of PCRE that is incompatible with Perl, but it is currently of very
471 little use. When set, any backslash in a pattern that is followed by a
472 letter that has no special meaning causes an error, thus reserving
473 these combinations for future expansion. By default, as in Perl, a
474 backslash followed by a letter with no special meaning is treated as a
475 literal. (Perl can, however, be persuaded to give a warning for this.)
476 There are at present no other features controlled by this option. It
477 can also be set by a (?X) option setting within a pattern.
478
479 PCRE_FIRSTLINE
480
481 If this option is set, an unanchored pattern is required to match
482 before or at the first newline in the subject string, though the
483 matched text may continue over the newline.
484
485 PCRE_JAVASCRIPT_COMPAT
486
487 If this option is set, PCRE's behaviour is changed in some ways so that
488 it is compatible with JavaScript rather than Perl. The changes are as
489 follows:
490
491 (1) A lone closing square bracket in a pattern causes a compile-time
492 error, because this is illegal in JavaScript (by default it is treated
493 as a data character). Thus, the pattern AB]CD becomes illegal when this
494 option is set.
495
496 (2) At run time, a back reference to an unset subpattern group matches
497 an empty string (by default this causes the current matching alterna‐
498 tive to fail). A pattern such as (\1)(a) succeeds when this option is
499 set (assuming it can find an "a" in the subject), whereas it fails by
500 default, for Perl compatibility.
501
502 PCRE_MULTILINE
503
504 By default, PCRE treats the subject string as consisting of a single
505 line of characters (even if it actually contains newlines). The "start
506 of line" metacharacter (^) matches only at the start of the string,
507 while the "end of line" metacharacter ($) matches only at the end of
508 the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY
509 is set). This is the same as Perl.
510
511 When PCRE_MULTILINE it is set, the "start of line" and "end of line"
512 constructs match immediately following or immediately before internal
513 newlines in the subject string, respectively, as well as at the very
514 start and end. This is equivalent to Perl's /m option, and it can be
515 changed within a pattern by a (?m) option setting. If there are no new‐
516 lines in a subject string, or no occurrences of ^ or $ in a pattern,
517 setting PCRE_MULTILINE has no effect.
518
519 PCRE_NEWLINE_CR
520 PCRE_NEWLINE_LF
521 PCRE_NEWLINE_CRLF
522 PCRE_NEWLINE_ANYCRLF
523 PCRE_NEWLINE_ANY
524
525 These options override the default newline definition that was chosen
526 when PCRE was built. Setting the first or the second specifies that a
527 newline is indicated by a single character (CR or LF, respectively).
528 Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the
529 two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies
530 that any of the three preceding sequences should be recognized. Setting
531 PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be
532 recognized. The Unicode newline sequences are the three just mentioned,
533 plus the single characters VT (vertical tab, U+000B), FF (form feed,
534 U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS
535 (paragraph separator, U+2029). The last two are recognized only in
536 UTF-8 mode.
537
538 The newline setting in the options word uses three bits that are
539 treated as a number, giving eight possibilities. Currently only six are
540 used (default plus the five values above). This means that if you set
541 more than one newline option, the combination may or may not be sensi‐
542 ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
543 PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and
544 cause an error.
545
546 The only time that a line break is specially recognized when compiling
547 a pattern is if PCRE_EXTENDED is set, and an unescaped # outside a
548 character class is encountered. This indicates a comment that lasts
549 until after the next line break sequence. In other circumstances, line
550 break sequences are treated as literal data, except that in
551 PCRE_EXTENDED mode, both CR and LF are treated as white space charac‐
552 ters and are therefore ignored.
553
554 The newline option that is set at compile time becomes the default that
555 is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.
556
557 PCRE_NO_AUTO_CAPTURE
558
559 If this option is set, it disables the use of numbered capturing paren‐
560 theses in the pattern. Any opening parenthesis that is not followed by
561 ? behaves as if it were followed by ?: but named parentheses can still
562 be used for capturing (and they acquire numbers in the usual way).
563 There is no equivalent of this option in Perl.
564
565 PCRE_UNGREEDY
566
567 This option inverts the "greediness" of the quantifiers so that they
568 are not greedy by default, but become greedy if followed by "?". It is
569 not compatible with Perl. It can also be set by a (?U) option setting
570 within the pattern.
571
572 PCRE_UTF8
573
574 This option causes PCRE to regard both the pattern and the subject as
575 strings of UTF-8 characters instead of single-byte character strings.
576 However, it is available only when PCRE is built to include UTF-8 sup‐
577 port. If not, the use of this option provokes an error. Details of how
578 this option changes the behaviour of PCRE are given in the section on
579 UTF-8 support in the main pcre page.
580
581 PCRE_NO_UTF8_CHECK
582
583 When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
584 automatically checked. There is a discussion about the validity of
585 UTF-8 strings in the main pcre page. If an invalid UTF-8 sequence of
586 bytes is found, pcre_compile() returns an error. If you already know
587 that your pattern is valid, and you want to skip this check for perfor‐
588 mance reasons, you can set the PCRE_NO_UTF8_CHECK option. When it is
589 set, the effect of passing an invalid UTF-8 string as a pattern is
590 undefined. It may cause your program to crash. Note that this option
591 can also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the
592 UTF-8 validity checking of subject strings.
593
595
596 The following table lists the error codes than may be returned by
597 pcre_compile2(), along with the error messages that may be returned by
598 both compiling functions. As PCRE has developed, some error codes have
599 fallen out of use. To avoid confusion, they have not been re-used.
600
601 0 no error
602 1 \ at end of pattern
603 2 \c at end of pattern
604 3 unrecognized character follows \
605 4 numbers out of order in {} quantifier
606 5 number too big in {} quantifier
607 6 missing terminating ] for character class
608 7 invalid escape sequence in character class
609 8 range out of order in character class
610 9 nothing to repeat
611 10 [this code is not in use]
612 11 internal error: unexpected repeat
613 12 unrecognized character after (? or (?-
614 13 POSIX named classes are supported only within a class
615 14 missing )
616 15 reference to non-existent subpattern
617 16 erroffset passed as NULL
618 17 unknown option bit(s) set
619 18 missing ) after comment
620 19 [this code is not in use]
621 20 regular expression is too large
622 21 failed to get memory
623 22 unmatched parentheses
624 23 internal error: code overflow
625 24 unrecognized character after (?<
626 25 lookbehind assertion is not fixed length
627 26 malformed number or name after (?(
628 27 conditional group contains more than two branches
629 28 assertion expected after (?(
630 29 (?R or (?[+-]digits must be followed by )
631 30 unknown POSIX class name
632 31 POSIX collating elements are not supported
633 32 this version of PCRE is not compiled with PCRE_UTF8 support
634 33 [this code is not in use]
635 34 character value in \x{...} sequence is too large
636 35 invalid condition (?(0)
637 36 \C not allowed in lookbehind assertion
638 37 PCRE does not support \L, \l, \N, \U, or \u
639 38 number after (?C is > 255
640 39 closing ) for (?C expected
641 40 recursive call could loop indefinitely
642 41 unrecognized character after (?P
643 42 syntax error in subpattern name (missing terminator)
644 43 two named subpatterns have the same name
645 44 invalid UTF-8 string
646 45 support for \P, \p, and \X has not been compiled
647 46 malformed \P or \p sequence
648 47 unknown property name after \P or \p
649 48 subpattern name is too long (maximum 32 characters)
650 49 too many named subpatterns (maximum 10000)
651 50 [this code is not in use]
652 51 octal value is greater than \377 (not in UTF-8 mode)
653 52 internal error: overran compiling workspace
654 53 internal error: previously-checked referenced subpattern not
655 found
656 54 DEFINE group contains more than one branch
657 55 repeating a DEFINE group is not allowed
658 56 inconsistent NEWLINE options
659 57 \g is not followed by a braced, angle-bracketed, or quoted
660 name/number or by a plain number
661 58 a numbered reference must not be zero
662 59 (*VERB) with an argument is not supported
663 60 (*VERB) not recognized
664 61 number is too big
665 62 subpattern name expected
666 63 digit expected after (?+
667 64 ] is an invalid data character in JavaScript compatibility mode
668
669 The numbers 32 and 10000 in errors 48 and 49 are defaults; different
670 values may be used if the limits were changed when PCRE was built.
671
673
674 pcre_extra *pcre_study(const pcre *code, int options
675 const char **errptr);
676
677 If a compiled pattern is going to be used several times, it is worth
678 spending more time analyzing it in order to speed up the time taken for
679 matching. The function pcre_study() takes a pointer to a compiled pat‐
680 tern as its first argument. If studying the pattern produces additional
681 information that will help speed up matching, pcre_study() returns a
682 pointer to a pcre_extra block, in which the study_data field points to
683 the results of the study.
684
685 The returned value from pcre_study() can be passed directly to
686 pcre_exec(). However, a pcre_extra block also contains other fields
687 that can be set by the caller before the block is passed; these are
688 described below in the section on matching a pattern.
689
690 If studying the pattern does not produce any additional information
691 pcre_study() returns NULL. In that circumstance, if the calling program
692 wants to pass any of the other fields to pcre_exec(), it must set up
693 its own pcre_extra block.
694
695 The second argument of pcre_study() contains option bits. At present,
696 no options are defined, and this argument should always be zero.
697
698 The third argument for pcre_study() is a pointer for an error message.
699 If studying succeeds (even if no data is returned), the variable it
700 points to is set to NULL. Otherwise it is set to point to a textual
701 error message. This is a static string that is part of the library. You
702 must not try to free it. You should test the error pointer for NULL
703 after calling pcre_study(), to be sure that it has run successfully.
704
705 This is a typical call to pcre_study():
706
707 pcre_extra *pe;
708 pe = pcre_study(
709 re, /* result of pcre_compile() */
710 0, /* no options exist */
711 &error); /* set to NULL or points to a message */
712
713 At present, studying a pattern is useful only for non-anchored patterns
714 that do not have a single fixed starting character. A bitmap of possi‐
715 ble starting bytes is created.
716
718
719 PCRE handles caseless matching, and determines whether characters are
720 letters, digits, or whatever, by reference to a set of tables, indexed
721 by character value. When running in UTF-8 mode, this applies only to
722 characters with codes less than 128. Higher-valued codes never match
723 escapes such as \w or \d, but can be tested with \p if PCRE is built
724 with Unicode character property support. The use of locales with Uni‐
725 code is discouraged. If you are handling characters with codes greater
726 than 128, you should either use UTF-8 and Unicode, or use locales, but
727 not try to mix the two.
728
729 PCRE contains an internal set of tables that are used when the final
730 argument of pcre_compile() is NULL. These are sufficient for many
731 applications. Normally, the internal tables recognize only ASCII char‐
732 acters. However, when PCRE is built, it is possible to cause the inter‐
733 nal tables to be rebuilt in the default "C" locale of the local system,
734 which may cause them to be different.
735
736 The internal tables can always be overridden by tables supplied by the
737 application that calls PCRE. These may be created in a different locale
738 from the default. As more and more applications change to using Uni‐
739 code, the need for this locale support is expected to die away.
740
741 External tables are built by calling the pcre_maketables() function,
742 which has no arguments, in the relevant locale. The result can then be
743 passed to pcre_compile() or pcre_exec() as often as necessary. For
744 example, to build and use tables that are appropriate for the French
745 locale (where accented characters with values greater than 128 are
746 treated as letters), the following code could be used:
747
748 setlocale(LC_CTYPE, "fr_FR");
749 tables = pcre_maketables();
750 re = pcre_compile(..., tables);
751
752 The locale name "fr_FR" is used on Linux and other Unix-like systems;
753 if you are using Windows, the name for the French locale is "french".
754
755 When pcre_maketables() runs, the tables are built in memory that is
756 obtained via pcre_malloc. It is the caller's responsibility to ensure
757 that the memory containing the tables remains available for as long as
758 it is needed.
759
760 The pointer that is passed to pcre_compile() is saved with the compiled
761 pattern, and the same tables are used via this pointer by pcre_study()
762 and normally also by pcre_exec(). Thus, by default, for any single pat‐
763 tern, compilation, studying and matching all happen in the same locale,
764 but different patterns can be compiled in different locales.
765
766 It is possible to pass a table pointer or NULL (indicating the use of
767 the internal tables) to pcre_exec(). Although not intended for this
768 purpose, this facility could be used to match a pattern in a different
769 locale from the one in which it was compiled. Passing table pointers at
770 run time is discussed below in the section on matching a pattern.
771
773
774 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
775 int what, void *where);
776
777 The pcre_fullinfo() function returns information about a compiled pat‐
778 tern. It replaces the obsolete pcre_info() function, which is neverthe‐
779 less retained for backwards compability (and is documented below).
780
781 The first argument for pcre_fullinfo() is a pointer to the compiled
782 pattern. The second argument is the result of pcre_study(), or NULL if
783 the pattern was not studied. The third argument specifies which piece
784 of information is required, and the fourth argument is a pointer to a
785 variable to receive the data. The yield of the function is zero for
786 success, or one of the following negative numbers:
787
788 PCRE_ERROR_NULL the argument code was NULL
789 the argument where was NULL
790 PCRE_ERROR_BADMAGIC the "magic number" was not found
791 PCRE_ERROR_BADOPTION the value of what was invalid
792
793 The "magic number" is placed at the start of each compiled pattern as
794 an simple check against passing an arbitrary memory pointer. Here is a
795 typical call of pcre_fullinfo(), to obtain the length of the compiled
796 pattern:
797
798 int rc;
799 size_t length;
800 rc = pcre_fullinfo(
801 re, /* result of pcre_compile() */
802 pe, /* result of pcre_study(), or NULL */
803 PCRE_INFO_SIZE, /* what is required */
804 &length); /* where to put the data */
805
806 The possible values for the third argument are defined in pcre.h, and
807 are as follows:
808
809 PCRE_INFO_BACKREFMAX
810
811 Return the number of the highest back reference in the pattern. The
812 fourth argument should point to an int variable. Zero is returned if
813 there are no back references.
814
815 PCRE_INFO_CAPTURECOUNT
816
817 Return the number of capturing subpatterns in the pattern. The fourth
818 argument should point to an int variable.
819
820 PCRE_INFO_DEFAULT_TABLES
821
822 Return a pointer to the internal default character tables within PCRE.
823 The fourth argument should point to an unsigned char * variable. This
824 information call is provided for internal use by the pcre_study() func‐
825 tion. External callers can cause PCRE to use its internal tables by
826 passing a NULL table pointer.
827
828 PCRE_INFO_FIRSTBYTE
829
830 Return information about the first byte of any matched string, for a
831 non-anchored pattern. The fourth argument should point to an int vari‐
832 able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name
833 is still recognized for backwards compatibility.)
834
835 If there is a fixed first byte, for example, from a pattern such as
836 (cat|cow|coyote), its value is returned. Otherwise, if either
837
838 (a) the pattern was compiled with the PCRE_MULTILINE option, and every
839 branch starts with "^", or
840
841 (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
842 set (if it were set, the pattern would be anchored),
843
844 -1 is returned, indicating that the pattern matches only at the start
845 of a subject string or after any newline within the string. Otherwise
846 -2 is returned. For anchored patterns, -2 is returned.
847
848 PCRE_INFO_FIRSTTABLE
849
850 If the pattern was studied, and this resulted in the construction of a
851 256-bit table indicating a fixed set of bytes for the first byte in any
852 matching string, a pointer to the table is returned. Otherwise NULL is
853 returned. The fourth argument should point to an unsigned char * vari‐
854 able.
855
856 PCRE_INFO_HASCRORLF
857
858 Return 1 if the pattern contains any explicit matches for CR or LF
859 characters, otherwise 0. The fourth argument should point to an int
860 variable. An explicit match is either a literal CR or LF character, or
861 \r or \n.
862
863 PCRE_INFO_JCHANGED
864
865 Return 1 if the (?J) or (?-J) option setting is used in the pattern,
866 otherwise 0. The fourth argument should point to an int variable. (?J)
867 and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
868
869 PCRE_INFO_LASTLITERAL
870
871 Return the value of the rightmost literal byte that must exist in any
872 matched string, other than at its start, if such a byte has been
873 recorded. The fourth argument should point to an int variable. If there
874 is no such byte, -1 is returned. For anchored patterns, a last literal
875 byte is recorded only if it follows something of variable length. For
876 example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
877 /^a\dz\d/ the returned value is -1.
878
879 PCRE_INFO_NAMECOUNT
880 PCRE_INFO_NAMEENTRYSIZE
881 PCRE_INFO_NAMETABLE
882
883 PCRE supports the use of named as well as numbered capturing parenthe‐
884 ses. The names are just an additional way of identifying the parenthe‐
885 ses, which still acquire numbers. Several convenience functions such as
886 pcre_get_named_substring() are provided for extracting captured sub‐
887 strings by name. It is also possible to extract the data directly, by
888 first converting the name to a number in order to access the correct
889 pointers in the output vector (described with pcre_exec() below). To do
890 the conversion, you need to use the name-to-number map, which is
891 described by these three values.
892
893 The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
894 gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
895 of each entry; both of these return an int value. The entry size
896 depends on the length of the longest name. PCRE_INFO_NAMETABLE returns
897 a pointer to the first entry of the table (a pointer to char). The
898 first two bytes of each entry are the number of the capturing parenthe‐
899 sis, most significant byte first. The rest of the entry is the corre‐
900 sponding name, zero terminated. The names are in alphabetical order.
901 When PCRE_DUPNAMES is set, duplicate names are in order of their paren‐
902 theses numbers. For example, consider the following pattern (assume
903 PCRE_EXTENDED is set, so white space - including newlines - is
904 ignored):
905
906 (?<date> (?<year>(\d\d)?\d\d) -
907 (?<month>\d\d) - (?<day>\d\d) )
908
909 There are four named subpatterns, so the table has four entries, and
910 each entry in the table is eight bytes long. The table is as follows,
911 with non-printing bytes shows in hexadecimal, and undefined bytes shown
912 as ??:
913
914 00 01 d a t e 00 ??
915 00 05 d a y 00 ?? ??
916 00 04 m o n t h 00
917 00 02 y e a r 00 ??
918
919 When writing code to extract data from named subpatterns using the
920 name-to-number map, remember that the length of the entries is likely
921 to be different for each compiled pattern.
922
923 PCRE_INFO_OKPARTIAL
924
925 Return 1 if the pattern can be used for partial matching, otherwise 0.
926 The fourth argument should point to an int variable. The pcrepartial
927 documentation lists the restrictions that apply to patterns when par‐
928 tial matching is used.
929
930 PCRE_INFO_OPTIONS
931
932 Return a copy of the options with which the pattern was compiled. The
933 fourth argument should point to an unsigned long int variable. These
934 option bits are those specified in the call to pcre_compile(), modified
935 by any top-level option settings at the start of the pattern itself. In
936 other words, they are the options that will be in force when matching
937 starts. For example, if the pattern /(?im)abc(?-i)d/ is compiled with
938 the PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE,
939 and PCRE_EXTENDED.
940
941 A pattern is automatically anchored by PCRE if all of its top-level
942 alternatives begin with one of the following:
943
944 ^ unless PCRE_MULTILINE is set
945 \A always
946 \G always
947 .* if PCRE_DOTALL is set and there are no back
948 references to the subpattern in which .* appears
949
950 For such patterns, the PCRE_ANCHORED bit is set in the options returned
951 by pcre_fullinfo().
952
953 PCRE_INFO_SIZE
954
955 Return the size of the compiled pattern, that is, the value that was
956 passed as the argument to pcre_malloc() when PCRE was getting memory in
957 which to place the compiled data. The fourth argument should point to a
958 size_t variable.
959
960 PCRE_INFO_STUDYSIZE
961
962 Return the size of the data block pointed to by the study_data field in
963 a pcre_extra block. That is, it is the value that was passed to
964 pcre_malloc() when PCRE was getting memory into which to place the data
965 created by pcre_study(). The fourth argument should point to a size_t
966 variable.
967
969
970 int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
971
972 The pcre_info() function is now obsolete because its interface is too
973 restrictive to return all the available data about a compiled pattern.
974 New programs should use pcre_fullinfo() instead. The yield of
975 pcre_info() is the number of capturing subpatterns, or one of the fol‐
976 lowing negative numbers:
977
978 PCRE_ERROR_NULL the argument code was NULL
979 PCRE_ERROR_BADMAGIC the "magic number" was not found
980
981 If the optptr argument is not NULL, a copy of the options with which
982 the pattern was compiled is placed in the integer it points to (see
983 PCRE_INFO_OPTIONS above).
984
985 If the pattern is not anchored and the firstcharptr argument is not
986 NULL, it is used to pass back information about the first character of
987 any matched string (see PCRE_INFO_FIRSTBYTE above).
988
990
991 int pcre_refcount(pcre *code, int adjust);
992
993 The pcre_refcount() function is used to maintain a reference count in
994 the data block that contains a compiled pattern. It is provided for the
995 benefit of applications that operate in an object-oriented manner,
996 where different parts of the application may be using the same compiled
997 pattern, but you want to free the block when they are all done.
998
999 When a pattern is compiled, the reference count field is initialized to
1000 zero. It is changed only by calling this function, whose action is to
1001 add the adjust value (which may be positive or negative) to it. The
1002 yield of the function is the new value. However, the value of the count
1003 is constrained to lie between 0 and 65535, inclusive. If the new value
1004 is outside these limits, it is forced to the appropriate limit value.
1005
1006 Except when it is zero, the reference count is not correctly preserved
1007 if a pattern is compiled on one host and then transferred to a host
1008 whose byte-order is different. (This seems a highly unlikely scenario.)
1009
1011
1012 int pcre_exec(const pcre *code, const pcre_extra *extra,
1013 const char *subject, int length, int startoffset,
1014 int options, int *ovector, int ovecsize);
1015
1016 The function pcre_exec() is called to match a subject string against a
1017 compiled pattern, which is passed in the code argument. If the pattern
1018 has been studied, the result of the study should be passed in the extra
1019 argument. This function is the main matching facility of the library,
1020 and it operates in a Perl-like manner. For specialist use there is also
1021 an alternative matching function, which is described below in the sec‐
1022 tion about the pcre_dfa_exec() function.
1023
1024 In most applications, the pattern will have been compiled (and option‐
1025 ally studied) in the same process that calls pcre_exec(). However, it
1026 is possible to save compiled patterns and study data, and then use them
1027 later in different processes, possibly even on different hosts. For a
1028 discussion about this, see the pcreprecompile documentation.
1029
1030 Here is an example of a simple call to pcre_exec():
1031
1032 int rc;
1033 int ovector[30];
1034 rc = pcre_exec(
1035 re, /* result of pcre_compile() */
1036 NULL, /* we didn't study the pattern */
1037 "some string", /* the subject string */
1038 11, /* the length of the subject string */
1039 0, /* start at offset 0 in the subject */
1040 0, /* default options */
1041 ovector, /* vector of integers for substring information */
1042 30); /* number of elements (NOT size in bytes) */
1043
1044 Extra data for pcre_exec()
1045
1046 If the extra argument is not NULL, it must point to a pcre_extra data
1047 block. The pcre_study() function returns such a block (when it doesn't
1048 return NULL), but you can also create one for yourself, and pass addi‐
1049 tional information in it. The pcre_extra block contains the following
1050 fields (not necessarily in this order):
1051
1052 unsigned long int flags;
1053 void *study_data;
1054 unsigned long int match_limit;
1055 unsigned long int match_limit_recursion;
1056 void *callout_data;
1057 const unsigned char *tables;
1058
1059 The flags field is a bitmap that specifies which of the other fields
1060 are set. The flag bits are:
1061
1062 PCRE_EXTRA_STUDY_DATA
1063 PCRE_EXTRA_MATCH_LIMIT
1064 PCRE_EXTRA_MATCH_LIMIT_RECURSION
1065 PCRE_EXTRA_CALLOUT_DATA
1066 PCRE_EXTRA_TABLES
1067
1068 Other flag bits should be set to zero. The study_data field is set in
1069 the pcre_extra block that is returned by pcre_study(), together with
1070 the appropriate flag bit. You should not set this yourself, but you may
1071 add to the block by setting the other fields and their corresponding
1072 flag bits.
1073
1074 The match_limit field provides a means of preventing PCRE from using up
1075 a vast amount of resources when running patterns that are not going to
1076 match, but which have a very large number of possibilities in their
1077 search trees. The classic example is the use of nested unlimited
1078 repeats.
1079
1080 Internally, PCRE uses a function called match() which it calls repeat‐
1081 edly (sometimes recursively). The limit set by match_limit is imposed
1082 on the number of times this function is called during a match, which
1083 has the effect of limiting the amount of backtracking that can take
1084 place. For patterns that are not anchored, the count restarts from zero
1085 for each position in the subject string.
1086
1087 The default value for the limit can be set when PCRE is built; the
1088 default default is 10 million, which handles all but the most extreme
1089 cases. You can override the default by suppling pcre_exec() with a
1090 pcre_extra block in which match_limit is set, and
1091 PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is
1092 exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
1093
1094 The match_limit_recursion field is similar to match_limit, but instead
1095 of limiting the total number of times that match() is called, it limits
1096 the depth of recursion. The recursion depth is a smaller number than
1097 the total number of calls, because not all calls to match() are recur‐
1098 sive. This limit is of use only if it is set smaller than match_limit.
1099
1100 Limiting the recursion depth limits the amount of stack that can be
1101 used, or, when PCRE has been compiled to use memory on the heap instead
1102 of the stack, the amount of heap memory that can be used.
1103
1104 The default value for match_limit_recursion can be set when PCRE is
1105 built; the default default is the same value as the default for
1106 match_limit. You can override the default by suppling pcre_exec() with
1107 a pcre_extra block in which match_limit_recursion is set, and
1108 PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the
1109 limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
1110
1111 The pcre_callout field is used in conjunction with the "callout" fea‐
1112 ture, which is described in the pcrecallout documentation.
1113
1114 The tables field is used to pass a character tables pointer to
1115 pcre_exec(); this overrides the value that is stored with the compiled
1116 pattern. A non-NULL value is stored with the compiled pattern only if
1117 custom tables were supplied to pcre_compile() via its tableptr argu‐
1118 ment. If NULL is passed to pcre_exec() using this mechanism, it forces
1119 PCRE's internal tables to be used. This facility is helpful when re-
1120 using patterns that have been saved after compiling with an external
1121 set of tables, because the external tables might be at a different
1122 address when pcre_exec() is called. See the pcreprecompile documenta‐
1123 tion for a discussion of saving compiled patterns for later use.
1124
1125 Option bits for pcre_exec()
1126
1127 The unused bits of the options argument for pcre_exec() must be zero.
1128 The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx,
1129 PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK and
1130 PCRE_PARTIAL.
1131
1132 PCRE_ANCHORED
1133
1134 The PCRE_ANCHORED option limits pcre_exec() to matching at the first
1135 matching position. If a pattern was compiled with PCRE_ANCHORED, or
1136 turned out to be anchored by virtue of its contents, it cannot be made
1137 unachored at matching time.
1138
1139 PCRE_BSR_ANYCRLF
1140 PCRE_BSR_UNICODE
1141
1142 These options (which are mutually exclusive) control what the \R escape
1143 sequence matches. The choice is either to match only CR, LF, or CRLF,
1144 or to match any Unicode newline sequence. These options override the
1145 choice that was made or defaulted when the pattern was compiled.
1146
1147 PCRE_NEWLINE_CR
1148 PCRE_NEWLINE_LF
1149 PCRE_NEWLINE_CRLF
1150 PCRE_NEWLINE_ANYCRLF
1151 PCRE_NEWLINE_ANY
1152
1153 These options override the newline definition that was chosen or
1154 defaulted when the pattern was compiled. For details, see the descrip‐
1155 tion of pcre_compile() above. During matching, the newline choice
1156 affects the behaviour of the dot, circumflex, and dollar metacharac‐
1157 ters. It may also alter the way the match position is advanced after a
1158 match failure for an unanchored pattern.
1159
1160 When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is
1161 set, and a match attempt for an unanchored pattern fails when the cur‐
1162 rent position is at a CRLF sequence, and the pattern contains no
1163 explicit matches for CR or LF characters, the match position is
1164 advanced by two characters instead of one, in other words, to after the
1165 CRLF.
1166
1167 The above rule is a compromise that makes the most common cases work as
1168 expected. For example, if the pattern is .+A (and the PCRE_DOTALL
1169 option is not set), it does not match the string "\r\nA" because, after
1170 failing at the start, it skips both the CR and the LF before retrying.
1171 However, the pattern [\r\n]A does match that string, because it con‐
1172 tains an explicit CR or LF reference, and so advances only by one char‐
1173 acter after the first failure.
1174
1175 An explicit match for CR of LF is either a literal appearance of one of
1176 those characters, or one of the \r or \n escape sequences. Implicit
1177 matches such as [^X] do not count, nor does \s (which includes CR and
1178 LF in the characters that it matches).
1179
1180 Notwithstanding the above, anomalous effects may still occur when CRLF
1181 is a valid newline sequence and explicit \r or \n escapes appear in the
1182 pattern.
1183
1184 PCRE_NOTBOL
1185
1186 This option specifies that first character of the subject string is not
1187 the beginning of a line, so the circumflex metacharacter should not
1188 match before it. Setting this without PCRE_MULTILINE (at compile time)
1189 causes circumflex never to match. This option affects only the behav‐
1190 iour of the circumflex metacharacter. It does not affect \A.
1191
1192 PCRE_NOTEOL
1193
1194 This option specifies that the end of the subject string is not the end
1195 of a line, so the dollar metacharacter should not match it nor (except
1196 in multiline mode) a newline immediately before it. Setting this with‐
1197 out PCRE_MULTILINE (at compile time) causes dollar never to match. This
1198 option affects only the behaviour of the dollar metacharacter. It does
1199 not affect \Z or \z.
1200
1201 PCRE_NOTEMPTY
1202
1203 An empty string is not considered to be a valid match if this option is
1204 set. If there are alternatives in the pattern, they are tried. If all
1205 the alternatives match the empty string, the entire match fails. For
1206 example, if the pattern
1207
1208 a?b?
1209
1210 is applied to a string not beginning with "a" or "b", it matches the
1211 empty string at the start of the subject. With PCRE_NOTEMPTY set, this
1212 match is not valid, so PCRE searches further into the string for occur‐
1213 rences of "a" or "b".
1214
1215 Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe‐
1216 cial case of a pattern match of the empty string within its split()
1217 function, and when using the /g modifier. It is possible to emulate
1218 Perl's behaviour after matching a null string by first trying the match
1219 again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then
1220 if that fails by advancing the starting offset (see below) and trying
1221 an ordinary match again. There is some code that demonstrates how to do
1222 this in the pcredemo.c sample program.
1223
1224 PCRE_NO_UTF8_CHECK
1225
1226 When PCRE_UTF8 is set at compile time, the validity of the subject as a
1227 UTF-8 string is automatically checked when pcre_exec() is subsequently
1228 called. The value of startoffset is also checked to ensure that it
1229 points to the start of a UTF-8 character. There is a discussion about
1230 the validity of UTF-8 strings in the section on UTF-8 support in the
1231 main pcre page. If an invalid UTF-8 sequence of bytes is found,
1232 pcre_exec() returns the error PCRE_ERROR_BADUTF8. Information about the
1233 precise nature of the error may also be returned (see the descriptions
1234 of these errors in the section entitled Error return values from
1235 pcre_exec() below). If startoffset contains a value that does not
1236 point to the start of a UTF-8 character (or to the end of the subject),
1237 PCRE_ERROR_BADUTF8_OFFSET is returned.
1238
1239 If you already know that your subject is valid, and you want to skip
1240 these checks for performance reasons, you can set the
1241 PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to
1242 do this for the second and subsequent calls to pcre_exec() if you are
1243 making repeated calls to find all the matches in a single subject
1244 string. However, you should be sure that the value of startoffset
1245 points to the start of a UTF-8 character (or the end of the subject).
1246 When PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid UTF-8
1247 string as a subject or an invalid value of startoffset is undefined.
1248 Your program may crash.
1249
1250 PCRE_PARTIAL
1251
1252 This option turns on the partial matching feature. If the subject
1253 string fails to match the pattern, but at some point during the match‐
1254 ing process the end of the subject was reached (that is, the subject
1255 partially matches the pattern and the failure to match occurred only
1256 because there were not enough subject characters), pcre_exec() returns
1257 PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is
1258 used, there are restrictions on what may appear in the pattern. These
1259 are discussed in the pcrepartial documentation.
1260
1261 The string to be matched by pcre_exec()
1262
1263 The subject string is passed to pcre_exec() as a pointer in subject, a
1264 length (in bytes) in length, and a starting byte offset in startoffset.
1265 If this is negative or greater than the length of the subject,
1266 pcre_exec() returns PCRE_ERROR_BADOFFSET.
1267
1268 In UTF-8 mode, the byte offset must point to the start of a UTF-8 char‐
1269 acter (or the end of the subject). Unlike the pattern string, the sub‐
1270 ject may contain binary zero bytes. When the starting offset is zero,
1271 the search for a match starts at the beginning of the subject, and this
1272 is by far the most common case.
1273
1274 A non-zero starting offset is useful when searching for another match
1275 in the same subject by calling pcre_exec() again after a previous suc‐
1276 cess. Setting startoffset differs from just passing over a shortened
1277 string and setting PCRE_NOTBOL in the case of a pattern that begins
1278 with any kind of lookbehind. For example, consider the pattern
1279
1280 \Biss\B
1281
1282 which finds occurrences of "iss" in the middle of words. (\B matches
1283 only if the current position in the subject is not a word boundary.)
1284 When applied to the string "Mississipi" the first call to pcre_exec()
1285 finds the first occurrence. If pcre_exec() is called again with just
1286 the remainder of the subject, namely "issipi", it does not match,
1287 because \B is always false at the start of the subject, which is deemed
1288 to be a word boundary. However, if pcre_exec() is passed the entire
1289 string again, but with startoffset set to 4, it finds the second occur‐
1290 rence of "iss" because it is able to look behind the starting point to
1291 discover that it is preceded by a letter.
1292
1293 If a non-zero starting offset is passed when the pattern is anchored,
1294 one attempt to match at the given offset is made. This can only succeed
1295 if the pattern does not require the match to be at the start of the
1296 subject.
1297
1298 How pcre_exec() returns captured substrings
1299
1300 In general, a pattern matches a certain portion of the subject, and in
1301 addition, further substrings from the subject may be picked out by
1302 parts of the pattern. Following the usage in Jeffrey Friedl's book,
1303 this is called "capturing" in what follows, and the phrase "capturing
1304 subpattern" is used for a fragment of a pattern that picks out a sub‐
1305 string. PCRE supports several other kinds of parenthesized subpattern
1306 that do not cause substrings to be captured.
1307
1308 Captured substrings are returned to the caller via a vector of integers
1309 whose address is passed in ovector. The number of elements in the vec‐
1310 tor is passed in ovecsize, which must be a non-negative number. Note:
1311 this argument is NOT the size of ovector in bytes.
1312
1313 The first two-thirds of the vector is used to pass back captured sub‐
1314 strings, each substring using a pair of integers. The remaining third
1315 of the vector is used as workspace by pcre_exec() while matching cap‐
1316 turing subpatterns, and is not available for passing back information.
1317 The number passed in ovecsize should always be a multiple of three. If
1318 it is not, it is rounded down.
1319
1320 When a match is successful, information about captured substrings is
1321 returned in pairs of integers, starting at the beginning of ovector,
1322 and continuing up to two-thirds of its length at the most. The first
1323 element of each pair is set to the byte offset of the first character
1324 in a substring, and the second is set to the byte offset of the first
1325 character after the end of a substring. Note: these values are always
1326 byte offsets, even in UTF-8 mode. They are not character counts.
1327
1328 The first pair of integers, ovector[0] and ovector[1], identify the
1329 portion of the subject string matched by the entire pattern. The next
1330 pair is used for the first capturing subpattern, and so on. The value
1331 returned by pcre_exec() is one more than the highest numbered pair that
1332 has been set. For example, if two substrings have been captured, the
1333 returned value is 3. If there are no capturing subpatterns, the return
1334 value from a successful match is 1, indicating that just the first pair
1335 of offsets has been set.
1336
1337 If a capturing subpattern is matched repeatedly, it is the last portion
1338 of the string that it matched that is returned.
1339
1340 If the vector is too small to hold all the captured substring offsets,
1341 it is used as far as possible (up to two-thirds of its length), and the
1342 function returns a value of zero. If the substring offsets are not of
1343 interest, pcre_exec() may be called with ovector passed as NULL and
1344 ovecsize as zero. However, if the pattern contains back references and
1345 the ovector is not big enough to remember the related substrings, PCRE
1346 has to get additional memory for use during matching. Thus it is usu‐
1347 ally advisable to supply an ovector.
1348
1349 The pcre_info() function can be used to find out how many capturing
1350 subpatterns there are in a compiled pattern. The smallest size for
1351 ovector that will allow for n captured substrings, in addition to the
1352 offsets of the substring matched by the whole pattern, is (n+1)*3.
1353
1354 It is possible for capturing subpattern number n+1 to match some part
1355 of the subject when subpattern n has not been used at all. For example,
1356 if the string "abc" is matched against the pattern (a|(z))(bc) the
1357 return from the function is 4, and subpatterns 1 and 3 are matched, but
1358 2 is not. When this happens, both values in the offset pairs corre‐
1359 sponding to unused subpatterns are set to -1.
1360
1361 Offset values that correspond to unused subpatterns at the end of the
1362 expression are also set to -1. For example, if the string "abc" is
1363 matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not
1364 matched. The return from the function is 2, because the highest used
1365 capturing subpattern number is 1. However, you can refer to the offsets
1366 for the second and third capturing subpatterns if you wish (assuming
1367 the vector is large enough, of course).
1368
1369 Some convenience functions are provided for extracting the captured
1370 substrings as separate strings. These are described below.
1371
1372 Error return values from pcre_exec()
1373
1374 If pcre_exec() fails, it returns a negative number. The following are
1375 defined in the header file:
1376
1377 PCRE_ERROR_NOMATCH (-1)
1378
1379 The subject string did not match the pattern.
1380
1381 PCRE_ERROR_NULL (-2)
1382
1383 Either code or subject was passed as NULL, or ovector was NULL and
1384 ovecsize was not zero.
1385
1386 PCRE_ERROR_BADOPTION (-3)
1387
1388 An unrecognized bit was set in the options argument.
1389
1390 PCRE_ERROR_BADMAGIC (-4)
1391
1392 PCRE stores a 4-byte "magic number" at the start of the compiled code,
1393 to catch the case when it is passed a junk pointer and to detect when a
1394 pattern that was compiled in an environment of one endianness is run in
1395 an environment with the other endianness. This is the error that PCRE
1396 gives when the magic number is not present.
1397
1398 PCRE_ERROR_UNKNOWN_OPCODE (-5)
1399
1400 While running the pattern match, an unknown item was encountered in the
1401 compiled pattern. This error could be caused by a bug in PCRE or by
1402 overwriting of the compiled pattern.
1403
1404 PCRE_ERROR_NOMEMORY (-6)
1405
1406 If a pattern contains back references, but the ovector that is passed
1407 to pcre_exec() is not big enough to remember the referenced substrings,
1408 PCRE gets a block of memory at the start of matching to use for this
1409 purpose. If the call via pcre_malloc() fails, this error is given. The
1410 memory is automatically freed at the end of matching.
1411
1412 PCRE_ERROR_NOSUBSTRING (-7)
1413
1414 This error is used by the pcre_copy_substring(), pcre_get_substring(),
1415 and pcre_get_substring_list() functions (see below). It is never
1416 returned by pcre_exec().
1417
1418 PCRE_ERROR_MATCHLIMIT (-8)
1419
1420 The backtracking limit, as specified by the match_limit field in a
1421 pcre_extra structure (or defaulted) was reached. See the description
1422 above.
1423
1424 PCRE_ERROR_CALLOUT (-9)
1425
1426 This error is never generated by pcre_exec() itself. It is provided for
1427 use by callout functions that want to yield a distinctive error code.
1428 See the pcrecallout documentation for details.
1429
1430 PCRE_ERROR_BADUTF8 (-10)
1431
1432 A string that contains an invalid UTF-8 byte sequence was passed as a
1433 subject, and the PCRE_NO_UTF8_CHECK option was not set. If the size of
1434 the output vector (ovecsize) is at least 2, the byte offset to the
1435 start of the the invalid UTF-8 character is placed in the first ele‐
1436 ment, and a reason code is placed in the second element. The reason
1437 codes are listed in the following section.
1438
1439 PCRE_ERROR_BADUTF8_OFFSET (-11)
1440
1441 The UTF-8 byte sequence that was passed as a subject was checked and
1442 found to be valid (the PCRE_NO_UTF8_CHECK option was not set), but the
1443 value of startoffset did not point to the beginning of a UTF-8 charac‐
1444 ter.
1445
1446 PCRE_ERROR_PARTIAL (-12)
1447
1448 The subject string did not match, but it did match partially. See the
1449 pcrepartial documentation for details of partial matching.
1450
1451 PCRE_ERROR_BADPARTIAL (-13)
1452
1453 The PCRE_PARTIAL option was used with a compiled pattern containing
1454 items that are not supported for partial matching. See the pcrepartial
1455 documentation for details of partial matching.
1456
1457 PCRE_ERROR_INTERNAL (-14)
1458
1459 An unexpected internal error has occurred. This error could be caused
1460 by a bug in PCRE or by overwriting of the compiled pattern.
1461
1462 PCRE_ERROR_BADCOUNT (-15)
1463
1464 This error is given if the value of the ovecsize argument is negative.
1465
1466 PCRE_ERROR_RECURSIONLIMIT (-21)
1467
1468 The internal recursion limit, as specified by the match_limit_recursion
1469 field in a pcre_extra structure (or defaulted) was reached. See the
1470 description above.
1471
1472 PCRE_ERROR_BADNEWLINE (-23)
1473
1474 An invalid combination of PCRE_NEWLINE_xxx options was given.
1475
1476 PCRE_ERROR_BADOFFSET (-24)
1477
1478 The value of startoffset was negative or greater than the length of the
1479 subject, that is, the value in length.
1480
1481 Error numbers -16 to -20 and -22 are not used by pcre_exec().
1482
1483 Reason codes for invalid UTF-8 strings
1484
1485 When pcre_exec() returns either PCRE_ERROR_BADUTF8 or PCRE_ERROR_SHORT‐
1486 UTF8, and the size of the output vector (ovecsize) is at least 2, the
1487 offset of the start of the invalid UTF-8 character is placed in the
1488 first output vector element (ovector[0]) and a reason code is placed in
1489 the second element (ovector[1]). The reason codes are given names in
1490 the pcre.h header file:
1491
1492 PCRE_UTF8_ERR1
1493 PCRE_UTF8_ERR2
1494 PCRE_UTF8_ERR3
1495 PCRE_UTF8_ERR4
1496 PCRE_UTF8_ERR5
1497
1498 The string ends with a truncated UTF-8 character; the code specifies
1499 how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8
1500 characters to be no longer than 4 bytes, the encoding scheme (origi‐
1501 nally defined by RFC 2279) allows for up to 6 bytes, and this is
1502 checked first; hence the possibility of 4 or 5 missing bytes.
1503
1504 PCRE_UTF8_ERR6
1505 PCRE_UTF8_ERR7
1506 PCRE_UTF8_ERR8
1507 PCRE_UTF8_ERR9
1508 PCRE_UTF8_ERR10
1509
1510 The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of
1511 the character do not have the binary value 0b10 (that is, either the
1512 most significant bit is 0, or the next bit is 1).
1513
1514 PCRE_UTF8_ERR11
1515 PCRE_UTF8_ERR12
1516
1517 A character that is valid by the RFC 2279 rules is either 5 or 6 bytes
1518 long; these code points are excluded by RFC 3629.
1519
1520 PCRE_UTF8_ERR13
1521
1522 A 4-byte character has a value greater than 0x10fff; these code points
1523 are excluded by RFC 3629.
1524
1525 PCRE_UTF8_ERR14
1526
1527 A 3-byte character has a value in the range 0xd800 to 0xdfff; this
1528 range of code points are reserved by RFC 3629 for use with UTF-16, and
1529 so are excluded from UTF-8.
1530
1531 PCRE_UTF8_ERR15
1532 PCRE_UTF8_ERR16
1533 PCRE_UTF8_ERR17
1534 PCRE_UTF8_ERR18
1535 PCRE_UTF8_ERR19
1536
1537 A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
1538 for a value that can be represented by fewer bytes, which is invalid.
1539 For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor‐
1540 rect coding uses just one byte.
1541
1542 PCRE_UTF8_ERR20
1543
1544 The two most significant bits of the first byte of a character have the
1545 binary value 0b10 (that is, the most significant bit is 1 and the sec‐
1546 ond is 0). Such a byte can only validly occur as the second or subse‐
1547 quent byte of a multi-byte character.
1548
1549 PCRE_UTF8_ERR21
1550
1551 The first byte of a character has the value 0xfe or 0xff. These values
1552 can never occur in a valid UTF-8 string.
1553
1555
1556 int pcre_copy_substring(const char *subject, int *ovector,
1557 int stringcount, int stringnumber, char *buffer,
1558 int buffersize);
1559
1560 int pcre_get_substring(const char *subject, int *ovector,
1561 int stringcount, int stringnumber,
1562 const char **stringptr);
1563
1564 int pcre_get_substring_list(const char *subject,
1565 int *ovector, int stringcount, const char ***listptr);
1566
1567 Captured substrings can be accessed directly by using the offsets
1568 returned by pcre_exec() in ovector. For convenience, the functions
1569 pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub‐
1570 string_list() are provided for extracting captured substrings as new,
1571 separate, zero-terminated strings. These functions identify substrings
1572 by number. The next section describes functions for extracting named
1573 substrings.
1574
1575 A substring that contains a binary zero is correctly extracted and has
1576 a further zero added on the end, but the result is not, of course, a C
1577 string. However, you can process such a string by referring to the
1578 length that is returned by pcre_copy_substring() and pcre_get_sub‐
1579 string(). Unfortunately, the interface to pcre_get_substring_list() is
1580 not adequate for handling strings containing binary zeros, because the
1581 end of the final string is not independently indicated.
1582
1583 The first three arguments are the same for all three of these func‐
1584 tions: subject is the subject string that has just been successfully
1585 matched, ovector is a pointer to the vector of integer offsets that was
1586 passed to pcre_exec(), and stringcount is the number of substrings that
1587 were captured by the match, including the substring that matched the
1588 entire regular expression. This is the value returned by pcre_exec() if
1589 it is greater than zero. If pcre_exec() returned zero, indicating that
1590 it ran out of space in ovector, the value passed as stringcount should
1591 be the number of elements in the vector divided by three.
1592
1593 The functions pcre_copy_substring() and pcre_get_substring() extract a
1594 single substring, whose number is given as stringnumber. A value of
1595 zero extracts the substring that matched the entire pattern, whereas
1596 higher values extract the captured substrings. For pcre_copy_sub‐
1597 string(), the string is placed in buffer, whose length is given by
1598 buffersize, while for pcre_get_substring() a new block of memory is
1599 obtained via pcre_malloc, and its address is returned via stringptr.
1600 The yield of the function is the length of the string, not including
1601 the terminating zero, or one of these error codes:
1602
1603 PCRE_ERROR_NOMEMORY (-6)
1604
1605 The buffer was too small for pcre_copy_substring(), or the attempt to
1606 get memory failed for pcre_get_substring().
1607
1608 PCRE_ERROR_NOSUBSTRING (-7)
1609
1610 There is no substring whose number is stringnumber.
1611
1612 The pcre_get_substring_list() function extracts all available sub‐
1613 strings and builds a list of pointers to them. All this is done in a
1614 single block of memory that is obtained via pcre_malloc. The address of
1615 the memory block is returned via listptr, which is also the start of
1616 the list of string pointers. The end of the list is marked by a NULL
1617 pointer. The yield of the function is zero if all went well, or the
1618 error code
1619
1620 PCRE_ERROR_NOMEMORY (-6)
1621
1622 if the attempt to get the memory block failed.
1623
1624 When any of these functions encounter a substring that is unset, which
1625 can happen when capturing subpattern number n+1 matches some part of
1626 the subject, but subpattern n has not been used at all, they return an
1627 empty string. This can be distinguished from a genuine zero-length sub‐
1628 string by inspecting the appropriate offset in ovector, which is nega‐
1629 tive for unset substrings.
1630
1631 The two convenience functions pcre_free_substring() and pcre_free_sub‐
1632 string_list() can be used to free the memory returned by a previous
1633 call of pcre_get_substring() or pcre_get_substring_list(), respec‐
1634 tively. They do nothing more than call the function pointed to by
1635 pcre_free, which of course could be called directly from a C program.
1636 However, PCRE is used in some situations where it is linked via a spe‐
1637 cial interface to another programming language that cannot use
1638 pcre_free directly; it is for these cases that the functions are pro‐
1639 vided.
1640
1642
1643 int pcre_get_stringnumber(const pcre *code,
1644 const char *name);
1645
1646 int pcre_copy_named_substring(const pcre *code,
1647 const char *subject, int *ovector,
1648 int stringcount, const char *stringname,
1649 char *buffer, int buffersize);
1650
1651 int pcre_get_named_substring(const pcre *code,
1652 const char *subject, int *ovector,
1653 int stringcount, const char *stringname,
1654 const char **stringptr);
1655
1656 To extract a substring by name, you first have to find associated num‐
1657 ber. For example, for this pattern
1658
1659 (a+)b(?<xxx>\d+)...
1660
1661 the number of the subpattern called "xxx" is 2. If the name is known to
1662 be unique (PCRE_DUPNAMES was not set), you can find the number from the
1663 name by calling pcre_get_stringnumber(). The first argument is the com‐
1664 piled pattern, and the second is the name. The yield of the function is
1665 the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no
1666 subpattern of that name.
1667
1668 Given the number, you can extract the substring directly, or use one of
1669 the functions described in the previous section. For convenience, there
1670 are also two functions that do the whole job.
1671
1672 Most of the arguments of pcre_copy_named_substring() and
1673 pcre_get_named_substring() are the same as those for the similarly
1674 named functions that extract by number. As these are described in the
1675 previous section, they are not re-described here. There are just two
1676 differences:
1677
1678 First, instead of a substring number, a substring name is given. Sec‐
1679 ond, there is an extra argument, given at the start, which is a pointer
1680 to the compiled pattern. This is needed in order to gain access to the
1681 name-to-number translation table.
1682
1683 These functions call pcre_get_stringnumber(), and if it succeeds, they
1684 then call pcre_copy_substring() or pcre_get_substring(), as appropri‐
1685 ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the
1686 behaviour may not be what you want (see the next section).
1687
1689
1690 int pcre_get_stringtable_entries(const pcre *code,
1691 const char *name, char **first, char **last);
1692
1693 When a pattern is compiled with the PCRE_DUPNAMES option, names for
1694 subpatterns are not required to be unique. Normally, patterns with
1695 duplicate names are such that in any one match, only one of the named
1696 subpatterns participates. An example is shown in the pcrepattern docu‐
1697 mentation.
1698
1699 When duplicates are present, pcre_copy_named_substring() and
1700 pcre_get_named_substring() return the first substring corresponding to
1701 the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING
1702 (-7) is returned; no data is returned. The pcre_get_stringnumber()
1703 function returns one of the numbers that are associated with the name,
1704 but it is not defined which it is.
1705
1706 If you want to get full details of all captured substrings for a given
1707 name, you must use the pcre_get_stringtable_entries() function. The
1708 first argument is the compiled pattern, and the second is the name. The
1709 third and fourth are pointers to variables which are updated by the
1710 function. After it has run, they point to the first and last entries in
1711 the name-to-number table for the given name. The function itself
1712 returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if
1713 there are none. The format of the table is described above in the sec‐
1714 tion entitled Information about a pattern. described above in the sec‐
1715 tion entitled Information about a pattern above. Given all the rele‐
1716 vant entries for the name, you can extract each of their numbers, and
1717 hence the captured data, if any.
1718
1720
1721 The traditional matching function uses a similar algorithm to Perl,
1722 which stops when it finds the first match, starting at a given point in
1723 the subject. If you want to find all possible matches, or the longest
1724 possible match, consider using the alternative matching function (see
1725 below) instead. If you cannot use the alternative function, but still
1726 need to find all possible matches, you can kludge it up by making use
1727 of the callout facility, which is described in the pcrecallout documen‐
1728 tation.
1729
1730 What you have to do is to insert a callout right at the end of the pat‐
1731 tern. When your callout function is called, extract and save the cur‐
1732 rent matched substring. Then return 1, which forces pcre_exec() to
1733 backtrack and try other alternatives. Ultimately, when it runs out of
1734 matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
1735
1737
1738 int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
1739 const char *subject, int length, int startoffset,
1740 int options, int *ovector, int ovecsize,
1741 int *workspace, int wscount);
1742
1743 The function pcre_dfa_exec() is called to match a subject string
1744 against a compiled pattern, using a matching algorithm that scans the
1745 subject string just once, and does not backtrack. This has different
1746 characteristics to the normal algorithm, and is not compatible with
1747 Perl. Some of the features of PCRE patterns are not supported. Never‐
1748 theless, there are times when this kind of matching can be useful. For
1749 a discussion of the two matching algorithms, see the pcrematching docu‐
1750 mentation.
1751
1752 The arguments for the pcre_dfa_exec() function are the same as for
1753 pcre_exec(), plus two extras. The ovector argument is used in a differ‐
1754 ent way, and this is described below. The other common arguments are
1755 used in the same way as for pcre_exec(), so their description is not
1756 repeated here.
1757
1758 The two additional arguments provide workspace for the function. The
1759 workspace vector should contain at least 20 elements. It is used for
1760 keeping track of multiple paths through the pattern tree. More
1761 workspace will be needed for patterns and subjects where there are a
1762 lot of potential matches.
1763
1764 Here is an example of a simple call to pcre_dfa_exec():
1765
1766 int rc;
1767 int ovector[10];
1768 int wspace[20];
1769 rc = pcre_dfa_exec(
1770 re, /* result of pcre_compile() */
1771 NULL, /* we didn't study the pattern */
1772 "some string", /* the subject string */
1773 11, /* the length of the subject string */
1774 0, /* start at offset 0 in the subject */
1775 0, /* default options */
1776 ovector, /* vector of integers for substring information */
1777 10, /* number of elements (NOT size in bytes) */
1778 wspace, /* working space vector */
1779 20); /* number of elements (NOT size in bytes) */
1780
1781 Option bits for pcre_dfa_exec()
1782
1783 The unused bits of the options argument for pcre_dfa_exec() must be
1784 zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW‐
1785 LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK,
1786 PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
1787 three of these are the same as for pcre_exec(), so their description is
1788 not repeated here.
1789
1790 PCRE_PARTIAL
1791
1792 This has the same general effect as it does for pcre_exec(), but the
1793 details are slightly different. When PCRE_PARTIAL is set for
1794 pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is converted into
1795 PCRE_ERROR_PARTIAL if the end of the subject is reached, there have
1796 been no complete matches, but there is still at least one matching pos‐
1797 sibility. The portion of the string that provided the partial match is
1798 set as the first matching string.
1799
1800 PCRE_DFA_SHORTEST
1801
1802 Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to
1803 stop as soon as it has found one match. Because of the way the alterna‐
1804 tive algorithm works, this is necessarily the shortest possible match
1805 at the first possible matching point in the subject string.
1806
1807 PCRE_DFA_RESTART
1808
1809 When pcre_dfa_exec() is called with the PCRE_PARTIAL option, and
1810 returns a partial match, it is possible to call it again, with addi‐
1811 tional subject characters, and have it continue with the same match.
1812 The PCRE_DFA_RESTART option requests this action; when it is set, the
1813 workspace and wscount options must reference the same vector as before
1814 because data about the match so far is left in them after a partial
1815 match. There is more discussion of this facility in the pcrepartial
1816 documentation.
1817
1818 Successful returns from pcre_dfa_exec()
1819
1820 When pcre_dfa_exec() succeeds, it may have matched more than one sub‐
1821 string in the subject. Note, however, that all the matches from one run
1822 of the function start at the same point in the subject. The shorter
1823 matches are all initial substrings of the longer matches. For example,
1824 if the pattern
1825
1826 <.*>
1827
1828 is matched against the string
1829
1830 This is <something> <something else> <something further> no more
1831
1832 the three matched strings are
1833
1834 <something>
1835 <something> <something else>
1836 <something> <something else> <something further>
1837
1838 On success, the yield of the function is a number greater than zero,
1839 which is the number of matched substrings. The substrings themselves
1840 are returned in ovector. Each string uses two elements; the first is
1841 the offset to the start, and the second is the offset to the end. In
1842 fact, all the strings have the same start offset. (Space could have
1843 been saved by giving this only once, but it was decided to retain some
1844 compatibility with the way pcre_exec() returns data, even though the
1845 meaning of the strings is different.)
1846
1847 The strings are returned in reverse order of length; that is, the long‐
1848 est matching string is given first. If there were too many matches to
1849 fit into ovector, the yield of the function is zero, and the vector is
1850 filled with the longest matches.
1851
1852 Error returns from pcre_dfa_exec()
1853
1854 The pcre_dfa_exec() function returns a negative number when it fails.
1855 Many of the errors are the same as for pcre_exec(), and these are
1856 described above. There are in addition the following errors that are
1857 specific to pcre_dfa_exec():
1858
1859 PCRE_ERROR_DFA_UITEM (-16)
1860
1861 This return is given if pcre_dfa_exec() encounters an item in the pat‐
1862 tern that it does not support, for instance, the use of \C or a back
1863 reference.
1864
1865 PCRE_ERROR_DFA_UCOND (-17)
1866
1867 This return is given if pcre_dfa_exec() encounters a condition item
1868 that uses a back reference for the condition, or a test for recursion
1869 in a specific group. These are not supported.
1870
1871 PCRE_ERROR_DFA_UMLIMIT (-18)
1872
1873 This return is given if pcre_dfa_exec() is called with an extra block
1874 that contains a setting of the match_limit field. This is not supported
1875 (it is meaningless).
1876
1877 PCRE_ERROR_DFA_WSSIZE (-19)
1878
1879 This return is given if pcre_dfa_exec() runs out of space in the
1880 workspace vector.
1881
1882 PCRE_ERROR_DFA_RECURSE (-20)
1883
1884 When a recursive subpattern is processed, the matching function calls
1885 itself recursively, using private vectors for ovector and workspace.
1886 This error is given if the output vector is not large enough. This
1887 should be extremely rare, as a vector of size 1000 is used.
1888
1890
1891 pcrebuild(3), pcrecallout(3), pcrecpp(3)[22m(3), pcrematching(3), pcrepar‐
1892 tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3).
1893
1895
1896 Philip Hazel
1897 University Computing Service
1898 Cambridge CB2 3QH, England.
1899
1901
1902 Last updated: 24 August 2008
1903 Copyright (c) 1997-2011 University of Cambridge.
1904
1905
1906
1907 PCREAPI(3)