1PCREAPI(3) Library Functions Manual PCREAPI(3)
2
3
4
6 PCRE - Perl-compatible regular expressions
7
9
10 #include <pcre.h>
11
12 pcre *pcre_compile(const char *pattern, int options,
13 const char **errptr, int *erroffset,
14 const unsigned char *tableptr);
15
16 pcre *pcre_compile2(const char *pattern, int options,
17 int *errorcodeptr,
18 const char **errptr, int *erroffset,
19 const unsigned char *tableptr);
20
21 pcre_extra *pcre_study(const pcre *code, int options,
22 const char **errptr);
23
24 int pcre_exec(const pcre *code, const pcre_extra *extra,
25 const char *subject, int length, int startoffset,
26 int options, int *ovector, int ovecsize);
27
28 int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
29 const char *subject, int length, int startoffset,
30 int options, int *ovector, int ovecsize,
31 int *workspace, int wscount);
32
33 int pcre_copy_named_substring(const pcre *code,
34 const char *subject, int *ovector,
35 int stringcount, const char *stringname,
36 char *buffer, int buffersize);
37
38 int pcre_copy_substring(const char *subject, int *ovector,
39 int stringcount, int stringnumber, char *buffer,
40 int buffersize);
41
42 int pcre_get_named_substring(const pcre *code,
43 const char *subject, int *ovector,
44 int stringcount, const char *stringname,
45 const char **stringptr);
46
47 int pcre_get_stringnumber(const pcre *code,
48 const char *name);
49
50 int pcre_get_stringtable_entries(const pcre *code,
51 const char *name, char **first, char **last);
52
53 int pcre_get_substring(const char *subject, int *ovector,
54 int stringcount, int stringnumber,
55 const char **stringptr);
56
57 int pcre_get_substring_list(const char *subject,
58 int *ovector, int stringcount, const char ***listptr);
59
60 void pcre_free_substring(const char *stringptr);
61
62 void pcre_free_substring_list(const char **stringptr);
63
64 const unsigned char *pcre_maketables(void);
65
66 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
67 int what, void *where);
68
69 int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
70
71 int pcre_refcount(pcre *code, int adjust);
72
73 int pcre_config(int what, void *where);
74
75 char *pcre_version(void);
76
77 void *(*pcre_malloc)(size_t);
78
79 void (*pcre_free)(void *);
80
81 void *(*pcre_stack_malloc)(size_t);
82
83 void (*pcre_stack_free)(void *);
84
85 int (*pcre_callout)(pcre_callout_block *);
86
88
89 PCRE has its own native API, which is described in this document. There
90 are also some wrapper functions that correspond to the POSIX regular
91 expression API. These are described in the pcreposix documentation.
92 Both of these APIs define a set of C function calls. A C++ wrapper is
93 distributed with PCRE. It is documented in the pcrecpp page.
94
95 The native API C function prototypes are defined in the header file
96 pcre.h, and on Unix systems the library itself is called libpcre. It
97 can normally be accessed by adding -lpcre to the command for linking an
98 application that uses PCRE. The header file defines the macros
99 PCRE_MAJOR and PCRE_MINOR to contain the major and minor release num‐
100 bers for the library. Applications can use these to include support
101 for different releases of PCRE.
102
103 In a Windows environment, if you want to statically link an application
104 program against a non-dll pcre.a file, you must define PCRE_STATIC
105 before including pcre.h or pcrecpp.h, because otherwise the pcre_mal‐
106 loc() and pcre_free() exported functions will be declared
107 __declspec(dllimport), with unwanted results.
108
109 The functions pcre_compile(), pcre_compile2(), pcre_study(), and
110 pcre_exec() are used for compiling and matching regular expressions in
111 a Perl-compatible manner. A sample program that demonstrates the sim‐
112 plest way of using them is provided in the file called pcredemo.c in
113 the PCRE source distribution. A listing of this program is given in the
114 pcredemo documentation, and the pcresample documentation describes how
115 to compile and run it.
116
117 A second matching function, pcre_dfa_exec(), which is not Perl-compati‐
118 ble, is also provided. This uses a different algorithm for the match‐
119 ing. The alternative algorithm finds all possible matches (at a given
120 point in the subject), and scans the subject just once (unless there
121 are lookbehind assertions). However, this algorithm does not return
122 captured substrings. A description of the two matching algorithms and
123 their advantages and disadvantages is given in the pcrematching docu‐
124 mentation.
125
126 In addition to the main compiling and matching functions, there are
127 convenience functions for extracting captured substrings from a subject
128 string that is matched by pcre_exec(). They are:
129
130 pcre_copy_substring()
131 pcre_copy_named_substring()
132 pcre_get_substring()
133 pcre_get_named_substring()
134 pcre_get_substring_list()
135 pcre_get_stringnumber()
136 pcre_get_stringtable_entries()
137
138 pcre_free_substring() and pcre_free_substring_list() are also provided,
139 to free the memory used for extracted strings.
140
141 The function pcre_maketables() is used to build a set of character
142 tables in the current locale for passing to pcre_compile(),
143 pcre_exec(), or pcre_dfa_exec(). This is an optional facility that is
144 provided for specialist use. Most commonly, no special tables are
145 passed, in which case internal tables that are generated when PCRE is
146 built are used.
147
148 The function pcre_fullinfo() is used to find out information about a
149 compiled pattern; pcre_info() is an obsolete version that returns only
150 some of the available information, but is retained for backwards com‐
151 patibility. The function pcre_version() returns a pointer to a string
152 containing the version of PCRE and its date of release.
153
154 The function pcre_refcount() maintains a reference count in a data
155 block containing a compiled pattern. This is provided for the benefit
156 of object-oriented applications.
157
158 The global variables pcre_malloc and pcre_free initially contain the
159 entry points of the standard malloc() and free() functions, respec‐
160 tively. PCRE calls the memory management functions via these variables,
161 so a calling program can replace them if it wishes to intercept the
162 calls. This should be done before calling any PCRE functions.
163
164 The global variables pcre_stack_malloc and pcre_stack_free are also
165 indirections to memory management functions. These special functions
166 are used only when PCRE is compiled to use the heap for remembering
167 data, instead of recursive function calls, when running the pcre_exec()
168 function. See the pcrebuild documentation for details of how to do
169 this. It is a non-standard way of building PCRE, for use in environ‐
170 ments that have limited stacks. Because of the greater use of memory
171 management, it runs more slowly. Separate functions are provided so
172 that special-purpose external code can be used for this case. When
173 used, these functions are always called in a stack-like manner (last
174 obtained, first freed), and always for memory blocks of the same size.
175 There is a discussion about PCRE's stack usage in the pcrestack docu‐
176 mentation.
177
178 The global variable pcre_callout initially contains NULL. It can be set
179 by the caller to a "callout" function, which PCRE will then call at
180 specified points during a matching operation. Details are given in the
181 pcrecallout documentation.
182
184
185 PCRE supports five different conventions for indicating line breaks in
186 strings: a single CR (carriage return) character, a single LF (line‐
187 feed) character, the two-character sequence CRLF, any of the three pre‐
188 ceding, or any Unicode newline sequence. The Unicode newline sequences
189 are the three just mentioned, plus the single characters VT (vertical
190 tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS (line
191 separator, U+2028), and PS (paragraph separator, U+2029).
192
193 Each of the first three conventions is used by at least one operating
194 system as its standard newline sequence. When PCRE is built, a default
195 can be specified. The default default is LF, which is the Unix stan‐
196 dard. When PCRE is run, the default can be overridden, either when a
197 pattern is compiled, or when it is matched.
198
199 At compile time, the newline convention can be specified by the options
200 argument of pcre_compile(), or it can be specified by special text at
201 the start of the pattern itself; this overrides any other settings. See
202 the pcrepattern page for details of the special character sequences.
203
204 In the PCRE documentation the word "newline" is used to mean "the char‐
205 acter or pair of characters that indicate a line break". The choice of
206 newline convention affects the handling of the dot, circumflex, and
207 dollar metacharacters, the handling of #-comments in /x mode, and, when
208 CRLF is a recognized line ending sequence, the match position advance‐
209 ment for a non-anchored pattern. There is more detail about this in the
210 section on pcre_exec() options below.
211
212 The choice of newline convention does not affect the interpretation of
213 the \n or \r escape sequences, nor does it affect what \R matches,
214 which is controlled in a similar way, but by separate options.
215
217
218 The PCRE functions can be used in multi-threading applications, with
219 the proviso that the memory management functions pointed to by
220 pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
221 callout function pointed to by pcre_callout, are shared by all threads.
222
223 The compiled form of a regular expression is not altered during match‐
224 ing, so the same compiled pattern can safely be used by several threads
225 at once.
226
228
229 The compiled form of a regular expression can be saved and re-used at a
230 later time, possibly by a different program, and even on a host other
231 than the one on which it was compiled. Details are given in the
232 pcreprecompile documentation. However, compiling a regular expression
233 with one version of PCRE for use with a different version is not guar‐
234 anteed to work and may cause crashes.
235
237
238 int pcre_config(int what, void *where);
239
240 The function pcre_config() makes it possible for a PCRE client to dis‐
241 cover which optional features have been compiled into the PCRE library.
242 The pcrebuild documentation has more details about these optional fea‐
243 tures.
244
245 The first argument for pcre_config() is an integer, specifying which
246 information is required; the second argument is a pointer to a variable
247 into which the information is placed. The following information is
248 available:
249
250 PCRE_CONFIG_UTF8
251
252 The output is an integer that is set to one if UTF-8 support is avail‐
253 able; otherwise it is set to zero.
254
255 PCRE_CONFIG_UNICODE_PROPERTIES
256
257 The output is an integer that is set to one if support for Unicode
258 character properties is available; otherwise it is set to zero.
259
260 PCRE_CONFIG_NEWLINE
261
262 The output is an integer whose value specifies the default character
263 sequence that is recognized as meaning "newline". The four values that
264 are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF,
265 and -1 for ANY. Though they are derived from ASCII, the same values
266 are returned in EBCDIC environments. The default should normally corre‐
267 spond to the standard sequence for your operating system.
268
269 PCRE_CONFIG_BSR
270
271 The output is an integer whose value indicates what character sequences
272 the \R escape sequence matches by default. A value of 0 means that \R
273 matches any Unicode line ending sequence; a value of 1 means that \R
274 matches only CR, LF, or CRLF. The default can be overridden when a pat‐
275 tern is compiled or matched.
276
277 PCRE_CONFIG_LINK_SIZE
278
279 The output is an integer that contains the number of bytes used for
280 internal linkage in compiled regular expressions. The value is 2, 3, or
281 4. Larger values allow larger regular expressions to be compiled, at
282 the expense of slower matching. The default value of 2 is sufficient
283 for all but the most massive patterns, since it allows the compiled
284 pattern to be up to 64K in size.
285
286 PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
287
288 The output is an integer that contains the threshold above which the
289 POSIX interface uses malloc() for output vectors. Further details are
290 given in the pcreposix documentation.
291
292 PCRE_CONFIG_MATCH_LIMIT
293
294 The output is a long integer that gives the default limit for the num‐
295 ber of internal matching function calls in a pcre_exec() execution.
296 Further details are given with pcre_exec() below.
297
298 PCRE_CONFIG_MATCH_LIMIT_RECURSION
299
300 The output is a long integer that gives the default limit for the depth
301 of recursion when calling the internal matching function in a
302 pcre_exec() execution. Further details are given with pcre_exec()
303 below.
304
305 PCRE_CONFIG_STACKRECURSE
306
307 The output is an integer that is set to one if internal recursion when
308 running pcre_exec() is implemented by recursive function calls that use
309 the stack to remember their state. This is the usual way that PCRE is
310 compiled. The output is zero if PCRE was compiled to use blocks of data
311 on the heap instead of recursive function calls. In this case,
312 pcre_stack_malloc and pcre_stack_free are called to manage memory
313 blocks on the heap, thus avoiding the use of the stack.
314
316
317 pcre *pcre_compile(const char *pattern, int options,
318 const char **errptr, int *erroffset,
319 const unsigned char *tableptr);
320
321 pcre *pcre_compile2(const char *pattern, int options,
322 int *errorcodeptr,
323 const char **errptr, int *erroffset,
324 const unsigned char *tableptr);
325
326 Either of the functions pcre_compile() or pcre_compile2() can be called
327 to compile a pattern into an internal form. The only difference between
328 the two interfaces is that pcre_compile2() has an additional argument,
329 errorcodeptr, via which a numerical error code can be returned. To
330 avoid too much repetition, we refer just to pcre_compile() below, but
331 the information applies equally to pcre_compile2().
332
333 The pattern is a C string terminated by a binary zero, and is passed in
334 the pattern argument. A pointer to a single block of memory that is
335 obtained via pcre_malloc is returned. This contains the compiled code
336 and related data. The pcre type is defined for the returned block; this
337 is a typedef for a structure whose contents are not externally defined.
338 It is up to the caller to free the memory (via pcre_free) when it is no
339 longer required.
340
341 Although the compiled code of a PCRE regex is relocatable, that is, it
342 does not depend on memory location, the complete pcre data block is not
343 fully relocatable, because it may contain a copy of the tableptr argu‐
344 ment, which is an address (see below).
345
346 The options argument contains various bit settings that affect the com‐
347 pilation. It should be zero if no options are required. The available
348 options are described below. Some of them (in particular, those that
349 are compatible with Perl, but some others as well) can also be set and
350 unset from within the pattern (see the detailed description in the
351 pcrepattern documentation). For those options that can be different in
352 different parts of the pattern, the contents of the options argument
353 specifies their settings at the start of compilation and execution. The
354 PCRE_ANCHORED, PCRE_BSR_xxx, and PCRE_NEWLINE_xxx options can be set at
355 the time of matching as well as at compile time.
356
357 If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise,
358 if compilation of a pattern fails, pcre_compile() returns NULL, and
359 sets the variable pointed to by errptr to point to a textual error mes‐
360 sage. This is a static string that is part of the library. You must not
361 try to free it. The byte offset from the start of the pattern to the
362 character that was being processed when the error was discovered is
363 placed in the variable pointed to by erroffset, which must not be NULL.
364 If it is, an immediate error is given. Some errors are not detected
365 until checks are carried out when the whole pattern has been scanned;
366 in this case the offset is set to the end of the pattern.
367
368 If pcre_compile2() is used instead of pcre_compile(), and the error‐
369 codeptr argument is not NULL, a non-zero error code number is returned
370 via this argument in the event of an error. This is in addition to the
371 textual error message. Error codes and messages are listed below.
372
373 If the final argument, tableptr, is NULL, PCRE uses a default set of
374 character tables that are built when PCRE is compiled, using the
375 default C locale. Otherwise, tableptr must be an address that is the
376 result of a call to pcre_maketables(). This value is stored with the
377 compiled pattern, and used again by pcre_exec(), unless another table
378 pointer is passed to it. For more discussion, see the section on locale
379 support below.
380
381 This code fragment shows a typical straightforward call to pcre_com‐
382 pile():
383
384 pcre *re;
385 const char *error;
386 int erroffset;
387 re = pcre_compile(
388 "^A.*Z", /* the pattern */
389 0, /* default options */
390 &error, /* for error message */
391 &erroffset, /* for error offset */
392 NULL); /* use default character tables */
393
394 The following names for option bits are defined in the pcre.h header
395 file:
396
397 PCRE_ANCHORED
398
399 If this bit is set, the pattern is forced to be "anchored", that is, it
400 is constrained to match only at the first matching point in the string
401 that is being searched (the "subject string"). This effect can also be
402 achieved by appropriate constructs in the pattern itself, which is the
403 only way to do it in Perl.
404
405 PCRE_AUTO_CALLOUT
406
407 If this bit is set, pcre_compile() automatically inserts callout items,
408 all with number 255, before each pattern item. For discussion of the
409 callout facility, see the pcrecallout documentation.
410
411 PCRE_BSR_ANYCRLF
412 PCRE_BSR_UNICODE
413
414 These options (which are mutually exclusive) control what the \R escape
415 sequence matches. The choice is either to match only CR, LF, or CRLF,
416 or to match any Unicode newline sequence. The default is specified when
417 PCRE is built. It can be overridden from within the pattern, or by set‐
418 ting an option when a compiled pattern is matched.
419
420 PCRE_CASELESS
421
422 If this bit is set, letters in the pattern match both upper and lower
423 case letters. It is equivalent to Perl's /i option, and it can be
424 changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE
425 always understands the concept of case for characters whose values are
426 less than 128, so caseless matching is always possible. For characters
427 with higher values, the concept of case is supported if PCRE is com‐
428 piled with Unicode property support, but not otherwise. If you want to
429 use caseless matching for characters 128 and above, you must ensure
430 that PCRE is compiled with Unicode property support as well as with
431 UTF-8 support.
432
433 PCRE_DOLLAR_ENDONLY
434
435 If this bit is set, a dollar metacharacter in the pattern matches only
436 at the end of the subject string. Without this option, a dollar also
437 matches immediately before a newline at the end of the string (but not
438 before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored
439 if PCRE_MULTILINE is set. There is no equivalent to this option in
440 Perl, and no way to set it within a pattern.
441
442 PCRE_DOTALL
443
444 If this bit is set, a dot metacharater in the pattern matches all char‐
445 acters, including those that indicate newline. Without it, a dot does
446 not match when the current position is at a newline. This option is
447 equivalent to Perl's /s option, and it can be changed within a pattern
448 by a (?s) option setting. A negative class such as [^a] always matches
449 newline characters, independent of the setting of this option.
450
451 PCRE_DUPNAMES
452
453 If this bit is set, names used to identify capturing subpatterns need
454 not be unique. This can be helpful for certain types of pattern when it
455 is known that only one instance of the named subpattern can ever be
456 matched. There are more details of named subpatterns below; see also
457 the pcrepattern documentation.
458
459 PCRE_EXTENDED
460
461 If this bit is set, whitespace data characters in the pattern are
462 totally ignored except when escaped or inside a character class. White‐
463 space does not include the VT character (code 11). In addition, charac‐
464 ters between an unescaped # outside a character class and the next new‐
465 line, inclusive, are also ignored. This is equivalent to Perl's /x
466 option, and it can be changed within a pattern by a (?x) option set‐
467 ting.
468
469 This option makes it possible to include comments inside complicated
470 patterns. Note, however, that this applies only to data characters.
471 Whitespace characters may never appear within special character
472 sequences in a pattern, for example within the sequence (?( which
473 introduces a conditional subpattern.
474
475 PCRE_EXTRA
476
477 This option was invented in order to turn on additional functionality
478 of PCRE that is incompatible with Perl, but it is currently of very
479 little use. When set, any backslash in a pattern that is followed by a
480 letter that has no special meaning causes an error, thus reserving
481 these combinations for future expansion. By default, as in Perl, a
482 backslash followed by a letter with no special meaning is treated as a
483 literal. (Perl can, however, be persuaded to give an error for this, by
484 running it with the -w option.) There are at present no other features
485 controlled by this option. It can also be set by a (?X) option setting
486 within a pattern.
487
488 PCRE_FIRSTLINE
489
490 If this option is set, an unanchored pattern is required to match
491 before or at the first newline in the subject string, though the
492 matched text may continue over the newline.
493
494 PCRE_JAVASCRIPT_COMPAT
495
496 If this option is set, PCRE's behaviour is changed in some ways so that
497 it is compatible with JavaScript rather than Perl. The changes are as
498 follows:
499
500 (1) A lone closing square bracket in a pattern causes a compile-time
501 error, because this is illegal in JavaScript (by default it is treated
502 as a data character). Thus, the pattern AB]CD becomes illegal when this
503 option is set.
504
505 (2) At run time, a back reference to an unset subpattern group matches
506 an empty string (by default this causes the current matching alterna‐
507 tive to fail). A pattern such as (\1)(a) succeeds when this option is
508 set (assuming it can find an "a" in the subject), whereas it fails by
509 default, for Perl compatibility.
510
511 PCRE_MULTILINE
512
513 By default, PCRE treats the subject string as consisting of a single
514 line of characters (even if it actually contains newlines). The "start
515 of line" metacharacter (^) matches only at the start of the string,
516 while the "end of line" metacharacter ($) matches only at the end of
517 the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY
518 is set). This is the same as Perl.
519
520 When PCRE_MULTILINE it is set, the "start of line" and "end of line"
521 constructs match immediately following or immediately before internal
522 newlines in the subject string, respectively, as well as at the very
523 start and end. This is equivalent to Perl's /m option, and it can be
524 changed within a pattern by a (?m) option setting. If there are no new‐
525 lines in a subject string, or no occurrences of ^ or $ in a pattern,
526 setting PCRE_MULTILINE has no effect.
527
528 PCRE_NEWLINE_CR
529 PCRE_NEWLINE_LF
530 PCRE_NEWLINE_CRLF
531 PCRE_NEWLINE_ANYCRLF
532 PCRE_NEWLINE_ANY
533
534 These options override the default newline definition that was chosen
535 when PCRE was built. Setting the first or the second specifies that a
536 newline is indicated by a single character (CR or LF, respectively).
537 Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the
538 two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies
539 that any of the three preceding sequences should be recognized. Setting
540 PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be
541 recognized. The Unicode newline sequences are the three just mentioned,
542 plus the single characters VT (vertical tab, U+000B), FF (formfeed,
543 U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS
544 (paragraph separator, U+2029). The last two are recognized only in
545 UTF-8 mode.
546
547 The newline setting in the options word uses three bits that are
548 treated as a number, giving eight possibilities. Currently only six are
549 used (default plus the five values above). This means that if you set
550 more than one newline option, the combination may or may not be sensi‐
551 ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
552 PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and
553 cause an error.
554
555 The only time that a line break is specially recognized when compiling
556 a pattern is if PCRE_EXTENDED is set, and an unescaped # outside a
557 character class is encountered. This indicates a comment that lasts
558 until after the next line break sequence. In other circumstances, line
559 break sequences are treated as literal data, except that in
560 PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters
561 and are therefore ignored.
562
563 The newline option that is set at compile time becomes the default that
564 is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.
565
566 PCRE_NO_AUTO_CAPTURE
567
568 If this option is set, it disables the use of numbered capturing paren‐
569 theses in the pattern. Any opening parenthesis that is not followed by
570 ? behaves as if it were followed by ?: but named parentheses can still
571 be used for capturing (and they acquire numbers in the usual way).
572 There is no equivalent of this option in Perl.
573
574 PCRE_UCP
575
576 This option changes the way PCRE processes \b, \d, \s, \w, and some of
577 the POSIX character classes. By default, only ASCII characters are rec‐
578 ognized, but if PCRE_UCP is set, Unicode properties are used instead to
579 classify characters. More details are given in the section on generic
580 character types in the pcrepattern page. If you set PCRE_UCP, matching
581 one of the items it affects takes much longer. The option is available
582 only if PCRE has been compiled with Unicode property support.
583
584 PCRE_UNGREEDY
585
586 This option inverts the "greediness" of the quantifiers so that they
587 are not greedy by default, but become greedy if followed by "?". It is
588 not compatible with Perl. It can also be set by a (?U) option setting
589 within the pattern.
590
591 PCRE_UTF8
592
593 This option causes PCRE to regard both the pattern and the subject as
594 strings of UTF-8 characters instead of single-byte character strings.
595 However, it is available only when PCRE is built to include UTF-8 sup‐
596 port. If not, the use of this option provokes an error. Details of how
597 this option changes the behaviour of PCRE are given in the section on
598 UTF-8 support in the main pcre page.
599
600 PCRE_NO_UTF8_CHECK
601
602 When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
603 automatically checked. There is a discussion about the validity of
604 UTF-8 strings in the main pcre page. If an invalid UTF-8 sequence of
605 bytes is found, pcre_compile() returns an error. If you already know
606 that your pattern is valid, and you want to skip this check for perfor‐
607 mance reasons, you can set the PCRE_NO_UTF8_CHECK option. When it is
608 set, the effect of passing an invalid UTF-8 string as a pattern is
609 undefined. It may cause your program to crash. Note that this option
610 can also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the
611 UTF-8 validity checking of subject strings.
612
614
615 The following table lists the error codes than may be returned by
616 pcre_compile2(), along with the error messages that may be returned by
617 both compiling functions. As PCRE has developed, some error codes have
618 fallen out of use. To avoid confusion, they have not been re-used.
619
620 0 no error
621 1 \ at end of pattern
622 2 \c at end of pattern
623 3 unrecognized character follows \
624 4 numbers out of order in {} quantifier
625 5 number too big in {} quantifier
626 6 missing terminating ] for character class
627 7 invalid escape sequence in character class
628 8 range out of order in character class
629 9 nothing to repeat
630 10 [this code is not in use]
631 11 internal error: unexpected repeat
632 12 unrecognized character after (? or (?-
633 13 POSIX named classes are supported only within a class
634 14 missing )
635 15 reference to non-existent subpattern
636 16 erroffset passed as NULL
637 17 unknown option bit(s) set
638 18 missing ) after comment
639 19 [this code is not in use]
640 20 regular expression is too large
641 21 failed to get memory
642 22 unmatched parentheses
643 23 internal error: code overflow
644 24 unrecognized character after (?<
645 25 lookbehind assertion is not fixed length
646 26 malformed number or name after (?(
647 27 conditional group contains more than two branches
648 28 assertion expected after (?(
649 29 (?R or (?[+-]digits must be followed by )
650 30 unknown POSIX class name
651 31 POSIX collating elements are not supported
652 32 this version of PCRE is not compiled with PCRE_UTF8 support
653 33 [this code is not in use]
654 34 character value in \x{...} sequence is too large
655 35 invalid condition (?(0)
656 36 \C not allowed in lookbehind assertion
657 37 PCRE does not support \L, \l, \N, \U, or \u
658 38 number after (?C is > 255
659 39 closing ) for (?C expected
660 40 recursive call could loop indefinitely
661 41 unrecognized character after (?P
662 42 syntax error in subpattern name (missing terminator)
663 43 two named subpatterns have the same name
664 44 invalid UTF-8 string
665 45 support for \P, \p, and \X has not been compiled
666 46 malformed \P or \p sequence
667 47 unknown property name after \P or \p
668 48 subpattern name is too long (maximum 32 characters)
669 49 too many named subpatterns (maximum 10000)
670 50 [this code is not in use]
671 51 octal value is greater than \377 (not in UTF-8 mode)
672 52 internal error: overran compiling workspace
673 53 internal error: previously-checked referenced subpattern
674 not found
675 54 DEFINE group contains more than one branch
676 55 repeating a DEFINE group is not allowed
677 56 inconsistent NEWLINE options
678 57 \g is not followed by a braced, angle-bracketed, or quoted
679 name/number or by a plain number
680 58 a numbered reference must not be zero
681 59 an argument is not allowed for (*ACCEPT), (*FAIL), or (*COMMIT)
682 60 (*VERB) not recognized
683 61 number is too big
684 62 subpattern name expected
685 63 digit expected after (?+
686 64 ] is an invalid data character in JavaScript compatibility mode
687 65 different names for subpatterns of the same number are
688 not allowed
689 66 (*MARK) must have an argument
690 67 this version of PCRE is not compiled with PCRE_UCP support
691
692 The numbers 32 and 10000 in errors 48 and 49 are defaults; different
693 values may be used if the limits were changed when PCRE was built.
694
696
697 pcre_extra *pcre_study(const pcre *code, int options
698 const char **errptr);
699
700 If a compiled pattern is going to be used several times, it is worth
701 spending more time analyzing it in order to speed up the time taken for
702 matching. The function pcre_study() takes a pointer to a compiled pat‐
703 tern as its first argument. If studying the pattern produces additional
704 information that will help speed up matching, pcre_study() returns a
705 pointer to a pcre_extra block, in which the study_data field points to
706 the results of the study.
707
708 The returned value from pcre_study() can be passed directly to
709 pcre_exec() or pcre_dfa_exec(). However, a pcre_extra block also con‐
710 tains other fields that can be set by the caller before the block is
711 passed; these are described below in the section on matching a pattern.
712
713 If studying the pattern does not produce any useful information,
714 pcre_study() returns NULL. In that circumstance, if the calling program
715 wants to pass any of the other fields to pcre_exec() or
716 pcre_dfa_exec(), it must set up its own pcre_extra block.
717
718 The second argument of pcre_study() contains option bits. At present,
719 no options are defined, and this argument should always be zero.
720
721 The third argument for pcre_study() is a pointer for an error message.
722 If studying succeeds (even if no data is returned), the variable it
723 points to is set to NULL. Otherwise it is set to point to a textual
724 error message. This is a static string that is part of the library. You
725 must not try to free it. You should test the error pointer for NULL
726 after calling pcre_study(), to be sure that it has run successfully.
727
728 This is a typical call to pcre_study():
729
730 pcre_extra *pe;
731 pe = pcre_study(
732 re, /* result of pcre_compile() */
733 0, /* no options exist */
734 &error); /* set to NULL or points to a message */
735
736 Studying a pattern does two things: first, a lower bound for the length
737 of subject string that is needed to match the pattern is computed. This
738 does not mean that there are any strings of that length that match, but
739 it does guarantee that no shorter strings match. The value is used by
740 pcre_exec() and pcre_dfa_exec() to avoid wasting time by trying to
741 match strings that are shorter than the lower bound. You can find out
742 the value in a calling program via the pcre_fullinfo() function.
743
744 Studying a pattern is also useful for non-anchored patterns that do not
745 have a single fixed starting character. A bitmap of possible starting
746 bytes is created. This speeds up finding a position in the subject at
747 which to start matching.
748
749 The two optimizations just described can be disabled by setting the
750 PCRE_NO_START_OPTIMIZE option when calling pcre_exec() or
751 pcre_dfa_exec(). You might want to do this if your pattern contains
752 callouts, or make use of (*MARK), and you make use of these in cases
753 where matching fails. See the discussion of PCRE_NO_START_OPTIMIZE
754 below.
755
757
758 PCRE handles caseless matching, and determines whether characters are
759 letters, digits, or whatever, by reference to a set of tables, indexed
760 by character value. When running in UTF-8 mode, this applies only to
761 characters with codes less than 128. By default, higher-valued codes
762 never match escapes such as \w or \d, but they can be tested with \p if
763 PCRE is built with Unicode character property support. Alternatively,
764 the PCRE_UCP option can be set at compile time; this causes \w and
765 friends to use Unicode property support instead of built-in tables. The
766 use of locales with Unicode is discouraged. If you are handling charac‐
767 ters with codes greater than 128, you should either use UTF-8 and Uni‐
768 code, or use locales, but not try to mix the two.
769
770 PCRE contains an internal set of tables that are used when the final
771 argument of pcre_compile() is NULL. These are sufficient for many
772 applications. Normally, the internal tables recognize only ASCII char‐
773 acters. However, when PCRE is built, it is possible to cause the inter‐
774 nal tables to be rebuilt in the default "C" locale of the local system,
775 which may cause them to be different.
776
777 The internal tables can always be overridden by tables supplied by the
778 application that calls PCRE. These may be created in a different locale
779 from the default. As more and more applications change to using Uni‐
780 code, the need for this locale support is expected to die away.
781
782 External tables are built by calling the pcre_maketables() function,
783 which has no arguments, in the relevant locale. The result can then be
784 passed to pcre_compile() or pcre_exec() as often as necessary. For
785 example, to build and use tables that are appropriate for the French
786 locale (where accented characters with values greater than 128 are
787 treated as letters), the following code could be used:
788
789 setlocale(LC_CTYPE, "fr_FR");
790 tables = pcre_maketables();
791 re = pcre_compile(..., tables);
792
793 The locale name "fr_FR" is used on Linux and other Unix-like systems;
794 if you are using Windows, the name for the French locale is "french".
795
796 When pcre_maketables() runs, the tables are built in memory that is
797 obtained via pcre_malloc. It is the caller's responsibility to ensure
798 that the memory containing the tables remains available for as long as
799 it is needed.
800
801 The pointer that is passed to pcre_compile() is saved with the compiled
802 pattern, and the same tables are used via this pointer by pcre_study()
803 and normally also by pcre_exec(). Thus, by default, for any single pat‐
804 tern, compilation, studying and matching all happen in the same locale,
805 but different patterns can be compiled in different locales.
806
807 It is possible to pass a table pointer or NULL (indicating the use of
808 the internal tables) to pcre_exec(). Although not intended for this
809 purpose, this facility could be used to match a pattern in a different
810 locale from the one in which it was compiled. Passing table pointers at
811 run time is discussed below in the section on matching a pattern.
812
814
815 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
816 int what, void *where);
817
818 The pcre_fullinfo() function returns information about a compiled pat‐
819 tern. It replaces the obsolete pcre_info() function, which is neverthe‐
820 less retained for backwards compability (and is documented below).
821
822 The first argument for pcre_fullinfo() is a pointer to the compiled
823 pattern. The second argument is the result of pcre_study(), or NULL if
824 the pattern was not studied. The third argument specifies which piece
825 of information is required, and the fourth argument is a pointer to a
826 variable to receive the data. The yield of the function is zero for
827 success, or one of the following negative numbers:
828
829 PCRE_ERROR_NULL the argument code was NULL
830 the argument where was NULL
831 PCRE_ERROR_BADMAGIC the "magic number" was not found
832 PCRE_ERROR_BADOPTION the value of what was invalid
833
834 The "magic number" is placed at the start of each compiled pattern as
835 an simple check against passing an arbitrary memory pointer. Here is a
836 typical call of pcre_fullinfo(), to obtain the length of the compiled
837 pattern:
838
839 int rc;
840 size_t length;
841 rc = pcre_fullinfo(
842 re, /* result of pcre_compile() */
843 pe, /* result of pcre_study(), or NULL */
844 PCRE_INFO_SIZE, /* what is required */
845 &length); /* where to put the data */
846
847 The possible values for the third argument are defined in pcre.h, and
848 are as follows:
849
850 PCRE_INFO_BACKREFMAX
851
852 Return the number of the highest back reference in the pattern. The
853 fourth argument should point to an int variable. Zero is returned if
854 there are no back references.
855
856 PCRE_INFO_CAPTURECOUNT
857
858 Return the number of capturing subpatterns in the pattern. The fourth
859 argument should point to an int variable.
860
861 PCRE_INFO_DEFAULT_TABLES
862
863 Return a pointer to the internal default character tables within PCRE.
864 The fourth argument should point to an unsigned char * variable. This
865 information call is provided for internal use by the pcre_study() func‐
866 tion. External callers can cause PCRE to use its internal tables by
867 passing a NULL table pointer.
868
869 PCRE_INFO_FIRSTBYTE
870
871 Return information about the first byte of any matched string, for a
872 non-anchored pattern. The fourth argument should point to an int vari‐
873 able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name
874 is still recognized for backwards compatibility.)
875
876 If there is a fixed first byte, for example, from a pattern such as
877 (cat|cow|coyote), its value is returned. Otherwise, if either
878
879 (a) the pattern was compiled with the PCRE_MULTILINE option, and every
880 branch starts with "^", or
881
882 (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
883 set (if it were set, the pattern would be anchored),
884
885 -1 is returned, indicating that the pattern matches only at the start
886 of a subject string or after any newline within the string. Otherwise
887 -2 is returned. For anchored patterns, -2 is returned.
888
889 PCRE_INFO_FIRSTTABLE
890
891 If the pattern was studied, and this resulted in the construction of a
892 256-bit table indicating a fixed set of bytes for the first byte in any
893 matching string, a pointer to the table is returned. Otherwise NULL is
894 returned. The fourth argument should point to an unsigned char * vari‐
895 able.
896
897 PCRE_INFO_HASCRORLF
898
899 Return 1 if the pattern contains any explicit matches for CR or LF
900 characters, otherwise 0. The fourth argument should point to an int
901 variable. An explicit match is either a literal CR or LF character, or
902 \r or \n.
903
904 PCRE_INFO_JCHANGED
905
906 Return 1 if the (?J) or (?-J) option setting is used in the pattern,
907 otherwise 0. The fourth argument should point to an int variable. (?J)
908 and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
909
910 PCRE_INFO_LASTLITERAL
911
912 Return the value of the rightmost literal byte that must exist in any
913 matched string, other than at its start, if such a byte has been
914 recorded. The fourth argument should point to an int variable. If there
915 is no such byte, -1 is returned. For anchored patterns, a last literal
916 byte is recorded only if it follows something of variable length. For
917 example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
918 /^a\dz\d/ the returned value is -1.
919
920 PCRE_INFO_MINLENGTH
921
922 If the pattern was studied and a minimum length for matching subject
923 strings was computed, its value is returned. Otherwise the returned
924 value is -1. The value is a number of characters, not bytes (this may
925 be relevant in UTF-8 mode). The fourth argument should point to an int
926 variable. A non-negative value is a lower bound to the length of any
927 matching string. There may not be any strings of that length that do
928 actually match, but every string that does match is at least that long.
929
930 PCRE_INFO_NAMECOUNT
931 PCRE_INFO_NAMEENTRYSIZE
932 PCRE_INFO_NAMETABLE
933
934 PCRE supports the use of named as well as numbered capturing parenthe‐
935 ses. The names are just an additional way of identifying the parenthe‐
936 ses, which still acquire numbers. Several convenience functions such as
937 pcre_get_named_substring() are provided for extracting captured sub‐
938 strings by name. It is also possible to extract the data directly, by
939 first converting the name to a number in order to access the correct
940 pointers in the output vector (described with pcre_exec() below). To do
941 the conversion, you need to use the name-to-number map, which is
942 described by these three values.
943
944 The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
945 gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
946 of each entry; both of these return an int value. The entry size
947 depends on the length of the longest name. PCRE_INFO_NAMETABLE returns
948 a pointer to the first entry of the table (a pointer to char). The
949 first two bytes of each entry are the number of the capturing parenthe‐
950 sis, most significant byte first. The rest of the entry is the corre‐
951 sponding name, zero terminated.
952
953 The names are in alphabetical order. Duplicate names may appear if (?|
954 is used to create multiple groups with the same number, as described in
955 the section on duplicate subpattern numbers in the pcrepattern page.
956 Duplicate names for subpatterns with different numbers are permitted
957 only if PCRE_DUPNAMES is set. In all cases of duplicate names, they
958 appear in the table in the order in which they were found in the pat‐
959 tern. In the absence of (?| this is the order of increasing number;
960 when (?| is used this is not necessarily the case because later subpat‐
961 terns may have lower numbers.
962
963 As a simple example of the name/number table, consider the following
964 pattern (assume PCRE_EXTENDED is set, so white space - including new‐
965 lines - is ignored):
966
967 (?<date> (?<year>(\d\d)?\d\d) -
968 (?<month>\d\d) - (?<day>\d\d) )
969
970 There are four named subpatterns, so the table has four entries, and
971 each entry in the table is eight bytes long. The table is as follows,
972 with non-printing bytes shows in hexadecimal, and undefined bytes shown
973 as ??:
974
975 00 01 d a t e 00 ??
976 00 05 d a y 00 ?? ??
977 00 04 m o n t h 00
978 00 02 y e a r 00 ??
979
980 When writing code to extract data from named subpatterns using the
981 name-to-number map, remember that the length of the entries is likely
982 to be different for each compiled pattern.
983
984 PCRE_INFO_OKPARTIAL
985
986 Return 1 if the pattern can be used for partial matching with
987 pcre_exec(), otherwise 0. The fourth argument should point to an int
988 variable. From release 8.00, this always returns 1, because the
989 restrictions that previously applied to partial matching have been
990 lifted. The pcrepartial documentation gives details of partial match‐
991 ing.
992
993 PCRE_INFO_OPTIONS
994
995 Return a copy of the options with which the pattern was compiled. The
996 fourth argument should point to an unsigned long int variable. These
997 option bits are those specified in the call to pcre_compile(), modified
998 by any top-level option settings at the start of the pattern itself. In
999 other words, they are the options that will be in force when matching
1000 starts. For example, if the pattern /(?im)abc(?-i)d/ is compiled with
1001 the PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE,
1002 and PCRE_EXTENDED.
1003
1004 A pattern is automatically anchored by PCRE if all of its top-level
1005 alternatives begin with one of the following:
1006
1007 ^ unless PCRE_MULTILINE is set
1008 \A always
1009 \G always
1010 .* if PCRE_DOTALL is set and there are no back
1011 references to the subpattern in which .* appears
1012
1013 For such patterns, the PCRE_ANCHORED bit is set in the options returned
1014 by pcre_fullinfo().
1015
1016 PCRE_INFO_SIZE
1017
1018 Return the size of the compiled pattern, that is, the value that was
1019 passed as the argument to pcre_malloc() when PCRE was getting memory in
1020 which to place the compiled data. The fourth argument should point to a
1021 size_t variable.
1022
1023 PCRE_INFO_STUDYSIZE
1024
1025 Return the size of the data block pointed to by the study_data field in
1026 a pcre_extra block. That is, it is the value that was passed to
1027 pcre_malloc() when PCRE was getting memory into which to place the data
1028 created by pcre_study(). If pcre_extra is NULL, or there is no study
1029 data, zero is returned. The fourth argument should point to a size_t
1030 variable.
1031
1033
1034 int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
1035
1036 The pcre_info() function is now obsolete because its interface is too
1037 restrictive to return all the available data about a compiled pattern.
1038 New programs should use pcre_fullinfo() instead. The yield of
1039 pcre_info() is the number of capturing subpatterns, or one of the fol‐
1040 lowing negative numbers:
1041
1042 PCRE_ERROR_NULL the argument code was NULL
1043 PCRE_ERROR_BADMAGIC the "magic number" was not found
1044
1045 If the optptr argument is not NULL, a copy of the options with which
1046 the pattern was compiled is placed in the integer it points to (see
1047 PCRE_INFO_OPTIONS above).
1048
1049 If the pattern is not anchored and the firstcharptr argument is not
1050 NULL, it is used to pass back information about the first character of
1051 any matched string (see PCRE_INFO_FIRSTBYTE above).
1052
1054
1055 int pcre_refcount(pcre *code, int adjust);
1056
1057 The pcre_refcount() function is used to maintain a reference count in
1058 the data block that contains a compiled pattern. It is provided for the
1059 benefit of applications that operate in an object-oriented manner,
1060 where different parts of the application may be using the same compiled
1061 pattern, but you want to free the block when they are all done.
1062
1063 When a pattern is compiled, the reference count field is initialized to
1064 zero. It is changed only by calling this function, whose action is to
1065 add the adjust value (which may be positive or negative) to it. The
1066 yield of the function is the new value. However, the value of the count
1067 is constrained to lie between 0 and 65535, inclusive. If the new value
1068 is outside these limits, it is forced to the appropriate limit value.
1069
1070 Except when it is zero, the reference count is not correctly preserved
1071 if a pattern is compiled on one host and then transferred to a host
1072 whose byte-order is different. (This seems a highly unlikely scenario.)
1073
1075
1076 int pcre_exec(const pcre *code, const pcre_extra *extra,
1077 const char *subject, int length, int startoffset,
1078 int options, int *ovector, int ovecsize);
1079
1080 The function pcre_exec() is called to match a subject string against a
1081 compiled pattern, which is passed in the code argument. If the pattern
1082 was studied, the result of the study should be passed in the extra
1083 argument. This function is the main matching facility of the library,
1084 and it operates in a Perl-like manner. For specialist use there is also
1085 an alternative matching function, which is described below in the sec‐
1086 tion about the pcre_dfa_exec() function.
1087
1088 In most applications, the pattern will have been compiled (and option‐
1089 ally studied) in the same process that calls pcre_exec(). However, it
1090 is possible to save compiled patterns and study data, and then use them
1091 later in different processes, possibly even on different hosts. For a
1092 discussion about this, see the pcreprecompile documentation.
1093
1094 Here is an example of a simple call to pcre_exec():
1095
1096 int rc;
1097 int ovector[30];
1098 rc = pcre_exec(
1099 re, /* result of pcre_compile() */
1100 NULL, /* we didn't study the pattern */
1101 "some string", /* the subject string */
1102 11, /* the length of the subject string */
1103 0, /* start at offset 0 in the subject */
1104 0, /* default options */
1105 ovector, /* vector of integers for substring information */
1106 30); /* number of elements (NOT size in bytes) */
1107
1108 Extra data for pcre_exec()
1109
1110 If the extra argument is not NULL, it must point to a pcre_extra data
1111 block. The pcre_study() function returns such a block (when it doesn't
1112 return NULL), but you can also create one for yourself, and pass addi‐
1113 tional information in it. The pcre_extra block contains the following
1114 fields (not necessarily in this order):
1115
1116 unsigned long int flags;
1117 void *study_data;
1118 unsigned long int match_limit;
1119 unsigned long int match_limit_recursion;
1120 void *callout_data;
1121 const unsigned char *tables;
1122 unsigned char **mark;
1123
1124 The flags field is a bitmap that specifies which of the other fields
1125 are set. The flag bits are:
1126
1127 PCRE_EXTRA_STUDY_DATA
1128 PCRE_EXTRA_MATCH_LIMIT
1129 PCRE_EXTRA_MATCH_LIMIT_RECURSION
1130 PCRE_EXTRA_CALLOUT_DATA
1131 PCRE_EXTRA_TABLES
1132 PCRE_EXTRA_MARK
1133
1134 Other flag bits should be set to zero. The study_data field is set in
1135 the pcre_extra block that is returned by pcre_study(), together with
1136 the appropriate flag bit. You should not set this yourself, but you may
1137 add to the block by setting the other fields and their corresponding
1138 flag bits.
1139
1140 The match_limit field provides a means of preventing PCRE from using up
1141 a vast amount of resources when running patterns that are not going to
1142 match, but which have a very large number of possibilities in their
1143 search trees. The classic example is a pattern that uses nested unlim‐
1144 ited repeats.
1145
1146 Internally, PCRE uses a function called match() which it calls repeat‐
1147 edly (sometimes recursively). The limit set by match_limit is imposed
1148 on the number of times this function is called during a match, which
1149 has the effect of limiting the amount of backtracking that can take
1150 place. For patterns that are not anchored, the count restarts from zero
1151 for each position in the subject string.
1152
1153 The default value for the limit can be set when PCRE is built; the
1154 default default is 10 million, which handles all but the most extreme
1155 cases. You can override the default by suppling pcre_exec() with a
1156 pcre_extra block in which match_limit is set, and
1157 PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is
1158 exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
1159
1160 The match_limit_recursion field is similar to match_limit, but instead
1161 of limiting the total number of times that match() is called, it limits
1162 the depth of recursion. The recursion depth is a smaller number than
1163 the total number of calls, because not all calls to match() are recur‐
1164 sive. This limit is of use only if it is set smaller than match_limit.
1165
1166 Limiting the recursion depth limits the amount of stack that can be
1167 used, or, when PCRE has been compiled to use memory on the heap instead
1168 of the stack, the amount of heap memory that can be used.
1169
1170 The default value for match_limit_recursion can be set when PCRE is
1171 built; the default default is the same value as the default for
1172 match_limit. You can override the default by suppling pcre_exec() with
1173 a pcre_extra block in which match_limit_recursion is set, and
1174 PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the
1175 limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
1176
1177 The callout_data field is used in conjunction with the "callout" fea‐
1178 ture, and is described in the pcrecallout documentation.
1179
1180 The tables field is used to pass a character tables pointer to
1181 pcre_exec(); this overrides the value that is stored with the compiled
1182 pattern. A non-NULL value is stored with the compiled pattern only if
1183 custom tables were supplied to pcre_compile() via its tableptr argu‐
1184 ment. If NULL is passed to pcre_exec() using this mechanism, it forces
1185 PCRE's internal tables to be used. This facility is helpful when re-
1186 using patterns that have been saved after compiling with an external
1187 set of tables, because the external tables might be at a different
1188 address when pcre_exec() is called. See the pcreprecompile documenta‐
1189 tion for a discussion of saving compiled patterns for later use.
1190
1191 If PCRE_EXTRA_MARK is set in the flags field, the mark field must be
1192 set to point to a char * variable. If the pattern contains any back‐
1193 tracking control verbs such as (*MARK:NAME), and the execution ends up
1194 with a name to pass back, a pointer to the name string (zero termi‐
1195 nated) is placed in the variable pointed to by the mark field. The
1196 names are within the compiled pattern; if you wish to retain such a
1197 name you must copy it before freeing the memory of a compiled pattern.
1198 If there is no name to pass back, the variable pointed to by the mark
1199 field set to NULL. For details of the backtracking control verbs, see
1200 the section entitled "Backtracking control" in the pcrepattern documen‐
1201 tation.
1202
1203 Option bits for pcre_exec()
1204
1205 The unused bits of the options argument for pcre_exec() must be zero.
1206 The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx,
1207 PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
1208 PCRE_NO_START_OPTIMIZE, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_SOFT, and
1209 PCRE_PARTIAL_HARD.
1210
1211 PCRE_ANCHORED
1212
1213 The PCRE_ANCHORED option limits pcre_exec() to matching at the first
1214 matching position. If a pattern was compiled with PCRE_ANCHORED, or
1215 turned out to be anchored by virtue of its contents, it cannot be made
1216 unachored at matching time.
1217
1218 PCRE_BSR_ANYCRLF
1219 PCRE_BSR_UNICODE
1220
1221 These options (which are mutually exclusive) control what the \R escape
1222 sequence matches. The choice is either to match only CR, LF, or CRLF,
1223 or to match any Unicode newline sequence. These options override the
1224 choice that was made or defaulted when the pattern was compiled.
1225
1226 PCRE_NEWLINE_CR
1227 PCRE_NEWLINE_LF
1228 PCRE_NEWLINE_CRLF
1229 PCRE_NEWLINE_ANYCRLF
1230 PCRE_NEWLINE_ANY
1231
1232 These options override the newline definition that was chosen or
1233 defaulted when the pattern was compiled. For details, see the descrip‐
1234 tion of pcre_compile() above. During matching, the newline choice
1235 affects the behaviour of the dot, circumflex, and dollar metacharac‐
1236 ters. It may also alter the way the match position is advanced after a
1237 match failure for an unanchored pattern.
1238
1239 When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is
1240 set, and a match attempt for an unanchored pattern fails when the cur‐
1241 rent position is at a CRLF sequence, and the pattern contains no
1242 explicit matches for CR or LF characters, the match position is
1243 advanced by two characters instead of one, in other words, to after the
1244 CRLF.
1245
1246 The above rule is a compromise that makes the most common cases work as
1247 expected. For example, if the pattern is .+A (and the PCRE_DOTALL
1248 option is not set), it does not match the string "\r\nA" because, after
1249 failing at the start, it skips both the CR and the LF before retrying.
1250 However, the pattern [\r\n]A does match that string, because it con‐
1251 tains an explicit CR or LF reference, and so advances only by one char‐
1252 acter after the first failure.
1253
1254 An explicit match for CR of LF is either a literal appearance of one of
1255 those characters, or one of the \r or \n escape sequences. Implicit
1256 matches such as [^X] do not count, nor does \s (which includes CR and
1257 LF in the characters that it matches).
1258
1259 Notwithstanding the above, anomalous effects may still occur when CRLF
1260 is a valid newline sequence and explicit \r or \n escapes appear in the
1261 pattern.
1262
1263 PCRE_NOTBOL
1264
1265 This option specifies that first character of the subject string is not
1266 the beginning of a line, so the circumflex metacharacter should not
1267 match before it. Setting this without PCRE_MULTILINE (at compile time)
1268 causes circumflex never to match. This option affects only the behav‐
1269 iour of the circumflex metacharacter. It does not affect \A.
1270
1271 PCRE_NOTEOL
1272
1273 This option specifies that the end of the subject string is not the end
1274 of a line, so the dollar metacharacter should not match it nor (except
1275 in multiline mode) a newline immediately before it. Setting this with‐
1276 out PCRE_MULTILINE (at compile time) causes dollar never to match. This
1277 option affects only the behaviour of the dollar metacharacter. It does
1278 not affect \Z or \z.
1279
1280 PCRE_NOTEMPTY
1281
1282 An empty string is not considered to be a valid match if this option is
1283 set. If there are alternatives in the pattern, they are tried. If all
1284 the alternatives match the empty string, the entire match fails. For
1285 example, if the pattern
1286
1287 a?b?
1288
1289 is applied to a string not beginning with "a" or "b", it matches an
1290 empty string at the start of the subject. With PCRE_NOTEMPTY set, this
1291 match is not valid, so PCRE searches further into the string for occur‐
1292 rences of "a" or "b".
1293
1294 PCRE_NOTEMPTY_ATSTART
1295
1296 This is like PCRE_NOTEMPTY, except that an empty string match that is
1297 not at the start of the subject is permitted. If the pattern is
1298 anchored, such a match can occur only if the pattern contains \K.
1299
1300 Perl has no direct equivalent of PCRE_NOTEMPTY or
1301 PCRE_NOTEMPTY_ATSTART, but it does make a special case of a pattern
1302 match of the empty string within its split() function, and when using
1303 the /g modifier. It is possible to emulate Perl's behaviour after
1304 matching a null string by first trying the match again at the same off‐
1305 set with PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED, and then if that
1306 fails, by advancing the starting offset (see below) and trying an ordi‐
1307 nary match again. There is some code that demonstrates how to do this
1308 in the pcredemo sample program.
1309
1310 PCRE_NO_START_OPTIMIZE
1311
1312 There are a number of optimizations that pcre_exec() uses at the start
1313 of a match, in order to speed up the process. For example, if it is
1314 known that an unanchored match must start with a specific character, it
1315 searches the subject for that character, and fails immediately if it
1316 cannot find it, without actually running the main matching function.
1317 This means that a special item such as (*COMMIT) at the start of a pat‐
1318 tern is not considered until after a suitable starting point for the
1319 match has been found. When callouts or (*MARK) items are in use, these
1320 "start-up" optimizations can cause them to be skipped if the pattern is
1321 never actually used. The start-up optimizations are in effect a pre-
1322 scan of the subject that takes place before the pattern is run.
1323
1324 The PCRE_NO_START_OPTIMIZE option disables the start-up optimizations,
1325 possibly causing performance to suffer, but ensuring that in cases
1326 where the result is "no match", the callouts do occur, and that items
1327 such as (*COMMIT) and (*MARK) are considered at every possible starting
1328 position in the subject string. Setting PCRE_NO_START_OPTIMIZE can
1329 change the outcome of a matching operation. Consider the pattern
1330
1331 (*COMMIT)ABC
1332
1333 When this is compiled, PCRE records the fact that a match must start
1334 with the character "A". Suppose the subject string is "DEFABC". The
1335 start-up optimization scans along the subject, finds "A" and runs the
1336 first match attempt from there. The (*COMMIT) item means that the pat‐
1337 tern must match the current starting position, which in this case, it
1338 does. However, if the same match is run with PCRE_NO_START_OPTIMIZE
1339 set, the initial scan along the subject string does not happen. The
1340 first match attempt is run starting from "D" and when this fails,
1341 (*COMMIT) prevents any further matches being tried, so the overall
1342 result is "no match". If the pattern is studied, more start-up opti‐
1343 mizations may be used. For example, a minimum length for the subject
1344 may be recorded. Consider the pattern
1345
1346 (*MARK:A)(X|Y)
1347
1348 The minimum length for a match is one character. If the subject is
1349 "ABC", there will be attempts to match "ABC", "BC", "C", and then
1350 finally an empty string. If the pattern is studied, the final attempt
1351 does not take place, because PCRE knows that the subject is too short,
1352 and so the (*MARK) is never encountered. In this case, studying the
1353 pattern does not affect the overall match result, which is still "no
1354 match", but it does affect the auxiliary information that is returned.
1355
1356 PCRE_NO_UTF8_CHECK
1357
1358 When PCRE_UTF8 is set at compile time, the validity of the subject as a
1359 UTF-8 string is automatically checked when pcre_exec() is subsequently
1360 called. The value of startoffset is also checked to ensure that it
1361 points to the start of a UTF-8 character. There is a discussion about
1362 the validity of UTF-8 strings in the section on UTF-8 support in the
1363 main pcre page. If an invalid UTF-8 sequence of bytes is found,
1364 pcre_exec() returns the error PCRE_ERROR_BADUTF8. If startoffset con‐
1365 tains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned.
1366
1367 If you already know that your subject is valid, and you want to skip
1368 these checks for performance reasons, you can set the
1369 PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to
1370 do this for the second and subsequent calls to pcre_exec() if you are
1371 making repeated calls to find all the matches in a single subject
1372 string. However, you should be sure that the value of startoffset
1373 points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is
1374 set, the effect of passing an invalid UTF-8 string as a subject, or a
1375 value of startoffset that does not point to the start of a UTF-8 char‐
1376 acter, is undefined. Your program may crash.
1377
1378 PCRE_PARTIAL_HARD
1379 PCRE_PARTIAL_SOFT
1380
1381 These options turn on the partial matching feature. For backwards com‐
1382 patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial
1383 match occurs if the end of the subject string is reached successfully,
1384 but there are not enough subject characters to complete the match. If
1385 this happens when PCRE_PARTIAL_HARD is set, pcre_exec() immediately
1386 returns PCRE_ERROR_PARTIAL. Otherwise, if PCRE_PARTIAL_SOFT is set,
1387 matching continues by testing any other alternatives. Only if they all
1388 fail is PCRE_ERROR_PARTIAL returned (instead of PCRE_ERROR_NOMATCH).
1389 The portion of the string that was inspected when the partial match was
1390 found is set as the first matching string. There is a more detailed
1391 discussion in the pcrepartial documentation.
1392
1393 The string to be matched by pcre_exec()
1394
1395 The subject string is passed to pcre_exec() as a pointer in subject, a
1396 length (in bytes) in length, and a starting byte offset in startoffset.
1397 In UTF-8 mode, the byte offset must point to the start of a UTF-8 char‐
1398 acter. Unlike the pattern string, the subject may contain binary zero
1399 bytes. When the starting offset is zero, the search for a match starts
1400 at the beginning of the subject, and this is by far the most common
1401 case.
1402
1403 A non-zero starting offset is useful when searching for another match
1404 in the same subject by calling pcre_exec() again after a previous suc‐
1405 cess. Setting startoffset differs from just passing over a shortened
1406 string and setting PCRE_NOTBOL in the case of a pattern that begins
1407 with any kind of lookbehind. For example, consider the pattern
1408
1409 \Biss\B
1410
1411 which finds occurrences of "iss" in the middle of words. (\B matches
1412 only if the current position in the subject is not a word boundary.)
1413 When applied to the string "Mississipi" the first call to pcre_exec()
1414 finds the first occurrence. If pcre_exec() is called again with just
1415 the remainder of the subject, namely "issipi", it does not match,
1416 because \B is always false at the start of the subject, which is deemed
1417 to be a word boundary. However, if pcre_exec() is passed the entire
1418 string again, but with startoffset set to 4, it finds the second occur‐
1419 rence of "iss" because it is able to look behind the starting point to
1420 discover that it is preceded by a letter.
1421
1422 If a non-zero starting offset is passed when the pattern is anchored,
1423 one attempt to match at the given offset is made. This can only succeed
1424 if the pattern does not require the match to be at the start of the
1425 subject.
1426
1427 How pcre_exec() returns captured substrings
1428
1429 In general, a pattern matches a certain portion of the subject, and in
1430 addition, further substrings from the subject may be picked out by
1431 parts of the pattern. Following the usage in Jeffrey Friedl's book,
1432 this is called "capturing" in what follows, and the phrase "capturing
1433 subpattern" is used for a fragment of a pattern that picks out a sub‐
1434 string. PCRE supports several other kinds of parenthesized subpattern
1435 that do not cause substrings to be captured.
1436
1437 Captured substrings are returned to the caller via a vector of integers
1438 whose address is passed in ovector. The number of elements in the vec‐
1439 tor is passed in ovecsize, which must be a non-negative number. Note:
1440 this argument is NOT the size of ovector in bytes.
1441
1442 The first two-thirds of the vector is used to pass back captured sub‐
1443 strings, each substring using a pair of integers. The remaining third
1444 of the vector is used as workspace by pcre_exec() while matching cap‐
1445 turing subpatterns, and is not available for passing back information.
1446 The number passed in ovecsize should always be a multiple of three. If
1447 it is not, it is rounded down.
1448
1449 When a match is successful, information about captured substrings is
1450 returned in pairs of integers, starting at the beginning of ovector,
1451 and continuing up to two-thirds of its length at the most. The first
1452 element of each pair is set to the byte offset of the first character
1453 in a substring, and the second is set to the byte offset of the first
1454 character after the end of a substring. Note: these values are always
1455 byte offsets, even in UTF-8 mode. They are not character counts.
1456
1457 The first pair of integers, ovector[0] and ovector[1], identify the
1458 portion of the subject string matched by the entire pattern. The next
1459 pair is used for the first capturing subpattern, and so on. The value
1460 returned by pcre_exec() is one more than the highest numbered pair that
1461 has been set. For example, if two substrings have been captured, the
1462 returned value is 3. If there are no capturing subpatterns, the return
1463 value from a successful match is 1, indicating that just the first pair
1464 of offsets has been set.
1465
1466 If a capturing subpattern is matched repeatedly, it is the last portion
1467 of the string that it matched that is returned.
1468
1469 If the vector is too small to hold all the captured substring offsets,
1470 it is used as far as possible (up to two-thirds of its length), and the
1471 function returns a value of zero. If the substring offsets are not of
1472 interest, pcre_exec() may be called with ovector passed as NULL and
1473 ovecsize as zero. However, if the pattern contains back references and
1474 the ovector is not big enough to remember the related substrings, PCRE
1475 has to get additional memory for use during matching. Thus it is usu‐
1476 ally advisable to supply an ovector.
1477
1478 The pcre_fullinfo() function can be used to find out how many capturing
1479 subpatterns there are in a compiled pattern. The smallest size for
1480 ovector that will allow for n captured substrings, in addition to the
1481 offsets of the substring matched by the whole pattern, is (n+1)*3.
1482
1483 It is possible for capturing subpattern number n+1 to match some part
1484 of the subject when subpattern n has not been used at all. For example,
1485 if the string "abc" is matched against the pattern (a|(z))(bc) the
1486 return from the function is 4, and subpatterns 1 and 3 are matched, but
1487 2 is not. When this happens, both values in the offset pairs corre‐
1488 sponding to unused subpatterns are set to -1.
1489
1490 Offset values that correspond to unused subpatterns at the end of the
1491 expression are also set to -1. For example, if the string "abc" is
1492 matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not
1493 matched. The return from the function is 2, because the highest used
1494 capturing subpattern number is 1. However, you can refer to the offsets
1495 for the second and third capturing subpatterns if you wish (assuming
1496 the vector is large enough, of course).
1497
1498 Some convenience functions are provided for extracting the captured
1499 substrings as separate strings. These are described below.
1500
1501 Error return values from pcre_exec()
1502
1503 If pcre_exec() fails, it returns a negative number. The following are
1504 defined in the header file:
1505
1506 PCRE_ERROR_NOMATCH (-1)
1507
1508 The subject string did not match the pattern.
1509
1510 PCRE_ERROR_NULL (-2)
1511
1512 Either code or subject was passed as NULL, or ovector was NULL and
1513 ovecsize was not zero.
1514
1515 PCRE_ERROR_BADOPTION (-3)
1516
1517 An unrecognized bit was set in the options argument.
1518
1519 PCRE_ERROR_BADMAGIC (-4)
1520
1521 PCRE stores a 4-byte "magic number" at the start of the compiled code,
1522 to catch the case when it is passed a junk pointer and to detect when a
1523 pattern that was compiled in an environment of one endianness is run in
1524 an environment with the other endianness. This is the error that PCRE
1525 gives when the magic number is not present.
1526
1527 PCRE_ERROR_UNKNOWN_OPCODE (-5)
1528
1529 While running the pattern match, an unknown item was encountered in the
1530 compiled pattern. This error could be caused by a bug in PCRE or by
1531 overwriting of the compiled pattern.
1532
1533 PCRE_ERROR_NOMEMORY (-6)
1534
1535 If a pattern contains back references, but the ovector that is passed
1536 to pcre_exec() is not big enough to remember the referenced substrings,
1537 PCRE gets a block of memory at the start of matching to use for this
1538 purpose. If the call via pcre_malloc() fails, this error is given. The
1539 memory is automatically freed at the end of matching.
1540
1541 This error is also given if pcre_stack_malloc() fails in pcre_exec().
1542 This can happen only when PCRE has been compiled with --disable-stack-
1543 for-recursion.
1544
1545 PCRE_ERROR_NOSUBSTRING (-7)
1546
1547 This error is used by the pcre_copy_substring(), pcre_get_substring(),
1548 and pcre_get_substring_list() functions (see below). It is never
1549 returned by pcre_exec().
1550
1551 PCRE_ERROR_MATCHLIMIT (-8)
1552
1553 The backtracking limit, as specified by the match_limit field in a
1554 pcre_extra structure (or defaulted) was reached. See the description
1555 above.
1556
1557 PCRE_ERROR_CALLOUT (-9)
1558
1559 This error is never generated by pcre_exec() itself. It is provided for
1560 use by callout functions that want to yield a distinctive error code.
1561 See the pcrecallout documentation for details.
1562
1563 PCRE_ERROR_BADUTF8 (-10)
1564
1565 A string that contains an invalid UTF-8 byte sequence was passed as a
1566 subject.
1567
1568 PCRE_ERROR_BADUTF8_OFFSET (-11)
1569
1570 The UTF-8 byte sequence that was passed as a subject was valid, but the
1571 value of startoffset did not point to the beginning of a UTF-8 charac‐
1572 ter.
1573
1574 PCRE_ERROR_PARTIAL (-12)
1575
1576 The subject string did not match, but it did match partially. See the
1577 pcrepartial documentation for details of partial matching.
1578
1579 PCRE_ERROR_BADPARTIAL (-13)
1580
1581 This code is no longer in use. It was formerly returned when the
1582 PCRE_PARTIAL option was used with a compiled pattern containing items
1583 that were not supported for partial matching. From release 8.00
1584 onwards, there are no restrictions on partial matching.
1585
1586 PCRE_ERROR_INTERNAL (-14)
1587
1588 An unexpected internal error has occurred. This error could be caused
1589 by a bug in PCRE or by overwriting of the compiled pattern.
1590
1591 PCRE_ERROR_BADCOUNT (-15)
1592
1593 This error is given if the value of the ovecsize argument is negative.
1594
1595 PCRE_ERROR_RECURSIONLIMIT (-21)
1596
1597 The internal recursion limit, as specified by the match_limit_recursion
1598 field in a pcre_extra structure (or defaulted) was reached. See the
1599 description above.
1600
1601 PCRE_ERROR_BADNEWLINE (-23)
1602
1603 An invalid combination of PCRE_NEWLINE_xxx options was given.
1604
1605 Error numbers -16 to -20 and -22 are not used by pcre_exec().
1606
1608
1609 int pcre_copy_substring(const char *subject, int *ovector,
1610 int stringcount, int stringnumber, char *buffer,
1611 int buffersize);
1612
1613 int pcre_get_substring(const char *subject, int *ovector,
1614 int stringcount, int stringnumber,
1615 const char **stringptr);
1616
1617 int pcre_get_substring_list(const char *subject,
1618 int *ovector, int stringcount, const char ***listptr);
1619
1620 Captured substrings can be accessed directly by using the offsets
1621 returned by pcre_exec() in ovector. For convenience, the functions
1622 pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub‐
1623 string_list() are provided for extracting captured substrings as new,
1624 separate, zero-terminated strings. These functions identify substrings
1625 by number. The next section describes functions for extracting named
1626 substrings.
1627
1628 A substring that contains a binary zero is correctly extracted and has
1629 a further zero added on the end, but the result is not, of course, a C
1630 string. However, you can process such a string by referring to the
1631 length that is returned by pcre_copy_substring() and pcre_get_sub‐
1632 string(). Unfortunately, the interface to pcre_get_substring_list() is
1633 not adequate for handling strings containing binary zeros, because the
1634 end of the final string is not independently indicated.
1635
1636 The first three arguments are the same for all three of these func‐
1637 tions: subject is the subject string that has just been successfully
1638 matched, ovector is a pointer to the vector of integer offsets that was
1639 passed to pcre_exec(), and stringcount is the number of substrings that
1640 were captured by the match, including the substring that matched the
1641 entire regular expression. This is the value returned by pcre_exec() if
1642 it is greater than zero. If pcre_exec() returned zero, indicating that
1643 it ran out of space in ovector, the value passed as stringcount should
1644 be the number of elements in the vector divided by three.
1645
1646 The functions pcre_copy_substring() and pcre_get_substring() extract a
1647 single substring, whose number is given as stringnumber. A value of
1648 zero extracts the substring that matched the entire pattern, whereas
1649 higher values extract the captured substrings. For pcre_copy_sub‐
1650 string(), the string is placed in buffer, whose length is given by
1651 buffersize, while for pcre_get_substring() a new block of memory is
1652 obtained via pcre_malloc, and its address is returned via stringptr.
1653 The yield of the function is the length of the string, not including
1654 the terminating zero, or one of these error codes:
1655
1656 PCRE_ERROR_NOMEMORY (-6)
1657
1658 The buffer was too small for pcre_copy_substring(), or the attempt to
1659 get memory failed for pcre_get_substring().
1660
1661 PCRE_ERROR_NOSUBSTRING (-7)
1662
1663 There is no substring whose number is stringnumber.
1664
1665 The pcre_get_substring_list() function extracts all available sub‐
1666 strings and builds a list of pointers to them. All this is done in a
1667 single block of memory that is obtained via pcre_malloc. The address of
1668 the memory block is returned via listptr, which is also the start of
1669 the list of string pointers. The end of the list is marked by a NULL
1670 pointer. The yield of the function is zero if all went well, or the
1671 error code
1672
1673 PCRE_ERROR_NOMEMORY (-6)
1674
1675 if the attempt to get the memory block failed.
1676
1677 When any of these functions encounter a substring that is unset, which
1678 can happen when capturing subpattern number n+1 matches some part of
1679 the subject, but subpattern n has not been used at all, they return an
1680 empty string. This can be distinguished from a genuine zero-length sub‐
1681 string by inspecting the appropriate offset in ovector, which is nega‐
1682 tive for unset substrings.
1683
1684 The two convenience functions pcre_free_substring() and pcre_free_sub‐
1685 string_list() can be used to free the memory returned by a previous
1686 call of pcre_get_substring() or pcre_get_substring_list(), respec‐
1687 tively. They do nothing more than call the function pointed to by
1688 pcre_free, which of course could be called directly from a C program.
1689 However, PCRE is used in some situations where it is linked via a spe‐
1690 cial interface to another programming language that cannot use
1691 pcre_free directly; it is for these cases that the functions are pro‐
1692 vided.
1693
1695
1696 int pcre_get_stringnumber(const pcre *code,
1697 const char *name);
1698
1699 int pcre_copy_named_substring(const pcre *code,
1700 const char *subject, int *ovector,
1701 int stringcount, const char *stringname,
1702 char *buffer, int buffersize);
1703
1704 int pcre_get_named_substring(const pcre *code,
1705 const char *subject, int *ovector,
1706 int stringcount, const char *stringname,
1707 const char **stringptr);
1708
1709 To extract a substring by name, you first have to find associated num‐
1710 ber. For example, for this pattern
1711
1712 (a+)b(?<xxx>\d+)...
1713
1714 the number of the subpattern called "xxx" is 2. If the name is known to
1715 be unique (PCRE_DUPNAMES was not set), you can find the number from the
1716 name by calling pcre_get_stringnumber(). The first argument is the com‐
1717 piled pattern, and the second is the name. The yield of the function is
1718 the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no
1719 subpattern of that name.
1720
1721 Given the number, you can extract the substring directly, or use one of
1722 the functions described in the previous section. For convenience, there
1723 are also two functions that do the whole job.
1724
1725 Most of the arguments of pcre_copy_named_substring() and
1726 pcre_get_named_substring() are the same as those for the similarly
1727 named functions that extract by number. As these are described in the
1728 previous section, they are not re-described here. There are just two
1729 differences:
1730
1731 First, instead of a substring number, a substring name is given. Sec‐
1732 ond, there is an extra argument, given at the start, which is a pointer
1733 to the compiled pattern. This is needed in order to gain access to the
1734 name-to-number translation table.
1735
1736 These functions call pcre_get_stringnumber(), and if it succeeds, they
1737 then call pcre_copy_substring() or pcre_get_substring(), as appropri‐
1738 ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the
1739 behaviour may not be what you want (see the next section).
1740
1741 Warning: If the pattern uses the (?| feature to set up multiple subpat‐
1742 terns with the same number, as described in the section on duplicate
1743 subpattern numbers in the pcrepattern page, you cannot use names to
1744 distinguish the different subpatterns, because names are not included
1745 in the compiled code. The matching process uses only numbers. For this
1746 reason, the use of different names for subpatterns of the same number
1747 causes an error at compile time.
1748
1750
1751 int pcre_get_stringtable_entries(const pcre *code,
1752 const char *name, char **first, char **last);
1753
1754 When a pattern is compiled with the PCRE_DUPNAMES option, names for
1755 subpatterns are not required to be unique. (Duplicate names are always
1756 allowed for subpatterns with the same number, created by using the (?|
1757 feature. Indeed, if such subpatterns are named, they are required to
1758 use the same names.)
1759
1760 Normally, patterns with duplicate names are such that in any one match,
1761 only one of the named subpatterns participates. An example is shown in
1762 the pcrepattern documentation.
1763
1764 When duplicates are present, pcre_copy_named_substring() and
1765 pcre_get_named_substring() return the first substring corresponding to
1766 the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING
1767 (-7) is returned; no data is returned. The pcre_get_stringnumber()
1768 function returns one of the numbers that are associated with the name,
1769 but it is not defined which it is.
1770
1771 If you want to get full details of all captured substrings for a given
1772 name, you must use the pcre_get_stringtable_entries() function. The
1773 first argument is the compiled pattern, and the second is the name. The
1774 third and fourth are pointers to variables which are updated by the
1775 function. After it has run, they point to the first and last entries in
1776 the name-to-number table for the given name. The function itself
1777 returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if
1778 there are none. The format of the table is described above in the sec‐
1779 tion entitled Information about a pattern. Given all the relevant
1780 entries for the name, you can extract each of their numbers, and hence
1781 the captured data, if any.
1782
1784
1785 The traditional matching function uses a similar algorithm to Perl,
1786 which stops when it finds the first match, starting at a given point in
1787 the subject. If you want to find all possible matches, or the longest
1788 possible match, consider using the alternative matching function (see
1789 below) instead. If you cannot use the alternative function, but still
1790 need to find all possible matches, you can kludge it up by making use
1791 of the callout facility, which is described in the pcrecallout documen‐
1792 tation.
1793
1794 What you have to do is to insert a callout right at the end of the pat‐
1795 tern. When your callout function is called, extract and save the cur‐
1796 rent matched substring. Then return 1, which forces pcre_exec() to
1797 backtrack and try other alternatives. Ultimately, when it runs out of
1798 matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
1799
1801
1802 int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
1803 const char *subject, int length, int startoffset,
1804 int options, int *ovector, int ovecsize,
1805 int *workspace, int wscount);
1806
1807 The function pcre_dfa_exec() is called to match a subject string
1808 against a compiled pattern, using a matching algorithm that scans the
1809 subject string just once, and does not backtrack. This has different
1810 characteristics to the normal algorithm, and is not compatible with
1811 Perl. Some of the features of PCRE patterns are not supported. Never‐
1812 theless, there are times when this kind of matching can be useful. For
1813 a discussion of the two matching algorithms, and a list of features
1814 that pcre_dfa_exec() does not support, see the pcrematching documenta‐
1815 tion.
1816
1817 The arguments for the pcre_dfa_exec() function are the same as for
1818 pcre_exec(), plus two extras. The ovector argument is used in a differ‐
1819 ent way, and this is described below. The other common arguments are
1820 used in the same way as for pcre_exec(), so their description is not
1821 repeated here.
1822
1823 The two additional arguments provide workspace for the function. The
1824 workspace vector should contain at least 20 elements. It is used for
1825 keeping track of multiple paths through the pattern tree. More
1826 workspace will be needed for patterns and subjects where there are a
1827 lot of potential matches.
1828
1829 Here is an example of a simple call to pcre_dfa_exec():
1830
1831 int rc;
1832 int ovector[10];
1833 int wspace[20];
1834 rc = pcre_dfa_exec(
1835 re, /* result of pcre_compile() */
1836 NULL, /* we didn't study the pattern */
1837 "some string", /* the subject string */
1838 11, /* the length of the subject string */
1839 0, /* start at offset 0 in the subject */
1840 0, /* default options */
1841 ovector, /* vector of integers for substring information */
1842 10, /* number of elements (NOT size in bytes) */
1843 wspace, /* working space vector */
1844 20); /* number of elements (NOT size in bytes) */
1845
1846 Option bits for pcre_dfa_exec()
1847
1848 The unused bits of the options argument for pcre_dfa_exec() must be
1849 zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW‐
1850 LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY,
1851 PCRE_NOTEMPTY_ATSTART, PCRE_NO_UTF8_CHECK, PCRE_BSR_ANYCRLF,
1852 PCRE_BSR_UNICODE, PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD, PCRE_PAR‐
1853 TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
1854 four of these are exactly the same as for pcre_exec(), so their
1855 description is not repeated here.
1856
1857 PCRE_PARTIAL_HARD
1858 PCRE_PARTIAL_SOFT
1859
1860 These have the same general effect as they do for pcre_exec(), but the
1861 details are slightly different. When PCRE_PARTIAL_HARD is set for
1862 pcre_dfa_exec(), it returns PCRE_ERROR_PARTIAL if the end of the sub‐
1863 ject is reached and there is still at least one matching possibility
1864 that requires additional characters. This happens even if some complete
1865 matches have also been found. When PCRE_PARTIAL_SOFT is set, the return
1866 code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end
1867 of the subject is reached, there have been no complete matches, but
1868 there is still at least one matching possibility. The portion of the
1869 string that was inspected when the longest partial match was found is
1870 set as the first matching string in both cases.
1871
1872 PCRE_DFA_SHORTEST
1873
1874 Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to
1875 stop as soon as it has found one match. Because of the way the alterna‐
1876 tive algorithm works, this is necessarily the shortest possible match
1877 at the first possible matching point in the subject string.
1878
1879 PCRE_DFA_RESTART
1880
1881 When pcre_dfa_exec() returns a partial match, it is possible to call it
1882 again, with additional subject characters, and have it continue with
1883 the same match. The PCRE_DFA_RESTART option requests this action; when
1884 it is set, the workspace and wscount options must reference the same
1885 vector as before because data about the match so far is left in them
1886 after a partial match. There is more discussion of this facility in the
1887 pcrepartial documentation.
1888
1889 Successful returns from pcre_dfa_exec()
1890
1891 When pcre_dfa_exec() succeeds, it may have matched more than one sub‐
1892 string in the subject. Note, however, that all the matches from one run
1893 of the function start at the same point in the subject. The shorter
1894 matches are all initial substrings of the longer matches. For example,
1895 if the pattern
1896
1897 <.*>
1898
1899 is matched against the string
1900
1901 This is <something> <something else> <something further> no more
1902
1903 the three matched strings are
1904
1905 <something>
1906 <something> <something else>
1907 <something> <something else> <something further>
1908
1909 On success, the yield of the function is a number greater than zero,
1910 which is the number of matched substrings. The substrings themselves
1911 are returned in ovector. Each string uses two elements; the first is
1912 the offset to the start, and the second is the offset to the end. In
1913 fact, all the strings have the same start offset. (Space could have
1914 been saved by giving this only once, but it was decided to retain some
1915 compatibility with the way pcre_exec() returns data, even though the
1916 meaning of the strings is different.)
1917
1918 The strings are returned in reverse order of length; that is, the long‐
1919 est matching string is given first. If there were too many matches to
1920 fit into ovector, the yield of the function is zero, and the vector is
1921 filled with the longest matches.
1922
1923 Error returns from pcre_dfa_exec()
1924
1925 The pcre_dfa_exec() function returns a negative number when it fails.
1926 Many of the errors are the same as for pcre_exec(), and these are
1927 described above. There are in addition the following errors that are
1928 specific to pcre_dfa_exec():
1929
1930 PCRE_ERROR_DFA_UITEM (-16)
1931
1932 This return is given if pcre_dfa_exec() encounters an item in the pat‐
1933 tern that it does not support, for instance, the use of \C or a back
1934 reference.
1935
1936 PCRE_ERROR_DFA_UCOND (-17)
1937
1938 This return is given if pcre_dfa_exec() encounters a condition item
1939 that uses a back reference for the condition, or a test for recursion
1940 in a specific group. These are not supported.
1941
1942 PCRE_ERROR_DFA_UMLIMIT (-18)
1943
1944 This return is given if pcre_dfa_exec() is called with an extra block
1945 that contains a setting of the match_limit field. This is not supported
1946 (it is meaningless).
1947
1948 PCRE_ERROR_DFA_WSSIZE (-19)
1949
1950 This return is given if pcre_dfa_exec() runs out of space in the
1951 workspace vector.
1952
1953 PCRE_ERROR_DFA_RECURSE (-20)
1954
1955 When a recursive subpattern is processed, the matching function calls
1956 itself recursively, using private vectors for ovector and workspace.
1957 This error is given if the output vector is not large enough. This
1958 should be extremely rare, as a vector of size 1000 is used.
1959
1961
1962 pcrebuild(3), pcrecallout(3), pcrecpp(3)[22m(3), pcrematching(3), pcrepar‐
1963 tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3).
1964
1966
1967 Philip Hazel
1968 University Computing Service
1969 Cambridge CB2 3QH, England.
1970
1972
1973 Last updated: 21 June 2010
1974 Copyright (c) 1997-2010 University of Cambridge.
1975
1976
1977
1978 PCREAPI(3)