1PCREAPI(3) Library Functions Manual PCREAPI(3)
2
3
4
6 PCRE - Perl-compatible regular expressions
7
8 #include <pcre.h>
9
11
12 pcre *pcre_compile(const char *pattern, int options,
13 const char **errptr, int *erroffset,
14 const unsigned char *tableptr);
15
16 pcre *pcre_compile2(const char *pattern, int options,
17 int *errorcodeptr,
18 const char **errptr, int *erroffset,
19 const unsigned char *tableptr);
20
21 pcre_extra *pcre_study(const pcre *code, int options,
22 const char **errptr);
23
24 void pcre_free_study(pcre_extra *extra);
25
26 int pcre_exec(const pcre *code, const pcre_extra *extra,
27 const char *subject, int length, int startoffset,
28 int options, int *ovector, int ovecsize);
29
30 int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
31 const char *subject, int length, int startoffset,
32 int options, int *ovector, int ovecsize,
33 int *workspace, int wscount);
34
36
37 int pcre_copy_named_substring(const pcre *code,
38 const char *subject, int *ovector,
39 int stringcount, const char *stringname,
40 char *buffer, int buffersize);
41
42 int pcre_copy_substring(const char *subject, int *ovector,
43 int stringcount, int stringnumber, char *buffer,
44 int buffersize);
45
46 int pcre_get_named_substring(const pcre *code,
47 const char *subject, int *ovector,
48 int stringcount, const char *stringname,
49 const char **stringptr);
50
51 int pcre_get_stringnumber(const pcre *code,
52 const char *name);
53
54 int pcre_get_stringtable_entries(const pcre *code,
55 const char *name, char **first, char **last);
56
57 int pcre_get_substring(const char *subject, int *ovector,
58 int stringcount, int stringnumber,
59 const char **stringptr);
60
61 int pcre_get_substring_list(const char *subject,
62 int *ovector, int stringcount, const char ***listptr);
63
64 void pcre_free_substring(const char *stringptr);
65
66 void pcre_free_substring_list(const char **stringptr);
67
69
70 int pcre_jit_exec(const pcre *code, const pcre_extra *extra,
71 const char *subject, int length, int startoffset,
72 int options, int *ovector, int ovecsize,
73 pcre_jit_stack *jstack);
74
75 pcre_jit_stack *pcre_jit_stack_alloc(int startsize, int maxsize);
76
77 void pcre_jit_stack_free(pcre_jit_stack *stack);
78
79 void pcre_assign_jit_stack(pcre_extra *extra,
80 pcre_jit_callback callback, void *data);
81
82 const unsigned char *pcre_maketables(void);
83
84 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
85 int what, void *where);
86
87 int pcre_refcount(pcre *code, int adjust);
88
89 int pcre_config(int what, void *where);
90
91 const char *pcre_version(void);
92
93 int pcre_pattern_to_host_byte_order(pcre *code,
94 pcre_extra *extra, const unsigned char *tables);
95
97
98 void *(*pcre_malloc)(size_t);
99
100 void (*pcre_free)(void *);
101
102 void *(*pcre_stack_malloc)(size_t);
103
104 void (*pcre_stack_free)(void *);
105
106 int (*pcre_callout)(pcre_callout_block *);
107
108 int (*pcre_stack_guard)(void);
109
111
112 As well as support for 8-bit character strings, PCRE also supports
113 16-bit strings (from release 8.30) and 32-bit strings (from release
114 8.32), by means of two additional libraries. They can be built as well
115 as, or instead of, the 8-bit library. To avoid too much complication,
116 this document describes the 8-bit versions of the functions, with only
117 occasional references to the 16-bit and 32-bit libraries.
118
119 The 16-bit and 32-bit functions operate in the same way as their 8-bit
120 counterparts; they just use different data types for their arguments
121 and results, and their names start with pcre16_ or pcre32_ instead of
122 pcre_. For every option that has UTF8 in its name (for example,
123 PCRE_UTF8), there are corresponding 16-bit and 32-bit names with UTF8
124 replaced by UTF16 or UTF32, respectively. This facility is in fact just
125 cosmetic; the 16-bit and 32-bit option names define the same bit val‐
126 ues.
127
128 References to bytes and UTF-8 in this document should be read as refer‐
129 ences to 16-bit data units and UTF-16 when using the 16-bit library, or
130 32-bit data units and UTF-32 when using the 32-bit library, unless
131 specified otherwise. More details of the specific differences for the
132 16-bit and 32-bit libraries are given in the pcre16 and pcre32 pages.
133
135
136 PCRE has its own native API, which is described in this document. There
137 are also some wrapper functions (for the 8-bit library only) that cor‐
138 respond to the POSIX regular expression API, but they do not give ac‐
139 cess to all the functionality. They are described in the pcreposix doc‐
140 umentation. Both of these APIs define a set of C function calls. A C++
141 wrapper (again for the 8-bit library only) is also distributed with
142 PCRE. It is documented in the pcrecpp page.
143
144 The native API C function prototypes are defined in the header file
145 pcre.h, and on Unix-like systems the (8-bit) library itself is called
146 libpcre. It can normally be accessed by adding -lpcre to the command
147 for linking an application that uses PCRE. The header file defines the
148 macros PCRE_MAJOR and PCRE_MINOR to contain the major and minor release
149 numbers for the library. Applications can use these to include support
150 for different releases of PCRE.
151
152 In a Windows environment, if you want to statically link an application
153 program against a non-dll pcre.a file, you must define PCRE_STATIC be‐
154 fore including pcre.h or pcrecpp.h, because otherwise the pcre_malloc()
155 and pcre_free() exported functions will be declared __declspec(dl‐
156 limport), with unwanted results.
157
158 The functions pcre_compile(), pcre_compile2(), pcre_study(), and
159 pcre_exec() are used for compiling and matching regular expressions in
160 a Perl-compatible manner. A sample program that demonstrates the sim‐
161 plest way of using them is provided in the file called pcredemo.c in
162 the PCRE source distribution. A listing of this program is given in the
163 pcredemo documentation, and the pcresample documentation describes how
164 to compile and run it.
165
166 Just-in-time compiler support is an optional feature of PCRE that can
167 be built in appropriate hardware environments. It greatly speeds up the
168 matching performance of many patterns. Simple programs can easily re‐
169 quest that it be used if available, by setting an option that is ig‐
170 nored when it is not relevant. More complicated programs might need to
171 make use of the functions pcre_jit_stack_alloc(),
172 pcre_jit_stack_free(), and pcre_assign_jit_stack() in order to control
173 the JIT code's memory usage.
174
175 From release 8.32 there is also a direct interface for JIT execution,
176 which gives improved performance. The JIT-specific functions are dis‐
177 cussed in the pcrejit documentation.
178
179 A second matching function, pcre_dfa_exec(), which is not Perl-compati‐
180 ble, is also provided. This uses a different algorithm for the match‐
181 ing. The alternative algorithm finds all possible matches (at a given
182 point in the subject), and scans the subject just once (unless there
183 are lookbehind assertions). However, this algorithm does not return
184 captured substrings. A description of the two matching algorithms and
185 their advantages and disadvantages is given in the pcrematching docu‐
186 mentation.
187
188 In addition to the main compiling and matching functions, there are
189 convenience functions for extracting captured substrings from a subject
190 string that is matched by pcre_exec(). They are:
191
192 pcre_copy_substring()
193 pcre_copy_named_substring()
194 pcre_get_substring()
195 pcre_get_named_substring()
196 pcre_get_substring_list()
197 pcre_get_stringnumber()
198 pcre_get_stringtable_entries()
199
200 pcre_free_substring() and pcre_free_substring_list() are also provided,
201 to free the memory used for extracted strings.
202
203 The function pcre_maketables() is used to build a set of character ta‐
204 bles in the current locale for passing to pcre_compile(), pcre_exec(),
205 or pcre_dfa_exec(). This is an optional facility that is provided for
206 specialist use. Most commonly, no special tables are passed, in which
207 case internal tables that are generated when PCRE is built are used.
208
209 The function pcre_fullinfo() is used to find out information about a
210 compiled pattern. The function pcre_version() returns a pointer to a
211 string containing the version of PCRE and its date of release.
212
213 The function pcre_refcount() maintains a reference count in a data
214 block containing a compiled pattern. This is provided for the benefit
215 of object-oriented applications.
216
217 The global variables pcre_malloc and pcre_free initially contain the
218 entry points of the standard malloc() and free() functions, respec‐
219 tively. PCRE calls the memory management functions via these variables,
220 so a calling program can replace them if it wishes to intercept the
221 calls. This should be done before calling any PCRE functions.
222
223 The global variables pcre_stack_malloc and pcre_stack_free are also in‐
224 directions to memory management functions. These special functions are
225 used only when PCRE is compiled to use the heap for remembering data,
226 instead of recursive function calls, when running the pcre_exec() func‐
227 tion. See the pcrebuild documentation for details of how to do this. It
228 is a non-standard way of building PCRE, for use in environments that
229 have limited stacks. Because of the greater use of memory management,
230 it runs more slowly. Separate functions are provided so that special-
231 purpose external code can be used for this case. When used, these func‐
232 tions always allocate memory blocks of the same size. There is a dis‐
233 cussion about PCRE's stack usage in the pcrestack documentation.
234
235 The global variable pcre_callout initially contains NULL. It can be set
236 by the caller to a "callout" function, which PCRE will then call at
237 specified points during a matching operation. Details are given in the
238 pcrecallout documentation.
239
240 The global variable pcre_stack_guard initially contains NULL. It can be
241 set by the caller to a function that is called by PCRE whenever it
242 starts to compile a parenthesized part of a pattern. When parentheses
243 are nested, PCRE uses recursive function calls, which use up the system
244 stack. This function is provided so that applications with restricted
245 stacks can force a compilation error if the stack runs out. The func‐
246 tion should return zero if all is well, or non-zero to force an error.
247
249
250 PCRE supports five different conventions for indicating line breaks in
251 strings: a single CR (carriage return) character, a single LF (line‐
252 feed) character, the two-character sequence CRLF, any of the three pre‐
253 ceding, or any Unicode newline sequence. The Unicode newline sequences
254 are the three just mentioned, plus the single characters VT (vertical
255 tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
256 separator, U+2028), and PS (paragraph separator, U+2029).
257
258 Each of the first three conventions is used by at least one operating
259 system as its standard newline sequence. When PCRE is built, a default
260 can be specified. The default default is LF, which is the Unix stan‐
261 dard. When PCRE is run, the default can be overridden, either when a
262 pattern is compiled, or when it is matched.
263
264 At compile time, the newline convention can be specified by the options
265 argument of pcre_compile(), or it can be specified by special text at
266 the start of the pattern itself; this overrides any other settings. See
267 the pcrepattern page for details of the special character sequences.
268
269 In the PCRE documentation the word "newline" is used to mean "the char‐
270 acter or pair of characters that indicate a line break". The choice of
271 newline convention affects the handling of the dot, circumflex, and
272 dollar metacharacters, the handling of #-comments in /x mode, and, when
273 CRLF is a recognized line ending sequence, the match position advance‐
274 ment for a non-anchored pattern. There is more detail about this in the
275 section on pcre_exec() options below.
276
277 The choice of newline convention does not affect the interpretation of
278 the \n or \r escape sequences, nor does it affect what \R matches,
279 which is controlled in a similar way, but by separate options.
280
282
283 The PCRE functions can be used in multi-threading applications, with
284 the proviso that the memory management functions pointed to by
285 pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
286 callout and stack-checking functions pointed to by pcre_callout and
287 pcre_stack_guard, are shared by all threads.
288
289 The compiled form of a regular expression is not altered during match‐
290 ing, so the same compiled pattern can safely be used by several threads
291 at once.
292
293 If the just-in-time optimization feature is being used, it needs sepa‐
294 rate memory stack areas for each thread. See the pcrejit documentation
295 for more details.
296
298
299 The compiled form of a regular expression can be saved and re-used at a
300 later time, possibly by a different program, and even on a host other
301 than the one on which it was compiled. Details are given in the
302 pcreprecompile documentation, which includes a description of the
303 pcre_pattern_to_host_byte_order() function. However, compiling a regu‐
304 lar expression with one version of PCRE for use with a different ver‐
305 sion is not guaranteed to work and may cause crashes.
306
308
309 int pcre_config(int what, void *where);
310
311 The function pcre_config() makes it possible for a PCRE client to dis‐
312 cover which optional features have been compiled into the PCRE library.
313 The pcrebuild documentation has more details about these optional fea‐
314 tures.
315
316 The first argument for pcre_config() is an integer, specifying which
317 information is required; the second argument is a pointer to a variable
318 into which the information is placed. The returned value is zero on
319 success, or the negative error code PCRE_ERROR_BADOPTION if the value
320 in the first argument is not recognized. The following information is
321 available:
322
323 PCRE_CONFIG_UTF8
324
325 The output is an integer that is set to one if UTF-8 support is avail‐
326 able; otherwise it is set to zero. This value should normally be given
327 to the 8-bit version of this function, pcre_config(). If it is given to
328 the 16-bit or 32-bit version of this function, the result is PCRE_ER‐
329 ROR_BADOPTION.
330
331 PCRE_CONFIG_UTF16
332
333 The output is an integer that is set to one if UTF-16 support is avail‐
334 able; otherwise it is set to zero. This value should normally be given
335 to the 16-bit version of this function, pcre16_config(). If it is given
336 to the 8-bit or 32-bit version of this function, the result is PCRE_ER‐
337 ROR_BADOPTION.
338
339 PCRE_CONFIG_UTF32
340
341 The output is an integer that is set to one if UTF-32 support is avail‐
342 able; otherwise it is set to zero. This value should normally be given
343 to the 32-bit version of this function, pcre32_config(). If it is given
344 to the 8-bit or 16-bit version of this function, the result is PCRE_ER‐
345 ROR_BADOPTION.
346
347 PCRE_CONFIG_UNICODE_PROPERTIES
348
349 The output is an integer that is set to one if support for Unicode
350 character properties is available; otherwise it is set to zero.
351
352 PCRE_CONFIG_JIT
353
354 The output is an integer that is set to one if support for just-in-time
355 compiling is available; otherwise it is set to zero.
356
357 PCRE_CONFIG_JITTARGET
358
359 The output is a pointer to a zero-terminated "const char *" string. If
360 JIT support is available, the string contains the name of the architec‐
361 ture for which the JIT compiler is configured, for example "x86 32bit
362 (little endian + unaligned)". If JIT support is not available, the re‐
363 sult is NULL.
364
365 PCRE_CONFIG_NEWLINE
366
367 The output is an integer whose value specifies the default character
368 sequence that is recognized as meaning "newline". The values that are
369 supported in ASCII/Unicode environments are: 10 for LF, 13 for CR, 3338
370 for CRLF, -2 for ANYCRLF, and -1 for ANY. In EBCDIC environments, CR,
371 ANYCRLF, and ANY yield the same values. However, the value for LF is
372 normally 21, though some EBCDIC environments use 37. The corresponding
373 values for CRLF are 3349 and 3365. The default should normally corre‐
374 spond to the standard sequence for your operating system.
375
376 PCRE_CONFIG_BSR
377
378 The output is an integer whose value indicates what character sequences
379 the \R escape sequence matches by default. A value of 0 means that \R
380 matches any Unicode line ending sequence; a value of 1 means that \R
381 matches only CR, LF, or CRLF. The default can be overridden when a pat‐
382 tern is compiled or matched.
383
384 PCRE_CONFIG_LINK_SIZE
385
386 The output is an integer that contains the number of bytes used for in‐
387 ternal linkage in compiled regular expressions. For the 8-bit library,
388 the value can be 2, 3, or 4. For the 16-bit library, the value is ei‐
389 ther 2 or 4 and is still a number of bytes. For the 32-bit library, the
390 value is either 2 or 4 and is still a number of bytes. The default
391 value of 2 is sufficient for all but the most massive patterns, since
392 it allows the compiled pattern to be up to 64K in size. Larger values
393 allow larger regular expressions to be compiled, at the expense of
394 slower matching.
395
396 PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
397
398 The output is an integer that contains the threshold above which the
399 POSIX interface uses malloc() for output vectors. Further details are
400 given in the pcreposix documentation.
401
402 PCRE_CONFIG_PARENS_LIMIT
403
404 The output is a long integer that gives the maximum depth of nesting of
405 parentheses (of any kind) in a pattern. This limit is imposed to cap
406 the amount of system stack used when a pattern is compiled. It is spec‐
407 ified when PCRE is built; the default is 250. This limit does not take
408 into account the stack that may already be used by the calling applica‐
409 tion. For finer control over compilation stack usage, you can set a
410 pointer to an external checking function in pcre_stack_guard.
411
412 PCRE_CONFIG_MATCH_LIMIT
413
414 The output is a long integer that gives the default limit for the num‐
415 ber of internal matching function calls in a pcre_exec() execution.
416 Further details are given with pcre_exec() below.
417
418 PCRE_CONFIG_MATCH_LIMIT_RECURSION
419
420 The output is a long integer that gives the default limit for the depth
421 of recursion when calling the internal matching function in a
422 pcre_exec() execution. Further details are given with pcre_exec() be‐
423 low.
424
425 PCRE_CONFIG_STACKRECURSE
426
427 The output is an integer that is set to one if internal recursion when
428 running pcre_exec() is implemented by recursive function calls that use
429 the stack to remember their state. This is the usual way that PCRE is
430 compiled. The output is zero if PCRE was compiled to use blocks of data
431 on the heap instead of recursive function calls. In this case,
432 pcre_stack_malloc and pcre_stack_free are called to manage memory
433 blocks on the heap, thus avoiding the use of the stack.
434
436
437 pcre *pcre_compile(const char *pattern, int options,
438 const char **errptr, int *erroffset,
439 const unsigned char *tableptr);
440
441 pcre *pcre_compile2(const char *pattern, int options,
442 int *errorcodeptr,
443 const char **errptr, int *erroffset,
444 const unsigned char *tableptr);
445
446 Either of the functions pcre_compile() or pcre_compile2() can be called
447 to compile a pattern into an internal form. The only difference between
448 the two interfaces is that pcre_compile2() has an additional argument,
449 errorcodeptr, via which a numerical error code can be returned. To
450 avoid too much repetition, we refer just to pcre_compile() below, but
451 the information applies equally to pcre_compile2().
452
453 The pattern is a C string terminated by a binary zero, and is passed in
454 the pattern argument. A pointer to a single block of memory that is ob‐
455 tained via pcre_malloc is returned. This contains the compiled code and
456 related data. The pcre type is defined for the returned block; this is
457 a typedef for a structure whose contents are not externally defined. It
458 is up to the caller to free the memory (via pcre_free) when it is no
459 longer required.
460
461 Although the compiled code of a PCRE regex is relocatable, that is, it
462 does not depend on memory location, the complete pcre data block is not
463 fully relocatable, because it may contain a copy of the tableptr argu‐
464 ment, which is an address (see below).
465
466 The options argument contains various bit settings that affect the com‐
467 pilation. It should be zero if no options are required. The available
468 options are described below. Some of them (in particular, those that
469 are compatible with Perl, but some others as well) can also be set and
470 unset from within the pattern (see the detailed description in the
471 pcrepattern documentation). For those options that can be different in
472 different parts of the pattern, the contents of the options argument
473 specifies their settings at the start of compilation and execution. The
474 PCRE_ANCHORED, PCRE_BSR_xxx, PCRE_NEWLINE_xxx, PCRE_NO_UTF8_CHECK, and
475 PCRE_NO_START_OPTIMIZE options can be set at the time of matching as
476 well as at compile time.
477
478 If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise,
479 if compilation of a pattern fails, pcre_compile() returns NULL, and
480 sets the variable pointed to by errptr to point to a textual error mes‐
481 sage. This is a static string that is part of the library. You must not
482 try to free it. Normally, the offset from the start of the pattern to
483 the data unit that was being processed when the error was discovered is
484 placed in the variable pointed to by erroffset, which must not be NULL
485 (if it is, an immediate error is given). However, for an invalid UTF-8
486 or UTF-16 string, the offset is that of the first data unit of the
487 failing character.
488
489 Some errors are not detected until the whole pattern has been scanned;
490 in these cases, the offset passed back is the length of the pattern.
491 Note that the offset is in data units, not characters, even in a UTF
492 mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char‐
493 acter.
494
495 If pcre_compile2() is used instead of pcre_compile(), and the error‐
496 codeptr argument is not NULL, a non-zero error code number is returned
497 via this argument in the event of an error. This is in addition to the
498 textual error message. Error codes and messages are listed below.
499
500 If the final argument, tableptr, is NULL, PCRE uses a default set of
501 character tables that are built when PCRE is compiled, using the de‐
502 fault C locale. Otherwise, tableptr must be an address that is the re‐
503 sult of a call to pcre_maketables(). This value is stored with the com‐
504 piled pattern, and used again by pcre_exec() and pcre_dfa_exec() when
505 the pattern is matched. For more discussion, see the section on locale
506 support below.
507
508 This code fragment shows a typical straightforward call to pcre_com‐
509 pile():
510
511 pcre *re;
512 const char *error;
513 int erroffset;
514 re = pcre_compile(
515 "^A.*Z", /* the pattern */
516 0, /* default options */
517 &error, /* for error message */
518 &erroffset, /* for error offset */
519 NULL); /* use default character tables */
520
521 The following names for option bits are defined in the pcre.h header
522 file:
523
524 PCRE_ANCHORED
525
526 If this bit is set, the pattern is forced to be "anchored", that is, it
527 is constrained to match only at the first matching point in the string
528 that is being searched (the "subject string"). This effect can also be
529 achieved by appropriate constructs in the pattern itself, which is the
530 only way to do it in Perl.
531
532 PCRE_AUTO_CALLOUT
533
534 If this bit is set, pcre_compile() automatically inserts callout items,
535 all with number 255, before each pattern item. For discussion of the
536 callout facility, see the pcrecallout documentation.
537
538 PCRE_BSR_ANYCRLF
539 PCRE_BSR_UNICODE
540
541 These options (which are mutually exclusive) control what the \R escape
542 sequence matches. The choice is either to match only CR, LF, or CRLF,
543 or to match any Unicode newline sequence. The default is specified when
544 PCRE is built. It can be overridden from within the pattern, or by set‐
545 ting an option when a compiled pattern is matched.
546
547 PCRE_CASELESS
548
549 If this bit is set, letters in the pattern match both upper and lower
550 case letters. It is equivalent to Perl's /i option, and it can be
551 changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE
552 always understands the concept of case for characters whose values are
553 less than 128, so caseless matching is always possible. For characters
554 with higher values, the concept of case is supported if PCRE is com‐
555 piled with Unicode property support, but not otherwise. If you want to
556 use caseless matching for characters 128 and above, you must ensure
557 that PCRE is compiled with Unicode property support as well as with
558 UTF-8 support.
559
560 PCRE_DOLLAR_ENDONLY
561
562 If this bit is set, a dollar metacharacter in the pattern matches only
563 at the end of the subject string. Without this option, a dollar also
564 matches immediately before a newline at the end of the string (but not
565 before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored
566 if PCRE_MULTILINE is set. There is no equivalent to this option in
567 Perl, and no way to set it within a pattern.
568
569 PCRE_DOTALL
570
571 If this bit is set, a dot metacharacter in the pattern matches a char‐
572 acter of any value, including one that indicates a newline. However, it
573 only ever matches one character, even if newlines are coded as CRLF.
574 Without this option, a dot does not match when the current position is
575 at a newline. This option is equivalent to Perl's /s option, and it can
576 be changed within a pattern by a (?s) option setting. A negative class
577 such as [^a] always matches newline characters, independent of the set‐
578 ting of this option.
579
580 PCRE_DUPNAMES
581
582 If this bit is set, names used to identify capturing subpatterns need
583 not be unique. This can be helpful for certain types of pattern when it
584 is known that only one instance of the named subpattern can ever be
585 matched. There are more details of named subpatterns below; see also
586 the pcrepattern documentation.
587
588 PCRE_EXTENDED
589
590 If this bit is set, most white space characters in the pattern are to‐
591 tally ignored except when escaped or inside a character class. However,
592 white space is not allowed within sequences such as (?> that introduce
593 various parenthesized subpatterns, nor within a numerical quantifier
594 such as {1,3}. However, ignorable white space is permitted between an
595 item and a following quantifier and between a quantifier and a follow‐
596 ing + that indicates possessiveness.
597
598 White space did not used to include the VT character (code 11), because
599 Perl did not treat this character as white space. However, Perl changed
600 at release 5.18, so PCRE followed at release 8.34, and VT is now
601 treated as white space.
602
603 PCRE_EXTENDED also causes characters between an unescaped # outside a
604 character class and the next newline, inclusive, to be ignored.
605 PCRE_EXTENDED is equivalent to Perl's /x option, and it can be changed
606 within a pattern by a (?x) option setting.
607
608 Which characters are interpreted as newlines is controlled by the op‐
609 tions passed to pcre_compile() or by a special sequence at the start of
610 the pattern, as described in the section entitled "Newline conventions"
611 in the pcrepattern documentation. Note that the end of this type of
612 comment is a literal newline sequence in the pattern; escape sequences
613 that happen to represent a newline do not count.
614
615 This option makes it possible to include comments inside complicated
616 patterns. Note, however, that this applies only to data characters.
617 White space characters may never appear within special character se‐
618 quences in a pattern, for example within the sequence (?( that intro‐
619 duces a conditional subpattern.
620
621 PCRE_EXTRA
622
623 This option was invented in order to turn on additional functionality
624 of PCRE that is incompatible with Perl, but it is currently of very
625 little use. When set, any backslash in a pattern that is followed by a
626 letter that has no special meaning causes an error, thus reserving
627 these combinations for future expansion. By default, as in Perl, a
628 backslash followed by a letter with no special meaning is treated as a
629 literal. (Perl can, however, be persuaded to give an error for this, by
630 running it with the -w option.) There are at present no other features
631 controlled by this option. It can also be set by a (?X) option setting
632 within a pattern.
633
634 PCRE_FIRSTLINE
635
636 If this option is set, an unanchored pattern is required to match be‐
637 fore or at the first newline in the subject string, though the matched
638 text may continue over the newline.
639
640 PCRE_JAVASCRIPT_COMPAT
641
642 If this option is set, PCRE's behaviour is changed in some ways so that
643 it is compatible with JavaScript rather than Perl. The changes are as
644 follows:
645
646 (1) A lone closing square bracket in a pattern causes a compile-time
647 error, because this is illegal in JavaScript (by default it is treated
648 as a data character). Thus, the pattern AB]CD becomes illegal when this
649 option is set.
650
651 (2) At run time, a back reference to an unset subpattern group matches
652 an empty string (by default this causes the current matching alterna‐
653 tive to fail). A pattern such as (\1)(a) succeeds when this option is
654 set (assuming it can find an "a" in the subject), whereas it fails by
655 default, for Perl compatibility.
656
657 (3) \U matches an upper case "U" character; by default \U causes a com‐
658 pile time error (Perl uses \U to upper case subsequent characters).
659
660 (4) \u matches a lower case "u" character unless it is followed by four
661 hexadecimal digits, in which case the hexadecimal number defines the
662 code point to match. By default, \u causes a compile time error (Perl
663 uses it to upper case the following character).
664
665 (5) \x matches a lower case "x" character unless it is followed by two
666 hexadecimal digits, in which case the hexadecimal number defines the
667 code point to match. By default, as in Perl, a hexadecimal number is
668 always expected after \x, but it may have zero, one, or two digits (so,
669 for example, \xz matches a binary zero character followed by z).
670
671 PCRE_MULTILINE
672
673 By default, for the purposes of matching "start of line" and "end of
674 line", PCRE treats the subject string as consisting of a single line of
675 characters, even if it actually contains newlines. The "start of line"
676 metacharacter (^) matches only at the start of the string, and the "end
677 of line" metacharacter ($) matches only at the end of the string, or
678 before a terminating newline (except when PCRE_DOLLAR_ENDONLY is set).
679 Note, however, that unless PCRE_DOTALL is set, the "any character"
680 metacharacter (.) does not match at a newline. This behaviour (for ^,
681 $, and dot) is the same as Perl.
682
683 When PCRE_MULTILINE it is set, the "start of line" and "end of line"
684 constructs match immediately following or immediately before internal
685 newlines in the subject string, respectively, as well as at the very
686 start and end. This is equivalent to Perl's /m option, and it can be
687 changed within a pattern by a (?m) option setting. If there are no new‐
688 lines in a subject string, or no occurrences of ^ or $ in a pattern,
689 setting PCRE_MULTILINE has no effect.
690
691 PCRE_NEVER_UTF
692
693 This option locks out interpretation of the pattern as UTF-8 (or UTF-16
694 or UTF-32 in the 16-bit and 32-bit libraries). In particular, it pre‐
695 vents the creator of the pattern from switching to UTF interpretation
696 by starting the pattern with (*UTF). This may be useful in applications
697 that process patterns from external sources. The combination of
698 PCRE_UTF8 and PCRE_NEVER_UTF also causes an error.
699
700 PCRE_NEWLINE_CR
701 PCRE_NEWLINE_LF
702 PCRE_NEWLINE_CRLF
703 PCRE_NEWLINE_ANYCRLF
704 PCRE_NEWLINE_ANY
705
706 These options override the default newline definition that was chosen
707 when PCRE was built. Setting the first or the second specifies that a
708 newline is indicated by a single character (CR or LF, respectively).
709 Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the
710 two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies
711 that any of the three preceding sequences should be recognized. Setting
712 PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be
713 recognized.
714
715 In an ASCII/Unicode environment, the Unicode newline sequences are the
716 three just mentioned, plus the single characters VT (vertical tab,
717 U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line sep‐
718 arator, U+2028), and PS (paragraph separator, U+2029). For the 8-bit
719 library, the last two are recognized only in UTF-8 mode.
720
721 When PCRE is compiled to run in an EBCDIC (mainframe) environment, the
722 code for CR is 0x0d, the same as ASCII. However, the character code for
723 LF is normally 0x15, though in some EBCDIC environments 0x25 is used.
724 Whichever of these is not LF is made to correspond to Unicode's NEL
725 character. EBCDIC codes are all less than 256. For more details, see
726 the pcrebuild documentation.
727
728 The newline setting in the options word uses three bits that are
729 treated as a number, giving eight possibilities. Currently only six are
730 used (default plus the five values above). This means that if you set
731 more than one newline option, the combination may or may not be sensi‐
732 ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
733 PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and
734 cause an error.
735
736 The only time that a line break in a pattern is specially recognized
737 when compiling is when PCRE_EXTENDED is set. CR and LF are white space
738 characters, and so are ignored in this mode. Also, an unescaped # out‐
739 side a character class indicates a comment that lasts until after the
740 next line break sequence. In other circumstances, line break sequences
741 in patterns are treated as literal data.
742
743 The newline option that is set at compile time becomes the default that
744 is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.
745
746 PCRE_NO_AUTO_CAPTURE
747
748 If this option is set, it disables the use of numbered capturing paren‐
749 theses in the pattern. Any opening parenthesis that is not followed by
750 ? behaves as if it were followed by ?: but named parentheses can still
751 be used for capturing (and they acquire numbers in the usual way).
752 There is no equivalent of this option in Perl.
753
754 PCRE_NO_AUTO_POSSESS
755
756 If this option is set, it disables "auto-possessification". This is an
757 optimization that, for example, turns a+b into a++b in order to avoid
758 backtracks into a+ that can never be successful. However, if callouts
759 are in use, auto-possessification means that some of them are never
760 taken. You can set this option if you want the matching functions to do
761 a full unoptimized search and run all the callouts, but it is mainly
762 provided for testing purposes.
763
764 PCRE_NO_START_OPTIMIZE
765
766 This is an option that acts at matching time; that is, it is really an
767 option for pcre_exec() or pcre_dfa_exec(). If it is set at compile
768 time, it is remembered with the compiled pattern and assumed at match‐
769 ing time. This is necessary if you want to use JIT execution, because
770 the JIT compiler needs to know whether or not this option is set. For
771 details see the discussion of PCRE_NO_START_OPTIMIZE below.
772
773 PCRE_UCP
774
775 This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W,
776 \w, and some of the POSIX character classes. By default, only ASCII
777 characters are recognized, but if PCRE_UCP is set, Unicode properties
778 are used instead to classify characters. More details are given in the
779 section on generic character types in the pcrepattern page. If you set
780 PCRE_UCP, matching one of the items it affects takes much longer. The
781 option is available only if PCRE has been compiled with Unicode prop‐
782 erty support.
783
784 PCRE_UNGREEDY
785
786 This option inverts the "greediness" of the quantifiers so that they
787 are not greedy by default, but become greedy if followed by "?". It is
788 not compatible with Perl. It can also be set by a (?U) option setting
789 within the pattern.
790
791 PCRE_UTF8
792
793 This option causes PCRE to regard both the pattern and the subject as
794 strings of UTF-8 characters instead of single-byte strings. However, it
795 is available only when PCRE is built to include UTF support. If not,
796 the use of this option provokes an error. Details of how this option
797 changes the behaviour of PCRE are given in the pcreunicode page.
798
799 PCRE_NO_UTF8_CHECK
800
801 When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
802 automatically checked. There is a discussion about the validity of
803 UTF-8 strings in the pcreunicode page. If an invalid UTF-8 sequence is
804 found, pcre_compile() returns an error. If you already know that your
805 pattern is valid, and you want to skip this check for performance rea‐
806 sons, you can set the PCRE_NO_UTF8_CHECK option. When it is set, the
807 effect of passing an invalid UTF-8 string as a pattern is undefined. It
808 may cause your program to crash or loop. Note that this option can also
809 be passed to pcre_exec() and pcre_dfa_exec(), to suppress the validity
810 checking of subject strings only. If the same string is being matched
811 many times, the option can be safely set for the second and subsequent
812 matchings to improve performance.
813
815
816 The following table lists the error codes than may be returned by
817 pcre_compile2(), along with the error messages that may be returned by
818 both compiling functions. Note that error messages are always 8-bit
819 ASCII strings, even in 16-bit or 32-bit mode. As PCRE has developed,
820 some error codes have fallen out of use. To avoid confusion, they have
821 not been re-used.
822
823 0 no error
824 1 \ at end of pattern
825 2 \c at end of pattern
826 3 unrecognized character follows \
827 4 numbers out of order in {} quantifier
828 5 number too big in {} quantifier
829 6 missing terminating ] for character class
830 7 invalid escape sequence in character class
831 8 range out of order in character class
832 9 nothing to repeat
833 10 [this code is not in use]
834 11 internal error: unexpected repeat
835 12 unrecognized character after (? or (?-
836 13 POSIX named classes are supported only within a class
837 14 missing )
838 15 reference to non-existent subpattern
839 16 erroffset passed as NULL
840 17 unknown option bit(s) set
841 18 missing ) after comment
842 19 [this code is not in use]
843 20 regular expression is too large
844 21 failed to get memory
845 22 unmatched parentheses
846 23 internal error: code overflow
847 24 unrecognized character after (?<
848 25 lookbehind assertion is not fixed length
849 26 malformed number or name after (?(
850 27 conditional group contains more than two branches
851 28 assertion expected after (?(
852 29 (?R or (?[+-]digits must be followed by )
853 30 unknown POSIX class name
854 31 POSIX collating elements are not supported
855 32 this version of PCRE is compiled without UTF support
856 33 [this code is not in use]
857 34 character value in \x{} or \o{} is too large
858 35 invalid condition (?(0)
859 36 \C not allowed in lookbehind assertion
860 37 PCRE does not support \L, \l, \N{name}, \U, or \u
861 38 number after (?C is > 255
862 39 closing ) for (?C expected
863 40 recursive call could loop indefinitely
864 41 unrecognized character after (?P
865 42 syntax error in subpattern name (missing terminator)
866 43 two named subpatterns have the same name
867 44 invalid UTF-8 string (specifically UTF-8)
868 45 support for \P, \p, and \X has not been compiled
869 46 malformed \P or \p sequence
870 47 unknown property name after \P or \p
871 48 subpattern name is too long (maximum 32 characters)
872 49 too many named subpatterns (maximum 10000)
873 50 [this code is not in use]
874 51 octal value is greater than \377 in 8-bit non-UTF-8 mode
875 52 internal error: overran compiling workspace
876 53 internal error: previously-checked referenced subpattern
877 not found
878 54 DEFINE group contains more than one branch
879 55 repeating a DEFINE group is not allowed
880 56 inconsistent NEWLINE options
881 57 \g is not followed by a braced, angle-bracketed, or quoted
882 name/number or by a plain number
883 58 a numbered reference must not be zero
884 59 an argument is not allowed for (*ACCEPT), (*FAIL), or (*COMMIT)
885 60 (*VERB) not recognized or malformed
886 61 number is too big
887 62 subpattern name expected
888 63 digit expected after (?+
889 64 ] is an invalid data character in JavaScript compatibility mode
890 65 different names for subpatterns of the same number are
891 not allowed
892 66 (*MARK) must have an argument
893 67 this version of PCRE is not compiled with Unicode property
894 support
895 68 \c must be followed by an ASCII character
896 69 \k is not followed by a braced, angle-bracketed, or quoted name
897 70 internal error: unknown opcode in find_fixedlength()
898 71 \N is not supported in a class
899 72 too many forward references
900 73 disallowed Unicode code point (>= 0xd800 && <= 0xdfff)
901 74 invalid UTF-16 string (specifically UTF-16)
902 75 name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)
903 76 character value in \u.... sequence is too large
904 77 invalid UTF-32 string (specifically UTF-32)
905 78 setting UTF is disabled by the application
906 79 non-hex character in \x{} (closing brace missing?)
907 80 non-octal character in \o{} (closing brace missing?)
908 81 missing opening brace after \o
909 82 parentheses are too deeply nested
910 83 invalid range in character class
911 84 group name must start with a non-digit
912 85 parentheses are too deeply nested (stack check)
913
914 The numbers 32 and 10000 in errors 48 and 49 are defaults; different
915 values may be used if the limits were changed when PCRE was built.
916
918
919 pcre_extra *pcre_study(const pcre *code, int options,
920 const char **errptr);
921
922 If a compiled pattern is going to be used several times, it is worth
923 spending more time analyzing it in order to speed up the time taken for
924 matching. The function pcre_study() takes a pointer to a compiled pat‐
925 tern as its first argument. If studying the pattern produces additional
926 information that will help speed up matching, pcre_study() returns a
927 pointer to a pcre_extra block, in which the study_data field points to
928 the results of the study.
929
930 The returned value from pcre_study() can be passed directly to
931 pcre_exec() or pcre_dfa_exec(). However, a pcre_extra block also con‐
932 tains other fields that can be set by the caller before the block is
933 passed; these are described below in the section on matching a pattern.
934
935 If studying the pattern does not produce any useful information,
936 pcre_study() returns NULL by default. In that circumstance, if the
937 calling program wants to pass any of the other fields to pcre_exec() or
938 pcre_dfa_exec(), it must set up its own pcre_extra block. However, if
939 pcre_study() is called with the PCRE_STUDY_EXTRA_NEEDED option, it re‐
940 turns a pcre_extra block even if studying did not find any additional
941 information. It may still return NULL, however, if an error occurs in
942 pcre_study().
943
944 The second argument of pcre_study() contains option bits. There are
945 three further options in addition to PCRE_STUDY_EXTRA_NEEDED:
946
947 PCRE_STUDY_JIT_COMPILE
948 PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
949 PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
950
951 If any of these are set, and the just-in-time compiler is available,
952 the pattern is further compiled into machine code that executes much
953 faster than the pcre_exec() interpretive matching function. If the
954 just-in-time compiler is not available, these options are ignored. All
955 undefined bits in the options argument must be zero.
956
957 JIT compilation is a heavyweight optimization. It can take some time
958 for patterns to be analyzed, and for one-off matches and simple pat‐
959 terns the benefit of faster execution might be offset by a much slower
960 study time. Not all patterns can be optimized by the JIT compiler. For
961 those that cannot be handled, matching automatically falls back to the
962 pcre_exec() interpreter. For more details, see the pcrejit documenta‐
963 tion.
964
965 The third argument for pcre_study() is a pointer for an error message.
966 If studying succeeds (even if no data is returned), the variable it
967 points to is set to NULL. Otherwise it is set to point to a textual er‐
968 ror message. This is a static string that is part of the library. You
969 must not try to free it. You should test the error pointer for NULL af‐
970 ter calling pcre_study(), to be sure that it has run successfully.
971
972 When you are finished with a pattern, you can free the memory used for
973 the study data by calling pcre_free_study(). This function was added to
974 the API for release 8.20. For earlier versions, the memory could be
975 freed with pcre_free(), just like the pattern itself. This will still
976 work in cases where JIT optimization is not used, but it is advisable
977 to change to the new function when convenient.
978
979 This is a typical way in which pcre_study() is used (except that in a
980 real application there should be tests for errors):
981
982 int rc;
983 pcre *re;
984 pcre_extra *sd;
985 re = pcre_compile("pattern", 0, &error, &erroroffset, NULL);
986 sd = pcre_study(
987 re, /* result of pcre_compile() */
988 0, /* no options */
989 &error); /* set to NULL or points to a message */
990 rc = pcre_exec( /* see below for details of pcre_exec() options */
991 re, sd, "subject", 7, 0, 0, ovector, 30);
992 ...
993 pcre_free_study(sd);
994 pcre_free(re);
995
996 Studying a pattern does two things: first, a lower bound for the length
997 of subject string that is needed to match the pattern is computed. This
998 does not mean that there are any strings of that length that match, but
999 it does guarantee that no shorter strings match. The value is used to
1000 avoid wasting time by trying to match strings that are shorter than the
1001 lower bound. You can find out the value in a calling program via the
1002 pcre_fullinfo() function.
1003
1004 Studying a pattern is also useful for non-anchored patterns that do not
1005 have a single fixed starting character. A bitmap of possible starting
1006 bytes is created. This speeds up finding a position in the subject at
1007 which to start matching. (In 16-bit mode, the bitmap is used for 16-bit
1008 values less than 256. In 32-bit mode, the bitmap is used for 32-bit
1009 values less than 256.)
1010
1011 These two optimizations apply to both pcre_exec() and pcre_dfa_exec(),
1012 and the information is also used by the JIT compiler. The optimiza‐
1013 tions can be disabled by setting the PCRE_NO_START_OPTIMIZE option.
1014 You might want to do this if your pattern contains callouts or (*MARK)
1015 and you want to make use of these facilities in cases where matching
1016 fails.
1017
1018 PCRE_NO_START_OPTIMIZE can be specified at either compile time or exe‐
1019 cution time. However, if PCRE_NO_START_OPTIMIZE is passed to
1020 pcre_exec(), (that is, after any JIT compilation has happened) JIT exe‐
1021 cution is disabled. For JIT execution to work with PCRE_NO_START_OPTI‐
1022 MIZE, the option must be set at compile time.
1023
1024 There is a longer discussion of PCRE_NO_START_OPTIMIZE below.
1025
1027
1028 PCRE handles caseless matching, and determines whether characters are
1029 letters, digits, or whatever, by reference to a set of tables, indexed
1030 by character code point. When running in UTF-8 mode, or in the 16- or
1031 32-bit libraries, this applies only to characters with code points less
1032 than 256. By default, higher-valued code points never match escapes
1033 such as \w or \d. However, if PCRE is built with Unicode property sup‐
1034 port, all characters can be tested with \p and \P, or, alternatively,
1035 the PCRE_UCP option can be set when a pattern is compiled; this causes
1036 \w and friends to use Unicode property support instead of the built-in
1037 tables.
1038
1039 The use of locales with Unicode is discouraged. If you are handling
1040 characters with code points greater than 128, you should either use
1041 Unicode support, or use locales, but not try to mix the two.
1042
1043 PCRE contains an internal set of tables that are used when the final
1044 argument of pcre_compile() is NULL. These are sufficient for many ap‐
1045 plications. Normally, the internal tables recognize only ASCII charac‐
1046 ters. However, when PCRE is built, it is possible to cause the internal
1047 tables to be rebuilt in the default "C" locale of the local system,
1048 which may cause them to be different.
1049
1050 The internal tables can always be overridden by tables supplied by the
1051 application that calls PCRE. These may be created in a different locale
1052 from the default. As more and more applications change to using Uni‐
1053 code, the need for this locale support is expected to die away.
1054
1055 External tables are built by calling the pcre_maketables() function,
1056 which has no arguments, in the relevant locale. The result can then be
1057 passed to pcre_compile() as often as necessary. For example, to build
1058 and use tables that are appropriate for the French locale (where ac‐
1059 cented characters with values greater than 128 are treated as letters),
1060 the following code could be used:
1061
1062 setlocale(LC_CTYPE, "fr_FR");
1063 tables = pcre_maketables();
1064 re = pcre_compile(..., tables);
1065
1066 The locale name "fr_FR" is used on Linux and other Unix-like systems;
1067 if you are using Windows, the name for the French locale is "french".
1068
1069 When pcre_maketables() runs, the tables are built in memory that is ob‐
1070 tained via pcre_malloc. It is the caller's responsibility to ensure
1071 that the memory containing the tables remains available for as long as
1072 it is needed.
1073
1074 The pointer that is passed to pcre_compile() is saved with the compiled
1075 pattern, and the same tables are used via this pointer by pcre_study()
1076 and also by pcre_exec() and pcre_dfa_exec(). Thus, for any single pat‐
1077 tern, compilation, studying and matching all happen in the same locale,
1078 but different patterns can be processed in different locales.
1079
1080 It is possible to pass a table pointer or NULL (indicating the use of
1081 the internal tables) to pcre_exec() or pcre_dfa_exec() (see the discus‐
1082 sion below in the section on matching a pattern). This facility is pro‐
1083 vided for use with pre-compiled patterns that have been saved and
1084 reloaded. Character tables are not saved with patterns, so if a non-
1085 standard table was used at compile time, it must be provided again when
1086 the reloaded pattern is matched. Attempting to use this facility to
1087 match a pattern in a different locale from the one in which it was com‐
1088 piled is likely to lead to anomalous (usually incorrect) results.
1089
1091
1092 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
1093 int what, void *where);
1094
1095 The pcre_fullinfo() function returns information about a compiled pat‐
1096 tern. It replaces the pcre_info() function, which was removed from the
1097 library at version 8.30, after more than 10 years of obsolescence.
1098
1099 The first argument for pcre_fullinfo() is a pointer to the compiled
1100 pattern. The second argument is the result of pcre_study(), or NULL if
1101 the pattern was not studied. The third argument specifies which piece
1102 of information is required, and the fourth argument is a pointer to a
1103 variable to receive the data. The yield of the function is zero for
1104 success, or one of the following negative numbers:
1105
1106 PCRE_ERROR_NULL the argument code was NULL
1107 the argument where was NULL
1108 PCRE_ERROR_BADMAGIC the "magic number" was not found
1109 PCRE_ERROR_BADENDIANNESS the pattern was compiled with different
1110 endianness
1111 PCRE_ERROR_BADOPTION the value of what was invalid
1112 PCRE_ERROR_UNSET the requested field is not set
1113
1114 The "magic number" is placed at the start of each compiled pattern as a
1115 simple check against passing an arbitrary memory pointer. The endian‐
1116 ness error can occur if a compiled pattern is saved and reloaded on a
1117 different host. Here is a typical call of pcre_fullinfo(), to obtain
1118 the length of the compiled pattern:
1119
1120 int rc;
1121 size_t length;
1122 rc = pcre_fullinfo(
1123 re, /* result of pcre_compile() */
1124 sd, /* result of pcre_study(), or NULL */
1125 PCRE_INFO_SIZE, /* what is required */
1126 &length); /* where to put the data */
1127
1128 The possible values for the third argument are defined in pcre.h, and
1129 are as follows:
1130
1131 PCRE_INFO_BACKREFMAX
1132
1133 Return the number of the highest back reference in the pattern. The
1134 fourth argument should point to an int variable. Zero is returned if
1135 there are no back references.
1136
1137 PCRE_INFO_CAPTURECOUNT
1138
1139 Return the number of capturing subpatterns in the pattern. The fourth
1140 argument should point to an int variable.
1141
1142 PCRE_INFO_DEFAULT_TABLES
1143
1144 Return a pointer to the internal default character tables within PCRE.
1145 The fourth argument should point to an unsigned char * variable. This
1146 information call is provided for internal use by the pcre_study() func‐
1147 tion. External callers can cause PCRE to use its internal tables by
1148 passing a NULL table pointer.
1149
1150 PCRE_INFO_FIRSTBYTE (deprecated)
1151
1152 Return information about the first data unit of any matched string, for
1153 a non-anchored pattern. The name of this option refers to the 8-bit li‐
1154 brary, where data units are bytes. The fourth argument should point to
1155 an int variable. Negative values are used for special cases. However,
1156 this means that when the 32-bit library is in non-UTF-32 mode, the full
1157 32-bit range of characters cannot be returned. For this reason, this
1158 value is deprecated; use PCRE_INFO_FIRSTCHARACTERFLAGS and
1159 PCRE_INFO_FIRSTCHARACTER instead.
1160
1161 If there is a fixed first value, for example, the letter "c" from a
1162 pattern such as (cat|cow|coyote), its value is returned. In the 8-bit
1163 library, the value is always less than 256. In the 16-bit library the
1164 value can be up to 0xffff. In the 32-bit library the value can be up to
1165 0x10ffff.
1166
1167 If there is no fixed first value, and if either
1168
1169 (a) the pattern was compiled with the PCRE_MULTILINE option, and every
1170 branch starts with "^", or
1171
1172 (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
1173 set (if it were set, the pattern would be anchored),
1174
1175 -1 is returned, indicating that the pattern matches only at the start
1176 of a subject string or after any newline within the string. Otherwise
1177 -2 is returned. For anchored patterns, -2 is returned.
1178
1179 PCRE_INFO_FIRSTCHARACTER
1180
1181 Return the value of the first data unit (non-UTF character) of any
1182 matched string in the situation where PCRE_INFO_FIRSTCHARACTERFLAGS re‐
1183 turns 1; otherwise return 0. The fourth argument should point to a
1184 uint_t variable.
1185
1186 In the 8-bit library, the value is always less than 256. In the 16-bit
1187 library the value can be up to 0xffff. In the 32-bit library in UTF-32
1188 mode the value can be up to 0x10ffff, and up to 0xffffffff when not us‐
1189 ing UTF-32 mode.
1190
1191 PCRE_INFO_FIRSTCHARACTERFLAGS
1192
1193 Return information about the first data unit of any matched string, for
1194 a non-anchored pattern. The fourth argument should point to an int
1195 variable.
1196
1197 If there is a fixed first value, for example, the letter "c" from a
1198 pattern such as (cat|cow|coyote), 1 is returned, and the character
1199 value can be retrieved using PCRE_INFO_FIRSTCHARACTER. If there is no
1200 fixed first value, and if either
1201
1202 (a) the pattern was compiled with the PCRE_MULTILINE option, and every
1203 branch starts with "^", or
1204
1205 (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
1206 set (if it were set, the pattern would be anchored),
1207
1208 2 is returned, indicating that the pattern matches only at the start of
1209 a subject string or after any newline within the string. Otherwise 0 is
1210 returned. For anchored patterns, 0 is returned.
1211
1212 PCRE_INFO_FIRSTTABLE
1213
1214 If the pattern was studied, and this resulted in the construction of a
1215 256-bit table indicating a fixed set of values for the first data unit
1216 in any matching string, a pointer to the table is returned. Otherwise
1217 NULL is returned. The fourth argument should point to an unsigned char
1218 * variable.
1219
1220 PCRE_INFO_HASCRORLF
1221
1222 Return 1 if the pattern contains any explicit matches for CR or LF
1223 characters, otherwise 0. The fourth argument should point to an int
1224 variable. An explicit match is either a literal CR or LF character, or
1225 \r or \n.
1226
1227 PCRE_INFO_JCHANGED
1228
1229 Return 1 if the (?J) or (?-J) option setting is used in the pattern,
1230 otherwise 0. The fourth argument should point to an int variable. (?J)
1231 and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
1232
1233 PCRE_INFO_JIT
1234
1235 Return 1 if the pattern was studied with one of the JIT options, and
1236 just-in-time compiling was successful. The fourth argument should point
1237 to an int variable. A return value of 0 means that JIT support is not
1238 available in this version of PCRE, or that the pattern was not studied
1239 with a JIT option, or that the JIT compiler could not handle this par‐
1240 ticular pattern. See the pcrejit documentation for details of what can
1241 and cannot be handled.
1242
1243 PCRE_INFO_JITSIZE
1244
1245 If the pattern was successfully studied with a JIT option, return the
1246 size of the JIT compiled code, otherwise return zero. The fourth argu‐
1247 ment should point to a size_t variable.
1248
1249 PCRE_INFO_LASTLITERAL
1250
1251 Return the value of the rightmost literal data unit that must exist in
1252 any matched string, other than at its start, if such a value has been
1253 recorded. The fourth argument should point to an int variable. If there
1254 is no such value, -1 is returned. For anchored patterns, a last literal
1255 value is recorded only if it follows something of variable length. For
1256 example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
1257 /^a\dz\d/ the returned value is -1.
1258
1259 Since for the 32-bit library using the non-UTF-32 mode, this function
1260 is unable to return the full 32-bit range of characters, this value is
1261 deprecated; instead the PCRE_INFO_REQUIREDCHARFLAGS and PCRE_INFO_RE‐
1262 QUIREDCHAR values should be used.
1263
1264 PCRE_INFO_MATCH_EMPTY
1265
1266 Return 1 if the pattern can match an empty string, otherwise 0. The
1267 fourth argument should point to an int variable.
1268
1269 PCRE_INFO_MATCHLIMIT
1270
1271 If the pattern set a match limit by including an item of the form
1272 (*LIMIT_MATCH=nnnn) at the start, the value is returned. The fourth ar‐
1273 gument should point to an unsigned 32-bit integer. If no such value has
1274 been set, the call to pcre_fullinfo() returns the error PCRE_ERROR_UN‐
1275 SET.
1276
1277 PCRE_INFO_MAXLOOKBEHIND
1278
1279 Return the number of characters (NB not data units) in the longest
1280 lookbehind assertion in the pattern. This information is useful when
1281 doing multi-segment matching using the partial matching facilities.
1282 Note that the simple assertions \b and \B require a one-character look‐
1283 behind. \A also registers a one-character lookbehind, though it does
1284 not actually inspect the previous character. This is to ensure that at
1285 least one character from the old segment is retained when a new segment
1286 is processed. Otherwise, if there are no lookbehinds in the pattern, \A
1287 might match incorrectly at the start of a new segment.
1288
1289 PCRE_INFO_MINLENGTH
1290
1291 If the pattern was studied and a minimum length for matching subject
1292 strings was computed, its value is returned. Otherwise the returned
1293 value is -1. The value is a number of characters, which in UTF mode may
1294 be different from the number of data units. The fourth argument should
1295 point to an int variable. A non-negative value is a lower bound to the
1296 length of any matching string. There may not be any strings of that
1297 length that do actually match, but every string that does match is at
1298 least that long.
1299
1300 PCRE_INFO_NAMECOUNT
1301 PCRE_INFO_NAMEENTRYSIZE
1302 PCRE_INFO_NAMETABLE
1303
1304 PCRE supports the use of named as well as numbered capturing parenthe‐
1305 ses. The names are just an additional way of identifying the parenthe‐
1306 ses, which still acquire numbers. Several convenience functions such as
1307 pcre_get_named_substring() are provided for extracting captured sub‐
1308 strings by name. It is also possible to extract the data directly, by
1309 first converting the name to a number in order to access the correct
1310 pointers in the output vector (described with pcre_exec() below). To do
1311 the conversion, you need to use the name-to-number map, which is de‐
1312 scribed by these three values.
1313
1314 The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
1315 gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
1316 of each entry; both of these return an int value. The entry size de‐
1317 pends on the length of the longest name. PCRE_INFO_NAMETABLE returns a
1318 pointer to the first entry of the table. This is a pointer to char in
1319 the 8-bit library, where the first two bytes of each entry are the num‐
1320 ber of the capturing parenthesis, most significant byte first. In the
1321 16-bit library, the pointer points to 16-bit data units, the first of
1322 which contains the parenthesis number. In the 32-bit library, the
1323 pointer points to 32-bit data units, the first of which contains the
1324 parenthesis number. The rest of the entry is the corresponding name,
1325 zero terminated.
1326
1327 The names are in alphabetical order. If (?| is used to create multiple
1328 groups with the same number, as described in the section on duplicate
1329 subpattern numbers in the pcrepattern page, the groups may be given the
1330 same name, but there is only one entry in the table. Different names
1331 for groups of the same number are not permitted. Duplicate names for
1332 subpatterns with different numbers are permitted, but only if PCRE_DUP‐
1333 NAMES is set. They appear in the table in the order in which they were
1334 found in the pattern. In the absence of (?| this is the order of in‐
1335 creasing number; when (?| is used this is not necessarily the case be‐
1336 cause later subpatterns may have lower numbers.
1337
1338 As a simple example of the name/number table, consider the following
1339 pattern after compilation by the 8-bit library (assume PCRE_EXTENDED is
1340 set, so white space - including newlines - is ignored):
1341
1342 (?<date> (?<year>(\d\d)?\d\d) -
1343 (?<month>\d\d) - (?<day>\d\d) )
1344
1345 There are four named subpatterns, so the table has four entries, and
1346 each entry in the table is eight bytes long. The table is as follows,
1347 with non-printing bytes shows in hexadecimal, and undefined bytes shown
1348 as ??:
1349
1350 00 01 d a t e 00 ??
1351 00 05 d a y 00 ?? ??
1352 00 04 m o n t h 00
1353 00 02 y e a r 00 ??
1354
1355 When writing code to extract data from named subpatterns using the
1356 name-to-number map, remember that the length of the entries is likely
1357 to be different for each compiled pattern.
1358
1359 PCRE_INFO_OKPARTIAL
1360
1361 Return 1 if the pattern can be used for partial matching with
1362 pcre_exec(), otherwise 0. The fourth argument should point to an int
1363 variable. From release 8.00, this always returns 1, because the re‐
1364 strictions that previously applied to partial matching have been
1365 lifted. The pcrepartial documentation gives details of partial match‐
1366 ing.
1367
1368 PCRE_INFO_OPTIONS
1369
1370 Return a copy of the options with which the pattern was compiled. The
1371 fourth argument should point to an unsigned long int variable. These
1372 option bits are those specified in the call to pcre_compile(), modified
1373 by any top-level option settings at the start of the pattern itself. In
1374 other words, they are the options that will be in force when matching
1375 starts. For example, if the pattern /(?im)abc(?-i)d/ is compiled with
1376 the PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE,
1377 and PCRE_EXTENDED.
1378
1379 A pattern is automatically anchored by PCRE if all of its top-level al‐
1380 ternatives begin with one of the following:
1381
1382 ^ unless PCRE_MULTILINE is set
1383 \A always
1384 \G always
1385 .* if PCRE_DOTALL is set and there are no back
1386 references to the subpattern in which .* appears
1387
1388 For such patterns, the PCRE_ANCHORED bit is set in the options returned
1389 by pcre_fullinfo().
1390
1391 PCRE_INFO_RECURSIONLIMIT
1392
1393 If the pattern set a recursion limit by including an item of the form
1394 (*LIMIT_RECURSION=nnnn) at the start, the value is returned. The fourth
1395 argument should point to an unsigned 32-bit integer. If no such value
1396 has been set, the call to pcre_fullinfo() returns the error PCRE_ER‐
1397 ROR_UNSET.
1398
1399 PCRE_INFO_SIZE
1400
1401 Return the size of the compiled pattern in bytes (for all three li‐
1402 braries). The fourth argument should point to a size_t variable. This
1403 value does not include the size of the pcre structure that is returned
1404 by pcre_compile(). The value that is passed as the argument to
1405 pcre_malloc() when pcre_compile() is getting memory in which to place
1406 the compiled data is the value returned by this option plus the size of
1407 the pcre structure. Studying a compiled pattern, with or without JIT,
1408 does not alter the value returned by this option.
1409
1410 PCRE_INFO_STUDYSIZE
1411
1412 Return the size in bytes (for all three libraries) of the data block
1413 pointed to by the study_data field in a pcre_extra block. If pcre_extra
1414 is NULL, or there is no study data, zero is returned. The fourth argu‐
1415 ment should point to a size_t variable. The study_data field is set by
1416 pcre_study() to record information that will speed up matching (see the
1417 section entitled "Studying a pattern" above). The format of the
1418 study_data block is private, but its length is made available via this
1419 option so that it can be saved and restored (see the pcreprecompile
1420 documentation for details).
1421
1422 PCRE_INFO_REQUIREDCHARFLAGS
1423
1424 Returns 1 if there is a rightmost literal data unit that must exist in
1425 any matched string, other than at its start. The fourth argument should
1426 point to an int variable. If there is no such value, 0 is returned. If
1427 returning 1, the character value itself can be retrieved using
1428 PCRE_INFO_REQUIREDCHAR.
1429
1430 For anchored patterns, a last literal value is recorded only if it fol‐
1431 lows something of variable length. For example, for the pattern
1432 /^a\d+z\d+/ the returned value 1 (with "z" returned from PCRE_INFO_RE‐
1433 QUIREDCHAR), but for /^a\dz\d/ the returned value is 0.
1434
1435 PCRE_INFO_REQUIREDCHAR
1436
1437 Return the value of the rightmost literal data unit that must exist in
1438 any matched string, other than at its start, if such a value has been
1439 recorded. The fourth argument should point to a uint32_t variable. If
1440 there is no such value, 0 is returned.
1441
1443
1444 int pcre_refcount(pcre *code, int adjust);
1445
1446 The pcre_refcount() function is used to maintain a reference count in
1447 the data block that contains a compiled pattern. It is provided for the
1448 benefit of applications that operate in an object-oriented manner,
1449 where different parts of the application may be using the same compiled
1450 pattern, but you want to free the block when they are all done.
1451
1452 When a pattern is compiled, the reference count field is initialized to
1453 zero. It is changed only by calling this function, whose action is to
1454 add the adjust value (which may be positive or negative) to it. The
1455 yield of the function is the new value. However, the value of the count
1456 is constrained to lie between 0 and 65535, inclusive. If the new value
1457 is outside these limits, it is forced to the appropriate limit value.
1458
1459 Except when it is zero, the reference count is not correctly preserved
1460 if a pattern is compiled on one host and then transferred to a host
1461 whose byte-order is different. (This seems a highly unlikely scenario.)
1462
1464
1465 int pcre_exec(const pcre *code, const pcre_extra *extra,
1466 const char *subject, int length, int startoffset,
1467 int options, int *ovector, int ovecsize);
1468
1469 The function pcre_exec() is called to match a subject string against a
1470 compiled pattern, which is passed in the code argument. If the pattern
1471 was studied, the result of the study should be passed in the extra ar‐
1472 gument. You can call pcre_exec() with the same code and extra arguments
1473 as many times as you like, in order to match different subject strings
1474 with the same pattern.
1475
1476 This function is the main matching facility of the library, and it op‐
1477 erates in a Perl-like manner. For specialist use there is also an al‐
1478 ternative matching function, which is described below in the section
1479 about the pcre_dfa_exec() function.
1480
1481 In most applications, the pattern will have been compiled (and option‐
1482 ally studied) in the same process that calls pcre_exec(). However, it
1483 is possible to save compiled patterns and study data, and then use them
1484 later in different processes, possibly even on different hosts. For a
1485 discussion about this, see the pcreprecompile documentation.
1486
1487 Here is an example of a simple call to pcre_exec():
1488
1489 int rc;
1490 int ovector[30];
1491 rc = pcre_exec(
1492 re, /* result of pcre_compile() */
1493 NULL, /* we didn't study the pattern */
1494 "some string", /* the subject string */
1495 11, /* the length of the subject string */
1496 0, /* start at offset 0 in the subject */
1497 0, /* default options */
1498 ovector, /* vector of integers for substring information */
1499 30); /* number of elements (NOT size in bytes) */
1500
1501 Extra data for pcre_exec()
1502
1503 If the extra argument is not NULL, it must point to a pcre_extra data
1504 block. The pcre_study() function returns such a block (when it doesn't
1505 return NULL), but you can also create one for yourself, and pass addi‐
1506 tional information in it. The pcre_extra block contains the following
1507 fields (not necessarily in this order):
1508
1509 unsigned long int flags;
1510 void *study_data;
1511 void *executable_jit;
1512 unsigned long int match_limit;
1513 unsigned long int match_limit_recursion;
1514 void *callout_data;
1515 const unsigned char *tables;
1516 unsigned char **mark;
1517
1518 In the 16-bit version of this structure, the mark field has type
1519 "PCRE_UCHAR16 **".
1520
1521 In the 32-bit version of this structure, the mark field has type
1522 "PCRE_UCHAR32 **".
1523
1524 The flags field is used to specify which of the other fields are set.
1525 The flag bits are:
1526
1527 PCRE_EXTRA_CALLOUT_DATA
1528 PCRE_EXTRA_EXECUTABLE_JIT
1529 PCRE_EXTRA_MARK
1530 PCRE_EXTRA_MATCH_LIMIT
1531 PCRE_EXTRA_MATCH_LIMIT_RECURSION
1532 PCRE_EXTRA_STUDY_DATA
1533 PCRE_EXTRA_TABLES
1534
1535 Other flag bits should be set to zero. The study_data field and some‐
1536 times the executable_jit field are set in the pcre_extra block that is
1537 returned by pcre_study(), together with the appropriate flag bits. You
1538 should not set these yourself, but you may add to the block by setting
1539 other fields and their corresponding flag bits.
1540
1541 The match_limit field provides a means of preventing PCRE from using up
1542 a vast amount of resources when running patterns that are not going to
1543 match, but which have a very large number of possibilities in their
1544 search trees. The classic example is a pattern that uses nested unlim‐
1545 ited repeats.
1546
1547 Internally, pcre_exec() uses a function called match(), which it calls
1548 repeatedly (sometimes recursively). The limit set by match_limit is im‐
1549 posed on the number of times this function is called during a match,
1550 which has the effect of limiting the amount of backtracking that can
1551 take place. For patterns that are not anchored, the count restarts from
1552 zero for each position in the subject string.
1553
1554 When pcre_exec() is called with a pattern that was successfully studied
1555 with a JIT option, the way that the matching is executed is entirely
1556 different. However, there is still the possibility of runaway matching
1557 that goes on for a very long time, and so the match_limit value is also
1558 used in this case (but in a different way) to limit how long the match‐
1559 ing can continue.
1560
1561 The default value for the limit can be set when PCRE is built; the de‐
1562 fault default is 10 million, which handles all but the most extreme
1563 cases. You can override the default by supplying pcre_exec() with a
1564 pcre_extra block in which match_limit is set, and PCRE_EX‐
1565 TRA_MATCH_LIMIT is set in the flags field. If the limit is exceeded,
1566 pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
1567
1568 A value for the match limit may also be supplied by an item at the
1569 start of a pattern of the form
1570
1571 (*LIMIT_MATCH=d)
1572
1573 where d is a decimal number. However, such a setting is ignored unless
1574 d is less than the limit set by the caller of pcre_exec() or, if no
1575 such limit is set, less than the default.
1576
1577 The match_limit_recursion field is similar to match_limit, but instead
1578 of limiting the total number of times that match() is called, it limits
1579 the depth of recursion. The recursion depth is a smaller number than
1580 the total number of calls, because not all calls to match() are recur‐
1581 sive. This limit is of use only if it is set smaller than match_limit.
1582
1583 Limiting the recursion depth limits the amount of machine stack that
1584 can be used, or, when PCRE has been compiled to use memory on the heap
1585 instead of the stack, the amount of heap memory that can be used. This
1586 limit is not relevant, and is ignored, when matching is done using JIT
1587 compiled code.
1588
1589 The default value for match_limit_recursion can be set when PCRE is
1590 built; the default default is the same value as the default for
1591 match_limit. You can override the default by supplying pcre_exec() with
1592 a pcre_extra block in which match_limit_recursion is set, and PCRE_EX‐
1593 TRA_MATCH_LIMIT_RECURSION is set in the flags field. If the limit is
1594 exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
1595
1596 A value for the recursion limit may also be supplied by an item at the
1597 start of a pattern of the form
1598
1599 (*LIMIT_RECURSION=d)
1600
1601 where d is a decimal number. However, such a setting is ignored unless
1602 d is less than the limit set by the caller of pcre_exec() or, if no
1603 such limit is set, less than the default.
1604
1605 The callout_data field is used in conjunction with the "callout" fea‐
1606 ture, and is described in the pcrecallout documentation.
1607
1608 The tables field is provided for use with patterns that have been pre-
1609 compiled using custom character tables, saved to disc or elsewhere, and
1610 then reloaded, because the tables that were used to compile a pattern
1611 are not saved with it. See the pcreprecompile documentation for a dis‐
1612 cussion of saving compiled patterns for later use. If NULL is passed
1613 using this mechanism, it forces PCRE's internal tables to be used.
1614
1615 Warning: The tables that pcre_exec() uses must be the same as those
1616 that were used when the pattern was compiled. If this is not the case,
1617 the behaviour of pcre_exec() is undefined. Therefore, when a pattern is
1618 compiled and matched in the same process, this field should never be
1619 set. In this (the most common) case, the correct table pointer is auto‐
1620 matically passed with the compiled pattern from pcre_compile() to
1621 pcre_exec().
1622
1623 If PCRE_EXTRA_MARK is set in the flags field, the mark field must be
1624 set to point to a suitable variable. If the pattern contains any back‐
1625 tracking control verbs such as (*MARK:NAME), and the execution ends up
1626 with a name to pass back, a pointer to the name string (zero termi‐
1627 nated) is placed in the variable pointed to by the mark field. The
1628 names are within the compiled pattern; if you wish to retain such a
1629 name you must copy it before freeing the memory of a compiled pattern.
1630 If there is no name to pass back, the variable pointed to by the mark
1631 field is set to NULL. For details of the backtracking control verbs,
1632 see the section entitled "Backtracking control" in the pcrepattern doc‐
1633 umentation.
1634
1635 Option bits for pcre_exec()
1636
1637 The unused bits of the options argument for pcre_exec() must be zero.
1638 The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx,
1639 PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
1640 PCRE_NO_START_OPTIMIZE, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_HARD, and
1641 PCRE_PARTIAL_SOFT.
1642
1643 If the pattern was successfully studied with one of the just-in-time
1644 (JIT) compile options, the only supported options for JIT execution are
1645 PCRE_NO_UTF8_CHECK, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY,
1646 PCRE_NOTEMPTY_ATSTART, PCRE_PARTIAL_HARD, and PCRE_PARTIAL_SOFT. If an
1647 unsupported option is used, JIT execution is disabled and the normal
1648 interpretive code in pcre_exec() is run.
1649
1650 PCRE_ANCHORED
1651
1652 The PCRE_ANCHORED option limits pcre_exec() to matching at the first
1653 matching position. If a pattern was compiled with PCRE_ANCHORED, or
1654 turned out to be anchored by virtue of its contents, it cannot be made
1655 unachored at matching time.
1656
1657 PCRE_BSR_ANYCRLF
1658 PCRE_BSR_UNICODE
1659
1660 These options (which are mutually exclusive) control what the \R escape
1661 sequence matches. The choice is either to match only CR, LF, or CRLF,
1662 or to match any Unicode newline sequence. These options override the
1663 choice that was made or defaulted when the pattern was compiled.
1664
1665 PCRE_NEWLINE_CR
1666 PCRE_NEWLINE_LF
1667 PCRE_NEWLINE_CRLF
1668 PCRE_NEWLINE_ANYCRLF
1669 PCRE_NEWLINE_ANY
1670
1671 These options override the newline definition that was chosen or de‐
1672 faulted when the pattern was compiled. For details, see the description
1673 of pcre_compile() above. During matching, the newline choice affects
1674 the behaviour of the dot, circumflex, and dollar metacharacters. It may
1675 also alter the way the match position is advanced after a match failure
1676 for an unanchored pattern.
1677
1678 When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is
1679 set, and a match attempt for an unanchored pattern fails when the cur‐
1680 rent position is at a CRLF sequence, and the pattern contains no ex‐
1681 plicit matches for CR or LF characters, the match position is advanced
1682 by two characters instead of one, in other words, to after the CRLF.
1683
1684 The above rule is a compromise that makes the most common cases work as
1685 expected. For example, if the pattern is .+A (and the PCRE_DOTALL op‐
1686 tion is not set), it does not match the string "\r\nA" because, after
1687 failing at the start, it skips both the CR and the LF before retrying.
1688 However, the pattern [\r\n]A does match that string, because it con‐
1689 tains an explicit CR or LF reference, and so advances only by one char‐
1690 acter after the first failure.
1691
1692 An explicit match for CR of LF is either a literal appearance of one of
1693 those characters, or one of the \r or \n escape sequences. Implicit
1694 matches such as [^X] do not count, nor does \s (which includes CR and
1695 LF in the characters that it matches).
1696
1697 Notwithstanding the above, anomalous effects may still occur when CRLF
1698 is a valid newline sequence and explicit \r or \n escapes appear in the
1699 pattern.
1700
1701 PCRE_NOTBOL
1702
1703 This option specifies that first character of the subject string is not
1704 the beginning of a line, so the circumflex metacharacter should not
1705 match before it. Setting this without PCRE_MULTILINE (at compile time)
1706 causes circumflex never to match. This option affects only the behav‐
1707 iour of the circumflex metacharacter. It does not affect \A.
1708
1709 PCRE_NOTEOL
1710
1711 This option specifies that the end of the subject string is not the end
1712 of a line, so the dollar metacharacter should not match it nor (except
1713 in multiline mode) a newline immediately before it. Setting this with‐
1714 out PCRE_MULTILINE (at compile time) causes dollar never to match. This
1715 option affects only the behaviour of the dollar metacharacter. It does
1716 not affect \Z or \z.
1717
1718 PCRE_NOTEMPTY
1719
1720 An empty string is not considered to be a valid match if this option is
1721 set. If there are alternatives in the pattern, they are tried. If all
1722 the alternatives match the empty string, the entire match fails. For
1723 example, if the pattern
1724
1725 a?b?
1726
1727 is applied to a string not beginning with "a" or "b", it matches an
1728 empty string at the start of the subject. With PCRE_NOTEMPTY set, this
1729 match is not valid, so PCRE searches further into the string for occur‐
1730 rences of "a" or "b".
1731
1732 PCRE_NOTEMPTY_ATSTART
1733
1734 This is like PCRE_NOTEMPTY, except that an empty string match that is
1735 not at the start of the subject is permitted. If the pattern is an‐
1736 chored, such a match can occur only if the pattern contains \K.
1737
1738 Perl has no direct equivalent of PCRE_NOTEMPTY or PCRE_NOTEMPTY_AT‐
1739 START, but it does make a special case of a pattern match of the empty
1740 string within its split() function, and when using the /g modifier. It
1741 is possible to emulate Perl's behaviour after matching a null string by
1742 first trying the match again at the same offset with PCRE_NOTEMPTY_AT‐
1743 START and PCRE_ANCHORED, and then if that fails, by advancing the
1744 starting offset (see below) and trying an ordinary match again. There
1745 is some code that demonstrates how to do this in the pcredemo sample
1746 program. In the most general case, you have to check to see if the new‐
1747 line convention recognizes CRLF as a newline, and if so, and the cur‐
1748 rent character is CR followed by LF, advance the starting offset by two
1749 characters instead of one.
1750
1751 PCRE_NO_START_OPTIMIZE
1752
1753 There are a number of optimizations that pcre_exec() uses at the start
1754 of a match, in order to speed up the process. For example, if it is
1755 known that an unanchored match must start with a specific character, it
1756 searches the subject for that character, and fails immediately if it
1757 cannot find it, without actually running the main matching function.
1758 This means that a special item such as (*COMMIT) at the start of a pat‐
1759 tern is not considered until after a suitable starting point for the
1760 match has been found. Also, when callouts or (*MARK) items are in use,
1761 these "start-up" optimizations can cause them to be skipped if the pat‐
1762 tern is never actually used. The start-up optimizations are in effect a
1763 pre-scan of the subject that takes place before the pattern is run.
1764
1765 The PCRE_NO_START_OPTIMIZE option disables the start-up optimizations,
1766 possibly causing performance to suffer, but ensuring that in cases
1767 where the result is "no match", the callouts do occur, and that items
1768 such as (*COMMIT) and (*MARK) are considered at every possible starting
1769 position in the subject string. If PCRE_NO_START_OPTIMIZE is set at
1770 compile time, it cannot be unset at matching time. The use of
1771 PCRE_NO_START_OPTIMIZE at matching time (that is, passing it to
1772 pcre_exec()) disables JIT execution; in this situation, matching is al‐
1773 ways done using interpretively.
1774
1775 Setting PCRE_NO_START_OPTIMIZE can change the outcome of a matching op‐
1776 eration. Consider the pattern
1777
1778 (*COMMIT)ABC
1779
1780 When this is compiled, PCRE records the fact that a match must start
1781 with the character "A". Suppose the subject string is "DEFABC". The
1782 start-up optimization scans along the subject, finds "A" and runs the
1783 first match attempt from there. The (*COMMIT) item means that the pat‐
1784 tern must match the current starting position, which in this case, it
1785 does. However, if the same match is run with PCRE_NO_START_OPTIMIZE
1786 set, the initial scan along the subject string does not happen. The
1787 first match attempt is run starting from "D" and when this fails,
1788 (*COMMIT) prevents any further matches being tried, so the overall re‐
1789 sult is "no match". If the pattern is studied, more start-up optimiza‐
1790 tions may be used. For example, a minimum length for the subject may be
1791 recorded. Consider the pattern
1792
1793 (*MARK:A)(X|Y)
1794
1795 The minimum length for a match is one character. If the subject is
1796 "ABC", there will be attempts to match "ABC", "BC", "C", and then fi‐
1797 nally an empty string. If the pattern is studied, the final attempt
1798 does not take place, because PCRE knows that the subject is too short,
1799 and so the (*MARK) is never encountered. In this case, studying the
1800 pattern does not affect the overall match result, which is still "no
1801 match", but it does affect the auxiliary information that is returned.
1802
1803 PCRE_NO_UTF8_CHECK
1804
1805 When PCRE_UTF8 is set at compile time, the validity of the subject as a
1806 UTF-8 string is automatically checked when pcre_exec() is subsequently
1807 called. The entire string is checked before any other processing takes
1808 place. The value of startoffset is also checked to ensure that it
1809 points to the start of a UTF-8 character. There is a discussion about
1810 the validity of UTF-8 strings in the pcreunicode page. If an invalid
1811 sequence of bytes is found, pcre_exec() returns the error PCRE_ER‐
1812 ROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is a trun‐
1813 cated character at the end of the subject, PCRE_ERROR_SHORTUTF8. In
1814 both cases, information about the precise nature of the error may also
1815 be returned (see the descriptions of these errors in the section enti‐
1816 tled Error return values from pcre_exec() below). If startoffset con‐
1817 tains a value that does not point to the start of a UTF-8 character (or
1818 to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is returned.
1819
1820 If you already know that your subject is valid, and you want to skip
1821 these checks for performance reasons, you can set the
1822 PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to
1823 do this for the second and subsequent calls to pcre_exec() if you are
1824 making repeated calls to find all the matches in a single subject
1825 string. However, you should be sure that the value of startoffset
1826 points to the start of a character (or the end of the subject). When
1827 PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid string as a
1828 subject or an invalid value of startoffset is undefined. Your program
1829 may crash or loop.
1830
1831 PCRE_PARTIAL_HARD
1832 PCRE_PARTIAL_SOFT
1833
1834 These options turn on the partial matching feature. For backwards com‐
1835 patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial
1836 match occurs if the end of the subject string is reached successfully,
1837 but there are not enough subject characters to complete the match. If
1838 this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set,
1839 matching continues by testing any remaining alternatives. Only if no
1840 complete match can be found is PCRE_ERROR_PARTIAL returned instead of
1841 PCRE_ERROR_NOMATCH. In other words, PCRE_PARTIAL_SOFT says that the
1842 caller is prepared to handle a partial match, but only if no complete
1843 match can be found.
1844
1845 If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In this
1846 case, if a partial match is found, pcre_exec() immediately returns
1847 PCRE_ERROR_PARTIAL, without considering any other alternatives. In
1848 other words, when PCRE_PARTIAL_HARD is set, a partial match is consid‐
1849 ered to be more important that an alternative complete match.
1850
1851 In both cases, the portion of the string that was inspected when the
1852 partial match was found is set as the first matching string. There is a
1853 more detailed discussion of partial and multi-segment matching, with
1854 examples, in the pcrepartial documentation.
1855
1856 The string to be matched by pcre_exec()
1857
1858 The subject string is passed to pcre_exec() as a pointer in subject, a
1859 length in length, and a starting offset in startoffset. The units for
1860 length and startoffset are bytes for the 8-bit library, 16-bit data
1861 items for the 16-bit library, and 32-bit data items for the 32-bit li‐
1862 brary.
1863
1864 If startoffset is negative or greater than the length of the subject,
1865 pcre_exec() returns PCRE_ERROR_BADOFFSET. When the starting offset is
1866 zero, the search for a match starts at the beginning of the subject,
1867 and this is by far the most common case. In UTF-8 or UTF-16 mode, the
1868 offset must point to the start of a character, or the end of the sub‐
1869 ject (in UTF-32 mode, one data unit equals one character, so all off‐
1870 sets are valid). Unlike the pattern string, the subject may contain bi‐
1871 nary zeroes.
1872
1873 A non-zero starting offset is useful when searching for another match
1874 in the same subject by calling pcre_exec() again after a previous suc‐
1875 cess. Setting startoffset differs from just passing over a shortened
1876 string and setting PCRE_NOTBOL in the case of a pattern that begins
1877 with any kind of lookbehind. For example, consider the pattern
1878
1879 \Biss\B
1880
1881 which finds occurrences of "iss" in the middle of words. (\B matches
1882 only if the current position in the subject is not a word boundary.)
1883 When applied to the string "Mississippi" the first call to pcre_exec()
1884 finds the first occurrence. If pcre_exec() is called again with just
1885 the remainder of the subject, namely "issippi", it does not match, be‐
1886 cause \B is always false at the start of the subject, which is deemed
1887 to be a word boundary. However, if pcre_exec() is passed the entire
1888 string again, but with startoffset set to 4, it finds the second occur‐
1889 rence of "iss" because it is able to look behind the starting point to
1890 discover that it is preceded by a letter.
1891
1892 Finding all the matches in a subject is tricky when the pattern can
1893 match an empty string. It is possible to emulate Perl's /g behaviour by
1894 first trying the match again at the same offset, with the
1895 PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED options, and then if that
1896 fails, advancing the starting offset and trying an ordinary match
1897 again. There is some code that demonstrates how to do this in the pcre‐
1898 demo sample program. In the most general case, you have to check to see
1899 if the newline convention recognizes CRLF as a newline, and if so, and
1900 the current character is CR followed by LF, advance the starting offset
1901 by two characters instead of one.
1902
1903 If a non-zero starting offset is passed when the pattern is anchored,
1904 one attempt to match at the given offset is made. This can only succeed
1905 if the pattern does not require the match to be at the start of the
1906 subject.
1907
1908 How pcre_exec() returns captured substrings
1909
1910 In general, a pattern matches a certain portion of the subject, and in
1911 addition, further substrings from the subject may be picked out by
1912 parts of the pattern. Following the usage in Jeffrey Friedl's book,
1913 this is called "capturing" in what follows, and the phrase "capturing
1914 subpattern" is used for a fragment of a pattern that picks out a sub‐
1915 string. PCRE supports several other kinds of parenthesized subpattern
1916 that do not cause substrings to be captured.
1917
1918 Captured substrings are returned to the caller via a vector of integers
1919 whose address is passed in ovector. The number of elements in the vec‐
1920 tor is passed in ovecsize, which must be a non-negative number. Note:
1921 this argument is NOT the size of ovector in bytes.
1922
1923 The first two-thirds of the vector is used to pass back captured sub‐
1924 strings, each substring using a pair of integers. The remaining third
1925 of the vector is used as workspace by pcre_exec() while matching cap‐
1926 turing subpatterns, and is not available for passing back information.
1927 The number passed in ovecsize should always be a multiple of three. If
1928 it is not, it is rounded down.
1929
1930 When a match is successful, information about captured substrings is
1931 returned in pairs of integers, starting at the beginning of ovector,
1932 and continuing up to two-thirds of its length at the most. The first
1933 element of each pair is set to the offset of the first character in a
1934 substring, and the second is set to the offset of the first character
1935 after the end of a substring. These values are always data unit off‐
1936 sets, even in UTF mode. They are byte offsets in the 8-bit library,
1937 16-bit data item offsets in the 16-bit library, and 32-bit data item
1938 offsets in the 32-bit library. Note: they are not character counts.
1939
1940 The first pair of integers, ovector[0] and ovector[1], identify the
1941 portion of the subject string matched by the entire pattern. The next
1942 pair is used for the first capturing subpattern, and so on. The value
1943 returned by pcre_exec() is one more than the highest numbered pair that
1944 has been set. For example, if two substrings have been captured, the
1945 returned value is 3. If there are no capturing subpatterns, the return
1946 value from a successful match is 1, indicating that just the first pair
1947 of offsets has been set.
1948
1949 If a capturing subpattern is matched repeatedly, it is the last portion
1950 of the string that it matched that is returned.
1951
1952 If the vector is too small to hold all the captured substring offsets,
1953 it is used as far as possible (up to two-thirds of its length), and the
1954 function returns a value of zero. If neither the actual string matched
1955 nor any captured substrings are of interest, pcre_exec() may be called
1956 with ovector passed as NULL and ovecsize as zero. However, if the pat‐
1957 tern contains back references and the ovector is not big enough to re‐
1958 member the related substrings, PCRE has to get additional memory for
1959 use during matching. Thus it is usually advisable to supply an ovector
1960 of reasonable size.
1961
1962 There are some cases where zero is returned (indicating vector over‐
1963 flow) when in fact the vector is exactly the right size for the final
1964 match. For example, consider the pattern
1965
1966 (a)(?:(b)c|bd)
1967
1968 If a vector of 6 elements (allowing for only 1 captured substring) is
1969 given with subject string "abd", pcre_exec() will try to set the second
1970 captured string, thereby recording a vector overflow, before failing to
1971 match "c" and backing up to try the second alternative. The zero re‐
1972 turn, however, does correctly indicate that the maximum number of slots
1973 (namely 2) have been filled. In similar cases where there is temporary
1974 overflow, but the final number of used slots is actually less than the
1975 maximum, a non-zero value is returned.
1976
1977 The pcre_fullinfo() function can be used to find out how many capturing
1978 subpatterns there are in a compiled pattern. The smallest size for
1979 ovector that will allow for n captured substrings, in addition to the
1980 offsets of the substring matched by the whole pattern, is (n+1)*3.
1981
1982 It is possible for capturing subpattern number n+1 to match some part
1983 of the subject when subpattern n has not been used at all. For example,
1984 if the string "abc" is matched against the pattern (a|(z))(bc) the re‐
1985 turn from the function is 4, and subpatterns 1 and 3 are matched, but 2
1986 is not. When this happens, both values in the offset pairs correspond‐
1987 ing to unused subpatterns are set to -1.
1988
1989 Offset values that correspond to unused subpatterns at the end of the
1990 expression are also set to -1. For example, if the string "abc" is
1991 matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not
1992 matched. The return from the function is 2, because the highest used
1993 capturing subpattern number is 1, and the offsets for for the second
1994 and third capturing subpatterns (assuming the vector is large enough,
1995 of course) are set to -1.
1996
1997 Note: Elements in the first two-thirds of ovector that do not corre‐
1998 spond to capturing parentheses in the pattern are never changed. That
1999 is, if a pattern contains n capturing parentheses, no more than ovec‐
2000 tor[0] to ovector[2n+1] are set by pcre_exec(). The other elements (in
2001 the first two-thirds) retain whatever values they previously had.
2002
2003 Some convenience functions are provided for extracting the captured
2004 substrings as separate strings. These are described below.
2005
2006 Error return values from pcre_exec()
2007
2008 If pcre_exec() fails, it returns a negative number. The following are
2009 defined in the header file:
2010
2011 PCRE_ERROR_NOMATCH (-1)
2012
2013 The subject string did not match the pattern.
2014
2015 PCRE_ERROR_NULL (-2)
2016
2017 Either code or subject was passed as NULL, or ovector was NULL and
2018 ovecsize was not zero.
2019
2020 PCRE_ERROR_BADOPTION (-3)
2021
2022 An unrecognized bit was set in the options argument.
2023
2024 PCRE_ERROR_BADMAGIC (-4)
2025
2026 PCRE stores a 4-byte "magic number" at the start of the compiled code,
2027 to catch the case when it is passed a junk pointer and to detect when a
2028 pattern that was compiled in an environment of one endianness is run in
2029 an environment with the other endianness. This is the error that PCRE
2030 gives when the magic number is not present.
2031
2032 PCRE_ERROR_UNKNOWN_OPCODE (-5)
2033
2034 While running the pattern match, an unknown item was encountered in the
2035 compiled pattern. This error could be caused by a bug in PCRE or by
2036 overwriting of the compiled pattern.
2037
2038 PCRE_ERROR_NOMEMORY (-6)
2039
2040 If a pattern contains back references, but the ovector that is passed
2041 to pcre_exec() is not big enough to remember the referenced substrings,
2042 PCRE gets a block of memory at the start of matching to use for this
2043 purpose. If the call via pcre_malloc() fails, this error is given. The
2044 memory is automatically freed at the end of matching.
2045
2046 This error is also given if pcre_stack_malloc() fails in pcre_exec().
2047 This can happen only when PCRE has been compiled with --disable-stack-
2048 for-recursion.
2049
2050 PCRE_ERROR_NOSUBSTRING (-7)
2051
2052 This error is used by the pcre_copy_substring(), pcre_get_substring(),
2053 and pcre_get_substring_list() functions (see below). It is never re‐
2054 turned by pcre_exec().
2055
2056 PCRE_ERROR_MATCHLIMIT (-8)
2057
2058 The backtracking limit, as specified by the match_limit field in a
2059 pcre_extra structure (or defaulted) was reached. See the description
2060 above.
2061
2062 PCRE_ERROR_CALLOUT (-9)
2063
2064 This error is never generated by pcre_exec() itself. It is provided for
2065 use by callout functions that want to yield a distinctive error code.
2066 See the pcrecallout documentation for details.
2067
2068 PCRE_ERROR_BADUTF8 (-10)
2069
2070 A string that contains an invalid UTF-8 byte sequence was passed as a
2071 subject, and the PCRE_NO_UTF8_CHECK option was not set. If the size of
2072 the output vector (ovecsize) is at least 2, the byte offset to the
2073 start of the the invalid UTF-8 character is placed in the first ele‐
2074 ment, and a reason code is placed in the second element. The reason
2075 codes are listed in the following section. For backward compatibility,
2076 if PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8 char‐
2077 acter at the end of the subject (reason codes 1 to 5), PCRE_ER‐
2078 ROR_SHORTUTF8 is returned instead of PCRE_ERROR_BADUTF8.
2079
2080 PCRE_ERROR_BADUTF8_OFFSET (-11)
2081
2082 The UTF-8 byte sequence that was passed as a subject was checked and
2083 found to be valid (the PCRE_NO_UTF8_CHECK option was not set), but the
2084 value of startoffset did not point to the beginning of a UTF-8 charac‐
2085 ter or the end of the subject.
2086
2087 PCRE_ERROR_PARTIAL (-12)
2088
2089 The subject string did not match, but it did match partially. See the
2090 pcrepartial documentation for details of partial matching.
2091
2092 PCRE_ERROR_BADPARTIAL (-13)
2093
2094 This code is no longer in use. It was formerly returned when the
2095 PCRE_PARTIAL option was used with a compiled pattern containing items
2096 that were not supported for partial matching. From release 8.00 on‐
2097 wards, there are no restrictions on partial matching.
2098
2099 PCRE_ERROR_INTERNAL (-14)
2100
2101 An unexpected internal error has occurred. This error could be caused
2102 by a bug in PCRE or by overwriting of the compiled pattern.
2103
2104 PCRE_ERROR_BADCOUNT (-15)
2105
2106 This error is given if the value of the ovecsize argument is negative.
2107
2108 PCRE_ERROR_RECURSIONLIMIT (-21)
2109
2110 The internal recursion limit, as specified by the match_limit_recursion
2111 field in a pcre_extra structure (or defaulted) was reached. See the de‐
2112 scription above.
2113
2114 PCRE_ERROR_BADNEWLINE (-23)
2115
2116 An invalid combination of PCRE_NEWLINE_xxx options was given.
2117
2118 PCRE_ERROR_BADOFFSET (-24)
2119
2120 The value of startoffset was negative or greater than the length of the
2121 subject, that is, the value in length.
2122
2123 PCRE_ERROR_SHORTUTF8 (-25)
2124
2125 This error is returned instead of PCRE_ERROR_BADUTF8 when the subject
2126 string ends with a truncated UTF-8 character and the PCRE_PARTIAL_HARD
2127 option is set. Information about the failure is returned as for
2128 PCRE_ERROR_BADUTF8. It is in fact sufficient to detect this case, but
2129 this special error code for PCRE_PARTIAL_HARD precedes the implementa‐
2130 tion of returned information; it is retained for backwards compatibil‐
2131 ity.
2132
2133 PCRE_ERROR_RECURSELOOP (-26)
2134
2135 This error is returned when pcre_exec() detects a recursion loop within
2136 the pattern. Specifically, it means that either the whole pattern or a
2137 subpattern has been called recursively for the second time at the same
2138 position in the subject string. Some simple patterns that might do this
2139 are detected and faulted at compile time, but more complicated cases,
2140 in particular mutual recursions between two different subpatterns, can‐
2141 not be detected until run time.
2142
2143 PCRE_ERROR_JIT_STACKLIMIT (-27)
2144
2145 This error is returned when a pattern that was successfully studied us‐
2146 ing a JIT compile option is being matched, but the memory available for
2147 the just-in-time processing stack is not large enough. See the pcrejit
2148 documentation for more details.
2149
2150 PCRE_ERROR_BADMODE (-28)
2151
2152 This error is given if a pattern that was compiled by the 8-bit library
2153 is passed to a 16-bit or 32-bit library function, or vice versa.
2154
2155 PCRE_ERROR_BADENDIANNESS (-29)
2156
2157 This error is given if a pattern that was compiled and saved is
2158 reloaded on a host with different endianness. The utility function
2159 pcre_pattern_to_host_byte_order() can be used to convert such a pattern
2160 so that it runs on the new host.
2161
2162 PCRE_ERROR_JIT_BADOPTION
2163
2164 This error is returned when a pattern that was successfully studied us‐
2165 ing a JIT compile option is being matched, but the matching mode (par‐
2166 tial or complete match) does not correspond to any JIT compilation
2167 mode. When the JIT fast path function is used, this error may be also
2168 given for invalid options. See the pcrejit documentation for more de‐
2169 tails.
2170
2171 PCRE_ERROR_BADLENGTH (-32)
2172
2173 This error is given if pcre_exec() is called with a negative value for
2174 the length argument.
2175
2176 Error numbers -16 to -20, -22, and 30 are not used by pcre_exec().
2177
2178 Reason codes for invalid UTF-8 strings
2179
2180 This section applies only to the 8-bit library. The corresponding in‐
2181 formation for the 16-bit and 32-bit libraries is given in the pcre16
2182 and pcre32 pages.
2183
2184 When pcre_exec() returns either PCRE_ERROR_BADUTF8 or PCRE_ERROR_SHORT‐
2185 UTF8, and the size of the output vector (ovecsize) is at least 2, the
2186 offset of the start of the invalid UTF-8 character is placed in the
2187 first output vector element (ovector[0]) and a reason code is placed in
2188 the second element (ovector[1]). The reason codes are given names in
2189 the pcre.h header file:
2190
2191 PCRE_UTF8_ERR1
2192 PCRE_UTF8_ERR2
2193 PCRE_UTF8_ERR3
2194 PCRE_UTF8_ERR4
2195 PCRE_UTF8_ERR5
2196
2197 The string ends with a truncated UTF-8 character; the code specifies
2198 how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8
2199 characters to be no longer than 4 bytes, the encoding scheme (origi‐
2200 nally defined by RFC 2279) allows for up to 6 bytes, and this is
2201 checked first; hence the possibility of 4 or 5 missing bytes.
2202
2203 PCRE_UTF8_ERR6
2204 PCRE_UTF8_ERR7
2205 PCRE_UTF8_ERR8
2206 PCRE_UTF8_ERR9
2207 PCRE_UTF8_ERR10
2208
2209 The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of
2210 the character do not have the binary value 0b10 (that is, either the
2211 most significant bit is 0, or the next bit is 1).
2212
2213 PCRE_UTF8_ERR11
2214 PCRE_UTF8_ERR12
2215
2216 A character that is valid by the RFC 2279 rules is either 5 or 6 bytes
2217 long; these code points are excluded by RFC 3629.
2218
2219 PCRE_UTF8_ERR13
2220
2221 A 4-byte character has a value greater than 0x10fff; these code points
2222 are excluded by RFC 3629.
2223
2224 PCRE_UTF8_ERR14
2225
2226 A 3-byte character has a value in the range 0xd800 to 0xdfff; this
2227 range of code points are reserved by RFC 3629 for use with UTF-16, and
2228 so are excluded from UTF-8.
2229
2230 PCRE_UTF8_ERR15
2231 PCRE_UTF8_ERR16
2232 PCRE_UTF8_ERR17
2233 PCRE_UTF8_ERR18
2234 PCRE_UTF8_ERR19
2235
2236 A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
2237 for a value that can be represented by fewer bytes, which is invalid.
2238 For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor‐
2239 rect coding uses just one byte.
2240
2241 PCRE_UTF8_ERR20
2242
2243 The two most significant bits of the first byte of a character have the
2244 binary value 0b10 (that is, the most significant bit is 1 and the sec‐
2245 ond is 0). Such a byte can only validly occur as the second or subse‐
2246 quent byte of a multi-byte character.
2247
2248 PCRE_UTF8_ERR21
2249
2250 The first byte of a character has the value 0xfe or 0xff. These values
2251 can never occur in a valid UTF-8 string.
2252
2253 PCRE_UTF8_ERR22
2254
2255 This error code was formerly used when the presence of a so-called
2256 "non-character" caused an error. Unicode corrigendum #9 makes it clear
2257 that such characters should not cause a string to be rejected, and so
2258 this code is no longer in use and is never returned.
2259
2261
2262 int pcre_copy_substring(const char *subject, int *ovector,
2263 int stringcount, int stringnumber, char *buffer,
2264 int buffersize);
2265
2266 int pcre_get_substring(const char *subject, int *ovector,
2267 int stringcount, int stringnumber,
2268 const char **stringptr);
2269
2270 int pcre_get_substring_list(const char *subject,
2271 int *ovector, int stringcount, const char ***listptr);
2272
2273 Captured substrings can be accessed directly by using the offsets re‐
2274 turned by pcre_exec() in ovector. For convenience, the functions
2275 pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub‐
2276 string_list() are provided for extracting captured substrings as new,
2277 separate, zero-terminated strings. These functions identify substrings
2278 by number. The next section describes functions for extracting named
2279 substrings.
2280
2281 A substring that contains a binary zero is correctly extracted and has
2282 a further zero added on the end, but the result is not, of course, a C
2283 string. However, you can process such a string by referring to the
2284 length that is returned by pcre_copy_substring() and pcre_get_sub‐
2285 string(). Unfortunately, the interface to pcre_get_substring_list() is
2286 not adequate for handling strings containing binary zeros, because the
2287 end of the final string is not independently indicated.
2288
2289 The first three arguments are the same for all three of these func‐
2290 tions: subject is the subject string that has just been successfully
2291 matched, ovector is a pointer to the vector of integer offsets that was
2292 passed to pcre_exec(), and stringcount is the number of substrings that
2293 were captured by the match, including the substring that matched the
2294 entire regular expression. This is the value returned by pcre_exec() if
2295 it is greater than zero. If pcre_exec() returned zero, indicating that
2296 it ran out of space in ovector, the value passed as stringcount should
2297 be the number of elements in the vector divided by three.
2298
2299 The functions pcre_copy_substring() and pcre_get_substring() extract a
2300 single substring, whose number is given as stringnumber. A value of
2301 zero extracts the substring that matched the entire pattern, whereas
2302 higher values extract the captured substrings. For pcre_copy_sub‐
2303 string(), the string is placed in buffer, whose length is given by
2304 buffersize, while for pcre_get_substring() a new block of memory is ob‐
2305 tained via pcre_malloc, and its address is returned via stringptr. The
2306 yield of the function is the length of the string, not including the
2307 terminating zero, or one of these error codes:
2308
2309 PCRE_ERROR_NOMEMORY (-6)
2310
2311 The buffer was too small for pcre_copy_substring(), or the attempt to
2312 get memory failed for pcre_get_substring().
2313
2314 PCRE_ERROR_NOSUBSTRING (-7)
2315
2316 There is no substring whose number is stringnumber.
2317
2318 The pcre_get_substring_list() function extracts all available sub‐
2319 strings and builds a list of pointers to them. All this is done in a
2320 single block of memory that is obtained via pcre_malloc. The address of
2321 the memory block is returned via listptr, which is also the start of
2322 the list of string pointers. The end of the list is marked by a NULL
2323 pointer. The yield of the function is zero if all went well, or the er‐
2324 ror code
2325
2326 PCRE_ERROR_NOMEMORY (-6)
2327
2328 if the attempt to get the memory block failed.
2329
2330 When any of these functions encounter a substring that is unset, which
2331 can happen when capturing subpattern number n+1 matches some part of
2332 the subject, but subpattern n has not been used at all, they return an
2333 empty string. This can be distinguished from a genuine zero-length sub‐
2334 string by inspecting the appropriate offset in ovector, which is nega‐
2335 tive for unset substrings.
2336
2337 The two convenience functions pcre_free_substring() and pcre_free_sub‐
2338 string_list() can be used to free the memory returned by a previous
2339 call of pcre_get_substring() or pcre_get_substring_list(), respec‐
2340 tively. They do nothing more than call the function pointed to by
2341 pcre_free, which of course could be called directly from a C program.
2342 However, PCRE is used in some situations where it is linked via a spe‐
2343 cial interface to another programming language that cannot use
2344 pcre_free directly; it is for these cases that the functions are pro‐
2345 vided.
2346
2348
2349 int pcre_get_stringnumber(const pcre *code,
2350 const char *name);
2351
2352 int pcre_copy_named_substring(const pcre *code,
2353 const char *subject, int *ovector,
2354 int stringcount, const char *stringname,
2355 char *buffer, int buffersize);
2356
2357 int pcre_get_named_substring(const pcre *code,
2358 const char *subject, int *ovector,
2359 int stringcount, const char *stringname,
2360 const char **stringptr);
2361
2362 To extract a substring by name, you first have to find associated num‐
2363 ber. For example, for this pattern
2364
2365 (a+)b(?<xxx>\d+)...
2366
2367 the number of the subpattern called "xxx" is 2. If the name is known to
2368 be unique (PCRE_DUPNAMES was not set), you can find the number from the
2369 name by calling pcre_get_stringnumber(). The first argument is the com‐
2370 piled pattern, and the second is the name. The yield of the function is
2371 the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no
2372 subpattern of that name.
2373
2374 Given the number, you can extract the substring directly, or use one of
2375 the functions described in the previous section. For convenience, there
2376 are also two functions that do the whole job.
2377
2378 Most of the arguments of pcre_copy_named_substring() and
2379 pcre_get_named_substring() are the same as those for the similarly
2380 named functions that extract by number. As these are described in the
2381 previous section, they are not re-described here. There are just two
2382 differences:
2383
2384 First, instead of a substring number, a substring name is given. Sec‐
2385 ond, there is an extra argument, given at the start, which is a pointer
2386 to the compiled pattern. This is needed in order to gain access to the
2387 name-to-number translation table.
2388
2389 These functions call pcre_get_stringnumber(), and if it succeeds, they
2390 then call pcre_copy_substring() or pcre_get_substring(), as appropri‐
2391 ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the
2392 behaviour may not be what you want (see the next section).
2393
2394 Warning: If the pattern uses the (?| feature to set up multiple subpat‐
2395 terns with the same number, as described in the section on duplicate
2396 subpattern numbers in the pcrepattern page, you cannot use names to
2397 distinguish the different subpatterns, because names are not included
2398 in the compiled code. The matching process uses only numbers. For this
2399 reason, the use of different names for subpatterns of the same number
2400 causes an error at compile time.
2401
2403
2404 int pcre_get_stringtable_entries(const pcre *code,
2405 const char *name, char **first, char **last);
2406
2407 When a pattern is compiled with the PCRE_DUPNAMES option, names for
2408 subpatterns are not required to be unique. (Duplicate names are always
2409 allowed for subpatterns with the same number, created by using the (?|
2410 feature. Indeed, if such subpatterns are named, they are required to
2411 use the same names.)
2412
2413 Normally, patterns with duplicate names are such that in any one match,
2414 only one of the named subpatterns participates. An example is shown in
2415 the pcrepattern documentation.
2416
2417 When duplicates are present, pcre_copy_named_substring() and
2418 pcre_get_named_substring() return the first substring corresponding to
2419 the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING
2420 (-7) is returned; no data is returned. The pcre_get_stringnumber()
2421 function returns one of the numbers that are associated with the name,
2422 but it is not defined which it is.
2423
2424 If you want to get full details of all captured substrings for a given
2425 name, you must use the pcre_get_stringtable_entries() function. The
2426 first argument is the compiled pattern, and the second is the name. The
2427 third and fourth are pointers to variables which are updated by the
2428 function. After it has run, they point to the first and last entries in
2429 the name-to-number table for the given name. The function itself re‐
2430 turns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if there
2431 are none. The format of the table is described above in the section en‐
2432 titled Information about a pattern above. Given all the relevant en‐
2433 tries for the name, you can extract each of their numbers, and hence
2434 the captured data, if any.
2435
2437
2438 The traditional matching function uses a similar algorithm to Perl,
2439 which stops when it finds the first match, starting at a given point in
2440 the subject. If you want to find all possible matches, or the longest
2441 possible match, consider using the alternative matching function (see
2442 below) instead. If you cannot use the alternative function, but still
2443 need to find all possible matches, you can kludge it up by making use
2444 of the callout facility, which is described in the pcrecallout documen‐
2445 tation.
2446
2447 What you have to do is to insert a callout right at the end of the pat‐
2448 tern. When your callout function is called, extract and save the cur‐
2449 rent matched substring. Then return 1, which forces pcre_exec() to
2450 backtrack and try other alternatives. Ultimately, when it runs out of
2451 matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
2452
2454
2455 Matching certain patterns using pcre_exec() can use a lot of process
2456 stack, which in certain environments can be rather limited in size.
2457 Some users find it helpful to have an estimate of the amount of stack
2458 that is used by pcre_exec(), to help them set recursion limits, as de‐
2459 scribed in the pcrestack documentation. The estimate that is output by
2460 pcretest when called with the -m and -C options is obtained by calling
2461 pcre_exec with the values NULL, NULL, NULL, -999, and -999 for its
2462 first five arguments.
2463
2464 Normally, if its first argument is NULL, pcre_exec() immediately re‐
2465 turns the negative error code PCRE_ERROR_NULL, but with this special
2466 combination of arguments, it returns instead a negative number whose
2467 absolute value is the approximate stack frame size in bytes. (A nega‐
2468 tive number is used so that it is clear that no match has happened.)
2469 The value is approximate because in some cases, recursive calls to
2470 pcre_exec() occur when there are one or two additional variables on the
2471 stack.
2472
2473 If PCRE has been compiled to use the heap instead of the stack for re‐
2474 cursion, the value returned is the size of each block that is obtained
2475 from the heap.
2476
2478
2479 int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
2480 const char *subject, int length, int startoffset,
2481 int options, int *ovector, int ovecsize,
2482 int *workspace, int wscount);
2483
2484 The function pcre_dfa_exec() is called to match a subject string
2485 against a compiled pattern, using a matching algorithm that scans the
2486 subject string just once, and does not backtrack. This has different
2487 characteristics to the normal algorithm, and is not compatible with
2488 Perl. Some of the features of PCRE patterns are not supported. Never‐
2489 theless, there are times when this kind of matching can be useful. For
2490 a discussion of the two matching algorithms, and a list of features
2491 that pcre_dfa_exec() does not support, see the pcrematching documenta‐
2492 tion.
2493
2494 The arguments for the pcre_dfa_exec() function are the same as for
2495 pcre_exec(), plus two extras. The ovector argument is used in a differ‐
2496 ent way, and this is described below. The other common arguments are
2497 used in the same way as for pcre_exec(), so their description is not
2498 repeated here.
2499
2500 The two additional arguments provide workspace for the function. The
2501 workspace vector should contain at least 20 elements. It is used for
2502 keeping track of multiple paths through the pattern tree. More
2503 workspace will be needed for patterns and subjects where there are a
2504 lot of potential matches.
2505
2506 Here is an example of a simple call to pcre_dfa_exec():
2507
2508 int rc;
2509 int ovector[10];
2510 int wspace[20];
2511 rc = pcre_dfa_exec(
2512 re, /* result of pcre_compile() */
2513 NULL, /* we didn't study the pattern */
2514 "some string", /* the subject string */
2515 11, /* the length of the subject string */
2516 0, /* start at offset 0 in the subject */
2517 0, /* default options */
2518 ovector, /* vector of integers for substring information */
2519 10, /* number of elements (NOT size in bytes) */
2520 wspace, /* working space vector */
2521 20); /* number of elements (NOT size in bytes) */
2522
2523 Option bits for pcre_dfa_exec()
2524
2525 The unused bits of the options argument for pcre_dfa_exec() must be
2526 zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW‐
2527 LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_AT‐
2528 START, PCRE_NO_UTF8_CHECK, PCRE_BSR_ANYCRLF, PCRE_BSR_UNICODE,
2529 PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD, PCRE_PARTIAL_SOFT,
2530 PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last four of
2531 these are exactly the same as for pcre_exec(), so their description is
2532 not repeated here.
2533
2534 PCRE_PARTIAL_HARD
2535 PCRE_PARTIAL_SOFT
2536
2537 These have the same general effect as they do for pcre_exec(), but the
2538 details are slightly different. When PCRE_PARTIAL_HARD is set for
2539 pcre_dfa_exec(), it returns PCRE_ERROR_PARTIAL if the end of the sub‐
2540 ject is reached and there is still at least one matching possibility
2541 that requires additional characters. This happens even if some complete
2542 matches have also been found. When PCRE_PARTIAL_SOFT is set, the return
2543 code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end
2544 of the subject is reached, there have been no complete matches, but
2545 there is still at least one matching possibility. The portion of the
2546 string that was inspected when the longest partial match was found is
2547 set as the first matching string in both cases. There is a more de‐
2548 tailed discussion of partial and multi-segment matching, with examples,
2549 in the pcrepartial documentation.
2550
2551 PCRE_DFA_SHORTEST
2552
2553 Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to
2554 stop as soon as it has found one match. Because of the way the alterna‐
2555 tive algorithm works, this is necessarily the shortest possible match
2556 at the first possible matching point in the subject string.
2557
2558 PCRE_DFA_RESTART
2559
2560 When pcre_dfa_exec() returns a partial match, it is possible to call it
2561 again, with additional subject characters, and have it continue with
2562 the same match. The PCRE_DFA_RESTART option requests this action; when
2563 it is set, the workspace and wscount options must reference the same
2564 vector as before because data about the match so far is left in them
2565 after a partial match. There is more discussion of this facility in the
2566 pcrepartial documentation.
2567
2568 Successful returns from pcre_dfa_exec()
2569
2570 When pcre_dfa_exec() succeeds, it may have matched more than one sub‐
2571 string in the subject. Note, however, that all the matches from one run
2572 of the function start at the same point in the subject. The shorter
2573 matches are all initial substrings of the longer matches. For example,
2574 if the pattern
2575
2576 <.*>
2577
2578 is matched against the string
2579
2580 This is <something> <something else> <something further> no more
2581
2582 the three matched strings are
2583
2584 <something>
2585 <something> <something else>
2586 <something> <something else> <something further>
2587
2588 On success, the yield of the function is a number greater than zero,
2589 which is the number of matched substrings. The substrings themselves
2590 are returned in ovector. Each string uses two elements; the first is
2591 the offset to the start, and the second is the offset to the end. In
2592 fact, all the strings have the same start offset. (Space could have
2593 been saved by giving this only once, but it was decided to retain some
2594 compatibility with the way pcre_exec() returns data, even though the
2595 meaning of the strings is different.)
2596
2597 The strings are returned in reverse order of length; that is, the long‐
2598 est matching string is given first. If there were too many matches to
2599 fit into ovector, the yield of the function is zero, and the vector is
2600 filled with the longest matches. Unlike pcre_exec(), pcre_dfa_exec()
2601 can use the entire ovector for returning matched strings.
2602
2603 NOTE: PCRE's "auto-possessification" optimization usually applies to
2604 character repeats at the end of a pattern (as well as internally). For
2605 example, the pattern "a\d+" is compiled as if it were "a\d++" because
2606 there is no point even considering the possibility of backtracking into
2607 the repeated digits. For DFA matching, this means that only one possi‐
2608 ble match is found. If you really do want multiple matches in such
2609 cases, either use an ungreedy repeat ("a\d+?") or set the
2610 PCRE_NO_AUTO_POSSESS option when compiling.
2611
2612 Error returns from pcre_dfa_exec()
2613
2614 The pcre_dfa_exec() function returns a negative number when it fails.
2615 Many of the errors are the same as for pcre_exec(), and these are de‐
2616 scribed above. There are in addition the following errors that are
2617 specific to pcre_dfa_exec():
2618
2619 PCRE_ERROR_DFA_UITEM (-16)
2620
2621 This return is given if pcre_dfa_exec() encounters an item in the pat‐
2622 tern that it does not support, for instance, the use of \C or a back
2623 reference.
2624
2625 PCRE_ERROR_DFA_UCOND (-17)
2626
2627 This return is given if pcre_dfa_exec() encounters a condition item
2628 that uses a back reference for the condition, or a test for recursion
2629 in a specific group. These are not supported.
2630
2631 PCRE_ERROR_DFA_UMLIMIT (-18)
2632
2633 This return is given if pcre_dfa_exec() is called with an extra block
2634 that contains a setting of the match_limit or match_limit_recursion
2635 fields. This is not supported (these fields are meaningless for DFA
2636 matching).
2637
2638 PCRE_ERROR_DFA_WSSIZE (-19)
2639
2640 This return is given if pcre_dfa_exec() runs out of space in the
2641 workspace vector.
2642
2643 PCRE_ERROR_DFA_RECURSE (-20)
2644
2645 When a recursive subpattern is processed, the matching function calls
2646 itself recursively, using private vectors for ovector and workspace.
2647 This error is given if the output vector is not large enough. This
2648 should be extremely rare, as a vector of size 1000 is used.
2649
2650 PCRE_ERROR_DFA_BADRESTART (-30)
2651
2652 When pcre_dfa_exec() is called with the PCRE_DFA_RESTART option, some
2653 plausibility checks are made on the contents of the workspace, which
2654 should contain data about the previous partial match. If any of these
2655 checks fail, this error is given.
2656
2658
2659 pcre16(3), pcre32(3), pcrebuild(3), pcrecallout(3), pcrecpp(3)[22m(3),
2660 pcrematching(3), pcrepartial(3), pcreposix(3), pcreprecompile(3), pcre‐
2661 sample(3), pcrestack(3).
2662
2664
2665 Philip Hazel
2666 University Computing Service
2667 Cambridge CB2 3QH, England.
2668
2670
2671 Last updated: 18 December 2015
2672 Copyright (c) 1997-2015 University of Cambridge.
2673
2674
2675
2676PCRE 8.39 18 December 2015 PCREAPI(3)