1PERLREAPI(1)           Perl Programmers Reference Guide           PERLREAPI(1)
2
3
4

NAME

6       perlreapi - Perl regular expression plugin interface
7

DESCRIPTION

9       As of Perl 5.9.5 there is a new interface for plugging and using
10       regular expression engines other than the default one.
11
12       Each engine is supposed to provide access to a constant structure of
13       the following format:
14
15           typedef struct regexp_engine {
16               REGEXP* (*comp) (pTHX_
17                                const SV * const pattern, const U32 flags);
18               I32     (*exec) (pTHX_
19                                REGEXP * const rx,
20                                char* stringarg,
21                                char* strend, char* strbeg,
22                                SSize_t minend, SV* sv,
23                                void* data, U32 flags);
24               char*   (*intuit) (pTHX_
25                                  REGEXP * const rx, SV *sv,
26                                  const char * const strbeg,
27                                  char *strpos, char *strend, U32 flags,
28                                  struct re_scream_pos_data_s *data);
29               SV*     (*checkstr) (pTHX_ REGEXP * const rx);
30               void    (*free) (pTHX_ REGEXP * const rx);
31               void    (*numbered_buff_FETCH) (pTHX_
32                                               REGEXP * const rx,
33                                               const I32 paren,
34                                               SV * const sv);
35               void    (*numbered_buff_STORE) (pTHX_
36                                               REGEXP * const rx,
37                                               const I32 paren,
38                                               SV const * const value);
39               I32     (*numbered_buff_LENGTH) (pTHX_
40                                                REGEXP * const rx,
41                                                const SV * const sv,
42                                                const I32 paren);
43               SV*     (*named_buff) (pTHX_
44                                      REGEXP * const rx,
45                                      SV * const key,
46                                      SV * const value,
47                                      U32 flags);
48               SV*     (*named_buff_iter) (pTHX_
49                                           REGEXP * const rx,
50                                           const SV * const lastkey,
51                                           const U32 flags);
52               SV*     (*qr_package)(pTHX_ REGEXP * const rx);
53           #ifdef USE_ITHREADS
54               void*   (*dupe) (pTHX_ REGEXP * const rx, CLONE_PARAMS *param);
55           #endif
56               REGEXP* (*op_comp) (...);
57
58       When a regexp is compiled, its "engine" field is then set to point at
59       the appropriate structure, so that when it needs to be used Perl can
60       find the right routines to do so.
61
62       In order to install a new regexp handler, $^H{regcomp} is set to an
63       integer which (when casted appropriately) resolves to one of these
64       structures.  When compiling, the "comp" method is executed, and the
65       resulting "regexp" structure's engine field is expected to point back
66       at the same structure.
67
68       The pTHX_ symbol in the definition is a macro used by Perl under
69       threading to provide an extra argument to the routine holding a pointer
70       back to the interpreter that is executing the regexp. So under
71       threading all routines get an extra argument.
72

Callbacks

74   comp
75           REGEXP* comp(pTHX_ const SV * const pattern, const U32 flags);
76
77       Compile the pattern stored in "pattern" using the given "flags" and
78       return a pointer to a prepared "REGEXP" structure that can perform the
79       match.  See "The REGEXP structure" below for an explanation of the
80       individual fields in the REGEXP struct.
81
82       The "pattern" parameter is the scalar that was used as the pattern.
83       Previous versions of Perl would pass two "char*" indicating the start
84       and end of the stringified pattern; the following snippet can be used
85       to get the old parameters:
86
87           STRLEN plen;
88           char*  exp = SvPV(pattern, plen);
89           char* xend = exp + plen;
90
91       Since any scalar can be passed as a pattern, it's possible to implement
92       an engine that does something with an array (""ook" =~ [ qw/ eek hlagh
93       / ]") or with the non-stringified form of a compiled regular expression
94       (""ook" =~ qr/eek/").  Perl's own engine will always stringify
95       everything using the snippet above, but that doesn't mean other engines
96       have to.
97
98       The "flags" parameter is a bitfield which indicates which of the
99       "msixpn" flags the regex was compiled with.  It also contains
100       additional info, such as if "use locale" is in effect.
101
102       The "eogc" flags are stripped out before being passed to the comp
103       routine.  The regex engine does not need to know if any of these are
104       set, as those flags should only affect what Perl does with the pattern
105       and its match variables, not how it gets compiled and executed.
106
107       By the time the comp callback is called, some of these flags have
108       already had effect (noted below where applicable).  However most of
109       their effect occurs after the comp callback has run, in routines that
110       read the "rx->extflags" field which it populates.
111
112       In general the flags should be preserved in "rx->extflags" after
113       compilation, although the regex engine might want to add or delete some
114       of them to invoke or disable some special behavior in Perl.  The flags
115       along with any special behavior they cause are documented below:
116
117       The pattern modifiers:
118
119       "/m" - RXf_PMf_MULTILINE
120           If this is in "rx->extflags" it will be passed to "Perl_fbm_instr"
121           by "pp_split" which will treat the subject string as a multi-line
122           string.
123
124       "/s" - RXf_PMf_SINGLELINE
125       "/i" - RXf_PMf_FOLD
126       "/x" - RXf_PMf_EXTENDED
127           If present on a regex, "#" comments will be handled differently by
128           the tokenizer in some cases.
129
130           TODO: Document those cases.
131
132       "/p" - RXf_PMf_KEEPCOPY
133           TODO: Document this
134
135       Character set
136           The character set rules are determined by an enum that is contained
137           in this field.  This is still experimental and subject to change,
138           but the current interface returns the rules by use of the in-line
139           function "get_regex_charset(const U32 flags)".  The only currently
140           documented value returned from it is REGEX_LOCALE_CHARSET, which is
141           set if "use locale" is in effect. If present in "rx->extflags",
142           "split" will use the locale dependent definition of whitespace when
143           RXf_SKIPWHITE or RXf_WHITE is in effect.  ASCII whitespace is
144           defined as per isSPACE, and by the internal macros "is_utf8_space"
145           under UTF-8, and "isSPACE_LC" under "use locale".
146
147       Additional flags:
148
149       RXf_SPLIT
150           This flag was removed in perl 5.18.0.  "split ' '" is now special-
151           cased solely in the parser.  RXf_SPLIT is still #defined, so you
152           can test for it.  This is how it used to work:
153
154           If "split" is invoked as "split ' '" or with no arguments (which
155           really means "split(' ', $_)", see split), Perl will set this flag.
156           The regex engine can then check for it and set the SKIPWHITE and
157           WHITE extflags.  To do this, the Perl engine does:
158
159               if (flags & RXf_SPLIT && r->prelen == 1 && r->precomp[0] == ' ')
160                   r->extflags |= (RXf_SKIPWHITE|RXf_WHITE);
161
162       These flags can be set during compilation to enable optimizations in
163       the "split" operator.
164
165       RXf_SKIPWHITE
166           This flag was removed in perl 5.18.0.  It is still #defined, so you
167           can set it, but doing so will have no effect.  This is how it used
168           to work:
169
170           If the flag is present in "rx->extflags" "split" will delete
171           whitespace from the start of the subject string before it's
172           operated on.  What is considered whitespace depends on if the
173           subject is a UTF-8 string and if the "RXf_PMf_LOCALE" flag is set.
174
175           If RXf_WHITE is set in addition to this flag, "split" will behave
176           like "split " "" under the Perl engine.
177
178       RXf_START_ONLY
179           Tells the split operator to split the target string on newlines
180           ("\n") without invoking the regex engine.
181
182           Perl's engine sets this if the pattern is "/^/" ("plen == 1 && *exp
183           == '^'"), even under "/^/s"; see split.  Of course a different
184           regex engine might want to use the same optimizations with a
185           different syntax.
186
187       RXf_WHITE
188           Tells the split operator to split the target string on whitespace
189           without invoking the regex engine.  The definition of whitespace
190           varies depending on if the target string is a UTF-8 string and on
191           if RXf_PMf_LOCALE is set.
192
193           Perl's engine sets this flag if the pattern is "\s+".
194
195       RXf_NULL
196           Tells the split operator to split the target string on characters.
197           The definition of character varies depending on if the target
198           string is a UTF-8 string.
199
200           Perl's engine sets this flag on empty patterns, this optimization
201           makes "split //" much faster than it would otherwise be.  It's even
202           faster than "unpack".
203
204       RXf_NO_INPLACE_SUBST
205           Added in perl 5.18.0, this flag indicates that a regular expression
206           might perform an operation that would interfere with inplace
207           substitution. For instance it might contain lookbehind, or assign
208           to non-magical variables (such as $REGMARK and $REGERROR) during
209           matching.  "s///" will skip certain optimisations when this is set.
210
211   exec
212           I32 exec(pTHX_ REGEXP * const rx,
213                    char *stringarg, char* strend, char* strbeg,
214                    SSize_t minend, SV* sv,
215                    void* data, U32 flags);
216
217       Execute a regexp. The arguments are
218
219       rx  The regular expression to execute.
220
221       sv  This is the SV to be matched against.  Note that the actual char
222           array to be matched against is supplied by the arguments described
223           below; the SV is just used to determine UTF8ness, "pos()" etc.
224
225       strbeg
226           Pointer to the physical start of the string.
227
228       strend
229           Pointer to the character following the physical end of the string
230           (i.e.  the "\0", if any).
231
232       stringarg
233           Pointer to the position in the string where matching should start;
234           it might not be equal to "strbeg" (for example in a later iteration
235           of "/.../g").
236
237       minend
238           Minimum length of string (measured in bytes from "stringarg") that
239           must match; if the engine reaches the end of the match but hasn't
240           reached this position in the string, it should fail.
241
242       data
243           Optimisation data; subject to change.
244
245       flags
246           Optimisation flags; subject to change.
247
248   intuit
249           char* intuit(pTHX_
250                       REGEXP * const rx,
251                       SV *sv,
252                       const char * const strbeg,
253                       char *strpos,
254                       char *strend,
255                       const U32 flags,
256                       struct re_scream_pos_data_s *data);
257
258       Find the start position where a regex match should be attempted, or
259       possibly if the regex engine should not be run because the pattern
260       can't match.  This is called, as appropriate, by the core, depending on
261       the values of the "extflags" member of the "regexp" structure.
262
263       Arguments:
264
265           rx:     the regex to match against
266           sv:     the SV being matched: only used for utf8 flag; the string
267                   itself is accessed via the pointers below. Note that on
268                   something like an overloaded SV, SvPOK(sv) may be false
269                   and the string pointers may point to something unrelated to
270                   the SV itself.
271           strbeg: real beginning of string
272           strpos: the point in the string at which to begin matching
273           strend: pointer to the byte following the last char of the string
274           flags   currently unused; set to 0
275           data:   currently unused; set to NULL
276
277   checkstr
278           SV* checkstr(pTHX_ REGEXP * const rx);
279
280       Return a SV containing a string that must appear in the pattern. Used
281       by "split" for optimising matches.
282
283   free
284           void free(pTHX_ REGEXP * const rx);
285
286       Called by Perl when it is freeing a regexp pattern so that the engine
287       can release any resources pointed to by the "pprivate" member of the
288       "regexp" structure.  This is only responsible for freeing private data;
289       Perl will handle releasing anything else contained in the "regexp"
290       structure.
291
292   Numbered capture callbacks
293       Called to get/set the value of "$`", "$'", $& and their named
294       equivalents, ${^PREMATCH}, ${^POSTMATCH} and ${^MATCH}, as well as the
295       numbered capture groups ($1, $2, ...).
296
297       The "paren" parameter will be 1 for $1, 2 for $2 and so forth, and have
298       these symbolic values for the special variables:
299
300           ${^PREMATCH}  RX_BUFF_IDX_CARET_PREMATCH
301           ${^POSTMATCH} RX_BUFF_IDX_CARET_POSTMATCH
302           ${^MATCH}     RX_BUFF_IDX_CARET_FULLMATCH
303           $`            RX_BUFF_IDX_PREMATCH
304           $'            RX_BUFF_IDX_POSTMATCH
305           $&            RX_BUFF_IDX_FULLMATCH
306
307       Note that in Perl 5.17.3 and earlier, the last three constants were
308       also used for the caret variants of the variables.
309
310       The names have been chosen by analogy with Tie::Scalar methods names
311       with an additional LENGTH callback for efficiency.  However named
312       capture variables are currently not tied internally but implemented via
313       magic.
314
315       numbered_buff_FETCH
316
317           void numbered_buff_FETCH(pTHX_ REGEXP * const rx, const I32 paren,
318                                    SV * const sv);
319
320       Fetch a specified numbered capture.  "sv" should be set to the scalar
321       to return, the scalar is passed as an argument rather than being
322       returned from the function because when it's called Perl already has a
323       scalar to store the value, creating another one would be redundant.
324       The scalar can be set with "sv_setsv", "sv_setpvn" and friends, see
325       perlapi.
326
327       This callback is where Perl untaints its own capture variables under
328       taint mode (see perlsec).  See the "Perl_reg_numbered_buff_fetch"
329       function in regcomp.c for how to untaint capture variables if that's
330       something you'd like your engine to do as well.
331
332       numbered_buff_STORE
333
334           void    (*numbered_buff_STORE) (pTHX_
335                                           REGEXP * const rx,
336                                           const I32 paren,
337                                           SV const * const value);
338
339       Set the value of a numbered capture variable.  "value" is the scalar
340       that is to be used as the new value.  It's up to the engine to make
341       sure this is used as the new value (or reject it).
342
343       Example:
344
345           if ("ook" =~ /(o*)/) {
346               # 'paren' will be '1' and 'value' will be 'ee'
347               $1 =~ tr/o/e/;
348           }
349
350       Perl's own engine will croak on any attempt to modify the capture
351       variables, to do this in another engine use the following callback
352       (copied from "Perl_reg_numbered_buff_store"):
353
354           void
355           Example_reg_numbered_buff_store(pTHX_
356                                           REGEXP * const rx,
357                                           const I32 paren,
358                                           SV const * const value)
359           {
360               PERL_UNUSED_ARG(rx);
361               PERL_UNUSED_ARG(paren);
362               PERL_UNUSED_ARG(value);
363
364               if (!PL_localizing)
365                   Perl_croak(aTHX_ PL_no_modify);
366           }
367
368       Actually Perl will not always croak in a statement that looks like it
369       would modify a numbered capture variable.  This is because the STORE
370       callback will not be called if Perl can determine that it doesn't have
371       to modify the value.  This is exactly how tied variables behave in the
372       same situation:
373
374           package CaptureVar;
375           use parent 'Tie::Scalar';
376
377           sub TIESCALAR { bless [] }
378           sub FETCH { undef }
379           sub STORE { die "This doesn't get called" }
380
381           package main;
382
383           tie my $sv => "CaptureVar";
384           $sv =~ y/a/b/;
385
386       Because $sv is "undef" when the "y///" operator is applied to it, the
387       transliteration won't actually execute and the program won't "die".
388       This is different to how 5.8 and earlier versions behaved since the
389       capture variables were READONLY variables then; now they'll just die
390       when assigned to in the default engine.
391
392       numbered_buff_LENGTH
393
394           I32 numbered_buff_LENGTH (pTHX_
395                                     REGEXP * const rx,
396                                     const SV * const sv,
397                                     const I32 paren);
398
399       Get the "length" of a capture variable.  There's a special callback for
400       this so that Perl doesn't have to do a FETCH and run "length" on the
401       result, since the length is (in Perl's case) known from an offset
402       stored in "rx->offs", this is much more efficient:
403
404           I32 s1  = rx->offs[paren].start;
405           I32 s2  = rx->offs[paren].end;
406           I32 len = t1 - s1;
407
408       This is a little bit more complex in the case of UTF-8, see what
409       "Perl_reg_numbered_buff_length" does with is_utf8_string_loclen.
410
411   Named capture callbacks
412       Called to get/set the value of "%+" and "%-", as well as by some
413       utility functions in re.
414
415       There are two callbacks, "named_buff" is called in all the cases the
416       FETCH, STORE, DELETE, CLEAR, EXISTS and SCALAR Tie::Hash callbacks
417       would be on changes to "%+" and "%-" and "named_buff_iter" in the same
418       cases as FIRSTKEY and NEXTKEY.
419
420       The "flags" parameter can be used to determine which of these
421       operations the callbacks should respond to.  The following flags are
422       currently defined:
423
424       Which Tie::Hash operation is being performed from the Perl level on
425       "%+" or "%+", if any:
426
427           RXapif_FETCH
428           RXapif_STORE
429           RXapif_DELETE
430           RXapif_CLEAR
431           RXapif_EXISTS
432           RXapif_SCALAR
433           RXapif_FIRSTKEY
434           RXapif_NEXTKEY
435
436       If "%+" or "%-" is being operated on, if any.
437
438           RXapif_ONE /* %+ */
439           RXapif_ALL /* %- */
440
441       If this is being called as "re::regname", "re::regnames" or
442       "re::regnames_count", if any.  The first two will be combined with
443       "RXapif_ONE" or "RXapif_ALL".
444
445           RXapif_REGNAME
446           RXapif_REGNAMES
447           RXapif_REGNAMES_COUNT
448
449       Internally "%+" and "%-" are implemented with a real tied interface via
450       Tie::Hash::NamedCapture.  The methods in that package will call back
451       into these functions.  However the usage of Tie::Hash::NamedCapture for
452       this purpose might change in future releases.  For instance this might
453       be implemented by magic instead (would need an extension to mgvtbl).
454
455       named_buff
456
457           SV*     (*named_buff) (pTHX_ REGEXP * const rx, SV * const key,
458                                  SV * const value, U32 flags);
459
460       named_buff_iter
461
462           SV*     (*named_buff_iter) (pTHX_
463                                       REGEXP * const rx,
464                                       const SV * const lastkey,
465                                       const U32 flags);
466
467   qr_package
468           SV* qr_package(pTHX_ REGEXP * const rx);
469
470       The package the qr// magic object is blessed into (as seen by "ref
471       qr//").  It is recommended that engines change this to their package
472       name for identification regardless of if they implement methods on the
473       object.
474
475       The package this method returns should also have the internal "Regexp"
476       package in its @ISA.  "qr//->isa("Regexp")" should always be true
477       regardless of what engine is being used.
478
479       Example implementation might be:
480
481           SV*
482           Example_qr_package(pTHX_ REGEXP * const rx)
483           {
484               PERL_UNUSED_ARG(rx);
485               return newSVpvs("re::engine::Example");
486           }
487
488       Any method calls on an object created with "qr//" will be dispatched to
489       the package as a normal object.
490
491           use re::engine::Example;
492           my $re = qr//;
493           $re->meth; # dispatched to re::engine::Example::meth()
494
495       To retrieve the "REGEXP" object from the scalar in an XS function use
496       the "SvRX" macro, see "REGEXP Functions" in perlapi.
497
498           void meth(SV * rv)
499           PPCODE:
500               REGEXP * re = SvRX(sv);
501
502   dupe
503           void* dupe(pTHX_ REGEXP * const rx, CLONE_PARAMS *param);
504
505       On threaded builds a regexp may need to be duplicated so that the
506       pattern can be used by multiple threads.  This routine is expected to
507       handle the duplication of any private data pointed to by the "pprivate"
508       member of the "regexp" structure.  It will be called with the
509       preconstructed new "regexp" structure as an argument, the "pprivate"
510       member will point at the old private structure, and it is this
511       routine's responsibility to construct a copy and return a pointer to it
512       (which Perl will then use to overwrite the field as passed to this
513       routine.)
514
515       This allows the engine to dupe its private data but also if necessary
516       modify the final structure if it really must.
517
518       On unthreaded builds this field doesn't exist.
519
520   op_comp
521       This is private to the Perl core and subject to change. Should be left
522       null.
523

The REGEXP structure

525       The REGEXP struct is defined in regexp.h.  All regex engines must be
526       able to correctly build such a structure in their "comp" routine.
527
528       The REGEXP structure contains all the data that Perl needs to be aware
529       of to properly work with the regular expression.  It includes data
530       about optimisations that Perl can use to determine if the regex engine
531       should really be used, and various other control info that is needed to
532       properly execute patterns in various contexts, such as if the pattern
533       anchored in some way, or what flags were used during the compile, or if
534       the program contains special constructs that Perl needs to be aware of.
535
536       In addition it contains two fields that are intended for the private
537       use of the regex engine that compiled the pattern.  These are the
538       "intflags" and "pprivate" members.  "pprivate" is a void pointer to an
539       arbitrary structure, whose use and management is the responsibility of
540       the compiling engine.  Perl will never modify either of these values.
541
542           typedef struct regexp {
543               /* what engine created this regexp? */
544               const struct regexp_engine* engine;
545
546               /* what re is this a lightweight copy of? */
547               struct regexp* mother_re;
548
549               /* Information about the match that the Perl core uses to manage
550                * things */
551               U32 extflags;   /* Flags used both externally and internally */
552               I32 minlen;     /* mininum possible number of chars in */
553                                  string to match */
554               I32 minlenret;  /* mininum possible number of chars in $& */
555               U32 gofs;       /* chars left of pos that we search from */
556
557               /* substring data about strings that must appear
558                  in the final match, used for optimisations */
559               struct reg_substr_data *substrs;
560
561               U32 nparens;  /* number of capture groups */
562
563               /* private engine specific data */
564               U32 intflags;   /* Engine Specific Internal flags */
565               void *pprivate; /* Data private to the regex engine which
566                                  created this object. */
567
568               /* Data about the last/current match. These are modified during
569                * matching*/
570               U32 lastparen;            /* highest close paren matched ($+) */
571               U32 lastcloseparen;       /* last close paren matched ($^N) */
572               regexp_paren_pair *offs;  /* Array of offsets for (@-) and
573                                            (@+) */
574
575               char *subbeg;  /* saved or original string so \digit works
576                                 forever. */
577               SV_SAVED_COPY  /* If non-NULL, SV which is COW from original */
578               I32 sublen;    /* Length of string pointed by subbeg */
579               I32 suboffset;  /* byte offset of subbeg from logical start of
580                                  str */
581               I32 subcoffset; /* suboffset equiv, but in chars (for @-/@+) */
582
583               /* Information about the match that isn't often used */
584               I32 prelen;           /* length of precomp */
585               const char *precomp;  /* pre-compilation regular expression */
586
587               char *wrapped;  /* wrapped version of the pattern */
588               I32 wraplen;    /* length of wrapped */
589
590               I32 seen_evals;   /* number of eval groups in the pattern - for
591                                    security checks */
592               HV *paren_names;  /* Optional hash of paren names */
593
594               /* Refcount of this regexp */
595               I32 refcnt;             /* Refcount of this regexp */
596           } regexp;
597
598       The fields are discussed in more detail below:
599
600   "engine"
601       This field points at a "regexp_engine" structure which contains
602       pointers to the subroutines that are to be used for performing a match.
603       It is the compiling routine's responsibility to populate this field
604       before returning the regexp object.
605
606       Internally this is set to "NULL" unless a custom engine is specified in
607       $^H{regcomp}, Perl's own set of callbacks can be accessed in the struct
608       pointed to by "RE_ENGINE_PTR".
609
610   "mother_re"
611       TODO, see
612       <http://www.mail-archive.com/perl5-changes@perl.org/msg17328.html>
613
614   "extflags"
615       This will be used by Perl to see what flags the regexp was compiled
616       with, this will normally be set to the value of the flags parameter by
617       the comp callback.  See the comp documentation for valid flags.
618
619   "minlen" "minlenret"
620       The minimum string length (in characters) required for the pattern to
621       match.  This is used to prune the search space by not bothering to
622       match any closer to the end of a string than would allow a match.  For
623       instance there is no point in even starting the regex engine if the
624       minlen is 10 but the string is only 5 characters long.  There is no way
625       that the pattern can match.
626
627       "minlenret" is the minimum length (in characters) of the string that
628       would be found in $& after a match.
629
630       The difference between "minlen" and "minlenret" can be seen in the
631       following pattern:
632
633           /ns(?=\d)/
634
635       where the "minlen" would be 3 but "minlenret" would only be 2 as the \d
636       is required to match but is not actually included in the matched
637       content.  This distinction is particularly important as the
638       substitution logic uses the "minlenret" to tell if it can do in-place
639       substitutions (these can result in considerable speed-up).
640
641   "gofs"
642       Left offset from pos() to start match at.
643
644   "substrs"
645       Substring data about strings that must appear in the final match.  This
646       is currently only used internally by Perl's engine, but might be used
647       in the future for all engines for optimisations.
648
649   "nparens", "lastparen", and "lastcloseparen"
650       These fields are used to keep track of: how many paren capture groups
651       there are in the pattern; which was the highest paren to be closed (see
652       "$+" in perlvar); and which was the most recent paren to be closed (see
653       "$^N" in perlvar).
654
655   "intflags"
656       The engine's private copy of the flags the pattern was compiled with.
657       Usually this is the same as "extflags" unless the engine chose to
658       modify one of them.
659
660   "pprivate"
661       A void* pointing to an engine-defined data structure.  The Perl engine
662       uses the "regexp_internal" structure (see "Base Structures" in
663       perlreguts) but a custom engine should use something else.
664
665   "offs"
666       A "regexp_paren_pair" structure which defines offsets into the string
667       being matched which correspond to the $& and $1, $2 etc. captures, the
668       "regexp_paren_pair" struct is defined as follows:
669
670           typedef struct regexp_paren_pair {
671               I32 start;
672               I32 end;
673           } regexp_paren_pair;
674
675       If "->offs[num].start" or "->offs[num].end" is "-1" then that capture
676       group did not match.  "->offs[0].start/end" represents $& (or
677       "${^MATCH}" under "/p") and "->offs[paren].end" matches $$paren where
678       $paren = 1>.
679
680   "precomp" "prelen"
681       Used for optimisations.  "precomp" holds a copy of the pattern that was
682       compiled and "prelen" its length.  When a new pattern is to be compiled
683       (such as inside a loop) the internal "regcomp" operator checks if the
684       last compiled "REGEXP"'s "precomp" and "prelen" are equivalent to the
685       new one, and if so uses the old pattern instead of compiling a new one.
686
687       The relevant snippet from "Perl_pp_regcomp":
688
689               if (!re || !re->precomp || re->prelen != (I32)len ||
690                   memNE(re->precomp, t, len))
691               /* Compile a new pattern */
692
693   "paren_names"
694       This is a hash used internally to track named capture groups and their
695       offsets.  The keys are the names of the buffers the values are
696       dualvars, with the IV slot holding the number of buffers with the given
697       name and the pv being an embedded array of I32.  The values may also be
698       contained independently in the data array in cases where named
699       backreferences are used.
700
701   "substrs"
702       Holds information on the longest string that must occur at a fixed
703       offset from the start of the pattern, and the longest string that must
704       occur at a floating offset from the start of the pattern.  Used to do
705       Fast-Boyer-Moore searches on the string to find out if its worth using
706       the regex engine at all, and if so where in the string to search.
707
708   "subbeg" "sublen" "saved_copy" "suboffset" "subcoffset"
709       Used during the execution phase for managing search and replace
710       patterns, and for providing the text for $&, $1 etc. "subbeg" points to
711       a buffer (either the original string, or a copy in the case of
712       "RX_MATCH_COPIED(rx)"), and "sublen" is the length of the buffer.  The
713       "RX_OFFS" start and end indices index into this buffer.
714
715       In the presence of the "REXEC_COPY_STR" flag, but with the addition of
716       the "REXEC_COPY_SKIP_PRE" or "REXEC_COPY_SKIP_POST" flags, an engine
717       can choose not to copy the full buffer (although it must still do so in
718       the presence of "RXf_PMf_KEEPCOPY" or the relevant bits being set in
719       "PL_sawampersand").  In this case, it may set "suboffset" to indicate
720       the number of bytes from the logical start of the buffer to the
721       physical start (i.e. "subbeg").  It should also set "subcoffset", the
722       number of characters in the offset. The latter is needed to support
723       "@-" and "@+" which work in characters, not bytes.
724
725   "wrapped" "wraplen"
726       Stores the string "qr//" stringifies to. The Perl engine for example
727       stores "(?^:eek)" in the case of "qr/eek/".
728
729       When using a custom engine that doesn't support the "(?:)" construct
730       for inline modifiers, it's probably best to have "qr//" stringify to
731       the supplied pattern, note that this will create undesired patterns in
732       cases such as:
733
734           my $x = qr/a|b/;  # "a|b"
735           my $y = qr/c/i;   # "c"
736           my $z = qr/$x$y/; # "a|bc"
737
738       There's no solution for this problem other than making the custom
739       engine understand a construct like "(?:)".
740
741   "seen_evals"
742       This stores the number of eval groups in the pattern.  This is used for
743       security purposes when embedding compiled regexes into larger patterns
744       with "qr//".
745
746   "refcnt"
747       The number of times the structure is referenced.  When this falls to 0,
748       the regexp is automatically freed by a call to pregfree.  This should
749       be set to 1 in each engine's "comp" routine.
750

HISTORY

752       Originally part of perlreguts.
753

AUTHORS

755       Originally written by Yves Orton, expanded by AEvar Arnfjoerd`
756       Bjarmason.
757

LICENSE

759       Copyright 2006 Yves Orton and 2007 AEvar Arnfjoerd` Bjarmason.
760
761       This program is free software; you can redistribute it and/or modify it
762       under the same terms as Perl itself.
763
764
765
766perl v5.32.1                      2021-05-31                      PERLREAPI(1)
Impressum