perlreguts(1)

1PERLREGUTS(1)          Perl Programmers Reference Guide          PERLREGUTS(1)
2
3
4

NAME

6       perlreguts - Description of the Perl regular expression engine.
7

DESCRIPTION

9       This document is an attempt to shine some light on the guts of the
10       regex engine and how it works. The regex engine represents a
11       significant chunk of the perl codebase, but is relatively poorly
12       understood. This document is a meagre attempt at addressing this
13       situation. It is derived from the author's experience, comments in the
14       source code, other papers on the regex engine, feedback on the
15       perl5-porters mail list, and no doubt other places as well.
16
17       NOTICE! It should be clearly understood that the behavior and
18       structures discussed in this represents the state of the engine as the
19       author understood it at the time of writing. It is NOT an API
20       definition, it is purely an internals guide for those who want to hack
21       the regex engine, or understand how the regex engine works. Readers of
22       this document are expected to understand perl's regex syntax and its
23       usage in detail. If you want to learn about the basics of Perl's
24       regular expressions, see perlre. And if you want to replace the regex
25       engine with your own, see perlreapi.
26

OVERVIEW

28   A quick note on terms
29       There is some debate as to whether to say "regexp" or "regex". In this
30       document we will use the term "regex" unless there is a special reason
31       not to, in which case we will explain why.
32
33       When speaking about regexes we need to distinguish between their source
34       code form and their internal form. In this document we will use the
35       term "pattern" when we speak of their textual, source code form, and
36       the term "program" when we speak of their internal representation.
37       These correspond to the terms S-regex and B-regex that Mark Jason
38       Dominus employs in his paper on "Rx" ([1] in "REFERENCES").
39
40   What is a regular expression engine?
41       A regular expression engine is a program that takes a set of
42       constraints specified in a mini-language, and then applies those
43       constraints to a target string, and determines whether or not the
44       string satisfies the constraints. See perlre for a full definition of
45       the language.
46
47       In less grandiose terms, the first part of the job is to turn a pattern
48       into something the computer can efficiently use to find the matching
49       point in the string, and the second part is performing the search
50       itself.
51
52       To do this we need to produce a program by parsing the text. We then
53       need to execute the program to find the point in the string that
54       matches. And we need to do the whole thing efficiently.
55
56   Structure of a Regexp Program
57       High Level
58
59       Although it is a bit confusing and some people object to the
60       terminology, it is worth taking a look at a comment that has been in
61       regexp.h for years:
62
63       This is essentially a linear encoding of a nondeterministic finite-
64       state machine (aka syntax charts or "railroad normal form" in parsing
65       technology).
66
67       The term "railroad normal form" is a bit esoteric, with "syntax
68       diagram/charts", or "railroad diagram/charts" being more common terms.
69       Nevertheless it provides a useful mental image of a regex program: each
70       node can be thought of as a unit of track, with a single entry and in
71       most cases a single exit point (there are pieces of track that fork,
72       but statistically not many), and the whole forms a layout with a single
73       entry and single exit point. The matching process can be thought of as
74       a car that moves along the track, with the particular route through the
75       system being determined by the character read at each possible
76       connector point. A car can fall off the track at any point but it may
77       only proceed as long as it matches the track.
78
79       Thus the pattern "/foo(?:\w+|\d+|\s+)bar/" can be thought of as the
80       following chart:
81
82                             [start]
83                                |
84                              <foo>
85                                |
86                          +-----+-----+
87                          |     |     |
88                        <\w+> <\d+> <\s+>
89                          |     |     |
90                          +-----+-----+
91                                |
92                              <bar>
93                                |
94                              [end]
95
96       The truth of the matter is that perl's regular expressions these days
97       are much more complex than this kind of structure, but visualising it
98       this way can help when trying to get your bearings, and it matches the
99       current implementation pretty closely.
100
101       To be more precise, we will say that a regex program is an encoding of
102       a graph. Each node in the graph corresponds to part of the original
103       regex pattern, such as a literal string or a branch, and has a pointer
104       to the nodes representing the next component to be matched. Since
105       "node" and "opcode" already have other meanings in the perl source, we
106       will call the nodes in a regex program "regops".
107
108       The program is represented by an array of "regnode" structures, one or
109       more of which represent a single regop of the program. Struct "regnode"
110       is the smallest struct needed, and has a field structure which is
111       shared with all the other larger structures.  (Outside this document,
112       the term "regnode" is sometimes used to mean "regop", which could be
113       confusing.)
114
115       The "next" pointers of all regops except "BRANCH" implement
116       concatenation; a "next" pointer with a "BRANCH" on both ends of it is
117       connecting two alternatives.  [Here we have one of the subtle syntax
118       dependencies: an individual "BRANCH" (as opposed to a collection of
119       them) is never concatenated with anything because of operator
120       precedence.]
121
122       The operand of some types of regop is a literal string; for others, it
123       is a regop leading into a sub-program.  In particular, the operand of a
124       "BRANCH" node is the first regop of the branch.
125
126       NOTE: As the railroad metaphor suggests, this is not a tree structure:
127       the tail of the branch connects to the thing following the set of
128       "BRANCH"es.  It is a like a single line of railway track that splits as
129       it goes into a station or railway yard and rejoins as it comes out the
130       other side.
131
132       Regops
133
134       The base structure of a regop is defined in regexp.h as follows:
135
136           struct regnode {
137               U8  flags;    /* Various purposes, sometimes overridden */
138               U8  type;     /* Opcode value as specified by regnodes.h */
139               U16 next_off; /* Offset in size regnode */
140           };
141
142       Other larger "regnode"-like structures are defined in regcomp.h. They
143       are almost like subclasses in that they have the same fields as
144       "regnode", with possibly additional fields following in the structure,
145       and in some cases the specific meaning (and name) of some of base
146       fields are overridden. The following is a more complete description.
147
148       "regnode_1"
149       "regnode_2"
150           "regnode_1" structures have the same header, followed by a single
151           four-byte argument; "regnode_2" structures contain two two-byte
152           arguments instead:
153
154               regnode_1                U32 arg1;
155               regnode_2                U16 arg1;  U16 arg2;
156
157       "regnode_string"
158           "regnode_string" structures, used for literal strings, follow the
159           header with a one-byte length and then the string data. Strings are
160           padded on the tail end with zero bytes so that the total length of
161           the node is a multiple of four bytes:
162
163               regnode_string           char string[1];
164                                        U8 str_len; /* overrides flags */
165
166       "regnode_charclass"
167           Bracketed character classes are represented by "regnode_charclass"
168           structures, which have a four-byte argument and then a 32-byte
169           (256-bit) bitmap indicating which characters in the Latin1 range
170           are included in the class.
171
172               regnode_charclass        U32 arg1;
173                                        char bitmap[ANYOF_BITMAP_SIZE];
174
175           Various flags whose names begin with "ANYOF_" are used for special
176           situations.  Above Latin1 matches and things not known until run-
177           time are stored in "Perl's pprivate structure".
178
179       "regnode_charclass_posixl"
180           There is also a larger form of a char class structure used to
181           represent POSIX char classes under "/l" matching, called
182           "regnode_charclass_posixl" which has an additional 32-bit bitmap
183           indicating which POSIX char classes have been included.
184
185              regnode_charclass_posixl U32 arg1;
186                                       char bitmap[ANYOF_BITMAP_SIZE];
187                                       U32 classflags;
188
189       regnodes.h defines an array called "PL_regnode_arg_len[]" which gives
190       the size of each opcode in units of "size regnode" (4-byte). A macro is
191       used to calculate the size of an "EXACT" node based on its "str_len"
192       field.
193
194       The regops are defined in regnodes.h which is generated from
195       regcomp.sym by regcomp.pl. Currently the maximum possible number of
196       distinct regops is restricted to 256, with about a quarter already
197       used.
198
199       A set of macros makes accessing the fields easier and more consistent.
200       These include OP(), which is used to determine the type of a
201       "regnode"-like structure; NEXT_OFF(), which is the offset to the next
202       node (more on this later); ARG(), ARG1(), ARG2(), ARG_SET(), and
203       equivalents for reading and setting the arguments; and STR_LEN(),
204       STRING() and OPERAND() for manipulating strings and regop bearing
205       types.
206
207       What regnode is next?
208
209       There are two distinct concepts of "next regnode" in the regex engine,
210       and it is important to keep them distinct in your thinking as they
211       overlap conceptually in many places, but where they don't overlap the
212       difference is critical. For the majority of regnode types the two
213       concepts are (nearly) identical in practice. The two types are
214       "REGNODE_AFTER" which is used heavily during compilation but only
215       occasionally during execution and "regnext" which is used heavily
216       during execution, and only occasionally during compilation.
217
218       "REGNODE_AFTER"
219           This is the "positionally next regnode" in the compiled regex
220           program.  For the smaller regnode types it is "regnode_ptr+1" under
221           the hood, but as regnode sizes vary and can change over time we
222           offer macros which hide the gory details.
223
224           It is heavily used in the compiler phase but is only used by a few
225           select regnode types in the execution phase. It is also heavily
226           used in the code for dumping the regexp program for debugging.
227
228           There are a selection of macros which can be used to compute this
229           as efficiently as possible depending on the circumstances. The
230           canonical macro is REGNODE_AFTER(), which is the most powerful and
231           should handle any case we have, but is also potentially the
232           slowest. There are two additional macros for the special case that
233           you KNOW the current regnode size is constant, and you know its
234           type or opcode. In which case you can use REGNODE_AFTER_opcode() or
235           REGNODE_AFTER_type().
236
237           In older versions of the regex engine REGNODE_AFTER() was called
238           "NEXTOPER" but this was found to be confusing and it was renamed.
239           There is also a REGNODE_BEFORE(), but it is unsafe and should not
240           be used in new code.
241
242       "regnext"
243           This is the regnode which can be reached by jumping forward by the
244           value of the NEXT_OFF() member of the regnode, or in a few cases
245           for longer jumps by the "arg1" field of the "regnode_1" structure.
246           The subroutine regnext() handles this transparently. In the
247           majority of cases the "regnext" for a regnode is the regnode which
248           should be executed after the current one has successfully matched,
249           but in some cases this may not be true. In loop control and branch
250           control regnode types the regnext may signify something special,
251           for BRANCH nodes "regnext" is the next BRANCH that should be
252           executed if the current one fails execution, and some loop control
253           regnodes set the regnext to be the end of the loop so they can jump
254           to their cleanup if the current iteration fails to match.
255
256       Most regnode types do not create a branch in the execution flow, and
257       leaving aside optimizations the two concepts of "next" are the same.
258       For instance the "regnext" and "REGNODE_AFTER" of a SBOL opcode are the
259       same during compilation phase. The main place this is not true is
260       "BRANCH" regnodes where the "REGNODE_AFTER" represents the start of the
261       pattern in the branch and the "regnext" represents the linkage to the
262       next BRANCH should this one fail to match, or 0 if it is the last
263       branch. The looping logic for quantifiers also makes similar use of the
264       distinction between the two types, with "REGNODE_AFTER" being the
265       inside of the loop construct, and the "regnext" pointing at the end of
266       the loop.
267
268       During compilation the engine may not know what the regnext is for a
269       given node, so during compilation "regnext" is only used where it must
270       be used and is known to be correct. At the very end of the compilation
271       phase we walk the regex program and correct the regnext data as
272       appropriate, and also perform various optimizations which may result in
273       regnodes that were required during construction becoming redundant, or
274       we may replace a large regnode with a much smaller one and filling in
275       the gap with OPTIMIZED regnodes. Thus we might start with something
276       like this:
277
278           BRANCH
279             EXACT "foo"
280           BRANCH
281             EXACT "bar"
282           EXACT "!"
283
284       and replace it with something like:
285
286           TRIE foo|bar
287           OPTIMIZED
288           OPTIMIZED
289           OPTIMIZED
290           EXACT "!"
291
292       the "REGNODE_AFTER" for the "TRIE" node would be an "OPTIMIZED"
293       regnode, and in theory the "regnext" would be the same as the
294       "REGNODE_AFTER". But it would be inefficient to execute the OPTIMIZED
295       regnode as a noop three times, so the optimizer fixes the "regnext" so
296       such nodes are skipped during execution phase.
297
298       During execution phases we use the regnext() almost exclusively, and
299       only use "REGNODE_AFTER" in special cases where it has a well defined
300       meaning for a given regnode type. For instance /x+/ results in
301
302           PLUS
303               EXACT "x"
304           END
305
306       the "regnext" of the "PLUS" regnode is the "END" regnode, and the
307       "REGNODE_AFTER" of the "PLUS" regnode is the "EXACT" regnode. The
308       "regnext" and "REGNODE_AFTER" of the "EXACT" regnode is the "END"
309       regnode.
310

Process Overview

312       Broadly speaking, performing a match of a string against a pattern
313       involves the following steps:
314
315       A. Compilation
316            1. Parsing
317            2. Peep-hole optimisation and analysis
318       B. Execution
319            3. Start position and no-match optimisations
320            4. Program execution
321
322       Where these steps occur in the actual execution of a perl program is
323       determined by whether the pattern involves interpolating any string
324       variables. If interpolation occurs, then compilation happens at run
325       time. If it does not, then compilation is performed at compile time.
326       (The "/o" modifier changes this, as does "qr//" to a certain extent.)
327       The engine doesn't really care that much.
328
329   Compilation
330       This code resides primarily in regcomp.c, along with the header files
331       regcomp.h, regexp.h and regnodes.h.
332
333       Compilation starts with pregcomp(), which is mostly an initialisation
334       wrapper which farms work out to two other routines for the heavy
335       lifting: the first is reg(), which is the start point for parsing; the
336       second, study_chunk(), is responsible for optimisation.
337
338       Initialisation in pregcomp() mostly involves the creation and data-
339       filling of a special structure, "RExC_state_t" (defined in regcomp.c).
340       Almost all internally-used routines in regcomp.h take a pointer to one
341       of these structures as their first argument, with the name
342       "pRExC_state".  This structure is used to store the compilation state
343       and contains many fields. Likewise there are many macros which operate
344       on this variable: anything that looks like "RExC_xxxx" is a macro that
345       operates on this pointer/structure.
346
347       reg() is the start of the parse process. It is responsible for parsing
348       an arbitrary chunk of pattern up to either the end of the string, or
349       the first closing parenthesis it encounters in the pattern.  This means
350       it can be used to parse the top-level regex, or any section inside of a
351       grouping parenthesis. It also handles the "special parens" that perl's
352       regexes have. For instance when parsing "/x(?:foo)y/", reg() will at
353       one point be called to parse from the "?" symbol up to and including
354       the ")".
355
356       Additionally, reg() is responsible for parsing the one or more branches
357       from the pattern, and for "finishing them off" by correctly setting
358       their next pointers. In order to do the parsing, it repeatedly calls
359       out to regbranch(), which is responsible for handling up to the first
360       "|" symbol it sees.
361
362       regbranch() in turn calls regpiece() which handles "things" followed by
363       a quantifier. In order to parse the "things", regatom() is called. This
364       is the lowest level routine, which parses out constant strings,
365       character classes, and the various special symbols like "$". If
366       regatom() encounters a "(" character it in turn calls reg().
367
368       There used to be two main passes involved in parsing, the first to
369       calculate the size of the compiled program, and the second to actually
370       compile it.  But now there is only one main pass, with an initial crude
371       guess based on the length of the input pattern, which is increased if
372       necessary as parsing proceeds, and afterwards, trimmed to the actual
373       amount used.
374
375       However, it may happen that parsing must be restarted at the beginning
376       when various circumstances occur along the way.  An example is if the
377       program turns out to be so large that there are jumps in it that won't
378       fit in the normal 16 bits available.  There are two special regops that
379       can hold bigger jump destinations, BRANCHJ and LONGBRANCH.  The parse
380       is restarted, and these are used instead of the normal shorter ones.
381       Whenever restarting the parse is required, the function returns failure
382       and sets a flag as to what needs to be done.  This is passed up to the
383       top level routine which takes the appropriate action and restarts from
384       scratch.  In the case of needing longer jumps, the "RExC_use_BRANCHJ"
385       flag is set in the "RExC_state_t" structure, which the functions know
386       to inspect before deciding how to do branches.
387
388       In most instances, the function that discovers the issue sets the
389       causal flag and returns failure immediately.  "Parsing complications"
390       contains an explicit example of how this works.  In other cases, such
391       as a forward reference to a numbered parenthetical grouping, we need to
392       finish the parse to know if that numbered grouping actually appears in
393       the pattern.  In those cases, the parse is just redone at the end, with
394       the knowledge of how many groupings occur in it.
395
396       The routine regtail() is called by both reg() and regbranch() in order
397       to "set the tail pointer" correctly. When executing and we get to the
398       end of a branch, we need to go to the node following the grouping
399       parens. When parsing, however, we don't know where the end will be
400       until we get there, so when we do we must go back and update the
401       offsets as appropriate. "regtail" is used to make this easier.
402
403       A subtlety of the parsing process means that a regex like "/foo/" is
404       originally parsed into an alternation with a single branch. It is only
405       afterwards that the optimiser converts single branch alternations into
406       the simpler form.
407
408       Parse Call Graph and a Grammar
409
410       The call graph looks like this:
411
412        reg()                        # parse a top level regex, or inside of
413                                     # parens
414            regbranch()              # parse a single branch of an alternation
415                regpiece()           # parse a pattern followed by a quantifier
416                    regatom()        # parse a simple pattern
417                        regclass()   #   used to handle a class
418                        reg()        #   used to handle a parenthesised
419                                     #   subpattern
420                        ....
421                ...
422                regtail()            # finish off the branch
423            ...
424            regtail()                # finish off the branch sequence. Tie each
425                                     # branch's tail to the tail of the
426                                     # sequence
427                                     # (NEW) In Debug mode this is
428                                     # regtail_study().
429
430       A grammar form might be something like this:
431
432           atom  : constant | class
433           quant : '*' | '+' | '?' | '{min,max}'
434           _branch: piece
435                  | piece _branch
436                  | nothing
437           branch: _branch
438                 | _branch '|' branch
439           group : '(' branch ')'
440           _piece: atom | group
441           piece : _piece
442                 | _piece quant
443
444       Parsing complications
445
446       The implication of the above description is that a pattern containing
447       nested parentheses will result in a call graph which cycles through
448       reg(), regbranch(), regpiece(), regatom(), reg(), regbranch() etc
449       multiple times, until the deepest level of nesting is reached. All the
450       above routines return a pointer to a "regnode", which is usually the
451       last regnode added to the program. However, one complication is that
452       reg() returns NULL for parsing "(?:)" syntax for embedded modifiers,
453       setting the flag "TRYAGAIN". The "TRYAGAIN" propagates upwards until it
454       is captured, in some cases by regatom(), but otherwise unconditionally
455       by regbranch(). Hence it will never be returned by regbranch() to
456       reg(). This flag permits patterns such as "(?i)+" to be detected as
457       errors (Quantifier follows nothing in regex; marked by <-- HERE in
458       m/(?i)+ <-- HERE /).
459
460       Another complication is that the representation used for the program
461       differs if it needs to store Unicode, but it's not always possible to
462       know for sure whether it does until midway through parsing. The Unicode
463       representation for the program is larger, and cannot be matched as
464       efficiently. (See "Unicode and Localisation Support" below for more
465       details as to why.)  If the pattern contains literal Unicode, it's
466       obvious that the program needs to store Unicode. Otherwise, the parser
467       optimistically assumes that the more efficient representation can be
468       used, and starts sizing on this basis.  However, if it then encounters
469       something in the pattern which must be stored as Unicode, such as an
470       "\x{...}" escape sequence representing a character literal, then this
471       means that all previously calculated sizes need to be redone, using
472       values appropriate for the Unicode representation.  This is another
473       instance where the parsing needs to be restarted, and it can and is
474       done immediately.  The function returns failure, and sets the flag
475       "RESTART_UTF8" (encapsulated by using the macro "REQUIRE_UTF8").  This
476       restart request is propagated up the call chain in a similar fashion,
477       until it is "caught" in Perl_re_op_compile(), which marks the pattern
478       as containing Unicode, and restarts the sizing pass. It is also
479       possible for constructions within run-time code blocks to turn out to
480       need Unicode representation., which is signalled by
481       S_compile_runtime_code() returning false to Perl_re_op_compile().
482
483       The restart was previously implemented using a "longjmp" in regatom()
484       back to a "setjmp" in Perl_re_op_compile(), but this proved to be
485       problematic as the latter is a large function containing many automatic
486       variables, which interact badly with the emergent control flow of
487       "setjmp".
488
489       Debug Output
490
491       Starting in the 5.9.x development version of perl you can "use re Debug
492       => 'PARSE'" to see some trace information about the parse process. We
493       will start with some simple patterns and build up to more complex
494       patterns.
495
496       So when we parse "/foo/" we see something like the following table. The
497       left shows what is being parsed, and the number indicates where the
498       next regop would go. The stuff on the right is the trace output of the
499       graph. The names are chosen to be short to make it less dense on the
500       screen. 'tsdy' is a special form of regtail() which does some extra
501       analysis.
502
503        >foo<             1    reg
504                                 brnc
505                                   piec
506                                     atom
507        ><                4      tsdy~ EXACT <foo> (EXACT) (1)
508                                     ~ attach to END (3) offset to 2
509
510       The resulting program then looks like:
511
512          1: EXACT <foo>(3)
513          3: END(0)
514
515       As you can see, even though we parsed out a branch and a piece, it was
516       ultimately only an atom. The final program shows us how things work. We
517       have an "EXACT" regop, followed by an "END" regop. The number in parens
518       indicates where the "regnext" of the node goes. The "regnext" of an
519       "END" regop is unused, as "END" regops mean we have successfully
520       matched. The number on the left indicates the position of the regop in
521       the regnode array.
522
523       Now let's try a harder pattern. We will add a quantifier, so now we
524       have the pattern "/foo+/". We will see that regbranch() calls
525       regpiece() twice.
526
527        >foo+<            1    reg
528                                 brnc
529                                   piec
530                                     atom
531        >o+<              3        piec
532                                     atom
533        ><                6        tail~ EXACT <fo> (1)
534                          7      tsdy~ EXACT <fo> (EXACT) (1)
535                                     ~ PLUS (END) (3)
536                                     ~ attach to END (6) offset to 3
537
538       And we end up with the program:
539
540          1: EXACT <fo>(3)
541          3: PLUS(6)
542          4:   EXACT <o>(0)
543          6: END(0)
544
545       Now we have a special case. The "EXACT" regop has a "regnext" of 0.
546       This is because if it matches it should try to match itself again. The
547       "PLUS" regop handles the actual failure of the "EXACT" regop and acts
548       appropriately (going to regnode 6 if the "EXACT" matched at least once,
549       or failing if it didn't).
550
551       Now for something much more complex: "/x(?:foo*|b[a][rR])(foo|bar)$/"
552
553        >x(?:foo*|b...    1    reg
554                                 brnc
555                                   piec
556                                     atom
557        >(?:foo*|b[...    3        piec
558                                     atom
559        >?:foo*|b[a...                 reg
560        >foo*|b[a][...                   brnc
561                                           piec
562                                             atom
563        >o*|b[a][rR...    5                piec
564                                             atom
565        >|b[a][rR])...    8                tail~ EXACT <fo> (3)
566        >b[a][rR])(...    9              brnc
567                         10                piec
568                                             atom
569        >[a][rR])(f...   12                piec
570                                             atom
571        >a][rR])(fo...                         clas
572        >[rR])(foo|...   14                tail~ EXACT <b> (10)
573                                           piec
574                                             atom
575        >rR])(foo|b...                         clas
576        >)(foo|bar)...   25                tail~ EXACT <a> (12)
577                                         tail~ BRANCH (3)
578                         26              tsdy~ BRANCH (END) (9)
579                                             ~ attach to TAIL (25) offset to 16
580                                         tsdy~ EXACT <fo> (EXACT) (4)
581                                             ~ STAR (END) (6)
582                                             ~ attach to TAIL (25) offset to 19
583                                         tsdy~ EXACT <b> (EXACT) (10)
584                                             ~ EXACT <a> (EXACT) (12)
585                                             ~ ANYOF[Rr] (END) (14)
586                                             ~ attach to TAIL (25) offset to 11
587        >(foo|bar)$<               tail~ EXACT <x> (1)
588                                   piec
589                                     atom
590        >foo|bar)$<                    reg
591                         28              brnc
592                                           piec
593                                             atom
594        >|bar)$<         31              tail~ OPEN1 (26)
595        >bar)$<                          brnc
596                         32                piec
597                                             atom
598        >)$<             34              tail~ BRANCH (28)
599                         36              tsdy~ BRANCH (END) (31)
600                                            ~ attach to CLOSE1 (34) offset to 3
601                                         tsdy~ EXACT <foo> (EXACT) (29)
602                                            ~ attach to CLOSE1 (34) offset to 5
603                                         tsdy~ EXACT <bar> (EXACT) (32)
604                                            ~ attach to CLOSE1 (34) offset to 2
605        >$<                        tail~ BRANCH (3)
606                                       ~ BRANCH (9)
607                                       ~ TAIL (25)
608                                   piec
609                                     atom
610        ><               37        tail~ OPEN1 (26)
611                                       ~ BRANCH (28)
612                                       ~ BRANCH (31)
613                                       ~ CLOSE1 (34)
614                         38      tsdy~ EXACT <x> (EXACT) (1)
615                                     ~ BRANCH (END) (3)
616                                     ~ BRANCH (END) (9)
617                                     ~ TAIL (END) (25)
618                                     ~ OPEN1 (END) (26)
619                                     ~ BRANCH (END) (28)
620                                     ~ BRANCH (END) (31)
621                                     ~ CLOSE1 (END) (34)
622                                     ~ EOL (END) (36)
623                                     ~ attach to END (37) offset to 1
624
625       Resulting in the program
626
627          1: EXACT <x>(3)
628          3: BRANCH(9)
629          4:   EXACT <fo>(6)
630          6:   STAR(26)
631          7:     EXACT <o>(0)
632          9: BRANCH(25)
633         10:   EXACT <ba>(14)
634         12:   OPTIMIZED (2 nodes)
635         14:   ANYOF[Rr](26)
636         25: TAIL(26)
637         26: OPEN1(28)
638         28:   TRIE-EXACT(34)
639               [StS:1 Wds:2 Cs:6 Uq:5 #Sts:7 Mn:3 Mx:3 Stcls:bf]
640                 <foo>
641                 <bar>
642         30:   OPTIMIZED (4 nodes)
643         34: CLOSE1(36)
644         36: EOL(37)
645         37: END(0)
646
647       Here we can see a much more complex program, with various optimisations
648       in play. At regnode 10 we see an example where a character class with
649       only one character in it was turned into an "EXACT" node. We can also
650       see where an entire alternation was turned into a "TRIE-EXACT" node. As
651       a consequence, some of the regnodes have been marked as optimised away.
652       We can see that the "$" symbol has been converted into an "EOL" regop,
653       a special piece of code that looks for "\n" or the end of the string.
654
655       The next pointer for "BRANCH"es is interesting in that it points at
656       where execution should go if the branch fails. When executing, if the
657       engine tries to traverse from a branch to a "regnext" that isn't a
658       branch then the engine will know that the entire set of branches has
659       failed.
660
661       Peep-hole Optimisation and Analysis
662
663       The regular expression engine can be a weighty tool to wield. On long
664       strings and complex patterns it can end up having to do a lot of work
665       to find a match, and even more to decide that no match is possible.
666       Consider a situation like the following pattern.
667
668          'ababababababababababab' =~ /(a|b)*z/
669
670       The "(a|b)*" part can match at every char in the string, and then fail
671       every time because there is no "z" in the string. So obviously we can
672       avoid using the regex engine unless there is a "z" in the string.
673       Likewise in a pattern like:
674
675          /foo(\w+)bar/
676
677       In this case we know that the string must contain a "foo" which must be
678       followed by "bar". We can use Fast Boyer-Moore matching as implemented
679       in fbm_instr() to find the location of these strings. If they don't
680       exist then we don't need to resort to the much more expensive regex
681       engine.  Even better, if they do exist then we can use their positions
682       to reduce the search space that the regex engine needs to cover to
683       determine if the entire pattern matches.
684
685       There are various aspects of the pattern that can be used to facilitate
686       optimisations along these lines:
687
688       •    anchored fixed strings
689
690       •    floating fixed strings
691
692       •    minimum and maximum length requirements
693
694       •    start class
695
696       •    Beginning/End of line positions
697
698       Another form of optimisation that can occur is the post-parse "peep-
699       hole" optimisation, where inefficient constructs are replaced by more
700       efficient constructs. The "TAIL" regops which are used during parsing
701       to mark the end of branches and the end of groups are examples of this.
702       These regops are used as place-holders during construction and "always
703       match" so they can be "optimised away" by making the things that point
704       to the "TAIL" point to the thing that "TAIL" points to, thus "skipping"
705       the node.
706
707       Another optimisation that can occur is that of ""EXACT" merging" which
708       is where two consecutive "EXACT" nodes are merged into a single regop.
709       An even more aggressive form of this is that a branch sequence of the
710       form "EXACT BRANCH ... EXACT" can be converted into a "TRIE-EXACT"
711       regop.
712
713       All of this occurs in the routine study_chunk() which uses a special
714       structure "scan_data_t" to store the analysis that it has performed,
715       and does the "peep-hole" optimisations as it goes.
716
717       The code involved in study_chunk() is extremely cryptic. Be careful.
718       :-)
719
720   Execution
721       Execution of a regex generally involves two phases, the first being
722       finding the start point in the string where we should match from, and
723       the second being running the regop interpreter.
724
725       If we can tell that there is no valid start point then we don't bother
726       running the interpreter at all. Likewise, if we know from the analysis
727       phase that we cannot detect a short-cut to the start position, we go
728       straight to the interpreter.
729
730       The two entry points are re_intuit_start() and pregexec(). These
731       routines have a somewhat incestuous relationship with overlap between
732       their functions, and pregexec() may even call re_intuit_start() on its
733       own. Nevertheless other parts of the perl source code may call into
734       either, or both.
735
736       Execution of the interpreter itself used to be recursive, but thanks to
737       the efforts of Dave Mitchell in the 5.9.x development track, that has
738       changed: now an internal stack is maintained on the heap and the
739       routine is fully iterative. This can make it tricky as the code is
740       quite conservative about what state it stores, with the result that two
741       consecutive lines in the code can actually be running in totally
742       different contexts due to the simulated recursion.
743
744       Start position and no-match optimisations
745
746       re_intuit_start() is responsible for handling start points and no-match
747       optimisations as determined by the results of the analysis done by
748       study_chunk() (and described in "Peep-hole Optimisation and Analysis").
749
750       The basic structure of this routine is to try to find the start- and/or
751       end-points of where the pattern could match, and to ensure that the
752       string is long enough to match the pattern. It tries to use more
753       efficient methods over less efficient methods and may involve
754       considerable cross-checking of constraints to find the place in the
755       string that matches.  For instance it may try to determine that a given
756       fixed string must be not only present but a certain number of chars
757       before the end of the string, or whatever.
758
759       It calls several other routines, such as fbm_instr() which does Fast
760       Boyer Moore matching and find_byclass() which is responsible for
761       finding the start using the first mandatory regop in the program.
762
763       When the optimisation criteria have been satisfied, reg_try() is called
764       to perform the match.
765
766       Program execution
767
768       pregexec() is the main entry point for running a regex. It contains
769       support for initialising the regex interpreter's state, running
770       re_intuit_start() if needed, and running the interpreter on the string
771       from various start positions as needed. When it is necessary to use the
772       regex interpreter pregexec() calls regtry().
773
774       regtry() is the entry point into the regex interpreter. It expects as
775       arguments a pointer to a "regmatch_info" structure and a pointer to a
776       string.  It returns an integer 1 for success and a 0 for failure.  It
777       is basically a set-up wrapper around regmatch().
778
779       "regmatch" is the main "recursive loop" of the interpreter. It is
780       basically a giant switch statement that implements a state machine,
781       where the possible states are the regops themselves, plus a number of
782       additional intermediate and failure states. A few of the states are
783       implemented as subroutines but the bulk are inline code.
784

MISCELLANEOUS

786   Unicode and Localisation Support
787       When dealing with strings containing characters that cannot be
788       represented using an eight-bit character set, perl uses an internal
789       representation that is a permissive version of Unicode's UTF-8
790       encoding[2]. This uses single bytes to represent characters from the
791       ASCII character set, and sequences of two or more bytes for all other
792       characters. (See perlunitut for more information about the relationship
793       between UTF-8 and perl's encoding, utf8. The difference isn't important
794       for this discussion.)
795
796       No matter how you look at it, Unicode support is going to be a pain in
797       a regex engine. Tricks that might be fine when you have 256 possible
798       characters often won't scale to handle the size of the UTF-8 character
799       set.  Things you can take for granted with ASCII may not be true with
800       Unicode. For instance, in ASCII, it is safe to assume that
801       "sizeof(char1) == sizeof(char2)", but in UTF-8 it isn't. Unicode case
802       folding is vastly more complex than the simple rules of ASCII, and even
803       when not using Unicode but only localised single byte encodings, things
804       can get tricky (for example, LATIN SMALL LETTER SHARP S (U+00DF, ß)
805       should match 'SS' in localised case-insensitive matching).
806
807       Making things worse is that UTF-8 support was a later addition to the
808       regex engine (as it was to perl) and this necessarily  made things a
809       lot more complicated. Obviously it is easier to design a regex engine
810       with Unicode support in mind from the beginning than it is to retrofit
811       it to one that wasn't.
812
813       Nearly all regops that involve looking at the input string have two
814       cases, one for UTF-8, and one not. In fact, it's often more complex
815       than that, as the pattern may be UTF-8 as well.
816
817       Care must be taken when making changes to make sure that you handle
818       UTF-8 properly, both at compile time and at execution time, including
819       when the string and pattern are mismatched.
820
821   Base Structures
822       The "regexp" structure described in perlreapi is common to all regex
823       engines. Two of its fields are intended for the private use of the
824       regex engine that compiled the pattern. These are the "intflags" and
825       pprivate members. The "pprivate" is a void pointer to an arbitrary
826       structure whose use and management is the responsibility of the
827       compiling engine. perl will never modify either of these values. In the
828       case of the stock engine the structure pointed to by "pprivate" is
829       called "regexp_internal".
830
831       Its "pprivate" and "intflags" fields contain data specific to each
832       engine.
833
834       There are two structures used to store a compiled regular expression.
835       One, the "regexp" structure described in perlreapi is populated by the
836       engine currently being used and some of its fields read by perl to
837       implement things such as the stringification of "qr//".
838
839       The other structure is pointed to by the "regexp" struct's "pprivate"
840       and is in addition to "intflags" in the same struct considered to be
841       the property of the regex engine which compiled the regular expression;
842
843       The regexp structure contains all the data that perl needs to be aware
844       of to properly work with the regular expression. It includes data about
845       optimisations that perl can use to determine if the regex engine should
846       really be used, and various other control info that is needed to
847       properly execute patterns in various contexts such as is the pattern
848       anchored in some way, or what flags were used during the compile, or
849       whether the program contains special constructs that perl needs to be
850       aware of.
851
852       In addition it contains two fields that are intended for the private
853       use of the regex engine that compiled the pattern. These are the
854       "intflags" and pprivate members. The "pprivate" is a void pointer to an
855       arbitrary structure whose use and management is the responsibility of
856       the compiling engine. perl will never modify either of these values.
857
858       As mentioned earlier, in the case of the default engines, the
859       "pprivate" will be a pointer to a regexp_internal structure which holds
860       the compiled program and any additional data that is private to the
861       regex engine implementation.
862
863       Perl's "pprivate" structure
864
865       The following structure is used as the "pprivate" struct by perl's
866       regex engine. Since it is specific to perl it is only of curiosity
867       value to other engine implementations.
868
869           typedef struct regexp_internal {
870               regnode *regstclass;
871               struct reg_data *data;
872               struct reg_code_blocks *code_blocks;
873               U32 proglen;
874               U32 name_list_idx;
875               regnode program[1];
876           } regexp_internal;
877
878       Description of the attributes is as follows:
879
880       "regstclass"
881            Special regop that is used by re_intuit_start() to check if a
882            pattern can match at a certain position. For instance if the regex
883            engine knows that the pattern must start with a 'Z' then it can
884            scan the string until it finds one and then launch the regex
885            engine from there. The routine that handles this is called
886            find_by_class(). Sometimes this field points at a regop embedded
887            in the program, and sometimes it points at an independent
888            synthetic regop that has been constructed by the optimiser.
889
890       "data"
891            This field points at a "reg_data" structure, which is defined as
892            follows
893
894                struct reg_data {
895                    U32 count;
896                    U8 *what;
897                    void* data[1];
898                };
899
900            This structure is used for handling data structures that the regex
901            engine needs to handle specially during a clone or free operation
902            on the compiled product. Each element in the data array has a
903            corresponding element in the what array. During compilation regops
904            that need special structures stored will add an element to each
905            array using the add_data() routine and then store the index in the
906            regop.
907
908            In modern perls the 0th element of this structure is reserved and
909            is NEVER used to store anything of use. This is to allow things
910            that need to index into this array to represent "no value".
911
912       "code_blocks"
913            This optional structure is used to manage "(?{})" constructs in
914            the pattern.  It is made up of the following structures.
915
916                /* record the position of a (?{...}) within a pattern */
917                struct reg_code_block {
918                    STRLEN start;
919                    STRLEN end;
920                    OP     *block;
921                    REGEXP *src_regex;
922                };
923
924                /* array of reg_code_block's plus header info */
925                struct reg_code_blocks {
926                    int refcnt; /* we may be pointed to from a regex
927                                   and from the savestack */
928                    int  count; /* how many code blocks */
929                    struct reg_code_block *cb; /* array of reg_code_block's */
930                };
931
932       "proglen"
933            Stores the length of the compiled program in units of regops.
934
935       "name_list_idx"
936            This is the index into the data array where an AV is stored that
937            contains the names of any named capture buffers in the pattern,
938            should there be any. This is only used in the debugging version of
939            the regex engine and when RXp_PAREN_NAMES(prog) is true. It will
940            be 0 if there is no such data.
941
942       "program"
943            Compiled program. Inlined into the structure so the entire struct
944            can be treated as a single blob.
945

AUTHOR

954       by Yves Orton, 2006.
955
956       With excerpts from Perl, and contributions and suggestions from Ronald
957       J. Kimball, Dave Mitchell, Dominic Dunlop, Mark Jason Dominus, Stephen
958       McCamant, and David Landgren.
959
960       Now maintained by Perl 5 Porters.
961

LICENCE

963       Same terms as Perl.
964

REFERENCES

966       [1] <https://perl.plover.com/Rx/paper/>
967
968       [2] <https://www.unicode.org/>
969
970
971
972perl v5.38.2                      2023-11-30                     PERLREGUTS(1)