perlreguts(1)

1PERLREGUTS(1)          Perl Programmers Reference Guide          PERLREGUTS(1)
2
3
4

NAME

6       perlreguts - Description of the Perl regular expression engine.
7

DESCRIPTION

9       This document is an attempt to shine some light on the guts of the
10       regex engine and how it works. The regex engine represents a
11       significant chunk of the perl codebase, but is relatively poorly
12       understood. This document is a meagre attempt at addressing this
13       situation. It is derived from the author's experience, comments in the
14       source code, other papers on the regex engine, feedback on the
15       perl5-porters mail list, and no doubt other places as well.
16
17       NOTICE! It should be clearly understood that the behavior and
18       structures discussed in this represents the state of the engine as the
19       author understood it at the time of writing. It is NOT an API
20       definition, it is purely an internals guide for those who want to hack
21       the regex engine, or understand how the regex engine works. Readers of
22       this document are expected to understand perl's regex syntax and its
23       usage in detail. If you want to learn about the basics of Perl's
24       regular expressions, see perlre. And if you want to replace the regex
25       engine with your own, see perlreapi.
26

OVERVIEW

28   A quick note on terms
29       There is some debate as to whether to say "regexp" or "regex". In this
30       document we will use the term "regex" unless there is a special reason
31       not to, in which case we will explain why.
32
33       When speaking about regexes we need to distinguish between their source
34       code form and their internal form. In this document we will use the
35       term "pattern" when we speak of their textual, source code form, and
36       the term "program" when we speak of their internal representation.
37       These correspond to the terms S-regex and B-regex that Mark Jason
38       Dominus employs in his paper on "Rx" ([1] in "REFERENCES").
39
40   What is a regular expression engine?
41       A regular expression engine is a program that takes a set of
42       constraints specified in a mini-language, and then applies those
43       constraints to a target string, and determines whether or not the
44       string satisfies the constraints. See perlre for a full definition of
45       the language.
46
47       In less grandiose terms, the first part of the job is to turn a pattern
48       into something the computer can efficiently use to find the matching
49       point in the string, and the second part is performing the search
50       itself.
51
52       To do this we need to produce a program by parsing the text. We then
53       need to execute the program to find the point in the string that
54       matches. And we need to do the whole thing efficiently.
55
56   Structure of a Regexp Program
57       High Level
58
59       Although it is a bit confusing and some people object to the
60       terminology, it is worth taking a look at a comment that has been in
61       regexp.h for years:
62
63       This is essentially a linear encoding of a nondeterministic finite-
64       state machine (aka syntax charts or "railroad normal form" in parsing
65       technology).
66
67       The term "railroad normal form" is a bit esoteric, with "syntax
68       diagram/charts", or "railroad diagram/charts" being more common terms.
69       Nevertheless it provides a useful mental image of a regex program: each
70       node can be thought of as a unit of track, with a single entry and in
71       most cases a single exit point (there are pieces of track that fork,
72       but statistically not many), and the whole forms a layout with a single
73       entry and single exit point. The matching process can be thought of as
74       a car that moves along the track, with the particular route through the
75       system being determined by the character read at each possible
76       connector point. A car can fall off the track at any point but it may
77       only proceed as long as it matches the track.
78
79       Thus the pattern "/foo(?:\w+|\d+|\s+)bar/" can be thought of as the
80       following chart:
81
82                             [start]
83                                |
84                              <foo>
85                                |
86                          +-----+-----+
87                          |     |     |
88                        <\w+> <\d+> <\s+>
89                          |     |     |
90                          +-----+-----+
91                                |
92                              <bar>
93                                |
94                              [end]
95
96       The truth of the matter is that perl's regular expressions these days
97       are much more complex than this kind of structure, but visualising it
98       this way can help when trying to get your bearings, and it matches the
99       current implementation pretty closely.
100
101       To be more precise, we will say that a regex program is an encoding of
102       a graph. Each node in the graph corresponds to part of the original
103       regex pattern, such as a literal string or a branch, and has a pointer
104       to the nodes representing the next component to be matched. Since
105       "node" and "opcode" already have other meanings in the perl source, we
106       will call the nodes in a regex program "regops".
107
108       The program is represented by an array of "regnode" structures, one or
109       more of which represent a single regop of the program. Struct "regnode"
110       is the smallest struct needed, and has a field structure which is
111       shared with all the other larger structures.  (Outside this document,
112       the term "regnode" is sometimes used to mean "regop", which could be
113       confusing.)
114
115       The "next" pointers of all regops except "BRANCH" implement
116       concatenation; a "next" pointer with a "BRANCH" on both ends of it is
117       connecting two alternatives.  [Here we have one of the subtle syntax
118       dependencies: an individual "BRANCH" (as opposed to a collection of
119       them) is never concatenated with anything because of operator
120       precedence.]
121
122       The operand of some types of regop is a literal string; for others, it
123       is a regop leading into a sub-program.  In particular, the operand of a
124       "BRANCH" node is the first regop of the branch.
125
126       NOTE: As the railroad metaphor suggests, this is not a tree structure:
127       the tail of the branch connects to the thing following the set of
128       "BRANCH"es.  It is a like a single line of railway track that splits as
129       it goes into a station or railway yard and rejoins as it comes out the
130       other side.
131
132       Regops
133
134       The base structure of a regop is defined in regexp.h as follows:
135
136           struct regnode {
137               U8  flags;    /* Various purposes, sometimes overridden */
138               U8  type;     /* Opcode value as specified by regnodes.h */
139               U16 next_off; /* Offset in size regnode */
140           };
141
142       Other larger "regnode"-like structures are defined in regcomp.h. They
143       are almost like subclasses in that they have the same fields as
144       "regnode", with possibly additional fields following in the structure,
145       and in some cases the specific meaning (and name) of some of base
146       fields are overridden. The following is a more complete description.
147
148       "regnode_1"
149       "regnode_2"
150           "regnode_1" structures have the same header, followed by a single
151           four-byte argument; "regnode_2" structures contain two two-byte
152           arguments instead:
153
154               regnode_1                U32 arg1;
155               regnode_2                U16 arg1;  U16 arg2;
156
157       "regnode_string"
158           "regnode_string" structures, used for literal strings, follow the
159           header with a one-byte length and then the string data. Strings are
160           padded on the tail end with zero bytes so that the total length of
161           the node is a multiple of four bytes:
162
163               regnode_string           char string[1];
164                                        U8 str_len; /* overrides flags */
165
166       "regnode_charclass"
167           Bracketed character classes are represented by "regnode_charclass"
168           structures, which have a four-byte argument and then a 32-byte
169           (256-bit) bitmap indicating which characters in the Latin1 range
170           are included in the class.
171
172               regnode_charclass        U32 arg1;
173                                        char bitmap[ANYOF_BITMAP_SIZE];
174
175           Various flags whose names begin with "ANYOF_" are used for special
176           situations.  Above Latin1 matches and things not known until run-
177           time are stored in "Perl's pprivate structure".
178
179       "regnode_charclass_posixl"
180           There is also a larger form of a char class structure used to
181           represent POSIX char classes under "/l" matching, called
182           "regnode_charclass_posixl" which has an additional 32-bit bitmap
183           indicating which POSIX char classes have been included.
184
185              regnode_charclass_posixl U32 arg1;
186                                       char bitmap[ANYOF_BITMAP_SIZE];
187                                       U32 classflags;
188
189       regnodes.h defines an array called "regarglen[]" which gives the size
190       of each opcode in units of "size regnode" (4-byte). A macro is used to
191       calculate the size of an "EXACT" node based on its "str_len" field.
192
193       The regops are defined in regnodes.h which is generated from
194       regcomp.sym by regcomp.pl. Currently the maximum possible number of
195       distinct regops is restricted to 256, with about a quarter already
196       used.
197
198       A set of macros makes accessing the fields easier and more consistent.
199       These include "OP()", which is used to determine the type of a
200       "regnode"-like structure; "NEXT_OFF()", which is the offset to the next
201       node (more on this later); "ARG()", "ARG1()", "ARG2()", "ARG_SET()",
202       and equivalents for reading and setting the arguments; and "STR_LEN()",
203       "STRING()" and "OPERAND()" for manipulating strings and regop bearing
204       types.
205
206       What regop is next?
207
208       There are three distinct concepts of "next" in the regex engine, and it
209       is important to keep them clear.
210
211       •   There is the "next regnode" from a given regnode, a value which is
212           rarely useful except that sometimes it matches up in terms of value
213           with one of the others, and that sometimes the code assumes this to
214           always be so.
215
216       •   There is the "next regop" from a given regop/regnode. This is the
217           regop physically located after the current one, as determined by
218           the size of the current regop. This is often useful, such as when
219           dumping the structure we use this order to traverse. Sometimes the
220           code assumes that the "next regnode" is the same as the "next
221           regop", or in other words assumes that the sizeof a given regop
222           type is always going to be one regnode large.
223
224       •   There is the "regnext" from a given regop. This is the regop which
225           is reached by jumping forward by the value of "NEXT_OFF()", or in a
226           few cases for longer jumps by the "arg1" field of the "regnode_1"
227           structure. The subroutine "regnext()" handles this transparently.
228           This is the logical successor of the node, which in some cases,
229           like that of the "BRANCH" regop, has special meaning.
230

Process Overview

232       Broadly speaking, performing a match of a string against a pattern
233       involves the following steps:
234
235       A. Compilation
236            1. Parsing
237            2. Peep-hole optimisation and analysis
238       B. Execution
239            3. Start position and no-match optimisations
240            4. Program execution
241
242       Where these steps occur in the actual execution of a perl program is
243       determined by whether the pattern involves interpolating any string
244       variables. If interpolation occurs, then compilation happens at run
245       time. If it does not, then compilation is performed at compile time.
246       (The "/o" modifier changes this, as does "qr//" to a certain extent.)
247       The engine doesn't really care that much.
248
249   Compilation
250       This code resides primarily in regcomp.c, along with the header files
251       regcomp.h, regexp.h and regnodes.h.
252
253       Compilation starts with "pregcomp()", which is mostly an initialisation
254       wrapper which farms work out to two other routines for the heavy
255       lifting: the first is "reg()", which is the start point for parsing;
256       the second, "study_chunk()", is responsible for optimisation.
257
258       Initialisation in "pregcomp()" mostly involves the creation and data-
259       filling of a special structure, "RExC_state_t" (defined in regcomp.c).
260       Almost all internally-used routines in regcomp.h take a pointer to one
261       of these structures as their first argument, with the name
262       "pRExC_state".  This structure is used to store the compilation state
263       and contains many fields. Likewise there are many macros which operate
264       on this variable: anything that looks like "RExC_xxxx" is a macro that
265       operates on this pointer/structure.
266
267       "reg()" is the start of the parse process. It is responsible for
268       parsing an arbitrary chunk of pattern up to either the end of the
269       string, or the first closing parenthesis it encounters in the pattern.
270       This means it can be used to parse the top-level regex, or any section
271       inside of a grouping parenthesis. It also handles the "special parens"
272       that perl's regexes have. For instance when parsing "/x(?:foo)y/",
273       "reg()" will at one point be called to parse from the "?" symbol up to
274       and including the ")".
275
276       Additionally, "reg()" is responsible for parsing the one or more
277       branches from the pattern, and for "finishing them off" by correctly
278       setting their next pointers. In order to do the parsing, it repeatedly
279       calls out to "regbranch()", which is responsible for handling up to the
280       first "|" symbol it sees.
281
282       "regbranch()" in turn calls "regpiece()" which handles "things"
283       followed by a quantifier. In order to parse the "things", "regatom()"
284       is called. This is the lowest level routine, which parses out constant
285       strings, character classes, and the various special symbols like "$".
286       If "regatom()" encounters a "(" character it in turn calls "reg()".
287
288       There used to be two main passes involved in parsing, the first to
289       calculate the size of the compiled program, and the second to actually
290       compile it.  But now there is only one main pass, with an initial crude
291       guess based on the length of the input pattern, which is increased if
292       necessary as parsing proceeds, and afterwards, trimmed to the actual
293       amount used.
294
295       However, it may happen that parsing must be restarted at the beginning
296       when various circumstances occur along the way.  An example is if the
297       program turns out to be so large that there are jumps in it that won't
298       fit in the normal 16 bits available.  There are two special regops that
299       can hold bigger jump destinations, BRANCHJ and LONGBRANCH.  The parse
300       is restarted, and these are used instead of the normal shorter ones.
301       Whenever restarting the parse is required, the function returns failure
302       and sets a flag as to what needs to be done.  This is passed up to the
303       top level routine which takes the appropriate action and restarts from
304       scratch.  In the case of needing longer jumps, the "RExC_use_BRANCHJ"
305       flag is set in the "RExC_state_t" structure, which the functions know
306       to inspect before deciding how to do branches.
307
308       In most instances, the function that discovers the issue sets the
309       causal flag and returns failure immediately.  "Parsing complications"
310       contains an explicit example of how this works.  In other cases, such
311       as a forward reference to a numbered parenthetical grouping, we need to
312       finish the parse to know if that numbered grouping actually appears in
313       the pattern.  In those cases, the parse is just redone at the end, with
314       the knowledge of how many groupings occur in it.
315
316       The routine "regtail()" is called by both "reg()" and "regbranch()" in
317       order to "set the tail pointer" correctly. When executing and we get to
318       the end of a branch, we need to go to the node following the grouping
319       parens. When parsing, however, we don't know where the end will be
320       until we get there, so when we do we must go back and update the
321       offsets as appropriate. "regtail" is used to make this easier.
322
323       A subtlety of the parsing process means that a regex like "/foo/" is
324       originally parsed into an alternation with a single branch. It is only
325       afterwards that the optimiser converts single branch alternations into
326       the simpler form.
327
328       Parse Call Graph and a Grammar
329
330       The call graph looks like this:
331
332        reg()                        # parse a top level regex, or inside of
333                                     # parens
334            regbranch()              # parse a single branch of an alternation
335                regpiece()           # parse a pattern followed by a quantifier
336                    regatom()        # parse a simple pattern
337                        regclass()   #   used to handle a class
338                        reg()        #   used to handle a parenthesised
339                                     #   subpattern
340                        ....
341                ...
342                regtail()            # finish off the branch
343            ...
344            regtail()                # finish off the branch sequence. Tie each
345                                     # branch's tail to the tail of the
346                                     # sequence
347                                     # (NEW) In Debug mode this is
348                                     # regtail_study().
349
350       A grammar form might be something like this:
351
352           atom  : constant | class
353           quant : '*' | '+' | '?' | '{min,max}'
354           _branch: piece
355                  | piece _branch
356                  | nothing
357           branch: _branch
358                 | _branch '|' branch
359           group : '(' branch ')'
360           _piece: atom | group
361           piece : _piece
362                 | _piece quant
363
364       Parsing complications
365
366       The implication of the above description is that a pattern containing
367       nested parentheses will result in a call graph which cycles through
368       "reg()", "regbranch()", "regpiece()", "regatom()", "reg()",
369       "regbranch()" etc multiple times, until the deepest level of nesting is
370       reached. All the above routines return a pointer to a "regnode", which
371       is usually the last regnode added to the program. However, one
372       complication is that reg() returns NULL for parsing "(?:)" syntax for
373       embedded modifiers, setting the flag "TRYAGAIN". The "TRYAGAIN"
374       propagates upwards until it is captured, in some cases by "regatom()",
375       but otherwise unconditionally by "regbranch()". Hence it will never be
376       returned by "regbranch()" to "reg()". This flag permits patterns such
377       as "(?i)+" to be detected as errors (Quantifier follows nothing in
378       regex; marked by <-- HERE in m/(?i)+ <-- HERE /).
379
380       Another complication is that the representation used for the program
381       differs if it needs to store Unicode, but it's not always possible to
382       know for sure whether it does until midway through parsing. The Unicode
383       representation for the program is larger, and cannot be matched as
384       efficiently. (See "Unicode and Localisation Support" below for more
385       details as to why.)  If the pattern contains literal Unicode, it's
386       obvious that the program needs to store Unicode. Otherwise, the parser
387       optimistically assumes that the more efficient representation can be
388       used, and starts sizing on this basis.  However, if it then encounters
389       something in the pattern which must be stored as Unicode, such as an
390       "\x{...}" escape sequence representing a character literal, then this
391       means that all previously calculated sizes need to be redone, using
392       values appropriate for the Unicode representation.  This is another
393       instance where the parsing needs to be restarted, and it can and is
394       done immediately.  The function returns failure, and sets the flag
395       "RESTART_UTF8" (encapsulated by using the macro "REQUIRE_UTF8").  This
396       restart request is propagated up the call chain in a similar fashion,
397       until it is "caught" in "Perl_re_op_compile()", which marks the pattern
398       as containing Unicode, and restarts the sizing pass. It is also
399       possible for constructions within run-time code blocks to turn out to
400       need Unicode representation., which is signalled by
401       "S_compile_runtime_code()" returning false to "Perl_re_op_compile()".
402
403       The restart was previously implemented using a "longjmp" in "regatom()"
404       back to a "setjmp" in "Perl_re_op_compile()", but this proved to be
405       problematic as the latter is a large function containing many automatic
406       variables, which interact badly with the emergent control flow of
407       "setjmp".
408
409       Debug Output
410
411       Starting in the 5.9.x development version of perl you can "use re Debug
412       => 'PARSE'" to see some trace information about the parse process. We
413       will start with some simple patterns and build up to more complex
414       patterns.
415
416       So when we parse "/foo/" we see something like the following table. The
417       left shows what is being parsed, and the number indicates where the
418       next regop would go. The stuff on the right is the trace output of the
419       graph. The names are chosen to be short to make it less dense on the
420       screen. 'tsdy' is a special form of "regtail()" which does some extra
421       analysis.
422
423        >foo<             1    reg
424                                 brnc
425                                   piec
426                                     atom
427        ><                4      tsdy~ EXACT <foo> (EXACT) (1)
428                                     ~ attach to END (3) offset to 2
429
430       The resulting program then looks like:
431
432          1: EXACT <foo>(3)
433          3: END(0)
434
435       As you can see, even though we parsed out a branch and a piece, it was
436       ultimately only an atom. The final program shows us how things work. We
437       have an "EXACT" regop, followed by an "END" regop. The number in parens
438       indicates where the "regnext" of the node goes. The "regnext" of an
439       "END" regop is unused, as "END" regops mean we have successfully
440       matched. The number on the left indicates the position of the regop in
441       the regnode array.
442
443       Now let's try a harder pattern. We will add a quantifier, so now we
444       have the pattern "/foo+/". We will see that "regbranch()" calls
445       "regpiece()" twice.
446
447        >foo+<            1    reg
448                                 brnc
449                                   piec
450                                     atom
451        >o+<              3        piec
452                                     atom
453        ><                6        tail~ EXACT <fo> (1)
454                          7      tsdy~ EXACT <fo> (EXACT) (1)
455                                     ~ PLUS (END) (3)
456                                     ~ attach to END (6) offset to 3
457
458       And we end up with the program:
459
460          1: EXACT <fo>(3)
461          3: PLUS(6)
462          4:   EXACT <o>(0)
463          6: END(0)
464
465       Now we have a special case. The "EXACT" regop has a "regnext" of 0.
466       This is because if it matches it should try to match itself again. The
467       "PLUS" regop handles the actual failure of the "EXACT" regop and acts
468       appropriately (going to regnode 6 if the "EXACT" matched at least once,
469       or failing if it didn't).
470
471       Now for something much more complex: "/x(?:foo*|b[a][rR])(foo|bar)$/"
472
473        >x(?:foo*|b...    1    reg
474                                 brnc
475                                   piec
476                                     atom
477        >(?:foo*|b[...    3        piec
478                                     atom
479        >?:foo*|b[a...                 reg
480        >foo*|b[a][...                   brnc
481                                           piec
482                                             atom
483        >o*|b[a][rR...    5                piec
484                                             atom
485        >|b[a][rR])...    8                tail~ EXACT <fo> (3)
486        >b[a][rR])(...    9              brnc
487                         10                piec
488                                             atom
489        >[a][rR])(f...   12                piec
490                                             atom
491        >a][rR])(fo...                         clas
492        >[rR])(foo|...   14                tail~ EXACT <b> (10)
493                                           piec
494                                             atom
495        >rR])(foo|b...                         clas
496        >)(foo|bar)...   25                tail~ EXACT <a> (12)
497                                         tail~ BRANCH (3)
498                         26              tsdy~ BRANCH (END) (9)
499                                             ~ attach to TAIL (25) offset to 16
500                                         tsdy~ EXACT <fo> (EXACT) (4)
501                                             ~ STAR (END) (6)
502                                             ~ attach to TAIL (25) offset to 19
503                                         tsdy~ EXACT <b> (EXACT) (10)
504                                             ~ EXACT <a> (EXACT) (12)
505                                             ~ ANYOF[Rr] (END) (14)
506                                             ~ attach to TAIL (25) offset to 11
507        >(foo|bar)$<               tail~ EXACT <x> (1)
508                                   piec
509                                     atom
510        >foo|bar)$<                    reg
511                         28              brnc
512                                           piec
513                                             atom
514        >|bar)$<         31              tail~ OPEN1 (26)
515        >bar)$<                          brnc
516                         32                piec
517                                             atom
518        >)$<             34              tail~ BRANCH (28)
519                         36              tsdy~ BRANCH (END) (31)
520                                            ~ attach to CLOSE1 (34) offset to 3
521                                         tsdy~ EXACT <foo> (EXACT) (29)
522                                            ~ attach to CLOSE1 (34) offset to 5
523                                         tsdy~ EXACT <bar> (EXACT) (32)
524                                            ~ attach to CLOSE1 (34) offset to 2
525        >$<                        tail~ BRANCH (3)
526                                       ~ BRANCH (9)
527                                       ~ TAIL (25)
528                                   piec
529                                     atom
530        ><               37        tail~ OPEN1 (26)
531                                       ~ BRANCH (28)
532                                       ~ BRANCH (31)
533                                       ~ CLOSE1 (34)
534                         38      tsdy~ EXACT <x> (EXACT) (1)
535                                     ~ BRANCH (END) (3)
536                                     ~ BRANCH (END) (9)
537                                     ~ TAIL (END) (25)
538                                     ~ OPEN1 (END) (26)
539                                     ~ BRANCH (END) (28)
540                                     ~ BRANCH (END) (31)
541                                     ~ CLOSE1 (END) (34)
542                                     ~ EOL (END) (36)
543                                     ~ attach to END (37) offset to 1
544
545       Resulting in the program
546
547          1: EXACT <x>(3)
548          3: BRANCH(9)
549          4:   EXACT <fo>(6)
550          6:   STAR(26)
551          7:     EXACT <o>(0)
552          9: BRANCH(25)
553         10:   EXACT <ba>(14)
554         12:   OPTIMIZED (2 nodes)
555         14:   ANYOF[Rr](26)
556         25: TAIL(26)
557         26: OPEN1(28)
558         28:   TRIE-EXACT(34)
559               [StS:1 Wds:2 Cs:6 Uq:5 #Sts:7 Mn:3 Mx:3 Stcls:bf]
560                 <foo>
561                 <bar>
562         30:   OPTIMIZED (4 nodes)
563         34: CLOSE1(36)
564         36: EOL(37)
565         37: END(0)
566
567       Here we can see a much more complex program, with various optimisations
568       in play. At regnode 10 we see an example where a character class with
569       only one character in it was turned into an "EXACT" node. We can also
570       see where an entire alternation was turned into a "TRIE-EXACT" node. As
571       a consequence, some of the regnodes have been marked as optimised away.
572       We can see that the "$" symbol has been converted into an "EOL" regop,
573       a special piece of code that looks for "\n" or the end of the string.
574
575       The next pointer for "BRANCH"es is interesting in that it points at
576       where execution should go if the branch fails. When executing, if the
577       engine tries to traverse from a branch to a "regnext" that isn't a
578       branch then the engine will know that the entire set of branches has
579       failed.
580
581       Peep-hole Optimisation and Analysis
582
583       The regular expression engine can be a weighty tool to wield. On long
584       strings and complex patterns it can end up having to do a lot of work
585       to find a match, and even more to decide that no match is possible.
586       Consider a situation like the following pattern.
587
588          'ababababababababababab' =~ /(a|b)*z/
589
590       The "(a|b)*" part can match at every char in the string, and then fail
591       every time because there is no "z" in the string. So obviously we can
592       avoid using the regex engine unless there is a "z" in the string.
593       Likewise in a pattern like:
594
595          /foo(\w+)bar/
596
597       In this case we know that the string must contain a "foo" which must be
598       followed by "bar". We can use Fast Boyer-Moore matching as implemented
599       in "fbm_instr()" to find the location of these strings. If they don't
600       exist then we don't need to resort to the much more expensive regex
601       engine.  Even better, if they do exist then we can use their positions
602       to reduce the search space that the regex engine needs to cover to
603       determine if the entire pattern matches.
604
605       There are various aspects of the pattern that can be used to facilitate
606       optimisations along these lines:
607
608       •    anchored fixed strings
609
610       •    floating fixed strings
611
612       •    minimum and maximum length requirements
613
614       •    start class
615
616       •    Beginning/End of line positions
617
618       Another form of optimisation that can occur is the post-parse "peep-
619       hole" optimisation, where inefficient constructs are replaced by more
620       efficient constructs. The "TAIL" regops which are used during parsing
621       to mark the end of branches and the end of groups are examples of this.
622       These regops are used as place-holders during construction and "always
623       match" so they can be "optimised away" by making the things that point
624       to the "TAIL" point to the thing that "TAIL" points to, thus "skipping"
625       the node.
626
627       Another optimisation that can occur is that of ""EXACT" merging" which
628       is where two consecutive "EXACT" nodes are merged into a single regop.
629       An even more aggressive form of this is that a branch sequence of the
630       form "EXACT BRANCH ... EXACT" can be converted into a "TRIE-EXACT"
631       regop.
632
633       All of this occurs in the routine "study_chunk()" which uses a special
634       structure "scan_data_t" to store the analysis that it has performed,
635       and does the "peep-hole" optimisations as it goes.
636
637       The code involved in "study_chunk()" is extremely cryptic. Be careful.
638       :-)
639
640   Execution
641       Execution of a regex generally involves two phases, the first being
642       finding the start point in the string where we should match from, and
643       the second being running the regop interpreter.
644
645       If we can tell that there is no valid start point then we don't bother
646       running the interpreter at all. Likewise, if we know from the analysis
647       phase that we cannot detect a short-cut to the start position, we go
648       straight to the interpreter.
649
650       The two entry points are "re_intuit_start()" and "pregexec()". These
651       routines have a somewhat incestuous relationship with overlap between
652       their functions, and "pregexec()" may even call "re_intuit_start()" on
653       its own. Nevertheless other parts of the perl source code may call into
654       either, or both.
655
656       Execution of the interpreter itself used to be recursive, but thanks to
657       the efforts of Dave Mitchell in the 5.9.x development track, that has
658       changed: now an internal stack is maintained on the heap and the
659       routine is fully iterative. This can make it tricky as the code is
660       quite conservative about what state it stores, with the result that two
661       consecutive lines in the code can actually be running in totally
662       different contexts due to the simulated recursion.
663
664       Start position and no-match optimisations
665
666       "re_intuit_start()" is responsible for handling start points and no-
667       match optimisations as determined by the results of the analysis done
668       by "study_chunk()" (and described in "Peep-hole Optimisation and
669       Analysis").
670
671       The basic structure of this routine is to try to find the start- and/or
672       end-points of where the pattern could match, and to ensure that the
673       string is long enough to match the pattern. It tries to use more
674       efficient methods over less efficient methods and may involve
675       considerable cross-checking of constraints to find the place in the
676       string that matches.  For instance it may try to determine that a given
677       fixed string must be not only present but a certain number of chars
678       before the end of the string, or whatever.
679
680       It calls several other routines, such as "fbm_instr()" which does Fast
681       Boyer Moore matching and "find_byclass()" which is responsible for
682       finding the start using the first mandatory regop in the program.
683
684       When the optimisation criteria have been satisfied, "reg_try()" is
685       called to perform the match.
686
687       Program execution
688
689       "pregexec()" is the main entry point for running a regex. It contains
690       support for initialising the regex interpreter's state, running
691       "re_intuit_start()" if needed, and running the interpreter on the
692       string from various start positions as needed. When it is necessary to
693       use the regex interpreter "pregexec()" calls "regtry()".
694
695       "regtry()" is the entry point into the regex interpreter. It expects as
696       arguments a pointer to a "regmatch_info" structure and a pointer to a
697       string.  It returns an integer 1 for success and a 0 for failure.  It
698       is basically a set-up wrapper around "regmatch()".
699
700       "regmatch" is the main "recursive loop" of the interpreter. It is
701       basically a giant switch statement that implements a state machine,
702       where the possible states are the regops themselves, plus a number of
703       additional intermediate and failure states. A few of the states are
704       implemented as subroutines but the bulk are inline code.
705

MISCELLANEOUS

707   Unicode and Localisation Support
708       When dealing with strings containing characters that cannot be
709       represented using an eight-bit character set, perl uses an internal
710       representation that is a permissive version of Unicode's UTF-8
711       encoding[2]. This uses single bytes to represent characters from the
712       ASCII character set, and sequences of two or more bytes for all other
713       characters. (See perlunitut for more information about the relationship
714       between UTF-8 and perl's encoding, utf8. The difference isn't important
715       for this discussion.)
716
717       No matter how you look at it, Unicode support is going to be a pain in
718       a regex engine. Tricks that might be fine when you have 256 possible
719       characters often won't scale to handle the size of the UTF-8 character
720       set.  Things you can take for granted with ASCII may not be true with
721       Unicode. For instance, in ASCII, it is safe to assume that
722       "sizeof(char1) == sizeof(char2)", but in UTF-8 it isn't. Unicode case
723       folding is vastly more complex than the simple rules of ASCII, and even
724       when not using Unicode but only localised single byte encodings, things
725       can get tricky (for example, LATIN SMALL LETTER SHARP S (U+00DF, ss)
726       should match 'SS' in localised case-insensitive matching).
727
728       Making things worse is that UTF-8 support was a later addition to the
729       regex engine (as it was to perl) and this necessarily  made things a
730       lot more complicated. Obviously it is easier to design a regex engine
731       with Unicode support in mind from the beginning than it is to retrofit
732       it to one that wasn't.
733
734       Nearly all regops that involve looking at the input string have two
735       cases, one for UTF-8, and one not. In fact, it's often more complex
736       than that, as the pattern may be UTF-8 as well.
737
738       Care must be taken when making changes to make sure that you handle
739       UTF-8 properly, both at compile time and at execution time, including
740       when the string and pattern are mismatched.
741
742   Base Structures
743       The "regexp" structure described in perlreapi is common to all regex
744       engines. Two of its fields are intended for the private use of the
745       regex engine that compiled the pattern. These are the "intflags" and
746       pprivate members. The "pprivate" is a void pointer to an arbitrary
747       structure whose use and management is the responsibility of the
748       compiling engine. perl will never modify either of these values. In the
749       case of the stock engine the structure pointed to by "pprivate" is
750       called "regexp_internal".
751
752       Its "pprivate" and "intflags" fields contain data specific to each
753       engine.
754
755       There are two structures used to store a compiled regular expression.
756       One, the "regexp" structure described in perlreapi is populated by the
757       engine currently being. used and some of its fields read by perl to
758       implement things such as the stringification of "qr//".
759
760       The other structure is pointed to by the "regexp" struct's "pprivate"
761       and is in addition to "intflags" in the same struct considered to be
762       the property of the regex engine which compiled the regular expression;
763
764       The regexp structure contains all the data that perl needs to be aware
765       of to properly work with the regular expression. It includes data about
766       optimisations that perl can use to determine if the regex engine should
767       really be used, and various other control info that is needed to
768       properly execute patterns in various contexts such as is the pattern
769       anchored in some way, or what flags were used during the compile, or
770       whether the program contains special constructs that perl needs to be
771       aware of.
772
773       In addition it contains two fields that are intended for the private
774       use of the regex engine that compiled the pattern. These are the
775       "intflags" and pprivate members. The "pprivate" is a void pointer to an
776       arbitrary structure whose use and management is the responsibility of
777       the compiling engine. perl will never modify either of these values.
778
779       As mentioned earlier, in the case of the default engines, the
780       "pprivate" will be a pointer to a regexp_internal structure which holds
781       the compiled program and any additional data that is private to the
782       regex engine implementation.
783
784       Perl's "pprivate" structure
785
786       The following structure is used as the "pprivate" struct by perl's
787       regex engine. Since it is specific to perl it is only of curiosity
788       value to other engine implementations.
789
790           typedef struct regexp_internal {
791               regnode *regstclass;
792               struct reg_data *data;
793               struct reg_code_blocks *code_blocks;
794               U32 proglen;
795               U32 name_list_idx;
796               regnode program[1];
797           } regexp_internal;
798
799       Description of the attributes is as follows:
800
801       "regstclass"
802            Special regop that is used by "re_intuit_start()" to check if a
803            pattern can match at a certain position. For instance if the regex
804            engine knows that the pattern must start with a 'Z' then it can
805            scan the string until it finds one and then launch the regex
806            engine from there. The routine that handles this is called
807            "find_by_class()". Sometimes this field points at a regop embedded
808            in the program, and sometimes it points at an independent
809            synthetic regop that has been constructed by the optimiser.
810
811       "data"
812            This field points at a "reg_data" structure, which is defined as
813            follows
814
815                struct reg_data {
816                    U32 count;
817                    U8 *what;
818                    void* data[1];
819                };
820
821            This structure is used for handling data structures that the regex
822            engine needs to handle specially during a clone or free operation
823            on the compiled product. Each element in the data array has a
824            corresponding element in the what array. During compilation regops
825            that need special structures stored will add an element to each
826            array using the add_data() routine and then store the index in the
827            regop.
828
829            In modern perls the 0th element of this structure is reserved and
830            is NEVER used to store anything of use. This is to allow things
831            that need to index into this array to represent "no value".
832
833       "code_blocks"
834            This optional structure is used to manage "(?{})" constructs in
835            the pattern.  It is made up of the following structures.
836
837                /* record the position of a (?{...}) within a pattern */
838                struct reg_code_block {
839                    STRLEN start;
840                    STRLEN end;
841                    OP     *block;
842                    REGEXP *src_regex;
843                };
844
845                /* array of reg_code_block's plus header info */
846                struct reg_code_blocks {
847                    int refcnt; /* we may be pointed to from a regex
848                                   and from the savestack */
849                    int  count; /* how many code blocks */
850                    struct reg_code_block *cb; /* array of reg_code_block's */
851                };
852
853       "proglen"
854            Stores the length of the compiled program in units of regops.
855
856       "name_list_idx"
857            This is the index into the data array where an AV is stored that
858            contains the names of any named capture buffers in the pattern,
859            should there be any. This is only used in the debugging version of
860            the regex engine and when RXp_PAREN_NAMES(prog) is true. It will
861            be 0 if there is no such data.
862
863       "program"
864            Compiled program. Inlined into the structure so the entire struct
865            can be treated as a single blob.
866

AUTHOR

875       by Yves Orton, 2006.
876
877       With excerpts from Perl, and contributions and suggestions from Ronald
878       J. Kimball, Dave Mitchell, Dominic Dunlop, Mark Jason Dominus, Stephen
879       McCamant, and David Landgren.
880
881       Now maintained by Perl 5 Porters.
882

LICENCE

884       Same terms as Perl.
885

REFERENCES

887       [1] <https://perl.plover.com/Rx/paper/>
888
889       [2] <https://www.unicode.org/>
890
891
892
893perl v5.36.0                      2022-08-30                     PERLREGUTS(1)