1PERLREGUTS(1)          Perl Programmers Reference Guide          PERLREGUTS(1)
2
3
4

NAME

6       perlreguts - Description of the Perl regular expression engine.
7

DESCRIPTION

9       This document is an attempt to shine some light on the guts of the
10       regex engine and how it works. The regex engine represents a
11       significant chunk of the perl codebase, but is relatively poorly
12       understood. This document is a meagre attempt at addressing this
13       situation. It is derived from the author's experience, comments in the
14       source code, other papers on the regex engine, feedback on the
15       perl5-porters mail list, and no doubt other places as well.
16
17       NOTICE! It should be clearly understood that the behavior and
18       structures discussed in this represents the state of the engine as the
19       author understood it at the time of writing. It is NOT an API
20       definition, it is purely an internals guide for those who want to hack
21       the regex engine, or understand how the regex engine works. Readers of
22       this document are expected to understand perl's regex syntax and its
23       usage in detail. If you want to learn about the basics of Perl's
24       regular expressions, see perlre. And if you want to replace the regex
25       engine with your own, see perlreapi.
26

OVERVIEW

28   A quick note on terms
29       There is some debate as to whether to say "regexp" or "regex". In this
30       document we will use the term "regex" unless there is a special reason
31       not to, in which case we will explain why.
32
33       When speaking about regexes we need to distinguish between their source
34       code form and their internal form. In this document we will use the
35       term "pattern" when we speak of their textual, source code form, and
36       the term "program" when we speak of their internal representation.
37       These correspond to the terms S-regex and B-regex that Mark Jason
38       Dominus employs in his paper on "Rx" ([1] in "REFERENCES").
39
40   What is a regular expression engine?
41       A regular expression engine is a program that takes a set of
42       constraints specified in a mini-language, and then applies those
43       constraints to a target string, and determines whether or not the
44       string satisfies the constraints. See perlre for a full definition of
45       the language.
46
47       In less grandiose terms, the first part of the job is to turn a pattern
48       into something the computer can efficiently use to find the matching
49       point in the string, and the second part is performing the search
50       itself.
51
52       To do this we need to produce a program by parsing the text. We then
53       need to execute the program to find the point in the string that
54       matches. And we need to do the whole thing efficiently.
55
56   Structure of a Regexp Program
57       High Level
58
59       Although it is a bit confusing and some people object to the
60       terminology, it is worth taking a look at a comment that has been in
61       regexp.h for years:
62
63       This is essentially a linear encoding of a nondeterministic finite-
64       state machine (aka syntax charts or "railroad normal form" in parsing
65       technology).
66
67       The term "railroad normal form" is a bit esoteric, with "syntax
68       diagram/charts", or "railroad diagram/charts" being more common terms.
69       Nevertheless it provides a useful mental image of a regex program: each
70       node can be thought of as a unit of track, with a single entry and in
71       most cases a single exit point (there are pieces of track that fork,
72       but statistically not many), and the whole forms a layout with a single
73       entry and single exit point. The matching process can be thought of as
74       a car that moves along the track, with the particular route through the
75       system being determined by the character read at each possible
76       connector point. A car can fall off the track at any point but it may
77       only proceed as long as it matches the track.
78
79       Thus the pattern "/foo(?:\w+|\d+|\s+)bar/" can be thought of as the
80       following chart:
81
82                             [start]
83                                |
84                              <foo>
85                                |
86                          +-----+-----+
87                          |     |     |
88                        <\w+> <\d+> <\s+>
89                          |     |     |
90                          +-----+-----+
91                                |
92                              <bar>
93                                |
94                              [end]
95
96       The truth of the matter is that perl's regular expressions these days
97       are much more complex than this kind of structure, but visualising it
98       this way can help when trying to get your bearings, and it matches the
99       current implementation pretty closely.
100
101       To be more precise, we will say that a regex program is an encoding of
102       a graph. Each node in the graph corresponds to part of the original
103       regex pattern, such as a literal string or a branch, and has a pointer
104       to the nodes representing the next component to be matched. Since
105       "node" and "opcode" already have other meanings in the perl source, we
106       will call the nodes in a regex program "regops".
107
108       The program is represented by an array of "regnode" structures, one or
109       more of which represent a single regop of the program. Struct "regnode"
110       is the smallest struct needed, and has a field structure which is
111       shared with all the other larger structures.
112
113       The "next" pointers of all regops except "BRANCH" implement
114       concatenation; a "next" pointer with a "BRANCH" on both ends of it is
115       connecting two alternatives.  [Here we have one of the subtle syntax
116       dependencies: an individual "BRANCH" (as opposed to a collection of
117       them) is never concatenated with anything because of operator
118       precedence.]
119
120       The operand of some types of regop is a literal string; for others, it
121       is a regop leading into a sub-program.  In particular, the operand of a
122       "BRANCH" node is the first regop of the branch.
123
124       NOTE: As the railroad metaphor suggests, this is not a tree structure:
125       the tail of the branch connects to the thing following the set of
126       "BRANCH"es.  It is a like a single line of railway track that splits as
127       it goes into a station or railway yard and rejoins as it comes out the
128       other side.
129
130       Regops
131
132       The base structure of a regop is defined in regexp.h as follows:
133
134           struct regnode {
135               U8  flags;    /* Various purposes, sometimes overridden */
136               U8  type;     /* Opcode value as specified by regnodes.h */
137               U16 next_off; /* Offset in size regnode */
138           };
139
140       Other larger "regnode"-like structures are defined in regcomp.h. They
141       are almost like subclasses in that they have the same fields as
142       "regnode", with possibly additional fields following in the structure,
143       and in some cases the specific meaning (and name) of some of base
144       fields are overridden. The following is a more complete description.
145
146       "regnode_1"
147       "regnode_2"
148           "regnode_1" structures have the same header, followed by a single
149           four-byte argument; "regnode_2" structures contain two two-byte
150           arguments instead:
151
152               regnode_1                U32 arg1;
153               regnode_2                U16 arg1;  U16 arg2;
154
155       "regnode_string"
156           "regnode_string" structures, used for literal strings, follow the
157           header with a one-byte length and then the string data. Strings are
158           padded on the end with zero bytes so that the total length of the
159           node is a multiple of four bytes:
160
161               regnode_string           char string[1];
162                                        U8 str_len; /* overrides flags */
163
164       "regnode_charclass"
165           Bracketed character classes are represented by "regnode_charclass"
166           structures, which have a four-byte argument and then a 32-byte
167           (256-bit) bitmap indicating which characters in the Latin1 range
168           are included in the class.
169
170               regnode_charclass        U32 arg1;
171                                        char bitmap[ANYOF_BITMAP_SIZE];
172
173           Various flags whose names begin with "ANYOF_" are used for special
174           situations.  Above Latin1 matches and things not known until run-
175           time are stored in "Perl's pprivate structure".
176
177       "regnode_charclass_posixl"
178           There is also a larger form of a char class structure used to
179           represent POSIX char classes under "/l" matching, called
180           "regnode_charclass_posixl" which has an additional 32-bit bitmap
181           indicating which POSIX char classes have been included.
182
183              regnode_charclass_posixl U32 arg1;
184                                       char bitmap[ANYOF_BITMAP_SIZE];
185                                       U32 classflags;
186
187       regnodes.h defines an array called "regarglen[]" which gives the size
188       of each opcode in units of "size regnode" (4-byte). A macro is used to
189       calculate the size of an "EXACT" node based on its "str_len" field.
190
191       The regops are defined in regnodes.h which is generated from
192       regcomp.sym by regcomp.pl. Currently the maximum possible number of
193       distinct regops is restricted to 256, with about a quarter already
194       used.
195
196       A set of macros makes accessing the fields easier and more consistent.
197       These include "OP()", which is used to determine the type of a
198       "regnode"-like structure; "NEXT_OFF()", which is the offset to the next
199       node (more on this later); "ARG()", "ARG1()", "ARG2()", "ARG_SET()",
200       and equivalents for reading and setting the arguments; and "STR_LEN()",
201       "STRING()" and "OPERAND()" for manipulating strings and regop bearing
202       types.
203
204       What regop is next?
205
206       There are three distinct concepts of "next" in the regex engine, and it
207       is important to keep them clear.
208
209       ·   There is the "next regnode" from a given regnode, a value which is
210           rarely useful except that sometimes it matches up in terms of value
211           with one of the others, and that sometimes the code assumes this to
212           always be so.
213
214       ·   There is the "next regop" from a given regop/regnode. This is the
215           regop physically located after the current one, as determined by
216           the size of the current regop. This is often useful, such as when
217           dumping the structure we use this order to traverse. Sometimes the
218           code assumes that the "next regnode" is the same as the "next
219           regop", or in other words assumes that the sizeof a given regop
220           type is always going to be one regnode large.
221
222       ·   There is the "regnext" from a given regop. This is the regop which
223           is reached by jumping forward by the value of "NEXT_OFF()", or in a
224           few cases for longer jumps by the "arg1" field of the "regnode_1"
225           structure. The subroutine "regnext()" handles this transparently.
226           This is the logical successor of the node, which in some cases,
227           like that of the "BRANCH" regop, has special meaning.
228

Process Overview

230       Broadly speaking, performing a match of a string against a pattern
231       involves the following steps:
232
233       A. Compilation
234            1. Parsing for size
235            2. Parsing for construction
236            3. Peep-hole optimisation and analysis
237       B. Execution
238            4. Start position and no-match optimisations
239            5. Program execution
240
241       Where these steps occur in the actual execution of a perl program is
242       determined by whether the pattern involves interpolating any string
243       variables. If interpolation occurs, then compilation happens at run
244       time. If it does not, then compilation is performed at compile time.
245       (The "/o" modifier changes this, as does "qr//" to a certain extent.)
246       The engine doesn't really care that much.
247
248   Compilation
249       This code resides primarily in regcomp.c, along with the header files
250       regcomp.h, regexp.h and regnodes.h.
251
252       Compilation starts with "pregcomp()", which is mostly an initialisation
253       wrapper which farms work out to two other routines for the heavy
254       lifting: the first is "reg()", which is the start point for parsing;
255       the second, "study_chunk()", is responsible for optimisation.
256
257       Initialisation in "pregcomp()" mostly involves the creation and data-
258       filling of a special structure, "RExC_state_t" (defined in regcomp.c).
259       Almost all internally-used routines in regcomp.h take a pointer to one
260       of these structures as their first argument, with the name
261       "pRExC_state".  This structure is used to store the compilation state
262       and contains many fields. Likewise there are many macros which operate
263       on this variable: anything that looks like "RExC_xxxx" is a macro that
264       operates on this pointer/structure.
265
266       Parsing for size
267
268       In this pass the input pattern is parsed in order to calculate how much
269       space is needed for each regop we would need to emit. The size is also
270       used to determine whether long jumps will be required in the program.
271
272       This stage is controlled by the macro "SIZE_ONLY" being set.
273
274       The parse proceeds pretty much exactly as it does during the
275       construction phase, except that most routines are short-circuited to
276       change the size field "RExC_size" and not do anything else.
277
278       Parsing for construction
279
280       Once the size of the program has been determined, the pattern is parsed
281       again, but this time for real. Now "SIZE_ONLY" will be false, and the
282       actual construction can occur.
283
284       "reg()" is the start of the parse process. It is responsible for
285       parsing an arbitrary chunk of pattern up to either the end of the
286       string, or the first closing parenthesis it encounters in the pattern.
287       This means it can be used to parse the top-level regex, or any section
288       inside of a grouping parenthesis. It also handles the "special parens"
289       that perl's regexes have. For instance when parsing "/x(?:foo)y/"
290       "reg()" will at one point be called to parse from the "?" symbol up to
291       and including the ")".
292
293       Additionally, "reg()" is responsible for parsing the one or more
294       branches from the pattern, and for "finishing them off" by correctly
295       setting their next pointers. In order to do the parsing, it repeatedly
296       calls out to "regbranch()", which is responsible for handling up to the
297       first "|" symbol it sees.
298
299       "regbranch()" in turn calls "regpiece()" which handles "things"
300       followed by a quantifier. In order to parse the "things", "regatom()"
301       is called. This is the lowest level routine, which parses out constant
302       strings, character classes, and the various special symbols like "$".
303       If "regatom()" encounters a "(" character it in turn calls "reg()".
304
305       The routine "regtail()" is called by both "reg()" and "regbranch()" in
306       order to "set the tail pointer" correctly. When executing and we get to
307       the end of a branch, we need to go to the node following the grouping
308       parens. When parsing, however, we don't know where the end will be
309       until we get there, so when we do we must go back and update the
310       offsets as appropriate. "regtail" is used to make this easier.
311
312       A subtlety of the parsing process means that a regex like "/foo/" is
313       originally parsed into an alternation with a single branch. It is only
314       afterwards that the optimiser converts single branch alternations into
315       the simpler form.
316
317       Parse Call Graph and a Grammar
318
319       The call graph looks like this:
320
321        reg()                        # parse a top level regex, or inside of
322                                     # parens
323            regbranch()              # parse a single branch of an alternation
324                regpiece()           # parse a pattern followed by a quantifier
325                    regatom()        # parse a simple pattern
326                        regclass()   #   used to handle a class
327                        reg()        #   used to handle a parenthesised
328                                     #   subpattern
329                        ....
330                ...
331                regtail()            # finish off the branch
332            ...
333            regtail()                # finish off the branch sequence. Tie each
334                                     # branch's tail to the tail of the
335                                     # sequence
336                                     # (NEW) In Debug mode this is
337                                     # regtail_study().
338
339       A grammar form might be something like this:
340
341           atom  : constant | class
342           quant : '*' | '+' | '?' | '{min,max}'
343           _branch: piece
344                  | piece _branch
345                  | nothing
346           branch: _branch
347                 | _branch '|' branch
348           group : '(' branch ')'
349           _piece: atom | group
350           piece : _piece
351                 | _piece quant
352
353       Parsing complications
354
355       The implication of the above description is that a pattern containing
356       nested parentheses will result in a call graph which cycles through
357       "reg()", "regbranch()", "regpiece()", "regatom()", "reg()",
358       "regbranch()" etc multiple times, until the deepest level of nesting is
359       reached. All the above routines return a pointer to a "regnode", which
360       is usually the last regnode added to the program. However, one
361       complication is that reg() returns NULL for parsing "(?:)" syntax for
362       embedded modifiers, setting the flag "TRYAGAIN". The "TRYAGAIN"
363       propagates upwards until it is captured, in some cases by "regatom()",
364       but otherwise unconditionally by "regbranch()". Hence it will never be
365       returned by "regbranch()" to "reg()". This flag permits patterns such
366       as "(?i)+" to be detected as errors (Quantifier follows nothing in
367       regex; marked by <-- HERE in m/(?i)+ <-- HERE /).
368
369       Another complication is that the representation used for the program
370       differs if it needs to store Unicode, but it's not always possible to
371       know for sure whether it does until midway through parsing. The Unicode
372       representation for the program is larger, and cannot be matched as
373       efficiently. (See "Unicode and Localisation Support" below for more
374       details as to why.)  If the pattern contains literal Unicode, it's
375       obvious that the program needs to store Unicode. Otherwise, the parser
376       optimistically assumes that the more efficient representation can be
377       used, and starts sizing on this basis.  However, if it then encounters
378       something in the pattern which must be stored as Unicode, such as an
379       "\x{...}" escape sequence representing a character literal, then this
380       means that all previously calculated sizes need to be redone, using
381       values appropriate for the Unicode representation. Currently, all
382       regular expression constructions which can trigger this are parsed by
383       code in "regatom()".
384
385       To avoid wasted work when a restart is needed, the sizing pass is
386       abandoned - "regatom()" immediately returns NULL, setting the flag
387       "RESTART_UTF8".  (This action is encapsulated using the macro
388       "REQUIRE_UTF8".) This restart request is propagated up the call chain
389       in a similar fashion, until it is "caught" in "Perl_re_op_compile()",
390       which marks the pattern as containing Unicode, and restarts the sizing
391       pass. It is also possible for constructions within run-time code blocks
392       to turn out to need Unicode representation., which is signalled by
393       "S_compile_runtime_code()" returning false to "Perl_re_op_compile()".
394
395       The restart was previously implemented using a "longjmp" in "regatom()"
396       back to a "setjmp" in "Perl_re_op_compile()", but this proved to be
397       problematic as the latter is a large function containing many automatic
398       variables, which interact badly with the emergent control flow of
399       "setjmp".
400
401       Debug Output
402
403       In the 5.9.x development version of perl you can "use re Debug =>
404       'PARSE'" to see some trace information about the parse process. We will
405       start with some simple patterns and build up to more complex patterns.
406
407       So when we parse "/foo/" we see something like the following table. The
408       left shows what is being parsed, and the number indicates where the
409       next regop would go. The stuff on the right is the trace output of the
410       graph. The names are chosen to be short to make it less dense on the
411       screen. 'tsdy' is a special form of "regtail()" which does some extra
412       analysis.
413
414        >foo<             1    reg
415                                 brnc
416                                   piec
417                                     atom
418        ><                4      tsdy~ EXACT <foo> (EXACT) (1)
419                                     ~ attach to END (3) offset to 2
420
421       The resulting program then looks like:
422
423          1: EXACT <foo>(3)
424          3: END(0)
425
426       As you can see, even though we parsed out a branch and a piece, it was
427       ultimately only an atom. The final program shows us how things work. We
428       have an "EXACT" regop, followed by an "END" regop. The number in parens
429       indicates where the "regnext" of the node goes. The "regnext" of an
430       "END" regop is unused, as "END" regops mean we have successfully
431       matched. The number on the left indicates the position of the regop in
432       the regnode array.
433
434       Now let's try a harder pattern. We will add a quantifier, so now we
435       have the pattern "/foo+/". We will see that "regbranch()" calls
436       "regpiece()" twice.
437
438        >foo+<            1    reg
439                                 brnc
440                                   piec
441                                     atom
442        >o+<              3        piec
443                                     atom
444        ><                6        tail~ EXACT <fo> (1)
445                          7      tsdy~ EXACT <fo> (EXACT) (1)
446                                     ~ PLUS (END) (3)
447                                     ~ attach to END (6) offset to 3
448
449       And we end up with the program:
450
451          1: EXACT <fo>(3)
452          3: PLUS(6)
453          4:   EXACT <o>(0)
454          6: END(0)
455
456       Now we have a special case. The "EXACT" regop has a "regnext" of 0.
457       This is because if it matches it should try to match itself again. The
458       "PLUS" regop handles the actual failure of the "EXACT" regop and acts
459       appropriately (going to regnode 6 if the "EXACT" matched at least once,
460       or failing if it didn't).
461
462       Now for something much more complex: "/x(?:foo*|b[a][rR])(foo|bar)$/"
463
464        >x(?:foo*|b...    1    reg
465                                 brnc
466                                   piec
467                                     atom
468        >(?:foo*|b[...    3        piec
469                                     atom
470        >?:foo*|b[a...                 reg
471        >foo*|b[a][...                   brnc
472                                           piec
473                                             atom
474        >o*|b[a][rR...    5                piec
475                                             atom
476        >|b[a][rR])...    8                tail~ EXACT <fo> (3)
477        >b[a][rR])(...    9              brnc
478                         10                piec
479                                             atom
480        >[a][rR])(f...   12                piec
481                                             atom
482        >a][rR])(fo...                         clas
483        >[rR])(foo|...   14                tail~ EXACT <b> (10)
484                                           piec
485                                             atom
486        >rR])(foo|b...                         clas
487        >)(foo|bar)...   25                tail~ EXACT <a> (12)
488                                         tail~ BRANCH (3)
489                         26              tsdy~ BRANCH (END) (9)
490                                             ~ attach to TAIL (25) offset to 16
491                                         tsdy~ EXACT <fo> (EXACT) (4)
492                                             ~ STAR (END) (6)
493                                             ~ attach to TAIL (25) offset to 19
494                                         tsdy~ EXACT <b> (EXACT) (10)
495                                             ~ EXACT <a> (EXACT) (12)
496                                             ~ ANYOF[Rr] (END) (14)
497                                             ~ attach to TAIL (25) offset to 11
498        >(foo|bar)$<               tail~ EXACT <x> (1)
499                                   piec
500                                     atom
501        >foo|bar)$<                    reg
502                         28              brnc
503                                           piec
504                                             atom
505        >|bar)$<         31              tail~ OPEN1 (26)
506        >bar)$<                          brnc
507                         32                piec
508                                             atom
509        >)$<             34              tail~ BRANCH (28)
510                         36              tsdy~ BRANCH (END) (31)
511                                            ~ attach to CLOSE1 (34) offset to 3
512                                         tsdy~ EXACT <foo> (EXACT) (29)
513                                            ~ attach to CLOSE1 (34) offset to 5
514                                         tsdy~ EXACT <bar> (EXACT) (32)
515                                            ~ attach to CLOSE1 (34) offset to 2
516        >$<                        tail~ BRANCH (3)
517                                       ~ BRANCH (9)
518                                       ~ TAIL (25)
519                                   piec
520                                     atom
521        ><               37        tail~ OPEN1 (26)
522                                       ~ BRANCH (28)
523                                       ~ BRANCH (31)
524                                       ~ CLOSE1 (34)
525                         38      tsdy~ EXACT <x> (EXACT) (1)
526                                     ~ BRANCH (END) (3)
527                                     ~ BRANCH (END) (9)
528                                     ~ TAIL (END) (25)
529                                     ~ OPEN1 (END) (26)
530                                     ~ BRANCH (END) (28)
531                                     ~ BRANCH (END) (31)
532                                     ~ CLOSE1 (END) (34)
533                                     ~ EOL (END) (36)
534                                     ~ attach to END (37) offset to 1
535
536       Resulting in the program
537
538          1: EXACT <x>(3)
539          3: BRANCH(9)
540          4:   EXACT <fo>(6)
541          6:   STAR(26)
542          7:     EXACT <o>(0)
543          9: BRANCH(25)
544         10:   EXACT <ba>(14)
545         12:   OPTIMIZED (2 nodes)
546         14:   ANYOF[Rr](26)
547         25: TAIL(26)
548         26: OPEN1(28)
549         28:   TRIE-EXACT(34)
550               [StS:1 Wds:2 Cs:6 Uq:5 #Sts:7 Mn:3 Mx:3 Stcls:bf]
551                 <foo>
552                 <bar>
553         30:   OPTIMIZED (4 nodes)
554         34: CLOSE1(36)
555         36: EOL(37)
556         37: END(0)
557
558       Here we can see a much more complex program, with various optimisations
559       in play. At regnode 10 we see an example where a character class with
560       only one character in it was turned into an "EXACT" node. We can also
561       see where an entire alternation was turned into a "TRIE-EXACT" node. As
562       a consequence, some of the regnodes have been marked as optimised away.
563       We can see that the "$" symbol has been converted into an "EOL" regop,
564       a special piece of code that looks for "\n" or the end of the string.
565
566       The next pointer for "BRANCH"es is interesting in that it points at
567       where execution should go if the branch fails. When executing, if the
568       engine tries to traverse from a branch to a "regnext" that isn't a
569       branch then the engine will know that the entire set of branches has
570       failed.
571
572       Peep-hole Optimisation and Analysis
573
574       The regular expression engine can be a weighty tool to wield. On long
575       strings and complex patterns it can end up having to do a lot of work
576       to find a match, and even more to decide that no match is possible.
577       Consider a situation like the following pattern.
578
579          'ababababababababababab' =~ /(a|b)*z/
580
581       The "(a|b)*" part can match at every char in the string, and then fail
582       every time because there is no "z" in the string. So obviously we can
583       avoid using the regex engine unless there is a "z" in the string.
584       Likewise in a pattern like:
585
586          /foo(\w+)bar/
587
588       In this case we know that the string must contain a "foo" which must be
589       followed by "bar". We can use Fast Boyer-Moore matching as implemented
590       in "fbm_instr()" to find the location of these strings. If they don't
591       exist then we don't need to resort to the much more expensive regex
592       engine.  Even better, if they do exist then we can use their positions
593       to reduce the search space that the regex engine needs to cover to
594       determine if the entire pattern matches.
595
596       There are various aspects of the pattern that can be used to facilitate
597       optimisations along these lines:
598
599       ·    anchored fixed strings
600
601       ·    floating fixed strings
602
603       ·    minimum and maximum length requirements
604
605       ·    start class
606
607       ·    Beginning/End of line positions
608
609       Another form of optimisation that can occur is the post-parse "peep-
610       hole" optimisation, where inefficient constructs are replaced by more
611       efficient constructs. The "TAIL" regops which are used during parsing
612       to mark the end of branches and the end of groups are examples of this.
613       These regops are used as place-holders during construction and "always
614       match" so they can be "optimised away" by making the things that point
615       to the "TAIL" point to the thing that "TAIL" points to, thus "skipping"
616       the node.
617
618       Another optimisation that can occur is that of ""EXACT" merging" which
619       is where two consecutive "EXACT" nodes are merged into a single regop.
620       An even more aggressive form of this is that a branch sequence of the
621       form "EXACT BRANCH ... EXACT" can be converted into a "TRIE-EXACT"
622       regop.
623
624       All of this occurs in the routine "study_chunk()" which uses a special
625       structure "scan_data_t" to store the analysis that it has performed,
626       and does the "peep-hole" optimisations as it goes.
627
628       The code involved in "study_chunk()" is extremely cryptic. Be careful.
629       :-)
630
631   Execution
632       Execution of a regex generally involves two phases, the first being
633       finding the start point in the string where we should match from, and
634       the second being running the regop interpreter.
635
636       If we can tell that there is no valid start point then we don't bother
637       running the interpreter at all. Likewise, if we know from the analysis
638       phase that we cannot detect a short-cut to the start position, we go
639       straight to the interpreter.
640
641       The two entry points are "re_intuit_start()" and "pregexec()". These
642       routines have a somewhat incestuous relationship with overlap between
643       their functions, and "pregexec()" may even call "re_intuit_start()" on
644       its own. Nevertheless other parts of the perl source code may call into
645       either, or both.
646
647       Execution of the interpreter itself used to be recursive, but thanks to
648       the efforts of Dave Mitchell in the 5.9.x development track, that has
649       changed: now an internal stack is maintained on the heap and the
650       routine is fully iterative. This can make it tricky as the code is
651       quite conservative about what state it stores, with the result that two
652       consecutive lines in the code can actually be running in totally
653       different contexts due to the simulated recursion.
654
655       Start position and no-match optimisations
656
657       "re_intuit_start()" is responsible for handling start points and no-
658       match optimisations as determined by the results of the analysis done
659       by "study_chunk()" (and described in "Peep-hole Optimisation and
660       Analysis").
661
662       The basic structure of this routine is to try to find the start- and/or
663       end-points of where the pattern could match, and to ensure that the
664       string is long enough to match the pattern. It tries to use more
665       efficient methods over less efficient methods and may involve
666       considerable cross-checking of constraints to find the place in the
667       string that matches.  For instance it may try to determine that a given
668       fixed string must be not only present but a certain number of chars
669       before the end of the string, or whatever.
670
671       It calls several other routines, such as "fbm_instr()" which does Fast
672       Boyer Moore matching and "find_byclass()" which is responsible for
673       finding the start using the first mandatory regop in the program.
674
675       When the optimisation criteria have been satisfied, "reg_try()" is
676       called to perform the match.
677
678       Program execution
679
680       "pregexec()" is the main entry point for running a regex. It contains
681       support for initialising the regex interpreter's state, running
682       "re_intuit_start()" if needed, and running the interpreter on the
683       string from various start positions as needed. When it is necessary to
684       use the regex interpreter "pregexec()" calls "regtry()".
685
686       "regtry()" is the entry point into the regex interpreter. It expects as
687       arguments a pointer to a "regmatch_info" structure and a pointer to a
688       string.  It returns an integer 1 for success and a 0 for failure.  It
689       is basically a set-up wrapper around "regmatch()".
690
691       "regmatch" is the main "recursive loop" of the interpreter. It is
692       basically a giant switch statement that implements a state machine,
693       where the possible states are the regops themselves, plus a number of
694       additional intermediate and failure states. A few of the states are
695       implemented as subroutines but the bulk are inline code.
696

MISCELLANEOUS

698   Unicode and Localisation Support
699       When dealing with strings containing characters that cannot be
700       represented using an eight-bit character set, perl uses an internal
701       representation that is a permissive version of Unicode's UTF-8
702       encoding[2]. This uses single bytes to represent characters from the
703       ASCII character set, and sequences of two or more bytes for all other
704       characters. (See perlunitut for more information about the relationship
705       between UTF-8 and perl's encoding, utf8. The difference isn't important
706       for this discussion.)
707
708       No matter how you look at it, Unicode support is going to be a pain in
709       a regex engine. Tricks that might be fine when you have 256 possible
710       characters often won't scale to handle the size of the UTF-8 character
711       set.  Things you can take for granted with ASCII may not be true with
712       Unicode. For instance, in ASCII, it is safe to assume that
713       "sizeof(char1) == sizeof(char2)", but in UTF-8 it isn't. Unicode case
714       folding is vastly more complex than the simple rules of ASCII, and even
715       when not using Unicode but only localised single byte encodings, things
716       can get tricky (for example, LATIN SMALL LETTER SHARP S (U+00DF, ss)
717       should match 'SS' in localised case-insensitive matching).
718
719       Making things worse is that UTF-8 support was a later addition to the
720       regex engine (as it was to perl) and this necessarily  made things a
721       lot more complicated. Obviously it is easier to design a regex engine
722       with Unicode support in mind from the beginning than it is to retrofit
723       it to one that wasn't.
724
725       Nearly all regops that involve looking at the input string have two
726       cases, one for UTF-8, and one not. In fact, it's often more complex
727       than that, as the pattern may be UTF-8 as well.
728
729       Care must be taken when making changes to make sure that you handle
730       UTF-8 properly, both at compile time and at execution time, including
731       when the string and pattern are mismatched.
732
733   Base Structures
734       The "regexp" structure described in perlreapi is common to all regex
735       engines. Two of its fields are intended for the private use of the
736       regex engine that compiled the pattern. These are the "intflags" and
737       pprivate members. The "pprivate" is a void pointer to an arbitrary
738       structure whose use and management is the responsibility of the
739       compiling engine. perl will never modify either of these values. In the
740       case of the stock engine the structure pointed to by "pprivate" is
741       called "regexp_internal".
742
743       Its "pprivate" and "intflags" fields contain data specific to each
744       engine.
745
746       There are two structures used to store a compiled regular expression.
747       One, the "regexp" structure described in perlreapi is populated by the
748       engine currently being. used and some of its fields read by perl to
749       implement things such as the stringification of "qr//".
750
751       The other structure is pointed to by the "regexp" struct's "pprivate"
752       and is in addition to "intflags" in the same struct considered to be
753       the property of the regex engine which compiled the regular expression;
754
755       The regexp structure contains all the data that perl needs to be aware
756       of to properly work with the regular expression. It includes data about
757       optimisations that perl can use to determine if the regex engine should
758       really be used, and various other control info that is needed to
759       properly execute patterns in various contexts such as is the pattern
760       anchored in some way, or what flags were used during the compile, or
761       whether the program contains special constructs that perl needs to be
762       aware of.
763
764       In addition it contains two fields that are intended for the private
765       use of the regex engine that compiled the pattern. These are the
766       "intflags" and pprivate members. The "pprivate" is a void pointer to an
767       arbitrary structure whose use and management is the responsibility of
768       the compiling engine. perl will never modify either of these values.
769
770       As mentioned earlier, in the case of the default engines, the
771       "pprivate" will be a pointer to a regexp_internal structure which holds
772       the compiled program and any additional data that is private to the
773       regex engine implementation.
774
775       Perl's "pprivate" structure
776
777       The following structure is used as the "pprivate" struct by perl's
778       regex engine. Since it is specific to perl it is only of curiosity
779       value to other engine implementations.
780
781        typedef struct regexp_internal {
782                U32 *offsets;           /* offset annotations 20001228 MJD
783                                         * data about mapping the program to
784                                         * the string*/
785                regnode *regstclass;    /* Optional startclass as identified or
786                                         * constructed by the optimiser */
787                struct reg_data *data;  /* Additional miscellaneous data used
788                                         * by the program.  Used to make it
789                                         * easier to clone and free arbitrary
790                                         * data that the regops need. Often the
791                                         * ARG field of a regop is an index
792                                         * into this structure */
793                regnode program[1];     /* Unwarranted chumminess with
794                                         * compiler. */
795        } regexp_internal;
796
797       "offsets"
798            Offsets holds a mapping of offset in the "program" to offset in
799            the "precomp" string. This is only used by ActiveState's visual
800            regex debugger.
801
802       "regstclass"
803            Special regop that is used by "re_intuit_start()" to check if a
804            pattern can match at a certain position. For instance if the regex
805            engine knows that the pattern must start with a 'Z' then it can
806            scan the string until it finds one and then launch the regex
807            engine from there. The routine that handles this is called
808            "find_by_class()". Sometimes this field points at a regop embedded
809            in the program, and sometimes it points at an independent
810            synthetic regop that has been constructed by the optimiser.
811
812       "data"
813            This field points at a "reg_data" structure, which is defined as
814            follows
815
816                struct reg_data {
817                    U32 count;
818                    U8 *what;
819                    void* data[1];
820                };
821
822            This structure is used for handling data structures that the regex
823            engine needs to handle specially during a clone or free operation
824            on the compiled product. Each element in the data array has a
825            corresponding element in the what array. During compilation regops
826            that need special structures stored will add an element to each
827            array using the add_data() routine and then store the index in the
828            regop.
829
830       "program"
831            Compiled program. Inlined into the structure so the entire struct
832            can be treated as a single blob.
833

SEE ALSO

835       perlreapi
836
837       perlre
838
839       perlunitut
840

AUTHOR

842       by Yves Orton, 2006.
843
844       With excerpts from Perl, and contributions and suggestions from Ronald
845       J. Kimball, Dave Mitchell, Dominic Dunlop, Mark Jason Dominus, Stephen
846       McCamant, and David Landgren.
847

LICENCE

849       Same terms as Perl.
850

REFERENCES

852       [1] <http://perl.plover.com/Rx/paper/>
853
854       [2] <http://www.unicode.org>
855
856
857
858perl v5.30.2                      2020-03-27                     PERLREGUTS(1)
Impressum