1PERLREGUTS(1) Perl Programmers Reference Guide PERLREGUTS(1)
2
3
4
6 perlreguts - Description of the Perl regular expression engine.
7
9 This document is an attempt to shine some light on the guts of the
10 regex engine and how it works. The regex engine represents a
11 significant chunk of the perl codebase, but is relatively poorly
12 understood. This document is a meagre attempt at addressing this
13 situation. It is derived from the author's experience, comments in the
14 source code, other papers on the regex engine, feedback on the
15 perl5-porters mail list, and no doubt other places as well.
16
17 NOTICE! It should be clearly understood that the behavior and
18 structures discussed in this represents the state of the engine as the
19 author understood it at the time of writing. It is NOT an API
20 definition, it is purely an internals guide for those who want to hack
21 the regex engine, or understand how the regex engine works. Readers of
22 this document are expected to understand perl's regex syntax and its
23 usage in detail. If you want to learn about the basics of Perl's
24 regular expressions, see perlre. And if you want to replace the regex
25 engine with your own, see perlreapi.
26
28 A quick note on terms
29 There is some debate as to whether to say "regexp" or "regex". In this
30 document we will use the term "regex" unless there is a special reason
31 not to, in which case we will explain why.
32
33 When speaking about regexes we need to distinguish between their source
34 code form and their internal form. In this document we will use the
35 term "pattern" when we speak of their textual, source code form, and
36 the term "program" when we speak of their internal representation.
37 These correspond to the terms S-regex and B-regex that Mark Jason
38 Dominus employs in his paper on "Rx" ([1] in "REFERENCES").
39
40 What is a regular expression engine?
41 A regular expression engine is a program that takes a set of
42 constraints specified in a mini-language, and then applies those
43 constraints to a target string, and determines whether or not the
44 string satisfies the constraints. See perlre for a full definition of
45 the language.
46
47 In less grandiose terms, the first part of the job is to turn a pattern
48 into something the computer can efficiently use to find the matching
49 point in the string, and the second part is performing the search
50 itself.
51
52 To do this we need to produce a program by parsing the text. We then
53 need to execute the program to find the point in the string that
54 matches. And we need to do the whole thing efficiently.
55
56 Structure of a Regexp Program
57 High Level
58
59 Although it is a bit confusing and some people object to the
60 terminology, it is worth taking a look at a comment that has been in
61 regexp.h for years:
62
63 This is essentially a linear encoding of a nondeterministic finite-
64 state machine (aka syntax charts or "railroad normal form" in parsing
65 technology).
66
67 The term "railroad normal form" is a bit esoteric, with "syntax
68 diagram/charts", or "railroad diagram/charts" being more common terms.
69 Nevertheless it provides a useful mental image of a regex program: each
70 node can be thought of as a unit of track, with a single entry and in
71 most cases a single exit point (there are pieces of track that fork,
72 but statistically not many), and the whole forms a layout with a single
73 entry and single exit point. The matching process can be thought of as
74 a car that moves along the track, with the particular route through the
75 system being determined by the character read at each possible
76 connector point. A car can fall off the track at any point but it may
77 only proceed as long as it matches the track.
78
79 Thus the pattern "/foo(?:\w+|\d+|\s+)bar/" can be thought of as the
80 following chart:
81
82 [start]
83 |
84 <foo>
85 |
86 +-----+-----+
87 | | |
88 <\w+> <\d+> <\s+>
89 | | |
90 +-----+-----+
91 |
92 <bar>
93 |
94 [end]
95
96 The truth of the matter is that perl's regular expressions these days
97 are much more complex than this kind of structure, but visualising it
98 this way can help when trying to get your bearings, and it matches the
99 current implementation pretty closely.
100
101 To be more precise, we will say that a regex program is an encoding of
102 a graph. Each node in the graph corresponds to part of the original
103 regex pattern, such as a literal string or a branch, and has a pointer
104 to the nodes representing the next component to be matched. Since
105 "node" and "opcode" already have other meanings in the perl source, we
106 will call the nodes in a regex program "regops".
107
108 The program is represented by an array of "regnode" structures, one or
109 more of which represent a single regop of the program. Struct "regnode"
110 is the smallest struct needed, and has a field structure which is
111 shared with all the other larger structures.
112
113 The "next" pointers of all regops except "BRANCH" implement
114 concatenation; a "next" pointer with a "BRANCH" on both ends of it is
115 connecting two alternatives. [Here we have one of the subtle syntax
116 dependencies: an individual "BRANCH" (as opposed to a collection of
117 them) is never concatenated with anything because of operator
118 precedence.]
119
120 The operand of some types of regop is a literal string; for others, it
121 is a regop leading into a sub-program. In particular, the operand of a
122 "BRANCH" node is the first regop of the branch.
123
124 NOTE: As the railroad metaphor suggests, this is not a tree structure:
125 the tail of the branch connects to the thing following the set of
126 "BRANCH"es. It is a like a single line of railway track that splits as
127 it goes into a station or railway yard and rejoins as it comes out the
128 other side.
129
130 Regops
131
132 The base structure of a regop is defined in regexp.h as follows:
133
134 struct regnode {
135 U8 flags; /* Various purposes, sometimes overridden */
136 U8 type; /* Opcode value as specified by regnodes.h */
137 U16 next_off; /* Offset in size regnode */
138 };
139
140 Other larger "regnode"-like structures are defined in regcomp.h. They
141 are almost like subclasses in that they have the same fields as
142 "regnode", with possibly additional fields following in the structure,
143 and in some cases the specific meaning (and name) of some of base
144 fields are overridden. The following is a more complete description.
145
146 "regnode_1"
147 "regnode_2"
148 "regnode_1" structures have the same header, followed by a single
149 four-byte argument; "regnode_2" structures contain two two-byte
150 arguments instead:
151
152 regnode_1 U32 arg1;
153 regnode_2 U16 arg1; U16 arg2;
154
155 "regnode_string"
156 "regnode_string" structures, used for literal strings, follow the
157 header with a one-byte length and then the string data. Strings are
158 padded on the end with zero bytes so that the total length of the
159 node is a multiple of four bytes:
160
161 regnode_string char string[1];
162 U8 str_len; /* overrides flags */
163
164 "regnode_charclass"
165 Bracketed character classes are represented by "regnode_charclass"
166 structures, which have a four-byte argument and then a 32-byte
167 (256-bit) bitmap indicating which characters in the Latin1 range
168 are included in the class.
169
170 regnode_charclass U32 arg1;
171 char bitmap[ANYOF_BITMAP_SIZE];
172
173 Various flags whose names begin with "ANYOF_" are used for special
174 situations. Above Latin1 matches and things not known until run-
175 time are stored in "Perl's pprivate structure".
176
177 "regnode_charclass_posixl"
178 There is also a larger form of a char class structure used to
179 represent POSIX char classes under "/l" matching, called
180 "regnode_charclass_posixl" which has an additional 32-bit bitmap
181 indicating which POSIX char classes have been included.
182
183 regnode_charclass_posixl U32 arg1;
184 char bitmap[ANYOF_BITMAP_SIZE];
185 U32 classflags;
186
187 regnodes.h defines an array called "regarglen[]" which gives the size
188 of each opcode in units of "size regnode" (4-byte). A macro is used to
189 calculate the size of an "EXACT" node based on its "str_len" field.
190
191 The regops are defined in regnodes.h which is generated from
192 regcomp.sym by regcomp.pl. Currently the maximum possible number of
193 distinct regops is restricted to 256, with about a quarter already
194 used.
195
196 A set of macros makes accessing the fields easier and more consistent.
197 These include "OP()", which is used to determine the type of a
198 "regnode"-like structure; "NEXT_OFF()", which is the offset to the next
199 node (more on this later); "ARG()", "ARG1()", "ARG2()", "ARG_SET()",
200 and equivalents for reading and setting the arguments; and "STR_LEN()",
201 "STRING()" and "OPERAND()" for manipulating strings and regop bearing
202 types.
203
204 What regop is next?
205
206 There are three distinct concepts of "next" in the regex engine, and it
207 is important to keep them clear.
208
209 · There is the "next regnode" from a given regnode, a value which is
210 rarely useful except that sometimes it matches up in terms of value
211 with one of the others, and that sometimes the code assumes this to
212 always be so.
213
214 · There is the "next regop" from a given regop/regnode. This is the
215 regop physically located after the current one, as determined by
216 the size of the current regop. This is often useful, such as when
217 dumping the structure we use this order to traverse. Sometimes the
218 code assumes that the "next regnode" is the same as the "next
219 regop", or in other words assumes that the sizeof a given regop
220 type is always going to be one regnode large.
221
222 · There is the "regnext" from a given regop. This is the regop which
223 is reached by jumping forward by the value of "NEXT_OFF()", or in a
224 few cases for longer jumps by the "arg1" field of the "regnode_1"
225 structure. The subroutine "regnext()" handles this transparently.
226 This is the logical successor of the node, which in some cases,
227 like that of the "BRANCH" regop, has special meaning.
228
230 Broadly speaking, performing a match of a string against a pattern
231 involves the following steps:
232
233 A. Compilation
234 1. Parsing for size
235 2. Parsing for construction
236 3. Peep-hole optimisation and analysis
237 B. Execution
238 4. Start position and no-match optimisations
239 5. Program execution
240
241 Where these steps occur in the actual execution of a perl program is
242 determined by whether the pattern involves interpolating any string
243 variables. If interpolation occurs, then compilation happens at run
244 time. If it does not, then compilation is performed at compile time.
245 (The "/o" modifier changes this, as does "qr//" to a certain extent.)
246 The engine doesn't really care that much.
247
248 Compilation
249 This code resides primarily in regcomp.c, along with the header files
250 regcomp.h, regexp.h and regnodes.h.
251
252 Compilation starts with "pregcomp()", which is mostly an initialisation
253 wrapper which farms work out to two other routines for the heavy
254 lifting: the first is "reg()", which is the start point for parsing;
255 the second, "study_chunk()", is responsible for optimisation.
256
257 Initialisation in "pregcomp()" mostly involves the creation and data-
258 filling of a special structure, "RExC_state_t" (defined in regcomp.c).
259 Almost all internally-used routines in regcomp.h take a pointer to one
260 of these structures as their first argument, with the name
261 "pRExC_state". This structure is used to store the compilation state
262 and contains many fields. Likewise there are many macros which operate
263 on this variable: anything that looks like "RExC_xxxx" is a macro that
264 operates on this pointer/structure.
265
266 Parsing for size
267
268 In this pass the input pattern is parsed in order to calculate how much
269 space is needed for each regop we would need to emit. The size is also
270 used to determine whether long jumps will be required in the program.
271
272 This stage is controlled by the macro "SIZE_ONLY" being set.
273
274 The parse proceeds pretty much exactly as it does during the
275 construction phase, except that most routines are short-circuited to
276 change the size field "RExC_size" and not do anything else.
277
278 Parsing for construction
279
280 Once the size of the program has been determined, the pattern is parsed
281 again, but this time for real. Now "SIZE_ONLY" will be false, and the
282 actual construction can occur.
283
284 "reg()" is the start of the parse process. It is responsible for
285 parsing an arbitrary chunk of pattern up to either the end of the
286 string, or the first closing parenthesis it encounters in the pattern.
287 This means it can be used to parse the top-level regex, or any section
288 inside of a grouping parenthesis. It also handles the "special parens"
289 that perl's regexes have. For instance when parsing "/x(?:foo)y/"
290 "reg()" will at one point be called to parse from the "?" symbol up to
291 and including the ")".
292
293 Additionally, "reg()" is responsible for parsing the one or more
294 branches from the pattern, and for "finishing them off" by correctly
295 setting their next pointers. In order to do the parsing, it repeatedly
296 calls out to "regbranch()", which is responsible for handling up to the
297 first "|" symbol it sees.
298
299 "regbranch()" in turn calls "regpiece()" which handles "things"
300 followed by a quantifier. In order to parse the "things", "regatom()"
301 is called. This is the lowest level routine, which parses out constant
302 strings, character classes, and the various special symbols like "$".
303 If "regatom()" encounters a "(" character it in turn calls "reg()".
304
305 The routine "regtail()" is called by both "reg()" and "regbranch()" in
306 order to "set the tail pointer" correctly. When executing and we get to
307 the end of a branch, we need to go to the node following the grouping
308 parens. When parsing, however, we don't know where the end will be
309 until we get there, so when we do we must go back and update the
310 offsets as appropriate. "regtail" is used to make this easier.
311
312 A subtlety of the parsing process means that a regex like "/foo/" is
313 originally parsed into an alternation with a single branch. It is only
314 afterwards that the optimiser converts single branch alternations into
315 the simpler form.
316
317 Parse Call Graph and a Grammar
318
319 The call graph looks like this:
320
321 reg() # parse a top level regex, or inside of
322 # parens
323 regbranch() # parse a single branch of an alternation
324 regpiece() # parse a pattern followed by a quantifier
325 regatom() # parse a simple pattern
326 regclass() # used to handle a class
327 reg() # used to handle a parenthesised
328 # subpattern
329 ....
330 ...
331 regtail() # finish off the branch
332 ...
333 regtail() # finish off the branch sequence. Tie each
334 # branch's tail to the tail of the
335 # sequence
336 # (NEW) In Debug mode this is
337 # regtail_study().
338
339 A grammar form might be something like this:
340
341 atom : constant | class
342 quant : '*' | '+' | '?' | '{min,max}'
343 _branch: piece
344 | piece _branch
345 | nothing
346 branch: _branch
347 | _branch '|' branch
348 group : '(' branch ')'
349 _piece: atom | group
350 piece : _piece
351 | _piece quant
352
353 Parsing complications
354
355 The implication of the above description is that a pattern containing
356 nested parentheses will result in a call graph which cycles through
357 "reg()", "regbranch()", "regpiece()", "regatom()", "reg()",
358 "regbranch()" etc multiple times, until the deepest level of nesting is
359 reached. All the above routines return a pointer to a "regnode", which
360 is usually the last regnode added to the program. However, one
361 complication is that reg() returns NULL for parsing "(?:)" syntax for
362 embedded modifiers, setting the flag "TRYAGAIN". The "TRYAGAIN"
363 propagates upwards until it is captured, in some cases by "regatom()",
364 but otherwise unconditionally by "regbranch()". Hence it will never be
365 returned by "regbranch()" to "reg()". This flag permits patterns such
366 as "(?i)+" to be detected as errors (Quantifier follows nothing in
367 regex; marked by <-- HERE in m/(?i)+ <-- HERE /).
368
369 Another complication is that the representation used for the program
370 differs if it needs to store Unicode, but it's not always possible to
371 know for sure whether it does until midway through parsing. The Unicode
372 representation for the program is larger, and cannot be matched as
373 efficiently. (See "Unicode and Localisation Support" below for more
374 details as to why.) If the pattern contains literal Unicode, it's
375 obvious that the program needs to store Unicode. Otherwise, the parser
376 optimistically assumes that the more efficient representation can be
377 used, and starts sizing on this basis. However, if it then encounters
378 something in the pattern which must be stored as Unicode, such as an
379 "\x{...}" escape sequence representing a character literal, then this
380 means that all previously calculated sizes need to be redone, using
381 values appropriate for the Unicode representation. Currently, all
382 regular expression constructions which can trigger this are parsed by
383 code in "regatom()".
384
385 To avoid wasted work when a restart is needed, the sizing pass is
386 abandoned - "regatom()" immediately returns NULL, setting the flag
387 "RESTART_UTF8". (This action is encapsulated using the macro
388 "REQUIRE_UTF8".) This restart request is propagated up the call chain
389 in a similar fashion, until it is "caught" in "Perl_re_op_compile()",
390 which marks the pattern as containing Unicode, and restarts the sizing
391 pass. It is also possible for constructions within run-time code blocks
392 to turn out to need Unicode representation., which is signalled by
393 "S_compile_runtime_code()" returning false to "Perl_re_op_compile()".
394
395 The restart was previously implemented using a "longjmp" in "regatom()"
396 back to a "setjmp" in "Perl_re_op_compile()", but this proved to be
397 problematic as the latter is a large function containing many automatic
398 variables, which interact badly with the emergent control flow of
399 "setjmp".
400
401 Debug Output
402
403 In the 5.9.x development version of perl you can "use re Debug =>
404 'PARSE'" to see some trace information about the parse process. We will
405 start with some simple patterns and build up to more complex patterns.
406
407 So when we parse "/foo/" we see something like the following table. The
408 left shows what is being parsed, and the number indicates where the
409 next regop would go. The stuff on the right is the trace output of the
410 graph. The names are chosen to be short to make it less dense on the
411 screen. 'tsdy' is a special form of "regtail()" which does some extra
412 analysis.
413
414 >foo< 1 reg
415 brnc
416 piec
417 atom
418 >< 4 tsdy~ EXACT <foo> (EXACT) (1)
419 ~ attach to END (3) offset to 2
420
421 The resulting program then looks like:
422
423 1: EXACT <foo>(3)
424 3: END(0)
425
426 As you can see, even though we parsed out a branch and a piece, it was
427 ultimately only an atom. The final program shows us how things work. We
428 have an "EXACT" regop, followed by an "END" regop. The number in parens
429 indicates where the "regnext" of the node goes. The "regnext" of an
430 "END" regop is unused, as "END" regops mean we have successfully
431 matched. The number on the left indicates the position of the regop in
432 the regnode array.
433
434 Now let's try a harder pattern. We will add a quantifier, so now we
435 have the pattern "/foo+/". We will see that "regbranch()" calls
436 "regpiece()" twice.
437
438 >foo+< 1 reg
439 brnc
440 piec
441 atom
442 >o+< 3 piec
443 atom
444 >< 6 tail~ EXACT <fo> (1)
445 7 tsdy~ EXACT <fo> (EXACT) (1)
446 ~ PLUS (END) (3)
447 ~ attach to END (6) offset to 3
448
449 And we end up with the program:
450
451 1: EXACT <fo>(3)
452 3: PLUS(6)
453 4: EXACT <o>(0)
454 6: END(0)
455
456 Now we have a special case. The "EXACT" regop has a "regnext" of 0.
457 This is because if it matches it should try to match itself again. The
458 "PLUS" regop handles the actual failure of the "EXACT" regop and acts
459 appropriately (going to regnode 6 if the "EXACT" matched at least once,
460 or failing if it didn't).
461
462 Now for something much more complex: "/x(?:foo*|b[a][rR])(foo|bar)$/"
463
464 >x(?:foo*|b... 1 reg
465 brnc
466 piec
467 atom
468 >(?:foo*|b[... 3 piec
469 atom
470 >?:foo*|b[a... reg
471 >foo*|b[a][... brnc
472 piec
473 atom
474 >o*|b[a][rR... 5 piec
475 atom
476 >|b[a][rR])... 8 tail~ EXACT <fo> (3)
477 >b[a][rR])(... 9 brnc
478 10 piec
479 atom
480 >[a][rR])(f... 12 piec
481 atom
482 >a][rR])(fo... clas
483 >[rR])(foo|... 14 tail~ EXACT <b> (10)
484 piec
485 atom
486 >rR])(foo|b... clas
487 >)(foo|bar)... 25 tail~ EXACT <a> (12)
488 tail~ BRANCH (3)
489 26 tsdy~ BRANCH (END) (9)
490 ~ attach to TAIL (25) offset to 16
491 tsdy~ EXACT <fo> (EXACT) (4)
492 ~ STAR (END) (6)
493 ~ attach to TAIL (25) offset to 19
494 tsdy~ EXACT <b> (EXACT) (10)
495 ~ EXACT <a> (EXACT) (12)
496 ~ ANYOF[Rr] (END) (14)
497 ~ attach to TAIL (25) offset to 11
498 >(foo|bar)$< tail~ EXACT <x> (1)
499 piec
500 atom
501 >foo|bar)$< reg
502 28 brnc
503 piec
504 atom
505 >|bar)$< 31 tail~ OPEN1 (26)
506 >bar)$< brnc
507 32 piec
508 atom
509 >)$< 34 tail~ BRANCH (28)
510 36 tsdy~ BRANCH (END) (31)
511 ~ attach to CLOSE1 (34) offset to 3
512 tsdy~ EXACT <foo> (EXACT) (29)
513 ~ attach to CLOSE1 (34) offset to 5
514 tsdy~ EXACT <bar> (EXACT) (32)
515 ~ attach to CLOSE1 (34) offset to 2
516 >$< tail~ BRANCH (3)
517 ~ BRANCH (9)
518 ~ TAIL (25)
519 piec
520 atom
521 >< 37 tail~ OPEN1 (26)
522 ~ BRANCH (28)
523 ~ BRANCH (31)
524 ~ CLOSE1 (34)
525 38 tsdy~ EXACT <x> (EXACT) (1)
526 ~ BRANCH (END) (3)
527 ~ BRANCH (END) (9)
528 ~ TAIL (END) (25)
529 ~ OPEN1 (END) (26)
530 ~ BRANCH (END) (28)
531 ~ BRANCH (END) (31)
532 ~ CLOSE1 (END) (34)
533 ~ EOL (END) (36)
534 ~ attach to END (37) offset to 1
535
536 Resulting in the program
537
538 1: EXACT <x>(3)
539 3: BRANCH(9)
540 4: EXACT <fo>(6)
541 6: STAR(26)
542 7: EXACT <o>(0)
543 9: BRANCH(25)
544 10: EXACT <ba>(14)
545 12: OPTIMIZED (2 nodes)
546 14: ANYOF[Rr](26)
547 25: TAIL(26)
548 26: OPEN1(28)
549 28: TRIE-EXACT(34)
550 [StS:1 Wds:2 Cs:6 Uq:5 #Sts:7 Mn:3 Mx:3 Stcls:bf]
551 <foo>
552 <bar>
553 30: OPTIMIZED (4 nodes)
554 34: CLOSE1(36)
555 36: EOL(37)
556 37: END(0)
557
558 Here we can see a much more complex program, with various optimisations
559 in play. At regnode 10 we see an example where a character class with
560 only one character in it was turned into an "EXACT" node. We can also
561 see where an entire alternation was turned into a "TRIE-EXACT" node. As
562 a consequence, some of the regnodes have been marked as optimised away.
563 We can see that the "$" symbol has been converted into an "EOL" regop,
564 a special piece of code that looks for "\n" or the end of the string.
565
566 The next pointer for "BRANCH"es is interesting in that it points at
567 where execution should go if the branch fails. When executing, if the
568 engine tries to traverse from a branch to a "regnext" that isn't a
569 branch then the engine will know that the entire set of branches has
570 failed.
571
572 Peep-hole Optimisation and Analysis
573
574 The regular expression engine can be a weighty tool to wield. On long
575 strings and complex patterns it can end up having to do a lot of work
576 to find a match, and even more to decide that no match is possible.
577 Consider a situation like the following pattern.
578
579 'ababababababababababab' =~ /(a|b)*z/
580
581 The "(a|b)*" part can match at every char in the string, and then fail
582 every time because there is no "z" in the string. So obviously we can
583 avoid using the regex engine unless there is a "z" in the string.
584 Likewise in a pattern like:
585
586 /foo(\w+)bar/
587
588 In this case we know that the string must contain a "foo" which must be
589 followed by "bar". We can use Fast Boyer-Moore matching as implemented
590 in "fbm_instr()" to find the location of these strings. If they don't
591 exist then we don't need to resort to the much more expensive regex
592 engine. Even better, if they do exist then we can use their positions
593 to reduce the search space that the regex engine needs to cover to
594 determine if the entire pattern matches.
595
596 There are various aspects of the pattern that can be used to facilitate
597 optimisations along these lines:
598
599 · anchored fixed strings
600
601 · floating fixed strings
602
603 · minimum and maximum length requirements
604
605 · start class
606
607 · Beginning/End of line positions
608
609 Another form of optimisation that can occur is the post-parse "peep-
610 hole" optimisation, where inefficient constructs are replaced by more
611 efficient constructs. The "TAIL" regops which are used during parsing
612 to mark the end of branches and the end of groups are examples of this.
613 These regops are used as place-holders during construction and "always
614 match" so they can be "optimised away" by making the things that point
615 to the "TAIL" point to the thing that "TAIL" points to, thus "skipping"
616 the node.
617
618 Another optimisation that can occur is that of ""EXACT" merging" which
619 is where two consecutive "EXACT" nodes are merged into a single regop.
620 An even more aggressive form of this is that a branch sequence of the
621 form "EXACT BRANCH ... EXACT" can be converted into a "TRIE-EXACT"
622 regop.
623
624 All of this occurs in the routine "study_chunk()" which uses a special
625 structure "scan_data_t" to store the analysis that it has performed,
626 and does the "peep-hole" optimisations as it goes.
627
628 The code involved in "study_chunk()" is extremely cryptic. Be careful.
629 :-)
630
631 Execution
632 Execution of a regex generally involves two phases, the first being
633 finding the start point in the string where we should match from, and
634 the second being running the regop interpreter.
635
636 If we can tell that there is no valid start point then we don't bother
637 running the interpreter at all. Likewise, if we know from the analysis
638 phase that we cannot detect a short-cut to the start position, we go
639 straight to the interpreter.
640
641 The two entry points are "re_intuit_start()" and "pregexec()". These
642 routines have a somewhat incestuous relationship with overlap between
643 their functions, and "pregexec()" may even call "re_intuit_start()" on
644 its own. Nevertheless other parts of the perl source code may call into
645 either, or both.
646
647 Execution of the interpreter itself used to be recursive, but thanks to
648 the efforts of Dave Mitchell in the 5.9.x development track, that has
649 changed: now an internal stack is maintained on the heap and the
650 routine is fully iterative. This can make it tricky as the code is
651 quite conservative about what state it stores, with the result that two
652 consecutive lines in the code can actually be running in totally
653 different contexts due to the simulated recursion.
654
655 Start position and no-match optimisations
656
657 "re_intuit_start()" is responsible for handling start points and no-
658 match optimisations as determined by the results of the analysis done
659 by "study_chunk()" (and described in "Peep-hole Optimisation and
660 Analysis").
661
662 The basic structure of this routine is to try to find the start- and/or
663 end-points of where the pattern could match, and to ensure that the
664 string is long enough to match the pattern. It tries to use more
665 efficient methods over less efficient methods and may involve
666 considerable cross-checking of constraints to find the place in the
667 string that matches. For instance it may try to determine that a given
668 fixed string must be not only present but a certain number of chars
669 before the end of the string, or whatever.
670
671 It calls several other routines, such as "fbm_instr()" which does Fast
672 Boyer Moore matching and "find_byclass()" which is responsible for
673 finding the start using the first mandatory regop in the program.
674
675 When the optimisation criteria have been satisfied, "reg_try()" is
676 called to perform the match.
677
678 Program execution
679
680 "pregexec()" is the main entry point for running a regex. It contains
681 support for initialising the regex interpreter's state, running
682 "re_intuit_start()" if needed, and running the interpreter on the
683 string from various start positions as needed. When it is necessary to
684 use the regex interpreter "pregexec()" calls "regtry()".
685
686 "regtry()" is the entry point into the regex interpreter. It expects as
687 arguments a pointer to a "regmatch_info" structure and a pointer to a
688 string. It returns an integer 1 for success and a 0 for failure. It
689 is basically a set-up wrapper around "regmatch()".
690
691 "regmatch" is the main "recursive loop" of the interpreter. It is
692 basically a giant switch statement that implements a state machine,
693 where the possible states are the regops themselves, plus a number of
694 additional intermediate and failure states. A few of the states are
695 implemented as subroutines but the bulk are inline code.
696
698 Unicode and Localisation Support
699 When dealing with strings containing characters that cannot be
700 represented using an eight-bit character set, perl uses an internal
701 representation that is a permissive version of Unicode's UTF-8
702 encoding[2]. This uses single bytes to represent characters from the
703 ASCII character set, and sequences of two or more bytes for all other
704 characters. (See perlunitut for more information about the relationship
705 between UTF-8 and perl's encoding, utf8. The difference isn't important
706 for this discussion.)
707
708 No matter how you look at it, Unicode support is going to be a pain in
709 a regex engine. Tricks that might be fine when you have 256 possible
710 characters often won't scale to handle the size of the UTF-8 character
711 set. Things you can take for granted with ASCII may not be true with
712 Unicode. For instance, in ASCII, it is safe to assume that
713 "sizeof(char1) == sizeof(char2)", but in UTF-8 it isn't. Unicode case
714 folding is vastly more complex than the simple rules of ASCII, and even
715 when not using Unicode but only localised single byte encodings, things
716 can get tricky (for example, LATIN SMALL LETTER SHARP S (U+00DF, ss)
717 should match 'SS' in localised case-insensitive matching).
718
719 Making things worse is that UTF-8 support was a later addition to the
720 regex engine (as it was to perl) and this necessarily made things a
721 lot more complicated. Obviously it is easier to design a regex engine
722 with Unicode support in mind from the beginning than it is to retrofit
723 it to one that wasn't.
724
725 Nearly all regops that involve looking at the input string have two
726 cases, one for UTF-8, and one not. In fact, it's often more complex
727 than that, as the pattern may be UTF-8 as well.
728
729 Care must be taken when making changes to make sure that you handle
730 UTF-8 properly, both at compile time and at execution time, including
731 when the string and pattern are mismatched.
732
733 Base Structures
734 The "regexp" structure described in perlreapi is common to all regex
735 engines. Two of its fields are intended for the private use of the
736 regex engine that compiled the pattern. These are the "intflags" and
737 pprivate members. The "pprivate" is a void pointer to an arbitrary
738 structure whose use and management is the responsibility of the
739 compiling engine. perl will never modify either of these values. In the
740 case of the stock engine the structure pointed to by "pprivate" is
741 called "regexp_internal".
742
743 Its "pprivate" and "intflags" fields contain data specific to each
744 engine.
745
746 There are two structures used to store a compiled regular expression.
747 One, the "regexp" structure described in perlreapi is populated by the
748 engine currently being. used and some of its fields read by perl to
749 implement things such as the stringification of "qr//".
750
751 The other structure is pointed to by the "regexp" struct's "pprivate"
752 and is in addition to "intflags" in the same struct considered to be
753 the property of the regex engine which compiled the regular expression;
754
755 The regexp structure contains all the data that perl needs to be aware
756 of to properly work with the regular expression. It includes data about
757 optimisations that perl can use to determine if the regex engine should
758 really be used, and various other control info that is needed to
759 properly execute patterns in various contexts such as is the pattern
760 anchored in some way, or what flags were used during the compile, or
761 whether the program contains special constructs that perl needs to be
762 aware of.
763
764 In addition it contains two fields that are intended for the private
765 use of the regex engine that compiled the pattern. These are the
766 "intflags" and pprivate members. The "pprivate" is a void pointer to an
767 arbitrary structure whose use and management is the responsibility of
768 the compiling engine. perl will never modify either of these values.
769
770 As mentioned earlier, in the case of the default engines, the
771 "pprivate" will be a pointer to a regexp_internal structure which holds
772 the compiled program and any additional data that is private to the
773 regex engine implementation.
774
775 Perl's "pprivate" structure
776
777 The following structure is used as the "pprivate" struct by perl's
778 regex engine. Since it is specific to perl it is only of curiosity
779 value to other engine implementations.
780
781 typedef struct regexp_internal {
782 U32 *offsets; /* offset annotations 20001228 MJD
783 * data about mapping the program to
784 * the string*/
785 regnode *regstclass; /* Optional startclass as identified or
786 * constructed by the optimiser */
787 struct reg_data *data; /* Additional miscellaneous data used
788 * by the program. Used to make it
789 * easier to clone and free arbitrary
790 * data that the regops need. Often the
791 * ARG field of a regop is an index
792 * into this structure */
793 regnode program[1]; /* Unwarranted chumminess with
794 * compiler. */
795 } regexp_internal;
796
797 "offsets"
798 Offsets holds a mapping of offset in the "program" to offset in
799 the "precomp" string. This is only used by ActiveState's visual
800 regex debugger.
801
802 "regstclass"
803 Special regop that is used by "re_intuit_start()" to check if a
804 pattern can match at a certain position. For instance if the regex
805 engine knows that the pattern must start with a 'Z' then it can
806 scan the string until it finds one and then launch the regex
807 engine from there. The routine that handles this is called
808 "find_by_class()". Sometimes this field points at a regop embedded
809 in the program, and sometimes it points at an independent
810 synthetic regop that has been constructed by the optimiser.
811
812 "data"
813 This field points at a "reg_data" structure, which is defined as
814 follows
815
816 struct reg_data {
817 U32 count;
818 U8 *what;
819 void* data[1];
820 };
821
822 This structure is used for handling data structures that the regex
823 engine needs to handle specially during a clone or free operation
824 on the compiled product. Each element in the data array has a
825 corresponding element in the what array. During compilation regops
826 that need special structures stored will add an element to each
827 array using the add_data() routine and then store the index in the
828 regop.
829
830 "program"
831 Compiled program. Inlined into the structure so the entire struct
832 can be treated as a single blob.
833
835 perlreapi
836
837 perlre
838
839 perlunitut
840
842 by Yves Orton, 2006.
843
844 With excerpts from Perl, and contributions and suggestions from Ronald
845 J. Kimball, Dave Mitchell, Dominic Dunlop, Mark Jason Dominus, Stephen
846 McCamant, and David Landgren.
847
849 Same terms as Perl.
850
852 [1] <http://perl.plover.com/Rx/paper/>
853
854 [2] <http://www.unicode.org>
855
856
857
858perl v5.30.1 2019-11-29 PERLREGUTS(1)