1PERLREGUTS(1) Perl Programmers Reference Guide PERLREGUTS(1)
2
3
4
6 perlreguts - Description of the Perl regular expression engine.
7
9 This document is an attempt to shine some light on the guts of the
10 regex engine and how it works. The regex engine represents a
11 significant chunk of the perl codebase, but is relatively poorly
12 understood. This document is a meagre attempt at addressing this
13 situation. It is derived from the author's experience, comments in the
14 source code, other papers on the regex engine, feedback on the
15 perl5-porters mail list, and no doubt other places as well.
16
17 NOTICE! It should be clearly understood that the behavior and
18 structures discussed in this represents the state of the engine as the
19 author understood it at the time of writing. It is NOT an API
20 definition, it is purely an internals guide for those who want to hack
21 the regex engine, or understand how the regex engine works. Readers of
22 this document are expected to understand perl's regex syntax and its
23 usage in detail. If you want to learn about the basics of Perl's
24 regular expressions, see perlre. And if you want to replace the regex
25 engine with your own, see perlreapi.
26
28 A quick note on terms
29 There is some debate as to whether to say "regexp" or "regex". In this
30 document we will use the term "regex" unless there is a special reason
31 not to, in which case we will explain why.
32
33 When speaking about regexes we need to distinguish between their source
34 code form and their internal form. In this document we will use the
35 term "pattern" when we speak of their textual, source code form, and
36 the term "program" when we speak of their internal representation.
37 These correspond to the terms S-regex and B-regex that Mark Jason
38 Dominus employs in his paper on "Rx" ([1] in "REFERENCES").
39
40 What is a regular expression engine?
41 A regular expression engine is a program that takes a set of
42 constraints specified in a mini-language, and then applies those
43 constraints to a target string, and determines whether or not the
44 string satisfies the constraints. See perlre for a full definition of
45 the language.
46
47 In less grandiose terms, the first part of the job is to turn a pattern
48 into something the computer can efficiently use to find the matching
49 point in the string, and the second part is performing the search
50 itself.
51
52 To do this we need to produce a program by parsing the text. We then
53 need to execute the program to find the point in the string that
54 matches. And we need to do the whole thing efficiently.
55
56 Structure of a Regexp Program
57 High Level
58
59 Although it is a bit confusing and some people object to the
60 terminology, it is worth taking a look at a comment that has been in
61 regexp.h for years:
62
63 This is essentially a linear encoding of a nondeterministic finite-
64 state machine (aka syntax charts or "railroad normal form" in parsing
65 technology).
66
67 The term "railroad normal form" is a bit esoteric, with "syntax
68 diagram/charts", or "railroad diagram/charts" being more common terms.
69 Nevertheless it provides a useful mental image of a regex program: each
70 node can be thought of as a unit of track, with a single entry and in
71 most cases a single exit point (there are pieces of track that fork,
72 but statistically not many), and the whole forms a layout with a single
73 entry and single exit point. The matching process can be thought of as
74 a car that moves along the track, with the particular route through the
75 system being determined by the character read at each possible
76 connector point. A car can fall off the track at any point but it may
77 only proceed as long as it matches the track.
78
79 Thus the pattern "/foo(?:\w+|\d+|\s+)bar/" can be thought of as the
80 following chart:
81
82 [start]
83 |
84 <foo>
85 |
86 +-----+-----+
87 | | |
88 <\w+> <\d+> <\s+>
89 | | |
90 +-----+-----+
91 |
92 <bar>
93 |
94 [end]
95
96 The truth of the matter is that perl's regular expressions these days
97 are much more complex than this kind of structure, but visualising it
98 this way can help when trying to get your bearings, and it matches the
99 current implementation pretty closely.
100
101 To be more precise, we will say that a regex program is an encoding of
102 a graph. Each node in the graph corresponds to part of the original
103 regex pattern, such as a literal string or a branch, and has a pointer
104 to the nodes representing the next component to be matched. Since
105 "node" and "opcode" already have other meanings in the perl source, we
106 will call the nodes in a regex program "regops".
107
108 The program is represented by an array of "regnode" structures, one or
109 more of which represent a single regop of the program. Struct "regnode"
110 is the smallest struct needed, and has a field structure which is
111 shared with all the other larger structures. (Outside this document,
112 the term "regnode" is sometimes used to mean "regop", which could be
113 confusing.)
114
115 The "next" pointers of all regops except "BRANCH" implement
116 concatenation; a "next" pointer with a "BRANCH" on both ends of it is
117 connecting two alternatives. [Here we have one of the subtle syntax
118 dependencies: an individual "BRANCH" (as opposed to a collection of
119 them) is never concatenated with anything because of operator
120 precedence.]
121
122 The operand of some types of regop is a literal string; for others, it
123 is a regop leading into a sub-program. In particular, the operand of a
124 "BRANCH" node is the first regop of the branch.
125
126 NOTE: As the railroad metaphor suggests, this is not a tree structure:
127 the tail of the branch connects to the thing following the set of
128 "BRANCH"es. It is a like a single line of railway track that splits as
129 it goes into a station or railway yard and rejoins as it comes out the
130 other side.
131
132 Regops
133
134 The base structure of a regop is defined in regexp.h as follows:
135
136 struct regnode {
137 U8 flags; /* Various purposes, sometimes overridden */
138 U8 type; /* Opcode value as specified by regnodes.h */
139 U16 next_off; /* Offset in size regnode */
140 };
141
142 Other larger "regnode"-like structures are defined in regcomp.h. They
143 are almost like subclasses in that they have the same fields as
144 "regnode", with possibly additional fields following in the structure,
145 and in some cases the specific meaning (and name) of some of base
146 fields are overridden. The following is a more complete description.
147
148 "regnode_1"
149 "regnode_2"
150 "regnode_1" structures have the same header, followed by a single
151 four-byte argument; "regnode_2" structures contain two two-byte
152 arguments instead:
153
154 regnode_1 U32 arg1;
155 regnode_2 U16 arg1; U16 arg2;
156
157 "regnode_string"
158 "regnode_string" structures, used for literal strings, follow the
159 header with a one-byte length and then the string data. Strings are
160 padded on the tail end with zero bytes so that the total length of
161 the node is a multiple of four bytes:
162
163 regnode_string char string[1];
164 U8 str_len; /* overrides flags */
165
166 "regnode_charclass"
167 Bracketed character classes are represented by "regnode_charclass"
168 structures, which have a four-byte argument and then a 32-byte
169 (256-bit) bitmap indicating which characters in the Latin1 range
170 are included in the class.
171
172 regnode_charclass U32 arg1;
173 char bitmap[ANYOF_BITMAP_SIZE];
174
175 Various flags whose names begin with "ANYOF_" are used for special
176 situations. Above Latin1 matches and things not known until run-
177 time are stored in "Perl's pprivate structure".
178
179 "regnode_charclass_posixl"
180 There is also a larger form of a char class structure used to
181 represent POSIX char classes under "/l" matching, called
182 "regnode_charclass_posixl" which has an additional 32-bit bitmap
183 indicating which POSIX char classes have been included.
184
185 regnode_charclass_posixl U32 arg1;
186 char bitmap[ANYOF_BITMAP_SIZE];
187 U32 classflags;
188
189 regnodes.h defines an array called "regarglen[]" which gives the size
190 of each opcode in units of "size regnode" (4-byte). A macro is used to
191 calculate the size of an "EXACT" node based on its "str_len" field.
192
193 The regops are defined in regnodes.h which is generated from
194 regcomp.sym by regcomp.pl. Currently the maximum possible number of
195 distinct regops is restricted to 256, with about a quarter already
196 used.
197
198 A set of macros makes accessing the fields easier and more consistent.
199 These include "OP()", which is used to determine the type of a
200 "regnode"-like structure; "NEXT_OFF()", which is the offset to the next
201 node (more on this later); "ARG()", "ARG1()", "ARG2()", "ARG_SET()",
202 and equivalents for reading and setting the arguments; and "STR_LEN()",
203 "STRING()" and "OPERAND()" for manipulating strings and regop bearing
204 types.
205
206 What regop is next?
207
208 There are three distinct concepts of "next" in the regex engine, and it
209 is important to keep them clear.
210
211 • There is the "next regnode" from a given regnode, a value which is
212 rarely useful except that sometimes it matches up in terms of value
213 with one of the others, and that sometimes the code assumes this to
214 always be so.
215
216 • There is the "next regop" from a given regop/regnode. This is the
217 regop physically located after the current one, as determined by
218 the size of the current regop. This is often useful, such as when
219 dumping the structure we use this order to traverse. Sometimes the
220 code assumes that the "next regnode" is the same as the "next
221 regop", or in other words assumes that the sizeof a given regop
222 type is always going to be one regnode large.
223
224 • There is the "regnext" from a given regop. This is the regop which
225 is reached by jumping forward by the value of "NEXT_OFF()", or in a
226 few cases for longer jumps by the "arg1" field of the "regnode_1"
227 structure. The subroutine "regnext()" handles this transparently.
228 This is the logical successor of the node, which in some cases,
229 like that of the "BRANCH" regop, has special meaning.
230
232 Broadly speaking, performing a match of a string against a pattern
233 involves the following steps:
234
235 A. Compilation
236 1. Parsing
237 2. Peep-hole optimisation and analysis
238 B. Execution
239 3. Start position and no-match optimisations
240 4. Program execution
241
242 Where these steps occur in the actual execution of a perl program is
243 determined by whether the pattern involves interpolating any string
244 variables. If interpolation occurs, then compilation happens at run
245 time. If it does not, then compilation is performed at compile time.
246 (The "/o" modifier changes this, as does "qr//" to a certain extent.)
247 The engine doesn't really care that much.
248
249 Compilation
250 This code resides primarily in regcomp.c, along with the header files
251 regcomp.h, regexp.h and regnodes.h.
252
253 Compilation starts with "pregcomp()", which is mostly an initialisation
254 wrapper which farms work out to two other routines for the heavy
255 lifting: the first is "reg()", which is the start point for parsing;
256 the second, "study_chunk()", is responsible for optimisation.
257
258 Initialisation in "pregcomp()" mostly involves the creation and data-
259 filling of a special structure, "RExC_state_t" (defined in regcomp.c).
260 Almost all internally-used routines in regcomp.h take a pointer to one
261 of these structures as their first argument, with the name
262 "pRExC_state". This structure is used to store the compilation state
263 and contains many fields. Likewise there are many macros which operate
264 on this variable: anything that looks like "RExC_xxxx" is a macro that
265 operates on this pointer/structure.
266
267 "reg()" is the start of the parse process. It is responsible for
268 parsing an arbitrary chunk of pattern up to either the end of the
269 string, or the first closing parenthesis it encounters in the pattern.
270 This means it can be used to parse the top-level regex, or any section
271 inside of a grouping parenthesis. It also handles the "special parens"
272 that perl's regexes have. For instance when parsing "/x(?:foo)y/",
273 "reg()" will at one point be called to parse from the "?" symbol up to
274 and including the ")".
275
276 Additionally, "reg()" is responsible for parsing the one or more
277 branches from the pattern, and for "finishing them off" by correctly
278 setting their next pointers. In order to do the parsing, it repeatedly
279 calls out to "regbranch()", which is responsible for handling up to the
280 first "|" symbol it sees.
281
282 "regbranch()" in turn calls "regpiece()" which handles "things"
283 followed by a quantifier. In order to parse the "things", "regatom()"
284 is called. This is the lowest level routine, which parses out constant
285 strings, character classes, and the various special symbols like "$".
286 If "regatom()" encounters a "(" character it in turn calls "reg()".
287
288 There used to be two main passes involved in parsing, the first to
289 calculate the size of the compiled program, and the second to actually
290 compile it. But now there is only one main pass, with an initial crude
291 guess based on the length of the input pattern, which is increased if
292 necessary as parsing proceeds, and afterwards, trimmed to the actual
293 amount used.
294
295 However, it may happen that parsing must be restarted at the beginning
296 when various circumstances occur along the way. An example is if the
297 program turns out to be so large that there are jumps in it that won't
298 fit in the normal 16 bits available. There are two special regops that
299 can hold bigger jump destinations, BRANCHJ and LONGBRANCH. The parse
300 is restarted, and these are used instead of the normal shorter ones.
301 Whenever restarting the parse is required, the function returns failure
302 and sets a flag as to what needs to be done. This is passed up to the
303 top level routine which takes the appropriate action and restarts from
304 scratch. In the case of needing longer jumps, the "RExC_use_BRANCHJ"
305 flag is set in the "RExC_state_t" structure, which the functions know
306 to inspect before deciding how to do branches.
307
308 In most instances, the function that discovers the issue sets the
309 causal flag and returns failure immediately. "Parsing complications"
310 contains an explicit example of how this works. In other cases, such
311 as a forward reference to a numbered parenthetical grouping, we need to
312 finish the parse to know if that numbered grouping actually appears in
313 the pattern. In those cases, the parse is just redone at the end, with
314 the knowledge of how many groupings occur in it.
315
316 The routine "regtail()" is called by both "reg()" and "regbranch()" in
317 order to "set the tail pointer" correctly. When executing and we get to
318 the end of a branch, we need to go to the node following the grouping
319 parens. When parsing, however, we don't know where the end will be
320 until we get there, so when we do we must go back and update the
321 offsets as appropriate. "regtail" is used to make this easier.
322
323 A subtlety of the parsing process means that a regex like "/foo/" is
324 originally parsed into an alternation with a single branch. It is only
325 afterwards that the optimiser converts single branch alternations into
326 the simpler form.
327
328 Parse Call Graph and a Grammar
329
330 The call graph looks like this:
331
332 reg() # parse a top level regex, or inside of
333 # parens
334 regbranch() # parse a single branch of an alternation
335 regpiece() # parse a pattern followed by a quantifier
336 regatom() # parse a simple pattern
337 regclass() # used to handle a class
338 reg() # used to handle a parenthesised
339 # subpattern
340 ....
341 ...
342 regtail() # finish off the branch
343 ...
344 regtail() # finish off the branch sequence. Tie each
345 # branch's tail to the tail of the
346 # sequence
347 # (NEW) In Debug mode this is
348 # regtail_study().
349
350 A grammar form might be something like this:
351
352 atom : constant | class
353 quant : '*' | '+' | '?' | '{min,max}'
354 _branch: piece
355 | piece _branch
356 | nothing
357 branch: _branch
358 | _branch '|' branch
359 group : '(' branch ')'
360 _piece: atom | group
361 piece : _piece
362 | _piece quant
363
364 Parsing complications
365
366 The implication of the above description is that a pattern containing
367 nested parentheses will result in a call graph which cycles through
368 "reg()", "regbranch()", "regpiece()", "regatom()", "reg()",
369 "regbranch()" etc multiple times, until the deepest level of nesting is
370 reached. All the above routines return a pointer to a "regnode", which
371 is usually the last regnode added to the program. However, one
372 complication is that reg() returns NULL for parsing "(?:)" syntax for
373 embedded modifiers, setting the flag "TRYAGAIN". The "TRYAGAIN"
374 propagates upwards until it is captured, in some cases by "regatom()",
375 but otherwise unconditionally by "regbranch()". Hence it will never be
376 returned by "regbranch()" to "reg()". This flag permits patterns such
377 as "(?i)+" to be detected as errors (Quantifier follows nothing in
378 regex; marked by <-- HERE in m/(?i)+ <-- HERE /).
379
380 Another complication is that the representation used for the program
381 differs if it needs to store Unicode, but it's not always possible to
382 know for sure whether it does until midway through parsing. The Unicode
383 representation for the program is larger, and cannot be matched as
384 efficiently. (See "Unicode and Localisation Support" below for more
385 details as to why.) If the pattern contains literal Unicode, it's
386 obvious that the program needs to store Unicode. Otherwise, the parser
387 optimistically assumes that the more efficient representation can be
388 used, and starts sizing on this basis. However, if it then encounters
389 something in the pattern which must be stored as Unicode, such as an
390 "\x{...}" escape sequence representing a character literal, then this
391 means that all previously calculated sizes need to be redone, using
392 values appropriate for the Unicode representation. This is another
393 instance where the parsing needs to be restarted, and it can and is
394 done immediately. The function returns failure, and sets the flag
395 "RESTART_UTF8" (encapsulated by using the macro "REQUIRE_UTF8"). This
396 restart request is propagated up the call chain in a similar fashion,
397 until it is "caught" in "Perl_re_op_compile()", which marks the pattern
398 as containing Unicode, and restarts the sizing pass. It is also
399 possible for constructions within run-time code blocks to turn out to
400 need Unicode representation., which is signalled by
401 "S_compile_runtime_code()" returning false to "Perl_re_op_compile()".
402
403 The restart was previously implemented using a "longjmp" in "regatom()"
404 back to a "setjmp" in "Perl_re_op_compile()", but this proved to be
405 problematic as the latter is a large function containing many automatic
406 variables, which interact badly with the emergent control flow of
407 "setjmp".
408
409 Debug Output
410
411 Starting in the 5.9.x development version of perl you can "use re Debug
412 => 'PARSE'" to see some trace information about the parse process. We
413 will start with some simple patterns and build up to more complex
414 patterns.
415
416 So when we parse "/foo/" we see something like the following table. The
417 left shows what is being parsed, and the number indicates where the
418 next regop would go. The stuff on the right is the trace output of the
419 graph. The names are chosen to be short to make it less dense on the
420 screen. 'tsdy' is a special form of "regtail()" which does some extra
421 analysis.
422
423 >foo< 1 reg
424 brnc
425 piec
426 atom
427 >< 4 tsdy~ EXACT <foo> (EXACT) (1)
428 ~ attach to END (3) offset to 2
429
430 The resulting program then looks like:
431
432 1: EXACT <foo>(3)
433 3: END(0)
434
435 As you can see, even though we parsed out a branch and a piece, it was
436 ultimately only an atom. The final program shows us how things work. We
437 have an "EXACT" regop, followed by an "END" regop. The number in parens
438 indicates where the "regnext" of the node goes. The "regnext" of an
439 "END" regop is unused, as "END" regops mean we have successfully
440 matched. The number on the left indicates the position of the regop in
441 the regnode array.
442
443 Now let's try a harder pattern. We will add a quantifier, so now we
444 have the pattern "/foo+/". We will see that "regbranch()" calls
445 "regpiece()" twice.
446
447 >foo+< 1 reg
448 brnc
449 piec
450 atom
451 >o+< 3 piec
452 atom
453 >< 6 tail~ EXACT <fo> (1)
454 7 tsdy~ EXACT <fo> (EXACT) (1)
455 ~ PLUS (END) (3)
456 ~ attach to END (6) offset to 3
457
458 And we end up with the program:
459
460 1: EXACT <fo>(3)
461 3: PLUS(6)
462 4: EXACT <o>(0)
463 6: END(0)
464
465 Now we have a special case. The "EXACT" regop has a "regnext" of 0.
466 This is because if it matches it should try to match itself again. The
467 "PLUS" regop handles the actual failure of the "EXACT" regop and acts
468 appropriately (going to regnode 6 if the "EXACT" matched at least once,
469 or failing if it didn't).
470
471 Now for something much more complex: "/x(?:foo*|b[a][rR])(foo|bar)$/"
472
473 >x(?:foo*|b... 1 reg
474 brnc
475 piec
476 atom
477 >(?:foo*|b[... 3 piec
478 atom
479 >?:foo*|b[a... reg
480 >foo*|b[a][... brnc
481 piec
482 atom
483 >o*|b[a][rR... 5 piec
484 atom
485 >|b[a][rR])... 8 tail~ EXACT <fo> (3)
486 >b[a][rR])(... 9 brnc
487 10 piec
488 atom
489 >[a][rR])(f... 12 piec
490 atom
491 >a][rR])(fo... clas
492 >[rR])(foo|... 14 tail~ EXACT <b> (10)
493 piec
494 atom
495 >rR])(foo|b... clas
496 >)(foo|bar)... 25 tail~ EXACT <a> (12)
497 tail~ BRANCH (3)
498 26 tsdy~ BRANCH (END) (9)
499 ~ attach to TAIL (25) offset to 16
500 tsdy~ EXACT <fo> (EXACT) (4)
501 ~ STAR (END) (6)
502 ~ attach to TAIL (25) offset to 19
503 tsdy~ EXACT <b> (EXACT) (10)
504 ~ EXACT <a> (EXACT) (12)
505 ~ ANYOF[Rr] (END) (14)
506 ~ attach to TAIL (25) offset to 11
507 >(foo|bar)$< tail~ EXACT <x> (1)
508 piec
509 atom
510 >foo|bar)$< reg
511 28 brnc
512 piec
513 atom
514 >|bar)$< 31 tail~ OPEN1 (26)
515 >bar)$< brnc
516 32 piec
517 atom
518 >)$< 34 tail~ BRANCH (28)
519 36 tsdy~ BRANCH (END) (31)
520 ~ attach to CLOSE1 (34) offset to 3
521 tsdy~ EXACT <foo> (EXACT) (29)
522 ~ attach to CLOSE1 (34) offset to 5
523 tsdy~ EXACT <bar> (EXACT) (32)
524 ~ attach to CLOSE1 (34) offset to 2
525 >$< tail~ BRANCH (3)
526 ~ BRANCH (9)
527 ~ TAIL (25)
528 piec
529 atom
530 >< 37 tail~ OPEN1 (26)
531 ~ BRANCH (28)
532 ~ BRANCH (31)
533 ~ CLOSE1 (34)
534 38 tsdy~ EXACT <x> (EXACT) (1)
535 ~ BRANCH (END) (3)
536 ~ BRANCH (END) (9)
537 ~ TAIL (END) (25)
538 ~ OPEN1 (END) (26)
539 ~ BRANCH (END) (28)
540 ~ BRANCH (END) (31)
541 ~ CLOSE1 (END) (34)
542 ~ EOL (END) (36)
543 ~ attach to END (37) offset to 1
544
545 Resulting in the program
546
547 1: EXACT <x>(3)
548 3: BRANCH(9)
549 4: EXACT <fo>(6)
550 6: STAR(26)
551 7: EXACT <o>(0)
552 9: BRANCH(25)
553 10: EXACT <ba>(14)
554 12: OPTIMIZED (2 nodes)
555 14: ANYOF[Rr](26)
556 25: TAIL(26)
557 26: OPEN1(28)
558 28: TRIE-EXACT(34)
559 [StS:1 Wds:2 Cs:6 Uq:5 #Sts:7 Mn:3 Mx:3 Stcls:bf]
560 <foo>
561 <bar>
562 30: OPTIMIZED (4 nodes)
563 34: CLOSE1(36)
564 36: EOL(37)
565 37: END(0)
566
567 Here we can see a much more complex program, with various optimisations
568 in play. At regnode 10 we see an example where a character class with
569 only one character in it was turned into an "EXACT" node. We can also
570 see where an entire alternation was turned into a "TRIE-EXACT" node. As
571 a consequence, some of the regnodes have been marked as optimised away.
572 We can see that the "$" symbol has been converted into an "EOL" regop,
573 a special piece of code that looks for "\n" or the end of the string.
574
575 The next pointer for "BRANCH"es is interesting in that it points at
576 where execution should go if the branch fails. When executing, if the
577 engine tries to traverse from a branch to a "regnext" that isn't a
578 branch then the engine will know that the entire set of branches has
579 failed.
580
581 Peep-hole Optimisation and Analysis
582
583 The regular expression engine can be a weighty tool to wield. On long
584 strings and complex patterns it can end up having to do a lot of work
585 to find a match, and even more to decide that no match is possible.
586 Consider a situation like the following pattern.
587
588 'ababababababababababab' =~ /(a|b)*z/
589
590 The "(a|b)*" part can match at every char in the string, and then fail
591 every time because there is no "z" in the string. So obviously we can
592 avoid using the regex engine unless there is a "z" in the string.
593 Likewise in a pattern like:
594
595 /foo(\w+)bar/
596
597 In this case we know that the string must contain a "foo" which must be
598 followed by "bar". We can use Fast Boyer-Moore matching as implemented
599 in "fbm_instr()" to find the location of these strings. If they don't
600 exist then we don't need to resort to the much more expensive regex
601 engine. Even better, if they do exist then we can use their positions
602 to reduce the search space that the regex engine needs to cover to
603 determine if the entire pattern matches.
604
605 There are various aspects of the pattern that can be used to facilitate
606 optimisations along these lines:
607
608 • anchored fixed strings
609
610 • floating fixed strings
611
612 • minimum and maximum length requirements
613
614 • start class
615
616 • Beginning/End of line positions
617
618 Another form of optimisation that can occur is the post-parse "peep-
619 hole" optimisation, where inefficient constructs are replaced by more
620 efficient constructs. The "TAIL" regops which are used during parsing
621 to mark the end of branches and the end of groups are examples of this.
622 These regops are used as place-holders during construction and "always
623 match" so they can be "optimised away" by making the things that point
624 to the "TAIL" point to the thing that "TAIL" points to, thus "skipping"
625 the node.
626
627 Another optimisation that can occur is that of ""EXACT" merging" which
628 is where two consecutive "EXACT" nodes are merged into a single regop.
629 An even more aggressive form of this is that a branch sequence of the
630 form "EXACT BRANCH ... EXACT" can be converted into a "TRIE-EXACT"
631 regop.
632
633 All of this occurs in the routine "study_chunk()" which uses a special
634 structure "scan_data_t" to store the analysis that it has performed,
635 and does the "peep-hole" optimisations as it goes.
636
637 The code involved in "study_chunk()" is extremely cryptic. Be careful.
638 :-)
639
640 Execution
641 Execution of a regex generally involves two phases, the first being
642 finding the start point in the string where we should match from, and
643 the second being running the regop interpreter.
644
645 If we can tell that there is no valid start point then we don't bother
646 running the interpreter at all. Likewise, if we know from the analysis
647 phase that we cannot detect a short-cut to the start position, we go
648 straight to the interpreter.
649
650 The two entry points are "re_intuit_start()" and "pregexec()". These
651 routines have a somewhat incestuous relationship with overlap between
652 their functions, and "pregexec()" may even call "re_intuit_start()" on
653 its own. Nevertheless other parts of the perl source code may call into
654 either, or both.
655
656 Execution of the interpreter itself used to be recursive, but thanks to
657 the efforts of Dave Mitchell in the 5.9.x development track, that has
658 changed: now an internal stack is maintained on the heap and the
659 routine is fully iterative. This can make it tricky as the code is
660 quite conservative about what state it stores, with the result that two
661 consecutive lines in the code can actually be running in totally
662 different contexts due to the simulated recursion.
663
664 Start position and no-match optimisations
665
666 "re_intuit_start()" is responsible for handling start points and no-
667 match optimisations as determined by the results of the analysis done
668 by "study_chunk()" (and described in "Peep-hole Optimisation and
669 Analysis").
670
671 The basic structure of this routine is to try to find the start- and/or
672 end-points of where the pattern could match, and to ensure that the
673 string is long enough to match the pattern. It tries to use more
674 efficient methods over less efficient methods and may involve
675 considerable cross-checking of constraints to find the place in the
676 string that matches. For instance it may try to determine that a given
677 fixed string must be not only present but a certain number of chars
678 before the end of the string, or whatever.
679
680 It calls several other routines, such as "fbm_instr()" which does Fast
681 Boyer Moore matching and "find_byclass()" which is responsible for
682 finding the start using the first mandatory regop in the program.
683
684 When the optimisation criteria have been satisfied, "reg_try()" is
685 called to perform the match.
686
687 Program execution
688
689 "pregexec()" is the main entry point for running a regex. It contains
690 support for initialising the regex interpreter's state, running
691 "re_intuit_start()" if needed, and running the interpreter on the
692 string from various start positions as needed. When it is necessary to
693 use the regex interpreter "pregexec()" calls "regtry()".
694
695 "regtry()" is the entry point into the regex interpreter. It expects as
696 arguments a pointer to a "regmatch_info" structure and a pointer to a
697 string. It returns an integer 1 for success and a 0 for failure. It
698 is basically a set-up wrapper around "regmatch()".
699
700 "regmatch" is the main "recursive loop" of the interpreter. It is
701 basically a giant switch statement that implements a state machine,
702 where the possible states are the regops themselves, plus a number of
703 additional intermediate and failure states. A few of the states are
704 implemented as subroutines but the bulk are inline code.
705
707 Unicode and Localisation Support
708 When dealing with strings containing characters that cannot be
709 represented using an eight-bit character set, perl uses an internal
710 representation that is a permissive version of Unicode's UTF-8
711 encoding[2]. This uses single bytes to represent characters from the
712 ASCII character set, and sequences of two or more bytes for all other
713 characters. (See perlunitut for more information about the relationship
714 between UTF-8 and perl's encoding, utf8. The difference isn't important
715 for this discussion.)
716
717 No matter how you look at it, Unicode support is going to be a pain in
718 a regex engine. Tricks that might be fine when you have 256 possible
719 characters often won't scale to handle the size of the UTF-8 character
720 set. Things you can take for granted with ASCII may not be true with
721 Unicode. For instance, in ASCII, it is safe to assume that
722 "sizeof(char1) == sizeof(char2)", but in UTF-8 it isn't. Unicode case
723 folding is vastly more complex than the simple rules of ASCII, and even
724 when not using Unicode but only localised single byte encodings, things
725 can get tricky (for example, LATIN SMALL LETTER SHARP S (U+00DF, ss)
726 should match 'SS' in localised case-insensitive matching).
727
728 Making things worse is that UTF-8 support was a later addition to the
729 regex engine (as it was to perl) and this necessarily made things a
730 lot more complicated. Obviously it is easier to design a regex engine
731 with Unicode support in mind from the beginning than it is to retrofit
732 it to one that wasn't.
733
734 Nearly all regops that involve looking at the input string have two
735 cases, one for UTF-8, and one not. In fact, it's often more complex
736 than that, as the pattern may be UTF-8 as well.
737
738 Care must be taken when making changes to make sure that you handle
739 UTF-8 properly, both at compile time and at execution time, including
740 when the string and pattern are mismatched.
741
742 Base Structures
743 The "regexp" structure described in perlreapi is common to all regex
744 engines. Two of its fields are intended for the private use of the
745 regex engine that compiled the pattern. These are the "intflags" and
746 pprivate members. The "pprivate" is a void pointer to an arbitrary
747 structure whose use and management is the responsibility of the
748 compiling engine. perl will never modify either of these values. In the
749 case of the stock engine the structure pointed to by "pprivate" is
750 called "regexp_internal".
751
752 Its "pprivate" and "intflags" fields contain data specific to each
753 engine.
754
755 There are two structures used to store a compiled regular expression.
756 One, the "regexp" structure described in perlreapi is populated by the
757 engine currently being. used and some of its fields read by perl to
758 implement things such as the stringification of "qr//".
759
760 The other structure is pointed to by the "regexp" struct's "pprivate"
761 and is in addition to "intflags" in the same struct considered to be
762 the property of the regex engine which compiled the regular expression;
763
764 The regexp structure contains all the data that perl needs to be aware
765 of to properly work with the regular expression. It includes data about
766 optimisations that perl can use to determine if the regex engine should
767 really be used, and various other control info that is needed to
768 properly execute patterns in various contexts such as is the pattern
769 anchored in some way, or what flags were used during the compile, or
770 whether the program contains special constructs that perl needs to be
771 aware of.
772
773 In addition it contains two fields that are intended for the private
774 use of the regex engine that compiled the pattern. These are the
775 "intflags" and pprivate members. The "pprivate" is a void pointer to an
776 arbitrary structure whose use and management is the responsibility of
777 the compiling engine. perl will never modify either of these values.
778
779 As mentioned earlier, in the case of the default engines, the
780 "pprivate" will be a pointer to a regexp_internal structure which holds
781 the compiled program and any additional data that is private to the
782 regex engine implementation.
783
784 Perl's "pprivate" structure
785
786 The following structure is used as the "pprivate" struct by perl's
787 regex engine. Since it is specific to perl it is only of curiosity
788 value to other engine implementations.
789
790 typedef struct regexp_internal {
791 U32 *offsets; /* offset annotations 20001228 MJD
792 * data about mapping the program to
793 * the string*/
794 regnode *regstclass; /* Optional startclass as identified or
795 * constructed by the optimiser */
796 struct reg_data *data; /* Additional miscellaneous data used
797 * by the program. Used to make it
798 * easier to clone and free arbitrary
799 * data that the regops need. Often the
800 * ARG field of a regop is an index
801 * into this structure */
802 regnode program[1]; /* Unwarranted chumminess with
803 * compiler. */
804 } regexp_internal;
805
806 "offsets"
807 Offsets holds a mapping of offset in the "program" to offset in
808 the "precomp" string. This is only used by ActiveState's visual
809 regex debugger.
810
811 "regstclass"
812 Special regop that is used by "re_intuit_start()" to check if a
813 pattern can match at a certain position. For instance if the regex
814 engine knows that the pattern must start with a 'Z' then it can
815 scan the string until it finds one and then launch the regex
816 engine from there. The routine that handles this is called
817 "find_by_class()". Sometimes this field points at a regop embedded
818 in the program, and sometimes it points at an independent
819 synthetic regop that has been constructed by the optimiser.
820
821 "data"
822 This field points at a "reg_data" structure, which is defined as
823 follows
824
825 struct reg_data {
826 U32 count;
827 U8 *what;
828 void* data[1];
829 };
830
831 This structure is used for handling data structures that the regex
832 engine needs to handle specially during a clone or free operation
833 on the compiled product. Each element in the data array has a
834 corresponding element in the what array. During compilation regops
835 that need special structures stored will add an element to each
836 array using the add_data() routine and then store the index in the
837 regop.
838
839 "program"
840 Compiled program. Inlined into the structure so the entire struct
841 can be treated as a single blob.
842
844 perlreapi
845
846 perlre
847
848 perlunitut
849
851 by Yves Orton, 2006.
852
853 With excerpts from Perl, and contributions and suggestions from Ronald
854 J. Kimball, Dave Mitchell, Dominic Dunlop, Mark Jason Dominus, Stephen
855 McCamant, and David Landgren.
856
857 Now maintained by Perl 5 Porters.
858
860 Same terms as Perl.
861
863 [1] <https://perl.plover.com/Rx/paper/>
864
865 [2] <https://www.unicode.org/>
866
867
868
869perl v5.34.1 2022-03-15 PERLREGUTS(1)