1PERLREGUTS(1) Perl Programmers Reference Guide PERLREGUTS(1)
2
3
4
6 perlreguts - Description of the Perl regular expression engine.
7
9 This document is an attempt to shine some light on the guts of the
10 regex engine and how it works. The regex engine represents a
11 significant chunk of the perl codebase, but is relatively poorly
12 understood. This document is a meagre attempt at addressing this
13 situation. It is derived from the author's experience, comments in the
14 source code, other papers on the regex engine, feedback on the
15 perl5-porters mail list, and no doubt other places as well.
16
17 NOTICE! It should be clearly understood that the behavior and
18 structures discussed in this represents the state of the engine as the
19 author understood it at the time of writing. It is NOT an API
20 definition, it is purely an internals guide for those who want to hack
21 the regex engine, or understand how the regex engine works. Readers of
22 this document are expected to understand perl's regex syntax and its
23 usage in detail. If you want to learn about the basics of Perl's
24 regular expressions, see perlre. And if you want to replace the regex
25 engine with your own, see perlreapi.
26
28 A quick note on terms
29 There is some debate as to whether to say "regexp" or "regex". In this
30 document we will use the term "regex" unless there is a special reason
31 not to, in which case we will explain why.
32
33 When speaking about regexes we need to distinguish between their source
34 code form and their internal form. In this document we will use the
35 term "pattern" when we speak of their textual, source code form, and
36 the term "program" when we speak of their internal representation.
37 These correspond to the terms S-regex and B-regex that Mark Jason
38 Dominus employs in his paper on "Rx" ([1] in "REFERENCES").
39
40 What is a regular expression engine?
41 A regular expression engine is a program that takes a set of
42 constraints specified in a mini-language, and then applies those
43 constraints to a target string, and determines whether or not the
44 string satisfies the constraints. See perlre for a full definition of
45 the language.
46
47 In less grandiose terms, the first part of the job is to turn a pattern
48 into something the computer can efficiently use to find the matching
49 point in the string, and the second part is performing the search
50 itself.
51
52 To do this we need to produce a program by parsing the text. We then
53 need to execute the program to find the point in the string that
54 matches. And we need to do the whole thing efficiently.
55
56 Structure of a Regexp Program
57 High Level
58
59 Although it is a bit confusing and some people object to the
60 terminology, it is worth taking a look at a comment that has been in
61 regexp.h for years:
62
63 This is essentially a linear encoding of a nondeterministic finite-
64 state machine (aka syntax charts or "railroad normal form" in parsing
65 technology).
66
67 The term "railroad normal form" is a bit esoteric, with "syntax
68 diagram/charts", or "railroad diagram/charts" being more common terms.
69 Nevertheless it provides a useful mental image of a regex program: each
70 node can be thought of as a unit of track, with a single entry and in
71 most cases a single exit point (there are pieces of track that fork,
72 but statistically not many), and the whole forms a layout with a single
73 entry and single exit point. The matching process can be thought of as
74 a car that moves along the track, with the particular route through the
75 system being determined by the character read at each possible
76 connector point. A car can fall off the track at any point but it may
77 only proceed as long as it matches the track.
78
79 Thus the pattern "/foo(?:\w+|\d+|\s+)bar/" can be thought of as the
80 following chart:
81
82 [start]
83 |
84 <foo>
85 |
86 +-----+-----+
87 | | |
88 <\w+> <\d+> <\s+>
89 | | |
90 +-----+-----+
91 |
92 <bar>
93 |
94 [end]
95
96 The truth of the matter is that perl's regular expressions these days
97 are much more complex than this kind of structure, but visualising it
98 this way can help when trying to get your bearings, and it matches the
99 current implementation pretty closely.
100
101 To be more precise, we will say that a regex program is an encoding of
102 a graph. Each node in the graph corresponds to part of the original
103 regex pattern, such as a literal string or a branch, and has a pointer
104 to the nodes representing the next component to be matched. Since
105 "node" and "opcode" already have other meanings in the perl source, we
106 will call the nodes in a regex program "regops".
107
108 The program is represented by an array of "regnode" structures, one or
109 more of which represent a single regop of the program. Struct "regnode"
110 is the smallest struct needed, and has a field structure which is
111 shared with all the other larger structures. (Outside this document,
112 the term "regnode" is sometimes used to mean "regop", which could be
113 confusing.)
114
115 The "next" pointers of all regops except "BRANCH" implement
116 concatenation; a "next" pointer with a "BRANCH" on both ends of it is
117 connecting two alternatives. [Here we have one of the subtle syntax
118 dependencies: an individual "BRANCH" (as opposed to a collection of
119 them) is never concatenated with anything because of operator
120 precedence.]
121
122 The operand of some types of regop is a literal string; for others, it
123 is a regop leading into a sub-program. In particular, the operand of a
124 "BRANCH" node is the first regop of the branch.
125
126 NOTE: As the railroad metaphor suggests, this is not a tree structure:
127 the tail of the branch connects to the thing following the set of
128 "BRANCH"es. It is a like a single line of railway track that splits as
129 it goes into a station or railway yard and rejoins as it comes out the
130 other side.
131
132 Regops
133
134 The base structure of a regop is defined in regexp.h as follows:
135
136 struct regnode {
137 U8 flags; /* Various purposes, sometimes overridden */
138 U8 type; /* Opcode value as specified by regnodes.h */
139 U16 next_off; /* Offset in size regnode */
140 };
141
142 Other larger "regnode"-like structures are defined in regcomp.h. They
143 are almost like subclasses in that they have the same fields as
144 "regnode", with possibly additional fields following in the structure,
145 and in some cases the specific meaning (and name) of some of base
146 fields are overridden. The following is a more complete description.
147
148 "regnode_1"
149 "regnode_2"
150 "regnode_1" structures have the same header, followed by a single
151 four-byte argument; "regnode_2" structures contain two two-byte
152 arguments instead:
153
154 regnode_1 U32 arg1;
155 regnode_2 U16 arg1; U16 arg2;
156
157 "regnode_string"
158 "regnode_string" structures, used for literal strings, follow the
159 header with a one-byte length and then the string data. Strings are
160 padded on the tail end with zero bytes so that the total length of
161 the node is a multiple of four bytes:
162
163 regnode_string char string[1];
164 U8 str_len; /* overrides flags */
165
166 "regnode_charclass"
167 Bracketed character classes are represented by "regnode_charclass"
168 structures, which have a four-byte argument and then a 32-byte
169 (256-bit) bitmap indicating which characters in the Latin1 range
170 are included in the class.
171
172 regnode_charclass U32 arg1;
173 char bitmap[ANYOF_BITMAP_SIZE];
174
175 Various flags whose names begin with "ANYOF_" are used for special
176 situations. Above Latin1 matches and things not known until run-
177 time are stored in "Perl's pprivate structure".
178
179 "regnode_charclass_posixl"
180 There is also a larger form of a char class structure used to
181 represent POSIX char classes under "/l" matching, called
182 "regnode_charclass_posixl" which has an additional 32-bit bitmap
183 indicating which POSIX char classes have been included.
184
185 regnode_charclass_posixl U32 arg1;
186 char bitmap[ANYOF_BITMAP_SIZE];
187 U32 classflags;
188
189 regnodes.h defines an array called "PL_regnode_arg_len[]" which gives
190 the size of each opcode in units of "size regnode" (4-byte). A macro is
191 used to calculate the size of an "EXACT" node based on its "str_len"
192 field.
193
194 The regops are defined in regnodes.h which is generated from
195 regcomp.sym by regcomp.pl. Currently the maximum possible number of
196 distinct regops is restricted to 256, with about a quarter already
197 used.
198
199 A set of macros makes accessing the fields easier and more consistent.
200 These include OP(), which is used to determine the type of a
201 "regnode"-like structure; NEXT_OFF(), which is the offset to the next
202 node (more on this later); ARG(), ARG1(), ARG2(), ARG_SET(), and
203 equivalents for reading and setting the arguments; and STR_LEN(),
204 STRING() and OPERAND() for manipulating strings and regop bearing
205 types.
206
207 What regnode is next?
208
209 There are two distinct concepts of "next regnode" in the regex engine,
210 and it is important to keep them distinct in your thinking as they
211 overlap conceptually in many places, but where they don't overlap the
212 difference is critical. For the majority of regnode types the two
213 concepts are (nearly) identical in practice. The two types are
214 "REGNODE_AFTER" which is used heavily during compilation but only
215 occasionally during execution and "regnext" which is used heavily
216 during execution, and only occasionally during compilation.
217
218 "REGNODE_AFTER"
219 This is the "positionally next regnode" in the compiled regex
220 program. For the smaller regnode types it is "regnode_ptr+1" under
221 the hood, but as regnode sizes vary and can change over time we
222 offer macros which hide the gory details.
223
224 It is heavily used in the compiler phase but is only used by a few
225 select regnode types in the execution phase. It is also heavily
226 used in the code for dumping the regexp program for debugging.
227
228 There are a selection of macros which can be used to compute this
229 as efficiently as possible depending on the circumstances. The
230 canonical macro is REGNODE_AFTER(), which is the most powerful and
231 should handle any case we have, but is also potentially the
232 slowest. There are two additional macros for the special case that
233 you KNOW the current regnode size is constant, and you know its
234 type or opcode. In which case you can use REGNODE_AFTER_opcode() or
235 REGNODE_AFTER_type().
236
237 In older versions of the regex engine REGNODE_AFTER() was called
238 "NEXTOPER" but this was found to be confusing and it was renamed.
239 There is also a REGNODE_BEFORE(), but it is unsafe and should not
240 be used in new code.
241
242 "regnext"
243 This is the regnode which can be reached by jumping forward by the
244 value of the NEXT_OFF() member of the regnode, or in a few cases
245 for longer jumps by the "arg1" field of the "regnode_1" structure.
246 The subroutine regnext() handles this transparently. In the
247 majority of cases the "regnext" for a regnode is the regnode which
248 should be executed after the current one has successfully matched,
249 but in some cases this may not be true. In loop control and branch
250 control regnode types the regnext may signify something special,
251 for BRANCH nodes "regnext" is the next BRANCH that should be
252 executed if the current one fails execution, and some loop control
253 regnodes set the regnext to be the end of the loop so they can jump
254 to their cleanup if the current iteration fails to match.
255
256 Most regnode types do not create a branch in the execution flow, and
257 leaving aside optimizations the two concepts of "next" are the same.
258 For instance the "regnext" and "REGNODE_AFTER" of a SBOL opcode are the
259 same during compilation phase. The main place this is not true is
260 "BRANCH" regnodes where the "REGNODE_AFTER" represents the start of the
261 pattern in the branch and the "regnext" represents the linkage to the
262 next BRANCH should this one fail to match, or 0 if it is the last
263 branch. The looping logic for quantifiers also makes similar use of the
264 distinction between the two types, with "REGNODE_AFTER" being the
265 inside of the loop construct, and the "regnext" pointing at the end of
266 the loop.
267
268 During compilation the engine may not know what the regnext is for a
269 given node, so during compilation "regnext" is only used where it must
270 be used and is known to be correct. At the very end of the compilation
271 phase we walk the regex program and correct the regnext data as
272 appropriate, and also perform various optimizations which may result in
273 regnodes that were required during construction becoming redundant, or
274 we may replace a large regnode with a much smaller one and filling in
275 the gap with OPTIMIZED regnodes. Thus we might start with something
276 like this:
277
278 BRANCH
279 EXACT "foo"
280 BRANCH
281 EXACT "bar"
282 EXACT "!"
283
284 and replace it with something like:
285
286 TRIE foo|bar
287 OPTIMIZED
288 OPTIMIZED
289 OPTIMIZED
290 EXACT "!"
291
292 the "REGNODE_AFTER" for the "TRIE" node would be an "OPTIMIZED"
293 regnode, and in theory the "regnext" would be the same as the
294 "REGNODE_AFTER". But it would be inefficient to execute the OPTIMIZED
295 regnode as a noop three times, so the optimizer fixes the "regnext" so
296 such nodes are skipped during execution phase.
297
298 During execution phases we use the regnext() almost exclusively, and
299 only use "REGNODE_AFTER" in special cases where it has a well defined
300 meaning for a given regnode type. For instance /x+/ results in
301
302 PLUS
303 EXACT "x"
304 END
305
306 the "regnext" of the "PLUS" regnode is the "END" regnode, and the
307 "REGNODE_AFTER" of the "PLUS" regnode is the "EXACT" regnode. The
308 "regnext" and "REGNODE_AFTER" of the "EXACT" regnode is the "END"
309 regnode.
310
312 Broadly speaking, performing a match of a string against a pattern
313 involves the following steps:
314
315 A. Compilation
316 1. Parsing
317 2. Peep-hole optimisation and analysis
318 B. Execution
319 3. Start position and no-match optimisations
320 4. Program execution
321
322 Where these steps occur in the actual execution of a perl program is
323 determined by whether the pattern involves interpolating any string
324 variables. If interpolation occurs, then compilation happens at run
325 time. If it does not, then compilation is performed at compile time.
326 (The "/o" modifier changes this, as does "qr//" to a certain extent.)
327 The engine doesn't really care that much.
328
329 Compilation
330 This code resides primarily in regcomp.c, along with the header files
331 regcomp.h, regexp.h and regnodes.h.
332
333 Compilation starts with pregcomp(), which is mostly an initialisation
334 wrapper which farms work out to two other routines for the heavy
335 lifting: the first is reg(), which is the start point for parsing; the
336 second, study_chunk(), is responsible for optimisation.
337
338 Initialisation in pregcomp() mostly involves the creation and data-
339 filling of a special structure, "RExC_state_t" (defined in regcomp.c).
340 Almost all internally-used routines in regcomp.h take a pointer to one
341 of these structures as their first argument, with the name
342 "pRExC_state". This structure is used to store the compilation state
343 and contains many fields. Likewise there are many macros which operate
344 on this variable: anything that looks like "RExC_xxxx" is a macro that
345 operates on this pointer/structure.
346
347 reg() is the start of the parse process. It is responsible for parsing
348 an arbitrary chunk of pattern up to either the end of the string, or
349 the first closing parenthesis it encounters in the pattern. This means
350 it can be used to parse the top-level regex, or any section inside of a
351 grouping parenthesis. It also handles the "special parens" that perl's
352 regexes have. For instance when parsing "/x(?:foo)y/", reg() will at
353 one point be called to parse from the "?" symbol up to and including
354 the ")".
355
356 Additionally, reg() is responsible for parsing the one or more branches
357 from the pattern, and for "finishing them off" by correctly setting
358 their next pointers. In order to do the parsing, it repeatedly calls
359 out to regbranch(), which is responsible for handling up to the first
360 "|" symbol it sees.
361
362 regbranch() in turn calls regpiece() which handles "things" followed by
363 a quantifier. In order to parse the "things", regatom() is called. This
364 is the lowest level routine, which parses out constant strings,
365 character classes, and the various special symbols like "$". If
366 regatom() encounters a "(" character it in turn calls reg().
367
368 There used to be two main passes involved in parsing, the first to
369 calculate the size of the compiled program, and the second to actually
370 compile it. But now there is only one main pass, with an initial crude
371 guess based on the length of the input pattern, which is increased if
372 necessary as parsing proceeds, and afterwards, trimmed to the actual
373 amount used.
374
375 However, it may happen that parsing must be restarted at the beginning
376 when various circumstances occur along the way. An example is if the
377 program turns out to be so large that there are jumps in it that won't
378 fit in the normal 16 bits available. There are two special regops that
379 can hold bigger jump destinations, BRANCHJ and LONGBRANCH. The parse
380 is restarted, and these are used instead of the normal shorter ones.
381 Whenever restarting the parse is required, the function returns failure
382 and sets a flag as to what needs to be done. This is passed up to the
383 top level routine which takes the appropriate action and restarts from
384 scratch. In the case of needing longer jumps, the "RExC_use_BRANCHJ"
385 flag is set in the "RExC_state_t" structure, which the functions know
386 to inspect before deciding how to do branches.
387
388 In most instances, the function that discovers the issue sets the
389 causal flag and returns failure immediately. "Parsing complications"
390 contains an explicit example of how this works. In other cases, such
391 as a forward reference to a numbered parenthetical grouping, we need to
392 finish the parse to know if that numbered grouping actually appears in
393 the pattern. In those cases, the parse is just redone at the end, with
394 the knowledge of how many groupings occur in it.
395
396 The routine regtail() is called by both reg() and regbranch() in order
397 to "set the tail pointer" correctly. When executing and we get to the
398 end of a branch, we need to go to the node following the grouping
399 parens. When parsing, however, we don't know where the end will be
400 until we get there, so when we do we must go back and update the
401 offsets as appropriate. "regtail" is used to make this easier.
402
403 A subtlety of the parsing process means that a regex like "/foo/" is
404 originally parsed into an alternation with a single branch. It is only
405 afterwards that the optimiser converts single branch alternations into
406 the simpler form.
407
408 Parse Call Graph and a Grammar
409
410 The call graph looks like this:
411
412 reg() # parse a top level regex, or inside of
413 # parens
414 regbranch() # parse a single branch of an alternation
415 regpiece() # parse a pattern followed by a quantifier
416 regatom() # parse a simple pattern
417 regclass() # used to handle a class
418 reg() # used to handle a parenthesised
419 # subpattern
420 ....
421 ...
422 regtail() # finish off the branch
423 ...
424 regtail() # finish off the branch sequence. Tie each
425 # branch's tail to the tail of the
426 # sequence
427 # (NEW) In Debug mode this is
428 # regtail_study().
429
430 A grammar form might be something like this:
431
432 atom : constant | class
433 quant : '*' | '+' | '?' | '{min,max}'
434 _branch: piece
435 | piece _branch
436 | nothing
437 branch: _branch
438 | _branch '|' branch
439 group : '(' branch ')'
440 _piece: atom | group
441 piece : _piece
442 | _piece quant
443
444 Parsing complications
445
446 The implication of the above description is that a pattern containing
447 nested parentheses will result in a call graph which cycles through
448 reg(), regbranch(), regpiece(), regatom(), reg(), regbranch() etc
449 multiple times, until the deepest level of nesting is reached. All the
450 above routines return a pointer to a "regnode", which is usually the
451 last regnode added to the program. However, one complication is that
452 reg() returns NULL for parsing "(?:)" syntax for embedded modifiers,
453 setting the flag "TRYAGAIN". The "TRYAGAIN" propagates upwards until it
454 is captured, in some cases by regatom(), but otherwise unconditionally
455 by regbranch(). Hence it will never be returned by regbranch() to
456 reg(). This flag permits patterns such as "(?i)+" to be detected as
457 errors (Quantifier follows nothing in regex; marked by <-- HERE in
458 m/(?i)+ <-- HERE /).
459
460 Another complication is that the representation used for the program
461 differs if it needs to store Unicode, but it's not always possible to
462 know for sure whether it does until midway through parsing. The Unicode
463 representation for the program is larger, and cannot be matched as
464 efficiently. (See "Unicode and Localisation Support" below for more
465 details as to why.) If the pattern contains literal Unicode, it's
466 obvious that the program needs to store Unicode. Otherwise, the parser
467 optimistically assumes that the more efficient representation can be
468 used, and starts sizing on this basis. However, if it then encounters
469 something in the pattern which must be stored as Unicode, such as an
470 "\x{...}" escape sequence representing a character literal, then this
471 means that all previously calculated sizes need to be redone, using
472 values appropriate for the Unicode representation. This is another
473 instance where the parsing needs to be restarted, and it can and is
474 done immediately. The function returns failure, and sets the flag
475 "RESTART_UTF8" (encapsulated by using the macro "REQUIRE_UTF8"). This
476 restart request is propagated up the call chain in a similar fashion,
477 until it is "caught" in Perl_re_op_compile(), which marks the pattern
478 as containing Unicode, and restarts the sizing pass. It is also
479 possible for constructions within run-time code blocks to turn out to
480 need Unicode representation., which is signalled by
481 S_compile_runtime_code() returning false to Perl_re_op_compile().
482
483 The restart was previously implemented using a "longjmp" in regatom()
484 back to a "setjmp" in Perl_re_op_compile(), but this proved to be
485 problematic as the latter is a large function containing many automatic
486 variables, which interact badly with the emergent control flow of
487 "setjmp".
488
489 Debug Output
490
491 Starting in the 5.9.x development version of perl you can "use re Debug
492 => 'PARSE'" to see some trace information about the parse process. We
493 will start with some simple patterns and build up to more complex
494 patterns.
495
496 So when we parse "/foo/" we see something like the following table. The
497 left shows what is being parsed, and the number indicates where the
498 next regop would go. The stuff on the right is the trace output of the
499 graph. The names are chosen to be short to make it less dense on the
500 screen. 'tsdy' is a special form of regtail() which does some extra
501 analysis.
502
503 >foo< 1 reg
504 brnc
505 piec
506 atom
507 >< 4 tsdy~ EXACT <foo> (EXACT) (1)
508 ~ attach to END (3) offset to 2
509
510 The resulting program then looks like:
511
512 1: EXACT <foo>(3)
513 3: END(0)
514
515 As you can see, even though we parsed out a branch and a piece, it was
516 ultimately only an atom. The final program shows us how things work. We
517 have an "EXACT" regop, followed by an "END" regop. The number in parens
518 indicates where the "regnext" of the node goes. The "regnext" of an
519 "END" regop is unused, as "END" regops mean we have successfully
520 matched. The number on the left indicates the position of the regop in
521 the regnode array.
522
523 Now let's try a harder pattern. We will add a quantifier, so now we
524 have the pattern "/foo+/". We will see that regbranch() calls
525 regpiece() twice.
526
527 >foo+< 1 reg
528 brnc
529 piec
530 atom
531 >o+< 3 piec
532 atom
533 >< 6 tail~ EXACT <fo> (1)
534 7 tsdy~ EXACT <fo> (EXACT) (1)
535 ~ PLUS (END) (3)
536 ~ attach to END (6) offset to 3
537
538 And we end up with the program:
539
540 1: EXACT <fo>(3)
541 3: PLUS(6)
542 4: EXACT <o>(0)
543 6: END(0)
544
545 Now we have a special case. The "EXACT" regop has a "regnext" of 0.
546 This is because if it matches it should try to match itself again. The
547 "PLUS" regop handles the actual failure of the "EXACT" regop and acts
548 appropriately (going to regnode 6 if the "EXACT" matched at least once,
549 or failing if it didn't).
550
551 Now for something much more complex: "/x(?:foo*|b[a][rR])(foo|bar)$/"
552
553 >x(?:foo*|b... 1 reg
554 brnc
555 piec
556 atom
557 >(?:foo*|b[... 3 piec
558 atom
559 >?:foo*|b[a... reg
560 >foo*|b[a][... brnc
561 piec
562 atom
563 >o*|b[a][rR... 5 piec
564 atom
565 >|b[a][rR])... 8 tail~ EXACT <fo> (3)
566 >b[a][rR])(... 9 brnc
567 10 piec
568 atom
569 >[a][rR])(f... 12 piec
570 atom
571 >a][rR])(fo... clas
572 >[rR])(foo|... 14 tail~ EXACT <b> (10)
573 piec
574 atom
575 >rR])(foo|b... clas
576 >)(foo|bar)... 25 tail~ EXACT <a> (12)
577 tail~ BRANCH (3)
578 26 tsdy~ BRANCH (END) (9)
579 ~ attach to TAIL (25) offset to 16
580 tsdy~ EXACT <fo> (EXACT) (4)
581 ~ STAR (END) (6)
582 ~ attach to TAIL (25) offset to 19
583 tsdy~ EXACT <b> (EXACT) (10)
584 ~ EXACT <a> (EXACT) (12)
585 ~ ANYOF[Rr] (END) (14)
586 ~ attach to TAIL (25) offset to 11
587 >(foo|bar)$< tail~ EXACT <x> (1)
588 piec
589 atom
590 >foo|bar)$< reg
591 28 brnc
592 piec
593 atom
594 >|bar)$< 31 tail~ OPEN1 (26)
595 >bar)$< brnc
596 32 piec
597 atom
598 >)$< 34 tail~ BRANCH (28)
599 36 tsdy~ BRANCH (END) (31)
600 ~ attach to CLOSE1 (34) offset to 3
601 tsdy~ EXACT <foo> (EXACT) (29)
602 ~ attach to CLOSE1 (34) offset to 5
603 tsdy~ EXACT <bar> (EXACT) (32)
604 ~ attach to CLOSE1 (34) offset to 2
605 >$< tail~ BRANCH (3)
606 ~ BRANCH (9)
607 ~ TAIL (25)
608 piec
609 atom
610 >< 37 tail~ OPEN1 (26)
611 ~ BRANCH (28)
612 ~ BRANCH (31)
613 ~ CLOSE1 (34)
614 38 tsdy~ EXACT <x> (EXACT) (1)
615 ~ BRANCH (END) (3)
616 ~ BRANCH (END) (9)
617 ~ TAIL (END) (25)
618 ~ OPEN1 (END) (26)
619 ~ BRANCH (END) (28)
620 ~ BRANCH (END) (31)
621 ~ CLOSE1 (END) (34)
622 ~ EOL (END) (36)
623 ~ attach to END (37) offset to 1
624
625 Resulting in the program
626
627 1: EXACT <x>(3)
628 3: BRANCH(9)
629 4: EXACT <fo>(6)
630 6: STAR(26)
631 7: EXACT <o>(0)
632 9: BRANCH(25)
633 10: EXACT <ba>(14)
634 12: OPTIMIZED (2 nodes)
635 14: ANYOF[Rr](26)
636 25: TAIL(26)
637 26: OPEN1(28)
638 28: TRIE-EXACT(34)
639 [StS:1 Wds:2 Cs:6 Uq:5 #Sts:7 Mn:3 Mx:3 Stcls:bf]
640 <foo>
641 <bar>
642 30: OPTIMIZED (4 nodes)
643 34: CLOSE1(36)
644 36: EOL(37)
645 37: END(0)
646
647 Here we can see a much more complex program, with various optimisations
648 in play. At regnode 10 we see an example where a character class with
649 only one character in it was turned into an "EXACT" node. We can also
650 see where an entire alternation was turned into a "TRIE-EXACT" node. As
651 a consequence, some of the regnodes have been marked as optimised away.
652 We can see that the "$" symbol has been converted into an "EOL" regop,
653 a special piece of code that looks for "\n" or the end of the string.
654
655 The next pointer for "BRANCH"es is interesting in that it points at
656 where execution should go if the branch fails. When executing, if the
657 engine tries to traverse from a branch to a "regnext" that isn't a
658 branch then the engine will know that the entire set of branches has
659 failed.
660
661 Peep-hole Optimisation and Analysis
662
663 The regular expression engine can be a weighty tool to wield. On long
664 strings and complex patterns it can end up having to do a lot of work
665 to find a match, and even more to decide that no match is possible.
666 Consider a situation like the following pattern.
667
668 'ababababababababababab' =~ /(a|b)*z/
669
670 The "(a|b)*" part can match at every char in the string, and then fail
671 every time because there is no "z" in the string. So obviously we can
672 avoid using the regex engine unless there is a "z" in the string.
673 Likewise in a pattern like:
674
675 /foo(\w+)bar/
676
677 In this case we know that the string must contain a "foo" which must be
678 followed by "bar". We can use Fast Boyer-Moore matching as implemented
679 in fbm_instr() to find the location of these strings. If they don't
680 exist then we don't need to resort to the much more expensive regex
681 engine. Even better, if they do exist then we can use their positions
682 to reduce the search space that the regex engine needs to cover to
683 determine if the entire pattern matches.
684
685 There are various aspects of the pattern that can be used to facilitate
686 optimisations along these lines:
687
688 • anchored fixed strings
689
690 • floating fixed strings
691
692 • minimum and maximum length requirements
693
694 • start class
695
696 • Beginning/End of line positions
697
698 Another form of optimisation that can occur is the post-parse "peep-
699 hole" optimisation, where inefficient constructs are replaced by more
700 efficient constructs. The "TAIL" regops which are used during parsing
701 to mark the end of branches and the end of groups are examples of this.
702 These regops are used as place-holders during construction and "always
703 match" so they can be "optimised away" by making the things that point
704 to the "TAIL" point to the thing that "TAIL" points to, thus "skipping"
705 the node.
706
707 Another optimisation that can occur is that of ""EXACT" merging" which
708 is where two consecutive "EXACT" nodes are merged into a single regop.
709 An even more aggressive form of this is that a branch sequence of the
710 form "EXACT BRANCH ... EXACT" can be converted into a "TRIE-EXACT"
711 regop.
712
713 All of this occurs in the routine study_chunk() which uses a special
714 structure "scan_data_t" to store the analysis that it has performed,
715 and does the "peep-hole" optimisations as it goes.
716
717 The code involved in study_chunk() is extremely cryptic. Be careful.
718 :-)
719
720 Execution
721 Execution of a regex generally involves two phases, the first being
722 finding the start point in the string where we should match from, and
723 the second being running the regop interpreter.
724
725 If we can tell that there is no valid start point then we don't bother
726 running the interpreter at all. Likewise, if we know from the analysis
727 phase that we cannot detect a short-cut to the start position, we go
728 straight to the interpreter.
729
730 The two entry points are re_intuit_start() and pregexec(). These
731 routines have a somewhat incestuous relationship with overlap between
732 their functions, and pregexec() may even call re_intuit_start() on its
733 own. Nevertheless other parts of the perl source code may call into
734 either, or both.
735
736 Execution of the interpreter itself used to be recursive, but thanks to
737 the efforts of Dave Mitchell in the 5.9.x development track, that has
738 changed: now an internal stack is maintained on the heap and the
739 routine is fully iterative. This can make it tricky as the code is
740 quite conservative about what state it stores, with the result that two
741 consecutive lines in the code can actually be running in totally
742 different contexts due to the simulated recursion.
743
744 Start position and no-match optimisations
745
746 re_intuit_start() is responsible for handling start points and no-match
747 optimisations as determined by the results of the analysis done by
748 study_chunk() (and described in "Peep-hole Optimisation and Analysis").
749
750 The basic structure of this routine is to try to find the start- and/or
751 end-points of where the pattern could match, and to ensure that the
752 string is long enough to match the pattern. It tries to use more
753 efficient methods over less efficient methods and may involve
754 considerable cross-checking of constraints to find the place in the
755 string that matches. For instance it may try to determine that a given
756 fixed string must be not only present but a certain number of chars
757 before the end of the string, or whatever.
758
759 It calls several other routines, such as fbm_instr() which does Fast
760 Boyer Moore matching and find_byclass() which is responsible for
761 finding the start using the first mandatory regop in the program.
762
763 When the optimisation criteria have been satisfied, reg_try() is called
764 to perform the match.
765
766 Program execution
767
768 pregexec() is the main entry point for running a regex. It contains
769 support for initialising the regex interpreter's state, running
770 re_intuit_start() if needed, and running the interpreter on the string
771 from various start positions as needed. When it is necessary to use the
772 regex interpreter pregexec() calls regtry().
773
774 regtry() is the entry point into the regex interpreter. It expects as
775 arguments a pointer to a "regmatch_info" structure and a pointer to a
776 string. It returns an integer 1 for success and a 0 for failure. It
777 is basically a set-up wrapper around regmatch().
778
779 "regmatch" is the main "recursive loop" of the interpreter. It is
780 basically a giant switch statement that implements a state machine,
781 where the possible states are the regops themselves, plus a number of
782 additional intermediate and failure states. A few of the states are
783 implemented as subroutines but the bulk are inline code.
784
786 Unicode and Localisation Support
787 When dealing with strings containing characters that cannot be
788 represented using an eight-bit character set, perl uses an internal
789 representation that is a permissive version of Unicode's UTF-8
790 encoding[2]. This uses single bytes to represent characters from the
791 ASCII character set, and sequences of two or more bytes for all other
792 characters. (See perlunitut for more information about the relationship
793 between UTF-8 and perl's encoding, utf8. The difference isn't important
794 for this discussion.)
795
796 No matter how you look at it, Unicode support is going to be a pain in
797 a regex engine. Tricks that might be fine when you have 256 possible
798 characters often won't scale to handle the size of the UTF-8 character
799 set. Things you can take for granted with ASCII may not be true with
800 Unicode. For instance, in ASCII, it is safe to assume that
801 "sizeof(char1) == sizeof(char2)", but in UTF-8 it isn't. Unicode case
802 folding is vastly more complex than the simple rules of ASCII, and even
803 when not using Unicode but only localised single byte encodings, things
804 can get tricky (for example, LATIN SMALL LETTER SHARP S (U+00DF, ß)
805 should match 'SS' in localised case-insensitive matching).
806
807 Making things worse is that UTF-8 support was a later addition to the
808 regex engine (as it was to perl) and this necessarily made things a
809 lot more complicated. Obviously it is easier to design a regex engine
810 with Unicode support in mind from the beginning than it is to retrofit
811 it to one that wasn't.
812
813 Nearly all regops that involve looking at the input string have two
814 cases, one for UTF-8, and one not. In fact, it's often more complex
815 than that, as the pattern may be UTF-8 as well.
816
817 Care must be taken when making changes to make sure that you handle
818 UTF-8 properly, both at compile time and at execution time, including
819 when the string and pattern are mismatched.
820
821 Base Structures
822 The "regexp" structure described in perlreapi is common to all regex
823 engines. Two of its fields are intended for the private use of the
824 regex engine that compiled the pattern. These are the "intflags" and
825 pprivate members. The "pprivate" is a void pointer to an arbitrary
826 structure whose use and management is the responsibility of the
827 compiling engine. perl will never modify either of these values. In the
828 case of the stock engine the structure pointed to by "pprivate" is
829 called "regexp_internal".
830
831 Its "pprivate" and "intflags" fields contain data specific to each
832 engine.
833
834 There are two structures used to store a compiled regular expression.
835 One, the "regexp" structure described in perlreapi is populated by the
836 engine currently being used and some of its fields read by perl to
837 implement things such as the stringification of "qr//".
838
839 The other structure is pointed to by the "regexp" struct's "pprivate"
840 and is in addition to "intflags" in the same struct considered to be
841 the property of the regex engine which compiled the regular expression;
842
843 The regexp structure contains all the data that perl needs to be aware
844 of to properly work with the regular expression. It includes data about
845 optimisations that perl can use to determine if the regex engine should
846 really be used, and various other control info that is needed to
847 properly execute patterns in various contexts such as is the pattern
848 anchored in some way, or what flags were used during the compile, or
849 whether the program contains special constructs that perl needs to be
850 aware of.
851
852 In addition it contains two fields that are intended for the private
853 use of the regex engine that compiled the pattern. These are the
854 "intflags" and pprivate members. The "pprivate" is a void pointer to an
855 arbitrary structure whose use and management is the responsibility of
856 the compiling engine. perl will never modify either of these values.
857
858 As mentioned earlier, in the case of the default engines, the
859 "pprivate" will be a pointer to a regexp_internal structure which holds
860 the compiled program and any additional data that is private to the
861 regex engine implementation.
862
863 Perl's "pprivate" structure
864
865 The following structure is used as the "pprivate" struct by perl's
866 regex engine. Since it is specific to perl it is only of curiosity
867 value to other engine implementations.
868
869 typedef struct regexp_internal {
870 regnode *regstclass;
871 struct reg_data *data;
872 struct reg_code_blocks *code_blocks;
873 U32 proglen;
874 U32 name_list_idx;
875 regnode program[1];
876 } regexp_internal;
877
878 Description of the attributes is as follows:
879
880 "regstclass"
881 Special regop that is used by re_intuit_start() to check if a
882 pattern can match at a certain position. For instance if the regex
883 engine knows that the pattern must start with a 'Z' then it can
884 scan the string until it finds one and then launch the regex
885 engine from there. The routine that handles this is called
886 find_by_class(). Sometimes this field points at a regop embedded
887 in the program, and sometimes it points at an independent
888 synthetic regop that has been constructed by the optimiser.
889
890 "data"
891 This field points at a "reg_data" structure, which is defined as
892 follows
893
894 struct reg_data {
895 U32 count;
896 U8 *what;
897 void* data[1];
898 };
899
900 This structure is used for handling data structures that the regex
901 engine needs to handle specially during a clone or free operation
902 on the compiled product. Each element in the data array has a
903 corresponding element in the what array. During compilation regops
904 that need special structures stored will add an element to each
905 array using the add_data() routine and then store the index in the
906 regop.
907
908 In modern perls the 0th element of this structure is reserved and
909 is NEVER used to store anything of use. This is to allow things
910 that need to index into this array to represent "no value".
911
912 "code_blocks"
913 This optional structure is used to manage "(?{})" constructs in
914 the pattern. It is made up of the following structures.
915
916 /* record the position of a (?{...}) within a pattern */
917 struct reg_code_block {
918 STRLEN start;
919 STRLEN end;
920 OP *block;
921 REGEXP *src_regex;
922 };
923
924 /* array of reg_code_block's plus header info */
925 struct reg_code_blocks {
926 int refcnt; /* we may be pointed to from a regex
927 and from the savestack */
928 int count; /* how many code blocks */
929 struct reg_code_block *cb; /* array of reg_code_block's */
930 };
931
932 "proglen"
933 Stores the length of the compiled program in units of regops.
934
935 "name_list_idx"
936 This is the index into the data array where an AV is stored that
937 contains the names of any named capture buffers in the pattern,
938 should there be any. This is only used in the debugging version of
939 the regex engine and when RXp_PAREN_NAMES(prog) is true. It will
940 be 0 if there is no such data.
941
942 "program"
943 Compiled program. Inlined into the structure so the entire struct
944 can be treated as a single blob.
945
947 perlreapi
948
949 perlre
950
951 perlunitut
952
954 by Yves Orton, 2006.
955
956 With excerpts from Perl, and contributions and suggestions from Ronald
957 J. Kimball, Dave Mitchell, Dominic Dunlop, Mark Jason Dominus, Stephen
958 McCamant, and David Landgren.
959
960 Now maintained by Perl 5 Porters.
961
963 Same terms as Perl.
964
966 [1] <https://perl.plover.com/Rx/paper/>
967
968 [2] <https://www.unicode.org/>
969
970
971
972perl v5.38.2 2023-11-30 PERLREGUTS(1)