1PERLREGUTS(1)          Perl Programmers Reference Guide          PERLREGUTS(1)
2
3
4

NAME

6       perlreguts - Description of the Perl regular expression engine.
7

DESCRIPTION

9       This document is an attempt to shine some light on the guts of the
10       regex engine and how it works. The regex engine represents a
11       significant chunk of the perl codebase, but is relatively poorly
12       understood. This document is a meagre attempt at addressing this
13       situation. It is derived from the author's experience, comments in the
14       source code, other papers on the regex engine, feedback on the
15       perl5-porters mail list, and no doubt other places as well.
16
17       NOTICE! It should be clearly understood that the behavior and
18       structures discussed in this represents the state of the engine as the
19       author understood it at the time of writing. It is NOT an API
20       definition, it is purely an internals guide for those who want to hack
21       the regex engine, or understand how the regex engine works. Readers of
22       this document are expected to understand perl's regex syntax and its
23       usage in detail. If you want to learn about the basics of Perl's
24       regular expressions, see perlre. And if you want to replace the regex
25       engine with your own see see perlreapi.
26

OVERVIEW

28   A quick note on terms
29       There is some debate as to whether to say "regexp" or "regex". In this
30       document we will use the term "regex" unless there is a special reason
31       not to, in which case we will explain why.
32
33       When speaking about regexes we need to distinguish between their source
34       code form and their internal form. In this document we will use the
35       term "pattern" when we speak of their textual, source code form, and
36       the term "program" when we speak of their internal representation.
37       These correspond to the terms S-regex and B-regex that Mark Jason
38       Dominus employs in his paper on "Rx" ([1] in "REFERENCES").
39
40   What is a regular expression engine?
41       A regular expression engine is a program that takes a set of
42       constraints specified in a mini-language, and then applies those
43       constraints to a target string, and determines whether or not the
44       string satisfies the constraints. See perlre for a full definition of
45       the language.
46
47       In less grandiose terms, the first part of the job is to turn a pattern
48       into something the computer can efficiently use to find the matching
49       point in the string, and the second part is performing the search
50       itself.
51
52       To do this we need to produce a program by parsing the text. We then
53       need to execute the program to find the point in the string that
54       matches. And we need to do the whole thing efficiently.
55
56   Structure of a Regexp Program
57       High Level
58
59       Although it is a bit confusing and some people object to the
60       terminology, it is worth taking a look at a comment that has been in
61       regexp.h for years:
62
63       This is essentially a linear encoding of a nondeterministic finite-
64       state machine (aka syntax charts or "railroad normal form" in parsing
65       technology).
66
67       The term "railroad normal form" is a bit esoteric, with "syntax
68       diagram/charts", or "railroad diagram/charts" being more common terms.
69       Nevertheless it provides a useful mental image of a regex program: each
70       node can be thought of as a unit of track, with a single entry and in
71       most cases a single exit point (there are pieces of track that fork,
72       but statistically not many), and the whole forms a layout with a single
73       entry and single exit point. The matching process can be thought of as
74       a car that moves along the track, with the particular route through the
75       system being determined by the character read at each possible
76       connector point. A car can fall off the track at any point but it may
77       only proceed as long as it matches the track.
78
79       Thus the pattern "/foo(?:\w+|\d+|\s+)bar/" can be thought of as the
80       following chart:
81
82                             [start]
83                                |
84                              <foo>
85                                |
86                          +-----+-----+
87                          |     |     |
88                        <\w+> <\d+> <\s+>
89                          |     |     |
90                          +-----+-----+
91                                |
92                              <bar>
93                                |
94                              [end]
95
96       The truth of the matter is that perl's regular expressions these days
97       are much more complex than this kind of structure, but visualising it
98       this way can help when trying to get your bearings, and it matches the
99       current implementation pretty closely.
100
101       To be more precise, we will say that a regex program is an encoding of
102       a graph. Each node in the graph corresponds to part of the original
103       regex pattern, such as a literal string or a branch, and has a pointer
104       to the nodes representing the next component to be matched. Since
105       "node" and "opcode" already have other meanings in the perl source, we
106       will call the nodes in a regex program "regops".
107
108       The program is represented by an array of "regnode" structures, one or
109       more of which represent a single regop of the program. Struct "regnode"
110       is the smallest struct needed, and has a field structure which is
111       shared with all the other larger structures.
112
113       The "next" pointers of all regops except "BRANCH" implement
114       concatenation; a "next" pointer with a "BRANCH" on both ends of it is
115       connecting two alternatives.  [Here we have one of the subtle syntax
116       dependencies: an individual "BRANCH" (as opposed to a collection of
117       them) is never concatenated with anything because of operator
118       precedence.]
119
120       The operand of some types of regop is a literal string; for others, it
121       is a regop leading into a sub-program.  In particular, the operand of a
122       "BRANCH" node is the first regop of the branch.
123
124       NOTE: As the railroad metaphor suggests, this is not a tree structure:
125       the tail of the branch connects to the thing following the set of
126       "BRANCH"es.  It is a like a single line of railway track that splits as
127       it goes into a station or railway yard and rejoins as it comes out the
128       other side.
129
130       Regops
131
132       The base structure of a regop is defined in regexp.h as follows:
133
134           struct regnode {
135               U8  flags;    /* Various purposes, sometimes overridden */
136               U8  type;     /* Opcode value as specified by regnodes.h */
137               U16 next_off; /* Offset in size regnode */
138           };
139
140       Other larger "regnode"-like structures are defined in regcomp.h. They
141       are almost like subclasses in that they have the same fields as
142       "regnode", with possibly additional fields following in the structure,
143       and in some cases the specific meaning (and name) of some of base
144       fields are overridden. The following is a more complete description.
145
146       "regnode_1"
147       "regnode_2"
148           "regnode_1" structures have the same header, followed by a single
149           four-byte argument; "regnode_2" structures contain two two-byte
150           arguments instead:
151
152               regnode_1                U32 arg1;
153               regnode_2                U16 arg1;  U16 arg2;
154
155       "regnode_string"
156           "regnode_string" structures, used for literal strings, follow the
157           header with a one-byte length and then the string data. Strings are
158           padded on the end with zero bytes so that the total length of the
159           node is a multiple of four bytes:
160
161               regnode_string           char string[1];
162                                        U8 str_len; /* overrides flags */
163
164       "regnode_charclass"
165           Character classes are represented by "regnode_charclass"
166           structures, which have a four-byte argument and then a 32-byte
167           (256-bit) bitmap indicating which characters are included in the
168           class.
169
170               regnode_charclass        U32 arg1;
171                                        char bitmap[ANYOF_BITMAP_SIZE];
172
173       "regnode_charclass_class"
174           There is also a larger form of a char class structure used to
175           represent POSIX char classes called "regnode_charclass_class" which
176           has an additional 4-byte (32-bit) bitmap indicating which POSIX
177           char classes have been included.
178
179               regnode_charclass_class  U32 arg1;
180                                        char bitmap[ANYOF_BITMAP_SIZE];
181                                        char classflags[ANYOF_CLASSBITMAP_SIZE];
182
183       regnodes.h defines an array called "regarglen[]" which gives the size
184       of each opcode in units of "size regnode" (4-byte). A macro is used to
185       calculate the size of an "EXACT" node based on its "str_len" field.
186
187       The regops are defined in regnodes.h which is generated from
188       regcomp.sym by regcomp.pl. Currently the maximum possible number of
189       distinct regops is restricted to 256, with about a quarter already
190       used.
191
192       A set of macros makes accessing the fields easier and more consistent.
193       These include "OP()", which is used to determine the type of a
194       "regnode"-like structure; "NEXT_OFF()", which is the offset to the next
195       node (more on this later); "ARG()", "ARG1()", "ARG2()", "ARG_SET()",
196       and equivalents for reading and setting the arguments; and "STR_LEN()",
197       "STRING()" and "OPERAND()" for manipulating strings and regop bearing
198       types.
199
200       What regop is next?
201
202       There are three distinct concepts of "next" in the regex engine, and it
203       is important to keep them clear.
204
205       ·   There is the "next regnode" from a given regnode, a value which is
206           rarely useful except that sometimes it matches up in terms of value
207           with one of the others, and that sometimes the code assumes this to
208           always be so.
209
210       ·   There is the "next regop" from a given regop/regnode. This is the
211           regop physically located after the the current one, as determined
212           by the size of the current regop. This is often useful, such as
213           when dumping the structure we use this order to traverse. Sometimes
214           the code assumes that the "next regnode" is the same as the "next
215           regop", or in other words assumes that the sizeof a given regop
216           type is always going to be one regnode large.
217
218       ·   There is the "regnext" from a given regop. This is the regop which
219           is reached by jumping forward by the value of "NEXT_OFF()", or in a
220           few cases for longer jumps by the "arg1" field of the "regnode_1"
221           structure. The subroutine "regnext()" handles this transparently.
222           This is the logical successor of the node, which in some cases,
223           like that of the "BRANCH" regop, has special meaning.
224

Process Overview

226       Broadly speaking, performing a match of a string against a pattern
227       involves the following steps:
228
229       A. Compilation
230            1. Parsing for size
231            2. Parsing for construction
232            3. Peep-hole optimisation and analysis
233       B. Execution
234            4. Start position and no-match optimisations
235            5. Program execution
236
237       Where these steps occur in the actual execution of a perl program is
238       determined by whether the pattern involves interpolating any string
239       variables. If interpolation occurs, then compilation happens at run
240       time. If it does not, then compilation is performed at compile time.
241       (The "/o" modifier changes this, as does "qr//" to a certain extent.)
242       The engine doesn't really care that much.
243
244   Compilation
245       This code resides primarily in regcomp.c, along with the header files
246       regcomp.h, regexp.h and regnodes.h.
247
248       Compilation starts with "pregcomp()", which is mostly an initialisation
249       wrapper which farms work out to two other routines for the heavy
250       lifting: the first is "reg()", which is the start point for parsing;
251       the second, "study_chunk()", is responsible for optimisation.
252
253       Initialisation in "pregcomp()" mostly involves the creation and data-
254       filling of a special structure, "RExC_state_t" (defined in regcomp.c).
255       Almost all internally-used routines in regcomp.h take a pointer to one
256       of these structures as their first argument, with the name
257       "pRExC_state".  This structure is used to store the compilation state
258       and contains many fields. Likewise there are many macros which operate
259       on this variable: anything that looks like "RExC_xxxx" is a macro that
260       operates on this pointer/structure.
261
262       Parsing for size
263
264       In this pass the input pattern is parsed in order to calculate how much
265       space is needed for each regop we would need to emit. The size is also
266       used to determine whether long jumps will be required in the program.
267
268       This stage is controlled by the macro "SIZE_ONLY" being set.
269
270       The parse proceeds pretty much exactly as it does during the
271       construction phase, except that most routines are short-circuited to
272       change the size field "RExC_size" and not do anything else.
273
274       Parsing for construction
275
276       Once the size of the program has been determined, the pattern is parsed
277       again, but this time for real. Now "SIZE_ONLY" will be false, and the
278       actual construction can occur.
279
280       "reg()" is the start of the parse process. It is responsible for
281       parsing an arbitrary chunk of pattern up to either the end of the
282       string, or the first closing parenthesis it encounters in the pattern.
283       This means it can be used to parse the top-level regex, or any section
284       inside of a grouping parenthesis. It also handles the "special parens"
285       that perl's regexes have. For instance when parsing "/x(?:foo)y/"
286       "reg()" will at one point be called to parse from the "?" symbol up to
287       and including the ")".
288
289       Additionally, "reg()" is responsible for parsing the one or more
290       branches from the pattern, and for "finishing them off" by correctly
291       setting their next pointers. In order to do the parsing, it repeatedly
292       calls out to "regbranch()", which is responsible for handling up to the
293       first "|" symbol it sees.
294
295       "regbranch()" in turn calls "regpiece()" which handles "things"
296       followed by a quantifier. In order to parse the "things", "regatom()"
297       is called. This is the lowest level routine, which parses out constant
298       strings, character classes, and the various special symbols like "$".
299       If "regatom()" encounters a "(" character it in turn calls "reg()".
300
301       The routine "regtail()" is called by both "reg()" and "regbranch()" in
302       order to "set the tail pointer" correctly. When executing and we get to
303       the end of a branch, we need to go to the node following the grouping
304       parens. When parsing, however, we don't know where the end will be
305       until we get there, so when we do we must go back and update the
306       offsets as appropriate. "regtail" is used to make this easier.
307
308       A subtlety of the parsing process means that a regex like "/foo/" is
309       originally parsed into an alternation with a single branch. It is only
310       afterwards that the optimiser converts single branch alternations into
311       the simpler form.
312
313       Parse Call Graph and a Grammar
314
315       The call graph looks like this:
316
317           reg()                        # parse a top level regex, or inside of parens
318               regbranch()              # parse a single branch of an alternation
319                   regpiece()           # parse a pattern followed by a quantifier
320                       regatom()        # parse a simple pattern
321                           regclass()   #   used to handle a class
322                           reg()        #   used to handle a parenthesised subpattern
323                           ....
324                   ...
325                   regtail()            # finish off the branch
326               ...
327               regtail()                # finish off the branch sequence. Tie each
328                                        # branch's tail to the tail of the sequence
329                                        # (NEW) In Debug mode this is
330                                        # regtail_study().
331
332       A grammar form might be something like this:
333
334           atom  : constant | class
335           quant : '*' | '+' | '?' | '{min,max}'
336           _branch: piece
337                  | piece _branch
338                  | nothing
339           branch: _branch
340                 | _branch '|' branch
341           group : '(' branch ')'
342           _piece: atom | group
343           piece : _piece
344                 | _piece quant
345
346       Debug Output
347
348       In the 5.9.x development version of perl you can "use re Debug =>
349       'PARSE'" to see some trace information about the parse process. We will
350       start with some simple patterns and build up to more complex patterns.
351
352       So when we parse "/foo/" we see something like the following table. The
353       left shows what is being parsed, and the number indicates where the
354       next regop would go. The stuff on the right is the trace output of the
355       graph. The names are chosen to be short to make it less dense on the
356       screen. 'tsdy' is a special form of "regtail()" which does some extra
357       analysis.
358
359        >foo<             1    reg
360                                 brnc
361                                   piec
362                                     atom
363        ><                4      tsdy~ EXACT <foo> (EXACT) (1)
364                                     ~ attach to END (3) offset to 2
365
366       The resulting program then looks like:
367
368          1: EXACT <foo>(3)
369          3: END(0)
370
371       As you can see, even though we parsed out a branch and a piece, it was
372       ultimately only an atom. The final program shows us how things work. We
373       have an "EXACT" regop, followed by an "END" regop. The number in parens
374       indicates where the "regnext" of the node goes. The "regnext" of an
375       "END" regop is unused, as "END" regops mean we have successfully
376       matched. The number on the left indicates the position of the regop in
377       the regnode array.
378
379       Now let's try a harder pattern. We will add a quantifier, so now we
380       have the pattern "/foo+/". We will see that "regbranch()" calls
381       "regpiece()" twice.
382
383        >foo+<            1    reg
384                                 brnc
385                                   piec
386                                     atom
387        >o+<              3        piec
388                                     atom
389        ><                6        tail~ EXACT <fo> (1)
390                          7      tsdy~ EXACT <fo> (EXACT) (1)
391                                     ~ PLUS (END) (3)
392                                     ~ attach to END (6) offset to 3
393
394       And we end up with the program:
395
396          1: EXACT <fo>(3)
397          3: PLUS(6)
398          4:   EXACT <o>(0)
399          6: END(0)
400
401       Now we have a special case. The "EXACT" regop has a "regnext" of 0.
402       This is because if it matches it should try to match itself again. The
403       "PLUS" regop handles the actual failure of the "EXACT" regop and acts
404       appropriately (going to regnode 6 if the "EXACT" matched at least once,
405       or failing if it didn't).
406
407       Now for something much more complex: "/x(?:foo*|b[a][rR])(foo|bar)$/"
408
409        >x(?:foo*|b...    1    reg
410                                 brnc
411                                   piec
412                                     atom
413        >(?:foo*|b[...    3        piec
414                                     atom
415        >?:foo*|b[a...                 reg
416        >foo*|b[a][...                   brnc
417                                           piec
418                                             atom
419        >o*|b[a][rR...    5                piec
420                                             atom
421        >|b[a][rR])...    8                tail~ EXACT <fo> (3)
422        >b[a][rR])(...    9              brnc
423                         10                piec
424                                             atom
425        >[a][rR])(f...   12                piec
426                                             atom
427        >a][rR])(fo...                         clas
428        >[rR])(foo|...   14                tail~ EXACT <b> (10)
429                                           piec
430                                             atom
431        >rR])(foo|b...                         clas
432        >)(foo|bar)...   25                tail~ EXACT <a> (12)
433                                         tail~ BRANCH (3)
434                         26              tsdy~ BRANCH (END) (9)
435                                             ~ attach to TAIL (25) offset to 16
436                                         tsdy~ EXACT <fo> (EXACT) (4)
437                                             ~ STAR (END) (6)
438                                             ~ attach to TAIL (25) offset to 19
439                                         tsdy~ EXACT <b> (EXACT) (10)
440                                             ~ EXACT <a> (EXACT) (12)
441                                             ~ ANYOF[Rr] (END) (14)
442                                             ~ attach to TAIL (25) offset to 11
443        >(foo|bar)$<               tail~ EXACT <x> (1)
444                                   piec
445                                     atom
446        >foo|bar)$<                    reg
447                         28              brnc
448                                           piec
449                                             atom
450        >|bar)$<         31              tail~ OPEN1 (26)
451        >bar)$<                          brnc
452                         32                piec
453                                             atom
454        >)$<             34              tail~ BRANCH (28)
455                         36              tsdy~ BRANCH (END) (31)
456                                             ~ attach to CLOSE1 (34) offset to 3
457                                         tsdy~ EXACT <foo> (EXACT) (29)
458                                             ~ attach to CLOSE1 (34) offset to 5
459                                         tsdy~ EXACT <bar> (EXACT) (32)
460                                             ~ attach to CLOSE1 (34) offset to 2
461        >$<                        tail~ BRANCH (3)
462                                       ~ BRANCH (9)
463                                       ~ TAIL (25)
464                                   piec
465                                     atom
466        ><               37        tail~ OPEN1 (26)
467                                       ~ BRANCH (28)
468                                       ~ BRANCH (31)
469                                       ~ CLOSE1 (34)
470                         38      tsdy~ EXACT <x> (EXACT) (1)
471                                     ~ BRANCH (END) (3)
472                                     ~ BRANCH (END) (9)
473                                     ~ TAIL (END) (25)
474                                     ~ OPEN1 (END) (26)
475                                     ~ BRANCH (END) (28)
476                                     ~ BRANCH (END) (31)
477                                     ~ CLOSE1 (END) (34)
478                                     ~ EOL (END) (36)
479                                     ~ attach to END (37) offset to 1
480
481       Resulting in the program
482
483          1: EXACT <x>(3)
484          3: BRANCH(9)
485          4:   EXACT <fo>(6)
486          6:   STAR(26)
487          7:     EXACT <o>(0)
488          9: BRANCH(25)
489         10:   EXACT <ba>(14)
490         12:   OPTIMIZED (2 nodes)
491         14:   ANYOF[Rr](26)
492         25: TAIL(26)
493         26: OPEN1(28)
494         28:   TRIE-EXACT(34)
495               [StS:1 Wds:2 Cs:6 Uq:5 #Sts:7 Mn:3 Mx:3 Stcls:bf]
496                 <foo>
497                 <bar>
498         30:   OPTIMIZED (4 nodes)
499         34: CLOSE1(36)
500         36: EOL(37)
501         37: END(0)
502
503       Here we can see a much more complex program, with various optimisations
504       in play. At regnode 10 we see an example where a character class with
505       only one character in it was turned into an "EXACT" node. We can also
506       see where an entire alternation was turned into a "TRIE-EXACT" node. As
507       a consequence, some of the regnodes have been marked as optimised away.
508       We can see that the "$" symbol has been converted into an "EOL" regop,
509       a special piece of code that looks for "\n" or the end of the string.
510
511       The next pointer for "BRANCH"es is interesting in that it points at
512       where execution should go if the branch fails. When executing, if the
513       engine tries to traverse from a branch to a "regnext" that isn't a
514       branch then the engine will know that the entire set of branches has
515       failed.
516
517       Peep-hole Optimisation and Analysis
518
519       The regular expression engine can be a weighty tool to wield. On long
520       strings and complex patterns it can end up having to do a lot of work
521       to find a match, and even more to decide that no match is possible.
522       Consider a situation like the following pattern.
523
524          'ababababababababababab' =~ /(a|b)*z/
525
526       The "(a|b)*" part can match at every char in the string, and then fail
527       every time because there is no "z" in the string. So obviously we can
528       avoid using the regex engine unless there is a "z" in the string.
529       Likewise in a pattern like:
530
531          /foo(\w+)bar/
532
533       In this case we know that the string must contain a "foo" which must be
534       followed by "bar". We can use Fast Boyer-Moore matching as implemented
535       in "fbm_instr()" to find the location of these strings. If they don't
536       exist then we don't need to resort to the much more expensive regex
537       engine.  Even better, if they do exist then we can use their positions
538       to reduce the search space that the regex engine needs to cover to
539       determine if the entire pattern matches.
540
541       There are various aspects of the pattern that can be used to facilitate
542       optimisations along these lines:
543
544       ·    anchored fixed strings
545
546       ·    floating fixed strings
547
548       ·    minimum and maximum length requirements
549
550       ·    start class
551
552       ·    Beginning/End of line positions
553
554       Another form of optimisation that can occur is the post-parse "peep-
555       hole" optimisation, where inefficient constructs are replaced by more
556       efficient constructs. The "TAIL" regops which are used during parsing
557       to mark the end of branches and the end of groups are examples of this.
558       These regops are used as place-holders during construction and "always
559       match" so they can be "optimised away" by making the things that point
560       to the "TAIL" point to the thing that "TAIL" points to, thus "skipping"
561       the node.
562
563       Another optimisation that can occur is that of ""EXACT" merging" which
564       is where two consecutive "EXACT" nodes are merged into a single regop.
565       An even more aggressive form of this is that a branch sequence of the
566       form "EXACT BRANCH ... EXACT" can be converted into a "TRIE-EXACT"
567       regop.
568
569       All of this occurs in the routine "study_chunk()" which uses a special
570       structure "scan_data_t" to store the analysis that it has performed,
571       and does the "peep-hole" optimisations as it goes.
572
573       The code involved in "study_chunk()" is extremely cryptic. Be careful.
574       :-)
575
576   Execution
577       Execution of a regex generally involves two phases, the first being
578       finding the start point in the string where we should match from, and
579       the second being running the regop interpreter.
580
581       If we can tell that there is no valid start point then we don't bother
582       running interpreter at all. Likewise, if we know from the analysis
583       phase that we cannot detect a short-cut to the start position, we go
584       straight to the interpreter.
585
586       The two entry points are "re_intuit_start()" and "pregexec()". These
587       routines have a somewhat incestuous relationship with overlap between
588       their functions, and "pregexec()" may even call "re_intuit_start()" on
589       its own. Nevertheless other parts of the the perl source code may call
590       into either, or both.
591
592       Execution of the interpreter itself used to be recursive, but thanks to
593       the efforts of Dave Mitchell in the 5.9.x development track, that has
594       changed: now an internal stack is maintained on the heap and the
595       routine is fully iterative. This can make it tricky as the code is
596       quite conservative about what state it stores, with the result that
597       that two consecutive lines in the code can actually be running in
598       totally different contexts due to the simulated recursion.
599
600       Start position and no-match optimisations
601
602       "re_intuit_start()" is responsible for handling start points and no-
603       match optimisations as determined by the results of the analysis done
604       by "study_chunk()" (and described in "Peep-hole Optimisation and
605       Analysis").
606
607       The basic structure of this routine is to try to find the start- and/or
608       end-points of where the pattern could match, and to ensure that the
609       string is long enough to match the pattern. It tries to use more
610       efficient methods over less efficient methods and may involve
611       considerable cross-checking of constraints to find the place in the
612       string that matches.  For instance it may try to determine that a given
613       fixed string must be not only present but a certain number of chars
614       before the end of the string, or whatever.
615
616       It calls several other routines, such as "fbm_instr()" which does Fast
617       Boyer Moore matching and "find_byclass()" which is responsible for
618       finding the start using the first mandatory regop in the program.
619
620       When the optimisation criteria have been satisfied, "reg_try()" is
621       called to perform the match.
622
623       Program execution
624
625       "pregexec()" is the main entry point for running a regex. It contains
626       support for initialising the regex interpreter's state, running
627       "re_intuit_start()" if needed, and running the interpreter on the
628       string from various start positions as needed. When it is necessary to
629       use the regex interpreter "pregexec()" calls "regtry()".
630
631       "regtry()" is the entry point into the regex interpreter. It expects as
632       arguments a pointer to a "regmatch_info" structure and a pointer to a
633       string.  It returns an integer 1 for success and a 0 for failure.  It
634       is basically a set-up wrapper around "regmatch()".
635
636       "regmatch" is the main "recursive loop" of the interpreter. It is
637       basically a giant switch statement that implements a state machine,
638       where the possible states are the regops themselves, plus a number of
639       additional intermediate and failure states. A few of the states are
640       implemented as subroutines but the bulk are inline code.
641

MISCELLANEOUS

643   Unicode and Localisation Support
644       When dealing with strings containing characters that cannot be
645       represented using an eight-bit character set, perl uses an internal
646       representation that is a permissive version of Unicode's UTF-8
647       encoding[2]. This uses single bytes to represent characters from the
648       ASCII character set, and sequences of two or more bytes for all other
649       characters. (See perlunitut for more information about the relationship
650       between UTF-8 and perl's encoding, utf8 -- the difference isn't
651       important for this discussion.)
652
653       No matter how you look at it, Unicode support is going to be a pain in
654       a regex engine. Tricks that might be fine when you have 256 possible
655       characters often won't scale to handle the size of the UTF-8 character
656       set.  Things you can take for granted with ASCII may not be true with
657       Unicode. For instance, in ASCII, it is safe to assume that
658       "sizeof(char1) == sizeof(char2)", but in UTF-8 it isn't. Unicode case
659       folding is vastly more complex than the simple rules of ASCII, and even
660       when not using Unicode but only localised single byte encodings, things
661       can get tricky (for example, LATIN SMALL LETTER SHARP S (U+00DF, ss)
662       should match 'SS' in localised case-insensitive matching).
663
664       Making things worse is that UTF-8 support was a later addition to the
665       regex engine (as it was to perl) and this necessarily  made things a
666       lot more complicated. Obviously it is easier to design a regex engine
667       with Unicode support in mind from the beginning than it is to retrofit
668       it to one that wasn't.
669
670       Nearly all regops that involve looking at the input string have two
671       cases, one for UTF-8, and one not. In fact, it's often more complex
672       than that, as the pattern may be UTF-8 as well.
673
674       Care must be taken when making changes to make sure that you handle
675       UTF-8 properly, both at compile time and at execution time, including
676       when the string and pattern are mismatched.
677
678       The following comment in regcomp.h gives an example of exactly how
679       tricky this can be:
680
681           Two problematic code points in Unicode casefolding of EXACT nodes:
682
683           U+0390 - GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
684           U+03B0 - GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
685
686           which casefold to
687
688           Unicode                      UTF-8
689
690           U+03B9 U+0308 U+0301         0xCE 0xB9 0xCC 0x88 0xCC 0x81
691           U+03C5 U+0308 U+0301         0xCF 0x85 0xCC 0x88 0xCC 0x81
692
693           This means that in case-insensitive matching (or "loose matching",
694           as Unicode calls it), an EXACTF of length six (the UTF-8 encoded
695           byte length of the above casefolded versions) can match a target
696           string of length two (the byte length of UTF-8 encoded U+0390 or
697           U+03B0). This would rather mess up the minimum length computation.
698
699           What we'll do is to look for the tail four bytes, and then peek
700           at the preceding two bytes to see whether we need to decrease
701           the minimum length by four (six minus two).
702
703           Thanks to the design of UTF-8, there cannot be false matches:
704           A sequence of valid UTF-8 bytes cannot be a subsequence of
705           another valid sequence of UTF-8 bytes.
706
707   Base Structures
708       The "regexp" structure described in perlreapi is common to all regex
709       engines. Two of its fields that are intended for the private use of the
710       regex engine that compiled the pattern. These are the "intflags" and
711       pprivate members. The "pprivate" is a void pointer to an arbitrary
712       structure whose use and management is the responsibility of the
713       compiling engine. perl will never modify either of these values. In the
714       case of the stock engine the structure pointed to by "pprivate" is
715       called "regexp_internal".
716
717       Its "pprivate" and "intflags" fields contain data specific to each
718       engine.
719
720       There are two structures used to store a compiled regular expression.
721       One, the "regexp" structure described in perlreapi is populated by the
722       engine currently being. used and some of its fields read by perl to
723       implement things such as the stringification of "qr//".
724
725       The other structure is pointed to be the "regexp" struct's "pprivate"
726       and is in addition to "intflags" in the same struct considered to be
727       the property of the regex engine which compiled the regular expression;
728
729       The regexp structure contains all the data that perl needs to be aware
730       of to properly work with the regular expression. It includes data about
731       optimisations that perl can use to determine if the regex engine should
732       really be used, and various other control info that is needed to
733       properly execute patterns in various contexts such as is the pattern
734       anchored in some way, or what flags were used during the compile, or
735       whether the program contains special constructs that perl needs to be
736       aware of.
737
738       In addition it contains two fields that are intended for the private
739       use of the regex engine that compiled the pattern. These are the
740       "intflags" and pprivate members. The "pprivate" is a void pointer to an
741       arbitrary structure whose use and management is the responsibility of
742       the compiling engine. perl will never modify either of these values.
743
744       As mentioned earlier, in the case of the default engines, the
745       "pprivate" will be a pointer to a regexp_internal structure which holds
746       the compiled program and any additional data that is private to the
747       regex engine implementation.
748
749       Perl's "pprivate" structure
750
751       The following structure is used as the "pprivate" struct by perl's
752       regex engine. Since it is specific to perl it is only of curiosity
753       value to other engine implementations.
754
755           typedef struct regexp_internal {
756                   regexp_paren_ofs *swap; /* Swap copy of *startp / *endp */
757                   U32 *offsets;           /* offset annotations 20001228 MJD
758                                              data about mapping the program to the
759                                              string*/
760                   regnode *regstclass;    /* Optional startclass as identified or constructed
761                                              by the optimiser */
762                   struct reg_data *data;  /* Additional miscellaneous data used by the program.
763                                              Used to make it easier to clone and free arbitrary
764                                              data that the regops need. Often the ARG field of
765                                              a regop is an index into this structure */
766                   regnode program[1];     /* Unwarranted chumminess with compiler. */
767           } regexp_internal;
768
769       "swap"
770            "swap" formerly was an extra set of startp/endp stored in a
771            "regexp_paren_ofs" struct. This was used when the last successful
772            match was from the same pattern as the current pattern, so that a
773            partial match didn't overwrite the previous match's results, but
774            it caused a problem with re-entrant code such as trying to build
775            the UTF-8 swashes.  Currently unused and left for backward
776            compatibility with 5.10.0.
777
778       "offsets"
779            Offsets holds a mapping of offset in the "program" to offset in
780            the "precomp" string. This is only used by ActiveState's visual
781            regex debugger.
782
783       "regstclass"
784            Special regop that is used by "re_intuit_start()" to check if a
785            pattern can match at a certain position. For instance if the regex
786            engine knows that the pattern must start with a 'Z' then it can
787            scan the string until it finds one and then launch the regex
788            engine from there. The routine that handles this is called
789            "find_by_class()". Sometimes this field points at a regop embedded
790            in the program, and sometimes it points at an independent
791            synthetic regop that has been constructed by the optimiser.
792
793       "data"
794            This field points at a reg_data structure, which is defined as
795            follows
796
797                struct reg_data {
798                    U32 count;
799                    U8 *what;
800                    void* data[1];
801                };
802
803            This structure is used for handling data structures that the regex
804            engine needs to handle specially during a clone or free operation
805            on the compiled product. Each element in the data array has a
806            corresponding element in the what array. During compilation regops
807            that need special structures stored will add an element to each
808            array using the add_data() routine and then store the index in the
809            regop.
810
811       "program"
812            Compiled program. Inlined into the structure so the entire struct
813            can be treated as a single blob.
814

SEE ALSO

816       perlreapi
817
818       perlre
819
820       perlunitut
821

AUTHOR

823       by Yves Orton, 2006.
824
825       With excerpts from Perl, and contributions and suggestions from Ronald
826       J. Kimball, Dave Mitchell, Dominic Dunlop, Mark Jason Dominus, Stephen
827       McCamant, and David Landgren.
828

LICENCE

830       Same terms as Perl.
831

REFERENCES

833       [1] <http://perl.plover.com/Rx/paper/>
834
835       [2] <http://www.unicode.org>
836
837
838
839perl v5.10.1                      2017-03-22                     PERLREGUTS(1)
Impressum