perlfaq6(1)

1PERLFAQ6(1)            Perl Programmers Reference Guide            PERLFAQ6(1)
2
3
4

NAME

6       perlfaq6 - Regular Expressions ($Revision: 1.38 $, $Date: 2005/12/31
7       00:54:37 $)
8

DESCRIPTION

10       This section is surprisingly small because the rest of the FAQ is lit‐
11       tered with answers involving regular expressions.  For example, decod‐
12       ing a URL and checking whether something is a number are handled with
13       regular expressions, but those answers are found elsewhere in this doc‐
14       ument (in perlfaq9: "How do I decode or create those %-encodings on the
15       web" and perlfaq4: "How do I determine whether a scalar is a num‐
16       ber/whole/integer/float", to be precise).
17
18       How can I hope to use regular expressions without creating illegible
19       and unmaintainable code?
20
21       Three techniques can make regular expressions maintainable and under‐
22       standable.
23
24       Comments Outside the Regex
25           Describe what you're doing and how you're doing it, using normal
26           Perl comments.
27
28               # turn the line into the first word, a colon, and the
29               # number of characters on the rest of the line
30               s/^(\w+)(.*)/ lc($1) . ":" . length($2) /meg;
31
32       Comments Inside the Regex
33           The "/x" modifier causes whitespace to be ignored in a regex pat‐
34           tern (except in a character class), and also allows you to use nor‐
35           mal comments there, too.  As you can imagine, whitespace and com‐
36           ments help a lot.
37
38           "/x" lets you turn this:
39
40               s{<(?:[^>'"]*⎪".*?"⎪'.*?')+>}{}gs;
41
42           into this:
43
44               s{ <                    # opening angle bracket
45                   (?:                 # Non-backreffing grouping paren
46                        [^>'"] *       # 0 or more things that are neither > nor ' nor "
47                           ⎪           #    or else
48                        ".*?"          # a section between double quotes (stingy match)
49                           ⎪           #    or else
50                        '.*?'          # a section between single quotes (stingy match)
51                   ) +                 #   all occurring one or more times
52                  >                    # closing angle bracket
53               }{}gsx;                 # replace with nothing, i.e. delete
54
55           It's still not quite so clear as prose, but it is very useful for
56           describing the meaning of each part of the pattern.
57
58       Different Delimiters
59           While we normally think of patterns as being delimited with "/"
60           characters, they can be delimited by almost any character.  perlre
61           describes this.  For example, the "s///" above uses braces as
62           delimiters.  Selecting another delimiter can avoid quoting the
63           delimiter within the pattern:
64
65               s/\/usr\/local/\/usr\/share/g;      # bad delimiter choice
66               s#/usr/local#/usr/share#g;          # better
67
68       I'm having trouble matching over more than one line.  What's wrong?
69
70       Either you don't have more than one line in the string you're looking
71       at (probably), or else you aren't using the correct modifier(s) on your
72       pattern (possibly).
73
74       There are many ways to get multiline data into a string.  If you want
75       it to happen automatically while reading input, you'll want to set $/
76       (probably to '' for paragraphs or "undef" for the whole file) to allow
77       you to read more than one line at a time.
78
79       Read perlre to help you decide which of "/s" and "/m" (or both) you
80       might want to use: "/s" allows dot to include newline, and "/m" allows
81       caret and dollar to match next to a newline, not just at the end of the
82       string.  You do need to make sure that you've actually got a multiline
83       string in there.
84
85       For example, this program detects duplicate words, even when they span
86       line breaks (but not paragraph ones).  For this example, we don't need
87       "/s" because we aren't using dot in a regular expression that we want
88       to cross line boundaries.  Neither do we need "/m" because we aren't
89       wanting caret or dollar to match at any point inside the record next to
90       newlines.  But it's imperative that $/ be set to something other than
91       the default, or else we won't actually ever have a multiline record
92       read in.
93
94           $/ = '';            # read in more whole paragraph, not just one line
95           while ( <> ) {
96               while ( /\b([\w'-]+)(\s+\1)+\b/gi ) {   # word starts alpha
97                   print "Duplicate $1 at paragraph $.\n";
98               }
99           }
100
101       Here's code that finds sentences that begin with "From " (which would
102       be mangled by many mailers):
103
104           $/ = '';            # read in more whole paragraph, not just one line
105           while ( <> ) {
106               while ( /^From /gm ) { # /m makes ^ match next to \n
107                   print "leading from in paragraph $.\n";
108               }
109           }
110
111       Here's code that finds everything between START and END in a paragraph:
112
113           undef $/;           # read in whole file, not just one line or paragraph
114           while ( <> ) {
115               while ( /START(.*?)END/sgm ) { # /s makes . cross line boundaries
116                   print "$1\n";
117               }
118           }
119
120       How can I pull out lines between two patterns that are themselves on
121       different lines?
122
123       You can use Perl's somewhat exotic ".." operator (documented in per‐
124       lop):
125
126           perl -ne 'print if /START/ .. /END/' file1 file2 ...
127
128       If you wanted text and not lines, you would use
129
130           perl -0777 -ne 'print "$1\n" while /START(.*?)END/gs' file1 file2 ...
131
132       But if you want nested occurrences of "START" through "END", you'll run
133       up against the problem described in the question in this section on
134       matching balanced text.
135
136       Here's another example of using "..":
137
138           while (<>) {
139               $in_header =   1  .. /^$/;
140               $in_body   = /^$/ .. eof();
141               # now choose between them
142           } continue {
143               reset if eof();         # fix $.
144           }
145
146       I put a regular expression into $/ but it didn't work. What's wrong?
147
148       Up to Perl 5.8.0, $/ has to be a string.  This may change in 5.10, but
149       don't get your hopes up. Until then, you can use these examples if you
150       really need to do this.
151
152       If you have File::Stream, this is easy.
153
154                                use File::Stream;
155                    my $stream = File::Stream->new(
156                         $filehandle,
157                         separator => qr/\s*,\s*/,
158                         );
159
160                                print "$_\n" while <$stream>;
161
162       If you don't have File::Stream, you have to do a little more work.
163
164       You can use the four argument form of sysread to continually add to a
165       buffer.  After you add to the buffer, you check if you have a complete
166       line (using your regular expression).
167
168              local $_ = "";
169              while( sysread FH, $_, 8192, length ) {
170                 while( s/^((?s).*?)your_pattern/ ) {
171                    my $record = $1;
172                    # do stuff here.
173                 }
174              }
175
176        You can do the same thing with foreach and a match using the
177        c flag and the \G anchor, if you do not mind your entire file
178        being in memory at the end.
179
180              local $_ = "";
181              while( sysread FH, $_, 8192, length ) {
182                 foreach my $record ( m/\G((?s).*?)your_pattern/gc ) {
183                    # do stuff here.
184                 }
185                 substr( $_, 0, pos ) = "" if pos;
186              }
187
188       How do I substitute case insensitively on the LHS while preserving case
189       on the RHS?
190
191       Here's a lovely Perlish solution by Larry Rosler.  It exploits proper‐
192       ties of bitwise xor on ASCII strings.
193
194           $_= "this is a TEsT case";
195
196           $old = 'test';
197           $new = 'success';
198
199           s{(\Q$old\E)}
200            { uc $new ⎪ (uc $1 ^ $1) .
201               (uc(substr $1, -1) ^ substr $1, -1) x
202                   (length($new) - length $1)
203            }egi;
204
205           print;
206
207       And here it is as a subroutine, modeled after the above:
208
209           sub preserve_case($$) {
210               my ($old, $new) = @_;
211               my $mask = uc $old ^ $old;
212
213               uc $new ⎪ $mask .
214                   substr($mask, -1) x (length($new) - length($old))
215           }
216
217           $a = "this is a TEsT case";
218           $a =~ s/(test)/preserve_case($1, "success")/egi;
219           print "$a\n";
220
221       This prints:
222
223           this is a SUcCESS case
224
225       As an alternative, to keep the case of the replacement word if it is
226       longer than the original, you can use this code, by Jeff Pinyan:
227
228         sub preserve_case {
229           my ($from, $to) = @_;
230           my ($lf, $lt) = map length, @_;
231
232           if ($lt < $lf) { $from = substr $from, 0, $lt }
233           else { $from .= substr $to, $lf }
234
235           return uc $to ⎪ ($from ^ uc $from);
236         }
237
238       This changes the sentence to "this is a SUcCess case."
239
240       Just to show that C programmers can write C in any programming lan‐
241       guage, if you prefer a more C-like solution, the following script makes
242       the substitution have the same case, letter by letter, as the original.
243       (It also happens to run about 240% slower than the Perlish solution
244       runs.)  If the substitution has more characters than the string being
245       substituted, the case of the last character is used for the rest of the
246       substitution.
247
248           # Original by Nathan Torkington, massaged by Jeffrey Friedl
249           #
250           sub preserve_case($$)
251           {
252               my ($old, $new) = @_;
253               my ($state) = 0; # 0 = no change; 1 = lc; 2 = uc
254               my ($i, $oldlen, $newlen, $c) = (0, length($old), length($new));
255               my ($len) = $oldlen < $newlen ? $oldlen : $newlen;
256
257               for ($i = 0; $i < $len; $i++) {
258                   if ($c = substr($old, $i, 1), $c =~ /[\W\d_]/) {
259                       $state = 0;
260                   } elsif (lc $c eq $c) {
261                       substr($new, $i, 1) = lc(substr($new, $i, 1));
262                       $state = 1;
263                   } else {
264                       substr($new, $i, 1) = uc(substr($new, $i, 1));
265                       $state = 2;
266                   }
267               }
268               # finish up with any remaining new (for when new is longer than old)
269               if ($newlen > $oldlen) {
270                   if ($state == 1) {
271                       substr($new, $oldlen) = lc(substr($new, $oldlen));
272                   } elsif ($state == 2) {
273                       substr($new, $oldlen) = uc(substr($new, $oldlen));
274                   }
275               }
276               return $new;
277           }
278
279       How can I make "\w" match national character sets?
280
281       Put "use locale;" in your script.  The \w character class is taken from
282       the current locale.
283
284       See perllocale for details.
285
286       How can I match a locale-smart version of "/[a-zA-Z]/"?
287
288       You can use the POSIX character class syntax "/[[:alpha:]]/" documented
289       in perlre.
290
291       No matter which locale you are in, the alphabetic characters are the
292       characters in \w without the digits and the underscore.  As a regex,
293       that looks like "/[^\W\d_]/".  Its complement, the non-alphabetics, is
294       then everything in \W along with the digits and the underscore, or
295       "/[\W\d_]/".
296
297       How can I quote a variable to use in a regex?
298
299       The Perl parser will expand $variable and @variable references in regu‐
300       lar expressions unless the delimiter is a single quote.  Remember, too,
301       that the right-hand side of a "s///" substitution is considered a dou‐
302       ble-quoted string (see perlop for more details).  Remember also that
303       any regex special characters will be acted on unless you precede the
304       substitution with \Q.  Here's an example:
305
306           $string = "Placido P. Octopus";
307           $regex  = "P.";
308
309           $string =~ s/$regex/Polyp/;
310           # $string is now "Polypacido P. Octopus"
311
312       Because "." is special in regular expressions, and can match any single
313       character, the regex "P." here has matched the <Pl> in the original
314       string.
315
316       To escape the special meaning of ".", we use "\Q":
317
318           $string = "Placido P. Octopus";
319           $regex  = "P.";
320
321           $string =~ s/\Q$regex/Polyp/;
322           # $string is now "Placido Polyp Octopus"
323
324       The use of "\Q" causes the <.> in the regex to be treated as a regular
325       character, so that "P." matches a "P" followed by a dot.
326
327       What is "/o" really for?
328
329       Using a variable in a regular expression match forces a re-evaluation
330       (and perhaps recompilation) each time the regular expression is encoun‐
331       tered.  The "/o" modifier locks in the regex the first time it's used.
332       This always happens in a constant regular expression, and in fact, the
333       pattern was compiled into the internal format at the same time your
334       entire program was.
335
336       Use of "/o" is irrelevant unless variable interpolation is used in the
337       pattern, and if so, the regex engine will neither know nor care whether
338       the variables change after the pattern is evaluated the very first
339       time.
340
341       "/o" is often used to gain an extra measure of efficiency by not per‐
342       forming subsequent evaluations when you know it won't matter (because
343       you know the variables won't change), or more rarely, when you don't
344       want the regex to notice if they do.
345
346       For example, here's a "paragrep" program:
347
348           $/ = '';  # paragraph mode
349           $pat = shift;
350           while (<>) {
351               print if /$pat/o;
352           }
353
354       How do I use a regular expression to strip C style comments from a
355       file?
356
357       While this actually can be done, it's much harder than you'd think.
358       For example, this one-liner
359
360           perl -0777 -pe 's{/\*.*?\*/}{}gs' foo.c
361
362       will work in many but not all cases.  You see, it's too simple-minded
363       for certain kinds of C programs, in particular, those with what appear
364       to be comments in quoted strings.  For that, you'd need something like
365       this, created by Jeffrey Friedl and later modified by Fred Curtis.
366
367           $/ = undef;
368           $_ = <>;
369           s#/\*[^*]*\*+([^/*][^*]*\*+)*/⎪("(\\.⎪[^"\\])*"⎪'(\\.⎪[^'\\])*'⎪.[^/"'\\]*)#defined $2 ? $2 : ""#gse;
370           print;
371
372       This could, of course, be more legibly written with the "/x" modifier,
373       adding whitespace and comments.  Here it is expanded, courtesy of Fred
374       Curtis.
375
376           s{
377              /\*         ##  Start of /* ... */ comment
378              [^*]*\*+    ##  Non-* followed by 1-or-more *'s
379              (
380                [^/*][^*]*\*+
381              )*          ##  0-or-more things which don't start with /
382                          ##    but do end with '*'
383              /           ##  End of /* ... */ comment
384
385            ⎪         ##     OR  various things which aren't comments:
386
387              (
388                "           ##  Start of " ... " string
389                (
390                  \\.           ##  Escaped char
391                ⎪               ##    OR
392                  [^"\\]        ##  Non "\
393                )*
394                "           ##  End of " ... " string
395
396              ⎪         ##     OR
397
398                '           ##  Start of ' ... ' string
399                (
400                  \\.           ##  Escaped char
401                ⎪               ##    OR
402                  [^'\\]        ##  Non '\
403                )*
404                '           ##  End of ' ... ' string
405
406              ⎪         ##     OR
407
408                .           ##  Anything other char
409                [^/"'\\]*   ##  Chars which doesn't start a comment, string or escape
410              )
411            }{defined $2 ? $2 : ""}gxse;
412
413       A slight modification also removes C++ comments:
414
415           s#/\*[^*]*\*+([^/*][^*]*\*+)*/⎪//[^\n]*⎪("(\\.⎪[^"\\])*"⎪'(\\.⎪[^'\\])*'⎪.[^/"'\\]*)#defined $2 ? $2 : ""#gse;
416
417       Can I use Perl regular expressions to match balanced text?
418
419       Historically, Perl regular expressions were not capable of matching
420       balanced text.  As of more recent versions of perl including 5.6.1
421       experimental features have been added that make it possible to do this.
422       Look at the documentation for the (??{ }) construct in recent perlre
423       manual pages to see an example of matching balanced parentheses.  Be
424       sure to take special notice of the  warnings present in the manual
425       before making use of this feature.
426
427       CPAN contains many modules that can be useful for matching text depend‐
428       ing on the context.  Damian Conway provides some useful patterns in
429       Regexp::Common.  The module Text::Balanced provides a general solution
430       to this problem.
431
432       One of the common applications of balanced text matching is working
433       with XML and HTML.  There are many modules available that support these
434       needs.  Two examples are HTML::Parser and XML::Parser. There are many
435       others.
436
437       An elaborate subroutine (for 7-bit ASCII only) to pull out balanced and
438       possibly nested single chars, like "`" and "'", "{" and "}", or "(" and
439       ")" can be found in
440       http://www.cpan.org/authors/id/TOMC/scripts/pull_quotes.gz .
441
442       The C::Scan module from CPAN also contains such subs for internal use,
443       but they are undocumented.
444
445       What does it mean that regexes are greedy?  How can I get around it?
446
447       Most people mean that greedy regexes match as much as they can.  Tech‐
448       nically speaking, it's actually the quantifiers ("?", "*", "+", "{}")
449       that are greedy rather than the whole pattern; Perl prefers local greed
450       and immediate gratification to overall greed.  To get non-greedy ver‐
451       sions of the same quantifiers, use ("??", "*?", "+?", "{}?").
452
453       An example:
454
455               $s1 = $s2 = "I am very very cold";
456               $s1 =~ s/ve.*y //;      # I am cold
457               $s2 =~ s/ve.*?y //;     # I am very cold
458
459       Notice how the second substitution stopped matching as soon as it
460       encountered "y ".  The "*?" quantifier effectively tells the regular
461       expression engine to find a match as quickly as possible and pass con‐
462       trol on to whatever is next in line, like you would if you were playing
463       hot potato.
464
465       How do I process each word on each line?
466
467       Use the split function:
468
469           while (<>) {
470               foreach $word ( split ) {
471                   # do something with $word here
472               }
473           }
474
475       Note that this isn't really a word in the English sense; it's just
476       chunks of consecutive non-whitespace characters.
477
478       To work with only alphanumeric sequences (including underscores), you
479       might consider
480
481           while (<>) {
482               foreach $word (m/(\w+)/g) {
483                   # do something with $word here
484               }
485           }
486
487       How can I print out a word-frequency or line-frequency summary?
488
489       To do this, you have to parse out each word in the input stream.  We'll
490       pretend that by word you mean chunk of alphabetics, hyphens, or apos‐
491       trophes, rather than the non-whitespace chunk idea of a word given in
492       the previous question:
493
494           while (<>) {
495               while ( /(\b[^\W_\d][\w'-]+\b)/g ) {   # misses "`sheep'"
496                   $seen{$1}++;
497               }
498           }
499           while ( ($word, $count) = each %seen ) {
500               print "$count $word\n";
501           }
502
503       If you wanted to do the same thing for lines, you wouldn't need a regu‐
504       lar expression:
505
506           while (<>) {
507               $seen{$_}++;
508           }
509           while ( ($line, $count) = each %seen ) {
510               print "$count $line";
511           }
512
513       If you want these output in a sorted order, see perlfaq4: "How do I
514       sort a hash (optionally by value instead of key)?".
515
516       How can I do approximate matching?
517
518       See the module String::Approx available from CPAN.
519
520       How do I efficiently match many regular expressions at once?
521
522       ( contributed by brian d foy )
523
524       Avoid asking Perl to compile a regular expression every time you want
525       to match it.  In this example, perl must recompile the regular expres‐
526       sion for every iteration of the foreach() loop since it has no way to
527       know what $pattern will be.
528
529           @patterns = qw( foo bar baz );
530
531           LINE: while( <> )
532               {
533                       foreach $pattern ( @patterns )
534                               {
535                       print if /\b$pattern\b/i;
536                       next LINE;
537                               }
538                       }
539
540       The qr// operator showed up in perl 5.005.  It compiles a regular
541       expression, but doesn't apply it.  When you use the pre-compiled ver‐
542       sion of the regex, perl does less work. In this example, I inserted a
543       map() to turn each pattern into its pre-compiled form.  The rest of the
544       script is the same, but faster.
545
546           @patterns = map { qr/\b$_\b/i } qw( foo bar baz );
547
548           LINE: while( <> )
549               {
550                       foreach $pattern ( @patterns )
551                               {
552                       print if /\b$pattern\b/i;
553                       next LINE;
554                               }
555                       }
556
557       In some cases, you may be able to make several patterns into a single
558       regular expression.  Beware of situations that require backtracking
559       though.
560
561               $regex = join '⎪', qw( foo bar baz );
562
563           LINE: while( <> )
564               {
565                       print if /\b(?:$regex)\b/i;
566                       }
567
568       For more details on regular expression efficiency, see Mastering Regu‐
569       lar Expressions by Jeffrey Freidl.  He explains how regular expressions
570       engine work and why some patterns are surprisingly inefficient.  Once
571       you understand how perl applies regular expressions, you can tune them
572       for individual situations.
573
574       Why don't word-boundary searches with "\b" work for me?
575
576       (contributed by brian d foy)
577
578       Ensure that you know what \b really does: it's the boundary between a
579       word character, \w, and something that isn't a word character. That
580       thing that isn't a word character might be \W, but it can also be the
581       start or end of the string.
582
583       It's not (not!) the boundary between whitespace and non-whitespace, and
584       it's not the stuff between words we use to create sentences.
585
586       In regex speak, a word boundary (\b) is a "zero width assertion", mean‐
587       ing that it doesn't represent a character in the string, but a condi‐
588       tion at a certain position.
589
590       For the regular expression, /\bPerl\b/, there has to be a word boundary
591       before the "P" and after the "l".  As long as something other than a
592       word character precedes the "P" and succeeds the "l", the pattern will
593       match. These strings match /\bPerl\b/.
594
595               "Perl"    # no word char before P or after l
596               "Perl "   # same as previous (space is not a word char)
597               "'Perl'"  # the ' char is not a word char
598               "Perl's"  # no word char before P, non-word char after "l"
599
600       These strings do not match /\bPerl\b/.
601
602               "Perl_"   # _ is a word char!
603               "Perler"  # no word char before P, but one after l
604
605       You don't have to use \b to match words though.  You can look for non-
606       word characters surrounded by word characters.  These strings match the
607       pattern /\b'\b/.
608
609               "don't"   # the ' char is surrounded by "n" and "t"
610               "qep'a'"  # the ' char is surrounded by "p" and "a"
611
612       These strings do not match /\b'\b/.
613
614               "foo'"    # there is no word char after non-word '
615
616       You can also use the complement of \b, \B, to specify that there should
617       not be a word boundary.
618
619       In the pattern /\Bam\B/, there must be a word character before the "a"
620       and after the "m". These patterns match /\Bam\B/:
621
622               "llama"   # "am" surrounded by word chars
623               "Samuel"  # same
624
625       These strings do not match /\Bam\B/
626
627               "Sam"      # no word boundary before "a", but one after "m"
628               "I am Sam" # "am" surrounded by non-word chars
629
630       Why does using $&, $`, or $' slow my program down?
631
632       (contributed by Anno Siegel)
633
634       Once Perl sees that you need one of these variables anywhere in the
635       program, it provides them on each and every pattern match. That means
636       that on every pattern match the entire string will be copied, part of
637       it to $`, part to $&, and part to $'. Thus the penalty is most severe
638       with long strings and patterns that match often. Avoid $&, $', and $`
639       if you can, but if you can't, once you've used them at all, use them at
640       will because you've already paid the price. Remember that some algo‐
641       rithms really appreciate them. As of the 5.005 release, the $& variable
642       is no longer "expensive" the way the other two are.
643
644       Since Perl 5.6.1 the special variables @- and @+ can functionally
645       replace $`, $& and $'.  These arrays contain pointers to the beginning
646       and end of each match (see perlvar for the full story), so they give
647       you essentially the same information, but without the risk of excessive
648       string copying.
649
650       What good is "\G" in a regular expression?
651
652       You use the "\G" anchor to start the next match on the same string
653       where the last match left off.  The regular expression engine cannot
654       skip over any characters to find the next match with this anchor, so
655       "\G" is similar to the beginning of string anchor, "^".  The "\G"
656       anchor is typically used with the "g" flag.  It uses the value of pos()
657       as the position to start the next match.  As the match operator makes
658       successive matches, it updates pos() with the position of the next
659       character past the last match (or the first character of the next
660       match, depending on how you like to look at it). Each string has its
661       own pos() value.
662
663       Suppose you want to match all of consective pairs of digits in a string
664       like "1122a44" and stop matching when you encounter non-digits.  You
665       want to match 11 and 22 but the letter <a> shows up between 22 and 44
666       and you want to stop at "a". Simply matching pairs of digits skips over
667       the "a" and still matches 44.
668
669               $_ = "1122a44";
670               my @pairs = m/(\d\d)/g;   # qw( 11 22 44 )
671
672       If you use the \G anchor, you force the match after 22 to start with
673       the "a".  The regular expression cannot match there since it does not
674       find a digit, so the next match fails and the match operator returns
675       the pairs it already found.
676
677               $_ = "1122a44";
678               my @pairs = m/\G(\d\d)/g; # qw( 11 22 )
679
680       You can also use the "\G" anchor in scalar context. You still need the
681       "g" flag.
682
683               $_ = "1122a44";
684               while( m/\G(\d\d)/g )
685                       {
686                       print "Found $1\n";
687                       }
688
689       After the match fails at the letter "a", perl resets pos() and the next
690       match on the same string starts at the beginning.
691
692               $_ = "1122a44";
693               while( m/\G(\d\d)/g )
694                       {
695                       print "Found $1\n";
696                       }
697
698               print "Found $1 after while" if m/(\d\d)/g; # finds "11"
699
700       You can disable pos() resets on fail with the "c" flag.  Subsequent
701       matches start where the last successful match ended (the value of
702       pos()) even if a match on the same string as failed in the meantime. In
703       this case, the match after the while() loop starts at the "a" (where
704       the last match stopped), and since it does not use any anchor it can
705       skip over the "a" to find "44".
706
707               $_ = "1122a44";
708               while( m/\G(\d\d)/gc )
709                       {
710                       print "Found $1\n";
711                       }
712
713               print "Found $1 after while" if m/(\d\d)/g; # finds "44"
714
715       Typically you use the "\G" anchor with the "c" flag when you want to
716       try a different match if one fails, such as in a tokenizer. Jeffrey
717       Friedl offers this example which works in 5.004 or later.
718
719           while (<>) {
720             chomp;
721             PARSER: {
722                  m/ \G( \d+\b    )/gcx   && do { print "number: $1\n";  redo; };
723                  m/ \G( \w+      )/gcx   && do { print "word:   $1\n";  redo; };
724                  m/ \G( \s+      )/gcx   && do { print "space:  $1\n";  redo; };
725                  m/ \G( [^\w\d]+ )/gcx   && do { print "other:  $1\n";  redo; };
726             }
727           }
728
729       For each line, the PARSER loop first tries to match a series of digits
730       followed by a word boundary.  This match has to start at the place the
731       last match left off (or the beginning of the string on the first
732       match). Since "m/ \G( \d+\b )/gcx" uses the "c" flag, if the string
733       does not match that regular expression, perl does not reset pos() and
734       the next match starts at the same position to try a different pattern.
735
736       Are Perl regexes DFAs or NFAs?  Are they POSIX compliant?
737
738       While it's true that Perl's regular expressions resemble the DFAs
739       (deterministic finite automata) of the egrep(1) program, they are in
740       fact implemented as NFAs (non-deterministic finite automata) to allow
741       backtracking and backreferencing.  And they aren't POSIX-style either,
742       because those guarantee worst-case behavior for all cases.  (It seems
743       that some people prefer guarantees of consistency, even when what's
744       guaranteed is slowness.)  See the book "Mastering Regular Expressions"
745       (from O'Reilly) by Jeffrey Friedl for all the details you could ever
746       hope to know on these matters (a full citation appears in perlfaq2).
747
748       What's wrong with using grep in a void context?
749
750       The problem is that grep builds a return list, regardless of the con‐
751       text.  This means you're making Perl go to the trouble of building a
752       list that you then just throw away. If the list is large, you waste
753       both time and space.  If your intent is to iterate over the list, then
754       use a for loop for this purpose.
755
756       In perls older than 5.8.1, map suffers from this problem as well.  But
757       since 5.8.1, this has been fixed, and map is context aware - in void
758       context, no lists are constructed.
759
760       How can I match strings with multibyte characters?
761
762       Starting from Perl 5.6 Perl has had some level of multibyte character
763       support.  Perl 5.8 or later is recommended.  Supported multibyte char‐
764       acter repertoires include Unicode, and legacy encodings through the
765       Encode module.  See perluniintro, perlunicode, and Encode.
766
767       If you are stuck with older Perls, you can do Unicode with the "Uni‐
768       code::String" module, and character conversions using the "Uni‐
769       code::Map8" and "Unicode::Map" modules.  If you are using Japanese
770       encodings, you might try using the jperl 5.005_03.
771
772       Finally, the following set of approaches was offered by Jeffrey Friedl,
773       whose article in issue #5 of The Perl Journal talks about this very
774       matter.
775
776       Let's suppose you have some weird Martian encoding where pairs of ASCII
777       uppercase letters encode single Martian letters (i.e. the two bytes
778       "CV" make a single Martian letter, as do the two bytes "SG", "VS",
779       "XX", etc.). Other bytes represent single characters, just like ASCII.
780
781       So, the string of Martian "I am CVSGXX!" uses 12 bytes to encode the
782       nine characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'.
783
784       Now, say you want to search for the single character "/GX/". Perl
785       doesn't know about Martian, so it'll find the two bytes "GX" in the "I
786       am CVSGXX!"  string, even though that character isn't there: it just
787       looks like it is because "SG" is next to "XX", but there's no real
788       "GX".  This is a big problem.
789
790       Here are a few ways, all painful, to deal with it:
791
792          $martian =~ s/([A-Z][A-Z])/ $1 /g; # Make sure adjacent "martian"
793                                             # bytes are no longer adjacent.
794          print "found GX!\n" if $martian =~ /GX/;
795
796       Or like this:
797
798          @chars = $martian =~ m/([A-Z][A-Z]⎪[^A-Z])/g;
799          # above is conceptually similar to:     @chars = $text =~ m/(.)/g;
800          #
801          foreach $char (@chars) {
802              print "found GX!\n", last if $char eq 'GX';
803          }
804
805       Or like this:
806
807          while ($martian =~ m/\G([A-Z][A-Z]⎪.)/gs) {  # \G probably unneeded
808              print "found GX!\n", last if $1 eq 'GX';
809          }
810
811       Here's another, slightly less painful, way to do it from Benjamin Gold‐
812       berg, who uses a zero-width negative look-behind assertion.
813
814               print "found GX!\n" if  $martian =~ m/
815                          (?<![A-Z])
816                          (?:[A-Z][A-Z])*?
817                          GX
818                       /x;
819
820       This succeeds if the "martian" character GX is in the string, and fails
821       otherwise.  If you don't like using (?<!), a zero-width negative look-
822       behind assertion, you can replace (?<![A-Z]) with (?:^⎪[^A-Z]).
823
824       It does have the drawback of putting the wrong thing in $-[0] and
825       $+[0], but this usually can be worked around.
826
827       How do I match a pattern that is supplied by the user?
828
829       Well, if it's really a pattern, then just use
830
831           chomp($pattern = <STDIN>);
832           if ($line =~ /$pattern/) { }
833
834       Alternatively, since you have no guarantee that your user entered a
835       valid regular expression, trap the exception this way:
836
837           if (eval { $line =~ /$pattern/ }) { }
838
839       If all you really want is to search for a string, not a pattern, then
840       you should either use the index() function, which is made for string
841       searching, or, if you can't be disabused of using a pattern match on a
842       non-pattern, then be sure to use "\Q"..."\E", documented in perlre.
843
844           $pattern = <STDIN>;
845
846           open (FILE, $input) or die "Couldn't open input $input: $!; aborting";
847           while (<FILE>) {
848               print if /\Q$pattern\E/;
849           }
850           close FILE;
851

AUTHOR AND COPYRIGHT

853       Copyright (c) 1997-2006 Tom Christiansen, Nathan Torkington, and other
854       authors as noted. All rights reserved.
855
856       This documentation is free; you can redistribute it and/or modify it
857       under the same terms as Perl itself.
858
859       Irrespective of its distribution, all code examples in this file are
860       hereby placed into the public domain.  You are permitted and encouraged
861       to use this code in your own programs for fun or for profit as you see
862       fit.  A simple comment in the code giving credit would be courteous but
863       is not required.
864
865
866
867perl v5.8.8                       2006-01-07                       PERLFAQ6(1)